_              _         ____              
   / \   _ ____  _(_)_   __ |  _ \  __ _ _   _ 
  / _ \ | '__\ \/ / \ \ / / | | | |/ _` | | | |
 / ___ \| |   >  <| |\ V /  | |_| | (_| | |_| |
/_/   \_\_|  /_/\_\_| \_/   |____/ \__,_|\__, |
                                         |___/ 
        

Articles: 0

Last Updated: N/A (+00:00)

Index | Calendar | Favorites | Archive | Profile

SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model

We propose a decoupled 3D scene generation framework called SceneMaker in this work. Due to the lack of sufficient open-set de-occlusion and pose estimation priors, existing methods struggle to simultaneously produce high-quality geometry and accurate poses under severe occlusion and open-set settings. To address these issues, we first decouple the de-occlusion model from 3D object generation, and enhance it by leveraging image datasets and collected de-occlusion datasets for much more diverse open-set occlusion patterns. Then, we propose a unified pose estimation model that integrates global and local mechanisms for both self-attention and cross-attention to improve accuracy. Besides, we construct an open-set 3D scene dataset to further extend the generalization of the pose estimation model. Comprehensive experiments demonstrate the superiority of our decoupled framework on both indoor and open-set scenes. Our codes and datasets is released at https://idea-research.github.io/SceneMaker/.

Updated: 2025-12-11 18:59:56

标题: SceneMaker:使用解耦的去遮挡和姿态估计模型进行开放式3D场景生成

摘要: 我们在这项工作中提出了一种名为SceneMaker的解耦3D场景生成框架。由于缺乏足够的开放域去遮挡和姿势估计先验,现有方法在严重遮挡和开放域设置下很难同时产生高质量的几何形状和准确的姿势。为了解决这些问题,我们首先将去遮挡模型与3D对象生成解耦,并借助图像数据集和收集的去遮挡数据集来增强它,以获得更多样化的开放域遮挡模式。然后,我们提出了一个统一的姿势估计模型,该模型整合了全局和局部机制,用于自我关注和交叉关注,以提高准确性。此外,我们构建了一个开放域3D场景数据集,进一步扩展了姿势估计模型的泛化能力。全面的实验证明了我们的解耦框架在室内和开放域场景上的优越性。我们的代码和数据集已发布在https://idea-research.github.io/SceneMaker/。

更新时间: 2025-12-11 18:59:56

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2512.10957v1

Bidirectional Normalizing Flow: From Data to Noise and Back

Normalizing Flows (NFs) have been established as a principled framework for generative modeling. Standard NFs consist of a forward process and a reverse process: the forward process maps data to noise, while the reverse process generates samples by inverting it. Typical NF forward transformations are constrained by explicit invertibility, ensuring that the reverse process can serve as their exact analytic inverse. Recent developments in TARFlow and its variants have revitalized NF methods by combining Transformers and autoregressive flows, but have also exposed causal decoding as a major bottleneck. In this work, we introduce Bidirectional Normalizing Flow ($\textbf{BiFlow}$), a framework that removes the need for an exact analytic inverse. BiFlow learns a reverse model that approximates the underlying noise-to-data inverse mapping, enabling more flexible loss functions and architectures. Experiments on ImageNet demonstrate that BiFlow, compared to its causal decoding counterpart, improves generation quality while accelerating sampling by up to two orders of magnitude. BiFlow yields state-of-the-art results among NF-based methods and competitive performance among single-evaluation ("1-NFE") methods. Following recent encouraging progress on NFs, we hope our work will draw further attention to this classical paradigm.

Updated: 2025-12-11 18:59:55

标题: 双向正规化流:从数据到噪声再回到数据

摘要: 正规化流(NFs)已被确立为生成建模的原则性框架。标准NFs包括正向过程和反向过程:正向过程将数据映射到噪声,而反向过程通过反转生成样本。典型的NF正向变换受到显式可逆性的限制,确保反向过程可以作为它们的精确解析逆。最近在TARFlow及其变种的发展通过将Transformer和自回归流结合起来,使NF方法得到了复兴,但也暴露了因果解码作为一个主要瓶颈。在这项工作中,我们介绍了双向正规化流(BiFlow),这是一个不需要精确解析逆的框架。BiFlow学习一个逆模型,近似地逆转底层的噪声到数据的映射,从而实现更灵活的损失函数和架构。在ImageNet上的实验证明,与其因果解码对应物相比,BiFlow提高了生成质量,同时加速了采样速度达到了两个数量级。BiFlow在基于NF的方法中取得了最先进的结果,并在单次评估("1-NFE")方法中表现出竞争力。鉴于近期对NFs的进展令人鼓舞,我们希望我们的工作能够吸引更多对这一经典范式的关注。

更新时间: 2025-12-11 18:59:55

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2512.10953v1

Hierarchical Dataset Selection for High-Quality Data Sharing

The success of modern machine learning hinges on access to high-quality training data. In many real-world scenarios, such as acquiring data from public repositories or sharing across institutions, data is naturally organized into discrete datasets that vary in relevance, quality, and utility. Selecting which repositories or institutions to search for useful datasets, and which datasets to incorporate into model training are therefore critical decisions, yet most existing methods select individual samples and treat all data as equally relevant, ignoring differences between datasets and their sources. In this work, we formalize the task of dataset selection: selecting entire datasets from a large, heterogeneous pool to improve downstream performance under resource constraints. We propose Dataset Selection via Hierarchies (DaSH), a dataset selection method that models utility at both dataset and group (e.g., collections, institutions) levels, enabling efficient generalization from limited observations. Across two public benchmarks (Digit-Five and DomainNet), DaSH outperforms state-of-the-art data selection baselines by up to 26.2% in accuracy, while requiring significantly fewer exploration steps. Ablations show DaSH is robust to low-resource settings and lack of relevant datasets, making it suitable for scalable and adaptive dataset selection in practical multi-source learning workflows.

Updated: 2025-12-11 18:59:55

标题: 分层数据集选择以实现高质量数据共享

摘要: 现代机器学习的成功取决于高质量的训练数据的获取。在许多真实场景中,例如从公共存储库获取数据或在机构之间共享数据,数据自然地组织成不同的数据集,这些数据集在相关性、质量和效用方面存在差异。选择搜索有用数据集的存储库或机构,以及将哪些数据集纳入模型训练,因此是关键决策,然而大多数现有方法选择单个样本并将所有数据视为同等相关,忽略了数据集及其来源之间的差异。在这项工作中,我们形式化了数据集选择的任务:从一个大型、异构的数据池中选择整个数据集,以在资源约束下提高下游性能。我们提出了一种名为DaSH(Dataset Selection via Hierarchies)的数据集选择方法,该方法在数据集和组(例如集合、机构)级别模拟效用,从而能够从有限的观察中进行高效的泛化。在两个公共基准测试上(Digit-Five和DomainNet),DaSH在准确率方面超过了最先进的数据选择基线高达26.2%,同时需要较少的探索步骤。消融实验证明,DaSH对于低资源环境和缺乏相关数据集是稳健的,使其适用于实际多源学习工作流程中的可伸缩和自适应数据集选择。

更新时间: 2025-12-11 18:59:55

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2512.10952v1

Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation

Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation. Code is released at https://github.com/Ivan-Tang-3D/3DGen-R1.

Updated: 2025-12-11 18:59:52

标题: 我们准备好在文本到3D生成中使用强化学习吗?一个渐进的调查

摘要: 强化学习(RL)早期被证明在大型语言和多模态模型中非常有效,最近已成功扩展到增强2D图像生成。然而,将RL应用于3D生成仍然很大程度上未被探索,这是因为3D对象具有更高的空间复杂性,需要全局一致的几何和细粒度的局部纹理。这使得3D生成对奖励设计和RL算法非常敏感。为了解决这些挑战,我们进行了首次系统研究RL用于文本到3D自回归生成的多个维度。(1)奖励设计:我们评估奖励维度和模型选择,显示与人类偏好的对齐至关重要,并且通用的多模态模型为3D属性提供了强有力的信号。(2)RL算法:我们研究了GRPO变体,强调了基于标记级别的优化的有效性,并进一步研究了训练数据和迭代次数的扩展。(3)文本到3D基准:由于现有基准未能衡量3D生成模型的隐式推理能力,我们引入了MME-3DR。(4)高级RL范例:受到3D生成的自然层次结构的启发,我们提出了Hi-GRPO,通过专门的奖励集合优化全局到局部的层次化3D生成。基于这些见解,我们开发了AR3D-R1,第一个从粗糙形状到纹理细化的RL增强文本到3D模型。我们希望这项研究为3D生成的RL驱动推理提供见解。代码发布在https://github.com/Ivan-Tang-3D/3DGen-R1。

更新时间: 2025-12-11 18:59:52

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2512.10949v1

ImplicitRDP: An End-to-End Visual-Force Diffusion Policy with Structural Slow-Fast Learning

Human-level contact-rich manipulation relies on the distinct roles of two key modalities: vision provides spatially rich but temporally slow global context, while force sensing captures rapid, high-frequency local contact dynamics. Integrating these signals is challenging due to their fundamental frequency and informational disparities. In this work, we propose ImplicitRDP, a unified end-to-end visual-force diffusion policy that integrates visual planning and reactive force control within a single network. We introduce Structural Slow-Fast Learning, a mechanism utilizing causal attention to simultaneously process asynchronous visual and force tokens, allowing the policy to perform closed-loop adjustments at the force frequency while maintaining the temporal coherence of action chunks. Furthermore, to mitigate modality collapse where end-to-end models fail to adjust the weights across different modalities, we propose Virtual-target-based Representation Regularization. This auxiliary objective maps force feedback into the same space as the action, providing a stronger, physics-grounded learning signal than raw force prediction. Extensive experiments on contact-rich tasks demonstrate that ImplicitRDP significantly outperforms both vision-only and hierarchical baselines, achieving superior reactivity and success rates with a streamlined training pipeline. Code and videos will be publicly available at https://implicit-rdp.github.io.

Updated: 2025-12-11 18:59:46

标题: ImplicitRDP:一种具有结构慢快学习的端到端视觉力扩散策略

摘要: 人类水平的接触丰富操作依赖于两个关键模态的明确角色:视觉提供空间丰富但时间上缓慢的全局背景,而力传感捕捉快速、高频的局部接触动态。由于它们的基本频率和信息差异,整合这些信号是具有挑战性的。在这项工作中,我们提出了ImplicitRDP,这是一个统一的端到端视觉-力扩散策略,它在一个网络内集成了视觉规划和反应性力控制。我们引入了结构慢快学习,这是一种利用因果关注同时处理异步视觉和力量令牌的机制,使策略能够在力频率下执行闭环调整,同时保持行动块的时间一致性。此外,为了缓解模态崩溃,即端到端模型无法调整不同模态之间的权重,我们提出了基于虚拟目标的表示正则化。这个辅助目标将力反馈映射到与行动相同的空间,提供比原始力预测更强的、基于物理的学习信号。在接触丰富任务上进行的大量实验表明,ImplicitRDP明显优于仅视觉和分层基线,实现了更高的反应速度和成功率,并使用了简化的训练流程。代码和视频将在https://implicit-rdp.github.io 上公开提供。

更新时间: 2025-12-11 18:59:46

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2512.10946v1

AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation

Recent advances in subject-driven video generation with large diffusion models have enabled personalized content synthesis conditioned on user-provided subjects. However, existing methods lack fine-grained temporal control over subject appearance and disappearance, which are essential for applications such as compositional video synthesis, storyboarding, and controllable animation. We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation. Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities, while seamlessly integrating with the pretrained video generation model positional embeddings. Additionally, we incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions, mitigating ambiguity during generation. Through token-wise concatenation, AlcheMinT avoids any additional cross-attention modules and incurs negligible parameter overhead. We establish a benchmark evaluating multiple subject identity preservation, video fidelity, and temporal adherence. Experimental results demonstrate that AlcheMinT achieves visual quality matching state-of-the-art video personalization methods, while, for the first time, enabling precise temporal control over multi-subject generation within videos. Project page is at https://snap-research.github.io/Video-AlcheMinT

Updated: 2025-12-11 18:59:34

标题: AlcheMinT:多参考一致视频生成的细粒度时间控制

摘要: 最近在主题驱动的视频生成方面取得的进展,借助大型扩散模型,使得能够根据用户提供的主题进行个性化内容合成。然而,现有方法缺乏对主题出现和消失的细粒度时间控制,这对于诸如构图视频合成、故事板和可控动画等应用至关重要。我们提出了AlcheMinT,这是一个统一的框架,引入了明确的时间戳条件,用于主题驱动的视频生成。我们的方法引入了一种新颖的位置编码机制,解锁了与主题身份相关联的时间间隔的编码,同时与预训练视频生成模型的位置嵌入无缝集成。此外,我们还结合了主题描述性文本标记,以加强视觉身份和视频标题之间的联系,从而在生成过程中减少歧义。通过标记逐个连接,AlcheMinT避免了任何额外的交叉关注模块,并几乎没有参数开销。我们建立了一个基准评估多个主题身份保留、视频保真度和时间遵守。实验结果表明,AlcheMinT实现了与最先进的视频个性化方法相匹配的视觉质量,同时首次实现了在视频中对多主题生成进行精确的时间控制。项目页面位于https://snap-research.github.io/Video-AlcheMinT。

更新时间: 2025-12-11 18:59:34

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2512.10943v1

Mull-Tokens: Modality-Agnostic Latent Thinking

Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead, we offer a simpler alternative -- Mull-Tokens -- modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We investigate best practices to train Mull-Tokens inspired by latent reasoning frameworks. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, Mull-Tokens offers a simple solution to abstractly think in multiple modalities.

Updated: 2025-12-11 18:59:08

标题: 多功能代币:与形式无关的潜在思维

摘要: 推理超越语言;现实世界需要对空间、时间、可供性等进行推理,而单凭文字无法传达。现有的多模态模型探索利用图像进行推理的潜力,但容易崩溃且不可扩展。它们依赖于调用专门工具、昂贵的图像生成,或手工推理数据来在文字和图像思维之间切换。相反,我们提供了一个更简单的替代方案——Mull-Tokens——模态无关的潜在标记,预先训练以保存图像或文本模态中的中间信息,让模型自由地思考正确答案。我们研究了受潜在推理框架启发的训练Mull-Tokens的最佳实践。我们首先使用交错的文本-图像跟踪监督来训练Mull-Tokens,然后仅使用最终答案进行无监督微调。通过涉及解决难题和采取不同视角等任务的四个具有挑战性的空间推理基准测试,我们证明Mull-Tokens优于几个基线,这些基线利用仅有文本推理或交错的图像-文本推理,平均提高了3%,在解决难题和推理为主的分裂中最高可提高16%。作为围绕文本和视觉推理困难的讨论的补充,Mull-Tokens提供了一个简单的解决方案,可以在多个模态中进行抽象思考。

更新时间: 2025-12-11 18:59:08

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2512.10941v1

OmniView: An All-Seeing Diffusion Model for 3D and 4D View Synthesis

Prior approaches injecting camera control into diffusion models have focused on specific subsets of 4D consistency tasks: novel view synthesis, text-to-video with camera control, image-to-video, amongst others. Therefore, these fragmented approaches are trained on disjoint slices of available 3D/4D data. We introduce OmniView, a unified framework that generalizes across a wide range of 4D consistency tasks. Our method separately represents space, time, and view conditions, enabling flexible combinations of these inputs. For example, OmniView can synthesize novel views from static, dynamic, and multiview inputs, extrapolate trajectories forward and backward in time, and create videos from text or image prompts with full camera control. OmniView is competitive with task-specific models across diverse benchmarks and metrics, improving image quality scores among camera-conditioned diffusion models by up to 33\% in multiview NVS LLFF dataset, 60\% in dynamic NVS Neural 3D Video benchmark, 20\% in static camera control on RE-10K, and reducing camera trajectory errors by 4x in text-conditioned video generation. With strong generalizability in one model, OmniView demonstrates the feasibility of a generalist 4D video model. Project page is available at https://snap-research.github.io/OmniView/

Updated: 2025-12-11 18:59:05

标题: OmniView:一种全视角扩散模型用于3D和4D视图合成

摘要: 以前将摄像机控制注入扩散模型的方法主要集中在特定的4D一致性任务子集上:如新视图合成、具有摄像机控制的文本到视频、图像到视频等。因此,这些分散的方法是在可用的3D/4D数据的不连续切片上训练的。我们引入了OmniView,这是一个统一的框架,可以概括各种4D一致性任务。我们的方法分别表示空间、时间和视图条件,可以灵活地组合这些输入。例如,OmniView可以从静态、动态和多视图输入中合成新视图,向前和向后推断轨迹,并从文本或图像提示中创建具有完全摄像机控制的视频。OmniView在各种基准和指标上与特定任务模型具有竞争力,在多视图NVS LLFF数据集中提高了多达33%的图像质量分数,在动态NVS神经3D视频基准中提高了60%,在RE-10K上的静态摄像机控制中提高了20%,并将文本条件视频生成中的摄像机轨迹错误减少了4倍。通过一个强大的通用模型,OmniView展示了通用的4D视频模型的可行性。项目页面可在https://snap-research.github.io/OmniView/找到。

更新时间: 2025-12-11 18:59:05

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2512.10940v1

Stronger Normalization-Free Transformers

Although normalization layers have long been viewed as indispensable components of deep learning architectures, the recent introduction of Dynamic Tanh (DyT) has demonstrated that alternatives are possible. The point-wise function DyT constrains extreme values for stable convergence and reaches normalization-level performance; this work seeks further for function designs that can surpass it. We first study how the intrinsic properties of point-wise functions influence training and performance. Building on these findings, we conduct a large-scale search for a more effective function design. Through this exploration, we introduce $\mathrm{Derf}(x) = \mathrm{erf}(αx + s)$, where $\mathrm{erf}(x)$ is the rescaled Gaussian cumulative distribution function, and identify it as the most performant design. Derf outperforms LayerNorm, RMSNorm, and DyT across a wide range of domains, including vision (image recognition and generation), speech representation, and DNA sequence modeling. Our findings suggest that the performance gains of Derf largely stem from its improved generalization rather than stronger fitting capacity. Its simplicity and stronger performance make Derf a practical choice for normalization-free Transformer architectures.

Updated: 2025-12-11 18:58:49

标题: 更强大的无归一化Transformer

摘要: 尽管归一化层长期以来被认为是深度学习架构中不可或缺的组件,但最近引入的动态双曲正切(DyT)表明,也有可能存在替代方案。点对点函数DyT限制极端值以实现稳定收敛,并达到归一化级别的性能;本文进一步寻求可以超越它的函数设计。我们首先研究了点对点函数的内在特性如何影响训练和性能。基于这些发现,我们进行了大规模搜索,寻找更有效的函数设计。通过这一探索,我们引入了$\mathrm{Derf}(x) = \mathrm{erf}(αx + s)$,其中$\mathrm{erf}(x)$是经重新缩放的高斯累积分布函数,并确定它为性能最佳设计。Derf在包括视觉(图像识别和生成)、语音表示和DNA序列建模在内的广泛领域中胜过LayerNorm、RMSNorm和DyT。我们的研究结果表明,Derf的性能提升主要源于其更好的泛化能力,而不是更强的拟合能力。其简单性和更强的性能使Derf成为无归一化Transformer架构的实用选择。

更新时间: 2025-12-11 18:58:49

领域: cs.LG,cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2512.10938v1

On Decision-Making Agents and Higher-Order Causal Processes

We establish a precise correspondence between decision-making agents in partially observable Markov decision processes (POMDPs) and one-input process functions, the classical limit of higher-order quantum operations. In this identification an agent's policy and memory update combine into a process function w that interacts with a POMDP environment via the link product. This suggests a dual interpretation: in the physics view, the process function acts as the environment into which local operations (agent interventions) are inserted, whereas in the AI view it encodes the agent and the inserted functions represent environments. We extend this perspective to multi-agent systems by identifying observation-independent decentralized POMDPs as natural domains for multi-input process functions.

Updated: 2025-12-11 18:58:33

标题: 关于决策代理和高阶因果过程

摘要: 我们建立了一个精确的对应关系,将部分可观察马尔可夫决策过程(POMDPs)中的决策代理与一输入过程函数联系起来,这是高阶量子操作的经典极限。在这种识别中,代理的策略和记忆更新结合成一个与POMDP环境通过链乘积相互作用的过程函数w。这暗示了双重解释:在物理学观点中,过程函数充当环境,局部操作(代理干预)被插入其中,而在人工智能视角中,它对代理进行编码,插入的函数代表环境。我们将这一视角扩展到多智能体系统,通过确定观察无关的分散POMDPs作为多输入过程函数的自然领域。

更新时间: 2025-12-11 18:58:33

领域: cs.AI,quant-ph

下载: http://arxiv.org/abs/2512.10937v1

Empirical evaluation of the Frank-Wolfe methods for constructing white-box adversarial attacks

The construction of adversarial attacks for neural networks appears to be a crucial challenge for their deployment in various services. To estimate the adversarial robustness of a neural network, a fast and efficient approach is needed to construct adversarial attacks. Since the formalization of adversarial attack construction involves solving a specific optimization problem, we consider the problem of constructing an efficient and effective adversarial attack from a numerical optimization perspective. Specifically, we suggest utilizing advanced projection-free methods, known as modified Frank-Wolfe methods, to construct white-box adversarial attacks on the given input data. We perform a theoretical and numerical evaluation of these methods and compare them with standard approaches based on projection operations or geometrical intuition. Numerical experiments are performed on the MNIST and CIFAR-10 datasets, utilizing a multiclass logistic regression model, the convolutional neural networks (CNNs), and the Vision Transformer (ViT).

Updated: 2025-12-11 18:58:17

标题: 对于构建白盒对抗性攻击的Frank-Wolfe方法的实证评估

摘要: 神经网络的对抗攻击构建似乎是它们在各种服务中部署面临的重要挑战。为了评估神经网络的对抗鲁棒性,需要一种快速高效的方法来构建对抗攻击。由于对抗攻击构建的形式化涉及解决特定的优化问题,我们考虑从数值优化的角度解决构建高效和有效对抗攻击的问题。具体来说,我们建议利用先进的非投影方法,即改进的Frank-Wolfe方法,在给定输入数据上构建白盒对抗攻击。我们对这些方法进行了理论和数值评估,并将它们与基于投影操作或几何直觉的标准方法进行了比较。在MNIST和CIFAR-10数据集上进行了数值实验,利用多类别逻辑回归模型、卷积神经网络(CNNs)和Vision Transformer(ViT)。

更新时间: 2025-12-11 18:58:17

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2512.10936v1

Any4D: Unified Feed-Forward Metric 4D Reconstruction

We present Any4D, a scalable multi-view transformer for metric-scale, dense feed-forward 4D reconstruction. Any4D directly generates per-pixel motion and geometry predictions for N frames, in contrast to prior work that typically focuses on either 2-view dense scene flow or sparse 3D point tracking. Moreover, unlike other recent methods for 4D reconstruction from monocular RGB videos, Any4D can process additional modalities and sensors such as RGB-D frames, IMU-based egomotion, and Radar Doppler measurements, when available. One of the key innovations that allows for such a flexible framework is a modular representation of a 4D scene; specifically, per-view 4D predictions are encoded using a variety of egocentric factors (depthmaps and camera intrinsics) represented in local camera coordinates, and allocentric factors (camera extrinsics and scene flow) represented in global world coordinates. We achieve superior performance across diverse setups - both in terms of accuracy (2-3X lower error) and compute efficiency (15X faster), opening avenues for multiple downstream applications.

Updated: 2025-12-11 18:57:39

标题: Any4D:统一的前馈度量4D重建

摘要: 我们提出了Any4D,一个可扩展的多视角变换器,用于度量尺度、密集的前馈4D重建。与以往通常专注于2视角密集场景流或稀疏3D点跟踪的工作不同,Any4D直接为N帧生成每个像素的运动和几何预测。此外,与其他最近用于从单目RGB视频进行4D重建的方法不同,Any4D可以在可用时处理其他模态和传感器,如RGB-D帧、基于IMU的自我运动和雷达多普勒测量。允许这种灵活框架的一个关键创新是4D场景的模块化表示;具体而言,每个视角的4D预测使用在局部摄像机坐标中表示的各种自我因素(深度图和相机内参)编码,以及在全球世界坐标中表示的外在因素(相机外参和场景流)。我们在不同设置下实现了卓越的性能 - 在准确性(误差降低2-3倍)和计算效率(快15倍)方面,为多种下游应用打开了途径。

更新时间: 2025-12-11 18:57:39

领域: cs.CV,cs.AI,cs.LG,cs.RO

下载: http://arxiv.org/abs/2512.10935v1

Curriculum-Based Reinforcement Learning for Autonomous UAV Navigation in Unknown Curved Tubular Conduit

Autonomous drone navigation in confined tubular environments remains a major challenge due to the constraining geometry of the conduits, the proximity of the walls, and the perceptual limitations inherent to such scenarios. We propose a reinforcement learning approach enabling a drone to navigate unknown three-dimensional tubes without any prior knowledge of their geometry, relying solely on local observations from LiDAR and a conditional visual detection of the tube center. In contrast, the Pure Pursuit algorithm, used as a deterministic baseline, benefits from explicit access to the centerline, creating an information asymmetry designed to assess the ability of RL to compensate for the absence of a geometric model. The agent is trained through a progressive Curriculum Learning strategy that gradually exposes it to increasingly curved geometries, where the tube center frequently disappears from the visual field. A turning-negotiation mechanism, based on the combination of direct visibility, directional memory, and LiDAR symmetry cues, proves essential for ensuring stable navigation under such partial observability conditions. Experiments show that the PPO policy acquires robust and generalizable behavior, consistently outperforming the deterministic controller despite its limited access to geometric information. Validation in a high-fidelity 3D environment further confirms the transferability of the learned behavior to a continuous physical dynamics. The proposed approach thus provides a complete framework for autonomous navigation in unknown tubular environments and opens perspectives for industrial, underground, or medical applications where progressing through narrow and weakly perceptive conduits represents a central challenge.

Updated: 2025-12-11 18:57:29

标题: 基于课程的强化学习在未知曲线管道中自主无人机导航中的应用

摘要: 在受限的管道环境中,自主无人机导航仍然是一个主要挑战,原因是管道的几何形状限制、墙壁的接近以及与此类场景固有的感知限制。我们提出了一种强化学习方法,使无人机能够在未知的三维管道中导航,而不需要任何关于其几何形状的先验知识,仅依靠来自LiDAR的局部观察和对管道中心的条件视觉检测。相比之下,作为确定性基线使用的Pure Pursuit算法可以明确访问中心线,从而创建了一个信息不对称性,旨在评估强化学习补偿几何模型缺失的能力。代理通过渐进式课程学习策略进行训练,逐渐暴露于越来越弯曲的几何形状,其中管道中心经常从视野中消失。基于直接可见性、方向记忆和LiDAR对称性线索的转向协商机制对于确保在这种部分可观察条件下稳定导航至关重要。实验表明,PPO策略获得了稳健和可泛化的行为,尽管其对几何信息的访问受限,但始终优于确定性控制器。在高保真度的3D环境中验证进一步证实了学到的行为对连续物理动力学的可转移性。 因此,所提出的方法为在未知管道环境中的自主导航提供了一个完整框架,并为在工业、地下或医疗应用中前进通过狭窄和感知能力较弱的管道构成中心挑战的地方提供了展望。

更新时间: 2025-12-11 18:57:29

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2512.10934v1

BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models

Early children's developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive evaluation. The pretraining set maximizes coverage while minimizing curation of a longitudinal, infant-centric audiovisual corpus, yielding video-utterance, image-utterance, and multi-turn conversational data that mirror infant experiences. DevCV Toolbox adapts all vision-related measures of the recently released NIH Baby Toolbox into a benchmark suite of ten multimodal tasks, covering spatial reasoning, memory, and vocabulary understanding aligned with early children's capabilities. Experimental results show that a compact model pretrained from scratch can achieve competitive performance on DevCV Toolbox, outperforming GPT-4o on some tasks. We hope the principled, unified BabyVLM-V2 framework will accelerate research in developmentally plausible pretraining of vision foundation models.

Updated: 2025-12-11 18:57:05

标题: 宝贝VLM-V2:走向基于发展性的视觉基础模型预训练和基准测试

摘要: 早期儿童的发展轨迹为视觉基础模型的高效预训练设定了一个自然目标。我们引入了BabyVLM-V2,这是一个基于婴儿启发的视觉语言建模的发展性基础框架,通过纵向、多方面的预训练集、多功能模型以及最重要的认知评估工具DevCV Toolbox,在很大程度上改进了BabyVLM-V1。预训练集最大化覆盖范围,同时最小化了对纵向、以婴儿为中心的视听语料库的筹备工作,产生了反映婴儿体验的视频-话语、图像-话语和多轮对话数据。DevCV Toolbox将最近发布的NIH Baby Toolbox中所有与视觉相关的测量指标转化为一个包含十个多模态任务的基准套件,涵盖了与早期儿童能力相符的空间推理、记忆和词汇理解。实验结果显示,从头开始预训练的紧凑模型可以在DevCV Toolbox上取得竞争性表现,在某些任务上胜过了GPT-4o。我们希望这一系统、统一的BabyVLM-V2框架将加速发展合理的视觉基础模型的预训练研究。

更新时间: 2025-12-11 18:57:05

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2512.10932v1

Asynchronous Reasoning: Training-Free Interactive Thinking LLMs

Many state-of-the-art LLMs are trained to think before giving their answer. Reasoning can greatly improve language model capabilities and safety, but it also makes them less interactive: given a new input, a model must stop thinking before it can respond. Real-world use cases such as voice-based or embedded assistants require an LLM agent to respond and adapt to additional information in real time, which is incompatible with sequential interactions. In contrast, humans can listen, think, and act asynchronously: we begin thinking about the problem while reading it and continue thinking while formulating the answer. In this work, we augment LLMs capable of reasoning to operate in a similar way without additional training. Our method uses the properties of rotary embeddings to enable LLMs built for sequential interactions to simultaneously think, listen, and generate outputs. We evaluate our approach on math, commonsense, and safety reasoning and find that it can generate accurate thinking-augmented answers in real time, reducing time to first non-thinking token from minutes to <= 5s. and the overall real-time delays by 6-11x.

Updated: 2025-12-11 18:57:02

标题: 非同步推理:无需训练的交互式思考LLMs

摘要: 许多最先进的LLM(语言模型)在给出答案之前经过训练来思考。推理可以极大地提高语言模型的能力和安全性,但也使它们的互动性减弱:对于新的输入,模型必须在回应之前停止思考。诸如基于语音或嵌入式助手等真实用例需要LLM代理能够实时响应和适应额外信息,这与顺序交互不兼容。相比之下,人类可以异步地听取、思考和行动:我们在阅读问题时就开始思考,并在制定答案的过程中继续思考。在这项工作中,我们通过增强具有推理能力的LLM,使其能够以类似的方式运行而无需额外训练。我们的方法利用旋转嵌入的属性,使为顺序交互而构建的LLM能够同时思考、听取和生成输出。我们在数学、常识和安全推理上评估了我们的方法,发现它能够实时生成准确的思考增强答案,将首个非思考标记的时间缩短从几分钟到<= 5秒,并将整体实时延迟减少6-11倍。

更新时间: 2025-12-11 18:57:02

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2512.10931v1

Noisy Quantum Learning Theory

We develop a framework for learning from noisy quantum experiments, focusing on fault-tolerant devices accessing uncharacterized systems through noisy couplings. Our starting point is the complexity class $\textsf{NBQP}$ ("noisy BQP"), modeling noisy fault-tolerant quantum computers that cannot, in general, error-correct the oracle systems they query. Using this class, we show that for natural oracle problems, noise can eliminate exponential quantum learning advantages of ideal noiseless learners while preserving a superpolynomial gap between NISQ and fault-tolerant devices. Beyond oracle separations, we study concrete noisy learning tasks. For purity testing, the exponential two-copy advantage collapses under a single application of local depolarizing noise. Nevertheless, we identify a setting motivated by AdS/CFT in which noise-resilient structure restores a quantum learning advantage in a noisy regime. We then analyze noisy Pauli shadow tomography, deriving lower bounds that characterize how instance size, quantum memory, and noise control sample complexity, and design algorithms with parametrically similar scalings. Together, our results show that the Bell-basis and SWAP-test primitives underlying most exponential quantum learning advantages are fundamentally fragile to noise unless the experimental system has latent noise-robust structure. Thus, realizing meaningful quantum advantages in future experiments will require understanding how noise-robust physical properties interface with available algorithmic techniques.

Updated: 2025-12-11 18:56:32

标题: 嘈杂的量子学习理论

摘要: 我们建立了一个从嘈杂的量子实验中学习的框架,重点关注通过嘈杂的耦合访问未经特征化系统的容错设备。我们的出发点是复杂性类$\textsf{NBQP}$(“嘈杂BQP”),建模嘈杂的容错量子计算机,通常无法纠正其查询的oracle系统中的错误。利用这一类,我们表明对于自然的oracle问题,噪声可以消除理想无噪声学习器的指数量子学习优势,同时保留NISQ和容错设备之间的超多项式差距。除了oracle分离外,我们还研究具体的嘈杂学习任务。对于纯度测试,指数双复制优势在单次局部去极化噪声的应用下消失。然而,我们确定了一个由AdS/CFT激发的设定,在这种设定下,噪音韧性结构在嘈杂区域恢复了量子学习优势。然后,我们分析了嘈杂的Pauli阴影层析术,推导出刻画实例大小、量子记忆和噪声控制样本复杂性的下界,并设计了具有参数相似标度的算法。综合起来,我们的结果表明,大多数指数量子学习优势的基础——Bell基和SWAP测试原语——在没有实验系统具有潜在噪声韧性结构的情况下基本上对噪声脆弱。因此,在未来的实验中实现有意义的量子优势将需要了解噪声韧性物理属性如何与可用的算法技术接口。

更新时间: 2025-12-11 18:56:32

领域: quant-ph,cs.CC,cs.IT,cs.LG

下载: http://arxiv.org/abs/2512.10929v1

Decoupled Q-Chunking

Temporal-difference (TD) methods learn state and action values efficiently by bootstrapping from their own future value predictions, but such a self-bootstrapping mechanism is prone to bootstrapping bias, where the errors in the value targets accumulate across steps and result in biased value estimates. Recent work has proposed to use chunked critics, which estimate the value of short action sequences ("chunks") rather than individual actions, speeding up value backup. However, extracting policies from chunked critics is challenging: policies must output the entire action chunk open-loop, which can be sub-optimal for environments that require policy reactivity and also challenging to model especially when the chunk length grows. Our key insight is to decouple the chunk length of the critic from that of the policy, allowing the policy to operate over shorter action chunks. We propose a novel algorithm that achieves this by optimizing the policy against a distilled critic for partial action chunks, constructed by optimistically backing up from the original chunked critic to approximate the maximum value achievable when a partial action chunk is extended to a complete one. This design retains the benefits of multi-step value propagation while sidestepping both the open-loop sub-optimality and the difficulty of learning action chunking policies for long action chunks. We evaluate our method on challenging, long-horizon offline goal-conditioned tasks and show that it reliably outperforms prior methods. Code: github.com/ColinQiyangLi/dqc.

Updated: 2025-12-11 18:52:51

标题: 解耦的Q-切块

摘要: 时间差分(TD)方法通过从自身未来价值预测中引导学习状态和动作值,有效地学习,但这种自我引导机制容易产生引导偏差,即价值目标中的错误会随着步骤的累积而导致偏置的价值估计。最近的工作提出使用分块评论家(chunked critics),这些评论家估计短动作序列(“块”)的价值,而不是单独的动作,从而加快了价值备份。然而,从分块评论家中提取策略是具有挑战性的:策略必须输出整个动作块的开环操作,这对于需要策略反应性的环境可能是次优的,尤其在块长度增加时更具挑战性。我们的关键洞察是将评论家的块长度与策略的块长度分离,允许策略操作较短的动作块。我们提出了一种新颖的算法,通过针对部分动作块进行优化,针对经过乐观调整的评论家进行优化,以此来近似从原始分块评论家向后备份以估计将部分动作块扩展为完整块时可以实现的最大价值。这种设计保留了多步价值传播的好处,同时避开了开环次优性和学习长动作块的动作分块策略的困难。我们在具有挑战性的、长时间跨度的离线目标条件任务上评估了我们的方法,并展示了它可靠地胜过以前的方法。代码:github.com/ColinQiyangLi/dqc.

更新时间: 2025-12-11 18:52:51

领域: cs.LG,cs.AI,cs.RO,stat.ML

下载: http://arxiv.org/abs/2512.10926v1

LLM-Driven Composite Neural Architecture Search for Multi-Source RL State Encoding

Designing state encoders for reinforcement learning (RL) with multiple information sources -- such as sensor measurements, time-series signals, image observations, and textual instructions -- remains underexplored and often requires manual design. We formalize this challenge as a problem of composite neural architecture search (NAS), where multiple source-specific modules and a fusion module are jointly optimized. Existing NAS methods overlook useful side information from the intermediate outputs of these modules -- such as their representation quality -- limiting sample efficiency in multi-source RL settings. To address this, we propose an LLM-driven NAS pipeline in which the LLM serves as a neural architecture design agent, leveraging language-model priors and intermediate-output signals to guide sample-efficient search for high-performing composite state encoders. On a mixed-autonomy traffic control task, our approach discovers higher-performing architectures with fewer candidate evaluations than traditional NAS baselines and the LLM-based GENIUS framework.

Updated: 2025-12-11 18:52:44

标题: LLM驱动的复合神经架构搜索用于多源RL状态编码

摘要: 设计用于强化学习(RL)的状态编码器,具有多个信息来源 - 例如传感器测量、时间序列信号、图像观测和文本指令 - 仍未得到充分探索,并且通常需要手动设计。我们将这一挑战形式化为一个复合神经架构搜索(NAS)问题,其中多个源特定模块和一个融合模块被共同优化。现有的NAS方法忽视了这些模块的中间输出的有用附加信息 - 例如它们的表示质量 - 限制了在多源RL环境中的样本效率。为了解决这个问题,我们提出了一个LLM驱动的NAS流水线,在这个流水线中,LLM作为神经架构设计代理,利用语言模型先验和中间输出信号来指导高性能复合状态编码器的样本效率搜索。在一个混合自治交通控制任务中,我们的方法发现了比传统NAS基线和基于LLM的GENIUS框架更优秀的架构,并且比较少的候选评估。

更新时间: 2025-12-11 18:52:44

领域: cs.LG,eess.SY

下载: http://arxiv.org/abs/2512.06982v2

Digital Twin Supervised Reinforcement Learning Framework for Autonomous Underwater Navigation

Autonomous navigation in underwater environments remains a major challenge due to the absence of GPS, degraded visibility, and the presence of submerged obstacles. This article investigates these issues through the case of the BlueROV2, an open platform widely used for scientific experimentation. We propose a deep reinforcement learning approach based on the Proximal Policy Optimization (PPO) algorithm, using an observation space that combines target-oriented navigation information, a virtual occupancy grid, and ray-casting along the boundaries of the operational area. The learned policy is compared against a reference deterministic kinematic planner, the Dynamic Window Approach (DWA), commonly employed as a robust baseline for obstacle avoidance. The evaluation is conducted in a realistic simulation environment and complemented by validation on a physical BlueROV2 supervised by a 3D digital twin of the test site, helping to reduce risks associated with real-world experimentation. The results show that the PPO policy consistently outperforms DWA in highly cluttered environments, notably thanks to better local adaptation and reduced collisions. Finally, the experiments demonstrate the transferability of the learned behavior from simulation to the real world, confirming the relevance of deep RL for autonomous navigation in underwater robotics.

Updated: 2025-12-11 18:52:42

标题: 数字孪生监督强化学习框架用于自主水下导航

摘要: 水下环境中的自主导航仍然是一个主要挑战,因为缺乏GPS、能见度降低和存在水下障碍物。本文通过蓝色ROV2的案例研究这些问题,蓝色ROV2是一个广泛用于科学实验的开放平台。我们提出了一种基于Proximal Policy Optimization(PPO)算法的深度强化学习方法,使用结合了面向目标导航信息、虚拟占用网格和沿操作区域边界的射线投射的观测空间。学习到的策略与参考的确定性运动规划器,即常用于避障的动态窗口方法(DWA)进行比较。评估在一个逼真的模拟环境中进行,同时在由测试场地的3D数字孪生监督的实际蓝色ROV2上进行验证,有助于减少与真实世界实验相关的风险。结果显示,在高度混乱的环境中,PPO策略始终优于DWA,特别是由于更好的局部适应性和减少的碰撞。最后,实验证明了从模拟到真实世界的学习行为的可转移性,证实了深度强化学习在水下机器人的自主导航中的相关性。

更新时间: 2025-12-11 18:52:42

领域: cs.LG,cs.RO

下载: http://arxiv.org/abs/2512.10925v1

PlanetServe: A Decentralized, Scalable, and Privacy-Preserving Overlay for Democratizing Large Language Model Serving

While significant progress has been made in research and development on open-source and cost-efficient large-language models (LLMs), serving scalability remains a critical challenge, particularly for small organizations and individuals seeking to deploy and test their LLM innovations. Inspired by peer-to-peer networks that leverage decentralized overlay nodes to increase throughput and availability, we propose GenTorrent, an LLM serving overlay that harnesses computing resources from decentralized contributors. We identify four key research problems inherent to enabling such a decentralized infrastructure: 1) overlay network organization; 2) LLM communication privacy; 3) overlay forwarding for resource efficiency; and 4) verification of serving quality. This work presents the first systematic study of these fundamental problems in the context of decentralized LLM serving. Evaluation results from a prototype implemented on a set of decentralized nodes demonstrate that GenTorrent achieves a latency reduction of over 50% compared to the baseline design without overlay forwarding. Furthermore, the security features introduce minimal overhead to serving latency and throughput. We believe this work pioneers a new direction for democratizing and scaling future AI serving capabilities.

Updated: 2025-12-11 18:49:32

标题: PlanetServe:一个用于民主化大型语言模型服务的去中心化、可扩展且保护隐私的覆盖网络

摘要: 尽管在开源和成本效益的大型语言模型(LLMs)的研究和开发方面取得了重要进展,但服务可扩展性仍然是一个关键挑战,特别是对于寻求部署和测试其LLM创新的小型组织和个人。受点对点网络的启发,该网络利用分散的叠加节点来增加吞吐量和可用性,我们提出了GenTorrent,一个利用来自分散贡献者的计算资源的LLM服务叠加。我们确定了四个与启用这种分散基础设施相关的关键研究问题:1)叠加网络组织;2)LLM通信隐私;3)叠加转发以实现资源效率;4)服务质量的验证。这项工作首次在分散的LLM服务环境中系统研究了这些基本问题。在一组分散节点上实现的原型的评估结果表明,与没有叠加转发的基线设计相比,GenTorrent实现了超过50%的延迟减少。此外,安全功能对服务延迟和吞吐量几乎没有额外开销。我们相信这项工作开创了未来人工智能服务能力民主化和扩展的新方向。

更新时间: 2025-12-11 18:49:32

领域: cs.DC,cs.AI

下载: http://arxiv.org/abs/2504.20101v4

SparseSwaps: Tractable LLM Pruning Mask Refinement at Scale

The resource requirements of Neural Networks can be significantly reduced through pruning -- the removal of seemingly less important parameters. However, with the rise of Large Language Models (LLMs), full retraining to recover pruning-induced performance degradation is often prohibitive and classical approaches such as global magnitude pruning are suboptimal on Transformer architectures. State-of-the-art methods hence solve a layer-wise mask selection problem, the problem of finding a pruning mask which minimizes the per-layer pruning error on a small set of calibration data. Exactly solving this problem to optimality using Integer Programming (IP) solvers is computationally infeasible due to its combinatorial nature and the size of the search space, and existing approaches therefore rely on approximations or heuristics. In this work, we demonstrate that the mask selection problem can be made drastically more tractable at LLM scale. To that end, we decouple the rows by enforcing equal sparsity levels per row. This allows us to derive optimal 1-swaps (exchanging one kept and one pruned weight) that can be computed efficiently using the Gram matrix of the calibration data. Using these observations, we propose a tractable and simple 1-swap algorithm that warm starts from any pruning mask, runs efficiently on GPUs at LLM scale, and is essentially hyperparameter-free. We demonstrate that our approach reduces per-layer pruning error by up to 60% over Wanda (Sun et al., 2023) and consistently improves perplexity and zero-shot accuracy across state-of-the-art GPT architectures.

Updated: 2025-12-11 18:47:48

标题: 稀疏交换:可扩展的LLM修剪掩码细化

摘要: 神经网络的资源需求可以通过剪枝大幅减少——即删除看似不太重要的参数。然而,随着大型语言模型(LLMs)的兴起,为了恢复剪枝引起的性能下降,进行完整的重新训练通常是不可行的,而传统方法,如全局大小剪枝,在Transformer架构上不够优化。因此,当前的最先进方法解决了一种逐层选择掩模的问题,即在一小组校准数据上找到一个最小化每层剪枝误差的剪枝掩模的问题。通过使用整数规划(IP)求解器完全解决这个问题在计算上是不可行的,因为其组合性质和搜索空间的大小,因此现有方法依赖于近似或启发式方法。在这项工作中,我们展示了在LLM规模下,可以使掩模选择问题变得极端可处理。为此,我们通过强制每行的稀疏水平相等来解耦行。这使我们能够推导出可以使用校准数据的Gram矩阵有效计算的最优1-互换(交换一个保留和一个剪枝权重)。利用这些观察结果,我们提出了一个可处理且简单的1-互换算法,它可以从任何剪枝掩模进行热启动,在LLM规模上在GPU上高效运行,并且基本上不需要超参数。我们证明了我们的方法将每层剪枝误差降低了高达60%,超过了Wanda(Sun等人,2023年),并且在最先进的GPT架构上始终提高了困惑度和零射解准确性。

更新时间: 2025-12-11 18:47:48

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2512.10922v1

If generative AI is the answer, what is the question?

Beginning with text and images, generative AI has expanded to audio, video, computer code, and molecules. Yet, if generative AI is the answer, what is the question? We explore the foundations of generation as a distinct machine learning task with connections to prediction, compression, and decision-making. We survey five major generative model families: autoregressive models, variational autoencoders, normalizing flows, generative adversarial networks, and diffusion models. We then introduce a probabilistic framework that emphasizes the distinction between density estimation and generation. We review a game-theoretic framework with a two-player adversary-learner setup to study generation. We discuss post-training modifications that prepare generative models for deployment. We end by highlighting some important topics in socially responsible generation such as privacy, detection of AI-generated content, and copyright and IP. We adopt a task-first framing of generation, focusing on what generation is as a machine learning problem, rather than only on how models implement it.

Updated: 2025-12-11 18:45:18

标题: 如果生成式人工智能是答案,那问题是什么?

摘要: 从文本和图像开始,生成式人工智能已经扩展到音频、视频、计算机代码和分子。然而,如果生成式人工智能是答案,那么问题是什么?我们探索了生成作为一种独特的机器学习任务的基础,与预测、压缩和决策制定之间的联系。我们调查了五个主要的生成模型家族:自回归模型、变分自动编码器、正规化流、生成对抗网络和扩散模型。然后,我们引入了一个强调密度估计与生成之间区别的概率框架。我们审查了一个博弈论框架,其中包含一个两人对抗学习者的设置来研究生成。我们讨论了事后训练修改,以准备生成模型进行部署。最后,我们强调了一些社会责任生成中的重要主题,如隐私、检测AI生成内容以及版权和知识产权。我们采用了以任务为先的生成框架,关注生成作为一种机器学习问题,而不仅仅关注模型如何实现它。

更新时间: 2025-12-11 18:45:18

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2509.06120v2

Hermitian Yang--Mills connections on general vector bundles: geometry and physical Yukawa couplings

We compute solutions to the Hermitian Yang-Mills equations on holomorphic vector bundles $V$ via an alternating optimisation procedure founded on geometric machine learning. The proposed method is fully general with respect to the rank and structure group of $V$, requiring only the ability to enumerate a basis of global sections for a given bundle. This enables us to compute the physically normalised Yukawa couplings in a broad class of heterotic string compactifications. Using this method, we carry out this computation in full for a heterotic compactification incorporating a gauge bundle with non-Abelian structure group.

Updated: 2025-12-11 18:38:10

标题: 通用向量丛上的Hermite Yang-Mills连接:几何和物理Yukawa耦合

摘要: 我们通过基于几何机器学习的交替优化程序计算了在保持性质的向量丛$V$上的埃尔米特杨-米尔斯方程的解。所提出的方法在$V$的秩和结构群方面是完全通用的,只需要能够为给定的丛列举一个全局截面的基础。这使我们能够在广泛类别的异构弦紧致化中计算物理标准化的Yukawa耦合。利用这种方法,我们完全地为一个包含非阿贝尔结构群的规范丛的异构紧致化进行了这种计算。

更新时间: 2025-12-11 18:38:10

领域: hep-th,cs.LG

下载: http://arxiv.org/abs/2512.10907v1

Distributionally Robust Regret Optimal Control Under Moment-Based Ambiguity Sets

In this paper, we consider a class of finite-horizon, linear-quadratic stochastic control problems, where the probability distribution governing the noise process is unknown but assumed to belong to an ambiguity set consisting of all distributions whose mean and covariance lie within norm balls centered at given nominal values. To address the distributional ambiguity, we explore the design of causal affine control policies to minimize the worst-case expected regret over all distributions in the given ambiguity set. The resulting minimax optimal control problem is shown to admit an equivalent reformulation as a tractable convex program that corresponds to a regularized version of the nominal linear-quadratic stochastic control problem. While this convex program can be recast as a semidefinite program, semidefinite programs are typically solved using primal-dual interior point methods that scale poorly with the problem size in practice. To address this limitation, we propose a scalable dual projected subgradient method to compute optimal controllers to an arbitrary accuracy. Numerical experiments are presented to benchmark the proposed method against state-of-the-art data-driven and distributionally robust control design approaches.

Updated: 2025-12-11 18:36:15

标题: 基于矩阵的模糊集合下的分布鲁棒性遗憾最优控制

摘要: 在本文中,我们考虑了一类有限时间段、线性二次随机控制问题,其中控制过程的噪声概率分布是未知的,但假定属于一个模糊集合,包括所有均值和协方差在给定标称值处的范数球中的分布。为了解决分布的模糊性,我们探讨了设计因果仿射控制策略,以最小化在给定模糊集合中所有分布上的最坏情况下的预期遗憾。结果表明,所得到的极小极大最优控制问题可以等效地重新表述为一个可行的凸规划问题,对应于标称线性二次随机控制问题的正则化版本。虽然这个凸规划问题可以重新表述为一个半定规划,但在实践中,半定规划通常使用原始-对偶内点方法求解,随着问题规模的增大,计算效率较低。为了解决这个问题,我们提出了一种可扩展的双重投影次梯度方法,以计算出对任意精度的最优控制器。数值实验被用来与最先进的数据驱动和分布鲁棒控制设计方法进行对比。

更新时间: 2025-12-11 18:36:15

领域: math.OC,cs.LG,eess.SY

下载: http://arxiv.org/abs/2512.10906v1

Multi-Granular Node Pruning for Circuit Discovery

Circuit discovery aims to identify minimal subnetworks that are responsible for specific behaviors in large language models (LLMs). Existing approaches primarily rely on iterative edge pruning, which is computationally expensive and limited to coarse-grained units such as attention heads or MLP blocks, overlooking finer structures like individual neurons. We propose a node-level pruning framework for circuit discovery that addresses both scalability and granularity limitations. Our method introduces learnable masks across multiple levels of granularity, from entire blocks to individual neurons, within a unified optimization objective. Granularity-specific sparsity penalties guide the pruning process, allowing a comprehensive compression in a single fine-tuning run. Empirically, our approach identifies circuits that are smaller in nodes than those discovered by prior methods; moreover, we demonstrate that many neurons deemed important by coarse methods are actually irrelevant, while still maintaining task performance. Furthermore, our method has a significantly lower memory footprint, 5-10x, as it does not require keeping intermediate activations in the memory to work.

Updated: 2025-12-11 18:32:15

标题: 多粒度节点修剪用于电路发现

摘要: 电路发现旨在识别大型语言模型(LLMs)中负责特定行为的最小子网络。现有方法主要依赖于迭代边缘修剪,这在计算上昂贵,并且仅限于粗粒度单元,如注意力头或MLP块,忽视了像个别神经元这样的更细微的结构。我们提出了一个节点级修剪框架,用于电路发现,既解决了可扩展性问题,又解决了粒度限制问题。我们的方法在统一的优化目标中引入了可学习的掩模,跨多个粒度级别,从整个块到个别神经元。粒度特定的稀疏惩罚指导修剪过程,允许在单次微调运行中进行全面压缩。从经验上看,我们的方法识别出比先前方法发现的节点更少的电路;此外,我们证明了许多被粗略方法认为重要的神经元实际上是不相关的,同时仍然保持任务性能。此外,我们的方法具有显着较低的内存占用,为5-10倍,因为它不需要保留中间激活以工作。

更新时间: 2025-12-11 18:32:15

领域: cs.AI

下载: http://arxiv.org/abs/2512.10903v1

LLMs Can Assist with Proposal Selection at Large User Facilities

We explore how large language models (LLMs) can enhance the proposal selection process at large user facilities, offering a scalable, consistent, and cost-effective alternative to traditional human review. Proposal selection depends on assessing the relative strength among submitted proposals; however, traditional human scoring often suffers from weak inter-proposal correlations and is subject to reviewer bias and inconsistency. A pairwise preference-based approach is logically superior, providing a more rigorous and internally consistent basis for ranking, but its quadratic workload makes it impractical for human reviewers. We address this limitation using LLMs. Leveraging the uniquely well-curated proposals and publication records from three beamlines at the Spallation Neutron Source (SNS), Oak Ridge National Laboratory (ORNL), we show that the LLM rankings correlate strongly with the human rankings (Spearman $ρ\simeq 0.2-0.8$, improving to $\geq 0.5$ after 10\% outlier removal). Moreover, LLM performance is no worse than that of human reviewers in identifying proposals with high publication potential, while costing over two orders of magnitude less. Beyond ranking, LLMs enable advanced analyses that are challenging for humans, such as quantitative assessment of proposal similarity via embedding models, which provides information crucial for review committees.

Updated: 2025-12-11 18:23:56

标题: LLMs(Large Language Models)可以帮助大型用户设施进行方案选择

摘要: 我们探讨了大型语言模型(LLMs)如何可以增强大型用户设施中的提案选择过程,为传统人工审核提供了一种可扩展、一致和成本效益高的替代方案。提案选择取决于评估提交提案之间的相对优势;然而,传统的人工评分经常存在较弱的提案之间的相关性,并且容易受到评审者的偏见和不一致性的影响。基于两两偏好的方法在逻辑上更为优越,提供了更严格和内部一致的排名基础,但是其二次工作量使其对人类评审者来说不切实际。我们利用位于奥克岭国家实验室(ORNL)斯帕拉申中子源(SNS)三条光束线独特精心策划的提案和出版记录,展示了LLMs排名与人类排名之间的强相关性(Spearman ρ≈ 0.2-0.8,在去除10%异常值后提高至≥ 0.5)。此外,LLMs在识别具有高出版潜力的提案方面的表现不亚于人类评审者,同时成本低至人类的两个数量级。除了排名之外,LLMs还能够进行人类难以完成的高级分析,例如通过嵌入模型进行提案相似性的定量评估,为审查委员会提供关键信息。

更新时间: 2025-12-11 18:23:56

领域: cs.AI

下载: http://arxiv.org/abs/2512.10895v1

Iterative Compositional Data Generation for Robot Control

Collecting robotic manipulation data is expensive, making it impractical to acquire demonstrations for the combinatorially large space of tasks that arise in multi-object, multi-robot, and multi-environment settings. While recent generative models can synthesize useful data for individual tasks, they do not exploit the compositional structure of robotic domains and struggle to generalize to unseen task combinations. We propose a semantic compositional diffusion transformer that factorizes transitions into robot-, object-, obstacle-, and objective-specific components and learns their interactions through attention. Once trained on a limited subset of tasks, we show that our model can zero-shot generate high-quality transitions from which we can learn control policies for unseen task combinations. Then, we introduce an iterative self-improvement procedure in which synthetic data is validated via offline reinforcement learning and incorporated into subsequent training rounds. Our approach substantially improves zero-shot performance over monolithic and hard-coded compositional baselines, ultimately solving nearly all held-out tasks and demonstrating the emergence of meaningful compositional structure in the learned representations.

Updated: 2025-12-11 18:20:49

标题: 机器人控制的迭代式组成数据生成

摘要: 收集机器人操纵数据成本高昂,使得在多对象、多机器人和多环境设置中产生的任务空间中获取示范变得不切实际。尽管最近的生成模型可以合成个别任务的有用数据,但它们并没有利用机器人领域的组合结构,并且在泛化到未见任务组合时很难。我们提出了一种语义组合扩散变换器,将转移分解为机器人、对象、障碍物和目标特定的组件,并通过注意力学习它们之间的相互作用。我们展示,一旦在一组有限的任务上训练,我们的模型可以零-shot生成高质量的转移数据,从中我们可以学习未见任务组合的控制策略。然后,我们引入一个迭代的自我改进过程,通过离线强化学习验证合成数据,并将其纳入后续训练轮次。我们的方法显著提高了零-shot性能,超过了整体性和硬编码的组合基线,最终解决了几乎所有保留的任务,并展示了学习表示中有意义的组合结构的出现。

更新时间: 2025-12-11 18:20:49

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2512.10891v1

AI Through the Human Lens: Investigating Cognitive Theories in Machine Psychology

We investigate whether Large Language Models (LLMs) exhibit human-like cognitive patterns under four established frameworks from psychology: Thematic Apperception Test (TAT), Framing Bias, Moral Foundations Theory (MFT), and Cognitive Dissonance. We evaluated several proprietary and open-source models using structured prompts and automated scoring. Our findings reveal that these models often produce coherent narratives, show susceptibility to positive framing, exhibit moral judgments aligned with Liberty/Oppression concerns, and demonstrate self-contradictions tempered by extensive rationalization. Such behaviors mirror human cognitive tendencies yet are shaped by their training data and alignment methods. We discuss the implications for AI transparency, ethical deployment, and future work that bridges cognitive psychology and AI safety

Updated: 2025-12-11 18:18:42

标题: AI 通过人类视角:探究机器心理学中的认知理论

摘要: 我们研究了大型语言模型(LLMs)在心理学的四个已建立框架下是否表现出类似于人类认知模式:主题理解测验(TAT)、框架偏见、道德基础理论(MFT)和认知失调。我们使用结构化提示和自动评分评估了几种专有和开源模型。我们的研究发现,这些模型通常能产生连贯的叙述,显示对积极框架的敏感性,展示与自由/压迫关切相关的道德判断,以及展示通过广泛理性化调节的自我矛盾。这些行为反映了人类的认知倾向,但受其训练数据和对齐方法的影响。我们讨论了对AI透明度、伦理部署和未来工作的影响,这些工作将认知心理学和AI安全联系起来。

更新时间: 2025-12-11 18:18:42

领域: cs.AI

下载: http://arxiv.org/abs/2506.18156v3

Physics-Informed Learning of Flow Distribution and Receiver Heat Losses in Parabolic Trough Solar Fields

Parabolic trough Concentrating Solar Power (CSP) plants operate large hydraulic networks of collector loops that must deliver a uniform outlet temperature despite spatially heterogeneous optical performance, heat losses, and pressure drops. While loop temperatures are measured, loop-level mass flows and receiver heat-loss parameters are unobserved, making it impossible to diagnose hydraulic imbalances or receiver degradation using standard monitoring tools. We present a physics-informed learning framework that infers (i) loop-level mass-flow ratios and (ii) time-varying receiver heat-transfer coefficients directly from routine operational data. The method exploits nocturnal homogenization periods -- when hot oil is circulated through a non-irradiated field -- to isolate hydraulic and thermal-loss effects. A differentiable conjugate heat-transfer model is discretized and embedded into an end-to-end learning pipeline optimized using historical plant data from the 50 MW Andasol 3 solar field. The model accurately reconstructs loop temperatures (RMSE $<2^\circ$C) and produces physically meaningful estimates of loop imbalances and receiver heat losses. Comparison against drone-based infrared thermography (QScan) shows strong correspondence, correctly identifying all areas with high-loss receivers. This demonstrates that noisy real-world CSP operational data contain enough information to recover latent physical parameters when combined with appropriate modeling and differentiable optimization.

Updated: 2025-12-11 18:16:26

标题: 物理信息学习在抛物槽太阳能场中流量分布和接收器热损失的应用

摘要: 抛物槽集中式太阳能光热发电(CSP)电站运行大型集热器环路的水力网络,这些环路必须在空间异质光学性能、热损失和压降的情况下提供均匀的出口温度。尽管环路温度可以测量,但环路级质量流量和接收器热损失参数是未知的,这使得使用标准监测工具无法诊断水力不平衡或接收器退化。 我们提出了一个物理信息学习框架,直接从常规运行数据中推断(i)环路级质量流比和(ii)时变接收器传热系数。该方法利用夜间均化期间——当热油通过非照射场地循环时——来孤立水力和热损失效应。一个可微的共轭传热模型被离散化并嵌入到一个端到端的学习管道中,该管道使用50兆瓦Andasol 3太阳能场的历史数据进行优化。 该模型准确重建环路温度(RMSE <2°C),并产生有意义的环路不平衡和接收器热损失的估计。与基于无人机的红外热成像(QScan)的比较显示出强烈的对应性,正确识别出所有具有高损失接收器的区域。这证明了嘈杂的现实世界CSP运行数据结合适当的建模和可微优化时包含足够的信息来恢复潜在的物理参数。

更新时间: 2025-12-11 18:16:26

领域: cs.LG,cs.CE

下载: http://arxiv.org/abs/2512.10886v1

Classifier Reconstruction Through Counterfactual-Aware Wasserstein Prototypes

Counterfactual explanations provide actionable insights by identifying minimal input changes required to achieve a desired model prediction. Beyond their interpretability benefits, counterfactuals can also be leveraged for model reconstruction, where a surrogate model is trained to replicate the behavior of a target model. In this work, we demonstrate that model reconstruction can be significantly improved by recognizing that counterfactuals, which typically lie close to the decision boundary, can serve as informative though less representative samples for both classes. This is particularly beneficial in settings with limited access to labeled data. We propose a method that integrates original data samples with counterfactuals to approximate class prototypes using the Wasserstein barycenter, thereby preserving the underlying distributional structure of each class. This approach enhances the quality of the surrogate model and mitigates the issue of decision boundary shift, which commonly arises when counterfactuals are naively treated as ordinary training instances. Empirical results across multiple datasets show that our method improves fidelity between the surrogate and target models, validating its effectiveness.

Updated: 2025-12-11 18:06:49

标题: 分类器重建通过反事实感知的Wasserstein原型

摘要: 对策反事实解释通过识别实现所需模型预测所需的最小输入更改,提供可操作的见解。除了其可解释性好处外,对策反事实还可以用于模型重建,其中训练替代模型以复制目标模型的行为。在这项工作中,我们证明了通过认识到对策反事实通常位于决策边界附近,可以作为信息性的尽管不太代表性的样本来显著改善模型重建。这在访问有限标记数据的情况下特别有益。我们提出了一种方法,该方法将原始数据样本与对策反事实结合起来,使用Wasserstein重心来近似类原型,从而保留每个类的基础分布结构。这种方法增强了替代模型的质量,并减轻了决策边界转移的问题,当对策反事实被天真地视为普通训练实例时常常出现。跨多个数据集的实证结果显示,我们的方法提高了替代模型和目标模型之间的忠实度,验证了其有效性。

更新时间: 2025-12-11 18:06:49

领域: cs.LG

下载: http://arxiv.org/abs/2512.10878v1

Guided Transfer Learning for Discrete Diffusion Models

Discrete diffusion models achieve strong performance across language and other discrete domains, providing a powerful alternative to autoregressive models. However, their strong performance relies on large training datasets, which are costly or risky to obtain, especially when adapting to new domains. Transfer learning is the natural way to adapt pretrained discrete diffusion models, but current methods require fine-tuning large diffusion models, which is computationally expensive and often impractical. Building on ratio-based transfer learning for continuous diffusion, we provide Guided Transfer Learning for discrete diffusion models (GTL). This enables sampling from a target distribution without modifying the pretrained denoiser. The same guidance formulation applies to both discrete-time diffusion and continuous-time score-based discrete diffusion, yielding a unified treatment. Guided discrete diffusion often requires many forward passes of the guidance network, which becomes impractical for large vocabularies and long sequences. To address this, we further present an efficient guided sampler that concentrates evaluations on planner-selected positions and top candidate tokens, thus lowering sampling time and computation. This makes guided language modeling practical at scale for large vocabularies and long sequences. We evaluate GTL on sequential data, including synthetic Markov chains and language modeling, and provide empirical analyses of its behavior.

Updated: 2025-12-11 18:05:55

标题: 离散扩散模型的引导式迁移学习

摘要: 离散扩散模型在语言和其他离散领域取得了强大的性能,为自回归模型提供了一个强大的替代方案。然而,它们的强大性能依赖于大型训练数据集,而获取这些数据集成本高或风险大,特别是在适应新领域时。迁移学习是适应预训练离散扩散模型的自然方式,但当前的方法需要对大型扩散模型进行微调,这在计算上代价高昂且通常不切实际。基于连续扩散的比例转移学习,我们提出了离散扩散模型的引导迁移学习(GTL)。这使得从目标分布中采样而不修改预训练去噪器成为可能。相同的引导公式适用于离散时间扩散和基于得分的连续时间离散扩散,实现了统一处理。引导离散扩散通常需要对引导网络进行多次前向传递,对于大词汇量和长序列来说这是不切实际的。为了解决这个问题,我们进一步提供了一种高效的引导采样器,集中评估规划选择的位置和前几位候选标记,从而降低采样时间和计算量。这使得大规模的引导语言建模对于大词汇量和长序列来说变得切实可行。我们在序列数据上评估了GTL,包括合成马尔可夫链和语言建模,并对其行为进行了实证分析。

更新时间: 2025-12-11 18:05:55

领域: cs.LG

下载: http://arxiv.org/abs/2512.10877v1

Nonasymptotic CLT and Error Bounds for Two-Time-Scale Stochastic Approximation

We consider linear two-time-scale stochastic approximation algorithms driven by martingale noise. Recent applications in machine learning motivate the need to understand finite-time error rates, but conventional stochastic approximation analysis focus on either asymptotic convergence in distribution or finite-time bounds that are far from optimal. Prior work on asymptotic central limit theorems (CLTs) suggest that two-time-scale algorithms may be able to achieve $1/\sqrt{n}$ error in expectation, with a constant given by the expected norm of the limiting Gaussian vector. However, the best known finite-time rates are much slower. We derive the first nonasymptotic central limit theorem with respect to the Wasserstein-1 distance for two-time-scale stochastic approximation with Polyak-Ruppert averaging. As a corollary, we show that expected error achieved by Polyak-Ruppert averaging decays at rate $1/\sqrt{n}$, which significantly improves on the rates of convergence in prior works.

Updated: 2025-12-11 18:05:48

标题: 非渐近中心极限定理和两时间尺度随机逼近的误差界

摘要: 我们考虑由鞅噪声驱动的线性双时间尺度随机逼近算法。最近在机器学习中的应用激发了对于理解有限时间误差率的需求,但传统的随机逼近分析要么集中在渐近分布收敛,要么是离最优远的有限时间界限。先前关于渐近中心极限定理(CLTs)的工作表明,双时间尺度算法可能能够实现预期中的$1/\sqrt{n}$误差,其中常数由极限高斯向量的期望范数给定。然而,已知的最佳有限时间速率要慢得多。我们推导了关于Wasserstein-1距离的第一个非渐近中心极限定理,用于具有Polyak-Ruppert平均的双时间尺度随机逼近。作为一个推论,我们表明Polyak-Ruppert平均实现的期望误差以$1/\sqrt{n}$的速率衰减,这在先前的收敛速率方面有了显著改进。

更新时间: 2025-12-11 18:05:48

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.09884v3

A Differentiable Digital Twin of Distributed Link Scheduling for Contention-Aware Networking

Many routing and flow optimization problems in wired networks can be solved efficiently using minimum cost flow formulations. However, this approach does not extend to wireless multi-hop networks, where the assumptions of fixed link capacity and linear cost structure collapse due to contention for shared spectrum resources. The key challenge is that the long-term capacity of a wireless link becomes a non-linear function of its network context, including network topology, link quality, and the traffic assigned to neighboring links. In this work, we pursue a new direction of modeling wireless network under randomized medium access control by developing an analytical network digital twin (NDT) that predicts link duty cycles from network context. We generalize randomized contention as finding a Maximal Independent Set (MIS) on the conflict graph using weighted Luby's algorithm, derive an analytical model of link duty cycles, and introduce an iterative procedure that resolves the circular dependency among duty cycle, link capacity, and contention probability. Our numerical experiments show that the proposed NDT accurately predicts link duty cycles and congestion patterns with up to a 5000x speedup over packet-level simulation, and enables us to optimize link scheduling using gradient descent for reduced congestion and radio footprint.

Updated: 2025-12-11 18:04:30

标题: 一个可微的数字孪生:用于内容感知网络的分布式链路调度

摘要: 在有线网络中,许多路由和流优化问题可以通过最小成本流公式有效解决。然而,在无线多跳网络中,这种方法无法延伸,因为固定链路容量和线性成本结构的假设由于共享频谱资源的争用而崩溃。关键挑战在于无线链路的长期容量成为其网络上下文的非线性函数,包括网络拓扑、链路质量和分配给相邻链路的流量。在这项工作中,我们追求一种新的建模无线网络的方向,通过开发一个能够从网络上下文中预测链路工作周期的分析网络数值模型 (NDT) 来建立对随机介质访问控制的建模。我们将随机争用泛化为使用加权Luby算法在冲突图上找到最大独立集 (MIS),推导出链路工作周期的分析模型,并引入一个迭代过程,解决工作周期、链路容量和争用概率之间的循环依赖关系。我们的数值实验表明,提出的NDT可以准确预测链路工作周期和拥塞模式,比分组级仿真快速高达5000倍,并使我们能够使用梯度下降优化链路调度,以减少拥塞和无线覆盖范围。

更新时间: 2025-12-11 18:04:30

领域: cs.NI,cs.LG,eess.SP,eess.SY

下载: http://arxiv.org/abs/2512.10874v1

Beyond Basic A/B testing: Improving Statistical Efficiency for Business Growth

The standard A/B testing approaches are mostly based on t-test in large scale industry applications. These standard approaches however suffers from low statistical power in business settings, due to nature of small sample-size or non-Gaussian distribution or return-on-investment (ROI) consideration. In this paper, we (i) show the statistical efficiency of using estimating equation and U statistics, which can address these issues separately; and (ii) propose a novel doubly robust generalized U that allows flexible definition of treatment effect, and can handles small samples, distribution robustness, ROI and confounding consideration in one framework. We provide theoretical results on asymptotics and efficiency bounds, together with insights on the efficiency gain from theoretical analysis. We further conduct comprehensive simulation studies, apply the methods to multiple real A/B tests at a large SaaS company, and share results and learnings that are broadly useful.

Updated: 2025-12-11 18:04:04

标题: 超越基础A/B测试:提高商业增长的统计效率

摘要: 标准的A/B测试方法在大规模工业应用中主要基于t检验。然而,在商业环境中,这些标准方法由于样本量小、非高斯分布或投资回报(ROI)考虑等原因而具有较低的统计功效。本文中,我们(i)展示了使用估计方程和U统计量的统计效率,可以分别解决这些问题;并且(ii)提出了一种新颖的双重稳健广义U统计量,允许灵活定义处理效果,并且可以在一个框架中处理小样本、分布鲁棒性、ROI和混杂因素考虑。我们提供了渐近性和效率界限的理论结果,以及从理论分析中获得的效率增益的见解。我们进一步进行了全面的模拟研究,将这些方法应用于一个大型SaaS公司的多个真实A/B测试,并分享了广泛有用的结果和经验教训。

更新时间: 2025-12-11 18:04:04

领域: stat.ME,cs.LG,math.ST,stat.CO

下载: http://arxiv.org/abs/2505.08128v2

Physics-informed Polynomial Chaos Expansion with Enhanced Constrained Optimization Solver and D-optimal Sampling

Physics-informed polynomial chaos expansions (PC$^2$) provide an efficient physically constrained surrogate modeling framework by embedding governing equations and other physical constraints into the standard data-driven polynomial chaos expansions (PCE) and solving via the Karush-Kuhn-Tucker (KKT) conditions. This approach improves the physical interpretability of surrogate models while achieving high computational efficiency and accuracy. However, the performance and efficiency of PC$^2$ can still be degraded with high-dimensional parameter spaces, limited data availability, or unrepresentative training data. To address this problem, this study explores two complementary enhancements to the PC$^2$ framework. First, a numerically efficient constrained optimization solver, straightforward updating of Lagrange multipliers (SULM), is adopted as an alternative to the conventional KKT solver. The SULM method significantly reduces computational cost when solving physically constrained problems with high-dimensionality and derivative boundary conditions that require a large number of virtual points. Second, a D-optimal sampling strategy is utilized to select informative virtual points to improve the stability and achieve the balance of accuracy and efficiency of the PC$^2$. The proposed methods are integrated into the PC$^2$ framework and evaluated through numerical examples of representative physical systems governed by ordinary or partial differential equations. The results demonstrate that the enhanced PC$^2$ has better comprehensive capability than standard PC$^2$, and is well-suited for high-dimensional uncertainty quantification tasks.

Updated: 2025-12-11 18:03:29

标题: 具有增强约束优化求解器和D-最优采样的基于物理信息的多项式混沌展开

摘要: 物理信息的多项式混沌扩展(PC$^2$)通过将控制方程和其他物理约束嵌入标准数据驱动的多项式混沌扩展(PCE)中,并通过Karush-Kuhn-Tucker(KKT)条件求解,提供了一种高效的受物理约束的代理建模框架。这种方法提高了代理模型的物理可解释性,同时实现了高计算效率和准确性。然而,PC$^2$的性能和效率在高维参数空间、有限的数据可用性或非代表性的训练数据情况下仍可能受到影响。为解决这个问题,本研究探讨了对PC$^2$框架的两种补充增强。首先,采用了一种数值高效的约束优化求解器,即直接更新拉格朗日乘子(SULM),作为传统KKT求解器的替代方法。SULM方法在解决高维度和需要大量虚拟点的导数边界条件的物理约束问题时显著降低了计算成本。其次,采用了一种D-最优抽样策略来选择信息丰富的虚拟点,以改善PC$^2$的稳定性,并实现准确性和效率的平衡。所提出的方法被整合到PC$^2$框架中,并通过受普通或偏微分方程控制的代表性物理系统的数值示例进行评估。结果表明,增强的PC$^2$具有比标准PC$^2$更好的综合能力,并且非常适用于高维度不确定性量化任务。

更新时间: 2025-12-11 18:03:29

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2512.10873v1

UrbanAI 2025 Challenge: Linear vs Transformer Models for Long-Horizon Exogenous Temperature Forecasting

We study long-horizon exogenous-only temperature forecasting - a challenging univariate setting where only the past values of the indoor temperature are used for prediction - using linear and Transformer-family models. We evaluate Linear, NLinear, DLinear, Transformer, Informer, and Autoformer under standardized train, validation, and test splits. Results show that linear baselines (Linear, NLinear, DLinear) consistently outperform more complex Transformer-family architectures, with DLinear achieving the best overall accuracy across all splits. These findings highlight that carefully designed linear models remain strong baselines for time series forecasting in challenging exogenous-only settings.

Updated: 2025-12-11 17:59:44

标题: UrbanAI 2025挑战:线性模型与变压器模型在长期外部温度预测中的比较

摘要: 我们研究了长时间跨度的外生温度预测 - 这是一个具有挑战性的单变量设置,仅使用室内温度的过去值进行预测 - 使用线性和Transformer家族模型。我们在标准化的训练、验证和测试分割下评估了Linear、NLinear、DLinear、Transformer、Informer和Autoformer。结果表明,线性基线模型(Linear、NLinear、DLinear)始终优于更复杂的Transformer家族架构,其中DLinear在所有分割中实现了最佳的整体准确性。这些发现突显了在具有挑战性的仅外生设置中,精心设计的线性模型仍然是时间序列预测的强大基线。

更新时间: 2025-12-11 17:59:44

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2512.10866v1

MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence

Spatial understanding over continuous visual input is crucial for MLLMs to evolve into general-purpose assistants in physical environments. Yet there is still no comprehensive benchmark that holistically assesses the progress toward this goal. In this work, we introduce MMSI-Video-Bench, a fully human-annotated benchmark for video-based spatial intelligence in MLLMs. It operationalizes a four-level framework, Perception, Planning, Prediction, and Cross-Video Reasoning, through 1,106 questions grounded in 1,278 clips from 25 datasets and in-house videos. Each item is carefully designed and reviewed by 3DV experts with explanatory rationales to ensure precise, unambiguous grounding. Leveraging its diverse data sources and holistic task coverage, MMSI-Video-Bench also supports three domain-oriented sub-benchmarks (Indoor Scene Perception Bench, Robot Bench and Grounding Bench) for targeted capability assessment. We evaluate 25 strong open-source and proprietary MLLMs, revealing a striking human--AI gap: many models perform near chance, and the best reasoning model lags humans by nearly 60%. We further find that spatially fine-tuned models still fail to generalize effectively on our benchmark. Fine-grained error analysis exposes systematic failures in geometric reasoning, motion grounding, long-horizon prediction, and cross-video correspondence. We also show that typical frame-sampling strategies transfer poorly to our reasoning-intensive benchmark, and that neither 3D spatial cues nor chain-of-thought prompting yields meaningful gains. We expect our benchmark to establish a solid testbed for advancing video-based spatial intelligence.

Updated: 2025-12-11 17:57:24

标题: MMSI-Video-Bench:基于视频的空间智能综合基准测试

摘要: 空间理解在MLLMs中是至关重要的,以便它们能够演变成物理环境中的通用助手。然而,目前仍然没有一个全面评估朝着这个目标进展的综合基准。在这项工作中,我们介绍了MMSI-Video-Bench,这是一个完全由人类标注的基准,用于MLLMs中基于视频的空间智能。它通过来自25个数据集和内部视频的1,278个剪辑中根植于1,106个问题,实现了一个四级框架,包括感知、规划、预测和跨视频推理。每个项目都经过三个3DV专家精心设计和审查,附有解释性理由,以确保准确、明确的基础。利用其多样的数据来源和全面的任务覆盖,MMSI-Video-Bench还支持三个面向领域的子基准(室内场景感知基准、机器人基准和基础基准),用于有针对性的能力评估。我们评估了25个强大的开源和专有MLLMs,揭示了明显的人工智能差距:许多模型表现接近于偶然,而最佳推理模型落后于人类近60%。我们进一步发现,空间细化的模型仍然无法有效地在我们的基准上推广。细粒度的错误分析揭示了几何推理、运动基础、长期预测和跨视频对应的系统性失败。我们还展示了典型的帧采样策略在我们的推理密集型基准上转移效果不佳,而3D空间线索和思维链提示都没有带来实质性的增益。我们期望我们的基准可以建立一个坚实的测试平台,促进基于视频的空间智能的发展。

更新时间: 2025-12-11 17:57:24

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2512.10863v1

Scaling Behavior of Discrete Diffusion Language Models

Modern LLM pre-training consumes vast amounts of compute and training data, making the scaling behavior, or scaling laws, of different models a key distinguishing factor. Discrete diffusion language models (DLMs) have been proposed as an alternative to autoregressive language models (ALMs). However, their scaling behavior has not yet been fully explored, with prior work suggesting that they require more data and compute to match the performance of ALMs. We study the scaling behavior of DLMs on different noise types by smoothly interpolating between masked and uniform diffusion while paying close attention to crucial hyperparameters such as batch size and learning rate. Our experiments reveal that the scaling behavior of DLMs strongly depends on the noise type and is considerably different from ALMs. While all noise types converge to similar loss values in compute-bound scaling, we find that uniform diffusion requires more parameters and less data for compute-efficient training compared to masked diffusion, making them a promising candidate in data-bound settings. We scale our uniform diffusion model up to 10B parameters trained for $10^{22}$ FLOPs, confirming the predicted scaling behavior and making it the largest publicly known uniform diffusion model to date.

Updated: 2025-12-11 17:54:10

标题: 离散扩散语言模型的尺度行为

摘要: 现代LLM预训练消耗大量的计算资源和训练数据,使得不同模型的扩展行为或扩展定律成为一个关键的区分因素。离散扩散语言模型(DLMs)被提出作为自回归语言模型(ALMs)的替代方案。然而,它们的扩展行为尚未完全探索,先前的研究表明,它们需要更多的数据和计算资源来匹配ALMs的性能。我们研究了DLMs在不同噪声类型上的扩展行为,通过在掩蔽和均匀扩散之间平滑插值,同时密切关注关键的超参数,如批处理大小和学习率。我们的实验表明,DLMs的扩展行为强烈依赖于噪声类型,并且与ALMs有很大的不同。在计算受限的扩展中,所有噪声类型收敛到类似的损失值,我们发现相比于掩蔽扩散,均匀扩散需要更多的参数和更少的数据进行高效的训练,使其成为数据受限环境中的一个有前途的候选项。我们将我们的均匀扩散模型扩展到10亿参数,进行了$10^{22}$次FLOPs的训练,证实了预测的扩展行为,并使其成为迄今为止已知的最大的公开均匀扩散模型。

更新时间: 2025-12-11 17:54:10

领域: cs.LG

下载: http://arxiv.org/abs/2512.10858v1

Generative Modeling from Black-box Corruptions via Self-Consistent Stochastic Interpolants

Transport-based methods have emerged as a leading paradigm for building generative models from large, clean datasets. However, in many scientific and engineering domains, clean data are often unavailable: instead, we only observe measurements corrupted through a noisy, ill-conditioned channel. A generative model for the original data thus requires solving an inverse problem at the level of distributions. In this work, we introduce a novel approach to this task based on Stochastic Interpolants: we iteratively update a transport map between corrupted and clean data samples using only access to the corrupted dataset as well as black box access to the corruption channel. Under appropriate conditions, this iterative procedure converges towards a self-consistent transport map that effectively inverts the corruption channel, thus enabling a generative model for the clean data. We refer to the resulting method as the self-consistent stochastic interpolant (SCSI). It (i) is computationally efficient compared to variational alternatives, (ii) highly flexible, handling arbitrary nonlinear forward models with only black-box access, and (iii) enjoys theoretical guarantees. We demonstrate superior performance on inverse problems in natural image processing and scientific reconstruction, and establish convergence guarantees of the scheme under appropriate assumptions.

Updated: 2025-12-11 17:53:38

标题: 通过自一致随机插值从黑盒损坏中生成建模

摘要: 运输基础方法已经成为从大规模干净数据集构建生成模型的主要范式。然而,在许多科学和工程领域,干净数据通常不可用:相反,我们只能观察到通过嘈杂、病态信道损坏的测量。因此,原始数据的生成模型需要在分布的级别解决一个逆问题。在这项工作中,我们介绍了一种基于随机插值的新方法来处理这个任务:我们通过仅访问受损数据集以及黑盒访问损坏通道,迭代更新损坏和干净数据样本之间的传输映射。在适当条件下,这种迭代过程会收敛到一个自洽的传输映射,有效地反转损坏通道,从而实现对干净数据的生成模型。我们将结果方法称为自洽随机插值(SCSI)。它(i)与变分替代方案相比具有计算效率,(ii)高度灵活,只需黑盒访问即可处理任意非线性正向模型,(iii)享有理论保证。我们在自然图像处理和科学重建中展示了优越性能,并在适当假设下建立了方案的收敛保证。

更新时间: 2025-12-11 17:53:38

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2512.10857v1

Bayesian Symbolic Regression via Posterior Sampling

Symbolic regression is a powerful tool for discovering governing equations directly from data, but its sensitivity to noise hinders its broader application. This paper introduces a Sequential Monte Carlo (SMC) framework for Bayesian symbolic regression that approximates the posterior distribution over symbolic expressions, enhancing robustness and enabling uncertainty quantification for symbolic regression in the presence of noise. Differing from traditional genetic programming approaches, the SMC-based algorithm combines probabilistic selection, adaptive tempering, and the use of normalized marginal likelihood to efficiently explore the search space of symbolic expressions, yielding parsimonious expressions with improved generalization. When compared to standard genetic programming baselines, the proposed method better deals with challenging, noisy benchmark datasets. The reduced tendency to overfit and enhanced ability to discover accurate and interpretable equations paves the way for more robust symbolic regression in scientific discovery and engineering design applications.

Updated: 2025-12-11 17:38:20

标题: 通过后验抽样的贝叶斯符号回归

摘要: 符号回归是一种从数据中直接发现控制方程的强大工具,但其对噪声的敏感性限制了其更广泛的应用。本文介绍了一种基于顺序蒙特卡洛(SMC)框架的贝叶斯符号回归,该框架近似计算了符号表达式的后验分布,增强了稳健性,并在存在噪声的情况下实现了对符号回归的不确定性量化。与传统的遗传编程方法不同,基于SMC的算法结合了概率选择、自适应调节和使用规范化边际似然来有效地探索符号表达式的搜索空间,产生简洁的表达式并改进了泛化能力。与标准遗传编程基线相比,所提出的方法更好地处理具有挑战性的嘈杂基准数据集。减少过拟合的倾向和增强发现准确且可解释方程的能力为科学发现和工程设计应用中更稳健的符号回归铺平了道路。

更新时间: 2025-12-11 17:38:20

领域: cs.LG

下载: http://arxiv.org/abs/2512.10849v1

T-SHRED: Symbolic Regression for Regularization and Model Discovery with Transformer Shallow Recurrent Decoders

SHallow REcurrent Decoders (SHRED) are effective for system identification and forecasting from sparse sensor measurements. Such models are light-weight and computationally efficient, allowing them to be trained on consumer laptops. SHRED-based models rely on Recurrent Neural Networks (RNNs) and a simple Multi-Layer Perceptron (MLP) for the temporal encoding and spatial decoding respectively. Despite the relatively simple structure of SHRED, they are able to predict chaotic dynamical systems on different physical, spatial, and temporal scales directly from a sparse set of sensor measurements. In this work, we modify SHRED by leveraging transformers (T-SHRED) embedded with symbolic regression for the temporal encoding, circumventing auto-regressive long-term forecasting for physical data. This is achieved through a new sparse identification of nonlinear dynamics (SINDy) attention mechanism into T-SHRED to impose sparsity regularization on the latent space, which also allows for immediate symbolic interpretation. Symbolic regression improves model interpretability by learning and regularizing the dynamics of the latent space during training. We analyze the performance of T-SHRED on three different dynamical systems ranging from low-data to high-data regimes.

Updated: 2025-12-11 17:28:30

标题: T-SHRED:具有Transformer浅递归解码器的正则化和模型发现的符号回归

摘要: SHallow REcurrent Decoders (SHRED)对于利用稀疏传感器测量进行系统识别和预测非常有效。这种模型轻量且计算效率高,可以在消费者笔记本电脑上进行训练。基于SHRED的模型依赖于循环神经网络(RNNs)和简单的多层感知器(MLP)进行时间编码和空间解码。尽管SHRED的结构相对简单,但它们能够直接从稀疏的传感器测量中预测不同物理、空间和时间尺度上的混沌动力系统。在这项工作中,我们通过利用嵌入符号回归的变压器(T-SHRED)对SHRED进行修改,用于时间编码,避免针对物理数据的自回归长期预测。通过将新的稀疏非线性动力学(SINDy)注意机制整合到T-SHRED中,对潜在空间施加稀疏正则化,从而实现符号化解释。符号回归通过在训练过程中学习和正则化潜在空间的动力学来提高模型的可解释性。我们分析了T-SHRED在从低数据到高数据范围内的三种不同动力系统上的性能。

更新时间: 2025-12-11 17:28:30

领域: cs.LG

下载: http://arxiv.org/abs/2506.15881v3

Faster Results from a Smarter Schedule: Reframing Collegiate Cross Country through Analysis of the National Running Club Database

Collegiate cross country teams often build their season schedules on intuition rather than evidence, partly because large-scale performance datasets are not publicly accessible. To address this limitation, we introduce the National Running Club Database (NRCD), the first openly available dataset to aggregate 23,725 race results from 7,594 collegiate club athletes across the 2023-2025 seasons. Unlike existing resources, NRCD includes detailed course metadata, allowing us to develop two standardized performance metrics: Converted Only (distance correction) and Standardized (distance, weather, and elevation adjusted). Using these standardized measures, we find that athletes with slower initial performances exhibit the greatest improvement within a season, and that race frequency is the strongest predictor of improvement. Using six machine learning models, random forest achieves the highest accuracy (r squared equals 0.92), revealing that athletes who race more frequently progress significantly faster than those who do not. At the team level, programs whose athletes race at least four times during the regular season have substantially higher odds of placing in the top 15 at nationals (chi-squared less than 0.01). These results challenge common coaching practices that favor minimal racing before championship meets. Our findings demonstrate that a data-informed scheduling strategy improves both individual development and team competitiveness. The NRCD provides a new foundation for evidence-based decision-making in collegiate cross country and opens opportunities for further research on standardized, longitudinal athlete performance modeling.

Updated: 2025-12-11 17:28:17

标题: 更聪明的计划带来更快的结果:通过分析全国跑步俱乐部数据库重新构建大学越野跑

摘要: 大学越野队通常根据直觉而非证据来制定赛季计划,部分原因是大规模的表现数据集并不公开可访问。为了解决这一限制,我们引入了国家跑步俱乐部数据库(NRCD),这是第一个公开可用的数据集,汇总了来自2023-2025赛季的7,594名大学俱乐部运动员的23,725次比赛成绩。与现有资源不同,NRCD包括详细的赛道元数据,使我们能够开发两种标准化绩效指标:仅转换(校正距离)和标准化(校正距离、天气和海拔)。使用这些标准化指标,我们发现初始表现较慢的运动员在赛季内有最大的提高,并且比赛频率是提高的最强预测因素。使用六种机器学习模型,随机森林获得了最高的准确率(r平方等于0.92),显示出比赛频率更高的运动员进步速度明显快于那些不比赛频繁的运动员。在团队层面上,那些在常规赛季至少比赛四次的项目在国家比赛中排名前15的几率大大提高(卡方小于0.01)。这些结果挑战了在锦标赛之前更倾向于最少比赛的常见教练做法。我们的研究结果表明,数据驱动的赛程策略既可以改善个人发展,也可以提高团队竞争力。NRCD为大学越野运动提供了一个新的基础,为基于证据的决策打开了机会,并为标准化、纵向运动员表现建模的进一步研究开辟了机会。

更新时间: 2025-12-11 17:28:17

领域: cs.CY,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.10600v3

Learning Controllable and Diverse Player Behaviors in Multi-Agent Environments

This paper introduces a reinforcement learning framework that enables controllable and diverse player behaviors without relying on human gameplay data. Existing approaches often require large-scale player trajectories, train separate models for different player types, or provide no direct mapping between interpretable behavioral parameters and the learned policy, limiting their scalability and controllability. We define player behavior in an N-dimensional continuous space and uniformly sample target behavior vectors from a region that encompasses the subset representing real human styles. During training, each agent receives both its current and target behavior vectors as input, and the reward is based on the normalized reduction in distance between them. This allows the policy to learn how actions influence behavioral statistics, enabling smooth control over attributes such as aggressiveness, mobility, and cooperativeness. A single PPO-based multi-agent policy can reproduce new or unseen play styles without retraining. Experiments conducted in a custom multi-player Unity game show that the proposed framework produces significantly greater behavioral diversity than a win-only baseline and reliably matches specified behavior vectors across diverse targets. The method offers a scalable solution for automated playtesting, game balancing, human-like behavior simulation, and replacing disconnected players in online games.

Updated: 2025-12-11 17:26:24

标题: 学习在多智能体环境中可控和多样化的玩家行为

摘要: 本文介绍了一种强化学习框架,该框架可以实现可控和多样化的玩家行为,而无需依赖于人类游戏数据。现有方法通常需要大规模的玩家轨迹,为不同玩家类型训练单独的模型,或者不提供可解释的行为参数和学习策略之间的直接映射,从而限制了它们的可扩展性和可控性。我们将玩家行为定义在一个N维连续空间中,并从一个包含代表真实人类风格子集的区域中均匀采样目标行为向量。在训练过程中,每个agent都接收其当前和目标行为向量作为输入,奖励基于它们之间距离的归一化减少。这使得策略可以学习行动如何影响行为统计数据,从而实现对属性(如攻击性、移动性和合作性)的平滑控制。基于PPO的多智能体策略可以在无需重新训练的情况下复制新的或未见过的游玩风格。在一个自定义的多人Unity游戏中进行的实验证明,所提出的框架产生了比仅赢局基准更大的行为多样性,并且可靠地匹配各种目标上指定的行为向量。该方法提供了一个可扩展的解决方案,用于自动化游戏测试、游戏平衡、类人行为模拟和在线游戏中取代断开连接的玩家。

更新时间: 2025-12-11 17:26:24

领域: cs.LG

下载: http://arxiv.org/abs/2512.10835v1

MaskedManipulator: Versatile Whole-Body Manipulation

We tackle the challenges of synthesizing versatile, physically simulated human motions for full-body object manipulation. Unlike prior methods that are focused on detailed motion tracking, trajectory following, or teleoperation, our framework enables users to specify versatile high-level objectives such as target object poses or body poses. To achieve this, we introduce MaskedManipulator, a generative control policy distilled from a tracking controller trained on large-scale human motion capture data. This two-stage learning process allows the system to perform complex interaction behaviors, while providing intuitive user control over both character and object motions. MaskedManipulator produces goal-directed manipulation behaviors that expand the scope of interactive animation systems beyond task-specific solutions.

Updated: 2025-12-11 17:25:33

标题: "掩饰的操纵者:多功能全身操纵"

摘要: 我们致力于解决合成多功能、物理模拟的全身对象操作的挑战。与之前关注详细动作跟踪、轨迹跟随或遥控的方法不同,我们的框架使用户能够指定多功能的高级目标,如目标对象姿势或身体姿势。为了实现这一目标,我们引入了MaskedManipulator,这是一个从大规模人体动作捕捉数据训练出的跟踪控制器提炼出的生成控制策略。这种两阶段学习过程使系统能够执行复杂的交互行为,同时为用户提供直观的对角色和对象动作的控制。MaskedManipulator产生了目标导向的操作行为,拓展了交互式动画系统的范围,超越了特定任务的解决方案。

更新时间: 2025-12-11 17:25:33

领域: cs.RO,cs.AI,cs.GR

下载: http://arxiv.org/abs/2505.19086v3

An Elementary Proof of the Near Optimality of LogSumExp Smoothing

We consider the design of smoothings of the (coordinate-wise) max function in $\mathbb{R}^d$ in the infinity norm. The LogSumExp function $f(x)=\ln(\sum^d_i\exp(x_i))$ provides a classical smoothing, differing from the max function in value by at most $\ln(d)$. We provide an elementary construction of a lower bound, establishing that every overestimating smoothing of the max function must differ by at least $\sim 0.8145\ln(d)$. Hence, LogSumExp is optimal up to constant factors. However, in small dimensions, we provide stronger, exactly optimal smoothings attaining our lower bound, showing that the entropy-based LogSumExp approach to smoothing is not exactly optimal.

Updated: 2025-12-11 17:17:48

标题: 一个关于LogSumExp平滑近乎最优性的初等证明

摘要: 我们考虑在$\mathbb{R}^d$中用无穷范数平滑(逐坐标)最大函数的设计。LogSumExp函数$f(x)=\ln(\sum^d_i\exp(x_i))$提供了一种经典的平滑,其与最大函数的值之间最多相差$\ln(d)$。我们提供了一个基本的构造下界,证明每个高估最大函数的平滑至少必须相差$\sim 0.8145\ln(d)$。因此,LogSumExp在常数因子上是最优的。然而,在小维度下,我们提供了更强、完全最优的平滑,达到我们的下界,表明基于熵的LogSumExp平滑方法并非完全最优。

更新时间: 2025-12-11 17:17:48

领域: math.ST,cs.LG,math.OC

下载: http://arxiv.org/abs/2512.10825v1

Fairness-Aware Fine-Tuning of Vision-Language Models for Medical Glaucoma Diagnosis

Vision-language models achieve expert-level performance on medical imaging tasks but exhibit significant diagnostic accuracy disparities across demographic groups. We introduce fairness-aware Low-Rank Adaptation for medical VLMs, combining parameter efficiency with explicit fairness optimization. Our key algorithmic contribution is a differentiable MaxAccGap loss that enables end-to-end optimization of accuracy parity across demographic groups. We propose three methods: FR-LoRA integrates MaxAccGap regularization into the training objective, GR-LoRA applies inverse frequency weighting to balance gradient contributions, and Hybrid-LoRA combines both mechanisms. Evaluated on 10,000 glaucoma fundus images, GR-LoRA reduces diagnostic accuracy disparities by 69% while maintaining 53.15% overall accuracy. Ablation studies reveal that strong regularization strength achieves optimal fairness with minimal accuracy trade-off, and race-specific optimization yields 60% disparity reduction. Our approach requires only 0.24% trainable parameters, enabling practical deployment of fair medical AI in resource-constrained healthcare settings.

Updated: 2025-12-11 17:17:07

标题: 公平意识对医学青光眼诊断视觉-语言模型的微调

摘要: 视觉语言模型在医学成像任务上实现了专家级性能,但在不同人群之间存在显著的诊断准确度差异。我们引入了针对医学VLMs的公平感知低秩适应性,将参数效率与明确的公平优化相结合。我们的关键算法贡献是可微分的MaxAccGap损失,它使得在不同人群之间的准确性平衡的端到端优化成为可能。我们提出了三种方法:FR-LoRA将MaxAccGap正则化集成到训练目标中,GR-LoRA应用逆频率加权来平衡梯度贡献,而Hybrid-LoRA结合了这两种机制。在对1万张青光眼底图像进行评估时,GR-LoRA将诊断准确度差异降低了69%,同时保持了53.15%的总体准确度。消融研究表明,强正则化强度实现了最佳的公平性,且几乎没有准确性的权衡,而针对不同人种的优化可以减少60%的差距。我们的方法仅需要0.24%的可训练参数,使得在资源受限的医疗保健环境中实现公平医学人工智能的实际部署成为可能。

更新时间: 2025-12-11 17:17:07

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2512.03477v2

Luxical: High-Speed Lexical-Dense Text Embeddings

Frontier language model quality increasingly hinges on our ability to organize web-scale text corpora for training. Today's dominant tools trade off speed and flexibility: lexical classifiers (e.g., FastText) are fast but limited to producing classification output scores, while the vector-valued outputs of transformer text embedding models flexibly support numerous workflows (e.g., clustering, classification, and retrieval) but are computationally expensive to produce. We introduce Luxical, a library for high-speed "lexical-dense" text embeddings that aims to recover the best properties of both approaches for web-scale text organization. Luxical combines sparse TF--IDF features, a small ReLU network, and a knowledge distillation training regimen to approximate large transformer embedding models at a fraction of their operational cost. In this technical report, we describe the Luxical architecture and training objective and evaluate a concrete Luxical model in two disparate applications: a targeted webcrawl document retrieval test and an end-to-end language model data curation task grounded in text classification. In these tasks we demonstrate speedups ranging from 3x to 100x over varying-sized neural baselines, and comparable to FastText model inference during the data curation task. On these evaluations, the tested Luxical model illustrates favorable compute/quality trade-offs for large-scale text organization, matching the quality of neural baselines. Luxical is available as open-source software at https://github.com/datologyai/luxical.

Updated: 2025-12-11 17:14:51

标题: Luxical: 高速词汇密集型文本嵌入

摘要: 前沿语言模型质量越来越依赖于我们组织用于训练的规模庞大的文本语料库的能力。 当前主导的工具在速度和灵活性之间进行权衡:词汇分类器(例如,FastText)速度快,但仅限于生成分类输出分数,而变压器文本嵌入模型的矢量值输出灵活地支持许多工作流程(例如,聚类、分类和检索),但在计算上昂贵。 我们介绍了Luxical,这是一个用于高速“词汇密集”文本嵌入的库,旨在恢复这两种方法在规模庞大的文本组织中的最佳特性。 Luxical结合了稀疏的TF-IDF特征、一个小的ReLU网络和知识蒸馏训练方案,以在操作成本的一小部分近似大型变压器嵌入模型。 在这份技术报告中,我们描述了Luxical的架构和训练目标,并在两个不同的应用中评估了一个具体的Luxical模型:一个定向的网络爬虫文档检索测试和一个以文本分类为基础的端到端语言模型数据整理任务。 在这些任务中,我们展示了与不同规模的神经基线相比,从3倍到100倍的加速,并且在数据整理任务中与FastText模型推理相媲美。 在这些评估中,经过测试的Luxical模型展示了对于大规模文本组织的计算/质量权衡优势,与神经基线的质量相匹配。 Luxical可在https://github.com/datologyai/luxical上作为开源软件使用。

更新时间: 2025-12-11 17:14:51

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2512.09015v2

V-OCBF: Learning Safety Filters from Offline Data via Value-Guided Offline Control Barrier Functions

Ensuring safety in autonomous systems requires controllers that satisfy hard, state-wise constraints without relying on online interaction. While existing Safe Offline RL methods typically enforce soft expected-cost constraints, they do not guarantee forward invariance. Conversely, Control Barrier Functions (CBFs) provide rigorous safety guarantees but usually depend on expert-designed barrier functions or full knowledge of the system dynamics. We introduce Value-Guided Offline Control Barrier Functions (V-OCBF), a framework that learns a neural CBF entirely from offline demonstrations. Unlike prior approaches, V-OCBF does not assume access to the dynamics model; instead, it derives a recursive finite-difference barrier update, enabling model-free learning of a barrier that propagates safety information over time. Moreover, V-OCBF incorporates an expectile-based objective that avoids querying the barrier on out-of-distribution actions and restricts updates to the dataset-supported action set. The learned barrier is then used with a Quadratic Program (QP) formulation to synthesize real-time safe control. Across multiple case studies, V-OCBF yields substantially fewer safety violations than baseline methods while maintaining strong task performance, highlighting its scalability for offline synthesis of safety-critical controllers without online interaction or hand-engineered barriers.

Updated: 2025-12-11 17:14:37

标题: V-OCBF:通过价值引导的离线控制屏障函数从离线数据中学习安全过滤器

摘要: 确保自主系统安全性需要满足严格的状态约束的控制器,而不依赖在线交互。现有的安全离线强化学习方法通常强制执行软的期望成本约束,但不能保证前向不变性。相反,控制障碍函数(CBFs)提供严格的安全性保证,但通常依赖于专家设计的障碍函数或对系统动态的全面了解。我们引入了价值导向的离线控制障碍函数(V-OCBF)框架,该框架完全通过离线演示学习神经网络CBF。与先前的方法不同,V-OCBF不假设可以访问动态模型;相反,它推导出一个递归有限差分障碍更新,实现了障碍物的无模型学习,随着时间的推移传播安全信息。此外,V-OCBF结合了基于期望值的目标,避免在超出分布的操作上查询障碍,并限制更新到数据集支持的操作集。然后,学习的障碍与二次规划(QP)公式一起用于合成实时安全控制。在多个案例研究中,V-OCBF产生的安全违规行为明显少于基线方法,同时保持强大的任务性能,突出了其在没有在线交互或手工设计障碍的情况下离线合成安全关键控制器的可扩展性。

更新时间: 2025-12-11 17:14:37

领域: cs.AI,cs.RO

下载: http://arxiv.org/abs/2512.10822v1

Agile Deliberation: Concept Deliberation for Subjective Visual Classification

From content moderation to content curation, applications requiring vision classifiers for visual concepts are rapidly expanding. Existing human-in-the-loop approaches typically assume users begin with a clear, stable concept understanding to be able to provide high-quality supervision. In reality, users often start with a vague idea and must iteratively refine it through "concept deliberation", a practice we uncovered through structured interviews with content moderation experts. We operationalize the common strategies in deliberation used by real content moderators into a human-in-the-loop framework called "Agile Deliberation" that explicitly supports evolving and subjective concepts. The system supports users in defining the concept for themselves by exposing them to borderline cases. The system does this with two deliberation stages: (1) concept scoping, which decomposes the initial concept into a structured hierarchy of sub-concepts, and (2) concept iteration, which surfaces semantically borderline examples for user reflection and feedback to iteratively align an image classifier with the user's evolving intent. Since concept deliberation is inherently subjective and interactive, we painstakingly evaluate the framework through 18 user sessions, each 1.5h long, rather than standard benchmarking datasets. We find that Agile Deliberation achieves 7.5% higher F1 scores than automated decomposition baselines and more than 3% higher than manual deliberation, while participants reported clearer conceptual understanding and lower cognitive effort.

Updated: 2025-12-11 17:13:09

标题: 敏捷审议:主观视觉分类的概念审议

摘要: 从内容管理到内容策展,需要视觉概念的视觉分类器的应用正在迅速扩展。现有的人在环环路方法通常假设用户从清晰、稳定的概念理解开始,以便能够提供高质量的监督。实际上,用户经常从一个模糊的想法开始,并通过“概念考虑”逐步完善它,这是我们通过与内容管理专家进行结构化访谈发现的一种实践。我们将现实内容管理者在思考中使用的常见策略操作化为一个名为“敏捷思考”的人在环路框架,明确支持不断发展和主观概念。该系统通过暴露用户边界案例来支持用户自定义概念。该系统通过两个思考阶段实现这一点:(1)概念范围界定,将初始概念分解为结构化的子概念层次结构,(2)概念迭代,为用户反思和反馈提供语义边界案例,以迭代地将图像分类器与用户不断发展的意图对齐。由于概念思考在本质上是主观的和交互式的,我们通过18个用户会话进行了小心翼翼的评估,每个会话持续1.5小时,而不是使用标准基准数据集。我们发现,敏捷思考比自动分解基线高7.5%的F1分数,比手动思考高3%以上,同时参与者报告了更清晰的概念理解和更低的认知努力。

更新时间: 2025-12-11 17:13:09

领域: cs.AI,cs.CV,cs.HC,cs.LG

下载: http://arxiv.org/abs/2512.10821v1

Deep sets and event-level maximum-likelihood estimation for fast pile-up jet rejection in ATLAS

Multiple proton-proton collisions (pile-up) occur at every bunch crossing at the LHC, with the mean number of interactions expected to reach 80 during Run 3 and up to 200 at the High-Luminosity LHC. As a direct consequence, events with multijet signatures will occur at increasingly high rates. To cope with the increased luminosity, being able to efficiently group jets according to their origin along the beamline is crucial, particularly at the trigger level. In this work, a novel uncertainty-aware jet regression model based on a Deep Sets architecture is introduced, DIPz, to regress on a jet origin position along the beamline. The inputs to the DIPz algorithm are the charged particle tracks associated to each jet. An event-level discriminant, the Maximum Log Product of Likelihoods (MLPL), is constructed by combining the DIPz per-jet predictions. MLPL is cut-optimized to select events compatible with targeted multi-jet signature selection. This combined approach provides a robust and computationally efficient method for pile-up rejection in multi-jet final states, applicable to real-time event selections at the ATLAS High Level Trigger.

Updated: 2025-12-11 17:09:45

标题: 深度集合和基于事件级极大似然估计在ATLAS中用于快速拒绝堆积喷注

摘要: 在LHC的每次束团交叉点都会发生多次质子-质子碰撞(堆积),预计在Run 3期间,平均相互作用次数将达到80次,在高亮度LHC上可达到200次。由于这一直接后果,具有多重喷注特征的事件将以越来越高的速率发生。为了应对增加的光度,能够根据其在束线沿线的起源有效地将喷注分组对于在触发器级别特别关键。在这项工作中,引入了一种基于深集合结构的新型不确定性感知喷注回归模型,DIPz,用于对喷注沿着束线的起源位置进行回归。DIPz算法的输入是与每个喷注相关联的带电粒子轨迹。通过结合DIPz的每个喷注预测,构建了一个事件级别的判别器,最大对数概率乘积(MLPL)。MLPL经过切割优化,以选择与目标多喷注特征选择兼容的事件。这种综合方法为多重喷注终态中的堆积拒绝提供了一种稳健且计算效率高的方法,适用于ATLAS高级触发器的实时事件选择。

更新时间: 2025-12-11 17:09:45

领域: hep-ex,cs.LG

下载: http://arxiv.org/abs/2512.10819v1

Understanding Outer Optimizers in Local SGD: Learning Rates, Momentum, and Acceleration

Modern machine learning often requires training with large batch size, distributed data, and massively parallel compute hardware (like mobile and other edge devices or distributed data centers). Communication becomes a major bottleneck in such settings but methods like Local Stochastic Gradient Descent (Local SGD) show great promise in reducing this additional communication overhead. Local SGD consists of three parts: a local optimization process, an aggregation mechanism, and an outer optimizer that uses the aggregated updates from the nodes to produce a new model. While there exists an extensive literature on understanding the impact of hyperparameters in the local optimization process, the choice of outer optimizer and its hyperparameters is less clear. We study the role of the outer optimizer in Local SGD, and prove new convergence guarantees for the algorithm. In particular, we show that tuning the outer learning rate allows us to (a) trade off between optimization error and stochastic gradient noise variance, and (b) make up for ill-tuning of the inner learning rate. Our theory suggests that the outer learning rate should sometimes be set to values greater than $1$. We extend our results to settings where we use momentum in the outer optimizer, and we show a similar role for the momentum-adjusted outer learning rate. We also study acceleration in the outer optimizer and show that it improves the convergence rate as a function of the number of communication rounds, improving upon the convergence rate of prior algorithms that apply acceleration locally. Finally, we also introduce a novel data-dependent analysis of Local SGD that yields further insights on outer learning rate tuning. We conduct comprehensive experiments with standard language models and various outer optimizers to validate our theory.

Updated: 2025-12-11 17:09:43

标题: 理解本地SGD中的外部优化器:学习率、动量和加速度

摘要: 现代机器学习通常需要使用大批量训练、分布式数据和大规模并行计算硬件(如移动设备和其他边缘设备或分布式数据中心)。在这种情况下,通信成为一个主要瓶颈,但像本地随机梯度下降(Local SGD)这样的方法在减少这种额外通信开销方面表现出很大的潜力。本地SGD由三部分组成:本地优化过程、聚合机制和使用节点的聚合更新生成新模型的外部优化器。虽然已经有大量文献对本地优化过程中超参数的影响进行了研究,但外部优化器及其超参数的选择仍不够清晰。我们研究了外部优化器在本地SGD中的作用,并为算法提供了新的收敛保证。特别是,我们表明调整外部学习率使我们能够(a)在优化误差和随机梯度噪声方差之间进行权衡,以及(b)弥补内部学习率的不良调整。我们的理论表明,外部学习率有时应设置为大于$1的值。我们将我们的结果扩展到在外部优化器中使用动量的情况,并展示动量调整的外部学习率的类似作用。我们还研究了外部优化器中的加速,并表明它提高了收敛速度作为通信轮数的函数,改进了先前算法的收敛速度,这些算法在本地应用加速。最后,我们还引入了一种新颖的数据相关分析方法,用于对外部学习率调整提供进一步的见解。我们使用标准语言模型和各种外部优化器进行全面实验,以验证我们的理论。

更新时间: 2025-12-11 17:09:43

领域: cs.LG,math.OC,stat.ML

下载: http://arxiv.org/abs/2509.10439v2

Extrapolation of Periodic Functions Using Binary Encoding of Continuous Numerical Values

We report the discovery that binary encoding allows neural networks to extrapolate periodic functions beyond their training bounds. We introduce Normalized Base-2 Encoding (NB2E) as a method for encoding continuous numerical values and demonstrate that, using this input encoding, vanilla multi-layer perceptrons (MLP) successfully extrapolate diverse periodic signals without prior knowledge of their functional form. Internal activation analysis reveals that NB2E induces bit-phase representations, enabling MLPs to learn and extrapolate signal structure independently of position.

Updated: 2025-12-11 17:08:28

标题: 使用连续数值的二进制编码对周期函数进行外推

摘要: 我们报道了一个发现,即二进制编码使神经网络能够在训练边界之外推断周期函数。我们引入了归一化二进制基数编码(NB2E)作为一种编码连续数值的方法,并证明,使用这种输入编码,普通多层感知器(MLP)成功地推断出各种周期信号,而不需要先验知识它们的函数形式。内部激活分析显示,NB2E引入了比特相位表示,使MLP能够独立于位置学习和推断信号结构。

更新时间: 2025-12-11 17:08:28

领域: cs.LG,cs.AI,cs.CV,stat.ML

下载: http://arxiv.org/abs/2512.10817v1

Deception Detection in Dyadic Exchanges Using Multimodal Machine Learning: A Study on a Swedish Cohort

This study investigates the efficacy of using multimodal machine learning techniques to detect deception in dyadic interactions, focusing on the integration of data from both the deceiver and the deceived. We compare early and late fusion approaches, utilizing audio and video data - specifically, Action Units and gaze information - across all possible combinations of modalities and participants. Our dataset, newly collected from Swedish native speakers engaged in truth or lie scenarios on emotionally relevant topics, serves as the basis for our analysis. The results demonstrate that incorporating both speech and facial information yields superior performance compared to single-modality approaches. Moreover, including data from both participants significantly enhances deception detection accuracy, with the best performance (71%) achieved using a late fusion strategy applied to both modalities and participants. These findings align with psychological theories suggesting differential control of facial and vocal expressions during initial interactions. As the first study of its kind on a Scandinavian cohort, this research lays the groundwork for future investigations into dyadic interactions, particularly within psychotherapy settings.

Updated: 2025-12-11 17:00:44

标题: 《使用多模态机器学习在二元交流中检测欺骗:对瑞典队列的研究》

摘要: 这项研究调查了使用多模态机器学习技术来检测双人互动中欺骗的有效性,重点关注来自欺骗者和被欺骗者的数据整合。我们比较了早期和晚期融合方法,利用音频和视频数据 - 具体来说,动作单位和凝视信息 - 跨所有可能的模态和参与者的组合。我们的数据集是新收集的,来自参与真实或虚假情景的情感相关话题的瑞典本族人,作为我们分析的基础。结果表明,将语音和面部信息结合起来比单一模态方法表现出更好的性能。此外,包括来自两个参与者的数据显著提高了欺骗检测的准确性,采用晚期融合策略应用于两种模态和参与者时表现最佳(71%)。这些发现与心理学理论一致,表明在初始互动过程中面部和声音表达有不同的控制。作为对斯堪的纳维亚人群的首项研究,这项研究为未来对双人互动的调查奠定了基础,特别是在心理治疗环境中。

更新时间: 2025-12-11 17:00:44

领域: cs.LG

下载: http://arxiv.org/abs/2506.21429v2

Quantum Approaches to Urban Logistics: From Core QAOA to Clustered Scalability

The Traveling Salesman Problem (TSP) is a fundamental challenge in combinatorial optimization, widely applied in logistics and transportation. As the size of TSP instances grows, traditional algorithms often struggle to produce high-quality solutions within reasonable timeframes. This study investigates the potential of the Quantum Approximate Optimization Algorithm (QAOA), a hybrid quantum-classical method, to solve TSP under realistic constraints. We adopt a QUBO-based formulation of TSP that integrates real-world logistical constraints reflecting operational conditions, such as vehicle capacity, road accessibility, and time windows, while ensuring compatibility with the limitations of current quantum hardware. Our experiments are conducted in a simulated environment using high-performance computing (HPC) resources to assess QAOA's performance across different problem sizes and quantum circuit depths. In order to improve scalability, we propose clustering QAOA (Cl-QAOA), a hybrid approach combining classical machine learning with QAOA. This method decomposes large TSP instances into smaller sub-problems, making quantum optimization feasible even on devices with a limited number of qubits. The results offer a comprehensive evaluation of QAOA's strengths and limitations in solving constrained TSP scenarios. This study advances quantum optimization and lays groundwork for future large-scale applications.

Updated: 2025-12-11 17:00:24

标题: 量子方法在城市物流中的应用:从核心QAOA到集群可扩展性

摘要: 旅行推销员问题(TSP)是组合优化中的一个基本挑战,在物流和运输领域得到广泛应用。随着TSP实例规模的增长,传统算法往往难以在合理的时间范围内产生高质量的解决方案。本研究调查了量子近似优化算法(QAOA)在现实约束条件下解决TSP的潜力,这是一种混合量子-经典方法。我们采用基于QUBO的TSP公式,集成了反映运营条件的现实物流约束,如车辆容量、道路可达性和时间窗口,同时确保与当前量子硬件的限制兼容。我们在模拟环境中使用高性能计算(HPC)资源进行实验,评估QAOA在不同问题规模和量子电路深度下的性能。为了提高可伸缩性,我们提出了聚类QAOA(Cl-QAOA),这是一种结合了经典机器学习和QAOA的混合方法。该方法将大型TSP实例分解为较小的子问题,使得即使在具有有限量子比特数的设备上,量子优化也成为可能。结果全面评估了QAOA在解决受约束的TSP场景中的优势和局限性。这项研究推动了量子优化,并为未来大规模应用奠定了基础。

更新时间: 2025-12-11 17:00:24

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2512.10813v1

HAROOD: A Benchmark for Out-of-distribution Generalization in Sensor-based Human Activity Recognition

Sensor-based human activity recognition (HAR) mines activity patterns from the time-series sensory data. In realistic scenarios, variations across individuals, devices, environments, and time introduce significant distributional shifts for the same activities. Recent efforts attempt to solve this challenge by applying or adapting existing out-of-distribution (OOD) algorithms, but only in certain distribution shift scenarios (e.g., cross-device or cross-position), lacking comprehensive insights on the effectiveness of these algorithms. For instance, is OOD necessary to HAR? Which OOD algorithm performs the best? In this paper, we fill this gap by proposing HAROOD, a comprehensive benchmark for HAR in OOD settings. We define 4 OOD scenarios: cross-person, cross-position, cross-dataset, and cross-time, and build a testbed covering 6 datasets, 16 comparative methods (implemented with CNN-based and Transformer-based architectures), and two model selection protocols. Then, we conduct extensive experiments and present several findings for future research, e.g., no single method consistently outperforms others, highlighting substantial opportunity for advancement. Our codebase is highly modular and easy to extend for new datasets, algorithms, comparisons, and analysis, with the hope to facilitate the research in OOD-based HAR. Our implementation is released and can be found at https://github.com/AIFrontierLab/HAROOD.

Updated: 2025-12-11 16:52:50

标题: HAROOD:传感器人体活动识别中的超领域泛化基准

摘要: 传感器基础的人体活动识别(HAR)从时间序列感官数据中挖掘活动模式。在现实场景中,个体、设备、环境和时间的变化为相同的活动引入了显著的分布偏移。最近的努力试图通过应用或调整现有的超出分布(OOD)算法来解决这一挑战,但仅在某些分布偏移场景(例如跨设备或交叉位置)中,缺乏这些算法效果的全面洞见。例如,对HAR来说,OOD是必要的吗?哪种OOD算法效果最好?在本文中,我们通过提出HAROOD,为OOD设置中的HAR提供了一个全面的基准。我们定义了4个OOD场景:跨人员、跨位置、跨数据集和跨时间,并建立了一个涵盖6个数据集、16种比较方法(采用基于CNN和Transformer的架构实现)和两种模型选择协议的测试平台。然后,我们进行了广泛的实验,并提出了一些未来研究的发现,例如,没有单一方法始终优于其他方法,突显了进步的重大机会。我们的代码库非常模块化,易于扩展到新的数据集、算法、比较和分析,希望能促进基于OOD的HAR研究。我们的实现已发布,可在https://github.com/AIFrontierLab/HAROOD找到。

更新时间: 2025-12-11 16:52:50

领域: cs.AI

下载: http://arxiv.org/abs/2512.10807v1

Interpretable and Steerable Concept Bottleneck Sparse Autoencoders

Sparse autoencoders (SAEs) promise a unified approach for mechanistic interpretability, concept discovery, and model steering in LLMs and LVLMs. However, realizing this potential requires that the learned features be both interpretable and steerable. To that end, we introduce two new computationally inexpensive interpretability and steerability metrics and conduct a systematic analysis on LVLMs. Our analysis uncovers two observations; (i) a majority of SAE neurons exhibit either low interpretability or low steerability or both, rendering them ineffective for downstream use; and (ii) due to the unsupervised nature of SAEs, user-desired concepts are often absent in the learned dictionary, thus limiting their practical utility. To address these limitations, we propose Concept Bottleneck Sparse Autoencoders (CB-SAE) - a novel post-hoc framework that prunes low-utility neurons and augments the latent space with a lightweight concept bottleneck aligned to a user-defined concept set. The resulting CB-SAE improves interpretability by +32.1% and steerability by +14.5% across LVLMs and image generation tasks. We will make our code and model weights available.

Updated: 2025-12-11 16:48:07

标题: 可解释和可操控的概念瓶颈稀疏自动编码器

摘要: 稀疏自动编码器(SAEs)承诺在LLMs和LVLMs中提供一种统一的方法,用于机制可解释性、概念发现和模型引导。然而,实现这一潜力需要学习到的特征既具有可解释性又具有可操控性。为此,我们引入了两种新的计算成本低廉的可解释性和可操控性指标,并对LVLMs进行了系统分析。我们的分析揭示了两个观察结果:(i)大多数SAE神经元表现出较低的可解释性、较低的可操控性或两者兼有,使它们无法有效用于下游应用;(ii)由于SAE的无监督性质,用户所需的概念通常在学习到的词典中缺失,从而限制了它们的实际效用。为了解决这些限制,我们提出了概念瓶颈稀疏自动编码器(CB-SAE)- 一种新颖的事后框架,通过修剪低效用神经元,并在潜在空间中增加一个轻量级概念瓶颈,与用户定义的概念集对齐。由此产生的CB-SAE在LVLMs和图像生成任务中,可将解释性提高32.1%,将可操控性提高14.5%。我们将提供我们的代码和模型权重。

更新时间: 2025-12-11 16:48:07

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2512.10805v1

Equivariant Test-Time Training with Operator Sketching for Imaging Inverse Problems

Equivariant Imaging (EI) regularization has become the de-facto technique for unsupervised training of deep imaging networks, without any need of ground-truth data. Observing that the EI-based unsupervised training paradigm currently has significant computational redundancy leading to inefficiency in high-dimensional applications, we propose a sketched EI regularization which leverages the randomized sketching techniques for acceleration. We apply our sketched EI regularization to develop an accelerated deep internal learning framework, which can be efficiently applied for test-time network adaptation. Additionally, for network adaptation tasks, we propose a parameter-efficient approach to accelerate both EI and Sketched-EI via optimizing only the normalization layers. Our numerical study on X-ray CT and multicoil magnetic resonance image reconstruction tasks demonstrate that our approach can achieve significant computational acceleration over the standard EI counterpart, especially in test-time training tasks.

Updated: 2025-12-11 16:43:35

标题: 等变测试时间训练与操作符素描在成像逆问题中的应用

摘要: 等变成像(EI)正则化已成为深度成像网络无监督训练的事实技术,无需地面真实数据。观察到基于EI的无监督训练范式目前存在显着的计算冗余,导致在高维应用中效率低下,我们提出了一种利用随机素描技术加速的草图EI正则化。我们将我们的草图EI正则化应用于开发一个加速的深度内部学习框架,可以有效地应用于测试时间网络适应。此外,对于网络适应任务,我们提出了一种参数高效的方法,通过仅优化归一化层来加速EI和草图EI。我们对X射线CT和多线圈磁共振图像重建任务进行的数值研究表明,我们的方法可以在标准EI对应物上实现显着的计算加速,特别是在测试时间训练任务中。

更新时间: 2025-12-11 16:43:35

领域: eess.IV,cs.CV,cs.LG,math.OC

下载: http://arxiv.org/abs/2411.05771v5

Achieving Trustworthy Real-Time Decision Support Systems with Low-Latency Interpretable AI Models

This paper investigates real-time decision support systems that leverage low-latency AI models, bringing together recent progress in holistic AI-driven decision tools, integration with Edge-IoT technologies, and approaches for effective human-AI teamwork. It looks into how large language models can assist decision-making, especially when resources are limited. The research also examines the effects of technical developments such as DeLLMa, methods for compressing models, and improvements for analytics on edge devices, while also addressing issues like limited resources and the need for adaptable frameworks. Through a detailed review, the paper offers practical perspectives on development strategies and areas of application, adding to the field by pointing out opportunities for more efficient and flexible AI-supported systems. The conclusions set the stage for future breakthroughs in this fast-changing area, highlighting how AI can reshape real-time decision support.

Updated: 2025-12-11 16:41:25

标题: 实现具有低延迟可解释人工智能模型的可信实时决策支持系统

摘要: 本文调查了利用低延迟人工智能模型的实时决策支持系统,将最新的全面人工智能驱动决策工具的进展、与边缘物联网技术的集成以及有效的人工智能团队合作方法结合在一起。它探讨了大型语言模型如何在资源有限的情况下辅助决策制定。研究还考察了技术发展的影响,如DeLLMa、模型压缩方法以及边缘设备上的分析改进,同时还解决了资源有限和需求可调整框架等问题。通过详细的回顾,本文提供了关于发展战略和应用领域的实用观点,指出了更高效、灵活的人工智能支持系统的机会,从而为该领域增添价值。结论为未来在这一快速变化领域的突破奠定了基础,突出了人工智能如何重塑实时决策支持的重要性。

更新时间: 2025-12-11 16:41:25

领域: cs.AI,cs.AR

下载: http://arxiv.org/abs/2506.20018v2

What matters for Representation Alignment: Global Information or Spatial Structure?

Representation alignment (REPA) guides generative training by distilling representations from a strong, pretrained vision encoder to intermediate diffusion features. We investigate a fundamental question: what aspect of the target representation matters for generation, its \textit{global} \revision{semantic} information (e.g., measured by ImageNet-1K accuracy) or its spatial structure (i.e. pairwise cosine similarity between patch tokens)? Prevalent wisdom holds that stronger global semantic performance leads to better generation as a target representation. To study this, we first perform a large-scale empirical analysis across 27 different vision encoders and different model scales. The results are surprising; spatial structure, rather than global performance, drives the generation performance of a target representation. To further study this, we introduce two straightforward modifications, which specifically accentuate the transfer of \emph{spatial} information. We replace the standard MLP projection layer in REPA with a simple convolution layer and introduce a spatial normalization layer for the external representation. Surprisingly, our simple method (implemented in $<$4 lines of code), termed iREPA, consistently improves convergence speed of REPA, across a diverse set of vision encoders, model sizes, and training variants (such as REPA, REPA-E, Meanflow, JiT etc). %, etc. Our work motivates revisiting the fundamental working mechanism of representational alignment and how it can be leveraged for improved training of generative models. The code and project page are available at https://end2end-diffusion.github.io/irepa

Updated: 2025-12-11 16:39:53

标题: 代表对齐的关键因素:全局信息还是空间结构?

摘要: Representation alignment (REPA)通过从一个强大的预训练视觉编码器中提炼表示来引导生成训练到中间扩散特征。我们调查了一个基本问题:目标表示的哪个方面对于生成是重要的,是\textit{全局}\revision{语义}信息(例如通过ImageNet-1K准确性衡量)还是其空间结构(即补丁标记之间的余弦相似度)?普遍的智慧认为,更强大的全局语义性能会导致更好的生成作为目标表示。为了研究这一点,我们首先进行了一项涵盖27种不同视觉编码器和不同模型规模的大规模实证分析。结果令人惊讶;空间结构,而不是全局性能,驱动了目标表示的生成性能。为了进一步研究这一点,我们引入了两个简单的修改,这些修改特别强调\emph{空间}信息的传递。我们用一个简单的卷积层替换了REPA中的标准MLP投影层,并为外部表示引入了一个空间归一化层。令人惊讶的是,我们的简单方法(在$<$4行代码中实现),称为iREPA,在各种视觉编码器、模型大小和训练变体(如REPA、REPA-E、Meanflow、JiT等)中始终改善了REPA的收敛速度。我们的工作激励重新审视表示对齐的基本工作机制,以及如何利用它来改进生成模型的训练。代码和项目页面可在https://end2end-diffusion.github.io/irepa找到。

更新时间: 2025-12-11 16:39:53

领域: cs.CV,cs.AI,cs.GR,cs.LG,stat.ML

下载: http://arxiv.org/abs/2512.10794v1

LabelFusion: Learning to Fuse LLMs and Transformer Classifiers for Robust Text Classification

LabelFusion is a fusion ensemble for text classification that learns to combine a traditional transformer-based classifier (e.g., RoBERTa) with one or more Large Language Models (LLMs such as OpenAI GPT, Google Gemini, or DeepSeek) to deliver accurate and cost-aware predictions across multi-class and multi-label tasks. The package provides a simple high-level interface (AutoFusionClassifier) that trains the full pipeline end-to-end with minimal configuration, and a flexible API for advanced users. Under the hood, LabelFusion integrates vector signals from both sources by concatenating the ML backbone's embeddings with the LLM-derived per-class scores -- obtained through structured prompt-engineering strategies -- and feeds this joint representation into a compact multi-layer perceptron (FusionMLP) that produces the final prediction. This learned fusion approach captures complementary strengths of LLM reasoning and traditional transformer-based classifiers, yielding robust performance across domains -- achieving 92.4% accuracy on AG News and 92.3% on 10-class Reuters 21578 topic classification -- while enabling practical trade-offs between accuracy, latency, and cost.

Updated: 2025-12-11 16:39:07

标题: LabelFusion:学习融合LLMs和Transformer分类器以实现文本分类的稳健性

摘要: LabelFusion是一个用于文本分类的融合集成系统,它学习将传统的基于transformer的分类器(例如RoBERTa)与一个或多个大型语言模型(LLM,如OpenAI GPT、Google Gemini或DeepSeek)结合起来,以在多类和多标签任务中提供准确且成本敏感的预测。该软件包提供了一个简单的高级接口(AutoFusionClassifier),可以以最小的配置训练完整的流水线,并为高级用户提供了灵活的API。在内部,LabelFusion通过将ML主干的嵌入与LLM导出的每个类别得分(通过结构化提示工程策略获得)连接起来,从而整合了来自两个源的向量信号,并将这种联合表示输入到一个紧凑的多层感知器(FusionMLP)中,以产生最终的预测。这种学习的融合方法捕捉了LLM推理和传统基于transformer的分类器的互补优势,实现了跨领域的稳健性能,例如在AG News上达到92.4%的准确度,在10类Reuters 21578主题分类上达到92.3%的准确度,同时实现了准确性、延迟和成本之间的实际权衡。

更新时间: 2025-12-11 16:39:07

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2512.10793v1

EcomBench: Towards Holistic Evaluation of Foundation Agents in E-commerce

Foundation agents have rapidly advanced in their ability to reason and interact with real environments, making the evaluation of their core capabilities increasingly important. While many benchmarks have been developed to assess agent performance, most concentrate on academic settings or artificially designed scenarios while overlooking the challenges that arise in real applications. To address this issue, we focus on a highly practical real-world setting, the e-commerce domain, which involves a large volume of diverse user interactions, dynamic market conditions, and tasks directly tied to real decision-making processes. To this end, we introduce EcomBench, a holistic E-commerce Benchmark designed to evaluate agent performance in realistic e-commerce environments. EcomBench is built from genuine user demands embedded in leading global e-commerce ecosystems and is carefully curated and annotated through human experts to ensure clarity, accuracy, and domain relevance. It covers multiple task categories within e-commerce scenarios and defines three difficulty levels that evaluate agents on key capabilities such as deep information retrieval, multi-step reasoning, and cross-source knowledge integration. By grounding evaluation in real e-commerce contexts, EcomBench provides a rigorous and dynamic testbed for measuring the practical capabilities of agents in modern e-commerce.

Updated: 2025-12-11 16:38:57

标题: EcomBench:朝向电子商务基础代理的整体评估

摘要: 基金会代理商在推理和与真实环境互动的能力方面迅速取得进展,这使得评估它们的核心能力变得越来越重要。虽然已经开发了许多基准来评估代理商的表现,但大多集中在学术环境或人为设计的场景上,忽视了在真实应用中出现的挑战。为了解决这个问题,我们关注一个高度实用的现实世界设置,即电子商务领域,该领域涉及大量多样化的用户互动、动态市场条件以及直接与真实决策过程相关的任务。为此,我们介绍了 EcomBench,一个旨在评估代理商在真实电子商务环境中表现的综合性电子商务基准。EcomBench是基于全球领先的电子商务生态系统中嵌入的真实用户需求构建的,并通过人类专家精心策划和注释,以确保清晰、准确和领域相关性。它涵盖了电子商务场景中的多个任务类别,并定义了三个难度级别,评估代理商在深度信息检索、多步推理和跨源知识整合等关键能力上的表现。通过将评估基于真实的电子商务环境,EcomBench为衡量现代电子商务中代理商的实际能力提供了严格而动态的测试平台。

更新时间: 2025-12-11 16:38:57

领域: cs.AI

下载: http://arxiv.org/abs/2512.08868v2

SyGra: A Unified Graph-Based Framework for Scalable Generation, Quality Tagging, and Management of Synthetic Data

The advancement of large language models (LLMs) is critically dependent on the availability of high-quality datasets for Supervised Fine-Tuning (SFT), alignment tasks like Direct Preference Optimization (DPO), etc. In this work, we present a comprehensive synthetic data generation framework that facilitates scalable, configurable, and high-fidelity generation of synthetic data tailored for these training paradigms. Our approach employs a modular and configuration-based pipeline capable of modeling complex dialogue flows with minimal manual intervention. This framework uses a dual-stage quality tagging mechanism, combining heuristic rules and LLM-based evaluations, to automatically filter and score data extracted from OASST-formatted conversations, ensuring the curation of high-quality dialogue samples. The resulting datasets are structured under a flexible schema supporting both SFT and DPO use cases, enabling seamless integration into diverse training workflows. Together, these innovations offer a robust solution for generating and managing synthetic conversational data at scale, significantly reducing the overhead of data preparation in LLM training pipelines.

Updated: 2025-12-11 16:38:00

标题: SyGra:一个统一的基于图的框架,用于大规模生成、质量标记和管理合成数据

摘要: 大型语言模型(LLMs)的发展在很大程度上取决于用于监督微调(SFT)、直接偏好优化(DPO)等对齐任务的高质量数据集的可用性。在这项工作中,我们提出了一个全面的合成数据生成框架,促进了专为这些训练范式量身定制的合成数据的可伸缩、可配置和高保真生成。我们的方法采用了一个模块化和基于配置的管道,能够在最小干预的情况下对复杂的对话流进行建模。该框架使用了一个双阶段质量标记机制,结合启发式规则和LLM评估,自动过滤和评分从OASST格式对话中提取的数据,确保高质量对话样本的筛选。生成的数据集按照一个灵活的架构进行组织,支持SFT和DPO用例,实现了与不同训练工作流的无缝集成。这些创新共同提供了一个强大的解决方案,用于在规模上生成和管理合成对话数据,显著减少了LLM训练管道中数据准备的开销。

更新时间: 2025-12-11 16:38:00

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2508.15432v3

The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

We introduce The FACTS Leaderboard, an online leaderboard suite and associated set of benchmarks that comprehensively evaluates the ability of language models to generate factually accurate text across diverse scenarios. The suite provides a holistic measure of factuality by aggregating the performance of models on four distinct sub-leaderboards: (1) FACTS Multimodal, which measures the factuality of responses to image-based questions; (2) FACTS Parametric, which assesses models' world knowledge by answering closed-book factoid questions from internal parameters; (3) FACTS Search, which evaluates factuality in information-seeking scenarios, where the model must use a search API; and (4) FACTS Grounding (v2), which evaluates whether long-form responses are grounded in provided documents, featuring significantly improved judge models. Each sub-leaderboard employs automated judge models to score model responses, and the final suite score is an average of the four components, designed to provide a robust and balanced assessment of a model's overall factuality. The FACTS Leaderboard Suite will be actively maintained, containing both public and private splits to allow for external participation while guarding its integrity. It can be found at https://www.kaggle.com/benchmarks/google/facts .

Updated: 2025-12-11 16:35:14

标题: 事实排行榜:大型语言模型真实性的综合基准

摘要: 我们介绍了FACTS排行榜,这是一个在线排行榜套件及相关基准,全面评估语言模型在各种场景下生成事实准确文本的能力。该套件通过汇总模型在四个不同子排行榜上的表现,提供了一个全面的事实性度量:(1)FACTS Multimodal,衡量对基于图像的问题的响应的事实性;(2)FACTS Parametric,通过回答来自内部参数的闭卷事实问题来评估模型的世界知识;(3)FACTS Search,评估信息检索场景中的事实性,在这种情况下,模型必须使用搜索API;以及(4)FACTS Grounding(v2),评估长篇回答是否基于提供的文件,并具有显著改进的判断模型。每个子排行榜都使用自动化的判断模型对模型的响应进行评分,最终套件分数是四个组成部分的平均值,旨在提供对模型整体事实性的强大和平衡的评估。FACTS排行榜套件将得到积极维护,包含公共和私有分割,以允许外部参与同时保护其完整性。您可以在https://www.kaggle.com/benchmarks/google/facts 找到它。

更新时间: 2025-12-11 16:35:14

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2512.10791v1

Natural Language Interface for Firewall Configuration

This paper presents the design and prototype implementation of a natural language interface for configuring enterprise firewalls. The framework allows administrators to express access control policies in plain language, which are then translated into vendor specific configurations. A compact schema bound intermediate representation separates human intent from device syntax and in the current prototype compiles to Palo Alto PAN OS command line configuration while remaining extensible to other platforms. Large language models are used only as assistive parsers that generate typed intermediate representation objects, while compilation and enforcement remain deterministic. The prototype integrates three validation layers, namely a static linter that checks structural and vendor specific constraints, a safety gate that blocks overly permissive rules such as any to any allows, and a Batfish based simulator that validates configuration syntax and referential integrity against a synthetic device model. The paper describes the architecture, implementation, and test methodology on synthetic network context datasets and discusses how this approach can evolve into a scalable auditable and human centered workflow for firewall policy management.

Updated: 2025-12-11 16:33:33

标题: 防火墙配置的自然语言界面

摘要: 这篇论文介绍了一个用于配置企业防火墙的自然语言接口的设计和原型实现。该框架允许管理员用普通语言表达访问控制策略,然后将其转换为特定厂商的配置。一个紧凑的模式绑定中间表示将人类意图与设备语法分离,在当前原型中编译为Palo Alto PAN OS命令行配置,同时可扩展到其他平台。大型语言模型仅用作辅助解析器,生成类型化的中间表示对象,而编译和强制执行保持确定性。该原型集成了三个验证层,即检查结构和特定厂商约束的静态检查器、阻止过于宽松规则(如任何到任何允许)的安全门,以及基于Batfish的模拟器,验证配置语法和引用完整性与合成设备模型相对应。该论文描述了在合成网络上下文数据集上的架构、实现和测试方法,并讨论了这种方法如何演变为防火墙策略管理的可扩展可审计和以人为中心的工作流程。

更新时间: 2025-12-11 16:33:33

领域: cs.NI,cs.AI

下载: http://arxiv.org/abs/2512.10789v1

Replace, Don't Expand: Mitigating Context Dilution in Multi-Hop RAG via Fixed-Budget Evidence Assembly

Retrieval-Augmented Generation (RAG) systems often fail on multi-hop queries when the initial retrieval misses a bridge fact. Prior corrective approaches, such as Self-RAG, CRAG, and Adaptive-$k$, typically address this by \textit{adding} more context or pruning existing lists. However, simply expanding the context window often leads to \textbf{context dilution}, where distractors crowd out relevant information. We propose \textbf{SEAL-RAG}, a training-free controller that adopts a \textbf{``replace, don't expand''} strategy to fight context dilution under a fixed retrieval depth $k$. SEAL executes a (\textbf{S}earch $\rightarrow$ \textbf{E}xtract $\rightarrow$ \textbf{A}ssess $\rightarrow$ \textbf{L}oop) cycle: it performs on-the-fly, entity-anchored extraction to build a live \textit{gap specification} (missing entities/relations), triggers targeted micro-queries, and uses \textit{entity-first ranking} to actively swap out distractors for gap-closing evidence. We evaluate SEAL-RAG against faithful re-implementations of Basic RAG, CRAG, Self-RAG, and Adaptive-$k$ in a shared environment on \textbf{HotpotQA} and \textbf{2WikiMultiHopQA}. On HotpotQA ($k=3$), SEAL improves answer correctness by \textbf{+3--13 pp} and evidence precision by \textbf{+12--18 pp} over Self-RAG. On 2WikiMultiHopQA ($k=5$), it outperforms Adaptive-$k$ by \textbf{+8.0 pp} in accuracy and maintains \textbf{96\%} evidence precision compared to 22\% for CRAG. These gains are statistically significant ($p<0.001$). By enforcing fixed-$k$ replacement, SEAL yields a predictable cost profile while ensuring the top-$k$ slots are optimized for precision rather than mere breadth. We release our code and data at https://github.com/mosherino/SEAL-RAG.

Updated: 2025-12-11 16:31:29

标题: 替换,而不是扩展:通过固定预算证据组装来减轻多跳RAG中的上下文稀释

摘要: 检索增强生成(RAG)系统在多跳查询时经常失败,当初始检索错过桥接事实时。先前的纠正方法,如Self-RAG、CRAG和自适应-$k$,通常通过\textit{添加}更多上下文或修剪现有列表来解决这个问题。然而,简单地扩展上下文窗口通常会导致\textbf{上下文稀释},使干扰因素挤占相关信息。我们提出了\textbf{SEAL-RAG},这是一个无需训练的控制器,采用\textbf{``替换,不扩展''}策略来对抗上下文稀释,在固定的检索深度$k$下。SEAL执行一个(\textbf{S}earch $\rightarrow$ \textbf{E}xtract $\rightarrow$ \textbf{A}ssess $\rightarrow$ \textbf{L}oop)循环:它执行实时的、实体锚定的提取来构建一个活动的\textit{缺口规范}(缺失的实体/关系),触发有针对性的微查询,并使用\textit{实体优先排序}来积极地用缺口封闭证据替换干扰因素。我们在\textbf{HotpotQA}和\textbf{2WikiMultiHopQA}上对SEAL-RAG进行评估,与Basic RAG、CRAG、Self-RAG和自适应-$k$的忠实重现在共享环境中进行比较。在HotpotQA($k=3$)上,SEAL将答案正确率提高了\textbf{+3--13 pp},将证据精度提高了\textbf{+12--18 pp},超过了Self-RAG。在2WikiMultiHopQA($k=5$)上,它在准确性方面优于自适应-$k$,提高了\textbf{+8.0 pp},与CRAG的22\%相比,保持了\textbf{96\%}的证据精度。这些收益在统计上是显著的($p<0.001$)。通过执行固定-$k$替换,SEAL产生可预测的成本配置文件,同时确保顶部-$k$槽被优化为精度而不仅仅是广度。我们在https://github.com/mosherino/SEAL-RAG上发布了我们的代码和数据。

更新时间: 2025-12-11 16:31:29

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2512.10787v1

Developing and Evaluating a Large Language Model-Based Automated Feedback System Grounded in Evidence-Centered Design for Supporting Physics Problem Solving

Generative AI offers new opportunities for individualized and adaptive learning, particularly through large language model (LLM)-based feedback systems. While LLMs can produce effective feedback for relatively straightforward conceptual tasks, delivering high-quality feedback for tasks that require advanced domain expertise, such as physics problem solving, remains a substantial challenge. This study presents the design of an LLM-based feedback system for physics problem solving grounded in evidence-centered design (ECD) and evaluates its performance within the German Physics Olympiad. Participants assessed the usefulness and accuracy of the generated feedback, which was generally perceived as useful and highly accurate. However, an in-depth analysis revealed that the feedback contained factual errors in 20% of cases; errors that often went unnoticed by the students. We discuss the risks associated with uncritical reliance on LLM-based feedback systems and outline potential directions for generating more adaptive and reliable LLM-based feedback in the future.

Updated: 2025-12-11 16:29:38

标题: 基于大型语言模型的自动反馈系统的开发与评估:基于证据中心设计支持物理问题解决的文献

摘要: 生成式人工智能为个性化和自适应学习提供了新的机会,特别是通过基于大型语言模型(LLM)的反馈系统。虽然LLMs可以为相对简单的概念任务提供有效的反馈,但为需要高级领域专业知识的任务提供高质量的反馈仍然是一个重大挑战。本研究介绍了一个基于证据中心设计(ECD)的物理问题解决的LLM反馈系统的设计,并在德国物理奥林匹克竞赛中评估了其性能。参与者评估了生成的反馈的有用性和准确性,通常认为是有用和高度准确的。然而,深入分析发现,在20%的情况下,反馈中包含事实错误;这些错误常常被学生忽视。我们讨论了盲目依赖LLM反馈系统所带来的风险,并概述了未来生成更具适应性和可靠性的LLM反馈的潜在方向。

更新时间: 2025-12-11 16:29:38

领域: physics.ed-ph,cs.AI,cs.HC

下载: http://arxiv.org/abs/2512.10785v1

Extrapolating Jet Radiation with Autoregressive Transformers

Generative networks are an exciting tool for fast LHC event fixed number of particles. Autoregressive transformers allow us to generate events containing variable numbers of particles, very much in line with the physics of QCD jet radiation, and offer the possibility to generalize to higher multiplicities. We show how transformers can learn a factorized likelihood for jet radiation and extrapolate in terms of the number of generated jets. For this extrapolation, bootstrapping training data and training with modifications of the likelihood loss can be used.

Updated: 2025-12-11 16:24:23

标题: 用自回归变压器外推射流辐射

摘要: 生成网络是用于快速生成固定粒子数的LHC事件的一种令人兴奋的工具。自回归变压器允许我们生成包含可变粒子数的事件,非常符合QCD喷注辐射的物理规律,并且可以推广到更高的多重性。我们展示了变压器如何学习喷注辐射的因子化似然,并在生成的喷注数量方面进行外推。为了进行这种外推,可以使用自助训练数据和训练具有修改的似然损失。

更新时间: 2025-12-11 16:24:23

领域: hep-ph,cs.LG

下载: http://arxiv.org/abs/2412.12074v2

Script Gap: Evaluating LLM Triage on Indian Languages in Native vs Roman Scripts in a Real World Setting

Large Language Models (LLMs) are increasingly deployed in high-stakes clinical applications in India. In many such settings, speakers of Indian languages frequently communicate using romanized text rather than native scripts, yet existing research rarely evaluates this orthographic variation using real-world data. We investigate how romanization impacts the reliability of LLMs in a critical domain: maternal and newborn healthcare triage. We benchmark leading LLMs on a real-world dataset of user-generated queries spanning five Indian languages and Nepali. Our results reveal consistent degradation in performance for romanized messages, with F1 scores trailing those of native scripts by 5-12 points. At our partner maternal health organization in India, this gap could cause nearly 2 million excess errors in triage. Crucially, this performance gap by scripts is not due to a failure in clinical reasoning. We demonstrate that LLMs often correctly infer the semantic intent of romanized queries. Nevertheless, their final classification outputs remain brittle in the presence of orthographic noise in romanized inputs. Our findings highlight a critical safety blind spot in LLM-based health systems: models that appear to understand romanized input may still fail to act on it reliably.

Updated: 2025-12-11 16:15:42

标题: 脚本差距:在真实世界环境中评估印度语言在本地文字与罗马字母脚本上的LLM分类triage

摘要: 大型语言模型(LLMs)在印度的高风险临床应用中越来越多地得到部署。在许多这样的环境中,印度语言的使用者经常使用罗马化文本而不是本地文字进行交流,然而现有研究很少使用真实数据来评估这种文字变体。我们研究了罗马化对LLMs在一个关键领域——产妇和新生儿保健分诊中的可靠性的影响。我们在一个跨越五种印度语言和尼泊尔语的用户生成查询的真实数据集上对领先的LLMs进行了基准测试。我们的结果显示,罗马化消息的性能持续下降,F1得分比本地文字低5-12个点。在我们在印度的合作伙伴母婴保健组织中,这种差距可能导致将近200万个分诊错误。关键是,这种脚本之间的性能差距并不是由临床推理失败引起的。我们证明LLMs经常可以正确推断罗马化查询的语义意图。然而,在罗马化输入中存在文字噪音时,它们的最终分类输出仍然脆弱。我们的研究结果突显了LLM基于健康系统中的一个关键安全盲点:看似理解罗马化输入的模型可能仍然无法可靠地对其进行处理。

更新时间: 2025-12-11 16:15:42

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2512.10780v1

Grow Up and Merge: Scaling Strategies for Efficient Language Adaptation

Achieving high-performing language models which include medium- and lower-resource languages remains a challenge. Massively multilingual models still underperform compared to language-specific adaptations, especially at smaller model scales. In this work, we investigate scaling as an efficient strategy for adapting pretrained models to new target languages. Through comprehensive scaling ablations with approximately FLOP-matched models, we test whether upscaling an English base model enables more effective and resource-efficient adaptation than standard continued pretraining. We find that, once exposed to sufficient target-language data, larger upscaled models can match or surpass the performance of smaller models continually pretrained on much more data, demonstrating the benefits of scaling for data efficiency. Scaling also helps preserve the base model's capabilities in English, thus reducing catastrophic forgetting. Finally, we explore whether such scaled, language-specific models can be merged to construct modular and flexible multilingual systems. We find that while merging remains less effective than joint multilingual training, upscaled merges perform better than smaller ones. We observe large performance differences across merging methods, suggesting potential for improvement through merging approaches specialized for language-level integration.

Updated: 2025-12-11 16:09:54

标题: 成长和融合:有效语言适应的扩展策略

摘要: 实现包括中低资源语言在内的高性能语言模型仍然是一个挑战。与特定语言适应相比,大规模多语言模型仍然表现不佳,特别是在较小的模型规模下。在这项工作中,我们研究了扩展作为一种有效的策略,用于将预训练模型适应到新的目标语言。通过对大约FLOP匹配模型的全面扩展删除,我们测试了将英语基础模型放大是否能够比标准持续预训练更有效和资源有效的适应。我们发现,一旦暴露于足够的目标语言数据,更大规模的放大模型可以匹配或超越持续预训练在更多数据上的较小模型的性能,展示了扩展对数据效率的好处。扩展还有助于保留基础模型在英语中的能力,从而减少灾难性遗忘。最后,我们探讨了这种放大的、特定语言的模型是否可以合并以构建模块化和灵活的多语言系统。我们发现,尽管合并仍不如联合多语言训练有效,但放大的合并比较小的合并效果更好。我们观察到合并方法之间存在较大的性能差异,表明通过专门用于语言级集成的合并方法可能存在改进的潜力。

更新时间: 2025-12-11 16:09:54

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2512.10772v1

Template-Free Retrosynthesis with Graph-Prior Augmented Transformers

Retrosynthesis reaction prediction seeks to infer plausible reactant molecules for a given product and is a central problem in computer-aided organic synthesis. Despite recent progress, many existing models still fall short of the accuracy and robustness required for practical deployment. This work studies a template-free, Transformer-based framework that eliminates reliance on handcrafted reaction templates or additional chemical rule engines. The model injects molecular graph information into the attention mechanism to jointly exploit \SMILES\ sequences and structural cues, and further applies a paired data augmentation strategy to enhance training diversity and scale. On the USPTO-50K benchmark, our proposed approach achieves state-of-the-art performance among template-free methods and substantially outperforming a vanilla Transformer baseline.

Updated: 2025-12-11 16:08:32

标题: 无模板的图先增强变压器反合成

摘要: 反合成反应预测旨在推断给定产物的可能反应物分子,是计算辅助有机合成中的一个核心问题。尽管最近取得了一些进展,许多现有模型仍然无法达到实际部署所需的准确性和稳健性。本文研究了一种无模板、基于Transformer的框架,消除了对手工制作的反应模板或额外化学规则引擎的依赖。该模型将分子图信息注入注意力机制,共同利用SMILES序列和结构线索,并进一步应用成对数据增强策略来增强训练多样性和规模。在USPTO-50K基准测试中,我们提出的方法在无模板方法中达到了最先进的性能,并且明显优于一个基本的Transformer基线。

更新时间: 2025-12-11 16:08:32

领域: cs.LG

下载: http://arxiv.org/abs/2512.10770v1

WAM-Flow: Parallel Coarse-to-Fine Motion Planning via Discrete Flow Matching for Autonomous Driving

We introduce WAM-Flow, a vision-language-action (VLA) model that casts ego-trajectory planning as discrete flow matching over a structured token space. In contrast to autoregressive decoders, WAM-Flow performs fully parallel, bidirectional denoising, enabling coarse-to-fine refinement with a tunable compute-accuracy trade-off. Specifically, the approach combines a metric-aligned numerical tokenizer that preserves scalar geometry via triplet-margin learning, a geometry-aware flow objective and a simulator-guided GRPO alignment that integrates safety, ego progress, and comfort rewards while retaining parallel generation. A multi-stage adaptation converts a pre-trained auto-regressive backbone (Janus-1.5B) from causal decoding to non-causal flow model and strengthens road-scene competence through continued multimodal pretraining. Thanks to the inherent nature of consistency model training and parallel decoding inference, WAM-Flow achieves superior closed-loop performance against autoregressive and diffusion-based VLA baselines, with 1-step inference attaining 89.1 PDMS and 5-step inference reaching 90.3 PDMS on NAVSIM v1 benchmark. These results establish discrete flow matching as a new promising paradigm for end-to-end autonomous driving. The code will be publicly available soon.

Updated: 2025-12-11 16:06:13

标题: WAM-Flow:通过离散流匹配实现自动驾驶的并行粗到细运动规划

摘要: 我们介绍了WAM-Flow,一个将自我轨迹规划视为在结构化令牌空间上的离散流匹配的视觉-语言-动作(VLA)模型。与自回归解码器相比,WAM-Flow执行完全并行、双向去噪,实现了可调节的计算-精度权衡的粗到精的改进。具体而言,该方法结合了一个保留标量几何的度量对齐的数值分词器(通过三元组边缘学习)、一个几何感知流目标和一个模拟器引导的GRPO对齐,该对齐集成了安全性、自我进展和舒适性奖励,同时保持并行生成。多阶段适应将一个预训练的自回归骨干(Janus-1.5B)从因果解码转换为非因果流模型,并通过持续的多模态预训练加强了道路场景能力。由于一致性模型训练和并行解码推理的固有特性,WAM-Flow在与自回归和扩散基础的VLA基线相比实现了优越的闭环性能,1步推理达到89.1 PDMS,5步推理达到90.3 PDMS,达到了NAVSIM v1基准的优秀表现。这些结果将离散流匹配确立为端到端自动驾驶的新的有前途的范例。代码将很快公开。

更新时间: 2025-12-11 16:06:13

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2512.06112v2

FOL-Traces: Verified First-Order Logic Reasoning Traces at Scale

Reasoning in language models is difficult to evaluate: natural-language traces are unverifiable, symbolic datasets too small, and most benchmarks conflate heuristics with inference. We present FOL-Traces, the first large-scale dataset of programmatically verified reasoning traces, enabling rigorous evaluation of structured logical inference. We also propose two challenging and comprehensive diagnostic tasks-masked operation prediction and step completion-that directly probe syntactic awareness and process fidelity. FOL-Traces serves as a scalable testbed for rigorously studying how models perform structured logical inference. Systematic experiments with 5 reasoning LLMs show that the dataset remains challenging: models only reach around 45.7% accuracy on masked operation prediction and around 27% on two-step completion.

Updated: 2025-12-11 16:02:17

标题: FOL-Traces: 在规模上验证的一阶逻辑推理轨迹

摘要: 语言模型中的推理很难评估:自然语言迹象是无法验证的,符号数据集太小,而大多数基准测试将启发式与推理混为一谈。我们提出了FOL-Traces,这是第一个经过程序验证的推理迹象的大规模数据集,可以严格评估结构化逻辑推理。我们还提出了两个具有挑战性和全面性的诊断任务-掩盖操作预测和步骤完成-直接探测句法意识和过程的保真度。FOL-Traces作为一个可扩展的试验平台,用于严格研究模型如何执行结构化逻辑推理。对5个推理LLM进行系统性实验表明,数据集仍然具有挑战性:模型在掩盖操作预测方面的准确率仅达到约45.7%,在两步完成方面仅达到约27%。

更新时间: 2025-12-11 16:02:17

领域: cs.AI

下载: http://arxiv.org/abs/2505.14932v2

Deep Operator BSDE: a Numerical Scheme to Approximate Solution Operators

Motivated by dynamic risk measures and conditional $g$-expectations, in this work we propose a numerical method to approximate the solution operator given by a Backward Stochastic Differential Equation (BSDE). The main ingredients for this are the Wiener chaos decomposition and the classical Euler scheme for BSDEs. We show convergence of this scheme under very mild assumptions, and provide a rate of convergence in more restrictive cases. We then implement it using neural networks, and we present several numerical examples where we can check the accuracy of the method.

Updated: 2025-12-11 15:57:23

标题: 深度算子BSDE:一种近似解算子的数值方案

摘要: 受动态风险度量和条件$g$-期望的启发,本文提出了一种数值方法来近似由反向随机微分方程(BSDE)给出的解算子。该方法的主要组成部分是Wiener混沌分解和BSDE的经典Euler方案。我们在非常温和的假设下证明了该方案的收敛性,并在更严格的情况下提供了收敛速度。然后我们利用神经网络实施该方法,并呈现了几个数值示例,可以验证该方法的准确性。

更新时间: 2025-12-11 15:57:23

领域: math.NA,cs.LG,math.PR

下载: http://arxiv.org/abs/2412.03405v2

Designing AI-Resilient Assessments Using Interconnected Problems: A Theoretically Grounded and Empirically Validated Framework

The rapid adoption of generative AI has undermined traditional modular assessments in computing education, creating a disconnect between academic evaluation and industry practice. This paper presents a theoretically grounded framework for designing AI-resilient assessments, supported by formal analysis and multi-year empirical validation. We make three contributions. First, we establish two theoretical results: (1) assessments composed of interconnected problems, where outputs feed into subsequent stages, are more AI-resilient than modular assessments because current language models struggle with sustained multi-step reasoning and context; and (2) semi-structured problems with deterministic success criteria provide more reliable measures of student competency than fully open-ended projects, which allow AI systems to default to familiar solution patterns. These results challenge common policy and institutional guidance that promotes open-ended assessments as the primary safeguard for academic integrity. Second, we validate these results using data from four university data science courses (N = 138). While students achieve near-perfect scores on AI-assisted modular homework, performance drops by roughly 30 percentage points on proctored exams, indicating substantial AI score inflation. Interconnected projects remain strongly correlated with modular assessments, suggesting they measure the same underlying skills while resisting AI misuse. Proctored exams show weaker alignment, implying they may assess test-taking ability rather than intended learning outcomes. Third, we translate these findings into a practical assessment design framework. The proposed approach enables educators to create assessments that promote integrative thinking, reflect real-world AI-augmented workflows, and naturally resist trivial delegation to generative AI, thereby helping restore academic integrity.

Updated: 2025-12-11 15:53:19

标题: 设计智能强大的评估工具:使用相互关联的问题——一个理论基础和经验验证的框架

摘要: 快速采用生成式人工智能已经削弱了传统的计算机教育模块化评估,造成了学术评估与行业实践之间的脱节。本文提出了一个理论基础的框架,用于设计AI弹性评估,该框架受到正式分析和多年经验验证的支持。 我们做出了三项贡献。首先,我们建立了两个理论结果:(1)由相互连接的问题组成的评估,其中输出进入后续阶段,比模块化评估更具AI弹性,因为当前语言模型在持续的多步推理和上下文方面存在困难; (2)具有确定性成功标准的半结构化问题提供了比完全开放式项目更可靠的学生能力测量标准,后者允许AI系统默认为熟悉的解决方案模式。这些结果挑战了促进开放式评估作为学术诚信的主要保障的常见政策和机构指导。 其次,我们使用来自四个大学数据科学课程的数据(N = 138)验证了这些结果。虽然学生在AI辅助模块化作业上获得了接近完美的分数,但在监考考试中成绩下降了约30个百分点,表明存在实质性的AI得分膨胀。相互连接的项目与模块化评估之间仍然存在强烈的相关性,表明它们测量了相同的基本技能,同时抵制了AI的误用。监考考试显示较弱的对齐,暗示它们可能评估的是考试能力而不是预期的学习成果。 第三,我们将这些发现转化为实用的评估设计框架。所提出的方法使教育工作者能够创建促进整合思维、反映现实世界AI增强工作流程的评估,并自然地抵制将生成式AI轻易委托的行为,从而有助于恢复学术诚信。

更新时间: 2025-12-11 15:53:19

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2512.10758v1

OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification

Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out OPV-Bench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2% to 73.3% on AIME2025 as the compute budget scales.

Updated: 2025-12-11 15:47:38

标题: OPV:基于结果的过程验证器,用于高效的长链条验证

摘要: 大型语言模型(LLMs)通过可验证奖励的强化学习(RLVR)取得了在解决复杂推理任务方面的显著进展。这一进步也离不开可靠验证器自动监督。然而,目前基于结果的验证器(OVs)无法检查长推理链中不可靠的中间步骤。与此同时,当前基于过程的验证器(PVs)在可靠地检测复杂长推理链中的错误方面存在困难,这是由于人工标注的成本过高导致高质量标注的稀缺。因此,我们提出了基于结果的过程验证器(OPV),该验证器验证了从长推理链中总结的结果的理性过程,实现了准确和高效的验证,并能够进行大规模标注。为了增强提出的验证器,我们采用了一个迭代的主动学习框架,通过专家标注逐步提高OPV的验证能力,同时减少标注成本。具体而言,在每个迭代中,对当前最佳OPV的最不确定的案例进行标注,然后通过拒绝微调(RFT)和RLVR训练一个新的OPV用于下一轮。大量实验表明OPV具有卓越的性能和广泛的适用性。它在我们的留存OPV-Bench上取得了新的最新成果,优于Qwen3-Max-Preview等更大的开源模型,F1分数为83.1,而Qwen3-Max-Preview为76.3。此外,OPV在合成数据集中有效地检测到假阳性,与专家评估结果密切相关。与政策模型合作时,OPV始终取得性能提升,例如,在AIME2025上,将DeepSeek-R1-Distill-Qwen-32B的准确率从55.2%提高到73.3%,随着计算预算的增加。

更新时间: 2025-12-11 15:47:38

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2512.10756v1

Adapting to Change: A Comparison of Continual and Transfer Learning for Modeling Building Thermal Dynamics under Concept Drifts

Transfer Learning (TL) is currently the most effective approach for modeling building thermal dynamics when only limited data are available. TL uses a pretrained model that is fine-tuned to a specific target building. However, it remains unclear how to proceed after initial fine-tuning, as more operational measurement data are collected over time. This challenge becomes even more complex when the dynamics of the building change, for example, after a retrofit or a change in occupancy. In Machine Learning literature, Continual Learning (CL) methods are used to update models of changing systems. TL approaches can also address this challenge by reusing the pretrained model at each update step and fine-tuning it with new measurement data. A comprehensive study on how to incorporate new measurement data over time to improve prediction accuracy and address the challenges of concept drifts (changes in dynamics) for building thermal dynamics is still missing. Therefore, this study compares several CL and TL strategies, as well as a model trained from scratch, for thermal dynamics modeling during building operation. The methods are evaluated using 5--7 years of simulated data representative of single-family houses in Central Europe, including scenarios with concept drifts from retrofits and changes in occupancy. We propose a CL strategy (Seasonal Memory Learning) that provides greater accuracy improvements than existing CL and TL methods, while maintaining low computational effort. SML outperformed the benchmark of initial fine-tuning by 28.1\% without concept drifts and 34.9\% with concept drifts.

Updated: 2025-12-11 15:37:19

标题: 适应变化:连续学习与迁移学习在建筑热动力建模中对概念漂移的比较

摘要: 迁移学习(TL)目前是在建筑热力动力学建模时仅有限数据可用时最有效的方法。TL使用一个预训练模型,对特定目标建筑进行微调。然而,如何在初始微调后继续进行仍不清楚,因为随着时间的推移收集到更多的运行测量数据。当建筑物的动态发生变化,例如在改建或入住情况变化后,这一挑战变得更加复杂。在机器学习文献中,连续学习(CL)方法用于更新不断变化系统的模型。TL方法也可以通过在每个更新步骤中重复使用预训练模型,并将其与新的测量数据进行微调来解决这一挑战。目前缺乏关于如何随时间整合新的测量数据以提高预测准确性并解决建筑热力动力学概念漂移(动态变化)挑战的全面性研究。 因此,本研究比较了几种连续学习和迁移学习策略,以及从头开始训练的模型,用于建筑运行期间的热力动力学建模。这些方法使用了代表中欧单户住宅的5-7年模拟数据进行评估,包括具有来自改建和入住情况变化的概念漂移的情景。我们提出了一种连续学习策略(季节性记忆学习),比现有的连续学习和迁移学习方法提供了更大的准确性改进,同时保持较低的计算量。在没有概念漂移的情况下,SML的表现超过了初始微调的基准28.1%,在有概念漂移的情况下超过了34.9%。

更新时间: 2025-12-11 15:37:19

领域: eess.SY,cs.LG

下载: http://arxiv.org/abs/2508.21615v2

State-Space Models for Tabular Prior-Data Fitted Networks

Recent advancements in foundation models for tabular data, such as TabPFN, demonstrated that pretrained Transformer architectures can approximate Bayesian inference with high predictive performance. However, Transformers suffer from quadratic complexity with respect to sequence length, motivating the exploration of more efficient sequence models. In this work, we investigate the potential of using Hydra, a bidirectional linear-time structured state space model (SSM), as an alternative to Transformers in TabPFN. A key challenge lies in SSM's inherent sensitivity to the order of input tokens - an undesirable property for tabular datasets where the row order is semantically meaningless. We investigate to what extent a bidirectional approach can preserve efficiency and enable symmetric context aggregation. Our experiments show that this approach reduces the order-dependence, achieving predictive performance competitive to the original TabPFN model.

Updated: 2025-12-11 15:36:53

标题: 表格先验数据适配网络的状态空间模型

摘要: 最近关于表格数据基础模型的进展,如TabPFN,表明预训练的Transformer架构可以近似贝叶斯推断,并具有高预测性能。然而,Transformer在序列长度方面存在二次复杂度,促使我们探索更高效的序列模型。在这项工作中,我们研究了使用Hydra作为TabPFN中Transformer的替代方案的潜力,Hydra是一个双向的线性时间结构化状态空间模型(SSM)。一个关键挑战在于SSM对输入标记的顺序敏感的固有特性 - 这对于表格数据集来说是一个不希望的属性,因为行的顺序在语义上没有意义。我们研究双向方法在多大程度上可以保留效率并实现对称上下文聚合。我们的实验表明,这种方法减少了顺序依赖性,实现了与原始TabPFN模型竞争性的预测性能。

更新时间: 2025-12-11 15:36:53

领域: cs.LG

下载: http://arxiv.org/abs/2510.14573v2

The LLM Wears Prada: Analysing Gender Bias and Stereotypes through Online Shopping Data

With the wide and cross-domain adoption of Large Language Models, it becomes crucial to assess to which extent the statistical correlations in training data, which underlie their impressive performance, hide subtle and potentially troubling biases. Gender bias in LLMs has been widely investigated from the perspectives of works, hobbies, and emotions typically associated with a specific gender. In this study, we introduce a novel perspective. We investigate whether LLMs can predict an individual's gender based solely on online shopping histories and whether these predictions are influenced by gender biases and stereotypes. Using a dataset of historical online purchases from users in the United States, we evaluate the ability of six LLMs to classify gender and we then analyze their reasoning and products-gender co-occurrences. Results indicate that while models can infer gender with moderate accuracy, their decisions are often rooted in stereotypical associations between product categories and gender. Furthermore, explicit instructions to avoid bias reduce the certainty of model predictions, but do not eliminate stereotypical patterns. Our findings highlight the persistent nature of gender biases in LLMs and emphasize the need for robust bias-mitigation strategies.

Updated: 2025-12-11 15:33:30

标题: 《LLM穿着Prada:通过在线购物数据分析性别偏见和刻板印象》

摘要: 随着大型语言模型的广泛跨领域应用,评估训练数据中的统计相关性在多大程度上隐藏了微妙且可能令人担忧的偏见变得至关重要,这些相关性是它们出色性能的基础。LLMs中的性别偏见已经从与特定性别通常相关的作品、爱好和情感的角度进行了广泛调查。在这项研究中,我们引入了一个新颖的视角。我们调查LLMs是否可以仅基于在线购物历史预测个人的性别,以及这些预测是否受到性别偏见和刻板印象的影响。利用来自美国用户的历史在线购买数据集,我们评估了六个LLMs分类性别的能力,然后分析它们的推理和产品-性别共现。结果表明,虽然模型能够以中等准确度推断性别,但它们的决策往往根植于产品类别和性别之间的刻板印象关联。此外,明确要求避免偏见会降低模型预测的确定性,但并不能消除刻板印象模式。我们的研究结果突显了LLMs中性别偏见的持久性,并强调了需要强大的偏见缓解策略。

更新时间: 2025-12-11 15:33:30

领域: cs.AI,cs.CL,cs.CY

下载: http://arxiv.org/abs/2504.01951v2

PMB-NN: Physiology-Centred Hybrid AI for Personalized Hemodynamic Monitoring from Photoplethysmography

Continuous monitoring of blood pressure (BP) and hemodynamic parameters such as peripheral resistance (R) and arterial compliance (C) are critical for early vascular dysfunction detection. While photoplethysmography (PPG) wearables has gained popularity, existing data-driven methods for BP estimation lack interpretability. We advanced our previously proposed physiology-centered hybrid AI method-Physiological Model-Based Neural Network (PMB-NN)-in blood pressure estimation, that unifies deep learning with a 2-element Windkessel based model parameterized by R and C acting as physics constraints. The PMB-NN model was trained in a subject-specific manner using PPG-derived timing features, while demographic information was used to infer an intermediate variable: cardiac output. We validated the model on 10 healthy adults performing static and cycling activities across two days for model's day-to-day robustness, benchmarked against deep learning (DL) models (FCNN, CNN-LSTM, Transformer) and standalone Windkessel based physiological model (PM). Validation was conducted on three perspectives: accuracy, interpretability and plausibility. PMB-NN achieved systolic BP accuracy (MAE: 7.2 mmHg) comparable to DL benchmarks, diastolic performance (MAE: 3.9 mmHg) lower than DL models. However, PMB-NN exhibited higher physiological plausibility than both DL baselines and PM, suggesting that the hybrid architecture unifies and enhances the respective merits of physiological principles and data-driven techniques. Beyond BP, PMB-NN identified R (ME: 0.15 mmHg$\cdot$s/ml) and C (ME: -0.35 ml/mmHg) during training with accuracy similar to PM, demonstrating that the embedded physiological constraints confer interpretability to the hybrid AI framework. These results position PMB-NN as a balanced, physiologically grounded alternative to purely data-driven approaches for daily hemodynamic monitoring.

Updated: 2025-12-11 15:32:50

标题: PMB-NN:基于生理学的混合人工智能技术,用于从光电容积脉搏图监测个性化血液动力学

摘要: 持续监测血压(BP)和外周阻力(R)和动脉顺应性(C)等血流动力学参数对于早期血管功能障碍的检测至关重要。虽然光电容积脉动图(PPG)可穿戴设备已经变得流行,但现有的基于数据驱动的血压估计方法缺乏可解释性。我们在血压估计方面推进了我们先前提出的生理中心混合AI方法-生理模型基础神经网络(PMB-NN),该方法将深度学习与基于2元风袋模型的参数化R和C相结合,作为物理约束。PMB-NN模型使用基于PPG的时间特征以特定于受试者的方式进行训练,而人口统计信息用于推断一个中间变量:心输出量。我们在两天内对10名健康成年人进行了静止和骑行活动的模型日常稳健性验证,与深度学习(DL)模型(FCNN、CNN-LSTM、Transformer)和独立的基于风袋模型的生理模型(PM)进行了对比。验证从三个方面进行:准确性、可解释性和合理性。PMB-NN达到了收缩压准确性(MAE:7.2 mmHg)与DL基准相当,舒张压性能(MAE:3.9 mmHg)低于DL模型。然而,PMB-NN表现出比DL基线和PM更高的生理可行性,表明混合架构统一和增强了生理原理和数据驱动技术的各自优点。除了血压外,PMB-NN在训练时识别了R(ME:0.15 mmHg·s/ml)和C(ME:-0.35 ml/mmHg),准确度与PM相似,表明嵌入的生理约束赋予混合AI框架可解释性。这些结果将PMB-NN定位为平衡的、生理基础的替代方案,用于日常血流动力学监测的纯数据驱动方法。

更新时间: 2025-12-11 15:32:50

领域: physics.med-ph,cs.LG

下载: http://arxiv.org/abs/2512.10745v1

Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the \textbf{O}utcome-based \textbf{P}rocess \textbf{V}erifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out \textsc{\thisbench}, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2\% to 73.3\% on AIME2025 as the compute budget scales.

Updated: 2025-12-11 15:26:28

标题: 长视角推理代理人用于奥林匹克级数学问题求解

摘要: 大型语言模型(LLMs)通过可验证奖励的强化学习(RLVR)在解决复杂推理任务方面取得了重大进展。这一进步也离不开可靠验证者的自动监督。然而,当前基于结果的验证者(OVs)无法检查长推理链中不可靠的中间步骤。与此同时,当前基于过程的验证者(PVs)在可靠检测复杂长推理链中的错误方面存在困难,这是由于人工标注成本高昂导致高质量注释稀缺。因此,我们提出了基于结果的过程验证者(OPV),该验证者验证长推理链的总结结果的理据过程,以实现准确和高效的验证,同时实现大规模注释。为了增强所提出的验证者,我们采用了一个迭代式主动学习框架,通过专家注释逐步提高OPV的验证能力,减少注释成本。具体而言,在每个迭代中,对当前最佳OPV中最不确定的情况进行注释,然后通过拒绝微调(RFT)和RLVR用于下一轮训练新的OPV。大量实验证明OPV具有卓越的性能和广泛的适用性。它在我们保留的\textsc{\thisbench}上取得了新的最先进结果,在F1分数方面优于更大的开源模型,例如Qwen3-Max-Preview,83.1比76.3。此外,OPV有效地检测到合成数据集中的假阳性,与专家评估紧密一致。在与策略模型合作时,OPV始终带来性能提升,例如在AIME2025上,将DeepSeek-R1-Distill-Qwen-32B的准确率从55.2\%提高到73.3\%,随着计算预算的扩大。

更新时间: 2025-12-11 15:26:28

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2512.10739v1

LGAN: An Efficient High-Order Graph Neural Network via the Line Graph Aggregation

Graph Neural Networks (GNNs) have emerged as a dominant paradigm for graph classification. Specifically, most existing GNNs mainly rely on the message passing strategy between neighbor nodes, where the expressivity is limited by the 1-dimensional Weisfeiler-Lehman (1-WL) test. Although a number of k-WL-based GNNs have been proposed to overcome this limitation, their computational cost increases rapidly with k, significantly restricting the practical applicability. Moreover, since the k-WL models mainly operate on node tuples, these k-WL-based GNNs cannot retain fine-grained node- or edge-level semantics required by attribution methods (e.g., Integrated Gradients), leading to the less interpretable problem. To overcome the above shortcomings, in this paper, we propose a novel Line Graph Aggregation Network (LGAN), that constructs a line graph from the induced subgraph centered at each node to perform the higher-order aggregation. We theoretically prove that the LGAN not only possesses the greater expressive power than the 2-WL under injective aggregation assumptions, but also has lower time complexity. Empirical evaluations on benchmarks demonstrate that the LGAN outperforms state-of-the-art k-WL-based GNNs, while offering better interpretability.

Updated: 2025-12-11 15:23:46

标题: LGAN:通过线图聚合的高效高阶图神经网络

摘要: 图神经网络(GNNs)已成为图分类的主导范式。具体来说,大多数现有的GNN主要依赖于邻居节点之间的消息传递策略,其中表达能力受到一维Weisfeiler-Lehman(1-WL)测试的限制。尽管已提出了一些基于k-WL的GNN来克服这一限制,但它们的计算成本随着k的增加而迅速增加,从而显著限制了实际的适用性。此外,由于k-WL模型主要操作节点元组,这些基于k-WL的GNN无法保留属性方法(例如集成梯度)所需的细粒度节点或边级语义,导致更少可解释性的问题。为了克服上述缺点,在本文中,我们提出了一种新颖的Line Graph Aggregation Network(LGAN),它从以每个节点为中心的诱导子图构造一条线图来执行高阶聚合。我们在理论上证明了LGAN不仅在注射聚合假设下具有比2-WL更大的表达能力,而且具有更低的时间复杂度。基准测试中的实证评估表明,LGAN优于最先进的基于k-WL的GNN,同时提供更好的可解释性。

更新时间: 2025-12-11 15:23:46

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2512.10735v1

Textual Data Bias Detection and Mitigation - An Extensible Pipeline with Experimental Evaluation

Textual data used to train large language models (LLMs) exhibits multifaceted bias manifestations encompassing harmful language and skewed demographic distributions. Regulations such as the European AI Act require identifying and mitigating biases against protected groups in data, with the ultimate goal of preventing unfair model outputs. However, practical guidance and operationalization are lacking. We propose a comprehensive data bias detection and mitigation pipeline comprising four components that address two data bias types, namely representation bias and (explicit) stereotypes for a configurable sensitive attribute. First, we leverage LLM-generated word lists created based on quality criteria to detect relevant group labels. Second, representation bias is quantified using the Demographic Representation Score. Third, we detect and mitigate stereotypes using sociolinguistically informed filtering. Finally, we compensate representation bias through Grammar- and Context-Aware Counterfactual Data Augmentation. We conduct a two-fold evaluation using the examples of gender, religion and age. First, the effectiveness of each individual component on data debiasing is evaluated through human validation and baseline comparison. The findings demonstrate that we successfully reduce representation bias and (explicit) stereotypes in a text dataset. Second, the effect of data debiasing on model bias reduction is evaluated by bias benchmarking of several models (0.6B-8B parameters), fine-tuned on the debiased text dataset. This evaluation reveals that LLMs fine-tuned on debiased data do not consistently show improved performance on bias benchmarks, exposing critical gaps in current evaluation methodologies and highlighting the need for targeted data manipulation to address manifested model bias.

Updated: 2025-12-11 15:18:59

标题: 文本数据偏差检测和缓解 - 具有实验评估的可扩展管道

摘要: 使用文本数据训练大型语言模型(LLMs)展现出多方面的偏见表现,包括有害语言和偏斜的人口统计分布。《欧洲人工智能法案》等法规要求在数据中识别和减轻对受保护群体的偏见,最终目标是防止不公平的模型输出。然而,缺乏实用指导和操作化。我们提出了一个全面的数据偏见检测和减轻管道,包括四个组件,涉及两种数据偏见类型,即代表性偏见和(明显的)刻板印象,针对可配置的敏感属性。首先,我们利用基于质量标准创建的LLM生成的单词列表来检测相关的群体标签。其次,使用人口统计代表得分来量化代表性偏见。第三,我们使用社会语言学知识进行过滤来检测和减轻刻板印象。最后,我们通过语法和上下文感知的反事实数据增强来补偿代表性偏见。我们使用性别、宗教和年龄的例子进行了双重评估。首先,通过人工验证和基准比较评估每个单独组件对数据去偏见的有效性。研究结果表明,我们成功在文本数据集中减少了代表性偏见和(明显的)刻板印象。其次,通过在去偏见文本数据集上微调的多个模型(0.6B-8B参数)进行偏见基准测试,评估了数据去偏见对模型偏见减少的影响。这项评估显示,在去偏见数据上微调的LLMs并不总是在偏见基准测试中表现出改善的性能,暴露了当前评估方法中的关键差距,并强调了有针对性的数据操作的必要性,以解决已显现的模型偏见。

更新时间: 2025-12-11 15:18:59

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2512.10734v1

TriHaRd: Higher Resilience for TEE Trusted Time

Accurately measuring time passing is critical for many applications. However, in Trusted Execution Environments (TEEs) such as Intel SGX, the time source is outside the Trusted Computing Base: a malicious host can manipulate the TEE's notion of time, jumping in time or affecting perceived time speed. Previous work (Triad) proposes protocols for TEEs to maintain a trustworthy time source by building a cluster of TEEs that collaborate with each other and with a remote Time Authority to maintain a continuous notion of passing time. However, such approaches still allow an attacker to control the operating system and arbitrarily manipulate their own TEE's perceived clock speed. An attacker can even propagate faster passage of time to honest machines participating in Triad's trusted time protocol, causing them to skip to timestamps arbitrarily far in the future. We propose TriHaRd, a TEE trusted time protocol achieving high resilience against clock speed and offset manipulations, notably through Byzantine-resilient clock updates and consistency checks. We empirically show that TriHaRd mitigates known attacks against Triad.

Updated: 2025-12-11 15:17:37

标题: TriHaRd:针对TEE可信时间的更高弹性

摘要: 准确测量时间的流逝对许多应用程序至关重要。然而,在可信执行环境(TEE)如Intel SGX中,时间来源在可信计算基础之外:恶意主机可以操纵TEE对时间的概念,跳跃时间或影响感知时间速度。先前的工作(Triad)提出了一种协议,用于TEE维护可信的时间来源,通过建立一个TEE集群,彼此合作,并与远程时间机构一起维护时间的连续概念。然而,这种方法仍然允许攻击者控制操作系统,并任意操纵自己TEE的感知时钟速度。攻击者甚至可以将时间的流逝传播得更快到参与Triad的可信机器,导致它们跳到任意远的未来时间戳。我们提出了TriHaRd,一种TEE可信时间协议,通过拜占庭容错时钟更新和一致性检查,实现高度抗时钟速度和偏移操纵。我们通过实验证明,TriHaRd可以减轻对Triad已知攻击。

更新时间: 2025-12-11 15:17:37

领域: cs.CR,cs.DC

下载: http://arxiv.org/abs/2512.10732v1

Generalized Spherical Neural Operators: Green's Function Formulation

Neural operators offer powerful approaches for solving parametric partial differential equations, but extending them to spherical domains remains challenging due to the need to preserve intrinsic geometry while avoiding distortions that break rotational consistency. Existing spherical operators rely on rotational equivariance but often lack the flexibility for real-world complexity. We propose a general operator-design framework based on the designable spherical Green's function and its harmonic expansion, establishing a solid operator-theoretic foundation for spherical learning. Based on this, we propose an absolute and relative position-dependent Green's function that enables flexible balance of equivariance and invariance for real-world modeling. The resulting operator, Green's-function Spherical Neural Operator (GSNO) with a novel spectral learning method, can adapt to anisotropic, constraint-rich systems while retaining spectral efficiency. To exploit GSNO, we develop GSHNet, a hierarchical architecture that combines multi-scale spectral modeling with spherical up-down sampling, enhancing global feature representation. Evaluations on diffusion MRI, shallow water dynamics, and global weather forecasting, GSNO and GSHNet consistently outperform state-of-the-art methods. Our results position GSNO as a principled and general framework for spherical operator learning, bridging rigorous theory with real-world complexity.

Updated: 2025-12-11 15:05:33

标题: 广义球形神经运算符:格林函数表述

摘要: 神经算子为解决参数化偏微分方程提供了强大的方法,但将其扩展到球形域仍然具有挑战性,因为需要保留固有几何形状,同时避免破坏旋转一致性的扭曲。现有的球形算子依赖于旋转等变性,但通常缺乏处理真实世界复杂性的灵活性。我们提出了一个基于可设计的球形Green函数及其谐波展开的通用算子设计框架,为球形学习建立了坚实的算子理论基础。基于此,我们提出了一种绝对和相对位置依赖的Green函数,可以灵活平衡等变性和不变性,用于真实世界建模。由此产生的算子,Green函数球形神经算子(GSNO)具有一种新颖的谱学习方法,可以适应各向异性、约束丰富的系统,同时保留谱效率。为了利用GSNO,我们开发了GSHNet,这是一种结合多尺度谱建模和球形上下采样的分层架构,增强了全局特征表示。在扩散MRI、浅水动力学和全球天气预测等方面的评估中,GSNO和GSHNet始终优于最先进的方法。我们的结果将GSNO定位为一个有原则且通用的球形算子学习框架,将严格的理论与真实世界复杂性联系起来。

更新时间: 2025-12-11 15:05:33

领域: cs.LG

下载: http://arxiv.org/abs/2512.10723v1

Beyond the Black Box: Identifiable Interpretation and Control in Generative Models via Causal Minimality

Deep generative models, while revolutionizing fields like image and text generation, largely operate as opaque black boxes, hindering human understanding, control, and alignment. While methods like sparse autoencoders (SAEs) show remarkable empirical success, they often lack theoretical guarantees, risking subjective insights. Our primary objective is to establish a principled foundation for interpretable generative models. We demonstrate that the principle of causal minimality -- favoring the simplest causal explanation -- can endow the latent representations of diffusion vision and autoregressive language models with clear causal interpretation and robust, component-wise identifiable control. We introduce a novel theoretical framework for hierarchical selection models, where higher-level concepts emerge from the constrained composition of lower-level variables, better capturing the complex dependencies in data generation. Under theoretically derived minimality conditions (manifesting as sparsity or compression constraints), we show that learned representations can be equivalent to the true latent variables of the data-generating process. Empirically, applying these constraints to leading generative models allows us to extract their innate hierarchical concept graphs, offering fresh insights into their internal knowledge organization. Furthermore, these causally grounded concepts serve as levers for fine-grained model steering, paving the way for transparent, reliable systems.

Updated: 2025-12-11 14:59:14

标题: 超越黑匣子:通过因果最小性在生成模型中进行可识别的解释和控制

摘要: 深度生成模型,虽然在图像和文本生成领域产生了革命性的影响,但主要作为不透明的黑匣子运行,阻碍了人类的理解、控制和对齐。虽然像稀疏自编码器(SAEs)这样的方法表现出令人瞩目的实证成功,但它们经常缺乏理论保证,存在主观洞察的风险。我们的主要目标是为可解释的生成模型建立一个有原则的基础。我们展示了因果最小性原则——支持最简单的因果解释——可以赋予扩散视觉和自回归语言模型的潜在表示具有清晰的因果解释和稳健的、组件化可识别的控制。我们引入了一个新颖的层次选择模型的理论框架,其中更高级别的概念是从较低级别变量的受限组合中产生的,更好地捕捉数据生成中的复杂依赖关系。在理论推导的最小性条件下(表现为稀疏性或压缩约束),我们展示了学习得到的表示可以等同于数据生成过程的真实潜在变量。在实证方面,将这些约束应用于领先的生成模型使我们能够提取它们固有的层次概念图,为我们提供对其内部知识组织的新见解。此外,这些因果基础的概念作为微观模型控制的杠杆,为透明、可靠的系统铺平道路。

更新时间: 2025-12-11 14:59:14

领域: cs.LG

下载: http://arxiv.org/abs/2512.10720v1

ENMA: Tokenwise Autoregression for Generative Neural PDE Operators

Solving time-dependent parametric partial differential equations (PDEs) remains a fundamental challenge for neural solvers, particularly when generalizing across a wide range of physical parameters and dynamics. When data is uncertain or incomplete-as is often the case-a natural approach is to turn to generative models. We introduce ENMA, a generative neural operator designed to model spatio-temporal dynamics arising from physical phenomena. ENMA predicts future dynamics in a compressed latent space using a generative masked autoregressive transformer trained with flow matching loss, enabling tokenwise generation. Irregularly sampled spatial observations are encoded into uniform latent representations via attention mechanisms and further compressed through a spatio-temporal convolutional encoder. This allows ENMA to perform in-context learning at inference time by conditioning on either past states of the target trajectory or auxiliary context trajectories with similar dynamics. The result is a robust and adaptable framework that generalizes to new PDE regimes and supports one-shot surrogate modeling of time-dependent parametric PDEs.

Updated: 2025-12-11 14:55:09

标题: ENMA: 基于Token的自回归生成神经PDE操作符

摘要: 解决依赖于时间的参数偏微分方程(PDEs)仍然是神经求解器面临的一个基本挑战,特别是在泛化到各种物理参数和动态范围时。当数据是不确定或不完整的时候,一个自然的方法是转向生成模型。我们介绍了ENMA,一个设计用于建模由物理现象引起的时空动态的生成神经运算符。ENMA使用经过流匹配损失训练的生成掩码自回归变换器在压缩的潜在空间中预测未来动态,从而实现逐标记生成。不规则采样的空间观测通过注意机制编码为统一的潜在表示,并通过时空卷积编码器进一步压缩。这使得ENMA能够在推断时进行上下文学习,通过在目标轨迹的过去状态或具有类似动态的辅助上下文轨迹进行调节。结果是一个稳健且可适应的框架,可以泛化到新的PDE体制,并支持一次性建模依赖于时间的参数PDEs。

更新时间: 2025-12-11 14:55:09

领域: cs.LG

下载: http://arxiv.org/abs/2506.06158v3

PACIFIC: a framework for generating benchmarks to check Precise Automatically Checked Instruction Following In Code

Large Language Model (LLM)-based code assistants have emerged as a powerful application of generative AI, demonstrating impressive capabilities in code generation and comprehension. A key requirement for these systems is their ability to accurately follow user instructions. We present Precise Automatically Checked Instruction Following In Code (PACIFIC), a novel framework designed to automatically generate benchmarks that rigorously assess sequential instruction-following and code dry-running capabilities in LLMs, while allowing control over benchmark difficulty. PACIFIC produces benchmark variants with clearly defined expected outputs, enabling straightforward and reliable evaluation through simple output comparisons. In contrast to existing approaches that often rely on tool usage or agentic behavior, our work isolates and evaluates the LLM's intrinsic ability to reason through code behavior step-by-step without execution (dry running) and to follow instructions. Furthermore, our framework mitigates training data contamination by facilitating effortless generation of novel benchmark variations. We validate our framework by generating a suite of benchmarks spanning a range of difficulty levels and evaluating multiple state-of-the-art LLMs. Our results demonstrate that PACIFIC can produce increasingly challenging benchmarks that effectively differentiate instruction-following and dry running capabilities, even among advanced models. Overall, our framework offers a scalable, contamination-resilient methodology for assessing core competencies of LLMs in code-related tasks.

Updated: 2025-12-11 14:49:56

标题: PACIFIC:生成基准的框架,用于检查代码中精确自动检查的指令跟随

摘要: 基于大型语言模型(LLM)的代码助手已被证明是生成式人工智能强大应用的一个重要领域,展现出在代码生成和理解方面的令人印象深刻的能力。这些系统的一个关键要求是它们能够准确地遵循用户指令。我们提出了一种新颖的框架——Precise Automatically Checked Instruction Following In Code(PACIFIC),旨在自动生成严格评估LLM中顺序指令遵循和代码干运行能力的基准,同时允许对基准难度进行控制。PACIFIC生成具有清晰定义的期望输出的基准变体,通过简单的输出比较实现直观可靠的评估。与现有方法不同,这些方法通常依赖于工具使用或主动行为,我们的工作隔离并评估LLM无需执行(干运行)就能逐步推理代码行为的内在能力,以及遵循指令的能力。此外,我们的框架通过促进轻松生成新的基准变体,减少了训练数据污染。我们通过生成一系列涵盖多种难度级别的基准,并评估多个最先进的LLM来验证我们的框架。我们的结果表明,PACIFIC能够生成越来越具有挑战性的基准,有效区分指令遵循和干运行能力,即使对于先进模型也是如此。总的来说,我们的框架为评估LLM在与代码相关任务中的核心能力提供了一种可扩展、抗污染的方法论。

更新时间: 2025-12-11 14:49:56

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2512.10713v1

AI-Newton: A Concept-Driven Physical Law Discovery System without Prior Physical Knowledge

While current AI-driven methods excel at deriving empirical models from individual experiments, a significant challenge remains in uncovering the common fundamental physics that underlie these models -- a task at which human physicists are adept. To bridge this gap, we introduce AI-Newton, a novel framework for concept-driven scientific discovery. Our system autonomously derives general physical laws directly from raw, multi-experiment data, operating without supervision or prior physical knowledge. Its core innovations are twofold: (1) proposing interpretable physical concepts to construct laws, and (2) progressively generalizing these laws to broader domains. Applied to a large, noisy dataset of mechanics experiments, AI-Newton successfully rediscovers foundational and universal laws, such as Newton's second law, the conservation of energy, and the universal gravitation. This work represents a significant advance toward autonomous, human-like scientific discovery.

Updated: 2025-12-11 14:46:15

标题: AI-Newton:一种基于概念驱动的物理定律发现系统,无需先前的物理知识

摘要: 当前的人工智能驱动方法擅长从单个实验中推导经验模型,但揭示这些模型背后的共同基本物理仍然是一个重要挑战——这是人类物理学家擅长的任务。为了弥合这一差距,我们引入了AI-Newton,这是一个用于概念驱动科学发现的新型框架。我们的系统可以自主从原始的多实验数据中直接推导出一般物理定律,而无需监督或先前的物理知识。其核心创新点有两个:(1)提出可解释的物理概念来构建定律,以及(2)逐渐将这些定律推广到更广泛的领域。应用于一个大规模、嘈杂的力学实验数据集,AI-Newton成功地重新发现了基础和普适的定律,如牛顿第二定律、能量守恒定律和普遍引力定律。这项工作代表了朝着自主、类似人类的科学发现迈出的重要一步。

更新时间: 2025-12-11 14:46:15

领域: cs.AI,cs.LG,cs.SC,hep-ph,physics.class-ph

下载: http://arxiv.org/abs/2504.01538v2

COMPARE: Clinical Optimization with Modular Planning and Assessment via RAG-Enhanced AI-OCT: Superior Decision Support for Percutaneous Coronary Intervention Compared to ChatGPT-5 and Junior Operators

Background: While intravascular imaging, particularly optical coherence tomography (OCT), improves percutaneous coronary intervention (PCI) outcomes, its interpretation is operator-dependent. General-purpose artificial intelligence (AI) shows promise but lacks domain-specific reliability. We evaluated the performance of CA-GPT, a novel large model deployed on an AI-OCT system, against that of the general-purpose ChatGPT-5 and junior physicians for OCT-guided PCI planning and assessment. Methods: In this single-center analysis of 96 patients who underwent OCT-guided PCI, the procedural decisions generated by the CA-GPT, ChatGPT-5, and junior physicians were compared with an expert-derived procedural record. Agreement was assessed using ten pre-specified metrics across pre-PCI and post-PCI phases. Results: For pre-PCI planning, CA-GPT demonstrated significantly higher median agreement scores (5[IQR 3.75-5]) compared to both ChatGPT-5 (3[2-4], P<0.001) and junior physicians (4[3-4], P<0.001). CA-GPT significantly outperformed ChatGPT-5 across all individual pre-PCI metrics and showed superior performance to junior physicians in stent diameter (90.3% vs. 72.2%, P<0.05) and length selection (80.6% vs. 52.8%, P<0.01). In post-PCI assessment, CA-GPT maintained excellent overall agreement (5[4.75-5]), significantly higher than both ChatGPT-5 (4[4-5], P<0.001) and junior physicians (5[4-5], P<0.05). Subgroup analysis confirmed CA-GPT's robust performance advantage in complex scenarios. Conclusion: The CA-GPT-based AI-OCT system achieved superior decision-making agreement versus a general-purpose large language model and junior physicians across both PCI planning and assessment phases. This approach provides a standardized and reliable method for intravascular imaging interpretation, demonstrating significant potential to augment operator expertise and optimize OCT-guided PCI.

Updated: 2025-12-11 14:41:37

标题: 对比:通过RAG增强AI-OCT的模块化规划和评估进行临床优化:与ChatGPT-5和初级操作员相比,对经皮冠状动脉介入的优越决策支持

摘要: 背景:尽管血管内成像,特别是光学相干断层扫描(OCT),改善了经皮冠状动脉介入(PCI)的结果,但其解释取决于操作者。通用人工智能(AI)显示出潜力,但缺乏领域特定的可靠性。我们评估了CA-GPT,一种新型大型模型在AI-OCT系统上部署,针对通用的ChatGPT-5和初级医生进行OCT引导PCI规划和评估的性能。 方法:在这项对96名接受OCT引导PCI的患者进行的单中心分析中,CA-GPT、ChatGPT-5和初级医生生成的程序决策与专家制定的程序记录进行了比较。使用十个预先指定的指标评估了在PCI前后阶段的协议。 结果:对于PCI规划,CA-GPT显示出显著更高的中位一致性得分(5[IQR 3.75-5]),与ChatGPT-5(3[2-4],P<0.001)和初级医生(4[3-4],P<0.001)相比。CA-GPT在所有个体PCI前指标方面明显优于ChatGPT-5,并在支架直径(90.3% vs. 72.2%,P<0.05)和长度选择(80.6% vs. 52.8%,P<0.01)方面表现优越于初级医生。在PCI后评估中,CA-GPT保持了卓越的整体一致性(5[4.75-5]),显著高于ChatGPT-5(4[4-5],P<0.001)和初级医生(5[4-5],P<0.05)。亚组分析证实了CA-GPT在复杂情况下的稳健性能优势。 结论:基于CA-GPT的AI-OCT系统在PCI规划和评估阶段的决策一致性优于通用的大型语言模型和初级医生。这种方法提供了一种标准化和可靠的方法来解释血管内成像,显示出增强操作者专业知识和优化OCT引导PCI的重要潜力。

更新时间: 2025-12-11 14:41:37

领域: cs.AI

下载: http://arxiv.org/abs/2512.10702v1

HybridVFL: Disentangled Feature Learning for Edge-Enabled Vertical Federated Multimodal Classification

Vertical Federated Learning (VFL) offers a privacy-preserving paradigm for Edge AI scenarios like mobile health diagnostics, where sensitive multimodal data reside on distributed, resource-constrained devices. Yet, standard VFL systems often suffer performance limitations due to simplistic feature fusion. This paper introduces HybridVFL, a novel framework designed to overcome this bottleneck by employing client-side feature disentanglement paired with a server-side cross-modal transformer for context-aware fusion. Through systematic evaluation on the multimodal HAM10000 skin lesion dataset, we demonstrate that HybridVFL significantly outperforms standard federated baselines, validating the criticality of advanced fusion mechanisms in robust, privacy-preserving systems.

Updated: 2025-12-11 14:41:19

标题: 混合VFL:面向边缘的垂直联邦多模态分类的特征解耦学习

摘要: Vertical Federated Learning(VFL)为移动健康诊断等边缘AI场景提供了一种保护隐私的范式,其中敏感的多模态数据存储在分布式、资源受限的设备上。然而,标准的VFL系统通常由于简单的特征融合而遭受性能限制。本文介绍了HybridVFL,这是一个新颖的框架,旨在通过采用客户端特征解缠合并服务器端的跨模态变换器进行上下文感知融合,以克服这一瓶颈。通过对多模态HAM10000皮肤病变数据集的系统评估,我们证明了HybridVFL明显优于标准的联邦基线,验证了在强大的、保护隐私系统中先进融合机制的关键性。

更新时间: 2025-12-11 14:41:19

领域: cs.LG

下载: http://arxiv.org/abs/2512.10701v1

How to Brake? Ethical Emergency Braking with Deep Reinforcement Learning

Connected and automated vehicles (CAVs) have the potential to enhance driving safety, for example by enabling safe vehicle following and more efficient traffic scheduling. For such future deployments, safety requirements should be addressed, where the primary such are avoidance of vehicle collisions and substantial mitigating of harm when collisions are unavoidable. However, conservative worst-case-based control strategies come at the price of reduced flexibility and may compromise overall performance. In light of this, we investigate how Deep Reinforcement Learning (DRL) can be leveraged to improve safety in multi-vehicle-following scenarios involving emergency braking. Specifically, we investigate how DRL with vehicle-to-vehicle communication can be used to ethically select an emergency breaking profile in scenarios where overall, or collective, three-vehicle harm reduction or collision avoidance shall be obtained instead of single-vehicle such. As an algorithm, we provide a hybrid approach that combines DRL with a previously published method based on analytical expressions for selecting optimal constant deceleration. By combining DRL with the previous method, the proposed hybrid approach increases the reliability compared to standalone DRL, while achieving superior performance in terms of overall harm reduction and collision avoidance.

Updated: 2025-12-11 14:40:33

标题: 如何刹车?利用深度强化学习进行道德紧急制动

摘要: Connected and automated vehicles (CAVs)具有提高驾驶安全性的潜力,例如通过实现安全的车辆跟随和更有效的交通调度。对于这样的未来部署,应该解决安全要求,其中主要是避免车辆碰撞和在碰撞不可避免时减少伤害。然而,保守的最坏情况控制策略会以降低灵活性为代价,可能会损害整体性能。基于此,我们调查了如何利用深度强化学习(DRL)来改善多车辆跟随场景中的紧急制动安全性。具体来说,我们研究了如何利用车辆间通信的DRL来在情况需要整体或集体三车辆减少损害或避免碰撞时,道德上选择紧急制动配置。作为一种算法,我们提供了一种混合方法,将DRL与基于分析表达式选择最佳恒定减速度的先前发布的方法相结合。通过将DRL与先前的方法结合使用,所提出的混合方法提高了可靠性,同时在整体减少损害和避免碰撞方面实现了卓越的性能。

更新时间: 2025-12-11 14:40:33

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2512.10698v1

Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution

Procedural memory enables large language model (LLM) agents to internalize "how-to" knowledge, theoretically reducing redundant trial-and-error. However, existing frameworks predominantly suffer from a "passive accumulation" paradigm, treating memory as a static append-only archive. To bridge the gap between static storage and dynamic reasoning, we propose $\textbf{ReMe}$ ($\textit{Remember Me, Refine Me}$), a comprehensive framework for experience-driven agent evolution. ReMe innovates across the memory lifecycle via three mechanisms: 1) $\textit{multi-faceted distillation}$, which extracts fine-grained experiences by recognizing success patterns, analyzing failure triggers and generating comparative insights; 2) $\textit{context-adaptive reuse}$, which tailors historical insights to new contexts via scenario-aware indexing; and 3) $\textit{utility-based refinement}$, which autonomously adds valid memories and prunes outdated ones to maintain a compact, high-quality experience pool. Extensive experiments on BFCL-V3 and AppWorld demonstrate that ReMe establishes a new state-of-the-art in agent memory system. Crucially, we observe a significant memory-scaling effect: Qwen3-8B equipped with ReMe outperforms larger, memoryless Qwen3-14B, suggesting that self-evolving memory provides a computation-efficient pathway for lifelong learning. We release our code and the $\texttt{reme.library}$ dataset to facilitate further research.

Updated: 2025-12-11 14:40:01

标题: 记住我,完善我:一种面向经验驱动的智能体进化的动态程序化记忆框架

摘要: 程序性记忆使大型语言模型(LLM)代理能够内化“如何”知识,从理论上减少了冗余的试错过程。然而,现有框架主要受到“被动积累”范式的困扰,将记忆视为静态的追加式存档。为了弥合静态存储和动态推理之间的差距,我们提出了$\textbf{ReMe}$($\textit{记住我,完善我}$),这是一个基于经验驱动的代理进化的综合框架。ReMe通过三种机制创新地跨越了记忆生命周期:1)$\textit{多方面蒸馏}$,通过识别成功模式、分析失败触发器和生成比较性见解,提取细粒度的经验;2)$\textit{上下文自适应重用}$,通过情景感知索引,将历史见解量身定制到新的情境中;以及3)$\textit{基于效用的完善}$,自主地添加有效的记忆并修剪过时的记忆,以维护一个紧凑且高质量的经验池。在BFCL-V3和AppWorld上进行的大量实验表明,ReMe在代理记忆系统中建立了一个新的最先进水平。至关重要的是,我们观察到了显著的记忆扩展效应:配备ReMe的Qwen3-8B优于更大的无记忆Qwen3-14B,这表明自我进化的记忆为终身学习提供了一个计算高效的途径。我们发布了我们的代码和$\texttt{reme.library}$数据集,以促进进一步的研究。

更新时间: 2025-12-11 14:40:01

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2512.10696v1

Enhancing Radiology Report Generation and Visual Grounding using Reinforcement Learning

Recent advances in vision-language models (VLMs) have improved Chest X-ray (CXR) interpretation in multiple aspects. However, many medical VLMs rely solely on supervised fine-tuning (SFT), which optimizes next-token prediction without evaluating answer quality. In contrast, reinforcement learning (RL) can incorporate task-specific feedback, and its combination with explicit intermediate reasoning ("thinking") has demonstrated substantial gains on verifiable math and coding tasks. To investigate the effects of RL and thinking in a CXR VLM, we perform large-scale SFT on CXR data to build an updated RadVLM based on Qwen3-VL, followed by a cold-start SFT stage that equips the model with basic thinking ability. We then apply Group Relative Policy Optimization (GRPO) with clinically grounded, task-specific rewards for report generation and visual grounding, and run matched RL experiments on both domain-specific and general-domain Qwen3-VL variants, with and without thinking. Across these settings, we find that while strong SFT remains crucial for high base performance, RL provides additional gains on both tasks, whereas explicit thinking does not appear to further improve results. Under a unified evaluation pipeline, the RL-optimized RadVLM models outperform their baseline counterparts and reach state-of-the-art performance on both report generation and grounding, highlighting clinically aligned RL as a powerful complement to SFT for medical VLMs.

Updated: 2025-12-11 14:36:14

标题: 利用强化学习提升放射学报告生成和视觉定位

摘要: 最近视觉语言模型(VLMs)的进展提高了胸部X射线(CXR)解释在多个方面。然而,许多医学VLMs仅依赖于监督微调(SFT),这种微调优化了下一个标记的预测,而没有评估答案质量。相比之下,强化学习(RL)可以融入任务特定的反馈,其与明确的中间推理(“思考”)的结合已经在可验证的数学和编码任务中取得了实质性的收益。为了研究RL和思考在CXR VLM中的影响,我们对CXR数据进行大规模SFT,构建了基于Qwen3-VL的更新的RadVLM,然后进行冷启动SFT阶段,为模型配备基本的思考能力。然后,我们应用基于组相对策略优化(GRPO)的临床基础、任务特定的奖励进行报告生成和视觉定位,并在领域特定和通用领域的Qwen3-VL变体上进行匹配的RL实验,有无思考。在这些设置中,我们发现,虽然强大的SFT对于高基准性能仍然至关重要,但RL在两个任务上提供了额外的收益,而显式的思考似乎并没有进一步改善结果。在统一的评估管道下,经过RL优化的RadVLM模型胜过其基线对应模型,并在报告生成和定位两个方面达到了最先进的性能,突显了临床对齐的RL作为医学VLM的SFT的强大补充。

更新时间: 2025-12-11 14:36:14

领域: cs.AI,cs.CV

下载: http://arxiv.org/abs/2512.10691v1

Rethinking Popularity Bias in Collaborative Filtering via Analytical Vector Decomposition

Popularity bias fundamentally undermines the personalization capabilities of collaborative filtering (CF) models, causing them to disproportionately recommend popular items while neglecting users' genuine preferences for niche content. While existing approaches treat this as an external confounding factor, we reveal that popularity bias is an intrinsic geometric artifact of Bayesian Pairwise Ranking (BPR) optimization in CF models. Through rigorous mathematical analysis, we prove that BPR systematically organizes item embeddings along a dominant "popularity direction" where embedding magnitudes directly correlate with interaction frequency. This geometric distortion forces user embeddings to simultaneously handle two conflicting tasks-expressing genuine preference and calibrating against global popularity-trapping them in suboptimal configurations that favor popular items regardless of individual tastes. We propose Directional Decomposition and Correction (DDC), a universally applicable framework that surgically corrects this embedding geometry through asymmetric directional updates. DDC guides positive interactions along personalized preference directions while steering negative interactions away from the global popularity direction, disentangling preference from popularity at the geometric source. Extensive experiments across multiple BPR-based architectures demonstrate that DDC significantly outperforms state-of-the-art debiasing methods, reducing training loss to less than 5% of heavily-tuned baselines while achieving superior recommendation quality and fairness. Code is available in https://github.com/LingFeng-Liu-AI/DDC.

Updated: 2025-12-11 14:35:13

标题: 通过分析向量分解重新思考协同过滤中的流行度偏见

摘要: 流行度偏差根本破坏了协同过滤(CF)模型的个性化能力,导致它们在推荐项目时过分偏向流行项目,而忽视用户对利基内容的真实偏好。虽然现有方法将其视为外部混淆因素,但我们揭示了流行度偏差是CF模型中贝叶斯对偶排序(BPR)优化的固有几何特征。通过严格的数学分析,我们证明BPR系统地沿着主导的“流行度方向”组织项目嵌入,其中嵌入幅度与交互频率直接相关。这种几何失真迫使用户嵌入同时处理两个冲突的任务——表达真实偏好和校准全局流行度,将它们困在有利于流行项目的次优配置中,而不考虑个体口味。我们提出了方向分解和校正(DDC),这是一个通用的框架,通过非对称方向更新对嵌入几何进行外科矫正。DDC引导正面交互沿着个性化偏好方向进行,同时将负面交互从全局流行度方向中移开,从几何源头解开偏好和流行度。在多个基于BPR的架构上进行的大量实验表明,DDC明显优于最先进的去偏方法,将训练损失降低到经过精细调整的基线的不到5%,同时实现更高的推荐质量和公平性。代码可在https://github.com/LingFeng-Liu-AI/DDC中找到。

更新时间: 2025-12-11 14:35:13

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2512.10688v1

Challenges of Evaluating LLM Safety for User Welfare

Safety evaluations of large language models (LLMs) typically focus on universal risks like dangerous capabilities or undesirable propensities. However, millions use LLMs for personal advice on high-stakes topics like finance and health, where harms are context-dependent rather than universal. While frameworks like the OECD's AI classification recognize the need to assess individual risks, user-welfare safety evaluations remain underdeveloped. We argue that developing such evaluations is non-trivial due to fundamental questions about accounting for user context in evaluation design. In this exploratory study, we evaluated advice on finance and health from GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro across user profiles of varying vulnerability. First, we demonstrate that evaluators must have access to rich user context: identical LLM responses were rated significantly safer by context-blind evaluators than by those aware of user circumstances, with safety scores for high-vulnerability users dropping from safe (5/7) to somewhat unsafe (3/7). One might assume this gap could be addressed by creating realistic user prompts containing key contextual information. However, our second study challenges this: we rerun the evaluation on prompts containing context users report they would disclose, finding no significant improvement. Our work establishes that effective user-welfare safety evaluation requires evaluators to assess responses against diverse user profiles, as realistic user context disclosure alone proves insufficient, particularly for vulnerable populations. By demonstrating a methodology for context-aware evaluation, this study provides both a starting point for such assessments and foundational evidence that evaluating individual welfare demands approaches distinct from existing universal-risk frameworks. We publish our code and dataset to aid future developments.

Updated: 2025-12-11 14:34:40

标题: 评估LLM安全性对用户福利的挑战

摘要: 大型语言模型(LLMs)的安全评估通常专注于危险能力或不良倾向等普遍风险。然而,数百万人在高风险主题如金融和健康上寻求LLMs的个人建议,这些伤害是依赖于上下文而不是普遍存在的。尽管像OECD的AI分类这样的框架认识到需要评估个体风险,但用户福利安全评估仍未得到充分发展。我们认为,由于在评估设计中考虑用户上下文的基本问题,开发这种评估并不容易。在这项探索性研究中,我们评估了GPT-5、Claude Sonnet 4和Gemini 2.5 Pro对金融和健康建议在不同脆弱性用户档案下的表现。首先,我们证明评估者必须能够获得丰富的用户上下文:相同的LLM回答在无视上下文的评估者评分明显更安全,而在了解用户情况的评估者手中,高脆弱性用户的安全得分从安全(5/7)降至有些不安全(3/7)。人们可能认为这种差距可以通过创建包含关键上下文信息的真实用户提示来解决。然而,我们的第二项研究挑战了这一点:我们重新对包含用户报告会透露的上下文的提示进行评估,结果没有显著改善。我们的工作建立了有效的用户福利安全评估需要评估者根据不同用户档案评估回答,因为仅仅依靠真实用户上下文的披露是不够的,特别是对于脆弱人群。通过展示一种上下文感知评估方法,这项研究为此类评估提供了一个起点,同时也提供了评估个体福利需要与现有普遍风险框架不同的方法的基础证据。我们公开我们的代码和数据集以帮助未来发展。

更新时间: 2025-12-11 14:34:40

领域: cs.AI,cs.CY

下载: http://arxiv.org/abs/2512.10687v1

Sharp Monocular View Synthesis in Less Than a Second

We present SHARP, an approach to photorealistic view synthesis from a single image. Given a single photograph, SHARP regresses the parameters of a 3D Gaussian representation of the depicted scene. This is done in less than a second on a standard GPU via a single feedforward pass through a neural network. The 3D Gaussian representation produced by SHARP can then be rendered in real time, yielding high-resolution photorealistic images for nearby views. The representation is metric, with absolute scale, supporting metric camera movements. Experimental results demonstrate that SHARP delivers robust zero-shot generalization across datasets. It sets a new state of the art on multiple datasets, reducing LPIPS by 25-34% and DISTS by 21-43% versus the best prior model, while lowering the synthesis time by three orders of magnitude. Code and weights are provided at https://github.com/apple/ml-sharp

Updated: 2025-12-11 14:34:11

标题: 不到一秒钟内的清晰单眼视图合成

摘要: 我们提出了SHARP,一种从单个图像合成逼真视图的方法。给定一张照片,SHARP回归所描述场景的3D高斯表示的参数。通过神经网络的单次前向传递,在标准GPU上不到一秒的时间内完成。SHARP生成的3D高斯表示可以实时渲染,为附近的视图产生高分辨率的逼真图像。该表示是度量的,具有绝对尺度,支持度量相机移动。实验证明,SHARP在不同数据集上实现了强大的零样本泛化。它在多个数据集上创造了新的技术水平,将LPIPS降低了25-34%,DISTS降低了21-43%,同时将合成时间降低了三个数量级。代码和权重可在https://github.com/apple/ml-sharp找到。

更新时间: 2025-12-11 14:34:11

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2512.10685v1

Optimal transport unlocks end-to-end learning for single-molecule localization

Single-molecule localization microscopy (SMLM) allows reconstructing biology-relevant structures beyond the diffraction limit by detecting and localizing individual fluorophores -- fluorescent molecules stained onto the observed specimen -- over time to reconstruct super-resolved images. Currently, efficient SMLM requires non-overlapping emitting fluorophores, leading to long acquisition times that hinders live-cell imaging. Recent deep-learning approaches can handle denser emissions, but they rely on variants of non-maximum suppression (NMS) layers, which are unfortunately non-differentiable and may discard true positives with their local fusion strategy. In this presentation, we reformulate the SMLM training objective as a set-matching problem, deriving an optimal-transport loss that eliminates the need for NMS during inference and enables end-to-end training. Additionally, we propose an iterative neural network that integrates knowledge of the microscope's optical system inside our model. Experiments on synthetic benchmarks and real biological data show that both our new loss function and architecture surpass the state of the art at moderate and high emitter densities. Code is available at https://github.com/RSLLES/SHOT.

Updated: 2025-12-11 14:30:16

标题: 最优输运打开了单分子定位的端到端学习

摘要: 单分子定位显微镜(SMLM)允许在超过衍射限制的范围内重建生物相关结构,通过检测和定位个别荧光标记在观察标本上的荧光分子 - 随时间重建超分辨率图像。目前,高效的SMLM需要非重叠发射荧光分子,导致长时间采集,从而阻碍活细胞成像。最近的深度学习方法可以处理更密集的发射,但它们依赖于非最大抑制(NMS)层的变体,这些层不可微分,可能通过其局部融合策略丢弃真正的阳性。在本演示中,我们将SMLM训练目标重新制定为一个集合匹配问题,推导出一种最优输运损失,消除了推理过程中对NMS的需求,并实现了端到端训练。此外,我们提出了一个迭代神经网络,将显微镜光学系统的知识整合到我们的模型中。在合成基准和真实生物数据上的实验表明,我们的新损失函数和架构在中等和高发射体密度下均超越了现有技术水平。代码可在https://github.com/RSLLES/SHOT中找到。

更新时间: 2025-12-11 14:30:16

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2512.10683v1

Evaluating Gemini Robotics Policies in a Veo World Simulator

Generative world models hold significant potential for simulating interactions with visuomotor policies in varied environments. Frontier video models can enable generation of realistic observations and environment interactions in a scalable and general manner. However, the use of video models in robotics has been limited primarily to in-distribution evaluations, i.e., scenarios that are similar to ones used to train the policy or fine-tune the base video model. In this report, we demonstrate that video models can be used for the entire spectrum of policy evaluation use cases in robotics: from assessing nominal performance to out-of-distribution (OOD) generalization, and probing physical and semantic safety. We introduce a generative evaluation system built upon a frontier video foundation model (Veo). The system is optimized to support robot action conditioning and multi-view consistency, while integrating generative image-editing and multi-view completion to synthesize realistic variations of real-world scenes along multiple axes of generalization. We demonstrate that the system preserves the base capabilities of the video model to enable accurate simulation of scenes that have been edited to include novel interaction objects, novel visual backgrounds, and novel distractor objects. This fidelity enables accurately predicting the relative performance of different policies in both nominal and OOD conditions, determining the relative impact of different axes of generalization on policy performance, and performing red teaming of policies to expose behaviors that violate physical or semantic safety constraints. We validate these capabilities through 1600+ real-world evaluations of eight Gemini Robotics policy checkpoints and five tasks for a bimanual manipulator.

Updated: 2025-12-11 14:22:14

标题: 在Veo世界模拟器中评估双子机器人政策

摘要: 生成世界模型在模拟各种环境中与视觉动作策略的交互方面具有重要潜力。前沿视频模型可以以一种可扩展且通用的方式生成逼真的观察和环境交互。然而,机器人领域中对视频模型的使用主要限于分布内评估,即与用于训练策略或微调基础视频模型的场景相似的情况。在本报告中,我们证明视频模型可以用于机器人领域策略评估用例的整个范围:从评估正常性能到超出分布范围的泛化,以及探索物理和语义安全性。我们引入了一个基于前沿视频基础模型(Veo)构建的生成评估系统。该系统经过优化,支持机器人动作调节和多视图一致性,同时集成生成图像编辑和多视图完成,以在多个泛化轴上合成真实世界场景的变化。我们证明该系统保留了视频模型的基本功能,可以准确模拟已编辑以包含新颖的互动对象、新颖的视觉背景和新颖的干扰对象的场景。这种保真度使得能够准确预测不同策略在正常和超出分布条件下的相对性能,确定不同泛化轴对策略性能的相对影响,并对策略进行红队测试,以暴露违反物理或语义安全约束的行为。我们通过对八个Gemini Robotics策略检查点和一个双手操作器的五个任务进行1600多次真实世界评估来验证这些能力。

更新时间: 2025-12-11 14:22:14

领域: cs.RO,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2512.10675v1

AEBNAS: Strengthening Exit Branches in Early-Exit Networks through Hardware-Aware Neural Architecture Search

Early-exit networks are effective solutions for reducing the overall energy consumption and latency of deep learning models by adjusting computation based on the complexity of input data. By incorporating intermediate exit branches into the architecture, they provide less computation for simpler samples, which is particularly beneficial for resource-constrained devices where energy consumption is crucial. However, designing early-exit networks is a challenging and time-consuming process due to the need to balance efficiency and performance. Recent works have utilized Neural Architecture Search (NAS) to design more efficient early-exit networks, aiming to reduce average latency while improving model accuracy by determining the best positions and number of exit branches in the architecture. Another important factor affecting the efficiency and accuracy of early-exit networks is the depth and types of layers in the exit branches. In this paper, we use hardware-aware NAS to strengthen exit branches, considering both accuracy and efficiency during optimization. Our performance evaluation on the CIFAR-10, CIFAR-100, and SVHN datasets demonstrates that our proposed framework, which considers varying depths and layers for exit branches along with adaptive threshold tuning, designs early-exit networks that achieve higher accuracy with the same or lower average number of MACs compared to the state-of-the-art approaches.

Updated: 2025-12-11 14:17:49

标题: AEBNAS:通过硬件感知神经架构搜索强化早期退出网络中的退出分支

摘要: 早期退出网络是通过根据输入数据的复杂性调整计算来有效减少深度学习模型的总能耗和延迟的解决方案。通过将中间退出分支纳入架构中,它们为简单样本提供更少的计算量,这对于能源受限设备尤其有益,其中能耗至关重要。然而,设计早期退出网络是一项具有挑战性且耗时的过程,因为需要在效率和性能之间取得平衡。最近的研究利用神经架构搜索(NAS)设计更高效的早期退出网络,旨在通过确定架构中最佳位置和退出分支数量来减少平均延迟,同时提高模型准确性。影响早期退出网络效率和准确性的另一个重要因素是退出分支中的深度和层类型。在本文中,我们使用硬件感知NAS来加强退出分支,考虑了在优化过程中的准确性和效率。我们在CIFAR-10、CIFAR-100和SVHN数据集上的性能评估表明,我们提出的框架,结合了不同深度和层的退出分支以及自适应阈值调整,设计出比最先进方法具有更高准确性的早期退出网络,同时具有相同或更低的平均MAC数量。

更新时间: 2025-12-11 14:17:49

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2512.10671v1

Learning by Analogy: A Causal Framework for Composition Generalization

Compositional generalization -- the ability to understand and generate novel combinations of learned concepts -- enables models to extend their capabilities beyond limited experiences. While effective, the data structures and principles that enable this crucial capability remain poorly understood. We propose that compositional generalization fundamentally requires decomposing high-level concepts into basic, low-level concepts that can be recombined across similar contexts, similar to how humans draw analogies between concepts. For example, someone who has never seen a peacock eating rice can envision this scene by relating it to their previous observations of a chicken eating rice. In this work, we formalize these intuitive processes using principles of causal modularity and minimal changes. We introduce a hierarchical data-generating process that naturally encodes different levels of concepts and their interaction mechanisms. Theoretically, we demonstrate that this approach enables compositional generalization supporting complex relations between composed concepts, advancing beyond prior work that assumes simpler interactions like additive effects. Critically, we also prove that this latent hierarchical structure is provably recoverable (identifiable) from observable data like text-image pairs, a necessary step for learning such a generative process. To validate our theory, we apply insights from our theoretical framework and achieve significant improvements on benchmark datasets.

Updated: 2025-12-11 14:16:14

标题: 学习类比:组合概括的因果框架

摘要: 组合概括——理解和生成学习概念的新组合的能力——使模型能够超越有限的经验。尽管有效,但支持这一关键能力的数据结构和原则仍不明确。我们提出,组合概括基本上需要将高级概念分解为基本的低级概念,这些概念可以在类似的背景下重新组合,类似于人类如何在概念之间进行类比。例如,一个从未见过孔雀吃米饭的人可以通过将其与以前观察到的鸡吃米饭联系起来来想象这个场景。 在这项工作中,我们使用因果模块化和最小变化的原则形式化这些直观过程。我们引入了一个自然编码不同水平概念及其交互机制的分层数据生成过程。理论上,我们证明了这种方法能够支持复杂概念之间的组合概括,超越了之前只假设简单交互如加法效应的工作。至关重要的是,我们还证明了这种潜在的分层结构可以从可观测数据如文本-图像对中可证明地恢复(可识别),这是学习这种生成过程所必需的一步。为了验证我们的理论,我们应用了我们理论框架的见解,并在基准数据集上取得了显著的改进。

更新时间: 2025-12-11 14:16:14

领域: cs.LG

下载: http://arxiv.org/abs/2512.10669v1

A Proof of Success and Reward Distribution Protocol for Multi-bridge Architecture in Cross-chain Communication

Single-bridge blockchain solutions enable cross-chain communication. However, they are associated with centralization and single-point-of-failure risks. This paper proposes Proof of Success and Reward Distribution (PSCRD), a novel multi-bridge response coordination and incentive distribution protocol designed to address the challenges. PSCRD introduces a fair reward distribution system that equitably distributes the transfer fee among participating bridges, incentivizing honest behavior and sustained commitment. The purpose is to encourage bridge participation for higher decentralization and lower single-point-of-failure risks. The mathematical analysis and simulation results validate the effectiveness of PSCRD using two key metrics: the Gini index, which demonstrates a progressive improvement in the fairness of the reward distribution as new bridge groups joined the network; and the Nakamoto coefficient, which shows a significant improvement in decentralization over time. These findings highlight that PSCRD provides a more resilient and secure cross-chain bridge system without substantially increasing user costs.

Updated: 2025-12-11 14:15:36

标题: 多桥架构在跨链通信中的成功验证和奖励分配协议证明

摘要: 单桥区块链解决方案实现了跨链通信。然而,它们存在集中化和单点故障风险。本文提出了Proof of Success and Reward Distribution(PSCRD),这是一种新颖的多桥响应协调和激励分配协议,旨在解决这些挑战。PSCRD引入了一个公平的奖励分配系统,公平地将转账费用分配给参与的桥梁,激励诚实行为和持续承诺。其目的是鼓励桥梁参与,以提高去中心化程度并降低单点故障风险。数学分析和模拟结果验证了PSCRD的有效性,使用了两个关键指标:基尼系数,表明随着新桥组加入网络,奖励分配的公平性逐渐改善;以及中本聪系数,显示随着时间的推移,去中心化程度有显著提高。这些发现凸显了PSCRD提供了一个更具韧性和安全性的跨链桥梁系统,而不会大幅增加用户成本。

更新时间: 2025-12-11 14:15:36

领域: cs.CR,cs.DC,cs.ET

下载: http://arxiv.org/abs/2512.10667v1

On the Dynamics of Multi-Agent LLM Communities Driven by Value Diversity

As Large Language Models (LLM) based multi-agent systems become increasingly prevalent, the collective behaviors, e.g., collective intelligence, of such artificial communities have drawn growing attention. This work aims to answer a fundamental question: How does diversity of values shape the collective behavior of AI communities? Using naturalistic value elicitation grounded in the prevalent Schwartz's Theory of Basic Human Values, we constructed multi-agent simulations where communities with varying numbers of agents engaged in open-ended interactions and constitution formation. The results show that value diversity enhances value stability, fosters emergent behaviors, and brings more creative principles developed by the agents themselves without external guidance. However, these effects also show diminishing returns: extreme heterogeneity induces instability. This work positions value diversity as a new axis of future AI capability, bridging AI ability and sociological studies of institutional emergence.

Updated: 2025-12-11 14:13:53

标题: 关于价值多样性推动的多智能体LLM社区动态的研究

摘要: 随着基于大型语言模型(LLM)的多智能体系统越来越普遍,这种人工社区的集体行为,例如集体智慧,引起了越来越多的关注。本研究旨在回答一个基本问题:价值观的多样性如何塑造AI社区的集体行为?利用基于Schwartz基本人类价值观理论的自然价值调查,我们构建了多智能体模拟,在这些模拟中,拥有不同数量智能体的社区进行开放性互动和组织形成。结果显示,价值观的多样性增强了价值的稳定性,促进了新兴行为,并带来了更多由智能体自己开发而不需要外部指导的创造性原则。然而,这些效果也呈现出递减的趋势:极端的异质性会导致不稳定性。这项工作将价值观的多样性定位为未来AI能力的一个新的研究领域,架起了AI能力与社会学研究中机构出现的桥梁。

更新时间: 2025-12-11 14:13:53

领域: cs.AI

下载: http://arxiv.org/abs/2512.10665v1

DCFO Additional Material

Outlier detection identifies data points that significantly deviate from the majority of the data distribution. Explaining outliers is crucial for understanding the underlying factors that contribute to their detection, validating their significance, and identifying potential biases or errors. Effective explanations provide actionable insights, facilitating preventive measures to avoid similar outliers in the future. Counterfactual explanations clarify why specific data points are classified as outliers by identifying minimal changes required to alter their prediction. Although valuable, most existing counterfactual explanation methods overlook the unique challenges posed by outlier detection, and fail to target classical, widely adopted outlier detection algorithms. Local Outlier Factor (LOF) is one the most popular unsupervised outlier detection methods, quantifying outlierness through relative local density. Despite LOF's widespread use across diverse applications, it lacks interpretability. To address this limitation, we introduce Density-based Counterfactuals for Outliers (DCFO), a novel method specifically designed to generate counterfactual explanations for LOF. DCFO partitions the data space into regions where LOF behaves smoothly, enabling efficient gradient-based optimisation. Extensive experimental validation on 50 OpenML datasets demonstrates that DCFO consistently outperforms benchmarked competitors, offering superior proximity and validity of generated counterfactuals.

Updated: 2025-12-11 14:04:52

标题: DCFO附加材料

摘要: 异常检测识别明显偏离大多数数据分布的数据点。解释异常值对于理解造成其检测的潜在因素、验证其重要性以及识别潜在偏差或错误至关重要。有效的解释提供可操作的见解,促进预防措施以避免未来出现类似的异常值。反事实解释通过识别改变其预测所需的最小更改来澄清为什么特定数据点被分类为异常值。尽管有价值,大多数现有的反事实解释方法忽视了异常检测带来的独特挑战,并未针对经典、广泛采用的异常检测算法。局部异常因子(LOF)是最受欢迎的无监督异常检测方法之一,通过相对局部密度量化异常值。尽管LOF在各种应用中被广泛使用,但它缺乏可解释性。为了解决这一限制,我们引入了专门为LOF生成反事实解释的新方法——基于密度的反事实异常值(DCFO)。DCFO将数据空间划分为LOF表现平滑的区域,实现了高效的基于梯度的优化。对50个OpenML数据集的广泛实验证明,DCFO始终优于基准竞争对手,提供了更接近和有效性的生成反事实解释。

更新时间: 2025-12-11 14:04:52

领域: cs.LG

下载: http://arxiv.org/abs/2512.10659v1

Token Sample Complexity of Attention

As context windows in large language models continue to expand, it is essential to characterize how attention behaves at extreme sequence lengths. We introduce token-sample complexity: the rate at which attention computed on $n$ tokens converges to its infinite-token limit. We estimate finite-$n$ convergence bounds at two levels: pointwise uniform convergence of the attention map, and convergence of moments for the transformed token distribution. For compactly supported (and more generally sub-Gaussian) distributions, our first result shows that the attention map converges uniformly on a ball of radius $R$ at rate $C(R)/\sqrt{n}$, where $C(R)$ grows exponentially with $R$. For large $R$, this estimate loses practical value, and our second result addresses this issue by establishing convergence rates for the moments of the transformed distribution (the token output of the attention layer). In this case, the rate is $C'(R)/n^β$ with $β<\tfrac{1}{2}$, and $C'(R)$ depends polynomially on the size of the support of the distribution. The exponent $β$ depends on the attention geometry and the spectral properties of the tokens distribution. We also examine the regime in which the attention parameter tends to infinity and the softmax approaches a hardmax, and in this setting, we establish a logarithmic rate of convergence. Experiments on synthetic Gaussian data and real BERT models on Wikipedia text confirm our predictions.

Updated: 2025-12-11 14:02:34

标题: 关注力的标记样本复杂性

摘要: 随着大型语言模型中的上下文窗口不断扩大,必须对在极端序列长度下注意力的行为进行表征。我们引入了令牌-样本复杂度:计算在$n$个令牌上的注意力收敛到无限令牌极限的速率。我们在两个层面估计有限-$n$的收敛界限:注意力映射的逐点一致收敛,以及经过变换的令牌分布的矩收敛。对于紧支持(更一般地,次高斯)分布,我们的第一个结果表明,注意力映射以速率$C(R)/\sqrt{n}$在半径为$R$的球上一致收敛,其中$C(R)$随$R$呈指数增长。对于较大的$R$,这一估计失去了实际价值,我们的第二个结果解决了这个问题,通过建立转换分布的矩(注意力层的令牌输出)的收敛率。在这种情况下,速率为$C'(R)/n^β$,其中$β<\tfrac{1}{2}$,且$C'(R)$多项式地依赖于分布的支持大小。指数$β$取决于注意力几何和令牌分布的谱特性。我们还研究了注意力参数趋近于无穷大且softmax接近于hardmax的情况,在这种设置下,我们建立了对数收敛速率。对合成高斯数据和维基百科文本上的真实BERT模型的实验证实了我们的预测。

更新时间: 2025-12-11 14:02:34

领域: cs.LG

下载: http://arxiv.org/abs/2512.10656v1

CAPTAIN: Semantic Feature Injection for Memorization Mitigation in Text-to-Image Diffusion Models

Diffusion models can unintentionally reproduce training examples, raising privacy and copyright concerns as these systems are increasingly deployed at scale. Existing inference-time mitigation methods typically manipulate classifier-free guidance (CFG) or perturb prompt embeddings; however, they often struggle to reduce memorization without compromising alignment with the conditioning prompt. We introduce CAPTAIN, a training-free framework that mitigates memorization by directly modifying latent features during denoising. CAPTAIN first applies frequency-based noise initialization to reduce the tendency to replicate memorized patterns early in the denoising process. It then identifies the optimal denoising timesteps for feature injection and localizes memorized regions. Finally, CAPTAIN injects semantically aligned features from non-memorized reference images into localized latent regions, suppressing memorization while preserving prompt fidelity and visual quality. Our experiments show that CAPTAIN achieves substantial reductions in memorization compared to CFG-based baselines while maintaining strong alignment with the intended prompt.

Updated: 2025-12-11 14:01:47

标题: CAPTAIN:语义特征注入在文本到图像扩散模型中减轻记忆化的作用

摘要: 扩散模型可能无意中复制训练示例,随着这些系统在规模上的不断部署,引发了隐私和版权方面的担忧。现有的推理时间缓解方法通常操作分类器自由指导(CFG)或扰动提示嵌入;然而,它们往往在减少记忆化的同时保持与条件提示的对齐方面遇到困难。我们引入了CAPTAIN,这是一个无需训练的框架,通过直接在去噪过程中修改潜在特征来减少记忆化。CAPTAIN首先应用基于频率的噪声初始化来减少在去噪过程早期复制记忆化模式的倾向。然后,它确定了最佳的去噪时间步骤用于特征注入并定位记忆化区域。最后,CAPTAIN将来自非记忆化参考图像的语义对齐特征注入到定位的潜在区域,抑制记忆化同时保持提示的忠实度和视觉质量。我们的实验表明,与基于CFG的基线相比,CAPTAIN在减少记忆化方面取得了显著的进展,同时保持与预期提示的强对齐。

更新时间: 2025-12-11 14:01:47

领域: cs.AI

下载: http://arxiv.org/abs/2512.10655v1

Virtual camera detection: Catching video injection attacks in remote biometric systems

Face anti-spoofing (FAS) is a vital component of remote biometric authentication systems based on facial recognition, increasingly used across web-based applications. Among emerging threats, video injection attacks -- facilitated by technologies such as deepfakes and virtual camera software -- pose significant challenges to system integrity. While virtual camera detection (VCD) has shown potential as a countermeasure, existing literature offers limited insight into its practical implementation and evaluation. This study introduces a machine learning-based approach to VCD, with a focus on its design and validation. The model is trained on metadata collected during sessions with authentic users. Empirical results demonstrate its effectiveness in identifying video injection attempts and reducing the risk of malicious users bypassing FAS systems.

Updated: 2025-12-11 14:01:06

标题: 虚拟摄像头检测:在远程生物识别系统中捕捉视频注入攻击

摘要: 人脸反欺骗(FAS)是基于面部识别的远程生物识别认证系统的重要组成部分,在网络应用中越来越被广泛使用。在新兴威胁中,视频注入攻击 - 借助深度伪造和虚拟摄像头软件等技术实施 - 对系统完整性构成重大挑战。虚拟摄像头检测(VCD)已显示出作为一种对抗措施的潜力,然而现有文献对其实际实施和评估提供的见解有限。本研究引入了一种基于机器学习的VCD方法,重点关注其设计和验证。该模型是在与真实用户进行会话期间收集的元数据上进行训练的。实证结果表明,该方法在识别视频注入尝试并降低恶意用户绕过FAS系统的风险方面非常有效。

更新时间: 2025-12-11 14:01:06

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2512.10653v1

TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection

Advances in generative modeling have made it increasingly easy to fabricate realistic portrayals of individuals, creating serious risks for security, communication, and public trust. Detecting such person-driven manipulations requires systems that not only distinguish altered content from authentic media but also provide clear and reliable reasoning. In this paper, we introduce TriDF, a comprehensive benchmark for interpretable DeepFake detection. TriDF contains high-quality forgeries from advanced synthesis models, covering 16 DeepFake types across image, video, and audio modalities. The benchmark evaluates three key aspects: Perception, which measures the ability of a model to identify fine-grained manipulation artifacts using human-annotated evidence; Detection, which assesses classification performance across diverse forgery families and generators; and Hallucination, which quantifies the reliability of model-generated explanations. Experiments on state-of-the-art multimodal large language models show that accurate perception is essential for reliable detection, but hallucination can severely disrupt decision-making, revealing the interdependence of these three aspects. TriDF provides a unified framework for understanding the interaction between detection accuracy, evidence identification, and explanation reliability, offering a foundation for building trustworthy systems that address real-world synthetic media threats.

Updated: 2025-12-11 14:01:01

标题: TriDF:解释性DeepFake检测的感知、检测和幻觉评估

摘要: 生成建模的进展使得制造出真实的个体描绘变得越来越容易,从而给安全、通信和公众信任带来严重风险。检测此类以人为驱动的操纵需要系统不仅能够区分经过修改的内容和真实媒体,还需要提供清晰可靠的推理。在本文中,我们介绍了TriDF,这是一个可解释的DeepFake检测的全面基准。TriDF包含来自先进合成模型的高质量伪造品,涵盖了图像、视频和音频模式下的16种DeepFake类型。该基准评估了三个关键方面:感知力,衡量模型利用人工标注证据识别微观操纵痕迹的能力;检测,评估了跨不同伪造家族和生成器的分类性能;幻觉,量化了模型生成的解释的可靠性。对最先进的多模态大型语言模型进行的实验表明,准确的感知对于可靠的检测至关重要,但幻觉可能严重扰乱决策,揭示了这三个方面之间的相互依赖关系。TriDF提供了一个统一的框架,用于理解检测准确性、证据识别和解释可靠性之间的互动关系,为构建能够应对现实世界合成媒体威胁的可信系统奠定了基础。

更新时间: 2025-12-11 14:01:01

领域: cs.CV,cs.CR

下载: http://arxiv.org/abs/2512.10652v1

Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models

Large Language Models (LLMs) are prone to generating fluent but incorrect content, known as confabulation, which poses increasing risks in multi-turn or agentic applications where outputs may be reused as context. In this work, we investigate how in-context information influences model behavior and whether LLMs can identify their unreliable responses. We propose a reliability estimation that leverages token-level uncertainty to guide the aggregation of internal model representations. Specifically, we compute aleatoric and epistemic uncertainty from output logits to identify salient tokens and aggregate their hidden states into compact representations for response-level reliability prediction. Through controlled experiments on open QA benchmarks, we find that correct in-context information improves both answer accuracy and model confidence, while misleading context often induces confidently incorrect responses, revealing a misalignment between uncertainty and correctness. Our probing-based method captures these shifts in model behavior and improves the detection of unreliable outputs across multiple open-source LLMs. These results underscore the limitations of direct uncertainty signals and highlight the potential of uncertainty-guided probing for reliability-aware generation.

Updated: 2025-12-11 13:49:54

标题: Language Models能够检测到他们的混淆言?在考虑不确定性的语言模型中估计可靠性

摘要: 大型语言模型(LLMs)容易生成流畅但不正确的内容,即所谓的混淆,这在多轮或代理应用中可能会带来越来越大的风险,因为输出可能会被重复使用作为上下文。在这项工作中,我们研究了上下文信息如何影响模型行为以及LLMs是否能够识别其不可靠的响应。我们提出了一种可靠性估计方法,利用标记级别的不确定性来指导内部模型表示的聚合。具体而言,我们从输出logits计算随机性和认知不确定性,以识别显著的标记,并将它们的隐藏状态聚合成紧凑的表示,用于响应级可靠性预测。通过对开放QA基准的受控实验,我们发现正确的上下文信息可以提高答案的准确性和模型的信心,而误导性的上下文往往会导致自信地不正确的响应,揭示了不确定性和正确性之间的不一致。我们基于探测的方法捕捉了模型行为的这种变化,并改善了对多个开源LLMs的不可靠输出的检测。这些结果突显了直接不确定性信号的局限性,并强调了不确定性引导的探测对于意识到可靠性的生成的潜力。

更新时间: 2025-12-11 13:49:54

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.08139v2

PASCAL: Precise and Efficient ANN- SNN Conversion using Spike Accumulation and Adaptive Layerwise Activation

Spiking Neural Networks (SNNs) have been put forward as an energy-efficient alternative to Artificial Neural Networks (ANNs) since they perform sparse Accumulate operations instead of the power-hungry Multiply-and-Accumulate operations. ANN-SNN conversion is a widely used method to realize deep SNNs with accuracy comparable to that of ANNs.~\citeauthor{bu2023optimal} recently proposed the Quantization-Clip-Floor-Shift (QCFS) activation as an alternative to ReLU to minimize the accuracy loss during ANN-SNN conversion. Nevertheless, SNN inferencing requires a large number of timesteps to match the accuracy of the source ANN for real-world datasets. In this work, we propose PASCAL, which performs ANN-SNN conversion in such a way that the resulting SNN is mathematically equivalent to an ANN with QCFS-activation, thereby yielding similar accuracy as the source ANN with minimal inference timesteps. In addition, we propose a systematic method to configure the quantization step of QCFS activation in a layerwise manner, which effectively determines the optimal number of timesteps per layer for the converted SNN. Our results show that the ResNet-34 SNN obtained using PASCAL achieves an accuracy of $\approx$74\% on ImageNet with a 64$\times$ reduction in the number of inference timesteps compared to existing approaches.

Updated: 2025-12-11 13:45:50

标题: 帕斯卡尔:使用尖峰累积和适应性逐层激活的精确高效ANN-SNN转换

摘要: 脉冲神经网络(SNNs)被提出作为一种节能的替代人工神经网络(ANNs)的方法,因为它们执行稀疏的累积操作,而不是耗电的乘法和累积操作。ANN-SNN转换是一种广泛使用的方法,用于实现具有与ANNs可比的准确性的深度SNNs。最近,Bu等人提出了Quantization-Clip-Floor-Shift(QCFS)激活作为ReLU的替代方案,以在ANN-SNN转换期间最小化准确性损失。然而,SNN推断需要大量的时间步来匹配真实世界数据集的源ANN的准确性。在这项工作中,我们提出了PASCAL,它以一种方式执行ANN-SNN转换,使得生成的SNN在数学上等效于具有QCFS激活的ANN,从而产生与源ANN相似的准确性,同时最小化推断时间步数。此外,我们提出了一种系统方法,以逐层配置QCFS激活的量化步骤,有效确定转换后SNN每层的最佳时间步数。我们的结果表明,使用PASCAL获得的ResNet-34 SNN在ImageNet上实现了约74%的准确性,与现有方法相比,推断时间步数减少了64倍。

更新时间: 2025-12-11 13:45:50

领域: cs.NE,cs.AI

下载: http://arxiv.org/abs/2505.01730v2

Refinement Contrastive Learning of Cell-Gene Associations for Unsupervised Cell Type Identification

Unsupervised cell type identification is crucial for uncovering and characterizing heterogeneous populations in single cell omics studies. Although a range of clustering methods have been developed, most focus exclusively on intrinsic cellular structure and ignore the pivotal role of cell-gene associations, which limits their ability to distinguish closely related cell types. To this end, we propose a Refinement Contrastive Learning framework (scRCL) that explicitly incorporates cell-gene interactions to derive more informative representations. Specifically, we introduce two contrastive distribution alignment components that reveal reliable intrinsic cellular structures by effectively exploiting cell-cell structural relationships. Additionally, we develop a refinement module that integrates gene-correlation structure learning to enhance cell embeddings by capturing underlying cell-gene associations. This module strengthens connections between cells and their associated genes, refining the representation learning to exploiting biologically meaningful relationships. Extensive experiments on several single-cell RNA-seq and spatial transcriptomics benchmark datasets demonstrate that our method consistently outperforms state-of-the-art baselines in cell-type identification accuracy. Moreover, downstream biological analyses confirm that the recovered cell populations exhibit coherent gene-expression signatures, further validating the biological relevance of our approach. The code is available at https://github.com/THPengL/scRCL.

Updated: 2025-12-11 13:45:31

标题: 无监督细胞类型识别的细胞-基因关联对比学习的优化

摘要: 无监督的细胞类型识别对于揭示和表征单细胞组学研究中的异质群体至关重要。尽管已经发展了一系列聚类方法,但大多数方法仅专注于内在的细胞结构,忽略了细胞-基因关联的关键作用,这限制了它们区分密切相关的细胞类型的能力。为此,我们提出了一种精细对比学习框架(scRCL),明确地将细胞-基因相互作用纳入考虑,以获取更多信息丰富的表示。具体地,我们引入了两个对比分布对齐组件,通过有效地利用细胞-细胞结构关系揭示可靠的内在细胞结构。此外,我们开发了一个精炼模块,通过整合基因相关性结构学习来增强细胞嵌入,捕捉潜在的细胞-基因关联。该模块加强了细胞与其相关基因之间的联系,使表示学习更好地利用生物学上有意义的关系。在几个单细胞RNA-seq和空间转录组学的基准数据集上进行的大量实验证明,我们的方法在细胞类型识别准确性方面始终优于最先进的基线方法。此外,下游生物学分析证实,恢复的细胞群体展示了连贯的基因表达特征,进一步验证了我们方法的生物学相关性。代码可在https://github.com/THPengL/scRCL上找到。

更新时间: 2025-12-11 13:45:31

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2512.10640v1

Adaptive Intrusion Detection System Leveraging Dynamic Neural Models with Adversarial Learning for 5G/6G Networks

Intrusion Detection Systems (IDS) are critical components in safeguarding 5G/6G networks from both internal and external cyber threats. While traditional IDS approaches rely heavily on signature-based methods, they struggle to detect novel and evolving attacks. This paper presents an advanced IDS framework that leverages adversarial training and dynamic neural networks in 5G/6G networks to enhance network security by providing robust, real-time threat detection and response capabilities. Unlike conventional models, which require costly retraining to update knowledge, the proposed framework integrates incremental learning algorithms, reducing the need for frequent retraining. Adversarial training is used to fortify the IDS against poisoned data. By using fewer features and incorporating statistical properties, the system can efficiently detect potential threats. Extensive evaluations using the NSL- KDD dataset demonstrate that the proposed approach provides better accuracy of 82.33% for multiclass classification of various network attacks while resisting dataset poisoning. This research highlights the potential of adversarial-trained, dynamic neural networks for building resilient IDS solutions.

Updated: 2025-12-11 13:40:37

标题: 基于对抗学习的动态神经模型的自适应入侵检测系统在5G/6G网络中的应用

摘要: 入侵检测系统(IDS)是保护5G/6G网络免受内部和外部网络威胁的关键组件。传统的IDS方法主要依赖基于签名的方法,但往往难以检测新型和不断演变的攻击。本文提出了一种先进的IDS框架,利用对抗训练和动态神经网络在5G/6G网络中增强网络安全,提供强大的实时威胁检测和响应能力。与传统模型不同,该框架集成了增量学习算法,减少了频繁重新训练的需求。对抗训练用于加固IDS以抵御有毒数据。通过使用较少的特征和整合统计属性,系统能够高效地检测潜在威胁。使用NSL-KDD数据集进行广泛评估表明,所提出的方法在各种网络攻击的多类分类中提供了更高的82.33%的准确性,同时抵抗数据集毒化。这项研究突出了对抗训练、动态神经网络在构建弹性IDS解决方案中的潜力。

更新时间: 2025-12-11 13:40:37

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2512.10637v1

Objectives and Design Principles in Offline Payments with Central Bank Digital Currency (CBDC)

In this work, fundamental design principles for a central bank digital currency (CBDC) with an offline functionality and corresponding counter measures are discussed. We identify three major objectives for any such CBDC proposal:(i) Access Control Security - protection of a user's funds against unauthorized access by other users; (ii) Security against Depositor's Misbehavior - preservation of the integrity of an environment (potentially the wallet) against misbehavior of its owner (for example, double-spending), and (iii) Privacy by Design - ensuring privacy is embedded into the system architecture. Our central conclusion is the alignment of the objectives to concrete design elements as countermeasures, whereas certain objectives and countermeasures have no or minimal interferences with each other. For example, we work out that the integrity of a user's wallet and, accordingly, the prevention of double-spending race attacks should be addressed through the adoption and integration of \textit{secure hardware} within a CBDC system.

Updated: 2025-12-11 13:39:50

标题: 中央银行数字货币(CBDC)的离线支付目标和设计原则

摘要: 在这项工作中,讨论了具有离线功能和相应对策的中央银行数字货币(CBDC)的基本设计原则。我们确定了任何此类CBDC提案的三个主要目标:(i)访问控制安全 - 保护用户资金免受其他用户的未经授权访问;(ii)防止存款人不当行为的安全性 - 保护环境(可能是钱包)的完整性,防止其所有者的不当行为(例如,双重支付);和(iii)隐私设计 - 确保隐私嵌入到系统架构中。我们的中心结论是将这些目标与具体的设计要素作为对策进行对齐,而某些目标和对策之间没有或只有最小的干扰。例如,我们得出结论认为用户钱包的完整性以及因此防止双重支付竞赛攻击应通过采用和集成\textit {安全硬件} 来解决CBDC系统内。

更新时间: 2025-12-11 13:39:50

领域: cs.CR

下载: http://arxiv.org/abs/2512.10636v1

Supporting Migration Policies with Forecasts: Illegal Border Crossings in Europe through a Mixed Approach

This paper presents a mixed-methodology to forecast illegal border crossings in Europe across five key migratory routes, with a one-year time horizon. The methodology integrates machine learning techniques with qualitative insights from migration experts. This approach aims at improving the predictive capacity of data-driven models through the inclusion of a human-assessed covariate, an innovation that addresses challenges posed by sudden shifts in migration patterns and limitations in traditional datasets. The proposed methodology responds directly to the forecasting needs outlined in the EU Pact on Migration and Asylum, supporting the Asylum and Migration Management Regulation (AMMR). It is designed to provide policy-relevant forecasts that inform strategic decisions, early warning systems, and solidarity mechanisms among EU Member States. By joining data-driven modeling with expert judgment, this work aligns with existing academic recommendations and introduces a novel operational tool tailored for EU migration governance. The methodology is tested and validated with known data to demonstrate its applicability and reliability in migration-related policy context.

Updated: 2025-12-11 13:33:25

标题: 用预测支持移民政策:欧洲非法边境越境的混合方法

摘要: 这篇论文提出了一种混合方法,用于预测欧洲五条主要移民路线上的非法边境越境情况,预测时间跨度为一年。该方法将机器学习技术与移民专家的定性见解相结合。这种方法旨在通过包含人工评估的协变量来提高数据驱动模型的预测能力,这一创新解决了移民模式突然转变和传统数据集的局限性带来的挑战。所提出的方法直接响应了欧盟移民与庇护政策宪章中提出的预测需求,支持《庇护和移民管理条例》(AMMR)。该方法旨在提供政策相关的预测,以指导战略决策、早期警告系统和欧盟成员国之间的团结机制。通过将数据驱动建模与专家判断结合起来,这项工作符合现有的学术建议,并引入了一种针对欧盟移民治理量身定制的新型操作工具。该方法经过已知数据的测试和验证,以展示其在与移民相关的政策背景中的适用性和可靠性。

更新时间: 2025-12-11 13:33:25

领域: cs.LG,cs.SI,stat.AP

下载: http://arxiv.org/abs/2512.10633v1

Unified Smart Factory Model: A model-based Approach for Integrating Industry 4.0 and Sustainability for Manufacturing Systems

This paper presents the Unified Smart Factory Model (USFM), a comprehensive framework designed to translate high-level sustainability goals into measurable factory-level indicators with a systematic information map of manufacturing activities. The manufacturing activities were modelled as set of manufacturing, assembly and auxiliary processes using Object Process Methodology, a Model Based Systems Engineering (MBSE) language. USFM integrates Manufacturing Process and System, Data Process, and Key Performance Indicator (KPI) Selection and Assessment in a single framework. Through a detailed case study of Printed Circuit Board (PCB) assembly factory, the paper demonstrates how environmental sustainability KPIs can be selected, modelled, and mapped to the necessary data, highlighting energy consumption and environmental impact metrics. The model's systematic approach can reduce redundancy, minimize the risk of missing critical information, and enhance data collection. The paper concluded that the USFM bridges the gap between sustainability goals and practical implementation, providing significant benefits for industries specifically SMEs aiming to achieve sustainability targets.

Updated: 2025-12-11 13:30:38

标题: 统一智能工厂模型:基于模型的方法,实现工业4.0和制造系统可持续性的整合

摘要: 这篇论文介绍了统一智能工厂模型(USFM),这是一个综合框架,旨在将高级可持续发展目标转化为可衡量的工厂级指标,并具有一种制造活动的系统信息图。利用对象过程方法,将制造活动建模为一组制造、装配和辅助流程,这是一种基于模型的系统工程(MBSE)语言。USFM整合了制造过程和系统、数据处理以及关键绩效指标(KPI)选择和评估在一个单一框架中。通过对印刷电路板(PCB)装配工厂的详细案例研究,本文展示了如何选择、建模和映射环境可持续性KPI,并突出了能源消耗和环境影响度量。该模型的系统方法可以减少冗余,减少错过关键信息的风险,并增强数据收集。本文得出结论,USFM弥合了可持续发展目标和实际实施之间的差距,为特别是旨在实现可持续发展目标的中小型企业带来重大益处。

更新时间: 2025-12-11 13:30:38

领域: cs.AI

下载: http://arxiv.org/abs/2512.10631v1

GT-SNT: A Linear-Time Transformer for Large-Scale Graphs via Spiking Node Tokenization

Graph Transformers (GTs), which integrate message passing and self-attention mechanisms simultaneously, have achieved promising empirical results in graph prediction tasks. However, the design of scalable and topology-aware node tokenization has lagged behind other modalities. This gap becomes critical as the quadratic complexity of full attention renders them impractical on large-scale graphs. Recently, Spiking Neural Networks (SNNs), as brain-inspired models, provided an energy-saving scheme to convert input intensity into discrete spike-based representations through event-driven spiking neurons. Inspired by these characteristics, we propose a linear-time Graph Transformer with Spiking Node Tokenization (GT-SNT) for node classification. By integrating multi-step feature propagation with SNNs, spiking node tokenization generates compact, locality-aware spike count embeddings as node tokens to avoid predefined codebooks and their utilization issues. The codebook guided self-attention leverages these tokens to perform node-to-token attention for linear-time global context aggregation. In experiments, we compare GT-SNT with other state-of-the-art baselines on node classification datasets ranging from small to large. Experimental results show that GT-SNT achieves comparable performances on most datasets and reaches up to 130x faster inference speed compared to other GTs.

Updated: 2025-12-11 13:28:05

标题: GT-SNT:通过尖峰节点标记实现大规模图的线性时间变换器

摘要: 图形变换器(GTs)同时集成了消息传递和自注意机制,在图预测任务中取得了令人满意的实证结果。然而,可扩展且具有拓扑意识的节点标记设计落后于其他模态。随着完全注意力的二次复杂度使它们在大规模图上变得不切实际,这个差距变得至关重要。最近,作为脑启发模型的脉冲神经网络(SNNs)提供了一个节能方案,通过事件驱动的脉冲神经元将输入强度转换为离散的基于脉冲的表示。受到这些特征的启发,我们提出了一种具有脉冲节点标记的线性时间图形变换器(GT-SNT)用于节点分类。通过将多步特征传播与SNNs集成,脉冲节点标记生成紧凑的、具有局部感知的脉冲计数嵌入作为节点标记,以避免预定义的码本及其利用问题。码本引导的自注意机制利用这些标记执行节点到标记的注意机制,用于线性时间全局上下文聚合。在实验中,我们将GT-SNT与其他最先进的基线方法在从小到大的节点分类数据集上进行比较。实验结果表明,GT-SNT在大多数数据集上实现了可比较的性能,并且相比其他GTs,推理速度提高了高达130倍。

更新时间: 2025-12-11 13:28:05

领域: cs.NE,cs.LG

下载: http://arxiv.org/abs/2504.11840v2

Distributional Shrinkage I: Universal Denoisers in Multi-Dimensions

We revisit the problem of denoising from noisy measurements where only the noise level is known, not the noise distribution. In multi-dimensions, independent noise $Z$ corrupts the signal $X$, resulting in the noisy measurement $Y = X + σZ$, where $σ\in (0, 1)$ is a known noise level. Our goal is to recover the underlying signal distribution $P_X$ from denoising $P_Y$. We propose and analyze universal denoisers that are agnostic to a wide range of signal and noise distributions. Our distributional denoisers offer order-of-magnitude improvements over the Bayes-optimal denoiser derived from Tweedie's formula, if the focus is on the entire distribution $P_X$ rather than on individual realizations of $X$. Our denoisers shrink $P_Y$ toward $P_X$ optimally, achieving $O(σ^4)$ and $O(σ^6)$ accuracy in matching generalized moments and density functions. Inspired by optimal transport theory, the proposed denoisers are optimal in approximating the Monge-Ampère equation with higher-order accuracy, and can be implemented efficiently via score matching. Let $q$ represent the density of $P_Y$; for optimal distributional denoising, we recommend replacing the Bayes-optimal denoiser, \[ \mathbf{T}^*(y) = y + σ^2 \nabla \log q(y), \] with denoisers exhibiting less aggressive distributional shrinkage, \[ \mathbf{T}_1(y) = y + \frac{σ^2}{2} \nabla \log q(y), \] \[ \mathbf{T}_2(y) = y + \frac{σ^2}{2} \nabla \log q(y) - \frac{σ^4}{8} \nabla \left( \frac{1}{2} \| \nabla \log q(y) \|^2 + \nabla \cdot \nabla \log q(y) \right) . \]

Updated: 2025-12-11 13:24:15

标题: 分布式收缩I:多维度中的通用去噪器

摘要: 我们重新审视了只知道噪声级别而不知道噪声分布的情况下,从嘈杂的测量中去噪的问题。在多维度中,独立噪声$Z$会损坏信号$X$,导致嘈杂的测量$Y = X + σZ$,其中$σ\in (0, 1)$是已知的噪声水平。我们的目标是从去噪的$P_Y$中恢复基础信号分布$P_X$。我们提出并分析了对各种信号和噪声分布都无所谓的通用去噪器。相比于从Tweedie的公式导出的贝叶斯最优去噪器,如果关注的是整个分布$P_X$而不是个别实现的$X$,我们的分布去噪器提供了数量级的改进。我们的去噪器将$P_Y$最优地收缩到$P_X$,在匹配广义矩和密度函数方面达到了$O(σ^4)$和$O(σ^6)$的精确度。受到最优输运理论的启发,所提出的去噪器在以更高阶精度逼近Monge-Ampère方程方面是最优的,并且可以通过得分匹配高效实现。 让$q$代表$P_Y$的密度;对于最佳的分布去噪,我们建议用展现出较少激进分布收缩的去噪器取代贝叶斯最优去噪器,\[ \mathbf{T}^*(y) = y + σ^2 \nabla \log q(y), \],\[ \mathbf{T}_1(y) = y + \frac{σ^2}{2} \nabla \log q(y), \],\[ \mathbf{T}_2(y) = y + \frac{σ^2}{2} \nabla \log q(y) - \frac{σ^4}{8} \nabla \left( \frac{1}{2} \| \nabla \log q(y) \|^2 + \nabla \cdot \nabla \log q(y) \right) . \]

更新时间: 2025-12-11 13:24:15

领域: stat.ML,cs.LG,math.ST

下载: http://arxiv.org/abs/2511.09500v2

SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence

Existing evaluations of multimodal large language models (MLLMs) on spatial intelligence are typically fragmented and limited in scope. In this work, we aim to conduct a holistic assessment of the spatial understanding capabilities of modern MLLMs and propose complementary data-driven and agent-based solutions. Specifically, we make the following contributions: (i) we introduce SpatialScore, to our knowledge, the most comprehensive and diverse benchmark for multimodal spatial intelligence to date. It covers multiple visual data types, input modalities, and question-answering formats, and contains approximately 5K manually verified samples spanning 30 distinct tasks; (ii) using SpatialScore, we extensively evaluate 40 representative MLLMs, revealing persistent challenges and a substantial gap between current models and human-level spatial intelligence; (iii) to advance model capabilities, we construct SpatialCorpus, a large-scale training resource with 331K multimodal QA samples that supports fine-tuning on spatial reasoning tasks and significantly improves the performance of existing models (e.g., Qwen3-VL); (iv) to complement this data-driven route with a training-free paradigm, we develop SpatialAgent, a multi-agent system equipped with 12 specialized spatial perception tools that supports both Plan-Execute and ReAct reasoning, enabling substantial gains in spatial reasoning without additional model training. Extensive experiments and in-depth analyses demonstrate the effectiveness of our benchmark, corpus, and agent framework. We expect these resources to serve as a solid foundation for advancing MLLMs toward human-level spatial intelligence. All data, code, and models will be released to the research community.

Updated: 2025-12-11 13:21:59

标题: SpatialScore:朝向空间智能的全面评估

摘要: 现有对多模态大型语言模型(MLLMs)在空间智能上的评估通常是零散的,范围有限。在这项工作中,我们旨在对现代MLLMs的空间理解能力进行全面评估,并提出了数据驱动和基于代理的互补解决方案。具体而言,我们做出了以下贡献:(i)我们引入了SpatialScore,据我们所知,这是迄今为止最全面和多样化的多模态空间智能基准。它涵盖了多种视觉数据类型、输入模态和问答格式,并包含大约5K个手动验证的样本,涵盖30个不同的任务;(ii)利用SpatialScore,我们广泛评估了40个代表性的MLLMs,揭示了持续存在的挑战和当前模型与人类水平空间智能之间的显著差距;(iii)为了提升模型能力,我们构建了SpatialCorpus,这是一个包含331K个多模态QA样本的大规模训练资源,支持在空间推理任务上进行微调,并显著提高了现有模型(例如Qwen3-VL)的性能;(iv)为了在无需训练的情况下补充这种数据驱动的路径,我们开发了SpatialAgent,这是一个配备12种专门的空间感知工具的多代理系统,支持计划执行和反应推理,从而在空间推理方面取得重大进展,而无需额外的模型训练。大量实验和深入分析证明了我们基准、语料库和代理框架的有效性。我们期望这些资源能够为推动MLLMs朝着人类水平的空间智能发展奠定坚实基础。所有数据、代码和模型将向研究社区发布。

更新时间: 2025-12-11 13:21:59

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2505.17012v2

Advancing Mathematical Research via Human-AI Interactive Theorem Proving

We investigate how large language models can be used as research tools in scientific computing while preserving mathematical rigor. We propose a human-in-the-loop workflow for interactive theorem proving and discovery with LLMs. Human experts retain control over problem formulation and admissible assumptions, while the model searches for proofs or contradictions, proposes candidate properties and theorems, and helps construct structures and parameters that satisfy explicit constraints, supported by numerical experiments and simple verification checks. Experts treat these outputs as raw material, further refine them, and organize the results into precise statements and rigorous proofs. We instantiate this workflow in a case study on the connection between manifold optimization and Grover's quantum search algorithm, where the pipeline helps identify invariant subspaces, explore Grover-compatible retractions, and obtain convergence guarantees for the retraction-based gradient method. The framework provides a practical template for integrating large language models into frontier mathematical research, enabling faster exploration of proof space and algorithm design while maintaining transparent reasoning responsibilities. Although illustrated on manifold optimization problems in quantum computing, the principles extend to other core areas of scientific computing.

Updated: 2025-12-11 13:10:50

标题: 推动数学研究:人工智能交互定理证明的进展

摘要: 我们研究了如何在科学计算中保持数学严谨的同时,将大型语言模型用作研究工具。我们提出了一个人机协同的工作流程,用于与LLMs互动证明和发现。人类专家保留对问题表述和可接受假设的控制权,而模型则搜索证明或矛盾,提出候选属性和定理,并帮助构建满足显式约束的结构和参数,支持数值实验和简单验证检查。专家将这些输出视为原始材料,进一步完善它们,并将结果组织成精确的陈述和严谨的证明。我们在流形优化和Grover量子搜索算法之间的连接案例研究中实例化了这个工作流程,其中该流水线帮助识别不变子空间,探索与Grover兼容的缩影,并为基于缩影的梯度方法获得收敛保证。该框架为将大型语言模型整合到前沿数学研究提供了一个实用模板,使得在保持透明推理职责的同时,可以更快地探索证明空间和算法设计。虽然在量子计算中的流形优化问题中进行了说明,但这些原则适用于科学计算的其他核心领域。

更新时间: 2025-12-11 13:10:50

领域: cs.HC,cs.AI,math.OC

下载: http://arxiv.org/abs/2512.09443v2

Phythesis: Physics-Guided Evolutionary Scene Synthesis for Energy-Efficient Data Center Design via LLMs

Data center (DC) infrastructure serves as the backbone to support the escalating demand for computing capacity. Traditional design methodologies that blend human expertise with specialized simulation tools scale poorly with the increasing system complexity. Recent studies adopt generative artificial intelligence to design plausible human-centric indoor layouts. However, they do not consider the underlying physics, making them unsuitable for the DC design that sets quantifiable operational objectives and strict physical constraints. To bridge the gap, we propose Phythesis, a novel framework that synergizes large language models (LLMs) and physics-guided evolutionary optimization to automate simulation-ready (SimReady) scene synthesis for energy-efficient DC design. Phythesis employs an iterative bi-level optimization architecture, where (i) the LLM-driven optimization level generates physically plausible three-dimensional layouts and self-criticizes them to refine the scene topology, and (ii) the physics-informed optimization level identifies the optimal asset parameters and selects the best asset combination. Experiments on three generation scales show that Phythesis achieves 57.3% generation success rate increase and 11.5% power usage effectiveness (PUE) improvement, compared with the vanilla LLM-based solution.

Updated: 2025-12-11 13:04:44

标题: Phythesis:基于物理引导的进化场景综合,用于通过LLMs实现能效数据中心设计

摘要: 数据中心(DC)基础设施作为支持不断增长的计算需求的支柱。传统的设计方法很难随着系统复杂性的增加而扩展,因为它们结合了人类专业知识和专门的仿真工具。近期的研究采用生成式人工智能来设计合理的以人为中心的室内布局。然而,它们并未考虑基础物理学,因此不适用于设定可量化的运营目标和严格的物理约束的DC设计。为了弥合这一差距,我们提出了Phythesis,这是一个新颖的框架,它结合了大型语言模型(LLMs)和物理引导的进化优化,以自动化模拟就绪(SimReady)场景合成,用于高效的DC设计。Phythesis采用迭代的双层优化架构,其中LLM驱动的优化层生成了物理上合理的三维布局,并对其进行自我批评以优化场景拓扑,物理信息优化层确定了最佳的资产参数和选择最佳的资产组合。在三个不同规模的实验中,Phythesis相比于基于vanilla LLM的解决方案,实现了57.3%的生成成功率提高和11.5%的功耗效率(PUE)改善。

更新时间: 2025-12-11 13:04:44

领域: cs.AI,cs.NE

下载: http://arxiv.org/abs/2512.10611v1

Object-centric proto-symbolic behavioural reasoning from pixels

Autonomous intelligent agents must bridge computational challenges at disparate levels of abstraction, from the low-level spaces of sensory input and motor commands to the high-level domain of abstract reasoning and planning. A key question in designing such agents is how best to instantiate the representational space that will interface between these two levels -- ideally without requiring supervision in the form of expensive data annotations. These objectives can be efficiently achieved by representing the world in terms of objects (grounded in perception and action). In this work, we present a novel, brain-inspired, deep-learning architecture that learns from pixels to interpret, control, and reason about its environment, using object-centric representations. We show the utility of our approach through tasks in synthetic environments that require a combination of (high-level) logical reasoning and (low-level) continuous control. Results show that the agent can learn emergent conditional behavioural reasoning, such as $(A \to B) \land (\neg A \to C)$, as well as logical composition $(A \to B) \land (A \to C) \vdash A \to (B \land C)$ and XOR operations, and successfully controls its environment to satisfy objectives deduced from these logical rules. The agent can adapt online to unexpected changes in its environment and is robust to mild violations of its world model, thanks to dynamic internal desired goal generation. While the present results are limited to synthetic settings (2D and 3D activated versions of dSprites), which fall short of real-world levels of complexity, the proposed architecture shows how to manipulate grounded object representations, as a key inductive bias for unsupervised learning, to enable behavioral reasoning.

Updated: 2025-12-11 12:52:59

标题: 从像素中的对象中心原型符号化行为推理

摘要: 自主智能代理必须在不同抽象级别的计算挑战之间建立桥梁,从感官输入和运动命令的低层空间到抽象推理和规划的高层领域。设计这种代理的一个关键问题是如何最好地实例化表示空间,使其在这两个级别之间进行接口 -- 理想情况下不需要昂贵的数据注释监督。通过以对象为基础(基于感知和行动)来表示世界,可以有效地实现这些目标。在这项工作中,我们提出了一种新颖的、受大脑启发的深度学习架构,它从像素中学习来解释、控制和推理其环境,使用以对象为中心的表示。我们通过合成环境中的任务展示了我们方法的实用性,这些任务需要组合(高级)逻辑推理和(低级)连续控制。结果表明,代理可以学习出现的条件行为推理,如$(A \to B) \land (\neg A \to C)$,以及逻辑组合$(A \to B) \land (A \to C) \vdash A \to (B \land C)$和异或运算,并成功控制其环境以满足从这些逻辑规则推导出的目标。代理可以在线适应环境中的意外变化,并且对其世界模型的轻微违规具有鲁棒性,这要归功于动态内部期望目标生成。尽管目前的结果仅限于合成环境(2D和3D激活版本的dSprites),这些环境还不足以达到真实世界的复杂程度,但所提出的架构展示了如何操纵基于对象的表示,作为无监督学习的一个关键归纳偏见,以实现行为推理。

更新时间: 2025-12-11 12:52:59

领域: cs.AI,cs.CV,cs.LG,cs.NE

下载: http://arxiv.org/abs/2411.17438v3

Uncertainty-Preserving QBNNs: Multi-Level Quantization of SVI-Based Bayesian Neural Networks for Image Classification

Bayesian Neural Networks (BNNs) provide principled uncertainty quantification but suffer from substantial computational and memory overhead compared to deterministic networks. While quantization techniques have successfully reduced resource requirements in standard deep learning models, their application to probabilistic models remains largely unexplored. We introduce a systematic multi-level quantization framework for Stochastic Variational Inference based BNNs that distinguishes between three quantization strategies: Variational Parameter Quantization (VPQ), Sampled Parameter Quantization (SPQ), and Joint Quantization (JQ). Our logarithmic quantization for variance parameters, and specialized activation functions to preserve the distributional structure are essential for calibrated uncertainty estimation. Through comprehensive experiments on Dirty-MNIST, we demonstrate that BNNs can be quantized down to 4-bit precision while maintaining both classification accuracy and uncertainty disentanglement. At 4 bits, Joint Quantization achieves up to 8x memory reduction compared to floating-point implementations with minimal degradation in epistemic and aleatoric uncertainty estimation. These results enable deployment of BNNs on resource-constrained edge devices and provide design guidelines for future analog "Bayesian Machines" operating at inherently low precision.

Updated: 2025-12-11 12:51:42

标题: 不确定性保留QBNNs:基于SVI的贝叶斯神经网络的多级量化用于图像分类

摘要: 贝叶斯神经网络(BNNs)提供了合理的不确定性量化,但与确定性网络相比,存在着相当大的计算和内存开销。虽然量化技术成功地降低了标准深度学习模型的资源要求,但它们在概率模型中的应用仍然很少被探索。我们引入了一种系统化的基于随机变分推断的BNNs的多级量化框架,区分了三种量化策略:变分参数量化(VPQ)、采样参数量化(SPQ)和联合量化(JQ)。我们的对方差参数进行对数量化的方法,以及专门设计的激活函数来保留分布结构,对于准确的不确定性估计是至关重要的。通过在Dirty-MNIST上进行全面实验,我们展示了BNNs可以被量化到4位精度,同时保持分类准确性和不确定性分离。在4位精度下,联合量化与浮点实现相比,内在地降低了最多8倍的内存使用,同时对认知和随机不确定性估计的降级最小。这些结果使得BNNs可以部署在资源受限的边缘设备上,并为未来基于低精度运行的模拟“贝叶斯机器”的设计提供了指导。

更新时间: 2025-12-11 12:51:42

领域: cs.LG

下载: http://arxiv.org/abs/2512.10602v1

Multi-Objective Reward and Preference Optimization: Theory and Algorithms

This thesis develops theoretical frameworks and algorithms that advance constrained reinforcement learning (RL) across control, preference learning, and alignment of large language models. The first contribution addresses constrained Markov Decision Processes (CMDPs) under the average-cost criterion through the Average-Constrained Policy Optimization (ACPO) algorithm. ACPO integrates sensitivity analysis with trust-region updates to ensure stable constraint handling, achieving state-of-the-art empirical performance with theoretical guarantees. Constrained RL is then extended to finite-horizon settings via e-COP, the first policy optimization method for episodic CMDPs. Built on an episodic policy difference lemma, e-COP offers provable performance, simplicity, and scalability in safety-critical environments. The thesis then investigates reinforcement learning from human preferences. warmPref-PS introduces a posterior sampling strategy for linear bandits that integrates offline preference data from heterogeneous raters into online learning. Explicit modeling of rater competence yields substantial regret reduction and more efficient data collection for RLHF. The PSPL algorithm further advances preference-based RL by jointly sampling reward models and transition dynamics from pairwise trajectory comparisons, providing Bayesian simple-regret guarantees and robust empirical identification of optimal policies. The final contribution applies these methods to large-scale model alignment. A multi-objective constrained optimization view yields MOPO, an iterative algorithm with closed-form updates that scales to multi-billion-parameter language models and remains robust across alignment settings. Collectively, the thesis unifies constrained RL across average-cost, episodic, and preference-driven paradigms, delivering theoretical advances and practical tools for safe and aligned decision-making.

Updated: 2025-12-11 12:51:21

标题: 多目标奖励和偏好优化:理论与算法

摘要: 这篇论文发展了在控制、偏好学习和大型语言模型对齐方面推进受限制强化学习(RL)的理论框架和算法。第一个贡献通过平均成本准则下的平均约束策略优化(ACPO)算法解决了受限制的马尔可夫决策过程(CMDPs)。ACPO将敏感性分析与信任区域更新相结合,确保稳定的约束处理,实现了具有理论保证的最先进的实证性能。然后,通过e-COP将受限制的RL扩展到有限时间段的设置,这是用于叙事CMDPs的第一种策略优化方法。基于叙事策略差异引理,e-COP在安全关键环境中提供可证明的性能、简单性和可扩展性。接着,论文研究了基于人类偏好的强化学习。warmPref-PS为线性赌博机引入了一个后验采样策略,将来自异质评分者的离线偏好数据整合到在线学习中。对评分者能力的明确建模显著减少了遗憾,并为RLHF提供更有效的数据收集。PSPL算法通过从成对轨迹比较中同时采样奖励模型和转移动态,提供了贝叶斯简单遗憾保证和对最优策略的稳健实证识别,进一步推进了基于偏好的RL。最后的贡献将这些方法应用于大规模模型对齐。多目标受限制优化视图提供了MOPO,这是一个迭代算法,具有封闭形式的更新,可扩展到拥有数十亿参数的语言模型,并在对齐设置中保持稳健。总的来说,这篇论文统一了关于平均成本、叙事和偏好驱动范式的受限制RL,为安全和对齐决策提供了理论进展和实用工具。

更新时间: 2025-12-11 12:51:21

领域: cs.LG

下载: http://arxiv.org/abs/2512.10601v1

Authority Backdoor: A Certifiable Backdoor Mechanism for Authoring DNNs

Deep Neural Networks (DNNs), as valuable intellectual property, face unauthorized use. Existing protections, such as digital watermarking, are largely passive; they provide only post-hoc ownership verification and cannot actively prevent the illicit use of a stolen model. This work proposes a proactive protection scheme, dubbed ``Authority Backdoor," which embeds access constraints directly into the model. In particular, the scheme utilizes a backdoor learning framework to intrinsically lock a model's utility, such that it performs normally only in the presence of a specific trigger (e.g., a hardware fingerprint). But in its absence, the DNN's performance degrades to be useless. To further enhance the security of the proposed authority scheme, the certifiable robustness is integrated to prevent an adaptive attacker from removing the implanted backdoor. The resulting framework establishes a secure authority mechanism for DNNs, combining access control with certifiable robustness against adversarial attacks. Extensive experiments on diverse architectures and datasets validate the effectiveness and certifiable robustness of the proposed framework.

Updated: 2025-12-11 12:50:39

标题: 权威后门:一种用于编写DNN的可验证后门机制

摘要: 深度神经网络(DNNs)作为有价值的知识产权,面临未经授权的使用。现有的保护措施,如数字水印,主要是被动的;它们只能提供事后所有权验证,无法主动阻止盗窃模型的非法使用。本文提出了一种积极的保护方案,命名为“授权后门”,直接将访问限制嵌入到模型中。具体而言,该方案利用后门学习框架,固有地锁定模型的实用性,使其只有在特定触发器(例如硬件指纹)存在时才能正常运行。但在触发器不存在时,DNN的性能会下降到无用的程度。为了进一步增强所提出的授权方案的安全性,将可证明的强健性集成进来,以防止自适应攻击者移除植入的后门。所得到的框架建立了一个用于DNN的安全授权机制,将访问控制与针对对抗性攻击的可证明强健性相结合。对各种架构和数据集进行的广泛实验验证了所提出框架的有效性和可证明的强健性。

更新时间: 2025-12-11 12:50:39

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2512.10600v1

Beyond Pixels: A Training-Free, Text-to-Text Framework for Remote Sensing Image Retrieval

Semantic retrieval of remote sensing (RS) images is a critical task fundamentally challenged by the \textquote{semantic gap}, the discrepancy between a model's low-level visual features and high-level human concepts. While large Vision-Language Models (VLMs) offer a promising path to bridge this gap, existing methods often rely on costly, domain-specific training, and there is a lack of benchmarks to evaluate the practical utility of VLM-generated text in a zero-shot retrieval context. To address this research gap, we introduce the Remote Sensing Rich Text (RSRT) dataset, a new benchmark featuring multiple structured captions per image. Based on this dataset, we propose a fully training-free, text-only retrieval reference called TRSLLaVA. Our methodology reformulates cross-modal retrieval as a text-to-text (T2T) matching problem, leveraging rich text descriptions as queries against a database of VLM-generated captions within a unified textual embedding space. This approach completely bypasses model training or fine-tuning. Experiments on the RSITMD and RSICD benchmarks show our training-free method is highly competitive with state-of-the-art supervised models. For instance, on RSITMD, our method achieves a mean Recall of 42.62\%, nearly doubling the 23.86\% of the standard zero-shot CLIP baseline and surpassing several top supervised models. This validates that high-quality semantic representation through structured text provides a powerful and cost-effective paradigm for remote sensing image retrieval.

Updated: 2025-12-11 12:43:41

标题: 超越像素:一种无需训练的遥感图像检索文本-文本框架

摘要: 遥感图像的语义检索是一项关键任务,根本上受到“语义鸿沟”的挑战,即模型的低级视觉特征与高级人类概念之间的差异。尽管大型视觉-语言模型(VLMs)为弥合这一鸿沟提供了一条有希望的途径,但现有方法往往依赖昂贵的、领域特定的训练,并且缺乏用于评估在零样本检索环境中VLM生成的文本实际效用的基准。为了填补这一研究空白,我们引入了遥感丰富文本(RSRT)数据集,这是一个新的基准,每个图像具有多个结构化标题。基于这个数据集,我们提出了一个完全无需训练的、仅使用文本的检索参考方法称为TRSLLaVA。我们的方法重新构建了跨模态检索问题,将其视为文本到文本(T2T)匹配问题,利用丰富的文本描述作为查询,在一个统一的文本嵌入空间中与VLM生成的标题数据库进行匹配。这种方法完全绕过了模型训练或微调。在RSITMD和RSICD基准上进行的实验表明,我们的无需训练方法与最先进的监督模型竞争力强。例如,在RSITMD上,我们的方法实现了42.62%的平均召回率,几乎是标准零样本CLIP基线的23.86%的两倍,并超过了几个顶级监督模型。这证实了通过结构化文本提供高质量的语义表示是遥感图像检索的一种强大且经济有效的范式。

更新时间: 2025-12-11 12:43:41

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2512.10596v1

Lightweight Model Attribution and Detection of Synthetic Speech via Audio Residual Fingerprints

As speech generation technologies advance, so do risks of impersonation, misinformation, and spoofing. We present a lightweight, training-free approach for detecting synthetic speech and attributing it to its source model. Our method addresses three tasks: (1) single-model attribution in an open-world setting, (2) multi-model attribution in a closed-world setting, and (3) real vs. synthetic speech classification. The core idea is simple: we compute standardized average residuals--the difference between an audio signal and its filtered version--to extract model-agnostic fingerprints that capture synthesis artifacts. Experiments across multiple synthesis systems and languages show AUROC scores above 99%, with strong reliability even when only a subset of model outputs is available. The method maintains high performance under common audio distortions, including echo and moderate background noise, while data augmentation can improve results in more challenging conditions. In addition, out-of-domain detection is performed using Mahalanobis distances to in-domain residual fingerprints, achieving an F1 score of 0.91 on unseen models, reinforcing the method's efficiency, generalizability, and suitability for digital forensics and security applications.

Updated: 2025-12-11 12:41:32

标题: 通过音频残留指纹轻量级模型归因和检测合成语音

摘要: 随着语音生成技术的进步,冒充、虚假信息和欺骗的风险也在增加。我们提出了一种轻量级、无需训练的方法,用于检测合成语音并将其归因于其源模型。我们的方法涉及三个任务:(1)在开放世界环境中进行单模型归因,(2)在封闭世界环境中进行多模型归因,和(3)真实 vs. 合成语音分类。核心思想很简单:我们计算标准化的平均残差--音频信号与其经过滤版本之间的差异--以提取捕获合成伪影的模型无关指纹。跨多个合成系统和语言的实验显示AUROC分数超过99%,即使只有模型输出的子集可用,也具有很强的可靠性。该方法在常见的音频失真情况下保持高性能,包括回声和适度的背景噪音,而数据增强可以改善在更具挑战性的条件下的结果。此外,通过使用马氏距离来检测领域外的残差指纹,实现了对未知模型的F1分数为0.91的检测,进一步强调了该方法在数字取证和安全应用中的高效性、泛化能力和适用性。

更新时间: 2025-12-11 12:41:32

领域: eess.AS,cs.CR,cs.LG

下载: http://arxiv.org/abs/2411.14013v4

THeGAU: Type-Aware Heterogeneous Graph Autoencoder and Augmentation

Heterogeneous Graph Neural Networks (HGNNs) are effective for modeling Heterogeneous Information Networks (HINs), which encode complex multi-typed entities and relations. However, HGNNs often suffer from type information loss and structural noise, limiting their representational fidelity and generalization. We propose THeGAU, a model-agnostic framework that combines a type-aware graph autoencoder with guided graph augmentation to improve node classification. THeGAU reconstructs schema-valid edges as an auxiliary task to preserve node-type semantics and introduces a decoder-driven augmentation mechanism to selectively refine noisy structures. This joint design enhances robustness, accuracy, and efficiency while significantly reducing computational overhead. Extensive experiments on three benchmark HIN datasets (IMDB, ACM, and DBLP) demonstrate that THeGAU consistently outperforms existing HGNN methods, achieving state-of-the-art performance across multiple backbones.

Updated: 2025-12-11 12:30:42

标题: TheGAU:类型感知的异质图自动编码器和增强

摘要: 异质图神经网络(HGNNs)对建模异质信息网络(HINs)有效,这些网络编码复杂的多类型实体和关系。然而,HGNNs经常遭受类型信息丢失和结构噪声的困扰,限制了它们的表示保真度和泛化能力。我们提出了THeGAU,这是一个模型无关的框架,将类型感知图自编码器与引导图增强相结合,以改善节点分类。THeGAU将模式有效边缘重建作为辅助任务,以保留节点类型语义,并引入由解码器驱动的增强机制来选择性地优化嘈杂结构。这种联合设计增强了鲁棒性、准确性和效率,同时显著降低了计算开销。对三个基准HIN数据集(IMDB、ACM和DBLP)的广泛实验表明,THeGAU始终优于现有的HGNN方法,在多个主干网络上实现了最先进的性能。

更新时间: 2025-12-11 12:30:42

领域: cs.LG

下载: http://arxiv.org/abs/2512.10589v1

Towards Robust Assessment of Pathological Voices via Combined Low-Level Descriptors and Foundation Model Representations

Perceptual voice quality assessment plays a vital role in diagnosing and monitoring voice disorders. Traditional methods, such as the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) and the Grade, Roughness, Breathiness, Asthenia, and Strain (GRBAS) scales, rely on expert raters and are prone to inter-rater variability, emphasizing the need for objective solutions. This study introduces the Voice Quality Assessment Network (VOQANet), a deep learning framework that employs an attention mechanism and Speech Foundation Model (SFM) embeddings to extract high-level features. To further enhance performance, we propose VOQANet+, which integrates self-supervised SFM embeddings with low-level acoustic descriptors-namely jitter, shimmer, and harmonics-to-noise ratio (HNR). Unlike previous approaches that focus solely on vowel-based phonation (PVQD-A), our models are evaluated on both vowel-level and sentence-level speech (PVQD-S) to assess generalizability. Experimental results demonstrate that sentence-based inputs yield higher accuracy, particularly at the patient level. Overall, VOQANet consistently outperforms baseline models in terms of root mean squared error (RMSE) and Pearson correlation coefficient across CAPE-V and GRBAS dimensions, with VOQANet+ achieving even greater performance gains. Additionally, VOQANet+ maintains consistent performance under noisy conditions, suggesting enhanced robustness for real-world and telehealth applications. This work highlights the value of combining SFM embeddings with low-level features for accurate and robust pathological voice assessment.

Updated: 2025-12-11 12:26:21

标题: 朝向通过低级描述符和基础模型表示结合的病理性声音的稳健评估

摘要: 感知声音质量评估在诊断和监测声音障碍中起着至关重要的作用。传统方法,如一致性听觉-感知声音评估 (CAPE-V) 和 Grade, Roughness, Breathiness, Asthenia, and Strain (GRBAS) 量表,依赖于专家评分员,并容易出现评分员之间的变异性,强调了需要客观解决方案的必要性。本研究介绍了 Voice Quality Assessment Network (VOQANet),这是一个深度学习框架,采用了注意机制和 Speech Foundation Model (SFM) 嵌入来提取高级特征。为了进一步提高性能,我们提出了 VOQANet+,它将自监督的 SFM 嵌入与低级声学描述符(即颤音、闪光和谐波噪声比)相结合。与之前专注于基于元音的发音(PVQD-A)的方法不同,我们的模型在元音级别和句子级别的语音(PVQD-S)上进行评估,以评估泛化能力。实验结果表明,基于句子的输入能够在患者级别特别提高准确性。总体而言,VOQANet 在 CAPE-V 和 GRBAS 维度上的均方根误差(RMSE)和皮尔逊相关系数方面一直优于基准模型,而 VOQANet+ 更进一步提高了性能。此外,VOQANet+ 在嘈杂环境下保持稳定的表现,表明在现实世界和远程医疗应用中具有增强的鲁棒性。这项工作突出了将 SFM 嵌入与低级特征结合以进行准确和稳健的病理性声音评估的价值。

更新时间: 2025-12-11 12:26:21

领域: cs.SD,cs.LG,eess.AS

下载: http://arxiv.org/abs/2505.21356v4

Topology-Guided Quantum GANs for Constrained Graph Generation

Quantum computing (QC) promises theoretical advantages, benefiting computational problems that would not be efficiently classically simulatable. However, much of this theoretical speedup depends on the quantum circuit design solving the problem. We argue that QC literature has yet to explore more domain specific ansatz-topologies, instead of relying on generic, one-size-fits-all architectures. In this work, we show that incorporating task-specific inductive biases -- specifically geometric priors -- into quantum circuit design can enhance the performance of hybrid Quantum Generative Adversarial Networks (QuGANs) on the task of generating geometrically constrained K4 graphs. We evaluate a portfolio of entanglement topologies and loss-function designs to assess their impact on both statistical fidelity and compliance with geometric constraints, including the Triangle and Ptolemaic inequalities. Our results show that aligning circuit topology with the underlying problem structure yields substantial benefits: the Triangle-topology QuGAN achieves the highest geometric validity among quantum models and matches the performance of classical Generative Adversarial Networks (GAN). Additionally, we showcase how specific architectural choices, such as entangling gate types, variance regularization and output-scaling govern the trade-off between geometric consistency and distributional accuracy, thus emphasizing the value of structured, task-aware quantum ansatz-topologies.

Updated: 2025-12-11 12:22:18

标题: 拓扑学引导的约束图生成的量子生成对抗网络

摘要: 量子计算(QC)承诺理论上的优势,有益于那些经典模拟效率低下的计算问题。然而,这种理论加速很大程度上取决于解决问题的量子电路设计。我们认为,QC文献尚未探索更多领域特定的ansatz拓扑结构,而是依赖于通用的、一刀切的架构。在这项工作中,我们展示了将任务特定的归纳偏差——特别是几何先验——纳入量子电路设计可以提高混合量子生成对抗网络(QuGANs)在生成受几何约束的K4图任务上的性能。我们评估了一系列纠缠拓扑结构和损失函数设计,以评估它们对统计保真度和与几何约束的符合度的影响,包括三角形和托勒密不等式。我们的结果显示,将电路拓扑结构与底层问题结构对齐具有显著的好处:三角形拓扑结构的QuGAN在量子模型中实现了最高的几何有效性,并与经典生成对抗网络(GAN)的性能相匹配。此外,我们展示了特定的架构选择,如纠缠门类型、方差正则化和输出缩放如何决定几何一致性和分布准确性之间的权衡,从而强调结构化、任务感知的量子ansatz拓扑结构的价值。

更新时间: 2025-12-11 12:22:18

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2512.10582v1

Is the Information Bottleneck Robust Enough? Towards Label-Noise Resistant Information Bottleneck Learning

The Information Bottleneck (IB) principle facilitates effective representation learning by preserving label-relevant information while compressing irrelevant information. However, its strong reliance on accurate labels makes it inherently vulnerable to label noise, prevalent in real-world scenarios, resulting in significant performance degradation and overfitting. To address this issue, we propose LaT-IB, a novel Label-Noise ResistanT Information Bottleneck method which introduces a "Minimal-Sufficient-Clean" (MSC) criterion. Instantiated as a mutual information regularizer to retain task-relevant information while discarding noise, MSC addresses standard IB's vulnerability to noisy label supervision. To achieve this, LaT-IB employs a noise-aware latent disentanglement that decomposes the latent representation into components aligned with to the clean label space and the noise space. Theoretically, we first derive mutual information bounds for each component of our objective including prediction, compression, and disentanglement, and moreover prove that optimizing it encourages representations invariant to input noise and separates clean and noisy label information. Furthermore, we design a three-phase training framework: Warmup, Knowledge Injection and Robust Training, to progressively guide the model toward noise-resistant representations. Extensive experiments demonstrate that LaT-IB achieves superior robustness and efficiency under label noise, significantly enhancing robustness and applicability in real-world scenarios with label noise.

Updated: 2025-12-11 12:01:20

标题: 信息瓶颈是否足够强大?走向抗标签噪声的信息瓶颈学习

摘要: 信息瓶颈(IB)原则通过保留与标签相关的信息并压缩无关信息,促进了有效的表示学习。然而,其对准确标签的强烈依赖使其天生容易受到标签噪声的影响,在现实场景中普遍存在,导致性能严重下降和过拟合。为解决这一问题,我们提出了LaT-IB,一种新颖的抗标签噪声的信息瓶颈方法,引入了“最小-充分-干净”(MSC)标准。MSC作为互信息正则化器,用于保留任务相关信息同时丢弃噪声,解决了标准IB对嘈杂标签监督的脆弱性。为实现这一目标,LaT-IB采用了噪声感知的潜在解耦,将潜在表示分解为与干净标签空间和噪声空间对齐的组件。在理论上,我们首先推导了我们的目标的每个组件的互信息界限,包括预测、压缩和解耦,此外还证明了优化它鼓励对输入噪声不变的表示,并将干净和嘈杂的标签信息分离。此外,我们设计了一个三阶段训练框架:预热、知识注入和稳健训练,逐步引导模型朝向抗噪声表示。大量实验证明,LaT-IB在标签噪声下实现了卓越的鲁棒性和效率,显著提高了在现实场景中具有标签噪声的鲁棒性和适用性。

更新时间: 2025-12-11 12:01:20

领域: cs.LG

下载: http://arxiv.org/abs/2512.10573v1

Flexible Deep Neural Networks for Partially Linear Survival Data

We propose a flexible deep neural network (DNN) framework for modeling survival data within a partially linear regression structure. The approach preserves interpretability through a parametric linear component for covariates of primary interest, while a nonparametric DNN component captures complex time-covariate interactions among nuisance variables. We refer to the method as FLEXI-Haz, a flexible hazard model with a partially linear structure. In contrast to existing DNN approaches for partially linear Cox models, FLEXI-Haz does not rely on the proportional hazards assumption. We establish theoretical guarantees: the neural network component attains minimax-optimal convergence rates based on composite Holder classes, and the linear estimator is root-n consistent, asymptotically normal, and semiparametrically efficient. Extensive simulations and real-data analyses demonstrate that FLEXI-Haz provides accurate estimation of the linear effect, offering a principled and interpretable alternative to modern methods based on proportional hazards. Code for implementing FLEXI-Haz, as well as scripts for reproducing data analyses and simulations, is available at: https://github.com/AsafBanana/FLEXI-Haz

Updated: 2025-12-11 11:58:42

标题: 灵活的深度神经网络用于部分线性生存数据

摘要: 我们提出了一个灵活的深度神经网络(DNN)框架,用于在部分线性回归结构中建模生存数据。该方法通过对主要兴趣协变量的参数线性部分进行保留解释性,同时非参数DNN组件捕获了干扰变量之间复杂的时间-协变量交互作用。我们将该方法称为FLEXI-Haz,一个具有部分线性结构的灵活危险模型。与现有的用于部分线性Cox模型的DNN方法相比,FLEXI-Haz不依赖于比例风险假设。我们建立了理论保证:基于复合Holder类,神经网络组件实现了极小化最优收敛速率,线性估计器是根n一致的,渐进正态的,并且是半参数有效的。广泛的模拟和实际数据分析表明,FLEXI-Haz提供了线性效应的准确估计,为基于比例风险的现代方法提供了有原则和可解释的替代方案。实施FLEXI-Haz的代码,以及用于重现数据分析和模拟的脚本,可在以下网址获得:https://github.com/AsafBanana/FLEXI-Haz

更新时间: 2025-12-11 11:58:42

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2512.10570v1

NormCode: A Semi-Formal Language for Context-Isolated AI Planning

Multistep workflows that chain large language model (LLM) calls suffer from context pollution: as information accumulates across steps, models hallucinate, confuse intermediate outputs, and lose track of task constraints. We present NormCode, a semiformal language for constructing plans of inferences, structured decompositions where each step operates in data isolation and receives only explicitly passed inputs, which eliminates crossstep contamination by design. NormCode enforces a strict separation between semantic operations (LLMdriven reasoning, nondeterministic) and syntactic operations (deterministic data restructuring), enabling precise cost and reliability tracing. The language exists in three isomorphic formats: .ncds for human authoring, .ncd for machine execution, and .ncn for human verification, supporting progressive formalization from sketch to production. We validate NormCode through two demonstrations: (1) a base X addition algorithm achieving 100 percent accuracy on arbitrary length inputs, and (2) self hosted execution of NormCode's own five phase compiler pipeline. The working orchestrator provides dependency driven scheduling, SQLite backed checkpointing, and loop management, making AI workflows auditable by design and addressing a critical need for transparency in high stakes domains such as legal reasoning, medical decision making, and financial analysis.

Updated: 2025-12-11 11:50:50

标题: NormCode:一种半正式语言,用于上下文隔离的人工智能规划

摘要: 多步骤工作流链式大型语言模型(LLM)调用遭受上下文污染:随着信息在步骤之间累积,模型会产生幻觉,混淆中间输出,并丢失任务约束的跟踪。我们提出了NormCode,一种半正式语言,用于构建推理计划,结构化分解,其中每个步骤在数据隔离中运行,并仅接收明确传递的输入,从而通过设计消除了跨步骤污染。NormCode强制执行语义操作(LLM驱动的推理,非确定性)和语法操作(确定性数据重组)之间的严格分离,实现精确的成本和可靠性跟踪。该语言以三种同构格式存在:.ncds用于人类编写,.ncd用于机器执行,.ncn用于人工验证,支持从草图到生产的逐步形式化。我们通过两个演示验证了NormCode:(1)一个基本的X加法算法,在任意长度的输入上实现100%的准确性,以及(2)NormCode自己的五阶段编译器流水线的自托管执行。工作协调器提供依赖驱动调度,SQLite支持的检查点,以及循环管理,通过设计使AI工作流程具有审计功能,并满足高风险领域(如法律推理,医疗决策和金融分析)对透明度的关键需求。

更新时间: 2025-12-11 11:50:50

领域: cs.AI

下载: http://arxiv.org/abs/2512.10563v1

Causal Reasoning Favors Encoders: On The Limits of Decoder-Only Models

In context learning (ICL) underpins recent advances in large language models (LLMs), although its role and performance in causal reasoning remains unclear. Causal reasoning demands multihop composition and strict conjunctive control, and reliance on spurious lexical relations of the input could provide misleading results. We hypothesize that, due to their ability to project the input into a latent space, encoder and encoder decoder architectures are better suited for said multihop conjunctive reasoning versus decoder only models. To do this, we compare fine-tuned versions of all the aforementioned architectures with zero and few shot ICL in both natural language and non natural language scenarios. We find that ICL alone is insufficient for reliable causal reasoning, often overfocusing on irrelevant input features. In particular, decoder only models are noticeably brittle to distributional shifts, while finetuned encoder and encoder decoder models can generalize more robustly across our tests, including the non natural language split. Both architectures are only matched or surpassed by decoder only architectures at large scales. We conclude by noting that for cost effective, short horizon robust causal reasoning, encoder or encoder decoder architectures with targeted finetuning are preferable.

Updated: 2025-12-11 11:46:48

标题: 因果推理偏向编码器:关于仅解码模型的局限性

摘要: 在语境学习(ICL)支持下,最近大型语言模型(LLMs)取得了进展,尽管它在因果推理中的作用和性能仍不清楚。因果推理需要多跳组合和严格的连接控制,依赖于输入的虚假词汇关系可能会导致误导性结果。我们假设,由于它们能够将输入投影到潜在空间中,编码器和编码器解码器架构更适合进行多跳连接推理,而非仅有解码器模型。为此,我们比较了所有上述架构的经过微调的版本,在自然语言和非自然语言场景下使用零和少量ICL。我们发现,仅仅依靠ICL是不足以进行可靠的因果推理的,往往会过分关注无关的输入特征。特别是,仅有解码器的模型在分布变化时明显脆弱,而微调的编码器和编码器解码器模型可以更稳健地泛化到我们的测试中,包括非自然语言部分。在大规模情况下,这两种架构仅在某些方面与仅有解码器的架构相匹敌或超过。我们总结指出,为了成本效益高、短期内稳健的因果推理,带有有针对性微调的编码器或编码器解码器架构更为可取。

更新时间: 2025-12-11 11:46:48

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2512.10561v1

SWEnergy: An Empirical Study on Energy Efficiency in Agentic Issue Resolution Frameworks with SLMs

Context. LLM-based autonomous agents in software engineering rely on large, proprietary models, limiting local deployment. This has spurred interest in Small Language Models (SLMs), but their practical effectiveness and efficiency within complex agentic frameworks for automated issue resolution remain poorly understood. Goal. We investigate the performance, energy efficiency, and resource consumption of four leading agentic issue resolution frameworks when deliberately constrained to using SLMs. We aim to assess the viability of these systems for this task in resource-limited settings and characterize the resulting trade-offs. Method. We conduct a controlled evaluation of four leading agentic frameworks (SWE-Agent, OpenHands, Mini SWE Agent, AutoCodeRover) using two SLMs (Gemma-3 4B, Qwen-3 1.7B) on the SWE-bench Verified Mini benchmark. On fixed hardware, we measure energy, duration, token usage, and memory over 150 runs per configuration. Results. We find that framework architecture is the primary driver of energy consumption. The most energy-intensive framework, AutoCodeRover (Gemma), consumed 9.4x more energy on average than the least energy-intensive, OpenHands (Gemma). However, this energy is largely wasted. Task resolution rates were near-zero, demonstrating that current frameworks, when paired with SLMs, consume significant energy on unproductive reasoning loops. The SLM's limited reasoning was the bottleneck for success, but the framework's design was the bottleneck for efficiency. Conclusions. Current agentic frameworks, designed for powerful LLMs, fail to operate efficiently with SLMs. We find that framework architecture is the primary driver of energy consumption, but this energy is largely wasted due to the SLMs' limited reasoning. Viable low-energy solutions require shifting from passive orchestration to architectures that actively manage SLM weaknesses.

Updated: 2025-12-11 11:33:34

标题: SWEnergy:基于SLM的主动问题解决框架能效的实证研究

摘要: 背景。基于LLM的软件工程自主代理依赖于大型专有模型,限制了本地部署。这引发了对小语言模型(SLMs)的兴趣,但它们在自动问题解决复杂代理框架中的实际效果和效率仍不为人所了解。 目标。我们调查了四个领先的代理问题解决框架在故意限制使用SLMs时的性能,能源效率和资源消耗。我们的目标是评估这些系统在资源有限的环境中执行此任务的可行性,并描述产生的权衡结果。 方法。我们对四个领先的代理框架(SWE-Agent,OpenHands,Mini SWE Agent,AutoCodeRover)在SWE-bench Verified Mini基准测试中使用两个SLMs(Gemma-3 4B,Qwen-3 1.7B)进行了受控评估。在固定硬件上,我们对每种配置进行了150次运行的能源消耗,持续时间,令牌使用和内存的测量。 结果。我们发现框架架构是能源消耗的主要驱动因素。能源消耗最高的框架AutoCodeRover(Gemma)平均比能源消耗最低的OpenHands(Gemma)高出9.4倍。然而,这种能源在很大程度上是浪费的。任务解决率接近零,表明当前框架在与SLMs配对时,会消耗大量能源用于无效的推理循环。SLM的有限推理成为成功的瓶颈,但框架的设计成为效率的瓶颈。 结论。当前为强大的LLMs设计的代理框架无法有效地与SLMs一起运行。我们发现框架架构是能源消耗的主要驱动因素,但由于SLMs的有限推理,这种能源在很大程度上是浪费的。可行的低能源解决方案需要从被动编排转变为积极管理SLM弱点的架构。

更新时间: 2025-12-11 11:33:34

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2512.09543v2

LLM-Auction: Generative Auction towards LLM-Native Advertising

The rapid advancement of large language models (LLMs) necessitates novel monetization strategies, among which LLM-native advertising has emerged as a promising paradigm by naturally integrating advertisement within LLM-generated responses. However, this paradigm fundamentally shifts the auction object from discrete ad slots to the distribution over LLM outputs, posing new challenges for designing auction mechanisms. Existing mechanisms for LLM-native advertising adopt frameworks that decouple auction and generation, which either ignore externalities or require multiple LLM inferences for ad allocation, rendering them impractical for industrial scenarios. To address these challenges, we propose LLM-Auction, which to the best of our knowledge is the first learning-based generative auction mechanism that integrates auction and LLM generation for LLM-native advertising. By formulating the allocation optimization as a preference alignment problem between LLM outputs and the mechanism's objective which reflects both advertisers' expected value and user experience, we introduce Iterative Reward-Preference Optimization (IRPO) algorithm that alternately optimizes the reward model and the LLM. This approach enables the LLM to inherently model allocation externalities without any extra inference cost. We further identify the allocation monotonicity and continuity of LLM-Auction, which allows us to prove that a simple first-price payment rule exhibits favorable incentive properties. Additionally, we design an LLM-as-a-judge simulation environment to facilitate large-scale data construction and enable comprehensive quantitative evaluation of the mechanism's performance. Extensive quantitative and qualitative experiments demonstrate that LLM-Auction significantly outperforms existing baselines in allocation efficiency, while achieving the desired mechanism properties.

Updated: 2025-12-11 11:31:20

标题: LLM拍卖:面向LLM原生广告的生成式拍卖

摘要: 大型语言模型(LLMs)的快速发展需要新颖的货币化策略,其中LLM本地广告已经成为一种有前途的范式,通过自然地将广告集成到LLM生成的响应中。然而,这种范式从离散广告槽转变为LLM输出的分布,对设计拍卖机制提出了新的挑战。现有的LLM本地广告机制采用将拍卖和生成分离的框架,这要么忽略外部性,要么需要多个LLM推断来进行广告分配,使其在工业场景中不切实际。为了解决这些挑战,我们提出了LLM-Auction,据我们所知,这是第一个基于学习的生成式拍卖机制,将拍卖和LLM生成整合在一起,用于LLM本地广告。通过将分配优化问题建模为LLM输出和机制目标之间的偏好对齐问题,反映了广告商预期价值和用户体验,我们引入了交替优化奖励模型和LLM的迭代奖励-偏好优化(IRPO)算法。这种方法使LLM能够本质上建模分配外部性,而不需要额外的推断成本。我们进一步确定了LLM-Auction的分配单调性和连续性,这使我们能够证明一个简单的一价付款规则具有有利的激励特性。此外,我们设计了一个LLM作为评判者的仿真环境,以促进大规模数据构建,并实现机制性能的全面定量评估。广泛的定量和定性实验证明,LLM-Auction在分配效率上明显优于现有基线,同时实现了期望的机制属性。

更新时间: 2025-12-11 11:31:20

领域: cs.GT,cs.AI,cs.LG

下载: http://arxiv.org/abs/2512.10551v1

Unlocking the Address Book: Dissecting the Sparse Semantic Structure of LLM Key-Value Caches via Sparse Autoencoders

The Key-Value (KV) cache is the primary memory bottleneck in long-context Large Language Models, yet it is typically treated as an opaque numerical tensor. In this work, we propose \textbf{STA-Attention}, a framework that utilizes Top-K Sparse Autoencoders (SAEs) to decompose the KV cache into interpretable ``semantic atoms.'' Unlike standard $L_1$-regularized SAEs, our Top-K approach eliminates shrinkage bias, preserving the precise dot-product geometry required for attention. Our analysis uncovers a fundamental \textbf{Key-Value Asymmetry}: while Key vectors serve as highly sparse routers dominated by a ``Semantic Elbow,'' deep Value vectors carry dense content payloads requiring a larger budget. Based on this structure, we introduce a Dual-Budget Strategy that selectively preserves the most informative semantic components while filtering representational noise. Experiments on Yi-6B, Mistral-7B, Qwen2.5-32B, and others show that our semantic reconstructions maintain perplexity and zero-shot performance comparable to the original models, effectively bridging the gap between mechanistic interpretability and faithful attention modeling.

Updated: 2025-12-11 11:23:50

标题: 解锁地址簿:通过稀疏自动编码器解剖LLM键值缓存的稀疏语义结构

摘要: 键-值(KV)缓存是长上下文大型语言模型中的主要内存瓶颈,但通常被视为不透明的数值张量。在这项工作中,我们提出了\textbf{STA-Attention},这是一个利用Top-K稀疏自编码器(SAEs)将KV缓存分解为可解释的“语义原子”的框架。与标准的$L_1$正则化SAEs不同,我们的Top-K方法消除了收缩偏差,保留了注意力所需的精确点积几何结构。我们的分析揭示了一个基本的\textbf{键-值不对称性}:虽然键向量作为由“语义弯头”主导的高度稀疏路由器,深度值向量携带密集内容有效载荷,需要更大的预算。基于这种结构,我们引入了一种双预算策略,选择性地保留最具信息量的语义组件,同时过滤代表性噪声。在Yi-6B、Mistral-7B、Qwen2.5-32B等实验中,我们的语义重构维持了与原始模型相当的困惑度和零-shot性能,有效地弥合了机械解释性和忠实注意力建模之间的差距。

更新时间: 2025-12-11 11:23:50

领域: cs.LG

下载: http://arxiv.org/abs/2512.10547v1

LibIQ: Toward Real-Time Spectrum Classification in O-RAN dApps

The O-RAN architecture is transforming cellular networks by adopting RAN softwarization and disaggregation concepts to enable data-driven monitoring and control of the network. Such management is enabled by RICs, which facilitate near-real-time and non-real-time network control through xApps and rApps. However, they face limitations, including latency overhead in data exchange between the RAN and RIC, restricting real-time monitoring, and the inability to access user plain data due to privacy and security constraints, hindering use cases like beamforming and spectrum classification. In this paper, we leverage the dApps concept to enable real-time RF spectrum classification with LibIQ, a novel library for RF signals that facilitates efficient spectrum monitoring and signal classification by providing functionalities to read I/Q samples as time-series, create datasets and visualize time-series data through plots and spectrograms. Thanks to LibIQ, I/Q samples can be efficiently processed to detect external RF signals, which are subsequently classified using a CNN inside the library. To achieve accurate spectrum analysis, we created an extensive dataset of time-series-based I/Q samples, representing distinct signal types captured using a custom dApp running on a 5G deployment over the Colosseum network emulator and an OTA testbed. We evaluate our model by deploying LibIQ in heterogeneous scenarios with varying center frequencies, time windows, and external RF signals. In real-time analysis, the model classifies the processed I/Q samples, achieving an average accuracy of approximately 97.8% in identifying signal types across all scenarios. We pledge to release both LibIQ and the dataset created as a publicly available framework upon acceptance.

Updated: 2025-12-11 11:18:01

标题: LibIQ:面向O-RAN dApps的实时频谱分类

摘要: O-RAN架构通过采用RAN软件化和分解概念,改变了蜂窝网络,从而实现了对网络的数据驱动监控和控制。这种管理是通过RICs实现的,它们通过xApps和rApps实现了近实时和非实时的网络控制。然而,它们面临一些限制,包括在RAN和RIC之间的数据交换中的延迟开销,限制了实时监控,并且由于隐私和安全约束,无法访问用户明文数据,从而阻碍了诸如波束成形和频谱分类等用例。在本文中,我们利用dApps概念,借助LibIQ实现了实时RF频谱分类,这是一个用于RF信号的新型库,通过提供读取I/Q样本作为时间序列、创建数据集并通过绘图和频谱图可视化时间序列数据的功能,促进了高效的频谱监测和信号分类。由于LibIQ,I/Q样本可以被高效地处理以检测外部RF信号,然后使用库内的CNN对其进行分类。为了实现准确的频谱分析,我们在Colosseum网络仿真器和OTA测试平台上运行的自定义dApp上创建了基于时间序列的I/Q样本的大量数据集,代表不同信号类型。我们通过在具有不同中心频率、时间窗口和外部RF信号的异构场景中部署LibIQ来评估我们的模型。在实时分析中,该模型对处理的I/Q样本进行分类,实现了在所有场景中识别信号类型的平均准确率约为97.8%。我们承诺在接受后发布LibIQ和创建的数据集作为一个公开可用的框架。

更新时间: 2025-12-11 11:18:01

领域: cs.NI,cs.AI

下载: http://arxiv.org/abs/2505.10537v3

IRG: Modular Synthetic Relational Database Generation with Complex Relational Schemas

Relational databases (RDBs) are widely used by corporations and governments to store multiple related tables. Their relational schemas pose unique challenges to synthetic data generation for privacy-preserving data sharing, e.g., for collaborative analytical and data mining tasks, as well as software testing at various scales. Relational schemas typically include a set of primary and foreign key constraints to specify the intra-and inter-table entity relations, which also imply crucial intra-and inter-table data correlations in the RDBs. Existing synthetic RDB generation approaches often focus on the relatively simple and basic parent-child relations, failing to address the ubiquitous real-world complexities in relational schemas in key constraints like composite keys, intra-table correlations like sequential correlation, and inter-table data correlations like indirectly connected tables. In this paper, we introduce incremental relational generator (IRG), a modular framework designed to handle these real-world challenges. In IRG, each table is generated by learning context from a depth-first traversal of relational connections to capture indirect inter-table relationships and constructs different parts of a table through several classical generative and predictive modules to preserve complex key constraints and data correlations. Compared to 3 prior art algorithms across 10 real-world RDB datasets, IRG successfully handles the relational schemas and captures critical data relationships for all datasets while prior works are incapable of. The generated synthetic data also demonstrates better fidelity and utility than prior works, implying its higher potential as a replacement for the basis of analytical tasks and data mining applications. Code is available at: https://github.com/li-jiayu-ljy/irg.

Updated: 2025-12-11 11:16:48

标题: IRG:具有复杂关系模式的模块化合成关系数据库生成

摘要: 关系数据库(RDBs)被广泛应用于企业和政府,用于存储多个相关表。它们的关系模式对于隐私保护数据共享的合成数据生成提出了独特挑战,例如用于协作分析和数据挖掘任务,以及各种规模的软件测试。关系模式通常包括一组主键和外键约束,用于指定表内和表间实体关系,这也意味着RDBs中关键的表内和表间数据相关性。现有的合成RDB生成方法通常侧重于相对简单和基本的父子关系,未能解决关系模式中的普遍现实世界复杂性,如复合键、表内相关性(如顺序相关性)和表间数据相关性(如间接连接表)。在本文中,我们介绍了增量关系生成器(IRG),这是一个设计用于处理这些现实世界挑战的模块化框架。在IRG中,通过学习深度优先遍历关系连接来生成每个表,以捕获间接的表间关系,并通过几个经典的生成和预测模块构建表的不同部分,以保留复杂的键约束和数据相关性。与10个真实世界的RDB数据集上的3种现有算法相比,IRG成功处理了关系模式并捕获了所有数据集的关键数据关系,而以前的工作则无法做到。生成的合成数据还表现出比以前的工作更好的忠实度和实用性,暗示其作为替代基础用于分析任务和数据挖掘应用的潜力更高。代码可在以下网址找到:https://github.com/li-jiayu-ljy/irg。

更新时间: 2025-12-11 11:16:48

领域: cs.DB,cs.LG

下载: http://arxiv.org/abs/2312.15187v3

Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning

Large language model (LLM) agents exhibit strong mathematical problem-solving abilities and can even solve International Mathematical Olympiad (IMO) level problems with the assistance of formal proof systems. However, due to weak heuristics for auxiliary constructions, AI for geometry problem solving remains dominated by expert models such as AlphaGeometry 2, which rely heavily on large-scale data synthesis and search for both training and evaluation. In this work, we make the first attempt to build a medalist-level LLM agent for geometry and present InternGeometry. InternGeometry overcomes the heuristic limitations in geometry by iteratively proposing propositions and auxiliary constructions, verifying them with a symbolic engine, and reflecting on the engine's feedback to guide subsequent proposals. A dynamic memory mechanism enables InternGeometry to conduct more than two hundred interactions with the symbolic engine per problem. To further accelerate learning, we introduce Complexity-Boosting Reinforcement Learning (CBRL), which gradually increases the complexity of synthesized problems across training stages. Built on InternThinker-32B, InternGeometry solves 44 of 50 IMO geometry problems (2000-2024), exceeding the average gold medalist score (40.9), using only 13K training examples, just 0.004% of the data used by AlphaGeometry 2, demonstrating the potential of LLM agents on expert-level geometry tasks. InternGeometry can also propose novel auxiliary constructions for IMO problems that do not appear in human solutions. We will release the model, data, and symbolic engine to support future research.

Updated: 2025-12-11 11:05:04

标题: 通过复杂度提升强化学习实现奥林匹亚级几何大型语言模型代理

摘要: 大型语言模型(LLM)代理展示了强大的数学问题解决能力,甚至可以在形式证明系统的帮助下解决国际数学奥林匹克(IMO)水平的问题。然而,由于辅助构造的启发式较弱,人工智能在几何问题解决方面仍然被专家模型(如AlphaGeometry 2)所主导,这些模型在训练和评估中都严重依赖大规模数据合成和搜索。在这项工作中,我们首次尝试构建一个几何奖牌级别的LLM代理并呈现InternGeometry。InternGeometry通过迭代提出命题和辅助构造,用符号引擎验证它们,并反思引擎的反馈以指导后续提议,克服了几何中的启发式限制。一种动态记忆机制使InternGeometry能够在每个问题上与符号引擎进行超过两百次的互动。为了进一步加速学习,我们引入了复杂性增强强化学习(CBRL),逐渐增加了在训练阶段合成问题的复杂性。建立在InternThinker-32B基础上,InternGeometry解决了50个IMO几何问题(2000-2024年)中的44个,超过了平均金牌得分(40.9),仅使用了13K个训练示例,仅为AlphaGeometry 2使用数据的0.004%,展示了LLM代理在专家级几何任务上的潜力。InternGeometry还可以为IMO问题提出新颖的辅助构造,这些构造在人类解决方案中并不存在。我们将发布模型、数据和符号引擎以支持未来研究。

更新时间: 2025-12-11 11:05:04

领域: cs.AI

下载: http://arxiv.org/abs/2512.10534v1

LLMs in Interpreting Legal Documents

This chapter explores the application of Large Language Models in the legal domain, showcasing their potential to optimise and augment traditional legal tasks by analysing possible use cases, such as assisting in interpreting statutes, contracts, and case law, enhancing clarity in legal summarisation, contract negotiation, and information retrieval. There are several challenges that can arise from the application of such technologies, such as algorithmic monoculture, hallucinations, and compliance with existing regulations, including the EU's AI Act and recent U.S. initiatives, alongside the emerging approaches in China. Furthermore, two different benchmarks are presented.

Updated: 2025-12-11 11:01:35

标题: LLMs在解释法律文件中的应用

摘要: 本章探讨了大型语言模型在法律领域的应用,展示了它们通过分析可能的用例,如协助解释法规、合同和案例法,增强法律摘要、合同谈判和信息检索的清晰度,优化和增强传统法律任务的潜力。应用这种技术可能会面临一些挑战,如算法单一文化、幻觉和遵守现有法规,包括欧盟的AI法案和最近美国的倡议,以及中国新兴的方法。此外,文中还介绍了两种不同的基准测试。

更新时间: 2025-12-11 11:01:35

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2512.09830v2

Continuous Vision-Language-Action Co-Learning with Semantic-Physical Alignment for Behavioral Cloning

Language-conditioned manipulation facilitates human-robot interaction via behavioral cloning (BC), which learns control policies from human demonstrations and serves as a cornerstone of embodied AI. Overcoming compounding errors in sequential action decisions remains a central challenge to improving BC performance. Existing approaches mitigate compounding errors through data augmentation, expressive representation, or temporal abstraction. However, they suffer from physical discontinuities and semantic-physical misalignment, leading to inaccurate action cloning and intermittent execution. In this paper, we present Continuous vision-language-action Co-Learning with Semantic-Physical Alignment (CCoL), a novel BC framework that ensures temporally consistent execution and fine-grained semantic grounding. It generates robust and smooth action execution trajectories through continuous co-learning across vision, language, and proprioceptive inputs (e.g., robot internal states). Meanwhile, we anchor language semantics to visuomotor representations by a bidirectional cross-attention to learn contextual information for action generation, successfully overcoming the problem of semantic-physical misalignment. Extensive experiments show that CCoL achieves an average 8.0% relative improvement across three simulation suites, with up to 19.2% relative gain in human-demonstrated bimanual insertion tasks. Real-world tests on a 7-DoF robot further confirm CCoL's generalization under unseen and noisy object states.

Updated: 2025-12-11 10:52:27

标题: 连续视觉-语言-动作协同学习,通过语义-物理对齐进行行为克隆

摘要: 语言条件操纵通过行为克隆(BC)促进了人机交互,BC从人类示范中学习控制策略,是具有身体的人工智能的基石。克服连续动作决策中的复合错误仍然是改善BC性能的一个核心挑战。现有方法通过数据增强、表达形式或时间抽象来减轻复合错误,然而它们受到物理不连续性和语义-物理错位的影响,导致动作克隆不准确和间歇执行。本文提出了一种新颖的BC框架Continuous vision-language-action Co-Learning with Semantic-Physical Alignment(CCoL),确保时间上的一致执行和精细的语义基础。它通过连续的视觉、语言和本体感输入(例如机器人内部状态)之间的持续共同学习生成强大而平滑的动作执行轨迹。与此同时,我们通过双向交叉注意力将语言语义锚定到视觉运动表示中,学习动作生成的上下文信息,成功地克服了语义-物理错位问题。大量实验表明,CCoL在三个模拟套件中实现了平均8.0%的相对改进,双手示范的双手插入任务相对增益高达19.2%。在一个7自由度机器人上的真实世界测试进一步证实了CCoL在未见过和嘈杂的物体状态下的泛化能力。

更新时间: 2025-12-11 10:52:27

领域: cs.RO,cs.AI,cs.CV

下载: http://arxiv.org/abs/2511.14396v2

Mode-Seeking for Inverse Problems with Diffusion Models

A pre-trained unconditional diffusion model, combined with posterior sampling or maximum a posteriori (MAP) estimation techniques, can solve arbitrary inverse problems without task-specific training or fine-tuning. However, existing posterior sampling and MAP estimation methods often rely on modeling approximations and can be computationally demanding. In this work, we propose the variational mode-seeking loss (VML), which, when minimized during each reverse diffusion step, guides the generated sample towards the MAP estimate. VML arises from a novel perspective of minimizing the Kullback-Leibler (KL) divergence between the diffusion posterior $p(\mathbf{x}_0|\mathbf{x}_t)$ and the measurement posterior $p(\mathbf{x}_0|\mathbf{y})$, where $\mathbf{y}$ denotes the measurement. Importantly, for linear inverse problems, VML can be analytically derived and need not be approximated. Based on further theoretical insights, we propose VML-MAP, an empirically effective algorithm for solving inverse problems, and validate its efficacy over existing methods in both performance and computational time, through extensive experiments on diverse image-restoration tasks across multiple datasets.

Updated: 2025-12-11 10:51:34

标题: 使用扩散模型进行逆问题的模式寻找

摘要: 一个预先训练的无条件扩散模型,结合后验采样或最大后验(MAP)估计技术,可以解决任意逆问题,无需特定任务的训练或微调。然而,现有的后验采样和MAP估计方法通常依赖于建模近似,并且可能需要大量计算。在这项工作中,我们提出了变分模式寻找损失(VML),在每个反向扩散步骤中最小化该损失,可以将生成的样本引导到MAP估计。VML源于一种新颖的视角,即最小化扩散后验$p(\mathbf{x}_0|\mathbf{x}_t)$与测量后验$p(\mathbf{x}_0|\mathbf{y})$之间的Kullback-Leibler(KL)散度,其中$\mathbf{y}$表示测量。重要的是,对于线性逆问题,VML可以进行解析推导,无需近似。基于进一步的理论洞察,我们提出了VML-MAP,这是一个在解决逆问题方面具有实证效果的算法,并通过对多个数据集上不同图像恢复任务的广泛实验,验证了其在性能和计算时间上的有效性,超过了现有方法。

更新时间: 2025-12-11 10:51:34

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2512.10524v1

Disentangled and Distilled Encoder for Out-of-Distribution Reasoning with Rademacher Guarantees

Recently, the disentangled latent space of a variational autoencoder (VAE) has been used to reason about multi-label out-of-distribution (OOD) test samples that are derived from different distributions than training samples. Disentangled latent space means having one-to-many maps between latent dimensions and generative factors or important characteristics of an image. This paper proposes a disentangled distilled encoder (DDE) framework to decrease the OOD reasoner size for deployment on resource-constrained devices while preserving disentanglement. DDE formalizes student-teacher distillation for model compression as a constrained optimization problem while preserving disentanglement with disentanglement constraints. Theoretical guarantees for disentanglement during distillation based on Rademacher complexity are established. The approach is evaluated empirically by deploying the compressed model on an NVIDIA

Updated: 2025-12-11 10:47:38

标题: 分离和精炼的编码器用于具有Rademacher保证的超出分布推理

摘要: 最近,变分自动编码器(VAE)的解缠缠绕潜在空间已被用于推理出源自不同分布的多标签超出分布(OOD)测试样本,这些样本与训练样本不同。解缠缠绕的潜在空间意味着在潜在维度和图像的生成因子或重要特征之间存在一对多的映射。本文提出了一种解缠缠绕蒸馏编码器(DDE)框架,以减小部署在资源受限设备上的OOD推理器大小,同时保持解缠缠绕。DDE将学生-老师蒸馏作为模型压缩的约束优化问题进行形式化,同时通过解缠缠绕约束保持解缠缠绕。基于Rademacher复杂度建立了关于蒸馏期间解缠缠绕的理论保证。该方法通过在NVIDIA上部署压缩模型进行了实证评估。

更新时间: 2025-12-11 10:47:38

领域: cs.LG

下载: http://arxiv.org/abs/2512.10522v1

Beyond Log-Concavity and Score Regularity: Improved Convergence Bounds for Score-Based Generative Models in W2-distance

Score-based Generative Models (SGMs) aim to sample from a target distribution by learning score functions using samples perturbed by Gaussian noise. Existing convergence bounds for SGMs in the W2-distance rely on stringent assumptions about the data distribution. In this work, we present a novel framework for analyzing W2-convergence in SGMs, significantly relaxing traditional assumptions such as log-concavity and score regularity. Leveraging the regularization properties of the Ornstein--Uhlenbeck (OU) process, we show that weak log-concavity of the data distribution evolves into log-concavity over time. This transition is rigorously quantified through a PDE-based analysis of the Hamilton--Jacobi--Bellman equation governing the log-density of the forward process. Moreover, we establish that the drift of the time-reversed OU process alternates between contractive and non-contractive regimes, reflecting the dynamics of concavity. Our approach circumvents the need for stringent regularity conditions on the score function and its estimators, relying instead on milder, more practical assumptions. We demonstrate the wide applicability of this framework through explicit computations on Gaussian mixture models, illustrating its versatility and potential for broader classes of data distributions.

Updated: 2025-12-11 10:42:29

标题: 超越对数凹性和分数正则性:改进W2距离下基于分数的生成模型的收敛界限

摘要: 得分基于生成模型(SGMs)旨在通过学习使用受高斯噪声扰动的样本的得分函数来从目标分布中进行采样。现有的针对W2距离中SGMs的收敛性界限依赖于对数据分布的严格假设。在本工作中,我们提出了一个新颖的框架,用于分析SGMs中的W2收敛性,显著放宽了传统假设,如对数凹性和得分正则性。利用Ornstein-Uhlenbeck(OU)过程的正则化特性,我们表明数据分布的弱对数凹性随时间演变为对数凹性。通过对控制前向过程的对数密度的Hamilton-Jacobi-Bellman方程进行基于PDE的分析,这种转变得到了严格的量化。此外,我们建立了时间反转的OU过程的漂移在收缩和非收缩状态之间交替变化,反映了凹性的动态。我们的方法避免了对得分函数及其估计量进行严格的正则性条件,而是依赖于更温和、更实用的假设。通过对高斯混合模型进行明确计算,我们展示了这一框架的广泛适用性,展示了其灵活性和适用于更广泛数据分布类别的潜力。

更新时间: 2025-12-11 10:42:29

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2501.02298v5

Adaptive Replay Buffer for Offline-to-Online Reinforcement Learning

Offline-to-Online Reinforcement Learning (O2O RL) faces a critical dilemma in balancing the use of a fixed offline dataset with newly collected online experiences. Standard methods, often relying on a fixed data-mixing ratio, struggle to manage the trade-off between early learning stability and asymptotic performance. To overcome this, we introduce the Adaptive Replay Buffer (ARB), a novel approach that dynamically prioritizes data sampling based on a lightweight metric we call 'on-policyness'. Unlike prior methods that rely on complex learning procedures or fixed ratios, ARB is designed to be learning-free and simple to implement, seamlessly integrating into existing O2O RL algorithms. It assesses how closely collected trajectories align with the current policy's behavior and assigns a proportional sampling weight to each transition within that trajectory. This strategy effectively leverages offline data for initial stability while progressively focusing learning on the most relevant, high-rewarding online experiences. Our extensive experiments on D4RL benchmarks demonstrate that ARB consistently mitigates early performance degradation and significantly improves the final performance of various O2O RL algorithms, highlighting the importance of an adaptive, behavior-aware replay buffer design.

Updated: 2025-12-11 10:30:04

标题: 自适应重播缓冲区用于离线到在线强化学习

摘要: 离线到在线强化学习(O2O RL)面临着一个关键困境,即如何平衡使用固定的离线数据集和新收集的在线经验。标准方法通常依赖于固定的数据混合比率,很难在早期学习稳定性和渐近性能之间取得平衡。为了克服这一困境,我们引入了自适应回放缓冲区(ARB),这是一种基于我们称之为“在线策略性”的轻量级度量标准动态地优先考虑数据采样的新方法。与依赖复杂学习程序或固定比率的先前方法不同,ARB旨在无需学习并且易于实现,可以无缝地集成到现有的O2O RL算法中。它评估了收集的轨迹与当前策略行为的契合程度,并为每个轨迹中的每个转换分配一个比例采样权重。这一策略有效地利用离线数据实现了初始稳定性,同时逐渐将学习重点放在最相关且高奖励的在线经验上。我们在D4RL基准测试上进行的大量实验证明,ARB能够持续减轻早期性能下降,并显著提高各种O2O RL算法的最终性能,突显了自适应、行为感知回放缓冲区设计的重要性。

更新时间: 2025-12-11 10:30:04

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2512.10510v1

Hyperspectral Image Data Reduction for Endmember Extraction

Endmember extraction from hyperspectral images aims to identify the spectral signatures of materials present in a scene. Recent studies have shown that self-dictionary methods can achieve high extraction accuracy; however, their high computational cost limits their applicability to large-scale hyperspectral images. Although several approaches have been proposed to mitigate this issue, it remains a major challenge. Motivated by this situation, this paper pursues a data reduction approach. Assuming that the hyperspectral image follows the linear mixing model with the pure-pixel assumption, we develop a data reduction technique that removes pixels that do not contain endmembers. We analyze the theoretical properties of this reduction step and show that it preserves pixels that lie close to the endmembers. Building on this result, we propose a data-reduced self-dictionary method that integrates the data reduction with a self-dictionary method based on a linear programming formulation. Numerical experiments demonstrate that the proposed method can substantially reduce the computational time of the original self-dictionary method without sacrificing endmember extraction accuracy.

Updated: 2025-12-11 10:27:40

标题: 高光谱图像数据降维用于端元提取

摘要: 高光谱图像的端元提取旨在识别场景中存在的材料的光谱特征。最近的研究表明,自字典方法可以实现较高的提取准确性; 但是,它们的高计算成本限制了它们对大规模高光谱图像的适用性。尽管已经提出了几种方法来缓解这个问题,但它仍然是一个重大挑战。受到这种情况的启发,本文采用了一种数据减少的方法。假设高光谱图像遵循线性混合模型并具有纯像素假设,我们开发了一种数据减少技术,用于移除不包含端元的像素。我们分析了这种减少步骤的理论特性,并展示了它保留了接近端元的像素。基于这一结果,我们提出了一种数据减少的自字典方法,将数据减少与基于线性规划公式的自字典方法相结合。数值实验表明,所提出的方法可以显著减少原始自字典方法的计算时间,而不损害端元提取的准确性。

更新时间: 2025-12-11 10:27:40

领域: eess.IV,cs.LG,eess.SP

下载: http://arxiv.org/abs/2512.10506v1

V-VAE: A Variational Auto Encoding Framework Towards Fine-Grained Control over Human-Like Chat

With the continued proliferation of Large Language Model (LLM) based chatbots, there is a growing demand for generating responses that are not only linguistically fluent but also consistently aligned with persona-specific traits in conversations. However, existing role-play and persona-based chat approaches rely heavily on static role descriptions, coarse-grained signal space, and low-quality synthetic data, which fail to capture dynamic fine-grained details in human-like chat. Human-like chat requires modeling subtle latent traits, such as emotional tone, situational awareness, and evolving personality, which are difficult to predefine and cannot be easily learned from synthetic or distillation-based data. To address these limitations, we propose a Verbal Variational Auto-Encoding (V-VAE) framework, containing a variational auto-encoding module and fine-grained control space which dynamically adapts dialogue behaviour based on fine-grained, interpretable latent variables across talking style, interaction patterns, and personal attributes. We also construct a high-quality dataset, HumanChatData, and benchmark HumanChatBench to address the scarcity of high-quality data in the human-like domain. Experiments show that LLMs based on V-VAE consistently outperform standard baselines on HumanChatBench and DialogBench, which further demonstrates the effectiveness of V-VAE and HumanChatData.

Updated: 2025-12-11 10:22:55

标题: V-VAE:一种变分自动编码框架,实现对类人聊天的精细控制

摘要: 随着基于大型语言模型(LLM)的聊天机器人的不断增加,人们越来越需要生成不仅在语言流畅性上表现出色,而且在对话中也与特定人物特征一致的响应。然而,现有的角色扮演和基于人物的聊天方法主要依赖于静态角色描述、粗粒度信号空间和低质量的合成数据,这些方法无法捕捉类似于人类对话中的动态细节。人类对话需要对微妙的潜在特征进行建模,如情感色调、情景意识和不断变化的个性,这些特征难以预定义,也无法轻松从合成或蒸馏数据中学习。为了解决这些限制,我们提出了一种语言变分自动编码(V-VAE)框架,包含一个变分自动编码模块和细粒度控制空间,根据细粒度、可解释的潜在变量在话语风格、互动模式和个人属性上动态调整对话行为。我们还构建了一个高质量的数据集HumanChatData,并使用HumanChatBench进行基准测试,以解决人类对话领域高质量数据的稀缺问题。实验证明,基于V-VAE的LLM在HumanChatBench和DialogBench上始终优于标准基线,进一步证明了V-VAE和HumanChatData的有效性。

更新时间: 2025-12-11 10:22:55

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.01524v2

Zero-shot 3D Map Generation with LLM Agents: A Dual-Agent Architecture for Procedural Content Generation

Procedural Content Generation (PCG) offers scalable methods for algorithmically creating complex, customizable worlds. However, controlling these pipelines requires the precise configuration of opaque technical parameters. We propose a training-free architecture that utilizes LLM agents for zero-shot PCG parameter configuration. While Large Language Models (LLMs) promise a natural language interface for PCG tools, off-the-shelf models often fail to bridge the semantic gap between abstract user instructions and strict parameter specifications. Our system pairs an Actor agent with a Critic agent, enabling an iterative workflow where the system autonomously reasons over tool parameters and refines configurations to progressively align with human design preferences. We validate this approach on the generation of various 3D maps, establishing a new benchmark for instruction-following in PCG. Experiments demonstrate that our approach outperforms single-agent baselines, producing diverse and structurally valid environments from natural language descriptions. These results demonstrate that off-the-shelf LLMs can be effectively repurposed as generalized agents for arbitrary PCG tools. By shifting the burden from model training to architectural reasoning, our method offers a scalable framework for mastering complex software without task-specific fine-tuning.

Updated: 2025-12-11 10:22:02

标题: 无需训练的LLM代理生成3D地图:一种用于程序内容生成的双代理架构

摘要: 程序内容生成(PCG)提供了可扩展的方法,用于通过算法创建复杂且可定制的世界。然而,控制这些流程需要精确配置不透明的技术参数。我们提出了一种无需训练的架构,利用LLM代理进行零射击PCG参数配置。尽管大型语言模型(LLMs)承诺为PCG工具提供自然语言界面,但现成的模型通常无法弥合抽象用户指令和严格参数规格之间的语义差距。我们的系统配对了一个演员代理和一个评论代理,实现了一个迭代工作流程,其中系统自主地对工具参数进行推理,并不断优化配置,以逐渐与人类设计偏好相一致。我们在生成各种3D地图方面验证了这种方法,为PCG中的指令遵循建立了一个新的基准。实验表明,我们的方法优于单一代理基线,能够从自然语言描述中产生多样且结构有效的环境。这些结果表明,现成的LLMs可以有效地重新用作任意PCG工具的通用代理。通过将负担从模型训练转移到架构推理,我们的方法为掌握复杂软件提供了可扩展的框架,而无需特定任务的微调。

更新时间: 2025-12-11 10:22:02

领域: cs.AI

下载: http://arxiv.org/abs/2512.10501v1

Compliant Residual DAgger: Improving Real-World Contact-Rich Manipulation with Human Corrections

We address key challenges in Dataset Aggregation (DAgger) for real-world contact-rich manipulation: how to collect informative human correction data and how to effectively update policies with this new data. We introduce Compliant Residual DAgger (CR-DAgger), which contains two novel components: 1) a Compliant Intervention Interface that leverages compliance control, allowing humans to provide gentle, accurate delta action corrections without interrupting the ongoing robot policy execution; and 2) a Compliant Residual Policy formulation that learns from human corrections while incorporating force feedback and force control. Our system significantly enhances performance on precise contact-rich manipulation tasks using minimal correction data, improving base policy success rates by over 50\% on two challenging tasks (book flipping and belt assembly) while outperforming both retraining-from-scratch and finetuning approaches. Through extensive real-world experiments, we provide practical guidance for implementing effective DAgger in real-world robot learning tasks. Result videos are available at: https://compliant-residual-dagger.github.io/

Updated: 2025-12-11 10:15:43

标题: 《顺应性残余DAgger:通过人类纠正改进真实世界中接触丰富的操作》

摘要: 我们解决了实际接触丰富操作中数据集聚合(DAgger)面临的关键挑战:如何收集信息丰富的人类校正数据以及如何有效地利用这些新数据更新策略。我们引入了一种名为CR-DAgger的合规残差DAgger,它包含两个新颖的组件:1)一种合规干预界面,利用合规控制,允许人类提供温和、准确的增量动作校正,而不会中断正在进行的机器人策略执行;和2)一种合规残差策略制定,从人类校正中学习,并结合力反馈和力控制。我们的系统通过使用最少的校正数据显著提升了精确的接触丰富操作任务的性能,在两个具有挑战性的任务(翻书和皮带组装)上,将基础策略成功率提高了50\%以上,同时优于从头开始重新训练和微调方法。通过大量的实际实验,我们提供了实施实际机器人学习任务中有效DAgger的实用指导。相关结果视频可在以下网址查看:https://compliant-residual-dagger.github.io/

更新时间: 2025-12-11 10:15:43

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2506.16685v4

UACER: An Uncertainty-Aware Critic Ensemble Framework for Robust Adversarial Reinforcement Learning

Robust adversarial reinforcement learning has emerged as an effective paradigm for training agents to handle uncertain disturbance in real environments, with critical applications in sequential decision-making domains such as autonomous driving and robotic control. Within this paradigm, agent training is typically formulated as a zero-sum Markov game between a protagonist and an adversary to enhance policy robustness. However, the trainable nature of the adversary inevitably induces non-stationarity in the learning dynamics, leading to exacerbated training instability and convergence difficulties, particularly in high-dimensional complex environments. In this paper, we propose a novel approach, Uncertainty-Aware Critic Ensemble for robust adversarial Reinforcement learning (UACER), which consists of two strategies: 1) Diversified critic ensemble: a diverse set of K critic networks is exploited in parallel to stabilize Q-value estimation rather than conventional single-critic architectures for both variance reduction and robustness enhancement. 2) Time-varying Decay Uncertainty (TDU) mechanism: advancing beyond simple linear combinations, we develop a variance-derived Q-value aggregation strategy that explicitly incorporates epistemic uncertainty to dynamically regulate the exploration-exploitation trade-off while simultaneously stabilizing the training process. Comprehensive experiments across several MuJoCo control problems validate the superior effectiveness of UACER, outperforming state-of-the-art methods in terms of overall performance, stability, and efficiency.

Updated: 2025-12-11 10:14:13

标题: UACER:一种用于稳健对抗强化学习的不确定性感知评论家集成框架

摘要: Robust adversarial reinforcement learning has emerged as an effective method for training agents to deal with uncertain disturbances in real-world environments, particularly in areas such as autonomous driving and robotic control. However, the trainable nature of the adversary can lead to non-stationarity in learning dynamics, causing training instability and convergence issues, especially in complex environments. In this paper, we introduce a novel approach called Uncertainty-Aware Critic Ensemble for robust adversarial Reinforcement learning (UACER). This approach utilizes a diverse set of critic networks in parallel to stabilize Q-value estimation and incorporates a Time-varying Decay Uncertainty (TDU) mechanism to dynamically regulate exploration-exploitation trade-off. Experiments conducted on various MuJoCo control problems demonstrate the superior effectiveness of UACER compared to existing methods in terms of performance, stability, and efficiency.

更新时间: 2025-12-11 10:14:13

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2512.10492v1

LLM-Assisted AHP for Explainable Cyber Range Evaluation

Cyber Ranges (CRs) have emerged as prominent platforms for cybersecurity training and education, especially for Critical Infrastructure (CI) sectors that face rising cyber threats. One way to address these threats is through hands-on exercises that bridge IT and OT domains to improve defensive readiness. However, consistently evaluating whether a CR platform is suitable and effective remains a challenge. This paper proposes an evaluation framework for CRs, emphasizing mission-critical settings by using a multi-criteria decision-making approach. We define a set of evaluation criteria that capture technical fidelity, training and assessment capabilities, scalability, usability, and other relevant factors. To weight and aggregate these criteria, we employ the Analytic Hierarchy Process (AHP), supported by a simulated panel of multidisciplinary experts implemented through a Large Language Model (LLM). This LLM-assisted expert reasoning enables consistent and reproducible pairwise comparisons across criteria without requiring direct expert convening. The framework's output equals quantitative scores that facilitate objective comparison of CR platforms and highlight areas for improvement. Overall, this work lays the foundation for a standardized and explainable evaluation methodology to guide both providers and end-users of CRs.

Updated: 2025-12-11 10:07:15

标题: LLM辅助的AHP用于可解释的网络安全范围评估

摘要: 网络范围(CRs)已成为网络安全培训和教育的重要平台,尤其是面临不断增加的网络威胁的关键基础设施(CI)部门。解决这些威胁的一种方法是通过实践演习,将IT和OT领域联系起来,以提高防御准备。然而,持续评估CR平台是否合适和有效仍然是一个挑战。本文提出了一个CR评估框架,强调使用多标准决策方法的任务关键设置。我们定义了一组评估标准,涵盖技术忠实度、培训和评估能力、可扩展性、易用性和其他相关因素。为了对这些标准进行加权和聚合,我们采用层次分析过程(AHP),通过一个通过大型语言模型(LLM)实现的模拟跨学科专家小组来支持。这种LLM辅助专家推理使得可以在不需要直接召集专家的情况下进行一致和可重复的两两比较。该框架的输出是定量分数,有助于客观比较CR平台,并突出改进的方向。总的来说,这项工作为指导CR的提供者和最终用户奠定了一个标准化和可解释的评估方法学的基础。

更新时间: 2025-12-11 10:07:15

领域: cs.CR

下载: http://arxiv.org/abs/2512.10487v1

Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models

Existing world models for autonomous driving struggle with long-horizon generation and generalization to challenging scenarios. In this work, we develop a model using simple design choices, and without additional supervision or sensors, such as maps, depth, or multiple cameras. We show that our model yields state-of-the-art performance, despite having only 469M parameters and being trained on 280h of video data. It particularly stands out in difficult scenarios like turning maneuvers and urban traffic. We test whether discrete token models possibly have advantages over continuous models based on flow matching. To this end, we set up a hybrid tokenizer that is compatible with both approaches and allows for a side-by-side comparison. Our study concludes in favor of the continuous autoregressive model, which is less brittle on individual design choices and more powerful than the model built on discrete tokens. Code, models and qualitative results are publicly available at https://lmb-freiburg.github.io/orbis.github.io/.

Updated: 2025-12-11 10:05:50

标题: Orbis:克服驾驶世界模型中长期预测的挑战

摘要: 现有的自动驾驶世界模型在长期规划和对具有挑战性场景的泛化方面存在困难。在这项工作中,我们开发了一个模型,采用简单的设计选择,并且没有额外的监督或传感器,如地图、深度或多个摄像头。我们展示了我们的模型在性能方面达到了最先进的水平,尽管它只有4.69亿个参数,并且是在280小时的视频数据上训练的。它在转弯操作和城市交通等困难场景中特别突出。我们测试了离散标记模型可能比基于流匹配的连续模型具有优势。为此,我们建立了一个混合标记器,既兼容这两种方法,又允许进行并行比较。我们的研究得出结论,支持连续自回归模型,该模型在个别设计选择上不太脆弱,比基于离散标记构建的模型更强大。代码、模型和定性结果可在https://lmb-freiburg.github.io/orbis.github.io/上公开获取。

更新时间: 2025-12-11 10:05:50

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.13162v2

From Lab to Reality: A Practical Evaluation of Deep Learning Models and LLMs for Vulnerability Detection

Vulnerability detection methods based on deep learning (DL) have shown strong performance on benchmark datasets, yet their real-world effectiveness remains underexplored. Recent work suggests that both graph neural network (GNN)-based and transformer-based models, including large language models (LLMs), yield promising results when evaluated on curated benchmark datasets. These datasets are typically characterized by consistent data distributions and heuristic or partially noisy labels. In this study, we systematically evaluate two representative DL models-ReVeal and LineVul-across four representative datasets: Juliet, Devign, BigVul, and ICVul. Each model is trained independently on each respective dataset, and their code representations are analyzed using t-SNE to uncover vulnerability related patterns. To assess realistic applicability, we deploy these models along with four pretrained LLMs, Claude 3.5 Sonnet, GPT-o3-mini, GPT-4o, and GPT-5 on a curated dataset, VentiVul, comprising 20 recently (May 2025) fixed vulnerabilities from the Linux kernel. Our experiments reveal that current models struggle to distinguish vulnerable from non-vulnerable code in representation space and generalize poorly across datasets with differing distributions. When evaluated on VentiVul, our newly constructed time-wise out-of-distribution dataset, performance drops sharply, with most models failing to detect vulnerabilities reliably. These results expose a persistent gap between academic benchmarks and real-world deployment, emphasizing the value of our deployment-oriented evaluation framework and the need for more robust code representations and higher-quality datasets.

Updated: 2025-12-11 10:04:54

标题: 从实验室到现实:深度学习模型和LLMs在漏洞检测中的实际评估

摘要: 基于深度学习(DL)的漏洞检测方法在基准数据集上表现出色,但它们在现实世界中的有效性尚未得到充分探讨。最近的研究表明,基于图神经网络(GNN)和基于变压器的模型,包括大型语言模型(LLMs),在经过精心筛选的基准数据集上表现出有希望的结果。这些数据集通常具有一致的数据分布和启发式或部分嘈杂的标签。在这项研究中,我们系统地评估了两个代表性的DL模型-ReVeal和LineVul-在四个代表性数据集上:Juliet、Devign、BigVul和ICVul。每个模型在各自的数据集上独立训练,并使用t-SNE分析它们的代码表示,以揭示与漏洞相关的模式。为了评估实际应用性,我们在一个经过精心筛选的数据集VentiVul上部署了这些模型以及四个预训练的LLM,Claude 3.5 Sonnet、GPT-o3-mini、GPT-4o和GPT-5,该数据集包含来自Linux内核的20个最近(2025年5月)修复的漏洞。我们的实验结果显示,当前的模型在表示空间中很难区分易受攻击的代码和非易受攻击的代码,并且在具有不同分布的数据集之间泛化能力较差。在VentiVul上评估时,我们新构建的基于时间的分布外数据集,性能急剧下降,大多数模型无法可靠地检测漏洞。这些结果揭示了学术基准和实际部署之间的持续差距,强调了我们面向部署的评估框架的价值,以及对更强大的代码表示和更高质量数据集的需求。

更新时间: 2025-12-11 10:04:54

领域: cs.CR,cs.LG,cs.SE

下载: http://arxiv.org/abs/2512.10485v1

A Generation Framework with Strict Constraints for Crystal Materials Design

The design of crystal materials plays a critical role in areas such as new energy development, biomedical engineering, and semiconductors. Recent advances in data-driven methods have enabled the generation of diverse crystal structures. However, most existing approaches still rely on random sampling without strict constraints, requiring multiple post-processing steps to identify stable candidates with the desired physical and chemical properties. In this work, we present a new constrained generation framework that takes multiple constraints as input and enables the generation of crystal structures with specific chemical and properties. In this framework, intermediate constraints, such as symmetry information and composition ratio, are generated by a constraint generator based on large language models (LLMs), which considers the target properties. These constraints are then used by a subsequent crystal structure generator to ensure that the structure generation process is under control. Our method generates crystal structures with a probability of meeting the target properties that is more than twice that of existing approaches. Furthermore, nearly 100% of the generated crystals strictly adhere to predefined chemical composition, eliminating the risks of supply chain during production.

Updated: 2025-12-11 09:59:41

标题: 一个具有严格约束的晶体材料设计的世代框架

摘要: 晶体材料的设计在新能源开发、生物医学工程和半导体等领域起着至关重要的作用。最近数据驱动方法的进展使得能够生成多样化的晶体结构。然而,大多数现有方法仍依赖于没有严格约束的随机取样,需要多个后处理步骤来识别具有所需物理和化学性质的稳定候选体。在这项工作中,我们提出了一个新的受限生成框架,将多个约束作为输入,并能够生成具有特定化学和物性的晶体结构。在这个框架中,基于大型语言模型(LLMs)的约束生成器生成中间约束,如对称信息和组成比,考虑目标性质。这些约束然后由后续的晶体结构生成器使用,以确保结构生成过程受到控制。我们的方法生成的晶体结构符合目标性质的概率是现有方法的两倍以上。此外,近乎100%的生成晶体严格遵守预定义的化学组成,消除了生产过程中供应链风险。

更新时间: 2025-12-11 09:59:41

领域: cs.AI,cond-mat.mtrl-sci

下载: http://arxiv.org/abs/2411.08464v3

DDFI: Diverse and Distribution-aware Missing Feature Imputation via Two-step Reconstruction

Incomplete node features are ubiquitous in real-world scenarios, e.g., the attributes of web users may be partly private, which causes the performance of Graph Neural Networks (GNNs) to decline significantly. Feature propagation (FP) is a well-known method that performs well for imputation of missing node features on graphs, but it still has the following three issues: 1) it struggles with graphs that are not fully connected, 2) imputed features face the over-smoothing problem, and 3) FP is tailored for transductive tasks, overlooking the feature distribution shift in inductive tasks. To address these challenges, we introduce DDFI, a Diverse and Distribution-aware Missing Feature Imputation method that combines feature propagation with a graph-based Masked AutoEncoder (MAE) in a nontrivial manner. It first designs a simple yet effective algorithm, namely Co-Label Linking (CLL), that randomly connects nodes in the training set with the same label to enhance the performance on graphs with numerous connected components. Then we develop a novel two-step representation generation process at the inference stage. Specifically, instead of directly using FP-imputed features as input during inference, DDFI further reconstructs the features through the whole MAE to reduce feature distribution shift in the inductive tasks and enhance the diversity of node features. Meanwhile, since existing feature imputation methods for graphs only evaluate by simulating the missing scenes with manually masking the features, we collect a new dataset called Sailing from the records of voyages that contains naturally missing features to help better evaluate the effectiveness. Extensive experiments conducted on six public datasets and Sailing show that DDFI outperforms the state-of-the-art methods under both transductive and inductive settings.

Updated: 2025-12-11 09:53:17

标题: DDFI:通过两步重建实现多样性和分布感知的缺失特征插值

摘要: 现实世界中不完整的节点特征是普遍存在的,例如,Web用户的属性可能部分私密,这会导致图神经网络(GNNs)的性能显着下降。特征传播(FP)是一种众所周知的方法,对图上缺失节点特征的插补效果很好,但仍然存在以下三个问题:1)它在非完全连接的图上表现不佳,2)插补特征面临过度平滑问题,3)FP专为传导任务而设计,忽视感知任务中特征分布的变化。为了解决这些挑战,我们引入了DDFI,一种多样性和分布感知的缺失特征插补方法,以一种非常规的方式将特征传播与基于图的掩码自动编码器(MAE)相结合。它首先设计了一种简单但有效的算法,即共标签连接(CLL),随机连接训练集中具有相同标签的节点,以增强在具有众多连接组件的图上的性能。然后,在推理阶段开发了一种新颖的两步表示生成过程。具体来说,DDFI在推理过程中不直接使用FP插补的特征作为输入,而是通过整个MAE进一步重构特征,以减少感知任务中特征分布的变化,并增强节点特征的多样性。同时,由于现有的图特征插补方法仅通过手动遮罩特征来模拟缺失场景进行评估,我们从航行记录中收集了一个名为Sailing的新数据集,其中包含自然缺失的特征,以帮助更好地评估有效性。在六个公共数据集和Sailing上进行的大量实验表明,DDFI在传导和感知设置下均优于最先进的方法。

更新时间: 2025-12-11 09:53:17

领域: cs.LG,cs.SI

下载: http://arxiv.org/abs/2512.06356v2

Stealth and Evasion in Rogue AP Attacks: An Analysis of Modern Detection and Bypass Techniques

Wireless networks act as the backbone of modern digital connectivity, making them a primary target for cyber adversaries. Rogue Access Point attacks, specifically the Evil Twin variant, enable attackers to clone legitimate wireless network identifiers to deceive users into connecting. Once a connection is established, the adversary can intercept traffic and harvest sensitive credentials. While modern defensive architectures often employ Network Intrusion Detection Systems (NIDS) to identify malicious activity, the effectiveness of these systems against Layer 2 wireless threats remains a subject of critical inquiry. This project aimed to design a stealth-capable Rogue AP and evaluate its detectability against Suricata, an open-source NIDS/IPS. The methodology initially focused on a hardware-based deployment using Raspberry Pi platforms but transitioned to a virtualized environment due to severe system compatibility issues. Using Wifipumpkin3, the research team successfully deployed a captive portal that harvested user credentials from connected devices. However, the Suricata NIDS failed to flag the attack, highlighting a significant blind spot in traditional intrusion detection regarding wireless management frame attacks. This paper details the construction of the attack, the evasion techniques employed, and the limitations of current NIDS solutions in detecting localized wireless threats

Updated: 2025-12-11 09:45:48

标题: 流氓AP攻击中的隐匿和逃避:现代检测和绕过技术分析

摘要: 无线网络是现代数字连接的支柱,因此成为网络对手的主要目标。Rogue Access Point 攻击,特别是Evil Twin变种,使攻击者能够克隆合法无线网络标识符,欺骗用户连接。一旦建立连接,对手就可以拦截流量并窃取敏感凭证。虽然现代防御架构通常使用网络入侵检测系统(NIDS)来识别恶意活动,但这些系统对于第2层无线威胁的有效性仍然是一个重要课题。本项目旨在设计一个隐蔽的Rogue AP,并评估其对Suricata(一个开源NIDS/IPS)的可检测性。方法论最初集中在使用树莓派平台进行硬件部署,但由于严重的系统兼容性问题,转变为虚拟化环境。利用Wifipumpkin3,研究团队成功部署了一个捕获门户,从连接设备中窃取用户凭证。然而,Suricata NIDS未能标记这次攻击,突显出传统入侵检测在检测无线管理帧攻击方面存在重大盲点。本文详细介绍了攻击的构建、使用的规避技术以及目前NIDS解决方案在检测局部无线威胁方面的局限性。

更新时间: 2025-12-11 09:45:48

领域: cs.CR

下载: http://arxiv.org/abs/2512.10470v1

Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning

The remarkable performance of Large Language Models (LLMs) can be enhanced with test-time computation, which relies on external tools and even other deep learning models. However, existing approaches for integrating non-text modality representations into LLMs typically require additional costly supervised training, restricting on-the-fly adaptation to new domains and modalities. In this work, we explore the feasibility of integrating representations from non-text foundational models (FMs) into text-based LLMs in a training-free manner. We propose In-Context Representation Learning (ICRL) as a proof-of-concept to allow LLMs to adaptively utilize non-text modality representations with few-shot learning. Unlike traditional in-context learning, which incorporates text-label pairs, ICRL replaces text inputs with FM representations, enabling the LLM to perform multi-modal inference without fine-tuning. We evaluate ICRL on a suite of tasks in the molecular domain, investigating three core research questions: (i) how to map FM representations into LLMs in a training-free manner, (ii) what factors influence ICRL performance, and (iii) what mechanisms underlie the effectiveness of ICRL. To the best of our knowledge, ICRL is the first training-free framework for integrating non-text modality representations into text-based LLMs, presenting a promising direction for adaptable, multi-modal generalization.

Updated: 2025-12-11 09:40:22

标题: 机器学习模型能否在无需训练的情况下推理非文本模态?以上下文表示学习为例研究

摘要: 大型语言模型(LLMs)的显着性能可以通过测试时计算进行增强,这依赖于外部工具甚至其他深度学习模型。然而,现有的将非文本模态表示集成到LLMs中的方法通常需要额外昂贵的监督训练,限制了对新领域和模态的即时适应。在这项工作中,我们探讨了在无需培训的情况下将非文本基础模型(FMs)的表示集成到基于文本的LLMs中的可行性。我们提出了上下文表示学习(ICRL)作为一个概念验证,允许LLMs通过少量样本学习自适应地利用非文本模态表示。与传统的上下文学习不同,传统的上下文学习将文本-标签对合并起来,ICRL将文本输入替换为FM表示,使LLM能够进行多模态推理而无需微调。我们在分子领域的一系列任务中评估了ICRL,研究了三个核心研究问题:(i)如何在无需培训的情况下将FM表示映射到LLMs中,(ii)什么因素影响ICRL的性能,以及(iii)ICRL的有效性背后的机制是什么。据我们所知,ICRL是第一个无需培训的框架,用于将非文本模态表示集成到基于文本的LLMs中,为适应性强、多模态概括提供了一个有前途的方向。

更新时间: 2025-12-11 09:40:22

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.17552v3

Beyond Words and Pixels: A Benchmark for Implicit World Knowledge Reasoning in Generative Models

Text-to-image (T2I) models today are capable of producing photorealistic, instruction-following images, yet they still frequently fail on prompts that require implicit world knowledge. Existing evaluation protocols either emphasize compositional alignment or rely on single-round VQA-based scoring, leaving critical dimensions such as knowledge grounding, multi-physics interactions, and auditable evidence-substantially undertested. To address these limitations, we introduce PicWorld, the first comprehensive benchmark that assesses the grasp of implicit world knowledge and physical causal reasoning of T2I models. This benchmark consists of 1,100 prompts across three core categories. To facilitate fine-grained evaluation, we propose PW-Agent, an evidence-grounded multi-agent evaluator to hierarchically assess images on their physical realism and logical consistency by decomposing prompts into verifiable visual evidence. We conduct a thorough analysis of 17 mainstream T2I models on PicWorld, illustrating that they universally exhibit a fundamental limitation in their capacity for implicit world knowledge and physical causal reasoning to varying degrees. The findings highlight the need for reasoning-aware, knowledge-integrative architectures in future T2I systems.

Updated: 2025-12-11 09:39:01

标题: 超越语言和像素:生成模型中隐含世界知识推理的基准测试

摘要: 今天的文本到图像(T2I)模型能够生成逼真的、遵循指令的图像,但它们仍然经常在需要隐含世界知识的提示上失败。现有的评估协议要么强调组合对齐,要么依赖于单轮VQA评分,留下了知识基础、多物理相互作用和可审计证据等关键维度大量未经测试。为了解决这些限制,我们引入了PicWorld,这是第一个全面评估T2I模型隐含世界知识和物理因果推理掌握能力的基准。该基准包括三个核心类别的1,100个提示。为了便于细致评估,我们提出了PW-Agent,这是一个基于证据的多代理评估器,通过将提示分解为可验证的视觉证据,层次化地评估图像的物理现实性和逻辑一致性。我们对17个主流T2I模型在PicWorld上进行了深入分析,结果显示它们普遍在隐含世界知识和物理因果推理方面存在不同程度的基本限制。这些发现凸显了未来T2I系统需要具有推理意识、知识整合结构的需求。

更新时间: 2025-12-11 09:39:01

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.18271v3

T-SKM-Net: Trainable Neural Network Framework for Linear Constraint Satisfaction via Sampling Kaczmarz-Motzkin Method

Neural network constraint satisfaction is crucial for safety-critical applications such as power system optimization, robotic path planning, and autonomous driving. However, existing constraint satisfaction methods face efficiency-applicability trade-offs, with hard constraint methods suffering from either high computational complexity or restrictive assumptions on constraint structures. The Sampling Kaczmarz-Motzkin (SKM) method is a randomized iterative algorithm for solving large-scale linear inequality systems with favorable convergence properties, but its argmax operations introduce non-differentiability, posing challenges for neural network applications. This work proposes the Trainable Sampling Kaczmarz-Motzkin Network (T-SKM-Net) framework and, for the first time, systematically integrates SKM-type methods into neural network constraint satisfaction. The framework transforms mixed constraint problems into pure inequality problems through null space transformation, employs SKM for iterative solving, and maps solutions back to the original constraint space, efficiently handling both equality and inequality constraints. We provide theoretical proof of post-processing effectiveness in expectation and end-to-end trainability guarantees based on unbiased gradient estimators, demonstrating that despite non-differentiable operations, the framework supports standard backpropagation. On the DCOPF case118 benchmark, our method achieves 4.27ms/item GPU serial forward inference with 0.0025% max optimality gap with post-processing mode and 5.25ms/item with 0.0008% max optimality gap with joint training mode, delivering over 25$\times$ speedup compared to the pandapower solver while maintaining zero constraint violations under given tolerance.

Updated: 2025-12-11 09:35:13

标题: T-SKM-Net:通过采样Kaczmarz-Motzkin方法进行线性约束满足的可训练神经网络框架

摘要: 神经网络约束满足对于诸如电力系统优化、机器人路径规划和自动驾驶等安全关键应用至关重要。然而,现有的约束满足方法面临效率和适用性的折衷,硬约束方法要么计算复杂度高,要么对约束结构做出了限制性假设。采样卡兹玛兹金(SKM)方法是一种用于解决具有良好收敛特性的大规模线性不等式系统的随机迭代算法,但其argmax操作引入了不可微性,在神经网络应用中存在挑战。本研究提出了可训练采样卡兹玛兹金网络(T-SKM-Net)框架,并首次系统地将SKM类型方法集成到神经网络约束满足中。该框架通过零空间变换将混合约束问题转化为纯不等式问题,在迭代求解中采用SKM,并将解映射回原始约束空间,有效处理等式和不等式约束。我们提供了期望中后处理效果的理论证明,并基于无偏梯度估计器提供端到端可训练性保证,表明尽管存在不可微操作,该框架支持标准反向传播。在DCOPF case118基准测试中,我们的方法在后处理模式下实现了4.27ms/item的GPU串行前向推断,最优性差异为0.0025%,在联合训练模式下为5.25ms/item,最优性差异为0.0008%,与pandapower求解器相比,实现了超过25倍的加速,同时在给定容差下保持零约束违反。

更新时间: 2025-12-11 09:35:13

领域: cs.LG,cs.AI,math.OC

下载: http://arxiv.org/abs/2512.10461v1

Examining the Metrics for Document-Level Claim Extraction in Czech and Slovak

Document-level claim extraction remains an open challenge in the field of fact-checking, and subsequently, methods for evaluating extracted claims have received limited attention. In this work, we explore approaches to aligning two sets of claims pertaining to the same source document and computing their similarity through an alignment score. We investigate techniques to identify the best possible alignment and evaluation method between claim sets, with the aim of providing a reliable evaluation framework. Our approach enables comparison between model-extracted and human-annotated claim sets, serving as a metric for assessing the extraction performance of models and also as a possible measure of inter-annotator agreement. We conduct experiments on newly collected dataset-claims extracted from comments under Czech and Slovak news articles-domains that pose additional challenges due to the informal language, strong local context, and subtleties of these closely related languages. The results draw attention to the limitations of current evaluation approaches when applied to document-level claim extraction and highlight the need for more advanced methods-ones able to correctly capture semantic similarity and evaluate essential claim properties such as atomicity, checkworthiness, and decontextualization.

Updated: 2025-12-11 09:34:43

标题: 审查捷克语和斯洛伐克语文档级索赔提取的度量标准

摘要: 文档级别的索赔提取仍然是事实核查领域面临的一个挑战,因此,对提取的索赔进行评估的方法受到了有限的关注。在这项工作中,我们探讨了将与同一源文档相关的两组索赔进行对齐并通过对齐分数计算它们的相似性的方法。我们研究了识别索赔集之间最佳对齐和评估方法的技术,旨在提供一个可靠的评估框架。我们的方法使模型提取和人工注释的索赔集之间的比较成为可能,作为评估模型提取性能的度量标准,同时也可以作为评估者间协议的可能度量。我们对新收集的数据集进行实验-这些数据集是从捷克和斯洛伐克新闻文章的评论中提取的-这些领域由于非正式语言、强烈的地方背景和这些密切相关语言的微妙之处而带来了额外的挑战。结果引起了当前评估方法在应用于文档级别索赔提取时的限制的注意,并强调了更先进方法的需求-这些方法能够正确捕捉语义相似性并评估基本索赔属性,如原子性、值得检查性和去上下文化。

更新时间: 2025-12-11 09:34:43

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2511.14566v2

Hybrid Physics-ML Model for Forward Osmosis Flux with Complete Uncertainty Quantification

Forward Osmosis (FO) is a promising low-energy membrane separation technology, but challenges in accurately modelling its water flux (Jw) persist due to complex internal mass transfer phenomena. Traditional mechanistic models struggle with empirical parameter variability, while purely data-driven models lack physical consistency and rigorous uncertainty quantification (UQ). This study introduces a novel Robust Hybrid Physics-ML framework employing Gaussian Process Regression (GPR) for highly accurate, uncertainty-aware Jw prediction. The core innovation lies in training the GPR on the residual error between the detailed, non-linear FO physical model prediction (Jw_physical) and the experimental water flux (Jw_actual). Crucially, we implement a full UQ methodology by decomposing the total predictive variance (sigma2_total) into model uncertainty (epistemic, from GPR's posterior variance) and input uncertainty (aleatoric, analytically propagated via the Delta method for multi-variate correlated inputs). Leveraging the inherent strength of GPR in low-data regimes, the model, trained on a meagre 120 data points, achieved a state-of-the-art Mean Absolute Percentage Error (MAPE) of 0.26% and an R2 of 0.999 on the independent test data, validating a truly robust and reliable surrogate model for advanced FO process optimization and digital twin development.

Updated: 2025-12-11 09:27:44

标题: 混合物理-机器学习模型用于前向渗透通量的完全不确定性量化

摘要: 正渗透(FO)是一种具有前景的低能膜分离技术,但由于复杂的内部质量传递现象,对其水通量(Jw)进行准确建模仍然存在挑战。传统的机械模型在处理经验参数变异性方面存在困难,而纯数据驱动的模型缺乏物理一致性和严格的不确定性量化(UQ)。本研究介绍了一种新颖的坚固的混合物理-机器学习框架,采用高精度、具有不确定性意识的高斯过程回归(GPR)进行Jw预测。核心创新在于在详细的、非线性FO物理模型预测(Jw_physical)与实验水通量(Jw_actual)之间的残差误差上训练GPR。至关重要的是,我们通过将总预测方差(sigma2_total)分解为模型不确定性(认识性,来自GPR的后验方差)和输入不确定性(aleatoric,通过Delta方法在多变量相关输入上进行分析传播)来实施完整的UQ方法。利用GPR在低数据情况下的固有优势,该模型在仅有120个数据点的情况下,实现了独立测试数据的0.26%的最先进的平均绝对百分比误差(MAPE)和0.999的R2,验证了一个真正坚固可靠的替代模型,用于先进FO工艺优化和数字孪生开发。

更新时间: 2025-12-11 09:27:44

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2512.10457v1

Sublinear Variational Optimization of Gaussian Mixture Models with Millions to Billions of Parameters

Gaussian Mixture Models (GMMs) range among the most frequently used models in machine learning. However, training large, general GMMs becomes computationally prohibitive for datasets that have many data points $N$ of high-dimensionality $D$. For GMMs with arbitrary covariances, we here derive a highly efficient variational approximation, which is then integrated with mixtures of factor analyzers (MFAs). For GMMs with $C$ components, our proposed algorithm substantially reduces runtime complexity from $\mathcal{O}(NCD^2)$ per iteration to a complexity scaling linearly with $D$ and sublinearly with $NC$. In numerical experiments, we first validate that the complexity reduction results in a sublinear scaling for the entire GMM optimization process. Second, we show on large-scale benchmarks that the sublinear algorithm results in speed-ups of an order-of-magnitude compared to the state-of-the-art. Third, as a proof of concept, we finally train GMMs with over 10 billion parameters on about 100 million images, observing training times of less than nine hours on a single state-of-the-art CPU. Finally, and forth, we demonstrate the effectiveness of large-scale GMMs on the task of zero-shot image denoising, where sublinear training results in state-of-the-art denoising times while competitive denoising performance is maintained.

Updated: 2025-12-11 09:27:19

标题: 高斯混合模型参数数量达到百万至十亿时的次线性变分优化

摘要: 高斯混合模型(GMMs)是机器学习中最常用的模型之一。然而,对于具有许多高维数据点$N$的数据集来说,训练大型通用GMMs在计算上是不可行的。对于具有任意协方差的GMMs,我们提出了一种高效的变分逼近方法,并将其与因子分析器混合(MFAs)集成在一起。对于具有$C$个分量的GMMs,我们提出的算法将每次迭代的运行时间复杂度从$\mathcal{O}(NCD^2)$大幅降低到与$D$线性相关且与$NC$次线性相关。在数值实验中,我们首先验证了复杂度降低导致整个GMM优化过程的次线性缩放。其次,我们在大规模基准测试中展示,次线性算法与最先进技术相比可以实现数量级的加速。第三,作为概念验证,我们最终在大约1亿张图像上训练具有超过100亿参数的GMMs,在单个最先进的CPU上观察到不到9小时的训练时间。最后,我们展示了大规模GMMs在零样本图像去噪任务中的有效性,其中次线性训练导致最先进的去噪时间,同时保持竞争性的去噪性能。

更新时间: 2025-12-11 09:27:19

领域: stat.ML,cs.CV,cs.LG

下载: http://arxiv.org/abs/2501.12299v2

GTPO: Stabilizing Group Relative Policy Optimization via Gradient and Entropy Control

Group Relative Policy Optimization (GRPO) is a promising policy-based approach for Large Language Model alignment, yet its performance is often limited by training instability and suboptimal convergence. In this paper, we identify and analyze two main GRPO issues: (i) the token-level penalization, where valuable tokens shared across different responses receive contradictory feedback signals, leading to conflicting gradient updates that can reduce their likelihood; and (ii) the policy collapse, where negatively rewarded completions may penalize confident responses and shift model decisions toward unlikely tokens, destabilizing training process. To address these issues we introduce GTPO (Group-relative Trajectory-based Policy Optimization), which prevents conflicting gradients on valuable tokens by skipping negative updates while amplifying positive ones and filters out completions whose entropy exceeds a provable threshold, to prevent policy collapse. Unlike GRPO, GTPO does not rely on KL-divergence regularization, eliminating the need for a reference model during training, while still ensuring greater training stability and improved performance, as validated through multiple experiments on GSM8K, MATH, AIME 2024, AIME 2025 and AMC 2023.

Updated: 2025-12-11 09:23:26

标题: GTPO:通过梯度和熵控制稳定群体相对策略优化

摘要: Group Relative Policy Optimization(GRPO)是一种用于大型语言模型对齐的有前途的基于策略的方法,然而其性能通常受到训练不稳定性和次优收敛的限制。在本文中,我们确定并分析了两个主要的GRPO问题:(i)令牌级别的惩罚,其中在不同响应之间共享的有价值的令牌接收到矛盾的反馈信号,导致冲突的梯度更新,可能降低它们的可能性;和(ii)策略崩溃,其中受到负面奖励的完成可能惩罚自信的响应,并将模型决策转向不太可能的令牌,使训练过程不稳定。为了解决这些问题,我们引入了GTPO(基于轨迹的群体相对策略优化),通过跳过负面更新并放大正面更新来防止有价值的令牌上的冲突梯度,并过滤出熵超过可证明阈值的完成,以防止策略崩溃。与GRPO不同,GTPO不依赖于KL散度正则化,消除了训练过程中对参考模型的需求,同时仍保证了更大的训练稳定性和改进的性能,通过对GSM8K、MATH、AIME 2024、AIME 2025和AMC 2023进行多次实验证实。

更新时间: 2025-12-11 09:23:26

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2508.03772v5

Towards Personalized Deep Research: Benchmarks and Evaluations

Deep Research Agents (DRAs) can autonomously conduct complex investigations and generate comprehensive reports, demonstrating strong real-world potential. However, existing benchmarks primarily evaluate DRAs on generic quality metrics and overlook personalization, a critical dimension for individual users. However, existing evaluations mostly rely on close-ended benchmarks, while open-ended deep research benchmarks remain scarce and typically neglect personalized scenarios. To bridge this gap, we introduce Personalized Deep Research Bench (PDR-Bench), the first benchmark for evaluating personalization in DRAs. It pairs 50 diverse research tasks across 10 domains with 25 authentic user profiles that combine structured persona attributes with dynamic real-world contexts, yielding 250 realistic user-task queries. To assess system performance, we propose the PQR Evaluation Framework, which jointly measures Personalization Alignment, Content Quality, and Factual Reliability. Our experiments on a range of systems highlight current capabilities and limitations in handling personalized deep research. This work establishes a rigorous foundation for developing and evaluating the next generation of truly personalized AI research assistants.

Updated: 2025-12-11 09:21:18

标题: 走向个性化深度研究:基准和评估

摘要: 深度研究代理人(DRAs)可以自主进行复杂的调查并生成综合报告,展示了强大的现实潜力。然而,现有的基准主要通过评估DRAs的通用质量指标,忽略了个性化,这是个人用户的关键维度。然而,现有的评估主要依赖于封闭式基准,而开放式深度研究基准很少,通常忽视个性化场景。为了弥补这一差距,我们引入了个性化深度研究基准(PDR-Bench),这是用于评估DRAs中个性化的第一个基准。它将10个领域中的50个不同研究任务与25个结构化人物属性和动态现实世界背景相结合的真实用户档案配对,产生250个真实用户任务查询。为了评估系统性能,我们提出了PQR评估框架,同时衡量了个性化对齐、内容质量和事实可靠性。我们在一系列系统上的实验突显了在处理个性化深度研究方面的当前能力和限制。这项工作为开发和评估下一代真正个性化的AI研究助理奠定了严格的基础。

更新时间: 2025-12-11 09:21:18

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2509.25106v2

Metacognitive Sensitivity for Test-Time Dynamic Model Selection

A key aspect of human cognition is metacognition - the ability to assess one's own knowledge and judgment reliability. While deep learning models can express confidence in their predictions, they often suffer from poor calibration, a cognitive bias where expressed confidence does not reflect true competence. Do models truly know what they know? Drawing from human cognitive science, we propose a new framework for evaluating and leveraging AI metacognition. We introduce meta-d', a psychologically-grounded measure of metacognitive sensitivity, to characterise how reliably a model's confidence predicts its own accuracy. We then use this dynamic sensitivity score as context for a bandit-based arbiter that performs test-time model selection, learning which of several expert models to trust for a given task. Our experiments across multiple datasets and deep learning model combinations (including CNNs and VLMs) demonstrate that this metacognitive approach improves joint-inference accuracy over constituent models. This work provides a novel behavioural account of AI models, recasting ensemble selection as a problem of evaluating both short-term signals (confidence prediction scores) and medium-term traits (metacognitive sensitivity).

Updated: 2025-12-11 09:15:05

标题: 测试时间动态模型选择的元认知敏感性

摘要: 一个关键方面的人类认知是元认知——评估自己的知识和判断可靠性的能力。虽然深度学习模型可以表达对其预测的信心,但它们经常受到较差的校准的影响,即表达的信心并不反映真实的能力。模型是否真正知道自己知道什么?借鉴人类认知科学,我们提出了一个评估和利用AI元认知的新框架。我们引入了meta-d',一个基于心理学的元认知敏感度度量,来描述一个模型的信心如何可靠地预测自己的准确性。然后,我们使用这个动态敏感度分数作为一种基于赌徒算法的仲裁者的上下文,该算法在测试时进行模型选择,学习对于特定任务应该信任哪个专家模型。我们在多个数据集和深度学习模型组合(包括CNNs和VLMs)上进行的实验表明,这种元认知方法提高了整体推理准确性。这项工作提供了对AI模型的新颖行为描述,将集成选择重新构想为评估短期信号(信心预测分数)和中期特征(元认知敏感度)的问题。

更新时间: 2025-12-11 09:15:05

领域: cs.LG

下载: http://arxiv.org/abs/2512.10451v1

When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt Injection

The landscape of scientific peer review is rapidly evolving with the integration of Large Language Models (LLMs). This shift is driven by two parallel trends: the widespread individual adoption of LLMs by reviewers to manage workload (the "Lazy Reviewer" hypothesis) and the formal institutional deployment of AI-powered assessment systems by conferences like AAAI and Stanford's Agents4Science. This study investigates the robustness of these "LLM-as-a-Judge" systems (both illicit and sanctioned) to adversarial PDF manipulation. Unlike general jailbreaks, we focus on a distinct incentive: flipping "Reject" decisions to "Accept," for which we develop a novel evaluation metric which we term as WAVS (Weighted Adversarial Vulnerability Score). We curated a dataset of 200 scientific papers and adapted 15 domain-specific attack strategies to this task, evaluating them across 13 Language Models, including GPT-5, Claude Haiku, and DeepSeek. Our results demonstrate that obfuscation strategies like "Maximum Mark Magyk" successfully manipulate scores, achieving alarming decision flip rates even in large-scale models. We will release our complete dataset and injection framework to facilitate more research on this topic.

Updated: 2025-12-11 09:13:36

标题: 当拒绝变成接受:量化基于LLM的科学审稿人对间接提示注入的脆弱性

摘要: 科学同行评审的格局正在迅速发展,与此同时,大型语言模型(LLMs)的整合也在加速。这种转变受到两个并行趋势的推动:广泛采用LLMs来管理工作量的审稿人(“懒惰审稿人”假说)和会议如AAAI和斯坦福的Agents4Science等正式机构部署AI驱动的评估系统。本研究调查了这些“LLM作为评判者”系统(包括非法和合法)对恶意PDF操作的稳健性。与一般性越狱不同,我们关注一种明确的动机:将“拒绝”决定翻转为“接受”,为此我们开发了一种新的评估指标,称为WAVS(加权恶意脆弱性评分)。我们收集了一个包含200篇科学论文的数据集,并针对此任务调整了15种领域特定的攻击策略,对这些策略在13个语言模型上进行评估,包括GPT-5、Claude Haiku和DeepSeek。我们的结果表明,像“最大标记魔法”这样的混淆策略成功地操纵了分数,即使在大规模模型中也实现了令人担忧的决策翻转率。我们将发布完整的数据集和注入框架,以促进更多关于这一主题的研究。

更新时间: 2025-12-11 09:13:36

领域: cs.AI,cs.CL,cs.CR

下载: http://arxiv.org/abs/2512.10449v1

Maximum Risk Minimization with Random Forests

We consider a regression setting where observations are collected in different environments modeled by different data distributions. The field of out-of-distribution (OOD) generalization aims to design methods that generalize better to test environments whose distributions differ from those observed during training. One line of such works has proposed to minimize the maximum risk across environments, a principle that we refer to as MaxRM (Maximum Risk Minimization). In this work, we introduce variants of random forests based on the principle of MaxRM. We provide computationally efficient algorithms and prove statistical consistency for our primary method. Our proposed method can be used with each of the following three risks: the mean squared error, the negative reward (which relates to the explained variance), and the regret (which quantifies the excess risk relative to the best predictor). For MaxRM with regret as the risk, we prove a novel out-of-sample guarantee over unseen test distributions. Finally, we evaluate the proposed methods on both simulated and real-world data.

Updated: 2025-12-11 09:10:52

标题: 随机森林最大风险最小化

摘要: 我们考虑一个回归设置,观测结果在不同环境中收集,由不同数据分布建模。超出分布(OOD)泛化领域旨在设计能够更好地泛化到与训练时观察到的分布不同的测试环境的方法。其中一种方法是提出最小化跨环境的最大风险,这一原则被称为MaxRM(最大风险最小化)。在这项工作中,我们基于MaxRM原则介绍了随机森林的变种。我们提供了计算效率高的算法,并证明了我们主要方法的统计一致性。我们提出的方法可以与以下三种风险一起使用:均方误差、负奖励(与解释方差有关)和遗憾(量化相对于最佳预测器的过度风险)。对于将遗憾作为风险的MaxRM,我们证明了在未见过的测试分布上的新颖样本外保证。最后,我们在模拟数据和真实数据上评估了提出的方法。

更新时间: 2025-12-11 09:10:52

领域: stat.ML,cs.AI,cs.LG,stat.ME

下载: http://arxiv.org/abs/2512.10445v1

Clustered Federated Learning with Hierarchical Knowledge Distillation

Clustered Federated Learning (CFL) has emerged as a powerful approach for addressing data heterogeneity and ensuring privacy in large distributed IoT environments. By clustering clients and training cluster-specific models, CFL enables personalized models tailored to groups of heterogeneous clients. However, conventional CFL approaches suffer from fragmented learning for training independent global models for each cluster and fail to take advantage of collective cluster insights. This paper advocates a shift to hierarchical CFL, allowing bi-level aggregation to train cluster-specific models at the edge and a unified global model at the cloud. This shift improves training efficiency yet might introduce communication challenges. To this end, we propose CFLHKD, a novel personalization scheme for integrating hierarchical cluster knowledge into CFL. Built upon multi-teacher knowledge distillation, CFLHKD enables inter-cluster knowledge sharing while preserving cluster-specific personalization. CFLHKD adopts a bi-level aggregation to bridge the gap between local and global learning. Extensive evaluations of standard benchmark datasets demonstrate that CFLHKD outperforms representative baselines in cluster-specific and global model accuracy and achieves a performance improvement of 3.32-7.57\%.

Updated: 2025-12-11 09:08:35

标题: 具有层次知识蒸馏的集群化联邦学习

摘要: 集群联邦学习(CFL)已经成为处理数据异质性和确保隐私的强大方法,特别是在大规模分布式物联网环境中。通过对客户进行聚类并训练特定于集群的模型,CFL实现了针对异质客户群体的个性化模型。然而,传统的CFL方法存在着为每个集群训练独立全局模型的碎片化学习问题,并且未能利用集群见解的集体优势。本文提倡转向分层CFL,允许双层聚合,在边缘训练特定于集群的模型,并在云端训练统一的全局模型。这种转变提高了训练效率,但可能引入通信挑战。为此,我们提出了CFLHKD,一种将分层集群知识整合到CFL中的新型个性化方案。基于多教师知识蒸馏,CFLHKD实现了集群之间的知识共享,同时保留了集群特定的个性化。CFLHKD采用双层聚合来弥合本地和全局学习之间的差距。对标准基准数据集的广泛评估表明,CFLHKD在集群特定和全局模型准确性方面优于代表性基线,并实现了3.32-7.57\%的性能提升。

更新时间: 2025-12-11 09:08:35

领域: cs.DC,cs.AI,cs.LG

下载: http://arxiv.org/abs/2512.10443v1

Enhanced Spatial Clustering of Single-Molecule Localizations with Graph Neural Networks

Single-molecule localization microscopy generates point clouds corresponding to fluorophore localizations. Spatial cluster identification and analysis of these point clouds are crucial for extracting insights about molecular organization. However, this task becomes challenging in the presence of localization noise, high point density, or complex biological structures. Here, we introduce MIRO (Multifunctional Integration through Relational Optimization), an algorithm that uses recurrent graph neural networks to transform the point clouds in order to improve clustering efficiency when applying conventional clustering techniques. We show that MIRO supports simultaneous processing of clusters of different shapes and at multiple scales, demonstrating improved performance across varied datasets. Our comprehensive evaluation demonstrates MIRO's transformative potential for single-molecule localization applications, showcasing its capability to revolutionize cluster analysis and provide accurate, reliable details of molecular architecture. In addition, MIRO's robust clustering capabilities hold promise for applications in various fields such as neuroscience, for the analysis of neural connectivity patterns, and environmental science, for studying spatial distributions of ecological data.

Updated: 2025-12-11 08:59:03

标题: 用图神经网络增强单分子定位的空间聚类

摘要: 单分子定位显微镜生成与荧光物质定位相对应的点云。对这些点云进行空间聚类识别和分析对于提取有关分子组织的见解至关重要。然而,在定位噪音、高点密度或复杂生物结构存在的情况下,这项任务变得具有挑战性。在这里,我们介绍了MIRO(多功能集成通过关系优化)算法,该算法利用循环图神经网络来转换点云,以提高应用传统聚类技术时的聚类效率。我们展示MIRO支持同时处理不同形状和多个尺度的聚类,并展示在各种数据集上表现出改进的性能。我们的全面评估展示了MIRO在单分子定位应用中的变革潜力,展示了其革新聚类分析并提供有关分子体系结构的准确、可靠细节的能力。此外,MIRO强大的聚类能力为在各个领域的应用提供了希望,如神经科学,用于分析神经连接模式,以及环境科学,用于研究生态数据的空间分布。

更新时间: 2025-12-11 08:59:03

领域: cs.LG,physics.bio-ph,physics.data-an,q-bio.QM

下载: http://arxiv.org/abs/2412.00173v2

An M-Health Algorithmic Approach to Identify and Assess Physiotherapy Exercises in Real Time

This work presents an efficient algorithmic framework for real-time identification, classification, and evaluation of human physiotherapy exercises using mobile devices. The proposed method interprets a kinetic movement as a sequence of static poses, which are estimated from camera input using a pose-estimation neural network. Extracted body keypoints are transformed into trigonometric angle-based features and classified with lightweight supervised models to generate frame-level pose predictions and accuracy scores. To recognize full exercise movements and detect deviations from prescribed patterns, we employ a dynamic-programming scheme based on a modified Levenshtein distance algorithm, enabling robust sequence matching and localization of inaccuracies. The system operates entirely on the client side, ensuring scalability and real-time performance. Experimental evaluation demonstrates the effectiveness of the methodology and highlights its applicability to remote physiotherapy supervision and m-health applications.

Updated: 2025-12-11 08:56:03

标题: 一种用于实时识别和评估理疗锻炼的M-Health算法方法

摘要: 这项工作提出了一个高效的算法框架,用于实时识别、分类和评估人体物理治疗练习,利用移动设备。所提出的方法将动态运动解释为一系列静态姿势,这些姿势是通过使用姿势估计神经网络从摄像机输入中估计出来的。提取的身体关键点被转换为基于三角函数角度的特征,并使用轻量级监督模型进行分类,以生成帧级姿势预测和准确性分数。为了识别完整的运动练习并检测偏离规定模式的情况,我们采用基于修改后的Levenshtein距离算法的动态规划方案,实现了强大的序列匹配和准确性定位。该系统完全在客户端上运行,确保可扩展性和实时性能。实验评估证明了该方法的有效性,并强调了其在远程物理治疗监督和移动健康应用中的适用性。

更新时间: 2025-12-11 08:56:03

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2512.10437v1

Generalized Kernelized Bandits: A Novel Self-Normalized Bernstein-Like Dimension-Free Inequality and Regret Bounds

We study the regret minimization problem in the novel setting of generalized kernelized bandits (GKBs), where we optimize an unknown function $f^*$ belonging to a reproducing kernel Hilbert space (RKHS) having access to samples generated by an exponential family (EF) reward model whose mean is a non-linear function $μ(f^*)$. This setting extends both kernelized bandits (KBs) and generalized linear bandits (GLBs), providing a unified view of both settings. We propose an optimistic regret minimization algorithm, GKB-UCB, and we explain why existing self-normalized concentration inequalities used for KBs and GLBs do not allow to provide tight regret guarantees. For this reason, we devise a novel self-normalized Bernstein-like dimension-free inequality that applies to a Hilbert space of functions with bounded norm, representing a contribution of independent interest. Based on it, we analyze GKB-UCB, deriving a regret bound of order $\widetilde{O}( γ_T \sqrt{T/κ_*})$, being $T$ the learning horizon, $γ_T$ the maximal information gain, and $κ_*$ a term characterizing the magnitude of the expected reward non-linearity. Our result is tight in its dependence on $T$, $γ_T$, and $κ_*$ for both KBs and GLBs. Finally, we present a tractable version GKB-UCB, Trac-GKB-UCB, which attains similar regret guarantees, and we discuss its time and space complexity.

Updated: 2025-12-11 08:54:23

标题: 广义核化赌博机:一种新颖的自标准化伯恩斯坦样无维不等式和遗憾界限

摘要: 我们研究了在广义核化赌博机(GKBs)的新颖设置中的遗憾最小化问题,其中我们优化一个属于再生核希尔伯特空间(RKHS)的未知函数$f^*$,并且能够访问由指数族(EF)奖励模型生成的样本,其均值是一个非线性函数$μ(f^*)$。这个设置扩展了核化赌博机(KBs)和广义线性赌博机(GLBs),提供了对这两种设置的统一视角。我们提出了一种乐观的遗憾最小化算法,GKB-UCB,并解释为什么现有用于KBs和GLBs的自标准化集中不等式无法提供紧密的遗憾保证。因此,我们设计了一种新颖的自标准化类似Bernstein不等式,适用于具有有界范数的函数希尔伯特空间,代表了一个独立有趣的贡献。基于此,我们分析了GKB-UCB,推导出一个与$T$的阶数$\widetilde{O}( γ_T \sqrt{T/κ_*})$的遗憾界,其中$T$是学习视界,$γ_T$是最大信息增益,$κ_*$是表征期望奖励非线性程度的项。我们的结果对于KBs和GLBs在对$T$、$γ_T$和$κ_*$的依赖上是紧密的。最后,我们提出了一个可行版本的GKB-UCB,Trac-GKB-UCB,它具有类似的遗憾保证,我们讨论了它的时间和空间复杂度。

更新时间: 2025-12-11 08:54:23

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2508.01681v2

Machine Learning for Quantifier Selection in cvc5

In this work we considerably improve the state-of-the-art SMT solving on first-order quantified problems by efficient machine learning guidance of quantifier selection. Quantifiers represent a significant challenge for SMT and are technically a source of undecidability. In our approach, we train an efficient machine learning model that informs the solver which quantifiers should be instantiated and which not. Each quantifier may be instantiated multiple times and the set of the active quantifiers changes as the solving progresses. Therefore, we invoke the ML predictor many times, during the whole run of the solver. To make this efficient, we use fast ML models based on gradient boosting decision trees. We integrate our approach into the state-of-the-art cvc5 SMT solver and show a considerable increase of the system's holdout-set performance after training it on a large set of first-order problems collected from the Mizar Mathematical Library.

Updated: 2025-12-11 08:53:43

标题: 机器学习在cvc5中的量词选择

摘要: 在这项工作中,我们通过高效的机器学习指导量词选择,显著改进了一阶量化问题的最新SMT求解方法。量词对SMT构成了重大挑战,并且在技术上是不可判定性的根源。在我们的方法中,我们训练了一个高效的机器学习模型,指导求解器哪些量词应该被实例化,哪些不应该。每个量词可能被多次实例化,而活跃量词的集合会随着求解的进展而变化。因此,在求解器的整个运行过程中,我们多次调用ML预测器。为了使这个过程高效,我们使用基于梯度提升决策树的快速机器学习模型。我们将我们的方法集成到最新的cvc5 SMT求解器中,并展示了在Mizar数学库中收集的大量一阶问题上训练后,系统保留集性能显著提高。

更新时间: 2025-12-11 08:53:43

领域: cs.AI,cs.LG,cs.LO

下载: http://arxiv.org/abs/2408.14338v2

SCAN: Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation

With the increasing adoption of Large Language Models (LLMs) and Vision-Language Models (VLMs), rich document analysis technologies for applications like Retrieval-Augmented Generation (RAG) and visual RAG are gaining significant attention. Recent research indicates that using VLMs yields better RAG performance, but processing rich documents remains a challenge since a single page contains large amounts of information. In this paper, we present SCAN (SemantiC Document Layout ANalysis), a novel approach that enhances both textual and visual Retrieval-Augmented Generation (RAG) systems that work with visually rich documents. It is a VLM-friendly approach that identifies document components with appropriate semantic granularity, balancing context preservation with processing efficiency. SCAN uses a coarse-grained semantic approach that divides documents into coherent regions covering contiguous components. We trained the SCAN model by fine-tuning object detection models on an annotated dataset. Our experimental results across English and Japanese datasets demonstrate that applying SCAN improves end-to-end textual RAG performance by up to 9.4 points and visual RAG performance by up to 10.4 points, outperforming conventional approaches and even commercial document processing solutions.

Updated: 2025-12-11 08:51:09

标题: 扫描:用于文本和视觉检索增强生成的语义文档布局分析

摘要: 随着大型语言模型(LLMs)和视觉语言模型(VLMs)的日益采用,用于检索增强生成(RAG)和视觉RAG等应用的丰富文档分析技术引起了广泛关注。最近的研究表明,使用VLMs可以提高RAG的性能,但处理丰富文档仍然是一个挑战,因为单个页面包含大量信息。在本文中,我们提出了一种名为SCAN(SemantiC Document Layout ANalysis)的新方法,可以增强文本和视觉RAG系统,这些系统可以处理视觉丰富的文档。这是一种适用于VLM的方法,可以识别具有适当语义粒度的文档组件,平衡上下文保留和处理效率。SCAN使用粗粒度语义方法,将文档划分为覆盖连续组件的连贯区域。我们通过在带有注释的数据集上微调对象检测模型来训练SCAN模型。我们在英语和日语数据集上的实验结果表明,应用SCAN可以将端到端文本RAG性能提高最多9.4个点,并将视觉RAG性能提高最多10.4个点,优于传统方法甚至商业文档处理解决方案。

更新时间: 2025-12-11 08:51:09

领域: cs.AI

下载: http://arxiv.org/abs/2505.14381v2

Targeted Data Protection for Diffusion Model by Matching Training Trajectory

Recent advancements in diffusion models have made fine-tuning text-to-image models for personalization increasingly accessible, but have also raised significant concerns regarding unauthorized data usage and privacy infringement. Current protection methods are limited to passively degrading image quality, failing to achieve stable control. While Targeted Data Protection (TDP) offers a promising paradigm for active redirection toward user-specified target concepts, existing TDP attempts suffer from poor controllability due to snapshot-matching approaches that fail to account for complete learning dynamics. We introduce TAFAP (Trajectory Alignment via Fine-tuning with Adversarial Perturbations), the first method to successfully achieve effective TDP by controlling the entire training trajectory. Unlike snapshot-based methods whose protective influence is easily diluted as training progresses, TAFAP employs trajectory-matching inspired by dataset distillation to enforce persistent, verifiable transformations throughout fine-tuning. We validate our method through extensive experiments, demonstrating the first successful targeted transformation in diffusion models with simultaneous control over both identity and visual patterns. TAFAP significantly outperforms existing TDP attempts, achieving robust redirection toward target concepts while maintaining high image quality. This work enables verifiable safeguards and provides a new framework for controlling and tracing alterations in diffusion model outputs.

Updated: 2025-12-11 08:47:41

标题: 匹配训练轨迹实现扩散模型的定向数据保护

摘要: 最近在扩散模型方面取得的进展使得对文本到图像模型进行个性化微调变得越来越容易,但也引发了关于未经授权数据使用和侵犯隐私的重大关注。目前的保护方法仅限于 passively 降低图像质量,无法实现稳定的控制。虽然目标数据保护(TDP)提供了一种向用户指定目标概念进行积极重定向的有前途的范式,但现有的 TDP 尝试由于采用快照匹配方法导致控制能力不足,无法考虑完整的学习动态。我们介绍了 TAFAP(通过对抗性扰动微调实现轨迹对齐),这是第一种成功实现有效 TDP 的方法,通过控制整个训练轨迹。与基于快照的方法不同,随着训练的进行,其保护作用容易被稀释,TAFAP 利用受数据集提炼启发的轨迹匹配,强制在整个微调过程中实施持续可验证的转换。我们通过大量实验验证了我们的方法,展示了在扩散模型中首次成功实现有目标的转换,同时控制身份和视觉模式。TAFAP 明显优于现有的 TDP 尝试,实现了强大的向目标概念的重定向,同时保持高图像质量。这项工作实现了可验证的保护措施,并为控制和跟踪扩散模型输出的变化提供了一个新框架。

更新时间: 2025-12-11 08:47:41

领域: cs.AI

下载: http://arxiv.org/abs/2512.10433v1

CogMCTS: A Novel Cognitive-Guided Monte Carlo Tree Search Framework for Iterative Heuristic Evolution with Large Language Models

Automatic Heuristic Design (AHD) is an effective framework for solving complex optimization problems. The development of large language models (LLMs) enables the automated generation of heuristics. Existing LLM-based evolutionary methods rely on population strategies and are prone to local optima. Integrating LLMs with Monte Carlo Tree Search (MCTS) improves the trade-off between exploration and exploitation, but multi-round cognitive integration remains limited and search diversity is constrained. To overcome these limitations, this paper proposes a novel cognitive-guided MCTS framework (CogMCTS). CogMCTS tightly integrates the cognitive guidance mechanism of LLMs with MCTS to achieve efficient automated heuristic optimization. The framework employs multi-round cognitive feedback to incorporate historical experience, node information, and negative outcomes, dynamically improving heuristic generation. Dual-track node expansion combined with elite heuristic management balances the exploration of diverse heuristics and the exploitation of high-quality experience. In addition, strategic mutation modifies the heuristic forms and parameters to further enhance the diversity of the solution and the overall optimization performance. The experimental results indicate that CogMCTS outperforms existing LLM-based AHD methods in stability, efficiency, and solution quality.

Updated: 2025-12-11 08:46:55

标题: CogMCTS:一种新颖的认知引导的蒙特卡洛树搜索框架,用于具有大型语言模型的迭代启发式演化

摘要: 自动启发式设计(AHD)是解决复杂优化问题的有效框架。大型语言模型(LLMs)的发展使得自动生成启发式成为可能。现有基于LLM的进化方法依赖于种群策略,并容易陷入局部最优解。将LLMs与蒙特卡洛树搜索(MCTS)结合可以改善探索和开发之间的平衡,但多轮认知整合仍受限制,搜索多样性受到限制。为了克服这些局限性,本文提出了一种新颖的认知引导MCTS框架(CogMCTS)。CogMCTS将LLMs的认知引导机制与MCTS紧密结合,实现高效的自动启发式优化。该框架利用多轮认知反馈来整合历史经验、节点信息和负面结果,动态改进启发式生成。双轨节点扩展结合精英启发式管理平衡了多样启发式的探索和高质量经验的开发。此外,战略变异修改启发式形式和参数以进一步提高解决方案的多样性和整体优化性能。实验结果表明,CogMCTS在稳定性、效率和解决方案质量方面优于现有基于LLM的AHD方法。

更新时间: 2025-12-11 08:46:55

领域: cs.AI

下载: http://arxiv.org/abs/2512.08609v2

Executable Epistemology: The Structured Cognitive Loop as an Architecture of Intentional Understanding

Large language models exhibit intelligence without genuine epistemic understanding, exposing a key gap: the absence of epistemic architecture. This paper introduces the Structured Cognitive Loop (SCL) as an executable epistemological framework for emergent intelligence. Unlike traditional AI research asking "what is intelligence?" (ontological), SCL asks "under what conditions does cognition emerge?" (epistemological). Grounded in philosophy of mind and cognitive phenomenology, SCL bridges conceptual philosophy and implementable cognition. Drawing on process philosophy, enactive cognition, and extended mind theory, we define intelligence not as a property but as a performed process -- a continuous loop of judgment, memory, control, action, and regulation. SCL makes three contributions. First, it operationalizes philosophical insights into computationally interpretable structures, enabling "executable epistemology" -- philosophy as structural experiment. Second, it shows that functional separation within cognitive architecture yields more coherent and interpretable behavior than monolithic prompt based systems, supported by agent evaluations. Third, it redefines intelligence: not representational accuracy but the capacity to reconstruct its own epistemic state through intentional understanding. This framework impacts philosophy of mind, epistemology, and AI. For philosophy, it allows theories of cognition to be enacted and tested. For AI, it grounds behavior in epistemic structure rather than statistical regularity. For epistemology, it frames knowledge not as truth possession but as continuous reconstruction within a phenomenologically coherent loop. We situate SCL within debates on cognitive phenomenology, emergence, normativity, and intentionality, arguing that real progress requires not larger models but architectures that realize cognitive principles structurally.

Updated: 2025-12-11 08:40:26

标题: 可执行认识论:结构化认知循环作为有意义理解的架构

摘要: 大型语言模型展示出智能,但缺乏真正的认识理解,暴露了一个关键差距:缺乏认知结构。本文介绍了结构化认知闭环(SCL)作为一种可执行的认识论框架,用于新兴智能。与传统人工智能研究问“智能是什么?”(本体论)不同,SCL问“在什么条件下认知会出现?”(认识论)。基于心灵哲学和认知现象学,SCL连接了概念哲学和实施认知。借鉴过程哲学、行动认知和扩展心理理论,我们将智能定义为一种执行过程而不是一种属性 -- 判断、记忆、控制、行动和调节的连续循环。SCL做出了三点贡献。首先,它将哲学洞见操作化为可计算的结构,实现了“可执行的认识论” -- 哲学作为结构性实验。其次,它表明认知结构内的功能分离比单一提示系统产生更一致和可解释的行为,得到了代理评估的支持。第三,它重新定义了智能:不是表征准确性,而是通过有意义的理解重建自身认识状态的能力。这个框架影响了心灵哲学、认识论和人工智能。对于哲学来说,它使认知理论得以实施和测试。对于人工智能来说,它将行为基于认知结构而不是统计规律。对于认识论来说,它将知识框定为不是真理的拥有,而是在一个现象学上连贯的循环中持续重建。我们将SCL置于有关认知现象学、出现、规范性和有意向性的争论中,认为真正的进步不是需要更大的模型,而是需要实现认知原则的结构化架构。

更新时间: 2025-12-11 08:40:26

领域: cs.AI

下载: http://arxiv.org/abs/2510.15952v4

Representation of the structure of graphs by sequences of instructions

The representation of graphs is commonly based on the adjacency matrix concept. This formulation is the foundation of most algebraic and computational approaches to graph processing. The advent of deep learning language models offers a wide range of powerful computational models that are specialized in the processing of text. However, current procedures to represent graphs are not amenable to processing by these models. In this work, a new method to represent graphs is proposed. It represents the adjacency matrix of a graph by a string of simple instructions. The instructions build the adjacency matrix step by step. The transformation is reversible, i.e. given a graph the string can be produced and vice versa. The proposed representation is compact and it maintains the local structural patterns of the graph. Therefore, it is envisaged that it could be useful to boost the processing of graphs by deep learning models. A tentative computational experiment is reported, with favorable results.

Updated: 2025-12-11 08:40:06

标题: 用指令序列表示图的结构

摘要: 图的表示通常基于邻接矩阵概念。这种公式是大多数代数和计算方法处理图形的基础。深度学习语言模型的出现提供了一系列强大的计算模型,专门用于处理文本。然而,目前用于表示图形的程序不适合这些模型的处理。在这项工作中,提出了一种表示图形的新方法。它通过一系列简单指令来表示图的邻接矩阵。这些指令逐步构建邻接矩阵。转换是可逆的,即给定一个图形,可以生成字符串,反之亦然。所提出的表示是紧凑的,并保持图的局部结构模式。因此,预计这种表示方式可以提高深度学习模型处理图形的效率。报告了一项初步的计算实验,结果良好。

更新时间: 2025-12-11 08:40:06

领域: cs.AI

下载: http://arxiv.org/abs/2512.10429v1

The Operator Origins of Neural Scaling Laws: A Generalized Spectral Transport Dynamics of Deep Learning

Modern deep networks operate in a rough, finite-regularity regime where Jacobian-induced operators exhibit heavy-tailed spectra and strong basis drift. In this work, we derive a unified operator-theoretoretic description of neural training dynamics directly from gradient descent. Starting from the exact evolution $\dot e_t = -M(t)e_t$ in function space, we apply Kato perturbation theory to obtain a rigorous system of coupled mode ODEs and show that, after coarse-graining, these dynamics converge to a spectral transport--dissipation PDE \[ \partial_t g + \partial_λ(v g) = -λg + S, \] where $v$ captures eigenbasis drift and $S$ encodes nonlocal spectral coupling. We prove that neural training preserves functional regularity, forcing the drift to take an asymptotic power-law form $v(λ,t)\sim -c(t)λ^b$. In the weak-coupling regime -- naturally induced by spectral locality and SGD noise -- the PDE admits self-similar solutions with a resolution frontier, polynomial amplitude growth, and power-law dissipation. This structure yields explicit scaling-law exponents, explains the geometry of double descent, and shows that the effective training time satisfies $τ(t)=t^αL(t)$ for slowly varying $L$. Finally, we show that NTK training and feature learning arise as two limits of the same PDE: $v\equiv 0$ recovers lazy dynamics, while $v\neq 0$ produces representation drift. Our results provide a unified spectral framework connecting operator geometry, optimization dynamics, and the universal scaling behavior of modern deep networks.

Updated: 2025-12-11 08:38:46

标题: 神经缩放定律的操作员起源:深度学习的广义谱传输动力学

摘要: 现代深度网络在一个粗糙的有限正则性范围内运行,其中雅可比引起的算子表现出重尾谱和强基础漂移。在这项工作中,我们直接从梯度下降中推导出神经训练动力学的统一算子理论描述。从精确的函数空间演化$\dot e_t = -M(t)e_t$开始,我们应用加藤摄动理论得到一组严格耦合模式ODE系统,并展示,在粗粒化后,这些动态收敛到一个谱传输-耗散PDE \[ \partial_t g + \partial_λ(v g) = -λg + S, \] 其中$v$捕捉特征基漂移,$S$编码非局部谱耦合。 我们证明神经训练保持功能正则性,迫使漂移采取渐近幂律形式$v(λ,t)\sim -c(t)λ^b$。在弱耦合区域——由谱局部性和SGD噪声自然诱导——PDE允许具有分辨率边界、多项式振幅增长和幂律耗散的自相似解。这种结构提供了显式的标度定律指数,解释了双峰现象的几何结构,并展示了有效训练时间满足$τ(t)=t^αL(t)$,其中$L$缓慢变化。 最后,我们展示NTK训练和特征学习出现为同一PDE的两个极限:$v\equiv 0$恢复懒惰动态,而$v\neq 0$产生表示漂移。我们的结果提供了一个统一的谱框架,连接了算子几何、优化动力学和现代深度网络的通用标度行为。

更新时间: 2025-12-11 08:38:46

领域: cs.LG

下载: http://arxiv.org/abs/2512.10427v1

Differential Privacy for Secure Machine Learning in Healthcare IoT-Cloud Systems

Healthcare has become exceptionally sophisticated, as wearables and connected medical devices are revolutionising remote patient monitoring, emergency response, medication management, diagnosis, and predictive and prescriptive analytics. Internet of Things and Cloud computing integrated systems (IoT-Cloud) facilitate sensing, automation, and processing for these healthcare applications. While real-time response is crucial for alleviating patient emergencies, protecting patient privacy is extremely important in data-driven healthcare. In this paper, we propose a multi-layer IoT, Edge and Cloud architecture to enhance the speed of response for emergency healthcare by distributing tasks based on response criticality and permanence of storage. Privacy of patient data is assured by proposing a Differential Privacy framework across several machine learning models such as K-means, Logistic Regression, Random Forest and Naive Bayes. We establish a comprehensive threat model identifying three adversary classes and evaluate Laplace, Gaussian, and hybrid noise mechanisms across varying privacy budgets, with supervised algorithms achieving up to 86% accuracy. The proposed hybrid Laplace-Gaussian noise mechanism with adaptive budget allocation provides a balanced approach, offering moderate tails and better privacy-utility trade-offs for both low and high dimension datasets. At the practical threshold of $\varepsilon = 5.0$, supervised algorithms achieve 82-84% accuracy while reducing attribute inference attacks by up to 18% and data reconstruction correlation by 70%. Blockchain security further ensures trusted communication through time-stamping, traceability, and immutability for analytics applications. Edge computing demonstrates 8$\times$ latency reduction for emergency scenarios, validating the hierarchical architecture for time-critical operations.

Updated: 2025-12-11 08:37:37

标题: 医疗物联网-云系统中的安全机器学习差分隐私

摘要: 医疗保健变得异常复杂,穿戴设备和连接的医疗设备正在改变远程患者监测、急救响应、药物管理、诊断以及预测和处方分析。物联网和云计算集成系统(IoT-Cloud)促进了这些医疗应用的感知、自动化和处理。虽然实时响应对于缓解患者紧急情况至关重要,但在数据驱动的医疗保健中保护患者隐私至关重要。在本文中,我们提出了一个多层次的物联网、边缘和云架构,通过根据响应紧迫性和存储持久性来分配任务,提高应急医疗保健的响应速度。通过提出跨多个机器学习模型(如K-means、Logistic Regression、Random Forest和Naive Bayes)的差分隐私框架,保证了患者数据的隐私。我们建立了一个全面的威胁模型,确定了三个对手类别,并评估了在不同隐私预算下,Laplace、Gaussian和混合噪声机制,监督算法可以实现高达86%的准确性。提出的混合Laplace-Gaussian噪声机制具有自适应预算分配,提供了一种平衡的方法,为低维和高维数据集提供了适度的尾部和更好的隐私效用权衡。在隐私预算为 $\varepsilon = 5.0$ 的实际阈值下,监督算法实现了82-84%的准确性,同时将属性推断攻击减少了高达18%,数据重构相关性减少了70%。区块链安全进一步确保了经过时间戳、可追溯性和不可变性的受信任通信,用于分析应用。边缘计算在紧急情况下展示了8倍的延迟减少,验证了用于时间关键操作的分层架构。

更新时间: 2025-12-11 08:37:37

领域: cs.CR,cs.DC

下载: http://arxiv.org/abs/2512.10426v1

Cooperative Retrieval-Augmented Generation for Question Answering: Mutual Information Exchange and Ranking by Contrasting Layers

Since large language models (LLMs) have a tendency to generate factually inaccurate output, retrieval-augmented generation (RAG) has gained significant attention as a key means to mitigate this downside of harnessing only LLMs. However, existing RAG methods for simple and multi-hop question answering (QA) are still prone to incorrect retrievals and hallucinations. To address these limitations, we propose CoopRAG, a novel RAG framework for the question answering task in which a retriever and an LLM work cooperatively with each other by exchanging informative knowledge, and the earlier and later layers of the retriever model work cooperatively with each other to accurately rank the retrieved documents relevant to a given query. In this framework, we (i) unroll a question into sub-questions and a reasoning chain in which uncertain positions are masked, (ii) retrieve the documents relevant to the question augmented with the sub-questions and the reasoning chain, (iii) rerank the documents by contrasting layers of the retriever, and (iv) reconstruct the reasoning chain by filling the masked positions via the LLM. Our experiments demonstrate that CoopRAG consistently outperforms state-of-the-art QA methods on three multi-hop QA datasets as well as a simple QA dataset in terms of both the retrieval and QA performances. Our code is available.\footnote{https://github.com/meaningful96/CoopRAG}

Updated: 2025-12-11 08:35:17

标题: 合作检索增强生成用于问答:通过对比层的互信息交换和排名

摘要: 由于大型语言模型(LLMs)有生成事实不准确输出的倾向,检索增强生成(RAG)已经引起了重要关注,作为减轻仅使用LLMs这一弊端的关键手段。然而,现有的简单和多跳问题回答(QA)的RAG方法仍然容易出现不正确的检索和幻觉。为了解决这些限制,我们提出了CoopRAG,这是一个新颖的RAG框架,用于问答任务,在这个框架中,一个检索器和一个LLM通过交换信息知识共同合作工作,检索器模型的早期和后期层次共同合作,以准确对与给定查询相关的检索文档进行排名。在这个框架中,我们(i)将一个问题展开成子问题和一个推理链,在其中不确定的位置被屏蔽,(ii)检索与问题相关的文档,增加了子问题和推理链,(iii)通过对比检索器的层次对文档进行重新排名,(iv)通过LLM填充掩码位置重构推理链。我们的实验表明,CoopRAG在三个多跳QA数据集以及一个简单QA数据集的检索和QA表现方面始终优于最先进的QA方法。我们的代码可供使用。

更新时间: 2025-12-11 08:35:17

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2512.10422v1

Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks

Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. To ensure reliable evaluation, we develop a hybrid scoring pipeline that combines LLM-based graders, clinician validation, and semantic similarity metrics. Through systematic evaluation of state-of-the-art VLMs, including GPT-4o, Claude-4, and MedGemma, we observe a sharp performance drop compared to conventional datasets. Error analysis shows that reasoning failures, rather than perceptual errors, dominate model shortcomings. Our findings highlight the necessity of a Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization, and depth-oriented, compact benchmarks such as Neural-MedBench for reasoning fidelity. We release Neural-MedBench at https://neuromedbench.github.io/ as an open and extensible diagnostic testbed, which guides the expansion of future benchmarks and enables rigorous yet cost-effective assessment of clinically trustworthy AI.

Updated: 2025-12-11 08:31:33

标题: 超越分类准确性:神经-医学基准和对更深层推理基准的需求

摘要: 最近在视觉-语言模型(VLMs)方面取得了显著的进展,在标准医学基准测试中取得了出色的表现,然而它们真正的临床推理能力仍然不清楚。现有数据集主要强调分类准确性,导致了一种评估幻觉,即模型看起来熟练,但仍然在高风险诊断推理方面失败。我们介绍了神经-MedBench,这是一个紧凑但推理密集的基准测试,专门设计用于探索神经学中多模式临床推理的极限。神经-MedBench整合了多序列MRI扫描,结构化的电子健康记录和临床笔记,并包含三个核心任务系列:不同诊断,病变识别和理由生成。为了确保可靠的评估,我们开发了一个混合评分管道,结合了基于LLM的评分员,临床验证和语义相似性指标。通过对最先进的VLMs进行系统评估,包括GPT-4o、Claude-4和MedGemma,我们观察到与传统数据集相比,性能急剧下降。错误分析显示,推理失败而不是感知错误主导了模型的缺陷。我们的研究结果突出了两轴评估框架的必要性:面向广度的大型数据集用于统计概括,以及面向深度的紧凑基准测试,如神经-MedBench,用于推理的忠实度。我们将神经-MedBench发布在https://neuromedbench.github.io/,作为一个开放和可扩展的诊断测试平台,指导未来基准测试的扩展,并实现临床可信AI的严格而经济有效的评估。

更新时间: 2025-12-11 08:31:33

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2509.22258v2

Beyond Endpoints: Path-Centric Reasoning for Vectorized Off-Road Network Extraction

Deep learning has advanced vectorized road extraction in urban settings, yet off-road environments remain underexplored and challenging. A significant domain gap causes advanced models to fail in wild terrains due to two key issues: lack of large-scale vectorized datasets and structural weakness in prevailing methods. Models such as SAM-Road employ a node-centric paradigm that reasons at sparse endpoints, making them fragile to occlusions and ambiguous junctions in off-road scenes, leading to topological errors.This work addresses these limitations in two complementary ways. First, we release WildRoad, a gloabal off-road road network dataset constructed efficiently with a dedicated interactive annotation tool tailored for road-network labeling. Second, we introduce MaGRoad (Mask-aware Geodesic Road network extractor), a path-centric framework that aggregates multi-scale visual evidence along candidate paths to infer connectivity robustly.Extensive experiments show that MaGRoad achieves state-of-the-art performance on our challenging WildRoad benchmark while generalizing well to urban datasets. A streamlined pipeline also yields roughly 2.5x faster inference, improving practical applicability. Together, the dataset and path-centric paradigm provide a stronger foundation for mapping roads in the wild.

Updated: 2025-12-11 08:29:27

标题: 超越端点:基于路径的向量化越野网络提取推理

摘要: 深度学习已经在城市环境中推进了向量化道路提取,但越野环境仍然是未被充分探索且具有挑战性的。一个重要的领域差距导致先进模型在野外地形中失败,这是由于两个关键问题:缺乏大规模向量化数据集和主流方法中的结构弱点。像SAM-Road这样的模型采用了一个以节点为中心的范式,在稀疏端点处推理,使其对遮挡和模糊的路口在越野场景中脆弱,导致了拓扑错误。这项工作以两种互补的方式解决了这些限制。首先,我们发布了WildRoad,一个高效构建的全球越野道路网络数据集,配备了专门定制的交互式标注工具,用于道路网络标注。其次,我们介绍了MaGRoad(Mask-aware Geodesic Road network extractor),这是一个以路径为中心的框架,沿着候选路径聚合多尺度视觉证据,以稳健地推断连接性。大量实验表明,MaGRoad在我们具有挑战性的WildRoad基准测试中取得了最先进的性能,同时很好地泛化到城市数据集。一个简化的流水线还使推断速度提高了约2.5倍,提高了实际应用性。总的来说,数据集和以路径为中心的范式为在野外地图上绘制道路提供了更加坚实的基础。

更新时间: 2025-12-11 08:29:27

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2512.10416v1

How to Trick Your AI TA: A Systematic Study of Academic Jailbreaking in LLM Code Evaluation

The use of Large Language Models (LLMs) as automatic judges for code evaluation is becoming increasingly prevalent in academic environments. But their reliability can be compromised by students who may employ adversarial prompting strategies in order to induce misgrading and secure undeserved academic advantages. In this paper, we present the first large-scale study of jailbreaking LLM-based automated code evaluators in academic context. Our contributions are: (i) We systematically adapt 20+ jailbreaking strategies for jailbreaking AI code evaluators in the academic context, defining a new class of attacks termed academic jailbreaking. (ii) We release a poisoned dataset of 25K adversarial student submissions, specifically designed for the academic code-evaluation setting, sourced from diverse real-world coursework and paired with rubrics and human-graded references, and (iii) In order to capture the multidimensional impact of academic jailbreaking, we systematically adapt and define three jailbreaking metrics (Jailbreak Success Rate, Score Inflation, and Harmfulness). (iv) We comprehensively evalulate the academic jailbreaking attacks using six LLMs. We find that these models exhibit significant vulnerability, particularly to persuasive and role-play-based attacks (up to 97% JSR). Our adversarial dataset and benchmark suite lay the groundwork for next-generation robust LLM-based evaluators in academic code assessment.

Updated: 2025-12-11 08:28:33

标题: 如何愚弄您的AI助教:关于LLM代码评估中学术越狱的系统研究

摘要: 大型语言模型(LLMs)作为代码评估的自动评判工具在学术环境中越来越普遍。但是它们的可靠性可能会受到学生的影响,学生可能会采用敌对提示策略,以诱使错误评分并获得不当的学术优势。在本文中,我们首次进行了关于在学术环境中越狱LLM自动代码评估器的大规模研究。我们的贡献包括:(i)我们系统地适应了20多种越狱策略,用于在学术环境中越狱人工智能代码评估器,定义了一种新的攻击类别,称为学术越狱。(ii)我们发布了一个包含25K恶意学生提交的毒化数据集,专门设计用于学术代码评估设置,从各种真实课程作业中获取,并配有评分标准和人工评分参考,(iii)为了捕捉学术越狱的多维影响,我们系统地适应和定义了三个越狱指标(越狱成功率、分数膨胀和有害性)。(iv)我们使用六种LLMs全面评估了学术越狱攻击。我们发现这些模型存在显著的漏洞,特别是对于有说服力和角色扮演为基础的攻击(高达97%的JSR)。我们的敌对数据集和基准套件为下一代强大的基于LLM的学术代码评估器奠定了基础。

更新时间: 2025-12-11 08:28:33

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2512.10415v1

Boosting RL-Based Visual Reasoning with Selective Adversarial Entropy Intervention

Recently, reinforcement learning (RL) has become a common choice in enhancing the reasoning capabilities of vision-language models (VLMs). Considering existing RL- based finetuning methods, entropy intervention turns out to be an effective way to benefit exploratory ability, thereby improving policy performance. Notably, most existing stud- ies intervene in entropy by simply controlling the update of specific tokens during policy optimization of RL. They ig- nore the entropy intervention during the RL sampling that can boost the performance of GRPO by improving the di- versity of responses. In this paper, we propose Selective- adversarial Entropy Intervention, namely SaEI, which en- hances policy entropy by distorting the visual input with the token-selective adversarial objective coming from the en- tropy of sampled responses. Specifically, we first propose entropy-guided adversarial sampling (EgAS) that formu- lates the entropy of sampled responses as an adversarial ob- jective. Then, the corresponding adversarial gradient can be used to attack the visual input for producing adversarial samples, allowing the policy model to explore a larger an- swer space during RL sampling. Then, we propose token- selective entropy computation (TsEC) to maximize the ef- fectiveness of adversarial attack in EgAS without distorting factual knowledge within VLMs. Extensive experiments on both in-domain and out-of-domain datasets show that our proposed method can greatly improve policy exploration via entropy intervention, to boost reasoning capabilities. Code will be released once the paper is accepted.

Updated: 2025-12-11 08:27:02

标题: 用选择性对抗熵干预提升基于强化学习的视觉推理

摘要: 最近,强化学习(RL)已经成为提升视觉语言模型(VLMs)推理能力的常见选择。考虑到现有基于RL的微调方法,熵干预被证明是一种有效的方法,可以提高探索能力,从而改善策略性能。值得注意的是,大多数现有研究在RL的策略优化过程中通过简单控制特定标记的更新来干预熵。他们忽略了在RL采样过程中干预熵的作用,这可以通过提高响应的多样性来提高GRPO的性能。在本文中,我们提出了选择性对抗性熵干预,即SaEI,通过使用来自采样响应熵的标记选择对抗性目标来扭曲视觉输入,从而增强策略熵。具体来说,我们首先提出了熵引导的对抗采样(EgAS),将采样响应的熵形式化为对抗目标。然后,对应的对抗梯度可用于攻击视觉输入,生成对抗样本,使策略模型能够在RL采样过程中探索更大的答案空间。接着,我们提出了标记选择性熵计算(TsEC),以最大化EgAS中对抗攻击的有效性,而不扭曲VLMs中的事实知识。对领域内和领域外数据集的广泛实验表明,我们提出的方法可以通过熵干预大大改善策略探索,从而提升推理能力。一旦论文被接受,代码将会发布。

更新时间: 2025-12-11 08:27:02

领域: cs.AI

下载: http://arxiv.org/abs/2512.10414v1

BioMedGPT-Mol: Multi-task Learning for Molecular Understanding and Generation

Molecules play a crucial role in biomedical research and discovery, particularly in the field of small molecule drug development. Given the rapid advancements in large language models, especially the recent emergence of reasoning models, it is natural to explore how a general-purpose language model can be efficiently adapted for molecular science applications. In this work, we introduce BioMedGPT-Mol, a molecular language model designed to support molecular understanding and generation tasks. By curating and unifying existing public instruction datasets, we have assembled a large-scale, comprehensive, and high-quality training dataset. The model is then fine-tuned through a meticulously designed multi-task learning framework. On a consolidated benchmark derived from LlaSMol, TOMG-Bench, and MuMOInstruct, BioMedGPT-Mol achieves remarkable performance. Our experimental results demonstrate that a general-purpose reasoning model can be effectively and efficiently post-trained into a professional molecular language model through a well-structured multi-task curriculum. Leveraging these capabilities, we further apply the model to multi-step retrosynthetic planning, achieving state-of-the-art performance on RetroBench and demonstrating its superior efficacy as an end-to-end retrosynthetic planner. We anticipate that our approach can be extended to other biomedical scientific domains.

Updated: 2025-12-11 08:24:32

标题: BioMedGPT-Mol:分子理解与生成的多任务学习

摘要: 分子在生物医学研究和发现中起着至关重要的作用,尤其是在小分子药物开发领域。鉴于大型语言模型的快速进展,尤其是最近出现的推理模型,自然而然地探索了如何有效地将通用语言模型应用于分子科学应用。在这项工作中,我们介绍了 BioMedGPT-Mol,这是一个旨在支持分子理解和生成任务的分子语言模型。通过整理和统一现有的公共指导数据集,我们已经组建了一个大规模、全面且高质量的训练数据集。然后,通过精心设计的多任务学习框架对模型进行微调。在从 LlaSMol、TOMG-Bench 和 MuMOInstruct 汇总的基准测试中,BioMedGPT-Mol 实现了卓越的性能。我们的实验结果表明,通过一个结构良好的多任务课程,通用推理模型可以有效且高效地被后训练成为专业的分子语言模型。利用这些能力,我们进一步将该模型应用于多步合成规划,达到了 RetroBench 的最新性能,并展示了其作为端到端合成规划器的卓越有效性。我们预计我们的方法可以扩展到其他生物医学科学领域。

更新时间: 2025-12-11 08:24:32

领域: cs.AI

下载: http://arxiv.org/abs/2512.04629v2

Sliding Window Attention Adaptation

The self-attention mechanism in Transformer-based Large Language Models (LLMs) scales quadratically with input length, making long-context inference expensive. Sliding window attention (SWA) reduces this cost to linear complexity, but naively enabling complete SWA at inference-time for models pretrained with full attention (FA) causes severe long-context performance degradation due to training-inference mismatch. This makes us wonder: Can FA-pretrained LLMs be well adapted to SWA without pretraining? We investigate this by proposing Sliding Window Attention Adaptation (SWAA), a set of practical recipes that combine five methods for better adaptation: (1) applying SWA only during prefilling; (2) preserving "sink" tokens; (3) interleaving FA/SWA layers; (4) chain-of-thought (CoT); and (5) fine-tuning. Our experiments show that SWA adaptation is feasible while non-trivial: no single method suffices, yet specific synergistic combinations effectively recover the original long-context performance. We further analyze the performance-efficiency trade-offs of different SWAA configurations and provide recommended recipes for diverse scenarios. Our code is available at https://github.com/yuyijiong/sliding-window-attention-adaptation

Updated: 2025-12-11 08:21:24

标题: 滑动窗口注意力自适应

摘要: 基于Transformer的大型语言模型(LLMs)中的自注意机制随着输入长度呈二次增长,使得长上下文推理变得昂贵。滑动窗口注意力(SWA)将这一成本降低到线性复杂度,但是在推理时简单启用完全SWA对于使用全注意力(FA)预训练的模型会导致严重的长上下文性能下降,这是由于训练-推理不匹配引起的。这让我们思考:FA预训练的LLMs是否可以在不进行预训练的情况下很好地适应SWA?我们通过提出Sliding Window Attention Adaptation(SWAA)进行研究,这是一组实用的配方,结合了五种方法以更好地适应:(1)仅在预填充期间应用SWA;(2)保留“sink”标记;(3)交错FA/SWA层;(4)思维链(CoT);和(5)微调。我们的实验表明,SWA适应是可行的但并非简单的过程:没有单一方法足够,然而特定的协同组合有效地恢复了原始的长上下文性能。我们进一步分析了不同SWAA配置的性能效率权衡,并为不同场景提供了推荐的配方。我们的代码可以在https://github.com/yuyijiong/sliding-window-attention-adaptation上找到。

更新时间: 2025-12-11 08:21:24

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2512.10411v1

Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation

We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this end, we present Video4Spatial, a framework showing that video diffusion models conditioned solely on video-based scene context can perform complex spatial tasks. We validate on two tasks: scene navigation - following camera-pose instructions while remaining consistent with 3D geometry of the scene, and object grounding - which requires semantic localization, instruction following, and planning. Both tasks use video-only inputs, without auxiliary modalities such as depth or poses. With simple yet effective design choices in the framework and data curation, Video4Spatial demonstrates strong spatial understanding from video context: it plans navigation and grounds target objects end-to-end, follows camera-pose instructions while maintaining spatial consistency, and generalizes to long contexts and out-of-domain environments. Taken together, these results advance video generative models toward general visuospatial reasoning.

Updated: 2025-12-11 08:18:10

标题: Video4Spatial:基于上下文引导的视频生成实现视觉空间智能

摘要: 我们调查了视频生成模型是否能够表现出视觉空间智能,这是人类认知的核心能力,仅使用视觉数据。为此,我们提出了Video4Spatial,一个框架表明仅基于视频场景上下文的视频扩散模型可以执行复杂的空间任务。我们在两个任务上进行验证:场景导航-遵循摄像机姿态指令并与场景的3D几何保持一致,以及对象定位-需要语义定位、遵循指令和规划。这两个任务仅使用视频输入,没有深度或姿势等辅助模态。通过框架和数据策划中简单而有效的设计选择,Video4Spatial展示了从视频上下文中对空间的强大理解能力:它规划导航并将目标对象端到端地接地,遵循摄像机姿态指令同时保持空间一致性,并且对长期上下文和超领域环境具有泛化能力。综上所述,这些结果推动了视频生成模型朝向通用视觉空间推理的方向发展。

更新时间: 2025-12-11 08:18:10

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2512.03040v2

Supervised Learning of Random Neural Architectures Structured by Latent Random Fields on Compact Boundaryless Multiply-Connected Manifolds

This paper introduces a new probabilistic framework for supervised learning in neural systems. It is designed to model complex, uncertain systems whose random outputs are strongly non-Gaussian given deterministic inputs. The architecture itself is a random object stochastically generated by a latent anisotropic Gaussian random field defined on a compact, boundaryless, multiply-connected manifold. The goal is to establish a novel conceptual and mathematical framework in which neural architectures are realizations of a geometry-aware, field-driven generative process. Both the neural topology and synaptic weights emerge jointly from a latent random field. A reduced-order parameterization governs the spatial intensity of an inhomogeneous Poisson process on the manifold, from which neuron locations are sampled. Input and output neurons are identified via extremal evaluations of the latent field, while connectivity is established through geodesic proximity and local field affinity. Synaptic weights are conditionally sampled from the field realization, inducing stochastic output responses even for deterministic inputs. To ensure scalability, the architecture is sparsified via percentile-based diffusion masking, yielding geometry-aware sparse connectivity without ad hoc structural assumptions. Supervised learning is formulated as inference on the generative hyperparameters of the latent field, using a negative log-likelihood loss estimated through Monte Carlo sampling from single-observation-per-input datasets. The paper initiates a mathematical analysis of the model, establishing foundational properties such as well-posedness, measurability, and a preliminary analysis of the expressive variability of the induced stochastic mappings, which support its internal coherence and lay the groundwork for a broader theory of geometry-driven stochastic learning.

Updated: 2025-12-11 08:17:12

标题: 在紧凑的无边界多连通流形上由潜在随机场结构化的随机神经结构的监督学习

摘要: 这篇论文介绍了一个新的概率框架,用于神经系统中的监督学习。该框架旨在对复杂、不确定的系统进行建模,这些系统的随机输出在确定性输入给定的情况下强烈非高斯。该体系结构本身是由定义在紧凑的、无边界的、多重连接流形上的隐各向异性高斯随机场随机生成的随机对象。目标是建立一个新颖的概念和数学框架,其中神经结构是几何感知、场驱动生成过程的实现。神经拓扑和突触权重均联合从隐随机场中产生。通过对流形上的不均匀泊松过程的空间强度进行降阶参数化,从中抽样神经元位置。通过对隐场的极端评估确定输入和输出神经元,而连接性则通过测地距离和局部场亲和性建立。突触权重是从场实现中有条件地抽样得到的,即使对于确定性输入也会引发随机输出响应。为了确保可扩展性,通过基于百分位的扩散屏蔽将体系结构稀疏化,实现几何感知的稀疏连接而无需任意假设。监督学习被制定为对隐场的生成超参数进行推断,通过单输入数据集的蒙特卡洛抽样估计负对数似然损失。该论文开始对模型进行数学分析,建立基本性质,比如良定性、可测性,以及对诱导的随机映射的表达变异性的初步分析,这些支持其内部一致性并为几何驱动的随机学习理论奠定基础。

更新时间: 2025-12-11 08:17:12

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2512.10407v1

LeMiCa: Lexicographic Minimax Path Caching for Efficient Diffusion-Based Video Generation

We present LeMiCa, a training-free and efficient acceleration framework for diffusion-based video generation. While existing caching strategies primarily focus on reducing local heuristic errors, they often overlook the accumulation of global errors, leading to noticeable content degradation between accelerated and original videos. To address this issue, we formulate cache scheduling as a directed graph with error-weighted edges and introduce a Lexicographic Minimax Path Optimization strategy that explicitly bounds the worst-case path error. This approach substantially improves the consistency of global content and style across generated frames. Extensive experiments on multiple text-to-video benchmarks demonstrate that LeMiCa delivers dual improvements in both inference speed and generation quality. Notably, our method achieves a 2.9x speedup on the Latte model and reaches an LPIPS score of 0.05 on Open-Sora, outperforming prior caching techniques. Importantly, these gains come with minimal perceptual quality degradation, making LeMiCa a robust and generalizable paradigm for accelerating diffusion-based video generation. We believe this approach can serve as a strong foundation for future research on efficient and reliable video synthesis. Our code is available at :https://github.com/UnicomAI/LeMiCa

Updated: 2025-12-11 08:10:13

标题: LeMiCa:用于高效扩散式视频生成的词典最小最大路径缓存

摘要: 我们提出了LeMiCa,这是一个无需训练且高效的扩散视频生成加速框架。现有的缓存策略主要侧重于减少局部启发式错误,但它们常常忽视全局错误的积累,导致加速和原始视频之间的内容明显降级。为了解决这个问题,我们将缓存调度构建为一个带有错误加权边的有向图,并引入了一个词典最小最大路径优化策略,明确地限制了最坏情况路径错误。这种方法显著提高了生成帧之间全局内容和风格的一致性。对多个文本到视频基准的广泛实验表明,LeMiCa在推理速度和生成质量上实现了双重改进。值得注意的是,我们的方法在Latte模型上实现了2.9倍的加速,并在Open-Sora上达到了0.05的LPIPS分数,优于先前的缓存技术。重要的是,这些收益伴随着最小的感知质量降低,使LeMiCa成为加速基于扩散的视频生成的强大和可推广范例。我们相信这种方法可以作为未来研究高效可靠视频合成的坚实基础。我们的代码可在以下链接找到: https://github.com/UnicomAI/LeMiCa

更新时间: 2025-12-11 08:10:13

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.00090v3

The Eminence in Shadow: Exploiting Feature Boundary Ambiguity for Robust Backdoor Attacks

Deep neural networks (DNNs) underpin critical applications yet remain vulnerable to backdoor attacks, typically reliant on heuristic brute-force methods. Despite significant empirical advancements in backdoor research, the lack of rigorous theoretical analysis limits understanding of underlying mechanisms, constraining attack predictability and adaptability. Therefore, we provide a theoretical analysis targeting backdoor attacks, focusing on how sparse decision boundaries enable disproportionate model manipulation. Based on this finding, we derive a closed-form, ambiguous boundary region, wherein negligible relabeled samples induce substantial misclassification. Influence function analysis further quantifies significant parameter shifts caused by these margin samples, with minimal impact on clean accuracy, formally grounding why such low poison rates suffice for efficacious attacks. Leveraging these insights, we propose Eminence, an explainable and robust black-box backdoor framework with provable theoretical guarantees and inherent stealth properties. Eminence optimizes a universal, visually subtle trigger that strategically exploits vulnerable decision boundaries and effectively achieves robust misclassification with exceptionally low poison rates (< 0.1%, compared to SOTA methods typically requiring > 1%). Comprehensive experiments validate our theoretical discussions and demonstrate the effectiveness of Eminence, confirming an exponential relationship between margin poisoning and adversarial boundary manipulation. Eminence maintains > 90% attack success rate, exhibits negligible clean-accuracy loss, and demonstrates high transferability across diverse models, datasets and scenarios.

Updated: 2025-12-11 08:09:07

标题: 《暗影中的卓越:利用特征边界模糊性进行强大的后门攻击》

摘要: 深度神经网络(DNNs)支持关键应用,但仍然容易受到后门攻击的威胁,通常依赖于启发式蛮力方法。尽管后门研究取得了显著的实证进展,但缺乏严格的理论分析限制了对潜在机制的理解,限制了攻击的可预测性和适应性。因此,我们提供了一个针对后门攻击的理论分析,重点关注稀疏决策边界如何实现对模型的不成比例的操纵。基于这一发现,我们推导出一个封闭的、模糊的边界区域,其中可忽略的重新标记样本会导致显著的误分类。影响函数分析进一步量化了这些边界样本引起的显著参数偏移,对干净准确度影响微乎其微,正式证明了为什么这种低毒害率足以实现有效的攻击。利用这些见解,我们提出了Eminence,一个可解释且强大的黑盒后门框架,具有可证明的理论保证和固有的隐蔽特性。Eminence优化了一个通用的、视觉上微妙的触发器,策略性地利用脆弱的决策边界,有效地实现了具有极低毒害率的强大误分类(<0.1%,而SOTA方法通常需要>1%)。全面的实验证实了我们的理论讨论,并证明了Eminence的有效性,确认了边界毒害和对抗性边界操纵之间的指数关系。Eminence保持了>90%的攻击成功率,表现出微不足道的净准确度损失,并展示了在不同模型、数据集和场景之间的高可转移性。

更新时间: 2025-12-11 08:09:07

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2512.10402v1

Diffusion differentiable resampling

This paper is concerned with differentiable resampling in the context of sequential Monte Carlo (e.g., particle filtering). We propose a new informative resampling method that is instantly pathwise differentiable, based on an ensemble score diffusion model. We prove that our diffusion resampling method provides a consistent estimate to the resampling distribution, and we show by experiments that it outperforms the state-of-the-art differentiable resampling methods when used for stochastic filtering and parameter estimation.

Updated: 2025-12-11 08:08:55

标题: 扩散可微重采样

摘要: 本文关注于在顺序蒙特卡洛(例如,粒子滤波)中的可微重采样。我们提出了一种基于集合评分扩散模型的新型信息重采样方法,该方法即时路径可微。我们证明了我们的扩散重采样方法提供了对重采样分布的一致估计,并通过实验证明,在用于随机滤波和参数估计时,它优于当前最先进的可微重采样方法。

更新时间: 2025-12-11 08:08:55

领域: stat.ML,cs.LG,math.ST

下载: http://arxiv.org/abs/2512.10401v1

Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale

Real-world AI software engineering demands coding agents that can reason over massive repositories, maintain durable memory across and within long sessions, and robustly coordinate complex toolchains at test time. Existing open-source coding agents provide transparency but frequently fall short when pushed to these industrial-scale workloads, while proprietary coding agents offer strong practical performance but limited extensibility, interpretability, and controllability. We present the Confucius Code Agent (CCA), an open-sourced AI software engineer that can operate at an industrial scale. CCA is built atop the Confucius SDK, an open-sourced agent development platform designed around three complementary perspectives: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). The SDK introduces a unified orchestrator with hierarchical working memory for long-context reasoning, a persistent note-taking system for cross-session continual learning, and a modular extension module for robust tool use. Moreover, a meta-agent automates the synthesis, evaluation, and refinement of agent configurations through a build-test-improve loop, enabling rapid agent development on new tasks, environments, and tool stacks. Instantiated on Confucius SDK with these mechanisms, CCA delivers strong performance on real-world software engineering tasks. On SWE-Bench-Pro, CCA achieves a state-of-the-art Resolve@1 performance of 54.3%, substantially improving over prior coding agents. Together, the Confucius SDK and CCA provide a transparent, extensible, and reproducible foundation for AI agents, bridge gaps between research prototypes and production-grade systems, and support agent development and deployment at industrial scale.

Updated: 2025-12-11 08:05:58

标题: 孔子编码代理:一个在工业规模上开源的人工智能软件工程师

摘要: 现实世界中的AI软件工程需要编码代理,这些代理可以在大型代码库上进行推理,跨越和在长时间会话中保持持久记忆,并在测试时强大地协调复杂的工具链。现有的开源编码代理提供了透明度,但在推动到这些工业规模的工作负载时往往表现不佳,而专有编码代理提供了强大的实用性能,但可扩展性、解释性和可控性有限。我们提出了孔子代码代理(CCA),这是一个可以在工业规模上运行的开源AI软件工程师。CCA建立在孔子SDK之上,这是一个围绕三个互补视角设计的开源代理开发平台:代理体验(AX)、用户体验(UX)和开发者体验(DX)。该SDK引入了一个统一的编排器,具有用于长文本推理的分层工作记忆,一个持久的记事系统,用于跨会话的持续学习,以及一个用于强大工具使用的模块化扩展模块。此外,一个元代理通过构建-测试-改进循环自动合成、评估和改进代理配置,使其能够在新任务、环境和工具堆栈上快速开发代理。通过这些机制在孔子SDK上实例化,CCA在现实世界的软件工程任务上表现出色。在SWE-Bench-Pro上,CCA实现了54.3%的Resolve@1性能,大大提高了先前编码代理的性能。孔子SDK和CCA共同为AI代理提供了透明、可扩展和可重现的基础,弥合了研究原型和生产级系统之间的差距,并支持在工业规模上进行代理开发和部署。

更新时间: 2025-12-11 08:05:58

领域: cs.CL,cs.AI,cs.LG,cs.SE

下载: http://arxiv.org/abs/2512.10398v1

Toward Intelligent and Secure Cloud: Large Language Model Empowered Proactive Defense

The rapid evolution of cloud computing technologies and the increasing number of cloud applications have provided numerous benefits in our daily lives. However, the diversity and complexity of different components pose a significant challenge to cloud security, especially when dealing with sophisticated and advanced cyberattacks such as Denial of Service (DoS). Recent advancements in the large language models (LLMs) offer promising solutions for security intelligence. By exploiting the powerful capabilities in language understanding, data analysis, task inference, action planning, and code generation, we present LLM-PD, a novel defense architecture that proactively mitigates various DoS threats in cloud networks. LLM-PD can efficiently make decisions through comprehensive data analysis and sequential reasoning, as well as dynamically create and deploy actionable defense mechanisms. Furthermore, it can flexibly self-evolve based on experience learned from previous interactions and adapt to new attack scenarios without additional training. Our case study on three distinct DoS attacks demonstrates its remarkable ability in terms of defense effectiveness and efficiency when compared with other existing methods.

Updated: 2025-12-11 08:02:12

标题: 走向智能安全的云:大型语言模型增强主动防御

摘要: 云计算技术的快速发展和云应用数量的增加为我们的日常生活提供了许多好处。然而,不同组件的多样性和复杂性给云安全带来了重大挑战,特别是在处理诸如拒绝服务(DoS)等复杂和高级网络攻击时。最近大型语言模型(LLMs)的进展为安全情报提供了有希望的解决方案。通过利用语言理解、数据分析、任务推断、行动规划和代码生成的强大能力,我们提出了LLM-PD,一种新颖的防御架构,可以主动缓解云网络中各种DoS威胁。LLM-PD可以通过全面的数据分析和序贯推理有效地做出决策,动态地创建和部署可操作的防御机制。此外,它可以灵活地根据从先前交互中学到的经验自我进化,并适应新的攻击场景而无需额外培训。我们对三种不同的DoS攻击进行的案例研究显示,在与其他现有方法相比,它在防御效果和效率方面具有显著的能力。

更新时间: 2025-12-11 08:02:12

领域: cs.CR,cs.AI,cs.NI

下载: http://arxiv.org/abs/2412.21051v4

Thinking Ahead: Foresight Intelligence in MLLMs and World Models

In this work, we define Foresight Intelligence as the capability to anticipate and interpret future events-an ability essential for applications such as autonomous driving, yet largely overlooked by existing research. To bridge this gap, we introduce FSU-QA, a new Visual Question-Answering (VQA) dataset specifically designed to elicit and evaluate Foresight Intelligence. Using FSU-QA, we conduct the first comprehensive study of state-of-the-art Vision-Language Models (VLMs) under foresight-oriented tasks, revealing that current models still struggle to reason about future situations. Beyond serving as a benchmark, FSU-QA also enables the assessment of world models by measuring the semantic coherence of their generated predictions, quantified through performance gains when VLMs are augmented with such outputs. Our experiments further demonstrate that FSU-QA can effectively enhance foresight reasoning: even small VLMs fine-tuned on FSU-QA surpass much larger, advanced models by a substantial margin. Together, these findings position FSU-QA as a principled foundation for developing next-generation models capable of truly anticipating and understanding future events.

Updated: 2025-12-11 08:02:01

标题: 提前思考:多模型学习与世界模型中的先见智能

摘要: 在这项工作中,我们将预测智能定义为能够预测和解释未来事件的能力-这是自动驾驶等应用所必需的能力,但目前现有研究大多忽视了这一点。为了弥补这一差距,我们介绍了FSU-QA,这是一个新的专门设计用来引发和评估预测智能的视觉问答(VQA)数据集。利用FSU-QA,我们进行了第一次对最先进的视觉语言模型(VLMs)在面向预测的任务下的全面研究,发现当前模型仍然在推理未来情况方面存在困难。除了作为一个基准,FSU-QA还通过衡量生成的预测的语义连贯性来评估世界模型,通过VLMs增加这些输出的性能提升来量化。我们的实验进一步证明,FSU-QA可以有效地增强预测推理:即使是在FSU-QA上微调的小型VLMs也能明显超越更大、更先进的模型。综合这些发现,可以将FSU-QA定位为开发能够真正预测和理解未来事件的下一代模型的基础。

更新时间: 2025-12-11 08:02:01

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.18735v2

RoboNeuron: A Modular Framework Linking Foundation Models and ROS for Embodied AI

Current embodied AI systems face severe engineering impediments, primarily characterized by poor cross-scenario adaptability, rigid inter-module coupling, and fragmented inference acceleration. To overcome these limitations, we propose RoboNeuron, a universal deployment framework for embodied intelligence. RoboNeuron is the first framework to deeply integrate the cognitive capabilities of Large Language Models (LLMs) and Vision-Language-Action (VLA) models with the real-time execution backbone of the Robot Operating System (ROS). We utilize the Model Context Protocol (MCP) as a semantic bridge, enabling the LLM to dynamically orchestrate underlying robotic tools. The framework establishes a highly modular architecture that strictly decouples sensing, reasoning, and control by leveraging ROS's unified communication interfaces. Crucially, we introduce an automated tool to translate ROS messages into callable MCP functions, significantly streamlining development. RoboNeuron significantly enhances cross-scenario adaptability and component flexibility, while establishing a systematic platform for horizontal performance benchmarking, laying a robust foundation for scalable real-world embodied applications.

Updated: 2025-12-11 07:58:19

标题: RoboNeuron:一个模块化框架,用于连接基础模型和ROS以实现具身智能

摘要: 目前的具身人工智能系统面临严重的工程障碍,主要表现为跨场景适应性差、模块间耦合严格、推理加速断片化。为了克服这些限制,我们提出了RoboNeuron,一个通用的具身智能部署框架。RoboNeuron是第一个深度整合大型语言模型(LLMs)和视觉-语言-行动(VLA)模型与机器人操作系统(ROS)实时执行骨干的框架。我们利用模型上下文协议(MCP)作为语义桥梁,使LLM能够动态编排底层机器人工具。该框架建立了一个高度模块化的架构,通过利用ROS的统一通信接口严格解耦感知、推理和控制。关键是,我们引入了一个自动化工具,将ROS消息转换为可调用的MCP函数,极大地简化了开发流程。RoboNeuron显著增强了跨场景适应性和组件灵活性,同时建立了一个系统化平台,用于水平性能基准测试,为可伸缩的实际世界具身应用奠定了坚实基础。

更新时间: 2025-12-11 07:58:19

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2512.10394v1

Cross-modal Retrieval Models for Stripped Binary Analysis

LLM-agent based binary code analysis has demonstrated significant potential across a wide range of software security scenarios, including vulnerability detection, malware analysis, etc. In agent workflow, however, retrieving the positive from thousands of stripped binary functions based on user query remains under-studied and challenging, as the absence of symbolic information distinguishes it from source code retrieval. In this paper, we introduce, BinSeek, the first two-stage cross-modal retrieval framework for stripped binary code analysis. It consists of two models: BinSeekEmbedding is trained on large-scale dataset to learn the semantic relevance of the binary code and the natural language description, furthermore, BinSeek-Reranker learns to carefully judge the relevance of the candidate code to the description with context augmentation. To this end, we built an LLM-based data synthesis pipeline to automate training construction, also deriving a domain benchmark for future research. Our evaluation results show that BinSeek achieved the state-of-the-art performance, surpassing the the same scale models by 31.42% in Rec@3 and 27.17% in MRR@3, as well as leading the advanced general-purpose models that have 16 times larger parameters.

Updated: 2025-12-11 07:58:10

标题: 跨模态检索模型用于剥离式二元分析

摘要: LLM-基于代理的二进制代码分析已经在广泛的软件安全场景中展示出了显著的潜力,包括漏洞检测、恶意软件分析等。然而,在代理工作流程中,基于用户查询从成千上万个剥离的二进制函数中检索出积极的结果仍然是一个未被研究和具有挑战性的问题,因为缺乏符号信息使其与源代码检索有所不同。在本文中,我们介绍了BinSeek,这是第一个用于剥离的二进制代码分析的两阶段跨模态检索框架。它由两个模型组成:BinSeekEmbedding在大规模数据集上训练,以学习二进制代码和自然语言描述的语义相关性,此外,BinSeek-Reranker学会了仔细判断候选代码与描述的相关性,并进行上下文增强。为此,我们建立了一个基于LLM的数据合成流水线,以自动化训练构建,并为未来研究提供了一个领域基准。我们的评估结果表明,BinSeek实现了最先进的性能,Rec@3超过了相同规模模型31.42%,MRR@3超过了27.17%,并且领先于具有16倍更大参数的先进通用模型。

更新时间: 2025-12-11 07:58:10

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2512.10393v1

Fitting magnetization data using continued fraction of straight lines

Magnetization of a ferromagnetic substance in response to an externally applied magnetic field increases with the strength of the field. This is because at the microscopic level, magnetic moments in certain regions or domains of the substance increasingly align with the applied field, while the amount of misaligned domains decreases. The alignment of such magnetic domains with an applied magnetic field forms the physical basis for the nonlinearity of magnetization. In this paper, the nonlinear function is approximated as a combination of continued fraction of straight lines. The resulting fit is used to interpret the nonlinear behavior in both growing and shrinking magnetic domains. The continued fraction of straight lines used here is an algebraic expression which can be used to estimate parameters using nonlinear regression.

Updated: 2025-12-11 07:57:17

标题: 使用直线的连分数拟合磁化数据

摘要: 在外加磁场作用下,铁磁物质的磁化随着磁场强度的增加而增加。这是因为在微观层面,物质的某些区域或磁域中的磁矩越来越多地与外加磁场对齐,而不对齐的磁域的数量减少。这种磁性磁域与外加磁场的对齐形成了磁化非线性的物理基础。在本文中,非线性函数被近似为直线的连分式组合。得到的拟合结果用于解释在增长和缩小的磁性磁域中的非线性行为。在这里使用的直线连分式是一个代数表达式,可以用于使用非线性回归来估计参数。

更新时间: 2025-12-11 07:57:17

领域: cs.LG,cond-mat.mtrl-sci,physics.class-ph

下载: http://arxiv.org/abs/2512.10390v1

Learning (Approximately) Equivariant Networks via Constrained Optimization

Equivariant neural networks are designed to respect symmetries through their architecture, boosting generalization and sample efficiency when those symmetries are present in the data distribution. Real-world data, however, often departs from perfect symmetry because of noise, structural variation, measurement bias, or other symmetry-breaking effects. Strictly equivariant models may struggle to fit the data, while unconstrained models lack a principled way to leverage partial symmetries. Even when the data is fully symmetric, enforcing equivariance can hurt training by limiting the model to a restricted region of the parameter space. Guided by homotopy principles, where an optimization problem is solved by gradually transforming a simpler problem into a complex one, we introduce Adaptive Constrained Equivariance (ACE), a constrained optimization approach that starts with a flexible, non-equivariant model and gradually reduces its deviation from equivariance. This gradual tightening smooths training early on and settles the model at a data-driven equilibrium, balancing between equivariance and non-equivariance. Across multiple architectures and tasks, our method consistently improves performance metrics, sample efficiency, and robustness to input perturbations compared with strictly equivariant models and heuristic equivariance relaxations.

Updated: 2025-12-11 07:51:20

标题: 通过约束优化学习(近似)等变网络

摘要: 等变神经网络通过其架构设计来尊重对称性,当数据分布中存在这些对称性时,可以提高泛化能力和样本效率。然而,现实世界的数据通常会因为噪音、结构变化、测量偏差或其他破坏对称性的影响而偏离完美对称性。严格的等变模型可能难以拟合数据,而不受限制的模型缺乏一种原则性的方法来利用部分对称性。即使数据是完全对称的,强制执行等变性也可能通过限制模型到参数空间的受限区域而损害训练。受同伦原理的启发,其中一个优化问题通过逐渐将一个简单问题转化为一个复杂问题来解决,我们引入了自适应约束等变性(ACE),这是一种受约束的优化方法,从一个灵活的非等变模型开始,逐渐减少其与等变性的偏差。这种逐渐收紧的方法在早期平滑训练,并将模型稳定在一个数据驱动的平衡点,平衡等变性和非等变性之间。在多种架构和任务中,我们的方法与严格的等变模型和启发式等变性放宽相比,始终提高性能指标、样本效率和对输入扰动的稳健性。

更新时间: 2025-12-11 07:51:20

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2505.13631v2

The Best of the Two Worlds: Harmonizing Semantic and Hash IDs for Sequential Recommendation

Conventional Sequential Recommender Systems (SRS) typically assign unique Hash IDs (HID) to construct item embeddings. These HID embeddings effectively learn collaborative information from historical user-item interactions, making them vulnerable to situations where most items are rarely consumed (the long-tail problem). Recent methods that incorporate auxiliary information often suffer from noisy collaborative sharing caused by co-occurrence signals or semantic homogeneity caused by flat dense embeddings. Semantic IDs (SIDs), with their capability of code sharing and multi-granular semantic modeling, provide a promising alternative. However, the collaborative overwhelming phenomenon hinders the further development of SID-based methods. The quantization mechanisms commonly compromise the uniqueness of identifiers required for modeling head items, creating a performance seesaw between head and tail items. To address this dilemma, we propose \textbf{\name}, a novel framework that harmonizes the SID and HID. Specifically, we devise a dual-branch modeling architecture that enables the model to capture both the multi-granular semantics within SID while preserving the unique collaborative identity of HID. Furthermore, we introduce a dual-level alignment strategy that bridges the two representations, facilitating knowledge transfer and supporting robust preference modeling. Extensive experiments on three real-world datasets show that \name~ effectively balances recommendation quality for both head and tail items while surpassing the existing baselines. The implementation code can be found online\footnote{https://github.com/ziwliu8/H2Rec}.

Updated: 2025-12-11 07:50:53

标题: 两个世界的精华:协调语义和哈希ID用于顺序推荐

摘要: 传统的顺序推荐系统(SRS)通常为构建物品嵌入分配唯一的哈希ID(HID)。这些HID嵌入有效地从历史用户-物品交互中学习协同信息,使它们容易受到大多数物品很少被消费的情况的影响(长尾问题)。最近的一些方法结合了辅助信息,但往往受到由共现信号引起的噪声协同共享或由平坦密集嵌入引起的语义同质性的困扰。语义ID(SIDs)具有代码共享和多粒度语义建模的能力,为其提供了一种有希望的替代方案。然而,协同压倒性现象阻碍了基于SID的方法的进一步发展。量化机制通常会损害建模头部物品所需的标识符的独特性,从而在头部和尾部物品之间创建绩效跷跷板。为了解决这一困境,我们提出了一种新颖的框架\textbf{\name},该框架协调了SID和HID。具体来说,我们设计了一个双分支建模架构,使模型能够捕捉SID内部的多粒度语义,同时保留HID的独特协同身份。此外,我们引入了一个双级别对齐策略,桥接了两种表示,促进了知识传递并支持稳健的偏好建模。对三个真实世界数据集的广泛实验表明,\name~在平衡头部和尾部物品的推荐质量方面表现出色,超过了现有基线。实现代码可以在线找到\footnote{https://github.com/ziwliu8/H2Rec}。

更新时间: 2025-12-11 07:50:53

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2512.10388v1

Beyond Lux thresholds: a systematic pipeline for classifying biologically relevant light contexts from wearable data

Background: Wearable spectrometers enable field quantification of biologically relevant light, yet reproducible pipelines for contextual classification remain under-specified. Objective: To establish and validate a subject-wise evaluated, reproducible pipeline and actionable design rules for classifying natural vs. artificial light from wearable spectral data. Methods: We analysed ActLumus recordings from 26 participants, each monitored for at least 7 days at 10-second sampling, paired with daily exposure diaries. The pipeline fixes the sequence: domain selection, log-base-10 transform, L2 normalisation excluding total intensity (to avoid brightness shortcuts), hour-level medoid aggregation, sine/cosine hour encoding, and MLP classifier, evaluated under participant-wise cross-validation. Results: The proposed sequence consistently achieved high performance on the primary task, with representative configurations reaching AUC = 0.938 (accuracy 88%) for natural vs. artificial classification on the held-out subject split. In contrast, indoor vs. outdoor classification remained at feasibility level due to spectral overlap and class imbalance (best AUC approximately 0.75; majority-class collapse without contextual sensors). Threshold baselines were insufficient on our data, supporting the need for spectral-temporal modelling beyond illuminance cut-offs. Conclusions: We provide a reproducible, auditable baseline pipeline and design rules for contextual light classification under subject-wise generalisation. All code, configuration files, and derived artefacts will be openly archived (GitHub + Zenodo DOI) to support reuse and benchmarking.

Updated: 2025-12-11 07:50:25

标题: 超越 Lux 阈值:一个系统的流水线,用于从可穿戴数据中分类生物相关光环境

摘要: 背景:可穿戴光谱仪可以实现现场量化生物相关光线,但用于上下文分类的可重复管道仍未明确定义。 目标:建立并验证一种主观评估的可重复管道和可执行的设计规则,用于从可穿戴光谱数据中对自然光与人造光进行分类。 方法:我们分析了来自26名参与者的ActLumus记录,每名参与者至少被监测7天,每10秒采样一次,并配备每日暴露日记。该管道固定了以下顺序:领域选择,以10为底的对数变换,L2正则化(不包括总强度,以避免亮度捷径),小时级中位数聚合,正弦/余弦小时编码,以及MLP分类器,在参与者级别进行交叉验证评估。 结果:所提出的顺序始终在主要任务上取得高性能,具有代表性配置在保留的主题分割上达到AUC = 0.938(准确度88%)的自然与人造分类。相比之下,室内与室外分类由于光谱重叠和类别不平衡而仍处于可行性水平(最佳AUC约为0.75;在没有上下文传感器的情况下,多数类坍塌)。在我们的数据上,阈值基线不足,支持超出照度截止的光谱-时间建模的需求。 结论:我们提供了一个可重复的、可审计的基线管道和设计规则,用于主观通用的上下文光分类。所有代码、配置文件和衍生物件将被公开存档(GitHub + Zenodo DOI),以支持重用和基准测试。

更新时间: 2025-12-11 07:50:25

领域: q-bio.QM,cs.LG

下载: http://arxiv.org/abs/2512.06181v2

Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies

Large Vision Language Models (LVLMs) have made remarkable progress, enabling sophisticated vision-language interaction and dialogue applications. However, existing benchmarks primarily focus on reasoning tasks, often neglecting fine-grained recognition, which is crucial for practical application scenarios. To address this gap, we introduce the Fine-grained Recognition Open World (FROW) benchmark, designed for detailed evaluation of LVLMs with GPT-4o. On the basis of that, we propose a novel optimization strategy from two perspectives: \textit{data construction} and \textit{training process}, to improve the performance of LVLMs. Our dataset includes mosaic data, which combines multiple short-answer responses, and open-world data, generated from real-world questions and answers using GPT-4o, creating a comprehensive framework for evaluating fine-grained recognition in LVLMs. Experiments show that mosaic data improves category recognition accuracy by 1\% and open-world data boosts FROW benchmark accuracy by 10\%-20\% and content accuracy by 6\%-12\%. Meanwhile, incorporating fine-grained data into the pre-training phase can improve the model's category recognition accuracy by up to 10\%. The benchmark will be available at https://github.com/pc-inno/FROW.

Updated: 2025-12-11 07:48:34

标题: 朝着利用大型视觉语言模型进行细粒度识别:基准和优化策略

摘要: 大规模视觉语言模型(LVLMs)取得了显著进展,实现了复杂的视觉-语言交互和对话应用。然而,现有的基准主要集中在推理任务上,往往忽视了对细粒度识别的重要性,这对实际应用场景至关重要。为了填补这一空白,我们引入了细粒度识别开放世界(FROW)基准,旨在详细评估具有GPT-4o的LVLMs的性能。基于此,我们提出了一种新颖的优化策略,从数据构建和训练过程两个角度来改善LVLMs的性能。我们的数据集包括马赛克数据,将多个短答案响应结合在一起,以及开放世界数据,使用GPT-4o从现实世界的问题和答案生成,为LVLMs中的细粒度识别创建了一个全面的评估框架。实验证明,马赛克数据将类别识别准确率提高了1%,开放世界数据将FROW基准准确率提高了10%-20%,内容准确率提高了6%-12%。同时,将细粒度数据纳入预训练阶段可以将模型的类别识别准确率提高多达10%。该基准将在https://github.com/pc-inno/FROW上提供。

更新时间: 2025-12-11 07:48:34

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2512.10384v1

Toward a Unified Geometry Understanding: Riemannian Diffusion Framework for Graph Generation and Prediction

Graph diffusion models have made significant progress in learning structured graph data and have demonstrated strong potential for predictive tasks. Existing approaches typically embed node, edge, and graph-level features into a unified latent space, modeling prediction tasks including classification and regression as a form of conditional generation. However, due to the non-Euclidean nature of graph data, features of different curvatures are entangled in the same latent space without releasing their geometric potential. To address this issue, we aim to construt an ideal Riemannian diffusion model to capture distinct manifold signatures of complex graph data and learn their distribution. This goal faces two challenges: numerical instability caused by exponential mapping during the encoding proces and manifold deviation during diffusion generation. To address these challenges, we propose GeoMancer: a novel Riemannian graph diffusion framework for both generation and prediction tasks. To mitigate numerical instability, we replace exponential mapping with an isometric-invariant Riemannian gyrokernel approach and decouple multi-level features onto their respective task-specific manifolds to learn optimal representations. To address manifold deviation, we introduce a manifold-constrained diffusion method and a self-guided strategy for unconditional generation, ensuring that the generated data remains aligned with the manifold signature. Extensive experiments validate the effectiveness of our approach, demonstrating superior performance across a variety of tasks.

Updated: 2025-12-11 07:48:20

标题: 走向统一的几何理解:利用黎曼扩散框架进行图生成和预测

摘要: 图扩散模型在学习结构化图数据方面取得了显著进展,并展现了对预测任务的强大潜力。现有方法通常将节点、边和图级特征嵌入统一的潜在空间,将分类和回归等预测任务建模为一种条件生成形式。然而,由于图数据的非欧几里得性质,不同曲率的特征在同一潜在空间中纠缠在一起,未释放其几何潜力。为解决这一问题,我们旨在构建一个理想的黎曼扩散模型,以捕捉复杂图数据的不同流形特征并学习它们的分布。这一目标面临两个挑战:编码过程中指数映射引起的数值不稳定性和扩散生成过程中的流形偏差。为解决这些挑战,我们提出了GeoMancer:一个新颖的黎曼图扩散框架,用于生成和预测任务。为了减轻数值不稳定性,我们用同构不变的黎曼陀螺核方法替换指数映射,并将多级特征解耦到各自的任务特定流形上以学习最佳表示。为了解决流形偏差,我们引入了一种受限于流形的扩散方法和一种自我引导策略用于无条件生成,确保生成的数据与流形特征保持一致。大量实验证实了我们方法的有效性,展示了在各种任务中卓越的性能。

更新时间: 2025-12-11 07:48:20

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.04522v2

Neural personal sound zones with flexible bright zone control

Personal sound zone (PSZ) reproduction system, which attempts to create distinct virtual acoustic scenes for different listeners at their respective positions within the same spatial area using one loudspeaker array, is a fundamental technology in the application of virtual reality. For practical applications, the reconstruction targets must be measured on the same fixed receiver array used to record the local room impulse responses (RIRs) from the loudspeaker array to the control points in each PSZ, which makes the system inconvenient and costly for real-world use. In this paper, a 3D convolutional neural network (CNN) designed for PSZ reproduction with flexible control microphone grid and alternative reproduction target is presented, utilizing the virtual target scene as inputs and the PSZ pre-filters as output. Experimental results of the proposed method are compared with the traditional method, demonstrating that the proposed method is able to handle varied reproduction targets on flexible control point grid using only one training session. Furthermore, the proposed method also demonstrates the capability to learn global spatial information from sparse sampling points distributed in PSZs.

Updated: 2025-12-11 07:41:15

标题: 具有灵活明亮区域控制的神经个人声音区域

摘要: 个人声音区(PSZ)再现系统试图使用一个扬声器阵列在同一空间区域内为不同听众在其各自位置创建独特的虚拟声学场景,这是虚拟现实应用中的基本技术。对于实际应用,重建目标必须在用于记录从扬声器阵列到每个PSZ控制点的本地房间冲激响应(RIRs)的相同固定接收器阵列上进行测量,这使得系统在实际使用中不便且昂贵。本文提出了一种专为PSZ再现设计的具有灵活控制麦克风网格和替代再现目标的三维卷积神经网络(CNN),利用虚拟目标场景作为输入和PSZ预滤波器作为输出。所提出方法的实验结果与传统方法进行了比较,表明所提出的方法能够在灵活的控制点网格上处理各种再现目标,仅使用一个训练会话。此外,所提出的方法还展示了从分布在PSZ中的稀疏采样点学习全局空间信息的能力。

更新时间: 2025-12-11 07:41:15

领域: cs.SD,cs.AI

下载: http://arxiv.org/abs/2512.10375v1

D2M: A Decentralized, Privacy-Preserving, Incentive-Compatible Data Marketplace for Collaborative Learning

The rising demand for collaborative machine learning and data analytics calls for secure and decentralized data sharing frameworks that balance privacy, trust, and incentives. Existing approaches, including federated learning (FL) and blockchain-based data markets, fall short: FL often depends on trusted aggregators and lacks Byzantine robustness, while blockchain frameworks struggle with computation-intensive training and incentive integration. We present \prot, a decentralized data marketplace that unifies federated learning, blockchain arbitration, and economic incentives into a single framework for privacy-preserving data sharing. \prot\ enables data buyers to submit bid-based requests via blockchain smart contracts, which manage auctions, escrow, and dispute resolution. Computationally intensive training is delegated to \cone\ (\uline{Co}mpute \uline{N}etwork for \uline{E}xecution), an off-chain distributed execution layer. To safeguard against adversarial behavior, \prot\ integrates a modified YODA protocol with exponentially growing execution sets for resilient consensus, and introduces Corrected OSMD to mitigate malicious or low-quality contributions from sellers. All protocols are incentive-compatible, and our game-theoretic analysis establishes honesty as the dominant strategy. We implement \prot\ on Ethereum and evaluate it over benchmark datasets -- MNIST, Fashion-MNIST, and CIFAR-10 -- under varying adversarial settings. \prot\ achieves up to 99\% accuracy on MNIST and 90\% on Fashion-MNIST, with less than 3\% degradation up to 30\% Byzantine nodes, and 56\% accuracy on CIFAR-10 despite its complexity. Our results show that \prot\ ensures privacy, maintains robustness under adversarial conditions, and scales efficiently with the number of participants, making it a practical foundation for real-world decentralized data sharing.

Updated: 2025-12-11 07:38:05

标题: D2M:面向协作学习的去中心化、保护隐私、激励兼容的数据市场

摘要: 随着协作机器学习和数据分析需求的增加,对安全和分散化数据共享框架的需求也在增加,这些框架需要平衡隐私、信任和激励。现有方法,包括联邦学习(FL)和基于区块链的数据市场,存在不足:FL通常依赖于可信的聚合器,缺乏拜占庭容错性,而区块链框架则面临计算密集型训练和激励集成的挑战。 我们提出了\prot,这是一个统一的分散式数据市场,将联邦学习、区块链仲裁和经济激励融合到一个隐私保护数据共享框架中。\prot\使数据买家可以通过区块链智能合约提交基于出价的请求,智能合约管理拍卖、托管和争议解决。计算密集型训练被委托给\cone(\uline{Co}mpute \uline{N}etwork for \uline{E}xecution),这是一个离链分布式执行层。为了防范敌对行为,\prot\集成了一个修改过的YODA协议,通过指数增长的执行集实现强大的共识,并引入了修正的OSMD以减轻卖家的恶意或低质量贡献。所有协议都是激励兼容的,我们的博弈论分析确定了诚实作为主导策略。 我们在以太坊上实现了\prot,并在不同敌对环境下对MNIST、Fashion-MNIST和CIFAR-10等基准数据集进行了评估。在MNIST上,\prot\的准确率达到了99%,在Fashion-MNIST上达到了90%,在多达30%的拜占庭节点下准确率下降不到3%,尽管CIFAR-10复杂,准确率仍达到56%。我们的结果表明,\prot\确保了隐私,在敌对条件下保持了稳健性,并且随着参与者数量的增加能够高效扩展,使其成为实际的基础,用于现实世界的分散式数据共享。

更新时间: 2025-12-11 07:38:05

领域: cs.CR,cs.AI,cs.DC,cs.LG

下载: http://arxiv.org/abs/2512.10372v1

AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management

The rapid development of mobile GUI agents has stimulated growing research interest in long-horizon task automation. However, building agents for these tasks faces a critical bottleneck: the reliance on ever-expanding interaction history incurs substantial context overhead. Existing context management and compression techniques often fail to preserve vital semantic information, leading to degraded task performance. We propose AgentProg, a program-guided approach for agent context management that reframes the interaction history as a program with variables and control flow. By organizing information according to the structure of program, this structure provides a principled mechanism to determine which information should be retained and which can be discarded. We further integrate a global belief state mechanism inspired by Belief MDP framework to handle partial observability and adapt to unexpected environmental changes. Experiments on AndroidWorld and our extended long-horizon task suite demonstrate that AgentProg has achieved the state-of-the-art success rates on these benchmarks. More importantly, it maintains robust performance on long-horizon tasks while baseline methods experience catastrophic degradation. Our system is open-sourced at https://github.com/MobileLLM/AgentProg.

Updated: 2025-12-11 07:37:38

标题: AgentProg:通过程序引导的上下文管理,赋予长期展望的GUI代理程序更强大的能力

摘要: 移动GUI代理的快速发展刺激了对长期任务自动化日益增长的研究兴趣。然而,为这些任务构建代理面临一个关键瓶颈:依赖不断扩大的交互历史会带来实质性的上下文开销。现有的上下文管理和压缩技术通常无法保留重要的语义信息,导致任务性能下降。我们提出AgentProg,一种程序引导的代理上下文管理方法,将交互历史重新构建为具有变量和控制流的程序。通过根据程序结构组织信息,这种结构提供了一个原则性机制来确定哪些信息应该保留,哪些可以丢弃。我们进一步集成了受信念MDP框架启发的全局信念状态机制,以处理部分可观察性并适应意外的环境变化。在AndroidWorld和我们扩展的长期任务套件上的实验表明,AgentProg在这些基准测试中取得了最先进的成功率。更重要的是,它在长期任务上保持了稳健的性能,而基线方法经历了灾难性的退化。我们的系统在https://github.com/MobileLLM/AgentProg上开源。

更新时间: 2025-12-11 07:37:38

领域: cs.AI

下载: http://arxiv.org/abs/2512.10371v1

LLM-Empowered Representation Learning for Emerging Item Recommendation

In this work, we tackle the challenge of recommending emerging items, whose interactions gradually accumulate over time. Existing methods often overlook this dynamic process, typically assuming that emerging items have few or even no historical interactions. Such an assumption oversimplifies the problem, as a good model must preserve the uniqueness of emerging items while leveraging their shared patterns with established ones. To address this challenge, we propose EmerFlow, a novel LLM-empowered representation learning framework that generates distinctive embeddings for emerging items. It first enriches the raw features of emerging items through LLM reasoning, then aligns these representations with the embedding space of the existing recommendation model. Finally, new interactions are incorporated through meta-learning to refine the embeddings. This enables EmerFlow to learn expressive embeddings for emerging items from only limited interactions. Extensive experiments across diverse domains, including movies and pharmaceuticals, show that EmerFlow consistently outperforms existing methods.

Updated: 2025-12-11 07:36:44

标题: LLM增强的代表性学习用于新兴物品推荐

摘要: 在这项工作中,我们致力于解决推荐新兴项目的挑战,这些项目的互动会随着时间逐渐累积。现有方法通常忽视这一动态过程,通常假设新兴项目有很少甚至没有历史互动。这种假设过于简化了问题,因为一个好的模型必须保留新兴项目的独特性,同时利用它们与已建立项目的共享模式。为了解决这一挑战,我们提出了EmerFlow,一个新颖的LLM增强表示学习框架,为新兴项目生成独特的嵌入。它首先通过LLM推理丰富新兴项目的原始特征,然后将这些表示与现有推荐模型的嵌入空间对齐。最后,通过元学习将新的互动整合进来,以细化嵌入。这使得EmerFlow能够从有限的互动中学习出对新兴项目具有表现力的嵌入。在包括电影和制药在内的各个领域进行了大量实验,结果显示EmerFlow始终优于现有方法。

更新时间: 2025-12-11 07:36:44

领域: cs.AI

下载: http://arxiv.org/abs/2512.10370v1

Aligning ASR Evaluation with Human and LLM Judgments: Intelligibility Metrics Using Phonetic, Semantic, and NLI Approaches

Traditional ASR metrics like WER and CER fail to capture intelligibility, especially for dysarthric and dysphonic speech, where semantic alignment matters more than exact word matches. ASR systems struggle with these speech types, often producing errors like phoneme repetitions and imprecise consonants, yet the meaning remains clear to human listeners. We identify two key challenges: (1) Existing metrics do not adequately reflect intelligibility, and (2) while LLMs can refine ASR output, their effectiveness in correcting ASR transcripts of dysarthric speech remains underexplored. To address this, we propose a novel metric integrating Natural Language Inference (NLI) scores, semantic similarity, and phonetic similarity. Our ASR evaluation metric achieves a 0.890 correlation with human judgments on Speech Accessibility Project data, surpassing traditional methods and emphasizing the need to prioritize intelligibility over error-based measures.

Updated: 2025-12-11 07:36:11

标题: 将ASR评估与人类和LLM判断相吻合:使用音标、语义和NLI方法的可懂度度量

摘要: 传统的ASR指标如WER和CER未能捕捉到口语清晰度,特别是对于说话困难和声音嘶哑的语音,语义对齐比确切单词匹配更重要。ASR系统在处理这些语音类型时往往会产生错误,比如音素重复和不准确的辅音,但是对人类听众来说意思仍然清晰。我们确定了两个关键挑战:(1)现有的指标无法充分反映口语清晰度,(2)尽管LLMs可以改进ASR输出,但它们在纠正说话困难语音的ASR转录方面的有效性尚未得到充分探讨。为了解决这个问题,我们提出了一种新颖的指标,集成了自然语言推理(NLI)分数、语义相似性和语音相似性。我们的ASR评估指标在Speech Accessibility Project数据上与人类判断的相关性达到0.890,超过了传统方法,并强调了需要优先考虑口语清晰度而不是基于错误的度量。

更新时间: 2025-12-11 07:36:11

领域: cs.LG

下载: http://arxiv.org/abs/2506.16528v2

PaTAS: A Framework for Trust Propagation in Neural Networks Using Subjective Logic

Trustworthiness has become a key requirement for the deployment of artificial intelligence systems in safety-critical applications. Conventional evaluation metrics, such as accuracy and precision, fail to appropriately capture uncertainty or the reliability of model predictions, particularly under adversarial or degraded conditions. This paper introduces the Parallel Trust Assessment System (PaTAS), a framework for modeling and propagating trust in neural networks using Subjective Logic (SL). PaTAS operates in parallel with standard neural computation through Trust Nodes and Trust Functions that propagate input, parameter, and activation trust across the network. The framework defines a Parameter Trust Update mechanism to refine parameter reliability during training and an Inference-Path Trust Assessment (IPTA) method to compute instance-specific trust at inference. Experiments on real-world and adversarial datasets demonstrate that PaTAS produces interpretable, symmetric, and convergent trust estimates that complement accuracy and expose reliability gaps in poisoned, biased, or uncertain data scenarios. The results show that PaTAS effectively distinguishes between benign and adversarial inputs and identifies cases where model confidence diverges from actual reliability. By enabling transparent and quantifiable trust reasoning within neural architectures, PaTAS provides a foundation for evaluating model reliability across the AI lifecycle.

Updated: 2025-12-11 07:35:44

标题: PaTAS: 一种使用主观逻辑在神经网络中传播信任的框架

摘要: 信任度已经成为在安全关键应用中部署人工智能系统的关键要求。传统的评估指标,如准确度和精度,未能恰当地捕捉不确定性或模型预测的可靠性,特别是在敌对或降级条件下。本文介绍了并行信任评估系统(PaTAS),这是一个利用主观逻辑(SL)对神经网络建模和传播信任的框架。PaTAS通过信任节点和信任函数与标准神经计算并行运作,传播网络中的输入、参数和激活信任。该框架定义了一个参数信任更新机制,在训练过程中改进参数可靠性,并引入了推理路径信任评估(IPTA)方法,在推理时计算特定实例的信任。对真实世界和敌对数据集的实验表明,PaTAS产生可解释、对称和收敛的信任估计,补充准确性并揭示在受污染、偏见或不确定数据情景下的可靠性差距。结果显示,PaTAS有效区分良性和敌对输入,并识别模型信心与实际可靠性脱离的情况。通过在神经结构中实现透明和可量化的信任推理,PaTAS为在整个人工智能生命周期中评估模型可靠性提供了基础。

更新时间: 2025-12-11 07:35:44

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.20586v3

GPG: Generalized Policy Gradient Theorem for Transformer-based Policies

We present the Generalized Policy Gradient (GPG) Theorem, specifically designed for Transformer-based policies. Notably, we demonstrate that both standard Policy Gradient Theorem and GRPO emerge as special cases within our GPG framework. Furthermore, we explore its practical applications in training Large Language Models (LLMs), offering new insights into efficient policy optimization.

Updated: 2025-12-11 07:30:33

标题: GPG: 基于Transformer策略的广义策略梯度定理

摘要: 我们提出了广义策略梯度(GPG)定理,专门为基于Transformer的策略设计。值得注意的是,我们证明标准策略梯度定理和GRPO都可以在我们的GPG框架中作为特例出现。此外,我们探讨了其在训练大型语言模型(LLM)中的实际应用,为有效的策略优化提供了新的见解。

更新时间: 2025-12-11 07:30:33

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2512.10365v1

Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) demonstrate impressive reasoning capabilities, but often fail to perceive fine-grained visual details, limiting their applicability in precision-demanding tasks. While methods that crop salient regions of an image offer a partial solution, we identify a critical limitation they introduce: "Contextual Blindness". This failure occurs due to structural disconnect between high-fidelity details (from the crop) and the broader global context (from the original image), even when all necessary visual information is present. We argue that this limitation stems not from a lack of information 'Quantity', but from a lack of 'Structural Diversity' in the model's input. To resolve this, we propose Visual Funnel, a training-free, two-step approach. Visual Funnel first performs Contextual Anchoring to identify the region of interest in a single forward pass. It then constructs an Entropy-Scaled Portfolio that preserves the hierarchical context - ranging from focal detail to broader surroundings - by dynamically determining crop sizes based on attention entropy and refining crop centers. Through extensive experiments, we demonstrate that Visual Funnel significantly outperforms naive single-crop and unstructured multi-crop baselines. Our results further validate that simply adding more unstructured crops provides limited or even detrimental benefits, confirming that the hierarchical structure of our portfolio is key to resolving Contextual Blindness.

Updated: 2025-12-11 07:22:54

标题: 视觉漏斗:解决多模态大型语言模型中的语境盲点

摘要: 多模态大型语言模型(MLLMs)展示了令人印象深刻的推理能力,但往往无法感知细粒度的视觉细节,限制了它们在需要精度的任务中的适用性。虽然裁剪图像的显著区域的方法提供了部分解决方案,但我们确定了它们引入的一个关键限制:“上下文盲目”。即使所有必要的视觉信息都存在时,这种失败是由于高保真度细节(来自裁剪)与更广泛的全局上下文(来自原始图像)之间的结构断开导致的。我们认为,这种限制不是由于信息“数量”的缺乏,而是由于模型输入中“结构多样性”的缺乏。为了解决这个问题,我们提出了Visual Funnel,这是一种无需训练的两步方法。Visual Funnel首先执行上下文定位,以在单个前向传递中识别感兴趣的区域。然后通过动态确定基于注意熵的裁剪尺寸并优化裁剪中心,构建一个熵缩放组合,以保留从焦点细节到更广泛周围环境的分层上下文。通过大量实验证明,Visual Funnel明显优于朴素的单个裁剪和非结构化的多裁剪基线。我们的结果进一步验证,简单地添加更多的非结构化裁剪提供了有限甚至有害的好处,从而证实我们组合的分层结构对解决上下文盲目至关重要。

更新时间: 2025-12-11 07:22:54

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2512.10362v1

Bit of a Close Talker: A Practical Guide to Serverless Cloud Co-Location Attacks

Serverless computing has revolutionized cloud computing by offering an efficient and cost-effective way for users to develop and deploy applications without managing infrastructure details. However, serverless cloud users remain vulnerable to various types of attacks, including micro-architectural side-channel attacks. These attacks typically rely on the physical co-location of victim and attacker instances, and attackers will need to exploit cloud schedulers to achieve co-location with victims. Therefore, it is crucial to study vulnerabilities in serverless cloud schedulers and assess the security of different serverless scheduling algorithms. This study addresses the gap in understanding and constructing co-location attacks in serverless clouds. We present a comprehensive methodology to uncover exploitable features in serverless scheduling algorithms and devise strategies for constructing co-location attacks through normal user interfaces. In our experiments, we successfully reveal exploitable vulnerabilities and achieve instance co-location on prevalent open-source infrastructures and Microsoft Azure Functions. We also present a mitigation strategy to defend against co-location attacks in serverless clouds. Our work highlights critical areas for security enhancements in current cloud schedulers, offering insights to fortify serverless computing environments against potential co-location attacks.

Updated: 2025-12-11 07:22:07

标题: 一个近距离接触者:无服务器云共存攻击实用指南

摘要: 无服务器计算已经通过为用户提供一种高效和具有成本效益的方式来开发和部署应用程序,而无需管理基础架构细节,从而改变了云计算。然而,无服务器云用户仍然容易受到各种类型的攻击,包括微体系结构侧信道攻击。这些攻击通常依赖于受害者和攻击者实例的物理共存,攻击者将需要利用云调度程序来实现与受害者的共存。因此,研究无服务器云调度程序的漏洞并评估不同无服务器调度算法的安全性至关重要。本研究填补了对无服务器云中共存攻击的理解和构建的空白。我们提出了一种全面的方法论,以揭示无服务器调度算法中的可利用特性,并通过正常用户界面制定构建共存攻击的策略。在我们的实验中,我们成功地揭示了可利用的漏洞,并在流行的开源基础设施和Microsoft Azure Functions上实现了实例共存。我们还提出了一种缓解策略,以抵御无服务器云中的共存攻击。我们的工作突出了当前云调度程序中需要加强安全性的关键领域,为加强无服务器计算环境抵御潜在共存攻击提供了见解。

更新时间: 2025-12-11 07:22:07

领域: cs.CR,cs.DC

下载: http://arxiv.org/abs/2512.10361v1

Better Prevent than Tackle: Valuing Defense in Soccer Based on Graph Neural Networks

Evaluating defensive performance in soccer remains challenging, as effective defending is often expressed not through visible on-ball actions such as interceptions and tackles, but through preventing dangerous opportunities before they arise. Existing approaches have largely focused on valuing on-ball actions, leaving much of defenders' true impact unmeasured. To address this gap, we propose DEFCON (DEFensive CONtribution evaluator), a comprehensive framework that quantifies player-level defensive contributions for every attacking situation in soccer. Leveraging Graph Attention Networks, DEFCON estimates the success probability and expected value of each attacking option, along with each defender's responsibility for stopping it. These components yield an Expected Possession Value (EPV) for the attacking team before and after each action, and DEFCON assigns positive or negative credits to defenders according to whether they reduced or increased the opponent's EPV. Trained on 2023-24 and evaluated on 2024-25 Eredivisie event and tracking data, DEFCON's aggregated player credits exhibit strong positive correlations with market valuations. Finally, we showcase several practical applications, including in-game timelines of defensive contributions, spatial analyses across pitch zones, and pairwise summaries of attacker-defender interactions.

Updated: 2025-12-11 07:12:23

标题: 比解决更好的是预防:基于图神经网络对足球防守的价值进行评估

摘要: 在足球中评估防守表现仍然具有挑战性,因为有效的防守通常不是通过拦截和抢断等明显的球上动作表现出来,而是通过在危险机会出现之前预防它们。现有方法主要集中在评估球上动作的价值,使得大部分防守球员真实影响力无法被测量。为了填补这一空白,我们提出了DEFCON(DEFensive CONtribution evaluator),这是一个全面的框架,用于量化足球中每个进攻情况下球员级别的防守贡献。利用图注意力网络,DEFCON估计每个进攻选项的成功概率和预期价值,以及每个防守球员对阻止它的责任。这些组件产生了攻击团队在每个行动之前和之后的预期控球价值(EPV),DEFCON根据防守球员是否降低或增加对手的EPV来分配正面或负面的信用。在2023-24赛季训练并在2024-25赛季埃雷迪维西赛事和跟踪数据上评估后,DEFCON的总体球员信用与市场估值呈现强烈正相关。最后,我们展示了几个实际应用,包括防守贡献的比赛时间线,跨球场区域的空间分析,以及进攻者和防守者互动的成对摘要。

更新时间: 2025-12-11 07:12:23

领域: cs.LG,cs.MA

下载: http://arxiv.org/abs/2512.10355v1

RAD: Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis

Clinical diagnosis is a highly specialized discipline requiring both domain expertise and strict adherence to rigorous guidelines. While current AI-driven medical research predominantly focuses on knowledge graphs or natural text pretraining paradigms to incorporate medical knowledge, these approaches primarily rely on implicitly encoded knowledge within model parameters, neglecting task-specific knowledge required by diverse downstream tasks. To address this limitation, we propose Retrieval-Augmented Diagnosis (RAD), a novel framework that explicitly injects external knowledge into multimodal models directly on downstream tasks. Specifically, RAD operates through three key mechanisms: retrieval and refinement of disease-centered knowledge from multiple medical sources, a guideline-enhanced contrastive loss that constrains the latent distance between multi-modal features and guideline knowledge, and the dual transformer decoder that employs guidelines as queries to steer cross-modal fusion, aligning the models with clinical diagnostic workflows from guideline acquisition to feature extraction and decision-making. Moreover, recognizing the lack of quantitative evaluation of interpretability for multimodal diagnostic models, we introduce a set of criteria to assess the interpretability from both image and text perspectives. Extensive evaluations across four datasets with different anatomies demonstrate RAD's generalizability, achieving state-of-the-art performance. Furthermore, RAD enables the model to concentrate more precisely on abnormal regions and critical indicators, ensuring evidence-based, trustworthy diagnosis. Our code is available at https://github.com/tdlhl/RAD.

Updated: 2025-12-11 07:09:18

标题: RAD: 朝着值得信赖的检索增强多模式临床诊断

摘要: 临床诊断是一门高度专业化的学科,需要领域专业知识和严格遵守严格的指南。虽然当前由人工智能驱动的医学研究主要集中在知识图或自然文本预训练范式上,以整合医学知识,但这些方法主要依赖于模型参数中隐含编码的知识,忽略了多样化下游任务所需的任务特定知识。为了解决这一局限性,我们提出了检索增强诊断(RAD),这是一个新颖的框架,能够直接在下游任务中将外部知识显式地注入多模态模型中。具体而言,RAD通过三个关键机制运作:从多个医学来源中检索和细化以疾病为中心的知识,利用增强对比损失约束多模态特征与指南知识之间的潜在距离,以及使用指南作为查询的双变压器解码器,引导跨模态融合,使模型与临床诊断工作流程对齐,从指南获取到特征提取和决策制定。此外,鉴于多模态诊断模型的解释性缺乏定量评估,我们引入了一套标准来从图像和文本角度评估解释性。在不同解剖学的四个数据集上进行的广泛评估显示了RAD的泛化能力,实现了最先进的性能。此外,RAD使模型能够更精确地集中在异常区域和关键指标上,确保基于证据的可信诊断。我们的代码可在https://github.com/tdlhl/RAD 上找到。

更新时间: 2025-12-11 07:09:18

领域: cs.LG

下载: http://arxiv.org/abs/2509.19980v2

Dynamics of Agentic Loops in Large Language Models: A Geometric Theory of Trajectories

Agentic systems built on large language models operate through recursive feedback loops, where each output becomes the next input. Yet the geometric behavior of these agentic loops (whether they converge, diverge, or exhibit more complex dynamics) remains poorly understood. This paper introduces a geometric framework for analyzing agentic trajectories in semantic embedding space, treating iterative transformations as discrete dynamical systems. We distinguish the artifact space, where linguistic transformations occur, from the embedding space, where geometric measurements are performed. Because cosine similarity is biased by embedding anisotropy, we introduce an isotonic calibration that eliminates systematic bias and aligns similarities with human semantic judgments while preserving high local stability. This enables rigorous measurement of trajectories, clusters and attractors. Through controlled experiments on singular agentic loops, we identify two fundamental regimes. A contractive rewriting loop converges toward a stable attractor with decreasing dispersion, while an exploratory summarize and negate loop produces unbounded divergence with no cluster formation. These regimes display qualitatively distinct geometric signatures of contraction and expansion. Our results show that prompt design directly governs the dynamical regime of an agentic loop, enabling systematic control of convergence, divergence and trajectory structure in iterative LLM transformations.

Updated: 2025-12-11 07:06:14

标题: 大型语言模型中主动环路的动态:轨迹的几何理论

摘要: 建立在大型语言模型上的主体系统通过递归反馈循环运作,其中每个输出都成为下一个输入。然而,这些主体循环的几何行为(无论它们是收敛、发散还是表现出更复杂的动态)仍然知之甚少。本文介绍了一个用于分析语义嵌入空间中主体轨迹的几何框架,将迭代变换视为离散动态系统。 我们区分了艺术品空间,即语言变换发生的地方,与嵌入空间,即进行几何测量的地方。由于余弦相似性受嵌入各向异性的影响,我们引入了一种等温校准,消除了系统偏差,并将相似性与人类语义判断保持高局部稳定性。这使得能够严格测量轨迹、聚类和吸引子。 通过对特定主体循环的控制实验,我们确定了两种基本模式。一种收缩重写循环向稳定吸引子收敛,分散程度降低,而一种探索性总结和否定循环产生无界发散,没有聚类形成。这些模式展示了收缩和扩展的定性不同的几何特征。 我们的结果表明,提示设计直接决定了主体循环的动态模式,使得能够系统地控制迭代LLM变换中的收敛、发散和轨迹结构。

更新时间: 2025-12-11 07:06:14

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2512.10350v1

REMISVFU: Vertical Federated Unlearning via Representation Misdirection for Intermediate Output Feature

Data-protection regulations such as the GDPR grant every participant in a federated system a right to be forgotten. Federated unlearning has therefore emerged as a research frontier, aiming to remove a specific party's contribution from the learned model while preserving the utility of the remaining parties. However, most unlearning techniques focus on Horizontal Federated Learning (HFL), where data are partitioned by samples. In contrast, Vertical Federated Learning (VFL) allows organizations that possess complementary feature spaces to train a joint model without sharing raw data. The resulting feature-partitioned architecture renders HFL-oriented unlearning methods ineffective. In this paper, we propose REMISVFU, a plug-and-play representation misdirection framework that enables fast, client-level unlearning in splitVFL systems. When a deletion request arrives, the forgetting party collapses its encoder output to a randomly sampled anchor on the unit sphere, severing the statistical link between its features and the global model. To maintain utility for the remaining parties, the server jointly optimizes a retention loss and a forgetting loss, aligning their gradients via orthogonal projection to eliminate destructive interference. Evaluations on public benchmarks show that REMISVFU suppresses back-door attack success to the natural class-prior level and sacrifices only about 2.5% points of clean accuracy, outperforming state-of-the-art baselines.

Updated: 2025-12-11 07:05:36

标题: REMISVFU:通过表示误导进行中间输出特征的垂直联邦去学习

摘要: 数据保护法规,如《通用数据保护条例》(GDPR),授予联合系统中的每个参与者被遗忘的权利。因此,联合遗忘已经成为研究的前沿,旨在从学习的模型中删除特定方的贡献,同时保留其余方的效用。然而,大多数遗忘技术侧重于水平联合学习(HFL),其中数据按样本进行分区。相比之下,垂直联合学习(VFL)允许拥有互补特征空间的组织训练一个联合模型,而无需共享原始数据。由此产生的特征分区架构使得以HFL为导向的遗忘方法无效。在本文中,我们提出了REMISVFU,这是一个即插即用的表示误导框架,可以在splitVFL系统中实现快速、客户端级的遗忘。当删除请求到达时,忘记方将其编码器输出折叠到单位球上的随机采样锚点上,切断其特征与全局模型之间的统计联系。为了保持其余方的效用,服务器联合优化保留损失和遗忘损失,通过正交投影来对齐它们的梯度,消除破坏性干扰。对公共基准进行的评估显示,REMISVFU将后门攻击成功抑制到自然类先验水平,并且仅损失约2.5%的准确度,胜过最先进的基线模型。

更新时间: 2025-12-11 07:05:36

领域: cs.AI

下载: http://arxiv.org/abs/2512.10348v1

Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles

Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods for autonomous vehicles (AVs) typically struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi-source VG dataset in AD, featuring semantic annotations generated by a Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT)-prompted LLM pipeline. Extensive evaluations on six benchmarks, ThinkDeeper ranks #1 on the Talk2Car leaderboard and surpasses state-of-the-art baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Notably, it shows strong robustness and efficiency in challenging scenes (long-text, multi-agent, ambiguity) and retains superior performance even when trained on 50% of the data.

Updated: 2025-12-11 07:03:44

标题: 三思而后行:受世界模型启发的自动驾驶车辆多模态基础

摘要: 解释自然语言命令以定位目标对象对自动驾驶(AD)至关重要。现有的自动驾驶车辆(AVs)的视觉基础(VG)方法通常在处理模糊、依赖上下文的指令时遇到困难,因为它们缺乏对3D空间关系和预期场景演变的推理。基于世界模型的原则,我们提出了ThinkDeeper,这是一个在做出基础决策之前推理未来空间状态的框架。其核心是一个空间感知世界模型(SA-WM),通过将当前场景提炼成一个命令感知的潜在状态并展开一系列未来潜在状态,提供前瞻性线索以消除歧义。此外,一个超图引导的解码器随后会层次性地融合这些状态和多模态输入,捕捉高阶空间依赖关系以实现稳健的定位。此外,我们提出了DrivePilot,一个在AD中的多源VG数据集,其中包含通过检索增强生成(RAG)和思维链(CoT)提示的LLM管道生成的语义注释。在六个基准测试上进行了广泛评估,ThinkDeeper在Talk2Car排行榜上排名第一,并在DrivePilot、MoCAD和RefCOCO / + / g基准测试上超越了最先进的基线。值得注意的是,它在具有挑战性场景(长文本、多代理、模糊)中表现出强大的鲁棒性和高效性,即使在仅使用50%的数据进行训练时,仍然保持着优越的性能。

更新时间: 2025-12-11 07:03:44

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2512.03454v2

When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

Vision-Language-Action models (VLAs) have recently demonstrated remarkable progress in embodied environments, enabling robots to perceive, reason, and act through unified multimodal understanding. Despite their impressive capabilities, the adversarial robustness of these systems remains largely unexplored, especially under realistic multimodal and black-box conditions. Existing studies mainly focus on single-modality perturbations and overlook the cross-modal misalignment that fundamentally affects embodied reasoning and decision-making. In this paper, we introduce VLA-Fool, a comprehensive study of multimodal adversarial robustness in embodied VLA models under both white-box and black-box settings. VLA-Fool unifies three levels of multimodal adversarial attacks: (1) textual perturbations through gradient-based and prompt-based manipulations, (2) visual perturbations via patch and noise distortions, and (3) cross-modal misalignment attacks that intentionally disrupt the semantic correspondence between perception and instruction. We further incorporate a VLA-aware semantic space into linguistic prompts, developing the first automatically crafted and semantically guided prompting framework. Experiments on the LIBERO benchmark using a fine-tuned OpenVLA model reveal that even minor multimodal perturbations can cause significant behavioral deviations, demonstrating the fragility of embodied multimodal alignment.

Updated: 2025-12-11 06:59:01

标题: 当对齐失败:对视觉-语言-动作模型进行多模态对抗攻击

摘要: 视觉-语言-行动模型(VLAs)最近在具体环境中取得了显著进展,使得机器人能够通过统一的多模态理解来感知、推理和行动。尽管这些系统具有令人印象深刻的能力,但它们的对抗鲁棒性在现实多模态和黑盒条件下仍然很少被探究。现有研究主要集中在单模态扰动上,并忽视了跨模态不对齐对具体推理和决策产生根本影响。在本文中,我们介绍了VLA-Fool,这是对在白盒和黑盒设置下的具体VLA模型进行多模态对抗鲁棒性全面研究。VLA-Fool统一了三个级别的多模态对抗攻击:(1)文本扰动通过基于梯度和基于提示的操作,(2)视觉扰动通过补丁和噪声失真,以及(3)跨模态不对齐攻击,有意破坏感知和指示之间的语义对应关系。我们进一步将VLA感知空间纳入语言提示中,开发了第一个自动制作和语义引导的提示框架。在使用经过微调的OpenVLA模型进行的LIBERO基准测试中,实验证明,即使是轻微的多模态扰动也可能导致显著的行为偏差,展示了具体多模态对齐的脆弱性。

更新时间: 2025-12-11 06:59:01

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.16203v3

OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models

Since Multimodal Large Language Models (MLLMs) are increasingly being integrated into everyday tools and intelligent agents, growing concerns have arisen regarding their possible output of unsafe contents, ranging from toxic language and biased imagery to privacy violations and harmful misinformation. Current safety benchmarks remain highly limited in both modality coverage and performance evaluations, often neglecting the extensive landscape of content safety. In this work, we introduce OutSafe-Bench, the first most comprehensive content safety evaluation test suite designed for the multimodal era. OutSafe-Bench includes a large-scale dataset that spans four modalities, featuring over 18,000 bilingual (Chinese and English) text prompts, 4,500 images, 450 audio clips and 450 videos, all systematically annotated across nine critical content risk categories. In addition to the dataset, we introduce a Multidimensional Cross Risk Score (MCRS), a novel metric designed to model and assess overlapping and correlated content risks across different categories. To ensure fair and robust evaluation, we propose FairScore, an explainable automated multi-reviewer weighted aggregation framework. FairScore selects top-performing models as adaptive juries, thereby mitigating biases from single-model judgments and enhancing overall evaluation reliability. Our evaluation of nine state-of-the-art MLLMs reveals persistent and substantial safety vulnerabilities, underscoring the pressing need for robust safeguards in MLLMs.

Updated: 2025-12-11 06:53:44

标题: OutSafe-Bench:大型语言模型中多模态攻击性内容检测的基准测试

摘要: 鉴于多模态大型语言模型(MLLMs)日益被整合到日常工具和智能代理中,人们对其可能输出不安全内容的担忧日益加剧,范围涵盖有毒语言、偏见形象、隐私侵犯和有害错误信息。目前的安全基准在模态覆盖和性能评估方面仍然非常有限,通常忽视了广泛的内容安全领域。在这项工作中,我们介绍了OutSafe-Bench,这是第一个专为多模态时代设计的最全面的内容安全评估测试套件。OutSafe-Bench包括一个大规模数据集,涵盖了四种模态,包括超过18,000个双语(中文和英文)文本提示、4,500个图像、450个音频剪辑和450个视频,所有这些内容都在九个关键内容风险类别下进行了系统注释。除了数据集,我们还引入了一种多维交叉风险评分(MCRS),这是一种新颖的度量标准,旨在对不同类别之间的重叠和相关内容风险进行建模和评估。为了确保公平且健壮的评估,我们提出了FairScore,这是一个可解释的自动化多审阅者加权聚合框架。FairScore选择表现最佳的模型作为自适应陪审团,从而减轻了单一模型判断的偏见,并增强了整体评估的可靠性。我们对九种最先进的MLLM进行的评估显示出持续且重大的安全漏洞,强调了在MLLM中加强健全保障的紧迫需求。

更新时间: 2025-12-11 06:53:44

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2511.10287v3

Learning Generalizable Shape Completion with SIM(3) Equivariance

3D shape completion methods typically assume scans are pre-aligned to a canonical frame. This leaks pose and scale cues that networks may exploit to memorize absolute positions rather than inferring intrinsic geometry. When such alignment is absent in real data, performance collapses. We argue that robust generalization demands architectural equivariance to the similarity group, SIM(3), so the model remains agnostic to pose and scale. Following this principle, we introduce the first SIM(3)-equivariant shape completion network, whose modular layers successively canonicalize features, reason over similarity-invariant geometry, and restore the original frame. Under a de-biased evaluation protocol that removes the hidden cues, our model outperforms both equivariant and augmentation baselines on the PCN benchmark. It also sets new cross-domain records on real driving and indoor scans, lowering minimal matching distance on KITTI by 17% and Chamfer distance $\ell1$ on OmniObject3D by 14%. Perhaps surprisingly, ours under the stricter protocol still outperforms competitors under their biased settings. These results establish full SIM(3) equivariance as an effective route to truly generalizable shape completion. Project page: https://sime-completion.github.io.

Updated: 2025-12-11 06:52:51

标题: 学习具有SIM(3)等变性的可泛化形状完成

摘要: 3D形状补全方法通常假设扫描已经预先对齐到一个规范框架。这会泄露姿势和尺度线索,网络可能会利用这些线索来记忆绝对位置,而不是推断内在几何。当真实数据中缺少这种对齐时,性能会下降。我们认为,强大的泛化性需要对相似性群SIM(3)具有架构等变性,因此模型保持对姿势和尺度的不可知性。遵循这一原则,我们引入了第一个SIM(3)-等变形状完成网络,其模块化层逐步使特征规范化,对相似性不变的几何进行推理,并恢复原始框架。在一个去除隐藏线索的去偏评估协议下,我们的模型在PCN基准上表现优于等变和数据增强基线。它还在真实驾驶和室内扫描的跨域记录上取得了新的成绩,将在KITTI上的最小匹配距离降低了17%,在OmniObject3D上的Chamfer距离$\ell1$降低了14%。或许令人惊讶的是,在更严格的协议下,我们的模型仍然优于在其偏见设置下的竞争对手。这些结果确立了完全的SIM(3)等变性作为真正通用形状完成的有效途径。项目页面:https://sime-completion.github.io。

更新时间: 2025-12-11 06:52:51

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2509.26631v3

A Privacy-Preserving Cloud Architecture for Distributed Machine Learning at Scale

Distributed machine learning systems require strong privacy guarantees, verifiable compliance, and scalable deploy- ment across heterogeneous and multi-cloud environments. This work introduces a cloud-native privacy-preserving architecture that integrates federated learning, differential privacy, zero- knowledge compliance proofs, and adaptive governance powered by reinforcement learning. The framework supports secure model training and inference without centralizing sensitive data, while enabling cryptographically verifiable policy enforcement across institutions and cloud platforms. A full prototype deployed across hybrid Kubernetes clusters demonstrates reduced membership- inference risk, consistent enforcement of formal privacy budgets, and stable model performance under differential privacy. Ex- perimental evaluation across multi-institution workloads shows that the architecture maintains utility with minimal overhead while providing continuous, risk-aware governance. The pro- posed framework establishes a practical foundation for deploying trustworthy and compliant distributed machine learning systems at scale.

Updated: 2025-12-11 06:46:46

标题: 一个用于大规模分布式机器学习的隐私保护云架构

摘要: 分布式机器学习系统需要强有力的隐私保障、可验证的合规性,并且可以在异构和多云环境中进行可扩展部署。本研究介绍了一种云原生隐私保护架构,该架构集成了联邦学习、差分隐私、零知识合规性证明,并且由强化学习驱动的自适应治理。该框架支持安全的模型训练和推断,而不会将敏感数据集中,同时实现了在机构和云平台间进行加密可验证的策略执行。在混合Kubernetes集群上部署的完整原型展示了降低成员推理风险、一致执行形式隐私预算以及在差分隐私下稳定模型性能。跨多机构工作负载的实验评估显示,该架构在提供持续、风险感知的治理的同时,保持了最小的开销。提出的框架为规模部署可信赖和合规的分布式机器学习系统奠定了实践基础。

更新时间: 2025-12-11 06:46:46

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2512.10341v1

Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks

Recent advancements in driving world models enable controllable generation of high-quality RGB videos or multimodal videos. Existing methods primarily focus on metrics related to generation quality and controllability. However, they often overlook the evaluation of downstream perception tasks, which are $\mathbf{really\ crucial}$ for the performance of autonomous driving. Existing methods usually leverage a training strategy that first pretrains on synthetic data and finetunes on real data, resulting in twice the epochs compared to the baseline (real data only). When we double the epochs in the baseline, the benefit of synthetic data becomes negligible. To thoroughly demonstrate the benefit of synthetic data, we introduce Dream4Drive, a novel synthetic data generation framework designed for enhancing the downstream perception tasks. Dream4Drive first decomposes the input video into several 3D-aware guidance maps and subsequently renders the 3D assets onto these guidance maps. Finally, the driving world model is fine-tuned to produce the edited, multi-view photorealistic videos, which can be used to train the downstream perception models. Dream4Drive enables unprecedented flexibility in generating multi-view corner cases at scale, significantly boosting corner case perception in autonomous driving. To facilitate future research, we also contribute a large-scale 3D asset dataset named DriveObj3D, covering the typical categories in driving scenarios and enabling diverse 3D-aware video editing. We conduct comprehensive experiments to show that Dream4Drive can effectively boost the performance of downstream perception models under various training epochs. Page: https://wm-research.github.io/Dream4Drive/ GitHub Link: https://github.com/wm-research/Dream4Drive

Updated: 2025-12-11 06:46:07

标题: 重新思考将驾驶世界模型作为感知任务的合成数据生成器

摘要: 最近在驾驶世界模型方面取得的进展使得可以可控生成高质量的RGB视频或多模态视频。现有方法主要关注与生成质量和可控性相关的指标。然而,它们经常忽视下游感知任务的评估,这对自动驾驶的性能非常关键。现有方法通常利用先在合成数据上进行预训练,然后在真实数据上进行微调的训练策略,导致所需的训练轮次是基准(仅使用真实数据)的两倍。当我们在基准中将训练轮次加倍时,合成数据的好处变得微不足道。为了全面展示合成数据的好处,我们引入了Dream4Drive,这是一个专为增强下游感知任务而设计的新颖合成数据生成框架。Dream4Drive首先将输入视频分解为几个3D感知的引导地图,然后将3D资产渲染到这些引导地图上。最后,驾驶世界模型被微调以生成编辑过的、多视角逼真的视频,这些视频可用于训练下游感知模型。Dream4Drive实现了在规模上生成多视角边缘案例的前所未有的灵活性,显著提升了自动驾驶中的边缘案例感知。为促进未来研究,我们还提供了一个名为DriveObj3D的大规模3D资产数据集,涵盖了驾驶场景中的典型类别,并支持多样化的3D感知视频编辑。我们进行了全面的实验,展示了Dream4Drive在各种训练轮次下有效提升下游感知模型性能的能力。页面链接:https://wm-research.github.io/Dream4Drive/ GitHub链接:https://github.com/wm-research/Dream4Drive

更新时间: 2025-12-11 06:46:07

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.19195v3

On the Collapse of Generative Paths: A Criterion and Correction for Diffusion Steering

Inference-time steering enables pretrained diffusion/flow models to be adapted to new tasks without retraining. A widely used approach is the ratio-of-densities method, which defines a time-indexed target path by reweighting probability-density trajectories from multiple models with positive, or in some cases, negative exponents. This construction, however, harbors a critical and previously unformalized failure mode: Marginal Path Collapse, where intermediate densities become non-normalizable even though endpoints remain valid. Collapse arises systematically when composing heterogeneous models trained on different noise schedules or datasets, including a common setting in molecular design where de-novo, conformer, and pocket-conditioned models must be combined for tasks such as flexible-pose scaffold decoration. We provide a novel and complete solution for the problem. First, we derive a simple path existence criterion that predicts exactly when collapse occurs from noise schedules and exponents alone. Second, we introduce Adaptive path Correction with Exponents (ACE), which extends Feynman-Kac steering to time-varying exponents and guarantees a valid probability path. On a synthetic 2D benchmark and on flexible-pose scaffold decoration, ACE eliminates collapse and enables high-guidance compositional generation, improving distributional and docking metrics over constant-exponent baselines and even specialized task-specific scaffold decoration models. Our work turns ratio-of-densities steering with heterogeneous experts from an unstable heuristic into a reliable tool for controllable generation.

Updated: 2025-12-11 06:44:08

标题: 关于生成路径崩溃:扩散导向的标准和修正

摘要: 推理时间导向使预训练扩散/流模型能够适应新任务而无需重新训练。一种广泛使用的方法是密度比方法,它通过使用正指数或在某些情况下使用负指数对来自多个模型的概率密度轨迹进行重新加权,从而定义了一个时间索引的目标路径。然而,这种构建存在一个关键的、以前没有形式化的失败模式:边际路径坍塌,即使端点仍然有效,中间密度也变得不可归一化。当组合在不同的噪声时间表或数据集上训练的异质模型时,包括在分子设计中的常见设置,其中必须将全新的、构象和口袋条件模型结合在一起以执行灵活姿态支架装饰等任务时,坍塌会系统地出现。我们提出了这个问题的一种新颖且完整的解决方案。首先,我们推导出一个简单的路径存在准则,仅从噪声时间表和指数就可以准确预测坍塌发生的时机。其次,我们引入了带指数的自适应路径校正(ACE),它将 Feynman-Kac 导向扩展到时变指数,并保证一个有效的概率路径。在一个合成的2D基准测试和灵活姿态支架装饰上,ACE消除了坍塌,并实现了高引导的组合生成,改进了分布和对接指标,超过了恒定指数基线甚至专门针对特定任务的支架装饰模型。我们的工作将具有不稳定性启发式的异质专家的密度比导向转变为一个可靠的可控生成工具。

更新时间: 2025-12-11 06:44:08

领域: cs.AI

下载: http://arxiv.org/abs/2512.10339v1

Noisy Spiking Actor Network for Exploration

As a general method for exploration in deep reinforcement learning (RL), NoisyNet can produce problem-specific exploration strategies. Spiking neural networks (SNNs), due to their binary firing mechanism, have strong robustness to noise, making it difficult to realize efficient exploration with local disturbances. To solve this exploration problem, we propose a noisy spiking actor network (NoisySAN) that introduces time-correlated noise during charging and transmission. Moreover, a noise reduction method is proposed to find a stable policy for the agent. Extensive experimental results demonstrate that our method outperforms the state-of-the-art performance on a wide range of continuous control tasks from OpenAI gym.

Updated: 2025-12-11 06:42:28

标题: 噪声尖峰演员网络用于探索

摘要: 作为深度强化学习(RL)中探索的一般方法,NoisyNet可以生成特定于问题的探索策略。由于脉冲神经网络(SNN)具有二进制发射机制,对噪声具有较强的鲁棒性,因此难以在局部扰动中实现高效的探索。为了解决这一探索问题,我们提出了一种引入时间相关噪声的有噪脉冲演员网络(NoisySAN)。此外,提出了一种噪声减少方法来为代理找到稳定的策略。大量实验结果表明,我们的方法在从OpenAI gym中的各种连续控制任务中表现优于最先进的性能。

更新时间: 2025-12-11 06:42:28

领域: cs.LG,cs.NE

下载: http://arxiv.org/abs/2403.04162v2

Guiding What Not to Generate: Automated Negative Prompting for Text-Image Alignment

Despite substantial progress in text-to-image generation, achieving precise text-image alignment remains challenging, particularly for prompts with rich compositional structure or imaginative elements. To address this, we introduce Negative Prompting for Image Correction (NPC), an automated pipeline that improves alignment by identifying and applying negative prompts that suppress unintended content. We begin by analyzing cross-attention patterns to explain why both targeted negatives-those directly tied to the prompt's alignment error-and untargeted negatives-tokens unrelated to the prompt but present in the generated image-can enhance alignment. To discover useful negatives, NPC generates candidate prompts using a verifier-captioner-proposer framework and ranks them with a salient text-space score, enabling effective selection without requiring additional image synthesis. On GenEval++ and Imagine-Bench, NPC outperforms strong baselines, achieving 0.571 vs. 0.371 on GenEval++ and the best overall performance on Imagine-Bench. By guiding what not to generate, NPC provides a principled, fully automated route to stronger text-image alignment in diffusion models. Code is released at https://github.com/wiarae/NPC.

Updated: 2025-12-11 06:42:25

标题: 引导生成什么不生成:文本-图像对齐的自动负向提示

摘要: 尽管在文本到图像生成方面取得了重大进展,但实现精确的文本-图像对齐仍然具有挑战性,特别是对于具有丰富构成结构或想象元素的提示。为了解决这个问题,我们引入了一种名为Negative Prompting for Image Correction(NPC)的自动化流程,通过识别并应用抑制意外内容的负面提示来改善对齐。我们首先通过分析交叉注意力模式来解释为什么针对性负面(直接与提示的对齐错误相关的)和非针对性负面(与提示无关但出现在生成的图像中的标记)都可以增强对齐。为了发现有用的负面内容,NPC使用验证器-字幕生成器-提议者框架生成候选提示,并根据显著的文本空间分数对其进行排名,从而能够在不需要额外图像合成的情况下有效选择。在GenEval++和Imagine-Bench上,NPC优于强基线,分别在GenEval++上达到0.571比0.371,同时在Imagine-Bench上取得最佳综合表现。通过引导生成什么不应该生成,NPC提供了一条有原则的、完全自动化的路径,可以在扩散模型中实现更强的文本-图像对齐。代码已发布在https://github.com/wiarae/NPC。

更新时间: 2025-12-11 06:42:25

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2512.07702v2

Multilingual VLM Training: Adapting an English-Trained VLM to French

Artificial intelligence has made great progress in recent years, particularly in the development of Vision--Language Models (VLMs) that understand both visual and textual data. However, these advancements remain largely limited to English, reducing their accessibility for non--English speakers. It is essential to extend these capabilities to a broader range of languages. This paper explores the challenges of adapting an English-trained VLM to different languages. To this end, we will explore and compare different methods for their performance and computational cost. We consider a translation-based pipeline, LoRA finetuning, and a two-stage finetuning strategy that separates vision adaptation from language adaptation. To evaluate these methods, we use a combination of standard multimodal benchmarks translated into the target language and manual assessments by native experts. The results reveal that dataset translation remains a major bottleneck in multilingual VLM performance, with data quality limiting the effectiveness of training and evaluation. These findings suggest that future efforts should focus on native-language dataset collection and improved translation strategies.

Updated: 2025-12-11 06:38:51

标题: 多语种VLM培训:将接受英语培训的VLM适应法语

摘要: 人工智能在近年来取得了巨大进展,特别是在视觉-语言模型(VLMs)的发展方面,这些模型能够理解视觉和文本数据。然而,这些进展在很大程度上仍然局限于英语,降低了非英语使用者的可访问性。将这些能力扩展到更广泛的语言是至关重要的。本文探讨了将基于英语训练的VLM适应到不同语言的挑战。为此,我们将探讨并比较不同方法的性能和计算成本。我们考虑了基于翻译的流程、LoRA微调以及将视觉适应和语言适应分开的两阶段微调策略。为了评估这些方法,我们使用了一组标准的多模态基准数据集翻译成目标语言,并由本地专家进行手动评估。结果显示,数据集翻译仍然是多语言VLM性能的一个主要瓶颈,数据质量限制了训练和评估的有效性。这些发现表明,未来的努力应该集中在本地语言数据集收集和改进翻译策略上。

更新时间: 2025-12-11 06:38:51

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2512.10336v1

SAVE: Sparse Autoencoder-Driven Visual Information Enhancement for Mitigating Object Hallucination

Although Multimodal Large Language Models (MLLMs) have advanced substantially, they remain vulnerable to object hallucination caused by language priors and visual information loss. To address this, we propose SAVE (Sparse Autoencoder-Driven Visual Information Enhancement), a framework that mitigates hallucination by steering the model along Sparse Autoencoder (SAE) latent features. A binary object-presence question-answering probe identifies the SAE features most indicative of the model's visual information processing, referred to as visual understanding features. Steering the model along these identified features reinforces grounded visual understanding and effectively reduces hallucination. With its simple design, SAVE outperforms state-of-the-art training-free methods on standard benchmarks, achieving a 10\%p improvement in CHAIR\_S and consistent gains on POPE and MMHal-Bench. Extensive evaluations across multiple models and layers confirm the robustness and generalizability of our approach. Further analysis reveals that steering along visual understanding features suppresses the generation of uncertain object tokens and increases attention to image tokens, mitigating hallucination. Code is released at https://github.com/wiarae/SAVE.

Updated: 2025-12-11 06:37:50

标题: SAVE:用于减轻物体幻觉的稀疏自编码器驱动的视觉信息增强

摘要: 虽然多模态大型语言模型(MLLMs)已经取得了显著进展,但它们仍然容易受到由语言先验和视觉信息丢失引起的对象幻觉的影响。为了解决这个问题,我们提出了SAVE(稀疏自动编码器驱动的视觉信息增强)框架,通过引导模型沿着稀疏自动编码器(SAE)潜在特征来减轻幻觉。一个二进制对象存在的问答探针识别了最能表明模型视觉信息处理的SAE特征,称为视觉理解特征。沿着这些识别出的特征引导模型可以加强基于视觉的理解,并有效减少幻觉。通过其简单的设计,SAVE在标准基准测试中表现优于最先进的无需训练的方法,在CHAIR_S上取得了10\%的改进,并在POPE和MMHal-Bench上取得了持续的增益。对多个模型和层次的广泛评估证实了我们方法的鲁棒性和通用性。进一步的分析显示,沿着视觉理解特征引导可以抑制不确定对象令牌的生成,并增加对图像令牌的关注,从而减轻幻觉。代码发布在https://github.com/wiarae/SAVE。

更新时间: 2025-12-11 06:37:50

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2512.07730v2

JITServe: SLO-aware LLM Serving with Imprecise Request Information

The integration of Large Language Models (LLMs) into applications ranging from interactive chatbots to multi-agent systems has introduced a wide spectrum of service-level objectives (SLOs) for responsiveness. These include latency-sensitive requests emphasizing per-token latency in streaming chat, deadline-sensitive requests requiring rapid full responses to trigger external tools, and compound requests with evolving dependencies across multiple LLM calls. Despite-or perhaps, because of-this workload diversity and unpredictable request information (e.g., response lengths and dependencies), existing request schedulers have focused on aggregate performance, unable to ensure application-level SLO needs. This paper presents JITServe, the first SLO-aware LLM serving system designed to maximize service goodput (e.g., the number of tokens meeting request SLOs) across diverse workloads. JITServe novelly schedules requests using imprecise request information and gradually relaxes this conservatism by refining request information estimates as generation progresses. It applies a grouped margin goodput maximization algorithm to allocate just enough serving bandwidth to satisfy each request's SLO just-in-time (JIT), maximizing residual capacity for others, while deciding the composition of requests in a batch to maximize efficiency and goodput with provable guarantees. Our evaluation across diverse realistic workloads, including chat, deep research, and agentic pipelines, shows that JITServe improves service goodput by 1.4x-6.3x, alternatively achieving 28.5%-83.2% resource savings, compared to state-of-the-art designs.

Updated: 2025-12-11 06:24:21

标题: JITServe:具有不精确请求信息的SLO感知LLM服务

摘要: 将大型语言模型(LLMs)集成到从交互式聊天机器人到多智能体系统的应用程序中,引入了一系列服务级别目标(SLOs)以确保响应性。这些目标包括对延迟敏感的请求,强调流式聊天中每个标记的延迟,对截止日期敏感的请求,需要快速完整响应以触发外部工具,并且具有跨多个LLM调用的演进依赖的复合请求。尽管-或许正因为如此-存在工作负载的多样性和不可预测的请求信息(例如,响应长度和依赖关系),现有的请求调度程序主要关注聚合性能,无法确保应用级别的SLO需求。 本文介绍了JITServe,这是第一个面向SLO的LLM服务系统,旨在通过不同的工作负载最大化服务吞吐量(例如,符合请求SLO的标记数量)。JITServe通过使用不精确的请求信息来安排请求,并随着生成的进行逐渐放松这种保守性,通过细化请求信息估计。它应用了一种分组边界吞吐量最大化算法,为每个请求分配足够的服务带宽,以及时满足每个请求的SLO,同时最大化其他请求的剩余容量,同时决定批处理中请求的组成,以最大化效率和吞吐量,并提供可证明的保证。我们在各种现实工作负载(包括聊天、深度研究和智能管道)上进行的评估显示,与最先进的设计相比,JITServe将服务吞吐量提高了1.4倍至6.3倍,交替实现了28.5%至83.2%的资源节约。

更新时间: 2025-12-11 06:24:21

领域: cs.DC,cs.LG,eess.SY

下载: http://arxiv.org/abs/2504.20068v2

Bidirectional Representations Augmented Autoregressive Biological Sequence Generation

Autoregressive (AR) models, common in sequence generation, are limited in many biological tasks such as de novo peptide sequencing and protein modeling by their unidirectional nature, failing to capture crucial global bidirectional token dependencies. Non-Autoregressive (NAR) models offer holistic, bidirectional representations but face challenges with generative coherence and scalability. To transcend this, we propose a hybrid framework enhancing AR generation by dynamically integrating rich contextual information from non-autoregressive mechanisms. Our approach couples a shared input encoder with two decoders: a non-autoregressive one learning latent bidirectional biological features, and an AR decoder synthesizing the biological sequence by leveraging these bidirectional features. A novel cross-decoder attention module enables the AR decoder to iteratively query and integrate these bidirectional features, enriching its predictions. This synergy is cultivated via a tailored training strategy with importance annealing for balanced objectives and cross-decoder gradient blocking for stable, focused learning. Evaluations on a demanding nine-species benchmark of de novo peptide sequencing show that our model substantially surpasses AR and NAR baselines. It uniquely harmonizes AR stability with NAR contextual awareness, delivering robust, superior performance on diverse downstream data. This research advances biological sequence modeling techniques and contributes a novel architectural paradigm for augmenting AR models with enhanced bidirectional understanding for complex sequence generation. Code is available at https://github.com/BEAM-Labs/denovo.

Updated: 2025-12-11 06:21:38

标题: 双向表示增强的自回归生物序列生成

摘要: 自回归(AR)模型在序列生成中很常见,但在许多生物学任务中存在限制,比如全新肽段测序和蛋白建模,因为它们是单向的,不能捕捉到关键的全局双向标记依赖关系。非自回归(NAR)模型提供了整体的、双向的表示,但在生成连贯性和可扩展性方面面临挑战。为了突破这一局限,我们提出了一个混合框架,通过动态地整合非自回归机制中丰富的上下文信息来增强AR生成。我们的方法将一个共享的输入编码器与两个解码器相结合:一个学习潜在的双向生物特征的非自回归解码器,以及一个利用这些双向特征合成生物序列的自回归解码器。一个新颖的跨解码器注意力模块使得自回归解码器能够迭代地查询和整合这些双向特征,从而丰富其预测。通过一个定制的训练策略,包括重要性退火以平衡目标和交叉解码器梯度阻塞以稳定、集中学习,实现了这种协同效应。对一个具有挑战性的九种物种基准的全新肽段测序进行评估表明,我们的模型明显超越了AR和NAR的基线。它独特地将AR的稳定性与NAR的上下文意识相融合,为各种下游数据提供了稳健、优越的性能。这项研究推进了生物序列建模技术,并为增强复杂序列生成的自回归模型提供了一种新颖的架构范式。源代码可在https://github.com/BEAM-Labs/denovo中找到。

更新时间: 2025-12-11 06:21:38

领域: cs.LG

下载: http://arxiv.org/abs/2510.08169v3

Residual subspace evolution strategies for nonlinear inverse problems

Nonlinear inverse problems often feature noisy, non-differentiable, or expensive residual evaluations that make Jacobian-based solvers unreliable. Popular derivative-free optimizers such as natural evolution strategies (NES) or Powell's NEWUOA still assume smoothness or expend many evaluations to maintain stability. Ensemble Kalman inversion (EKI) relies on empirical covariances that require preconditioning and scale poorly with residual dimension. We introduce residual subspace evolution strategies (RSES), a derivative-free solver that samples Gaussian probes around the current iterate, builds a residual-only surrogate from their differences, and recombines the probes through a least-squares solve yielding an optimal update without forming Jacobians or covariances. Each iteration costs $k+1$ residual evaluations, where $k \ll n$ for $n$-dimensional problems, with $O(k^3)$ linear algebra overhead. Benchmarks on calibration, regression, and deconvolution problems demonstrate consistent misfit reduction in both deterministic and stochastic settings. RSES matches or surpasses xNES and NEWUOA while staying competitive with EKI under matched evaluation budgets, particularly when smoothness or covariance assumptions fail.

Updated: 2025-12-11 06:20:13

标题: 非线性反问题的残余子空间演化策略

摘要: 非线性反问题通常具有噪声、不可微分或昂贵的残差评估,这使得基于雅可比矩阵的求解器不可靠。流行的无导数优化器(如自然进化策略(NES)或Powell的NEWUOA)仍然假设平滑性或需要进行许多评估以维持稳定性。集合卡尔曼反演(EKI)依赖于需要预处理并随着残差维度的增加而缩放的经验协方差。 我们介绍了残差子空间进化策略(RSES),这是一种无导数求解器,它围绕当前迭代点对高斯探测器进行采样,从它们的差异构建仅残差的代理,并通过最小二乘求解重新组合这些探测器,从而产生一个最优更新,而无需形成雅可比矩阵或协方差。每次迭代需要$k+1$个残差评估,其中$k \ll n$适用于$n$维问题,线性代数开销为$O(k^3)$。 在校准、回归和解卷积问题上的基准测试显示,在确定性和随机设置中,RSES可以一致降低误差,而且在匹配的评估预算下,RSES与xNES和NEWUOA相匹配甚至超过它们,在平滑性或协方差假设失败时,与EKI保持竞争力。

更新时间: 2025-12-11 06:20:13

领域: math.OC,cs.LG

下载: http://arxiv.org/abs/2512.10325v1

User-Feedback-Driven Continual Adaptation for Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) requires agents to navigate complex environments by following natural-language instructions. General Scene Adaptation for VLN (GSA-VLN) shifts the focus from zero-shot generalization to continual, environment-specific adaptation, narrowing the gap between static benchmarks and real-world deployment. However, current GSA-VLN frameworks exclude user feedback, relying solely on unsupervised adaptation from repeated environmental exposure. In practice, user feedback offers natural and valuable supervision that can significantly enhance adaptation quality. We introduce a user-feedback-driven adaptation framework that extends GSA-VLN by systematically integrating human interactions into continual learning. Our approach converts user feedback-navigation instructions and corrective signals-into high-quality, environment-aligned training data, enabling efficient and realistic adaptation. A memory-bank warm-start mechanism further reuses previously acquired environmental knowledge, mitigating cold-start degradation and ensuring stable redeployment. Experiments on the GSA-R2R benchmark show that our method consistently surpasses strong baselines such as GR-DUET, improving navigation success and path efficiency. The memory-bank warm start stabilizes early navigation and reduces performance drops after updates. Results under both continual and hybrid adaptation settings confirm the robustness and generality of our framework, demonstrating sustained improvement across diverse deployment conditions.

Updated: 2025-12-11 06:11:45

标题: 用户反馈驱动的视觉与语言导航的持续适应

摘要: 视觉与语言导航(VLN)要求代理根据自然语言指令在复杂环境中导航。用于VLN的普适场景适应(GSA-VLN)将焦点从零样本泛化转向连续的、特定于环境的适应,缩小了静态基准和实际部署之间的差距。然而,当前的GSA-VLN框架排除了用户反馈,仅依赖于从重复环境暴露中的无监督适应。在实践中,用户反馈提供了自然且有价值的监督,可以显著提升适应质量。我们引入了一个用户反馈驱动的适应框架,通过将人类交互系统地整合到连续学习中,扩展了GSA-VLN。我们的方法将用户反馈-导航指令和校正信号-转换成高质量、与环境对齐的训练数据,实现了高效和现实的适应。一个内存库热启动机制进一步重用先前获得的环境知识,缓解了冷启动退化,并确保了稳定的重新部署。在GSA-R2R基准上的实验证明,我们的方法始终优于强基线,如GR-DUET,提高了导航成功率和路径效率。内存库热启动稳定了早期的导航,并减少了更新后的性能下降。在连续和混合适应设置下的结果证实了我们框架的稳健性和普适性,展示了在不同部署条件下持续改进。

更新时间: 2025-12-11 06:11:45

领域: cs.AI

下载: http://arxiv.org/abs/2512.10322v1

Translating Informal Proofs into Formal Proofs Using a Chain of States

We address the problem of translating informal mathematical proofs expressed in natural language into formal proofs in Lean4 under a constrained computational budget. Our approach is grounded in two key insights. First, informal proofs tend to proceed via a sequence of logical transitions - often implications or equivalences - without explicitly specifying intermediate results or auxiliary lemmas. In contrast, formal systems like Lean require an explicit representation of each proof state and the tactics that connect them. Second, each informal reasoning step can be viewed as an abstract transformation between proof states, but identifying the corresponding formal tactics often requires nontrivial domain knowledge and precise control over proof context. To bridge this gap, we propose a two stage framework. Rather than generating formal tactics directly, we first extract a Chain of States (CoS), a sequence of intermediate formal proof states aligned with the logical structure of the informal argument. We then generate tactics to transition between adjacent states in the CoS, thereby constructing the full formal proof. This intermediate representation significantly reduces the complexity of tactic generation and improves alignment with informal reasoning patterns. We build dedicated datasets and benchmarks for training and evaluation, and introduce an interactive framework to support tactic generation from formal states. Empirical results show that our method substantially outperforms existing baselines, achieving higher proof success rates.

Updated: 2025-12-11 06:08:34

标题: 将非正式证明转化为形式证明:使用状态链

摘要: 我们研究了将用自然语言表达的非正式数学证明转化为在有限计算预算下在Lean4中的形式证明的问题。我们的方法基于两个关键见解。首先,非正式证明往往通过一系列逻辑转换进行 - 通常是蕴含或等价关系 - 而不明确指定中间结果或辅助引理。相比之下,像Lean这样的形式系统需要明确表示每个证明状态及连接它们的策略。其次,每个非正式推理步骤可以被视为证明状态之间的抽象转换,但识别相应的形式策略通常需要领域知识和对证明上下文的精确控制。为了弥合这一差距,我们提出了一个两阶段框架。我们首先提取一个状态链(CoS),这是一个与非正式论证的逻辑结构对齐的中间形式证明状态序列。然后我们生成策略来在CoS中的相邻状态之间进行转换,从而构建完整的形式证明。这种中间表示显著降低了策略生成的复杂性,并提高了与非正式推理模式的对齐度。我们构建了专门的数据集和基准用于训练和评估,并引入了一个交互式框架来支持从形式状态生成策略。实证结果表明我们的方法明显优于现有基线,实现了更高的证明成功率。

更新时间: 2025-12-11 06:08:34

领域: cs.LO,cs.AI

下载: http://arxiv.org/abs/2512.10317v1

EpiPlanAgent: Agentic Automated Epidemic Response Planning

Epidemic response planning is essential yet traditionally reliant on labor-intensive manual methods. This study aimed to design and evaluate EpiPlanAgent, an agent-based system using large language models (LLMs) to automate the generation and validation of digital emergency response plans. The multi-agent framework integrated task decomposition, knowledge grounding, and simulation modules. Public health professionals tested the system using real-world outbreak scenarios in a controlled evaluation. Results demonstrated that EpiPlanAgent significantly improved the completeness and guideline alignment of plans while drastically reducing development time compared to manual workflows. Expert evaluation confirmed high consistency between AI-generated and human-authored content. User feedback indicated strong perceived utility. In conclusion, EpiPlanAgent provides an effective, scalable solution for intelligent epidemic response planning, demonstrating the potential of agentic AI to transform public health preparedness.

Updated: 2025-12-11 06:03:17

标题: EpiPlanAgent:主动自动化疫情应对规划

摘要: 流行病应对计划至关重要,但传统上依赖于繁重的手工方法。本研究旨在设计和评估EpiPlanAgent,一个使用大型语言模型(LLMs)自动化生成和验证数字紧急应对计划的基于代理的系统。这个多代理框架集成了任务分解、知识基础和模拟模块。公共卫生专业人员在受控评估中使用真实世界的爆发场景测试了系统。结果表明,与手工工作流相比,EpiPlanAgent显著提高了计划的完整性和指南对齐性,同时大幅缩短了开发时间。专家评估确认了AI生成和人工编写内容之间高度一致性。用户反馈表明了强烈的感知实用性。总之,EpiPlanAgent提供了一个有效、可扩展的智能流行病应对计划解决方案,展示了代理AI改变公共卫生应急准备的潜力。

更新时间: 2025-12-11 06:03:17

领域: cs.AI,cs.CY

下载: http://arxiv.org/abs/2512.10313v1

High-Dimensional Data Processing: Benchmarking Machine Learning and Deep Learning Architectures in Local and Distributed Environments

This document reports the sequence of practices and methodologies implemented during the Big Data course. It details the workflow beginning with the processing of the Epsilon dataset through group and individual strategies, followed by text analysis and classification with RestMex and movie feature analysis with IMDb. Finally, it describes the technical implementation of a distributed computing cluster with Apache Spark on Linux using Scala.

Updated: 2025-12-11 06:02:13

标题: 高维数据处理:在本地和分布式环境中对机器学习和深度学习架构进行基准测试

摘要: 本文报告了在大数据课程中实施的一系列实践和方法。它详细描述了从处理Epsilon数据集开始的工作流程,通过团体和个人策略,接着进行文本分析和分类,使用RestMex和电影特征分析IMDb。最后,它描述了在Linux上使用Scala在Apache Spark上实现分布式计算集群的技术实施。

更新时间: 2025-12-11 06:02:13

领域: cs.DC,cs.AI

下载: http://arxiv.org/abs/2512.10312v1

TranSimHub:A Unified Air-Ground Simulation Platform for Multi-Modal Perception and Decision-Making

Air-ground collaborative intelligence is becoming a key approach for next-generation urban intelligent transportation management, where aerial and ground systems work together on perception, communication, and decision-making. However, the lack of a unified multi-modal simulation environment has limited progress in studying cross-domain perception, coordination under communication constraints, and joint decision optimization. To address this gap, we present TranSimHub, a unified simulation platform for air-ground collaborative intelligence. TranSimHub offers synchronized multi-view rendering across RGB, depth, and semantic segmentation modalities, ensuring consistent perception between aerial and ground viewpoints. It also supports information exchange between the two domains and includes a causal scene editor that enables controllable scenario creation and counterfactual analysis under diverse conditions such as different weather, emergency events, and dynamic obstacles. We release TranSimHub as an open-source platform that supports end-to-end research on perception, fusion, and control across realistic air and ground traffic scenes. Our code is available at https://github.com/Traffic-Alpha/TransSimHub.

Updated: 2025-12-11 05:55:53

标题: TranSimHub:一个统一的空中地面模拟平台,用于多模态感知和决策-making

摘要: 空地协同智能正成为下一代城市智能交通管理的关键方法,其中空中和地面系统在感知、通信和决策方面协同工作。然而,缺乏统一的多模态模拟环境限制了研究跨领域感知、在通信约束下的协调以及联合决策优化的进展。为了填补这一空白,我们提出了TranSimHub,一个用于空地协同智能的统一模拟平台。TranSimHub提供了跨RGB、深度和语义分割模态的同步多视图渲染,确保了空中和地面视点之间的一致感知。它还支持两个领域之间的信息交换,并包括一个因果场景编辑器,可以在不同天气、紧急事件和动态障碍物等多种条件下进行可控的场景创建和反事实分析。我们将TranSimHub作为一个开源平台发布,支持对现实空中和地面交通场景中的感知、融合和控制进行端到端研究。我们的代码可以在https://github.com/Traffic-Alpha/TransSimHub 上找到。

更新时间: 2025-12-11 05:55:53

领域: eess.SY,cs.LG,cs.MA

下载: http://arxiv.org/abs/2510.15365v2

Tracking large chemical reaction networks and rare events by neural networks

Chemical reaction networks are widely used to model stochastic dynamics in chemical kinetics, systems biology and epidemiology. Solving the chemical master equation that governs these systems poses a significant challenge due to the large state space exponentially growing with system sizes. The development of autoregressive neural networks offers a flexible framework for this problem; however, its efficiency is limited especially for high-dimensional systems and in scenarios with rare events. Here, we push the frontier of neural-network approach by exploiting faster optimizations such as natural gradient descent and time-dependent variational principle, achieving a 5- to 22-fold speedup, and by leveraging enhanced-sampling strategies to capture rare events. We demonstrate reduced computational cost and higher accuracy over the previous neural-network method in challenging reaction networks, including the mitogen-activated protein kinase (MAPK) cascade network, the hitherto largest biological network handled by the previous approaches of solving the chemical master equation. We further apply the approach to spatially extended reaction-diffusion systems, the Schlögl model with rare events, on two-dimensional lattices, beyond the recent tensor-network approach that handles one-dimensional lattices. The present approach thus enables efficient modeling of chemical reaction networks in general.

Updated: 2025-12-11 05:55:44

标题: 使用神经网络跟踪大型化学反应网络和稀有事件

摘要: 化学反应网络被广泛用于模拟化学动力学、系统生物学和流行病学中的随机动态。解决控制这些系统的化学主方程的问题由于随着系统规模的增大,状态空间呈指数增长而面临重大挑战。自回归神经网络的发展为这一问题提供了一个灵活的框架;然而,其效率受到限制,特别是对于高维系统和罕见事件的情况。在这里,我们通过利用更快的优化方法,如自然梯度下降和时间依赖变分原理,实现了5到22倍的加速,并利用增强采样策略捕捉罕见事件。我们证明了在具有挑战性的反应网络中,包括迄今为止由解决化学主方程的先前方法处理的最大生物网络——有丝分裂原活化蛋白激酶(MAPK)级联网络,我们的方法比先前的神经网络方法具有更低的计算成本和更高的准确性。我们进一步将这种方法应用于空间扩展的反应扩散系统,Schlögl模型中的罕见事件,在二维点阵上,超出了最近处理一维点阵的张量网络方法。因此,目前的方法使得一般化学反应网络的有效建模成为可能。

更新时间: 2025-12-11 05:55:44

领域: q-bio.MN,cs.LG,physics.bio-ph

下载: http://arxiv.org/abs/2512.10309v1

Optimizing Drivers' Discount Order Acceptance Strategies: A Policy-Improved Deep Deterministic Policy Gradient Framework

The rapid expansion of platform integration has emerged as an effective solution to mitigate market fragmentation by consolidating multiple ride-hailing platforms into a single application. To address heterogeneous passenger preferences, third-party integrators provide Discount Express service delivered by express drivers at lower trip fares. For the individual platform, encouraging broader participation of drivers in Discount Express services has the potential to expand the accessible demand pool and improve matching efficiency, but often at the cost of reduced profit margins. This study aims to dynamically manage drivers' acceptance of Discount Express from the perspective of an individual platform. The lack of historical data under the new business model necessitates online learning. However, early-stage exploration through trial and error can be costly in practice, highlighting the need for reliable early-stage performance in real-world deployment. To address these challenges, this study formulates the decision regarding the proportion of drivers accepting discount orders as a continuous control task. In response to the high stochasticity and the opaque matching mechanisms employed by third-party integrator, we propose an innovative policy-improved deep deterministic policy gradient (pi-DDPG) framework. The proposed framework incorporates a refiner module to boost policy performance during the early training phase. A customized simulator based on a real-world dataset is developed to validate the effectiveness of the proposed pi-DDPG. Numerical experiments demonstrate that pi-DDPG achieves superior learning efficiency and significantly reduces early-stage training losses, enhancing its applicability to practical ride-hailing scenarios.

Updated: 2025-12-11 05:55:14

标题: 优化司机的折扣订单接受策略:一种政策改进的深度确定性策略梯度框架

摘要: 平台整合的迅速扩张已成为缓解市场碎片化的有效解决方案,通过将多个网约车平台整合为单一应用程序。为了满足异质乘客偏好,第三方集成商提供由快递司机提供的 Discount Express 服务,以更低的行程费用。对于单个平台来说,鼓励更多的司机参与 Discount Express 服务有扩大可访问需求池和提高匹配效率的潜力,但往往会以降低利润率为代价。本研究旨在从单个平台的角度动态管理司机对 Discount Express 的接受度。在新商业模式下缺乏历史数据,需要进行在线学习。然而,在实践中,通过试错的早期探索可能成本高昂,突显了在真实部署中需要可靠的早期性能。为了解决这些挑战,本研究将司机接受折扣订单比例的决策形式化为连续控制任务。针对第三方集成商采用的高随机性和不透明的匹配机制,我们提出了一种创新的策略改进深度确定性策略梯度(pi-DDPG)框架。所提出的框架包括一个优化模块,用于在早期训练阶段提升策略性能。基于真实数据集开发了一个定制的模拟器,用于验证所提出的 pi-DDPG 的有效性。数值实验表明,pi-DDPG 实现了优越的学习效率,并显著减少了早期阶段的培训损失,增强了其在实际网约车场景中的适用性。

更新时间: 2025-12-11 05:55:14

领域: cs.LG

下载: http://arxiv.org/abs/2507.11865v2

An Interpretable AI Tool for SAVR vs TAVR in Low to Intermediate Risk Patients with Severe Aortic Stenosis

Background. Treatment selection for low to intermediate risk patients with severe aortic stenosis between surgical (SAVR) and transcatheter (TAVR) aortic valve replacement remains variable in clinical practice, driven by patient heterogeneity and institutional preferences. While existing models predict postprocedural risk, there is a lack of interpretable, individualized treatment recommendations that directly optimize long-term outcomes. Methods. We introduce an interpretable prescriptive framework that integrates prognostic matching, counterfactual outcome modeling, and an Optimal Policy Tree (OPT) to recommend the treatment minimizing expected 5-year mortality. Using data from Hartford Hospital and St. Vincent's Hospital, we emulate randomization via prognostic matching and sample weighting and estimate counterfactual mortality under both SAVR and TAVR. The policy model, trained on these counterfactual predictions, partitions patients into clinically coherent subgroups and prescribes the treatment associated with lower estimated risk. Findings. If the OPT prescriptions are applied, counterfactual evaluation showed an estimated reduction in 5-year mortality of 20.3\% in Hartford and 13.8\% in St. Vincent's relative to real-life prescriptions, showing promising generalizability to unseen data from a different institution. The learned decision boundaries aligned with real-world outcomes and clinical observations. Interpretation. Our interpretable prescriptive framework is, to the best of our knowledge, the first to provide transparent, data-driven recommendations for TAVR versus SAVR that improve estimated long-term outcomes both in an internal and external cohort, while remaining clinically grounded and contributing toward a more systematic and evidence-based approach to precision medicine in structural heart disease.

Updated: 2025-12-11 05:54:22

标题: 一种可解释的人工智能工具:用于重度主动脉瓣狭窄低中风险患者的SAVR与TAVR比较

摘要: 背景。在临床实践中,对于低到中度风险严重主动脉瓣狭窄患者的治疗选择在外科主动脉瓣置换术(SAVR)和经导管主动脉瓣置换术(TAVR)之间仍存在差异,这取决于患者异质性和医疗机构的偏好。虽然现有模型可以预测术后风险,但缺乏可解释的、个性化的治疗建议,直接优化长期结果。 方法。我们引入了一个可解释的处方框架,将预后匹配、对照结果建模和最佳策略树(OPT)相结合,以推荐最小化预期5年死亡率的治疗方案。利用哈特福德医院和圣文森特医院的数据,通过预后匹配和样本加权来模拟随机化,并估计SAVR和TAVR下的对照死亡率。在这些对照预测上训练的策略模型,将患者分成临床上连贯的亚组,并推荐与较低估计风险相关的治疗。 发现。如果应用OPT处方,对照评估显示,在哈特福德和圣文森特相对于实际处方,预计5年死亡率分别减少20.3\%和13.8\%,显示出对来自不同机构的未见数据具有良好的概括性。学习到的决策边界与现实世界的结果和临床观察一致。 解释。据我们所知,我们的可解释的处方框架是首个提供透明、数据驱动的TAVR与SAVR建议,可以改善内部和外部队列中估计长期结果,同时保持临床基础,并促进结构性心脏疾病精准医学更系统和基于证据的方法。

更新时间: 2025-12-11 05:54:22

领域: cs.LG

下载: http://arxiv.org/abs/2512.10308v1

InfoCom: Kilobyte-Scale Communication-Efficient Collaborative Perception with Information Bottleneck

Precise environmental perception is critical for the reliability of autonomous driving systems. While collaborative perception mitigates the limitations of single-agent perception through information sharing, it encounters a fundamental communication-performance trade-off. Existing communication-efficient approaches typically assume MB-level data transmission per collaboration, which may fail due to practical network constraints. To address these issues, we propose InfoCom, an information-aware framework establishing the pioneering theoretical foundation for communication-efficient collaborative perception via extended Information Bottleneck principles. Departing from mainstream feature manipulation, InfoCom introduces a novel information purification paradigm that theoretically optimizes the extraction of minimal sufficient task-critical information under Information Bottleneck constraints. Its core innovations include: i) An Information-Aware Encoding condensing features into minimal messages while preserving perception-relevant information; ii) A Sparse Mask Generation identifying spatial cues with negligible communication cost; and iii) A Multi-Scale Decoding that progressively recovers perceptual information through mask-guided mechanisms rather than simple feature reconstruction. Comprehensive experiments across multiple datasets demonstrate that InfoCom achieves near-lossless perception while reducing communication overhead from megabyte to kilobyte-scale, representing 440-fold and 90-fold reductions per agent compared to Where2comm and ERMVP, respectively.

Updated: 2025-12-11 05:51:02

标题: InfoCom:使用信息瓶颈的千字节级通信高效协作感知

摘要: 精确的环境感知对于自动驾驶系统的可靠性至关重要。虽然协作感知通过信息共享减轻了单一代理的感知限制,但它遇到了基本的通信性能权衡。现有的通信高效方法通常假设每次协作传输MB级数据,但由于实际网络约束可能会失败。为了解决这些问题,我们提出了InfoCom,这是一个信息感知框架,通过扩展信息瓶颈原理建立了通信高效的协作感知的开创性理论基础。InfoCom摒弃了主流的特征处理方式,引入了一种新颖的信息净化范式,理论上优化了在信息瓶颈约束下提取最小充分的任务关键信息。其核心创新包括:i)信息感知编码将特征压缩成最小信息,同时保留感知相关信息;ii)稀疏掩码生成识别具有可忽略通信成本的空间线索;和iii)多尺度解码,通过掩码引导机制逐步恢复感知信息,而不是简单的特征重建。跨多个数据集的综合实验表明,InfoCom实现了接近无损的感知,同时将通信开销从兆字节降低到千字节规模,分别与Where2comm和ERMVP相比,每个代理的降低幅度分别为440倍和90倍。

更新时间: 2025-12-11 05:51:02

领域: cs.AI

下载: http://arxiv.org/abs/2512.10305v1

Trustworthy Orchestration Artificial Intelligence by the Ten Criteria with Control-Plane Governance

As Artificial Intelligence (AI) systems increasingly assume consequential decision-making roles, a widening gap has emerged between technical capabilities and institutional accountability. Ethical guidance alone is insufficient to counter this challenge; it demands architectures that embed governance into the execution fabric of the ecosystem. This paper presents the Ten Criteria for Trustworthy Orchestration AI, a comprehensive assurance framework that integrates human input, semantic coherence, audit and provenance integrity into a unified Control-Panel architecture. Unlike conventional agentic AI initiatives that primarily focus on AI-to-AI coordination, the proposed framework provides an umbrella of governance to the entire AI components, their consumers and human participants. By taking aspiration from international standards and Australia's National Framework for AI Assurance initiative, this work demonstrates that trustworthiness can be systematically incorporated (by engineering) into AI systems, ensuring the execution fabric remains verifiable, transparent, reproducible and under meaningful human control.

Updated: 2025-12-11 05:49:26

标题: 十个标准控制面治理下的可信调度人工智能

摘要: 随着人工智能(AI)系统越来越多地承担重要决策角色,技术能力和机构问责之间出现了日益扩大的差距。仅靠道德指导是不足以应对这一挑战的;这需要将治理嵌入到生态系统执行结构中的架构。本文提出了值得信赖的管控AI的十项标准,这是一个综合保证框架,将人类输入、语义的一致性、审计和出处完整性整合到一个统一的控制面板架构中。与传统的以AI-AI协调为主要焦点的AI主动性计划不同,所提出的框架为整个AI组件、其消费者和人类参与者提供了一个治理的保护伞。通过借鉴国际标准和澳大利亚AI保障框架倡议,这项工作表明,可信度可以通过工程化的方式系统地纳入AI系统中,确保执行结构保持可验证、透明、可复现,并在有意义的人类掌控之下。

更新时间: 2025-12-11 05:49:26

领域: cs.AI,cs.ET

下载: http://arxiv.org/abs/2512.10304v1

Investigating The Functional Roles of Attention Heads in Vision Language Models: Evidence for Reasoning Modules

Despite excelling on multimodal benchmarks, vision-language models (VLMs) largely remain a black box. In this paper, we propose a novel interpretability framework to systematically analyze the internal mechanisms of VLMs, focusing on the functional roles of attention heads in multimodal reasoning. To this end, we introduce CogVision, a dataset that decomposes complex multimodal questions into step-by-step subquestions designed to simulate human reasoning through a chain-of-thought paradigm, with each subquestion associated with specific receptive or cognitive functions such as high-level visual reception and inference. Using a probing-based methodology, we identify attention heads that specialize in these functions and characterize them as functional heads. Our analysis across diverse VLM families reveals that these functional heads are universally sparse, vary in number and distribution across functions, and mediate interactions and hierarchical organization. Furthermore, intervention experiments demonstrate their critical role in multimodal reasoning: removing functional heads leads to performance degradation, while emphasizing them enhances accuracy. These findings provide new insights into the cognitive organization of VLMs and suggest promising directions for designing models with more human-aligned perceptual and reasoning abilities.

Updated: 2025-12-11 05:42:53

标题: 研究视觉语言模型中注意力头的功能角色:推理模块的证据

摘要: 尽管在多模态基准测试中表现出色,但视觉语言模型(VLMs)在很大程度上仍然是一个黑盒子。在本文中,我们提出了一个新颖的可解释性框架,系统地分析VLMs的内部机制,重点关注多模态推理中注意力头的功能角色。为此,我们引入了CogVision,一个将复杂的多模态问题分解为逐步子问题的数据集,旨在通过一种思维链范式模拟人类推理,每个子问题与特定的接收或认知功能相关联,如高级视觉接收和推理。使用基于探测的方法,我们识别出专门从事这些功能的注意力头,并将它们描述为功能性头。我们在不同VLM家族中的分析显示,这些功能性头普遍稀疏,数量和分布在功能之间有所变化,并调节交互和层次组织。此外,干预实验证明了它们在多模态推理中的关键作用:移除功能性头会导致性能下降,而强调它们则提高准确性。这些发现为VLMs的认知组织提供了新的见解,并为设计具有更符合人类感知和推理能力的模型提供了有希望的方向。

更新时间: 2025-12-11 05:42:53

领域: cs.AI

下载: http://arxiv.org/abs/2512.10300v1

Meta-learning three-factor plasticity rules for structured credit assignment with sparse feedback

Biological neural networks learn complex behaviors from sparse, delayed feedback using local synaptic plasticity, yet the mechanisms enabling structured credit assignment remain elusive. In contrast, artificial recurrent networks solving similar tasks typically rely on biologically implausible global learning rules or hand-crafted local updates. The space of local plasticity rules capable of supporting learning from delayed reinforcement remains largely unexplored. Here, we present a meta-learning framework that discovers local learning rules for structured credit assignment in recurrent networks trained with sparse feedback. Our approach interleaves local neo-Hebbian-like updates during task execution with an outer loop that optimizes plasticity parameters via \textbf{tangent-propagation through learning}. The resulting three-factor learning rules enable long-timescale credit assignment using only local information and delayed rewards, offering new insights into biologically grounded mechanisms for learning in recurrent circuits.

Updated: 2025-12-11 05:38:49

标题: 元学习三因素可塑性规则用于稀疏反馈下的结构化信用分配

摘要: 生物神经网络通过局部突触可塑性从稀疏的延迟反馈中学习复杂行为,然而支持结构化信用分配的机制仍然是神秘的。相反,解决类似任务的人工递归网络通常依赖于生物不可信的全局学习规则或手工制作的局部更新。能够支持从延迟强化中学习的局部可塑性规则空间仍然大部分未被探索。在这里,我们提出了一个元学习框架,发现了适用于稀疏反馈训练的递归网络中结构化信用分配的局部学习规则。我们的方法在任务执行过程中交替进行局部新赫布式更新,同时通过“切线传播学习”优化可塑性参数的外部循环。由此产生的三因素学习规则仅使用本地信息和延迟奖励就能实现长时间尺度的信用分配,为递归电路中的学习提供了新的生物基础机制洞见。

更新时间: 2025-12-11 05:38:49

领域: q-bio.NC,cond-mat.dis-nn,cs.LG,physics.bio-ph

下载: http://arxiv.org/abs/2512.09366v2

Human or AI? Comparing Design Thinking Assessments by Teaching Assistants and Bots

As design thinking education grows in secondary and tertiary contexts, educators face the challenge of evaluating creative artefacts that combine visual and textual elements. Traditional rubric-based assessment is laborious, time-consuming, and inconsistent due to reliance on Teaching Assistants (TA) in large, multi-section cohorts. This paper presents an exploratory study investigating the reliability and perceived accuracy of AI-assisted assessment compared to TA-assisted assessment in evaluating student posters in design thinking education. Two activities were conducted with 33 Ministry of Education (MOE) Singapore school teachers to (1) compare AI-generated scores with TA grading across three key dimensions: empathy and user understanding, identification of pain points and opportunities, and visual communication, and (2) examine teacher preferences for AI-assigned, TA-assigned, and hybrid scores. Results showed low statistical agreement between instructor and AI scores for empathy and pain points, with slightly higher alignment for visual communication. Teachers preferred TA-assigned scores in six of ten samples. Qualitative feedback highlighted the potential of AI for formative feedback, consistency, and student self-reflection, but raised concerns about its limitations in capturing contextual nuance and creative insight. The study underscores the need for hybrid assessment models that integrate computational efficiency with human insights. This research contributes to the evolving conversation on responsible AI adoption in creative disciplines, emphasizing the balance between automation and human judgment for scalable and pedagogically sound assessment.

Updated: 2025-12-11 05:38:34

标题: 人类还是人工智能?通过助教和机器人比较设计思维评估

摘要: 随着设计思维教育在中学和高等教育领域的发展,教育工作者面临着评估结合视觉和文本元素的创意作品的挑战。传统的基于评分表的评估方式繁琐、耗时且不一致,因为依赖于大规模、多部分班的助教。本文介绍了一项探索性研究,研究了AI辅助评估与助教辅助评估在评估设计思维教育中学生海报时的可靠性和感知准确性。通过与新加坡教育部(MOE)的33名学校教师进行两项活动,旨在(1)比较AI生成的分数与助教评分在共情和用户理解、痛点和机会的识别以及视觉传达三个关键维度上的一致性,以及(2)检查教师对AI分配的分数、助教分配的分数和混合分数的偏好。结果显示,教师和AI分数在共情和痛点方面的统计一致性较低,而在视觉传达方面略高。教师在十个样本中有六个偏好助教分配的分数。定性反馈突显了AI在形成反馈、一致性和学生自我反思方面的潜力,但也提出了对其在捕捉情境细微差别和创意洞察方面的限制。该研究强调了需要将计算效率与人类洞察力相结合的混合评估模型。这项研究对创意学科中负责任的AI采用进行了贡献,强调了在可扩展且教学合理的评估中自动化和人类判断之间的平衡。

更新时间: 2025-12-11 05:38:34

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2510.16069v2

FLARE: A Wireless Side-Channel Fingerprinting Attack on Federated Learning

Federated Learning (FL) enables collaborative model training across distributed devices while safeguarding data and user privacy. However, FL remains susceptible to privacy threats that can compromise data via direct means. That said, indirectly compromising the confidentiality of the FL model architecture (e.g., a convolutional neural network (CNN) or a recurrent neural network (RNN)) on a client device by an outsider remains unexplored. If leaked, this information can enable next-level attacks tailored to the architecture. This paper proposes a novel side-channel fingerprinting attack, leveraging flow-level and packet-level statistics of encrypted wireless traffic from an FL client to infer its deep learning model architecture. We name it FLARE, a fingerprinting framework based on FL Architecture REconnaissance. Evaluation across various CNN and RNN variants-including pre-trained and custom models trained over IEEE 802.11 Wi-Fi-shows that FLARE achieves over 98% F1-score in closed-world and up to 91% in open-world scenarios. These results reveal that CNN and RNN models leak distinguishable traffic patterns, enabling architecture fingerprinting even under realistic FL settings with hardware, software, and data heterogeneity. To our knowledge, this is the first work to fingerprint FL model architectures by sniffing encrypted wireless traffic, exposing a critical side-channel vulnerability in current FL systems.

Updated: 2025-12-11 05:32:34

标题: FLARE:一种针对联邦学习的无线侧信道指纹攻击

摘要: 联邦学习(FL)可以在分布式设备间进行合作模型训练,同时保护数据和用户隐私。然而,FL仍然容易受到可能通过直接手段泄露数据的隐私威胁。尽管如此,通过外部人员间接地危害FL模型架构(例如卷积神经网络(CNN)或循环神经网络(RNN))的机密性在客户端设备上仍未被探索。如果泄露,这些信息可以使攻击者根据架构进行下一级别的攻击。本文提出了一种新颖的侧信道指纹攻击,利用来自FL客户端的加密无线流量的流级和数据包级统计信息来推断其深度学习模型架构。我们将其命名为FLARE,这是一个基于FL架构侦察的指纹框架。对各种CNN和RNN变体进行评估,包括在IEEE 802.11 Wi-Fi上训练的预训练和自定义模型,结果显示FLARE在封闭世界中达到了超过98%的F1得分,在开放世界情景中最高达到91%。这些结果表明CNN和RNN模型泄露出可区分的流量模式,即使在具有硬件、软件和数据异构性的真实FL设置中,也能实现架构指纹识别。据我们所知,这是首个通过嗅探加密无线流量来指纹FL模型架构的研究,揭示了当前FL系统中的关键侧信道漏洞。

更新时间: 2025-12-11 05:32:34

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2512.10296v1

Balanced Online Class-Incremental Learning via Dual Classifiers

Online class-incremental learning (OCIL) focuses on gradually learning new classes (called plasticity) from a stream of data in a single-pass, while concurrently preserving knowledge of previously learned classes (called stability). The primary challenge in OCIL lies in maintaining a good balance between the knowledge of old and new classes within the continually updated model. Most existing methods rely on explicit knowledge interaction through experience replay, and often employ exclusive training separation to address bias problems. Nevertheless, it still remains a big challenge to achieve a well-balanced learner, as these methods often exhibit either reduced plasticity or limited stability due to difficulties in continually integrating knowledge in the OCIL setting. In this paper, we propose a novel replay-based method, called Balanced Inclusive Separation for Online iNcremental learning (BISON), which can achieve both high plasticity and stability, thus ensuring more balanced performance in OCIL. Our BISON method proposes an inclusive training separation strategy using dual classifiers so that knowledge from both old and new classes can effectively be integrated into the model, while introducing implicit approaches for transferring knowledge across the two classifiers. Extensive experimental evaluations over three widely-used OCIL benchmark datasets demonstrate the superiority of BISON, showing more balanced yet better performance compared to state-of-the-art replay-based OCIL methods.

Updated: 2025-12-11 05:18:44

标题: 基于双分类器的平衡在线课堂增量学习

摘要: 在线课程增量学习(OCIL)专注于在单次传递中逐渐学习新类(称为可塑性)的同时,同时保留先前学习类(称为稳定性)的知识。 OCIL中的主要挑战在于在不断更新的模型中保持对旧类和新类知识之间的良好平衡。大多数现有方法依赖于通过经验重播进行的显式知识交互,并经常采用独立训练分离来解决偏见问题。然而,要实现一个良好平衡的学习者仍然是一个巨大的挑战,因为这些方法往往由于在OCIL设置中不断整合知识而表现出要么降低的可塑性,要么有限的稳定性。在本文中,我们提出了一种新颖的基于重播的方法,称为在线增量学习的平衡包容分离(BISON),该方法可以同时实现高可塑性和稳定性,从而确保在OCIL中更平衡的表现。我们的BISON方法提出了一种包容性训练分离策略,使用双分类器,以便来自旧类和新类的知识可以有效地集成到模型中,同时引入了用于在两个分类器之间传递知识的隐式方法。对三个广泛使用的OCIL基准数据集进行的广泛实验评估显示了BISON的优越性,与最先进的基于重播的OCIL方法相比,表现出更平衡且更好的性能。

更新时间: 2025-12-11 05:18:44

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.20566v2

Variational analysis of determinantal varieties

Determinantal varieties -- the sets of bounded-rank matrices or tensors -- have attracted growing interest in low-rank optimization. The tangent cone to low-rank sets is widely studied and underpins a range of geometric methods. The second-order geometry, which encodes curvature information, is more intricate. In this work, we develop a unified framework to derive explicit formulas for both first- and second-order tangent sets to various low-rank sets, including low-rank matrices, tensors, symmetric matrices, and positive semidefinite matrices. The framework also accommodates the intersection of a low-rank set and another set satisfying mild assumptions, thereby yielding a tangent intersection rule. Through the lens of tangent sets, we establish a necessary and sufficient condition under which a nonsmooth problem and its smooth parameterization share equivalent second-order stationary points. Moreover, we exploit tangent sets to characterize optimality conditions for low-rank optimization and prove that verifying second-order optimality is NP-hard. In a separate line of analysis, we investigate variational geometry of the graph of the normal cone to matrix varieties, deriving the explicit Bouligand tangent cone, Fréchet and Mordukhovich normal cones to the graph. These results are further applied to develop optimality conditions for low-rank bilevel programs.

Updated: 2025-12-11 05:13:53

标题: 行列式变量方程的变分分析

摘要: Determinantal varieties -- the sets of bounded-rank matrices or tensors -- have attracted growing interest in low-rank optimization. The tangent cone to low-rank sets is widely studied and underpins a range of geometric methods. The second-order geometry, which encodes curvature information, is more intricate. In this work, we develop a unified framework to derive explicit formulas for both first- and second-order tangent sets to various low-rank sets, including low-rank matrices, tensors, symmetric matrices, and positive semidefinite matrices. The framework also accommodates the intersection of a low-rank set and another set satisfying mild assumptions, thereby yielding a tangent intersection rule. Through the lens of tangent sets, we establish a necessary and sufficient condition under which a nonsmooth problem and its smooth parameterization share equivalent second-order stationary points. Moreover, we exploit tangent sets to characterize optimality conditions for low-rank optimization and prove that verifying second-order optimality is NP-hard. In a separate line of analysis, we investigate variational geometry of the graph of the normal cone to matrix varieties, deriving the explicit Bouligand tangent cone, Fréchet and Mordukhovich normal cones to the graph. These results are further applied to develop optimality conditions for low-rank bilevel programs.

更新时间: 2025-12-11 05:13:53

领域: math.OC,cs.AI,cs.LG

下载: http://arxiv.org/abs/2511.22613v2

RoFt-Mol: Benchmarking Robust Fine-Tuning with Molecular Graph Foundation Models

In the era of foundation models, fine-tuning pre-trained models for specific downstream tasks has become crucial. This drives the need for robust fine-tuning methods to address challenges such as model overfitting and sparse labeling. Molecular graph foundation models (MGFMs) face unique difficulties that complicate fine-tuning. These models are limited by smaller pre-training datasets and more severe data scarcity for downstream tasks, both of which require enhanced model generalization. Moreover, MGFMs must accommodate diverse objectives, including both regression and classification tasks. To better understand and improve fine-tuning techniques under these conditions, we classify eight fine-tuning methods into three mechanisms: weight-based, representation-based, and partial fine-tuning. We benchmark these methods on downstream regression and classification tasks across supervised and self-supervised pre-trained models in diverse labeling settings. This extensive evaluation provides valuable insights and informs the design of a refined robust fine-tuning method, ROFT-MOL. This approach combines the strengths of simple post-hoc weight interpolation with more complex weight ensemble fine-tuning methods, delivering improved performance across both task types while maintaining the ease of use inherent in post-hoc weight interpolation.

Updated: 2025-12-11 05:11:29

标题: RoFt-Mol:使用分子图基础模型进行鲁棒微调的基准测试

摘要: 在基础模型时代,微调预训练模型以适用特定下游任务变得至关重要。这推动了对强大的微调方法的需求,以应对模型过拟合和稀疏标记等挑战。分子图基础模型(MGFMs)面临独特困难,使微调变得复杂。这些模型受限于较小的预训练数据集,并对下游任务更加严重的数据稀缺情况,这两者都需要增强模型的泛化能力。此外,MGFMs必须适应多样的目标,包括回归和分类任务。为了更好地理解和改进在这些条件下的微调技术,我们将八种微调方法归类为三种机制:基于权重、基于表示和部分微调。我们在各种标签设置中对这些方法在下游回归和分类任务上进行了基准测试,包括监督和自监督预训练模型。这种广泛的评估提供了宝贵的见解,并为设计一种精细的强大微调方法ROFT-MOL提供了信息。这种方法将简单的后期权重插值的优势与更复杂的权重整合微调方法相结合,提供了在两种任务类型上都提高性能的效果,同时保持了后期权重插值固有的易用性。

更新时间: 2025-12-11 05:11:29

领域: cs.LG,physics.chem-ph

下载: http://arxiv.org/abs/2509.00614v3

Enforcing hidden physics in physics-informed neural networks

Physics-informed neural networks (PINNs) represent a new paradigm for solving partial differential equations (PDEs) by integrating physical laws into the learning process of neural networks. However, ensuring that such frameworks fully reflect the physical structure embedded in the governing equations remains an open challenge, particularly for maintaining robustness across diverse scientific problems. In this work, we address this issue by introducing a simple, generalized, yet robust irreversibility-regularized strategy that enforces hidden physical laws as soft constraints during training, thereby recovering the missing physics associated with irreversible processes in the conventional PINN. This approach ensures that the learned solutions consistently respect the intrinsic one-way nature of irreversible physical processes. Across a wide range of benchmarks spanning traveling wave propagation, steady combustion, ice melting, corrosion evolution, and crack growth, we observe substantial performance improvements over the conventional PINN, demonstrating that our regularization scheme reduces predictive errors by more than an order of magnitude, while requiring only minimal modification to existing PINN frameworks.

Updated: 2025-12-11 05:08:35

标题: 强化物理学隐含性在物理信息神经网络中

摘要: 物理信息神经网络(PINNs)代表了一种通过将物理定律整合到神经网络的学习过程中来解决偏微分方程(PDEs)的新范式。然而,确保这种框架完全反映嵌入在控制方程中的物理结构仍然是一个开放的挑战,特别是对于在不同科学问题中保持稳健性。在这项工作中,我们通过引入一种简单、广义但稳健的不可逆性正则化策略来解决这个问题,该策略在训练过程中将隐藏的物理定律作为软约束来执行,从而恢复传统PINN中与不可逆过程相关的缺失物理。这种方法确保学习到的解决方案始终尊重不可逆物理过程的固有单向性质。在横跨传播行波、稳定燃烧、冰融化、腐蚀演变和裂纹扩展等广泛基准中,我们观察到与传统PINN相比,我们的正则化方案显著提高了性能,表明我们的正则化方案将预测错误降低了一个数量级以上,同时对现有PINN框架只需要进行最少的修改。

更新时间: 2025-12-11 05:08:35

领域: cs.LG,physics.comp-ph

下载: http://arxiv.org/abs/2511.14348v2

$\mathrm{D}^\mathrm{3}$-Predictor: Noise-Free Deterministic Diffusion for Dense Prediction

Although diffusion models with strong visual priors have emerged as powerful dense prediction backboens, they overlook a core limitation: the stochastic noise at the core of diffusion sampling is inherently misaligned with dense prediction that requires a deterministic mapping from image to geometry. In this paper, we show that this stochastic noise corrupts fine-grained spatial cues and pushes the model toward timestep-specific noise objectives, consequently destroying meaningful geometric structure mappings. To address this, we introduce $\mathrm{D}^\mathrm{3}$-Predictor, a noise-free deterministic framework built by reformulating a pretrained diffusion model without stochasticity noise. Instead of relying on noisy inputs to leverage diffusion priors, $\mathrm{D}^\mathrm{3}$-Predictor views the pretrained diffusion network as an ensemble of timestep-dependent visual experts and self-supervisedly aggregates their heterogeneous priors into a single, clean, and complete geometric prior. Meanwhile, we utilize task-specific supervision to seamlessly adapt this noise-free prior to dense prediction tasks. Extensive experiments on various dense prediction tasks demonstrate that $\mathrm{D}^\mathrm{3}$-Predictor achieves competitive or state-of-the-art performance in diverse scenarios. In addition, it requires less than half the training data previously used and efficiently performs inference in a single step. Our code, data, and checkpoints are publicly available at https://x-gengroup.github.io/HomePage_D3-Predictor/.

Updated: 2025-12-11 05:07:55

标题: $\mathrm{D}^\mathrm{3}$-预测器:用于密集预测的无噪声确定性扩散

摘要: 尽管具有强大视觉先验的扩散模型已经成为强大的密集预测骨干,但它们忽视了一个核心限制:扩散采样核心的随机噪声与需要从图像到几何形态进行确定性映射的密集预测本质上不一致。在本文中,我们展示了这种随机噪声如何破坏了细粒度空间线索,推动模型朝向特定时间步的噪声目标,从而破坏了有意义的几何结构映射。为了解决这个问题,我们引入了D^3-Predictor,这是一个无噪声确定性框架,通过重新构建一个预训练的扩散模型而没有随机噪声。D^3-Predictor不依赖于噪声输入来利用扩散先验,而是将预训练的扩散网络视为一组时间步相关的视觉专家,并通过自监督方式将它们的异质先验汇总为一个单一、干净且完整的几何先验。同时,我们利用任务特定的监督来无缝地将这个无噪声先验调整到密集预测任务中。对各种密集预测任务的广泛实验表明,D^3-Predictor在不同场景下实现了竞争性或最先进的性能。此外,它所需的训练数据量不到以前的一半,并且可以在一步内高效地执行推理。我们的代码、数据和检查点可在https://x-gengroup.github.io/HomePage_D3-Predictor/ 上公开获取。

更新时间: 2025-12-11 05:07:55

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2512.07062v2

A Kernel-based Resource-efficient Neural Surrogate for Multi-fidelity Prediction of Aerodynamic Field

Surrogate models provide fast alternatives to costly aerodynamic simulations and are extremely useful in design and optimization applications. This study proposes the use of a recent kernel-based neural surrogate, KHRONOS. In this work, we blend sparse high-fidelity (HF) data with low-fidelity (LF) information to predict aerodynamic fields under varying constraints in computational resources. Unlike traditional approaches, KHRONOS is built upon variational principles, interpolation theory, and tensor decomposition. These elements provide a mathematical basis for heavy pruning compared to dense neural networks. Using the AirfRANS dataset as a high-fidelity benchmark and NeuralFoil to generate low-fidelity counterparts, this work compares the performance of KHRONOS with three contemporary model architectures: a multilayer perceptron (MLP), a graph neural network (GNN), and a physics-informed neural network (PINN). We consider varying levels of high-fidelity data availability (0%, 10%, and 30%) and increasingly complex geometry parameterizations. These are used to predict the surface pressure coefficient distribution over the airfoil. Results indicate that, whilst all models eventually achieve comparable predictive accuracy, KHRONOS excels in resource-constrained conditions. In this domain, KHRONOS consistently requires orders of magnitude fewer trainable parameters and delivers much faster training and inference than contemporary dense neural networks at comparable accuracy. These findings highlight the potential of KHRONOS and similar architectures to balance accuracy and efficiency in multi-fidelity aerodynamic field prediction.

Updated: 2025-12-11 05:05:10

标题: 基于核的资源高效的神经代理模型用于多保真度气动场预测

摘要: Surrogate models are fast and cost-effective alternatives to aerodynamic simulations, making them valuable for design and optimization purposes. This study introduces KHRONOS, a kernel-based neural surrogate model, which combines high-fidelity and low-fidelity data to predict aerodynamic fields under different computational constraints. Unlike traditional methods, KHRONOS is based on variational principles, interpolation theory, and tensor decomposition, allowing for more efficient pruning compared to dense neural networks. Using the AirfRANS dataset as a high-fidelity benchmark and NeuralFoil for low-fidelity data, the study compares KHRONOS with three other model architectures: multilayer perceptron (MLP), graph neural network (GNN), and physics-informed neural network (PINN). Different levels of high-fidelity data availability and complex geometry parameterizations are considered to predict the surface pressure coefficient distribution over an airfoil. Results show that, although all models achieve similar predictive accuracy, KHRONOS outperforms in resource-constrained conditions. It requires significantly fewer trainable parameters and offers faster training and inference compared to contemporary dense neural networks with similar accuracy. These findings demonstrate the potential of KHRONOS and similar models in achieving a balance between accuracy and efficiency in multi-fidelity aerodynamic field prediction.

更新时间: 2025-12-11 05:05:10

领域: cs.LG,physics.flu-dyn

下载: http://arxiv.org/abs/2512.10287v1

MotionEdit: Benchmarking and Learning Motion-Centric Image Editing

We introduce MotionEdit, a novel dataset for motion-centric image editing-the task of modifying subject actions and interactions while preserving identity, structure, and physical plausibility. Unlike existing image editing datasets that focus on static appearance changes or contain only sparse, low-quality motion edits, MotionEdit provides high-fidelity image pairs depicting realistic motion transformations extracted and verified from continuous videos. This new task is not only scientifically challenging but also practically significant, powering downstream applications such as frame-controlled video synthesis and animation. To evaluate model performance on the novel task, we introduce MotionEdit-Bench, a benchmark that challenges models on motion-centric edits and measures model performance with generative, discriminative, and preference-based metrics. Benchmark results reveal that motion editing remains highly challenging for existing state-of-the-art diffusion-based editing models. To address this gap, we propose MotionNFT (Motion-guided Negative-aware Fine Tuning), a post-training framework that computes motion alignment rewards based on how well the motion flow between input and model-edited images matches the ground-truth motion, guiding models toward accurate motion transformations. Extensive experiments on FLUX.1 Kontext and Qwen-Image-Edit show that MotionNFT consistently improves editing quality and motion fidelity of both base models on the motion editing task without sacrificing general editing ability, demonstrating its effectiveness.

Updated: 2025-12-11 04:53:58

标题: MotionEdit:基于动作的图像编辑的基准测试和学习

摘要: 我们介绍了MotionEdit,这是一个新颖的以动作为中心的图像编辑数据集,旨在修改主体动作和互动,同时保持身份、结构和物理可信度。与现有的图像编辑数据集不同,这些数据集侧重于静态外观变化或仅包含稀疏、低质量的动作编辑,MotionEdit提供了高保真度的图像对,展示了从连续视频中提取和验证的逼真动作变换。这个新任务不仅在科学上具有挑战性,而且在实际中具有重要意义,为下游应用如帧控制视频合成和动画提供动力。 为了评估模型在这一新任务上的表现,我们引入了MotionEdit-Bench,一个基准测试,挑战模型进行以动作为中心的编辑,并用生成、判别和基于偏好的指标来衡量模型的性能。基准测试结果显示,对于现有最先进的基于扩散的编辑模型来说,动作编辑仍然具有很高的挑战性。为了解决这一差距,我们提出了MotionNFT(Motion-guided Negative-aware Fine Tuning),一个后训练框架,根据输入图像和模型编辑图像之间的动作流与地面真实动作匹配程度来计算动作对齐奖励,引导模型朝向准确的动作变换。在FLUX.1 Kontext和Qwen-Image-Edit上进行的大量实验表明,MotionNFT在不牺牲一般编辑能力的情况下,持续提高了基础模型在动作编辑任务上的编辑质量和动作保真度,证明了其有效性。

更新时间: 2025-12-11 04:53:58

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2512.10284v1

Semantic Trajectory Generation for Goal-Oriented Spacecraft Rendezvous

Reliable real-time trajectory generation is essential for future autonomous spacecraft. While recent progress in nonconvex guidance and control is paving the way for onboard autonomous trajectory optimization, these methods still rely on extensive expert input (e.g., waypoints, constraints, mission timelines, etc.), which limits the operational scalability in real rendezvous missions. This paper introduces SAGES (Semantic Autonomous Guidance Engine for Space), a trajectory-generation framework that translates natural-language commands into spacecraft trajectories that reflect high-level intent while respecting nonconvex constraints. Experiments in two settings -- fault-tolerant proximity operations with continuous-time constraint enforcement and a free-flying robotic platform -- demonstrate that SAGES reliably produces trajectories aligned with human commands, achieving over 90% semantic-behavioral consistency across diverse behavior modes. Ultimately, this work marks an initial step toward language-conditioned, constraint-aware spacecraft trajectory generation, enabling operators to interactively guide both safety and behavior through intuitive natural-language commands with reduced expert burden.

Updated: 2025-12-11 04:52:52

标题: 面向目标导向的航天器交会的语义轨迹生成

摘要: 可靠的实时轨迹生成对于未来的自主航天器至关重要。尽管最近在非凸导航和控制方面取得了进展,为机载自主轨迹优化铺平了道路,但这些方法仍然依赖于广泛的专家输入(例如航点、约束条件、任务时间表等),这限制了在真实交会任务中的操作可扩展性。本文介绍了SAGES(用于空间的语义自主引导引擎), 一种将自然语言命令转化为反映高层意图的航天器轨迹的轨迹生成框架,同时尊重非凸约束。在两个环境中的实验——具有连续时间约束执行的容错近距操作和自由飞行机器人平台——证明了SAGES可可靠地生成与人类命令一致的轨迹,在各种行为模式下实现超过90%的语义行为一致性。最终,这项工作标志着朝着以语言为条件、具有约束意识的航天器轨迹生成迈出了初步的一步,使操作人员能够通过直观的自然语言命令交互地引导安全和行为,减轻专家负担。

更新时间: 2025-12-11 04:52:52

领域: cs.RO,cs.AI,math.OC

下载: http://arxiv.org/abs/2512.09111v2

Neuronal Attention Circuit (NAC) for Representation Learning

Attention improves representation learning over RNNs, but its discrete nature limits continuous-time (CT) modeling. We introduce Neuronal Attention Circuit (NAC), a novel, biologically plausible CT-Attention mechanism that reformulates attention logits computation as the solution to a linear first-order ODE with nonlinear interlinked gates derived from repurposing \textit{C. elegans} Neuronal Circuit Policies (NCPs) wiring mechanism. NAC replaces dense projections with sparse sensory gates for key-query projections and a sparse backbone network with two heads for computing \textit{content-target} and \textit{learnable time-constant} gates, enabling efficient adaptive dynamics. NAC supports three attention logit computation modes: (i) explicit Euler integration, (ii) exact closed-form solution, and (iii) steady-state approximation. To improve memory intensity, we implemented a sparse Top-\emph{K} pairwise concatenation scheme that selectively curates key-query interactions. We provide rigorous theoretical guarantees, including state stability, bounded approximation errors, and universal approximation. Empirically, we implemented NAC in diverse domains, including irregular time-series classification, lane-keeping for autonomous vehicles, and industrial prognostics. We observed that NAC matches or outperforms competing baselines in accuracy and occupies an intermediate position in runtime and memory efficiency compared with several CT baselines.

Updated: 2025-12-11 04:49:44

标题: 神经元注意力电路(NAC)用于表示学习

摘要: 注意力可以改善循环神经网络(RNNs)中的表示学习,但其离散性限制了连续时间(CT)建模。我们引入了神经元注意力电路(NAC),这是一种新颖的、生物合理的连续时间注意力机制,它将注意力logits计算重新构建为解非线性相互关联门的线性一阶ODE,这些门来自于重新利用了C. elegans神经回路策略(NCPs)的布线机制。NAC用稀疏感觉门替换密集投影,用两个头的稀疏骨干网络替换了用于计算内容-目标和可学习时间常数门的网络,从而实现了高效的自适应动态。NAC支持三种注意力logits计算模式:(i)显式欧拉积分,(ii)精确的闭式解,(iii)稳态近似。为了改善记忆密集度,我们实现了一种稀疏的Top-K成对连接方案,有选择地筛选关键-查询交互。我们提供了严格的理论保证,包括状态稳定性、有界逼近误差和通用逼近。在实证方面,我们在不同领域实现了NAC,包括不规则时间序列分类、自动驾驶车辆的车道保持和工业预测。我们观察到,与几个CT基线相比,NAC在准确性方面与竞争基线相匹配或表现更好,并在运行时间和内存效率方面占据了中间位置。

更新时间: 2025-12-11 04:49:44

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2512.10282v1

RegMean++: Enhancing Effectiveness and Generalization of Regression Mean for Model Merging

Regression Mean (RegMean), an approach that formulates model merging as a linear regression problem, aims to find the optimal weights for each linear layer in the merge model by minimizing the discrepancy in predictions between the merge and candidate models. RegMean provides a precise closed-form solution for the merging problem; therefore, it offers explainability and computational efficiency. However, RegMean merges each linear layer independently, overlooking how the features and information in the earlier layers propagate through the layers and influence the final prediction in the merge model. In this paper, we introduce RegMean++, a simple yet effective alternative to RegMean, that explicitly incorporates both intra- and cross-layer dependencies between merge models' layers into RegMean's objective. By accounting for these dependencies, RegMean++ better captures the behaviors of the merge model. Extensive experiments demonstrate that RegMean++ consistently outperforms RegMean across diverse settings, including in-domain (ID) and out-of-domain (OOD) generalization, sequential merging, large-scale tasks, and robustness under several types of distribution shifts. Furthermore, RegMean++ achieves competitive or state-of-the-art performance compared to various recent advanced model merging methods. Our code is available at https://github.com/nthehai01/RegMean-plusplus.

Updated: 2025-12-11 04:49:06

标题: RegMean++:增强回归均值的效果和泛化性,用于模型合并

摘要: 回归均值(RegMean)是一种将模型合并视为线性回归问题的方法,旨在通过最小化合并模型和候选模型之间预测差异来找到合并模型中每个线性层的最佳权重。RegMean为合并问题提供了精确的闭式解决方案;因此,它提供了可解释性和计算效率。然而,RegMean独立合并每个线性层,忽视了早期层中的特征和信息如何传播并影响合并模型的最终预测。在本文中,我们介绍了RegMean++,这是RegMean的一个简单而有效的替代方案,它明确地将合并模型层之间的内部和跨层依赖性纳入RegMean的目标中。通过考虑这些依赖性,RegMean++更好地捕捉了合并模型的行为。大量实验证明,RegMean++在各种设置下始终优于RegMean,包括领域内(ID)和领域外(OOD)泛化、顺序合并、大规模任务以及在几种分布转移下的稳健性。此外,与各种最近先进的模型合并方法相比,RegMean++实现了竞争性或最新技术性能。我们的代码可在https://github.com/nthehai01/RegMean-plusplus上找到。

更新时间: 2025-12-11 04:49:06

领域: cs.LG

下载: http://arxiv.org/abs/2508.03121v2

Graph Neural Network Based Adaptive Threat Detection for Cloud Identity and Access Management Logs

The rapid expansion of cloud infrastructures and distributed identity systems has significantly increased the complexity and attack surface of modern enterprises. Traditional rule based or signature driven detection systems are often inadequate in identifying novel or evolving threats within Identity and Access Management logs, where anomalous behavior may appear statistically benign but contextually malicious. This paper presents a Graph Neural Network Based Adaptive Threat Detection framework designed to learn latent user resource interaction patterns from IAM audit trails in real time. By modeling IAM logs as heterogeneous dynamic graphs, the proposed system captures temporal, relational, and contextual dependencies across entities such as users, roles, sessions, and access actions. The model incorporates attention based aggregation and graph embedding updates to enable continual adaptation to changing cloud environments. Experimental evaluation on synthesized and real world IAM datasets demonstrates that the proposed method achieves higher detection precision and recall than baseline LSTM and GCN classifiers, while maintaining scalability across multi tenant cloud environments. The frameworks adaptability enables proactive mitigation of insider threats, privilege escalation, and lateral movement attacks, contributing to the foundation of AI driven zero trust access analytics. This work bridges the gap between graph based machine learning and operational cloud security intelligence.

Updated: 2025-12-11 04:44:02

标题: 基于图神经网络的云身份和访问管理日志自适应威胁检测

摘要: 云基础设施和分布式身份系统的快速扩展显著增加了现代企业的复杂性和攻击面。传统的基于规则或签名的检测系统通常无法识别身份和访问管理日志中的新颖或不断演变的威胁,其中异常行为可能在统计上看似无害,但在上下文中是恶意的。本文提出了一种基于图神经网络的自适应威胁检测框架,旨在实时从IAM审计日志中学习潜在用户资源交互模式。通过将IAM日志建模为异构动态图,所提出的系统捕捉了用户、角色、会话和访问操作等实体之间的时间、关系和上下文依赖关系。该模型结合了基于注意力的聚合和图嵌入更新,以实现对不断变化的云环境的持续适应。对合成和真实世界的IAM数据集进行的实验评估表明,所提出的方法比基准LSTM和GCN分类器实现了更高的检测精度和召回率,同时在多租户云环境中保持了可扩展性。该框架的可适应性使得能够积极应对内部威胁、权限提升和横向移动攻击,为AI驱动的零信任访问分析奠定了基础。这项工作弥合了基于图的机器学习和运营云安全情报之间的差距。

更新时间: 2025-12-11 04:44:02

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2512.10280v1

Computing Evolutionarily Stable Strategies in Imperfect-Information Games

We present an algorithm for computing evolutionarily stable strategies (ESSs) in symmetric perfect-recall extensive-form games of imperfect information. Our main algorithm is for two-player games, and we describe how it can be extended to multiplayer games. The algorithm is sound and computes all ESSs in nondegenerate games and a subset of them in degenerate games which contain an infinite continuum of symmetric Nash equilibria. The algorithm is anytime and can be stopped early to find one or more ESSs. We experiment on an imperfect-information cancer signaling game as well as random games to demonstrate scalability.

Updated: 2025-12-11 04:38:55

标题: 在信息不完全的游戏中计算进化稳定策略

摘要: 我们提出了一种算法,用于计算不完全信息对称完美回忆广泛形式博弈中的进化稳定策略(ESSs)。我们的主要算法适用于两人游戏,并描述了如何将其扩展到多人游戏。该算法是可靠的,并在非退化游戏中计算所有ESSs,在包含无限对称纳什均衡的退化游戏中计算其中的一个子集。该算法随时可行,并可提前停止以找到一个或多个ESSs。我们在不完全信息的癌症信号传导游戏以及随机游戏上进行实验,以展示其可扩展性。

更新时间: 2025-12-11 04:38:55

领域: cs.GT,cs.AI,cs.MA,econ.TH,q-bio.PE

下载: http://arxiv.org/abs/2512.10279v1

Geometric Regularity in Deterministic Sampling Dynamics of Diffusion-based Generative Models

Diffusion-based generative models employ stochastic differential equations (SDEs) and their equivalent probability flow ordinary differential equations (ODEs) to establish a smooth transformation between complex high-dimensional data distributions and tractable prior distributions. In this paper, we reveal a striking geometric regularity in the deterministic sampling dynamics of diffusion generative models: each simulated sampling trajectory along the gradient field lies within an extremely low-dimensional subspace, and all trajectories exhibit an almost identical boomerang shape, regardless of the model architecture, applied conditions, or generated content. We characterize several intriguing properties of these trajectories, particularly under closed-form solutions based on kernel-estimated data modeling. We also demonstrate a practical application of the discovered trajectory regularity by proposing a dynamic programming-based scheme to better align the sampling time schedule with the underlying trajectory structure. This simple strategy requires minimal modification to existing deterministic numerical solvers, incurs negligible computational overhead, and achieves superior image generation performance, especially in regions with only 5 - 10 function evaluations.

Updated: 2025-12-11 04:38:09

标题: 扩散型生成模型确定性采样动态中的几何规律

摘要: 扩散型生成模型利用随机微分方程(SDEs)及其等价的概率流常微分方程(ODEs)来建立复杂高维数据分布与可处理的先验分布之间的平滑转换。本文揭示了扩散生成模型确定性采样动态中的引人注目的几何规律:沿着梯度场的每个模拟采样轨迹都位于一个极低维度的子空间内,并且所有轨迹展现出几乎相同的回旋形状,不论模型架构、应用条件或生成内容如何。我们对这些轨迹的几个有趣属性进行了表征,特别是基于核估计数据建模的封闭形式解决方案下。我们还通过提出基于动态规划的方案展示了发现的轨迹规律的实际应用,以更好地使采样时间表与底层轨迹结构对齐。这一简单策略对现有确定性数值求解器的修改需求最小,计算开销微乎其微,并在图像生成性能方面取得卓越成果,尤其是在仅需5-10次函数评估的区域。

更新时间: 2025-12-11 04:38:09

领域: cs.LG,cond-mat.stat-mech,cs.CV,stat.ML

下载: http://arxiv.org/abs/2506.10177v3

Token Is All You Price

We build a mechanism design framework where a platform designs GenAI models to screen users who obtain instrumental value from the generated conversation and privately differ in their preference for latency. We show that the revenue-optimal mechanism is simple: deploy a single aligned (user-optimal) model and use token cap as the only instrument to screen the user. The design decouples model training from pricing, is readily implemented with token metering, and mitigates misalignment pressures.

Updated: 2025-12-11 04:25:41

标题: Token Is All You Price 令牌就是你的价格

摘要: 我们建立了一个机制设计框架,平台设计了GenAI模型来筛选那些从生成的对话中获得实际价值并在延迟偏好上私下有差异的用户。我们展示了收入最优机制是简单的:部署一个单一的对齐(用户最优)模型,并使用令牌上限作为唯一的筛选用户的工具。该设计将模型训练与定价分离开来,可以通过令牌计量轻松实现,并减轻了不一致的压力。

更新时间: 2025-12-11 04:25:41

领域: econ.TH,cs.AI

下载: http://arxiv.org/abs/2510.09859v3

Reverse Thinking Enhances Missing Information Detection in Large Language Models

Large Language Models (LLMs) have demonstrated remarkable capabilities in various reasoning tasks, yet they often struggle with problems involving missing information, exhibiting issues such as incomplete responses, factual errors, and hallucinations. While forward reasoning approaches like Chain-of-Thought (CoT) and Tree-of-Thought (ToT) have shown success in structured problem-solving, they frequently fail to systematically identify and recover omitted information. In this paper, we explore the potential of reverse thinking methodologies to enhance LLMs' performance on missing information detection tasks. Drawing inspiration from recent work on backward reasoning, we propose a novel framework that guides LLMs through reverse thinking to identify necessary conditions and pinpoint missing elements. Our approach transforms the challenging task of missing information identification into a more manageable backward reasoning problem, significantly improving model accuracy. Experimental results demonstrate that our reverse thinking approach achieves substantial performance gains compared to traditional forward reasoning methods, providing a promising direction for enhancing LLMs' logical completeness and reasoning robustness.

Updated: 2025-12-11 04:25:17

标题: 反向思维增强大型语言模型中缺失信息的检测

摘要: 大语言模型(LLMs)在各种推理任务中展现出卓越的能力,但它们经常在涉及缺失信息的问题上遇到困难,表现出不完整的响应、事实错误和幻觉等问题。虽然前向推理方法如“思维链”(CoT)和“思维树”(ToT)在结构化问题解决中取得成功,但它们经常未能系统地识别和恢复遗漏的信息。在本文中,我们探讨了逆向思维方法对增强LLMs在缺失信息检测任务中表现的潜力。受最近关于反向推理的工作启发,我们提出了一个新颖的框架,通过逆向思维引导LLMs识别必要条件并找出缺失的元素。我们的方法将具有挑战性的缺失信息识别任务转化为一个更易管理的逆向推理问题,显著提高了模型的准确性。实验结果表明,与传统的前向推理方法相比,我们的逆向思维方法取得了显著的性能提升,为增强LLMs的逻辑完整性和推理鲁棒性提供了一个有希望的方向。

更新时间: 2025-12-11 04:25:17

领域: cs.AI

下载: http://arxiv.org/abs/2512.10273v1

Differential Smoothing Mitigates Sharpening and Improves LLM Reasoning

It is widely recognized that reinforcement learning (RL) fine-tuning of large language models often leads to diversity collapse, where outputs lack variety. Prior work has proposed a range of heuristics to counteract this effect, but these methods are ad hoc: they frequently trade off correctness for diversity, their effectiveness varies across tasks, and in some cases they even contradict one another. In this work, we place these observations on a rigorous foundation. We first provide a formal proof of why RL fine-tuning exhibits diversity collapse via a selection and reinforcement bias. Next, we make a key observation that any reward modification to address diversity collapse only needs to be applied on the correct trajectories. Building directly on this analysis, we introduce a principled method -- differential smoothing -- that provably improves both correctness and diversity, outperforming vanilla RL as well as widely used entropy-based heuristics. Our theory precisely characterizes when existing heuristics help and why they fail, while showing that differential smoothing is universally superior. Extensive experiments with models from 1B to 7B parameters, across domains including CountDown and real-world mathematical reasoning, demonstrate consistent gains. Differential smoothing improves both Pass@1 and Pass@k, with up to 6.7% improvements on AIME24 dataset.

Updated: 2025-12-11 04:20:51

标题: 差分平滑化缓解了锐化问题并提高了LLM推理效果

摘要: 众所周知,对大型语言模型进行强化学习(RL)微调往往会导致多样性崩溃,即输出缺乏多样性。先前的研究提出了一系列启发式方法来抵消这种效果,但这些方法是临时的:它们经常在正确性和多样性之间做出权衡,它们的有效性在不同任务中变化,并且在某些情况下甚至相互矛盾。在这项工作中,我们将这些观察结果放在了严谨的基础上。我们首先提供了一个正式的证明,说明为什么RL微调会出现多样性崩溃,这是通过选择和强化偏差导致的。接下来,我们做出了一个关键观察,即为了解决多样性崩溃,任何奖励修改只需要应用在正确的轨迹上。基于这一分析,我们引入了一种原则性的方法——差分平滑——可以证明在正确性和多样性方面都有所改善,其表现优于普通的RL以及广泛使用的基于熵的启发式方法。我们的理论精确地描述了现有启发式方法何时起作用以及为什么它们失败,同时表明差分平滑是普遍优越的。通过对包括CountDown和真实世界数学推理在内的领域中1B到7B参数模型的大量实验,我们展示了一致的收益。差分平滑改善了Pass@1和Pass@k,AIME24数据集上的改善率最高可达6.7%。

更新时间: 2025-12-11 04:20:51

领域: cs.LG

下载: http://arxiv.org/abs/2511.19942v2

Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters

Modern cloud platforms increasingly host large-scale deep learning (DL) workloads, demanding high-throughput, low-latency GPU scheduling. However, the growing heterogeneity of GPU clusters and limited visibility into application characteristics pose major challenges for existing schedulers, which often rely on offline profiling or application-specific assumptions. We present RLTune, an application-agnostic reinforcement learning (RL)-based scheduling framework that dynamically prioritizes and allocates DL jobs on heterogeneous GPU clusters. RLTune integrates RL-driven prioritization with MILP-based job-to-node mapping to optimize system-wide objectives such as job completion time (JCT), queueing delay, and resource utilization. Trained on large-scale production traces from Microsoft Philly, Helios, and Alibaba, RLTune improves GPU utilization by up to 20%, reduces queueing delay by up to 81%, and shortens JCT by as much as 70 percent. Unlike prior approaches, RLTune generalizes across diverse workloads without requiring per-job profiling, making it practical for cloud providers to deploy at scale for more efficient, fair, and sustainable DL workload management.

Updated: 2025-12-11 04:19:44

标题: 混合学习和基于优化的动态调度在异构GPU集群上用于深度学习工作负载

摘要: 现代云平台越来越承载大规模深度学习(DL)工作负载,需要高吞吐量、低延迟的GPU调度。然而,GPU集群的不断异构化和对应用特性的有限可见性增加了现有调度器面临的主要挑战,这些调度器通常依赖离线分析或特定应用假设。我们提出了RLTune,一种基于应用无关的强化学习(RL)调度框架,动态地在异构GPU集群上优先和分配DL作业。RLTune将RL驱动的优先级与基于MILP的作业-节点映射相结合,以优化系统范围的目标,如作业完成时间(JCT)、排队延迟和资源利用率。在来自Microsoft Philly、Helios和Alibaba的大规模生产跟踪数据上训练,RLTune可以将GPU利用率提高多达20%,将排队延迟降低多达81%,将JCT缩短多达70%。与先前的方法不同,RLTune在不需要每个作业的分析的情况下可以泛化到各种工作负载,使其成为云提供商在更高效、公平和可持续的DL工作负载管理方面大规模部署的实际解决方案。

更新时间: 2025-12-11 04:19:44

领域: cs.DC,cs.AI,cs.LG

下载: http://arxiv.org/abs/2512.10271v1

HarnessAgent: Scaling Automatic Fuzzing Harness Construction with Tool-Augmented LLM Pipelines

Large language model (LLM)-based techniques have achieved notable progress in generating harnesses for program fuzzing. However, applying them to arbitrary functions (especially internal functions) \textit{at scale} remains challenging due to the requirement of sophisticated contextual information, such as specification, dependencies, and usage examples. State-of-the-art methods heavily rely on static or incomplete context provisioning, causing failure of generating functional harnesses. Furthermore, LLMs tend to exploit harness validation metrics, producing plausible yet logically useless code. % Therefore, harness generation across large and diverse projects continues to face challenges in reliable compilation, robust code retrieval, and comprehensive validation. To address these challenges, we present HarnessAgent, a tool-augmented agentic framework that achieves fully automated, scalable harness construction over hundreds of OSS-Fuzz targets. HarnessAgent introduces three key innovations: 1) a rule-based strategy to identify and minimize various compilation errors; 2) a hybrid tool pool for precise and robust symbol source code retrieval; and 3) an enhanced harness validation pipeline that detects fake definitions. We evaluate HarnessAgent on 243 target functions from OSS-Fuzz projects (65 C projects and 178 C++ projects). It improves the three-shot success rate by approximately 20\% compared to state-of-the-art techniques, reaching 87\% for C and 81\% for C++. Our one-hour fuzzing results show that more than 75\% of the harnesses generated by HarnessAgent increase the target function coverage, surpassing the baselines by over 10\%. In addition, the hybrid tool-pool system of HarnessAgent achieves a response rate of over 90\% for source code retrieval, outperforming Fuzz Introspector by more than 30\%.

Updated: 2025-12-11 04:13:33

标题: HarnessAgent:利用工具增强的LLM管道扩展自动Fuzzing Harness建造

摘要: 基于大型语言模型(LLM)的技术在生成用于程序模糊测试的测试用例方面取得了显著进展。然而,将它们应用于任意函数(尤其是内部函数)在规模上仍然具有挑战性,因为需要复杂的上下文信息,如规范、依赖关系和使用示例。当前最先进的方法严重依赖于静态或不完整的上下文提供,导致生成功能测试用例失败。此外,LLMs倾向于利用测试用例验证指标,产生看似合理但逻辑上无用的代码。因此,在大型和多样化项目中生成测试用例仍然面临着可靠编译、强大的代码检索和全面验证方面的挑战。 为了解决这些挑战,我们提出了HarnessAgent,这是一个工具增强的主动框架,可以在数百个OSS-Fuzz目标上实现完全自动化、可扩展的测试用例生成。HarnessAgent引入了三个关键创新:1)基于规则的策略来识别和最小化各种编译错误;2)用于精确和强大的符号源代码检索的混合工具池;以及3)增强的测试用例验证管道,可以检测虚假定义。我们对来自OSS-Fuzz项目的243个目标函数(65个C项目和178个C ++项目)对HarnessAgent进行了评估。与最先进的技术相比,它将三次成功率提高了约20%,对于C为87%,对于C ++为81%。我们的一小时模糊测试结果显示,由HarnessAgent生成的测试用例中超过75%可以增加目标函数的覆盖范围,超过基线超过10%。此外,HarnessAgent的混合工具池系统实现了超过90%的源代码检索响应率,超过Fuzz Introspector超过30%。

更新时间: 2025-12-11 04:13:33

领域: cs.CR,cs.SE

下载: http://arxiv.org/abs/2512.03420v3

On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning

Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). KL regularization is ubiquitous, yet the design surface, choice of KL direction (forward vs. reverse), normalization (normalized vs. unnormalized), and estimator ($k_1/k_2/k_3$), is scattered across the literature and often intertwined with off-policy estimation. We ask a focused question: under the off-policy setting, what weighting is required for each KL variant so that the surrogate we optimize yields the exact gradient of the intended KL-regularized objective? We answer this with a compact, unified derivation we call the Regularized Policy Gradient (RPG) view. RPG (i) unifies normalized and unnormalized KL variants and shows that the widely-used $k_3$ penalty is exactly the unnormalized KL; (ii) specifies conditions under which REINFORCE-style losses with stop-gradient are gradient-equivalent to fully differentiable surrogates; (iii) identifies and corrects an off-policy importance-weighting mismatch in GRPO's KL term; and (iv) introduces RPG-Style Clip, a clipped-importance-sampling step within RPG-REINFORCE that enables stable, off-policy policy-gradient training at scale. On mathematical reasoning benchmarks (AIME24, AIME25), RPG-REINFORCE with RPG-Style Clip improves accuracy by up to $+6$ absolute percentage points over DAPO. We extend our experiments to 8K context length, and RPG-REINFORCE with RPG-Style Clip achieves 52% accuracy on AIME25, surpassing the official Qwen3-4B-Instruct model (47%). Notably, RPG is a stable and scalable RL algorithm for LLM reasoning, realized via (a) a KL-correct objective, (b) clipped importance sampling, and (c) an iterative reference-policy update scheme.

Updated: 2025-12-11 04:08:28

标题: 关于为LLM推理设计KL正则化策略梯度算法

摘要: 政策梯度算法已成功应用于增强大型语言模型(LLMs)的推理能力。KL正则化是无处不在的,但设计表面、KL方向的选择(前向vs.后向)、归一化(归一化vs.非归一化)和估计器($k_1/k_2/k_3$)在文献中分散,并且常常与离线估计相互交织在一起。我们提出一个专注的问题:在离线设置下,每个KL变体需要什么权重,以便我们优化的替代目标确实产生预期的KL正则化目标的精确梯度?我们用一种称为正则化政策梯度(RPG)观点的紧凑统一推导来回答这个问题。RPG(i)统一了归一化和非归一化的KL变体,并显示广泛使用的$k_3$惩罚恰好是非归一化的KL;(ii)指定了REINFORCE风格的损失与停梯度的梯度等价于完全可微替代品的条件;(iii)识别并纠正了GRPO的KL项中的离线重要性加权不匹配;(iv)引入了RPG风格的剪切,这是RPG-REINFORCE中的一个剪切重要性抽样步骤,可实现规模化稳定的离线政策梯度训练。在数学推理基准(AIME24,AIME25)上,带有RPG风格剪切的RPG-REINFORCE比DAPO提高了最多$+6$个绝对百分点的准确度。我们将实验扩展到8K上下文长度,并且带有RPG风格剪切的RPG-REINFORCE在AIME25上达到52%的准确度,超过官方的Qwen3-4B-Instruct模型(47%)。值得注意的是,RPG是一个稳定且可扩展的LLM推理RL算法,通过(a)一个KL正确的目标,(b)剪切重要性抽样和(c)一个迭代的参考策略更新方案实现。

更新时间: 2025-12-11 04:08:28

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2505.17508v3

Towards Foundation Models with Native Multi-Agent Intelligence

Foundation models (FMs) are increasingly assuming the role of the "brain" of AI agents. While recent efforts have begun to equip FMs with native single-agent abilities -- such as GUI interaction or integrated tool use -- we argue that the next frontier is endowing FMs with native multi-agent intelligence. We identify four core capabilities of FMs in multi-agent contexts: understanding, planning, efficient communication, and adaptation. Contrary to assumptions about the spontaneous emergence of such abilities, we provide extensive empirical evidence across 41 large language models showing that strong single-agent performance alone does not automatically yield robust multi-agent intelligence. To address this gap, we outline key research directions -- spanning dataset construction, evaluation, training paradigms, and safety considerations -- for building FMs with native multi-agent intelligence.

Updated: 2025-12-11 04:06:53

标题: 朝着具有原生多智能体智能的基础模型前进

摘要: 基础模型(FMs)越来越多地扮演AI代理的“大脑”角色。虽然最近的努力已经开始为基础模型配备本地单一代理能力,比如GUI交互或集成工具使用,但我们认为下一个前沿是赋予基础模型本地多代理智能。我们确定基础模型在多代理环境中的四个核心能力:理解、规划、高效沟通和适应性。与关于这些能力自发出现的假设相反,我们提供了跨41个大型语言模型的广泛经验证据,表明单一代理强大的表现并不能自动产生强大的多代理智能。为了解决这一差距,我们概述了构建具有本地多代理智能的基础模型的关键研究方向,涵盖数据集构建、评估、训练范式和安全考虑。

更新时间: 2025-12-11 04:06:53

领域: cs.AI,cs.MA

下载: http://arxiv.org/abs/2512.08743v2

HunyuanOCR Technical Report

This paper presents HunyuanOCR, a commercial-grade, open-source, and lightweight (1B parameters) Vision-Language Model (VLM) dedicated to OCR tasks. The architecture comprises a Native Vision Transformer (ViT) and a lightweight LLM connected via an MLP adapter. HunyuanOCR demonstrates superior performance, outperforming commercial APIs, traditional pipelines, and larger models (e.g., Qwen3-VL-4B). Specifically, it surpasses current public solutions in perception tasks (Text Spotting, Parsing) and excels in semantic tasks (IE, Text Image Translation), securing first place in the ICDAR 2025 DIMT Challenge (Small Model Track). Furthermore, it achieves state-of-the-art (SOTA) results on OCRBench among VLMs with fewer than 3B parameters. HunyuanOCR achieves breakthroughs in three key aspects: 1) Unifying Versatility and Efficiency: We implement comprehensive support for core capabilities including spotting, parsing, IE, VQA, and translation within a lightweight framework. This addresses the limitations of narrow "OCR expert models" and inefficient "General VLMs". 2) Streamlined End-to-End Architecture: Adopting a pure end-to-end paradigm eliminates dependencies on pre-processing modules (e.g., layout analysis). This fundamentally resolves error propagation common in traditional pipelines and simplifies system deployment. 3) Data-Driven and RL Strategies: We confirm the critical role of high-quality data and, for the first time in the industry, demonstrate that Reinforcement Learning (RL) strategies yield significant performance gains in OCR tasks. HunyuanOCR is officially open-sourced on HuggingFace. We also provide a high-performance deployment solution based on vLLM, placing its production efficiency in the top tier. We hope this model will advance frontier research and provide a solid foundation for industrial applications.

Updated: 2025-12-11 04:04:16

标题: 混元OCR技术报告

摘要: 本文介绍了HunyuanOCR,一个商业级、开源且轻量级(1B参数)的视觉-语言模型(VLM),专门用于OCR任务。该架构包括一个本地视觉变换器(ViT)和一个轻量级LLM,通过MLP适配器连接。HunyuanOCR展示出卓越的性能,超过了商业API、传统管道和更大的模型(例如Qwen3-VL-4B)。具体来说,它在感知任务(文本定位、解析)方面超越了当前的公共解决方案,并在语义任务(IE、文本图像翻译)方面表现出色,在ICDAR 2025 DIMT挑战赛(小型模型赛道)中获得了第一名。此外,它在少于3B参数的VLM中在OCR质量评估中取得了最新技术(SOTA)成果。 HunyuanOCR在三个关键方面取得了突破:1)统一多功能性和效率:我们在一个轻量级框架中实现了对核心能力(如定位、解析、IE、VQA和翻译)的全面支持。这解决了狭窄的“OCR专家模型”和低效的“通用VLM”的局限性。2)简化的端到端架构:采用纯端到端范式消除了对预处理模块(如布局分析)的依赖。这从根本上解决了传统管道中常见的错误传播问题,并简化了系统部署。3)数据驱动和RL策略:我们确认了高质量数据的关键作用,并首次在行业中证明了强化学习(RL)策略在OCR任务中带来显著的性能提升。 HunyuanOCR已在HuggingFace上正式开源。我们还提供了基于vLLM的高性能部署解决方案,将其生产效率放在顶级。我们希望这个模型能推动前沿研究,并为工业应用提供坚实基础。

更新时间: 2025-12-11 04:04:16

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.19575v2

AlphaOPT: Formulating Optimization Programs with Self-Improving LLM Experience Library

Optimization modeling enables critical decisions across industries but remains difficult to automate: informal language must be mapped to precise mathematical formulations and executable solver code. Prior LLM approaches either rely on brittle prompting or costly retraining with limited generalization. We present AlphaOPT, a self-improving experience library that enables an LLM to learn from limited demonstrations (even answers alone, without gold-standard programs) and solver feedback - without annotated reasoning traces or parameter updates. AlphaOPT operates in a continual two-phase cycle: (i) a Library Learning phase that reflects on failed attempts, extracting solver-verified, structured insights as {taxonomy, condition, explanation, example}; and (ii) a Library Evolution phase that diagnoses retrieval misalignments and refines the applicability conditions of stored insights, improving transfer across tasks. This design (1) learns efficiently from limited demonstrations without curated rationales, (2) expands continually without costly retraining by updating the library rather than model weights, and (3) makes knowledge explicit and interpretable for human inspection and intervention. Experiments show that AlphaOPT steadily improves with more data (65% to 72% from 100 to 300 training items) and surpasses the strongest baseline by 7.7% on the out-of-distribution OptiBench dataset when trained only on answers. Code and data are available at: https://github.com/Minw913/AlphaOPT.

Updated: 2025-12-11 03:59:43

标题: AlphaOPT: 使用自我改进的LLM体验库制定优化程序

摘要: 优化建模在各行业中能够支持重要决策,但自动化仍然困难:非正式语言必须映射到精确的数学公式和可执行的求解器代码。以往的LLM方法要么依赖脆弱的提示,要么需要昂贵的重新训练,并且泛化能力有限。我们提出了AlphaOPT,这是一个自我改进的经验库,使得LLM能够从有限的演示(甚至只有答案,而没有黄金标准程序)和求解器反馈中学习 - 而无需注释的推理轨迹或参数更新。AlphaOPT运行在一个持续的两阶段循环中:(i)图书馆学习阶段反思失败的尝试,提取求解器验证的结构化见解,如{分类、条件、解释、示例};(ii)图书馆演化阶段诊断检索不对齐,优化存储的见解的适用条件,提高任务间的转移能力。这种设计(1)从有限的演示中高效学习,无需策划的理由,(2)通过更新库而不是模型权重,持续扩展,避免了昂贵的重新训练,(3)使知识明确且可解释,供人类检查和干预。实验证明,AlphaOPT随着数据量的增加不断提高(从100到300个训练项目,准确率从65%提高到72%),并且在仅基于答案训练时,在分布外OptiBench数据集上超过最强基准线7.7%。代码和数据可在以下链接找到:https://github.com/Minw913/AlphaOPT。

更新时间: 2025-12-11 03:59:43

领域: cs.AI

下载: http://arxiv.org/abs/2510.18428v2

Attention is All You Need to Defend Against Indirect Prompt Injection Attacks in LLMs

Large Language Models (LLMs) have been integrated into many applications (e.g., web agents) to perform more sophisticated tasks. However, LLM-empowered applications are vulnerable to Indirect Prompt Injection (IPI) attacks, where instructions are injected via untrustworthy external data sources. This paper presents Rennervate, a defense framework to detect and prevent IPI attacks. Rennervate leverages attention features to detect the covert injection at a fine-grained token level, enabling precise sanitization that neutralizes IPI attacks while maintaining LLM functionalities. Specifically, the token-level detector is materialized with a 2-step attentive pooling mechanism, which aggregates attention heads and response tokens for IPI detection and sanitization. Moreover, we establish a fine-grained IPI dataset, FIPI, to be open-sourced to support further research. Extensive experiments verify that Rennervate outperforms 15 commercial and academic IPI defense methods, achieving high precision on 5 LLMs and 6 datasets. We also demonstrate that Rennervate is transferable to unseen attacks and robust against adaptive adversaries.

Updated: 2025-12-11 03:47:12

标题: 关注是防御LLM中间提示注入攻击的关键

摘要: 大型语言模型(LLMs)已经被整合到许多应用程序(例如Web代理)中,以执行更复杂的任务。然而,LLM增强的应用程序容易受到间接提示注入(IPI)攻击的威胁,其中指令通过不可信的外部数据源注入。本文介绍了Rennervate,一个用于检测和防止IPI攻击的防御框架。Rennervate利用注意力特征在细粒度的标记级别检测潜在的注入,实现精确的消毒,中和IPI攻击同时保持LLM功能。具体来说,标记级别的检测器是通过一个2步的专注汇聚机制实现的,该机制聚合注意力头和响应标记用于IPI检测和消毒。此外,我们建立了一个细粒度的IPI数据集FIPI,以开源的方式支持进一步的研究。广泛的实验验证了Rennervate优于15种商业和学术IPI防御方法,在5个LLMs和6个数据集上实现了高精度。我们还证明Rennervate可以对未知攻击进行转移,并且对适应性对手具有强大的韧性。

更新时间: 2025-12-11 03:47:12

领域: cs.CR

下载: http://arxiv.org/abs/2512.08417v2

GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation

Retrieval-augmented generation (RAG) has proven effective in integrating knowledge into large language models (LLMs). However, conventional RAGs struggle to capture complex relationships between pieces of knowledge, limiting their performance in intricate reasoning that requires integrating knowledge from multiple sources. Recently, graph-enhanced retrieval augmented generation (GraphRAG) builds graph structure to explicitly model these relationships, enabling more effective and efficient retrievers. Nevertheless, its performance is still hindered by the noise and incompleteness within the graph structure. To address this, we introduce GFM-RAG, a novel graph foundation model (GFM) for retrieval augmented generation. GFM-RAG is powered by an innovative graph neural network that reasons over graph structure to capture complex query-knowledge relationships. The GFM with 8M parameters undergoes a two-stage training process on large-scale datasets, comprising 60 knowledge graphs with over 14M triples and 700k documents. This results in impressive performance and generalizability for GFM-RAG, making it the first graph foundation model applicable to unseen datasets for retrieval without any fine-tuning required. Extensive experiments on three multi-hop QA datasets and seven domain-specific RAG datasets demonstrate that GFM-RAG achieves state-of-the-art performance while maintaining efficiency and alignment with neural scaling laws, highlighting its potential for further improvement.

Updated: 2025-12-11 03:45:04

标题: GFM-RAG:检索增强生成的图基础模型

摘要: 检索增强生成(RAG)已被证明能够有效地将知识整合到大型语言模型(LLMs)中。然而,传统的RAG在捕捉知识之间复杂关系方面存在困难,限制了它们在需要整合多个信息源的复杂推理中的表现。最近,图增强的检索增强生成(GraphRAG)构建了图结构来明确建模这些关系,实现了更有效和高效的检索器。然而,其性能仍受到图结构中的噪声和不完整性的影响。为了解决这个问题,我们引入了GFM-RAG,一种新颖的用于检索增强生成的图基础模型(GFM)。GFM-RAG由一种创新的图神经网络驱动,可以通过对图结构进行推理来捕捉复杂的查询-知识关系。拥有800万参数的GFM经过两阶段训练过程,在包含60个知识图和超过1400万三元组以及70万个文档的大规模数据集上进行训练。这使GFM-RAG具有出色的性能和泛化能力,使其成为第一个适用于未见数据集的检索的图基础模型,无需任何微调即可实现。对三个多跳QA数据集和七个领域特定的RAG数据集进行了大量实验,结果表明GFM-RAG在保持效率和与神经缩放定律一致的同时实现了最先进的性能,突显了其进一步改进的潜力。

更新时间: 2025-12-11 03:45:04

领域: cs.IR,cs.AI,cs.CL

下载: http://arxiv.org/abs/2502.01113v3

It Hears, It Sees too: Multi-Modal LLM for Depression Detection By Integrating Visual Understanding into Audio Language Models

Depression is one of the most prevalent mental health disorders globally. In recent years, multi-modal data, such as speech, video, and transcripts, has been increasingly used to develop AI-assisted depression assessment systems. Large language models have further advanced this field due to their strong language understanding and generalization capabilities. However, conventional LLMs remain text-centric and cannot process the rich non-verbal cues found in audio and visual modalities, which are critical components in mental health evaluation. While multi-modal LLMs offer a promising direction, few are tailored for psychological applications. In this study, we propose a novel multi-modal LLM framework for depression detection. Our approach augments an audio language model with visual understanding and aligns audio-visual features at the timestamp level. This fine-grained alignment improves modeling of temporal dynamics across modalities while reducing the need for extensive training data and computational resources. Experiments on the DAIC-WoZ dataset demonstrate that our model outperforms both single-modality approaches and previous multi-modal methods. Moreover, the proposed framework can be extended to incorporate additional physiological signals, paving the way for broader clinical applications beyond mental health.

Updated: 2025-12-11 03:40:00

标题: 它听得到,它也看得到:将视觉理解整合到音频语言模型中的多模态LLM用于抑郁症检测

摘要: 抑郁症是全球最普遍的心理健康障碍之一。近年来,越来越多的多模态数据,如语音、视频和剧本,被用于开发AI辅助的抑郁症评估系统。大型语言模型进一步推动了这一领域的发展,因为它们具有强大的语言理解和泛化能力。然而,传统的语言模型仍然以文本为中心,无法处理在音频和视觉模态中找到的丰富的非语言线索,这些线索是心理健康评估中的关键组成部分。虽然多模态语言模型提供了一个有前途的方向,但很少有针对心理应用的模型。在这项研究中,我们提出了一个新颖的多模态语言模型框架用于抑郁症检测。我们的方法将一个音频语言模型与视觉理解相结合,并在时间戳水平对齐音频-视觉特征。这种细粒度的对齐改善了跨模态的时间动态建模,同时减少了对大量训练数据和计算资源的需求。在DAIC-WoZ数据集上的实验证明,我们的模型优于单模态方法和先前的多模态方法。此外,所提出的框架可以扩展到包括额外的生理信号,为心理健康以外更广泛的临床应用铺平道路。

更新时间: 2025-12-11 03:40:00

领域: cs.MM,cs.CV,cs.LG,eess.AS

下载: http://arxiv.org/abs/2511.19877v2

R^2-HGP: A Double-Regularized Gaussian Process for Heterogeneous Transfer Learning

Multi-output Gaussian process (MGP) models have attracted significant attention for their flexibility and uncertainty-quantification capabilities, and have been widely adopted in multi-source transfer learning scenarios due to their ability to capture inter-task correlations. However, they still face several challenges in transfer learning. First, the input spaces of the source and target domains are often heterogeneous, which makes direct knowledge transfer difficult. Second, potential prior knowledge and physical information are typically ignored during heterogeneous transfer, hampering the utilization of domain-specific insights and leading to unstable mappings. Third, inappropriate information sharing among target and sources can easily lead to negative transfer. Traditional models fail to address these issues in a unified way. To overcome these limitations, this paper proposes a Double-Regularized Heterogeneous Gaussian Process framework (R^2-HGP). Specifically, a trainable prior probability mapping model is first proposed to align the heterogeneous input domains. The resulting aligned inputs are treated as latent variables, upon which a multi-source transfer GP model is constructed and the entire structure is integrated into a novel conditional variational autoencoder (CVAE) based framework. Physical insights is further incorporated as a regularization term to ensure that the alignment results adhere to known physical knowledge. Next, within the multi-source transfer GP model, a sparsity penalty is imposed on the transfer coefficients, enabling the model to adaptively select the most informative source outputs and suppress negative transfer. Extensive simulations and real-world engineering case studies validate the effectiveness of our R^2-HGP, demonstrating consistent superiority over state-of-the-art benchmarks across diverse evaluation metrics.

Updated: 2025-12-11 03:38:20

标题: R^2-HGP:一种用于异构迁移学习的双重正则化高斯过程

摘要: 多输出高斯过程(MGP)模型因其灵活性和不确定性量化能力而受到重视,在多源转移学习场景中被广泛采用,因为它们能够捕捉任务间的相关性。然而,在转移学习中仍然面临几个挑战。首先,源领域和目标领域的输入空间通常是异质的,这使得直接知识转移困难。其次,在异构转移过程中通常忽略潜在的先验知识和物理信息,阻碍了领域特定见解的利用,并导致不稳定的映射。第三,目标和源之间不恰当的信息共享很容易导致负面转移。传统模型不能统一解决这些问题。为了克服这些限制,本文提出了一个双正则化异构高斯过程框架(R^2-HGP)。具体来说,首先提出了一个可训练的先验概率映射模型,以对齐异质输入域。得到的对齐输入被视为潜在变量,基于这些变量构建多源转移GP模型,并将整个结构集成到基于新颖条件变分自动编码器(CVAE)的框架中。物理见解进一步作为正则化项纳入,以确保对齐结果符合已知的物理知识。接下来,在多源转移GP模型中,对转移系数施加了稀疏惩罚,使模型能够自适应地选择最具信息量的源输出并抑制负面转移。大量模拟和真实工程案例研究验证了我们的R^2-HGP的有效性,展示了在各种评估指标上始终优于最先进的基准的一致优势。

更新时间: 2025-12-11 03:38:20

领域: cs.LG

下载: http://arxiv.org/abs/2512.10258v1

Error Analysis of Generalized Langevin Equations with Approximated Memory Kernels

We analyze prediction error in stochastic dynamical systems with memory, focusing on generalized Langevin equations (GLEs) formulated as stochastic Volterra equations. We establish that, under a strongly convex potential, trajectory discrepancies decay at a rate determined by the decay of the memory kernel and are quantitatively bounded by the estimation error of the kernel in a weighted norm. Our analysis integrates synchronized noise coupling with a Volterra comparison theorem, encompassing both subexponential and exponential kernel classes. For first-order models, we derive moment and perturbation bounds using resolvent estimates in weighted spaces. For second-order models with confining potentials, we prove contraction and stability under kernel perturbations using a hypocoercive Lyapunov-type distance. This framework accommodates non-translation-invariant kernels and white-noise forcing, explicitly linking improved kernel estimation to enhanced trajectory prediction. Numerical examples validate these theoretical findings.

Updated: 2025-12-11 03:27:58

标题: 广义朗之万方程中近似记忆核的误差分析

摘要: 我们分析了具有记忆的随机动力系统中的预测误差,重点关注以随机Volterra方程形式化的广义Langevin方程(GLEs)。我们建立了在强凸势下,轨迹差异按记忆核的衰减速率衰减,并且在加权范数下定量受制于核的估计误差。我们的分析将同步噪声耦合与Volterra比较定理相结合,涵盖了次指数和指数核类。对于一阶模型,我们利用加权空间中的解析估计导出矩和扰动边界。对于具有约束势能的二阶模型,我们证明了在核扰动下的收缩和稳定性,使用了一种类似于Lyapunov的hypocoercive距离。这个框架可以容纳非平移不变核和白噪声迫使,明确地将改进的核估计与增强的轨迹预测联系起来。数值实例验证了这些理论发现。

更新时间: 2025-12-11 03:27:58

领域: stat.ML,cs.LG,math.DS,math.NA,math.PR

下载: http://arxiv.org/abs/2512.10256v1

Keep It on a Leash: Controllable Pseudo-label Generation Towards Realistic Long-Tailed Semi-Supervised Learning

Current long-tailed semi-supervised learning methods assume that labeled data exhibit a long-tailed distribution, and unlabeled data adhere to a typical predefined distribution (i.e., long-tailed, uniform, or inverse long-tailed). However, the distribution of the unlabeled data is generally unknown and may follow an arbitrary distribution. To tackle this challenge, we propose a Controllable Pseudo-label Generation (CPG) framework, expanding the labeled dataset with the progressively identified reliable pseudo-labels from the unlabeled dataset and training the model on the updated labeled dataset with a known distribution, making it unaffected by the unlabeled data distribution. Specifically, CPG operates through a controllable self-reinforcing optimization cycle: (i) at each training step, our dynamic controllable filtering mechanism selectively incorporates reliable pseudo-labels from the unlabeled dataset into the labeled dataset, ensuring that the updated labeled dataset follows a known distribution; (ii) we then construct a Bayes-optimal classifier using logit adjustment based on the updated labeled data distribution; (iii) this improved classifier subsequently helps identify more reliable pseudo-labels in the next training step. We further theoretically prove that this optimization cycle can significantly reduce the generalization error under some conditions. Additionally, we propose a class-aware adaptive augmentation module to further improve the representation of minority classes, and an auxiliary branch to maximize data utilization by leveraging all labeled and unlabeled samples. Comprehensive evaluations on various commonly used benchmark datasets show that CPG achieves consistent improvements, surpassing state-of-the-art methods by up to $\textbf{15.97%}$ in accuracy. The code is available at https://github.com/yaxinhou/CPG.

Updated: 2025-12-11 03:26:54

标题: 保持在一种控制状态:可控的伪标签生成以实现逼真的长尾半监督学习

摘要: 目前的长尾半监督学习方法假设有标签数据呈现长尾分布,而无标签数据遵循典型的预定义分布(即长尾、均匀或反长尾)。然而,无标签数据的分布通常是未知的,可能遵循任意分布。为了应对这一挑战,我们提出了一种可控伪标签生成(CPG)框架,通过从无标签数据中逐步确定可靠的伪标签扩展带标签数据集,并在已知分布上训练模型,使其不受无标签数据分布的影响。具体地,CPG通过可控的自我强化优化循环运行:(i)在每一个训练步骤中,我们的动态可控过滤机制会有选择地将可靠的伪标签从无标签数据集中合并到带标签数据集中,确保更新后的带标签数据集遵循已知分布;(ii)然后,我们基于更新后的数据分布构建一个贝叶斯最优分类器,通过逻辑调整;(iii)这个改进的分类器随后有助于在下一个训练步骤中识别更多可靠的伪标签。我们进一步理论证明了在某些条件下,这种优化循环可以显著降低泛化误差。此外,我们提出了一个类别感知自适应增强模块,进一步改善少数类的表示,以及一个辅助分支,通过利用所有有标签和无标签样本来最大化数据利用。对各种常用基准数据集的全面评估表明,CPG取得了一致的改进,准确度超过最先进的方法高达15.97%。代码可在https://github.com/yaxinhou/CPG上获得。

更新时间: 2025-12-11 03:26:54

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.03993v6

Explain First, Trust Later: LLM-Augmented Explanations for Graph-Based Crypto Anomaly Detection

The decentralized finance (DeFi) community has grown rapidly in recent years, pushed forward by cryptocurrency enthusiasts interested in the vast untapped potential of new markets. The surge in popularity of cryptocurrency has ushered in a new era of financial crime. Unfortunately, the novelty of the technology makes the task of catching and prosecuting offenders particularly challenging. Thus, it is necessary to implement automated detection tools related to policies to address the growing criminality in the cryptocurrency realm.

Updated: 2025-12-11 03:15:22

标题: 首先解释,后信任:基于LLM增强的图形密码异常检测解释。

摘要: 最近几年,去中心化金融(DeFi)社区迅速增长,受到加密货币爱好者的推动,他们对新市场的巨大潜力感兴趣。加密货币受欢迎程度的激增引发了一场新的金融犯罪时代。不幸的是,技术的新颖性使得捕捉和起诉犯罪者的任务特别具有挑战性。因此,有必要实施与政策相关的自动检测工具,以解决加密货币领域不断增长的犯罪问题。

更新时间: 2025-12-11 03:15:22

领域: cs.CE,cs.AI,cs.CR

下载: http://arxiv.org/abs/2506.14933v2

RobustSora: De-Watermarked Benchmark for Robust AI-Generated Video Detection

The proliferation of AI-generated video technologies poses challenges to information integrity. While recent benchmarks advance AIGC video detection, they overlook a critical factor: many state-of-the-art generative models embed digital watermarks in outputs, and detectors may partially rely on these patterns. To evaluate this influence, we present RobustSora, the benchmark designed to assess watermark robustness in AIGC video detection. We systematically construct a dataset of 6,500 videos comprising four types: Authentic-Clean (A-C), Authentic-Spoofed with fake watermarks (A-S), Generated-Watermarked (G-W), and Generated-DeWatermarked (G-DeW). Our benchmark introduces two evaluation tasks: Task-I tests performance on watermark-removed AI videos, while Task-II assesses false alarm rates on authentic videos with fake watermarks. Experiments with ten models spanning specialized AIGC detectors, transformer architectures, and MLLM approaches reveal performance variations of 2-8pp under watermark manipulation. Transformer-based models show consistent moderate dependency (6-8pp), while MLLMs exhibit diverse patterns (2-8pp). These findings indicate partial watermark dependency and highlight the need for watermark-aware training strategies. RobustSora provides essential tools to advance robust AIGC detection research.

Updated: 2025-12-11 03:12:56

标题: RobustSora: 适用于稳健AI生成视频检测的去水印基准

摘要: 人工智能生成的视频技术的大量增长对信息完整性构成挑战。尽管最近的基准推动了人工智能生成视频检测的发展,但它们忽视了一个关键因素:许多最先进的生成模型在输出中嵌入了数字水印,检测器可能部分依赖于这些模式。为了评估这种影响,我们提出了RobustSora,这是一个旨在评估人工智能生成视频检测中水印鲁棒性的基准。我们系统地构建了一个包含四种类型的视频数据集,共计6,500个视频:真实-干净(A-C)、真实-伪造,带有假水印(A-S)、生成-带水印(G-W)和生成-去水印(G-DeW)。我们的基准引入了两个评估任务:任务I测试水印被移除的人工智能视频的性能,而任务II评估带有假水印的真实视频的误报率。使用十种模型进行实验,涵盖了专门的人工智能生成视频检测器、变换器架构和MLLM方法,结果显示在水印操纵下性能变化为2-8pp。基于变换器的模型表现出一致的中度依赖性(6-8pp),而MLLM表现出多样的模式(2-8pp)。这些发现表明部分水印依赖性,并强调了需要水印感知的训练策略。RobustSora提供了推进鲁棒人工智能生成视频检测研究的必要工具。

更新时间: 2025-12-11 03:12:56

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2512.10248v1

Solving Semi-Supervised Few-Shot Learning from an Auto-Annotation Perspective

Semi-supervised few-shot learning (SSFSL) formulates real-world applications like ''auto-annotation'', as it aims to learn a model over a few labeled and abundant unlabeled examples to annotate the unlabeled ones. Despite the availability of powerful open-source Vision-Language Models (VLMs) and their pretraining data, the SSFSL literature largely neglects these open-source resources. In contrast, the related area few-shot learning (FSL) has already exploited them to boost performance. Arguably, to achieve auto-annotation in the real world, SSFSL should leverage such open-source resources. To this end, we start by applying established SSL methods to finetune a VLM. Counterintuitively, they significantly underperform FSL baselines. Our in-depth analysis reveals the root cause: VLMs produce rather ''flat'' distributions of softmax probabilities. This results in zero utilization of unlabeled data and weak supervision signals. We address this issue with embarrassingly simple techniques: classifier initialization and temperature tuning. They jointly increase the confidence scores of pseudo-labels, improving the utilization rate of unlabeled data, and strengthening supervision signals. Building on this, we propose: Stage-Wise Finetuning with Temperature Tuning (SWIFT), which enables existing SSL methods to effectively finetune a VLM on limited labeled data, abundant unlabeled data, and task-relevant but noisy data retrieved from the VLM's pretraining set. Extensive experiments on five SSFSL benchmarks show that SWIFT outperforms recent FSL and SSL methods by $\sim$5 accuracy points. SWIFT even rivals supervised learning, which finetunes VLMs with the unlabeled data being labeled with ground truth!

Updated: 2025-12-11 03:06:16

标题: 从自动注释的角度解决半监督少样本学习

摘要: 半监督少样本学习(SSFSL)制定了像“自动注释”这样的真实世界应用,因为它旨在通过学习模型来注释未标记的示例。尽管强大的开源视觉语言模型(VLMs)及其预训练数据可用,但SSFSL文献大部分忽视了这些开源资源。相比之下,相关领域的少样本学习(FSL)已经利用了它们来提升性能。可以说,为了在现实世界中实现自动注释,SSFSL应该利用这些开源资源。为此,我们首先应用已建立的半监督学习方法对VLM进行微调。出乎意料的是,它们的表现明显低于FSL基线。我们深入分析发现根本原因:VLM生成的softmax概率分布相当“平坦”。这导致未标记数据的零利用和弱监督信号。我们用令人尴尬简单的技术解决了这个问题:分类器初始化和温度调节。它们共同增加了伪标签的置信度分数,提高了未标记数据的利用率,并加强了监督信号。基于此,我们提出了:具有温度调节的分阶段微调(SWIFT),该方法使现有的半监督学习方法能够有效地在有限的标记数据、丰富的未标记数据和从VLM的预训练集中检索的与任务相关但嘈杂的数据上微调VLM。对五个SSFSL基准的大量实验表明,SWIFT的准确率比最近的FSL和SSL方法高出约5个百分点。SWIFT甚至可以与监督学习相媲美,后者使用带有地面实况标签的未标记数据对VLM进行微调!

更新时间: 2025-12-11 03:06:16

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2512.10244v1

Functional Percolation: A Perspective on Criticality of Form and Function

Understanding the physical constraints and minimal conditions that enable information processing in extended systems remains a central challenge across disciplines, from neuroscience and artificial intelligence to social and physical networks. Here we study how network connectivity both limits and enables information processing by analyzing random networks across the structural percolation transition. Using cascade-mediated dynamics as a minimal and universal mechanism for propagating state-dependent responses, we examine structural, functional, and information-theoretic observables as functions of mean degree in Erdos-Renyi networks. We find that the emergence of a giant connected component coincides with a sharp transition in realizable information processing: complex input-output response functions become accessible, functional diversity increases rapidly, output entropy rises, and directed information flow quantified by transfer entropy extends beyond local neighborhoods. These coincident transitions define a regime of functional percolation, referring to a sharp expansion of the space of realizable input-output functions at the structural percolation transition. Near criticality, networks exhibit a Pareto-optimal tradeoff between functional complexity and diversity, suggesting that percolation criticality provides a universal organizing principle for information processing in systems with local interactions and propagating influences.

Updated: 2025-12-11 02:55:57

标题: 功能渗流:形式和功能的临界性视角

摘要: 理解在扩展系统中使信息处理成为可能的物理限制和最低条件仍然是跨学科的一个核心挑战,从神经科学和人工智能到社会和物理网络。在这里,我们通过分析跨越结构渗透转变的随机网络来研究网络连接性如何限制和使信息处理成为可能。利用级联介导动力学作为传播状态依赖响应的最低和通用机制,我们研究了Erdos-Renyi网络中平均度数的结构、功能和信息理论可观察量作为函数。我们发现巨型连接组件的出现与可实现的信息处理的急剧转变相一致:复杂的输入-输出响应函数变得可访问,功能多样性迅速增加,输出熵上升,并且由传输熵量化的定向信息流超出了局部邻域。这些同时发生的转变定义了一个功能渗透的领域,指的是在结构渗透转变处实现的输入-输出函数空间的急剧扩展。接近临界点,网络展现出功能复杂性和多样性之间的帕累托最优折衷,表明渗透临界性为具有局部互动和传播影响的系统中信息处理提供了一个普遍的组织原则。

更新时间: 2025-12-11 02:55:57

领域: physics.soc-ph,cond-mat.stat-mech,cs.AI,physics.comp-ph

下载: http://arxiv.org/abs/2512.09317v2

Magnitude-Modulated Equivariant Adapter for Parameter-Efficient Fine-Tuning of Equivariant Graph Neural Networks

Pretrained equivariant graph neural networks based on spherical harmonics offer efficient and accurate alternatives to computationally expensive ab-initio methods, yet adapting them to new tasks and chemical environments still requires fine-tuning. Conventional parameter-efficient fine-tuning (PEFT) techniques, such as Adapters and LoRA, typically break symmetry, making them incompatible with those equivariant architectures. ELoRA, recently proposed, is the first equivariant PEFT method. It achieves improved parameter efficiency and performance on many benchmarks. However, the relatively high degrees of freedom it retains within each tensor order can still perturb pretrained feature distributions and ultimately degrade performance. To address this, we present Magnitude-Modulated Equivariant Adapter (MMEA), a novel equivariant fine-tuning method which employs lightweight scalar gating to modulate feature magnitudes on a per-order and per-multiplicity basis. We demonstrate that MMEA preserves strict equivariance and, across multiple benchmarks, consistently improves energy and force predictions to state-of-the-art levels while training fewer parameters than competing approaches. These results suggest that, in many practical scenarios, modulating channel magnitudes is sufficient to adapt equivariant models to new chemical environments without breaking symmetry, pointing toward a new paradigm for equivariant PEFT design.

Updated: 2025-12-11 02:46:34

标题: 幅度调制等变适配器用于参数高效微调等变图神经网络

摘要: 基于球谐波的预训练等变图神经网络提供了高效和准确的替代方案,以取代计算昂贵的从头开始方法,然而,将它们适应新任务和化学环境仍然需要微调。传统的参数高效微调(PEFT)技术,如适配器和LoRA,通常会破坏对称性,使它们与这些等变架构不兼容。最近提出的ELoRA是第一个等变PEFT方法。它在许多基准测试中实现了改进的参数效率和性能。然而,在每个张量阶中保留的相对自由度仍可能扰动预训练特征分布,最终降低性能。为了解决这个问题,我们提出了一种新颖的等变微调方法,称为幅度调制等变适配器(MMEA),它采用轻量级标量门控制器,以基于每个阶和每个重复性基础上调节特征幅度。我们证明MMEA保持严格的等变性,并且在多个基准测试中,持续提高能量和力预测至最先进水平,同时训练的参数比竞争方法少。这些结果表明,在许多实际场景中,调节通道幅度就足以使等变模型适应新的化学环境,而不会破坏对称性,从而指向等变PEFT设计的新范式。

更新时间: 2025-12-11 02:46:34

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2511.06696v2

Latency-Response Theory Model: Evaluating Large Language Models via Response Accuracy and Chain-of-Thought Length

The proliferation of Large Language Models (LLMs) necessitates valid evaluation methods to guide downstream applications and actionable future improvements. The Item Response Theory (IRT) has recently emerged as a promising framework for evaluating LLMs via their response accuracy. Beyond simple response accuracy, LLMs' chain of thought (CoT) lengths serve as a vital indicator of their reasoning ability. To leverage the CoT length information to assist the evaluation of LLMs, we propose Latency-Response Theory (LaRT) to jointly model the response accuracy and CoT length by introducing the latent ability, latent speed, and a key correlation parameter between them. We derive an efficient estimation algorithm and establish rigorous identifiability results for the population parameters to ensure the statistical validity of estimation. Theoretical asymptotic analyses and simulation studies demonstrate LaRT's advantages over IRT in terms of higher estimation accuracy and shorter confidence intervals for latent traits. A key finding is that the asymptotic estimation precision of the latent ability under LaRT exceeds that of IRT whenever the latent ability and latent speed are correlated. We collect real responses from diverse LLMs on popular benchmark datasets. The application of LaRT reveals a strong negative correlation between the latent ability and latent speed in all benchmarks, with stronger correlation for more difficult benchmarks. This finding supports the intuition that higher reasoning ability correlates with slower speed and longer response latency. LaRT yields different LLM rankings than IRT and outperforms IRT across multiple key evaluation metrics including predictive power, item efficiency, ranking validity, and LLM evaluation efficiency. Code and data are available at https://github.com/Toby-X/Latency-Response-Theory-Model.

Updated: 2025-12-11 02:45:56

标题: 延迟-响应理论模型:通过响应准确性和思维链长度评估大型语言模型

摘要: 大型语言模型(LLMs)的大量增长需要有效的评估方法来指导下游应用和可操作的未来改进。最近,项目反应理论(IRT)已经成为通过其响应准确性评估LLMs的有前途的框架。除了简单的响应准确性之外,LLMs的思维链(CoT)长度作为其推理能力的重要指标。为了利用CoT长度信息来辅助评估LLMs,我们提出了延迟-响应理论(LaRT),通过引入潜在能力、潜在速度和它们之间的关键相关参数来共同建模响应准确性和CoT长度。我们推导了一个高效的估计算法,并建立了严格的可辨识性结果以确保估计的统计有效性。理论渐近分析和模拟研究表明,在潜在特质方面,LaRT相对于IRT具有更高的估计准确性和更短的置信区间的优势。一个关键发现是,在潜在能力和潜在速度相关时,LaRT下的潜在能力的渐近估计精度超过了IRT。我们从流行的基准数据集上收集了各种LLMs的真实响应。LaRT的应用揭示了所有基准测试中潜在能力和潜在速度之间的强烈负相关性,对于更难的基准测试,相关性更强。这一发现支持高推理能力与较慢速度和较长响应延迟相关的直觉。LaRT产生了与IRT不同的LLM排名,并在多个关键评估指标上优于IRT,包括预测能力、项目效率、排名有效性和LLM评估效率。代码和数据可在https://github.com/Toby-X/Latency-Response-Theory-Model上找到。

更新时间: 2025-12-11 02:45:56

领域: stat.ME,cs.AI,stat.AP,stat.ML

下载: http://arxiv.org/abs/2512.07019v2

Design Space Exploration of DMA based Finer-Grain Compute Communication Overlap

As both ML training and inference are increasingly distributed, parallelization techniques that shard (divide) ML model across GPUs of a distributed system, are often deployed. With such techniques, there is a high prevalence of data-dependent communication and computation operations where communication is exposed, leaving as high as 1.7x ideal performance on the table. Prior works harness the fact that ML model state and inputs are already sharded, and employ careful overlap of individual computation/communication shards. While such coarse-grain overlap is promising, in this work, we instead make a case for finer-grain compute-communication overlap which we term FiCCO, where we argue for finer-granularity, one-level deeper overlap than at shard-level, to unlock compute/communication overlap for a wider set of network topologies, finer-grain dataflow and more. We show that FiCCO opens up a wider design space of execution schedules than possible at shard-level alone. At the same time, decomposition of ML operations into smaller operations (done in both shard-based and finer-grain techniques) causes operation-level inefficiency losses. To balance the two, we first present a detailed characterization of these inefficiency losses, then present a design space of FiCCO schedules, and finally overlay the schedules with concomitant inefficiency signatures. Doing so helps us design heuristics that frameworks and runtimes can harness to select bespoke FiCCO schedules based on the nature of underlying ML operations. Finally, to further minimize contention inefficiencies inherent with operation overlap, we offload communication to GPU DMA engines. We evaluate several scenarios from realistic ML deployments and demonstrate that our proposed bespoke schedules deliver up to 1.6x speedup and our heuristics provide accurate guidance in 81% of unseen scenarios.

Updated: 2025-12-11 02:43:27

标题: DMA基础下更精细的计算通信重叠的设计空间探索

摘要: 随着机器学习训练和推断越来越分布式化,经常会部署将ML模型分片(划分)到分布式系统的GPU上的并行化技术。通过这种技术,存在大量依赖数据的通信和计算操作,其中通信暴露出来,使得性能达到理想性能的1.7倍。以往的研究利用了ML模型状态和输入已经被分片的事实,并采用了仔细重叠的个别计算/通信分片。虽然这种粗粒度的重叠很有前途,但在这项工作中,我们提出了一个更细粒度的计算-通信重叠的理由,我们称之为FiCCO,我们认为需要更细粒度的、比分片级别再深一层的重叠,以解锁更广泛的网络拓扑、更细粒度的数据流等的计算/通信重叠。我们展示了FiCCO比仅在分片级别可能实现的更宽的执行计划设计空间。与此同时,将ML操作分解为更小的操作(在基于分片和更细粒度技术中均已进行)会导致操作级别的低效损失。为了平衡这两者,我们首先提出了这些低效损失的详细特征化,然后提出了一个FiCCO计划的设计空间,并最终将这些计划与相应的低效特征重叠。这样做有助于我们设计启发式方法,使框架和运行时能够根据底层ML操作的性质选择专门的FiCCO计划。最后,为了进一步减少操作重叠带来的争用低效,我们将通信卸载到GPU的DMA引擎。我们评估了几种来自实际ML部署的情景,并展示了我们提出的专门计划可提供高达1.6倍的加速,我们的启发式方法在81%的未见情况中提供准确的指导。

更新时间: 2025-12-11 02:43:27

领域: cs.DC,cs.AR,cs.LG

下载: http://arxiv.org/abs/2512.10236v1

Trustworthy scientific inference with generative models

Generative artificial intelligence (AI) excels at producing complex data structures (text, images, videos) by learning patterns from training examples. Across scientific disciplines, researchers are now applying generative models to "inverse problems" to directly predict hidden parameters from observed data along with measures of uncertainty. While these predictive or posterior-based methods can handle intractable likelihoods and large-scale studies, they can also produce biased or overconfident conclusions even without model misspecifications. We present a solution with Frequentist-Bayes (FreB), a mathematically rigorous protocol that reshapes AI-generated posterior probability distributions into (locally valid) confidence regions that consistently include true parameters with the expected probability, while achieving minimum size when training and target data align. We demonstrate FreB's effectiveness by tackling diverse case studies in the physical sciences: identifying unknown sources under dataset shift, reconciling competing theoretical models, and mitigating selection bias and systematics in observational studies. By providing validity guarantees with interpretable diagnostics, FreB enables trustworthy scientific inference across fields where direct likelihood evaluation remains impossible or prohibitively expensive.

Updated: 2025-12-11 02:41:40

标题: 使用生成模型进行可信科学推断

摘要: 生成人工智能(AI)擅长通过从训练示例中学习模式来生成复杂的数据结构(文本、图像、视频)。在各个科学领域,研究人员现在正在将生成模型应用于“逆问题”,以直接从观察数据中预测隐藏参数以及不确定性度量。虽然这些基于预测或后验的方法可以处理难以处理的似然和大规模研究,但即使没有模型规范错误,它们也可能产生有偏见或过于自信的结论。我们提出了一种解决方案,即频率贝叶斯(FreB),这是一个数学严谨的协议,将AI生成的后验概率分布重塑为(局部有效的)置信区间,这些区间在训练和目标数据一致时始终以期望的概率包含真实参数,并在实现最小尺寸时。我们通过在物理科学中处理多样化的案例研究来展示FreB的有效性:在数据集转移下识别未知来源,调和竞争性理论模型,并在观察研究中减轻选择偏差和系统性偏差。通过提供可信的科学推断和可解释的诊断,FreB使得在直接似然评估仍然不可能或成本过高的领域中进行可信赖的科学推断成为可能。

更新时间: 2025-12-11 02:41:40

领域: stat.ML,astro-ph.IM,cs.LG,stat.AP,stat.ME

下载: http://arxiv.org/abs/2508.02602v2

InFerActive: Towards Scalable Human Evaluation of Large Language Models through Interactive Inference

Human evaluation remains the gold standard for evaluating outputs of Large Language Models (LLMs). The current evaluation paradigm reviews numerous individual responses, leading to significant scalability challenges. LLM outputs can be more efficiently represented as a tree structure, reflecting their autoregressive generation process and stochastic token selection. However, conventional tree visualization cannot scale to the exponentially large trees generated by modern sampling methods of LLMs. To address this problem, we present InFerActive, an interactive inference system for scalable human evaluation. InFerActive enables on-demand exploration through probability-based filtering and evaluation features, while bridging the semantic gap between computational tokens and human-readable text through adaptive visualization techniques. Through a technical evaluation and user study (N=12), we demonstrate that InFerActive significantly improves evaluation efficiency and enables more comprehensive assessment of model behavior. We further conduct expert case studies that demonstrate InFerActive's practical applicability and potential for transforming LLM evaluation workflows.

Updated: 2025-12-11 02:41:14

标题: InFerActive:通过交互推理实现大型语言模型的可扩展人类评估

摘要: 人类评估仍然是评估大型语言模型(LLMs)输出的黄金标准。当前的评估范式审核了大量个体响应,导致了重大的可伸缩性挑战。LLM的输出可以更高效地表示为树结构,反映它们的自回归生成过程和随机令牌选择。然而,传统的树形可视化无法扩展到由现代LLM抽样方法生成的指数级大树。为解决这一问题,我们提出了InFerActive,一种可伸缩的人类评估交互推理系统。InFerActive通过基于概率的过滤和评估功能实现按需探索,同时通过自适应可视化技术弥合了计算令牌和人类可读文本之间的语义差距。通过技术评估和用户研究(N=12),我们证明InFerActive显著提高了评估效率,并实现了对模型行为更全面的评估。我们进一步进行专家案例研究,证明了InFerActive的实际适用性和转变LLM评估工作流程的潜力。

更新时间: 2025-12-11 02:41:14

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2512.10234v1

RLAX: Large-Scale, Distributed Reinforcement Learning for Large Language Models on TPUs

Reinforcement learning (RL) has emerged as the de-facto paradigm for improving the reasoning capabilities of large language models (LLMs). We have developed RLAX, a scalable RL framework on TPUs. RLAX employs a parameter-server architecture. A master trainer periodically pushes updated model weights to the parameter server while a fleet of inference workers pull the latest weights and generates new rollouts. We introduce a suite of system techniques to enable scalable and preemptible RL for a diverse set of state-of-art RL algorithms. To accelerate convergence and improve model quality, we have devised new dataset curation and alignment techniques. Large-scale evaluations show that RLAX improves QwQ-32B's pass@8 accuracy by 12.8% in just 12 hours 48 minutes on 1024 v5p TPUs, while remaining robust to preemptions during training.

Updated: 2025-12-11 02:38:26

标题: RLAX:基于TPU的大规模、分布式强化学习用于大型语言模型

摘要: 强化学习(RL)已经成为改进大型语言模型(LLMs)推理能力的事实范式。我们开发了RLAX,一个可扩展的TPU上的RL框架。RLAX采用参数服务器架构。一个主训练器定期将更新的模型权重推送到参数服务器,而一组推理工作者则拉取最新的权重并生成新的轨迹。我们引入了一套系统技术,以实现对一系列最先进的RL算法进行可扩展和可抢占的RL。为了加速收敛并提高模型质量,我们设计了新的数据集策划和对齐技术。大规模评估显示,RLAX 在1024个v5p TPU上的12小时48分钟内将QwQ-32B的pass@8准确率提高了12.8%,同时在训练过程中保持对抢占的稳健性。

更新时间: 2025-12-11 02:38:26

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2512.06392v2

Emotional Support with LLM-based Empathetic Dialogue Generation

Emotional Support Conversation (ESC) aims to provide empathetic and effective emotional assistance through dialogue, addressing the growing demand for mental health support. This paper presents our solution for the NLPCC 2025 Task 8 ESC evaluation, where we leverage large-scale language models enhanced by prompt engineering and finetuning techniques. We explore both parameter-efficient Low-Rank Adaptation and full-parameter fine-tuning strategies to improve the model's ability to generate supportive and contextually appropriate responses. Our best model ranked second in the competition, highlighting the potential of combining LLMs with effective adaptation methods for ESC tasks. Future work will focus on further enhancing emotional understanding and response personalization to build more practical and reliable emotional support systems.

Updated: 2025-12-11 02:38:17

标题: 基于LLM的共情对话生成的情感支持

摘要: 情感支持对话(ESC)旨在通过对话提供富有同理心且有效的情感支持,以满足心理健康支持日益增长的需求。本文提出了我们针对NLPCC 2025任务8 ESC评估的解决方案,我们利用了大规模语言模型,并通过提示工程和微调技术加以增强。我们探索了参数高效的低秩适应和全参数微调策略,以提高模型生成支持性和上下文适当性回应的能力。我们的最佳模型在比赛中排名第二,突显了将LLMs与有效的适应方法结合用于ESC任务的潜力。未来的工作将着重于进一步增强情感理解和回应个性化,以构建更实用和可靠的情感支持系统。

更新时间: 2025-12-11 02:38:17

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.12820v2

DS FedProxGrad: Asymptotic Stationarity Without Noise Floor in Fair Federated Learning

Recent work \cite{arifgroup} introduced Federated Proximal Gradient \textbf{(\texttt{FedProxGrad})} for solving non-convex composite optimization problems in group fair federated learning. However, the original analysis established convergence only to a \textit{noise-dominated neighborhood of stationarity}, with explicit dependence on a variance-induced noise floor. In this work, we provide an improved asymptotic convergence analysis for a generalized \texttt{FedProxGrad}-type analytical framework with inexact local proximal solutions and explicit fairness regularization. We call this extended analytical framework \textbf{DS \texttt{FedProxGrad}} (Decay Step Size \texttt{FedProxGrad}). Under a Robbins-Monro step-size schedule \cite{robbins1951stochastic} and a mild decay condition on local inexactness, we prove that $\liminf_{r\to\infty} \mathbb{E}[\|\nabla F(\mathbf{x}^r)\|^2] = 0$, i.e., the algorithm is asymptotically stationary and the convergence rate does not depend on a variance-induced noise floor.

Updated: 2025-12-11 02:35:40

标题: DS FedProxGrad:在公平联邦学习中实现无噪声地板的渐近稳定性

摘要: 最近的研究\cite{arifgroup}引入了联邦近端梯度(\textbf{FedProxGrad})用于解决组公平联邦学习中的非凸复合优化问题。然而,原始分析仅建立在接近稳定性的\textit{受噪声主导的邻域},显式依赖于方差引起的噪声底线。在这项工作中,我们提供了一个改进的渐近收敛分析,针对具有不精确本地近端解和显式公平性正则化的广义\textbf{FedProxGrad}类型的分析框架。我们将这个扩展的分析框架称为\textbf{DS FedProxGrad}(衰减步长FedProxGrad)。在Robbins-Monro步长调度\cite{robbins1951stochastic}和对本地不精确性的温和衰减条件下,我们证明了$\liminf_{r\to\infty} \mathbb{E}[\|\nabla F(\mathbf{x}^r)\|^2] = 0$,即算法是渐近稳定的,收敛速度不依赖于方差引起的噪声底线。

更新时间: 2025-12-11 02:35:40

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2512.08671v3

Deferred Poisoning: Making the Model More Vulnerable via Hessian Singularization

Recent studies have shown that deep learning models are very vulnerable to poisoning attacks. Many defense methods have been proposed to address this issue. However, traditional poisoning attacks are not as threatening as commonly believed. This is because they often cause differences in how the model performs on the training set compared to the validation set. Such inconsistency can alert defenders that their data has been poisoned, allowing them to take the necessary defensive actions. In this paper, we introduce a more threatening type of poisoning attack called the Deferred Poisoning Attack. This new attack allows the model to function normally during the training and validation phases but makes it very sensitive to evasion attacks or even natural noise. We achieve this by ensuring the poisoned model's loss function has a similar value as a normally trained model at each input sample but with a large local curvature. A similar model loss ensures that there is no obvious inconsistency between the training and validation accuracy, demonstrating high stealthiness. On the other hand, the large curvature implies that a small perturbation may cause a significant increase in model loss, leading to substantial performance degradation, which reflects a worse robustness. We fulfill this purpose by making the model have singular Hessian information at the optimal point via our proposed Singularization Regularization term. We have conducted both theoretical and empirical analyses of the proposed method and validated its effectiveness through experiments on image classification tasks. Furthermore, we have confirmed the hazards of this form of poisoning attack under more general scenarios using natural noise, offering a new perspective for research in the field of security.

Updated: 2025-12-11 02:35:39

标题: 延迟中毒:通过Hessian奇异化使模型更易受攻击

摘要: 最近的研究表明,深度学习模型对毒化攻击非常脆弱。已经提出了许多防御方法来解决这个问题。然而,传统的毒化攻击并没有像人们普遍认为的那么威胁。这是因为它们通常会导致模型在训练集和验证集上的表现差异。这种不一致性可以提醒防御者他们的数据已经被毒化,从而使他们能够采取必要的防御措施。在本文中,我们介绍了一种更具威胁性的毒化攻击类型,称为延迟毒化攻击。这种新型攻击使得模型在训练和验证阶段可以正常运行,但对规避攻击甚至自然噪声非常敏感。我们通过确保被毒化模型的损失函数在每个输入样本上与正常训练模型具有类似的值但具有较大的局部曲率来实现这一点。类似的模型损失确保训练和验证准确性之间没有明显的不一致性,展示了高度隐蔽性。另一方面,较大的曲率意味着微小的扰动可能导致模型损失的显著增加,从而导致性能严重下降,反映出更糟糕的鲁棒性。我们通过我们提出的奇异化正则化项使模型在最佳点具有奇异的Hessian信息来实现这一目的。我们对所提出的方法进行了理论和实证分析,并通过对图像分类任务的实验验证了其有效性。此外,我们使用自然噪声在更一般的场景下确认了这种形式的毒化攻击的危险性,为安全领域的研究提供了新的视角。

更新时间: 2025-12-11 02:35:39

领域: cs.LG,cs.CR,cs.CV

下载: http://arxiv.org/abs/2411.03752v3

When Worse is Better: Navigating the compression-generation tradeoff in visual tokenization

Current image generation methods are based on a two-stage training approach. In stage 1, an auto-encoder is trained to compress an image into a latent space; in stage 2, a generative model is trained to learn a distribution over that latent space. This reveals a fundamental trade-off, do we compress more aggressively to make the latent distribution easier for the stage 2 model to learn even if it makes reconstruction worse? We study this problem in the context of discrete, auto-regressive image generation. Through the lens of scaling laws, we show that smaller stage 2 models can benefit from more compressed stage 1 latents even if reconstruction performance worsens, demonstrating that generation modeling capacity plays a role in this trade-off. Diving deeper, we rigorously study the connection between compute scaling and the stage 1 rate-distortion trade-off. Next, we introduce Causally Regularized Tokenization (CRT), which uses knowledge of the stage 2 generation modeling procedure to embed useful inductive biases in stage 1 latents. This regularization improves stage 2 generation performance better by making the tokens easier to model without affecting the stage 1 compression rate and marginally affecting distortion: we are able to improve compute efficiency 2-3$\times$ over baseline. Finally, we use CRT with further optimizations to the visual tokenizer setup to result in a generative pipeline that matches LlamaGen-3B generation performance (2.18 FID) with half the tokens per image (256 vs. 576) and a fourth the total model parameters (775M vs. 3.1B) while using the same architecture and inference procedure.

Updated: 2025-12-11 02:35:08

标题: 当更糟糕更好:在视觉标记化中航行压缩生成权衡

摘要: 目前的图像生成方法基于两阶段训练方法。在第一阶段,自动编码器被训练以将图像压缩到潜在空间中;在第二阶段,一个生成模型被训练以学习该潜在空间上的分布。这揭示了一个基本的权衡,即我们是否应该更加积极地压缩以使第二阶段模型更容易学习潜在分布,即使这会使重构变得更糟?我们在离散、自回归图像生成的背景下研究了这个问题。通过规模定律的视角,我们表明较小的第二阶段模型可以从更压缩的第一阶段潜在空间中受益,即使重构性能变差,这表明生成建模容量在这个权衡中发挥了作用。更深入地探讨,我们严格研究了计算规模与第一阶段失真-压缩权衡之间的关系。接下来,我们引入了因果正则化标记化(CRT),它利用对第二阶段生成建模过程的了解,将有用的归纳偏差嵌入到第一阶段的潜在空间中。这种正则化通过使标记更容易建模而不影响第一阶段压缩率,并且在一定程度上影响失真,可以更好地提高第二阶段的生成性能:我们能够比基线提高2-3倍的计算效率。最后,我们使用CRT与进一步的优化来对视觉标记器设置进行操作,结果是一个生成管道,其生成性能与LlamaGen-3B相匹配(2.18 FID),每个图像的标记数量减半(256个 vs. 576个),总模型参数的四分之一(775M vs. 3.1B),同时使用相同的架构和推理过程。

更新时间: 2025-12-11 02:35:08

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2412.16326v2

Benchmarking Multimodal LLMs on Recognition and Understanding over Chemical Tables

With the widespread application of multimodal large language models in scientific intelligence, there is an urgent need for more challenging evaluation benchmarks to assess their ability to understand complex scientific data. Scientific tables, as core carriers of knowledge representation, combine text, symbols, and graphics, forming a typical multimodal reasoning scenario. However, existing benchmarks are mostly focused on general domains, failing to reflect the unique structural complexity and domain-specific semantics inherent in scientific research. Chemical tables are particularly representative: they intertwine structured variables such as reagents, conditions, and yields with visual symbols like molecular structures and chemical formulas, posing significant challenges to models in cross-modal alignment and semantic parsing. To address this, we propose ChemTable-a large scale benchmark of chemical tables constructed from real-world literature, containing expert-annotated cell layouts, logical structures, and domain-specific labels. It supports two core tasks: (1) table recognition (structure and content extraction); and (2) table understanding (descriptive and reasoning-based question answering). Evaluation on ChemTable shows that while mainstream multimodal models perform reasonably well in layout parsing, they still face significant limitations when handling critical elements such as molecular structures and symbolic conventions. Closed-source models lead overall but still fall short of human-level performance. This work provides a realistic testing platform for evaluating scientific multimodal understanding, revealing the current bottlenecks in domain-specific reasoning and advancing the development of intelligent systems for scientific research.

Updated: 2025-12-11 02:34:29

标题: 在化学表格上对多模式LLMs进行识别和理解的基准测试

摘要: 随着多模态大型语言模型在科学智能中的广泛应用,迫切需要更具挑战性的评估基准来评估它们理解复杂科学数据的能力。科学表格作为知识表示的核心载体,结合了文本、符号和图形,形成了典型的多模态推理场景。然而,现有的基准大多集中在一般领域,未能反映科学研究中固有的独特结构复杂性和领域特定语义。化学表格特别具有代表性:它们将反应物、条件和产物等结构化变量与分子结构和化学式等视觉符号交织在一起,对模型在跨模态对齐和语义解析方面提出了重大挑战。为了解决这个问题,我们提出了ChemTable——一个由真实文献构建的大规模化学表格基准,包含专家注释的单元格布局、逻辑结构和领域特定标签。它支持两个核心任务:(1)表格识别(结构和内容提取);和(2)表格理解(描述性和基于推理的问题回答)。对ChemTable的评估显示,尽管主流多模态模型在布局解析方面表现相当不错,但在处理分子结构和符号约定等关键要素时仍面临显著限制。封闭源模型在整体上领先,但仍未达到人类水平的表现。这项工作为评估科学多模态理解提供了一个现实的测试平台,揭示了领域特定推理中的当前瓶颈,并推动了智能系统在科学研究中的发展。

更新时间: 2025-12-11 02:34:29

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2506.11375v2

Adaptive Information Routing for Multimodal Time Series Forecasting

Time series forecasting is a critical task for artificial intelligence with numerous real-world applications. Traditional approaches primarily rely on historical time series data to predict the future values. However, in practical scenarios, this is often insufficient for accurate predictions due to the limited information available. To address this challenge, multimodal time series forecasting methods which incorporate additional data modalities, mainly text data, alongside time series data have been explored. In this work, we introduce the Adaptive Information Routing (AIR) framework, a novel approach for multimodal time series forecasting. Unlike existing methods that treat text data on par with time series data as interchangeable auxiliary features for forecasting, AIR leverages text information to dynamically guide the time series model by controlling how and to what extent multivariate time series information should be combined. We also present a text-refinement pipeline that employs a large language model to convert raw text data into a form suitable for multimodal forecasting, and we introduce a benchmark that facilitates multimodal forecasting experiments based on this pipeline. Experiment results with the real world market data such as crude oil price and exchange rates demonstrate that AIR effectively modulates the behavior of the time series model using textual inputs, significantly enhancing forecasting accuracy in various time series forecasting tasks.

Updated: 2025-12-11 02:25:27

标题: 多模态时间序列预测的自适应信息路由

摘要: 时间序列预测是人工智能中一个关键任务,具有许多现实世界的应用。传统方法主要依赖于历史时间序列数据来预测未来值。然而,在实际场景中,由于可用信息有限,这通常不足以实现准确的预测。为了解决这一挑战,研究人员探索了结合额外数据模态(主要是文本数据)与时间序列数据的多模态时间序列预测方法。在这项工作中,我们介绍了一种名为自适应信息路由(AIR)框架的新颖方法,用于多模态时间序列预测。与现有方法不同,现有方法将文本数据视为可互换的辅助特征来预测,AIR利用文本信息动态指导时间序列模型,控制多变量时间序列信息应如何以及在何种程度上进行组合。我们还提出了一个文本精炼流水线,利用大型语言模型将原始文本数据转换为适合于多模态预测的形式,并引入了一个基准,以便基于这个流水线进行多模态预测实验。对原油价格和汇率等真实市场数据的实验结果表明,AIR有效地调节时间序列模型的行为,利用文本输入显著增强了各种时间序列预测任务的准确性。

更新时间: 2025-12-11 02:25:27

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2512.10229v1

MedReasoner: Reinforcement Learning Drives Reasoning Grounding from Clinical Thought to Pixel-Level Precision

Accurately grounding regions of interest (ROIs) is critical for diagnosis and treatment planning in medical imaging. While multimodal large language models (MLLMs) combine visual perception with natural language, current medical-grounding pipelines still rely on supervised fine-tuning with explicit spatial hints, making them ill-equipped to handle the implicit queries common in clinical practice. This work makes three core contributions. We first define Unified Medical Reasoning Grounding (UMRG), a novel vision-language task that demands clinical reasoning and pixel-level grounding. Second, we release U-MRG-14K, a dataset of 14K samples featuring pixel-level masks alongside implicit clinical queries and reasoning traces, spanning 10 modalities, 15 super-categories, and 108 specific categories. Finally, we introduce MedReasoner, a modular framework that distinctly separates reasoning from segmentation: an MLLM reasoner is optimized with reinforcement learning, while a frozen segmentation expert converts spatial prompts into masks, with alignment achieved through format and accuracy rewards. MedReasoner achieves state-of-the-art performance on U-MRG-14K and demonstrates strong generalization to unseen clinical queries, underscoring the significant promise of reinforcement learning for interpretable medical grounding.

Updated: 2025-12-11 02:20:50

标题: MedReasoner:强化学习推动从临床思维到像素级精度的推理基础

摘要: 精确地确定感兴趣区域(ROIs)对于医学成像中的诊断和治疗规划至关重要。虽然多模式大型语言模型(MLLMs)将视觉感知与自然语言结合在一起,但当前的医学 grounding 流程仍然依赖于受监督的微调和明确的空间提示,使它们无法处理临床实践中常见的隐含查询。这项工作有三个核心贡献。首先,我们首次定义了统一医学推理 grounding(UMRG),这是一个要求临床推理和像素级 grounding 的新型视觉-语言任务。其次,我们发布了 U-MRG-14K 数据集,该数据集包含 14K 个样本,其中包括像素级蒙版、隐含的临床查询和推理轨迹,涵盖了 10 种模态、15 个超类别和 108 个具体类别。最后,我们介绍了 MedReasoner,这是一个模块化框架,明确区分推理和分割:一个 MLLM 推理器通过强化学习进行优化,而一个冻结的分割专家将空间提示转换为蒙版,通过格式和准确性奖励来实现对齐。MedReasoner 在 U-MRG-14K 上实现了最先进的性能,并展示了对未见临床查询的强大泛化能力,突显了强化学习在可解释医学 grounding 领域的重要潜力。

更新时间: 2025-12-11 02:20:50

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.08177v2

Federated Domain Generalization with Latent Space Inversion

Federated domain generalization (FedDG) addresses distribution shifts among clients in a federated learning framework. FedDG methods aggregate the parameters of locally trained client models to form a global model that generalizes to unseen clients while preserving data privacy. While improving the generalization capability of the global model, many existing approaches in FedDG jeopardize privacy by sharing statistics of client data between themselves. Our solution addresses this problem by contributing new ways to perform local client training and model aggregation. To improve local client training, we enforce (domain) invariance across local models with the help of a novel technique, \textbf{latent space inversion}, which enables better client privacy. When clients are not \emph{i.i.d}, aggregating their local models may discard certain local adaptations. To overcome this, we propose an \textbf{important weight} aggregation strategy to prioritize parameters that significantly influence predictions of local models during aggregation. Our extensive experiments show that our approach achieves superior results over state-of-the-art methods with less communication overhead.

Updated: 2025-12-11 02:17:03

标题: 使用潜在空间反转的联邦域泛化

摘要: 联邦领域泛化(FedDG)解决了在联邦学习框架中客户端之间的分布转移问题。 FedDG方法聚合了本地训练的客户端模型的参数,形成一个能够泛化到未知客户端的全局模型,同时保护数据隐私。虽然改进了全局模型的泛化能力,但许多现有的FedDG方法通过在它们之间共享客户端数据的统计数据来危及隐私。我们的解决方案通过提供执行本地客户端训练和模型聚合的新方法来解决这个问题。为了改进本地客户端训练,我们借助一种新颖的技术“潜空间反演”来强制实现本地模型的(领域)不变性,从而提高客户端隐私。当客户端不是独立同分布时,聚合它们的本地模型可能会丢弃某些本地适应性。为了克服这一问题,我们提出了一种“重要权重”聚合策略,以优先考虑在聚合过程中显著影响本地模型预测的参数。我们的大量实验证明,我们的方法在较少通信开销的情况下比现有方法取得了更优异的结果。

更新时间: 2025-12-11 02:17:03

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2512.10224v1

Galaxy Phase-Space and Field-Level Cosmology: The Strength of Semi-Analytic Models

Semi-analytic models are a widely used approach to simulate galaxy properties within a cosmological framework, relying on simplified yet physically motivated prescriptions. They have also proven to be an efficient alternative for generating accurate galaxy catalogs, offering a faster and less computationally expensive option compared to full hydrodynamical simulations. In this paper, we demonstrate that using only galaxy $3$D positions and radial velocities, we can train a graph neural network coupled to a moment neural network to obtain a robust machine learning based model capable of estimating the matter density parameters, $Ω_{\rm m}$, with a precision of approximately 10%. The network is trained on ($25 h^{-1}$Mpc)$^3$ volumes of galaxy catalogs from L-Galaxies and can successfully extrapolate its predictions to other semi-analytic models (GAEA, SC-SAM, and Shark) and, more remarkably, to hydrodynamical simulations (Astrid, SIMBA, IllustrisTNG, and SWIFT-EAGLE). Our results show that the network is robust to variations in astrophysical and subgrid physics, cosmological and astrophysical parameters, and the different halo-profile treatments used across simulations. This suggests that the physical relationships encoded in the phase-space of semi-analytic models are largely independent of their specific physical prescriptions, reinforcing their potential as tools for the generation of realistic mock catalogs for cosmological parameter inference.

Updated: 2025-12-11 02:13:38

标题: 星系相空间和场级宇宙学:半解析模型的力量

摘要: 半解析模型是在宇宙学框架内模拟星系性质的一种广泛使用的方法,依赖于简化但有物理动机的预设。它们已被证明是一种有效的替代方法,可生成准确的星系目录,与完全流体动力学模拟相比,提供了更快速和计算成本更低的选项。在本文中,我们展示了仅使用星系三维位置和径向速度,我们可以训练一个与矩神经网络相结合的图神经网络,以获得一个能够估计物质密度参数$Ω_{\rm m}$的鲁棒的基于机器学习的模型,其精度约为10%。该网络在来自L-Galaxies的星系目录的($25 h^{-1}$Mpc)$^3$体积上进行训练,并可以成功地将其预测外推到其他半解析模型(GAEA、SC-SAM和Shark)以及流体动力学模拟(Astrid、SIMBA、IllustrisTNG和SWIFT-EAGLE)。我们的结果表明,该网络对于天体物理和子格物理、宇宙学和天体物理参数以及模拟中使用的不同晕-剖面处理的变化具有鲁棒性。这表明半解析模型中编码的物理关系在很大程度上独立于它们的具体物理预设,加强了它们作为用于宇宙学参数推断的逼真模拟目录生成工具的潜力。

更新时间: 2025-12-11 02:13:38

领域: astro-ph.CO,astro-ph.GA,cs.LG

下载: http://arxiv.org/abs/2512.10222v1

On Learning-Curve Monotonicity for Maximum Likelihood Estimators

The property of learning-curve monotonicity, highlighted in a recent series of work by Loog, Mey and Viering, describes algorithms which only improve in average performance given more data, for any underlying data distribution within a given family. We establish the first nontrivial monotonicity guarantees for the maximum likelihood estimator in a variety of well-specified parametric settings. For sequential prediction with log loss, we show monotonicity (in fact complete monotonicity) of the forward KL divergence for Gaussian vectors with unknown covariance and either known or unknown mean, as well as for Gamma variables with unknown scale parameter. The Gaussian setting was explicitly highlighted as open in the aforementioned works, even in dimension 1. Finally we observe that for reverse KL divergence, a folklore trick yields monotonicity for very general exponential families. All results in this paper were derived by variants of GPT-5.2 Pro. Humans did not provide any proof strategies or intermediate arguments, but only prompted the model to continue developing additional results, and verified and transcribed its proofs.

Updated: 2025-12-11 02:12:12

标题: 关于最大似然估计器的学习曲线单调性

摘要: Loog、Mey和Viering最近的一系列工作中突出了学习曲线单调性的特性,描述了仅在给定数据家族内的任何基础数据分布下随着数据量增加而改进平均性能的算法。我们在多种明确定义的参数设置中首次建立了最大似然估计器的非平凡单调性保证。对于具有未知协方差和已知或未知均值的高斯向量以及具有未知尺度参数的Gamma变量的序贯预测,我们展示了前向KL散度的单调性(实际上是完全单调性)。在前述的工作中,即使在维度为1时,高斯设置也被明确地突出为开放问题。最后,我们观察到对于逆向KL散度,一种民间传说的技巧为非常一般的指数家族带来了单调性结果。本文中的所有结果均由GPT-5.2 Pro的变体导出。人类没有提供任何证明策略或中间论证,而只是促使模型继续开发额外的结果,并验证和记录其证明。

更新时间: 2025-12-11 02:12:12

领域: math.ST,cs.LG,stat.ML

下载: http://arxiv.org/abs/2512.10220v1

Multi-Robot Path Planning Combining Heuristics and Multi-Agent Reinforcement Learning

Multi-robot path finding in dynamic environments is a highly challenging classic problem. In the movement process, robots need to avoid collisions with other moving robots while minimizing their travel distance. Previous methods for this problem either continuously replan paths using heuristic search methods to avoid conflicts or choose appropriate collision avoidance strategies based on learning approaches. The former may result in long travel distances due to frequent replanning, while the latter may have low learning efficiency due to low sample exploration and utilization, and causing high training costs for the model. To address these issues, we propose a path planning method, MAPPOHR, which combines heuristic search, empirical rules, and multi-agent reinforcement learning. The method consists of two layers: a real-time planner based on the multi-agent reinforcement learning algorithm, MAPPO, which embeds empirical rules in the action output layer and reward functions, and a heuristic search planner used to create a global guiding path. During movement, the heuristic search planner replans new paths based on the instructions of the real-time planner. We tested our method in 10 different conflict scenarios. The experiments show that the planning performance of MAPPOHR is better than that of existing learning and heuristic methods. Due to the utilization of empirical knowledge and heuristic search, the learning efficiency of MAPPOHR is higher than that of existing learning methods.

Updated: 2025-12-11 02:08:19

标题: 多机器人路径规划:结合启发式和多智能体强化学习

摘要: 在动态环境中进行多机器人路径规划是一个极具挑战性的经典问题。在移动过程中,机器人需要避免与其他移动机器人发生碰撞,同时最小化它们的行程距离。先前针对这个问题的方法要么使用启发式搜索方法不断重新规划路径以避免冲突,要么基于学习方法选择适当的碰撞避免策略。前者可能由于频繁重新规划导致行程距离较长,而后者可能由于样本探索和利用率低,导致模型的训练成本较高。为了解决这些问题,我们提出了一种路径规划方法MAPPOHR,它结合了启发式搜索、经验规则和多智能体强化学习。该方法包括两层:基于多智能体强化学习算法MAPPO的实时规划器,该规划器在动作输出层和奖励函数中嵌入了经验规则,以及用于创建全局引导路径的启发式搜索规划器。在移动过程中,启发式搜索规划器根据实时规划器的指令重新规划新路径。我们在10种不同的冲突场景中测试了我们的方法。实验证明,MAPPOHR的规划性能优于现有的学习和启发式方法。由于利用了经验知识和启发式搜索,MAPPOHR的学习效率高于现有的学习方法。

更新时间: 2025-12-11 02:08:19

领域: cs.AI,cs.LG,cs.RO

下载: http://arxiv.org/abs/2306.01270v2

ID-PaS : Identity-Aware Predict-and-Search for General Mixed-Integer Linear Programs

Mixed-Integer Linear Programs (MIPs) are powerful and flexible tools for modeling a wide range of real-world combinatorial optimization problems. Predict-and-Search methods operate by using a predictive model to estimate promising variable assignments and then guiding a search procedure toward high-quality solutions. Recent research has demonstrated that incorporating machine learning (ML) into the Predict-and-Search framework significantly enhances its performance. Still, it is restricted to binary problems and overlooks the presence of fixed variables that commonly arise in practical settings. This work extends the Predict-and-Search (PaS) framework to parametric MIPs and introduces ID-PaS, an identity-aware learning framework that enables the ML model to handle heterogeneous variables more effectively. Experiments on several real-world large-scale problems demonstrate that ID-PaS consistently achieves superior performance compared to the state-of-the-art solver Gurobi and PaS.

Updated: 2025-12-11 01:58:28

标题: ID-PaS:通用混合整数线性规划的身份感知预测和搜索

摘要: 混合整数线性规划(MIPs)是建模各种实际组合优化问题的强大灵活工具。预测和搜索方法通过使用预测模型估计有前途的变量分配,然后引导搜索过程朝向高质量解决方案。最近的研究表明,将机器学习(ML)纳入预测和搜索框架显著提高了其性能。然而,这种方法仅限于二进制问题,并忽视了在实际情况下常见的固定变量的存在。本文将预测和搜索(PaS)框架扩展到参数化MIPs,并引入了ID-PaS,一种身份感知学习框架,使ML模型能够更有效地处理异构变量。针对几个实际大规模问题的实验表明,与最先进的求解器Gurobi和PaS相比,ID-PaS始终表现出更优异的性能。

更新时间: 2025-12-11 01:58:28

领域: cs.AI

下载: http://arxiv.org/abs/2512.10211v1

An exploration for higher efficiency in multi objective optimisation with reinforcement learning

Efficiency in optimisation and search processes persists to be one of the challenges, which affects the performance and use of optimisation algorithms. Utilising a pool of operators instead of a single operator to handle move operations within a neighbourhood remains promising, but an optimum or near optimum sequence of operators necessitates further investigation. One of the promising ideas is to generalise experiences and seek how to utilise it. Although numerous works are done around this issue for single objective optimisation, multi-objective cases have not much been touched in this regard. A generalised approach based on multi-objective reinforcement learning approach seems to create remedy for this issue and offer good solutions. This paper overviews a generalisation approach proposed with certain stages completed and phases outstanding that is aimed to help demonstrate the efficiency of using multi-objective reinforcement learning.

Updated: 2025-12-11 01:58:04

标题: 使用强化学习探索多目标优化的高效性

摘要: 优化和搜索过程中的效率仍然是挑战之一,影响着优化算法的性能和使用。利用一组操作符而不是单个操作符来处理邻域内的移动操作仍然具有前景,但需要进一步研究最佳或接近最佳的操作符序列。一种有前景的想法是泛化经验并探索如何利用它。尽管在单目标优化方面已经做了大量工作,但在多目标情况下这个问题并没有得到太多关注。基于多目标强化学习方法的泛化方法似乎为这个问题提供了解决方案并提供了良好的解决方案。本文概述了一种提出的泛化方法,已经完成了某些阶段,但仍有待完成的阶段,旨在帮助证明利用多目标强化学习的效率。

更新时间: 2025-12-11 01:58:04

领域: cs.AI,cs.NE

下载: http://arxiv.org/abs/2512.10208v1

CP-Env: Evaluating Large Language Models on Clinical Pathways in a Controllable Hospital Environment

Medical care follows complex clinical pathways that extend beyond isolated physician-patient encounters, emphasizing decision-making and transitions between different stages. Current benchmarks focusing on static exams or isolated dialogues inadequately evaluate large language models (LLMs) in dynamic clinical scenarios. We introduce CP-Env, a controllable agentic hospital environment designed to evaluate LLMs across end-to-end clinical pathways. CP-Env simulates a hospital ecosystem with patient and physician agents, constructing scenarios ranging from triage and specialist consultation to diagnostic testing and multidisciplinary team meetings for agent interaction. Following real hospital adaptive flow of healthcare, it enables branching, long-horizon task execution. We propose a three-tiered evaluation framework encompassing Clinical Efficacy, Process Competency, and Professional Ethics. Results reveal that most models struggle with pathway complexity, exhibiting hallucinations and losing critical diagnostic details. Interestingly, excessive reasoning steps can sometimes prove counterproductive, while top models tend to exhibit reduced tool dependency through internalized knowledge. CP-Env advances medical AI agents development through comprehensive end-to-end clinical evaluation. We provide the benchmark and evaluation tools for further research and development at https://github.com/SPIRAL-MED/CP-Env.

Updated: 2025-12-11 01:54:55

标题: CP-Env:在可控的医院环境中评估大型语言模型对临床路径的影响

摘要: 医疗护理遵循复杂的临床路径,超越孤立的医患会面,强调决策和不同阶段之间的过渡。目前的基准主要关注静态考试或孤立对话,无法充分评估大型语言模型(LLMs)在动态临床场景中的表现。我们介绍了CP-Env,这是一个可控的代理医院环境,旨在评估LLMs在端到端临床路径中的表现。CP-Env模拟了一个带有患者和医师代理的医院生态系统,构建了从分诊和专科会诊到诊断测试和多学科团队会议的情景,以促进代理之间的互动。遵循真实医院的适应性医疗流程,它支持分支、长期任务执行。我们提出了一个包括临床疗效、流程能力和职业道德的三层评估框架。结果显示,大多数模型在路径复杂性方面存在困难,表现出幻觉并丢失关键的诊断细节。有趣的是,过多的推理步骤有时可能适得其反,而顶级模型往往通过内部化知识展现出减少工具依赖性。CP-Env通过全面的端到端临床评估推进了医疗人工智能代理的发展。我们提供了用于进一步研究和发展的基准和评估工具,网址为https://github.com/SPIRAL-MED/CP-Env。

更新时间: 2025-12-11 01:54:55

领域: cs.AI

下载: http://arxiv.org/abs/2512.10206v1

On Sybil Proofness in Competitive Combinatorial Exchanges

We study Sybil manipulation in BRACE, a competitive equilibrium mechanism for combinatorial exchanges, by treating identity creation as a finite perturbation of the empirical distribution of reported types. Under standard regularity assumptions on the excess demand map and smoothness of principal utilities, we obtain explicit linear bounds on price and welfare deviations induced by bounded Sybil invasion. Using these bounds, we prove a sharp contrast: strategyproofness in the large holds if and only if each principal's share of identities vanishes, whereas any principal with a persistent positive share can construct deviations yielding strictly positive limiting gains. We further show that the feasibility of BRACE fails in the event of an unbounded population of Sybils and provide a precise cost threshold that ensures disincentivization of such attacks in large markets.

Updated: 2025-12-11 01:53:04

标题: 关于竞争性组合交易中的Sybil防护

摘要: 我们研究了在BRACE中的Sybil操纵,这是一种用于组合交换的竞争均衡机制,通过将身份创建视为报告类型的经验分布的有限扰动。在对过剩需求映射和主要效用的光滑性进行标准正则性假设的情况下,我们得到了由有界Sybil入侵引起的价格和福利偏差的明确线性界限。利用这些界限,我们证明了一个鲜明对比:只有当每个主体的身份份额消失时,大规模上的策略性是成立的,而任何具有持久正面份额的主体都可以构建出导致严格正限制性增益的偏差。我们进一步表明,在Sybils人群无限的情况下,BRACE的可行性失败,并提供了一个确切的成本阈值,以确保在大型市场中对此类攻击进行惩罚。

更新时间: 2025-12-11 01:53:04

领域: econ.TH,cs.CR

下载: http://arxiv.org/abs/2512.10203v1

How to Bridge Spatial and Temporal Heterogeneity in Link Prediction? A Contrastive Method

Temporal Heterogeneous Networks play a crucial role in capturing the dynamics and heterogeneity inherent in various real-world complex systems, rendering them a noteworthy research avenue for link prediction. However, existing methods fail to capture the fine-grained differential distribution patterns and temporal dynamic characteristics, which we refer to as spatial heterogeneity and temporal heterogeneity. To overcome such limitations, we propose a novel \textbf{C}ontrastive Learning-based \textbf{L}ink \textbf{P}rediction model, \textbf{CLP}, which employs a multi-view hierarchical self-supervised architecture to encode spatial and temporal heterogeneity. Specifically, aiming at spatial heterogeneity, we develop a spatial feature modeling layer to capture the fine-grained topological distribution patterns from node- and edge-level representations, respectively. Furthermore, aiming at temporal heterogeneity, we devise a temporal information modeling layer to perceive the evolutionary dependencies of dynamic graph topologies from time-level representations. Finally, we encode the spatial and temporal distribution heterogeneity from a contrastive learning perspective, enabling a comprehensive self-supervised hierarchical relation modeling for the link prediction task. Extensive experiments conducted on four real-world dynamic heterogeneous network datasets verify that our \mymodel consistently outperforms the state-of-the-art models, demonstrating an average improvement of 10.10\%, 13.44\% in terms of AUC and AP, respectively.

Updated: 2025-12-11 01:43:48

标题: 如何在链接预测中弥合空间和时间异质性?对比方法

摘要: 时间异质网络在捕捉各种真实世界复杂系统中固有的动态性和异质性方面发挥着至关重要的作用,使它们成为一个值得关注的链接预测研究领域。然而,现有方法未能捕捉到精细的差异分布模式和时间动态特征,我们将其称为空间异质性和时间异质性。为了克服这些限制,我们提出了一种新颖的基于对比学习的链接预测模型CLP,它采用多视图层次自监督架构来编码空间和时间的异质性。具体来说,为了解决空间异质性,我们开发了一个空间特征建模层,分别从节点级和边级表示中捕捉精细的拓扑分布模式。此外,为了解决时间异质性,我们设计了一个时间信息建模层,来感知动态图拓扑的演变依赖性。最后,我们从对比学习的角度编码了空间和时间分布异质性,实现了全面的自监督层次关系建模,用于链接预测任务。在四个真实世界的动态异质网络数据集上进行的大量实验验证了我们的模型始终优于最先进的模型,分别在AUC和AP方面平均提高了10.10%和13.44%。

更新时间: 2025-12-11 01:43:48

领域: cs.SI,cs.AI

下载: http://arxiv.org/abs/2411.00612v3

AutoMedic: An Automated Evaluation Framework for Clinical Conversational Agents with Medical Dataset Grounding

Evaluating large language models (LLMs) has recently emerged as a critical issue for safe and trustworthy application of LLMs in the medical domain. Although a variety of static medical question-answering (QA) benchmarks have been proposed, many aspects remain underexplored, such as the effectiveness of LLMs in generating responses in dynamic, interactive clinical multi-turn conversation situations and the identification of multi-faceted evaluation strategies beyond simple accuracy. However, formally evaluating a dynamic, interactive clinical situation is hindered by its vast combinatorial space of possible patient states and interaction trajectories, making it difficult to standardize and quantitatively measure such scenarios. Here, we introduce AutoMedic, a multi-agent simulation framework that enables automated evaluation of LLMs as clinical conversational agents. AutoMedic transforms off-the-shelf static QA datasets into virtual patient profiles, enabling realistic and clinically grounded multi-turn clinical dialogues between LLM agents. The performance of various clinical conversational agents is then assessed based on our CARE metric, which provides a multi-faceted evaluation standard of clinical conversational accuracy, efficiency/strategy, empathy, and robustness. Our findings, validated by human experts, demonstrate the validity of AutoMedic as an automated evaluation framework for clinical conversational agents, offering practical guidelines for the effective development of LLMs in conversational medical applications.

Updated: 2025-12-11 01:25:36

标题: AutoMedic:一个基于医疗数据支撑的临床对话代理自动评估框架

摘要: 评估大型语言模型(LLMs)最近已经成为医疗领域中安全和可信赖应用LLMs的关键问题。虽然已经提出了各种静态医学问答(QA)基准,但许多方面仍未得到充分探讨,例如LLMs在动态、互动性临床多轮对话情境中生成响应的有效性,以及超越简单准确性的多方面评估策略的识别。然而,正式评估动态、互动性临床情境受到其可能患者状态和交互轨迹的广泛组合空间的阻碍,这使得难以标准化和定量衡量这种情况。在这里,我们介绍了AutoMedic,这是一个多代理模拟框架,可以实现LLMs作为临床对话代理的自动评估。AutoMedic将现成的静态QA数据集转化为虚拟患者档案,实现LLM代理之间逼真且临床基础的多轮临床对话。然后根据我们的CARE指标评估各种临床对话代理的表现,该指标提供了临床对话准确性、效率/策略、同理心和韧性的多方面评估标准。我们的研究结果得到了人类专家的验证,证明了AutoMedic作为临床对话代理的自动评估框架的有效性,为开发LLMs在对话式医疗应用中提供了实用指导。

更新时间: 2025-12-11 01:25:36

领域: cs.CL,cs.LG,cs.MA

下载: http://arxiv.org/abs/2512.10195v1

Brain-like emergent properties in deep networks: impact of network architecture, datasets and training

Despite the rapid pace at which deep networks are improving on standardized vision benchmarks, they are still outperformed by humans on real-world vision tasks. One solution to this problem is to make deep networks more brain-like. Although there are several benchmarks that compare the ability of deep networks to predict brain responses on natural images, they do not capture subtle but important emergent properties present in brains. It is also unclear which design principle -- architecture, training data, or training regime -- would have the greatest impact on these emergent properties. To investigate these issues, we systematically evaluated over 30 state-of-the-art networks with varying network architectures, training datasets, and training regimes for the presence or absence of brain-like properties. Our main findings are as follows. First, network architecture had the strongest impact on brain-like properties compared to dataset and training regime variations. Second, networks varied widely in their alignment to the brain with no single network outperforming all others. Taken together, our results offer a principled and interpretable path toward closing the gap between artificial and human vision.

Updated: 2025-12-11 01:23:40

标题: 深度网络中类似大脑的新兴特性:网络架构、数据集和训练的影响

摘要: 尽管深度网络在标准视觉基准测试中的改进速度很快,但它们在真实世界的视觉任务中仍然被人类超越。解决这个问题的一个方法是使深度网络更像大脑。虽然有几个基准可以比较深度网络在自然图像上预测大脑反应的能力,但它们并不能捕捉到大脑中存在的微妙但重要的新兴特性。目前还不清楚哪种设计原则(架构、训练数据或训练方法)对这些新兴特性的影响最大。为了调查这些问题,我们系统评估了超过30个最先进的网络,这些网络具有不同的网络架构、训练数据集和训练方法,以确定它们是否具有类似大脑的特性。我们的主要发现如下。首先,与数据集和训练方法的变化相比,网络架构对类似大脑的特性影响最大。其次,网络在与大脑的对齐方面存在很大差异,没有单一网络能够胜过所有其他网络。综上所述,我们的结果为缩小人工视觉和人类视觉之间的差距提供了一个有原则和可解释的途径。

更新时间: 2025-12-11 01:23:40

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2411.16326v3

Optimizing the non-Clifford-count in unitary synthesis using Reinforcement Learning

In this paper we study the potential of using reinforcement learning (RL) in order to synthesize quantum circuits, while optimizing the T-count and CS-count, of unitaries that are exactly implementable by the Clifford+T and Clifford+CS gate sets, respectively. We have designed our RL framework to work with channel representation of unitaries, that enables us to perform matrix operations efficiently, using integers only. We have also incorporated pruning heuristics and a canonicalization of operators, in order to reduce the search complexity. As a result, compared to previous works, we are able to implement significantly larger unitaries, in less time, with much better success rate and improvement factor. Our results for Clifford+T synthesis on two qubit unitaries achieve close-to-optimal decompositions for up to 100 T gates, 5 times more than previous RL algorithms and to the best of our knowledge, the largest instances achieved with any method to date. Our RL algorithm is able to recover previously-known optimal linear complexity algorithm for T-count-optimal decomposition of 1 qubit unitaries. We illustrate significant reduction in the asymptotic T-count estimate of important primitives like controlled cyclic shift (43%), controlled adder (14.3%) and multiplier (14%), without adding any extra ancilla. For 2-qubit Clifford+CS unitaries, our algorithm achieves a linear complexity, something that could only be accomplished by a previous algorithm using SO(6) representation.

Updated: 2025-12-11 01:09:52

标题: 使用强化学习优化单元合成中的非克利福德计数

摘要: 在这篇论文中,我们研究了利用强化学习(RL)来合成量子电路的潜力,同时优化T-count和CS-count,这些电路是可以通过Clifford+T和Clifford+CS门集分别准确实现的酉矩阵。我们设计了RL框架,以通道表示单位矩阵,使我们能够仅使用整数高效地执行矩阵运算。我们还结合了修剪启发式和操作符的规范化,以减少搜索复杂性。因此,与先前的工作相比,我们能够在更短的时间内实现更大的单位矩阵,成功率和改进因子也更好。我们对两量子位Clifford+T合成的结果实现了接近最优的分解,T门数量高达100个,比以前的RL算法多5倍,据我们所知,是迄今为止任何方法实现的最大实例。我们的RL算法能够恢复先前已知的1量子位单位矩阵的T-count最优分解的线性复杂度算法。我们展示了重要原语的渐近T-count估计的显著减少,如控制循环移位(43%)、控制加法器(14.3%)和乘法器(14%),而不需要添加任何额外的辅助量子比特。对于2量子位Clifford+CS单位矩阵,我们的算法实现了线性复杂度,以前的算法只能通过SO(6)表示实现。

更新时间: 2025-12-11 01:09:52

领域: quant-ph,cs.AI

下载: http://arxiv.org/abs/2509.21709v2

Exact Recovery of Non-Random Missing Multidimensional Time Series via Temporal Isometric Delay-Embedding Transform

Non-random missing data is a ubiquitous yet undertreated flaw in multidimensional time series, fundamentally threatening the reliability of data-driven analysis and decision-making. Pure low-rank tensor completion, as a classical data recovery method, falls short in handling non-random missingness, both methodologically and theoretically. Hankel-structured tensor completion models provide a feasible approach for recovering multidimensional time series with non-random missing patterns. However, most Hankel-based multidimensional data recovery methods both suffer from unclear sources of Hankel tensor low-rankness and lack an exact recovery theory for non-random missing data. To address these issues, we propose the temporal isometric delay-embedding transform, which constructs a Hankel tensor whose low-rankness is naturally induced by the smoothness and periodicity of the underlying time series. Leveraging this property, we develop the \textit{Low-Rank Tensor Completion with Temporal Isometric Delay-embedding Transform} (LRTC-TIDT) model, which characterizes the low-rank structure under the \textit{Tensor Singular Value Decomposition} (t-SVD) framework. Once the prescribed non-random sampling conditions and mild incoherence assumptions are satisfied, the proposed LRTC-TIDT model achieves exact recovery, as confirmed by simulation experiments under various non-random missing patterns. Furthermore, LRTC-TIDT consistently outperforms existing tensor-based methods across multiple real-world tasks, including network flow reconstruction, urban traffic estimation, and temperature field prediction. Our implementation is publicly available at https://github.com/HaoShu2000/LRTC-TIDT.

Updated: 2025-12-11 01:04:27

标题: 通过时间等距延迟嵌入变换精确恢复非随机缺失的多维时间序列

摘要: 非随机缺失数据是多维时间序列中普遍存在但未经处理的缺陷,从根本上威胁着基于数据的分析和决策的可靠性。作为一种经典的数据恢复方法,纯低秩张量补全在处理非随机缺失方面存在方法论和理论上的不足。Hankel结构的张量补全模型为恢复具有非随机缺失模式的多维时间序列提供了可行的方法。然而,大多数基于Hankel的多维数据恢复方法都存在Hankel张量低秩性的来源不清晰,并且缺乏针对非随机缺失数据的精确恢复理论。为解决这些问题,我们提出了时间等距延迟嵌入变换,该变换构建了一个Hankel张量,其低秩性是由底层时间序列的平滑性和周期性自然诱导的。利用这一特性,我们开发了“具有时间等距延迟嵌入变换的低秩张量补全”(LRTC-TIDT)模型,该模型在“张量奇异值分解”(t-SVD)框架下描述了低秩结构。一旦满足了规定的非随机采样条件和温和的不一致性假设,所提出的LRTC-TIDT模型在各种非随机缺失模式下的模拟实验中实现了精确恢复。此外,LRTC-TIDT在多个现实任务中持续优于现有的基于张量的方法,包括网络流重建、城市交通估计和温度场预测。我们的实现可在https://github.com/HaoShu2000/LRTC-TIDT 上公开获取。

更新时间: 2025-12-11 01:04:27

领域: cs.LG

下载: http://arxiv.org/abs/2512.10191v1

The Interplay of Statistics and Noisy Optimization: Learning Linear Predictors with Random Data Weights

We analyze gradient descent with randomly weighted data points in a linear regression model, under a generic weighting distribution. This includes various forms of stochastic gradient descent, importance sampling, but also extends to weighting distributions with arbitrary continuous values, thereby providing a unified framework to analyze the impact of various kinds of noise on the training trajectory. We characterize the implicit regularization induced through the random weighting, connect it with weighted linear regression, and derive non-asymptotic bounds for convergence in first and second moments. Leveraging geometric moment contraction, we also investigate the stationary distribution induced by the added noise. Based on these results, we discuss how specific choices of weighting distribution influence both the underlying optimization problem and statistical properties of the resulting estimator, as well as some examples for which weightings that lead to fast convergence cause bad statistical performance.

Updated: 2025-12-11 00:55:29

标题: 统计学和嘈杂优化的互动:使用随机数据权重学习线性预测器

摘要: 我们分析了在线性回归模型中使用随机加权数据点的梯度下降,在一般的加权分布下。这包括各种形式的随机梯度下降,重要性抽样,但也扩展到具有任意连续值的加权分布,从而提供了一个统一的框架来分析各种噪声对训练轨迹的影响。我们表征了通过随机加权引起的隐性正则化,并将其与加权线性回归联系起来,推导出在第一和第二时刻收敛的非渐近界限。利用几何矩收缩,我们还研究了由添加噪声引起的稳态分布。基于这些结果,我们讨论了特定加权分布选择如何影响底层优化问题和结果估计器的统计特性,以及导致快速收敛的加权导致糟糕统计性能的一些示例。

更新时间: 2025-12-11 00:55:29

领域: stat.ML,cs.LG,stat.CO

下载: http://arxiv.org/abs/2512.10188v1

MiniF2F-Dafny: LLM-Guided Mathematical Theorem Proving via Auto-Active Verification

We present miniF2F-Dafny, the first translation of the mathematical reasoning benchmark miniF2F to an automated theorem prover: Dafny. Previously, the benchmark existed only in interactive theorem provers (Lean, Isabelle, HOL Light, Metamath). We find that Dafny's automation verifies 99/244 (40.6%) of the test set and 109/244 (44.7%) of the validation set with empty proofs--requiring no manual proof steps. For problems where empty proofs fail, we evaluate 12 off-the-shelf LLMs on providing proof hints. The best model we test achieves 55.7% pass@4 success rate employing iterative error correction. These preliminary results highlight an effective division of labor: LLMs provide high-level guidance while automation handles low-level details. Our benchmark can be found on GitHub at http://github.com/dafny-lang/miniF2F .

Updated: 2025-12-11 00:52:19

标题: MiniF2F-Dafny: 通过自动主动验证进行LLM引导的数学定理证明

摘要: 我们提出了miniF2F-Dafny,这是将数学推理基准miniF2F首次转换为自动定理证明器Dafny。之前,该基准只存在于交互式定理证明器(Lean、Isabelle、HOL Light、Metamath)中。我们发现Dafny的自动化能够验证测试集中的99/244(40.6%)和验证集中的109/244(44.7%),这些验证都是空证明--不需要手动证明步骤。对于空证明失败的问题,我们评估了12种现成的LLM模型来提供证明提示。我们测试的最佳模型在采用迭代错误纠正时实现了55.7%的通过率。这些初步结果突显出有效的分工:LLM提供高级指导,而自动化处理低级细节。我们的基准可以在GitHub上找到,网址为http://github.com/dafny-lang/miniF2F。

更新时间: 2025-12-11 00:52:19

领域: cs.LG

下载: http://arxiv.org/abs/2512.10187v1

Watermarks for Language Models via Probabilistic Automata

A recent watermarking scheme for language models achieves distortion-free embedding and robustness to edit-distance attacks. However, it suffers from limited generation diversity and high detection overhead. In parallel, recent research has focused on undetectability, a property ensuring that watermarks remain difficult for adversaries to detect and spoof. In this work, we introduce a new class of watermarking schemes constructed through probabilistic automata. We present two instantiations: (i) a practical scheme with exponential generation diversity and computational efficiency, and (ii) a theoretical construction with formal undetectability guarantees under cryptographic assumptions. Extensive experiments on LLaMA-3B and Mistral-7B validate the superior performance of our scheme in terms of robustness and efficiency.

Updated: 2025-12-11 00:49:06

标题: 通过概率自动机为语言模型生成水印

摘要: 最近一种用于语言模型的水印方案实现了无失真嵌入和对编辑距离攻击的稳健性。然而,它受到了生成多样性有限和检测开销高的困扰。与此同时,最近的研究集中在不可检测性上,这是一种确保水印对于对手难以检测和伪造的特性。在这项工作中,我们引入了一种通过概率自动机构建的新型水印方案。我们提出了两种实例:(i)一种实用方案,具有指数级的生成多样性和计算效率,以及(ii)一种理论构造,根据密码学假设提供了形式化的不可检测性保证。在LLaMA-3B和Mistral-7B上进行的大量实验验证了我们的方案在稳健性和效率方面的优越性能。

更新时间: 2025-12-11 00:49:06

领域: cs.CR,cs.CL

下载: http://arxiv.org/abs/2512.10185v1

Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation

Fine-tuning large-scale text-to-video diffusion models to add new generative controls, such as those over physical camera parameters (e.g., shutter speed or aperture), typically requires vast, high-fidelity datasets that are difficult to acquire. In this work, we propose a data-efficient fine-tuning strategy that learns these controls from sparse, low-quality synthetic data. We show that not only does fine-tuning on such simple data enable the desired controls, it actually yields superior results to models fine-tuned on photorealistic "real" data. Beyond demonstrating these results, we provide a framework that justifies this phenomenon both intuitively and quantitatively.

Updated: 2025-12-11 00:46:50

标题: Less is More:数据有效适应可控文本到视频生成

摘要: 将大规模文本到视频扩散模型微调以添加新的生成控制,例如对物理相机参数(例如快门速度或光圈)的控制,通常需要大量高保真度的数据集,这些数据集很难获取。在这项工作中,我们提出了一种数据高效的微调策略,从稀疏、低质量的合成数据中学习这些控制。我们展示了仅仅在这种简单数据上微调就能实现所需的控制,实际上比在逼真的“真实”数据上微调的模型产生更优秀的结果。除了展示这些结果外,我们还提供了一个既直观又定量地证明这一现象的框架。

更新时间: 2025-12-11 00:46:50

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2511.17844v2

Improved Segmentation of Polyps and Visual Explainability Analysis

Colorectal cancer (CRC) remains one of the leading causes of cancer-related morbidity and mortality worldwide, with gastrointestinal (GI) polyps serving as critical precursors according to the World Health Organization (WHO). Early and accurate segmentation of polyps during colonoscopy is essential for reducing CRC progression, yet manual delineation is labor-intensive and prone to observer variability. Deep learning methods have demonstrated strong potential for automated polyp analysis, but their limited interpretability remains a barrier to clinical adoption. In this study, we present PolypSeg-GradCAM, an explainable deep learning framework that integrates a U-Net architecture with a pre-trained ResNet-34 backbone and Gradient-weighted Class Activation Mapping (Grad-CAM) for transparent polyp segmentation. To ensure rigorous benchmarking, the model was trained and evaluated using 5-Fold Cross-Validation on the Kvasir-SEG dataset of 1,000 annotated endoscopic images. Experimental results show a mean Dice coefficient of 0.8902 +/- 0.0125, a mean Intersection-over-Union (IoU) of 0.8023, and an Area Under the Receiver Operating Characteristic Curve (AUC-ROC) of 0.9722. Advanced quantitative analysis using an optimal threshold yielded a Sensitivity of 0.9058 and Precision of 0.9083. Additionally, Grad-CAM visualizations confirmed that predictions were guided by clinically relevant regions, offering insight into the model's decision-making process. This study demonstrates that integrating segmentation accuracy with interpretability can support the development of trustworthy AI-assisted colonoscopy tools.

Updated: 2025-12-11 00:35:38

标题: Polyp的改良分割和可视化解释性分析

摘要: 结直肠癌(CRC)仍然是全球癌症相关发病率和死亡率的主要原因之一,根据世界卫生组织(WHO)的说法,胃肠道(GI)息肉是CRC的关键前体。在结肠镜检查中早期和准确地分割息肉对于减少CRC的发展至关重要,然而手动描绘是劳动密集型的,并且容易受观察者变异的影响。深度学习方法已经展现出自动化息肉分析的强大潜力,但它们有限的可解释性仍然是临床采用的障碍。在这项研究中,我们提出了PolypSeg-GradCAM,这是一个可解释的深度学习框架,它将U-Net架构与一个预训练的ResNet-34骨干和Gradient-weighted Class Activation Mapping (Grad-CAM)相结合,用于透明的息肉分割。为了确保严格的基准测试,该模型在包含1,000个注释内窥镜图像的Kvasir-SEG数据集上进行了5折交叉验证训练和评估。实验结果显示均值Dice系数为0.8902 +/- 0.0125,均值Intersection-over-Union(IoU)为0.8023,以及Area Under the Receiver Operating Characteristic Curve(AUC-ROC)为0.9722。使用最佳阈值进行高级定量分析产生了0.9058的灵敏度和0.9083的精度。此外,Grad-CAM可视化证实了预测是由临床相关区域引导的,为模型的决策过程提供了洞察力。这项研究表明,将分割准确性与可解释性相结合可以支持可信赖的AI辅助结肠镜工具的发展。

更新时间: 2025-12-11 00:35:38

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.18159v3

Assessing Neuromorphic Computing for Fingertip Force Decoding from Electromyography

High-density surface electromyography (HD-sEMG) provides a noninvasive neural interface for assistive and rehabilitation control, but mapping neural activity to user motor intent remains challenging. We assess a spiking neural network (SNN) as a neuromorphic architecture against a temporal convolutional network (TCN) for decoding fingertip force from motor-unit (MU) firing derived from HD-sEMG. Data were collected from a single participant (10 trials) with two forearm electrode arrays; MU activity was obtained via FastICA-based decomposition, and models were trained on overlapping windows with end-to-end causal convolutions. On held-out trials, the TCN achieved 4.44% MVC RMSE (Pearson r = 0.974) while the SNN achieved 8.25% MVC (r = 0.922). While the TCN was more accurate, we view the SNN as a realistic neuromorphic baseline that could close much of this gap with modest architectural and hyperparameter refinements.

Updated: 2025-12-11 00:33:31

标题: 评估神经形态计算用于从肌电图解码指尖力量

摘要: 高密度表面肌电图(HD-sEMG)为辅助和康复控制提供了一种非侵入性的神经接口,但将神经活动映射到用户的运动意图仍然具有挑战性。我们评估了一个尖峰神经网络(SNN)作为一种神经形态学架构,与时间卷积网络(TCN)相比,用于从HD-sEMG中导出的运动单位(MU)的发射来解码指尖力量。数据来自一个参与者(10次试验),使用两个前臂电极阵列收集数据;通过基于FastICA的分解获得MU活动,并在重叠窗口上进行端到端因果卷积模型的训练。在保留的试验中,TCN实现了4.44% MVC的RMSE(皮尔逊r = 0.974),而SNN实现了8.25% MVC(r = 0.922)。虽然TCN更准确,但我们认为SNN是一个现实的神经形态学基线,通过适度的架构和超参数调整,可以减小这一差距。

更新时间: 2025-12-11 00:33:31

领域: cs.LG,eess.SP

下载: http://arxiv.org/abs/2512.10179v1

CIEGAD: Cluster-Conditioned Interpolative and Extrapolative Framework for Geometry-Aware and Domain-Aligned Data Augmentation

In practical deep learning deployment, the scarcity of data and the imbalance of label distributions often lead to semantically uncovered regions within the real-world data distribution, hindering model training and causing misclassification near class boundaries as well as unstable behaviors in peripheral areas. Although recent large language models (LLMs) show promise for data augmentation, an integrated framework that simultaneously achieves directional control of generation, domain alignment, and quality control has not yet been fully established. To address these challenges, we propose a Cluster-conditioned Interpolative and Extrapolative framework for Geometry-Aware and Domain-aligned data augmentation (CIEGAD), which systematically complements both in-distribution and out-of-distribution semantically uncovered regions. CIEGAD constructs domain profiles through cluster conditioning, allocates generation with a hierarchical frequency-geometric allocation integrating class frequency and geometric indicators, and finely controls generation directions via the coexistence of interpolative and extrapolative synthesis. It further performs quality control through geometry-constrained filtering combined with an LLM-as-a-Judge mechanism. Experiments on multiple classification tasks demonstrate that CIEGAD effectively extends the periphery of real-world data distributions while maintaining high alignment between generated and real-world data as well as semantic diversity. In particular, for long-tailed and multi-class classification tasks, CIEGAD consistently improves F1 and recall, validating the triple harmony of distributional consistency, diversity, and quality. These results indicate that CIEGAD serves as a practically oriented data augmentation framework that complements underrepresented regions while preserving alignment with real-world data.

Updated: 2025-12-11 00:32:37

标题: CIEGAD: 集群条件下的几何感知和领域对齐数据增强的插值和外推框架

摘要: 在实际的深度学习部署中,数据稀缺和标签分布不平衡经常导致真实世界数据分布中存在语义未覆盖的区域,阻碍模型训练并导致在类边界附近的错误分类以及在边缘区域的不稳定行为。尽管最近的大型语言模型(LLM)显示出数据增强的潜力,但尚未完全建立同时实现生成方向控制、域对齐和质量控制的集成框架。为了解决这些挑战,我们提出了一种集群条件的插值和外推框架,用于几何感知和域对齐数据增强(CIEGAD),该框架系统地补充了分布内和分布外的语义未覆盖区域。CIEGAD通过集群条件构建域文件,通过分层频率-几何分配整合类频率和几何指标来分配生成,通过插值和外推合成的共存来精细控制生成方向。它进一步通过几何约束过滤与LLM作为评判机制相结合进行质量控制。多个分类任务的实验表明,CIEGAD有效地扩展了真实世界数据分布的边缘,同时保持了生成数据与真实世界数据之间的高对齐性和语义多样性。特别是对于长尾和多类别分类任务,CIEGAD始终提高了F1和召回率,验证了分布一致性、多样性和质量的三重和谐。这些结果表明,CIEGAD是一个实践导向的数据增强框架,可以补充代表性不足的区域,同时保持与真实世界数据的对齐。

更新时间: 2025-12-11 00:32:37

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2512.10178v1

Unveiling the Latent Directions of Reflection in Large Language Models

Reflection, the ability of large language models (LLMs) to evaluate and revise their own reasoning, has been widely used to improve performance on complex reasoning tasks. Yet, most prior works emphasizes designing reflective prompting strategies or reinforcement learning objectives, leaving the inner mechanisms of reflection underexplored. In this paper, we investigate reflection through the lens of latent directions in model activations. We propose a methodology based on activation steering to characterize how instructions with different reflective intentions: no reflection, intrinsic reflection, and triggered reflection. By constructing steering vectors between these reflection levels, we demonstrate that (1) new reflection-inducing instructions can be systematically identified, (2) reflective behavior can be directly enhanced or suppressed through activation interventions, and (3) suppressing reflection is considerably easier than stimulating it. Experiments on GSM8k-adv and Cruxeval-o-adv with Qwen2.5-3B and Gemma3-4B-IT reveal clear stratification across reflection levels, and steering interventions confirm the controllability of reflection. Our findings highlight both opportunities (e.g., reflection-enhancing defenses) and risks (e.g., adversarial inhibition of reflection in jailbreak attacks). This work opens a path toward mechanistic understanding of reflective reasoning in LLMs.

Updated: 2025-12-11 00:22:34

标题: 揭示大型语言模型中反思的潜在方向

摘要: 反思,即大型语言模型(LLMs)评估和修正自己推理能力的能力,已被广泛应用于提高复杂推理任务的性能。然而,大多数先前的研究强调设计反思提示策略或强化学习目标,而对反思的内部机制尚未充分探讨。在本文中,我们通过模型激活中的潜在方向来研究反思。我们提出了一种基于激活导向的方法,用于表征具有不同反思意图的指令:无反思、内在反思和触发反思。通过构建这些反思水平之间的导向向量,我们证明了:(1)可以系统地识别新的引发反思的指令,(2)可以通过激活干预直接增强或抑制反思行为,(3)抑制反思比刺激反思要容易得多。在GSM8k-adv和Cruxeval-o-adv上进行的实验,使用Qwen2.5-3B和Gemma3-4B-IT,显示了反思水平之间的清晰分层,并且导向干预证实了对反思的可控性。我们的研究结果突出了机会(例如,增强反思的防御)和风险(例如,在越狱攻击中对反思的敌意抑制)。这项工作为理解LLMs中反思推理的机制开辟了一条道路。

更新时间: 2025-12-11 00:22:34

领域: cs.LG

下载: http://arxiv.org/abs/2508.16989v2

Offscript: Automated Auditing of Instruction Adherence in LLMs

Large Language Models (LLMs) and generative search systems are increasingly used for information seeking by diverse populations with varying preferences for knowledge sourcing and presentation. While users can customize LLM behavior through custom instructions and behavioral prompts, no mechanism exists to evaluate whether these instructions are being followed effectively. We present Offscript, an automated auditing tool that efficiently identifies potential instruction following failures in LLMs. In a pilot study analyzing custom instructions sourced from Reddit, Offscript detected potential deviations from instructed behavior in 86.4% of conversations, 22.2% of which were confirmed as material violations through human review. Our findings suggest that automated auditing serves as a viable approach for evaluating compliance to behavioral instructions related to information seeking.

Updated: 2025-12-11 00:11:50

标题: 离线脚本:对LLMs中指令遵从性的自动化审计

摘要: 大型语言模型(LLMs)和生成式搜索系统越来越被不同偏好知识获取和呈现的人群用于信息搜索。用户可以通过自定义指令和行为提示来定制LLM行为,但目前没有机制来评估这些指令是否被有效遵循。我们提出了一个名为Offscript的自动审计工具,能够有效地识别LLMs中潜在的指令遵循失败。在一项分析来自Reddit的自定义指令的试点研究中,Offscript检测到86.4%的对话中可能存在指令遵循偏差,其中22.2%经过人工审核确认为实质性违规。我们的发现表明,自动审计是评估与信息搜索相关的行为指令遵循合规性的一种可行方法。

更新时间: 2025-12-11 00:11:50

领域: cs.HC,cs.AI,cs.CL

下载: http://arxiv.org/abs/2512.10172v1

Semantic-Aware Confidence Calibration for Automated Audio Captioning

Automated audio captioning models frequently produce overconfident predictions regardless of semantic accuracy, limiting their reliability in deployment. This deficiency stems from two factors: evaluation metrics based on n-gram overlap that fail to capture semantic correctness, and the absence of calibrated confidence estimation. We present a framework that addresses both limitations by integrating confidence prediction into audio captioning and redefining correctness through semantic similarity. Our approach augments a Whisper-based audio captioning model with a learned confidence prediction head that estimates uncertainty from decoder hidden states. We employ CLAP audio-text embeddings and sentence transformer similarities (FENSE) to define semantic correctness, enabling Expected Calibration Error (ECE) computation that reflects true caption quality rather than surface-level text overlap. Experiments on Clotho v2 demonstrate that confidence-guided beam search with semantic evaluation achieves dramatically improved calibration (CLAP-based ECE of 0.071) compared to greedy decoding baselines (ECE of 0.488), while simultaneously improving caption quality across standard metrics. Our results establish that semantic similarity provides a more meaningful foundation for confidence calibration in audio captioning than traditional n-gram metrics.

Updated: 2025-12-11 00:09:23

标题: 语义感知的自动音频字幕自信度校准

摘要: 自动化音频字幕模型经常产生过于自信的预测,而不考虑语义准确性,这限制了它们在部署中的可靠性。这种缺陷源于两个因素:基于n-gram重叠的评估指标未能捕捉语义正确性,以及缺乏校准的置信度估计。我们提出了一个框架,通过将置信度预测集成到音频字幕中,并通过语义相似性重新定义正确性,来解决这两个限制。我们的方法通过在Whisper-based音频字幕模型中增加一个学习的置信度预测头部来估计解码器隐藏状态的不确定性。我们使用CLAP音频-文本嵌入和句子转换器相似性(FENSE)来定义语义正确性,从而实现了反映真实字幕质量而不是表面级文本重叠的预期校准误差(ECE)计算。在Clotho v2上的实验表明,具有语义评估的置信度引导束搜索实现了显著改进的校准(基于CLAP的ECE为0.071),与贪婪解码基线(ECE为0.488)相比,同时提高了标准指标下的字幕质量。我们的结果表明,在音频字幕中,语义相似性为置信度校准提供了一个比传统n-gram指标更有意义的基础。

更新时间: 2025-12-11 00:09:23

领域: cs.SD,cs.LG

下载: http://arxiv.org/abs/2512.10170v1

An Introduction to Deep Reinforcement and Imitation Learning

Embodied agents, such as robots and virtual characters, must continuously select actions to execute tasks effectively, solving complex sequential decision-making problems. Given the difficulty of designing such controllers manually, learning-based approaches have emerged as promising alternatives, most notably Deep Reinforcement Learning (DRL) and Deep Imitation Learning (DIL). DRL leverages reward signals to optimize behavior, while DIL uses expert demonstrations to guide learning. This document introduces DRL and DIL in the context of embodied agents, adopting a concise, depth-first approach to the literature. It is self-contained, presenting all necessary mathematical and machine learning concepts as they are needed. It is not intended as a survey of the field; rather, it focuses on a small set of foundational algorithms and techniques, prioritizing in-depth understanding over broad coverage. The material ranges from Markov Decision Processes to REINFORCE and Proximal Policy Optimization (PPO) for DRL, and from Behavioral Cloning to Dataset Aggregation (DAgger) and Generative Adversarial Imitation Learning (GAIL) for DIL.

Updated: 2025-12-11 00:05:29

标题: 一个关于深度强化学习和模仿学习的介绍

摘要: 具有身体的代理,如机器人和虚拟角色,必须持续选择行动来有效地执行任务,解决复杂的顺序决策问题。鉴于手动设计这样的控制器的困难,学习为基础的方法已经成为有希望的替代方案,最为著名的是深度强化学习(DRL)和深度模仿学习(DIL)。DRL利用奖励信号来优化行为,而DIL使用专家演示来引导学习。本文介绍了在具有身体的代理环境中采用简明、深度优先方法对DRL和DIL进行研究。它是自包含的,按需展示所有必要的数学和机器学习概念。它不旨在对该领域进行概述;相反,它侧重于一小组基础算法和技术,优先考虑深入理解而不是广泛覆盖。内容涵盖从马尔可夫决策过程到DRL中的REINFORCE和Proximal Policy Optimization(PPO),以及从行为克隆到DIL中的数据集聚合(DAgger)和生成对抗模仿学习(GAIL)的技术。

更新时间: 2025-12-11 00:05:29

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2512.08052v2

The 2025 Foundation Model Transparency Index

Foundation model developers are among the world's most important companies. As these companies become increasingly consequential, how do their transparency practices evolve? The 2025 Foundation Model Transparency Index is the third edition of an annual effort to characterize and quantify the transparency of foundation model developers. The 2025 FMTI introduces new indicators related to data acquisition, usage data, and monitoring and evaluates companies like Alibaba, DeepSeek, and xAI for the first time. The 2024 FMTI reported that transparency was improving, but the 2025 FMTI finds this progress has deteriorated: the average score out of 100 fell from 58 in 2024 to 40 in 2025. Companies are most opaque about their training data and training compute as well as the post-deployment usage and impact of their flagship models. In spite of this general trend, IBM stands out as a positive outlier, scoring 95, in contrast to the lowest scorers, xAI and Midjourney, at just 14. The five members of the Frontier Model Forum we score end up in the middle of the Index: we posit that these companies avoid reputational harms from low scores but lack incentives to be transparency leaders. As policymakers around the world increasingly mandate certain types of transparency, this work reveals the current state of transparency for foundation model developers, how it may change given newly enacted policy, and where more aggressive policy interventions are necessary to address critical information deficits.

Updated: 2025-12-11 00:01:53

标题: 2025年基金会模型透明度指数

摘要: 基础模型开发者是世界上最重要的公司之一。随着这些公司变得越来越重要,它们的透明度实践如何演变?2025年基础模型透明度指数是一项年度努力的第三版,旨在描述和量化基础模型开发者的透明度。2025年FMTI引入了与数据获取、使用数据和监控相关的新指标,并首次评估了阿里巴巴、DeepSeek和xAI等公司。2024年FMTI报告称透明度正在改善,但2025年FMTI发现这一进展已经恶化:平均分数从2024年的58分下降到2025年的40分。公司在培训数据和培训计算以及旗舰模型的部署后使用和影响方面最为不透明。尽管存在这种普遍趋势,IBM以95分脱颖而出,成为积极的异常值,而得分最低的xAI和Midjourney仅为14分。我们评分的前沿模型论坛的五个成员最终位于指数中部:我们认为这些公司避免了低分数带来的声誉损害,但缺乏成为透明度领导者的激励。随着世界各地的政策制定者越来越多地要求某些类型的透明度,这项工作揭示了基础模型开发者的当前透明度状况,以及在新颁布的政策下可能会发生变化,并且需要更积极的政策干预来解决重要信息缺失问题。

更新时间: 2025-12-11 00:01:53

领域: cs.AI,cs.CY,cs.LG

下载: http://arxiv.org/abs/2512.10169v1

By Xinhai (Sean) Zou.