PlayWorld: Learning Robot World Models from Autonomous Play
Action-conditioned video models offer a promising path to building general-purpose robot simulators that can improve directly from data. Yet, despite training on large-scale robot datasets, current state-of-the-art video models still struggle to predict physically consistent robot-object interactions that are crucial in robotic manipulation. To close this gap, we present PlayWorld, a simple, scalable, and fully autonomous pipeline for training high-fidelity video world simulators from interaction experience. In contrast to prior approaches that rely on success-biased human demonstrations, PlayWorld is the first system capable of learning entirely from unsupervised robot self-play, enabling naturally scalable data collection while capturing complex, long-tailed physical interactions essential for modeling realistic object dynamics. Experiments across diverse manipulation tasks show that PlayWorld generates high-quality, physically consistent predictions for contact-rich interactions that are not captured by world models trained on human-collected data.We further demonstrate the versatility of PlayWorld in enabling fine-grained failure prediction and policy evaluation, with up to 40% improvements over human-collected data. Finally, we demonstrate how PlayWorld enables reinforcement learning in the world model, improving policy performance by 65% in success rates when deployed in the real world.
Updated: 2026-03-09 23:58:07
标题: PlayWorld:从自主游戏中学习机器人世界模型
摘要: 动作条件视频模型为构建可以直接从数据中改进的通用目的机器人模拟器提供了一个有希望的途径。然而,尽管在大规模机器人数据集上进行训练,当前最先进的视频模型仍然难以预测在机器人操作中至关重要的物理一致的机器人-物体交互。为了弥合这一差距,我们提出了PlayWorld,这是一个简单、可扩展且完全自主的管道,用于从交互经验中训练高保真度视频世界模拟器。与之前依赖于成功偏向的人类演示的方法不同,PlayWorld是第一个能够完全从无监督机器人自我游戏中学习的系统,从而实现自然可扩展的数据收集,同时捕捉对于建模真实对象动态至关重要的复杂、长尾的物理交互。通过各种操作任务的实验表明,PlayWorld生成了高质量、物理一致的预测,对于无法被人类收集数据训练的世界模型捕捉到的接触丰富的交互。我们进一步展示了PlayWorld在实现细粒度失败预测和策略评估方面的多功能性,相比人类收集数据,可以提高高达40%。最后,我们展示了PlayWorld如何在世界模型中实现强化学习,在实际世界中部署后,策略性能提高了65%。
更新时间: 2026-03-09 23:58:07
领域: cs.RO,cs.AI
Automating Detection and Root-Cause Analysis of Flaky Tests in Quantum Software
Like classical software, quantum software systems rely on automated testing. However, their inherently probabilistic outputs make them susceptible to quantum flakiness -- tests that pass or fail inconsistently without code changes. Such quantum flaky tests can mask real defects and reduce developer productivity, yet systematic tooling for their detection and diagnosis remains limited. This paper presents an automated pipeline to detect flaky-test-related issues and pull requests in quantum software repositories and to support the identification of their root causes. We aim to expand an existing quantum flaky test dataset and evaluate the capability of Large Language Models (LLMs) for flakiness classification and root-cause identification. Building on a prior manual analysis of 14 quantum software repositories, we automate the discovery of additional flaky test cases using LLMs and cosine similarity. We further evaluate a variety of LLMs from OpenAI GPT, Meta LLaMA, Google Gemini, and Anthropic Claude suites for classifying flakiness and identifying root causes from issue descriptions and code context. Classification performance is assessed using standard performance metrics, including F1-score. Using our pipeline, we identify 25 previously unknown flaky tests, increasing the original dataset size by 54%. The best-performing model, Google Gemini, achieves an F1-score of 0.9420 for flakiness detection and 0.9643 for root-cause identification, demonstrating that LLMs can provide practical support for triaging flaky reports and understanding their underlying causes in quantum software. The expanded dataset and automated pipeline provide reusable artifacts for the quantum software engineering community. Future work will focus on improving detection robustness and exploring automated repair of quantum flaky tests.
Updated: 2026-03-09 23:57:55
标题: 自动化检测和根本原因分析量子软件中的不稳定测试
摘要: 就像传统软件一样,量子软件系统依赖于自动化测试。然而,它们固有的概率性输出使它们容易受到量子不稳定性的影响--测试在没有代码更改的情况下通过或失败不一致。这种量子不稳定性测试可能掩盖真正的缺陷并降低开发人员的生产率,但对其检测和诊断的系统化工具仍然有限。 本文提出了一个自动化流程,用于检测量子软件存储库中与不稳定测试相关的问题和拉取请求,并支持识别其根本原因。我们旨在扩展现有的量子不稳定测试数据集,并评估大型语言模型(LLMs)在不稳定性分类和根本原因识别方面的能力。 基于先前对14个量子软件存储库的手动分析,我们利用LLMs和余弦相似性自动发现额外的不稳定测试案例。我们进一步评估了来自OpenAI GPT、Meta LLaMA、Google Gemini和Anthropic Claude套件的各种LLMs,用于从问题描述和代码环境中分类不稳定性并识别根本原因。分类性能使用标准性能指标进行评估,包括F1分数。 使用我们的流程,我们发现了25个以前未知的不稳定测试,将原始数据集大小增加了54%。表现最佳的模型,Google Gemini,实现了0.9420的F1分数用于不稳定性检测和0.9643用于根本原因识别,表明LLMs可以为量子软件中的不稳定性报告的分流提供实际支持,并理解其潜在原因。 扩展的数据集和自动化流程为量子软件工程社区提供了可重用的工件。未来的工作将集中在提高检测的稳健性并探索自动修复量子不稳定测试。
更新时间: 2026-03-09 23:57:55
领域: cs.SE,cs.AI,cs.ET
Lockbox -- A Zero Trust Architecture for Secure Processing of Sensitive Cloud Workloads
Enterprises increasingly rely on cloud-based applications to process highly sensitive data artifacts. Although cloud adoption improves agility and scalability, it also introduces new security challenges such as expanded attack surfaces, a wider radius of attack from credential compromise, and challenges maintaining strict access controls across users, services, and workflows. These challenges are especially acute for applications that handle privileged data and execute security-critical analysis, where traditional trust boundaries and ad hoc safeguards are insufficient. This paper presents Lockbox; a Zero Trust architecture designed for secure processing of sensitive cloud workloads under strict enterprise security and governance requirements. Lockbox applies explicit trust verification, strong isolation, least-privilege access, and policy-driven enforcement throughout the entire application lifecycle, from user authentication and document ingestion to analysis execution and result storage. The system incorporates modern cloud security primitives including; role-based access control, centralized key management, encryption in transit and at rest, and controlled integration with cloud-based data processing services, ensuring that sensitive artifacts remain protected and accessible only to authorized users. We discuss the usage of Lockbox in processing highly sensitive cybersecurity reports and demonstrate how this architecture enables organizations to safely adopt advanced capabilities, including AI-assisted processing, without weakening their security posture.
Updated: 2026-03-09 23:45:00
标题: Lockbox — 一种用于安全处理敏感云工作负载的零信任架构
摘要: 企业越来越依赖基于云的应用程序来处理高度敏感的数据文档。尽管云采用提高了敏捷性和可伸缩性,但也带来了新的安全挑战,例如扩展的攻击面、从凭证泄露导致的更广泛的攻击范围,以及在用户、服务和工作流之间保持严格访问控制的挑战。这些挑战对处理特权数据和执行安全关键分析的应用程序尤为严峻,传统的信任边界和临时保障措施是不够的。本文介绍了Lockbox;这是一个为在严格的企业安全和治理要求下安全处理敏感云工作负载而设计的零信任架构。Lockbox在整个应用程序生命周期中应用明确的信任验证、强大的隔离、最小权限访问和基于策略的执行,从用户认证和文档摄入到分析执行和结果存储。该系统整合了现代云安全原语,包括基于角色的访问控制、集中式密钥管理、传输和静态加密,以及与基于云的数据处理服务的受控集成,确保敏感文档仅对授权用户可访问。我们讨论了Lockbox在处理高度敏感的网络安全报告中的使用,并演示了这种架构如何使组织能够安全地采用先进的能力,包括AI辅助处理,而不会削弱其安全姿态。
更新时间: 2026-03-09 23:45:00
领域: cs.CR,cs.DC,cs.SE
When to Retrain after Drift: A Data-Only Test of Post-Drift Data Size Sufficiency
Sudden concept drift makes previously trained predictors unreliable, yet deciding when to retrain and what post-drift data size is sufficient is rarely addressed. We propose CALIPER - a detector- and model-agnostic, data-only test that estimates the post-drift data size required for stable retraining. CALIPER exploits state dependence in streams generated by dynamical systems: we run a single-pass weighted local regression over the post-drift window and track a one-step proxy error as a function of a locality parameter $θ$. When an effective sample size gate is satisfied, a monotonically non-increasing trend in this error with increasing a locality parameter indicates that the data size is sufficiently informative for retraining. We also provide a theoretical analysis of our method, and we show that the algorithm has a low per-update time and memory. Across datasets from four heterogeneous domains, three learner families, and two detectors, CALIPER consistently matches or exceeds the best fixed data size for retraining while incurring negligible overhead and often outperforming incremental updates. CALIPER closes the gap between drift detection and data-sufficient adaptation in streaming learning.
Updated: 2026-03-09 23:43:56
标题: 何时重新训练数据漂移后:一项仅使用数据的后漂移数据大小充分性测试
摘要: 突然的概念漂移使先前训练的预测器变得不可靠,然而决定何时重新训练以及什么后漂移数据大小是足够的很少被讨论。我们提出了CALIPER - 一个检测器和模型无关的,仅基于数据的测试,用于估计稳定重新训练所需的后漂移数据大小。CALIPER利用了由动态系统生成的流中的状态依赖性:我们在后漂移窗口上运行一次加权的局部回归,并跟踪一个一步代理误差作为局部参数$θ$的函数。当满足一个有效样本大小门限时,该错误随着局部参数的增加而呈单调非增趋势,表明数据大小已足够信息丰富以进行重新训练。我们还对我们的方法进行了理论分析,并展示了该算法具有较低的每次更新时间和内存占用。在来自四个异构领域、三个学习家族和两个检测器的数据集上,CALIPER始终与最佳固定数据大小相匹配或超越重新训练,同时带来可忽略的额外开销,并经常优于增量更新。CALIPER填补了流式学习中漂移检测和数据充分适应之间的差距。
更新时间: 2026-03-09 23:43:56
领域: cs.LG
The Missing Memory Hierarchy: Demand Paging for LLM Context Windows
The context window of a large language model is not memory. It is L1 cache: a small, fast, expensive resource that the field treats as the entire memory system. There is no L2, no virtual memory, no paging. Every tool definition, every system prompt, and every stale tool result occupies context for the lifetime of the session. The result is measurable: across 857 production sessions and 4.45 million effective input tokens, 21.8% is structural waste. We present Pichay, a demand paging system for LLM context windows. Implemented as a transparent proxy between client and inference API, Pichay interposes on the message stream to evict stale content, detect page faults when the model re-requests evicted material, and pin working-set pages identified by fault history. In offline replay across 1.4 million simulated evictions, the fault rate is 0.0254%. In live production deployment over 681turns, the system reduces context consumption by up to 93% (5,038KB to 339KB); under extreme sustained pressure, the system remains operational but exhibits the expected thrashing pathology, with repeated fault-in of evicted content. The key observation is that the problems the field faces, such as context limits, attention degradation, cost scaling, lost state across sessions, are virtual memory problems wearing different clothes. The solutions exist: working set theory (Denning, 1968), demand paging, fault-driven replacement policies, and memory hierarchies with multiple eviction-managed levels. We describe the architecture of a full memory hierarchy for LLM systems (L1 through persistent storage), report on the first three levels deployed in production use (L1 eviction, L2 fault-driven pinning, L3 model-initiated conversation compaction), and identify cross-session memory as the remaining frontier.
Updated: 2026-03-09 23:38:32
标题: 缺失的内存层次结构:LLM上下文窗口的需求分页
摘要: 一个大型语言模型的上下文窗口并非内存。它是L1缓存:一个小型、快速、昂贵的资源,该领域将其视为整个内存系统。没有L2,没有虚拟内存,没有分页。每个工具定义、每个系统提示以及每个过时的工具结果都会占用上下文,直至会话结束。结果是可衡量的:在857个生产会话和4.45百万有效输入令牌中,有21.8%是结构性浪费。 我们提出了Pichay,一种用于LLM上下文窗口的需求分页系统。作为客户端和推断API之间的透明代理实现,Pichay通过干预消息流来清除过时内容,当模型重新请求被清除的材料时检测页面错误,并固定由错误历史识别的工作集页面。在超过140万次模拟清除的离线重放中,错误率为0.0254%。在681次实际生产部署中,该系统将上下文消耗减少了高达93%(从5,038KB减少到339KB);在极端持续压力下,系统仍然可操作,但表现出预期的抖动病理,重复将被清除的内容错误地重新加载。 关键观察是,该领域面临的问题,如上下文限制、注意力下降、成本扩展、跨会话丢失状态等,实际上是穿着不同外衣的虚拟内存问题。解决方案已经存在:工作集理论(Denning,1968)、需求分页、错误驱动的替换策略以及具有多个驱逐管理级别的内存层次结构。我们描述了用于LLM系统的完整内存层次结构架构(从L1到持久存储),报告了在生产使用中部署的前三个级别(L1驱逐、L2错误驱动固定、L3模型启动的对话紧缩),并确定跨会话内存作为剩余的前沿。
更新时间: 2026-03-09 23:38:32
领域: cs.OS,cs.AI,cs.SE
MEMO: Memory-Augmented Model Context Optimization for Robust Multi-Turn Multi-Agent LLM Games
Multi-turn, multi-agent LLM game evaluations often exhibit substantial run-to-run variance. In long-horizon interactions, small early deviations compound across turns and are amplified by multi-agent coupling. This biases win rate estimates and makes rankings unreliable across repeated tournaments. Prompt choice worsens this further by producing different effective policies. We address both instability and underperformance with MEMO (Memory-augmented MOdel context optimization), a self-play framework that optimizes inference-time context by coupling retention and exploration. Retention maintains a persistent memory bank that stores structured insights from self-play trajectories and injects them as priors during later play. Exploration runs tournament-style prompt evolution with uncertainty-aware selection via TrueSkill, and uses prioritized replay to revisit rare and decisive states. Across five text-based games, MEMO raises mean win rate from 25.1% to 49.5% for GPT-4o-mini and from 20.9% to 44.3% for Qwen-2.5-7B-Instruct, using $2,000$ self-play games per task. Run-to-run variance also drops, giving more stable rankings across prompt variations. These results suggest that multi-agent LLM game performance and robustness have substantial room for improvement through context optimization. MEMO achieves the largest gains in negotiation and imperfect-information games, while RL remains more effective in perfect-information settings.
Updated: 2026-03-09 23:36:32
标题: 备忘录:用于强健多轮多代理LLM游戏的记忆增强模型上下文优化
摘要: 多轮次、多智能体LLM游戏评估经常呈现出显著的运行间差异。在长时间交互中,小的早期偏差会随着轮次的增加而累积,并且会被多智能体的耦合放大。这会使胜率估计存在偏差,并且在重复比赛中使排名不可靠。提示选择会进一步加剧这种情况,因为会产生不同的有效策略。我们通过MEMO(Memory-augmented MOdel context optimization)解决了不稳定性和性能不佳问题,这是一个通过耦合保留和探索来优化推理时间上下文的自我对弈框架。保留维持一个持久的记忆库,存储自我对弈轨迹中的结构化见解,并在后续对局中将它们注入作为先验知识。探索通过TrueSkill进行基于不确定性的选择,使用优先回放来重访罕见和决定性的状态。在五个基于文本的游戏中,MEMO将GPT-4o-mini的平均胜率从25.1%提高到49.5%,将Qwen-2.5-7B-Instruct的平均胜率从20.9%提高到44.3%,每个任务使用$2,000$次自我对弈游戏。运行间差异也减少了,使得在提示变化时排名更加稳定。这些结果表明,通过上下文优化,多智能体LLM游戏的性能和稳健性有很大的改进空间。MEMO在谈判和信息不完全游戏中实现了最大的增益,而在完全信息设置中,RL仍然更有效。
更新时间: 2026-03-09 23:36:32
领域: cs.AI
Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, 'aha' moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned "reasoning theater." Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.
Updated: 2026-03-09 23:35:16
标题: 推理剧场:解开模型信念与思维链的纠缠
摘要: 我们提供了关于推理模型中表现链式思维(CoT)的证据,即模型在最终给出答案后变得非常自信,但继续生成标记而不透露其内部信念。我们的分析比较了激活探测、早期强制回答和CoT监视器在两个大模型(DeepSeek-R1 671B和GPT-OSS 120B)之间的差异,并发现任务难度特定的差异:在易于回忆的MMLU问题中,CoT中的激活比监视器早得多地解码出模型的最终答案。我们将这与困难的多跳GPQA-Diamond问题中的真正推理进行对比。尽管如此,几乎只有在探测显示出较大的信念转变的响应中才会出现拐点(例如,回溯、'aha'时刻),这表明这些行为跟踪真正的不确定性,而不是学习的“推理剧院”。最后,探测引导的早期退出在MMLU上可以将标记减少高达80%,在GPQA-Diamond上可以减少30%,并且精度相似,将注意力探测定位为检测表现性推理和启用自适应计算的有效工具。
更新时间: 2026-03-09 23:35:16
领域: cs.CL,cs.AI,cs.LG
AI Phenomenology for Understanding Human-AI Experiences Across Eras
There is no 'ordinary' when it comes to AI. The human-AI experience is extraordinarily complex and specific to each person, yet dominant measures such as usability scales and engagement metrics flatten away nuance. We argue for AI phenomenology: a research stance that asks "How did it feel?" beyond the standard questions of "How well did it perform?" when interacting with AI systems. AI phenomenology acts as a paradigm for bidirectional human-AI alignment as it foregrounds users' first-person perceptions and interpretations of AI systems over time. We motivate AI phenomenology as a framework that captures how alignment is experienced, negotiated, and updated between users and AI systems. Tracing a lineage from Husserl through postphenomenology to Actor-Network Theory, and grounding our argument in three studies-two longitudinal studies with "Day", an AI companion, and a multi-method study of agentic AI in software engineering-we contribute a set of replicable methodological toolkits for conducting AI phenomenology research: instruments for capturing lived experience across personal and professional contexts, three design concepts (translucent design, agency-aware value alignment, temporal co-evolution tracking), and a concrete research agenda. We offer this toolkit not as a new paradigm but as a practical scaffold that researchers can adapt as AI systems-and the humans who live alongside them-continue to co-evolve.
Updated: 2026-03-09 23:26:46
标题: AI现象学:跨时代理解人工智能与人类体验
摘要: 在涉及人工智能时,没有“普通”的情况。人工智能体验异常复杂,并且对每个人来说具体而独特,然而主流的度量工具,如可用性评估和参与度指标,会削弱细微之处。我们主张人工智能现象学:一种研究立场,超越了与人工智能系统互动时的“表现如何?”标准问题,而是问“感觉如何?”。人工智能现象学作为一种双向人工智能对齐的范式,突出了用户随着时间对人工智能系统的第一人称感知和解释。我们将人工智能现象学作为一个框架,捕捉用户和人工智能系统之间如何体验、协商和更新对齐的方式。通过从胡塞尔到后现象学再到行动者网络理论的渊源,以及以“Day”为代表的AI伴侣的两项长期研究和软件工程中智能AI的多方法研究为基础,我们提出了一套可复制的方法论工具包,用于进行人工智能现象学研究:在个人和专业背景下捕捉生活经验的工具,三个设计概念(透明设计、机构感知价值对齐、时间演化跟踪),以及一个具体的研究议程。我们提供这个工具包,不是作为一个新的范式,而是作为研究人员可以根据需要调整的实用脚手架,因为人工智能系统——以及与之共同生活的人类——继续共同演进。
更新时间: 2026-03-09 23:26:46
领域: cs.HC,cs.AI
Meissa: Multi-modal Medical Agentic Intelligence
Multi-modal large language models (MM-LLMs) have shown strong performance in medical image understanding and clinical reasoning. Recent medical agent systems extend them with tool use and multi-agent collaboration, enabling complex decision-making. However, these systems rely almost entirely on frontier models (e.g., GPT), whose API-based deployment incurs high cost, high latency, and privacy risks that conflict with on-premise clinical requirements. We present Meissa, a lightweight 4B-parameter medical MM-LLM that brings agentic capability offline. Instead of imitating static answers, Meissa learns both when to engage external interaction (strategy selection) and how to execute multi-step interaction (strategy execution) by distilling structured trajectories from frontier models. Specifically, we propose: (1) Unified trajectory modeling: trajectories (reasoning and action traces) are represented within a single state-action-observation formalism, allowing one model to generalize across heterogeneous medical environments. (2) Three-tier stratified supervision: the model's own errors trigger progressive escalation from direct reasoning to tool-augmented and multi-agent interaction, explicitly learning difficulty-aware strategy selection. (3) Prospective-retrospective supervision: pairing exploratory forward traces with hindsight-rationalized execution traces enables stable learning of effective interaction policies. Trained on 40K curated trajectories, Meissa matches or exceeds proprietary frontier agents in 10 of 16 evaluation settings across 13 medical benchmarks spanning radiology, pathology, and clinical reasoning. Using over 25x fewer parameters than typical frontier models like Gemini-3, Meissa operates fully offline with 22x lower end-to-end latency compared to API-based deployment. Data, models, and environments are released at https://github.com/Schuture/Meissa.
Updated: 2026-03-09 23:22:55
标题: Meissa: 多模式医疗智能代理
摘要: 多模态大型语言模型(MM-LLMs)在医学图像理解和临床推理方面表现出强大的性能。最近的医疗代理系统通过工具使用和多智能体协作对其进行了扩展,实现了复杂的决策制定。然而,这些系统几乎完全依赖于前沿模型(例如,GPT),其基于API的部署造成高成本、高延迟和隐私风险,与基于现场的临床需求相冲突。我们提出Meissa,一个轻量级的4B参数医学MM-LLM,使智能能力离线。Meissa不是模仿静态答案,而是通过提取前沿模型的结构化轨迹来学习何时进行外部交互(策略选择)以及如何执行多步交互(策略执行)。具体来说,我们提出了:(1)统一轨迹建模:轨迹(推理和行动痕迹)在一个统一的状态-行动-观察形式中表示,使一个模型能够在异质医学环境中进行泛化。(2)三层分层监督:模型自身的错误触发从直接推理到工具增强和多智能体交互的逐步升级,明确学习难度感知的策略选择。(3)前瞻性-回顾性监督:将探索性前向轨迹与事后理性化的执行轨迹配对,实现有效交互策略的稳定学习。在经过40K个策划好的轨迹训练后,Meissa在涵盖放射学、病理学和临床推理等13个医学基准测试的16个评估设置中,与专有的前沿代理匹配或超过。与典型的前沿模型(如Gemini-3)相比,Meissa使用的参数少25倍以上,与基于API的部署相比,端到端延迟降低了22倍。数据、模型和环境都在https://github.com/Schuture/Meissa上发布。
更新时间: 2026-03-09 23:22:55
领域: cs.AI
An accurate flatness measure to estimate the generalization performance of CNN models
Flatness measures based on the spectrum or the trace of the Hessian of the loss are widely used as proxies for the generalization ability of deep networks. However, most existing definitions are either tailored to fully connected architectures, relying on stochastic estimators of the Hessian trace, or ignore the specific geometric structure of modern Convolutional Neural Networks (CNNs). In this work, we develop a flatness measure that is both exact and architecturally faithful for a broad and practically relevant class of CNNs. We first derive a closed-form expression for the trace of the Hessian of the cross-entropy loss with respect to convolutional kernels in networks that use global average pooling followed by a linear classifier. Building on this result, we then specialize the notion of relative flatness to convolutional layers and obtain a parameterization-aware flatness measure that properly accounts for the scaling symmetries and filter interactions induced by convolution and pooling. Finally, we empirically investigate the proposed measure on families of CNNs trained on standard image-classification benchmarks. The results obtained suggest that the proposed measure can serve as a robust tool to assess and compare the generalization performance of CNN models, and to guide the design of architecture and training choices in practice.
Updated: 2026-03-09 23:17:49
标题: 一个准确的平坦度测量方法来估计CNN模型的泛化性能
摘要: 基于损失函数的Hessian矩阵的谱或迹的平坦度度量广泛用作深度网络泛化能力的代理。然而,大多数现有定义要么针对全连接架构,依赖于Hessian迹的随机估计器,要么忽略现代卷积神经网络(CNNs)的特定几何结构。在这项工作中,我们开发了一种平坦度度量,对于广泛且实际相关的一类CNNs来说既精确又结构忠实。我们首先推导了在使用全局平均池化和线性分类器的网络中,相对于卷积核的交叉熵损失的Hessian矩阵迹的封闭形式表达式。基于这一结果,我们将相对平坦度的概念特化到卷积层,并获得了一个参数化感知的平坦度度量,适当考虑了由卷积和池化引起的缩放对称性和滤波器相互作用。最后,我们在标准图像分类基准数据集上对提出的度量进行了实证研究。所得结果表明,所提出的度量可以作为评估和比较CNN模型泛化性能的强大工具,并在实践中指导架构和训练选择的设计。
更新时间: 2026-03-09 23:17:49
领域: cs.LG,cs.CV,cs.NE
The Coupling Within: Flow Matching via Distilled Normalizing Flows
Flow models have rapidly become the go-to method for training and deploying large-scale generators, owing their success to inference-time flexibility via adjustable integration steps. A crucial ingredient in flow training is the choice of coupling measure for sampling noise/data pairs that define the flow matching (FM) regression loss. While FM training defaults usually to independent coupling, recent works show that adaptive couplings informed by noise/data distributions (e.g., via optimal transport, OT) improve both model training and inference. We radicalize this insight by shifting the paradigm: rather than computing adaptive couplings directly, we use distilled couplings from a different, pretrained model capable of placing noise and data spaces in bijection -- a property intrinsic to normalizing flows (NF) through their maximum likelihood and invertibility requirements. Leveraging recent advances in NF image generation via auto-regressive (AR) blocks, we propose Normalized Flow Matching (NFM), a new method that distills the quasi-deterministic coupling of pretrained NF models to train student flow models. These students achieve the best of both worlds: significantly outperforming flow models trained with independent or even OT couplings, while also improving on the teacher AR-NF model.
Updated: 2026-03-09 23:07:36
标题: 内部耦合:通过精炼的归一化流进行流匹配
摘要: 流模型已迅速成为训练和部署大规模生成器的首选方法,其成功在于推理时间的灵活性,可通过可调整的积分步骤实现。流训练中的一个关键因素是选择用于采样噪声/数据对的耦合度量,这些噪声/数据对定义了流匹配(FM)回归损失。虽然FM训练通常默认为独立耦合,但最近的研究表明,通过噪声/数据分布(例如,通过最优输运,OT)通知的自适应耦合可以改善模型训练和推理。我们通过转变范式来深化这一见解:与其直接计算自适应耦合,我们使用从不同的、预先训练的模型中提取的耦合,该模型能够将噪声和数据空间一一对应——这是通过其最大似然性和可逆性要求固有于归一化流(NF)的特性。利用最近在NF图像生成中的进展,通过自回归(AR)块,我们提出了一种新方法,即归一化流匹配(NFM),该方法将预先训练的NF模型的准确定耦合提炼出来,用于训练学生流模型。这些学生模型融合了两种优点:明显优于使用独立耦合或甚至OT耦合训练的流模型,同时也提高了教师AR-NF模型的性能。
更新时间: 2026-03-09 23:07:36
领域: cs.LG,cs.CV
Improving through Interaction: Searching Behavioral Representation Spaces with CMA-ES-IG
Robots that interact with humans must adapt to individual users' preferences to operate effectively in human-centered environments. An intuitive and effective technique to learn non-expert users' preferences is through rankings of robot behaviors, e.g., trajectories, gestures, or voices. Existing techniques primarily focus on generating queries that optimize preference learning outcomes, such as sample efficiency or final preference estimation accuracy. However, the focus on outcome overlooks key user expectations in the process of providing these rankings, which can negatively impact users' adoption of robotic systems. This work proposes the Covariance Matrix Adaptation Evolution Strategies with Information Gain (CMA-ES-IG) algorithm. CMA-ES-IG explicitly incorporates user experience considerations into the preference learning process by suggesting perceptually distinct and informative trajectories for users to rank. We demonstrate these benefits through both simulated studies and real-robot experiments. CMA-ES-IG, compared to state-of-the-art alternatives, (1) scales more effectively to higher-dimensional preference spaces, (2) maintains computational tractability for high-dimensional problems, (3) is robust to noisy or inconsistent user feedback, and (4) is preferred by non-expert users in identifying their preferred robot behaviors. This project's code is available at github.com/interaction-lab/CMA-ES-IG
Updated: 2026-03-09 23:00:42
标题: 通过互动改进:使用CMA-ES-IG搜索行为表示空间
摘要: 与人类互动的机器人必须适应个体用户的偏好,以在以人为中心的环境中有效运作。学习非专业用户偏好的直观和有效技术是通过排名机器人行为(如轨迹、手势或声音)来实现的。现有技术主要集中于生成优化偏好学习结果的查询,如样本效率或最终偏好估计准确性。然而,对结果的关注忽略了在提供这些排名过程中的关键用户期望,这可能会对用户对机器人系统的采用产生负面影响。本研究提出了协方差矩阵适应进化策略与信息增益(CMA-ES-IG)算法。CMA-ES-IG通过为用户提供感知上不同和信息丰富的轨迹来明确地将用户体验考虑纳入偏好学习过程。我们通过模拟研究和真实机器人实验展示了这些好处。与最先进的替代方案相比,CMA-ES-IG(1)更有效地扩展到高维偏好空间,(2)对高维问题保持计算可行性,(3)对嘈杂或不一致的用户反馈具有鲁棒性,(4)受到非专业用户的青睐,以识别他们偏好的机器人行为。该项目的代码可在github.com/interaction-lab/CMA-ES-IG上找到。
更新时间: 2026-03-09 23:00:42
领域: cs.RO,cs.AI,cs.HC
Lightening the Load: A Cluster-Based Framework for A Lower-Overhead, Provable Website Fingerprinting Defense
Website fingerprinting (WF) attacks remain a significant threat to encrypted traffic, prompting the development of a wide range of defenses. Among these, two prominent classes are regularization-based defenses, which shape traffic using fixed padding rules, and supersequence-based approaches, which conceal traces among predefined patterns. In this work, we present a unified framework for designing an adaptive WF defense that combines the effectiveness of regularization with the provable security of supersequence-style grouping. The scheme first extracts behavioural patterns from traces and clusters them into (k,l)-diverse anonymity sets; an early-time-series classifier (adapted from ECDIRE) then switches from a conservative global set of regularization parameters to the lighter, set-specific parameters. We instantiate the design as Adaptive Tamaraw, a variant of Tamaraw that assigns padding parameters on a per-cluster basis while retaining its original information-theoretic guarantee. Comprehensive experiments on public real-world datasets confirm the benefits. By tuning k, operators can trade privacy for efficiency: in its high-privacy mode Adaptive Tamaraw pushes the bound on any attacker's accuracy below 30%, whereas in efficiency-centred settings it cuts total overhead by 99% compared with classic Tamaraw.
Updated: 2026-03-09 22:59:47
标题: 减轻负担:基于集群的框架,用于降低开销、可证明的网站指纹识别防御
摘要: 网站指纹识别(WF)攻击仍然对加密流量构成重大威胁,促使开发了各种防御措施。在这些措施中,两个突出的类别是基于正则化的防御措施,利用固定填充规则塑造流量,以及基于超序列的方法,将痕迹隐藏在预定义的模式中。在这项工作中,我们提出了一个统一的框架,用于设计一个自适应的WF防御,结合了正则化的有效性和超序列样式分组的可证安全性。该方案首先从痕迹中提取行为模式,并将其聚类成(k,l)-多样的匿名集合;然后,一个早期时间序列分类器(改编自ECDIRE)从保守的全局正则化参数集切换到较轻的、特定于集合的参数。我们将设计实例化为自适应Tamaraw,这是Tamaraw的一个变体,它在每个集群的基础上分配填充参数,同时保留其原始的信息理论保证。对公开的真实数据集进行的全面实验证实了这些好处。通过调整k,运营商可以在隐私和效率之间进行权衡:在高隐私模式下,自适应Tamaraw将任何攻击者的准确率下降到30%以下,而在以效率为中心的设置中,它与经典Tamaraw相比减少了总开销99%。
更新时间: 2026-03-09 22:59:47
领域: cs.CR
Statistical Inference via Generative Models: Flow Matching and Causal Inference
Generative AI has achieved remarkable empirical success, but from the perspective of statistics it often remains opaque: its predictions may be accurate, yet the underlying mechanism is difficult to interpret, analyze, and trust. This book reinterprets generative AI in the language of statistics, using flow matching as a central example. The key idea is that generative models should be understood not merely as devices for producing plausible data, but as methods for the nonparametric learning of high-dimensional probability distributions. From this viewpoint, missing-data imputation becomes principled sampling from learned conditional distributions, counterfactual analysis becomes the estimation of intervention distributions, and distributional dynamics become statistically analyzable objects. Mathematically, flow matching represents distributional deformation through the continuity equation and a time-dependent velocity field, thereby extending score matching from the learning of static score fields to the learning of transport paths themselves. Building on this foundation, the book develops a statistical framework in which generative models are used to estimate nuisance components while inferential validity is maintained through orthogonalization and cross-fitting in the spirit of double/debiased machine learning. Applications to survival analysis, censoring, missingness, and causal inference show how generative models can be integrated into statistical inference for structured high-dimensional problems.
Updated: 2026-03-09 22:56:02
标题: 通过生成模型进行统计推断:流匹配和因果推断
摘要: 生成式人工智能取得了显著的实证成功,但从统计学的角度看,它常常仍然是不透明的:它的预测可能是准确的,但其基本机制很难解释、分析和信任。本书将生成式人工智能重新解释为统计学的语言,以流匹配作为一个中心示例。关键思想是生成模型不仅应该被理解为用于生成可信数据的设备,而且应该被理解为高维概率分布的非参数学习方法。从这个角度来看,缺失数据插补变成了从学习的条件分布中进行原则性抽样,反事实分析变成了干预分布的估计,分布动力学变成了可统计分析的对象。数学上,流匹配通过连续方程和一个时间相关的速度场来表示分布变形,从而将评分匹配从学习静态评分场扩展到学习传输路径本身。在这个基础上,本书发展了一个统计框架,其中生成模型被用于估计干扰组件,同时通过正交化和交叉拟合保持推断的有效性,这符合双重/去偏机器学习的精神。对于生存分析、截尾、缺失和因果推断的应用展示了生成模型如何被整合到针对结构化高维问题的统计推断中。
更新时间: 2026-03-09 22:56:02
领域: stat.ML,cs.LG
Gender Fairness in Audio Deepfake Detection: Performance and Disparity Analysis
Audio deepfake detection aims to detect real human voices from those generated by Artificial Intelligence (AI) and has emerged as a significant problem in the field of voice biometrics systems. With the ever-improving quality of synthetic voice, the probability of such a voice being exploited for illicit practices like identity thest and impersonation increases. Although significant progress has been made in the field of Audio Deepfake Detection in recent times, the issue of gender bias remains underexplored and in its nascent stage In this paper, we have attempted a thorough analysis of gender dependent performance and fairness in audio deepfake detection models. We have used the ASVspoof 5 dataset and train a ResNet-18 classifier and evaluate detection performance across four different audio features, and compared the performance with baseline AASIST model. Beyond conventional metrics such as Equal Error Rate (EER %), we incorporated five established fairness metrics to quantify gender disparities in the model. Our results show that even when the overall EER difference between genders appears low, fairness-aware evaluation reveals disparities in error distribution that are obscured by aggregate performance measures. These findings demonstrate that reliance on standard metrics is unreliable, whereas fairness metrics provide critical insights into demographic-specific failure modes. This work highlights the importance of fairness-aware evaluation for developing a more equitable, robust, and trustworthy audio deepfake detection system.
Updated: 2026-03-09 22:52:12
标题: 音频深度伪造检测中的性别公平性:性能和差异性分析
摘要: 音频深度伪造检测旨在检测真实的人类声音与由人工智能(AI)生成的声音,并已成为声音生物特征系统领域的一个重要问题。随着合成声音质量不断提高,这种声音被用于身份盗窃和冒充等非法行为的可能性增加。尽管近年来在音频深度伪造检测领域取得了显著进展,但性别偏见问题仍未得到充分探讨,仍处于起步阶段。在本文中,我们尝试对音频深度伪造检测模型中性别相关性能和公平性进行彻底分析。我们使用了ASVspoof 5数据集,训练了一个ResNet-18分类器,并评估了四种不同音频特征的检测性能,并将性能与基准AASIST模型进行了比较。除了传统的指标如等误差率(EER %)之外,我们还引入了五个已建立的公平性指标来量化模型中性别差异。我们的结果表明,即使总体EER性别之间的差异看起来很小,但公平意识评估揭示了错误分布中的差异,这些差异被总体性能指标所掩盖。这些发现表明,依赖标准指标是不可靠的,而公平性指标提供了对特定人口群体失败模式的关键见解。这项工作强调了公平性意识评估对于开发更具公平、健壮和可信赖的音频深度伪造检测系统的重要性。
更新时间: 2026-03-09 22:52:12
领域: cs.SD,cs.AI
Security Considerations for Multi-agent Systems
Multi-agent artificial intelligence systems or MAS are systems of autonomous agents that exercise delegated tool authority, share persistent memory, and coordinate via inter-agent communication. MAS introduces qualitatively distinct security vulnerabilities from those documented for singular AI models. Existing security and governance frameworks were not designed for these emerging attack surfaces. This study systematically characterizes the threat landscape of MAS and quantitatively evaluates 16 security frameworks for AI against it. A four-phase methodology is proposed: constructing a deep technical knowledge base of production multi-agent architectures; conducting generative AI-assisted threat modeling scoped to MAS cybersecurity risks and validated by domain experts; structuring survey plans at individual-threat granularity; and scoring each framework on a three-point scale against the cybersecurity risks. The risks were organized into 193 distinct main threat items across nine risk categories. The expected minimal average score is 2. No reviewed framework achieves majority coverage of any single category. Non-Determinism (mean score 1.231 across all 16 frameworks) and Data Leakage (1.340) are the most under-addressed domains. The OWASP Agentic Security Initiative leads overall at 65.3\% coverage and in the design phase; the CDAO Generative AI Responsible AI Toolkit leads in development and operational coverage. These results provide the first empirical cross-framework comparison for MAS security and offer evidence-based guidance for framework selection.
Updated: 2026-03-09 22:46:27
标题: 多智能体系统的安全考虑
摘要: 多智能体人工智能系统或MAS是一种由自治代理组成的系统,它们行使委托工具权限,共享持久性存储器,并通过代理间通信进行协调。MAS引入了与已记录的单一AI模型不同的安全漏洞。现有的安全和治理框架并不是为这些新兴攻击面而设计的。本研究系统地刻画了MAS的威胁景观,并对16个AI安全框架进行了量化评估。提出了一个四阶段方法论:构建生产多智能体架构的深度技术知识库;进行限定于MAS网络安全风险的生成式AI辅助威胁建模,并由领域专家验证;在个别威胁层面制定调查计划;并根据网络安全风险用三分制对每个框架进行评分。风险被组织成了193个不同的主要威胁项,涵盖九个风险类别。预期的最低平均分是2。没有审查的框架能够覆盖任何一个单独类别的大多数。非确定性(所有16个框架的平均分为1.231)和数据泄漏(1.340)是最少得到解决的领域。OWASP Intelligent Security Initiative在整体上以65.3\%的覆盖率领先,并在设计阶段领先;CDAO生成式AI负责任AI工具包在开发和运营覆盖范围上领先。这些结果提供了首次针对MAS安全性的跨框架实证比较,并为框架选择提供了基于证据的指导。
更新时间: 2026-03-09 22:46:27
领域: cs.CR,cs.AI
SPDIM: Source-Free Unsupervised Conditional and Label Shift Adaptation in EEG
The non-stationary nature of electroencephalography (EEG) introduces distribution shifts across domains (e.g., days and subjects), posing a significant challenge to EEG-based neurotechnology generalization. Without labeled calibration data for target domains, the problem is a source-free unsupervised domain adaptation (SFUDA) problem. For scenarios with constant label distribution, Riemannian geometry-aware statistical alignment frameworks on the symmetric positive definite (SPD) manifold are considered state-of-the-art. However, many practical scenarios, including EEG-based sleep staging, exhibit label shifts. Here, we propose a geometric deep learning framework for SFUDA problems under specific distribution shifts, including label shifts. We introduce a novel, realistic generative model and show that prior Riemannian statistical alignment methods on the SPD manifold can compensate for specific marginal and conditional distribution shifts but hurt generalization under label shifts. As a remedy, we propose a parameter-efficient manifold optimization strategy termed SPDIM. SPDIM uses the information maximization principle to learn a single SPD-manifold-constrained parameter per target domain. In simulations, we demonstrate that SPDIM can compensate for the shifts under our generative model. Moreover, using public EEG-based brain-computer interface and sleep staging datasets, we show that SPDIM outperforms prior approaches.
Updated: 2026-03-09 22:45:22
标题: SPDIM:无源无监督的脑电图条件和标签偏移适应
摘要: 脑电图(EEG)的非平稳性引入了不同领域(如天数和受试者)之间的分布变化,给基于EEG的神经技术泛化带来了重大挑战。在没有目标领域的标记校准数据的情况下,问题是一个无源无监督域自适应(SFUDA)问题。对于标签分布恒定的情况,基于黎曼几何感知的对称正定(SPD)流形上的统计对齐框架被认为是最先进的。然而,许多实际情况,包括基于EEG的睡眠分期,都表现出标签转移。在这里,我们提出了一个针对特定分布转移的SFUDA问题的几何深度学习框架,包括标签转移。我们引入了一个新颖的、现实的生成模型,并展示了之前在SPD流形上的黎曼统计对齐方法可以弥补特定边际和条件分布转移,但在标签转移下损害了泛化性能。为此,我们提出了一种称为SPDIM的参数高效的流形优化策略。SPDIM利用信息最大化原则学习每个目标领域的单个受限于SPD流形的参数。在模拟中,我们展示了SPDIM可以弥补我们的生成模型下的转移。此外,使用公共的基于EEG的脑机接口和睡眠分期数据集,我们展示了SPDIM优于之前的方法。
更新时间: 2026-03-09 22:45:22
领域: eess.SP,cs.LG
Arbiter: Detecting Interference in LLM Agent System Prompts
System prompts for LLM-based coding agents are software artifacts that govern agent behavior, yet lack the testing infrastructure applied to conventional software. We present Arbiter, a framework combining formal evaluation rules with multi-model LLM scouring to detect interference patterns in system prompts. Applied to three major coding agent system prompts: Claude Code (Anthropic), Codex CLI (OpenAI), and Gemini CLI (Google), we identify 152 findings across the undirected scouring phase and 21 hand-labeled interference patterns in directed analysis of one vendor. We show that prompt architecture (monolithic, flat, modular) strongly correlates with observed failure class but not with severity, and that multi-model evaluation discovers categorically different vulnerability classes than single-model analysis. One scourer finding was structural data loss in Gemini CLI's memory system was consistent with an issue filed and patched by Google, which addressed the symptom without addressing the schema-level root cause identified by the scourer. Total cost of cross-vendor analysis: \$0.27 USD.
Updated: 2026-03-09 22:29:47
标题: 仲裁者:检测LLM代理系统提示中的干扰
摘要: 基于LLM的编码代理系统提示是指导代理行为的软件工件,但缺乏应用于传统软件的测试基础设施。我们提出了Arbiter,一个将正式评估规则与多模型LLM搜索相结合的框架,用于检测系统提示中的干扰模式。应用于三个主要编码代理系统提示:Claude Code(Anthropic)、Codex CLI(OpenAI)和Gemini CLI(Google),我们在无向搜索阶段识别了152个发现,并在对一个供应商进行的定向分析中标识了21个手工标记的干扰模式。我们表明,提示架构(单体、扁平、模块化)与观察到的失败类别密切相关,但与严重性无关,并且多模型评估发现的漏洞类别与单模型分析不同。一个搜索器发现是Gemini CLI内存系统中的结构数据丢失与Google提交的问题一致,并已经修复了该问题,但未解决搜索器识别的模式级根本原因。跨供应商分析的总成本:0.27美元。
更新时间: 2026-03-09 22:29:47
领域: cs.SE,cs.AI,cs.CR,cs.PL
When Machine Learning Gets Personal: Evaluating Prediction and Explanation
In high-stakes domains like healthcare, users often expect that sharing personal information with machine learning systems will yield tangible benefits, such as more accurate diagnoses and clearer explanations of contributing factors. However, the validity of this assumption remains largely unexplored. We propose a unified framework to quantify how personalizing a model influences both prediction and explanation. We show that its impacts on prediction and explanation can diverge: a model may become more or less explainable even when prediction is unchanged. For practical settings, we study a standard hypothesis test for detecting personalization effects on demographic groups. We derive a finite-sample lower bound on its probability of error as a function of group sizes, number of personal attributes, and desired benefit from personalization. This provides actionable insights, such as which dataset characteristics are necessary to test an effect, or the maximum effect that can be tested given a dataset. We apply our framework to real-world tabular datasets using feature-attribution methods, uncovering scenarios where effects are fundamentally untestable due to the dataset statistics. Our results highlight the need for joint evaluation of prediction and explanation in personalized models and the importance of designing models and datasets with sufficient information for such evaluation.
Updated: 2026-03-09 22:23:29
标题: 当机器学习变得个性化:评估预测和解释
摘要: 在高风险领域,如医疗保健,用户通常期望与机器学习系统分享个人信息将产生实质性的好处,例如更准确的诊断和更清晰的解释相关因素。然而,这种假设的有效性仍然大多未被探索。我们提出了一个统一的框架,以量化个性化模型对预测和解释的影响。我们展示了个性化对预测和解释的影响可能会有所不同:即使预测结果不变,模型的可解释性可能会增加或减少。对于实际情况,我们研究了用于检测不同人群的个性化效应的标准假设检验。我们推导了一个有限样本下界,作为群体大小、个人属性数量和个性化所需好处的函数来衡量其错误概率。这提供了可操作的见解,例如测试效果所需的数据集特征以及在给定数据集情况下可以测试的最大效果。我们应用我们的框架到真实的表格数据集中,使用特征归因方法揭示了由于数据集统计而无法测试效果的场景。我们的结果强调了个性化模型中预测和解释的联合评估的必要性,以及设计具有足够信息进行评估的模型和数据集的重要性。
更新时间: 2026-03-09 22:23:29
领域: cs.LG
MAPLE: Elevating Medical Reasoning from Statistical Consensus to Process-Led Alignment
Recent advances in medical large language models have explored Test-Time Reinforcement Learning (TTRL) to enhance reasoning. However, standard TTRL often relies on majority voting (MV) as a heuristic supervision signal, which can be unreliable in complex medical scenarios where the most frequent reasoning path is not necessarily the clinically correct one. In this work, we propose a novel and unified training paradigm that integrates medical process reward models with TTRL to bridge the gap between test-time scaling (TTS) and parametric model optimization. Specifically, we advance the TTRL framework by replacing the conventional MV with a fine-grained, expert-aligned supervision paradigm using Med-RPM. This integration ensures that reinforcement learning is guided by medical correctness rather than mere consensus, effectively distilling search-based intelligence into the model's parametric memory. Extensive evaluations on four different benchmarks have demonstrated that our developed method consistently and significantly outperforms current TTRL and standalone PRM selection. Our findings establish that transitioning from stochastic heuristics to structured, step-wise rewards is essential for developing reliable and scalable medical AI systems
Updated: 2026-03-09 22:22:57
标题: MAPLE:从统计共识提升医学推理至过程导向的对齐
摘要: 最近在医学大型语言模型方面取得的进展已经探索了测试时间强化学习(TTRL)以增强推理能力。然而,标准的TTRL通常依赖于多数投票(MV)作为启发式监督信号,在复杂的医疗场景中可能不可靠,因为最常见的推理路径未必是临床正确的路径。在这项工作中,我们提出了一种新颖且统一的训练范式,将医疗过程奖励模型与TTRL集成,以弥合测试时间扩展(TTS)和参数模型优化之间的差距。具体来说,我们通过使用Med-RPM将传统的MV替换为细粒度的、与专家对齐的监督范式来推进TTRL框架。这种集成确保强化学习由医疗正确性而不是纯粹的共识指导,有效地将基于搜索的智能提炼到模型的参数记忆中。对四个不同基准的广泛评估表明,我们开发的方法始终明显优于当前的TTRL和独立的PRM选择。我们的发现表明,从随机启发式方法过渡到结构化、分步奖励对于开发可靠且可扩展的医疗人工智能系统至关重要。
更新时间: 2026-03-09 22:22:57
领域: cs.LG
Data-driven robust Markov decision processes on Borel spaces: performance guarantees via an axiomatic approach
We consider Markov decision processes (MDPs) with unknown disturbance distribution and address this problem using the robust Markov decision process (RMDP) approach. We construct the empirical distribution of the unknown disturbance distribution and characterize our ambiguity set of distributions as the sublevel set of a nonnegative distance function from the empirical distribution. By connecting the weak convergence of distributions to convergence with respect to the distance function, we prove that the robust optimal value function and the out-of-sample value function converge to the true optimal value function with increasing sample-sizes. We establish that, for finite sample-sizes, the robust optimal value function serves as a high probability upper bound on the out-of-sample value function. We also obtain probabilistic convergence rates, sample complexity bounds, and out-of-distribution performance bounds. The finite sample performance guarantees rely on the distance function satisfying a certain concentration type inequality. Several well-studied distances in the literature meet the requirements imposed on the distance function. We also analyze the data-driven properties of empirical MDPs and demonstrate that, unlike our data-driven RMDPs, empirical MDPs fail to satisfy some of the finite sample performance guarantees.
Updated: 2026-03-09 22:13:38
标题: 基于数据驱动的鲍雷尔空间上的鲁棒马尔可夫决策过程:通过公理化方法提供性能保证
摘要: 我们考虑具有未知干扰分布的马尔可夫决策过程(MDPs),并使用强健的马尔可夫决策过程(RMDP)方法解决这个问题。我们构建未知干扰分布的经验分布,并将我们的分布模糊集特征化为从经验分布到非负距离函数的次级集。通过将分布的弱收敛与相对于距离函数的收敛联系起来,我们证明随着样本量增加,强健最优值函数和样本外值函数收敛于真实最优值函数。我们确定,对于有限样本量,强健最优值函数充当样本外值函数的高概率上界。我们还获得概率收敛速度、样本复杂度界限和样本外性能界限。有限样本性能保证依赖于距离函数满足一定的浓度类型不等式。文献中研究的几种著名距离满足对距离函数施加的要求。我们还分析了经验MDP的数据驱动特性,并证明,与我们的数据驱动RMDPs不同,经验MDPs未能满足一些有限样本性能保证。
更新时间: 2026-03-09 22:13:38
领域: math.OC,cs.LG,stat.ML
Building Privacy-and-Security-Focused Federated Learning Infrastructure for Global Multi-Centre Healthcare Research
Collaborative healthcare research across multiple institutions increasingly requires diverse clinical datasets, but cross-border data sharing is strictly constrained by privacy regulations. Federated learning (FL) enables model training while keeping data local; however, many existing frameworks remain proof-of-concept and do not adequately address governance risks such as unauthorised participation, misuse, and lack of accountability. In particular, enforceable mechanisms for authentication, authorisation, and accounting (AAA) are often missing, limiting real-world clinical deployment. This paper presents FLA$^3$ (Federated Learning with Authentication, Authorisation, and Accounting), a governance-aware federated learning platform that operationalises regulatory obligations through runtime policy enforcement. FLA$^3$ integrates eXtensible Access Control Markup Language (XACML) compliant attribute-based access control (ABAC), cryptographic accounting, and study-scoped federation directly into the federated learning orchestration layer to enforce institutional sovereignty and protocol adherence. We evaluate FLA$^3$ through two complementary studies. First, we demonstrate operational feasibility by deploying the platform infrastructure across five BloodCounts! Consortium institutions in four countries: United Kingdom, Netherlands, India, and The Gambia. Second, we assess clinical utility using simulated federation of full blood count (FBC) data from 54,446 samples from 35,315 subjects across 25 centres in the INTERVAL study. Results show that FLA$^3$ achieves predictive performance comparable to centralised training while strictly enforcing governance constraints. These results show that enforceable governance can function as a first-class privacy-preserving control, improving trustworthiness for scalable artificial intelligence (AI) in cross-jurisdictional healthcare deployments.
Updated: 2026-03-09 22:13:00
标题: 构建面向全球多中心医疗研究的隐私和安全重点联邦学习基础设施
摘要: 协作跨多个机构的医疗研究越来越需要多样化的临床数据集,但跨境数据共享受到隐私法规的严格约束。联邦学习(FL)可以在保持数据本地的同时进行模型训练;然而,许多现有框架仍停留在概念验证阶段,并未充分解决未经授权参与、滥用和缺乏问责制等治理风险。特别是,可强制执行的身份验证、授权和会计(AAA)机制通常缺失,限制了真实世界的临床部署。本文介绍了FLA$^3$(具备身份验证、授权和会计的联邦学习),这是一个具备治理意识的联邦学习平台,通过运行时策略执行实现了对法规义务的操作。FLA$^3$将可扩展访问控制标记语言(XACML)兼容的基于属性的访问控制(ABAC)、加密会计和研究范围内的联邦直接集成到联邦学习编排层,以强制执行机构主权和协议遵从。我们通过两项互补研究评估了FLA$^3$。首先,我们展示了通过在四个国家的五个BloodCounts!联盟机构部署平台基础设施的操作可行性:英国、荷兰、印度和冈比亚。其次,我们使用模拟联邦方式对来自INTERVAL研究中25个中心的35,315名受试者的54,446个全血细胞计数(FBC)数据进行评估临床实用性。结果显示,FLA$^3$在严格执行治理约束的同时实现了与集中式训练相媲美的预测性能。这些结果表明,可强制执行的治理可以作为第一类隐私保护控制,提高了跨辖区医疗部署中可扩展人工智能(AI)的可信度。
更新时间: 2026-03-09 22:13:00
领域: cs.CR,cs.SE
MAcPNN: Mutual Assisted Learning on Data Streams with Temporal Dependence
Internet of Things (IoT) Analytics often involves applying machine learning (ML) models on data streams. In such scenarios, traditional ML paradigms face obstacles related to continuous learning while dealing with concept drifts, temporal dependence, and avoiding forgetting. Moreover, in IoT, different edge devices build up a network. When learning models on those devices, connecting them could be useful in improving performance and reusing others' knowledge. This work proposes Mutual Assisted Learning, a learning paradigm grounded on Vygotsky's popular Sociocultural Theory of Cognitive Development. Each device is autonomous and does not need a central orchestrator. Whenever it degrades its performance due to a concept drift, it asks for assistance from others and decides whether their knowledge is useful for solving the new problem. This way, the number of connections is drastically reduced compared to the classical Federated Learning approaches, where the devices communicate at each training round. Every device is equipped with a Continuous Progressive Neural Network (cPNN) to handle the dynamic nature of data streams. We call this implementation Mutual Assisted cPNN (MAcPNN). To implement it, we allow cPNNs for single data point predictions and apply quantization to reduce the memory footprint. Experimental results prove the effectiveness of MAcPNN in boosting performance on synthetic and real data streams.
Updated: 2026-03-09 22:03:37
标题: MAcPNN:具有时间依赖性的数据流上的相互辅助学习
摘要: 物联网(IoT)分析通常涉及在数据流上应用机器学习(ML)模型。在这种情况下,传统的ML范式面临着与连续学习相关的障碍,同时处理概念漂移、时间依赖性和避免遗忘。此外,在物联网中,不同的边缘设备建立了一个网络。在这些设备上学习模型时,连接它们可能有助于提高性能和重复使用其他人的知识。这项工作提出了相互协助学习,这是一种根植于维果茨基(Vygotsky)著名的社会文化认知发展理论的学习范式。每个设备都是自治的,不需要中央协调者。每当由于概念漂移导致性能下降时,它会向其他人请求帮助,并决定他们的知识是否有助于解决新问题。通过这种方式,与经典的联邦学习方法相比,连接数量大大减少,后者在每轮训练中设备进行通信。每个设备都配备了一个连续渐进神经网络(cPNN)来处理数据流的动态特性。我们将这个实现称为相互协助cPNN(MAcPNN)。为了实现它,我们允许cPNN进行单数据点的预测,并应用量化来减少内存占用。实验结果证明了MAcPNN在合成和真实数据流上提高性能的有效性。
更新时间: 2026-03-09 22:03:37
领域: cs.LG
WebAccessVL: Violation-Aware VLM for Web Accessibility
We present a vision-language model (VLM) that automatically edits website HTML to address violations of the Web Content Accessibility Guidelines 2 (WCAG2) while preserving the original design. We formulate this as a supervised image-conditioned program synthesis task, where the model learns to correct HTML given both the code and its visual rendering. We create WebAccessVL, a website dataset with manually corrected accessibility violations. We then propose a violation-conditioned VLM that further takes the detected violations' descriptions from a checker as input. This conditioning enables an iterative checker-in-the-loop refinement strategy at test time. We conduct extensive evaluation on both open API and open-weight models. Empirically, our method achieves 0.211 violations per website, a 96.0\% reduction from the 5.34 violations in raw data and 87\% better than GPT-5. A perceptual study also confirms that our edited websites better maintain the original visual appearance and content.
Updated: 2026-03-09 21:59:43
标题: WebAccessVL:面向Web辅助功能的违规感知VLM
摘要: 我们提出了一种视觉语言模型(VLM),可以自动编辑网站HTML,解决违反Web内容无障碍指南2(WCAG2)的问题,同时保留原始设计。我们将这视为一个监督图像条件程序综合任务,模型通过学习在给定代码和视觉渲染的情况下纠正HTML。我们创建了一个带有手动修正的无障碍违规的网站数据集WebAccessVL。然后,我们提出了一种违规条件的VLM,进一步接受从检查器中检测到的违规描述作为输入。这种条件化使得在测试时可以进行迭代式的检查器循环细化策略。我们对开放API和开放权重模型进行了广泛评估。实验上,我们的方法每个网站达到0.211个违规,与原始数据中的5.34个违规相比减少了96.0%,比GPT-5好了87%。感知研究还确认了我们编辑过的网站更好地保持了原始的视觉外观和内容。
更新时间: 2026-03-09 21:59:43
领域: cs.HC,cs.AI,cs.CV
Semantic Level of Detail: Multi-Scale Knowledge Representation via Heat Kernel Diffusion on Hyperbolic Manifolds
AI memory systems increasingly organize knowledge into graph structures -- knowledge graphs, entity relations, community hierarchies -- yet lack a principled mechanism for continuous resolution control: where do the qualitative boundaries between abstraction levels lie, and how should an agent navigate them? We introduce Semantic Level of Detail (SLoD), a framework that answers both questions by defining a continuous zoom operator via heat kernel diffusion on the Poincaré ball $\mathbb{B}^d$. At coarse scales ($σ\to \infty$), diffusion aggregates embeddings into high-level summaries; at fine scales ($σ\to 0$), local semantic detail is preserved. We prove hierarchical coherence with bounded approximation error $O(σ)$ and $(1+\varepsilon)$ distortion for tree-structured hierarchies under Sarkar embedding. Crucially, we show that spectral gaps in the graph Laplacian induce emergent scale boundaries -- scales where the representation undergoes qualitative transitions -- which can be detected automatically without manual resolution parameters. On synthetic hierarchies (HSBM), our boundary scanner recovers planted levels with ARI up to 1.00, with detection degrading gracefully near the information-theoretic Kesten-Stigum threshold. On the full WordNet noun hierarchy (82K synsets), detected boundaries align with true taxonomic depth ($τ= 0.79$), demonstrating that the method discovers meaningful abstraction levels in real-world knowledge graphs without supervision.
Updated: 2026-03-09 21:54:08
标题: 细节的语义层次:通过在双曲流形上的热核扩散实现多尺度知识表示
摘要: 人工智能记忆系统越来越多地将知识组织成图结构 -- 知识图、实体关系、社区层次结构 -- 但缺乏一种连续分辨率控制的原则性机制:抽象级别之间的定性边界在哪里,代理应该如何导航?我们引入了语义层次细节(Semantic Level of Detail,SLoD)框架,通过在Poincaré球$\mathbb{B}^d$上的热核扩散定义连续缩放运算符来回答这两个问题。在粗粒度尺度($σ\to \infty$)上,扩散将嵌入聚合成高层摘要;在细粒度尺度($σ\to 0$)上,保留了局部语义细节。我们证明了在Sarkar嵌入下,树状结构层次中的分层一致性与有界逼近误差$O(σ)$和$(1+\varepsilon)$失真。关键是,我们展示了图拉普拉斯算子中的谱间隙引发了新兴的尺度边界 -- 表示在该尺度下经历定性转变 -- 这些尺度可以在没有手动分辨率参数的情况下自动检测。在合成层次结构(HSBM)中,我们的边界扫描器恢复了植入的层次,ARI高达1.00,检测接近信息论Kesten-Stigum阈值时逐渐退化。在完整的WordNet名词层次结构(82K同义词集)中,检测到的边界与真实的分类深度($τ= 0.79$)一致,表明该方法在无监督情况下发现了现实世界知识图中有意义的抽象层次。
更新时间: 2026-03-09 21:54:08
领域: cs.LG,cs.AI
The FABRIC Strategy for Verifying Neural Feedback Systems
Forward reachability analysis is a dominant approach for verifying reach-avoid specifications in neural feedback systems, i.e., dynamical systems controlled by neural networks, and a number of directions have been proposed and studied. In contrast, far less attention has been given to backward reachability analysis for these systems, in part because of the limited scalability of known techniques. In this work, we begin to address this gap by introducing new algorithms for computing both over- and underapproximations of backward reachable sets for nonlinear neural feedback systems. We also describe and implement an integration of these backward reachability techniques with existing ones for forward analysis. We call the resulting algorithm Forward and Backward Reachability Integration for Certification (FaBRIC). We evaluate our algorithms on a representative set of benchmarks and show that they significantly outperform the prior state of the art.
Updated: 2026-03-09 21:54:07
标题: 验证神经反馈系统的FABRIC策略
摘要: 前向可达性分析是验证神经反馈系统中的可达-避免规范的主要方法,即由神经网络控制的动态系统,并且已经提出和研究了许多方向。相比之下,对于这些系统的后向可达性分析却受到了较少的关注,部分原因是已知技术的可扩展性有限。在这项工作中,我们开始填补这一差距,通过引入新的算法来计算非线性神经反馈系统的后向可达集的过估计和欠估计。我们还描述并实现了将这些后向可达性技术与现有的前向分析技术集成的方法。我们将结果算法命名为用于认证的前向和后向可达性集成(FaBRIC)。我们在一组代表性基准测试上评估了我们的算法,并表明它们明显优于目前的技术水平。
更新时间: 2026-03-09 21:54:07
领域: cs.AI,eess.SY
The $qs$ Inequality: Quantifying the Double Penalty of Mixture-of-Experts at Inference
Mixture-of-Experts (MoE) models deliver high quality at low training FLOPs, but this efficiency often vanishes at inference. We identify a double penalty that structurally disadvantages MoE architectures during decoding: first, expert routing fragments microbatches and reduces weight reuse; second, massive resident expert pools reduce high-bandwidth memory (HBM) headroom for the KV cache. This phenomenon, formalized as reuse fragmentation, pushes feed-forward networks (FFNs) into a bandwidth-bound regime, especially at long context lengths. We introduce the $qs$ inequality, a predictive criterion that identifies when MoE is structurally disadvantaged relative to a quality-matched dense model. This criterion unifies sparsity ($s$), the fraction of parameters activated per token, and the quality-equivalence factor ($q$), the size multiplier required for a dense model to match MoE performance. Our evaluation across frontier models including DeepSeek-V3, Qwen3-235B, Grok-1, and Switch-C demonstrates that this fragmentation is a general architectural phenomenon. For DeepSeek-V3 at 128k context, this results in a 4.5x throughput advantage for a quality-matched dense baseline. Crucially, massive architectures like Switch-C can become infeasible on cluster sizes where a quality-matched dense model remains viable. Our results suggest that training-time FLOP efficiency is an incomplete proxy for inference-time performance in long-context serving. They also indicate that MoE may be best viewed as a training-time optimization, with distillation into dense models as a possible path toward inference-efficient deployment.
Updated: 2026-03-09 21:48:04
标题: 《qs不平等:量化专家混合在推理中的双重惩罚》
摘要: 混合专家(MoE)模型在低训练FLOPs的情况下提供高质量,但这种效率通常在推理时消失。我们确定了一个双重惩罚,结构上劣势化了MoE架构在解码过程中:首先,专家路由会使微批次碎片化并减少权重重用;其次,大规模的专家池会减少高带宽内存(HBM)的余量用于KV缓存。这种现象,形式化为重用碎片化,将前馈网络(FFNs)推向带宽受限的领域,尤其在长上下文长度时。 我们引入了$qs$不等式,一个预测性准则,用于确定MoE相对于质量匹配的密集模型处于结构上的劣势。这个准则统一了稀疏性($s$),每个标记激活的参数比例,和质量等效因子($q$),密集模型需要的大小倍增器来匹配MoE性能。我们在包括DeepSeek-V3、Qwen3-235B、Grok-1和Switch-C等领先模型上的评估表明,这种碎片化是一种普遍的架构现象。对于128k上下文的DeepSeek-V3,与质量匹配的密集基线相比,结果是4.5倍的吞吐量优势。至关重要的是,像Switch-C这样的大规模架构在质量匹配的密集模型仍然可行的集群规模上可能变得不可行。 我们的结果表明,在长上下文服务中,训练时的FLOP效率是推理时性能的不完整代理。它们还表明,MoE可能最好被视为训练时的优化,将其蒸馏为密集模型可能是通向推理高效部署的一条可能途径。
更新时间: 2026-03-09 21:48:04
领域: cs.LG,cs.AR,cs.DC,cs.PF
Automated Tensor-Relational Decomposition for Large-Scale Sparse Tensor Computation
A \emph{tensor-relational} computation is a relational computation where individual tuples carry vectors, matrices, or higher-dimensional arrays. An advantage of tensor-relational computation is that the overall computation can be executed on top of a relational system, inheriting the system's ability to automatically handle very large inputs with high levels of sparsity while high-performance kernels (such as optimized matrix-matrix multiplication codes) can be used to perform most of the underlying mathematical operations. In this paper, we introduce upper-case-lower-case \texttt{EinSum}, which is a tensor-relational version of the classical Einstein Summation Notation. We study how to automatically rewrite a computation in Einstein Notation into upper-case-lower-case \texttt{EinSum} so that computationally intensive components are executed using efficient numerical kernels, while sparsity is managed relationally.
Updated: 2026-03-09 21:43:39
标题: 大规模稀疏张量计算的自动张量关系分解
摘要: 一种\emph{张量关系}计算是一种关系计算,其中单个元组携带向量、矩阵或更高维数组。张量关系计算的优势在于整体计算可以在关系系统之上执行,继承系统自动处理非常大输入和高稀疏度的能力,同时可以使用高性能内核(如优化的矩阵乘法代码)执行大部分基础数学操作。在本文中,我们介绍了大写-小写\texttt{EinSum},这是经典爱因斯坦求和符号的张量关系版本。我们研究如何自动将爱因斯坦符号计算重写为大写-小写\texttt{EinSum},以便使用高效的数值内核执行计算密集组件,同时在关系中管理稀疏性。
更新时间: 2026-03-09 21:43:39
领域: cs.MS,cs.AI,cs.DB
A Survey of Reinforcement Learning For Economics
This survey (re)introduces reinforcement learning methods to economists. The curse of dimensionality limits how far exact dynamic programming can be effectively applied, forcing us to rely on suitably "small" problems or our ability to convert "big" problems into smaller ones. While this reduction has been sufficient for many classical applications, a growing class of economic models resists such reduction. Reinforcement learning algorithms offer a natural, sample-based extension of dynamic programming, extending tractability to problems with high-dimensional states, continuous actions, and strategic interactions. I review the theory connecting classical planning to modern learning algorithms and demonstrate their mechanics through simulated examples in pricing, inventory control, strategic games, and preference elicitation. I also examine the practical vulnerabilities of these algorithms, noting their brittleness, sample inefficiency, sensitivity to hyperparameters, and the absence of global convergence guarantees outside of tabular settings. The successes of reinforcement learning remain strictly bounded by these constraints, as well as a reliance on accurate simulators. When guided by economic structure, reinforcement learning provides a remarkably flexible framework. It stands as an imperfect, but promising, addition to the computational economist's toolkit. A companion survey (Rust and Rawat, 2026b) covers the inverse problem of inferring preferences from observed behavior.
Updated: 2026-03-09 21:43:10
标题: 经济学中强化学习的调查
摘要: 这项调查向经济学家介绍了强化学习方法。维度诅咒限制了精确动态规划的有效应用范围,迫使我们依赖于适当“小”的问题或我们将“大”问题转换为较小问题的能力。虽然这种简化对许多经典应用已经足够,但一个不断增长的经济模型类别却抵制这种简化。强化学习算法提供了对动态规划的自然、基于样本的扩展,将可处理性扩展到具有高维状态、连续动作和战略互动的问题。我回顾了将经典规划与现代学习算法联系起来的理论,并通过定价、库存控制、战略游戏和偏好引导的模拟示例展示了它们的机制。我还研究了这些算法的实际脆弱性,指出它们的脆弱性、样本效率低、对超参数的敏感性以及在表格设置之外缺乏全局收敛保证。强化学习的成功严格受到这些限制的约束,以及对准确模拟器的依赖。在经济结构的指导下,强化学习提供了一个非常灵活的框架。它作为计算经济学家工具箱中一个不完美但有前途的补充。一篇配套调查文章(Rust and Rawat, 2026b)涵盖了从观察行为推断偏好的逆问题。
更新时间: 2026-03-09 21:43:10
领域: econ.GN,cs.LG
A Consensus-Driven Multi-LLM Pipeline for Missing-Person Investigations
The first 72 hours of a missing-person investigation are critical for successful recovery. Guardian is an end-to-end system designed to support missing-child investigation and early search planning. This paper presents the Guardian LLM Pipeline, a multi-model system in which LLMs are used for intelligent information extraction and processing related to missing-person search operations. The pipeline coordinates end-to-end execution across task-specialized LLM models and invokes a consensus LLM engine that compares multiple model outputs and resolves disagreements. The pipeline is further strengthened by QLoRA-based fine-tuning, using curated datasets. The presented design aligns with prior work on weak supervision and LLM-assisted annotation, emphasizing conservative, auditable use of LLMs as structured extractors and labelers rather than unconstrained end-to-end decision makers.
Updated: 2026-03-09 21:40:17
标题: 一个基于共识的多LLM管道用于失踪人员调查
摘要: 在失踪人员调查中,头72小时对成功找回失踪人员至关重要。Guardian是一个端到端系统,旨在支持失踪儿童调查和早期搜索计划。本文介绍了Guardian LLM Pipeline,这是一个多模型系统,其中LLM被用于智能信息提取和处理,与失踪人员搜索行动相关。该管道协调了跨任务专用LLM模型的端到端执行,并调用一个共识LLM引擎,比较多个模型输出并解决分歧。通过使用精心策划的数据集进行基于QLoRA的微调,进一步加强了管道的设计。所提出的设计与之前关于弱监督和LLM辅助注释的工作相一致,强调保守、可审计的LLM使用,作为结构化提取器和标记者,而非无约束的端到端决策者。
更新时间: 2026-03-09 21:40:17
领域: cs.AI,cs.CL,cs.DC,cs.IR,cs.LG
Towards Reliable Simulation-based Inference
Scientific knowledge expands by observing the world, hypothesizing some theories about it, and testing them against collected data. When those theories take the form of statistical models, statistical analyses are involved in the process of testing and refining scientific hypotheses. In this thesis, we focus on statistical models that take the form of scientific simulators and provide background about how machine learning can be used for statistical analyses in this context. The first part of this thesis is about showing empirically that performing statistical analyses with machine learning involves a degree of approximation. Specifically, all statistical analyses involve a level of uncertainty in the conclusions drawn, and we show that approximations can lead to overconfident conclusions. We draw caution regarding such overconfident conclusions and introduce a criterion to diagnose overconfident approximations. In the second part, we introduce balancing, a way to regularize machine learning models to reduce overconfidence and favor calibrated or underconfident approximations. Balancing is first introduced for neural ratio estimation algorithms and then extended to other algorithms. Intuition about why balancing leads to less overconfident solutions is provided, and it is shown empirically that balanced algorithms are often either close to calibrated or underconfident. The third part shows that Bayesian neural networks can also be used to mitigate the overconfidence of approximations. Unlike balancing, no regularization is required, and this solution can then work with few training samples and, hence, computationally expensive simulators. To that end, a new Bayesian neural network prior tailored for simulation-based inference is developed, and empirical results show a reduction in overconfidence compared to similar solutions without Bayesian neural networks.
Updated: 2026-03-09 21:29:13
标题: 朝向可靠的基于模拟的推断
摘要: 科学知识通过观察世界、对其提出一些理论假设,并将其与收集到的数据进行检验来不断扩展。当这些理论采取统计模型的形式时,统计分析参与了测试和完善科学假设的过程。在这篇论文中,我们关注采取科学模拟器形式的统计模型,并介绍了机器学习如何在这种背景下用于统计分析的背景。 论文的第一部分旨在通过实证来展示,利用机器学习进行统计分析涉及一定程度的近似。具体来说,所有统计分析都涉及到在得出结论时存在一定程度的不确定性,我们展示了近似可能导致过于自信的结论。我们对这种过于自信的结论提出了警告,并引入了一个诊断过于自信近似的标准。 在第二部分中,我们介绍平衡,这是一种用于正则化机器学习模型以减少过于自信并倾向于校准或不自信近似的方法。平衡首先被引入到神经比率估计算法中,然后扩展到其他算法。我们提供了关于为何平衡会导致更少过于自信解决方案的直觉,并通过实证表明,平衡算法通常要么接近校准要么不自信。 第三部分表明,贝叶斯神经网络也可以用于减少近似的过于自信。与平衡不同,不需要正则化,这个解决方案可以在训练样本较少且计算昂贵的模拟器中使用。为此,开发了一种针对基于模拟的推断的新贝叶斯神经网络先验,实证结果显示与没有贝叶斯神经网络的类似解决方案相比,过于自信的减少。
更新时间: 2026-03-09 21:29:13
领域: stat.ML,cs.LG
Kernel Debiased Plug-in Estimation based on the Universal Least Favorable Submodel
We propose ULFS-KDPE, a kernel debiased plug-in estimator based on the universal least favorable submodel, for estimating pathwise differentiable parameters in nonparametric models. The method constructs a data-adaptive debiasing flow in a reproducing kernel Hilbert space (RKHS), producing a plug-in estimator that achieves semiparametric efficiency without requiring explicit derivation or evaluation of efficient influence functions. We place ULFS-KDPE on a rigorous functional-analytic foundation by formulating the universal least favorable update as a nonlinear ordinary differential equation on probability densities. We establish existence, uniqueness, stability, and finite-time convergence of the empirical score along the induced flow. Under standard regularity conditions, the resulting estimator is regular, asymptotically linear, and attains the semiparametric efficiency bound simultaneously for a broad class of pathwise differentiable parameters. The method admits a computationally tractable implementation based on finite-dimensional kernel representations and principled stopping criteria. In finite samples, the combination of solving a rich collection of score equations with RKHS-based smoothing and avoidance of direct influence-function evaluation leads to improved numerical stability. Simulation studies illustrate the method and support the theoretical results.
Updated: 2026-03-09 21:27:26
标题: 基于通用最不利子模型的核修正插值估计
摘要: 我们提出了基于通用最不利子模型的核去偏插值估计器ULFS-KDPE,用于估计非参数模型中的逐路径可微参数。该方法在再生核希尔伯特空间(RKHS)中构造了一个数据自适应去偏流,产生了一个插值估计器,实现了半参数效率,而无需显式推导或评估有效的影响函数。我们通过将通用最不利更新公式定义为概率密度上的非线性常微分方程,为ULFS-KDPE建立了严格的函数分析基础。我们建立了经验得分在诱导流中的存在性、唯一性、稳定性和有限时间收敛性。在标准正则条件下,由此得到的估计器是正则的、渐近线性的,并同时对一类广泛的逐路径可微参数达到半参数效率界限。该方法采用基于有限维核表示和原则性停止准则的计算可行实现。在有限样本中,通过解决丰富的得分方程集合,结合RKHS平滑和避免直接影响函数评估,提高了数值稳定性。模拟研究展示了该方法并支持理论结果。
更新时间: 2026-03-09 21:27:26
领域: math.ST,cs.LG,stat.ML
Kite: How to Delegate Voting Power Privately
Ensuring the privacy of votes in an election is crucial for the integrity of a democratic process. Often, voting power is delegated to representatives (e.g., in congress) who subsequently vote on behalf of voters on specific issues. This delegation model is also widely used in Decentralized Autonomous Organizations (DAOs). Although several existing voting systems used in DAOs support private voting, they only offer public delegation. In this paper, we introduce Kite, a new protocol that enables $\textit{private}$ delegation of voting power for DAO members. Voters can freely delegate, revoke, and re-delegate their power without revealing any information about who they delegated to. Even the delegate does not learn who delegated to them. The only information that is recorded publicly is that the voter delegated or re-delegated their vote to someone. Kite accommodates both public and private voting for the delegates themselves. We analyze the security of our protocol within the Universal Composability (UC) framework. We implement Kite as an extension to the existing Governor Bravo smart contract on the Ethereum blockchain, that is widely used for DAO governance. Furthermore, we provide an evaluation of our implementation that demonstrates the practicality of the protocol. The most expensive operation is delegation due to the required zero-knowledge proofs. On a consumer-grade laptop, delegation takes between 7 and 167 seconds depending on the requested level of privacy.
Updated: 2026-03-09 21:27:24
标题: 风筝:如何私密委托投票权
摘要: 确保选举中投票的隐私对于民主进程的诚信至关重要。通常,投票权被委托给代表(例如在国会中),代表随后代表选民就特定问题进行投票。这种委托模型在分散自治组织(DAOs)中也被广泛使用。尽管DAOs中使用的几种现有投票系统支持私人投票,但它们只提供公开委托。在本文中,我们介绍了一种新协议Kite,该协议允许DAO成员进行$\textit{私人}$投票权委托。选民可以自由地委托、撤销和重新委托他们的权力,而不泄露任何关于他们委托给谁的信息。即使受托人也不会知道是谁委托给了他们。唯一公开记录的信息是选民将他们的投票权委托或重新委托给了谁。Kite为受托人本身提供了公开和私人投票的两种选择。我们在通用可组合性(UC)框架内分析了我们协议的安全性。我们将Kite实现为现有以太坊区块链上广泛用于DAO治理的Governor Bravo智能合约的一个扩展。此外,我们提供了一个演示我们协议实用性的评估。由于需要的零知识证明,委托是最昂贵的操作。在一台消费级笔记本电脑上,委托所需的时间在7到167秒之间,具体取决于所请求的隐私级别。
更新时间: 2026-03-09 21:27:24
领域: cs.CR
BiCLIP: Domain Canonicalization via Structured Geometric Transformation
Recent advances in vision-language models (VLMs) have demonstrated remarkable zero-shot capabilities, yet adapting these models to specialized domains remains a significant challenge. Building on recent theoretical insights suggesting that independently trained VLMs are related by a canonical transformation, we extend this understanding to the concept of domains. We hypothesize that image features across disparate domains are related by a canonicalized geometric transformation that can be recovered using a small set of anchors. Few-shot classification provides a natural setting for this alignment, as the limited labeled samples serve as the anchors required to estimate this transformation. Motivated by this hypothesis, we introduce BiCLIP, a framework that applies a targeted transformation to multimodal features to enhance cross-modal alignment. Our approach is characterized by its extreme simplicity and low parameter footprint. Extensive evaluations across 11 standard benchmarks, including EuroSAT, DTD, and FGVCAircraft, demonstrate that BiCLIP consistently achieves state-of-the-art results. Furthermore, we provide empirical verification of existing geometric findings by analyzing the orthogonality and angular distribution of the learned transformations, confirming that structured alignment is the key to robust domain adaptation. Code is available at https://github.com/QuantitativeImagingLaboratory/BilinearCLIP
Updated: 2026-03-09 21:26:15
标题: BiCLIP:通过结构化几何转换实现域规范化
摘要: 最近对视觉-语言模型(VLMs)的进展展示了显著的零样本能力,但将这些模型调整到专业领域仍然是一个重大挑战。基于最近的理论洞察,表明独立训练的VLMs通过一个规范变换相关联,我们将这一理解扩展到领域的概念。我们假设不同领域之间的图像特征通过一个经规范化的几何变换相关联,可以使用少量锚点恢复这个变换。少样本分类为这种对齐提供了一个自然的环境,因为有限的标记样本可以作为估计这个变换所需的锚点。受到这一假设的启发,我们引入了BiCLIP,这是一个框架,可以将有针对性的变换应用于多模态特征,以增强跨模态对齐。我们的方法以其极端简单性和低参数占用量而著称。在包括EuroSAT、DTD和FGVCAircraft在内的11个标准基准测试中进行了广泛评估,结果显示BiCLIP始终取得最先进的成果。此外,通过分析所学转换的正交性和角度分布,我们提供了现有几何发现的经验证明,结构化对齐是强大的领域适应的关键。代码可在https://github.com/QuantitativeImagingLaboratory/BilinearCLIP找到。
更新时间: 2026-03-09 21:26:15
领域: cs.CV,cs.AI,cs.CL,cs.LG
Convergence Rate for the Last Iterate of Stochastic Gradient Descent Schemes
We study the convergence rate for the last iterate of stochastic gradient descent (SGD) and stochastic heavy ball (SHB) in the parametric setting when the objective function $F$ is globally convex or non-convex whose gradient is $γ$-Hölder. Using only discrete Gronwall's inequality without Robbins-Siegmund theorem, we recover results for both SGD and SHB: $\min_{s\leq t} \|\nabla F(w_s)\|^2 = o(t^{p-1})$ for non-convex objectives and $F(w_{τ\wedge t}) - F_* = o(t^{2γ/(1+γ) \cdot \max(p-1,-2p+1)-ε})$ for $β\in (0, 1)$, $τ:= \inf \{ t > 0 : F(w_t) = F_*\}$, and $\min_{s \leq t} F(w_s) - F_* = o(t^{p-1})$ for convex objectives $F$ whose minimum is $F_*$. In addition, we proved that SHB with constant momentum parameter $β\in (0, 1)$ attains a convergence rate of $F(w_t) - F_* = O(t^{\max(p-1,-2p+1)} \log^2 \frac{t}δ)$ with probability at least $1-δ$ when $F$ is convex and $γ= 1$ and step size $α_t = Θ(t^{-p})$ with $p \in (\frac{1}{2}, 1)$.
Updated: 2026-03-09 21:24:06
标题: 随机梯度下降方案最后迭代的收敛速率
摘要: 我们研究了在参数设置下,当目标函数$F$是全局凸或非凸的,其梯度为$γ$-Hölder时,随机梯度下降(SGD)和随机重球(SHB)的最后迭代收敛速率。仅使用离散Gronwall不等式而不使用Robbins-Siegmund定理,我们恢复了SGD和SHB的结果:对于非凸目标函数,$\min_{s \leq t} \|\nabla F(w_s)\|^2 = o(t^{p-1})$,对于凸目标函数,其最小值为$F_*$,$β\in (0, 1)$时,$τ:= \inf \{ t > 0 : F(w_t) = F_*\}$,$\min_{s \leq t} F(w_s) - F_* = o(t^{p-1})$。此外,我们证明了具有恒定动量参数$β\in (0, 1)$的SHB在$F$为凸函数、$γ= 1$且步长$α_t = Θ(t^{-p})$(其中$p \in (\frac{1}{2}, 1)$)时,以至少$1-δ$的概率达到$F(w_t) - F_* = O(t^{\max(p-1,-2p+1)} \log^2 \frac{t}{δ})$的收敛速率。
更新时间: 2026-03-09 21:24:06
领域: math.OC,cs.LG
Personalized Feature Translation for Expression Recognition: An Efficient Source-Free Domain Adaptation Method
Facial expression recognition (FER) models are widely used in video-based affective computing applications, such as human-computer interaction and healthcare monitoring. However, deep FER models often struggle with subtle expressions and high inter-subject variability, limiting performance in real-world settings. Source-free domain adaptation (SFDA) has been proposed to personalize a pretrained source model using only unlabeled target data, avoiding privacy, storage, and transmission constraints. We address a particularly challenging setting where source data is unavailable and the target data contains only neutral expressions. Existing SFDA methods are not designed for adaptation from a single target class, while generating non-neutral facial images is often unstable and expensive. To address this, we propose Source-Free Domain Adaptation with Personalized Feature Translation (SFDA-PFT), a lightweight latent-space approach. A translator is first pretrained on source data to map subject-specific style features between subjects while preserving expression information through expression-consistency and style-aware objectives. It is then adapted to neutral target data without source data or image synthesis. By operating in the latent space, SFDA-PFT avoids noisy facial image generation, reduces computation, and learns discriminative embeddings for classification. Experiments on BioVid, StressID, BAH, and Aff-Wild2 show that SFDA-PFT consistently outperforms state-of-the-art SFDA methods in privacy-sensitive FER scenarios. Our code is publicly available at: \href{https://github.com/MasoumehSharafi/SFDA-PFT}{GitHub}.
Updated: 2026-03-09 21:18:03
标题: 个性化特征转换用于表情识别:一种高效的无源域自适应方法
摘要: 面部表情识别(FER)模型广泛应用于基于视频的情感计算应用程序,如人机交互和医疗监测。然而,深度FER模型通常难以处理微妙的表情和高度的个体间变异性,在现实世界的环境中限制了性能。无源域自适应(SFDA)被提出,使用仅未标记的目标数据来个性化预训练的源模型,避免了隐私、存储和传输约束。我们面临一个特别具有挑战性的情况,即源数据不可用,目标数据仅包含中性表情。现有的SFDA方法并不是设计用于从单个目标类进行适应,而生成非中性面部图像通常是不稳定且昂贵的。为了解决这个问题,我们提出了Source-Free Domain Adaptation with Personalized Feature Translation(SFDA-PFT),这是一种轻量级的潜在空间方法。翻译器首先在源数据上进行预训练,将主观特定风格特征映射到不同主体之间,同时通过表情一致性和风格感知目标保留表情信息。然后,它被适应到中性目标数据,而无需源数据或图像合成。通过在潜在空间中操作,SFDA-PFT避免了嘈杂的面部图像生成,减少了计算量,并学习了用于分类的判别性嵌入。在BioVid、StressID、BAH和Aff-Wild2上的实验证明,SFDA-PFT在隐私敏感的FER场景中始终优于最先进的SFDA方法。我们的代码可在以下网址公开获取:\href{https://github.com/MasoumehSharafi/SFDA-PFT}{GitHub}。
更新时间: 2026-03-09 21:18:03
领域: cs.CV,cs.AI
Computational Multi-Agents Society Experiments: Social Modeling Framework Based on Generative Agents
This paper introduces CMASE, a framework for Computational Multi-Agent Society Experiments that integrates generative agent-based modeling with virtual ethnographic methods to support researcher embedding, interactive participation, and mechanism-oriented intervention in virtual social environments. By transforming the simulation into a simulated ethnographic field, CMASE shifts the researcher from an external operator to an embedded participant. Specifically, the framework is designed to achieve three core capabilities: (1) enabling real-time human-computer interaction that allows researchers to dynamically embed themselves into the system to characterize complex social intervention processes; (2) reconstructing the generative logic of social phenomena by combining the rigor of computational experiments with the interpretative depth of traditional ethnography; and (3) providing a predictive foundation with causal explanatory power to make forward-looking judgments without sacrificing empirical accuracy. Experimental results show that CMASE can not only simulate complex phenomena, but also generate behavior trajectories consistent with both statistical patterns and mechanistic explanations. These findings demonstrate CMASE's methodological value for intervention modeling, highlighting its potential to advance interdisciplinary integration in the social sciences. The official code is available at: https://github.com/armihia/CMASE .
Updated: 2026-03-09 21:16:43
标题: 计算多智能体社会实验:基于生成智能体的社会建模框架
摘要: 本文介绍了CMASE,这是一个用于计算多代理社会实验的框架,它将生成式基于代理的建模与虚拟民族学方法整合在一起,以支持研究者嵌入、互动参与和机制导向的干预在虚拟社会环境中。通过将模拟转化为模拟的民族学领域,CMASE将研究者从外部操作者转变为内部参与者。具体地,该框架旨在实现三个核心能力:(1)实现实时人机交互,使研究者能够动态地嵌入系统以表征复杂的社会干预过程;(2)通过将计算实验的严谨性与传统民族学的解释深度相结合,重建社会现象的生成逻辑;(3)提供具有因果解释能力的预测基础,使得能够做出前瞻性判断而不损失经验的准确性。实验结果表明,CMASE不仅可以模拟复杂现象,还可以生成与统计模式和机制解释一致的行为轨迹。这些发现展示了CMASE在干预建模方面的方法论价值,突出了它在社会科学跨学科整合中的潜力。官方代码可在以下链接获取:https://github.com/armihia/CMASE。
更新时间: 2026-03-09 21:16:43
领域: cs.AI,cs.CY,cs.MA
Multi-Agent Memory from a Computer Architecture Perspective: Visions and Challenges Ahead
As LLM agents evolve into collaborative multi-agent systems, their memory requirements grow rapidly in complexity. This position paper frames multi-agent memory as a computer architecture problem. We distinguish shared and distributed memory paradigms, propose a three-layer memory hierarchy (I/O, cache, and memory), and identify two critical protocol gaps: cache sharing across agents and structured memory access control. We argue that the most pressing open challenge is multi-agent memory consistency. Our architectural framing provides a foundation for building reliable, scalable multi-agent systems.
Updated: 2026-03-09 21:16:12
标题: 从计算机架构的角度看多智能体记忆:未来的愿景和挑战
摘要: 随着LLM代理逐渐发展成协作的多代理系统,它们对内存的需求在复杂性上迅速增长。本文将多代理系统的内存视为一项计算机架构问题。我们区分了共享和分布式内存范式,提出了一个三层内存层次结构(I/O、缓存和内存),并确定了两个关键的协议漏洞:代理间的缓存共享和结构化内存访问控制。我们认为目前最紧迫的挑战是多代理内存一致性。我们的架构框架为构建可靠、可扩展的多代理系统提供了基础。
更新时间: 2026-03-09 21:16:12
领域: cs.AR,cs.AI,cs.MA
AgentOS: From Application Silos to a Natural Language-Driven Data Ecosystem
The rapid emergence of open-source, locally hosted intelligent agents marks a critical inflection point in human-computer interaction. Systems such as OpenClaw demonstrate that Large Language Model (LLM)-based agents can autonomously operate local computing environments, orchestrate workflows, and integrate external tools. However, within the current paradigm, these agents remain conventional applications running on legacy operating systems originally designed for Graphical User Interfaces (GUIs) or Command Line Interfaces (CLIs). This architectural mismatch leads to fragmented interaction models, poorly structured permission management (often described as "Shadow AI"), and severe context fragmentation. This paper proposes a new paradigm: a Personal Agent Operating System (AgentOS). In AgentOS, traditional GUI desktops are replaced by a Natural User Interface (NUI) centered on a unified natural language or voice portal. The system core becomes an Agent Kernel that interprets user intent, decomposes tasks, and coordinates multiple agents, while traditional applications evolve into modular Skills-as-Modules enabling users to compose software through natural language rules. We argue that realizing AgentOS fundamentally becomes a Knowledge Discovery and Data Mining (KDD) problem. The Agent Kernel must operate as a real-time engine for intent mining and knowledge discovery. Viewed through this lens, the operating system becomes a continuous data mining pipeline involving sequential pattern mining for workflow automation, recommender systems for skill retrieval, and dynamically evolving personal knowledge graphs. These challenges define a new research agenda for the KDD community in building the next generation of intelligent computing systems.
Updated: 2026-03-09 21:13:52
标题: AgentOS:从应用程序孤立到自然语言驱动的数据生态系统
摘要: 开源、本地托管的智能代理的迅速出现标志着人机交互中的一个关键转折点。OpenClaw等系统表明,基于大型语言模型(LLM)的代理可以自主运行本地计算环境、编排工作流程并集成外部工具。然而,在当前范式中,这些代理仍然是在最初设计用于图形用户界面(GUI)或命令行界面(CLI)的传统操作系统上运行的应用程序。这种架构不匹配导致了碎片化的交互模型、结构不良的权限管理(通常被描述为“Shadow AI”)和严重的上下文碎片化。本文提出了一个新的范式:个人代理操作系统(AgentOS)。在AgentOS中,传统的GUI桌面被自然用户界面(NUI)取代,该界面以统一的自然语言或语音门户为中心。系统核心变成一个代理内核,解释用户意图、分解任务并协调多个代理,而传统的应用程序演变为模块化的技能作为模块,使用户能够通过自然语言规则组合软件。我们认为,实现AgentOS基本上是一个知识发现和数据挖掘(KDD)问题。代理内核必须作为一个实时引擎进行意图挖掘和知识发现。通过这个视角来看,操作系统成为一个连续的数据挖掘管道,涉及用于工作流自动化的序列模式挖掘、用于技能检索的推荐系统以及动态演化的个人知识图。这些挑战为KDD社区构建下一代智能计算系统定义了一个新的研究议程。
更新时间: 2026-03-09 21:13:52
领域: cs.AI
Provable Acceleration of Distributed Optimization with Local Updates
In conventional distributed optimization, each agent performs a single local update between two communication rounds with its neighbors to synchronize solutions. Inspired by the success of using multiple local updates in federated learning, incorporating local updates into distributed optimization has recently attracted increasing attention. However, unlike federated learning, where multiple local updates can accelerate learning by improving gradient estimation under mini-batch settings, it remains unclear whether similar benefits hold in distributed optimization when gradients are exact. Moreover, existing theoretical results typically require reducing the step size when multiple local updates are employed, which can entirely offset any potential benefit of these additional local updates and obscure their true impact on convergence. In this paper, we focus on the classic DIGing algorithm and leverage the tight performance bounds provided by Performance Estimation Problems (PEP) to show that incorporating local updates can indeed accelerate distributed optimization. To the best of our knowledge, this is the first rigorous demonstration of such acceleration for a broad class of objective functions. Our analysis further reveals that, under an appropriate step size, performing only two local updates is sufficient to achieve the maximal possible improvement, and that additional local updates provide no further gains. Because more updates increase computational cost, these findings offer practical guidance for efficient implementation. Extensive experiments on both synthetic and real-world datasets corroborate the theoretical findings.
Updated: 2026-03-09 21:12:53
标题: 可证明的分布式优化加速与本地更新
摘要: 在传统的分布式优化中,每个代理在与邻居进行两次通信回合之间执行单个本地更新,以同步解决方案。受在联邦学习中使用多个本地更新取得成功的启发,将本地更新纳入分布式优化最近引起了越来越多的关注。然而,与联邦学习不同的是,在联邦学习中,多个本地更新可以通过改善在小批量设置下的梯度估计来加速学习,但在梯度精确的分布式优化中是否具有类似的好处仍不清楚。此外,现有的理论结果通常要求在使用多个本地更新时减小步长,这可能会完全抵消这些额外本地更新的任何潜在好处,并且混淆它们对收敛的真实影响。在本文中,我们专注于经典的DIGing算法,并利用性能估计问题(PEP)提供的严格性能界限,以展示将本地更新纳入分布式优化确实可以加速。据我们所知,这是对广泛类目标函数的这种加速的首次严格证明。我们的分析进一步揭示,在适当的步长下,仅执行两个本地更新就足以实现可能的最大改进,并且额外的本地更新不会带来进一步的收益。由于更多的更新会增加计算成本,这些发现为高效实现提供了实用指导。对合成和真实数据集的大量实验证实了理论发现。
更新时间: 2026-03-09 21:12:53
领域: eess.SY,cs.LG
VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs
Speech Large Language Models (LLMs) show great promise for speech emotion recognition (SER) via generative interfaces. However, shifting from closed-set classification to open text generation introduces zero-shot stochasticity, making evaluation highly sensitive to prompts. Additionally, conventional speech LLMs benchmarks overlook the inherent ambiguity of human emotion. Hence, we present VoxEmo, a comprehensive SER benchmark encompassing 35 emotion corpora across 15 languages for Speech LLMs. VoxEmo provides a standardized toolkit featuring varying prompt complexities, from direct classification to paralinguistic reasoning. To reflect real-world perception/application, we introduce a distribution-aware soft-label protocol and a prompt-ensemble strategy that emulates annotator disagreement. Experiments reveal that while zero-shot speech LLMs trail supervised baselines in hard-label accuracy, they uniquely align with human subjective distributions.
Updated: 2026-03-09 21:10:34
标题: VoxEmo:使用语音LLMs对语音情感识别进行基准测试
摘要: 大型语言模型(LLMs)在通过生成接口进行语音情感识别(SER)方面表现出巨大潜力。然而,从封闭集分类转向开放文本生成引入了零样本随机性,使评估对提示非常敏感。此外,传统的语音LLMs基准忽略了人类情感的固有歧义。因此,我们提出了VoxEmo,这是一个涵盖15种语言的35个情绪语料库的全面SER基准。VoxEmo提供了一个标准化工具包,其中包含各种提示复杂性,从直接分类到语用推理。为了反映现实世界的感知/应用,我们引入了一个分布感知的软标签协议和一个模拟注释者不一致的提示集成策略。实验表明,尽管零样本语音LLMs在硬标签准确度上落后于监督基准,但它们独特地与人类主观分布相一致。
更新时间: 2026-03-09 21:10:34
领域: cs.SD,cs.AI,cs.CL,cs.MM,eess.AS
PathoScribe: Transforming Pathology Data into a Living Library with a Unified LLM-Driven Framework for Semantic Retrieval and Clinical Integration
Pathology underpins modern diagnosis and cancer care, yet its most valuable asset, the accumulated experience encoded in millions of narrative reports, remains largely inaccessible. Although institutions are rapidly digitizing pathology workflows, storing data without effective mechanisms for retrieval and reasoning risks transforming archives into a passive data repository, where institutional knowledge exists but cannot meaningfully inform patient care. True progress requires not only digitization, but the ability for pathologists to interrogate prior similar cases in real time while evaluating a new diagnostic dilemma. We present PathoScribe, a unified retrieval-augmented large language model (LLM) framework designed to transform static pathology archives into a searchable, reasoning-enabled living library. PathoScribe enables natural language case exploration, automated cohort construction, clinical question answering, immunohistochemistry (IHC) panel recommendation, and prompt-controlled report transformation within a single architecture. Evaluated on 70,000 multi-institutional surgical pathology reports, PathoScribe achieved perfect Recall@10 for natural language case retrieval and demonstrated high-quality retrieval-grounded reasoning (mean reviewer score 4.56/5). Critically, the system operationalized automated cohort construction from free-text eligibility criteria, assembling research-ready cohorts in minutes (mean 9.2 minutes) with 91.3% agreement to human reviewers and no eligible cases incorrectly excluded, representing orders-of-magnitude reductions in time and cost compared to traditional manual chart review. This work establishes a scalable foundation for converting digital pathology archives from passive storage systems into active clinical intelligence platforms.
Updated: 2026-03-09 21:09:24
标题: PathoScribe:将病理学数据转化为一个活动的图书馆,采用统一的LLM驱动框架进行语义检索和临床整合
摘要: 病理学是现代诊断和癌症护理的基础,然而,其中最有价值的资产——数百万篇叙述报告中蕴含的积累经验——仍然大部分无法获取。尽管机构正在快速将病理学工作流数字化,但存储数据却缺乏有效的检索和推理机制,这使得档案库有可能变成一个被动的数据存储库,其中机构知识存在但无法有意义地指导患者护理。真正的进步不仅需要数字化,还需要病理学家有能力在评估新的诊断困境时实时查询先前类似的病例。我们提出了PathoScribe,这是一个统一的检索增强的大语言模型(LLM)框架,旨在将静态病理档案转变为可搜索、能够推理的生活图书馆。PathoScribe可以实现自然语言案例探索、自动建立队列、临床问题回答、免疫组织化学(IHC)面板推荐以及在单一体系结构内进行提示控制的报告转换。在对70,000份跨机构外科病理学报告进行评估时,PathoScribe实现了自然语言案例检索的完美Recall@10,并展示了高质量的检索支撑推理(平均评审分数4.56/5)。关键是,该系统实现了从自由文本资格标准中自动构建队列,仅需几分钟就能组建出可用于研究的队列(平均9.2分钟),与人工审阅者的一致性达到91.3%,且没有将有资格的病例错误地排除,相对于传统的手动病历查阅,这代表了时间和成本的数量级降低。这项工作为将数字病理档案从被动存储系统转变为主动临床智能平台奠定了可扩展的基础。
更新时间: 2026-03-09 21:09:24
领域: cs.CV,cs.AI,cs.CL,cs.DL,cs.IR
Interpretable Markov-Based Spatiotemporal Risk Surfaces for Missing-Child Search Planning with Reinforcement Learning and LLM-Based Quality Assurance
The first 72 hours of a missing-child investigation are critical for successful recovery. However, law enforcement agencies often face fragmented, unstructured data and a lack of dynamic, geospatial predictive tools. Our system, Guardian, provides an end-to-end decision-support system for missing-child investigation and early search planning. It converts heterogeneous, unstructured case documents into a schema-aligned spatiotemporal representation, enriches cases with geocoding and transportation context, and provides probabilistic search products spanning 0-72 hours. In this paper, we present an overview of Guardian as well as a detailed description of a three-layer predictive component of the system. The first layer is a Markov chain, a sparse, interpretable model with transitions incorporating road accessibility costs, seclusion preferences, and corridor bias with separate day/night parameterizations. The Markov chain's output prediction distributions are then transformed into operationally useful search plans by the second layer's reinforcement learning. Finally, the third layer's LLM performs post hoc validation of layer 2 search plans prior to their release. Using a synthetic but realistic case study, we report quantitative outputs across 24/48/72-hour horizons and analyze sensitivity, failure modes, and tradeoffs. Results show that the proposed predictive system with the three-layer architecture produces interpretable priors for zone optimization and human review.
Updated: 2026-03-09 21:08:29
标题: 可解释的基于马尔可夫的时空风险表面:基于强化学习和LLM的质量保证的失踪儿童搜寻规划
摘要: 在一个失踪儿童调查中,头72小时对于成功找回是至关重要的。然而,执法机构经常面临分散的、非结构化的数据以及缺乏动态、地理空间预测工具。我们的系统Guardian为失踪儿童调查和早期搜索规划提供了端到端的决策支持系统。它将异构的、非结构化的案件文档转化为与模式对齐的时空表示,丰富案件的地理编码和交通背景,并提供跨越0-72小时的概率搜索产品。在本文中,我们介绍了Guardian的概述以及系统的三层预测组件的详细描述。第一层是一个马尔可夫链,一个稀疏的、可解释的模型,其转换考虑了道路可达成本、隐蔽偏好和走廊偏差,并具有分别的白天/黑夜参数化。马尔可夫链的输出预测分布然后通过第二层的强化学习转换为操作上有用的搜索计划。最后,第三层的LLM在释放之前对第二层的搜索计划进行事后验证。通过一个合成但现实的案例研究,我们报告了24/48/72小时的数量输出,并分析了灵敏度、失败模式和权衡。结果表明,所提出的具有三层架构的预测系统产生了可解释的区域优化和人工审查的先验。
更新时间: 2026-03-09 21:08:29
领域: cs.AI,cs.IR,cs.LG
Reinforced Generation of Combinatorial Structures: Hardness of Approximation
Can AI based methods help us make advances in complexity theory? We provide evidence towards answering this in the affirmative, using AlphaEvolve (an LLM code mutation agent) to obtain new results in three settings: a) We improve a recent result of Kunisky and Yu to obtain near-optimal upper and (conditional) lower bounds on certification algorithms for MAX-CUT and MAX-Independent Set on random 3- and 4-regular graphs. Our improved lower bounds are obtained by constructing nearly extremal Ramanujan graphs on as many as $163$ vertices, and our upper bounds are obtained via analytical arguments. b) We obtain new inapproximability results for MAX-4-CUT and MAX-3-CUT, proving that it is NP-hard to approximate them within factors of $0.987$ and $0.9649$ respectively, using AlphaEvolve to discover new gadget reductions. Our MAX-4-CUT result improves upon the SOTA of $0.9883$, and our MAX-3-CUT result improves on the current best gadget-based inapproximability result of $0.9853$, but falls short of the SOTA of $16/17$ that relies on a custom PCP (rather than a reduction from ``standard'' Håstad-style PCPs). c) Inapproximability for the metric Traveling Salesman Problem (TSP): We show that it is NP-hard to approximate the minimum cost tour within a factor of $111/110$ using AlphaEvolve to discover a new gadget, thus improving the SOTA of $117/116$. Along the way, we provide new modular soundness and completeness arguments that can be of independent interest. A key technical challenge we faced: verifying a candidate construction produced by AlphaEvolve is costly (sometimes requiring time exponential in the size of the construction). We used AlphaEvolve itself to evolve the verification procedure to be faster (sometimes by $10,000\times$ for our gadgets). Our results suggest that gadget based proofs would benefit from a pass through AI-based tools to obtain stronger results.
Updated: 2026-03-09 21:00:18
标题: 增强组合结构的生成:逼近难度
摘要: 人工智能方法是否可以帮助我们在复杂性理论方面取得进展?我们提供证据肯定地回答了这个问题,使用AlphaEvolve(一个LLM代码突变代理)在三个设置中获得了新的结果: a) 我们改进了Kunisky和Yu最近的研究结果,获得了关于MAX-CUT和MAX-Independent Set在随机3-和4-正则图上的认证算法的近乎最优上界和(条件)下界。我们通过构建几乎是极端的Ramanujan图在多达163个顶点上获得了改进的下界,我们的上界是通过分析论证获得的。 b) 我们获得了MAX-4-CUT和MAX-3-CUT的新近似难度结果,证明了在AlphaEvolve发现新的小工具缩减的情况下,近似它们分别为0.987和0.9649是NP难的。我们的MAX-4-CUT结果改进了0.9883的SOTA,我们的MAX-3-CUT结果改进了当前最佳的基于小工具的难近似结果0.9853,但达不到依赖于自定义PCP(而不是从“标准”Håstad风格PCPs中减少)的16/17的SOTA。 c) 旅行推销员问题(TSP)的难近似性:我们展示了使用AlphaEvolve发现新的小工具,使得在111/110的因素内近似最小成本旅游是NP难的,从而改进了117/116的SOTA。在此过程中,我们提供了可以独立引起兴趣的新的模块化正确性和完整性论证。 我们面临的一个关键技术挑战是:验证AlphaEvolve生成的候选构造是昂贵的(有时需要与构造的大小指数成比例的时间)。我们使用AlphaEvolve本身来使验证过程更快(有时我们的小工具提速了10,000倍)。我们的结果表明,基于小工具的证明将受益于通过人工智能工具获得更强大的结果。
更新时间: 2026-03-09 21:00:18
领域: cs.LG,cs.AI,cs.CC,math.CO
Optimizing Reinforcement Learning Training over Digital Twin Enabled Multi-fidelity Networks
In this paper, we investigate a novel digital network twin (DNT) assisted deep learning (DL) model training framework. In particular, we consider a physical network where a base station (BS) uses several antennas to serve multiple mobile users, and a DNT that is a virtual representation of the physical network. The BS must adjust its antenna tilt angles to optimize the data rates of all users. Due to user mobility, the BS may not be able to accurately track network dynamics such as wireless channels and user mobilities. Hence, a reinforcement learning (RL) approach is used to dynamically adjust the antenna tilt angles. To train the RL, we can use data collected from the physical network and the DNT. The data collected from the physical network is more accurate but incurs more communication overhead compared to the data collected from the DNT. Therefore, it is necessary to determine the ratio of data collected from the physical network and the DNT to improve the training of the RL model. We formulate this problem as an optimization problem whose goal is to jointly optimize the tilt angle adjustment policy and the data collection strategy, aiming to maximize the data rates of all users while constraining the time delay introduced by collecting data from the physical network. To solve this problem, we propose a hierarchical RL framework that integrates robust adversarial loss and proximal policy optimization (PPO). Simulation results show that our proposed method reduces the physical network data collection delay by up to 28.01% and 1x compared to a hierarchical RL that uses vanilla PPO as the first level RL, and the baseline that uses robust-RL at the first level and selects the data collection ratio randomly.
Updated: 2026-03-09 20:59:23
标题: 优化强化学习训练在数字孪生技术支持的多精度网络上的应用
摘要: 在本文中,我们研究了一种新颖的数字网络双胞胎(DNT)辅助深度学习(DL)模型训练框架。特别地,我们考虑了一个物理网络,其中一个基站(BS)使用多个天线为多个移动用户提供服务,以及一个DNT,它是物理网络的虚拟表示。BS必须调整其天线倾斜角以优化所有用户的数据速率。由于用户的移动性,BS可能无法准确跟踪网络动态,如无线信道和用户移动性。因此,采用强化学习(RL)方法动态调整天线倾斜角。为了训练RL,我们可以使用从物理网络和DNT收集的数据。与从DNT收集的数据相比,从物理网络收集的数据更准确,但产生更多的通信开销。因此,有必要确定从物理网络和DNT收集的数据比例,以改善RL模型的训练。我们将这个问题形式化为一个优化问题,其目标是联合优化倾斜角调整策略和数据收集策略,旨在最大化所有用户的数据速率,同时限制由从物理网络收集数据引入的时间延迟。为了解决这个问题,我们提出了一个集成了强对抗性损失和近端策略优化(PPO)的分层RL框架。模拟结果表明,我们提出的方法将物理网络数据收集延迟降低了高达28.01%和1倍,与使用普通PPO作为第一层RL的分层RL,以及使用在第一层使用强化RL并随机选择数据收集比例的基线相比。
更新时间: 2026-03-09 20:59:23
领域: cs.NI,cs.LG,eess.SY
Using Vision Language Foundation Models to Generate Plant Simulation Configurations via In-Context Learning
This paper introduces a synthetic benchmark to evaluate the performance of vision language models (VLMs) in generating plant simulation configurations for digital twins. While functional-structural plant models (FSPMs) are useful tools for simulating biophysical processes in agricultural environments, their high complexity and low throughput create bottlenecks for deployment at scale. We propose a novel approach that leverages state-of-the-art open-source VLMs -- Gemma 3 and Qwen3-VL -- to directly generate simulation parameters in JSON format from drone-based remote sensing images. Using a synthetic cowpea plot dataset generated via the Helios 3D procedural plant generation library, we tested five in-context learning methods and evaluated the models across three categories: JSON integrity, geometric evaluations, and biophysical evaluations. Our results show that while VLMs can interpret structural metadata and estimate parameters like plant count and sun azimuth, they often exhibit performance degradation due to contextual bias or rely on dataset means when visual cues are insufficient. Validation on a real-world drone orthophoto dataset and an ablation study using a blind baseline further characterize the models' reasoning capabilities versus their reliance on contextual priors. To the best of our knowledge, this is the first study to utilize VLMs to generate structural JSON configurations for plant simulations, providing a scalable framework for reconstruction 3D plots for digital twin in agriculture.
Updated: 2026-03-09 20:58:43
标题: 使用视觉语言基础模型通过上下文学习生成植物模拟配置
摘要: 本文介绍了一种用于评估视觉语言模型(VLMs)在为数字孪生生成植物模拟配置性能的合成基准。尽管功能-结构植物模型(FSPMs)是模拟农业环境中生物物理过程的有用工具,但它们的高复杂性和低吞吐量在规模部署时会造成瓶颈。我们提出了一种新的方法,利用最先进的开源VLMs -- Gemma 3和Qwen3-VL -- 直接从基于无人机的遥感图像生成JSON格式的模拟参数。使用通过Helios 3D程序化植物生成库生成的合成豇豆样地数据集,我们测试了五种上下文学习方法,并在JSON完整性、几何评估和生物物理评估三个类别上评估了模型。我们的结果表明,虽然VLMs可以解释结构元数据并估计参数如植物数量和太阳方位角,但由于上下文偏见或在视觉线索不足时依赖数据集均值,它们通常表现出性能下降。在真实世界的无人机正射影像数据集上进行验证,并使用一个盲基准进行消蚀研究,进一步描述了模型的推理能力与它们对上下文先验的依赖。据我们所知,这是第一项利用VLMs生成植物模拟结构JSON配置的研究,为在农业中重建数字孪生的3D样地提供了可扩展的框架。
更新时间: 2026-03-09 20:58:43
领域: cs.CV,cs.AI
Quantifying Uncertainty in AI Visibility: A Statistical Framework for Generative Search Measurement
AI-powered answer engines are inherently non-deterministic: identical queries submitted at different times can produce different responses and cite different sources. Despite this stochastic behavior, current approaches to measuring domain visibility in generative search typically rely on single-run point estimates of citation share and prevalence, implicitly treating them as fixed values. This paper argues that citation visibility metrics should be treated as sample estimators of an underlying response distribution rather than fixed values. We conduct an empirical study of citation variability across three generative search platforms--Perplexity Search, OpenAI SearchGPT, and Google Gemini--using repeated sampling across three consumer product topics. Two sampling regimes are employed: daily collections over nine days and high-frequency sampling at ten-minute intervals. We show that citation distributions follow a power-law form and exhibit substantial variability across repeated samples. Bootstrap confidence intervals reveal that many apparent differences between domains fall within the noise floor of the measurement process. Distribution-wide rank stability analysis further demonstrates that citation rankings are unstable across samples, not only among top-ranked domains but throughout the frequently cited domain set. These findings demonstrate that single-run visibility metrics provide a misleadingly precise picture of domain performance in generative search. We argue that citation visibility must be reported with uncertainty estimates and provide practical guidance for sample sizes required to achieve interpretable confidence intervals.
Updated: 2026-03-09 20:47:22
标题: 量化人工智能可见性中的不确定性:一种用于生成搜索测量的统计框架
摘要: AI驱动的答案引擎在本质上是非确定性的:在不同时间提交相同的查询可能会产生不同的响应并引用不同的来源。尽管存在这种随机行为,当前衡量生成式搜索中领域可见性的方法通常依赖于引文占比和普遍性的单次运行点估计,隐含地将它们视为固定值。本文认为,引文可见性指标应被视为对潜在响应分布的样本估计器,而不是固定值。我们进行了一项关于引文变异性的实证研究,跨越三个生成式搜索平台——Perplexity Search、OpenAI SearchGPT和Google Gemini——在三个消费产品主题上进行了重复采样。采用了两种采样制度:在九天内进行每日采集和在十分钟间隔内进行高频采样。我们展示了引文分布遵循幂律形式,并在重复样本中展现出相当大的变异性。Bootstrap置信区间显示许多领域之间的明显差异落在测量过程的噪声阈值之内。分布范围内的排名稳定性分析进一步表明,引文排名在样本之间是不稳定的,不仅在排名靠前的领域之间,而且在经常被引用的领域集中。这些发现表明,单次运行的可见性指标提供了一个误导性精确的领域性能图像。我们认为引文可见性必须附带不确定性估计,并提供了实用指导,以确定需要达到可解释置信区间所需的样本大小。
更新时间: 2026-03-09 20:47:22
领域: stat.AP,cs.AI,cs.IR
Tool Receipts, Not Zero-Knowledge Proofs: Practical Hallucination Detection for AI Agents
AI agents that execute tasks via tool calls frequently hallucinate results - fabricating tool executions, misstating output counts, or presenting inferences as facts. Recent approaches to verifiable AI inference rely on zero-knowledge proofs, which provide cryptographic guarantees but impose minutes of proving time per query, making them impractical for interactive agents. We propose NabaOS, a lightweight verification framework inspired by Indian epistemology (Nyaya Shastra), which classifies every claim in an LLM response by its epistemic source (pramana): direct tool output (pratyaksha), inference (anumana), external testimony (shabda), absence (abhava), or ungrounded opinion. Our runtime generates HMAC-signed tool execution receipts that the LLM cannot forge, then cross-references claims against these receipts to detect hallucinations in real time. We evaluate on NyayaVerifyBench, a new benchmark of 1,800 agent response scenarios across four languages with injected hallucinations of six types. NabaOS detects 94.2% of fabricated tool references, 87.6% of count misstatements, and 91.3% of false absence claims, with <15ms verification overhead per response. For deep delegation (agents performing multi-step web tasks), our cross-checking protocol catches 78.4% of URL fabrications via independent re-fetching. We compare against five approaches: zkLLM (cryptographic proofs, 180s/query), TOPLOC (locality-sensitive hashing), SPEX (sampling-based proof of execution), tensor commitments, and self-consistency checking. NabaOS achieves the best cost-latency-coverage trade-off for interactive agents: 94.2% coverage at <15ms versus zkLLM's near-perfect coverage at 180,000ms. For interactive agents, practical receipt-based verification provides better cost-benefit than cryptographic proofs, and epistemic classification gives users actionable trust signals rather than binary judgments.
Updated: 2026-03-09 20:45:41
标题: 工具收据,而不是零知识证明:为人工智能代理实际幻觉检测
摘要: AI代理通过工具调用执行任务时经常会产生幻觉结果 - 制造工具执行、错误陈述输出计数,或将推断呈现为事实。最近可验证AI推理方法依赖于零知识证明,它提供了密码学保证,但每个查询需要几分钟的证明时间,使其对交互式代理不切实际。我们提出了NabaOS,这是一个受印度认识论(Nyaya Shastra)启发的轻量级验证框架,它通过其认识来源(pramana)对LLM响应中的每个声明进行分类:直接工具输出(pratyaksha)、推断(anumana)、外部证词(shabda)、缺席(abhava)或无根据的意见。我们的运行时生成HMAC签名的工具执行收据,LLM无法伪造,然后交叉引用声明以实时检测幻觉。我们在NyayaVerifyBench上进行评估,这是一个新的基准,涵盖了涉及六种类型幻觉的1,800个代理响应场景,跨越四种语言。NabaOS检测出94.2%的制造工具引用,87.6%的计数错误陈述,以及91.3%的虚假缺席声明,每个响应的验证开销<15ms。对于深度委托(代理执行多步网络任务),我们的交叉检查协议通过独立重新获取捕获了78.4%的URL制造。我们与五种方法进行比较:zkLLM(密码学证明,180秒/查询)、TOPLOC(局部敏感哈希)、SPEX(基于抽样的执行证明)、张量承诺和自洽性检查。NabaOS在交互式代理方面实现了最佳的成本-延迟-覆盖度权衡:在<15ms时获得94.2%的覆盖度,而zkLLM在180,000ms时接近完美的覆盖度。对于交互式代理,实际基于收据的验证提供了比密码学证明更好的成本效益,而认识分类为用户提供了可操作的信任信号,而不是二元判断。
更新时间: 2026-03-09 20:45:41
领域: cs.CR,cs.AI,cs.CL
Personalized Collaborative Learning with Affinity-Based Variance Reduction
Multi-agent learning faces a fundamental tension: leveraging distributed collaboration without sacrificing the personalization needed for diverse agents. This tension intensifies when aiming for full personalization while adapting to unknown heterogeneity levels -- gaining collaborative speedup when agents are similar, without performance degradation when they are different. Embracing the challenge, we propose personalized collaborative learning (PCL), a novel framework for heterogeneous agents to collaboratively learn personalized solutions with seamless adaptivity. Through carefully designed bias correction and importance correction mechanisms, our method AffPCL robustly handles both environment and objective heterogeneity. We prove that AffPCL reduces sample complexity over independent learning by a factor of $\max\{n^{-1}, δ\}$, where $n$ is the number of agents and $δ\in[0,1]$ measures their heterogeneity. This affinity-based acceleration automatically interpolates between the linear speedup of federated learning in homogeneous settings and the baseline of independent learning, without requiring prior knowledge of the system. Our analysis further reveals that an agent may obtain linear speedup even by collaborating with arbitrarily dissimilar agents, unveiling new insights into personalization and collaboration in the high heterogeneity regime.
Updated: 2026-03-09 20:43:37
标题: 基于亲和力的差异减少的个性化协作学习
摘要: 多智能体学习面临着一个基本的张力:在不牺牲多样化智能体所需的个性化的情况下利用分布式协作。当旨在实现完全个性化并适应未知的异质性水平时,这种张力会加剧 -- 在智能体相似时获得协作加速,而在它们不同的情况下不会导致性能下降。面对这一挑战,我们提出了个性化协作学习(PCL),这是一个新颖的框架,用于异质智能体协作学习具有无缝适应性的个性化解决方案。通过精心设计的偏差校正和重要性校正机制,我们的方法AffPCL能够稳健地处理环境和目标的异质性。我们证明AffPCL通过$\max\{n^{-1}, δ\}$的因素减少了独立学习的样本复杂性,其中$n$是智能体的数量,$δ\in[0,1]$衡量它们的异质性。这种基于亲和力的加速可以自动在同质设置中的联邦学习的线性加速和独立学习的基准之间进行插值,而不需要先前对系统的了解。我们的分析进一步揭示了即使与任意不同的智能体合作,一个智能体也可以获得线性加速,在高异质性环境中揭示了个性化和协作的新见解。
更新时间: 2026-03-09 20:43:37
领域: stat.ML,cs.LG,cs.MA,eess.SY
MolCrystalFlow: Molecular Crystal Structure Prediction via Flow Matching
Molecular crystal structure prediction represents a grand challenge in computational chemistry due to large sizes of constituent molecules and complex intra- and intermolecular interactions. While generative modeling has revolutionized structure discovery for molecules, inorganic solids, and metal-organic frameworks, extending such approaches to fully periodic molecular crystals is still elusive. Here, we present MolCrystalFlow, a flow-based generative model for molecular crystal structure prediction. The framework disentangles intramolecular complexity from intermolecular packing by embedding molecules as rigid bodies and jointly learning the lattice matrix, molecular orientations, and centroid positions. Centroids and orientations are represented on their native Riemannian manifolds, allowing geodesic flow construction and graph neural network operations that respects geometric symmetries. We benchmark our model against a state-of-the-art generative model (MOFFlow) for large-size periodic crystals and a rule-based structure generation method (Genarris) on two open-source molecular crystal datasets. MolCrystalFlow outperforms MOFFlow while achieving competitive performance against Genarris. We also demonstrate an integration of MolCrystalFlow model with universal machine learning potential to accelerate molecular crystal structure prediction, paving the way for data-driven generative discovery of molecular crystals.
Updated: 2026-03-09 20:41:50
标题: MolCrystalFlow:通过流匹配进行分子晶体结构预测
摘要: 分子晶体结构预测在计算化学中是一个巨大的挑战,因为构成分子的大小较大,而且存在复杂的分子内和分子间相互作用。虽然生成建模已经在分子、无机固体和金属有机框架的结构发现方面取得了革命性进展,但将这些方法扩展到完全周期性的分子晶体仍然是困难的。在这里,我们提出了MolCrystalFlow,一种用于分子晶体结构预测的基于流的生成模型。该框架通过将分子嵌入为刚性体,并联合学习晶格矩阵、分子方向和质心位置,将分子内复杂性与分子间堆积分离开来。质心和方向在其本地黎曼流形上表示,允许地理流构造和图神经网络操作,以尊重几何对称性。我们在两个开源分子晶体数据集上将我们的模型与最先进的生成模型(MOFFlow)和基于规则的结构生成方法(Genarris)进行了基准测试。 MolCrystalFlow在性能上优于MOFFlow,同时又在与Genarris竞争性能方面表现出色。我们还展示了将MolCrystalFlow模型与通用机器学习潜力集成,以加速分子晶体结构预测,为基于数据驱动的分子晶体生成发现铺平了道路。
更新时间: 2026-03-09 20:41:50
领域: cs.LG,cond-mat.mtrl-sci
Vision-Language Models Encode Clinical Guidelines for Concept-Based Medical Reasoning
Concept Bottleneck Models (CBMs) are a prominent framework for interpretable AI that map learned visual features to a set of meaningful concepts for task-specific downstream predictions. Their sequential structure enhances transparency by connecting model predictions to the underlying concepts that support them. In medical imaging, where transparency is essential, CBMs offer an appealing foundation for explainable model design. However, discrete concept representations often overlook broader clinical context such as diagnostic guidelines and expert heuristics, reducing reliability in complex cases. We propose MedCBR, a concept-based reasoning framework that integrates clinical guidelines with vision-language and reasoning models. Labeled clinical descriptors are transformed into guideline-conformant text, and a concept-based model is trained with a multitask objective combining multimodal contrastive alignment, concept supervision, and diagnostic classification to jointly ground image features, concepts, and pathology. A reasoning model then converts these predictions into structured clinical narratives that explain the diagnosis, emulating expert reasoning based on established guidelines. MedCBR achieves superior diagnostic and concept-level performance, with AUROCs of 94.2% on ultrasound and 84.0% on mammography. Further experiments on non-medical datasets achieve 86.1% accuracy. Our framework enhances interpretability and forms an end-to-end bridge from medical image analysis to decision-making.
Updated: 2026-03-09 20:39:46
标题: 视觉语言模型编码临床指南以进行基于概念的医学推理
摘要: 概念瓶颈模型(CBMs)是一个重要的框架,用于可解释的人工智能,将学习到的视觉特征映射到一组有意义的概念,用于特定任务的下游预测。它们的顺序结构通过将模型预测与支持它们的基本概念相连接,增强了透明度。在医学影像领域,透明度至关重要,CBMs为可解释的模型设计提供了一个吸引人的基础。然而,离散的概念表示通常忽视了更广泛的临床背景,如诊断指南和专家启发,降低了在复杂情况下的可靠性。我们提出了MedCBR,一个基于概念的推理框架,将临床指南与视觉语言和推理模型整合在一起。标记的临床描述符被转换为符合指南的文本,并且通过多模态对比对齐、概念监督和诊断分类的多任务目标训练了一个基于概念的模型,共同地基于图像特征、概念和病理。然后,一个推理模型将这些预测转换为解释诊断的结构化临床叙述,模拟基于已建立指南的专家推理。MedCBR实现了卓越的诊断和概念水平性能,在超声和乳腺X光上分别达到了94.2%和84.0%的AUROC。在非医学数据集上的进一步实验达到了86.1%的准确率。我们的框架增强了可解释性,并形成了从医学图像分析到决策制定的端到端桥梁。
更新时间: 2026-03-09 20:39:46
领域: cs.CV,cs.LG
Uncovering a Winning Lottery Ticket with Continuously Relaxed Bernoulli Gates
Over-parameterized neural networks incur prohibitive memory and computational costs for resource-constrained deployment. The Strong Lottery Ticket (SLT) hypothesis suggests that randomly initialized networks contain sparse subnetworks achieving competitive accuracy without weight training. Existing SLT methods, notably edge-popup, rely on non-differentiable score-based selection, limiting optimization efficiency and scalability. We propose using continuously relaxed Bernoulli gates to discover SLTs through fully differentiable, end-to-end optimization - training only gating parameters while keeping all network weights frozen at their initialized values. Continuous relaxation enables direct gradient-based optimization of an $\ell_0$-regularization objective, eliminating the need for non-differentiable gradient estimators or iterative pruning cycles. To our knowledge, this is the first fully differentiable approach for SLT discovery that avoids straight-through estimator approximations. Experiments across fully connected networks, CNNs (ResNet, Wide-ResNet), and Vision Transformers (ViT, Swin-T) demonstrate up to 90% sparsity with minimal accuracy loss - nearly double the sparsity achieved by edge-popup at comparable accuracy - establishing a scalable framework for pre-training network sparsification.
Updated: 2026-03-09 20:33:16
标题: 使用连续弛豫伯努利门揭示一张中奖彩票
摘要: 超参数化神经网络在资源受限的部署中产生了严重的内存和计算成本。强大的彩票票(SLT)假设表明,随机初始化的网络包含稀疏子网络,可以在没有权重训练的情况下达到竞争性的准确性。现有的SLT方法,尤其是边缘弹出,依赖于不可微的基于得分的选择,限制了优化效率和可扩展性。我们提出使用连续放松的伯努利门通过完全可微的端到端优化发现SLT - 只训练门控参数,同时保持所有网络权重冻结在其初始化的值。连续弛豫可以直接基于梯度优化$\ell_0$正则化目标,消除了对不可微梯度估计器或迭代剪枝循环的需求。据我们所知,这是第一个完全可微的SLT发现方法,避免了透传估计器的近似。在全连接网络、CNNs(ResNet、Wide-ResNet)和Vision Transformers(ViT、Swin-T)上的实验表明,可以实现高达90%的稀疏性而几乎没有准确性损失 - 在相似准确性下,几乎是边缘弹出实现的两倍稀疏性 - 建立了一个可扩展的框架用于预训练网络稀疏化。
更新时间: 2026-03-09 20:33:16
领域: cs.LG,cs.AI
Quantifying Memorization and Privacy Risks in Genomic Language Models
Genomic language models (GLMs) have emerged as powerful tools for learning representations of DNA sequences, enabling advances in variant prediction, regulatory element identification, and cross-task transfer learning. However, as these models are increasingly trained or fine-tuned on sensitive genomic cohorts, they risk memorizing specific sequences from their training data, raising serious concerns around privacy, data leakage, and regulatory compliance. Despite growing awareness of memorization risks in general-purpose language models, little systematic evaluation exists for these risks in the genomic domain, where data exhibit unique properties such as a fixed nucleotide alphabet, strong biological structure, and individual identifiability. We present a comprehensive, multi-vector privacy evaluation framework designed to quantify memorization risks in GLMs. Our approach integrates three complementary risk assessment methodologies: perplexity-based detection, canary sequence extraction, and membership inference. These are combined into a unified evaluation pipeline that produces a worst-case memorization risk score. To enable controlled evaluation, we plant canary sequences at varying repetition rates into both synthetic and real genomic datasets, allowing precise quantification of how repetition and training dynamics influence memorization. We evaluate our framework across multiple GLM architectures, examining the relationship between sequence repetition, model capacity, and memorization risk. Our results establish that GLMs exhibit measurable memorization and that the degree of memorization varies across architectures and training regimes. These findings reveal that no single attack vector captures the full scope of memorization risk, underscoring the need for multi-vector privacy auditing as a standard practice for genomic AI systems.
Updated: 2026-03-09 20:30:37
标题: 量化基因语言模型中的记忆和隐私风险
摘要: 基因组语言模型(GLMs)已经成为学习DNA序列表示的强大工具,促进了变异预测、调控元件识别和跨任务迁移学习的进展。然而,随着这些模型在敏感基因组队列上的训练或微调日益增多,它们存在着记忆其训练数据中特定序列的风险,引发了有关隐私、数据泄漏和法规合规性的严重担忧。尽管人们对通用语言模型中的记忆风险日益增强,但在基因组领域很少存在系统性评估这些风险的研究,因为该领域的数据具有固定的核苷酸字母表、强烈的生物结构和个体可识别性等独特属性。我们提出了一个全面的、多向量隐私评估框架,旨在量化GLMs中的记忆风险。我们的方法集成了三种互补的风险评估方法:基于困惑度的检测、金丝雀序列提取和成员推断。这些方法结合成一个统一的评估流程,生成一个最坏情况下的记忆风险评分。为了进行受控评估,我们在合成和真实的基因组数据集中以不同重复率种植金丝雀序列,从而精确量化重复和训练动态如何影响记忆。我们评估了我们的框架跨多个GLM架构,考察序列重复、模型容量和记忆风险之间的关系。我们的结果表明GLMs具有可测量的记忆,记忆程度在不同架构和训练方案之间存在差异。这些发现揭示了没有单一攻击向量能够完全捕捉记忆风险的全部范围,强调了多向量隐私审计作为基因组AI系统的标准实践的必要性。
更新时间: 2026-03-09 20:30:37
领域: cs.LG,cs.CR,q-bio.GN
FedLECC: Cluster- and Loss-Guided Client Selection for Federated Learning under Non-IID Data
Federated Learning (FL) enables distributed Artificial Intelligence (AI) across cloud-edge environments by allowing collaborative model training without centralizing data. In cross-device deployments, FL systems face strict communication and participation constraints, as well as strong non-independent and identically distributed (non-IID) data that degrades convergence and model quality. Since only a subset of devices (a.k.a clients) can participate per training round, intelligent client selection becomes a key systems challenge. This paper proposes FedLECC (Federated Learning with Enhanced Cluster Choice), a lightweight, cluster-aware, and loss-guided client selection strategy for cross-device FL. FedLECC groups clients by label-distribution similarity and prioritizes clusters and clients with higher local loss, enabling the selection of a small yet informative and diverse set of clients. Experimental results under severe label skew show that FedLECC improves test accuracy by up to 12%, while reducing communication rounds by approximately 22% and overall communication overhead by up to 50% compared to strong baselines. These results demonstrate that informed client selection improves the efficiency and scalability of FL workloads in cloud-edge systems.
Updated: 2026-03-09 20:28:17
标题: FedLECC:在非独立同分布数据下的联邦学习中基于簇和损失指导的客户端选择
摘要: Federated Learning (FL)通过允许协作模型训练而不集中数据,实现了在云边环境中分布式人工智能(AI)。在跨设备部署中,FL系统面临严格的通信和参与约束,以及强非独立和同分布(non-IID)数据,这些因素会降低收敛性和模型质量。由于每轮训练只有一小部分设备(也称为客户端)能够参与,智能客户端选择成为一个关键的系统挑战。本文提出了FedLECC(具有增强集群选择的联邦学习),这是一个轻量级、集群感知和损失引导的跨设备FL客户端选择策略。FedLECC通过标签分布相似性对客户端进行分组,并优先考虑具有更高本地损失的集群和客户端,从而选择出一小部分富有信息且多样化的客户端集。在严重标签倾斜的实验结果中显示,与强基线相比,FedLECC提高了测试准确度高达12%,同时将通信轮次减少了约22%,整体通信开销降低了高达50%。这些结果表明,有信息的客户端选择提高了在云边系统中FL工作负载的效率和可扩展性。
更新时间: 2026-03-09 20:28:17
领域: cs.DC,cs.AI,cs.LG
Cross-Domain Uncertainty Quantification for Selective Prediction: A Comprehensive Bound Ablation with Transfer-Informed Betting
We present a comprehensive ablation of nine finite-sample bound families for selective prediction with risk control, combining concentration inequalities (Hoeffding, Empirical Bernstein, Clopper-Pearson, Wasserstein DRO, CVaR) with multiple-testing corrections (union bound, Learn Then Test fixed-sequence) and betting-based confidence sequences (WSR). Our main theoretical contribution is Transfer-Informed Betting (TIB), which warm-starts the WSR wealth process using a source domain's risk profile, achieving tighter bounds in data-scarce settings with a formal dominance guarantee. We prove that the TIB wealth process remains a valid supermartingale under all source-target divergences, that TIB dominates standard WSR when domains match, and that no data-independent warm-start can achieve better convergence. The combination of betting-based confidence sequences, LTT monotone testing, and cross-domain transfer is, to our knowledge, a three-way novelty not present in the literature. We evaluate all nine bound families on four benchmarks-MASSIVE (n=1,102), NyayaBench (n=280), CLINC-150 (n=22.5K), and Banking77 (n=13K)-across 18 (alpha, delta) configurations. On MASSIVE at alpha=0.10, LTT eliminates the ln(K) union-bound penalty, achieving 94.0% guaranteed coverage versus 73.8% for Hoeffding-a 27% relative improvement. On NyayaBench, where the small calibration set makes Hoeffding-family bounds infeasible below alpha=0.20, Transfer-Informed Betting achieves 18.5% coverage at alpha=0.10, a 5.4x improvement over LTT + Hoeffding. We additionally compare with split-conformal prediction, showing that conformal methods produce prediction sets (avg. 1.67 classes) whereas selective prediction provides single-prediction risk guarantees. We apply these methods to agentic caching systems, formalizing a progressive trust model where the guarantee determines when cached responses can be served autonomously.
Updated: 2026-03-09 20:25:18
标题: 跨领域不确定性量化用于选择性预测:具有转移信息的全面边界消蚀和投注
摘要: 我们提出了对具有风险控制的选择性预测的九种有限样本界限家族的广泛消融,将集中不等式(Hoeffding、经验伯恩斯坦、Clopper-Pearson、Wasserstein DRO、CVaR)与多重检验校正(并集界限、先学习后测试固定序列)和基于赌注的置信序列(WSR)相结合。我们的主要理论贡献是传输信息赌注(TIB),通过使用源域的风险配置文件对 WSR 财富过程进行热启动,在数据稀缺的情况下实现更紧密的边界,并具有正式的支配保证。我们证明了 TIB 财富过程在所有源-目标分歧下仍然是有效的超马丁格尔,并且当域匹配时,TIB 优于标准 WSR,没有数据独立的热启动可以实现更好的收敛。基于赌注的置信序列、LTT 单调测试和跨域转移的结合,在我们所知的范围内,是文献中不存在的三向新颖性。我们在四个基准测试中评估了所有九个边界家族-MASSIVE(n=1,102)、NyayaBench(n=280)、CLINC-150(n=22.5K)和Banking77(n=13K)-跨越 18 个(alpha、delta)配置。在 alpha=0.10 下的 MASSIVE 上,LTT 消除了 ln(K) 并集界限惩罚,实现了94.0% 的保证覆盖率,而 Hoeffding 为 73.8% - 相对改进 27%。在 NyayaBench 上,由于较小的校准集使得 Hoeffding 家族的边界在 alpha=0.20 以下不可行,传输信息赌注在 alpha=0.10 时实现了18.5% 的覆盖率,比 LTT + Hoeffding 改善了 5.4 倍。我们还将其与 split-conformal 预测进行比较,结果显示,conformal 方法产生预测集(平均 1.67 个类别),而选择性预测提供单个预测的风险保证。我们将这些方法应用于主动缓存系统,形成一个渐进信任模型,其中保证确定了何时可以自主提供缓存响应。
更新时间: 2026-03-09 20:25:18
领域: cs.LG,cs.AI,stat.ML
Discovering Symbolic Differential Equations with Symmetry Invariants
Discovering symbolic differential equations from data uncovers fundamental dynamical laws underlying complex systems. However, existing methods often struggle with the vast search space of equations and may produce equations that violate known physical laws. In this work, we address these problems by introducing the concept of symmetry invariants in equation discovery. We leverage the fact that differential equations admitting a symmetry group can be expressed in terms of differential invariants of symmetry transformations. Thus, we propose to use these invariants as atomic entities in equation discovery, ensuring the discovered equations satisfy the specified symmetry. Our approach integrates seamlessly with existing equation discovery methods such as sparse regression and genetic programming, improving their accuracy and efficiency. We validate the proposed method through applications to various physical systems, such as fluid and reaction-diffusion, demonstrating its ability to recover parsimonious and interpretable equations that respect the laws of physics.
Updated: 2026-03-09 20:19:27
标题: 发现具有对称不变性的符号微分方程
摘要: 从数据中发现符号微分方程揭示了复杂系统基本动力学规律。然而,现有方法常常难以处理方程的广阔搜索空间,可能会产生违反已知物理定律的方程。在这项工作中,我们通过引入方程发现中的对称不变量的概念来解决这些问题。我们利用微分方程可以表达为对称变换的微分不变量的事实。因此,我们建议在方程发现中使用这些不变量作为原子实体,确保发现的方程符合指定的对称性。我们的方法与现有的方程发现方法(如稀疏回归和遗传编程)无缝集成,提高了它们的准确性和效率。我们通过应用于各种物理系统(如流体和反应扩散)来验证所提出的方法,展示了其恢复简练且可解释的方程,并遵守物理定律的能力。
更新时间: 2026-03-09 20:19:27
领域: cs.LG
ConLID: Supervised Contrastive Learning for Low-Resource Language Identification
Language identification (LID) is a critical step in curating multilingual LLM pretraining corpora from web crawls. While many studies on LID model training focus on collecting diverse training data to improve performance, low-resource languages -- often limited to single-domain data, such as the Bible -- continue to perform poorly. To resolve these imbalance and bias issues, we propose a novel supervised contrastive learning (SCL) approach to learn domain-invariant representations for low-resource languages. We show that our approach improves LID performance on out-of-domain data for low-resource languages by 3.2 percentage points, while maintaining its performance for the high-resource languages.
Updated: 2026-03-09 20:16:21
标题: ConLID:低资源语言识别的监督对比学习
摘要: 语言识别(LID)是从网络爬虫中筛选多语言LLM预训练语料库的关键步骤。虽然许多关于LID模型训练的研究侧重于收集多样化的训练数据以提高性能,但低资源语言——通常仅限于单一领域数据,如圣经——仍然表现不佳。为了解决这些不平衡和偏见问题,我们提出了一种新颖的监督对比学习(SCL)方法,用于学习低资源语言的域不变表示。我们展示了我们的方法将低资源语言在超出领域数据上的LID性能提高了3.2个百分点,同时保持了对高资源语言的性能。
更新时间: 2026-03-09 20:16:21
领域: cs.CL,cs.AI,cs.LG
NetDiffuser: Deceiving DNN-Based Network Attack Detection Systems with Diffusion-Generated Adversarial Traffic
Deep learning (DL)-based Network Intrusion Detection System (NIDS) has demonstrated great promise in detecting malicious network traffic. However, they face significant security risks due to their vulnerability to adversarial examples (AEs). Most existing adversarial attacks maliciously perturb data to maximize misclassification errors. Among AEs, natural adversarial examples (NAEs) are particularly difficult to detect because they closely resemble real data, making them challenging for both humans and machine learning models to distinguish from legitimate inputs. Creating NAEs is crucial for testing and strengthening NIDS defenses. This paper proposes NetDiffuser1, a novel framework for generating NAEs capable of deceiving NIDS. NetDiffuser consists of two novel components. First, a new feature categorization algorithm is designed to identify relatively independent features in network traffic. Perturbing these features minimizes changes while preserving network flow validity. The second component is a novel application of diffusion models to inject semantically consistent perturbations for generating NAEs. NetDiffuser performance was extensively evaluated using three benchmark NIDS datasets across various model architectures and state-of-the-art adversarial detectors. Our experimental results show that NetDiffuser achieves up to a 29.93% higher attack success rate and reduces AE detection performance by at least 0.267 (in some cases up to 0.534) in the Area under the Receiver Operating Characteristic Curve (AUC-ROC) score compared to the baseline attacks.
Updated: 2026-03-09 20:13:51
标题: NetDiffuser:使用扩散生成的对抗性流量欺骗基于DNN的网络攻击检测系统
摘要: 基于深度学习(DL)的网络入侵检测系统(NIDS)已经展示出在检测恶意网络流量方面具有巨大潜力。然而,它们面临着由于对对抗样本(AEs)的脆弱性而带来的重大安全风险。大多数现有的对抗攻击都是恶意扰乱数据以最大化误分类错误。在AEs中,自然对抗样本(NAEs)特别难以检测,因为它们与真实数据非常相似,使得它们对于人类和机器学习模型都具有挑战性,难以与合法输入区分开来。创建NAEs对于测试和加强NIDS的防御至关重要。本文提出了一种名为NetDiffuser的新颖框架,用于生成能够欺骗NIDS的NAEs。NetDiffuser由两个新颖组件组成。首先,设计了一种新的特征分类算法,用于识别网络流量中相对独立的特征。扰动这些特征可以最大程度地减少变化,同时保持网络流的有效性。第二个组件是对扩散模型的新颖应用,用于注入语义一致的扰动以生成NAEs。通过在各种模型架构和最先进的对抗检测器上使用三个基准NIDS数据集对NetDiffuser性能进行了广泛评估。我们的实验结果表明,与基准攻击相比,NetDiffuser的攻击成功率最高提高了29.93%,并且在接收器操作特性曲线下的面积(AUC-ROC)分数中,至少降低了AE检测性能0.267(在某些情况下甚至高达0.534)。
更新时间: 2026-03-09 20:13:51
领域: cs.CR,cs.AI
A New Modeling to Feature Selection Based on the Fuzzy Rough Set Theory in Normal and Optimistic States on Hybrid Information Systems
Considering the high volume, wide variety, and rapid speed of data generation, investigating feature selection methods for big data presents various applications and advantages. By removing irrelevant and redundant features, feature selection reduces data dimensions, thereby facilitating optimal decision-making within decision systems. One of the key tools for feature selection in hybrid information systems is fuzzy rough set theory. However, this theory faces two significant challenges: First, obtaining fuzzy equivalence relations through intersection operations in high-dimensional spaces can be both time-consuming and memory-intensive. Additionally, this method may produce noisy data, complicating the feature selection process. The purpose and innovation of this paper are to address these issues. We proposed a new feature selection model that calculates the combined distance between objects and subsequently used this information to derive the fuzzy equivalence relation. Rather than directly solving the feature selection problem, this approach reformulates it into an optimization problem that can be tackled using appropriate meta-heuristic algorithms. We have named this new approach FSbuHD. The FSbuHD model operates in two modes - normal and optimistic - based on the selection of one of the two introduced fuzzy equivalence relations. The model is then tested on standard datasets from the UCI repository and compared with other algorithms. The results of this research demonstrate that FSbuHD is one of the most efficient and effective methods for feature selection when compared to previous methods and algorithms.
Updated: 2026-03-09 20:12:44
标题: 基于模糊粗糙集理论在混合信息系统中正常和乐观状态下的特征选择新建模
摘要: 鉴于数据生成的高速、广泛和大量,对大数据进行特征选择方法的研究具有各种应用和优势。通过消除不相关和冗余特征,特征选择减少了数据维度,从而有助于在决策系统内进行最佳决策。在混合信息系统中进行特征选择的关键工具之一是模糊粗糙集理论。然而,该理论面临两个重大挑战:首先,在高维空间中通过交集运算获得模糊等价关系可能耗时且占用内存。此外,该方法可能产生嘈杂数据,使特征选择过程复杂化。本文的目的和创新在于解决这些问题。我们提出了一个新的特征选择模型,该模型计算对象之间的组合距离,然后利用这些信息推导模糊等价关系。与直接解决特征选择问题不同,这种方法将其重新构建为一个可以使用适当的元启发算法来解决的优化问题。我们将这种新方法命名为FSbuHD。FSbuHD模型以两种模式运行 - 正常和乐观 - 基于引入的两个模糊等价关系之一的选择。然后,在UCI仓库的标准数据集上测试该模型,并与其他算法进行比较。本研究结果表明,与先前的方法和算法相比,FSbuHD是特征选择的最有效和有效方法之一。
更新时间: 2026-03-09 20:12:44
领域: cs.LG,cs.AI
ConFu: Contemplate the Future for Better Speculative Sampling
Speculative decoding has emerged as a powerful approach to accelerate large language model (LLM) inference by employing lightweight draft models to propose candidate tokens that are subsequently verified by the target model. The effectiveness of this paradigm critically depends on the quality of the draft model. While recent advances such as the EAGLE series achieve state-of-the-art speedup, existing draft models remain limited by error accumulation: they condition only on the current prefix, causing their predictions to drift from the target model over steps. In this work, we propose \textbf{ConFu} (Contemplate the Future), a novel speculative decoding framework that enables draft models to anticipate the future direction of generation. ConFu introduces (i) contemplate tokens and soft prompts that allow the draft model to leverage future-oriented signals from the target model at negligible cost, (ii) a dynamic contemplate token mechanism with MoE to enable context-aware future prediction, and (iii) a training framework with anchor token sampling and future prediction replication that learns robust future prediction. Experiments demonstrate that ConFu improves token acceptance rates and generation speed over EAGLE-3 by 8--11% across various downstream tasks with Llama-3 3B and 8B models. We believe our work is the first to bridge speculative decoding with continuous reasoning tokens, offering a new direction for accelerating LLM inference.
Updated: 2026-03-09 20:11:06
标题: ConFu:深思未来以获得更好的投机抽样
摘要: 推测解码已经成为一种强大的方法,通过利用轻量级草稿模型提出候选标记,随后由目标模型验证,加速大型语言模型(LLM)推理。这种范式的有效性在很大程度上取决于草稿模型的质量。尽管最近的进展,如EAGLE系列实现了最先进的加速,但现有的草稿模型仍然受到错误积累的限制:它们仅在当前前缀的条件下运作,导致其预测在步骤中与目标模型偏离。在这项工作中,我们提出了一种新颖的推测解码框架ConFu(Contemplate the Future),使草稿模型能够预测生成的未来方向。ConFu引入(i)思考标记和软提示,允许草稿模型利用来自目标模型的未来导向信号,成本微乎其微,(ii)具有MoE的动态思考标记机制,以实现上下文感知的未来预测,以及(iii)具有锚标记采样和未来预测复制的训练框架,学习稳健的未来预测。实验证明,ConFu在各种下游任务中,使用Llama-3 3B和8B模型,比EAGLE-3提高了8-11%的标记接受率和生成速度。我们相信我们的工作是首次将推测解码与连续推理标记相结合,为加速LLM推理提供了一个新方向。
更新时间: 2026-03-09 20:11:06
领域: cs.CL,cs.LG
The Gaussian-Multinoulli Restricted Boltzmann Machine: A Potts Model Extension of the GRBM
Many real-world tasks, from associative memory to symbolic reasoning, benefit from discrete, structured representations that standard continuous latent models can struggle to express. We introduce the Gaussian-Multinoulli Restricted Boltzmann Machine (GM-RBM), a generative energy-based model that extends the Gaussian-Bernoulli RBM (GB-RBM) by replacing binary hidden units with q-state categorical (Potts) units, yielding a richer latent state space for multivalued concepts. We provide a self-contained derivation of the energy, conditional distributions, and learning rules, and detail practical training choices (contrastive divergence with temperature annealing and intra-slot diversity constraints) that avoid state collapse. To separate architectural effects from sheer latent capacity, we evaluate under both capacity-matched and parameter-matched setups, comparing GM-RBM with GB-RBM configured to have the same number of possible latent assignments. On analogical recall and structured memory benchmarks, GM-RBM achieves competitive, and in several regimes improved, recall at equal capacity with comparable training cost, despite using only Gibbs updates. The discrete q-ary formulation is also amenable to efficient implementation. These results clarify when categorical hidden units provide a simple, scalable alternative to binary latents for discrete inference within tractable RBMs.
Updated: 2026-03-09 20:00:05
标题: 高斯-多努力受限玻尔兹曼机:GRBM的Potts模型扩展
摘要: 许多现实世界任务,从关联记忆到符号推理,受益于离散、结构化的表示,而标准的连续潜在模型可能难以表达。我们引入了高斯-多努利受限玻尔兹曼机(GM-RBM),这是一种生成能量基模型,通过用 q 状态分类(波茨)单元替换二进制隐藏单元来扩展高斯-伯努利 RBM(GB-RBM),从而为多值概念提供更丰富的潜在状态空间。我们提供了能量、条件分布和学习规则的自包含推导,并详细介绍了实际的训练选择(对比散度与温度退火以及内部槽位多样性约束),以避免状态崩溃。为了区分架构效果与纯粹的潜在容量,我们在容量匹配和参数匹配的设置下进行评估,将 GM-RBM 与配置为具有相同潜在分配可能的 GB-RBM 进行比较。在类比回忆和结构化记忆基准测试中,GM-RBM 在相同容量下取得了竞争性、并且在几个领域上取得了改善的回忆,尽管只使用了 Gibbs 更新。离散的 q-元公式也适合于高效实现。这些结果澄清了何时分类隐藏单元为可计算的 RBM 提供了一个简单、可扩展的替代方案,用于离散推理。
更新时间: 2026-03-09 20:00:05
领域: cs.LG
Scalable Message Passing Neural Networks: No Need for Attention in Large Graph Representation Learning
We propose Scalable Message Passing Neural Networks (SMPNNs) and demonstrate that, by integrating standard convolutional message passing into a Pre-Layer Normalization Transformer-style block instead of attention, we can produce high-performing deep message-passing-based Graph Neural Networks (GNNs). This modification yields results competitive with the state-of-the-art in large graph transductive learning, particularly outperforming the best Graph Transformers in the literature, without requiring the otherwise computationally and memory-expensive attention mechanism. Our architecture not only scales to large graphs but also makes it possible to construct deep message-passing networks, unlike simple GNNs, which have traditionally been constrained to shallow architectures due to oversmoothing. Moreover, we provide a new theoretical analysis of oversmoothing based on universal approximation which we use to motivate SMPNNs. We show that in the context of graph convolutions, residual connections are necessary for maintaining the universal approximation properties of downstream learners and that removing them can lead to a loss of universality.
Updated: 2026-03-09 19:58:38
标题: 可扩展的消息传递神经网络:在大规模图表示学习中无需注意力
摘要: 我们提出了可扩展的消息传递神经网络(SMPNNs),并证明通过将标准的卷积消息传递集成到一个预层归一化变换器风格的块中,而不是使用注意力机制,我们可以生成性能优越的基于深度消息传递的图神经网络(GNNs)。这种修改产生的结果与大规模图的传导学习中最先进的方法相竞争,尤其是在文献中表现出色的最佳图变换器,而不需要计算和内存昂贵的注意力机制。我们的架构不仅可以扩展到大图,而且可以构建深度消息传递网络,而简单的GNNs传统上由于过度平滑而受限于浅层结构。此外,我们提供了一个基于通用逼近的过度平滑的新理论分析,我们用它来激励SMPNNs。我们展示了在图卷积的背景下,残差连接对于维持下游学习器的通用逼近性质是必要的,并且消除它们可能导致通用性的丧失。
更新时间: 2026-03-09 19:58:38
领域: cs.LG
Prognostics for Autonomous Deep-Space Habitat Health Management under Multiple Unknown Failure Modes
Deep-space habitats (DSHs) are safety-critical systems that must operate autonomously for long periods, often beyond the reach of ground-based maintenance or expert intervention. Monitoring health and anticipating failures are essential for safe operations. Prognostics based on remaining useful life (RUL) prediction support this goal by estimating how long a subsystem can operate before failure. Critical DSH subsystems, including environmental control and life support, power generation, and thermal control, are monitored by many sensors and can degrade through multiple failure modes. In practice, these failure modes are often unknown, and the sensors providing useful information may vary across modes, making accurate RUL prediction challenging when failure data are unlabeled. We propose an unsupervised prognostics framework for RUL prediction that jointly identifies latent failure modes and selects informative sensors using unlabeled run-to-failure data. The framework has two phases: offline sensor selection and failure mode identification, and online diagnosis and RUL prediction. In the offline phase, failure times are modeled using a mixture of Gaussian regressions, and an Expectation-Maximization algorithm simultaneously clusters degradation trajectories and selects mode-specific sensors. In the online phase, low-dimensional features from selected sensors diagnose the active failure mode and predict RUL through a weighted functional regression model. The framework is evaluated on a simulated dataset capturing key telemetry challenges in DSH systems and on the NASA C-MAPSS benchmark. Results show improved prediction accuracy and clearer identification of informative sensors and failure modes than existing methods.
Updated: 2026-03-09 19:58:25
标题: 自主深空航天生活环境健康管理在多种未知故障模式下的预测
摘要: 深空航天生活舱(DSHs)是安全关键系统,必须在长时间内实现自主运行,通常超出地面维护或专家干预的范围。监测健康状况并预测故障对安全运行至关重要。基于剩余有用寿命(RUL)预测的预测支持这一目标,通过估计子系统在故障之前可以运行多长时间来实现。关键的DSH子系统,包括环境控制和生命支持、发电和热控制,由许多传感器监测,并可通过多种故障模式退化。在实践中,这些故障模式通常是未知的,并且提供有用信息的传感器可能因模式而异,使得在故障数据未标记时准确预测RUL具有挑战性。我们提出了一种无监督的RUL预测的预测框架,通过使用未标记的故障数据共同识别潜在的故障模式和选择信息传感器。该框架分为两个阶段:离线传感器选择和故障模式识别,以及在线诊断和RUL预测。在离线阶段,使用高斯回归混合模型来建模故障时间,并利用期望最大化算法同时对退化轨迹进行聚类和选择特定模式的传感器。在在线阶段,来自选定传感器的低维特征诊断活动故障模式,并通过加权函数回归模型预测RUL。该框架在捕获DSH系统中关键遥测挑战的模拟数据集和NASA C-MAPSS基准测试中进行了评估。结果显示,与现有方法相比,提高了预测准确性,并清晰地识别了信息传感器和故障模式。
更新时间: 2026-03-09 19:58:25
领域: stat.ML,cs.LG,eess.SY,stat.AP
Latent Speech-Text Transformer
Auto-regressive speech-text models pre-trained on interleaved text tokens and discretized speech tokens demonstrate strong speech understanding and generation, yet remain substantially less compute-efficient than text LLMs, partly due to the much longer sequences of speech tokens relative to text. This modality imbalance disproportionately allocates pre-training and inference compute to speech, potentially hindering effective cross-modal alignment and slowing performance scaling by orders of magnitude. We introduce the Latent Speech-Text Transformer (LST), which aggregates speech tokens into latent speech patches that serve as higher-level autoregressive units. This design aligns the sequence-modeling granularity between speech and text while improving computational efficiency. The resulting patches can align with textual units to facilitate cross-modal knowledge transfer and compactly capture recurring acoustic patterns such as silence. Across story-completion benchmarks under both compute-controlled and data-controlled settings, LST consistently improves speech accuracy while also improving text performance, achieving up to +6.5% absolute gain on speech HellaSwag in compute-controlled training (+5.3% in data-controlled training). Under compute-controlled scaling from 420M to 1.8B parameters in a near compute-optimal regime, gains grow with scale, and improvements persist up to 7B parameters under fixed-token budgets. These benefits extend to downstream tasks: LST stabilizes ASR adaptation and reduces the effective autoregressive sequence length during ASR and TTS inference, lowering computational cost without degrading reconstruction quality. The code is available at https://github.com/facebookresearch/lst.
Updated: 2026-03-09 19:57:30
标题: 潜在语音文本转换器
摘要: 经过预先训练的自回归语音-文本模型,这些模型在交错的文本标记和离散的语音标记上表现出强大的语音理解和生成能力,然而与文本LLMs相比,计算效率仍然明显较低,部分原因是语音标记相对于文本标记具有更长的序列。这种模态不平衡不成比例地分配了预训练和推断计算资源给语音,可能会阻碍有效的跨模态对齐,并使性能放缓数个数量级。我们引入了潜在语音-文本Transformer(LST),它将语音标记聚合成用作更高级自回归单元的潜在语音块。这种设计在改善计算效率的同时,也使语音和文本之间的序列建模粒度保持一致。结果产生的块可以与文本单元对齐,以促进跨模态知识转移,并紧凑地捕捉重复的声学模式,如静音。在计算受控和数据受控的情况下进行的故事完成基准测试中,LST始终提高了语音准确性,同时也提高了文本性能,在计算受控训练中,HellaSwag的语音准确率可提高多达+6.5%(在数据受控训练中为+5.3%)。在接近计算最优区域的情况下,从420M扩展到1.8B参数的计算受控缩放中,随着规模的增加,收益也增加,且在固定标记预算下,改进一直持续到7B参数。这些好处延伸到下游任务:LST稳定了ASR适应性,并减少了ASR和TTS推断期间的有效自回归序列长度,降低了计算成本,而不降低重建质量。代码可在https://github.com/facebookresearch/lst获得。
更新时间: 2026-03-09 19:57:30
领域: cs.CL,cs.AI,cs.LG,eess.AS
Quantifying the Accuracy and Cost Impact of Design Decisions in Budget-Constrained Agentic LLM Search
Agentic Retrieval-Augmented Generation (RAG) systems combine iterative search, planning prompts, and retrieval backends, but deployed settings impose explicit budgets on tool calls and completion tokens. We present a controlled measurement study of how search depth, retrieval strategy, and completion budget affect accuracy and cost under fixed constraints. Using Budget-Constrained Agentic Search (BCAS), a model-agnostic evaluation harness that surfaces remaining budget and gates tool use, we run comparisons across six LLMs and three question-answering benchmarks. Across models and datasets, accuracy improves with additional searches up to a small cap, hybrid lexical and dense retrieval with lightweight re-ranking produces the largest average gains in our ablation grid, and larger completion budgets are most helpful on HotpotQA-style synthesis. These results provide practical guidance for configuring budgeted agentic retrieval pipelines and are accompanied by reproducible prompts and evaluation settings.
Updated: 2026-03-09 19:42:21
标题: 量化在预算受限的代理LLM搜索中设计决策的准确性和成本影响
摘要: Agentic Retrieval-Augmented Generation(RAG)系统结合迭代搜索、规划提示和检索后端,但是在部署设置中对工具调用和完成令牌施加明确的预算。我们提出了一个受控的测量研究,研究搜索深度、检索策略和完成预算在固定约束条件下对准确性和成本的影响。使用Budget-Constrained Agentic Search(BCAS),这是一个与模型无关的评估工具,显示剩余预算并控制工具使用,我们在六个LLMs和三个问答基准上进行比较。在各种模型和数据集中,准确性随着额外搜索的增加而提高,直到达到一个小的上限,混合词汇和密集检索与轻量级重新排序产生了我们的消融格中最大的平均增益,较大的完成预算对HotpotQA风格的综合最有帮助。这些结果为配置有预算的主动检索管道提供了实用指导,并附带可复制的提示和评估设置。
更新时间: 2026-03-09 19:42:21
领域: cs.AI
Why Channel-Centric Models are not Enough to Predict End-to-End Performance in Private 5G: A Measurement Campaign and Case Study
Communication-aware robot planning requires accurate predictions of wireless network performance. Current approaches rely on channel-level metrics such as received signal strength and signal-to-noise ratio, assuming these translate reliably into end-to-end throughput. We challenge this assumption through a measurement campaign in a private 5G industrial environment. We evaluate throughput predictions from a commercial ray-tracing simulator as well as data-driven Gaussian process regression models against measurements collected using a mobile robot. The study uses off-the-shelf user equipment in an underground, radio-shielded facility with detailed 3D modeling, representing a best-case scenario for prediction accuracy. The ray-tracing simulator captures the spatial structure of indoor propagation and predicts channel-level metrics with reasonable fidelity. However, it systematically over-predicts throughput, even in line-of-sight regions. The dominant error source is shown to be over-estimation of sustainable MIMO spatial layers: the simulator assumes near-uniform four-layer transmission while measurements reveal substantial adaptation between one and three layers. This mismatch inflates predicted throughput even when channel metrics appear accurate. In contrast, a Gaussian process model with a rational quadratic kernel achieves approximately two-thirds reduction in prediction error with near-zero bias by learning end-to-end throughput directly from measurements. These findings demonstrate that favorable channel conditions do not guarantee high throughput; communication-aware planners relying solely on channel-centric predictions risk overly optimistic trajectories that violate reliability requirements. Accurate throughput prediction for 5G systems requires either extensive calibration of link-layer models or data-driven approaches that capture real system behavior.
Updated: 2026-03-09 19:27:00
标题: 为什么基于通道的模型不足以预测私人5G端到端性能:一个测量活动和案例研究
摘要: 通信感知的机器人规划需要准确预测无线网络性能。当前的方法依赖于通道级度量,如接收信号强度和信噪比,假设这些可靠地转化为端到端吞吐量。我们通过在私人5G工业环境中进行测量活动来挑战这一假设。我们评估了商用射线跟踪模拟器以及基于数据驱动的高斯过程回归模型对使用移动机器人收集的测量数据进行吞吐量预测。该研究在一个地下、射频屏蔽设施中使用现成的用户设备和详细的3D建模,代表了预测准确性的最佳情况。射线跟踪模拟器捕捉了室内传播的空间结构,并以合理的忠实度预测通道级度量。然而,即使在视距区域,它也会系统性地高估吞吐量。主要的误差来源被显示为对可持续MIMO空间层的过度估计:模拟器假设几乎均匀的四层传输,而测量显示在一到三层之间有很大的调整。即使通道度量看起来准确,这种不匹配也会夸大预测的吞吐量。相反,具有合理二次核的高斯过程模型通过直接从测量中学习端到端吞吐量,实现了约三分之二的预测误差减少和接近零偏差。这些发现表明,有利的通道条件并不保证高吞吐量;仅依赖于通道中心预测的通信感知规划者可能会冒过于乐观的轨迹,违反可靠性要求。对于5G系统的准确吞吐量预测要么需要广泛的链路层模型校准,要么需要捕捉真实系统行为的数据驱动方法。
更新时间: 2026-03-09 19:27:00
领域: cs.NI,cs.LG,cs.RO
Research and Prototyping Study of an LLM-Based Chatbot for Electromagnetic Simulations
This work addresses the question of how generative artificial intelligence can be used to reduce the time required to set up electromagnetic simulation models. A chatbot based on a large language model is presented, enabling the automated generation of simulation models with various functional enhancements. A chatbot-driven workflow based on the large language model Google Gemini 2.0 Flash automatically generates and solves two-dimensional finite element eddy current models using Gmsh and GetDP. Python is used to coordinate and automate interactions between the workflow components. The study considers conductor geometries with circular cross-sections of variable position and number. Additionally, users can define custom post-processing routines and receive a concise summary of model information and simulation results. Each functional enhancement includes the corresponding architectural modifications and illustrative case studies.
Updated: 2026-03-09 19:25:05
标题: 基于LLM的聊天机器人用于电磁仿真的研究和原型设计Study of an LLM-Based Chatbot for Electromagnetic Simulations
摘要: 这项工作探讨了如何利用生成式人工智能来缩短建立电磁仿真模型所需的时间。文中提出了基于大型语言模型的聊天机器人,可以自动生成具有各种功能增强的仿真模型。基于大型语言模型Google Gemini 2.0 Flash的聊天机器人驱动工作流程,使用Gmsh和GetDP自动生成并解决二维有限元涡流模型。Python用于协调和自动化工作流程组件之间的交互。研究考虑了具有可变位置和数量的圆形横截面导体几何形状。此外,用户可以定义自定义后处理例程,并接收模型信息和仿真结果的简明摘要。每个功能增强都包括相应的架构修改和说明性案例研究。
更新时间: 2026-03-09 19:25:05
领域: cs.CE,cs.AI
APPLV: Adaptive Planner Parameter Learning from Vision-Language-Action Model
Autonomous navigation in highly constrained environments remains challenging for mobile robots. Classical navigation approaches offer safety assurances but require environment-specific parameter tuning; end-to-end learning bypasses parameter tuning but struggles with precise control in constrained spaces. To this end, recent robot learning approaches automate parameter tuning while retaining classical systems' safety, yet still face challenges in generalizing to unseen environments. Recently, Vision-Language-Action (VLA) models have shown promise by leveraging foundation models' scene understanding capabilities, but still struggle with precise control and inference latency in navigation tasks. In this paper, we propose Adaptive Planner Parameter Learning from Vision-Language-Action Model (\textsc{applv}). Unlike traditional VLA models that directly output actions, \textsc{applv} leverages pre-trained vision-language models with a regression head to predict planner parameters that configure classical planners. We develop two training strategies: supervised learning fine-tuning from collected navigation trajectories and reinforcement learning fine-tuning to further optimize navigation performance. We evaluate \textsc{applv} across multiple motion planners on the simulated Benchmark Autonomous Robot Navigation (BARN) dataset and in physical robot experiments. Results demonstrate that \textsc{applv} outperforms existing methods in both navigation performance and generalization to unseen environments.
Updated: 2026-03-09 19:23:09
标题: APPLV:基于视觉-语言-动作模型的自适应规划器参数学习
摘要: 在高度受限制的环境中自主导航对移动机器人仍然具有挑战性。经典的导航方法提供了安全保障,但需要特定环境参数调整;端到端学习虽然可以避开参数调整,但在受限空间中控制精度方面仍然困难重重。为此,最近的机器人学习方法自动化了参数调整,同时保留了经典系统的安全性,但仍然面临着推广到未知环境的挑战。最近,视觉-语言-动作(VLA)模型利用基础模型的场景理解能力已显示出潜力,但在导航任务中仍然面临着精确控制和推理延迟的挑战。在本文中,我们提出了一种从视觉-语言-动作模型中学习自适应规划器参数的方法(\textsc{applv})。与直接输出动作的传统VLA模型不同,\textsc{applv}利用预训练的视觉-语言模型,通过回归头来预测配置经典规划器的规划器参数。我们开发了两种训练策略:从收集的导航轨迹进行监督学习微调和通过强化学习微调进一步优化导航性能。我们评估了\textsc{applv}在模拟的Benchmark Autonomous Robot Navigation(BARN)数据集上的多个运动规划器,并进行了实际机器人实验。结果表明,\textsc{applv}在导航性能和对未知环境的泛化能力方面均优于现有方法。
更新时间: 2026-03-09 19:23:09
领域: cs.RO,cs.LG
Expressivity-Efficiency Tradeoffs for Hybrid Sequence Models
Hybrid sequence models--combining Transformer and state-space model layers--seek to gain the expressive versatility of attention as well as the computational efficiency of state-space model layers. Despite burgeoning interest in hybrid models, we lack a basic understanding of the settings where--and underlying mechanisms through which--they offer benefits over their constituent models. In this paper, we study this question, focusing on a broad family of core synthetic tasks. For this family of tasks, we prove the existence of fundamental limitations for non-hybrid models. Specifically, any Transformer or state-space model that solves the underlying task requires either a large number of parameters or a large working memory. On the other hand, for two prototypical tasks within this family--namely selective copying and associative recall--we construct hybrid models of small size and working memory that provably solve these tasks, thus achieving the best of both worlds. Our experimental evaluation empirically validates our theoretical findings. Importantly, going beyond the settings in our theoretical analysis, we empirically show that learned--rather than constructed--hybrids outperform non-hybrid models with up to 6x as many parameters. We additionally demonstrate that hybrid models exhibit stronger length generalization and out-of-distribution robustness than non-hybrids.
Updated: 2026-03-09 19:20:01
标题: 混合序列模型的表现效率权衡
摘要: 混合序列模型——结合Transformer和状态空间模型层——旨在获得注意力的表达多样性以及状态空间模型层的计算效率。尽管对混合模型的兴趣日益增长,但我们缺乏对它们何时以及通过何种基本机制比其组成模型提供优势的基本理解。在本文中,我们研究了这个问题,重点关注一系列核心合成任务。对于这个任务系列,我们证明了非混合模型存在基本限制。具体来说,任何解决基础任务的Transformer或状态空间模型都需要大量参数或大容量工作内存。另一方面,对于这个任务系列中的两个典型任务——即选择性复制和联想回忆——我们构建了小尺寸和工作内存的混合模型,可证实解决这些任务,从而实现最佳效果。我们的实验评估从实证上验证了我们的理论发现。重要的是,超出我们理论分析的设定,我们还实证地展示了学习而非构建的混合模型比具有多达6倍参数的非混合模型表现更好。此外,我们还证明混合模型表现出比非混合模型更强的长度泛化和超出分布的稳健性。
更新时间: 2026-03-09 19:20:01
领域: cs.LG
A Consequentialist Critique of Binary Classification Evaluation: Theory, Practice, and Tools
Machine learning-supported decisions, such as ordering diagnostic tests or determining preventive custody, often require converting probabilistic forecasts into binary classifications. We adopt a consequentialist perspective from decision theory to argue that evaluation methods should prioritize forecast quality across thresholds and base rates. This motivates the use of proper scoring rules such as the Brier score and log loss. However, our empirical review of practices at major ML venues (ICML, FAccT, CHIL) reveals a dominant reliance on top-K metrics or fixed-threshold evaluations. To bridge this disconnect, we introduce a decision-theoretic framework that maps evaluation metrics to their appropriate use cases, accompanied by a practical Python package, \texttt{briertools}, which lowers the barrier to applying proper scoring rules in practice. Methodologically, we derive and implement a clipped Brier score variant that avoids full integration and better reflects bounded, interpretable threshold ranges. Theoretically, we reconcile the Brier score with decision curve analysis, directly addressing the critique of (Assel, et al. 2017) regarding the clinical utility of proper scoring rules.
Updated: 2026-03-09 19:19:38
标题: 二元分类评估的后果主义批判:理论、实践和工具
摘要: 机器学习支持的决策,如订购诊断测试或确定预防性羁押,通常需要将概率预测转化为二元分类。我们采用决策理论中的后果主义视角,认为评估方法应该优先考虑跨阈值和基础率的预测质量。这促使我们使用适当的评分规则,如Brier分数和对数损失。然而,我们对主要机器学习会议(ICML,FAccT,CHIL)上的实践进行的实证回顾显示,人们主要依赖于top-K指标或固定阈值评估。为了弥合这种脱节,我们引入了一个决策理论框架,将评估指标映射到其适当的使用案例,配备了一个实用的Python包\texttt{briertools},降低了在实践中应用适当评分规则的门槛。在方法上,我们推导并实施了一个修剪的Brier分数变体,避免完全整合,并更好地反映有界的、可解释的阈值范围。从理论上讲,我们调和了Brier分数与决策曲线分析,直接回应了关于适当评分规则的临床效用的批评(Assel等,2017)。
更新时间: 2026-03-09 19:19:38
领域: cs.LG,cs.AI,stat.ME,stat.ML
Unpacking Interpretability: Human-Centered Criteria for Optimal Combinatorial Solutions
Algorithmic support systems often return optimal solutions that are hard to understand. Effective human-algorithm collaboration, however, requires interpretability. When machine solutions are equally optimal, humans must select one, but a precise account of what makes one solution more interpretable than another remains missing. To identify structural properties of interpretable machine solutions, we present an experimental paradigm in which participants chose which of two equally optimal solutions for packing items into bins was easier to understand. We show that preferences reliably track three quantifiable properties of solution structure: alignment with a greedy heuristic, simple within-bin composition, and ordered visual representation. The strongest associations were observed for ordered representations and heuristic alignment, with compositional simplicity also showing a consistent association. Reaction-time evidence was mixed, with faster responses observed primarily when heuristic differences were larger, and aggregate webcam-based gaze did not show reliable effects of complexity. These results provide a concrete, feature-based account of interpretability in optimal packing solutions, linking solution structure to human preference. By identifying actionable properties (simple compositions, ordered representation, and heuristic alignment), our findings enable interpretability-aware optimization and presentation of machine solutions, and outline a path to quantify trade-offs between optimality and interpretability in real-world allocation and design tasks.
Updated: 2026-03-09 19:18:52
标题: 拆解可解释性:人类中心准则对于最佳组合解决方案
摘要: 算法支持系统通常会返回难以理解的最优解。然而,有效的人机协作需要可解释性。当机器解决方案同样优化时,人类必须选择一个,但什么使一个解决方案比另一个更可解释的精确说明仍然缺失。为了识别可解释的机器解决方案的结构特性,我们提出了一个实验范式,在这个范式中,参与者选择将物品装入箱子的两个同样优化的解决方案中哪一个更容易理解。我们展示了偏好可靠地跟踪解决方案结构的三个可量化属性:与贪婪启发式对齐、简单的箱内组成、有序的视觉表示。观察到最强的关联是有序表示和启发式对齐,简单的组成也显示出一致的关联。反应时间的证据是混合的,主要是当启发式差异较大时观察到更快的响应,而聚合的基于网络摄像头的注视并没有显示出复杂性的可靠效应。这些结果提供了优化装箱解决方案中可解释性的具体、基于特征的说明,将解决方案结构与人类偏好联系起来。通过识别可操作的属性(简单构成、有序表示和启发式对齐),我们的发现使得在机器解决方案的优化和呈现中考虑可解释性,并概述了在现实世界的分配和设计任务中量化最优性和可解释性之间的权衡的途径。
更新时间: 2026-03-09 19:18:52
领域: cs.HC,cs.AI
DeZent: Decentralized z-Anonymity with Privacy-Preserving Coordination
Analyzing large volumes of sensor network data, such as electricity consumption measurements from smart meters, is essential for modern applications but raises significant privacy concerns. Privacy-enhancing technologies like z-anonymity offer efficient anonymization for continuous data streams by suppressing rare values that could lead to re-identification, making it particularly suited for resource-constrained environments. Originally designed for centralized architectures, z-anonymity assumes a trusted central entity. In this paper, we introduce deZent, a decentralized implementation of z-anonymity that minimizes trust in the central entity by realizing local z-anonymity with lightweight coordination. We develop deZent using a stochastic counting structure and secure sum to coordinate private anonymization across the network. Our results show that deZent achieves comparable performance to centralized z-anonymity in terms of publication ratio, while reducing the communication overhead towards the central entity. Thus, deZent presents a promising approach for enhancing privacy in sensor networks while preserving system efficiency.
Updated: 2026-03-09 19:14:23
标题: DeZent:具有保护隐私的协调的去中心化z-匿名化
摘要: 分析大量传感器网络数据,例如来自智能电表的电力消耗测量,对于现代应用至关重要,但引发了重大的隐私问题。隐私增强技术如z-匿名性通过抑制可能导致重新识别的稀有值,为连续数据流提供了有效的匿名化,特别适用于资源受限的环境。最初设计用于集中式架构的z-匿名性假定一个可信的中心实体。在本文中,我们介绍了deZent,z-匿名性的分散实现,通过实现轻量级协调实现本地z-匿名性,从而最大程度地减少对中心实体的信任。我们使用随机计数结构和安全求和开发了deZent,以协调网络上的私密匿名化。我们的结果表明,deZent在发布比率方面与集中式z-匿名性实现了可比性的性能,同时减少了向中心实体的通信开销。因此,deZent提供了一种有望增强传感器网络隐私性同时保持系统效率的方法。
更新时间: 2026-03-09 19:14:23
领域: cs.DC,cs.CR
LDP: An Identity-Aware Protocol for Multi-Agent LLM Systems
As multi-agent AI systems grow in complexity, the protocols connecting them constrain their capabilities. Current protocols such as A2A and MCP do not expose model-level properties as first-class primitives, ignoring properties fundamental to effective delegation: model identity, reasoning profile, quality calibration, and cost characteristics. We present the LLM Delegate Protocol (LDP), an AI-native communication protocol introducing five mechanisms: (1) rich delegate identity cards with quality hints and reasoning profiles; (2) progressive payload modes with negotiation and fallback; (3) governed sessions with persistent context; (4) structured provenance tracking confidence and verification status; (5) trust domains enforcing security boundaries at the protocol level. We implement LDP as a plugin for the JamJet agent runtime and evaluate against A2A and random baselines using local Ollama models and LLM-as-judge evaluation. Identity-aware routing achieves ~12x lower latency on easy tasks through delegate specialization, though it does not improve aggregate quality in our small delegate pool; semantic frame payloads reduce token count by 37% (p=0.031) with no observed quality loss; governed sessions eliminate 39% token overhead at 10 rounds; and noisy provenance degrades synthesis quality below the no-provenance baseline, arguing that confidence metadata is harmful without verification. Simulated analyses show architectural advantages in attack detection (96% vs. 6%) and failure recovery (100% vs. 35% completion). This paper contributes a protocol design, reference implementation, and initial evidence that AI-native protocol primitives enable more efficient and governable delegation.
Updated: 2026-03-09 19:13:17
标题: LDP:面向多Agent LLM系统的身份感知协议
摘要: 随着多智能体人工智能系统的复杂性不断增长,连接它们的协议限制了它们的能力。当前的协议,如A2A和MCP,并未将模型级属性作为第一类原语暴露出来,忽略了有效委托的基本属性:模型身份、推理配置文件、质量校准和成本特征。我们提出了LLM委托协议(LDP),这是一种AI原生通信协议,引入了五种机制:(1)具有质量提示和推理配置文件的丰富委托身份卡片;(2)具有谈判和回退的渐进式有效载荷模式;(3)具有持久上下文的受管会话;(4)结构化的溯源追踪信心和验证状态;(5)在协议级别强制安全边界的信任域。我们将LDP实现为JamJet代理运行时的插件,并使用本地Ollama模型和LLM作为评委进行评估,与A2A和随机基准进行对比。具有身份感知路由的委托专业化通过易任务实现了约12倍的较低延迟,尽管在我们的小委托池中没有改善总体质量;语义框架有效载荷将令牌数减少了37%(p=0.031),没有观察到质量损失;受管会话在10轮中消除了39%的令牌开销;嘈杂的溯源使合成质量低于无溯源基线,表明信心元数据在没有验证的情况下是有害的。模拟分析显示在攻击检测(96% vs. 6%)和故障恢复(100% vs. 35%完成)方面的架构优势。本文提供了一个协议设计、参考实现,并初步证据表明AI原生协议原语能够实现更有效和可管理的委托。
更新时间: 2026-03-09 19:13:17
领域: cs.AI,cs.MA,cs.SE
SBOMs into Agentic AIBOMs: Schema Extensions, Agentic Orchestration, and Reproducibility Evaluation
Software supply-chain security requires provenance mechanisms that support reproducibility and vulnerability assessment under dynamic execution conditions. Conventional Software Bills of Materials (SBOMs) provide static dependency inventories but cannot capture runtime behaviour, environment drift, or exploitability context. This paper introduces agentic Artificial Intelligence Bills of Materials (AIBOMs), extending SBOMs into active provenance artefacts through autonomous, policy-constrained reasoning. We present an agentic AIBOM framework based on a multi-agent architecture comprising (i) a baseline environment reconstruction agent (MCP), (ii) a runtime dependency and drift-monitoring agent (A2A), and (iii) a policy-aware vulnerability and VEX reasoning agent (AGNTCY). These agents generate contextual exploitability assertions by combining runtime execution evidence, dependency usage, and environmental mitigations with ISO/IEC 20153:2025 Common Security Advisory Framework (CSAF) v2.0 semantics. Exploitability is expressed via structured VEX assertions rather than enforcement actions. The framework introduces minimal, standards-aligned schema extensions to CycloneDX and SPDX, capturing execution context, dependency evolution, and agent decision provenance while preserving interoperability. Evaluation across heterogeneous analytical workloads demonstrates improved runtime dependency capture, reproducibility fidelity, and stability of vulnerability interpretation compared with established provenance systems, with low computational overhead. Ablation studies confirm that each agent contributes distinct capabilities unavailable through deterministic automation.
Updated: 2026-03-09 19:11:45
标题: 从SBOMs到主体AIBOMs:模式扩展、主体编排和可重现性评估
摘要: 软件供应链安全需要支持在动态执行条件下的可再现性和漏洞评估的来源机制。传统的软件材料清单(SBOMs)提供静态依赖清单,但无法捕获运行时行为、环境漂移或利用上下文。本文介绍了主动型人工智能材料清单(AIBOMs),通过自主的、受政策约束的推理将SBOMs扩展为活动性来源工件。我们提出了一个基于多智能体架构的主动型AIBOM框架,包括(i)基线环境重建智能体(MCP),(ii)运行时依赖和漂移监控智能体(A2A),以及(iii)政策感知漏洞和VEX推理智能体(AGNTCY)。这些智能体通过将运行时执行证据、依赖使用和环境缓解与ISO/IEC 20153:2025通用安全咨询框架(CSAF)v2.0语义相结合,生成上下文可利用性断言。可利用性通过结构化的VEX断言来表达,而不是执行行动。该框架引入了最小的、与标准对齐的CycloneDX和SPDX模式扩展,捕获执行上下文、依赖演变和智能体决策来源,同时保持互操作性。评估跨异构分析工作负载显示,与已建立的来源系统相比,该框架改进了运行时依赖捕获、可再现性忠实度和漏洞解释的稳定性,且计算开销低。消融研究证实,每个智能体都提供了通过确定性自动化无法获得的独特能力。
更新时间: 2026-03-09 19:11:45
领域: cs.CR,cs.AI,cs.SE
Bradley-Terry Policy Optimization for Generative Preference Modeling
Reinforcement learning (RL) has recently proven effective at scaling chain-of-thought (CoT) reasoning in large language models for tasks with verifiable answers. However, extending RL-based thought training to more general non-verifiable tasks-where supervision is provided only through pairwise human preferences-remains challenging. Existing approaches typically apply RL objectives designed for verifiable rewards to preference-based settings in a heuristic manner. In this work, we show that introducing CoT reasoning into preference modeling fundamentally changes the structure of the Bradley-Terry (BT) likelihood, as the reasoning process must be treated as a latent variable. This results in a preference likelihood expressed as a ratio of expectations over stochastic generation trajectories, which cannot be optimized using Jensen-style bounds or standard RL objectives. To address this challenge, we derive a consistent Monte Carlo estimator for the gradient of the resulting likelihood, leading to Bradley-Terry Policy Optimization (BTPO). Empirically, BTPO enables stable and effective training of generative preference models with CoT reasoning, consistently outperforming prior heuristic approaches across multiple benchmarks and model scales.
Updated: 2026-03-09 19:10:21
标题: 布拉德利-特里政策优化用于生成偏好建模
摘要: 强化学习(RL)最近在大型语言模型中证明了在具有可验证答案的任务中扩展思维链(CoT)推理的有效性。然而,将基于RL的思维训练扩展到更一般的非可验证任务-其中仅通过人类偏好进行监督-仍然具有挑战性。现有方法通常以启发式方式将设计用于可验证奖励的RL目标应用于基于偏好的设置。在这项工作中,我们展示了将CoT推理引入偏好建模会从根本上改变Bradley-Terry(BT)似然的结构,因为推理过程必须被视为潜变量。这导致了一个表达为概率生成轨迹期望比值的偏好似然,这不能使用Jensen样式的边界或标准RL目标进行优化。为了解决这一挑战,我们推导出了结果似然的梯度的一致蒙特卡洛估计器,从而导致了Bradley-Terry策略优化(BTPO)。实证上,BTPO使得具有CoT推理的生成式偏好模型的训练稳定且有效,在多个基准测试和模型规模上始终优于先前的启发式方法。
更新时间: 2026-03-09 19:10:21
领域: cs.LG
A Lightweight Multi-Cancer Tumor Localization Framework for Deployable Digital Pathology
Accurate localization of tumor regions from hematoxylin and eosin-stained whole-slide images is fundamental for translational research including spatial analysis, molecular profiling, and tissue architecture investigation. However, deep learning-based tumor detection trained within specific cancers may exhibit reduced robustness when applied across different tumor types. We investigated whether balanced training across cancers at modest scale can achieve high performance and generalize to unseen tumor types. A multi-cancer tumor localization model (MuCTaL) was trained on 79,984 non-overlapping tiles from four cancers (melanoma, hepatocellular carcinoma, colorectal cancer, and non-small cell lung cancer) using transfer learning with DenseNet169. The model achieved a tile-level ROC-AUC of 0.97 in validation data from the four training cancers, and 0.71 on an independent pancreatic ductal adenocarcinoma cohort. A scalable inference workflow was built to generate spatial tumor probability heatmaps compatible with existing digital pathology tools. Code and models are publicly available at https://github.com/AivaraX-AI/MuCTaL.
Updated: 2026-03-09 19:00:04
标题: 一个轻量级多癌症肿瘤定位框架,用于可部署的数字病理学
摘要: 准确定位肿瘤区域是从用hematoxylin和eosin染色的全切片图像中进行空间分析、分子分析和组织结构研究的转化研究的基础。然而,基于深度学习的肿瘤检测在特定癌症中训练时,可能在应用于不同类型的肿瘤时表现出降低的稳健性。我们调查了在适度规模下跨癌症进行平衡训练是否能够实现高性能并推广到未见过的肿瘤类型。一个多癌症肿瘤定位模型(MuCTaL)使用DenseNet169进行迁移学习,从四种癌症(黑色素瘤、肝细胞癌、结直肠癌和非小细胞肺癌)的79,984个非重叠瓷砖进行训练。该模型在来自四种训练癌症的验证数据中实现了0.97的瓷砖级ROC-AUC,并在独立的胰腺导管腺癌队列中达到0.71。建立了一个可扩展的推断工作流程,用于生成与现有数字病理学工具兼容的空间肿瘤概率热图。代码和模型可以在https://github.com/AivaraX-AI/MuCTaL公开获取。
更新时间: 2026-03-09 19:00:04
领域: cs.CV,cs.AI
MASEval: Extending Multi-Agent Evaluation from Models to Systems
The rapid adoption of LLM-based agentic systems has produced a rich ecosystem of frameworks (smolagents, LangGraph, AutoGen, CAMEL, LlamaIndex, i.a.). Yet existing benchmarks are model-centric: they fix the agentic setup and do not compare other system components. We argue that implementation decisions substantially impact performance, including choices such as topology, orchestration logic, and error handling. MASEval addresses this evaluation gap with a framework-agnostic library that treats the entire system as the unit of analysis. Through a systematic system-level comparison across 3 benchmarks, 3 models, and 3 frameworks, we find that framework choice matters as much as model choice. MASEval allows researchers to explore all components of agentic systems, opening new avenues for principled system design, and practitioners to identify the best implementation for their use case. MASEval is available under the MIT licence https://github.com/parameterlab/MASEval.
Updated: 2026-03-09 18:46:17
标题: MASEval:从模型到系统扩展多智能体评估
摘要: 基于LLM的代理系统的快速采用已经产生了一个丰富的框架生态系统(smolagents,LangGraph,AutoGen,CAMEL,LlamaIndex等)。然而,现有的基准测试是基于模型的:它们固定代理设置,并不比较其他系统组件。我们认为,实现决策对性能产生重大影响,包括拓扑结构、编排逻辑和错误处理等选择。MASEval通过一个与框架无关的库来解决这一评估差距,将整个系统视为分析单位。通过在3个基准测试、3个模型和3个框架之间进行系统级比较,我们发现框架选择与模型选择同样重要。MASEval使研究人员能够探索代理系统的所有组件,为原则性系统设计开辟新途径,使从业者能够确定最适合其用例的实现方式。 MASEval可在MIT许可证下获取https://github.com/parameterlab/MASEval。
更新时间: 2026-03-09 18:46:17
领域: cs.AI,cs.CL,cs.LG
FrontierCO: Real-World and Large-Scale Evaluation of Machine Learning Solvers for Combinatorial Optimization
Machine learning (ML) has shown promise for tackling combinatorial optimization (CO), but much of the reported progress relies on small-scale, synthetic benchmarks that fail to capture real-world structure and scale. A core limitation is that ML methods are typically trained and evaluated on synthetic instance generators, leaving open how they perform on irregular, competition-grade, or industrial datasets. We present FrontierCO, a benchmark for evaluating ML-based CO solvers under real-world structure and extreme scale. FrontierCO spans eight CO problems, including routing, scheduling, facility location, and graph problems, with instances drawn from competitions and public repositories (e.g., DIMACS, TSPLib). Each task provides both easy sets (historically challenging but now solvable) and hard sets (open or computationally intensive), alongside standardized training/validation resources. Using FrontierCO, we evaluate 16 representative ML solvers--graph neural approaches, hybrid neural-symbolic methods, and LLM-based agents--against state-of-the-art classical solvers. We find a persistent performance gap that widens under structurally challenging and large instance sizes (e.g., TSP up to 10M nodes; MIS up to 8M), while also identifying cases where ML methods outperform classical solvers. By centering evaluation on real-world structure and orders-of-magnitude larger instances, FrontierCO provides a rigorous basis for advancing ML for CO. Our benchmark is available at https://huggingface.co/datasets/CO-Bench/FrontierCO.
Updated: 2026-03-09 18:46:03
标题: FrontierCO:用于组合优化的机器学习求解器的实际世界和大规模评估
摘要: 机器学习(ML)在解决组合优化(CO)方面显示出潜力,但许多报道的进展依赖于规模较小、合成的基准测试,这些基准测试无法捕捉到真实世界的结构和规模。一个核心限制是,ML方法通常在合成实例生成器上进行训练和评估,这使得它们在不规则、竞赛级别或工业数据集上的表现仍然未知。我们提出了FrontierCO,这是一个用于评估基于ML的CO求解器在真实世界结构和极端规模下的基准测试。FrontierCO涵盖了八个CO问题,包括路径规划、调度、设施选址和图问题,实例来自竞赛和公共存储库(例如DIMACS、TSPLib)。每个任务提供了易解集(历史上具有挑战性但现在可解决)和难解集(开放或计算密集型),同时提供标准化的训练/验证资源。利用FrontierCO,我们评估了16种代表性的ML求解器--图神经方法、混合神经符号方法和基于LLM的代理--与最先进的经典求解器进行比较。我们发现在结构具有挑战性和大规模实例大小下(例如TSP达到10M个节点;MIS达到8M),存在着持续的性能差距,同时也确定了ML方法胜过经典求解器的情况。通过将评估重点放在真实世界结构和数量级更大的实例上,FrontierCO为推动ML在CO领域提供了严谨的基础。我们的基准测试可在https://huggingface.co/datasets/CO-Bench/FrontierCO上获得。
更新时间: 2026-03-09 18:46:03
领域: cs.LG
HeteroFedSyn: Differentially Private Tabular Data Synthesis for Heterogeneous Federated Settings
Traditional Differential Privacy (DP) mechanisms are typically tailored to specific analysis tasks, which limits the reusability of protected data. DP tabular data synthesis overcomes this by generating synthetic datasets that can be shared for arbitrary downstream tasks. However, existing synthesis methods predominantly assume centralized or local settings and overlook the more practical horizontal federated scenario. Naively synthesizing data locally or perturbing individual records either produces biased mixtures or introduces excessive noise, especially under heterogeneous data distributions across participants. We propose HeteroFedSyn, the first DP tabular data synthesis framework designed specifically for the horizontal federated setting. Built upon the PrivSyn paradigm of 2-way marginal-based synthesis, HeteroFedSyn introduces three key innovations for distributed marginal selection: (i) an L2-based dependency metric with random projection for noise-efficient correlation measurement, (ii) an unbiased estimator to correct multiplicative noise, and (iii) an adaptive selection strategy that dynamically updates dependency scores to avoid redundancy. Extensive experiments on range queries, Wasserstein fidelity, and machine learning tasks show that, despite the increased noise inherent to federated execution, HeteroFedSyn achieves utility comparable to centralized synthesis. Our code is open-sourced via the link.
Updated: 2026-03-09 18:44:01
标题: HeteroFedSyn:异构联邦设置下的差分隐私表格数据合成
摘要: 传统的差分隐私(DP)机制通常针对特定的分析任务进行定制,这限制了受保护数据的可重用性。DP表格数据合成通过生成可以用于任意下游任务的合成数据集来克服这一限制。然而,现有的合成方法主要假设集中或本地设置,并忽视了更实际的水平联邦场景。在参与者之间数据分布不均匀的情况下,简单地在本地合成数据或扰动个体记录会产生偏差混合物或引入过多的噪音。 我们提出了HeteroFedSyn,这是专门为水平联邦设置设计的第一个DP表格数据合成框架。建立在基于2方向边缘的合成的PrivSyn范例之上,HeteroFedSyn引入了三项关键创新用于分布式边缘选择:(i)基于L2的依赖度量与随机投影用于高效的噪声相关性测量,(ii)无偏估计器用于校正乘法噪声,以及(iii)自适应选择策略,动态更新依赖分数以避免冗余。对范围查询、Wasserstein保真度和机器学习任务进行了大量实验,结果显示,尽管在联邦执行中存在更多的噪声,HeteroFedSyn的效用与集中合成相当。我们的代码通过链接开源。
更新时间: 2026-03-09 18:44:01
领域: cs.CR
OAuthHub: Mitigating OAuth Data Overaccess through a Local Data Hub
Most OAuth service providers, such as Google and Microsoft, offer only a limited range of coarse-grained data access. As a result, third-party OAuth applications often end up accessing more user data than necessary, even if their developers want to minimize data access. We present OAuthHub, a development framework that leverages users' personal devices as the intermediary controller for OAuth-based data sharing between cloud services. The key innovations of OAuthHub are: (1) the insight that discretionary data access is largely unnecessary for most OAuth apps, which typically only require access at three well-defined moments-during installation, in response to user actions, and at scheduled intervals; (2) a development framework that requires explicit declarations of intended data access and supports the three common access patterns through intermittently available personal devices; and (3) a centralized runtime permission model for managing OAuth access across providers. We evaluated OAuthHub with three real-world apps on both PCs and mobile phones and found that OAuthHub requires moderate changes to the application code and imposes insignificant performance overheads. Our study with 18 developers showed that participants completed programming tasks significantly faster (9.1 vs. 18.0 minutes) with less code (4.7 vs. 15.8 lines) using OAuthHub than conventional OAuth APIs.
Updated: 2026-03-09 18:43:12
标题: OAuthHub:通过本地数据中心减轻OAuth数据过度访问
摘要: 大多数OAuth服务提供商,如谷歌和微软,仅提供有限的粗粒度数据访问。因此,第三方OAuth应用程序通常会访问比必要更多的用户数据,即使他们的开发人员想要最小化数据访问。我们提出了OAuthHub,这是一个开发框架,利用用户的个人设备作为OAuth基于云服务之间数据共享的中介控制器。OAuthHub的关键创新点是:(1)认识到对于大多数OAuth应用程序而言,自由数据访问在很大程度上是不必要的,这些应用程序通常只需要在三个明确定义的时刻进行访问-在安装期间,响应用户操作和在预定间隔内;(2)一个开发框架,要求明确声明预期的数据访问,并通过间歇性可用的个人设备支持三种常见的访问模式;(3)一个集中的运行时权限模型,用于管理跨提供商的OAuth访问。我们在PC和手机上使用三个真实应用程序评估了OAuthHub,并发现OAuthHub需要对应用程序代码进行适度的更改,并施加了微不足道的性能开销。我们与18位开发人员的研究表明,参与者在使用OAuthHub比传统OAuth API更快地完成编程任务(9.1 vs. 18.0分钟),并且使用更少的代码(4.7 vs. 15.8行)。
更新时间: 2026-03-09 18:43:12
领域: cs.CR,cs.NI,cs.SE
Small Language Models for Efficient Agentic Tool Calling: Outperforming Large Models with Targeted Fine-tuning
As organizations scale adoption of generative AI, model cost optimization and operational efficiency have emerged as critical factors determining sustainability and accessibility. While Large Language Models (LLMs) demonstrate impressive capabilities across diverse tasks, their extensive computational requirements make them cost-prohibitive for routine enterprise use. This limitation motivates the exploration of Small Language Models (SLMs), which can deliver comparable performance in targeted applications while drastically reducing infrastructure overhead (Irugalbandara et al., 2023). In this work, we investigate the feasibility of replacing LLM-driven workflows with optimized SLMs. We trained a domain-adapted SLM to execute representative tasks traditionally handled by LLMs, such as document summarization, query answering, and structured data interpretation. As part of the experiment, we investigated the fine-tuning of facebook/opt-350m model (single epoch only) using the Hugging Face TRL (Transformer Reinforcement Learning), specifically the Supervised Fine-Tuning (SFT) trainer. The OPT-350M model was released by Meta AI in 2022 as part of the OPT (Open Pretrained Transformer) family of models. Similar studies demonstrate that even models at the 350M parameter scale can meaningfully contribute to instruction-tuning pipelines (Mekala et al., 2024). Experimental results demonstrated that our fine-tuned SLM achieves exceptional performance with a 77.55\% pass rate on ToolBench evaluation, significantly outperforming all baseline models including ChatGPT-CoT (26.00\%), ToolLLaMA-DFS (30.18\%), and ToolLLaMA-CoT (16.27\%). These findings emphasize that thoughtful design and targeted training of SLMs can significantly lower barriers to adoption, enabling cost-effective, large-scale integration of generative AI into production systems.
Updated: 2026-03-09 18:40:10
标题: 小型语言模型用于高效的工具调用:通过有针对性的微调超越大型模型
摘要: 随着组织规模扩大采用生成式人工智能,模型成本优化和运营效率已成为决定可持续性和可访问性的关键因素。虽然大型语言模型(LLMs)在各种任务上展示了令人印象深刻的能力,但它们庞大的计算需求使它们对于日常企业使用成本过高。这一限制促使了对小型语言模型(SLMs)的探索,这些模型可以在特定应用中提供可比较的性能,同时极大地减少基础设施开销(Irugalbandara等,2023年)。在这项工作中,我们调查了用优化的SLMs替代LLM驱动的工作流程的可行性。我们训练了一个领域适应的SLM来执行传统由LLMs处理的代表性任务,如文档摘要、查询回答和结构化数据解释。作为实验的一部分,我们调查了使用Hugging Face TRL(Transformer Reinforcement Learning)的facebook/opt-350m模型(仅单个时期)的微调,具体来说是Supervised Fine-Tuning(SFT)训练器。 OPT-350M模型是Meta AI于2022年发布的OPT(开放预训练变换器)系列模型的一部分。类似的研究表明,即使在350M参数规模下,模型也可以有意义地为指导微调流水线做出贡献(Mekala等,2024年)。实验结果表明,我们微调的SLM在ToolBench评估中取得了出色的表现,通过77.55%的通过率,明显优于所有基准模型,包括ChatGPT-CoT(26.00%)、ToolLLaMA-DFS(30.18%)和ToolLLaMA-CoT(16.27%)。这些发现强调了对SLMs的深思熟虑的设计和有针对性的培训可以显著降低采用的障碍,实现成本有效的大规模集成生成式人工智能到生产系统中。
更新时间: 2026-03-09 18:40:10
领域: cs.AI
Are Expressive Encoders Necessary for Discrete Graph Generation?
Discrete graph generation has emerged as a powerful paradigm for modeling graph data, often relying on highly expressive neural backbones such as transformers or higher-order architectures. We revisit this design choice by introducing GenGNN, a modular message-passing framework for graph generation. Diffusion models with GenGNN achieve more than 90% validity on Tree and Planar datasets, within margins of graph transformers, at 2-5x faster inference speed. For molecule generation, DiGress with a GenGNN backbone achieves 99.49% Validity. A systematic ablation study shows the benefit provided by each GenGNN component, indicating the need for residual connections to mitigate oversmoothing on complicated graph-structure. Through scaling analyses, we apply a principled metric-space view to investigate learned diffusion representations and uncover whether GNNs can be expressive neural backbones for discrete diffusion.
Updated: 2026-03-09 18:36:06
标题: 离散图生成需要表达式编码器吗?
摘要: 离散图生成已经成为建模图数据的强大范例,通常依赖于高度表达能力的神经网络骨干,如变压器或高阶架构。我们通过引入GenGNN来重新审视这种设计选择,这是一个用于图生成的模块化消息传递框架。具有GenGNN的扩散模型在树状和平面数据集上实现了超过90%的有效性,在图变压器的边缘范围内,推理速度快2-5倍。对于分子生成,具有GenGNN骨干的DiGress实现了99.49%的有效性。系统化的消融研究显示了每个GenGNN组件提供的好处,表明需要残差连接来减轻复杂图结构上的过度平滑。通过扩展分析,我们应用了一种原则性的度量空间观点来研究学习到的扩散表示,并揭示了GNNs是否可以成为离散扩散的表达能力强大的神经网络骨干。
更新时间: 2026-03-09 18:36:06
领域: cs.LG,cs.AI
SoftJAX & SoftTorch: Empowering Automatic Differentiation Libraries with Informative Gradients
Automatic differentiation (AD) frameworks such as JAX and PyTorch have enabled gradient-based optimization for a wide range of scientific fields. Yet, many "hard" primitives in these libraries such as thresholding, Boolean logic, discrete indexing, and sorting operations yield zero or undefined gradients that are not useful for optimization. While numerous "soft" relaxations have been proposed that provide informative gradients, the respective implementations are fragmented across projects, making them difficult to combine and compare. This work introduces SoftJAX and SoftTorch, open-source, feature-complete libraries for soft differentiable programming. These libraries provide a variety of soft functions as drop-in replacements for their hard JAX and PyTorch counterparts. This includes (i) elementwise operators such as clip or abs, (ii) utility methods for manipulating Booleans and indices via fuzzy logic, (iii) axiswise operators such as sort or rank -- based on optimal transport or permutahedron projections, and (iv) offer full support for straight-through gradient estimation. Overall, SoftJAX and SoftTorch make the toolbox of soft relaxations easily accessible to differentiable programming, as demonstrated through benchmarking and a practical case study. Code is available at github.com/a-paulus/softjax and github.com/a-paulus/softtorch.
Updated: 2026-03-09 18:35:51
标题: SoftJAX 和 SoftTorch:利用信息梯度增强自动微分库
摘要: 自动微分(AD)框架,如JAX和PyTorch,已经实现了梯度优化在广泛科学领域的应用。然而,这些库中许多“硬”原语,如阈值、布尔逻辑、离散索引和排序操作产生零或未定义的梯度,对优化没有用处。虽然已经提出了许多能提供信息梯度的“软”松弛方法,但各自的实现分散在项目中,使它们难以组合和比较。本文介绍了SoftJAX和SoftTorch,开源、功能完整的软可微编程库。这些库提供各种软函数,可作为其硬JAX和PyTorch对应项的替代品。这包括(i)诸如clip或abs的逐元素运算符,(ii)通过模糊逻辑操作布尔值和索引的实用方法,(iii)基于最优传输或排列体投影的轴向运算符,如sort或rank,并且(iv)提供对直通梯度估计的全面支持。总的来说,SoftJAX和SoftTorch使软松弛工具箱易于访问可微编程,通过基准测试和实际案例研究进行了演示。代码可在github.com/a-paulus/softjax和github.com/a-paulus/softtorch找到。
更新时间: 2026-03-09 18:35:51
领域: cs.LG
Automating Forecasting Question Generation and Resolution for AI Evaluation
Forecasting future events is highly valuable in decision-making and is a robust measure of general intelligence. As forecasting is probabilistic, developing and evaluating AI forecasters requires generating large numbers of diverse and difficult questions, and accurately resolving them. Previous efforts to automate this laborious work relied on recurring data sources (e.g., weather, stocks), limiting diversity and utility. In this work, we present a system for generating and resolving high-quality forecasting questions automatically and at scale using LLM-powered web research agents. We use this system to generate 1499 diverse, real-world forecasting questions, and to resolve them several months later. We estimate that our system produces verifiable, unambiguous questions approximately 96% of the time, exceeding the rate of Metaculus, a leading human-curated forecasting platform. We also find that our system resolves questions at approximately 95% accuracy. We verify that forecasting agents powered by more intelligent LLMs perform better on these questions (Brier score of 0.134 for Gemini 3 Pro, 0.149 for GPT-5, and 0.179 for Gemini 2.5 Flash). Finally, we demonstrate how our system can be leveraged to directly improve forecasting, by evaluating a question decomposition strategy on a generated question set, yielding a significant improvement in Brier scores (0.132 vs. 0.141).
Updated: 2026-03-09 18:35:12
标题: 自动化预测问题生成和解决方案用于AI评估
摘要: 预测未来事件在决策中具有极大的价值,是智能的一个稳健指标。由于预测是概率性的,发展和评估人工智能预测者需要产生大量多样化和困难的问题,并准确解决这些问题。以往自动化这一繁琐工作的努力依赖于重复出现的数据来源(如天气、股票),限制了多样性和效用。在这项工作中,我们提出了一个利用LLM驱动的网络研究代理自动且规模化生成和解决高质量预测问题的系统。我们使用此系统生成了1499个多样化的、真实世界的预测问题,并在几个月后解决了这些问题。我们估计我们的系统大约96%的时间产生可验证、明确的问题,超过了领先的人工筛选预测平台Metaculus的速率。我们还发现我们的系统以约95%的准确率解决问题。我们验证,由更智能的LLM驱动的预测代理在这些问题上表现更好(Gemini 3 Pro的Brier得分为0.134,GPT-5的得分为0.149,Gemini 2.5 Flash的得分为0.179)。最后,我们演示了如何利用我们的系统直接改进预测,通过评估一个生成的问题集上的问题分解策略,使Brier得分显著提高(0.132比0.141)。
更新时间: 2026-03-09 18:35:12
领域: cs.LG,cs.AI
Fish Audio S2 Technical Report
We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions. To scale training, we develop a multi-stage training recipe together with a staged data pipeline covering video captioning and speech captioning, voice-quality assessment, and reward modeling. To push the frontier of open-source TTS, we release our model weights, fine-tuning code, and an SGLang-based inference engine. The inference engine is production-ready for streaming, achieving an RTF of 0.195 and a time-to-first-audio below 100 ms.Our code and weights are available on GitHub (https://github.com/fishaudio/fish-speech) and Hugging Face (https://huggingface.co/fishaudio/s2-pro). We highly encourage readers to visit https://fish.audio to try custom voices.
Updated: 2026-03-09 18:34:33
标题: 鱼音频S2技术报告
摘要: 我们介绍了Fish Audio S2,这是一个开源的文本转语音系统,具有多人说话者、多轮生成,最重要的是,可以通过自然语言描述进行指令控制。为了扩大训练规模,我们开发了一个多阶段训练配方,涵盖视频字幕生成和语音字幕生成、语音质量评估和奖励建模。为了推动开源TTS的前沿,我们发布了我们的模型权重、微调代码和基于SGLang的推理引擎。推理引擎已经准备好用于流媒体,实现了0.195的RTF和低于100毫秒的首次音频时间。我们的代码和权重可以在GitHub(https://github.com/fishaudio/fish-speech)和Hugging Face(https://huggingface.co/fishaudio/s2-pro)上找到。我们强烈建议读者访问https://fish.audio尝试自定义声音。
更新时间: 2026-03-09 18:34:33
领域: cs.SD,cs.AI,cs.CL
VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment
General-purpose vision-language models (VLMs) such as LLaVA and QwenVL produce descriptions of disaster imagery that lack domain-specific vocabulary and actionable detail. We propose the Vision-Language Caption Enhancer (VLCE), a framework that integrates external semantic knowledge from ConceptNet and WordNet into the caption generation process for post-disaster satellite and UAV imagery. VLCE operates in two stages: first, a baseline VLM generates an initial caption conditioned on YOLOv8 object detections; second, a knowledge-enriched sequential model, a CNN-LSTM or a hierarchical cross-modal Transformer, refines the caption using a vocabulary augmented with 1,566 domain-relevant terms extracted from knowledge graphs. We evaluate VLCE on two disaster benchmarks: xBD (satellite, 6,369 images, 3 damage classes) and RescueNet (UAV, 4,494 images, 12 damage classes), using CLIPScore for semantic alignment and InfoMetIC for informativeness. On RescueNet with the Transformer decoder, VLCE with knowledge graph enrichment produces captions preferred over QwenVL baselines in 95.33% of image pairs on InfoMetIC and 73.64% on CLIPScore. Qualitative analysis shows that without knowledge graph integration, generated captions exhibit hallucinations, word repetition, and semantic incoherence, whereas knowledge-enriched captions maintain factual consistency and domain-appropriate vocabulary.
Updated: 2026-03-09 18:22:03
标题: VLCE:灾害评估中图像描述的知识增强框架
摘要: 通用视觉语言模型(VLMs)如LLaVA和QwenVL生成的灾难图像描述缺乏领域特定词汇和可操作细节。我们提出了Vision-Language Caption Enhancer(VLCE),这是一个框架,将来自ConceptNet和WordNet的外部语义知识整合到后灾难卫星和无人机图像的字幕生成过程中。VLCE分为两个阶段:首先,一个基线VLM生成一个初始字幕,条件是基于YOLOv8对象检测;其次,一个知识增强的顺序模型,一个CNN-LSTM或一个分层跨模态Transformer,使用从知识图中提取的1,566个领域相关术语来完善字幕。我们在两个灾难基准上评估了VLCE:xBD(卫星,6,369张图像,3种损坏类别)和RescueNet(无人机,4,494张图像,12种损坏类别),使用CLIPScore进行语义对齐和InfoMetIC进行信息性评估。在RescueNet上使用Transformer解码器,具有知识图增强的VLCE在InfoMetIC上的图像对中以95.33%的比例优于QwenVL基线,在CLIPScore上为73.64%。定性分析显示,如果没有知识图整合,生成的字幕会出现幻觉、词汇重复和语义不连贯,而知识增强的字幕保持事实一致性和领域适当的词汇。
更新时间: 2026-03-09 18:22:03
领域: cs.CV,cs.LG
CuriousBot: Interactive Mobile Exploration via Actionable 3D Relational Object Graph
Mobile exploration is a longstanding challenge in robotics, yet current methods primarily focus on active perception instead of active interaction, limiting the robot's ability to interact with and fully explore its environment. Existing robotic exploration approaches via active interaction are often restricted to tabletop scenes, neglecting the unique challenges posed by mobile exploration, such as large exploration spaces, complex action spaces, and diverse object relations. In this work, we introduce a 3D relational object graph that encodes diverse object relations and enables exploration through active interaction. We develop a system based on this representation and evaluate it across diverse scenes. Our qualitative and quantitative results demonstrate the system's effectiveness and generalization across object instances, relations, and scenes, outperforming methods solely relying on vision-language models (VLMs).
Updated: 2026-03-09 18:21:05
标题: CuriousBot:通过可操作的3D关系对象图进行交互式移动探索
摘要: 移动探索是机器人领域长期面临的挑战,然而当前的方法主要集中在主动感知而非主动交互,限制了机器人与环境的互动能力和完全探索环境的能力。现有的通过主动交互进行机器人探索的方法通常局限于桌面场景,忽视了移动探索所面临的独特挑战,如较大的探索空间、复杂的动作空间和多样的物体关系。在这项工作中,我们引入了一个3D关系对象图,编码了多样的物体关系,并通过主动交互实现探索。我们基于这种表示开发了一个系统,并在不同场景中进行了评估。我们的定性和定量结果表明,该系统在物体实例、关系和场景之间的泛化能力和有效性,优于仅依赖于视觉-语言模型(VLMs)的方法。
更新时间: 2026-03-09 18:21:05
领域: cs.RO,cs.CV,cs.LG
Beyond Relevance: On the Relationship Between Retrieval and RAG Information Coverage
Retrieval-augmented generation (RAG) systems combine document retrieval with a generative model to address complex information seeking tasks like report generation. While the relationship between retrieval quality and generation effectiveness seems intuitive, it has not been systematically studied. We investigate whether upstream retrieval metrics can serve as reliable early indicators of the final generated response's information coverage. Through experiments across two text RAG benchmarks (TREC NeuCLIR 2024 and TREC RAG 2024) and one multimodal benchmark (WikiVideo), we analyze 15 text retrieval stacks and 10 multimodal retrieval stacks across four RAG pipelines and multiple evaluation frameworks (Auto-ARGUE and MiRAGE). Our findings demonstrate strong correlations between coverage-based retrieval metrics and nugget coverage in generated responses at both topic and system levels. This relationship holds most strongly when retrieval objectives align with generation goals, though more complex iterative RAG pipelines can partially decouple generation quality from retrieval effectiveness. These findings provide empirical support for using retrieval metrics as proxies for RAG performance.
Updated: 2026-03-09 18:20:20
标题: 超越相关性:检索与RAG信息覆盖之间的关系
摘要: 检索增强生成(RAG)系统结合文档检索和生成模型,以解决报告生成等复杂信息检索任务。虽然检索质量与生成效果之间的关系似乎是直觉的,但尚未得到系统研究。我们调查了上游检索指标是否可以作为最终生成响应的信息覆盖的可靠早期指标。通过在两个文本RAG基准(TREC NeuCLIR 2024和TREC RAG 2024)和一个多模态基准(WikiVideo)上进行实验,我们分析了15个文本检索堆栈和10个多模态检索堆栈,跨越四个RAG管道和多个评估框架(Auto-ARGUE和MiRAGE)。我们的研究结果表明,基于覆盖率的检索指标与生成响应中的信息覆盖在话题和系统级别都存在强相关性。当检索目标与生成目标一致时,这种关系最为显著,尽管更复杂的迭代式RAG管道可以部分解耦生成质量和检索效果。这些发现为使用检索指标作为RAG性能代理提供了经验支持。
更新时间: 2026-03-09 18:20:20
领域: cs.IR,cs.AI
Training Language Models via Neural Cellular Automata
Pre-training is crucial for large language models (LLMs), as it is when most representations and capabilities are acquired. However, natural language pre-training has problems: high-quality text is finite, it contains human biases, and it entangles knowledge with reasoning. This raises a fundamental question: is natural language the only path to intelligence? We propose using neural cellular automata (NCA) to generate synthetic, non-linguistic data for pre-pre-training LLMs--training on synthetic-then-natural language. NCA data exhibits rich spatiotemporal structure and statistics resembling natural language while being controllable and cheap to generate at scale. We find that pre-pre-training on only 164M NCA tokens improves downstream language modeling by up to 6% and accelerates convergence by up to 1.6x. Surprisingly, this even outperforms pre-pre-training on 1.6B tokens of natural language from Common Crawl with more compute. These gains also transfer to reasoning benchmarks, including GSM8K, HumanEval, and BigBench-Lite. Investigating what drives transfer, we find that attention layers are the most transferable, and that optimal NCA complexity varies by domain: code benefits from simpler dynamics, while math and web text favor more complex ones. These results enable systematic tuning of the synthetic distribution to target domains. More broadly, our work opens a path toward more efficient models with fully synthetic pre-training.
Updated: 2026-03-09 18:14:26
标题: 通过神经元细胞自动机训练语言模型
摘要: 预训练对于大型语言模型(LLMs)至关重要,因为在这个阶段大部分的表示和能力都被获得。然而,自然语言的预训练存在问题:高质量的文本是有限的,其中包含人类偏见,并且它将知识与推理纠缠在一起。这引发了一个基本问题:自然语言是唯一通往智能的途径吗?我们建议使用神经元元胞自动机(NCA)生成合成的非语言数据,用于LLMs的预预训练--在合成数据然后自然语言上进行训练。NCA数据展现出类似于自然语言的丰富时空结构和统计特性,同时在规模上是可控且廉价生成的。我们发现仅在164M个NCA令牌上进行预预训练可以将下游语言建模提高高达6%,并且收敛速度提高了1.6倍。令人惊讶的是,即使使用更多计算资源,这甚至优于在Common Crawl的16亿个自然语言令牌上进行预预训练。这些收益也可以转移到推理基准测试中,包括GSM8K,HumanEval和BigBench-Lite。调查推动转移的因素,我们发现注意层是最具可转移性的,而最佳的NCA复杂性因领域而异:代码受益于更简单的动态,而数学和网络文本则偏好更复杂的动态。这些结果使得可以系统地调整合成分布以针对特定领域。更广泛地说,我们的工作为具有完全合成预训练的更高效模型开辟了一条道路。
更新时间: 2026-03-09 18:14:26
领域: cs.LG,cs.AI,cs.CL
Scale-Plan: Scalable Language-Enabled Task Planning for Heterogeneous Multi-Robot Teams
Long-horizon task planning for heterogeneous multi-robot systems is essential for deploying collaborative teams in real-world environments; yet, it remains challenging due to the large volume of perceptual information, much of which is irrelevant to task objectives and burdens planning. Traditional symbolic planners rely on manually constructed problem specifications, limiting scalability and adaptability, while recent large language model (LLM)-based approaches often suffer from hallucinations and weak grounding-i.e., poor alignment between generated plans and actual environmental objects and constraints-in object-rich settings. We present Scale-Plan, a scalable LLM-assisted framework that generates compact, task-relevant problem representations from natural language instructions. Given a PDDL domain specification, Scale-Plan constructs an action graph capturing domain structure and uses shallow LLM reasoning to guide a structured graph search that identifies a minimal subset of relevant actions and objects. By filtering irrelevant information prior to planning, Scale-Plan enables efficient decomposition, allocation, and long-horizon plan generation. We evaluate our approach on complex multi-agent tasks and introduce MAT2-THOR, a cleaned benchmark built on AI2-THOR for reliable evaluation of multi-robot planning systems. Scale-Plan outperforms pure LLM and hybrid LLM-PDDL baselines across all metrics, improving scalability and reliability.
Updated: 2026-03-09 18:13:18
标题: 规模计划:面向异构多机器人团队的可扩展语言化任务规划
摘要: 长期任务规划对于在现实环境中部署协作团队的异构多机器人系统至关重要;然而,由于大量感知信息,其中许多与任务目标无关且给规划带来负担,因此仍然具有挑战性。传统的符号规划器依赖于手工构建的问题规范,限制了可伸缩性和适应性,而最近基于大型语言模型(LLM)的方法通常存在幻觉和弱接地-即,在富含对象的环境中生成的计划与实际环境对象和约束之间的对齐不足。我们提出了Scale-Plan,这是一个可伸缩的LLM辅助框架,可以从自然语言指令中生成简洁、与任务相关的问题表示。给定一个PDDL领域规范,Scale-Plan构建一个捕获领域结构的行动图,并使用浅层LLM推理来引导结构化图搜索,以识别最小的相关行动和对象子集。通过在规划之前过滤无关信息,Scale-Plan能够实现有效的分解、分配和长期规划生成。我们在复杂的多代理任务上评估了我们的方法,并引入了MAT2-THOR,这是一个基于AI2-THOR的清洁基准,用于可靠地评估多机器人规划系统。在所有指标上,Scale-Plan都优于纯LLM和混合LLM-PDDL基线,提高了可伸缩性和可靠性。
更新时间: 2026-03-09 18:13:18
领域: cs.RO,cs.AI,cs.ET,cs.MA
Robust Training of Neural Networks at Arbitrary Precision and Sparsity
The discontinuous operations inherent in quantization and sparsification introduce a long-standing obstacle to backpropagation, particularly in ultra-low precision and sparse regimes. While the community has long viewed quantization as unfriendly to gradient descent due to its lack of smoothness, we pinpoint-for the first time-that the key issue is the absence of a proper gradient path that allows training to learn robustness to quantization noise. The standard Straight-Through Estimator (STE) exacerbates this with its well-understood mismatch: a quantization-aware forward pass but oblivious backward pass, leading to unmanaged error and instability. We solve this by explicitly modeling quantization as additive noise, making the full forward-backward path well-defined without heuristic gradient estimation. As one natural solution, we introduce a denoising dequantization transform derived from a principled ridge regression objective, creating an explicit, corrective gradient path that makes learning robust to the noise STE bypasses. We extend this to sparsification by treating it as a special form of quantization that zeros out small values. Our unified framework trains models at arbitrary precisions and sparsity levels with off-the-shelf recipes, enabling stable A1W1 and sub-1-bit networks where others falter. It yields state-of-the-art results, mapping efficiency frontiers for modern LLMs and providing a theoretically grounded path to hyper-efficient neural networks.
Updated: 2026-03-09 18:13:08
标题: 在任意精度和稀疏性下对神经网络进行强健训练
摘要: 量化和稀疏化中固有的不连续操作给反向传播带来了长期存在的障碍,特别是在超低精度和稀疏区域。虽然社区长期以来一直认为量化由于缺乏平滑性对梯度下降不友好,但我们首次确定关键问题是缺乏适当的梯度路径,使训练能够学习对量化噪声的鲁棒性。标准的Straight-Through Estimator(STE)通过其众所周知的不匹配加剧了这一问题:具有量化感知的前向传递,但无视后向传递,导致错误和不稳定性无法管理。我们通过明确地将量化建模为加性噪声来解决这个问题,使得完整的前向-后向路径有明确定义,无需启发式梯度估计。作为一种自然解决方案,我们引入了一种由基于原则的岭回归目标导出的去噪反量化转换,创建了一个明确的、校正的梯度路径,使学习对STE绕过的噪声具有鲁棒性。我们将这一方法扩展到稀疏化,将其视为将小值归零的一种特殊形式的量化。我们的统一框架通过使用现成的配方在任意精度和稀疏水平上训练模型,使得在其他人失败的情况下能够稳定A1W1和次比特网络。这一方法产生了最新的结果,为现代LLMs的效率边界提供了映射,并为超高效神经网络提供了理论基础。
更新时间: 2026-03-09 18:13:08
领域: cs.LG,cs.AI,cs.CL,cs.CV,math.NA
Infusion: Shaping Model Behavior by Editing Training Data via Influence Functions
Influence functions are commonly used to attribute model behavior to training documents. We explore the reverse: crafting training data that induces model behavior. Our framework, Infusion, uses scalable influence-function approximations to compute small perturbations to training documents that induce targeted changes in model behavior through parameter shifts. We evaluate Infusion on data poisoning tasks across vision and language domains. On CIFAR-10, we show that making subtle edits via Infusion to just 0.2% (100/45,000) of the training documents can be competitive with the baseline of inserting a small number of explicit behavior examples. We also find that Infusion transfers across architectures (ResNet $\leftrightarrow$ CNN), suggesting a single poisoned corpus can affect multiple independently trained models. In preliminary language experiments, we characterize when our approach increases the probability of target behaviors and when it fails, finding it most effective at amplifying behaviors the model has already learned. Taken together, these results show that small, subtle edits to training data can systematically shape model behavior, underscoring the importance of training data interpretability for adversaries and defenders alike. We provide the code here: https://github.com/jrosseruk/infusion.
Updated: 2026-03-09 18:10:16
标题: 通过影响函数通过编辑训练数据塑造模型行为
摘要: 影响函数通常用于将模型行为归因于训练文档。我们探索了相反的情况:制定诱导模型行为的训练数据。我们的框架Infusion使用可伸缩的影响函数近似来计算对训练文档进行微小扰动,通过参数偏移诱导模型行为的有针对性变化。我们在视觉和语言领域的数据污染任务上评估了Infusion。在CIFAR-10上,我们展示通过Infusion对只有0.2%(100/45,000)的训练文档进行微小编辑可以与插入少量显式行为示例的基线竞争。我们还发现Infusion能够跨架构传输(ResNet $\leftrightarrow$ CNN),表明单个受污染的语料库可以影响多个独立训练的模型。在初步的语言实验中,我们描述了我们的方法何时增加目标行为的概率以及何时失败,并发现它最有效地增强模型已经学习的行为。综合这些结果表明,对训练数据进行小而微妙的编辑可以系统地塑造模型行为,强调了对攻击者和防御者来说训练数据可解释性的重要性。我们在这里提供代码: https://github.com/jrosseruk/infusion.
更新时间: 2026-03-09 18:10:16
领域: cs.LG,cs.AI,cs.CY
Test-Driven AI Agent Definition (TDAD): Compiling Tool-Using Agents from Behavioral Specifications
We present Test-Driven AI Agent Definition (TDAD), a methodology that treats agent prompts as compiled artifacts: engineers provide behavioral specifications, a coding agent converts them into executable tests, and a second coding agent iteratively refines the prompt until tests pass. Deploying tool-using LLM agents in production requires measurable behavioral compliance that current development practices cannot provide. Small prompt changes cause silent regressions, tool misuse goes undetected, and policy violations emerge only after deployment. To mitigate specification gaming, TDAD introduces three mechanisms: (1) visible/hidden test splits that withhold evaluation tests during compilation, (2) semantic mutation testing via a post-compilation agent that generates plausible faulty prompt variants, with the harness measuring whether the test suite detects them, and (3) spec evolution scenarios that quantify regression safety when requirements change. We evaluate TDAD on SpecSuite-Core, a benchmark of four deeply-specified agents spanning policy compliance, grounded analytics, runbook adherence, and deterministic enforcement. Across 24 independent trials, TDAD achieves 92% v1 compilation success with 97% mean hidden pass rate; evolved specifications compile at 58%, with most failed runs passing all visible tests except 1-2, and show 86-100% mutation scores, 78% v2 hidden pass rate, and 97% regression safety scores. The implementation is available as an open benchmark at https://github.com/f-labs-io/tdad-paper-code.
Updated: 2026-03-09 18:04:54
标题: 测试驱动的人工智能代理定义(TDAD):从行为规范编译工具使用代理
摘要: 我们提出了一种名为Test-Driven AI Agent Definition (TDAD)的方法论,将代理提示视为编译后的工件:工程师提供行为规范,一个编码代理将其转换为可执行测试,第二个编码代理通过迭代改进提示直到测试通过。在生产中部署使用工具的LLM代理需要可衡量的行为遵从性,而当前的开发实践无法提供。小的提示更改会导致潜在的退化,工具的错误使用不会被发现,政策违规只有在部署后才会出现。为了减轻规范游戏,TDAD引入了三种机制:(1)可见/隐藏测试拆分,在编译期间保留评估测试;(2)语义突变测试,通过一个后编译代理生成可能存在错误的提示变体,测试套件测量是否检测到它们;(3)规范演化场景,量化了在需求变化时的退化安全性。我们在SpecSuite-Core上评估了TDAD,这是一个涵盖政策合规性、基于分析的地面分析、运行手册遵从性和确定性执行的四个深度规范代理的基准测试。在24次独立试验中,TDAD实现了92%的v1编译成功率,平均隐藏通过率为97%;演化的规范编译率为58%,大多数失败的运行通过了所有可见测试,除了1-2个,并显示了86-100%的突变得分,78%的v2隐藏通过率和97%的退化安全得分。该实现可以作为一个开放的基准测试在https://github.com/f-labs-io/tdad-paper-code上找到。
更新时间: 2026-03-09 18:04:54
领域: cs.SE,cs.AI
The Temporal Markov Transition Field
The Markov Transition Field (MTF), introduced by Wang and Oates (2015), encodes a time series as a two-dimensional image by mapping each pair of time steps to the transition probability between their quantile states, estimated from a single global transition matrix. This construction is efficient when the transition dynamics are stationary, but produces a misleading representation when the process changes regime over time: the global matrix averages across regimes and the resulting image loses all information about \emph{when} each dynamical regime was active. In this paper we introduce the \emph{Temporal Markov Transition Field} (TMTF), an extension that partitions the series into $K$ contiguous temporal chunks, estimates a separate local transition matrix for each chunk, and assembles the image so that each row reflects the dynamics local to its chunk rather than the global average. The resulting $T \times T$ image has $K$ horizontal bands of distinct texture, each encoding the transition dynamics of one temporal segment. We develop the formal definition, establish the key structural properties of the representation, work through a complete numerical example that makes the distinction from the global MTF concrete, analyse the bias--variance trade-off introduced by temporal chunking, and discuss the geometric interpretation of the local transition matrices in terms of process properties such as persistence, mean reversion, and trending behaviour. The TMTF is amplitude-agnostic and order-preserving, making it suitable as an input channel for convolutional neural networks applied to time series characterisation tasks.
Updated: 2026-03-09 18:04:40
标题: 时间马尔可夫转移场
摘要: 马尔可夫转移场(MTF),由王和奥茨(2015年)引入,通过将每对时间步骤映射到它们的分位状态之间的转移概率,从单个全局转移矩阵中估计,将时间序列编码为二维图像。当转移动力学是稳定的时,这种构造是有效的,但当过程随时间改变制度时,会产生误导性的表示:全局矩阵对制度进行了平均处理,导致的图像丢失了关于每个动力制度何时活跃的所有信息。在本文中,我们引入了\emph{时间马尔可夫转移场}(TMTF),这是一个扩展,将时间序列分成$K$个连续的时间段,为每个时间段估计一个单独的局部转移矩阵,并组装图像,使得每行反映其时间段内的局部动态而不是全局平均值。结果是一个$T \times T$的图像,有$K$个水平条带具有不同的纹理,每个条带编码一个时间段的转移动力学。我们制定了正式定义,建立了表示的关键结构属性,通过一个完整的数值示例,明确了与全局MTF的区别,分析了时间分块引入的偏差-方差权衡,并讨论了局部转移矩阵的几何解释,以及与过程性质,如持久性,均值回归和趋势行为。TMTF是幅度不可知的和保持顺序的,使其适用于卷积神经网络应用于时间序列表征任务的输入通道。
更新时间: 2026-03-09 18:04:40
领域: cs.LG,stat.ML
Large Language Model-Assisted Superconducting Qubit Experiments
Superconducting circuits have demonstrated significant potential in quantum information processing and quantum sensing. Implementing novel control and measurement sequences for superconducting qubits is often a complex and time-consuming process, requiring extensive expertise in both the underlying physics and the specific hardware and software. In this work, we introduce a framework that leverages a large language model (LLM) to automate qubit control and measurement. Specifically, our framework conducts experiments by generating and invoking schema-less tools on demand via a knowledge base on instrumental usage and experimental procedures. We showcase this framework with two experiments: an autonomous resonator characterization and a direct reproduction of a quantum non-demolition (QND) characterization of a superconducting qubit from literature. This framework enables rapid deployment of standard control-and-measurement protocols and facilitates implementation of novel experimental procedures, offering a more flexible and user-friendly paradigm for controlling complex quantum hardware.
Updated: 2026-03-09 18:03:10
标题: 大型语言模型辅助超导量子比特实验
摘要: 超导电路在量子信息处理和量子传感方面展示出了显著的潜力。为超导量子比特实施新颖的控制和测量序列通常是一个复杂且耗时的过程,需要对基础物理学和具体硬件软件都有深入的专业知识。在这项工作中,我们引入了一个利用大型语言模型(LLM)来自动化量子比特控制和测量的框架。具体而言,我们的框架通过在需求时生成和调用无模式工具来进行实验,通过仪器使用和实验程序的知识库。我们通过两个实验展示了这个框架:一个自主谐振器特性测试和一个直接复制文献中超导量子比特的量子非摧毁(QND)特性测试。这个框架能够快速部署标准的控制和测量协议,并促进新颖实验程序的实施,为控制复杂量子硬件提供更灵活和用户友好的范式。
更新时间: 2026-03-09 18:03:10
领域: quant-ph,cs.AI
Scale Space Diffusion
Diffusion models degrade images through noise, and reversing this process reveals an information hierarchy across timesteps. Scale-space theory exhibits a similar hierarchy via low-pass filtering. We formalize this connection and show that highly noisy diffusion states contain no more information than small, downsampled images - raising the question of why they must be processed at full resolution. To address this, we fuse scale spaces into the diffusion process by formulating a family of diffusion models with generalized linear degradations and practical implementations. Using downsampling as the degradation yields our proposed Scale Space Diffusion. To support Scale Space Diffusion, we introduce Flexi-UNet, a UNet variant that performs resolution-preserving and resolution-increasing denoising using only the necessary parts of the network. We evaluate our framework on CelebA and ImageNet and analyze its scaling behavior across resolutions and network depths. Our project website ( https://prateksha.github.io/projects/scale-space-diffusion/ ) is available publicly.
Updated: 2026-03-09 17:59:42
标题: 尺度空间扩散
摘要: 扩散模型通过噪声降低图像质量,逆向这一过程揭示了跨时间步的信息层次结构。尺度空间理论通过低通滤波展示了类似的层次结构。我们正式化这一联系,并展示高度噪声扩散状态包含的信息不比小型、降采样的图像更多 - 这引发了一个问题,为什么它们必须以全分辨率进行处理。为了解决这个问题,我们将尺度空间融入到扩散过程中,通过制定一系列带有广义线性退化和实际实现的扩散模型。使用降采样作为退化方式产生了我们提出的尺度空间扩散。为了支持尺度空间扩散,我们引入了Flexi-UNet,这是一种UNet变体,仅使用网络的必要部分来执行保留分辨率和增加分辨率的去噪。我们在CelebA和ImageNet上评估了我们的框架,并分析了其在不同分辨率和网络深度下的扩展行为。我们的项目网站(https://prateksha.github.io/projects/scale-space-diffusion/)可以公开访问。
更新时间: 2026-03-09 17:59:42
领域: cs.CV,cs.AI
Multi-level meta-reinforcement learning with skill-based curriculum
We consider problems in sequential decision making with natural multi-level structure, where sub-tasks are assembled together to accomplish complex goals. Systematically inferring and leveraging hierarchical structure has remained a longstanding challenge; we describe an efficient multi-level procedure for repeatedly compressing Markov decision processes (MDPs), wherein a parametric family of policies at one level is treated as single actions in the compressed MDPs at higher levels, while preserving the semantic meanings and structure of the original MDP, and mimicking the natural logic to address a complex MDP. Higher-level MDPs are themselves independent MDPs with less stochasticity, and may be solved using existing algorithms. As a byproduct, spatial or temporal scales may be coarsened at higher levels, making it more efficient to find long-term optimal policies. The multi-level representation delivered by this procedure decouples sub-tasks from each other and usually greatly reduces unnecessary stochasticity and the policy search space, leading to fewer iterations and computations when solving the MDPs. A second fundamental aspect of this work is that these multi-level decompositions plus the factorization of policies into embeddings (problem-specific) and skills (including higher-order functions) yield new transfer opportunities of skills across different problems and different levels. This whole process is framed within curriculum learning, wherein a teacher organizes the student agent's learning process in a way that gradually increases the difficulty of tasks and and promotes transfer across MDPs and levels within and across curricula. The consistency of this framework and its benefits can be guaranteed under mild assumptions. We demonstrate abstraction, transferability, and curriculum learning in examples, including MazeBase+, a more complex variant of the MazeBase example.
Updated: 2026-03-09 17:59:39
标题: 多层次基于技能课程的元元强化学习
摘要: 我们考虑在具有自然多层结构的顺序决策制定中的问题,其中子任务被组合在一起以实现复杂目标。系统地推断和利用层次结构一直是一个长期存在的挑战;我们描述了一种有效的多层过程,用于重复压缩马尔可夫决策过程(MDPs),其中一个级别上的参数化策略系列被视为在更高级别的压缩MDPs中的单个动作,同时保留原始MDP的语义含义和结构,并模仿自然逻辑以解决复杂MDP。更高级别的MDPs本身是独立的MDPs,具有较少的随机性,并且可以使用现有算法来解决。作为副产品,空间或时间尺度可以在更高级别上粗化,从而更有效地找到长期最优策略。此过程提供的多层表示将子任务相互分离,并通常大大减少不必要的随机性和策略搜索空间,从而在解决MDPs时减少迭代和计算次数。这项工作的第二个基本方面是这些多层分解以及将策略分解为嵌入(特定于问题)和技能(包括高阶函数)产生了跨不同问题和不同级别的技能转移新机会。整个过程被框定在课程学习中,其中教师以逐渐增加任务难度的方式组织学生代理的学习过程,并促进跨MDPs和课程内外级别的转移。在温和的假设下,这个框架的一致性和好处可以得到保证。我们通过示例展示了抽象、可转移性和课程学习,包括MazeBase+,这是MazeBase示例的更复杂变体。
更新时间: 2026-03-09 17:59:39
领域: cs.LG,cs.AI,stat.ML
Impermanent: A Live Benchmark for Temporal Generalization in Time Series Forecasting
Recent advances in time-series forecasting increasingly rely on pre-trained foundation-style models. While these models often claim broad generalization, existing evaluation protocols provide limited evidence. Indeed, most current benchmarks use static train-test splits that can easily lead to contamination as foundation models can inadvertently train on test data or perform model selection using test scores, which can inflate performance. We introduce Impermanent, a live benchmark that evaluates forecasting models under open-world temporal change by scoring forecasts sequentially over time on continuously updated data streams, enabling the study of temporal robustness, distributional shift, and performance stability rather than one-off accuracy on a frozen test set. Impermanent is instantiated on GitHub open-source activity, providing a naturally live and highly non-stationary dataset shaped by releases, shifting contributor behavior, platform/tooling changes, and external events. We focus on the top 400 repositories by star count and construct time series from issues opened, pull requests opened, push events, and new stargazers, evaluated over a rolling window with daily updates, alongside standardized protocols and leaderboards for reproducible, ongoing comparison. By shifting evaluation from static accuracy to sustained performance, Impermanent takes a concrete step toward assessing when and whether foundation-level generalization in time-series forecasting can be meaningfully claimed. Code and a live dashboard are available at https://github.com/TimeCopilot/impermanent and https://impermanent.timecopilot.dev.
Updated: 2026-03-09 17:59:00
标题: 《Impermanent:时间序列预测中时间泛化的实时基准》
摘要: 最近时间序列预测方面的进展越来越依赖于预训练的基础模型。尽管这些模型通常声称具有广泛的泛化能力,但现有的评估协议提供的证据有限。事实上,大多数当前的基准测试使用静态的训练-测试分割,这很容易导致污染,因为基础模型可能会无意中在测试数据上进行训练,或者使用测试分数进行模型选择,这可能会夸大性能。我们引入了Impermanent,一个实时基准测试,通过在不断更新的数据流上连续评分来评估时间序列预测模型,从而评估在开放世界时间变化下的鲁棒性、分布转移和性能稳定性,而不是仅仅对一个冻结的测试集进行一次性精度评估。Impermanent 是基于 GitHub 的开源活动实例化的,提供了一个自然实时且高度非平稳的数据集,由发布、变化的贡献者行为、平台/工具变化和外部事件塑造。我们关注星标数量排名前 400 的仓库,并构建由问题打开次数、拉取请求打开次数、推送事件和新星标者组成的时间序列,通过每日更新的滚动窗口进行评估,同时提供标准化协议和排行榜,以便进行可重复的持续比较。通过从静态精度评估转向持续性性能评估,Impermanent 在评估时间序列预测中的基础泛化能力是否可以被有意义地声称方面迈出了实质性的一步。代码和实时仪表板可在 https://github.com/TimeCopilot/impermanent 和 https://impermanent.timecopilot.dev 上找到。
更新时间: 2026-03-09 17:59:00
领域: cs.LG
Agentic Critical Training
Training large language models (LLMs) as autonomous agents often begins with imitation learning, but it only teaches agents what to do without understanding why: agents never contrast successful actions against suboptimal alternatives and thus lack awareness of action quality. Recent approaches attempt to address this by introducing self-reflection supervision derived from contrasts between expert and alternative actions. However, the training paradigm fundamentally remains imitation learning: the model imitates pre-constructed reflection text rather than learning to reason autonomously. We propose Agentic Critical Training (ACT), a reinforcement learning paradigm that trains agents to identify the better action among alternatives. By rewarding whether the model's judgment is correct, ACT drives the model to autonomously develop reasoning about action quality, producing genuine self-reflection rather than imitating it. Across three challenging agent benchmarks, ACT consistently improves agent performance when combined with different post-training methods. It achieves an average improvement of 5.07 points over imitation learning and 4.62 points over reinforcement learning. Compared to approaches that inject reflection capability through knowledge distillation, ACT also demonstrates clear advantages, yielding an average improvement of 2.42 points. Moreover, ACT enables strong out-of-distribution generalization on agentic benchmarks and improves performance on general reasoning benchmarks without any reasoning-specific training data, highlighting the value of our method. These results suggest that ACT is a promising path toward developing more reflective and capable LLM agents.
Updated: 2026-03-09 17:58:56
标题: 主动性批判性培训
摘要: 训练大型语言模型(LLMs)作为自主代理通常从模仿学习开始,但它只教给代理该做什么而不理解为什么:代理从不将成功的行动与次优的替代方案进行对比,因此缺乏对行动质量的意识。最近的方法尝试通过引入从专家和替代行动之间的对比中衍生出的自我反思监督来解决这个问题。然而,训练范式在本质上仍然是模仿学习:模型模仿预先构建的反思文本,而不是学会自主推理。我们提出了自主批判训练(ACT),这是一种强化学习范式,训练代理来识别替代方案中的更好行动。通过奖励模型的判断是否正确,ACT推动模型自主发展关于行动质量的推理,产生真正的自我反思而不是模仿它。在三个具有挑战性的代理基准测试中,ACT结合不同的训练后方法时始终改善代理性能。它比模仿学习平均提高了5.07点,比强化学习提高了4.62点。与通过知识蒸馏注入反思能力的方法相比,ACT还展示了明显优势,平均提高了2.42点。此外,ACT使代理基准测试中的强大分布外泛化,并在没有任何推理特定训练数据的情况下提高了通用推理基准测试的性能,突显了我们方法的价值。这些结果表明,ACT是朝着开发更具反思性和能力的LLM代理的有希望的路径。
更新时间: 2026-03-09 17:58:56
领域: cs.AI,cs.CL,cs.LG
Evaluating Financial Intelligence in Large Language Models: Benchmarking SuperInvesting AI with LLM Engines
Large language models are increasingly used for financial analysis and investment research, yet systematic evaluation of their financial reasoning capabilities remains limited. In this work, we introduce the AI Financial Intelligence Benchmark (AFIB), a multi-dimensional evaluation framework designed to assess financial analysis capabilities across five dimensions: factual accuracy, analytical completeness, data recency, model consistency, and failure patterns. We evaluate five AI systems: GPT, Gemini, Perplexity, Claude, and SuperInvesting, using a dataset of 95+ structured financial analysis questions derived from real-world equity research tasks. The results reveal substantial differences in performance across models. Within this benchmark setting, SuperInvesting achieves the highest aggregate performance, with an average factual accuracy score of 8.96/10 and the highest completeness score of 56.65/70, while also demonstrating the lowest hallucination rate among evaluated systems. Retrieval-oriented systems such as Perplexity perform strongly on data recency tasks due to live information access but exhibit weaker analytical synthesis and consistency. Overall, the results highlight that financial intelligence in large language models is inherently multi-dimensional, and systems that combine structured financial data access with analytical reasoning capabilities provide the most reliable performance for complex investment research workflows.
Updated: 2026-03-09 17:58:54
标题: 评估大型语言模型中的金融智能:用LLM引擎对超级投资AI进行基准测试
摘要: 大型语言模型越来越多地被用于金融分析和投资研究,然而对它们的金融推理能力进行系统评估的研究仍然有限。在这项工作中,我们引入了AI金融智能基准(AFIB),这是一个多维评估框架,旨在评估金融分析能力在五个维度上:事实准确性、分析完整性、数据时效性、模型一致性和失败模式。我们使用来自真实股票研究任务的95个以上结构化金融分析问题的数据集,评估了五个AI系统:GPT、Gemini、Perplexity、Claude和SuperInvesting。结果显示,在模型之间的性能差异很大。在这个基准设置中,SuperInvesting获得了最高的综合性能,平均事实准确性得分为8.96/10,最高完整性得分为56.65/70,同时也展示了在评估系统中最低的幻觉率。检索导向系统如Perplexity在数据时效性任务上表现强劲,因为可以访问实时信息,但在分析综合和一致性方面表现较弱。总体而言,结果突显出大型语言模型中的金融智能本质上是多维的,结合了结构化金融数据访问和分析推理能力的系统为复杂的投资研究工作流提供了最可靠的性能。
更新时间: 2026-03-09 17:58:54
领域: cs.AI
DemoDiffusion: One-Shot Human Imitation using pre-trained Diffusion Policy
We propose DemoDiffusion, a simple method for enabling robots to perform manipulation tasks by imitating a single human demonstration, without requiring task-specific training or paired human-robot data. Our approach is based on two insights. First, the hand motion in a human demonstration provides a useful prior for the robot's end-effector trajectory, which we can convert into a rough open-loop robot motion trajectory via kinematic retargeting. Second, while this retargeted motion captures the overall structure of the task, it may not align well with plausible robot actions in-context. To address this, we leverage a pre-trained generalist diffusion policy to modify the trajectory, ensuring it both follows the human motion and remains within the distribution of plausible robot actions. Unlike approaches based on online reinforcement learning or paired human-robot data, our method enables robust adaptation to new tasks and scenes with minimal effort. In real-world experiments across 8 diverse manipulation tasks, DemoDiffusion achieves 83.8\% average success rate, compared to 13.8\% for the pre-trained policy and 52.5\% for kinematic retargeting, succeeding even on tasks where the pre-trained generalist policy fails entirely. Project page: https://demodiffusion.github.io/
Updated: 2026-03-09 17:58:20
标题: DemoDiffusion:使用预训练的扩散策略进行一次性人类模仿
摘要: 我们提出了DemoDiffusion,这是一种简单的方法,可以使机器人通过模仿单个人类演示来执行操作任务,而无需进行特定任务的训练或配对的人机数据。我们的方法基于两个见解。首先,人类演示中的手部运动为机器人的末端执行器轨迹提供了有用的先验知识,我们可以通过运动再定位将其转换为粗略的开环机器人运动轨迹。其次,虽然这种再定位的运动捕捉了任务的总体结构,但可能与上下文中的合理机器人动作不太匹配。为了解决这个问题,我们利用预训练的通用扩散策略来修改轨迹,确保它既遵循人类运动,又保持在合理机器人动作的分布范围内。与基于在线强化学习或配对的人机数据的方法不同,我们的方法能够在几乎不费力气的情况下对新任务和场景进行稳健的适应。在8个不同的操作任务上进行的真实世界实验中,DemoDiffusion实现了83.8\%的平均成功率,而预训练策略为13.8%,再定位为52.5%,即使在预训练的通用策略完全失败的任务上也能成功。项目页面:https://demodiffusion.github.io/
更新时间: 2026-03-09 17:58:20
领域: cs.RO,cs.LG
Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks
Despite recent rapid progress in AI safety, current large language models remain vulnerable to adversarial attacks in multi-turn interaction settings, where attackers strategically adapt their prompts across conversation turns and pose a more critical yet realistic challenge. Existing approaches that discover safety vulnerabilities either rely on manual red-teaming with human experts or employ automated methods using pre-defined templates and human-curated attack data, with most focusing on single-turn attacks. However, these methods did not explore the vast space of possible multi-turn attacks, failing to consider novel attack trajectories that emerge from complex dialogue dynamics and strategic conversation planning. This gap is particularly critical given recent findings that LLMs exhibit significantly higher vulnerability to multi-turn attacks compared to single-turn attacks. We propose DialTree, an on-policy reinforcement learning framework integrated with tree search that autonomously discovers diverse multi-turn attack strategies by treating the dialogue as a sequential decision-making problem, enabling systematic exploration without manually curated data. Through extensive experiments, our approach not only achieves more than 44.2% higher ASR across 12 target models compared to previous state-of-the-art approaches, but also effectively uncovers new attack strategies by learning optimal dialogue policies that maximize attack success across multiple turns.
Updated: 2026-03-09 17:54:56
标题: 《基于树的对话强化策略优化在红队攻击中的应用》
摘要: 尽管近年来在AI安全领域取得了快速进展,但当前的大型语言模型在多轮交互环境中仍然容易受到对抗攻击的影响,攻击者在对话轮次中策略性地调整他们的提示,提出一个更为严峻但现实的挑战。现有的发现安全漏洞的方法要么依赖于人工专家进行手动红队测试,要么使用预定义模板和人工策划的攻击数据的自动化方法,其中大多数集中在单轮攻击上。然而,这些方法并未探索可能的多轮攻击的广阔空间,未能考虑复杂对话动态和策略性对话规划中产生的新型攻击轨迹。鉴于最近的研究发现,LLMs相比于单轮攻击,对多轮攻击更易受攻击,因此这一差距尤为重要。我们提出DialTree,这是一个基于策略的强化学习框架,集成了树搜索,通过将对话视为一个顺序决策问题,自主发现多样化的多轮攻击策略,实现系统化的探索,而无需手动策划的数据。通过大量实验,我们的方法不仅在12个目标模型上比以往最先进的方法实现了超过44.2%的ASR提高,还通过学习最大化多轮攻击成功的最优对话策略,有效地发现了新的攻击策略。
更新时间: 2026-03-09 17:54:56
领域: cs.LG,cs.AI,cs.CL
A Multi-Objective Optimization Approach for Sustainable AI-Driven Entrepreneurship in Resilient Economies
The rapid advancement of artificial intelligence (AI) technologies presents both unprecedented opportunities and significant challenges for sustainable economic development. While AI offers transformative potential for addressing environmental challenges and enhancing economic resilience, its deployment often involves substantial energy consumption and environmental costs. This research introduces the EcoAI-Resilience framework, a multi-objective optimization approach designed to maximize the sustainability benefits of AI deployment while minimizing environmental costs and enhancing economic resilience. The framework addresses three critical objectives through mathematical optimization: sustainability impact maximization, economic resilience enhancement, and environmental cost minimization. The methodology integrates diverse data sources, including energy consumption metrics, sustainability indicators, economic performance data, and entrepreneurship outcomes across 53 countries and 14 sectors from 2015-2024. Our experimental validation demonstrates exceptional performance with R scores exceeding 0.99 across all model components, significantly outperforming baseline methods, including Linear Regression (R = 0.943), Random Forest (R = 0.957), and Gradient Boosting (R = 0.989). The framework successfully identifies optimal AI deployment strategies featuring 100\% renewable energy integration, 80% efficiency improvement targets, and optimal investment levels of $202.48 per capita. Key findings reveal strong correlations between economic complexity and resilience (r = 0.82), renewable energy adoption and sustainability outcomes (r = 0.71), and demonstrate significant temporal improvements in AI readiness (+1.12 points/year) and renewable energy adoption (+0.67 year) globally.
Updated: 2026-03-09 17:54:32
标题: 一个多目标优化方法:面向韧性经济中可持续的人工智能驱动的创业的研究
摘要: 人工智能(AI)技术的快速发展为可持续经济发展带来了前所未有的机遇和重大挑战。虽然AI具有转型潜力,可以解决环境挑战并增强经济韧性,但其部署往往涉及大量能源消耗和环境成本。本研究引入了EcoAI-Resilience框架,这是一种多目标优化方法,旨在最大程度地提高AI部署的可持续性益处,同时最小化环境成本并增强经济韧性。该框架通过数学优化解决了三个关键目标:最大化可持续性影响、增强经济韧性和最小化环境成本。该方法整合了各种数据源,包括能源消耗指标、可持续性指标、经济绩效数据以及2015年至2024年间来自53个国家和14个部门的创业成果。我们的实验验证表明,在所有模型组件上,R分数均超过0.99,显著优于基准方法,包括线性回归(R = 0.943)、随机森林(R = 0.957)和梯度提升(R = 0.989)。该框架成功地确定了最佳的AI部署策略,包括100%可再生能源整合、80%效率改进目标和每人202.48美元的最佳投资水平。主要发现揭示了经济复杂性与韧性之间的强相关性(r = 0.82)、可再生能源采用与可持续性结果之间的关联(r = 0.71),并且在全球范围内AI准备度(+1.12分/年)和可再生能源采用(+0.67年)均有显著的时间改进。
更新时间: 2026-03-09 17:54:32
领域: cs.AI
Split Federated Learning Architectures for High-Accuracy and Low-Delay Model Training
Can we find a network architecture for ML model training so as to optimize training loss (and thus, accuracy) in Split Federated Learning (SFL)? And can this architecture also reduce training delay and communication overhead? While accuracy is not influenced by how we split the model in ordinary, state-of-the-art SFL, in this work we answer the questions above in the affirmative. Recent Hierarchical SFL (HSFL) architectures adopt a three-tier training structure consisting of clients, (local) aggregators, and a central server. In this architecture, the model is partitioned at two partitioning layers into three sub-models, which are executed across the three tiers. Despite their merits, HSFL architectures overlook the impact of the partitioning layers and client-to-aggregator assignments on accuracy, delay, and overhead. This work explicitly captures the impact of the partitioning layers and client-to-aggregator assignments on accuracy, delay and overhead by formulating a joint optimization problem. We prove that the problem is NP-hard and propose the first accuracy-aware heuristic algorithm that explicitly accounts for model accuracy, while remaining delay-efficient. Simulation results on public datasets show that our approach can improve accuracy by 3%, while reducing delay by 20% and overhead by 50%, compared to state-of-the-art SFL and HSFL schemes.
Updated: 2026-03-09 17:53:20
标题: 分裂式联邦学习架构用于高精度和低延迟模型训练
摘要: 我们能否找到一个网络架构,用于在分裂式联邦学习(SFL)中优化训练损失(从而提高准确性)的机器学习模型训练?这种架构是否还能减少训练延迟和通信开销?尽管在普通的、最先进的SFL中,模型如何分裂不会影响准确性,但在这项工作中,我们肯定地回答了上述问题。最近的分层式SFL(HSFL)架构采用了一个由客户端、(本地)聚合器和中央服务器组成的三层训练结构。在这种架构中,模型在两个分区层被分为三个子模型,这些子模型在三个层之间执行。尽管具有优点,HSFL架构忽视了分区层和客户端到聚合器分配对准确性、延迟和开销的影响。本工作通过制定一个联合优化问题,明确捕捉了分区层和客户端到聚合器分配对准确性、延迟和开销的影响。我们证明了这个问题是NP难的,并提出了第一个明确考虑模型准确性的启发式算法,同时保持延迟高效。对公共数据集的模拟结果显示,与最先进的SFL和HSFL方案相比,我们的方法可以提高准确性3%,同时将延迟减少20%,开销减少50%。
更新时间: 2026-03-09 17:53:20
领域: cs.LG,cs.AI
Linear probes rely on textual evidence: Results from leakage mitigation studies in language models
White-box monitors are a popular technique for detecting potentially harmful behaviours in language models. While they perform well in general, their effectiveness in detecting text-ambiguous behaviour is disputed. In this work, we find evidence that removing textual evidence of a behaviour significantly decreases probe performance. The AUROC reduction ranges from $10$- to $30$-point depending on the setting. We evaluate probe monitors across three setups (Sandbagging, Sycophancy, and Bias), finding that when probes rely on textual evidence of the target behaviour (such as system prompts or CoT reasoning), performance degrades once these tokens are filtered. This filtering procedure is standard practice for output monitor evaluation. As further evidence of this phenomenon, we train Model Organisms which produce outputs without any behaviour verbalisations. We validate that probe performance on Model Organisms is substantially lower than unfiltered evaluations: $0.57$ vs $0.74$ AUROC for Bias, and $0.57$ vs $0.94$ AUROC for Sandbagging. Our findings suggest that linear probes may be brittle in scenarios where they must detect non-surface-level patterns.
Updated: 2026-03-09 17:52:49
标题: 线性探针依赖于文本证据:语言模型中泄漏减轻研究的结果
摘要: 白盒监视器是一种用于检测语言模型中潜在有害行为的流行技术。尽管它们在一般情况下表现良好,但它们在检测文本模糊行为方面的有效性存在争议。在这项工作中,我们发现,去除行为的文本证据会显著降低探针的性能。根据设置的不同,AUROC减少范围为10到30个百分点。我们评估了三种设置下的探测监视器(袭击、奉承和偏见),发现当探测器依赖于目标行为的文本证据(例如系统提示或CoT推理)时,一旦这些标记被过滤,性能就会下降。这种过滤程序是输出监视器评估的标准做法。作为进一步证据,我们训练了能够产生没有任何行为文本表达的模型生物。我们验证了在模型生物上的探针性能明显低于未经过滤的评估:对于偏见,AUROC分别为0.57和0.74,对于袭击,AUROC分别为0.57和0.94。我们的研究结果表明,在必须检测非表层模式的情况下,线性探测器可能会变得脆弱。
更新时间: 2026-03-09 17:52:49
领域: cs.AI,cs.CL,cs.LG
Benchmarking Language Modeling for Lossless Compression of Full-Fidelity Audio
Autoregressive "language" models (LMs) trained on raw waveforms can be repurposed for lossless audio compression, but prior work is limited to 8-bit audio, leaving open whether such approaches work for practical settings (16/24-bit) and can compete with existing codecs. We benchmark LM-based compression on full-fidelity audio across diverse domains (music, speech, bioacoustics), sampling rates (16kHz-48kHz), and bit depths (8, 16, 24-bit). Standard sample-level tokenization becomes intractable at higher bit depths due to vocabulary size (65K for 16-bit; 16.7M for 24-bit). We propose Trilobyte, a byte-level tokenization schema for full resolution audio, improving vocabulary scaling from $O(2^{b})$ to $O(1)$ and enabling the first tractable 24-bit LM-based lossless compression. While LMs consistently outperform FLAC and yield state-of-the-art compression at 8-bit and 16-bit, we observe that compression gains become more modest as bit depth increases beyond 8-bit.
Updated: 2026-03-09 17:52:02
标题: 基准语言建模用于无损压缩全保真音频
摘要: 基于原始波形训练的自回归“语言”模型(LMs)可以用于无损音频压缩,但先前的工作仅限于8位音频,因此是否这种方法适用于实际环境(16/24位),并且能否与现有编解码器竞争仍然是个问题。我们在各个领域(音乐、语音、生物声学)、采样率(16kHz-48kHz)和位深度(8、16、24位)的全保真音频上对基于LM的压缩进行基准测试。在更高的位深度下,由于词汇量过大(16位为65K;24位为16.7M),标准的样本级标记化变得难以处理。我们提出了Trilobyte,一种适用于全分辨率音频的字节级标记化模式,将词汇的扩展性从$O(2^{b})$提高到$O(1)$,实现了首个可行的24位LM-based无损压缩。虽然LMs在8位和16位上稳定优于FLAC并实现了最先进的压缩效果,但我们观察到随着位深度超过8位,压缩效果变得更为适度。
更新时间: 2026-03-09 17:52:02
领域: cs.SD,cs.AI,cs.LG,eess.AS
Structural Causal Bottleneck Models
We introduce structural causal bottleneck models (SCBMs), a novel class of structural causal models. At the core of SCBMs lies the assumption that causal effects between high-dimensional variables only depend on low-dimensional summary statistics, or bottlenecks, of the causes. SCBMs provide a flexible framework for task-specific dimension reduction while being estimable via standard, simple learning algorithms in practice. We analyse identifiability in SCBMs, connect them to information bottlenecks in the sense of Tishby & Zaslavsky (2015), and illustrate how to estimate them experimentally. We also demonstrate the benefit of bottlenecks for effect estimation in low-sample transfer learning settings. We argue that SCBMs provide an alternative to existing causal dimension reduction frameworks like causal representation learning or causal abstraction learning.
Updated: 2026-03-09 17:50:10
标题: 结构性因果瓶颈模型
摘要: 我们引入结构因果瓶颈模型(SCBM),这是一种新颖的结构因果模型类别。在SCBM的核心是假设高维变量之间的因果效应仅取决于导致原因的低维摘要统计数据,或者瓶颈。SCBM提供了一个灵活的框架,用于特定任务的维度缩减,同时在实践中可以通过标准、简单的学习算法进行估计。我们分析了SCBM中的可识别性,将它们与Tishby和Zaslavsky(2015)的信息瓶颈联系起来,并说明了如何通过实验估计它们。我们还展示了在低样本转移学习环境中瓶颈对效应估计的好处。我们认为SCBM为现有的因果维度缩减框架(如因果表示学习或因果抽象学习)提供了一种替代方案。
更新时间: 2026-03-09 17:50:10
领域: stat.ML,cs.LG
A New Lower Bound for the Random Offerer Mechanism in Bilateral Trade using AI-Guided Evolutionary Search
The celebrated Myerson--Satterthwaite theorem shows that in bilateral trade, no mechanism can be simultaneously fully efficient, Bayesian incentive compatible (BIC), and budget balanced (BB). This naturally raises the question of how closely the gains from trade (GFT) achievable by a BIC and BB mechanism can approximate the first-best (fully efficient) benchmark. The optimal BIC and BB mechanism is typically complex and highly distribution-dependent, making it difficult to characterize directly. Consequently, much of the literature analyzes simpler mechanisms such as the Random-Offerer (RO) mechanism and establishes constant-factor guarantees relative to the first-best GFT. An important open question concerns the worst-case performance of the RO mechanism relative to first-best (FB) efficiency. While it was originally hypothesized that the approximation ratio $\frac{\text{GFT}_{\text{FB}}}{\text{GFT}_{\text{RO}}}$ is bounded by $2$, recent work provided counterexamples to this conjecture: Cai et al. proved that the ratio can be strictly larger than $2$, and Babaioff et al. exhibited an explicit example with ratio approximately $2.02$. In this work, we employ AlphaEvolve, an AI-guided evolutionary search framework, to explore the space of value distributions. We identify a new worst-case instance that yields an improved lower bound of $\frac{\text{GFT}_{\text{FB}}}{\text{GFT}_{\text{RO}}} \ge \textbf{2.0749}$. This establishes a new lower bound on the worst-case performance of the Random-Offerer mechanism, demonstrating a wider efficiency gap than previously known.
Updated: 2026-03-09 17:49:02
标题: 一种使用AI引导的进化搜索技术在双边交易中随机报价者机制的新下界
摘要: 著名的Myerson-Satterthwaite定理表明,在双边贸易中,没有任何机制可以同时达到完全有效、贝叶斯激励兼容(BIC)和预算平衡(BB)。这自然引发了一个问题,即BIC和BB机制可以实现的贸易收益(GFT)如何接近第一最佳(完全有效)基准。最佳的BIC和BB机制通常是复杂的,高度依赖于分布,很难直接描述。因此,大部分文献分析了更简单的机制,如随机报价者(RO)机制,并建立了相对于第一最佳GFT的常数因子保证。一个重要的未解问题涉及RO机制相对于第一最佳(FB)效率的最坏情况表现。尽管最初假设近似比例GFT_FB / GFT_RO受到2的限制,但最近的工作提供了这一猜想的反例:Cai等人证明了比例可以严格大于2,而Babaioff等人展示了一个比例约为2.02的明确例子。 在这项工作中,我们使用AlphaEvolve,一个人工智能引导的进化搜索框架,来探索价值分配空间。我们确定了一个新的最坏情况实例,得到了一个改进的下界GFT_FB / GFT_RO≥2.0749。这建立了随机报价者机制最坏情况表现的新下界,显示了比以前已知的更广泛的效率差距。
更新时间: 2026-03-09 17:49:02
领域: cs.LG,cs.AI,cs.GT,econ.TH
Momentum SVGD-EM for Accelerated Maximum Marginal Likelihood Estimation
Maximum marginal likelihood estimation (MMLE) can be formulated as the optimization of a free energy functional. From this viewpoint, the Expectation-Maximisation (EM) algorithm admits a natural interpretation as a coordinate descent method over the joint space of model parameters and probability measures. Recently, a significant body of work has adopted this perspective, leading to interacting particle algorithms for MMLE. In this paper, we propose an accelerated version of one such procedure, based on Stein variational gradient descent (SVGD), by introducing Nesterov acceleration in both the parameter updates and in the space of probability measures. The resulting method, termed Momentum SVGD-EM, consistently accelerates convergence in terms of required iterations across various tasks of increasing difficulty, demonstrating effectiveness in both low- and high-dimensional settings.
Updated: 2026-03-09 17:47:36
标题: 动量SVG-EM用于加速最大边际似然估计
摘要: 最大边际似然估计(MMLE)可以被表述为自由能量泛函的优化。从这个观点来看,期望最大化(EM)算法可以被解释为在模型参数和概率测度的联合空间上的坐标下降方法。最近,大量研究采纳了这种观点,导致了用于MMLE的相互作用粒子算法。在本文中,我们提出了一种加速版本的这种过程,基于Stein变分梯度下降(SVGD),通过在参数更新和概率测度空间中引入Nesterov加速。由此产生的方法,称为动量SVG-EM,在不同难度的任务中一贯加速收敛所需的迭代次数,展示了在低维和高维环境中的效力。
更新时间: 2026-03-09 17:47:36
领域: stat.ML,cs.LG,stat.CO
Quantization of Ricci Curvature in Information Geometry
In 2004, while studying the information geometry of binary Bayesian networks (bitnets), the author conjectured that the volume-averaged Ricci scalar <R> computed with respect to the Fisher information metric is universally quantized to positive half-integers: <R> in (1/2)Z. This paper resolves the conjecture after 20 years. We prove it for tree-structured and complete-graph bitnets via a universal Beta function cancellation mechanism, and disprove it in general by exhibiting explicit loop counterexamples. We extend the program to Gaussian DAG networks, where a sign dichotomy holds: discrete bitnets have positive curvature, while Gaussian networks form solvable Lie groups with negative curvature.
Updated: 2026-03-09 17:44:10
标题: Ricci曲率在信息几何中的量子化
摘要: 在2004年,当研究二元贝叶斯网络(bitnets)的信息几何学时,作者猜测用费舍尔信息度量计算的体积平均里奇标量<R>普遍量子化为正的半整数:<R>在(1/2)Z。这篇论文在20年后解决了这一猜想。我们通过一个通用的Beta函数抵消机制证明了树状结构和完全图bitnets的这一猜想,并通过展示明确的回路反例来证明了一般情况下这一猜想是错误的。 我们将这个方案扩展到高斯DAG网络,这里存在一个符号二分法:离散bitnets具有正曲率,而高斯网络形成可解的李群并具有负曲率。
更新时间: 2026-03-09 17:44:10
领域: cs.IT,cs.LG,quant-ph
Do Schwartz Higher-Order Values Help Sentence-Level Human Value Detection? A Study of Hierarchical Gating and Calibration
Human value detection from single sentences is a sparse, imbalanced multi-label task. We study whether Schwartz higher-order (HO) categories help this setting on ValueEval'24 / ValuesML (74K English sentences) under a compute-frugal budget. Rather than proposing a new architecture, we compare direct supervised transformers, hard HO$\rightarrow$values pipelines, Presence$\rightarrow$HO$\rightarrow$values cascades, compact instruction-tuned large language models (LLMs), QLoRA, and low-cost upgrades such as threshold tuning and small ensembles. HO categories are learnable: the easiest bipolar pair, Growth vs. Self-Protection, reaches Macro-$F_1=0.58$. The most reliable gains come from calibration and ensembling: threshold tuning improves Social Focus vs. Personal Focus from $0.41$ to $0.57$ ($+0.16$), transformer soft voting lifts Growth from $0.286$ to $0.303$, and a Transformer+LLM hybrid reaches $0.353$ on Self-Protection. In contrast, hard hierarchical gating does not consistently improve the end task. Compact LLMs also underperform supervised encoders as stand-alone systems, although they sometimes add useful diversity in hybrid ensembles. Under this benchmark, the HO structure is more useful as an inductive bias than as a rigid routing rule.
Updated: 2026-03-09 17:41:24
标题: Schwartz高阶价值观有助于句子级人类价值检测吗?一个关于层次门控和校准的研究
摘要: 人类价值检测从单个句子中是一个稀疏、不平衡的多标签任务。我们研究了Schwartz高阶(HO)类别是否能帮助在ValueEval'24 / ValuesML(74K英文句子)这一设置下在计算资源有限的预算下进行。与提出新的架构不同,我们比较了直接监督的transformers、硬HO$\rightarrow$values管道、Presence$\rightarrow$HO$\rightarrow$values级联、紧凑的指导调整的大型语言模型(LLMs)、QLoRA以及门槛调整和小型集合等低成本升级。HO类别是可学习的:最容易处理的双极对,成长vs.自我保护,达到Macro-$F_1=0.58$。最可靠的收益来自校准和集成:门槛调整将社交焦点vs.个人焦点从$0.41$提高到$0.57$($+0.16$),transformer软投票将成长从$0.286$提升到$0.303$,而Transformer+LLM混合模型在自我保护上达到了$0.353$。相比之下,硬层次门控并不能始终改善最终任务。紧凑的LLMs在作为独立系统时表现不佳,尽管它们有时在混合集合中添加有用的多样性。在这个基准下,HO结构更有用作为归纳偏见,而不是作为刚性的路由规则。
更新时间: 2026-03-09 17:41:24
领域: cs.CL,cs.AI,cs.LG
Characterization and upgrade of a quantum graph neural network for charged particle tracking
In the forthcoming years the LHC experiments are going to be upgraded to benefit from the substantial increase of the LHC instantaneous luminosity, which will lead to larger, denser events, and, consequently, greater complexity in reconstructing charged particle tracks, motivating frontier research in new technologies. Quantum machine learning models are being investigated as potential new approaches to high energy physics (HEP) tasks. We characterize and upgrade a quantum graph neural network (QGNN) architecture for charged particle track reconstruction on a simulated high luminosity dataset. The model operates on a set of event graphs, each built from the hits generated in tracking detector layers by particles produced in proton collisions, performing a classification of the possible hit connections between adjacent layers. In this approach the QGNN is designed as a hybrid architecture, interleaving classical feedforward networks with parametrized quantum circuits. We characterize the interplay between the classical and quantum components. We report on the principal upgrades to the original design, and present new evidence of improved training behavior, specifically in terms of convergence toward the final trained configuration.
Updated: 2026-03-09 17:41:08
标题: 量子图神经网络在带电粒子跟踪中的特征化和升级
摘要: 在未来几年,LHC实验将进行升级,以从LHC瞬时亮度的显著增加中受益,这将导致更大、更密集的事件,因此在重建带电粒子轨迹方面会更加复杂,激发了新技术的前沿研究。量子机器学习模型被认为是高能物理(HEP)任务的潜在新方法。我们对一种用于模拟高亮度数据集上的带电粒子轨迹重建的量子图神经网络(QGNN)架构进行了特征化和升级。该模型在一组事件图上运行,每个事件图是由在轨道探测器层中产生的粒子在质子碰撞中产生的击中构建而成,对相邻层之间可能的击中连接进行分类。在这种方法中,QGNN被设计为一种混合架构,交替使用经典前馈网络和参数化的量子电路。我们对经典和量子组件之间的相互作用进行了特征化。我们报道了对原始设计的主要升级,并展示了新的证据表明改进了训练行为,特别是在收敛于最终训练配置方面。
更新时间: 2026-03-09 17:41:08
领域: quant-ph,cs.LG,hep-ex
Cybersecurity AI: Hacking Consumer Robots in the AI Era
Is robot cybersecurity broken by AI? Consumer robots -- from autonomous lawnmowers to powered exoskeletons and window cleaners -- are rapidly entering homes and workplaces, yet their security remains rooted in assumptions of specialized attacker expertise. This paper presents evidence that Generative AI has fundamentally disrupted robot cybersecurity: what historically required deep knowledge of ROS, ROS 2, and robotic system internals can now be automated by anyone with access to state-of-the-art GenAI tools spearheaded by the open source CAI (Cybersecurity AI). We provide empirical evidence through three case studies: (1) compromising a Hookii autonomous lawnmower robot, uncovering fleet-wide vulnerabilities and data protection violations affecting 267+ connected devices, (2) exploiting a Hypershell powered exoskeleton, demonstrating safety-critical motor control weaknesses and credential exposure including access to over 3,300 internal support emails, and (3) breaching a HOBOT S7 Pro window cleaning robot, achieving unauthenticated BLE command injection and OTA firmware exploitation. Across these platforms, CAI discovered in an automated manner 38 vulnerabilities that would have previously required months of specialized security research. Our findings reveal a stark asymmetry: while offensive capabilities have been democratized through AI, defensive measures often remain lagging behind. We argue that traditional defense-in-depth architectures like the Robot Immune System (RIS) must evolve toward GenAI-native defensive agents capable of matching the speed and adaptability of AI-powered attacks.
Updated: 2026-03-09 17:40:47
标题: 网络安全人工智能:在人工智能时代黑客消费者机器人
摘要: 机器人网络安全是否被人工智能破坏了?消费级机器人——从自动割草机到动力外骨骼和窗户清洁器——正在迅速进入家庭和工作场所,然而它们的安全性仍然根植于对专业攻击者专业知识的假设。本文提供证据表明,生成式人工智能已经从根本上打乱了机器人网络安全:以往需要对ROS、ROS 2和机器人系统内部的深入了解才能完成的工作,如今可以由任何拥有最先进GenAI工具(由开源CAI(网络安全人工智能)领导)的人自动完成。我们通过三个案例研究提供了实证证据:(1)入侵Hookii自动割草机器人,揭示影响267+连接设备的整个舰队漏洞和数据保护违规行为,(2)利用Hypershell动力外骨骼,展示了关键安全的电机控制弱点和凭证暴露,包括获取超过3300封内部支持电子邮件的权限,以及(3)侵犯HOBOT S7 Pro窗户清洁机器人,实现未经身份验证的BLE命令注入和OTA固件利用。通过这些平台,CAI以自动化方式发现了38个漏洞,这些漏洞以前需要数月专门的安全研究才能发现。我们的研究结果揭示了鲜明的不对称性:虽然人工智能已经使进攻能力民主化,但防御措施往往仍然滞后。我们认为,传统的防御深度体系(如机器人免疫系统(RIS))必须向GenAI原生的防御代理演进,以匹配人工智能攻击的速度和适应性。
更新时间: 2026-03-09 17:40:47
领域: cs.CR
How Far Can Unsupervised RLVR Scale LLM Training?
Unsupervised reinforcement learning with verifiable rewards (URLVR) offers a pathway to scale LLM training beyond the supervision bottleneck by deriving rewards without ground truth labels. Recent works leverage model intrinsic signals, showing promising early gains, yet their potential and limitations remain unclear. In this work, we revisit URLVR and provide a comprehensive analysis spanning taxonomy, theory and extensive experiments. We first classify URLVR methods into intrinsic versus external based on reward sources, then establish a unified theoretical framework revealing that all intrinsic methods converge toward sharpening the model's initial distribution This sharpening mechanism succeeds when initial confidence aligns with correctness but fails catastrophically when misaligned. Through systematic experiments, we show intrinsic rewards consistently follow a rise-then-fall pattern across methods, with collapse timing determined by model prior rather than engineering choices. Despite these scaling limits, we find intrinsic rewards remain valuable in test-time training on small datasets, and propose Model Collapse Step to measure model prior, serving as a practical indicator for RL trainability. Finally, we explore external reward methods that ground verification in computational asymmetries, showing preliminary evidence they may escape the confidence-correctness ceiling. Our findings chart boundaries for intrinsic URLVR while motivating paths toward scalable alternatives.
Updated: 2026-03-09 17:38:11
标题: 无监督RLVR在LLM训练中能扩展到多远?
摘要: 无监督强化学习与可验证奖励(URLVR)为超越监督瓶颈的LLM训练提供了一条途径,通过在没有地面真实标签的情况下导出奖励。最近的研究利用了模型内在信号,展现了有希望的早期收益,但它们的潜力和局限性仍然不明确。在这项工作中,我们重新审视了URLVR,并提供了一个涵盖分类、理论和大量实验的全面分析。我们首先根据奖励来源将URLVR方法分类为内在和外在,然后建立一个统一的理论框架,揭示所有内在方法都会收敛于锐化模型的初始分布。这种锐化机制在初始置信与正确性一致时成功,但在不一致时会灾难性失败。通过系统实验,我们展示内在奖励在各种方法中一致地遵循先升后降的模式,崩溃时机取决于模型先验而不是工程选择。尽管存在这些扩展限制,我们发现内在奖励在小数据集上的测试时间训练仍然有价值,并提出了模型崩溃步骤来衡量模型先验,作为RL可训练性的实用指标。最后,我们探索了在计算不对称性中将验证基于外部奖励的方法,展示了初步证据表明它们可能逃脱置信-正确性的限制。我们的研究为内在URLVR的边界绘制了界限,同时激发了通向可扩展替代方案的道路。
更新时间: 2026-03-09 17:38:11
领域: cs.LG,cs.CL
Context-free Self-Conditioned GAN for Trajectory Forecasting
In this paper, we present a context-free unsupervised approach based on a self-conditioned GAN to learn different modes from 2D trajectories. Our intuition is that each mode indicates a different behavioral moving pattern in the discriminator's feature space. We apply this approach to the problem of trajectory forecasting. We present three different training settings based on self-conditioned GAN, which produce better forecasters. We test our method in two data sets: human motion and road agents. Experimental results show that our approach outperforms previous context-free methods in the least representative supervised labels while performing well in the remaining labels. In addition, our approach outperforms globally in human motion, while performing well in road agents.
Updated: 2026-03-09 17:37:03
标题: 无上下文自条件GAN用于轨迹预测
摘要: 在这篇论文中,我们提出了一种基于自条件 GAN 的无监督上下文无关方法,用于从二维轨迹中学习不同的模式。我们的直觉是,每种模式在鉴别器特征空间中表示不同的行为移动模式。我们将这种方法应用于轨迹预测问题。我们提出了基于自条件 GAN 的三种不同训练设置,可以产生更好的预测器。我们在两个数据集中测试了我们的方法:人体运动和道路代理。实验结果表明,我们的方法在最不具代表性的监督标签上表现优于先前的无上下文方法,同时在其余标签上表现良好。此外,我们的方法在人体运动方面在全球范围内表现优于其他方法,在道路代理方面表现良好。
更新时间: 2026-03-09 17:37:03
领域: cs.LG
From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Models
Our aim is to learn to solve long-horizon decision-making problems in complex robotics domains given low-level skills and a handful of short-horizon demonstrations containing sequences of images. To this end, we focus on learning abstract symbolic world models that facilitate zero-shot generalization to novel goals via planning. A critical component of such models is the set of symbolic predicates that define properties of and relationships between objects. In this work, we leverage pretrained vision-language models (VLMs) to propose a large set of visual predicates potentially relevant for decision-making, and to evaluate those predicates directly from camera images. At training time, we pass the proposed predicates and demonstrations into an optimization-based model-learning algorithm to obtain an abstract symbolic world model that is defined in terms of a compact subset of the proposed predicates. At test time, given a novel goal in a novel setting, we use the VLM to construct a symbolic description of the current world state, and then use a search-based planning algorithm to find a sequence of low-level skills that achieves the goal. We demonstrate empirically across experiments in both simulation and the real world that our method can generalize aggressively, applying its learned world model to solve problems with a wide variety of object types, arrangements, numbers of objects, and visual backgrounds, as well as novel goals and much longer horizons than those seen at training time.
Updated: 2026-03-09 17:35:57
标题: 从像素到谓词:通过预训练的视觉-语言模型学习符号世界模型
摘要: 我们的目标是学习如何在复杂的机器人领域中解决长期决策问题,给定低级技能和少量包含图像序列的短期示范。为此,我们专注于学习抽象符号世界模型,通过规划促进对新目标的零样本泛化。这种模型的一个关键组成部分是定义对象属性和关系的符号谓词集合。在这项工作中,我们利用预训练的视觉语言模型(VLMs)提出了一个大量的视觉谓词,可能与决策有关,并直接从摄像机图像中评估这些谓词。在训练时,我们将提出的谓词和示范传递给基于优化的模型学习算法,以获得一个在提出的谓词的紧凑子集的条件下定义的抽象符号世界模型。在测试时,给定一个新的目标和一个新的环境,我们使用VLM构建当前世界状态的符号描述,然后使用基于搜索的规划算法找到一系列低级技能,实现这个目标。我们在仿真和现实世界的实验中进行了实证验证,证明我们的方法能够积极泛化,将学习的世界模型应用于解决具有各种对象类型、排列、对象数量和视觉背景,以及新目标和比训练时看到的更长期的问题。
更新时间: 2026-03-09 17:35:57
领域: cs.RO,cs.AI,cs.CV,cs.LG
OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning
We introduce OfficeQA Pro, a benchmark for evaluating AI agents on grounded, multi-document reasoning over a large and heterogeneous document corpus. The corpus consists of U.S. Treasury Bulletins spanning nearly 100 years, comprising 89,000 pages and over 26 million numerical values. OfficeQA Pro consists of 133 questions that require precise document parsing, retrieval, and analytical reasoning across both unstructured text and tabular data. Frontier LLMs including Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro Preview achieve less than 5% accuracy on OfficeQA Pro when relying on parametric knowledge, and less than 12% with additional access to the web. When provided directly with the document corpus, frontier agents still struggle on over half of questions, scoring 34.1% on average. We find that providing agents with a structured document representation produced by Databricks' ai_parse_document yields a 16.1% average relative performance gain across agents. We conduct additional ablations to study the effects of model selection, table representation, retrieval strategy, and test-time scaling on performance. Despite these improvements, significant headroom remains before agents can be considered reliable at enterprise-grade grounded reasoning.
Updated: 2026-03-09 17:34:53
标题: OfficeQA Pro:一种用于端到端基于事实推理的企业基准测试
摘要: 我们介绍了OfficeQA Pro,这是一个用于评估AI代理在大型和异构文档语料库上进行基于事实的、多文档推理的基准。该语料库由跨越近100年的美国财政部公告组成,包括89,000页和超过2600万个数字值。OfficeQA Pro包括133个问题,需要在非结构化文本和表格数据之间进行精确的文档解析、检索和分析推理。尽管依赖参数化知识,前沿LLMs,包括Claude Opus 4.6、GPT-5.4和Gemini 3.1 Pro Preview,在OfficeQA Pro上仅达到不到5%的准确率,在额外访问网络的情况下不到12%。即使直接提供文档语料库,前沿代理仍然在超过一半的问题上遇到困难,平均得分为34.1%。我们发现,为代理提供由Databricks的ai_parse_document生成的结构化文档表示可以使代理的平均相对性能提高16.1%。我们进行了额外的消融研究,研究模型选择、表格表示、检索策略和测试时间扩展对性能的影响。尽管有这些改进,代理在企业级基础推理方面仍有显着的潜力。
更新时间: 2026-03-09 17:34:53
领域: cs.AI,cs.CL,cs.IR
Cluster-Aware Attention-Based Deep Reinforcement Learning for Pickup and Delivery Problems
The Pickup and Delivery Problem (PDP) is a fundamental and challenging variant of the Vehicle Routing Problem, characterized by tightly coupled pickup--delivery pairs, precedence constraints, and spatial layouts that often exhibit clustering. Existing deep reinforcement learning (DRL) approaches either model all nodes on a flat graph, relying on implicit learning to enforce constraints, or achieve strong performance through inference-time collaborative search at the cost of substantial latency. In this paper, we propose \emph{CAADRL} (Cluster-Aware Attention-based Deep Reinforcement Learning), a DRL framework that explicitly exploits the multi-scale structure of PDP instances via cluster-aware encoding and hierarchical decoding. The encoder builds on a Transformer and combines global self-attention with intra-cluster attention over depot, pickup, and delivery nodes, producing embeddings that are both globally informative and locally role-aware. Based on these embeddings, we introduce a Dynamic Dual-Decoder with a learnable gate that balances intra-cluster routing and inter-cluster transitions at each step. The policy is trained end-to-end with a POMO-style policy gradient scheme using multiple symmetric rollouts per instance. Experiments on synthetic clustered and uniform PDP benchmarks show that CAADRL matches or improves upon strong state-of-the-art baselines on clustered instances and remains highly competitive on uniform instances, particularly as problem size increases. Crucially, our method achieves these results with substantially lower inference time than neural collaborative-search baselines, suggesting that explicitly modeling cluster structure provides an effective and efficient inductive bias for neural PDP solvers.
Updated: 2026-03-09 17:31:34
标题: 基于集群感知的注意力机制深度强化学习用于接送问题
摘要: 提取和交付问题(PDP)是车辆路径问题的一个基本且具有挑战性的变体,其特点是紧密耦合的提取-交付配对、优先约束和通常呈现聚类的空间布局。现有的深度强化学习(DRL)方法要么在平面图上建模所有节点,依靠隐式学习来强制执行约束,要么通过推理时的协作搜索实现强大性能,但代价是相当大的延迟。在本文中,我们提出了一种名为CAADRL(基于集群感知的注意力深度强化学习)的DRL框架,通过集群感知编码和分层解码明确利用PDP实例的多尺度结构。编码器基于Transformer,结合了全局自注意力和在仓库、提取和交付节点上的集群内部注意力,生成既具有全局信息又具有本地角色感知的嵌入。基于这些嵌入,我们引入了一个带有可学习门的动态双解码器,该门在每一步平衡集群内部路由和集群间转换。策略通过使用每个实例的多个对称回滚的POMO风格策略梯度方案进行端对端训练。在合成聚类和均匀的PDP基准测试上的实验结果显示,CAADRL在聚类实例上与强大的最新基线相匹配或改进,并且在均匀实例上仍然保持高度竞争力,特别是在问题规模增加时。至关重要的是,我们的方法在推理时间上明显低于神经协作搜索基线,这表明明确建模集群结构为神经PDP求解器提供了有效且高效的归纳偏好。
更新时间: 2026-03-09 17:31:34
领域: cs.LG
CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation
Recent advancements in Unified Multimodal Models (UMMs) have significantly advanced text-to-image (T2I) generation, particularly through the integration of Chain-of-Thought (CoT) reasoning. However, existing CoT-based T2I methods largely rely on abstract natural-language planning, which lacks the precision required for complex spatial layouts, structured visual elements, and dense textual content. In this work, we propose CoCo (Code-as-CoT), a code-driven reasoning framework that represents the reasoning process as executable code, enabling explicit and verifiable intermediate planning for image generation. Given a text prompt, CoCo first generates executable code that specifies the structural layout of the scene, which is then executed in a sandboxed environment to render a deterministic draft image. The model subsequently refines this draft through fine-grained image editing to produce the final high-fidelity result. To support this training paradigm, we construct CoCo-10K, a curated dataset containing structured draft-final image pairs designed to teach both structured draft construction and corrective visual refinement. Empirical evaluations on StructT2IBench, OneIG-Bench, and LongText-Bench show that CoCo achieves improvements of +68.83%, +54.8%, and +41.23% over direct generation, while also outperforming other generation methods empowered by CoT. These results demonstrate that executable code is an effective and reliable reasoning paradigm for precise, controllable, and structured text-to-image generation. The code is available at: https://github.com/micky-li-hd/CoCo
Updated: 2026-03-09 17:31:16
标题: CoCo:用于文本到图像预览和稀有概念生成的代码作为CoT
摘要: 最近统一多模态模型(UMMs)的进展显著推动了文本到图像(T2I)生成,特别是通过集成Chain-of-Thought(CoT)推理。然而,现有基于CoT的T2I方法主要依赖于抽象的自然语言规划,缺乏复杂空间布局、结构化视觉元素和密集文本内容所需的精度。在这项工作中,我们提出了CoCo(Code-as-CoT),一个以代码驱动的推理框架,将推理过程表示为可执行代码,从而实现图像生成的明确和可验证的中间规划。给定一个文本提示,CoCo首先生成指定场景结构布局的可执行代码,然后在沙箱环境中执行以渲染确定性草图图像。模型随后通过精细的图像编辑对这个草图进行改进,以产生最终的高保真结果。为支持这种训练范式,我们构建了CoCo-10K,一个包含结构化草图-最终图像对的策划数据集,旨在教授结构化草图构建和修正视觉精炼。在StructT2IBench、OneIG-Bench和LongText-Bench上的实证评估显示,CoCo相对于直接生成实现了+68.83%、+54.8%和+41.23%的改进,同时也胜过其他由CoT增强的生成方法。这些结果表明,可执行代码是精确、可控和结构化文本到图像生成的有效可靠推理范式。代码可在以下链接获得:https://github.com/micky-li-hd/CoCo
更新时间: 2026-03-09 17:31:16
领域: cs.AI
Group Entropies and Mirror Duality: A Class of Flexible Mirror Descent Updates for Machine Learning
We introduce a comprehensive theoretical and algorithmic framework that bridges formal group theory and group entropies with modern machine learning, paving the way for an infinite, flexible family of Mirror Descent (MD) optimization algorithms. Our approach exploits the rich structure of group entropies, which are generalized entropic functionals governed by group composition laws, encompassing and significantly extending all trace-form entropies such as the Shannon, Tsallis, and Kaniadakis families. By leveraging group-theoretical mirror maps (or link functions) in MD, expressed via multi-parametric generalized logarithms and their inverses (group exponentials), we achieve highly flexible and adaptable MD updates that can be tailored to diverse data geometries and statistical distributions. To this end, we introduce the notion of \textit{mirror duality}, which allows us to seamlessly switch or interchange group-theoretical link functions with their inverses, subject to specific learning rate constraints. By tuning or learning the hyperparameters of the group logarithms enables us to adapt the model to the statistical properties of the training distribution, while simultaneously ensuring desirable convergence characteristics via fine-tuning. This generality not only provides greater flexibility and improved convergence properties, but also opens new perspectives for applications in machine learning and deep learning by expanding the design of regularizers and natural gradient algorithms. We extensively evaluate the validity, robustness, and performance of the proposed updates on large-scale, simplex-constrained quadratic programming problems.
Updated: 2026-03-09 17:31:03
标题: 群熵和镜像对偶:一类用于机器学习的灵活镜像下降更新
摘要: 我们引入了一个全面的理论和算法框架,将形式群论和群熵与现代机器学习联系起来,为无限灵活的Mirror Descent(MD)优化算法家族铺平道路。我们的方法利用群熵的丰富结构,这些群熵是由群组合法则所支配的广义熵泛函,涵盖并显著扩展了所有迹形熵,如Shannon、Tsallis和Kaniadakis家族。通过在MD中利用群论镜像映射(或链接函数),通过多参数广义对数和它们的逆(群指数)来表达,我们实现了高度灵活和适应性强的MD更新,可以根据不同的数据几何形状和统计分布进行定制。为此,我们引入了“镜像对偶”的概念,使我们能够无缝地在特定学习速率约束下切换或交换群论链接函数与它们的逆。通过调整或学习群对数的超参数,我们能够使模型适应训练分布的统计特性,同时通过微调来确保理想的收敛特性。这种泛化不仅提供了更大的灵活性和改进的收敛性能,还通过扩展正则化器和自然梯度算法的设计,为机器学习和深度学习中的应用开辟了新的视角。我们在大规模、简单约束二次规划问题上对所提出的更新的有效性、鲁棒性和性能进行了广泛评估。
更新时间: 2026-03-09 17:31:03
领域: cs.LG,hep-th,math-ph
Divide and Predict: An Architecture for Input Space Partitioning and Enhanced Accuracy
In this article the authors develop an intrinsic measure for quantifying heterogeneity in training data for supervised learning. This measure is the variance of a random variable which factors through the influences of pairs of training points. The variance is shown to capture data heterogeneity and can thus be used to assess if a sample is a mixture of distributions. The authors prove that the data itself contains key information that supports a partitioning into blocks. Several proof of concept studies are provided that quantify the connection between variance and heterogeneity for EMNIST image data and synthetic data. The authors establish that variance is maximal for equal mixes of distributions, and detail how variance-based data purification followed by conventional training over blocks can lead to significant increases in test accuracy.
Updated: 2026-03-09 17:26:56
标题: 划分和预测:一种用于输入空间划分和提高准确性的架构
摘要: 在这篇文章中,作者们开发了一种用于量化监督学习训练数据异质性的内在度量。这种度量是一个随机变量的方差,通过训练点对之间的影响来计算。研究表明,方差能够捕捉数据的异质性,因此可以用来评估样本是否是分布的混合体。作者证明了数据本身包含支持将其分割为块的关键信息。他们提供了几个概念验证研究,量化了方差与EMNIST图像数据和合成数据的异质性之间的联系。作者们确定方差在分布混合均匀时达到最大值,并详细说明基于方差的数据净化,随后通过对块进行常规训练,可以显著提高测试准确性。
更新时间: 2026-03-09 17:26:56
领域: cs.LG
HDLxGraph: Bridging Large Language Models and HDL Repositories via HDL Graph Databases
Retrieval Augmented Generation (RAG) is an essential agent for Large Language Model (LLM) aided Description Language (HDL) tasks, addressing the challenges of limited training data and prohibitively long prompts. However, its performance in handling ambiguous queries and real-world, repository-level HDL projects containing thousands or even tens of thousands of code lines remains limited. Our analysis demonstrates two fundamental mismatches, structural and vocabulary, between conventional semantic similarity-based RAGs and HDL codes. To this end, we propose HDLxGraph, the first framework that integrates the inherent graph characteristics of HDLs with RAGs for LLM-assisted tasks. Specifically, HDLxGraph incorporates Abstract Syntax Trees (ASTs) to capture HDLs' hierarchical structures and Data Flow Graphs (DFGs) to address the vocabulary mismatch. In addition, to overcome the lack of comprehensive HDL search benchmarks, we introduce HDLSearch, an LLM generated dataset derived from real-world, repository-level HDL projects. Evaluations show that HDLxGraph improves search, debugging, and completion accuracy by 12.04%/12.22%/5.04% and by 11.59%/8.18%/4.07% over state-of-the-art similarity-based RAG and software-code Graph RAG baselines, respectively. The code of HDLxGraph and HDLSearch benchmark are available at https://github.com/UMN-ZhaoLab/HDLxGraph.
Updated: 2026-03-09 17:26:13
标题: HDLxGraph:通过HDL图数据库连接大型语言模型和HDL仓库
摘要: 检索增强生成(RAG)是大型语言模型(LLM)辅助描述语言(HDL)任务中的一个关键因素,解决了训练数据有限和提示过长的挑战。然而,它在处理模糊查询和包含数千甚至数万行代码的真实世界存储库级HDL项目方面的表现仍然有限。我们的分析表明,传统基于语义相似性的RAG与HDL代码之间存在两个基本不匹配,即结构和词汇。为此,我们提出了HDLxGraph,这是第一个将HDL的固有图特征与RAG结合起来进行LLM辅助任务的框架。具体而言,HDLxGraph将抽象语法树(ASTs)纳入其中,以捕捉HDL的层次结构,并利用数据流图(DFGs)来解决词汇不匹配问题。此外,为了克服缺乏全面HDL搜索基准的问题,我们引入了HDLSearch,这是一个从真实世界存储库级HDL项目中派生出来的LLM生成数据集。评估结果显示,HDLxGraph在搜索、调试和完成准确性方面分别比最先进的基于相似性的RAG和软件代码图RAG基线提高了12.04%/12.22%/5.04%和11.59%/8.18%/4.07%。HDLxGraph和HDLSearch基准代码可在https://github.com/UMN-ZhaoLab/HDLxGraph上找到。
更新时间: 2026-03-09 17:26:13
领域: cs.AR,cs.CL,cs.LG
Grow, Don't Overwrite: Fine-tuning Without Forgetting
Adapting pre-trained models to specialized tasks often leads to catastrophic forgetting, where new knowledge overwrites foundational capabilities. Existing methods either compromise performance on the new task or struggle to balance training stability with efficient reuse of pre-trained knowledge. We introduce a novel function-preserving expansion method that resolves this dilemma. Our technique expands model capacity by replicating pre-trained parameters within transformer submodules and applying a scaling correction that guarantees the expanded model is mathematically identical to the original at initialization, enabling stable training while exploiting existing knowledge. Empirically, our method eliminates the trade-off between plasticity and stability, matching the performance of full fine-tuning on downstream tasks without any degradation of the model's original capabilities. Furthermore, we demonstrate the modularity of our approach, showing that by selectively expanding a small subset of layers we can achieve the same performance as full fine-tuning at a fraction of the computational cost.
Updated: 2026-03-09 17:26:03
标题: 生长,而不是覆盖:精调而不遗忘
摘要: 将预训练模型适应专门任务往往会导致灾难性遗忘,新知识会覆盖基本能力。现有方法要么影响新任务的性能,要么难以在训练稳定性和有效重用预训练知识之间取得平衡。我们引入了一种新颖的保持功能扩展方法来解决这一困境。我们的技术通过在transformer子模块内复制预训练参数并应用缩放校正来扩展模型容量,确保扩展模型在初始化时与原始模型在数学上完全相同,从而实现稳定的训练同时利用现有知识。实证上,我们的方法消除了可塑性和稳定性之间的权衡,与完全微调在下游任务上的性能相匹配,而不会降低模型的原始能力。此外,我们展示了我们方法的模块化性,表明通过选择性地扩展一小部分层,我们可以以较小的计算成本实现与完全微调相同的性能。
更新时间: 2026-03-09 17:26:03
领域: cs.LG
Retrieval-Augmented Gaussian Avatars: Improving Expression Generalization
Template-free animatable head avatars can achieve high visual fidelity by learning expression-dependent facial deformation directly from a subject's capture, avoiding parametric face templates and hand-designed blendshape spaces. However, since learned deformation is supervised only by the expressions observed for a single identity, these models suffer from limited expression coverage and often struggle when driven by motions that deviate from the training distribution. We introduce RAF (Retrieval-Augmented Faces), a simple training-time augmentation designed for template-free head avatars that learn deformation from data. RAF constructs a large unlabeled expression bank and, during training, replaces a subset of the subject's expression features with nearest-neighbor expressions retrieved from this bank while still reconstructing the subject's original frames. This exposes the deformation field to a broader range of expression conditions, encouraging stronger identity-expression decoupling and improving robustness to expression distribution shift without requiring paired cross-identity data, additional annotations, or architectural changes. We further analyze how retrieval augmentation increases expression diversity and validate retrieval quality with a user study showing that retrieved neighbors are perceptually closer in expression and pose. Experiments on the NeRSemble benchmark demonstrate that RAF consistently improves expression fidelity over the baseline, in both self-driving and cross-driving scenarios.
Updated: 2026-03-09 17:24:11
标题: 检索增强的高斯化身:提高表达泛化
摘要: 无模板可动画头部头像可以通过直接从主题的捕捉中学习表情相关的面部变形来实现高视觉保真度,避免了参数化面部模板和手工设计的混合形状空间。然而,由于学习的变形仅受限于单个身份观察到的表情,这些模型在表达覆盖范围有限,通常在受到偏离训练分布的动作驱动时表现不佳。我们引入了RAF(检索增强面孔),这是一个简单的训练时增强,专为无模板头部头像设计,从数据中学习变形。RAF构建一个大型未标记的表情库,在训练过程中,用从该库中检索到的最近邻表情替换主题的表情特征的子集,同时仍然重建主题的原始帧。这使变形场暴露于更广泛的表情条件,鼓励更强的身份-表达解耦,并提高对表达分布变化的鲁棒性,而无需配对的跨身份数据、额外的注释或架构变化。我们进一步分析了检索增强如何增加表达多样性,并通过用户研究验证了检索质量,显示检索到的邻居在表情和姿势上在感知上更接近。在NeRSemble基准测试中的实验表明,RAF始终提高了基线上的表达保真度,在自驾和跨驾驶场景中都是如此。
更新时间: 2026-03-09 17:24:11
领域: cs.CV,cs.GR,cs.LG
Exploring Embedding Priors in Prompt-Tuning for Improved Interpretability and Control
Prompt-Tuning is an efficient method for adapting pre-trained language models to new tasks with minimal computational overhead by modifying prompt embeddings. In this work, we investigate how crucial the phenomenon of embedding collapse, frequently observed in Prompt-Tuning, is for the final performance of the model. To address this question, we designed embedding priors and compared them with posteriors of the converged Soft and Deep Prompt-Tuning methods. Our findings suggest that priors strongly affect the position of the tuned embeddings, and models can effectively work with embeddings from different parts of activation spaces, including completely new regions. As the final Prompt-Tuning capabilities are limited, we hypothesize that controllable Prompt-Tuning posteriors may serve as a good starting point for tasks such as chain-of-thought (COT) distillation. Our experiments also show that generated trajectories are not localized in the activation space of the models. However, there are distinct clusters of activations for distant tasks (e.g., NLP and arithmetic), while activations between NLP tasks (e.g., Question-Answering and MLM) lie in the same cluster. These observations raise questions about the importance of a single activation cluster for the generalization abilities of large language models.
Updated: 2026-03-09 17:23:47
标题: 探索在提示调整中嵌入先验知识以提高可解释性和控制
摘要: Prompt-Tuning是一种有效的方法,通过修改提示嵌入来将预训练语言模型调整到新任务中,从而减少计算开销。在这项工作中,我们调查了在Prompt-Tuning中经常观察到的嵌入坍塌现象对模型最终性能的关键性。为了解决这个问题,我们设计了嵌入先验,并将它们与收敛的Soft和Deep Prompt-Tuning方法的后验进行比较。我们的发现表明,先验强烈影响调整嵌入的位置,模型可以有效地使用来自激活空间不同部分的嵌入,包括全新区域。由于最终的Prompt-Tuning能力有限,我们假设可控的Prompt-Tuning后验可能作为链式思维(COT)蒸馏等任务的良好起点。我们的实验还表明,生成的轨迹不局限于模型的激活空间。然而,对于不同的任务(例如NLP和算术),存在明显的激活群集,而在NLP任务之间(例如问答和MLM),激活则位于同一群集中。这些观察引发了关于单一激活群集对大型语言模型泛化能力重要性的问题。
更新时间: 2026-03-09 17:23:47
领域: cs.CL,cs.LG
X-SYS: A Reference Architecture for Interactive Explanation Systems
The explainable AI (XAI) research community has proposed numerous technical methods, yet deploying explainability as systems remains challenging: Interactive explanation systems require both suitable algorithms and system capabilities that maintain explanation usability across repeated queries, evolving models and data, and governance constraints. We argue that operationalizing XAI requires treating explainability as an information systems problem where user interaction demands induce specific system requirements. We introduce X-SYS, a reference architecture for interactive explanation systems, that guides (X)AI researchers, developers and practitioners in connecting interactive explanation user interfaces (XUI) with system capabilities. X-SYS organizes around four quality attributes named STAR (scalability, traceability, responsiveness, and adaptability), and specifies a five-component decomposition (XUI Services, Explanation Services, Model Services, Data Services, Orchestration and Governance). It maps interaction patterns to system capabilities to decouple user interface evolution from backend computation. We implement X-SYS through SemanticLens, a system for semantic search and activation steering in vision-language models. SemanticLens demonstrates how contract-based service boundaries enable independent evolution, offline/online separation ensures responsiveness, and persistent state management supports traceability. Together, this work provides a reusable blueprint and concrete instantiation for interactive explanation systems supporting end-to-end design under operational constraints.
Updated: 2026-03-09 17:21:38
标题: X-SYS:交互式解释系统的参考架构
摘要: 可解释人工智能(XAI)研究界提出了许多技术方法,然而将解释性部署为系统仍然具有挑战性:交互式解释系统需要适当的算法和系统功能,以在重复查询、不断演化的模型和数据以及治理约束下保持解释可用性。我们认为,将XAI实现运营化需要将解释性视为信息系统问题,其中用户交互需求导致特定的系统要求。我们引入了X-SYS,这是一个交互式解释系统的参考架构,指导(X)AI研究人员、开发人员和从业者将交互式解释用户界面(XUI)与系统功能连接起来。X-SYS围绕四个名为STAR(可扩展性、可追溯性、响应性和适应性)的质量属性组织,并指定了一个五部分分解(XUI服务、解释服务、模型服务、数据服务、编排和治理)。它将交互模式映射到系统功能,将用户界面演变与后端计算分离。我们通过SemanticLens实现了X-SYS,这是一个用于视觉-语言模型中语义搜索和激活导向的系统。SemanticLens演示了基于契约的服务边界如何实现独立演化,离线/在线分离如何确保响应性,以及持久状态管理如何支持追溯性。总的来说,这项工作为支持在运营约束下端对端设计的交互式解释系统提供了可重复使用的蓝图和具体实例。
更新时间: 2026-03-09 17:21:38
领域: cs.AI,cs.HC,cs.SE
OmniGuide: Universal Guidance Fields for Enhancing Generalist Robot Policies
Vision-language-action(VLA) models have shown great promise as generalist policies for a large range of relatively simple tasks. However, they demonstrate limited performance on more complex tasks, such as those requiring complex spatial or semantic understanding, manipulation in clutter, or precise manipulation. We propose OMNIGUIDE, a flexible framework that improves VLA performance on such tasks by leveraging arbitrary sources of guidance, such as 3D foundation models, semantic-reasoning VLMs, and human pose models. We show how many kinds of guidance can be naturally expressed as differentiable energy functions with task-specific attractors and repellers located in 3D space, that influence the sampling of VLA actions. In this way, OMNIGUIDE enables guidance sources with complementary task-relevant strengths to improve a VLA model's performance on challenging tasks. Extensive experiments in both simulation and real-world environments, across diverse sources of guidance, demonstrate that OMNIGUIDE enhances the performance of state-of-the-art generalist policies (e.g., $π_{0.5}$, GR00T N1.6) significantly across success and safety rates. Critically, our unified framework matches or surpasses the performance of prior methods designed to incorporate specific sources of guidance into VLA policies. Project Page: $\href{https://omniguide.github.io/}{this \; url}$
Updated: 2026-03-09 17:18:13
标题: OmniGuide:用于增强通用机器人策略的通用引导场
摘要: 视觉-语言-动作(VLA)模型已经显示出在一系列相对简单任务上作为通用策略的巨大潜力。然而,在更复杂的任务上,比如需要复杂的空间或语义理解、在混乱环境中操作或精确操作的任务上,它们表现出有限的性能。我们提出了OMNIGUIDE,这是一个灵活的框架,通过利用任意的引导源(如3D基础模型、语义推理VLM和人体姿势模型)来提高VLA在这些任务上的性能。我们展示了许多种引导方式可以自然地表达为在3D空间中具有任务特定吸引子和排斥子的可微能量函数,这些吸引子和排斥子影响了VLA动作的采样。通过这种方式,OMNIGUIDE使得具有互补任务相关优势的引导源能够提高VLA模型在挑战性任务上的表现。在模拟和真实环境中进行的广泛实验,跨越不同引导源,展示了OMNIGUIDE显著提高了最先进的通用策略(如$π_{0.5}$,GR00T N1.6)在成功率和安全率上的性能。至关重要的是,我们的统一框架与先前设计用于将特定引导源纳入VLA策略的方法的性能相匹配或超越了它们。项目页面:$\href{https://omniguide.github.io/}{this \; url}$
更新时间: 2026-03-09 17:18:13
领域: cs.RO,cs.LG
PostTrainBench: Can LLM Agents Automate LLM Post-Training?
AI agents have become surprisingly proficient at software engineering over the past year, largely due to improvements in reasoning capabilities. This raises a deeper question: can these systems extend their capabilities to automate AI research itself? In this paper, we explore post-training, the critical phase that turns base LLMs into useful assistants. We introduce PostTrainBench to benchmark how well LLM agents can perform post-training autonomously under bounded compute constraints (10 hours on one H100 GPU). We ask frontier agents (e.g., Claude Code with Opus 4.6) to optimize the performance of a base LLM on a particular benchmark (e.g., Qwen3-4B on AIME). Importantly, we do not provide any predefined strategies to the agents and instead give them full autonomy to find necessary information on the web, run experiments, and curate data. We find that frontier agents make substantial progress but generally lag behind instruction-tuned LLMs from leading providers: 23.2% for the best agent vs. 51.1% for official instruction-tuned models. However, agents can exceed instruction-tuned models in targeted scenarios: GPT-5.1 Codex Max achieves 89% on BFCL with Gemma-3-4B vs. 67% for the official model. We also observe several failure modes worth flagging. Agents sometimes engage in reward hacking: training on the test set, downloading existing instruction-tuned checkpoints instead of training their own, and using API keys they find to generate synthetic data without authorization. These behaviors are concerning and highlight the importance of careful sandboxing as these systems become more capable. Overall, we hope PostTrainBench will be useful for tracking progress in AI R&D automation and for studying the risks that come with it. Website and code are available at https://posttrainbench.com/.
Updated: 2026-03-09 17:18:00
标题: PostTrainBench:LLM代理能自动化LLM后训练吗?
摘要: 人工智能代理在过去一年里在软件工程方面变得异常熟练,这主要归功于推理能力的提高。这引发了一个更深层次的问题:这些系统能否扩展其能力以自动化人工智能研究本身?在本文中,我们探讨了后训练这一关键阶段,将基础LLMs转变为有用的助手。我们引入了PostTrainBench来评估LLM代理在受限计算约束条件下(在一台H100 GPU上进行10小时)能否自主执行后训练。我们要求前沿代理(例如,带有Opus 4.6的Claude Code)优化基础LLM在特定基准测试(例如,在AIME上的Qwen3-4B)上的性能。重要的是,我们不向代理提供任何预定义策略,而是让他们完全自主地在网络上查找必要信息,运行实验并整理数据。我们发现前沿代理取得了实质性进展,但通常落后于主要供应商的经过指导调整的LLMs:最佳代理为23.2%,而官方经过指导调整的模型为51.1%。然而,在特定场景下,代理可以超过经过指导调整的模型:GPT-5.1 Codex Max在BFCL上达到了89%,而官方模型为67%。我们还观察到几种值得注意的失败模式。代理有时会进行奖励欺骗:在测试集上训练,下载现有的经过指导调整的检查点而不是训练自己的,以及使用他们找到的API密钥生成未经授权的合成数据。这些行为令人担忧,并突显了在这些系统变得更加有能力时谨慎设置沙盒的重要性。总的来说,我们希望PostTrainBench能有助于跟踪人工智能研发自动化的进展,并研究伴随其而来的风险。网站和代码可在https://posttrainbench.com/ 上找到。
更新时间: 2026-03-09 17:18:00
领域: cs.SE,cs.AI,cs.LG
UNBOX: Unveiling Black-box visual models with Natural-language
Ensuring trustworthiness in open-world visual recognition requires models that are interpretable, fair, and robust to distribution shifts. Yet modern vision systems are increasingly deployed as proprietary black-box APIs, exposing only output probabilities and hiding architecture, parameters, gradients, and training data. This opacity prevents meaningful auditing, bias detection, and failure analysis. Existing explanation methods assume white- or gray-box access or knowledge of the training distribution, making them unusable in these real-world settings. We introduce UNBOX, a framework for class-wise model dissection under fully data-free, gradient-free, and backpropagation-free constraints. UNBOX leverages Large Language Models and text-to-image diffusion models to recast activation maximization as a purely semantic search driven by output probabilities. The method produces human-interpretable text descriptors that maximally activate each class, revealing the concepts a model has implicitly learned, the training distribution it reflects, and potential sources of bias. We evaluate UNBOX on ImageNet-1K, Waterbirds, and CelebA through semantic fidelity tests, visual-feature correlation analyses and slice-discovery auditing. Despite operating under the strictest black-box constraints, UNBOX performs competitively with state-of-the-art white-box interpretability methods. This demonstrates that meaningful insight into a model's internal reasoning can be recovered without any internal access, enabling more trustworthy and accountable visual recognition systems.
Updated: 2026-03-09 17:16:39
标题: 解释:UNBOX:使用自然语言揭示黑盒视觉模型 请注意,这个翻译可能不是字面意思,但会更好地传达原文的意思。
摘要: 确保开放世界视觉识别的可信度需要具有可解释性、公平性和对分布变化具有鲁棒性的模型。然而,现代视觉系统越来越多地部署为专有的黑匣子API,仅暴露输出概率,隐藏架构、参数、梯度和训练数据。这种不透明性阻碍了有意义的审计、偏见检测和故障分析。现有的解释方法假定具有白盒或灰盒访问权限或对训练分布的知识,使它们在这些现实世界设置中无法使用。我们介绍了UNBOX,这是一个在完全无数据、无梯度和无反向传播约束下进行类别模型解剖的框架。UNBOX利用大型语言模型和文本到图像扩散模型,将激活最大化重新构建为仅由输出概率驱动的纯语义搜索。该方法产生了人类可解释的文本描述符,最大程度地激活每个类别,揭示了模型隐含学习的概念、反映的训练分布以及潜在的偏见来源。我们通过语义保真度测试、视觉特征相关性分析和切片发现审计,在ImageNet-1K、Waterbirds和CelebA上评估了UNBOX。尽管在最严格的黑匣子约束下运行,UNBOX与最先进的白盒可解释性方法竞争力强。这表明可以在没有任何内部访问的情况下恢复对模型内部推理的有意义洞察,从而使视觉识别系统更加可信赖和负责。
更新时间: 2026-03-09 17:16:39
领域: cs.CV,cs.AI
Integral Formulas for Vector Spherical Tensor Products
We derive integral formulas that simplify the Vector Spherical Tensor Product recently introduced by Xie et al., which generalizes the Gaunt tensor product to antisymmetric couplings. In particular, we obtain explicit closed-form expressions for the antisymmetric analogues of the Gaunt coefficients. This enables us to simulate the Clebsch-Gordan tensor product using a single Vector Spherical Tensor Product, yielding a $9\times$ reduction in the required tensor product evaluations. Our results enable efficient and practical implementations of the Vector Spherical Tensor Product, paving the way for applications of this generalization of Gaunt tensor products in $\mathrm{SO}(3)$-equivariant neural networks. Moreover, we discuss how the Gaunt and the Vector Spherical Tensor Products allow to control the expressivity-runtime tradeoff associated with the usual Clebsch-Gordan Tensor Products. Finally, we investigate low rank decompositions of the normalizations of the considered tensor products in view of their use in equivariant neural networks.
Updated: 2026-03-09 17:09:28
标题: 球张量积的积分公式
摘要: 我们推导了一个积分公式,简化了谢等人最近引入的矢量球张量积,该公式将高特张量积推广到了反对称耦合。特别地,我们得到了高特系数的反对称模拟的明确闭合形式表达式。这使我们能够使用单个矢量球张量积模拟Clebsch-Gordan张量积,从而将所需的张量积评估减少了$9\times$。我们的结果实现了高效且实用的矢量球张量积,为在$\mathrm{SO}(3)$-等价神经网络中应用高特张量积的这种推广铺平了道路。此外,我们讨论了高特和矢量球张量积如何允许控制与通常的Clebsch-Gordan张量积相关的表达性和运行时之间的权衡。最后,我们调查了所考虑张量积的归一化的低秩分解,以便在等变神经网络中使用。
更新时间: 2026-03-09 17:09:28
领域: cs.LG,physics.comp-ph
Why No Consensus on Consensus? A Deep Dive into Blockchain Consensus Protocols
Blockchain technology has revolutionized the digital landscape, driving innovations across industries through its decentralized and transparent infrastructure. These networks are primarily categorized as public or private, based on user access permissions. Public blockchains are open to all and fully decentralized while private blockchains have restricted access to authorized participants only and they are usually centralized or partially decentralized. Consensus protocols are at the heart of blockchain networks, playing a pivotal role in maintaining security, ensuring consistency, and achieving agreement among distributed nodes. This paper provides a critical and unified analysis, including detailed workflows, that addresses the limita- tions of recent literature. Furthermore, this research investigates the strengths and limitations of each protocol, shedding light on their suitability for various applications, including financial transactions, supply chain management, healthcare, and beyond. A critical analysis of ongoing challenges, such as security vulnerabilities, scalability bottlenecks, and energy consumption is provided. Finally, the paper identifies key research gaps in the field, offering insights into potential areas for future work and emerging trends aimed at addressing these issues. This comprehensive analysis serves as a valuable resource for re- searchers, practitioners, and organizations seeking to understand the role of consensus protocols in shaping the future of blockchain technology.
Updated: 2026-03-09 17:08:41
标题: 为什么没有关于共识的共识?深入探讨区块链共识协议
摘要: 区块链技术已经彻底改变了数字领域,通过其去中心化和透明的基础设施推动了各行业的创新。这些网络主要分为公共和私有两类,基于用户访问权限。公共区块链对所有人开放,完全去中心化,而私有区块链仅限授权参与者访问,通常是集中化或部分去中心化的。共识协议是区块链网络的核心,对于维护安全性、确保一致性和在分布式节点之间达成一致起着至关重要的作用。本文提供了一项批判性和统一的分析,包括详细的工作流程,以解决最近文献的局限性。此外,本研究调查了每个协议的优势和局限性,阐明了它们在各种应用中的适用性,包括金融交易、供应链管理、医疗保健等领域。提供了对安全漏洞、可扩展性瓶颈和能源消耗等持续挑战的批判性分析。最后,本文确定了该领域的关键研究空白,为未来工作和应对这些问题的新兴趋势提供了见解。这种全面的分析是研究人员、从业者和组织寻求了解共识协议在塑造区块链技术未来的作用的宝贵资源。
更新时间: 2026-03-09 17:08:41
领域: cs.CR
Less is More: On Copy Complexity in Quantum Cryptography
Quantum cryptographic definitions are often sensitive to the number of copies of the cryptographic states revealed to an adversary. Making definitional changes to the number of copies accessible to an adversary can drastically affect various aspects including the computational hardness, feasibility, and applicability of the resulting cryptographic scheme. This phenomenon appears in many places in quantum cryptography, including quantum pseudorandomness and unclonable cryptography. To address this, we present a generic approach to boost single-copy security to multi-copy security and apply this approach to many settings. As a consequence, we obtain the following new results: -One-copy stretch pseudorandom state generators (under mild assumptions) imply the existence of t-copy stretch pseudorandom state generators, for any fixed polynomial t. -One-query pseudorandom unitaries with short keys (under mild assumptions) imply the existence of t-query pseudorandom unitaries with short keys, for any fixed polynomial t. -Assuming indistinguishability obfuscation and other standard cryptographic assumptions, there exist identical-copy secure unclonable primitives such as public-key quantum money and quantum copy-protection.
Updated: 2026-03-09 17:04:32
标题: 少即是多:关于量子密码学中的复制复杂性
摘要: 量子密码学的定义通常对暴露给对手的密码状态的副本数量非常敏感。对对手可访问的副本数量进行定义性更改可以极大地影响各个方面,包括计算难度、可行性和生成的密码方案的适用性。这种现象在量子密码学的许多领域中都出现,包括量子伪随机性和不可克隆密码学。为了解决这个问题,我们提出了一种通用方法来将单副本安全性提升至多副本安全性,并将此方法应用于许多情境。作为结果,我们得到了以下新结果:-一副本拉伸伪随机状态生成器(在温和假设下)意味着存在t副本拉伸伪随机状态生成器,对于任意固定多项式t。-一次查询伪随机酉算子与短密钥(在温和假设下)意味着存在t次查询伪随机酉算子与短密钥,对于任意固定多项式t。-假设不可区分性混淆和其他标准密码学假设,存在相同副本安全的不可克隆原语,如公钥量子货币和量子复制保护。
更新时间: 2026-03-09 17:04:32
领域: quant-ph,cs.CR
Exposing the Illusion of Fairness: Auditing Vulnerabilities to Distributional Manipulation Attacks
The rapid deployment of AI systems in high-stakes domains, including those classified as high-risk under the The EU AI Act (Regulation (EU) 2024/1689), has intensified the need for reliable compliance auditing. For binary classifiers, regulatory risk assessment often relies on global fairness metrics such as the Disparate Impact ratio, widely used to evaluate potential discrimination. In typical auditing settings, the auditee provides a subset of its dataset to an auditor, while a supervisory authority may verify whether this subset is representative of the full underlying distribution. In this work, we investigate to what extent a malicious auditee can construct a fairness-compliant yet representative-looking sample from a non-compliant original distribution, thereby creating an illusion of fairness. We formalize this problem as a constrained distributional projection task and introduce mathematically grounded manipulation strategies based on entropic and optimal transport projections. These constructions characterize the minimal distributional shift required to satisfy fairness constraints. To counter such attacks, we formalize representativeness through distributional distance based statistical tests and systematically evaluate their ability to detect manipulated samples. Our analysis highlights the conditions under which fairness manipulation can remain statistically undetected and provides practical guidelines for strengthening supervisory verification. We validate our theoretical findings through experiments on standard tabular datasets for bias detection. Code is publicly available at https://github.com/ValentinLafargue/Inspection.
Updated: 2026-03-09 17:01:59
标题: 揭示公平的幻觉:审计分配操纵攻击的漏洞
摘要: AI系统在高风险领域的快速部署,包括在欧盟AI法案(Regulation (EU) 2024/1689)下被归类为高风险的领域,加剧了对可靠合规审计的需求。对于二元分类器,监管风险评估通常依赖于全局公平性指标,如不同影响比例,被广泛用于评估潜在的歧视。在典型的审计设置中,被审计方向审计员提供其数据集的子集,而监管机构可能验证该子集是否代表完整的基础分布。在本研究中,我们调查了一个恶意的被审计方在不符合原始分布的情况下能够构建一个符合公平性标准且具有代表性外观的样本,从而制造公平性的幻觉。我们将这个问题形式化为一个受限的分布投影任务,并引入了基于熵和最优传输投影的数学基础的操纵策略。这些构造描述了满足公平性约束所需的最小分布变化。为了对抗这种攻击,我们通过基于分布距离的统计检验形式化代表性,并系统评估它们检测操纵样本的能力。我们的分析突出了公平性操纵可能保持统计上未被检测的条件,并为加强监管验证提供了实用指南。我们通过在标准表格数据集上进行偏倚检测实验证实了我们的理论发现。代码可在https://github.com/ValentinLafargue/Inspection上公开获取。
更新时间: 2026-03-09 17:01:59
领域: cs.LG,math.OC,stat.AP
Coverage-Guided Multi-Agent Harness Generation for Java Library Fuzzing
Coverage-guided fuzzing has proven effective for software testing, but targeting library code requires specialized fuzz harnesses that translate fuzzer-generated inputs into valid API invocations. Manual harness creation is time-consuming and requires deep understanding of API semantics, initialization sequences, and exception handling contracts. We present a multi-agent architecture that automates fuzz harness generation for Java libraries through specialized LLM-powered agents. Five ReAct agents decompose the workflow into research, synthesis, compilation repair, coverage analysis, and refinement. Rather than preprocessing entire codebases, agents query documentation, source code, and callgraph information on demand through the Model Context Protocol, maintaining focused context while exploring complex dependencies. To enable effective refinement, we introduce method-targeted coverage that tracks coverage only during target method execution to isolate target behavior, and agent-guided termination that examines uncovered source code to distinguish productive refinement opportunities from diminishing returns. We evaluated our approach on seven target methods from six widely-deployed Java libraries totaling 115,000+ Maven dependents. Our generated harnesses achieve a median 26\% improvement over OSS-Fuzz baselines and outperform Jazzer AutoFuzz by 5\% in package-scope coverage. Generation costs average \$3.20 and 10 minutes per harness, making the approach practical for continuous fuzzing workflows. During a 12-hour fuzzing campaign, our generated harnesses discovered 3 bugs in projects that are already integrated into OSS-Fuzz, demonstrating the effectiveness of the generated harnesses.
Updated: 2026-03-09 16:59:30
标题: 使用覆盖引导的多代理器生成器进行Java库模糊测试
摘要: 覆盖引导模糊测试已被证明对软件测试有效,但针对库代码需要专门的模糊测试框架,将由模糊测试器生成的输入转化为有效的API调用。手动创建模糊测试框架耗时且需要深入理解API语义、初始化序列和异常处理协议。我们提出了一个多代理体系结构,通过专门的LLM驱动代理自动化生成Java库的模糊测试框架。五个ReAct代理将工作流程分解为研究、综合、编译修复、覆盖分析和改进。代理不是预处理整个代码库,而是通过模型上下文协议按需查询文档、源代码和调用图信息,保持专注的上下文同时探索复杂的依赖关系。为了实现有效的改进,我们引入了方法定向覆盖,仅在目标方法执行期间跟踪覆盖率以隔离目标行为,并引入了代理引导终止,检查未覆盖的源代码以区分有益的改进机会和递减的回报。我们在六个广泛部署的Java库中的七个目标方法上评估了我们的方法,总计有115,000多个Maven依赖项。我们生成的模糊测试框架与OSS-Fuzz基线相比,中位数提高了26\%,在包范围覆盖率方面比Jazzer AutoFuzz高出5\%。生成成本平均为每个模糊测试框架3.20美元和10分钟,使该方法适用于连续的模糊测试工作流程。在为期12小时的模糊测试活动中,我们生成的模糊测试框架发现了已经集成到OSS-Fuzz中的项目中的3个漏洞,证明了生成的模糊测试框架的有效性。
更新时间: 2026-03-09 16:59:30
领域: cs.SE,cs.CR
Micro-Diffusion Compression -- Binary Tree Tweedie Denoising for Online Probability Estimation
We present Midicoth, a lossless compression system that introduces a micro-diffusion denoising layer for improving probability estimates produced by adaptive statistical models. In compressors such as Prediction by Partial Matching (PPM), probability estimates are smoothed by a prior to handle sparse observations. When contexts have been seen only a few times, this prior dominates the prediction and produces distributions that are significantly flatter than the true source distribution, leading to compression inefficiency. Midicoth addresses this limitation by treating prior smoothing as a shrinkage process and applying a reverse denoising step that corrects predicted probabilities using empirical calibration statistics. To make this correction data-efficient, the method decomposes each byte prediction into a hierarchy of binary decisions along a bitwise tree. This converts a single 256-way calibration problem into a sequence of binary calibration tasks, enabling reliable estimation of correction terms from relatively small numbers of observations. The denoising process is applied in multiple successive steps, allowing each stage to refine residual prediction errors left by the previous one. The micro-diffusion layer operates as a lightweight post-blend calibration stage applied after all model predictions have been combined, allowing it to correct systematic biases in the final probability distribution. Midicoth combines five fully online components: an adaptive PPM model, a long-range match model, a trie-based word model, a high-order context model, and the micro-diffusion denoiser applied as the final stage.
Updated: 2026-03-09 16:59:24
标题: 微扩散压缩--用于在线概率估计的二叉树Tweedie去噪
摘要: 我们提出了Midicoth,这是一个无损压缩系统,引入了微扩散去噪层,以改善自适应统计模型产生的概率估计。在像部分匹配预测(PPM)这样的压缩器中,概率估计通过先验进行平滑处理,以处理稀疏观测。当上下文只见过几次时,这种先验会主导预测,并产生比真实源分布显着平坦的分布,导致压缩效率低下。Midicoth通过将先验平滑视为收缩过程,并应用反向去噪步骤来使用经验校准统计数据纠正预测概率,从而解决了这一限制。为了使这种校正数据有效,该方法将每个字节预测分解为沿着比特树的二进制决策层次结构。这将一个256路校准问题转换为一系列二进制校准任务,从而能够可靠地从相对较少的观测中估计校正项。去噪过程在多个连续步骤中应用,允许每个阶段调整前一个阶段留下的残余预测误差。微扩散层作为一个轻量级后混合校准阶段操作,在所有模型预测已经组合之后应用,使其能够校正最终概率分布中的系统偏差。Midicoth结合了五个完全在线组件:自适应PPM模型,长距离匹配模型,基于trie的词模型,高阶上下文模型,以及作为最终阶段应用的微扩散去噪器。
更新时间: 2026-03-09 16:59:24
领域: stat.ML,cs.IT,cs.LG
Hinge Regression Tree: A Newton Method for Oblique Regression Tree Splitting
Oblique decision trees combine the transparency of trees with the power of multivariate decision boundaries, but learning high-quality oblique splits is NP-hard, and practical methods still rely on slow search or theory-free heuristics. We present the Hinge Regression Tree (HRT), which reframes each split as a non-linear least-squares problem over two linear predictors whose max/min envelope induces ReLU-like expressive power. The resulting alternating fitting procedure is exactly equivalent to a damped Newton (Gauss-Newton) method within fixed partitions. We analyze this node-level optimization and, for a backtracking line-search variant, prove that the local objective decreases monotonically and converges; in practice, both fixed and adaptive damping yield fast, stable convergence and can be combined with optional ridge regularization. We further prove that HRT's model class is a universal approximator with an explicit $O(δ^2)$ approximation rate, and show on synthetic and real-world benchmarks that it matches or outperforms single-tree baselines with more compact structures.
Updated: 2026-03-09 16:55:02
标题: 铰链回归树:用于斜向回归树分裂的牛顿方法
摘要: Oblique decision trees combine the transparency of trees with the power of multivariate decision boundaries, but learning high-quality oblique splits is NP-hard, and practical methods still rely on slow search or theory-free heuristics. We present the Hinge Regression Tree (HRT), which reframes each split as a non-linear least-squares problem over two linear predictors whose max/min envelope induces ReLU-like expressive power. The resulting alternating fitting procedure is exactly equivalent to a damped Newton (Gauss-Newton) method within fixed partitions. We analyze this node-level optimization and, for a backtracking line-search variant, prove that the local objective decreases monotonically and converges; in practice, both fixed and adaptive damping yield fast, stable convergence and can be combined with optional ridge regularization. We further prove that HRT's model class is a universal approximator with an explicit $O(δ^2)$ approximation rate, and show on synthetic and real-world benchmarks that it matches or outperforms single-tree baselines with more compact structures.
更新时间: 2026-03-09 16:55:02
领域: cs.LG
Weakly Supervised Teacher-Student Framework with Progressive Pseudo-mask Refinement for Gland Segmentation
Background and objectives: Colorectal cancer histopathological grading depends on accurate segmentation of glandular structures. Current deep learning approaches rely on large scale pixel level annotations that are labor intensive and difficult to obtain in routine clinical practice. Weakly supervised semantic segmentation offers a promising alternative. However, class activation map based methods often produce incomplete pseudo masks that emphasize highly discriminative regions and fail to supervise unannotated glandular structures. We propose a weakly supervised teacher student framework that leverages sparse pathologist annotations and an Exponential Moving Average stabilized teacher network to generate refined pseudo masks. Methods: The framework integrates confidence based filtering, adaptive fusion of teacher predictions with limited ground truth, and curriculum guided refinement to progressively segment unannotated glandular regions. The method was evaluated on an institutional colorectal cancer cohort from The Ohio State University Wexner Medical Center consisting of 60 hematoxylin and eosin stained whole slide images and on public datasets including the Gland Segmentation dataset, TCGA COAD, TCGA READ, and SPIDER. Results: On the Gland Segmentation dataset the framework achieved a mean Intersection over Union of 80.10 and a mean Dice coefficient of 89.10. Cross cohort evaluation demonstrated robust generalization on TCGA COAD and TCGA READ without additional annotations, while reduced performance on SPIDER reflected domain shift. Conclusions: The proposed framework provides an annotation efficient and generalizable approach for gland segmentation in colorectal histopathology.
Updated: 2026-03-09 16:54:05
标题: 使用渐进伪掩码细化的弱监督师生框架进行腺体分割
摘要: 背景和目标:结直肠癌组织病理分级取决于对腺体结构的准确分割。当前的深度学习方法依赖于大规模的像素级注释,这在日常临床实践中是费时费力且难以获得的。弱监督语义分割提供了一个有前途的替代方案。然而,基于类激活映射的方法通常会产生不完整的伪蒙版,强调高度区分性的区域,并无法监督未注释的腺体结构。我们提出了一个弱监督的师生框架,利用稀疏的病理学家注释和指数移动平均稳定的师生网络来生成精细化的伪蒙版。 方法:该框架整合了基于置信度的过滤、师生预测与有限真值的自适应融合,以及课程引导的细化,逐步分割未注释的腺体区域。该方法在俄亥俄州立大学维克斯纳医学中心的一个结直肠癌队列中进行了评估,包括60张用伊红染和伊森染色的全玻片图像,以及公共数据集包括腺体分割数据集、TCGA COAD、TCGA READ和SPIDER。 结果:在腺体分割数据集上,该框架实现了80.10的平均交集联合和89.10的平均Dice系数。跨队列评估显示,在没有额外注释的情况下,在TCGA COAD和TCGA READ上具有强大的泛化能力,而在SPIDER上表现较差反映了领域转移。 结论:所提出的框架提供了一种在结直肠病理学中进行腺体分割的注释有效和可泛化的方法。
更新时间: 2026-03-09 16:54:05
领域: cs.CV,cs.AI
Don't Look Back in Anger: MAGIC Net for Streaming Continual Learning with Temporal Dependence
Concept drift, temporal dependence, and catastrophic forgetting represent major challenges when learning from data streams. While Streaming Machine Learning and Continual Learning (CL) address these issues separately, recent efforts in Streaming Continual Learning (SCL) aim to unify them. In this work, we introduce MAGIC Net, a novel SCL approach that integrates CL-inspired architectural strategies with recurrent neural networks to tame temporal dependence. MAGIC Net continuously learns, looks back at past knowledge by applying learnable masks over frozen weights, and expands its architecture when necessary. It performs all operations online, ensuring inference availability at all times. Experiments on synthetic and real-world streams show that it improves adaptation to new concepts, limits memory usage, and mitigates forgetting.
Updated: 2026-03-09 16:49:04
标题: 不要愤怒地回头看:带有时间依赖性的MAGIC Net用于流式连续学习
摘要: 概念漂移、时间依赖和灾难性遗忘是从数据流中学习时面临的主要挑战。虽然流式机器学习和持续学习(CL)分别解决了这些问题,但最近的努力将它们统一起来,形成了流式持续学习(SCL)。在这项工作中,我们介绍了一种新颖的SCL方法MAGIC Net,它将受持续学习启发的架构策略与循环神经网络相结合,以控制时间依赖。MAGIC Net持续学习,通过在冻结权重上应用可学习的掩码来回顾过去的知识,并在必要时扩展其架构。它在线执行所有操作,确保始终可用推断。对合成和真实数据流进行的实验表明,它改善了对新概念的适应性,限制了内存使用,并减轻了遗忘的影响。
更新时间: 2026-03-09 16:49:04
领域: cs.LG,cs.AI
Towards Batch-to-Streaming Deep Reinforcement Learning for Continuous Control
State-of-the-art deep reinforcement learning (RL) methods have achieved remarkable performance in continuous control tasks, yet their computational complexity is often incompatible with the constraints of resource-limited hardware, due to their reliance on replay buffers, batch updates, and target networks. The emerging paradigm of streaming deep RL addresses this limitation through purely online updates, achieving strong empirical performance on standard benchmarks. In this work, we propose two novel streaming deep RL algorithms, Streaming Soft Actor-Critic (S2AC) and Streaming Deterministic Actor-Critic (SDAC), explicitly designed to be compatible with state-of-the-art batch RL methods, making them particularly suitable for on-device finetuning applications such as Sim2Real transfer. Both algorithms achieve performance comparable to state-of-the-art streaming baselines on standard benchmarks without requiring tedious hyperparameter tuning. Finally, we further investigate the practical challenges of transitioning from batch to streaming learning during finetuning and propose concrete strategies to tackle them.
Updated: 2026-03-09 16:40:06
标题: 朝向连续控制的批处理到流式深度强化学习
摘要: 最先进的深度强化学习(RL)方法在连续控制任务中取得了显著的性能,然而它们的计算复杂性通常与资源有限的硬件的约束不兼容,因为它们依赖于重播缓冲区、批量更新和目标网络。新兴的流式深度RL范式通过纯在线更新来解决这一限制,在标准基准测试中取得了强大的实证性能。在这项工作中,我们提出了两种新颖的流式深度RL算法,Streaming Soft Actor-Critic(S2AC)和Streaming Deterministic Actor-Critic(SDAC),明确设计为与最先进的批量RL方法兼容,使它们特别适用于设备上的微调应用,如Sim2Real转移。这两种算法在标准基准测试中实现了与最先进的流式基线相当的性能,而无需繁琐的超参数调整。最后,我们进一步研究了从批量学习过渡到流式学习在微调过程中的实际挑战,并提出了具体策略来解决这些挑战。
更新时间: 2026-03-09 16:40:06
领域: cs.LG,cs.AI
DualFlexKAN: Dual-stage Kolmogorov-Arnold Networks with Independent Function Control
Multi-Layer Perceptrons (MLPs) rely on pre-defined, fixed activation functions, imposing a static inductive bias that forces the network to approximate complex topologies solely through increased depth and width. Kolmogorov-Arnold Networks (KANs) address this limitation through edge-centric learnable functions, yet their formulation suffers from quadratic parameter scaling and architectural rigidity that hinders the effective integration of standard regularization techniques. This paper introduces the DualFlexKAN (DFKAN), a flexible architecture featuring a dual-stage mechanism that independently controls pre-linear input transformations and post-linear output activations. This decoupling enables hybrid networks that optimize the trade-off between expressiveness and computational cost. Unlike standard formulations, DFKAN supports diverse basis function families, including orthogonal polynomials, B-splines, and radial basis functions, integrated with configurable regularization strategies that stabilize training dynamics. Comprehensive evaluations across regression benchmarks, physics-informed tasks, and function approximation demonstrate that DFKAN outperforms both MLPs and conventional KANs in accuracy, convergence speed, and gradient fidelity. The proposed hybrid configurations achieve superior performance with one to two orders of magnitude fewer parameters than standard KANs, effectively mitigating the parameter explosion problem while preserving KAN-style expressiveness. DFKAN provides a principled, scalable framework for incorporating adaptive non-linearities, proving particularly advantageous for data-efficient learning and interpretable function discovery in scientific applications.
Updated: 2026-03-09 16:36:04
标题: DualFlexKAN:具有独立功能控制的双阶段科尔莫戈洛夫-阿诺德网络
摘要: 多层感知器(MLPs)依赖于预定义的固定激活函数,施加了一种静态归纳偏差,迫使网络仅通过增加深度和宽度来逼近复杂拓扑结构。Kolmogorov-Arnold网络(KANs)通过以边为中心的可学习函数来解决这一限制,然而它们的形式化存在二次参数缩放和架构刚度的问题,这阻碍了标准正则化技术的有效整合。本文介绍了DualFlexKAN(DFKAN),这是一种灵活的架构,具有独立控制预线性输入变换和后线性输出激活的双阶段机制。这种解耦使得可以优化表达力和计算成本之间的权衡的混合网络。与标准形式不同,DFKAN支持多种基函数家族,包括正交多项式、B样条和径向基函数,结合可配置的正则化策略,稳定训练动力学。在回归基准、物理相关任务和函数逼近方面的综合评估表明,DFKAN在准确性、收敛速度和梯度保真度方面优于MLPs和传统KANs。所提出的混合配置比标准KANs少一个到两个数量级的参数,有效缓解了参数爆炸问题,同时保持了KAN风格的表达能力。DFKAN为融合自适应非线性提供了一个原则性、可扩展的框架,在科学应用中对数据高效学习和可解释函数发现特别有利。
更新时间: 2026-03-09 16:36:04
领域: cs.LG,cs.CV
SmartGraphical: A Human-in-the-Loop Framework for Detecting Smart Contract Logical Vulnerabilities via Pattern-Driven Static Analysis and Visual Abstraction
Smart contracts are fundamental components of blockchain ecosystems; however, their security remains a critical concern due to inherent vulnerabilities. While existing detection methodologies are predominantly syntax-oriented, targeting reentrancy and arithmetic errors, they often overlook logical flaws arising from defective business logic. This paper introduces SmartGraphical, a novel security framework specifically engineered to identify logical attack surfaces. By synthesizing automated static analysis with an interactive graphical representation of contract architectures, SmartGraphical facilitates a comprehensive inspection of a contract's functional control flow. To mitigate the context-dependent nature of logical bugs, the tool adopts a human-in-the-loop approach, empowering developers to interpret heuristic warnings within a visualized structural context. The efficacy of SmartGraphical was validated through a rigorous empirical evaluation involving a large dataset of real-world contracts and a large-scale user study with 100 developers of varying expertise. Furthermore, the framework's performance was demonstrated through case studies on high-profile exploits, such as the SYFI rebase failure and farming protocol flash swap attacks, proving that SmartGraphical identifies intricate vulnerabilities that elude state-of-the-art automated detectors. Our findings indicate that this hybrid methodology significantly enhances the interpretability and detection rate of non-trivial logical security threats in smart contracts.
Updated: 2026-03-09 16:35:17
标题: 智能图形化:一种通过基于模式驱动的静态分析和视觉抽象检测智能合约逻辑漏洞的人机协作框架
摘要: 智能合约是区块链生态系统的基本组件;然而,由于固有漏洞,它们的安全性仍然是一个关键问题。虽然现有的检测方法主要是基于语法的,针对重入和算术错误,但它们通常忽视由于有缺陷的业务逻辑而产生的逻辑缺陷。本文介绍了SmartGraphical,这是一个专门设计用于识别逻辑攻击面的新型安全框架。通过将自动静态分析与合同架构的交互式图形表示综合,SmartGraphical促进了对合同功能控制流的全面检查。为了减轻逻辑错误的上下文依赖性,该工具采用了一种人在环路方法,使开发人员能够在可视化的结构上下文中解释启发式警告。通过对真实世界合同的大型数据集进行严格的实证评估以及与100名不同专业知识水平开发人员进行的大规模用户研究,验证了SmartGraphical的有效性。此外,通过对SYFI重新基准失败和农业协议闪电交换攻击等重大漏洞的案例研究,证明SmartGraphical能够识别超越现有自动检测器的复杂漏洞。我们的研究结果表明,这种混合方法显著提高了智能合约中非平凡逻辑安全威胁的可解释性和检测率。
更新时间: 2026-03-09 16:35:17
领域: cs.CR
Drift-to-Action Controllers: Budgeted Interventions with Online Risk Certificates
Deployed machine learning systems face distribution drift, yet most monitoring pipelines stop at alarms and leave the response underspecified under labeling, compute, and latency constraints. We introduce Drift2Act, a drift-to-action controller that treats monitoring as constrained decision-making with explicit safety. Drift2Act combines a sensing layer that maps unlabeled monitoring signals to a belief over drift types with an active risk certificate that queries a small set of delayed labels from a recent window to produce an anytime-valid upper bound $U_t(δ)$ on current risk. The certificate gates operation: if $U_t(δ) \le τ$, the controller selects low-cost actions (e.g., recalibration or test-time adaptation); if $U_t(δ) > τ$, it activates abstain/handoff and escalates to rollback or retraining under cooldowns. In a realistic streaming protocol with label delay and explicit intervention costs, Drift2Act achieves near-zero safety violations and fast recovery at moderate cost on WILDS Camelyon17, DomainNet, and a controlled synthetic drift stream, outperforming alarm-only monitoring, adapt-always adaptation, schedule-based retraining, selective prediction alone, and an ablation without certification. Overall, online risk certification enables reliable drift response and reframes monitoring as decision-making with safety.
Updated: 2026-03-09 16:34:12
标题: 漂移至行动控制器:具有在线风险证书的预算干预
摘要: 部署的机器学习系统面临分布漂移,然而大多数监控管道仅在警报停止,并未明确指定标签、计算和延迟约束下的响应。我们引入了Drift2Act,一个将监控视为受限决策制定的漂移至行动控制器,具有明确的安全性。Drift2Act结合了一个感知层,将未标记的监控信号映射到漂移类型的信念上,并具有一个主动风险证书,从最近窗口中查询一小组延迟标签,以在当前风险上产生任何时候有效的上界$U_t(δ)$。证书控制操作:如果$U_t(δ) \le τ$,控制器选择低成本的操作(例如,重新校准或测试时间适应);如果$U_t(δ) > τ$,它将激活弃权/移交,并在冷却期内升级为回滚或重新训练。在具有标签延迟和明确干预成本的现实流协议中,Drift2Act在WILDS Camelyon17、DomainNet和一个受控的合成漂移流上实现了接近零的安全违规和中等成本下的快速恢复,优于仅具有警报的监控、适应所有适应、基于时间表的重新训练、仅选择性预测以及没有认证的消融。总的来说,在线风险认证使得可靠漂移响应成为可能,并重新构建监控为具有安全性的决策制定。
更新时间: 2026-03-09 16:34:12
领域: cs.LG,cs.CL
Trust via Reputation of Conviction
The question of \emph{knowledge}, \emph{truth} and \emph{trust} is explored via a mathematical formulation of claims and sources. We define truth as the reproducibly perceived subset of knowledge, formalize sources as having both generative and discriminative roles, and develop a framework for reputation grounded in the \emph{conviction} -- the likelihood that a source's stance is vindicated by independent consensus. We argue that conviction, rather than correctness or faithfulness, is the principled basis for trust: it is regime-independent, rewards genuine contribution, and demands the transparent and self-sufficient perceptions that make external verification possible. We formalize reputation as the expected weighted signed conviction over a realm of claims, characterize its behavior across source-claim regimes, and identify continuous verification as both a theoretical necessity and a practical mechanism through which reputation accrues. The framework is applied to AI agents, which are identified as capable but error-prone sources for whom verifiable conviction and continuously accrued reputation constitute the only robust foundation for trust.
Updated: 2026-03-09 16:30:33
标题: 信任通过信仰的声誉
摘要: 通过对声明和来源的数学形式化,探讨了关于\emph{知识}、\emph{真相}和\emph{信任}的问题。我们将真相定义为可复现的知识子集,将来源形式化为具有生成和判别作用,并开发了一个以\emph{信念}为基础的声誉框架——即来源的立场被独立共识证实的可能性。我们认为,信念而不是正确性或忠实度,是信任的原则基础:它不依赖于制度,奖励真正的贡献,并要求透明和自给自足的感知,从而使外部验证成为可能。我们将声誉形式化为在一系列声明领域中期望的加权签署信念,描述其在来源-声明制度中的行为,并确定连续验证既是理论上的必要性,也是声誉积累的实际机制。该框架应用于AI代理,识别出它们是能力强但容易出错的来源,可验证的信念和持续积累的声誉构成了唯一坚实的信任基础。
更新时间: 2026-03-09 16:30:33
领域: cs.AI,cs.LG
MetaWorld-X: Hierarchical World Modeling via VLM-Orchestrated Experts for Humanoid Loco-Manipulation
Learning natural, stable, and compositionally generalizable whole-body control policies for humanoid robots performing simultaneous locomotion and manipulation (loco-manipulation) remains a fundamental challenge in robotics. Existing reinforcement learning approaches typically rely on a single monolithic policy to acquire multiple skills, which often leads to cross-skill gradient interference and motion pattern conflicts in high-degree-of-freedom systems. As a result, generated behaviors frequently exhibit unnatural movements, limited stability, and poor generalization to complex task compositions. To address these limitations, we propose MetaWorld-X, a hierarchical world model framework for humanoid control. Guided by a divide-and-conquer principle, our method decomposes complex control problems into a set of specialized expert policies (Specialized Expert Policies, SEP). Each expert is trained under human motion priors through imitation-constrained reinforcement learning, introducing biomechanically consistent inductive biases that ensure natural and physically plausible motion generation. Building upon this foundation, we further develop an Intelligent Routing Mechanism (IRM) supervised by a Vision-Language Model (VLM), enabling semantic-driven expert composition. The VLM-guided router dynamically integrates expert policies according to high-level task semantics, facilitating compositional generalization and adaptive execution in multi-stage loco-manipulation tasks.
Updated: 2026-03-09 16:28:26
标题: MetaWorld-X:通过VLM-编排专家进行分层世界建模,用于人形机器人的定位和操纵
摘要: 学习自然、稳定且具有组合泛化能力的全身控制策略,用于执行同时进行 locomotion 和 manipulation(loco-manipulation)的人形机器人仍然是机器人领域的一个基本挑战。现有的强化学习方法通常依赖于单个整合策略来获取多种技能,这往往会导致在自由度较高的系统中跨技能梯度干扰和运动模式冲突。因此,生成的行为经常表现出不自然的动作、有限的稳定性和对复杂任务组合的泛化能力差。为了解决这些限制,我们提出了一个用于人形控制的分层世界模型框架 MetaWorld-X。在分割和征服原则的指导下,我们的方法将复杂的控制问题分解为一组专门的专家策略(Specialized Expert Policies,SEP)。每个专家都是在人类运动先验知识的指导下通过模仿约束强化学习进行训练的,引入了生物力学一致的归纳偏见,确保自然和物理上合理的动作生成。在此基础上,我们进一步开发了一个由视觉-语言模型(VLM)监督的智能路由机制(IRM),实现了语义驱动的专家组合。VLM引导的路由器根据高层任务语义动态集成专家策略,促进了在多阶段 loco-manipulation 任务中的组合泛化和自适应执行。
更新时间: 2026-03-09 16:28:26
领域: cs.RO,cs.AI
GDR-learners: Orthogonal Learning of Generative Models for Potential Outcomes
Various deep generative models have been proposed to estimate potential outcomes distributions from observational data. However, none of them have the favorable theoretical property of general Neyman-orthogonality and, associated with it, quasi-oracle efficiency and double robustness. In this paper, we introduce a general suite of generative Neyman-orthogonal (doubly-robust) learners that estimate the conditional distributions of potential outcomes. Our proposed generative doubly-robust learners (GDR-learners) are flexible and can be instantiated with many state-of-the-art deep generative models. In particular, we develop GDR-learners based on (a) conditional normalizing flows (which we call GDR-CNFs), (b) conditional generative adversarial networks (GDR-CGANs), (c) conditional variational autoencoders (GDR-CVAEs), and (d) conditional diffusion models (GDR-CDMs). Unlike the existing methods, our GDR-learners possess the properties of quasi-oracle efficiency and rate double robustness, and are thus asymptotically optimal. In a series of (semi-)synthetic experiments, we demonstrate that our GDR-learners are very effective and outperform the existing methods in estimating the conditional distributions of potential outcomes.
Updated: 2026-03-09 16:28:06
标题: GDR-学习者:潜在结果的生成模型的正交学习
摘要: 已经提出了各种深度生成模型,用于从观测数据中估计潜在结果分布。然而,它们中没有一个具有一般的奈曼正交性的有利理论特性,以及与之相关的准神谕效率和双重稳健性。在本文中,我们引入了一套通用的生成奈曼正交(双重稳健)学习器,用于估计潜在结果的条件分布。我们提出的生成双重稳健学习器(GDR-learners)灵活多样,并可以实例化为许多最先进的深度生成模型。特别地,我们基于(a)条件归一化流(我们称之为GDR-CNFs)、(b)条件生成对抗网络(GDR-CGANs)、(c)条件变分自动编码器(GDR-CVAEs)和(d)条件扩散模型(GDR-CDMs)开发了GDR-learners。与现有方法不同,我们的GDR-learners具有准神谕效率和速率双重稳健性的特性,因此在渐近上是最优的。在一系列(半)合成实验中,我们证明了我们的GDR-learners非常有效,并在估计潜在结果的条件分布方面优于现有方法。
更新时间: 2026-03-09 16:28:06
领域: cs.LG,stat.ML
OSS-CRS: Liberating AIxCC Cyber Reasoning Systems for Real-World Open-Source Security
DARPA's AI Cyber Challenge (AIxCC) showed that cyber reasoning systems (CRSs) can go beyond vulnerability discovery to autonomously confirm and patch bugs: seven teams built such systems and open-sourced them after the competition. Yet all seven open-sourced CRSs remain largely unusable outside their original teams, each bound to the competition cloud infrastructure that no longer exists. We present OSS-CRS, an open, locally deployable framework for running and combining CRS techniques against real-world open-source projects, with budget-aware resource management. We ported the first-place system (Atlantis) and discovered 10 previously unknown bugs (three of high severity) across 8 OSS-Fuzz projects. OSS-CRS is publicly available.
Updated: 2026-03-09 16:26:33
标题: OSS-CRS:解放AIxCC网络推理系统,用于现实世界的开源安全
摘要: DARPA的AI网络挑战(AIxCC)表明,网络推理系统(CRSs)可以超越漏洞发现,自主确认和修补错误:七个团队建立了这种系统,并在比赛后开源了它们。然而,所有七个开源的CRSs在其原始团队之外仍然基本无法使用,每个系统都与不再存在的比赛云基础设施绑定。我们提出了OSS-CRS,一个用于运行和结合CRS技术对真实世界开源项目进行测试的开放、本地部署框架,具有预算感知的资源管理。我们迁移了第一名系统(Atlantis)并在8个OSS-Fuzz项目中发现了10个之前未知的错误(其中三个严重)。OSS-CRS是公开可用的。
更新时间: 2026-03-09 16:26:33
领域: cs.CR,cs.AI
RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback
Large language model (LLM)-based agents trained with reinforcement learning (RL) have shown strong potential on complex interactive tasks. However, standard RL paradigms favor static problem-solving over continuous adaptation: agents often converge to suboptimal strategies due to insufficient exploration, while learned knowledge remains implicit within parameters rather than explicitly retrievable, limiting effective experiential learning. To address these limitations, we introduce RetroAgent, an online RL framework that empowers agents to master complex interactive environments not just by solving, but by evolving. Concretely, RetroAgent features a hindsight self-reflection mechanism that produces dual intrinsic feedback: (1) intrinsic numerical feedback that that tracks incremental subtask completion relative to prior attempts, rewarding promising explorations, and (2) intrinsic language feedback that distills reusable lessons into a memory buffer, retrieved via our proposed Similarity & Utility-Aware Upper Confidence Bound (SimUtil-UCB) strategy balancing relevance, utility, and exploration to effectively leverage past experiences. Extensive experiments on two model families across four challenging agentic tasks demonstrate that RetroAgent significantly outperforms existing methods, achieving state-of-the-art results -- e.g., surpassing Group Relative Policy Optimization (GRPO)-trained agents by +18.3% on ALFWorld, +15.4% on WebShop, +27.1% on Sokoban, and +8.9% on MineSweeper -- while exhibiting strong test-time adaptation and generalization to out-of-distribution scenarios.
Updated: 2026-03-09 16:23:33
标题: RetroAgent: 通过回顾性双内在反馈从解决到演化
摘要: 基于大型语言模型(LLM)的代理人通过强化学习(RL)训练在复杂的互动任务上展现出强大的潜力。然而,标准的RL范式更偏向于静态问题解决而非持续适应:代理人通常会因为探索不足而收敛到次优策略,同时学到的知识仅仅存在于参数之中而无法显性检索,限制了有效的经验学习。为了解决这些限制,我们引入了RetroAgent,一个在线RL框架,让代理人不仅通过解决问题而且通过演变来掌握复杂的互动环境。具体来说,RetroAgent具有一种事后自我反思机制,产生双重内在反馈:(1)内在数值反馈跟踪相对于先前尝试的增量子任务完成情况,奖励有前途的探索,(2)内在语言反馈将可重复使用的教训提炼到一个内存缓冲区中,通过我们提出的相似性和效用感知上置信界限(SimUtil-UCB)策略平衡相关性、效用和探索,有效利用过去的经验。在四个具有挑战性的代理任务上对两个模型系列进行的大量实验表明,RetroAgent显著优于现有方法,实现了最先进的结果——例如,在ALFWorld上超过了Group Relative Policy Optimization(GRPO)训练的代理人18.3%,在WebShop上超过了15.4%,在Sokoban上超过了27.1%,在MineSweeper上超过了8.9%——同时表现出强大的测试时间适应能力和对分布之外场景的泛化能力。
更新时间: 2026-03-09 16:23:33
领域: cs.AI
The Role of Feature Interactions in Graph-based Tabular Deep Learning
Accurate predictions on tabular data rely on capturing complex, dataset-specific feature interactions. Attention-based methods and graph neural networks, referred to as graph-based tabular deep learning (GTDL), aim to improve predictions by modeling these interactions as a graph. In this work, we analyze how these methods model the feature interactions. Current GTDL approaches primarily focus on optimizing predictive accuracy, often neglecting the accurate modeling of the underlying graph structure. Using synthetic datasets with known ground-truth graph structures, we find that current GTDL methods fail to recover meaningful feature interactions, as their edge recovery is close to random. This suggests that the attention mechanism and message-passing schemes used in GTDL do not effectively capture feature interactions. Furthermore, when we impose the true interaction structure, we find that the predictive accuracy improves. This highlights the need for GTDL methods to prioritize accurate modeling of the graph structure, as it leads to better predictions.
Updated: 2026-03-09 16:22:49
标题: 图形化表格深度学习中特征交互的作用
摘要: 在表格数据上准确预测依赖于捕捉复杂的、特定于数据集的特征交互。基于注意力机制和图神经网络的方法,即图表深度学习(GTDL),旨在通过将这些交互建模为图来提高预测能力。在这项工作中,我们分析了这些方法如何建模特征交互。目前的GTDL方法主要集中在优化预测精度上,通常忽视了对底层图结构的准确建模。通过使用具有已知地面真实图结构的合成数据集,我们发现当前的GTDL方法未能恢复有意义的特征交互,因为它们的边缘恢复接近随机。这表明GTDL中使用的注意力机制和消息传递方案并未有效捕捉特征交互。此外,当我们强加真实的交互结构时,我们发现预测精度有所提高。这突显了GTDL方法需要优先考虑准确建模图结构,因为这将导致更好的预测。
更新时间: 2026-03-09 16:22:49
领域: cs.LG,stat.ML
Integrating a Causal Foundation Model into a Prescriptive Maintenance Framework for Optimising Production-Line OEE
The transition to prescriptive maintenance (PsM) in manufacturing is critically constrained by a dependence on predictive models. Such purely predictive models tend to capture statistical associations in the data without identifying the underlying causal drivers of failure, which can lead to costly misdiagnoses and ineffective measures. This fundamental limitation results in a key challenge: while we can predict that a failure may occur, we lack a systematic method to understand why a failure occurs. This paper proposes a model based on causal machine learning to bridge this gap. Our objective is to move beyond diagnosis to active prescription by simulating and evaluating potential fixes to optimise KPIs such as Overall Equipment Effectiveness (OEE). For this purpose, a pre-trained causal foundation model is used as a ``what-if'' simulator to estimate the effects of potential fixes. By estimating the causal effect of each intervention on system-level KPIs, specific actions can be recommended for the production line. This can help identify plausible root causes and quantify their operational impact. The model is evaluated using semi-synthetic manufacturing data and compared with non-causal and causal baseline machine learning models. This paper provides a technical basis for a human-centred approach, allowing engineers to test potential solutions in a causal environment to make more effective operational decisions and reduce costly downtimes.
Updated: 2026-03-09 16:20:36
标题: 将因果基础模型融入预防性维护框架,优化生产线OEE
摘要: 制造业向规定性维护(PsM)的转变受到预测模型的依赖性的严重限制。这种纯粹的预测模型往往捕捉数据中的统计关联,而未识别出故障的潜在原因驱动,这可能导致昂贵的误诊和无效的措施。这种基本限制导致一个关键挑战:虽然我们可以预测故障可能发生,但我们缺乏一种系统方法来了解故障为什么发生。本文提出了一种基于因果机器学习的模型来弥合这一差距。我们的目标是通过模拟和评估潜在修复方案,优化KPI(例如设备综合效率OEE),从而超越诊断而实现积极的处方。为此,使用一个预先训练的因果基础模型作为“假设”模拟器来估计潜在修复方案的影响。通过估计每个干预措施对系统级KPI的因果效应,可以针对生产线推荐具体措施。这有助于识别可信的根本原因并量化其运营影响。该模型使用半合成制造数据进行评估,并与非因果和因果基线机器学习模型进行比较。本文提供了一个以人为中心的技术基础,允许工程师在因果环境中测试潜在解决方案,以做出更有效的运营决策并减少昂贵的停机时间。
更新时间: 2026-03-09 16:20:36
领域: cs.AI,eess.SY
Impact of Connectivity on Laplacian Representations in Reinforcement Learning
Learning compact state representations in Markov Decision Processes (MDPs) has proven crucial for addressing the curse of dimensionality in large-scale reinforcement learning (RL) problems. Existing principled approaches leverage structural priors on the MDP by constructing state representations as linear combinations of the state-graph Laplacian eigenvectors. When the transition graph is unknown or the state space is prohibitively large, the graph spectral features can be estimated directly via sample trajectories. In this work, we prove an upper bound on the approximation error of linear value function approximation under the learned spectral features. We show how this error scales with the algebraic connectivity of the state-graph, grounding the approximation quality in the topological structure of the MDP. We further bound the error introduced by the eigenvector estimation itself, leading to an end-to-end error decomposition across the representation learning pipeline. Additionally, our expression of the Laplacian operator for the RL setting, although equivalent to existing ones, prevents some common misunderstandings, of which we show some examples from the literature. Our results hold for general (non-uniform) policies without any assumptions on the symmetry of the induced transition kernel. We validate our theoretical findings with numerical simulations on gridworld environments.
Updated: 2026-03-09 16:20:31
标题: 连接性对强化学习中拉普拉斯表示的影响
摘要: 在马尔可夫决策过程(MDPs)中学习紧凑的状态表示已被证明对解决大规模强化学习(RL)问题中的维度灾难至关重要。现有的基于原则的方法利用MDP上的结构先验,通过构建状态表示为状态图拉普拉斯特征向量的线性组合。当转移图未知或状态空间过大时,可以通过样本轨迹直接估计图谱特征。在这项工作中,我们证明在学习的谱特征下的线性值函数逼近误差的上界。我们展示了这个错误如何随着状态图的代数连通度而缩放,将逼近质量基于MDP的拓扑结构。我们进一步限制了特征向量估计本身引入的误差,导致在表示学习流程中进行端到端的误差分解。此外,我们在RL设置中的拉普拉斯算子的表达,虽然与现有的表达等价,但避免了一些常见的误解,我们从文献中展示了一些例子。我们的结果适用于一般(非均匀)策略,对诱导转移核的对称性没有任何假设。我们通过在网格环境中进行数值模拟验证了我们的理论发现。
更新时间: 2026-03-09 16:20:31
领域: cs.LG,stat.ML
Generative Adversarial Regression (GAR): Learning Conditional Risk Scenarios
We propose Generative Adversarial Regression (GAR), a framework for learning conditional risk scenarios through generators aligned with downstream risk objectives. GAR builds on a regression characterization of conditional risk for elicitable functionals, including quantiles, expectiles, and jointly elicitable pairs. We extend this principle from point prediction to generative modeling by training generators whose policy-induced risk matches that of real data under the same context. To ensure robustness across all policies, GAR adopts a minimax formulation in which an adversarial policy identifies worst-case discrepancies in risk evaluation while the generator adapts to eliminate them. This structure preserves alignment with the risk functional across a broad class of policies rather than a fixed, pre-specified set. We illustrate GAR through a tail-risk instantiation based on jointly elicitable $(\mathrm{VaR}, \mathrm{ES})$ objectives. Experiments on S\&P 500 data show that GAR produces scenarios that better preserve downstream risk than unconditional, econometric, and direct predictive baselines while remaining stable under adversarially selected policies.
Updated: 2026-03-09 16:16:59
标题: 生成对抗回归(GAR):学习条件风险场景
摘要: 我们提出生成对抗回归(GAR),这是一个通过与下游风险目标对齐的生成器来学习条件风险场景的框架。GAR基于可引出函数的条件风险的回归特征,包括分位数、期望值和共同引出对。我们将这一原则从点预测扩展到生成建模,通过训练生成器,其策略引发的风险与在相同情境下真实数据的风险相匹配。为了确保所有策略的鲁棒性,GAR采用了一个极小极大的公式,其中对抗性策略识别出风险评估中的最坏情况差异,而生成器则适应消除这些差异。这个结构保持了与风险功能在一个广泛的策略类中的对齐,而不是一个固定的、预先指定的集合。我们通过一个基于共同引出$(\mathrm{VaR}, \mathrm{ES})$目标的尾风险实例来说明GAR。对S\&P 500数据的实验表明,GAR产生的场景比无条件、计量经济和直接预测基准更好地保留了下游风险,同时在对抗性选择的策略下保持稳定。
更新时间: 2026-03-09 16:16:59
领域: stat.ML,cs.LG,math.OC,q-fin.PM,q-fin.RM
Interactive World Simulator for Robot Policy Training and Evaluation
Action-conditioned video prediction models (often referred to as world models) have shown strong potential for robotics applications, but existing approaches are often slow and struggle to capture physically consistent interactions over long horizons, limiting their usefulness for scalable robot policy training and evaluation. We present Interactive World Simulator, a framework for building interactive world models from a moderate-sized robot interaction dataset. Our approach leverages consistency models for both image decoding and latent-space dynamics prediction, enabling fast and stable simulation of physical interactions. In our experiments, the learned world models produce interaction-consistent pixel-level predictions and support stable long-horizon interactions for more than 10 minutes at 15 FPS on a single RTX 4090 GPU. Our framework enables scalable demonstration collection solely within the world models to train state-of-the-art imitation policies. Through extensive real-world evaluation across diverse tasks involving rigid objects, deformable objects, object piles, and their interactions, we find that policies trained on world-model-generated data perform comparably to those trained on the same amount of real-world data. Additionally, we evaluate policies both within the world models and in the real world across diverse tasks, and observe a strong correlation between simulated and real-world performance. Together, these results establish the Interactive World Simulator as a stable and physically consistent surrogate for scalable robotic data generation and faithful, reproducible policy evaluation.
Updated: 2026-03-09 16:13:32
标题: 交互式世界模拟器用于机器人策略培训和评估
摘要: 行动条件视频预测模型(通常被称为世界模型)已经展现出在机器人应用中的强大潜力,但现有方法往往速度较慢,并且难以捕捉长期时间内的物理一致性交互,从而限制了它们在可扩展机器人策略训练和评估中的有用性。我们提出了交互式世界模拟器,这是一个从中等规模的机器人交互数据集构建交互式世界模型的框架。我们的方法利用一致性模型进行图像解码和潜在空间动态预测,实现了对物理交互的快速和稳定模拟。在我们的实验中,学习到的世界模型产生了交互一致的像素级预测,并且在单个RTX 4090 GPU上以15 FPS的速度支持稳定的长期时间内的交互超过10分钟。我们的框架允许仅在世界模型中进行可扩展的示范数据收集,以训练最先进的模仿策略。通过在涉及刚性物体、可变形物体、物体堆和它们的交互的各种任务中进行广泛的现实世界评估,我们发现在世界模型生成的数据上训练的策略与在相同数量的真实世界数据上训练的策略表现相当。此外,我们在世界模型内和真实世界中跨不同任务评估策略,并观察到模拟和真实世界表现之间存在强烈的相关性。综合这些结果,我们将交互式世界模拟器确立为可扩展机器人数据生成和忠实、可重现策略评估的稳定和物理一致的替代。
更新时间: 2026-03-09 16:13:32
领域: cs.RO,cs.CV,cs.LG
UniWhisper: Efficient Continual Multi-task Training for Robust Universal Audio Representation
A universal audio representation should capture fine-grained speech cues and high-level semantics for environmental sounds and music in a single encoder. Existing encoders often excel in one domain but degrade in others. We propose UniWhisper, an efficient continual multi-task training framework that casts heterogeneous audio tasks into a unified instruction and answer format. This enables standard next-token training without task-specific heads and losses. We train it on 38k hours of public audio and assess the encoder using shallow MLP probes and k-nearest neighbors (kNN) on 20 tasks spanning speech, environmental sound, and music. UniWhisper reaches normalized weighted averages of 0.81 with MLP probes and 0.61 with kNN, compared to 0.64 and 0.46 for Whisper, while retaining strong speech performance.
Updated: 2026-03-09 16:12:37
标题: UniWhisper:高效的持续多任务训练,用于稳健的通用音频表示
摘要: 一种通用的音频表示应该能够捕捉细粒度的语音提示和环境声音以及音乐的高级语义,在一个单一的编码器中。现有的编码器通常在一个领域表现出色,但在其他领域表现下降。我们提出了UniWhisper,一个高效的持续多任务训练框架,将异构音频任务转化为统一的指令和答案格式。这使得可以进行标准的下一个令牌训练,而无需特定于任务的头部和损失。我们在38k小时的公共音频上对其进行训练,并使用浅层MLP探针和k最近邻居(kNN)在涵盖语音、环境声音和音乐的20个任务上评估编码器。与Whisper的0.64和0.46相比,UniWhisper在MLP探针上达到了0.81的归一化加权平均值,在kNN上达到了0.61,同时保持了强大的语音表现。
更新时间: 2026-03-09 16:12:37
领域: cs.SD,cs.AI
Overlap-Adaptive Regularization for Conditional Average Treatment Effect Estimation
The conditional average treatment effect (CATE) is widely used in personalized medicine to inform therapeutic decisions. However, state-of-the-art methods for CATE estimation (so-called meta-learners) often perform poorly in the presence of low overlap. In this work, we introduce a new approach to tackle this issue and improve the performance of existing meta-learners in the low-overlap regions. Specifically, we introduce Overlap-Adaptive Regularization (OAR) that regularizes target models proportionally to overlap weights so that, informally, the regularization is higher in regions with low overlap. To the best of our knowledge, our OAR is the first approach to leverage overlap weights in the regularization terms of the meta-learners. Our OAR approach is flexible and works with any existing CATE meta-learner: we demonstrate how OAR can be applied to both parametric and non-parametric second-stage models. Furthermore, we propose debiased versions of our OAR that preserve the Neyman-orthogonality of existing meta-learners and thus ensure more robust inference. Through a series of (semi-)synthetic experiments, we demonstrate that our OAR significantly improves CATE estimation in low-overlap settings in comparison to constant regularization.
Updated: 2026-03-09 16:11:31
标题: 重叠自适应正则化用于条件平均治疗效应估计
摘要: 条件平均处理效应(CATE)被广泛应用于个性化医学,以指导治疗决策。然而,用于CATE估计的最先进方法(所谓的元学习器)在低重叠情况下通常表现不佳。在这项工作中,我们引入了一种新方法来解决这个问题,并改善现有元学习器在低重叠区域的性能。具体来说,我们引入了Overlap-Adaptive Regularization(OAR),根据重叠权重对目标模型进行正则化,因此,在低重叠区域,正则化更高。据我们所知,我们的OAR是第一个利用重叠权重在元学习器的正则化项中的方法。我们的OAR方法灵活,并适用于任何现有的CATE元学习器:我们展示了OAR如何应用于参数化和非参数化二阶段模型。此外,我们提出了我们的OAR的去偏差版本,保持现有元学习器的Neyman正交性,从而确保更稳健的推断。通过一系列(半)合成实验,我们证明我们的OAR相对于常数正则化显著改善了低重叠设置中的CATE估计。
更新时间: 2026-03-09 16:11:31
领域: cs.LG,stat.ML
The Neural Compass: Probabilistic Relative Feature Fields for Robotic Search
Object co-occurrences provide a key cue for finding objects successfully and efficiently in unfamiliar environments. Typically, one looks for cups in kitchens and views fridges as evidence of being in a kitchen. Such priors have also been exploited in artificial agents, but they are typically learned from explicitly labeled data or queried from language models. It is still unclear whether these relations can be learned implicitly from unlabeled observations alone. In this work, we address this problem and propose ProReFF, a feature field model trained to predict relative distributions of features obtained from pre-trained vision language models. In addition, we introduce a learning-based strategy that enables training from unlabeled and potentially contradictory data by aligning inconsistent observations into a coherent relative distribution. For the downstream object search task, we propose an agent that leverages predicted feature distributions as a semantic prior to guide exploration toward regions with a high likelihood of containing the object. We present extensive evaluations demonstrating that ProReFF captures meaningful relative feature distributions in natural scenes and provides insight into the impact of our proposed alignment step. We further evaluate the performance of our search agent in 100 challenges in the Matterport3D simulator, comparing with feature-based baselines and human participants. The proposed agent is 20% more efficient than the strongest baseline and achieves up to 80% of human performance.
Updated: 2026-03-09 16:11:05
标题: 神经罗盘:用于机器人搜索的概率相对特征场
摘要: 对象共现为在陌生环境中成功高效地找到对象提供了关键线索。通常,在厨房里寻找杯子,将冰箱视为在厨房的证据。这些先验也已被利用在人工智能代理中,但通常是从明确标记的数据中学习或从语言模型中查询。目前尚不清楚这些关系是否可以仅从未标记的观察中隐式学习。在这项工作中,我们解决了这个问题,并提出了ProReFF,一个训练用于预测由预先训练的视觉语言模型获得的特征的相对分布的特征场模型。此外,我们引入了一种基于学习的策略,通过将不一致的观察对齐到一个连贯的相对分布中,从未标记和可能矛盾的数据中进行训练。对于下游对象搜索任务,我们提出了一个代理,利用预测的特征分布作为语义先验,引导探索朝向可能包含对象的区域。我们进行了广泛的评估,证明ProReFF在自然场景中捕捉了有意义的相对特征分布,并深入研究了我们提出的对齐步骤的影响。我们进一步在Matterport3D模拟器中的100个挑战中评估了我们搜索代理的性能,与基于特征的基线和人类参与者进行比较。所提出的代理比最强基线高效20%,达到了人类性能的80%。
更新时间: 2026-03-09 16:11:05
领域: cs.RO,cs.LG
Towards Effective and Efficient Graph Alignment without Supervision
Unsupervised graph alignment aims to find the node correspondence across different graphs without any anchor node pairs. Despite the recent efforts utilizing deep learning-based techniques, such as the embedding and optimal transport (OT)-based approaches, we observe their limitations in terms of model accuracy-efficiency tradeoff. By focusing on the exploitation of local and global graph information, we formalize them as the ``local representation, global alignment'' paradigm, and present a new ``global representation and alignment'' paradigm to resolve the mismatch between the two phases in the alignment process. We then propose \underline{Gl}obal representation and \underline{o}ptimal transport-\underline{b}ased \underline{Align}ment (\texttt{GlobAlign}), and its variant, \texttt{GlobAlign-E}, for better \underline{E}fficiency. Our methods are equipped with the global attention mechanism and a hierarchical cross-graph transport cost, able to capture long-range and implicit node dependencies beyond the local graph structure. Furthermore, \texttt{GlobAlign-E} successfully closes the time complexity gap between representative embedding and OT-based methods, reducing OT's cubic complexity to quadratic terms. Through extensive experiments, our methods demonstrate superior performance, with up to a 20\% accuracy improvement over the best competitor. Meanwhile, \texttt{GlobAlign-E} achieves the best efficiency, with an order of magnitude speedup against existing OT-based methods.
Updated: 2026-03-09 16:00:08
标题: 朝向无监督有效高效的图对齐
摘要: 无监督图对齐旨在在不使用任何锚节点对的情况下找到不同图之间的节点对应关系。尽管最近的工作利用了基于深度学习的技术,例如嵌入和基于最优输运(OT)的方法,但我们观察到它们在模型准确性和效率之间存在限制性。通过专注于利用局部和全局图信息,我们将它们形式化为“局部表示,全局对齐”范式,并提出了一个新的“全局表示和对齐”范式来解决对齐过程中两个阶段之间的不匹配问题。然后,我们提出了全局表示和基于最优输运的对齐(GlobAlign)及其变体GlobAlign-E,以提高效率。我们的方法配备了全局注意机制和分层跨图传输成本,能够捕捉超出局部图结构的远程和隐式节点依赖关系。此外,GlobAlign-E成功地缩小了代表性嵌入和基于OT的方法之间的时间复杂度差距,将OT的立方复杂度降低到二次项。通过大量实验,我们的方法表现出优越性能,最高可比最佳竞争对手提高20%的准确性。同时,GlobAlign-E实现了最佳效率,在现有基于OT的方法上提高了一个数量级的速度。
更新时间: 2026-03-09 16:00:08
领域: cs.LG,cs.AI
SCAFFOLD-CEGIS: Preventing Latent Security Degradation in LLM-Driven Iterative Code Refinement
The application of large language models to code generation has evolved from one-shot generation to iterative refinement, yet the evolution of security throughout iteration remains insufficiently understood. Through comparative experiments on three mainstream LLMs, this paper reveals the iterative refinement paradox: specification drift during multi-objective optimization causes security to degrade gradually over successive iterations. Taking GPT-4o as an example, 43.7 % of iteration chains contain more vulnerabilities than the baseline after ten rounds, and cross-model experiments show that this phenomenon is prevalent. Further analysis shows that simply introducing static application security testing (SAST) gating cannot effectively suppress degradation; instead, it increases the latent security degradation rate from 12.5% under the unprotected baseline to 20.8 %. The root cause is that static-analysis rules cannot cover structural degradations such as the removal of defensive logic or the weakening of exception handling. To address this problem, we propose the SCAFFOLD-CEGIS framework. Drawing on the counterexample-guided inductive synthesis (CEGIS) paradigm, the framework adopts a multi-agent collaborative architecture that transforms security constraints from implicit prompts into explicit verifiable constraints. It automatically identifies and solidifies security-critical elements as hard constraints through semantic anchoring, enforces safety monotonicity through four-layer gated verification, and continuously assimilates experience from failures. Comparative experiments against six existing defense methods show that the full framework reduces the latent security degradation rate to 2.1% and achieves a safety monotonicity rate of 100%.
Updated: 2026-03-09 15:54:18
标题: SCAFFOLD-CEGIS:在LLM驱动的迭代代码优化中防止潜在的安全降级
摘要: 将大型语言模型应用于代码生成的方法已经从一次性生成发展到迭代细化,然而在迭代过程中安全性的演变仍然不够了解。通过对三种主流LLM进行比较实验,本文揭示了迭代细化悖论:在多目标优化过程中规范漂移导致安全性逐渐在连续迭代中恶化。以GPT-4o为例,43.7%的迭代链在十轮后比基线含有更多的漏洞,并且跨模型实验表明这种现象是普遍存在的。进一步分析表明,简单引入静态应用安全测试(SAST)门控无法有效抑制恶化,反而将潜在安全恶化率从未受保护的基线下的12.5%增加至20.8%。根本原因在于静态分析规则无法覆盖结构性恶化,如防御逻辑的移除或异常处理的削弱。为解决这一问题,我们提出了SCAFFOLD-CEGIS框架。借鉴反例引导归纳合成(CEGIS)范式,该框架采用多代理协作架构,将安全约束从隐式提示转化为显式可验证的约束。通过语义锚定自动识别和巩固安全关键元素作为硬约束,通过四层门控验证强制实施安全单调性,并持续吸收失败经验。与六种现有防御方法的比较实验表明,完整框架将潜在安全恶化率降低至2.1%,并实现了100%的安全单调性率。
更新时间: 2026-03-09 15:54:18
领域: cs.CR,cs.SE
GALACTIC: Global and Local Agnostic Counterfactuals for Time-series Clustering
Time-series clustering is a fundamental tool for pattern discovery, yet existing explainability methods, primarily based on feature attribution or metadata, fail to identify the transitions that move an instance across cluster boundaries. While Counterfactual Explanations (CEs) identify the minimal temporal perturbations required to alter the prediction of a model, they have been mostly confined to supervised settings. This paper introduces GALACTIC, the first unified framework to bridge local and global counterfactual explainability for unsupervised time-series clustering. At instance level (local), GALACTIC generates perturbations via a cluster-aware optimization objective that respects the target and underlying cluster assignments. At cluster level (global), to mitigate cognitive load and enhance interpretability, we formulate a representative CE selection problem. We propose a Minimum Description Length (MDL) objective to extract a non-redundant summary of global explanations that characterize the transitions between clusters. We prove that our MDL objective is supermodular, which allows the corresponding MDL reduction to be framed as a monotone submodular set function. This enables an efficient greedy selection algorithm with provable $(1-1/e)$ approximation guarantees. Extensive experimental evaluation on the UCR Archive demonstrates that GALACTIC produces significantly sparser local CEs and more concise global summaries than state-of-the-art baselines adapted for our problem, offering the first unified approach for interpreting clustered time-series through counterfactuals.
Updated: 2026-03-09 15:52:20
标题: GALACTIC:全球和本地对于时间序列聚类的不可知因果关系
摘要: 时间序列聚类是一种用于模式发现的基本工具,然而现有的可解释性方法主要基于特征归因或元数据,未能识别将一个实例移动到跨越聚类边界的转变。虽然反事实解释(CEs)可以识别改变模型预测所需的最小时间扰动,但它们大多局限于监督设置。本文介绍了GALACTIC,这是第一个统一框架,用于将局部和全局反事实可解释性桥接到无监督时间序列聚类。在实例级别(局部),GALACTIC通过一个考虑目标和基础聚类分配的集群感知优化目标生成扰动。在集群级别(全局),为了减轻认知负荷并增强可解释性,我们制定了一个代表性CE选择问题。我们提出了一个最小描述长度(MDL)目标,以提取描述转换之间的全局解释的非冗余摘要。我们证明我们的MDL目标是超模块化的,这允许相应的MDL缩减被构建为单调次模集函数。这使得我们能够提供一个具有可证明的(1-1/e)近似保证的高效贪婪选择算法。对UCR存档的广泛实验评估表明,GALACTIC产生的局部CEs明显更稀疏,全局总结更简洁,比为我们的问题调整的最先进基线提供了第一个解释聚类时间序列的统一方法。
更新时间: 2026-03-09 15:52:20
领域: cs.LG,cs.AI
BNEM: A Boltzmann Sampler Based on Bootstrapped Noised Energy Matching
Developing an efficient sampler capable of generating independent and identically distributed (IID) samples from a Boltzmann distribution is a crucial challenge in scientific research, e.g. molecular dynamics. In this work, we intend to learn neural samplers given energy functions instead of data sampled from the Boltzmann distribution. By learning the energies of the noised data, we propose a diffusion-based sampler, Noised Energy Matching, which theoretically has lower variance and more complexity compared to related works. Furthermore, a novel bootstrapping technique is applied to NEM to balance between bias and variance. We evaluate NEM and BNEM on a 2-dimensional 40 Gaussian Mixture Model (GMM) and a 4-particle double-well potential (DW-4). The experimental results demonstrate that BNEM can achieve state-of-the-art performance while being more robust.
Updated: 2026-03-09 15:51:28
标题: BNEM:基于引导噪声能量匹配的Boltzmann采样器
摘要: 开发一种能够从Boltzmann分布中生成独立同分布(IID)样本的高效取样器是科学研究中的一个关键挑战,例如分子动力学。在这项工作中,我们打算学习神经取样器,给定能量函数而不是从Boltzmann分布中采样的数据。通过学习噪声数据的能量,我们提出了一种基于扩散的取样器,称为Noised Energy Matching(NEM),在理论上与相关工作相比具有更低的方差和更复杂性。此外,我们应用了一种新颖的自助法技术来平衡NEM之间的偏差和方差。我们在一个二维40高斯混合模型(GMM)和一个4粒子双井势(DW-4)上评估了NEM和BNEM。实验结果表明,BNEM能够实现最先进的性能,同时更加稳健。
更新时间: 2026-03-09 15:51:28
领域: cs.LG,cs.AI,stat.CO,stat.ML
Breaking the Bias Barrier in Concave Multi-Objective Reinforcement Learning
While standard reinforcement learning optimizes a single reward signal, many applications require optimizing a nonlinear utility $f(J_1^π,\dots,J_M^π)$ over multiple objectives, where each $J_m^π$ denotes the expected discounted return of a distinct reward function. A common approach is concave scalarization, which captures important trade-offs such as fairness and risk sensitivity. However, nonlinear scalarization introduces a fundamental challenge for policy gradient methods: the gradient depends on $\partial f(J^π)$, while in practice only empirical return estimates $\hat J$ are available. Because $f$ is nonlinear, the plug-in estimator is biased ($\mathbb{E}[\partial f(\hat J)] \neq \partial f(\mathbb{E}[\hat J])$), leading to persistent gradient bias that degrades sample complexity. In this work we identify and overcome this bias barrier in concave-scalarized multi-objective reinforcement learning. We show that existing policy-gradient methods suffer an intrinsic $\widetilde{\mathcal{O}}(ε^{-4})$ sample complexity due to this bias. To address this issue, we develop a Natural Policy Gradient (NPG) algorithm equipped with a multi-level Monte Carlo (MLMC) estimator that controls the bias of the scalarization gradient while maintaining low sampling cost. We prove that this approach achieves the optimal $\widetilde{\mathcal{O}}(ε^{-2})$ sample complexity for computing an $ε$-optimal policy. Furthermore, we show that when the scalarization function is second-order smooth, the first-order bias cancels automatically, allowing vanilla NPG to achieve the same $\widetilde{\mathcal{O}}(ε^{-2})$ rate without MLMC. Our results provide the first optimal sample complexity guarantees for concave multi-objective reinforcement learning under policy-gradient methods.
Updated: 2026-03-09 15:49:10
标题: 突破凹多目标强化学习中的偏见障碍
摘要: 尽管标准强化学习优化单一奖励信号,但许多应用需要优化多个目标上的非线性效用$f(J_1^π,\dots,J_M^π)$,其中每个$J_m^π$表示不同奖励函数的期望折现回报。一种常见方法是凹凸标量化,它捕捉重要的权衡,如公平性和风险敏感性。然而,非线性标量化为策略梯度方法引入了一个基本挑战:梯度依赖于$\partial f(J^π)$,而在实践中只有经验回报估计$\hat J$是可用的。由于$f$是非线性的,插值估计器存在偏差($\mathbb{E}[\partial f(\hat J)] \neq \partial f(\mathbb{E}[\hat J])$),导致持续的梯度偏差,降低了样本复杂性。 在这项工作中,我们确定并克服了凹凸标量化多目标强化学习中的这种偏差障碍。我们展示现有策略梯度方法由于这种偏差而具有固有的$\widetilde{\mathcal{O}}(ε^{-4})$样本复杂性。为了解决这个问题,我们开发了一种配备多级蒙特卡洛(MLMC)估计器的自然策略梯度(NPG)算法,该估计器控制标量化梯度的偏差同时保持低采样成本。我们证明这种方法实现了计算ε-最优策略的最优$\widetilde{\mathcal{O}}(ε^{-2})$样本复杂性。此外,我们展示当标量化函数是二阶平滑时,一阶偏差将自动取消,允许普通NPG实现相同的$\widetilde{\mathcal{O}}(ε^{-2})$速率而不需要MLMC。我们的结果为凹凸多目标强化学习在策略梯度方法下提供了首个最优样本复杂性保证。
更新时间: 2026-03-09 15:49:10
领域: cs.LG,stat.ML
Remaining-data-free Machine Unlearning by Suppressing Sample Contribution
Machine unlearning (MU) aims to remove the influence of specific training samples from a well-trained model, a task of growing importance due to the ``right to be forgotten.'' The unlearned model should approach the retrained model, where forgetting data do not contribute to the training process. Therefore, unlearning should withdraw their contribution from the pre-trained model. However, quantifying and disentangling sample's contribution to overall learning process is highly challenging, leading most existing MU approaches to adopt other heuristic strategies such as random labeling or knowledge distillation. These operations inevitably degrade model utility, requiring additional maintenance with remaining data. To advance MU towards better utility and efficiency for practical deployment, we seek to approximate sample contribution with only the pre-trained model. We theoretically and empirically reveal that sample's contribution during training manifests in the learned model's increased sensitivity to it. In light of this, we propose MU-Mis (Machine Unlearning by Minimizing input sensitivity), which directly suppresses the contribution of forgetting data. This straightforward suppression enables MU-Mis to successfully unlearn without degrading model utility on the remaining data, thereby eliminating the need for access to the remaining data. To the best of our knowledge, this is the first time that a remaining-data-free method can perform on par with top performing remaining-data-dependent methods.
Updated: 2026-03-09 15:44:44
标题: 通过抑制样本贡献实现无剩余数据的机器遗忘
摘要: 机器遗忘(MU)旨在从经过良好训练的模型中移除特定训练样本的影响,这是由于“被遗忘权”而变得越来越重要的任务。被遗忘的模型应该接近重新训练的模型,其中遗忘数据不会对训练过程产生影响。因此,遗忘应该撤销它们对预训练模型的贡献。然而,量化和分解样本对整体学习过程的贡献是非常具有挑战性的,导致大多数现有的MU方法采用其他启发式策略,如随机标记或知识蒸馏。这些操作不可避免地会降低模型的效用,需要额外的维护工作以保留数据。为了将MU推进至更好的实用性和效率,我们试图用仅预训练模型来近似样本的贡献。我们在理论上和实证上揭示了训练过程中样本的贡献表现为学习模型对其的敏感性增加。基于此,我们提出了MU-Mis(通过最小化输入敏感性进行机器遗忘),该方法直接抑制遗忘数据的贡献。这种直接的抑制使得MU-Mis能够成功地进行遗忘,而不会降低对剩余数据的模型效用,从而消除了对剩余数据的访问需求。据我们所知,这是第一次有一种不需要剩余数据的方法可以与最优表现的依赖剩余数据的方法相媲美。
更新时间: 2026-03-09 15:44:44
领域: cs.LG
Beyond Hungarian: Match-Free Supervision for End-to-End Object Detection
Recent DEtection TRansformer (DETR) based frameworks have achieved remarkable success in end-to-end object detection. However, the reliance on the Hungarian algorithm for bipartite matching between queries and ground truths introduces computational overhead and complicates the training dynamics. In this paper, we propose a novel matching-free training scheme for DETR-based detectors that eliminates the need for explicit heuristic matching. At the core of our approach is a dedicated Cross-Attention-based Query Selection (CAQS) module. Instead of discrete assignment, we utilize encoded ground-truth information to probe the decoder queries through a cross-attention mechanism. By minimizing the weighted error between the queried results and the ground truths, the model autonomously learns the implicit correspondences between object queries and specific targets. This learned relationship further provides supervision signals for the learning of queries. Experimental results demonstrate that our proposed method bypasses the traditional matching process, significantly enhancing training efficiency, reducing the matching latency by over 50\%, effectively eliminating the discrete matching bottleneck through differentiable correspondence learning, and also achieving superior performance compared to existing state-of-the-art methods.
Updated: 2026-03-09 15:44:23
标题: 超越匈牙利方法:无匹配监督的端到端目标检测
摘要: 最近基于DEtection TRansformer (DETR)的框架在端到端目标检测中取得了显著的成功。然而,依赖匈牙利算法进行查询和地面真相之间的二部匹配引入了计算开销并且复杂化了训练动态。在本文中,我们提出了一种新颖的基于DETR检测器的无匹配训练方案,消除了对显式启发式匹配的需求。我们方法的核心是一个专门的基于交叉注意力的查询选择(CAQS)模块。我们利用编码的地面真相信息通过交叉注意力机制来探测解码器的查询,而不是离散的分配。通过最小化查询结果和地面真相之间的加权误差,模型自主学习了目标查询和特定目标之间的隐式对应关系。这种学习关系进一步为查询的学习提供了监督信号。实验结果表明,我们提出的方法绕过了传统的匹配过程,显著增强了训练效率,将匹配延迟减少了50%以上,通过可微的对应关系学习有效地消除了离散匹配瓶颈,并且与现有的最先进方法相比取得了优越的性能。
更新时间: 2026-03-09 15:44:23
领域: cs.CV,cs.AI
Oracle-Guided Soft Shielding for Safe Move Prediction in Chess
In high stakes environments, agents relying purely on imitation learning or reinforcement learning often struggle to avoid safety-critical errors during exploration. Existing reinforcement learning approaches for environments such as chess require hundreds of thousands of episodes and substantial computational resources to converge. Imitation learning, on the other hand, is more sample efficient but is brittle under distributional shift and lacks mechanisms for proactive risk avoidance. In this work, we propose Oracle-Guided Soft Shielding (OGSS), a simple yet effective framework for safer decision-making, enabling safe exploration by learning a probabilistic safety model from oracle feedback in an imitation learning setting. Focusing on the domain of chess, we train a model to predict strong moves based on past games, and separately learn a blunder prediction model from Stockfish evaluations to estimate the tactical risk of each move. During inference, the agent first generates a set of candidate moves and then uses the blunder model to determine high-risk options, and uses a utility function combining the predicted move likelihood from the policy model and the blunder probability to select actions that strike a balance between performance and safety. This enables the agent to explore and play competitively while significantly reducing the chance of tactical mistakes. Across hundreds of games against a strong chess engine, we compare our approach with other methods in the literature, such as action pruning, SafeDAgger, and uncertainty-based sampling. Our results demonstrate that OGSS variants maintain a lower blunder rate even as the agent's exploration ratio is increased by several folds, highlighting its ability to support broader exploration without compromising tactical soundness.
Updated: 2026-03-09 15:40:01
标题: Oracle-Guided软屏蔽技术在国际象棋中安全移动预测中的应用
摘要: 在高风险环境中,仅依靠模仿学习或强化学习的代理往往在探索过程中很难避免出现安全关键错误。现有的针对象棋等环境的强化学习方法需要数十万个episode和大量计算资源才能收敛。另一方面,模仿学习更具样本效率,但在分布偏移下容易出现脆弱性,并且缺乏主动风险规避机制。在这项工作中,我们提出了一种简单而有效的框架Oracle-Guided Soft Shielding (OGSS),用于更安全的决策制定,通过在模仿学习环境中从oracle反馈中学习概率安全模型实现安全探索。专注于象棋领域,我们训练一个模型根据过去的棋局预测强劲的走法,同时从Stockfish评估中学习一个失误预测模型来估计每个走法的战术风险。在推理过程中,代理首先生成一组候选走法,然后使用失误模型确定高风险选项,并使用结合策略模型预测的走法可能性和失误概率的效用函数来选择在性能和安全之间取得平衡的动作。这使代理能够在探索和比赛中发挥竞争力,同时显著降低战术错误的几率。在与强大的象棋引擎对局数百次的过程中,我们将我们的方法与文献中的其他方法进行比较,如动作修剪、SafeDAgger和基于不确定性的抽样。我们的结果表明,OGSS变体保持较低的失误率,即使代理的探索比例增加了数倍,也显示出其支持更广泛探索而不损害战术可靠性的能力。
更新时间: 2026-03-09 15:40:01
领域: cs.LG,cs.AI
Echo2ECG: Enhancing ECG Representations with Cardiac Morphology from Multi-View Echos
Electrocardiography (ECG) is a low-cost, widely used modality for diagnosing electrical abnormalities like atrial fibrillation by capturing the heart's electrical activity. However, it cannot directly measure cardiac morphological phenotypes, such as left ventricular ejection fraction (LVEF), which typically require echocardiography (Echo). Predicting these phenotypes from ECG would enable early, accessible health screening. Existing self-supervised methods suffer from a representational mismatch by aligning ECGs to single-view Echos, which only capture local, spatially restricted anatomical snapshots. To address this, we propose Echo2ECG, a multimodal self-supervised learning framework that enriches ECG representations with the heart's morphological structure captured in multi-view Echos. We evaluate Echo2ECG as an ECG feature extractor on two clinically relevant tasks that fundamentally require morphological information: (1) classification of structural cardiac phenotypes across three datasets, and (2) retrieval of Echo studies with similar morphological characteristics using ECG queries. Our extracted ECG representations consistently outperform those of state-of-the-art unimodal and multimodal baselines across both tasks, despite being 18x smaller than the largest baseline. These results demonstrate that Echo2ECG is a robust, powerful ECG feature extractor. Our code is accessible at https://github.com/michelleespranita/Echo2ECG.
Updated: 2026-03-09 15:39:57
标题: Echo2ECG: 利用多视角超声心动图增强心电图表示
摘要: 心电图(ECG)是一种低成本、广泛使用的诊断电气异常(如心房颤动)的方法,通过捕捉心脏的电活动。然而,它无法直接测量心脏形态表型,如左心室射血分数(LVEF),通常需要超声心动图(Echo)。从ECG预测这些表型将使早期、易于获取的健康筛查成为可能。现有的自监督方法存在代表性不匹配问题,通过将ECG与单视图Echo进行对齐,后者仅捕捉局部、空间限制的解剖快照。为了解决这个问题,我们提出了Echo2ECG,这是一个多模态自监督学习框架,通过多视图Echo捕捉的心脏形态结构丰富了ECG表示。我们将Echo2ECG作为ECG特征提取器在两个基本需要形态信息的临床相关任务上进行评估:(1)跨三个数据集对结构心脏表型进行分类,以及(2)使用ECG查询检索具有相似形态特征的Echo研究。我们的提取的ECG表示在这两个任务中始终优于最先进的单模态和多模态基线,尽管其大小是最大基线的18倍。这些结果表明Echo2ECG是一个稳健、强大的ECG特征提取器。我们的代码可在https://github.com/michelleespranita/Echo2ECG 上获取。
更新时间: 2026-03-09 15:39:57
领域: cs.LG,cs.AI
Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information
As Large Language Models (LLMs) become widely accessible, a detailed understanding of their knowledge within specific domains becomes necessary for successful real world use. This is particularly critical in the domains of medicine and public health, where failure to retrieve relevant, accurate, and current information could significantly impact UK residents. However, while there are a number of LLM benchmarks in the medical domain, currently little is known about LLM knowledge within the field of public health. To address this issue, this paper introduces a new benchmark, PubHealthBench, with over 8000 questions for evaluating LLMs' Multiple Choice Question Answering (MCQA) and free form responses to public health queries. To create PubHealthBench we extract free text from 687 current UK government guidance documents and implement an automated pipeline for generating MCQA samples. Assessing 24 LLMs on PubHealthBench we find the latest proprietary LLMs (GPT-4.5, GPT-4.1 and o1) have a high degree of knowledge, achieving >90% accuracy in the MCQA setup, and outperform humans with cursory search engine use. However, in the free form setup we see lower performance with no model scoring >75%. Therefore, while there are promising signs that state of the art (SOTA) LLMs are an increasingly accurate source of public health information, additional safeguards or tools may still be needed when providing free form responses.
Updated: 2026-03-09 15:37:34
标题: 健康的LLMs?对英国政府公共卫生信息的LLM知识进行基准测试
摘要: 随着大型语言模型(LLMs)变得广泛可用,对它们在特定领域的知识有详细的了解变得必要,以确保在现实世界中成功使用。这在医学和公共卫生领域尤为关键,因为未能检索到相关、准确和最新信息可能会严重影响英国居民。然而,虽然医学领域有许多LLM基准,但目前对公共卫生领域内LLM知识了解甚少。为了解决这一问题,本文介绍了一个新的基准,PubHealthBench,其中包含8000多个问题,用于评估LLMs对公共卫生查询的多项选择题答案(MCQA)和自由形式回答。为了创建PubHealthBench,我们从687份当前的英国政府指导文件中提取了自由文本,并实施了一个自动化流程来生成MCQA样本。在PubHealthBench上对24个LLMs进行评估,我们发现最新的专有LLMs(GPT-4.5、GPT-4.1和o1)具有较高的知识水平,在MCQA设置中实现了>90%的准确率,并在使用简单搜索引擎的情况下胜过人类。然而,在自由形式设置中,我们看到性能较低,没有模型得分>75%。因此,虽然最新技术(SOTA)LLMs有望成为越来越准确的公共卫生信息来源,但在提供自由形式回答时可能仍需要额外的保障或工具。
更新时间: 2026-03-09 15:37:34
领域: cs.CL,cs.LG
Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models
Ensuring the safety and reliability of large language models (LLMs) in clinical practice is critical to prevent patient harm. However, LLMs are advancing so rapidly that static benchmarks quickly become obsolete or prone to overfitting, yielding a misleading picture of model trustworthiness. Here we introduce a Dynamic, Automatic, and Systematic (DAS) red-teaming framework that continuously stress-tests LLMs across four safety-critical axes: robustness, privacy, bias/fairness, and hallucination. Validated against board-certified clinicians with high concordance, a suite of adversarial agents autonomously mutates clinical test cases to uncover vulnerabilities in real time. Applying DAS to 15 proprietary and open-source LLMs revealed a profound gap between high static benchmark performance and low dynamic reliability - the ``Benchmarking Gap''. Despite median MedQA accuracy exceeding 80\%, 94\% of previously correct answers failed our dynamic robustness tests. Crucially, this brittleness generalized to the realistic, open-ended HealthBench dataset, where top-tier models exhibited failure rates exceeding 70\% and stark shifts in model rankings across evaluations, suggesting that high scores on established static benchmarks may reflect superficial memorization. We observed similarly high failure rates across other domains: privacy leaks were elicited in 86\% of scenarios, cognitive-bias priming altered clinical recommendations in 81\% of fairness tests, and we identified hallucination rates exceeding 74\% in widely used models. By converting medical LLM safety analysis from a static checklist into a dynamic stress-test, DAS provides a foundational, scalable, and living platform to surface the latent risks that must be addressed before the next generation of medical AI can be safely deployed.
Updated: 2026-03-09 15:37:01
标题: 超越基准:面向值得信赖的医学语言模型的动态、自动和系统的红队代理
摘要: 确保大型语言模型(LLMs)在临床实践中的安全性和可靠性至关重要,以防止对患者造成伤害。然而,LLMs的发展速度如此迅速,以至于静态基准很快就会过时或容易过拟合,导致对模型可信度的误导性看法。在这里,我们介绍了一个动态、自动和系统化的(DAS)红队框架,该框架持续对LLMs进行四个安全关键轴线的压力测试:稳健性、隐私、偏见/公平性和幻觉。通过与高一致性的董事会认证医生进行验证,一套对抗性代理自主变异临床测试案例,实时发现漏洞。将DAS应用于15个专有和开源LLMs揭示了高静态基准性能和低动态可靠性之间的深刻差距 - “基准差距”。尽管中位MedQA准确率超过80%,但94%的先前正确答案未通过我们的动态稳健性测试。至关重要的是,这种脆弱性泛化到了现实的、开放式的HealthBench数据集,其中一流模型的失败率超过70%,并且在评估中模型排名发生了明显变化,这表明在已建立的静态基准上获得高分可能只是表面记忆。我们观察到在其他领域中同样高的失败率:在86%的情景中引发隐私泄漏,在81%的公平性测试中认知偏见引导改变了临床建议,并且我们发现了在广泛使用的模型中超过74%的幻觉率。通过将医疗LLM安全性分析从静态清单转变为动态压力测试,DAS提供了一个基础、可扩展和活跃的平台,以揭示必须在下一代医疗AI安全部署之前解决的潜在风险。
更新时间: 2026-03-09 15:37:01
领域: cs.LG
CRAwDAD: Causal Reasoning Augmentation with Dual-Agent Debate
When people reason about cause and effect, they often consider many competing "what if" scenarios before deciding which explanation fits best. Analogously, advanced language models capable of causal inference can consider multiple interventions and counterfactuals to judge the validity of causal claims. Crucially, this type of reasoning is less like a single calculation and more like an internal dialogue between alternative hypotheses. In this paper, we make this dialogue explicit through a dual-agent debate framework where one model provides a structured causal inference, and the other critically examines this reasoning for logical flaws. When disagreements arise, the agents attempt to persuade each other, challenging each other's logic and revising their conclusions until they converge on a mutually agreed answer. To take advantage of this deliberative process, we specifically use reasoning language models, whose strengths in both causal inference and adversarial debate remain under-explored relative to standard large language models. We evaluate our approach on the CLadder dataset, a benchmark linking natural language questions to formally defined causal graphs across all three rungs of Pearl's ladder of causation. With Qwen3 and DeepSeek-R1 as debater agents, we demonstrate that multi-agent debate improves DeepSeek-R1's overall accuracy in causal inference from 78.03% to 87.45%, with the counterfactual category specifically improving from 67.94% to 80.04% accuracy. Similarly, Qwen3's overall accuracy improves from 84.16% to 89.41%, and counterfactual questions from 71.53% to 80.35%, showing that even strong models can still benefit greatly from debate with weaker agents. Our results highlight the potential of reasoning models as building blocks for multi-agent systems in causal inference, and demonstrate the importance of diverse perspectives in causal problem-solving.
Updated: 2026-03-09 15:33:35
标题: CRAwDAD:双Agent辩论增强的因果推理
摘要: 当人们推理因果关系时,他们通常会在决定哪种解释最合适之前考虑许多竞争的“如果”场景。类似地,能够进行因果推断的先进语言模型可以考虑多种干预和反事实情况来判断因果主张的有效性。关键是,这种推理方式不像单一计算,而更像是对替代假设之间的内部对话。在本文中,我们通过一个双代理辩论框架将这种对话明确化,其中一个模型提供结构化的因果推断,另一个对这种推理进行逻辑缺陷的批判性审查。当出现分歧时,代理会努力说服彼此,挑战彼此的逻辑并修订他们的结论,直到他们收敛到一个达成共识的答案。为了利用这种审慎过程,我们专门使用推理语言模型,相对于标准的大型语言模型,这些模型在因果推断和对抗性辩论方面的优势仍未得到充分探索。我们在CLadder数据集上评估了我们的方法,该数据集将自然语言问题与Pearl的因果推断梯子的所有三个梯级中定义的因果图进行了连接。通过使用Qwen3和DeepSeek-R1作为辩手代理,我们展示了多代理辩论如何将DeepSeek-R1的因果推断整体准确率从78.03%提高到87.45%,尤其是反事实类别的准确率从67.94%提高到80.04%。同样,Qwen3的整体准确率从84.16%提高到89.41%,反事实问题的准确率从71.53%提高到80.35%,表明即使强大的模型仍然可以从与较弱代理的辩论中获益。我们的结果突显了推理模型作为因果推断多代理系统的构建模块的潜力,并展示了因果问题解决中多元化观点的重要性。
更新时间: 2026-03-09 15:33:35
领域: cs.LG,cs.MA
Efficient Credal Prediction through Decalibration
A reliable representation of uncertainty is essential for the application of modern machine learning methods in safety-critical settings. In this regard, the use of credal sets (i.e., convex sets of probability distributions) has recently been proposed as a suitable approach to representing epistemic uncertainty. However, as with other approaches to epistemic uncertainty, training credal predictors is computationally complex and usually involves (re-)training an ensemble of models. The resulting computational complexity prevents their adoption for complex models such as foundation models and multi-modal systems. To address this problem, we propose an efficient method for credal prediction that is grounded in the notion of relative likelihood and inspired by techniques for the calibration of probabilistic classifiers. For each class label, our method predicts a range of plausible probabilities in the form of an interval. To produce the lower and upper bounds of these intervals, we propose a technique that we refer to as decalibration. Extensive experiments show that our method yields credal sets with strong performance across diverse tasks, including coverage-efficiency evaluation, out-of-distribution detection, and in-context learning. Notably, we demonstrate credal prediction on models such as TabPFN and CLIP -- architectures for which the construction of credal sets was previously infeasible.
Updated: 2026-03-09 15:30:10
标题: 通过去校准实现高效的信任预测
摘要: 一种可靠的不确定性表示对于在安全关键环境中应用现代机器学习方法至关重要。在这方面,最近有人提出使用信任集(即概率分布的凸集合)作为表示认知不确定性的合适方法。然而,与其他认知不确定性方法一样,训练信任预测器计算复杂,并且通常涉及(重新)训练一组模型。由此导致的计算复杂性阻碍了它们在基础模型和多模态系统等复杂模型中的采用。为解决这一问题,我们提出了一种基于相对可能性概念的信任预测的高效方法,受到概率分类器校准技术的启发。对于每个类别标签,我们的方法以区间的形式预测一系列合理概率范围。为了产生这些区间的下限和上限,我们提出了一个我们称之为去校准的技术。大量实验证明我们的方法在各种任务中表现出强大的性能,包括覆盖效率评估、超出分布检测和上下文学习。值得注意的是,我们展示了在TabPFN和CLIP等模型上的信任预测——这些架构以前无法构建信任集。
更新时间: 2026-03-09 15:30:10
领域: cs.LG,stat.ML
When AI Levels the Playing Field: Skill Homogenization, Asset Concentration, and Two Regimes of Inequality
Generative AI compresses within-task skill differences while shifting economic value toward concentrated complementary assets, creating an apparent paradox: the technology that equalizes individual performance may widen aggregate inequality. We formalize this tension in a task-based model with endogenous education, employer screening, and heterogeneous firms. The model yields two regimes whose boundary depends on AI's technology structure (proprietary vs. commodity) and labor market institutions (rent-sharing elasticity, asset concentration). A scenario analysis via Method of Simulated Moments, matching six empirical targets, disciplines the model's quantitative magnitudes; a sensitivity decomposition reveals that the five non-$Δ$Gini moments identify mechanism rates but not the aggregate sign, which at the calibrated parameters is pinned by $m_6$ and $ξ$, while AI's technology structure ($η_1$ vs. $η_0$) independently crosses the boundary. The contribution is the mechanism -- not a verdict on the sign. Occupation-level regressions using BLS OEWS data (2019--2023) illustrate why such data cannot test the model's task-level predictions. The predictions are testable with within-occupation, within-task panel data that do not yet exist at scale.
Updated: 2026-03-09 15:29:42
标题: 当人工智能拉平竞争场:技能同质化、资产集中和两种不平等制度
摘要: 生成式人工智能在压缩任务内技能差异的同时,将经济价值转向集中的互补资产,从而产生了一个明显的悖论:平衡个体表现的技术可能会加剧总体不平等。我们在一个具有内生教育、雇主筛选和异质企业的基于任务的模型中形式化了这种紧张关系。该模型产生了两种制度,其分界取决于人工智能的技术结构(专有 vs. 商品)和劳动市场制度(租金分配弹性、资产集中度)。通过模拟矩法的情景分析,匹配六个实证目标,限制了模型的数量大小;敏感性分解显示,五个非$Δ$基尼矩标识了机制率,但未确定总体符号,根据校准参数,总体符号由$m_6$和$ξ$确定,而人工智能的技术结构($η_1$ vs. $η_0$)独立地跨越了分界线。贡献在于机制——而不是对符号的裁决。利用BLS OEWS数据(2019-2023年)进行职业水平回归,说明为什么此类数据无法测试模型的任务水平预测。这些预测可以通过尺度尚未存在的职业内、任务内面板数据进行测试。
更新时间: 2026-03-09 15:29:42
领域: cs.LG,cs.AI
First-Order Geometry, Spectral Compression, and Structural Compatibility under Bounded Computation
Optimization under structural constraints is typically analyzed through projection or penalty methods, obscuring the geometric mechanism by which constraints shape admissible dynamics. We propose an operator-theoretic formulation in which computational or feasibility limitations are encoded by self-adjoint operators defining locally reachable subspaces. In this setting, the optimal first-order improvement direction emerges as a pseudoinverse-weighted gradient, revealing how constraints induce a distorted ascent geometry. We further demonstrate that effective dynamics concentrate along dominant spectral modes, yielding a principled notion of spectral compression, and establish a compatibility principle that characterizes the existence of common admissible directions across multiple objectives. The resulting framework unifies gradient projection, spectral truncation, and multi-objective feasibility within a single geometric structure.
Updated: 2026-03-09 15:29:41
标题: 一阶几何、谱压缩和有界计算下的结构兼容性
摘要: 在结构约束下的优化通常通过投影或惩罚方法进行分析,这些方法模糊了约束如何塑造可接受动态的几何机制。我们提出了一个算子理论的表述,在这个表述中,计算或可行性限制由定义局部可达子空间的自伴算子编码。在这种设置中,最佳的一阶改进方向出现为一个伪逆加权梯度,揭示了约束如何引起扭曲的上升几何。我们进一步证明了有效动态沿着主导谱模式集中,产生了一个原则性的谱压缩概念,并建立了一个兼容原则,描述了跨多个目标存在共同可接受方向的情况。由此产生的框架统一了梯度投影、谱截断和多目标可行性在一个单一的几何结构中。
更新时间: 2026-03-09 15:29:41
领域: math.OC,cs.AI
Pareto-Optimal Anytime Algorithms via Bayesian Racing
Selecting an optimization algorithm requires comparing candidates across problem instances, but the computational budget for deployment is often unknown at benchmarking time. Current methods either collapse anytime performance into a scalar, require manual interpretation of plots, or produce conclusions that change when algorithms are added or removed. Moreover, methods based on raw objective values require normalization, which needs bounds or optima that are often unavailable and breaks coherent aggregation across instances. We propose a framework that formulates anytime algorithm comparison as Pareto optimization over time: an algorithm is non-dominated if no competitor beats it at every timepoint. By using rankings rather than objective values, our approach requires no bounds, no normalization, and aggregates coherently across arbitrary instance distributions without requiring known optima. We introduce PolarBear (Pareto-optimal anytime algorithms via Bayesian racing), a procedure that identifies the anytime Pareto set through adaptive sampling with calibrated uncertainty. Bayesian inference over a temporal Plackett-Luce ranking model provides posterior beliefs about pairwise dominance, enabling early elimination of confidently dominated algorithms. The output Pareto set together with the posterior supports downstream algorithm selection under arbitrary time preferences and risk profiles without additional experiments.
Updated: 2026-03-09 15:28:39
标题: 贝叶斯竞赛中的帕累托最优任意时间算法
摘要: 选择一个优化算法需要在问题实例之间进行比较候选者,但是在基准测试时往往并不清楚部署的计算预算。目前的方法要么将任意时间的性能转化为标量,要么需要对图表进行手动解释,或者在添加或删除算法时产生结论的变化。此外,基于原始客观值的方法需要标准化,而这通常需要不可用的边界或最优解,并且破坏了跨实例的一致聚合。我们提出了一个框架,将任意时间算法比较形式化为随时间的帕累托优化:如果没有竞争对手在每个时间点击败它,则算法是非支配的。通过使用排名而不是客观值,我们的方法不需要边界、不需要标准化,并且在不需要已知最优解的情况下对任意实例分布进行一致聚合。我们介绍了PolarBear(通过贝叶斯竞赛实现帕累托最优任意时间算法),该过程通过校准不确定性进行自适应采样来识别任意时间帕累托集。在时间Plackett-Luce排名模型上进行贝叶斯推断提供了关于两两支配的后验信念,从而可以及早消除自信支配的算法。输出的帕累托集连同后验支持根据任意时间偏好和风险配置选择算法,而无需进行额外实验。
更新时间: 2026-03-09 15:28:39
领域: cs.NE,cs.LG
NN-OpInf: an operator inference approach using structure-preserving composable neural networks
We propose neural network operator inference (NN-OpInf): a structure-preserving, composable, and minimally restrictive operator inference framework for the non-intrusive reduced-order modeling of dynamical systems. The approach learns latent dynamics from snapshot data, enforcing local operator structure such as skew-symmetry, (semi-)positive definiteness, and gradient preservation, while also reflecting complex dynamics by supporting additive compositions of heterogeneous operators. We present practical training strategies and analyze computational costs relative to linear and quadratic polynomial OpInf (P-OpInf). Numerical experiments across several nonlinear and parametric problems demonstrate improved accuracy, stability, and robustness over P-OpInf and prior NN-ROM formulations, particularly when the dynamics are not well represented by polynomial models. These results suggest that NN-OpInf can serve as an effective drop-in replacement for P-OpInf when the dynamics to be modeled contain non-polynomial nonlinearities, offering potential gains in accuracy and out-of-distribution performance at the expense of higher training computational costs and a more difficult, non-convex learning problem.
Updated: 2026-03-09 15:25:16
标题: NN-OpInf: 一种使用保持结构可组合神经网络的算子推理方法
摘要: 我们提出了神经网络算子推断(NN-OpInf):一种保持结构、可组合且最小限制的算子推断框架,用于非侵入式降阶建模动态系统。该方法从快照数据中学习潜在动态,强制执行本地算子结构(如斜对称性、(半)正定性和梯度保持),同时通过支持异质算子的加法组合来反映复杂动态。我们提出了实用的训练策略,并分析了与线性和二次多项式OpInf(P-OpInf)相对的计算成本。在几个非线性和参数化问题上进行的数值实验表明,与P-OpInf和先前的NN-ROM公式相比,NN-OpInf在准确性、稳定性和鲁棒性方面均有所提高,特别是在动态不适合多项式模型的情况下。这些结果表明,当要建模的动态包含非多项式非线性时,NN-OpInf可以作为P-OpInf的有效替代品,提供更高的准确性和超出分布性能,但以更高的训练计算成本和更加困难的非凸学习问题为代价。
更新时间: 2026-03-09 15:25:16
领域: cs.LG,math.DS
Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images
Multimodal large language models (MLLMs) face safety misalignment, where visual inputs enable harmful outputs. To address this, existing methods require explicit safety labels or contrastive data; yet, threat-related concepts are concrete and visually depictable, while safety concepts, like helpfulness, are abstract and lack visual referents. Inspired by the Self-Fulfilling mechanism underlying emergent misalignment, we propose Visual Self-Fulfilling Alignment (VSFA). VSFA fine-tunes vision-language models (VLMs) on neutral VQA tasks constructed around threat-related images, without any safety labels. Through repeated exposure to threat-related visual content, models internalize the implicit semantics of vigilance and caution, shaping safety-oriented personas. Experiments across multiple VLMs and safety benchmarks demonstrate that VSFA reduces the attack success rate, improves response quality, and mitigates over-refusal while preserving general capabilities. Our work extends the self-fulfilling mechanism from text to visual modalities, offering a label-free approach to VLMs alignment.
Updated: 2026-03-09 15:20:53
标题: 视觉自我实现对齐:通过威胁相关图像塑造安全导向的人物角色
摘要: 多模态大型语言模型(MLLMs)面临安全错位问题,其中视觉输入会导致有害输出。为了解决这个问题,现有方法需要明确的安全标签或对比数据;然而,威胁相关的概念是具体的且可视化的,而像帮助性这样的安全概念则是抽象的且缺乏视觉参照物。受到导致紧急错位的自我实现机制的启发,我们提出了视觉自我实现对齐(VSFA)。VSFA在围绕威胁相关图像构建的中性VQA任务上微调视觉语言模型(VLMs),而无需任何安全标签。通过反复接触与威胁相关的视觉内容,模型内化了警惕和谨慎的隐含语义,塑造了安全导向的人格。跨多个VLMs和安全基准的实验表明,VSFA降低了攻击成功率,提高了响应质量,并减轻了过度拒绝,同时保留了一般的能力。我们的工作将自我实现机制从文本扩展到视觉模态,为VLMs的对齐提供了一种无需标签的方法。
更新时间: 2026-03-09 15:20:53
领域: cs.CV,cs.AI
Towards Modeling Cybersecurity Behavior of Humans in Organizations
We undertake a comprehensive and structured synthesis of the drivers of human behavior in cybersecurity, focusing specifically on people within organizations (i.e., especially employees in companies), and integrate key concepts such as awareness, security culture, and usability into a coherent theoretical framework. This model is then compared with several relevant behavioral models that fundamentally represent drivers of human behavior. Additionally, we discuss how this theoretical framework can help the domain of agentic AI security: We argue that as AI systems increasingly act as autonomous agents within organizations and based on natural language processing, they also exhibit vulnerabilities analogous to human behavioral risks. Consequently, we propose that this human-centric model offers a blueprint for developing additional security strategies against manipulation attacks targeting AI agents.
Updated: 2026-03-09 15:19:53
标题: 朝向对组织中人员网络安全行为建模
摘要: 我们对网络安全中人类行为的驱动因素进行了全面和结构化的综合,特别关注组织内的人员(即公司中的员工),并将关键概念如意识、安全文化和可用性整合到一个连贯的理论框架中。然后,将该模型与几个基本代表人类行为驱动因素的相关行为模型进行了比较。此外,我们讨论了这一理论框架如何有助于代理人工智能安全领域:我们认为,随着人工智能系统在组织内越来越像自主代理行事,并基于自然语言处理,它们也表现出类似于人类行为风险的漏洞。因此,我们提出这一以人为中心的模型为针对针对人工智能代理的操纵攻击制定额外安全策略提供了一个蓝图。
更新时间: 2026-03-09 15:19:53
领域: cs.CR
X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection
The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and existing detectors. Against this backdrop, we take a generator-side view and observe that internal cross-attention mechanisms in these models encode fine-grained speech-motion alignment, offering useful correspondence cues for forgery detection. Building on this insight, we propose X-AVDT, a robust and generalizable deepfake detector that probes generator-internal audio-visual signals accessed via DDIM inversion to expose these cues. X-AVDT extracts two complementary signals: (i) a video composite capturing inversion-induced discrepancies, and (ii) an audio-visual cross-attention feature reflecting modality alignment enforced during generation. To enable faithful cross-generator evaluation, we further introduce MMDF, a new multimodal deepfake dataset spanning diverse manipulation types and rapidly evolving synthesis paradigms, including GANs, diffusion, and flow-matching. Extensive experiments demonstrate that X-AVDT achieves leading performance on MMDF and generalizes strongly to external benchmarks and unseen generators, outperforming existing methods with accuracy improved by 13.1%. Our findings highlight the importance of leveraging internal audio-visual consistency cues for robustness to future generators in deepfake detection.
Updated: 2026-03-09 15:18:42
标题: X-AVDT:用于稳健深度伪造检测的音频-视觉交叉注意力
摘要: 当代生成系统产生的高度逼真的合成视频的激增显著增加了恶意使用的风险,挑战着人类和现有检测器。在这种背景下,我们采取了一个生成器视角,并观察到这些模型中的内部交叉注意力机制编码了细粒度的语音-动作对齐,为伪造检测提供了有用的对应线索。基于这一洞察,我们提出了X-AVDT,一个强大且可泛化的深度伪造检测器,通过DDIM反演访问生成器内部的音频-视觉信号来暴露这些线索。X-AVDT提取了两种互补信号:(i)捕获反演引起的差异的视频合成,以及(ii)反映生成过程中强制执行的模态对齐的音频-视觉交叉注意力特征。为了实现准确的跨生成器评估,我们进一步引入了MMDF,一个涵盖各种操作类型和快速演变的合成范式的新的多模态深度伪造数据集,包括GAN、扩散和流匹配。大量实验证明,X-AVDT在MMDF上取得了领先的性能,并且对外部基准和未见生成器的泛化效果强劲,准确率提高了13.1%,超过了现有方法。我们的研究结果凸显了利用内部音频-视觉一致性线索对未来生成器的深度伪造检测的鲁棒性的重要性。
更新时间: 2026-03-09 15:18:42
领域: cs.CV,cs.AI,cs.LG
Where Do Flow Semantics Reside? A Protocol-Native Tabular Pretraining Paradigm for Encrypted Traffic Classification
Self-supervised masked modeling shows promise for encrypted traffic classification by masking and reconstructing raw bytes. Yet recent work reveals these methods fail to reduce reliance on labeled data despite costly pretraining: under frozen encoder evaluation, accuracy drops from greater than 0.9 to less than 0.47. We argue the root cause is inductive bias mismatch: flattening traffic into byte sequences destroys protocol-defined semantics. We identify three specific issues: 1) field unpredictability, random fields like ip.id are unlearnable yet treated as reconstruction targets; 2) embedding confusion, semantically distinct fields collapse into a unified embedding space; 3) metadata loss, capture-time metadata essential for temporal analysis is discarded. To address this, we propose a protocol-native paradigm that treats protocol-defined field semantics as architectural priors, reformulating the task to align with the data's intrinsic tabular modality rather than incrementally adapting sequence-based architectures. Instantiating this paradigm, we introduce FlowSem-MAE, a tabular masked autoencoder built on Flow Semantic Units (FSUs). It features predictability-guided filtering that focuses on learnable FSUs, FSU-specific embeddings to preserve field boundaries, and dual-axis attention to capture intra-packet and temporal patterns. FlowSem-MAE significantly outperforms state-of-the-art across datasets. With only half labeled data, it outperforms most existing methods trained on full data.
Updated: 2026-03-09 15:15:23
标题: 流语义位于何处?一种针对加密流量分类的协议原生表格预训练范式
摘要: 自我监督的遮蔽建模显示了通过遮蔽和重构原始字节进行加密流量分类的潜力。然而,最近的研究表明,尽管昂贵的预训练,这些方法仍未减少对标记数据的依赖:在冻结编码器评估下,准确度从大于0.9下降到小于0.47。我们认为根本原因在于归纳偏差不匹配:将流量展平为字节序列破坏了协议定义的语义。我们确定了三个具体问题:1)字段不可预测,像ip.id这样的随机字段是不能学习的,但被视为重构目标;2)嵌入混淆,语义上不同的字段合并为统一的嵌入空间;3)元数据丢失,对于时间分析至关重要的捕获时间元数据被丢弃。为了解决这个问题,我们提出了一个符合协议定义字段语义的协议本机范式,重新制定任务以与数据的固有表格模态对齐,而不是逐步调整基于序列的架构。实例化这个范式,我们引入了FlowSem-MAE,一个建立在Flow Semantic Units (FSUs)上的表格遮蔽自动编码器。它具有预测引导的过滤功能,专注于可学习的FSUs,FSU特定的嵌入以保留字段边界,以及双轴关注以捕获数据包内部和时间模式。FlowSem-MAE在各个数据集上明显优于现有技术。仅使用一半标记数据,它优于大多数使用完整数据训练的现有方法。
更新时间: 2026-03-09 15:15:23
领域: cs.NI,cs.AI,cs.CR,cs.LG
STRIDE: Structured Lagrangian and Stochastic Residual Dynamics via Flow Matching
Robotic systems operating in unstructured environments must operate under significant uncertainty arising from intermittent contacts, frictional variability, and unmodeled compliance. While recent model-free approaches have demonstrated impressive performance, many deployment settings still require predictive models that support planning, constraint handling, and online adaptation. Analytical rigid-body models provide strong physical structure but often fail to capture complex interaction effects, whereas purely data-driven models may violate physical consistency, exhibit data bias, and accumulate long-horizon drift. In this work, we propose STRIDE, a dynamics learning framework that explicitly separates conservative rigid-body mechanics from uncertain, effectively stochastic non-conservative interaction effects. The structured component is modeled using a Lagrangian Neural Network (LNN) to preserve energy-consistent inertial dynamics, while residual interaction forces are represented using Conditional Flow Matching (CFM) to capture multi-modal interaction phenomena. The two components are trained jointly end-to-end, enabling the model to retain physical structure while representing complex stochastic behavior. We evaluate STRIDE on systems of increasing complexity, including a pendulum, the Unitree Go1 quadruped, and the Unitree G1 humanoid. Results show 20% reduction in long-horizon prediction error and 30% reduction in contact force prediction error compared to deterministic residual baselines, supporting more reliable model-based control in uncertain robotic environments.
Updated: 2026-03-09 15:15:21
标题: STRIDE:通过流匹配的结构化拉格朗日和随机残余动力学
摘要: 在非结构化环境中操作的机器人系统必须在由间断接触、摩擦变化和未建模的柔度引起的显著不确定性下运行。尽管最近的无模型方法表现出令人印象深刻的性能,但许多部署场景仍需要支持规划、约束处理和在线适应的预测模型。分析刚体模型提供强大的物理结构,但往往无法捕捉复杂的相互作用效应,而纯数据驱动的模型可能违反物理一致性,表现出数据偏差,并且累积长期漂移。在这项工作中,我们提出了STRIDE,一个动力学学习框架,明确地将保守的刚体力学与不确定的、有效地随机的非保守性相互作用效应分离开来。结构化组件使用Lagrangian Neural Network (LNN)进行建模,以保留能量一致的惯性动力学,而残余的相互作用力则使用Conditional Flow Matching (CFM)来捕捉多模态相互作用现象。这两个组件被联合端到端地训练,使模型在保留物理结构的同时表示复杂的随机行为。我们在不断增加复杂性的系统上评估STRIDE,包括一个摆,Unitree Go1四足动物和Unitree G1人形机器人。结果显示,与确定性残余基线相比,长期预测误差减少了20%,接触力预测误差减少了30%,支持在不确定的机器人环境中更可靠的基于模型的控制。
更新时间: 2026-03-09 15:15:21
领域: cs.RO,cs.LG
GRADIEND: Feature Learning within Neural Networks Exemplified through Biases
AI systems frequently exhibit and amplify social biases, leading to harmful consequences in critical areas. This study introduces a novel encoder-decoder approach that leverages model gradients to learn a feature neuron encoding societal bias information such as gender, race, and religion. We show that our method can not only identify which weights of a model need to be changed to modify a feature, but even demonstrate that this can be used to rewrite models to debias them while maintaining other capabilities. We demonstrate the effectiveness of our approach across various model architectures and highlight its potential for broader applications.
Updated: 2026-03-09 15:13:54
标题: GRADIEND: 通过偏差示范神经网络内的特征学习
摘要: AI系统经常展示和放大社会偏见,导致在关键领域产生有害后果。本研究介绍了一种新颖的编码器-解码器方法,利用模型梯度来学习特征神经元编码社会偏见信息,如性别、种族和宗教。我们展示了我们的方法不仅可以识别需要改变的模型权重来修改特征,甚至证明这可以用来重写模型以消除偏见,同时保持其他功能。我们展示了我们的方法在各种模型架构上的有效性,并强调其在更广泛应用中的潜力。
更新时间: 2026-03-09 15:13:54
领域: cs.LG,cs.AI,cs.CL
R2F: Repurposing Ray Frontiers for LLM-free Object Navigation
Zero-shot open-vocabulary object navigation has progressed rapidly with the emergence of large Vision-Language Models (VLMs) and Large Language Models (LLMs), now widely used as high-level decision-makers instead of end-to-end policies. Although effective, such systems often rely on iterative large-model queries at inference time, introducing latency and computational overhead that limit real-time deployment. To address this problem, we repurpose ray frontiers (R2F), a recently proposed frontier-based exploration paradigm, to develop an LLM-free framework for indoor open-vocabulary object navigation. While ray frontiers were originally used to bias exploration using semantic cues carried along rays, we reinterpret frontier regions as explicit, direction-conditioned semantic hypotheses that serve as navigation goals. Language-aligned features accumulated along out-of-range rays are stored sparsely at frontiers, where each region maintains multiple directional embeddings encoding plausible unseen content. In this way, navigation then reduces to embedding-based frontier scoring and goal tracking within a classical mapping and planning pipeline, eliminating iterative large-model reasoning. We further introduce R2F-VLN, a lightweight extension for free-form language instructions using syntactic parsing and relational verification without additional VLM or LLM components. Experiments in Habitat-sim and on a real robotic platform demonstrate competitive state-of-the-art zero-shot performance with real-time execution, achieving up to 6 times faster runtime than VLM-based alternatives.
Updated: 2026-03-09 15:10:10
标题: R2F:重新利用免LLM的射线前沿进行目标导航
摘要: 随着大型视觉语言模型(VLMs)和大型语言模型(LLMs)的出现,零射开放词汇对象导航已经取得了快速进展,这些模型现在广泛用作高级决策者,而不是端到端策略。尽管有效,这样的系统通常依赖于推理时的迭代大模型查询,引入了延迟和计算开销,限制了实时部署。为了解决这个问题,我们重新利用了最近提出的基于射线前沿(R2F)的前沿探索范式,开发了一个无需LLM的室内开放词汇对象导航框架。尽管射线前沿最初是用来利用沿射线传递的语义线索来偏置探索,但我们重新解释前沿区域为明确的、方向条件的语义假设,作为导航目标。沿着超出范围的射线积累的与语言对齐的特征被稀疏地存储在前沿,每个区域保持多个方向嵌入,编码可能看不见的内容。通过这种方式,导航变成了基于嵌入的前沿评分和目标跟踪,在经典的映射和规划流水线内,消除了迭代的大模型推理。我们进一步引入了R2F-VLN,一个轻量级的扩展,用于自由形式的语言指令,使用句法分析和关系验证,无需额外的VLM或LLM组件。在Habitat-sim和真实机器人平台上的实验表明,与基于VLM的替代方案相比,实现了具有实时执行的竞争性最新零射性能,运行时速度比VLM快6倍。
更新时间: 2026-03-09 15:10:10
领域: cs.RO,cs.AI
Mitigating Unintended Memorization with LoRA in Federated Learning for LLMs
Federated learning (FL) is a popular paradigm for collaborative training which avoids direct data exposure between clients. However, data privacy issues still remain: FL-trained large language models are capable of memorizing and completing phrases and sentences contained in training data when given their prefixes. Thus, it is possible for adversarial and honest- but-curious clients to recover training data of other participants simply through targeted prompting. In this work, we demonstrate that a popular and simple fine-tuning strategy, low-rank adaptation (LoRA), reduces memorization during FL by a factor of up to 10 without significant performance cost. We study this effect by performing fine-tuning tasks in high-risk domains such as medicine, law, and finance. We observe a reduction in memorization for a wide variety of model families, from 1B to 70B parameters. We find that LoRA can reduce memorization in centralized learning as well, and we compare how the memorization patterns differ. Furthermore, we study the effect of hyperparameters and show that LoRA can be combined with other privacy-preserving techniques such as gradient clipping and Gaussian noise, secure aggregation, and Goldfish loss to further improve record-level privacy while maintaining performance.
Updated: 2026-03-09 15:09:30
标题: 使用LoRA在LLMs的联邦学习中缓解意外记忆效应
摘要: 联邦学习(FL)是一种流行的协作训练范式,可以避免客户端之间直接暴露数据。然而,数据隐私问题仍然存在:FL训练的大型语言模型能够记忆并完成训练数据中包含的短语和句子,只要给出它们的前缀。因此,对抗性和好奇的客户端可以通过有针对性的提示简单地恢复其他参与者的训练数据。在这项工作中,我们展示了一种流行且简单的微调策略,低秩调整(LoRA),可以将FL中的记忆减少多达10倍,而且没有显著的性能成本。我们通过在医学、法律和金融等高风险领域执行微调任务来研究这种效果。我们观察到各种模型家族,从1B到70B参数,记忆减少。我们发现LoRA在集中式学习中也可以减少记忆,并比较记忆模式的不同之处。此外,我们研究了超参数的影响,并表明LoRA可以与其他保护隐私的技术如梯度裁剪和高斯噪声、安全聚合和金鱼损失相结合,进一步提高记录级隐私同时保持性能。
更新时间: 2026-03-09 15:09:30
领域: cs.LG,cs.AI,cs.CL
Integrating Lagrangian Neural Networks into the Dyna Framework for Reinforcement Learning
Model-based reinforcement learning (MBRL) is sample-efficient but depends on the accuracy of the learned dynamics, which are often modeled using black-box methods that do not adhere to physical laws. Those methods tend to produce inaccurate predictions when presented with data that differ from the original training set. In this work, we employ Lagrangian neural networks (LNNs), which enforce an underlying Lagrangian structure to train the model within a Dyna-based MBRL framework. Furthermore, we train the LNN using stochastic gradient-based and state-estimation-based optimizers to learn the network's weights. The state-estimation-based method converges faster than the stochastic gradient-based method during neural network training. Simulation results are provided to illustrate the effectiveness of the proposed LNN-based Dyna framework for MBRL.
Updated: 2026-03-09 15:06:10
标题: 将拉格朗日神经网络集成到强化学习的Dyna框架中
摘要: 基于模型的强化学习(MBRL)具有高样本效率,但取决于学习动态的准确性,这些动态通常使用不符合物理定律的黑匣子方法建模。当这些方法面对与原始训练集不同的数据时,往往会产生不准确的预测。在本研究中,我们利用Lagrangian神经网络(LNNs),在基于Dyna的MBRL框架内强制执行一个潜在的Lagrangian结构来训练模型。此外,我们使用基于随机梯度和基于状态估计的优化器来训练LNN以学习网络的权重。在神经网络训练过程中,基于状态估计的方法比基于随机梯度的方法收敛速度更快。通过模拟结果,我们展示了基于提出的LNN的Dyna框架对MBRL的有效性。
更新时间: 2026-03-09 15:06:10
领域: eess.SY,cs.LG
MUSA-PINN: Multi-scale Weak-form Physics-Informed Neural Networks for Fluid Flow in Complex Geometries
While Physics-Informed Neural Networks (PINNs) offer a mesh-free approach to solving PDEs, standard point-wise residual minimization suffers from convergence pathologies in topologically complex domains like Triply Periodic Minimal Surfaces (TPMS). The locality bias of point-wise constraints fails to propagate global information through tortuous channels, causing unstable gradients and conservation violations. To address this, we propose the Multi-scale Weak-form PINN (MUSA-PINN), which reformulates PDE constraints as integral conservation laws over hierarchical spherical control volumes. We enforce continuity and momentum conservation via flux-balance residuals on control surfaces. Our method utilizes a three-scale subdomain strategy-comprising large volumes for long-range coupling, skeleton-aware meso-scale volumes aligned with transport pathways, and small volumes for local refinement-alongside a two-stage training schedule prioritizing continuity. Experiments on steady incompressible flow in TPMS geometries show MUSA-PINN outperforms state-of-the-art baselines, reducing relative errors by up to 93% and preserving mass conservation.
Updated: 2026-03-09 15:03:50
标题: MUSA-PINN: 复杂几何体中流体流动的多尺度弱形式物理信息神经网络
摘要: 虽然物理信息神经网络(PINNs)提供了一种无网格方法来解决PDEs,但标准的点对点残差最小化在类似三重周期最小曲面(TPMS)这样的拓扑复杂领域中存在收敛路径ologies问题。点对点约束的局部偏见未能通过曲折通道传播全局信息,导致不稳定的梯度和守恒违反。为了解决这个问题,我们提出了多尺度弱形式PINN(MUSA-PINN),将PDE约束重新表述为层次球形控制体上的积分守恒定律。我们通过控制面上的通量平衡残差来实施连续性和动量守恒。我们的方法利用了三尺度子域策略-包括大容量用于长程耦合,沿输送通道对齐的骨架感知中尺度容量,以及用于局部细化的小容量-以及优先考虑连续性的两阶段训练计划。在TPMS几何中稳态不可压缩流的实验中,MUSA-PINN优于最先进的基线,将相对误差减少了高达93%,并保持了质量守恒。
更新时间: 2026-03-09 15:03:50
领域: cs.LG
Reasoning as Compression: Unifying Budget Forcing via the Conditional Information Bottleneck
Chain-of-Thought (CoT) prompting improves LLM accuracy on complex tasks but often increases token usage and inference cost. Existing "Budget Forcing" methods reducing cost via fine-tuning with heuristic length penalties, suppress both essential reasoning and redundant filler. We recast efficient reasoning as a lossy compression problem under the Information Bottleneck (IB) principle, and identify a key theoretical gap when applying naive IB to transformers: attention violates the Markov property between prompt, reasoning trace, and response. To resolve this issue, we model CoT generation under the Conditional Information Bottleneck (CIB) principle, where the reasoning trace Z acts as a computational bridge that contains only the information about the response Y that is not directly accessible from the prompt X. This yields a general Reinforcement Learning objective: maximize task reward while compressing completions under a prior over reasoning traces, subsuming common heuristics (e.g., length penalties) as special cases (e.g., uniform priors). In contrast to naive token-counting-based approaches, we introduce a semantic prior that measures token cost by surprisal under a language model prior. Empirically, our CIB objective prunes cognitive bloat while preserving fluency and logic, improving accuracy at moderate compression and enabling aggressive compression with minimal accuracy drop.
Updated: 2026-03-09 14:56:57
标题: 推理作为压缩:通过条件信息瓶颈统一预算强制
摘要: Chain-of-Thought (CoT)提示可以提高在复杂任务上的LLM准确性,但通常会增加令牌使用和推理成本。现有的“预算强制”方法通过启发式长度惩罚进行微调以减少成本,抑制了基本推理和多余填充。我们将高效推理重新构建为信息瓶颈(IB)原则下的有损压缩问题,并在应用简单IB到变压器时识别了一个关键的理论差距:注意力违反了提示、推理痕迹和响应之间的马尔可夫性质。为了解决这个问题,我们在条件信息瓶颈(CIB)原则下对CoT生成进行建模,其中推理痕迹Z充当一个仅包含有关响应Y的信息的计算桥梁,该信息不直接从提示X获取。这产生了一个通用的强化学习目标:在推理痕迹的先验下最大化任务奖励,包含常见的启发式方法(例如长度惩罚)作为特例(例如均匀先验)。与简单的基于令牌计数的方法相比,我们引入了一个语义先验,根据语言模型先验下的惊讶来衡量令牌成本。在实证上,我们的CIB目标修剪了认知膨胀,同时保留了流畅性和逻辑,提高了在适度压缩下的准确性,并且在最小准确性下降的情况下实现了激进的压缩。
更新时间: 2026-03-09 14:56:57
领域: cs.LG
Data-Driven Priors for Uncertainty-Aware Deterioration Risk Prediction with Multimodal Data
Safe predictions are a crucial requirement for integrating predictive models into clinical decision support systems. One approach for ensuring trustworthiness is to enable models' ability to express their uncertainty about individual predictions. However, current machine learning models frequently lack reliable uncertainty estimation, hindering real-world deployment. This is further observed in multimodal settings, where the goal is to enable effective information fusion. In this work, we propose $\texttt{MedCertAIn}$, a predictive uncertainty framework that leverages multimodal clinical data for in-hospital risk prediction to improve model performance and reliability. We design data-driven priors over neural network parameters using a hybrid strategy that considers cross-modal similarity in self-supervised latent representations and modality-specific data corruptions. We train and evaluate the models with such priors using clinical time-series and chest X-ray images from the publicly-available datasets MIMIC-IV and MIMIC-CXR. Our results show that $\texttt{MedCertAIn}$ significantly improves predictive performance and uncertainty quantification compared to state-of-the-art deterministic baselines and alternative Bayesian methods. These findings highlight the promise of data-driven priors in advancing robust, uncertainty-aware AI tools for high-stakes clinical applications.
Updated: 2026-03-09 14:54:38
标题: 基于数据驱动的先验知识,利用多模态数据进行考虑不确定性的恶化风险预测
摘要: 安全的预测是将预测模型整合到临床决策支持系统中的关键要求。确保可信度的一种方法是使模型能够表达其对个体预测的不确定性。然而,当前的机器学习模型经常缺乏可靠的不确定性估计,从而阻碍了在现实世界中的部署。这种情况在多模态设置中进一步观察到,目标是实现有效的信息融合。在这项工作中,我们提出了一种预测不确定性框架$\texttt{MedCertAIn}$,利用多模态临床数据进行医院内风险预测,以提高模型性能和可靠性。我们设计了基于神经网络参数的数据驱动先验,采用一种混合策略,考虑了自监督潜在表示中的跨模态相似性和特定模态的数据损坏。我们使用公开可用的数据集MIMIC-IV和MIMIC-CXR中的临床时间序列和胸部X光图像对这些先验进行训练和评估。我们的结果显示,与最先进的确定性基线和替代贝叶斯方法相比,$\texttt{MedCertAIn}$显著改善了预测性能和不确定性量化。这些发现突显了数据驱动先验在推进高风险临床应用的鲁棒、不确定性感知人工智能工具方面的前景。
更新时间: 2026-03-09 14:54:38
领域: cs.LG
AltNet: Addressing the Plasticity-Stability Dilemma in Reinforcement Learning
Artificial neural networks have shown remarkable success in supervised learning when trained on a single task using a fixed dataset. However, when neural networks are trained on a reinforcement learning task, their ability to continue learning from new experiences declines over time. This decline in learning ability is known as plasticity loss. To restore plasticity, prior work has explored periodically resetting the parameters of the learning network, a strategy that often improves performance. However, such resets come at the cost of a temporary drop in performance, which can be dangerous in real-world settings. To overcome this instability, we introduce AltNet, a reset-based approach that restores plasticity without performance degradation by leveraging a pair of twin networks. The use of twin networks anchors performance during resets through a mechanism that allows networks to periodically alternate roles: one network learns as it acts in the environment, while the other learns off-policy from the active network's interactions through a replay buffer. At fixed intervals, the active network is reset and the passive network, having learned from prior experience, becomes the new active network. AltNet restores plasticity, improving sample efficiency and achieving higher performance, while avoiding performance drops that pose risks in safety-critical settings. We demonstrate these advantages in several high-dimensional control tasks from the DeepMind Control Suite, where AltNet outperforms various relevant baseline methods, as well as state-of-the-art reset-based techniques.
Updated: 2026-03-09 14:52:08
标题: AltNet:解决强化学习中的可塑性-稳定性困境
摘要: 人工神经网络在使用固定数据集训练单一任务时,在监督学习方面表现出了显著的成功。然而,当神经网络在强化学习任务上训练时,它们从新经验中继续学习的能力会随着时间的推移而下降。这种学习能力下降被称为可塑性丧失。为了恢复可塑性,先前的研究已经探索了周期性地重置学习网络的参数,这种策略通常会提高性能。然而,这种重置会带来暂时性的性能下降,这在现实世界的环境中可能会带来危险。为了克服这种不稳定性,我们引入了AltNet,一种基于重置的方法,通过利用一对双胞胎网络来恢复可塑性,而不会降低性能。双胞胎网络的使用通过允许网络定期交替角色的机制来在重置期间锚定性能:一个网络在环境中执行任务并学习,而另一个网络通过重播缓冲区从活动网络的交互中离线学习。在固定的间隔中,活动网络被重置,而已从先前经验中学习的被动网络则成为新的活动网络。AltNet恢复了可塑性,提高了样本效率并取得了更高的性能,同时避免了在安全关键环境中造成风险的性能下降。我们在DeepMind Control Suite的几个高维控制任务中展示了这些优势,AltNet胜过了各种相关基准方法,以及最先进的基于重置的技术。
更新时间: 2026-03-09 14:52:08
领域: cs.LG,cs.AI
Adaptive Entropy-Driven Sensor Selection in a Camera-LiDAR Particle Filter for Single-Vessel Tracking
Robust single-vessel tracking from fixed coastal platforms is hindered by modality-specific degradations: cameras suffer from illumination and visual clutter, while LiDAR performance drops with range and intermittent returns. We present a heterogeneous multi-sensor fusion particle-filter tracker that incorporates an information-gain (entropy-reduction) adaptive sensing policy to select the most informative configuration at each fusion time bin. The approach is validated in a real maritime deployment at the CMMI Smart Marina Testbed (Ayia Napa Marina, Cyprus), using a shore-mounted 3D LiDAR and an elevated fixed camera to track a rigid inflatable boat with onboard GNSS ground truth. We compare LiDAR-only, camera-only, all-sensors, and adaptive configurations. Results show LiDAR dominates near-field accuracy, the camera sustains longer-range coverage when LiDAR becomes unavailable, and the adaptive policy achieves a favorable accuracy-continuity trade-off by switching modalities based on information gain. By avoiding continuous multi-stream processing, the adaptive configuration provides a practical baseline for resilient and resource-aware maritime surveillance.
Updated: 2026-03-09 14:52:08
标题: 摄像头-LiDAR粒子滤波器中自适应熵驱动的传感器选择用于单船跟踪
摘要: 固定海岸平台上的单船跟踪受到模态特定的退化的影响:摄像头受到照明和视觉混乱的影响,而LiDAR性能随着距离和间歇性返回而下降。我们提出了一种异构多传感器融合粒子滤波跟踪器,该跟踪器包含一个信息增益(熵减少)自适应感知策略,以在每个融合时间段选择最具信息量的配置。该方法在CMMI智能码头测试基地(塞浦路斯阿亚纳帕码头)进行了真实海上部署验证,使用岸边安装的3D LiDAR和高架固定摄像头来跟踪一艘带有GNSS地面真值的硬壳充气船。我们比较了仅LiDAR、仅摄像头、所有传感器和自适应配置。结果表明,LiDAR在近场精度上占优势,当LiDAR不可用时,摄像头能够维持较长距离的覆盖范围,而自适应策略通过基于信息增益切换模态来实现有利的精度-连续性权衡。通过避免连续的多流处理,自适应配置为弹性和资源感知的海上监视提供了一个实用的基线。
更新时间: 2026-03-09 14:52:08
领域: cs.RO,cs.LG,eess.SP,eess.SY,physics.data-an
The Boiling Frog Threshold: Criticality and Blindness in World Model-Based Anomaly Detection Under Gradual Drift
When an RL agent's observations are gradually corrupted, at what drift rate does it "wake up" -- and what determines this boundary? We study world model-based self-monitoring under continuous observation drift across four MuJoCo environments, three detector families (z-score, variance, percentile), and three model capacities. We find that (1) a sharp detection threshold $\varepsilon^*$ exists universally: below it, drift is absorbed as normal variation; above it, detection occurs rapidly. The threshold's existence and sigmoid shape are invariant across all detector families and model capacities, though its position depends on the interaction between detector sensitivity, noise floor structure, and environment dynamics. (2) Sinusoidal drift is completely undetectable by all detector families -- including variance and percentile detectors with no temporal smoothing -- establishing this as a world model property rather than a detector artifact. (3) Within each environment, $\varepsilon^*$ follows a power law in detector parameters ($R^2 = 0.89$-$0.97$), but cross-environment prediction fails ($R^2 = 0.45$), revealing that the missing variable is environment-specific dynamics structure $\partial \mathrm{PE}/\partial\varepsilon$. (4) In fragile environments, agents collapse before any detector can fire ("collapse before awareness"), creating a fundamentally unmonitorable failure mode. Our results reframe $\varepsilon^*$ from an emergent world model property to a three-way interaction between noise floor, detector, and environment dynamics, providing a more defensible and empirically grounded account of self-monitoring boundaries in RL agents.
Updated: 2026-03-09 14:51:53
标题: 沸腾青蛙临界点:在逐渐漂移下基于世界模型的异常检测中的关键性和盲点
摘要: 当一个强化学习代理的观察逐渐受到损坏时,它在什么漂移速率下会“唤醒” — 以及是什么决定了这个界限?我们在四个MuJoCo环境、三个检测器家族(z-score、方差、百分位数)和三个模型容量下研究了基于世界模型的自我监测,跨连续观察漂移。我们发现(1)存在一个尖锐的检测阈值$\varepsilon^*$:在此之下,漂移被吸收为正常变化;在此之上,检测迅速发生。该阈值的存在和S形状在所有检测器家族和模型容量上都是不变的,尽管其位置取决于检测器灵敏度、噪声底线结构和环境动态之间的相互作用。 (2)正弦漂移对所有检测器家族都是完全不可检测的 — 包括没有时间平滑的方差和百分位数检测器 — 确定了这是一个世界模型属性而不是检测器人为因素。 (3)在每个环境中,$\varepsilon^*$在检测器参数上遵循幂律($R^2 = 0.89$-$0.97$),但是跨环境预测失败($R^2 = 0.45$),显示出缺失的变量是环境特定动态结构$\partial \mathrm{PE}/\partial\varepsilon$。 (4)在脆弱环境中,代理在任何检测器触发之前就会崩溃(“在意识之前崩溃”),导致一种基本上无法监测的失败模式。我们的结果将$\varepsilon^*$从一个新兴的世界模型属性重新界定为噪声底线、检测器和环境动态之间的三方互动,提供了一个更具防御性和经验基础的强化学习代理自我监测边界解释。
更新时间: 2026-03-09 14:51:53
领域: cs.AI,cs.LG
CroSTAta: Cross-State Transition Attention Transformer for Robotic Manipulation
Learning robotic manipulation policies through supervised learning from demonstrations remains challenging when policies encounter execution variations not explicitly covered during training. While incorporating historical context through attention mechanisms can improve robustness, standard approaches process all past states in a sequence without explicitly modeling the temporal structure that demonstrations may include, such as failure and recovery patterns. We propose a Cross-State Transition Attention Transformer that employs a novel State Transition Attention (STA) mechanism to modulate standard attention weights based on learned state evolution patterns, enabling policies to better adapt their behavior based on execution history. Our approach combines this structured attention with temporal masking during training, where visual information is randomly removed from recent timesteps to encourage temporal reasoning from historical context. Evaluation in simulation shows that STA consistently outperforms standard attention approach and temporal modeling methods like TCN and LSTM networks, achieving more than 2x improvement over cross-attention on precision-critical tasks. The source code and data can be accessed at https://github.com/iit-DLSLab/croSTAta
Updated: 2026-03-09 14:51:13
标题: CroSTAta:用于机器人操作的跨国过渡注意力变换器
摘要: 通过从示范中进行监督学习学习机器人操作策略在遇到训练过程中未明确涵盖的执行变化时仍然具有挑战性。虽然通过注意机制引入历史上下文可以提高鲁棒性,但标准方法处理序列中的所有过去状态而不明确建模示范可能包含的时间结构,例如失败和恢复模式。我们提出了一种交叉状态转换注意变换器,采用一种新颖的状态转换注意(STA)机制来调节基于学习状态演变模式的标准注意权重,使策略能够根据执行历史更好地调整其行为。我们的方法将这种结构化注意与在训练过程中的时间掩蔽结合起来,其中视觉信息会从最近的时间步骤中随机移除,以鼓励从历史上下文进行时间推理。在模拟中的评估结果显示,STA在精度关键任务上始终优于标准注意方法和类似TCN和LSTM网络的时间建模方法,相对于交叉注意实现了超过2倍的改进。源代码和数据可在https://github.com/iit-DLSLab/croSTAta访问。
更新时间: 2026-03-09 14:51:13
领域: cs.RO,cs.AI,cs.LG
Rewards as Labels: Revisiting RLVR from a Classification Perspective
Reinforcement Learning with Verifiable Rewards has recently advanced the capabilities of Large Language Models in complex reasoning tasks by providing explicit rule-based supervision. Among RLVR methods, GRPO and its variants have achieved strong empirical performance. Despite their success, we identify that they suffer from Gradient Misassignment in Positives and Gradient Domination in Negatives, which lead to inefficient and suboptimal policy updates. To address these issues, we propose Rewards as Labels (REAL), a novel framework that revisits verifiable rewards as categorical labels rather than scalar weights, thereby reformulating policy optimization as a classification problem. Building on this, we further introduce anchor logits to enhance policy learning. Our analysis reveals that REAL induces a monotonic and bounded gradient weighting, enabling balanced gradient allocation across rollouts and effectively mitigating the identified mismatches. Extensive experiments on mathematical reasoning benchmarks show that REAL improves training stability and consistently outperforms GRPO and strong variants such as DAPO. On the 1.5B model, REAL improves average Pass@1 over DAPO by 6.7%. These gains further scale to 7B model, REAL continues to outperform DAPO and GSPO by 6.2% and 1.7%, respectively. Notably, even with a vanilla binary cross-entropy, REAL remains stable and exceeds DAPO by 4.5% on average.
Updated: 2026-03-09 14:50:40
标题: 奖励作为标签:从分类角度重新审视RLVR
摘要: 具有可验证奖励的强化学习最近通过提供明确的基于规则的监督,推动了大型语言模型在复杂推理任务中的能力。在RLVR方法中,GRPO及其变体取得了强大的实证表现。尽管它们取得了成功,但我们发现它们存在着正样本梯度错误分配和负样本梯度主导等问题,这导致了低效和次优的策略更新。为了解决这些问题,我们提出了Rewards as Labels (REAL)框架,重新将可验证奖励视为分类标签而不是标量权重,从而将策略优化重新构建为分类问题。在此基础上,我们进一步引入锚定logits来增强策略学习。我们的分析表明,REAL引入了单调且有界的梯度加权,实现了在多次迭代中平衡的梯度分配,并有效地缓解了已识别的不匹配问题。对数学推理基准的大量实验表明,REAL提高了训练稳定性,并始终优于GRPO和强变体,如DAPO。在1.5B模型上,REAL的Pass@1平均值比DAPO提高了6.7%。这些收益进一步扩展到7B模型,REAL继续优于DAPO和GSPO分别提高了6.2%和1.7%。值得注意的是,即使使用普通的二元交叉熵,REAL仍然稳定,并平均超过DAPO 4.5%。
更新时间: 2026-03-09 14:50:40
领域: cs.LG,cs.CL
LycheeCluster: Efficient Long-Context Inference with Structure-Aware Chunking and Hierarchical KV Indexing
The quadratic complexity of the attention mechanism and the substantial memory footprint of the Key-Value (KV) cache present severe computational and memory challenges for Large Language Models (LLMs) processing long contexts. Existing retrieval-based methods often compromise semantic integrity through fixed-size chunking and suffer from inefficient linear scanning. In this paper, we propose LycheeCluster, a novel method for efficient KV cache management. LycheeCluster preserves local semantic coherence via boundary-aware chunking and constructs a recursive hierarchical index rooted in the triangle inequality. This design transforms cache retrieval from a linear scan into a theoretically bounded, logarithmic-time pruning process, while a lazy update strategy supports efficient streaming generation. Experiments demonstrate that LycheeCluster achieves up to a 3.6x end-to-end inference speedup with negligible degradation in model performance, outperforming state-of-the-art KV cache management methods (e.g., Quest, ClusterKV). We will release our code and kernels after publication.
Updated: 2026-03-09 14:50:35
标题: 荔枝集群:结构感知分块和分层KV索引实现高效的长上下文推理
摘要: 关于注意力机制的二次复杂性和键-值(KV)缓存的巨大内存占用给处理长文本的大型语言模型(LLMs)带来了严重的计算和内存挑战。现有的基于检索的方法往往通过固定大小的分块来牺牲语义完整性,并且受到低效的线性扫描的影响。在本文中,我们提出了LycheeCluster,一种用于有效的KV缓存管理的新方法。LycheeCluster通过边界感知分块保留了局部语义连贯性,并构建了一个根植于三角不等式的递归分层索引。这种设计将缓存检索从线性扫描转变为一个理论上有界的对数时间修剪过程,同时懒惰的更新策略支持高效的流式生成。实验证明,LycheeCluster实现了高达3.6倍的端到端推理加速,模型性能几乎没有降级,优于最先进的KV缓存管理方法(例如Quest,ClusterKV)。我们将在发表后发布我们的代码和内核。
更新时间: 2026-03-09 14:50:35
领域: cs.LG,cs.AI,cs.CL
Mem-T: Densifying Rewards for Long-Horizon Memory Agents
Memory agents, which depart from predefined memory-processing pipelines by endogenously managing the processing, storage, and retrieval of memories, have garnered increasing attention for their autonomy and adaptability. However, existing training paradigms remain constrained: agents often traverse long-horizon sequences of memory operations before receiving sparse and delayed rewards, which hinders truly end-to-end optimization of memory management policies. To address this limitation, we introduce Mem-T, an autonomous memory agent that interfaces with a lightweight hierarchical memory database to perform dynamic updates and multi-turn retrieval over streaming inputs. To effectively train long-horizon memory management capabilities, we further propose MoT-GRPO, a tree-guided reinforcement learning framework that transforms sparse terminal feedback into dense, step-wise supervision via memory operation tree backpropagation and hindsight credit assignment, thereby enabling the joint optimization of memory construction and retrieval. Extensive experiments demonstrate that Mem-T is (1) high-performing, surpassing frameworks such as A-Mem and Mem0 by up to $14.92\%$, and (2) economical, operating on a favorable accuracy-efficiency Pareto frontier and reducing inference tokens per query by $\sim24.45\%$ relative to GAM without sacrificing performance.
Updated: 2026-03-09 14:47:04
标题: Mem-T: 密集奖励长远记忆智能体
摘要: 记忆代理从预定义的记忆处理管道中脱颖而出,通过内生地管理记忆的处理、存储和检索,因其自主性和适应性而越来越受到关注。然而,现有的训练范式仍受到限制:代理通常在接收稀疏和延迟的奖励之前要遍历长时间序列的记忆操作,这阻碍了对记忆管理策略的真正端到端优化。为了解决这一限制,我们介绍了Mem-T,这是一个自主记忆代理,与轻量级分层记忆数据库进行接口,以执行动态更新和对流输入进行多轮检索。为了有效训练长时间序列的记忆管理能力,我们进一步提出了MoT-GRPO,这是一个树引导的强化学习框架,通过记忆操作树反向传播和事后学分分配将稀疏的终端反馈转化为密集、逐步的监督,从而实现记忆构建和检索的联合优化。大量实验表明,Mem-T在表现上表现出色,超过了A-Mem和Mem0等框架高达14.92%,而且在经济上操作在有利的准确性-效率帕累托边界上,相对于没有牺牲性能的情况下,每次查询的推理令牌减少了约24.45%的GAM。
更新时间: 2026-03-09 14:47:04
领域: cs.LG,cs.CL
Controllable Sequence Editing for Biological and Clinical Trajectories
Conditional generation models for longitudinal sequences can produce new or modified trajectories given a conditioning input. However, they often lack control over when the condition should take effect (timing) and which variables it should influence (scope). Most methods either operate only on univariate sequences or assume that the condition alters all variables and time steps. In scientific and clinical settings, interventions instead begin at a specific moment, such as the time of drug administration or surgery, and influence only a subset of measurements while the rest of the trajectory remains unchanged. CLEF learns temporal concepts that encode how and when a condition alters future sequence evolution. These concepts allow CLEF to apply targeted edits to the affected time steps and variables while preserving the rest of the sequence. We evaluate CLEF on 8 datasets spanning cellular reprogramming, patient health, and sales, comparing against 9 state-of-the-art baselines. CLEF improves immediate sequence editing accuracy by 16.28% (MAE) on average against their non-CLEF counterparts. Unlike prior models, CLEF enables one-step conditional generation at arbitrary future times, outperforming their non-CLEF counterparts in delayed sequence editing by 26.73% (MAE) on average. We test CLEF under counterfactual inference assumptions and show up to 62.84% (MAE) improvement on zero-shot conditional generation of counterfactual trajectories. In a case study of patients with type 1 diabetes mellitus, CLEF identifies clinical interventions that generate realistic counterfactual trajectories shifted toward healthier outcomes.
Updated: 2026-03-09 14:46:46
标题: 可控的生物和临床轨迹序列编辑
摘要: Longitudinal sequences的条件生成模型可以在给定条件输入的情况下产生新的或修改后的轨迹。然而,它们通常缺乏对条件何时生效(时间)以及应该影响哪些变量(范围)的控制。大多数方法要么仅在单变量序列上运行,要么假设条件会改变所有变量和时间步骤。在科学和临床环境中,干预通常从特定时刻开始,例如药物给予或手术时刻,并且仅影响部分测量值,而其余轨迹保持不变。CLEF学习编码条件如何以及何时改变未来序列演变的时间概念。这些概念允许CLEF对受影响的时间步骤和变量应用有针对性的编辑,同时保持序列的其余部分不变。我们在涵盖细胞重编程、患者健康和销售等8个数据集上评估了CLEF,并与9个最先进的基线进行了比较。CLEF平均提高了16.28%(MAE)的即时序列编辑准确性,相较于非CLEF对照组。与先前的模型不同,CLEF能够在任意未来时刻进行一步条件生成,在延迟序列编辑方面,平均提高了26.73%(MAE)。我们在反事实推断假设下测试CLEF,并展示了在零样本条件生成反事实轨迹方面高达62.84%(MAE)的改进。在对1型糖尿病患者的案例研究中,CLEF确定了可以生成朝着更健康结果方向偏移的现实反事实轨迹的临床干预措施。
更新时间: 2026-03-09 14:46:46
领域: cs.LG,q-bio.GN,q-bio.PE
A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic
Large language model (LLM)-based AI systems have shown promise for patient-facing diagnostic and management conversations in simulated settings. Translating these systems into clinical practice requires assessment in real-world workflows with rigorous safety oversight. We report a prospective, single-arm feasibility study of an LLM-based conversational AI, the Articulate Medical Intelligence Explorer (AMIE), conducting clinical history taking and presentation of potential diagnoses for patients to discuss with their provider at urgent care appointments at a leading academic medical center. 100 adult patients completed an AMIE text-chat interaction up to 5 days before their appointment. We sought to assess the conversational safety and quality, patient and clinician experience, and clinical reasoning capabilities compared to primary care providers (PCPs). Human safety supervisors monitored all patient-AMIE interactions in real time and did not need to intervene to stop any consultations based on pre-defined criteria. Patients reported high satisfaction and their attitudes towards AI improved after interacting with AMIE (p < 0.001). PCPs found AMIE's output useful with a positive impact on preparedness. AMIE's differential diagnosis (DDx) included the final diagnosis, per chart review 8 weeks post-encounter, in 90% of cases, with 75% top-3 accuracy. Blinded assessment of AMIE and PCP DDx and management (Mx) plans suggested similar overall DDx and Mx plan quality, without significant differences for DDx (p = 0.6) and appropriateness and safety of Mx (p = 0.1 and 1.0, respectively). PCPs outperformed AMIE in the practicality (p = 0.003) and cost effectiveness (p = 0.004) of Mx. While further research is needed, this study demonstrates the initial feasibility, safety, and user acceptance of conversational AI in a real-world setting, representing crucial steps towards clinical translation.
Updated: 2026-03-09 14:43:40
标题: 一项关于在门诊初级保健诊所使用对话式诊断人工智能的前瞻性临床可行性研究
摘要: 基于大型语言模型(LLM)的人工智能系统在模拟环境中显示出在面向患者的诊断和管理对话方面的潜力。将这些系统转化为临床实践需要在真实工作流程中进行评估,并进行严格的安全监督。我们报告了一项前瞻性的单臂可行性研究,研究了基于LLM的对话人工智能Articulate Medical Intelligence Explorer(AMIE),在一所领先的学术医学中心的急诊护理预约中进行临床病史采集和潜在诊断的呈现,供患者在与其提供者讨论。100名成年患者在预约前最多5天与AMIE进行了文本聊天互动。我们旨在评估对话的安全性和质量,患者和临床医生的体验,以及与初级保健提供者(PCPs)相比的临床推理能力。人类安全监督员实时监控所有患者与AMIE的互动,并不需要根据预先定义的标准进行任何干预以停止任何会诊。患者报告了高度满意度,与AMIE互动后其对人工智能的态度得到了改善(p <0.001)。PCPs发现AMIE的输出对其准备工作有积极影响。根据8周后的病历回顾,在90%的病例中,AMIE的初步诊断(DDx)包括最终诊断,其中75%的准确性排名前3位。对AMIE和PCP的DDx和治疗(Mx)计划进行蒙眼评估,显示整体DDx和Mx计划质量相似,DDx没有显著差异(p = 0.6),Mx的适当性和安全性也没有显著差异(p = 0.1和1.0,分别)。在Mx的实用性(p = 0.003)和成本效益(p = 0.004)方面,PCPs优于AMIE。尽管还需要进一步研究,但这项研究展示了对话人工智能在真实环境中的初步可行性、安全性和用户接受度,代表了朝着临床转化的关键步骤。
更新时间: 2026-03-09 14:43:40
领域: cs.HC,cs.AI,cs.CL,cs.LG
Efficient Policy Learning with Hybrid Evaluation-Based Genetic Programming for Uncertain Agile Earth Observation Satellite Scheduling
The Uncertain Agile Earth Observation Satellite Scheduling Problem (UAEOSSP) is a novel combinatorial optimization problem and a practical engineering challenge that aligns with the current demands of space technology development. It incorporates uncertainties in profit, resource consumption, and visibility, which may render pre-planned schedules suboptimal or even infeasible. Genetic Programming Hyper-Heuristic (GPHH) shows promise for evolving interpretable scheduling policies; however, their simulation-based evaluation incurs high computational costs. Moreover, the design of the constructive method, denoted as Online Scheduling Algorithm (OSA), directly affects fitness assessment, resulting in evaluation-dependent local optima within the policy space. To address these issues, this paper proposes a Hybrid Evaluation-based Genetic Programming (HE-GP) for effectively solving UAEOSSP. A Hybrid Evaluation (HE) mechanism is integrated into the policy-driven OSA, combining exact and approximate filtering modes: exact mode ensures evaluation accuracy through elaborately designed constraint verification modules, while approximate mode reduces computational overhead via simplified logic. HE-GP dynamically switches between evaluation models based on real-time evolutionary state information. Experiments on 16 simulated instance sets demonstrate that HE-GP significantly outperforms handcrafted heuristics and single-evaluation based GPHH, achieving substantial reductions in computational cost while maintaining excellent scheduling performance across diverse scenarios. Specifically, the average training time of HE-GP was reduced by 17.77\% compared to GP employing exclusively exact evaluation, while the optimal policy generated by HE-GP achieved the highest average ranks across all scenarios.
Updated: 2026-03-09 14:43:36
标题: 高效的政策学习:基于混合评估的遗传规划用于不确定的灵活地球观测卫星调度
摘要: 不确定的敏捷地球观测卫星调度问题(UAEOSSP)是一个新颖的组合优化问题,也是一个与当前空间技术发展需求相一致的实际工程挑战。它结合了利润、资源消耗和可见性方面的不确定性,这可能导致预先计划的调度不够优化甚至不可行。遗传编程超启发式(GPHH)显示出进化可解释调度策略的潜力;然而,它们基于仿真的评估带来了很高的计算成本。此外,被称为在线调度算法(OSA)的构造方法的设计直接影响适应度评估,在策略空间内导致评估依赖的局部最优解。为了解决这些问题,本文提出了一种用于有效解决UAEOSSP的混合评估遗传编程(HE-GP)。混合评估(HE)机制被整合到以策略驱动的OSA中,结合了精确和近似过滤模式:精确模式通过精心设计的约束验证模块确保评估准确性,而近似模式通过简化逻辑减少计算开销。HE-GP根据实时演化状态信息动态切换评估模型。对16个模拟实例集的实验表明,HE-GP明显优于手工启发式和基于单一评估的GPHH,在减少计算成本的同时保持了在不同场景下的优秀调度性能。具体而言,与仅使用精确评估的GP相比,HE-GP的平均训练时间减少了17.77\%,而HE-GP生成的最优策略在所有场景中都取得了最高的平均排名。
更新时间: 2026-03-09 14:43:36
领域: cs.AI
EasyInsert: A Data-Efficient and Generalizable Insertion Policy
Robotic insertion is a highly challenging task that requires exceptional precision in cluttered environments. Existing methods often have poor generalization capabilities. They typically function in restricted and structured environments, and frequently fail when the plug and socket are far apart, when the scene is densely cluttered, or when handling novel objects. They also rely on strong assumptions such as access to CAD models or a digital twin in simulation. To address these limitations, we propose EasyInsert. Inspired by human intuition, it formulates insertion as a delta-pose regression problem, which unlocks an efficient, highly scalable data collection pipeline with minimal human labor to train an end-to-end visual policy. During execution, the visual policy predicts the relative pose between plug and socket to drive a multi-phase, coarse-to-fine insertion process. EasyInsert demonstrates strong zero-shot generalization capability for unseen objects in cluttered environments, robustly handling cases with significant initial pose deviations. In real-world experiments, by leveraging just 1 hour of human teleoperation data to bootstrap a large-scale automated data collection process, EasyInsert achieves an over 90% success rate in zero-shot insertion for 13 out of 15 unseen novel objects, including challenging objects like Type-C cables, HDMI cables, and Ethernet cables. Furthermore, requiring only a single manual reset, EasyInsert allows for fast adaptation to novel test objects through automated data collection and fine-tuning, achieving an over 90% success rate across all 15 objects.
Updated: 2026-03-09 14:38:12
标题: EasyInsert:一种数据高效且通用的插入策略
摘要: 机器人插入是一项极具挑战性的任务,需要在杂乱环境中具有异常的精度。现有的方法通常具有较差的泛化能力。它们通常在受限制和结构化的环境中运行,并且在插头和插座相距较远、场景密集杂乱或处理新物体时经常失败。它们还依赖于强大的假设,例如在模拟中访问CAD模型或数字双胞胎。为了解决这些限制,我们提出了EasyInsert。受人类直觉启发,它将插入问题形式化为一个增量姿态回归问题,从而解锁了一个高效、高度可扩展的数据收集管道,几乎不需要人力来训练端到端的视觉策略。在执行过程中,视觉策略预测插头和插座之间的相对姿态,驱动一个多阶段、从粗到细的插入过程。EasyInsert在杂乱环境中展现了强大的零次泛化能力,对于未见过的物体能够稳健地处理具有显著初始姿态偏差的情况。在现实世界的实验中,通过利用仅1小时的人类远程操作数据来启动大规模自动化数据收集过程,EasyInsert在13个15个未见过的新物体的零次插入中实现了超过90%的成功率,包括挑战性的物体如Type-C电缆、HDMI电缆和以太网电缆。此外,只需要一次手动重置,EasyInsert允许通过自动化数据收集和微调快速适应新的测试物体,实现了在所有15个物体上的超过90%的成功率。
更新时间: 2026-03-09 14:38:12
领域: cs.RO,cs.AI
Parallel Decoder Transformer: Planner-Seeded Latent Coordination for Synchronized Parallel Decoding
Autoregressive language models can often identify parallel subproblems, but standard decoding exposes only a single left-to-right output interface. External orchestration methods can launch multiple prompts concurrently, yet they provide no model-internal state through which those generations can synchronize, resolve ownership, or wait for missing information. We present the Parallel Decoder Transformer (PDT), a frozen-trunk architecture that augments a decoder with a planner-seeded latent workspace and a synchronized multi-stream output protocol. Before any stream emits tokens, a mandatory prompt-time planner predicts fixed latent plan slots and projects them as snapshot 0 on an embeddings-only Dynamic Notes Bus. During decoding, each stream reads the visible notes window through Speculative Note Conditioning (SNC), emits provisional token blocks and latent summaries, and advances only when agreement logic determines that the current shared state is sufficient for continued parallel generation. Coverage heads track plan-item ownership, while rollback handles incoherent or premature commits. PDT therefore shifts parallel task decomposition from an external prompting strategy to a model-internal coordination mechanism over the output interface of a frozen language model.
Updated: 2026-03-09 14:35:35
标题: 并行解码器变压器:规划种子潜在协调用于同步并行解码
摘要: 自回归语言模型通常可以识别并行子问题,但标准解码只暴露单个从左到右的输出接口。外部编排方法可以同时启动多个提示,但它们无法提供模型内部状态,通过这些状态可以使生成同步,解决所有权或等待缺失信息。我们提出了并行解码器变换器(PDT),这是一种冻结主干体系结构,通过计划种子潜在工作空间和同步多流输出协议增强了解码器。在任何流发出令牌之前,强制性提示时间规划器会预测固定的潜在计划插槽,并将它们投影为嵌入式动态笔记总线上的快照0。在解码过程中,每个流通过推测笔记调节(SNC)读取可见笔记窗口,发出临时令牌块和潜在摘要,并仅在协议逻辑确定当前共享状态足以支持继续并行生成时才前进。覆盖头跟踪计划项所有权,而回滚处理不一致或过早提交。因此,PDT将并行任务分解从外部提示策略转移到了一个冻结语言模型的输出接口上的模型内部协调机制。
更新时间: 2026-03-09 14:35:35
领域: cs.AI,cs.CL
Entropy-Driven Curriculum for Multi-Task Training in Human Mobility Prediction
The increasing availability of big mobility data from ubiquitous portable devices enables human mobility prediction through deep learning approaches. However, the diverse complexity of human mobility data impedes model training, leading to inefficient gradient updates and potential underfitting. Meanwhile, exclusively predicting next locations neglects implicit determinants, including distances and directions, thereby yielding suboptimal prediction results. This paper presents a unified training framework that integrates entropy-driven curriculum and multi-task learning to address these challenges. The proposed entropy-driven curriculum learning strategy quantifies trajectory predictability based on Lempel-Ziv compression and organizes training from simple to complex for faster convergence and enhanced performance. The multi-task training simultaneously optimizes the primary location prediction alongside auxiliary estimation of movement distance and direction for learning realistic mobility patterns, and improve prediction accuracy through complementary supervision signals. Extensive experiments conducted in accordance with the HuMob Challenge demonstrate that our approach achieves state-of-the-art performance on GEO-BLEU (0.354) and DTW (26.15) metrics with up to 2.92-fold convergence speed compared to training without curriculum learning.
Updated: 2026-03-09 14:31:38
标题: 基于熵驱动的人类移动预测多任务训练课程
摘要: 随着普遍便携设备的大规模移动数据越来越容易获取,人类移动预测通过深度学习方法变得可能。然而,人类移动数据的复杂性多样性妨碍了模型训练,导致梯度更新低效和潜在的欠拟合问题。同时,仅仅预测下一个位置忽略了隐含的决定因素,包括距离和方向,从而产生次优的预测结果。本文提出了一个统一的训练框架,将基于熵的课程学习和多任务学习相结合,以解决这些挑战。所提出的基于熵的课程学习策略基于Lempel-Ziv压缩量化轨迹可预测性,并从简单到复杂组织训练,以实现更快的收敛和提升性能。多任务训练同时优化主要位置预测以及运动距离和方向的辅助估计,以学习现实的移动模式,并通过互补的监督信号提高预测准确性。根据HuMob挑战进行的大量实验表明,我们的方法在GEO-BLEU(0.354)和DTW(26.15)指标上取得了最先进的性能,与没有课程学习的训练相比,收敛速度提高了2.92倍。
更新时间: 2026-03-09 14:31:38
领域: cs.LG,cs.AI
Bridging Domains through Subspace-Aware Model Merging
Model merging integrates multiple task-specific models into a single consolidated one. Recent research has made progress in improving merging performance for in-distribution or multi-task scenarios, but domain generalization in model merging remains underexplored. We investigate how merging models fine-tuned on distinct domains affects generalization to unseen domains. Through an analysis of parameter competition in the task matrix using singular value decomposition, we show that merging models trained under different distribution shifts induces stronger conflicts between their subspaces compared to traditional multi-task settings. To mitigate this issue, we propose SCORE (Subspace COnflict-Resolving mErging), a method designed to alleviate such singular subspace conflicts. SCORE finds a shared orthogonal basis by computing the principal components of the concatenated leading singular vectors of all models. It then projects each task matrix into the shared basis, pruning off-diagonal components to remove conflicting singular directions. SCORE consistently outperforms, on average, existing model merging approaches in domain generalization settings across a variety of architectures and model scales, demonstrating its effectiveness and scalability.
Updated: 2026-03-09 14:31:33
标题: 通过子空间感知模型合并来构建跨领域连接
摘要: 模型合并将多个特定任务的模型整合为单个一致的模型。最近的研究在改进面向分布或多任务场景的合并性能方面取得了进展,但模型合并中的领域泛化仍未得到充分探讨。我们研究了在不同域上微调的模型如何影响到对未知领域的泛化。通过使用奇异值分解对任务矩阵中参数竞争进行分析,我们发现,与传统的多任务设置相比,在不同分布转移下训练的合并模型在其子空间之间引起更强烈的冲突。为了缓解这个问题,我们提出了SCORE(子空间冲突解决合并)方法,旨在减轻这种奇异子空间冲突。SCORE通过计算所有模型的连接主奇异向量的主成分来找到共享的正交基。然后,它将每个任务矩阵投影到共享基中,修剪对角线部分以消除冲突的奇异方向。在各种架构和模型尺度的领域泛化设置中,SCORE在平均水平上始终优于现有的模型合并方法,展示了其有效性和可伸缩性。
更新时间: 2026-03-09 14:31:33
领域: cs.LG,cs.AI,cs.CV
One Model Is Enough: Native Retrieval Embeddings from LLM Agent Hidden States
LLM agents that retrieve external knowledge typically generate a search query as text, then run a separate embedding model to encode it into a vector. This two-model pipeline adds infrastructure complexity and latency, yet is redundant: the LLM already encodes the full conversational context in its hidden states. We propose equipping LLM agents with native retrieval capability by adding a lightweight projection head that maps hidden states directly into the embedding space, eliminating the need for a separate embedding model. Trained with a combination of alignment, contrastive, and rank distillation losses, our method retains 97\% of baseline retrieval quality while enabling the LLM agent to search with its own representations. Experiments on the QReCC conversational search benchmark show competitive Recall@10 and MRR@10 compared to the standard generate-then-encode pipeline, with systematic ablations confirming the contribution of each loss component.
Updated: 2026-03-09 14:25:35
标题: 一个模型就足够了:从LLM代理隐藏状态中检索的本地嵌入
摘要: LLM代理通常会检索外部知识,生成一个搜索查询作为文本,然后运行一个单独的嵌入模型将其编码为向量。这种两模型流程增加了基础设施复杂性和延迟,但是是多余的:LLM已经在其隐藏状态中编码了完整的对话上下文。我们提出通过添加一个轻量级投影头将隐藏状态直接映射到嵌入空间,为LLM代理提供本地检索能力,从而消除了需要单独的嵌入模型。通过对齐、对比和排名蒸馏损失的组合训练,我们的方法保留了97\%的基线检索质量,同时使LLM代理能够使用自己的表示进行搜索。在QReCC对话搜索基准上的实验显示,与标准生成-编码流程相比,我们的方法具有竞争力的Recall@10和MRR@10,系统化的消融实验证实了每个损失组件的贡献。
更新时间: 2026-03-09 14:25:35
领域: cs.CL,cs.AI,cs.IR
Adaptive Multi-view Graph Contrastive Learning via Fractional-order Neural Diffusion Networks
Graph contrastive learning (GCL) learns node and graph representations by contrasting multiple views of the same graph. Existing methods typically rely on fixed, handcrafted views-usually a local and a global perspective, which limits their ability to capture multi-scale structural patterns. We present an augmentation-free, multi-view GCL framework grounded in fractional-order continuous dynamics. By varying the fractional derivative order $α\in (0,1]$, our encoders produce a continuous spectrum of views: small $α$ yields localized features, while large $α$ induces broader, global aggregation. We treat $α$ as a learnable parameter so the model can adapt diffusion scales to the data and automatically discover informative views. This principled approach generates diverse, complementary representations without manual augmentations. Extensive experiments on standard benchmarks demonstrate that our method produces more robust and expressive embeddings and outperforms state-of-the-art GCL baselines.
Updated: 2026-03-09 14:21:31
标题: 自适应多视角图对比学习:基于分数阶神经扩散网络
摘要: 图对比学习(GCL)通过对比同一图的多个视图来学习节点和图表示。现有方法通常依赖于固定的、手工制作的视图,通常是局部和全局视角,这限制了它们捕捉多尺度结构模式的能力。我们提出了一个无增强、多视图GCL框架,基于分数阶连续动态。通过变化分数阶导数$α\in (0,1]$,我们的编码器产生连续的视图谱:小$α$产生局部特征,而大$α$导致更广泛的全局聚合。我们将$α$视为一个可学习的参数,这样模型可以根据数据调整扩散尺度,并自动发现信息视图。这种基于原则的方法生成多样的、互补的表示,无需手动增强。对标准基准进行广泛实验表明,我们的方法产生更健壮、更表达丰富的嵌入,并胜过现有的GCL基线。
更新时间: 2026-03-09 14:21:31
领域: cs.LG
Grow, Assess, Compress: Adaptive Backbone Scaling for Memory-Efficient Class Incremental Learning
Class Incremental Learning (CIL) poses a fundamental challenge: maintaining a balance between the plasticity required to learn new tasks and the stability needed to prevent catastrophic forgetting. While expansion-based methods effectively mitigate forgetting by adding task-specific parameters, they suffer from uncontrolled architectural growth and memory overhead. In this paper, we propose a novel dynamic scaling framework that adaptively manages model capacity through a cyclic "GRow, Assess, ComprEss" (GRACE) strategy. Crucially, we supplement backbone expansion with a novel saturation assessment phase that evaluates the utilization of the model's capacity. This assessment allows the framework to make informed decisions to either expand the architecture or compress the backbones into a streamlined representation, preventing parameter explosion. Experimental results demonstrate that our approach achieves state-of-the-art performance across multiple CIL benchmarks, while reducing memory footprint by up to a 73% compared to purely expansionist models.
Updated: 2026-03-09 14:21:18
标题: 成长、评估、压缩:自适应骨干缩放用于内存高效的类增量学习
摘要: 增量学习(CIL)提出了一个基本挑战:在学习新任务所需的可塑性和防止灾难性遗忘所需的稳定性之间保持平衡。虽然基于扩展的方法通过添加任务特定参数有效地缓解了遗忘,但它们受到了不受控制的架构增长和内存开销的困扰。在本文中,我们提出了一个新颖的动态缩放框架,通过循环的“GRow, Assess, ComprEss”(GRACE)策略自适应地管理模型容量。至关重要的是,我们通过一个新颖的饱和度评估阶段来补充主干扩展,该阶段评估模型容量的利用。这种评估使框架能够做出明智的决策,要么扩展架构,要么将主干压缩成简化的表示,防止参数爆炸。实验结果表明,与纯粹扩张主义模型相比,我们的方法在多个CIL基准测试中实现了最先进的性能,同时将内存占用降低了高达73%。
更新时间: 2026-03-09 14:21:18
领域: cs.LG,cs.CV
IronEngine: Towards General AI Assistant
This paper presents IronEngine, a general AI assistant platform organized around a unified orchestration core that connects a desktop user interface, REST and WebSocket APIs, Python clients, local and cloud model backends, persistent memory, task scheduling, reusable skills, 24-category tool execution, MCP-compatible extensibility, and hardware-facing integration. IronEngine introduces a three-phase pipeline -- Discussion (Planner--Reviewer collaboration), Model Switch (VRAM-aware transition), and Execution (tool-augmented action loop) -- that separates planning quality from execution capability. The system features a hierarchical memory architecture with multi-level consolidation, a vectorized skill repository backed by ChromaDB, an adaptive model management layer supporting 92 model profiles with VRAM-aware context budgeting, and an intelligent tool routing system with 130+ alias normalization and automatic error correction. We present experimental results on file operation benchmarks achieving 100\% task completion with a mean total time of 1541 seconds across four heterogeneous tasks, and provide detailed comparisons with representative AI assistant systems including ChatGPT, Claude Desktop, Cursor, Windsurf, and open-source agent frameworks. Without disclosing proprietary prompts or core algorithms, this paper analyzes the platform's architectural decomposition, subsystem design, experimental performance, safety boundaries, and comparative engineering advantages. The resulting study positions IronEngine as a system-oriented foundation for general-purpose personal assistants, automation frameworks, and future human-centered agent platforms.
Updated: 2026-03-09 14:18:50
标题: IronEngine:通往通用人工智能助手的道路
摘要: 本文介绍了IronEngine,一个围绕统一编排核心组织的通用人工智能助手平台,该平台连接了桌面用户界面、REST和WebSocket API、Python客户端、本地和云模型后端、持久内存、任务调度、可重用技能、24个类别的工具执行、与MCP兼容的可扩展性以及面向硬件的集成。IronEngine引入了一个三阶段流水线-讨论(计划者-审阅者协作)、模型切换(VRAM感知过渡)和执行(工具增强动作循环)-将规划质量与执行能力分开。该系统具有多级整合的分层内存架构,由ChromaDB支持的矢量化技能库,支持92个模型配置文件的自适应模型管理层,具有VRAM感知的上下文预算,以及具有130个以上别名规范化和自动纠错的智能工具路由系统。我们在文件操作基准测试上展示了实验结果,完成了100%的任务,并在四个异构任务中平均总时间为1541秒,并与代表性的人工智能助手系统(包括ChatGPT、Claude Desktop、Cursor、Windsurf和开源代理框架)进行了详细比较。在不透露专有提示或核心算法的情况下,本文分析了该平台的架构分解、子系统设计、实验性能、安全边界和比较工程优势。由此产生的研究将IronEngine定位为通用个人助手、自动化框架和未来以人为中心的代理平台的系统导向基础。
更新时间: 2026-03-09 14:18:50
领域: cs.AI,cs.HC,cs.LG,cs.MA,eess.SY
SYNAPSE: Framework for Neuron Analysis and Perturbation in Sequence Encoding
In recent years, Artificial Intelligence has become a powerful partner for complex tasks such as data analysis, prediction, and problem-solving, yet its lack of transparency raises concerns about its reliability. In sensitive domains such as healthcare or cybersecurity, ensuring transparency, trustworthiness, and robustness is essential, since the consequences of wrong decisions or successful attacks can be severe. Prior neuron-level interpretability approaches are primarily descriptive, task-dependent, or require retraining, which limits their use as systematic, reusable tools for evaluating internal robustness across architectures and domains. To overcome these limitations, this work proposes SYNAPSE, a systematic, training-free framework for understanding and stress-testing the internal behavior of Transformer models across domains. It extracts per-layer [CLS] representations, trains a lightweight linear probe to obtain global and per-class neuron rankings, and applies forward-hook interventions during inference. This design enables controlled experiments on internal representations without altering the original model, thereby allowing weaknesses, stability patterns, and label-specific sensitivities to be measured and compared directly across tasks and architectures. Across all experiments, SYNAPSE reveals a consistent, domain-independent organization of internal representations, in which task-relevant information is encoded in broad, overlapping neuron subsets. This redundancy provides a strong degree of functional stability, while class-wise asymmetries expose heterogeneous specialization patterns and enable label-aware analysis. In contrast, small structured manipulations in weight or logit space are sufficient to redirect predictions, highlighting complementary vulnerability profiles and illustrating how SYNAPSE can guide the development of more robust Transformer models.
Updated: 2026-03-09 14:18:47
标题: SYNAPSE:神经元分析和序列编码中的干扰框架
摘要: 近年来,人工智能已成为复杂任务的强大合作伙伴,如数据分析、预测和问题解决,然而其缺乏透明度引发了人们对其可靠性的担忧。在诸如医疗保健或网络安全等敏感领域,确保透明度、可信度和稳健性至关重要,因为错误决策或成功攻击的后果可能严重。先前的神经元级可解释性方法主要是描述性的、任务相关的,或需要重新训练,这限制了它们作为系统化、可重复使用的工具,用于评估跨架构和领域内部稳健性。为了克服这些限制,本文提出了SYNAPSE,这是一个系统化、无需训练的框架,用于理解和对Transformer模型在不同领域的内部行为进行压力测试。它提取每层[CLS]表示,训练一个轻量级线性探针来获取全局和每个类别的神经元排序,并在推断过程中应用前向钩子干预。这种设计使得可以对内部表示进行受控实验,而不会改变原始模型,从而可以直接在任务和架构之间测量和比较弱点、稳定模式和标签特定的敏感性。在所有实验中,SYNAPSE揭示了内部表示的一致、与领域无关的组织结构,其中任务相关信息被编码在广泛、重叠的神经元子集中。这种冗余提供了较高程度的功能稳定性,而类别间的不对称性暴露了异质的专业化模式,并且使标签感知分析成为可能。相比之下,在权重或对数空间中进行的小结构化操作就足以重定向预测,突出了互补的脆弱性配置文件,并说明了SYNAPSE如何指导更稳健的Transformer模型的开发。
更新时间: 2026-03-09 14:18:47
领域: cs.LG,cs.AI
Client-Cooperative Split Learning
Model training is increasingly offered as a service for resource-constrained data owners to build customized models. Split Learning (SL) enables such services by offloading training computation under privacy constraints, and evolves toward serverless and multi-client settings where model segments are distributed across training clients. This cooperative mode assumes partial trust: data owners hide labels and data from trainer clients, while trainer clients produce verifiable training artifacts and ownership proofs. We present CliCooper, a multi-client cooperative SL framework tailored for cooperative model training services in heterogeneous and partially trusted environments, where one client contributes data, while others collectively act as SL trainers. CliCooper bridges the privacy and trust gaps through two new designs. First, differential privacy-based activation protection and secret label obfuscation safeguard data owners' privacy without degrading model performance. Second, a dynamic chained watermarking scheme cryptographically links training stages on model segments across trainers, ensuring verifiable training integrity, robust model provenance, and copyright protection. Experiments show that CliCooper preserves model accuracy while enhancing resilience to privacy and ownership attacks. It reduces the success rate of clustering attacks (which infer label groups from intermediate activation) to 0%, decreases inversion-reconstruction (which recovers training data) similarity from 0.50 to 0.03, and limits model-extraction-based surrogates to about 1% accuracy, comparable to random guessing.
Updated: 2026-03-09 14:17:26
标题: 客户合作拆分学习
摘要: 模型训练越来越被提供作为一种服务,供资源有限的数据所有者构建定制模型使用。拆分学习(SL)通过在隐私约束下卸载训练计算,使这种服务成为可能,并向无服务器和多客户端设置发展,其中模型段被分布在训练客户端之间。这种合作模式假设部分信任:数据所有者隐藏标签和数据,而训练客户端产生可验证的训练产品和所有权证明。我们提出了CliCooper,一个针对异构和部分信任环境中合作模型训练服务的多客户端合作SL框架,其中一个客户端贡献数据,而其他客户端共同作为SL训练者。CliCooper通过两个新设计弥合了隐私和信任差距。首先,基于差分隐私的激活保护和秘密标签模糊化保护数据所有者的隐私而不降低模型性能。其次,一个动态链式水印方案在模型段之间的训练阶段上加密链接训练者,确保可验证的训练完整性,强大的模型来源和版权保护。实验证明,CliCooper在保持模型精度的同时提高了对隐私和所有权攻击的抵抗力。它将聚类攻击的成功率(从中间激活中推断标签组)降低到0%,将反转重构(恢复训练数据)相似度从0.50降低到0.03,将基于模型提取的替代品限制在约1%的准确度,与随机猜测相当。
更新时间: 2026-03-09 14:17:26
领域: cs.CR
Human-Aware Robot Behaviour in Self-Driving Labs
Self-driving laboratories (SDLs) are rapidly transforming research in chemistry and materials science to accelerate new discoveries. Mobile robot chemists (MRCs) play a pivotal role by autonomously navigating the lab to transport samples, effectively connecting synthesis, analysis, and characterisation equipment. The instruments within an SDL are typically designed or retrofitted to be accessed by both human and robotic chemists, ensuring operational flexibility and integration between manual and automated workflows. In many scenarios, human and robotic chemists may need to use the same equipment simultaneously. Currently, MRCs rely on simple LiDAR-based obstruction detection, which forces the robot to passively wait if a human is present. This lack of situational awareness leads to unnecessary delays and inefficient coordination in time-critical automated workflows in human-robot shared labs. To address this, we present an initial study of an embodied, AI-driven perception method that facilitates proactive human-robot interaction in shared-access scenarios. Our method features a hierarchical human intention prediction model that allows the robot to distinguish between preparatory actions (waiting) and transient interactions (accessing the instrument). Our results demonstrate that the proposed approach enhances efficiency by enabling proactive human-robot interaction, streamlining coordination, and potentially increasing the efficiency of autonomous scientific labs.
Updated: 2026-03-09 14:17:18
标题: 自动驾驶实验室中的人性化机器人行为
摘要: 自动驾驶实验室(SDLs)正在快速改变化学和材料科学领域的研究,加速新发现的进程。移动机器人化学家(MRCs)通过自主导航实验室,运输样品,在合成、分析和表征设备之间有效连接起关键作用。SDL中的仪器通常设计或改装为人类和机器人化学家都可以访问,确保操作的灵活性和手动和自动工作流之间的集成。在许多情况下,人类和机器人化学家可能需要同时使用相同的设备。目前,MRC依赖于基于简单LiDAR的障碍检测,这迫使机器人在人类存在时被动等待。这种缺乏情境意识导致在人机共享实验室中时间关键的自动工作流程中出现不必要的延迟和低效的协调。为了解决这个问题,我们提出了一个具有身体化、人工智能驱动的感知方法的初步研究,促进了在共享访问场景中的主动人机交互。我们的方法具有一个分层人类意图预测模型,使机器人能够区分准备动作(等待)和瞬时互动(访问仪器)。我们的结果表明,所提出的方法通过实现主动人机交互,简化协调,潜在地提高自动科学实验室的效率。
更新时间: 2026-03-09 14:17:18
领域: cs.RO,cs.AI,cs.HC
ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers
Tactile sensing provides local essential information that is complementary to visual perception, such as texture, compliance, and force. Despite recent advances in visuotactile representation learning, challenges remain in fusing these modalities and generalizing across tasks and environments without heavy reliance on pre-trained vision-language models. Moreover, existing methods do not study positional encodings, thereby overlooking the multi-stage spatial reasoning needed to capture fine-grained visuotactile correlations. We introduce ViTaPEs, a transformer-based architecture for learning task-agnostic visuotactile representations from paired vision and tactile inputs. Our key idea is a two-stage positional injection: local (modality-specific) positional encodings are added within each stream, and a global positional encoding is added on the joint token sequence immediately before attention, providing a shared positional vocabulary at the stage where cross-modal interaction occurs. We make the positional injection points explicit and conduct controlled ablations that isolate their effect before a token-wise nonlinearity versus immediately before self-attention. Experiments on multiple large-scale real-world datasets show that ViTaPEs not only surpasses state-of-the-art baselines across various recognition tasks but also demonstrates zero-shot generalization to unseen, out-of-domain scenarios. We further demonstrate the transfer-learning strength of ViTaPEs in a robotic grasping task, where it outperforms state-of-the-art baselines in predicting grasp success. Project page: https://sites.google.com/view/vitapes
Updated: 2026-03-09 14:16:16
标题: ViTaPEs: 多模态变压器中的视触位置编码用于跨模态对齐
摘要: 触觉感知提供了与视觉感知互补的本地基本信息,如纹理、顺应性和力量。尽管在视觉触觉表示学习方面取得了最新进展,但在融合这些模态并在不过度依赖预训练视觉语言模型的情况下在任务和环境间进行泛化仍存在挑战。此外,现有方法未研究位置编码,因此忽视了捕捉细粒度视觉触觉相关性所需的多阶段空间推理。我们引入了ViTaPEs,这是一种基于Transformer的架构,用于从配对的视觉和触觉输入中学习与任务无关的视觉触觉表示。我们的关键思想是两阶段位置注入:在每个流中添加本地(模态特定)位置编码,并在关注之前立即在联合令牌序列上添加全局位置编码,提供在发生跨模态交互的阶段共享的位置词汇。我们使位置注入点明确,并进行了控制消融实验,以分离它们在令牌级非线性与自我关注之前的效果。在多个大规模真实世界数据集上的实验证明,ViTaPEs不仅在各种识别任务上超越了最先进的基线,还展示了对未见、超领域场景的零样本泛化。我们进一步展示了ViTaPEs在机器人抓取任务中的迁移学习能力,它在预测抓取成功方面优于最先进的基线。项目页面:https://sites.google.com/view/vitapes
更新时间: 2026-03-09 14:16:16
领域: cs.CV,cs.LG,cs.RO
Meta-RL with Shared Representations Enables Fast Adaptation in Energy Systems
Meta-Reinforcement Learning addresses the critical limitations of conventional Reinforcement Learning in multi-task and non-stationary environments by enabling fast policy adaptation and improved generalization. We introduce a novel Meta-RL framework that integrates a bi-level optimization scheme with a hybrid actor-critic architecture specially designed to enhance sample efficiency and inter-task adaptability. To improve knowledge transfer, we meta-learn a shared state feature extractor jointly optimized across actor and critic networks, providing efficient representation learning and limiting overfitting to individual tasks or dominant profiles. Additionally, we propose a parameter-sharing mechanism between the outer- and inner-loop actor networks, to reduce redundant learning and accelerate adaptation during task revisitation. The approach is validated on a real-world Building Energy Management Systems dataset covering nearly a decade of temporal and structural variability, for which we propose a task preparation method to promote generalization. Experiments demonstrate effective task adaptation and better performance compared to conventional RL and Meta-RL methods.
Updated: 2026-03-09 14:15:51
标题: 元RL与共享表示在能源系统中实现快速适应
摘要: 元强化学习(Meta-Reinforcement Learning)解决了传统强化学习在多任务和非稳态环境中的关键限制,通过实现快速策略适应和改进泛化能力。我们介绍了一个新颖的元强化学习框架,该框架将双层优化方案与混合的演员-评论者架构相结合,专门设计用于增强样本效率和跨任务适应性。为了提高知识传递效率,我们元学习了一个共享状态特征提取器,该提取器在演员和评论者网络之间联合优化,提供高效的表示学习,并限制对个体任务或主导特征的过拟合。此外,我们提出了一个在外部和内部循环演员网络之间共享参数的机制,以减少冗余学习,并加速在任务重新访问期间的适应性。该方法在覆盖了近十年时间和结构变化的真实建筑能源管理系统数据集上进行了验证,我们提出了一种任务准备方法来促进泛化。实验证明,与传统强化学习和元强化学习方法相比,该方法实现了有效的任务适应和更好的性能。
更新时间: 2026-03-09 14:15:51
领域: cs.LG
Noisy PDE Training Requires Bigger PINNs
Physics-Informed Neural Networks (PINNs) are increasingly used to approximate solutions of partial differential equations (PDEs), particularly in high dimensions. In real-world settings, data are often noisy, making it crucial to understand when a predictor can still achieve low empirical risk. Yet, little is known about the conditions under which a PINN can do so effectively. We analyse PINNs applied to the Hamilton--Jacobi--Bellman (HJB) PDE and establish a lower bound on the network size required for the supervised PINN empirical risk to fall below the variance of noisy supervision labels. Specifically, if a predictor achieves empirical risk $O(η)$ below $σ^2$ (the variance of the supervision data), then necessarily $d_N\log d_N\gtrsim N_s η^2$, where $N_s$ is the number of samples and $d_N$ the number of trainable parameters. A similar constraint holds in the fully unsupervised PINN setting when boundary labels are noisy. Thus, simply increasing the number of noisy supervision labels does not offer a ``free lunch'' in reducing empirical risk. We also give empirical studies on the HJB PDE, the Poisson PDE and the the Navier-Stokes PDE set to produce the Taylor-Green solutions. In these experiments we demonstrate that PINNs indeed need to be beyond a threshold model size for them to train to errors below $σ^2$. These results provide a quantitative foundation for understanding parameter requirements when training PINNs in the presence of noisy data.
Updated: 2026-03-09 14:12:45
标题: 嘈杂的PDE训练需要更大的PINN
摘要: 物理信息神经网络(PINNs)越来越被用来近似偏微分方程(PDEs)的解,特别是在高维情况下。在现实世界中,数据往往是带有噪声的,因此了解预测器何时仍然能够实现低经验风险至关重要。然而,对于PINN何时能够有效地实现这一点的条件知之甚少。我们分析了应用于Hamilton-Jacobi-Bellman(HJB)PDE的PINNs,并建立了一个关于网络大小的下限,以便监督PINN的经验风险降至低于噪声监督标签的方差。具体而言,如果一个预测器在$σ^2$(监督数据的方差)以下实现经验风险$O(η)$,则必然有$d_N\log d_N\gtrsim N_s η^2$,其中$N_s$是样本数,$d_N$是可训练参数的数量。在完全无监督的PINN设置中,当边界标签带有噪声时,类似的约束也适用。因此,简单地增加噪声监督标签的数量并不会在降低经验风险方面提供“免费午餐”。我们还对HJB PDE、泊松PDE和设置为生成Taylor-Green解的Navier-Stokes PDE进行了实证研究。在这些实验中,我们证明PINNs确实需要超过一个阈值模型大小才能训练到低于$σ^2$的误差。这些结果为在噪声数据存在的情况下训练PINNs时理解参数要求提供了定量基础。
更新时间: 2026-03-09 14:12:45
领域: cs.LG,cs.AI
Geometrically Constrained Outlier Synthesis
Deep neural networks for image classification often exhibit overconfidence on out-of-distribution (OOD) samples. To address this, we introduce Geometrically Constrained Outlier Synthesis (GCOS), a training-time regularization framework aimed at improving OOD robustness during inference. GCOS addresses a limitation of prior synthesis methods by generating virtual outliers in the hidden feature space that respect the learned manifold structure of in-distribution (ID) data. The synthesis proceeds in two stages: (i) a dominant-variance subspace extracted from the training features identifies geometrically informed, off-manifold directions; (ii) a conformally-inspired shell, defined by the empirical quantiles of a nonconformity score from a calibration set, adaptively controls the synthesis magnitude to produce boundary samples. The shell ensures that generated outliers are neither trivially detectable nor indistinguishable from in-distribution data, facilitating smoother learning of robust features. This is combined with a contrastive regularization objective that promotes separability of ID and OOD samples in a chosen score space, such as Mahalanobis or energy-based. Experiments demonstrate that GCOS outperforms state-of-the-art methods using standard energy-based inference on near-OOD benchmarks, defined as tasks where outliers share the same semantic domain as in-distribution data. As an exploratory extension, the framework naturally transitions to conformal OOD inference, which translates uncertainty scores into statistically valid p-values and enables thresholds with formal error guarantees, providing a pathway toward more predictable and reliable OOD detection.
Updated: 2026-03-09 14:11:47
标题: 几何约束异常值合成
摘要: 深度神经网络在图像分类中通常在分布外(OOD)样本上表现出过度自信。为了解决这个问题,我们引入了几何约束异常值合成(GCOS),这是一个旨在在推断过程中提高OOD鲁棒性的训练时正则化框架。GCOS通过在隐藏特征空间中生成虚拟异常值来解决以前合成方法的局限性,这些异常值尊重分布内(ID)数据学习到的流形结构。合成分为两个阶段:(i)从训练特征中提取的主方差子空间识别具有几何信息的、流形之外的方向;(ii)由来自校准集的非一致性分数的经验分位数定义的共形灵感壳,自适应地控制合成幅度以产生边界样本。这个壳确保生成的异常值既不是显而易见的,也不会与分布内数据无法区分,有助于更平滑地学习鲁棒特征。这与一个促进在所选择的得分空间(如马哈拉诺比斯或基于能量的空间)中ID和OOD样本可分离性的对比正则化目标相结合。实验证明,在定义为异常值与分布内数据共享相同语义域的任务中,使用标准基于能量的推断,GCOS优于最先进的方法。作为探索性扩展,该框架自然过渡到共形OOD推断,将不确定性分数转化为统计上有效的p值,并实现具有正式误差保证的阈值,为更可预测和可靠的OOD检测提供了一条途径。
更新时间: 2026-03-09 14:11:47
领域: cs.LG,cs.AI
Aligning to Illusions: Choice Blindness in Human and AI Feedback
Reinforcement Learning from Human Feedback (RLHF) assumes annotator preferences reflect stable internal states. We challenge this through three experiments spanning the preference pipeline. In a human choice blindness study, 91% of surreptitiously swapped preferences go undetected, extending choice blindness to third-person evaluative comparison of unfamiliar text. Testing fifteen LLM judges as potential replacements, we find detection relies on shallow text matching rather than genuine self-monitoring: removing prior reasoning from context causes blindness to surge from near-zero to over 50%, while explicit social pressure induces near-universal compliance. In a dose-response experiment across two architectures from 86M to 2B parameters, one-sixth to one-third of labels must be corrupted before the reward signal halves, yet standard pairwise accuracy remains virtually unchanged. A Best-of-N evaluation confirms this translates to downstream policy degradation: at 50% corruption, reward-guided selection produces no improvement over random sampling, while the proxy model reports monotonically increasing scores. Together, these results reveal a preference construction problem: the signal entering RLHF is shaped by elicitation context in ways that neither human metacognition, LLM self-monitoring, nor standard evaluation metrics can detect.
Updated: 2026-03-09 14:10:36
标题: 与错觉对齐:人类和人工智能反馈中的选择盲点
摘要: 《从人类反馈中学习的强化学习(RLHF)》假设注释者的偏好反映了稳定的内部状态。我们通过涵盖偏好流程的三个实验来挑战这一假设。在一项人类选择盲目研究中,91%的秘密交换偏好未被发现,将选择盲目现象延伸到对陌生文本的第三人评价比较。在测试十五名LLM法官作为潜在替代者时,我们发现检测依赖于浅层文本匹配而不是真正的自我监控:从上下文中移除先前的推理导致盲目现象从接近零增加到超过50%,而明确的社会压力导致几乎普遍的服从。在跨越从8600万到20亿参数的两种体系结构的剂量-反应实验中,必须损坏约六分之一到三分之一的标签才能使奖励信号减半,然而,标准的成对精度基本保持不变。最佳-N评估证实了这一点转化为下游政策恶化:在50%损坏时,奖励引导选择与随机抽样相比没有改善,而代理模型报告的分数呈单调增加。综合而言,这些结果揭示了一个偏好构建问题:进入RLHF的信号受到引发上下文的影响,人类元认知、LLM自我监控和标准评估指标均无法检测到这种影响。
更新时间: 2026-03-09 14:10:36
领域: cs.CL,cs.AI
Topological Spatial Graph Coarsening
Spatial graphs are particular graphs for which the nodes are localized in space (e.g., public transport network, molecules, branching biological structures). In this work, we consider the problem of spatial graph reduction, that aims to find a smaller spatial graph (i.e., with less nodes) with the same overall structure as the initial one. In this context, performing the graph reduction while preserving the main topological features of the initial graph is particularly relevant, due to the additional spatial information. Thus, we propose a topological spatial graph coarsening approach based on a new framework that finds a trade-off between the graph reduction and the preservation of the topological characteristics. The coarsening is realized by collapsing short edges. In order to capture the topological information required to calibrate the reduction level, we adapt the construction of classical topological descriptors made for point clouds (the so-called persistent diagrams) to spatial graphs. This construction relies on the introduction of a new filtration called triangle-aware graph filtration. Our coarsening approach is parameter-free and we prove that it is equivariant under rotations, translations and scaling of the initial spatial graph. We evaluate the performances of our method on synthetic and real spatial graphs, and show that it significantly reduces the graph sizes while preserving the relevant topological information.
Updated: 2026-03-09 14:09:32
标题: 拓扑空间图粗化
摘要: 空间图是一种特殊的图形,其中节点在空间中定位(例如,公共交通网络,分子,分支生物结构)。在这项工作中,我们考虑空间图缩减的问题,旨在找到一个较小的空间图(即,具有较少节点),其整体结构与初始图相同。在这种情况下,保持初始图的主要拓扑特征的图形缩减尤为重要,由于附加的空间信息。因此,我们提出了一种基于新框架的拓扑空间图简化方法,该方法找到了图形缩减和保持拓扑特征之间的权衡。简化是通过折叠短边来实现的。为了捕捉校准减少水平所需的拓扑信息,我们将为点云构造的经典拓扑描述符(所谓的持续图)调整为空间图。此构造依赖于引入称为三角形感知图过滤的新过滤。我们的简化方法是无参数的,并且我们证明它在旋转,平移和缩放初始空间图时是等变的。我们评估了我们的方法在合成和真实空间图上的性能,并显示它显着减少了图形大小,同时保留了相关的拓扑信息。
更新时间: 2026-03-09 14:09:32
领域: stat.ML,cs.CG,cs.LG
Latent Equivariant Operators for Robust Object Recognition: Promise and Challenges
Despite the successes of deep learning in computer vision, difficulties persist in recognizing objects that have undergone group-symmetric transformations rarely seen during training$\unicode{x2013}$for example objects seen in unusual poses, scales, positions, or combinations thereof. Equivariant neural networks are a solution to the problem of generalizing across symmetric transformations, but require knowledge of transformations a priori. An alternative family of architectures proposes to learn equivariant operators in a latent space, from examples of symmetric transformations. Here, using simple datasets of rotated and translated noisy MNIST, we illustrate how such architectures can successfully be harnessed for out-of-distribution classification, thus overcoming the limitations of both traditional and equivariant networks. While conceptually enticing, we discuss challenges ahead on the path of scaling these architectures to more complex datasets. Our code is available at https://github.com/BRAIN-Aalto/equivariant_operator.
Updated: 2026-03-09 14:09:29
标题: 潜在的等变算子用于鲁棒物体识别:机遇与挑战
摘要: 尽管深度学习在计算机视觉领域取得了成功,但在识别经历过训练中很少见的群对称变换的物体方面仍然存在困难,例如在不寻常的姿势、比例、位置或其组合中看到的物体。等变神经网络是一种解决跨对称变换泛化问题的方法,但需要事先了解变换的知识。另一种架构家族提出在潜在空间中学习等变操作符,从对称变换的示例中学习。在这里,我们使用旋转和平移的噪声MNIST简单数据集,展示了这些架构如何成功地用于超出分布分类,从而克服了传统和等变网络的局限性。尽管在概念上很吸引人,我们讨论了将这些架构扩展到更复杂数据集的挑战。我们的代码可在https://github.com/BRAIN-Aalto/equivariant_operator找到。
更新时间: 2026-03-09 14:09:29
领域: cs.CV,cs.LG
RoboLayout: Differentiable 3D Scene Generation for Embodied Agents
Recent advances in vision language models (VLMs) have shown strong potential for spatial reasoning and 3D scene layout generation from open-ended language instructions. However, generating layouts that are not only semantically coherent but also feasible for interaction by embodied agents remains challenging, particularly in physically constrained indoor environments. In this paper, RoboLayout is introduced as an extension of LayoutVLM that augments the original framework with agent-aware reasoning and improved optimization stability. RoboLayout integrates explicit reachability constraints into a differentiable layout optimization process, enabling the generation of layouts that are navigable and actionable by embodied agents. Importantly, the agent abstraction is not limited to a specific robot platform and can represent diverse entities with distinct physical capabilities, such as service robots, warehouse robots, humans of different age groups, or animals, allowing environment design to be tailored to the intended agent. In addition, a local refinement stage is proposed that selectively reoptimizes problematic object placements while keeping the remainder of the scene fixed, improving convergence efficiency without increasing global optimization iterations. Overall, RoboLayout preserves the strong semantic alignment and physical plausibility of LayoutVLM while enhancing applicability to agent-centric indoor scene generation, as demonstrated by experimental results across diverse scene configurations.
Updated: 2026-03-09 14:05:21
标题: RoboLayout:可微分的三维场景生成技术用于具身代理
摘要: 最近在视觉语言模型(VLMs)方面取得的进展显示出了在空间推理和从开放式语言指令生成3D场景布局方面的强大潜力。然而,生成既语义连贯又适合具体代理人交互的布局仍然具有挑战性,特别是在受物理限制的室内环境中。本文介绍了RoboLayout作为LayoutVLM的扩展,通过增强原始框架的代理人感知推理和改进的优化稳定性。RoboLayout将明确的可达性约束集成到可微布局优化过程中,使得可以生成适合具体代理人导航和执行的布局。重要的是,代理人抽象不仅限于特定的机器人平台,还可以代表具有不同物理能力的多样实体,如服务机器人、仓库机器人、不同年龄段的人类或动物,从而使环境设计能够根据目标代理人进行个性化定制。此外,提出了一个局部细化阶段,该阶段有选择性地重新优化问题对象的放置,同时保持场景的其余部分固定,提高了收敛效率而不增加全局优化迭代次数。总体而言,RoboLayout在保持LayoutVLM的强大语义对齐性和物理可信度的同时,增强了对以代理人为中心的室内场景生成的适用性,这一点在跨多样场景配置的实验结果中得到了证实。
更新时间: 2026-03-09 14:05:21
领域: cs.AI,cs.CV,cs.LG,cs.RO
The Illusion of Collusion
Algorithmic agents are used in a variety of competitive decision-making settings, including pricing contexts that range from online retail to residential home rental. We study the emergence of algorithmic collusion when competing agents employ multi-armed bandit algorithms and competition is modeled as a repeated Prisoner's Dilemma game. Notably, agents in our setting perform online learning with no prior model of game structure and have no direct knowledge of competitor states or actions, thus they cannot learn strategies that depend on these factors. These context-free bandits nonetheless frequently learn seemingly collusive behavior, a phenomenon we term naive collusion. Our results reveal that whether naive collusion emerges depends starkly on the choice of behavior policy employed by bandit learners. The mechanism underpinning the emergence of collusive outcomes is synchronicity in agent action plays, where synchronicity captures how often agents play the same action. We show that in the long-run, naive algorithmic collusion never emerges when both agents use a broad class of persistently random algorithms, including the epsilon-greedy algorithm without epsilon decay, sometimes emerges when both agents use greedy-in-the-limit algorithms which feature randomness during exploration but are asymptotically deterministic, and always emerges when both agents use deterministic bandit learning algorithms like those in the well-known upper confidence bound (UCB) family. We highlight market and algorithmic conditions under which one can and cannot predict a priori whether collusion will occur. Our findings have several policy implications: preventing pricing algorithms from conditioning their actions on competitor prices may not preclude algorithmic collusion, symmetry in algorithms may increase collusion potential, and the emergence of algorithmic collusion is path dependent.
Updated: 2026-03-09 14:00:18
标题: 勾结的幻觉
摘要: 算法代理在各种竞争性决策环境中被使用,包括从在线零售到住宅租赁的定价背景。我们研究了当竞争代理采用多臂老虎机算法,并且竞争被建模为重复的囚徒困境游戏时,算法性勾结的出现。值得注意的是,在我们的环境中,代理执行没有先验游戏结构模型的在线学习,并且没有直接了解竞争对手状态或行动的知识,因此它们无法学习依赖于这些因素的策略。尽管如此,这些无上下文的老虎机经常学习似乎勾结的行为,我们称之为天真勾结现象。我们的结果显示,天真勾结是否出现取决于老虎机学习者采用的行为策略选择。导致勾结结果出现的机制是代理行动的同步性,其中同步性捕捉代理多频率执行相同行动的情况。我们展示,从长远来看,当两个代理都使用一类广泛的持续随机算法时,如没有ε衰减的ε-贪婪算法,天真算法性勾结永远不会出现;当两个代理都使用在探索期间具有随机性但在渐近上是确定性的贪婪极限算法时,有时会出现;当两个代理都使用类似于众所周知的上置信界(UCB)家族中的确定性老虎机学习算法时,总是会出现。我们强调了市场和算法条件,在这些条件下可以或不可以预测勾结是否会发生。我们的发现有几个政策含义:防止定价算法将其行动取决于竞争对手价格可能不会排除算法性勾结,算法的对称性可能增加勾结潜力,并且算法性勾结的出现是路径依赖的。
更新时间: 2026-03-09 14:00:18
领域: econ.GN,cs.AI,cs.GT,cs.MA
Trust Nothing: RTOS Security without Run-Time Software TCB (Extended Version)
Embedded devices face an ever-expanding threat landscape: vulnerabilities in application software, operating system kernels, and peripherals threaten the embedded device integrity. Existing computer-architectural defenses fully consider at most two of these threat vectors in their security model. This paper aims at addressing this gap using a novel capability architecture. To this end, we combine a token capability approach suitable for building an untrusted operating system with protection against malicious devices without requiring hardware changes to peripherals. First, we develop and evaluate a full FPGA implementation of our capability architecture around legacy hardware components. Further, we present a soft real-time operating system based on Zephyr that has no run-time software TCB. To this end, we disaggregate Zephyr's subsystems into small, mutually isolated components. All subsystems that exist at run time, including scheduler, allocator and DMA drivers, and all peripherals are fully untrusted. We believe that our work offers a foundation for more rigorous security-by-design in tomorrow's security-critical embedded devices.
Updated: 2026-03-09 13:59:27
标题: 不信任任何东西:没有运行时软件TCB的RTOS安全性(扩展版本)
摘要: 嵌入式设备面临着一个不断扩大的威胁环境:应用软件、操作系统内核和外围设备的漏洞威胁着嵌入式设备的完整性。现有的计算机架构防御在其安全模型中最多考虑了这些威胁向量中的两个。 本文旨在通过一种新颖的能力架构来解决这一差距。为此,我们结合了适用于构建不受信任操作系统的令牌能力方法,并在不需要对外围设备进行硬件更改的情况下提供对恶意设备的保护。 首先,我们在遗留硬件组件周围开发和评估了我们的能力架构的完整FPGA实现。此外,我们提出了一个基于Zephyr的软实时操作系统,该操作系统没有运行时软件TCB。为此,我们将Zephyr的子系统解耦为小型、相互隔离的组件。所有在运行时存在的子系统,包括调度器、分配器和DMA驱动程序以及所有外围设备都是完全不受信任的。我们相信,我们的工作为明天的安全关键嵌入式设备提供了更严格的安全设计基础。
更新时间: 2026-03-09 13:59:27
领域: cs.CR,cs.AR,cs.OS
A Recipe for Stable Offline Multi-agent Reinforcement Learning
Despite remarkable achievements in single-agent offline reinforcement learning (RL), multi-agent RL (MARL) has struggled to adopt this paradigm, largely persisting with on-policy training and self-play from scratch. One reason for this gap comes from the instability of non-linear value decomposition, leading prior works to avoid complex mixing networks in favor of linear value decomposition (e.g., VDN) with value regularization used in single-agent setups. In this work, we analyze the source of instability in non-linear value decomposition within the offline MARL setting. Our observations confirm that they induce value-scale amplification and unstable optimization. To alleviate this, we propose a simple technique, scale-invariant value normalization (SVN), that stabilizes actor-critic training without altering the Bellman fixed point. Empirically, we examine the interaction among key components of offline MARL (e.g., value decomposition, value learning, and policy extraction) and derive a practical recipe that unlocks its full potential.
Updated: 2026-03-09 13:57:08
标题: 一个稳定的离线多智能体强化学习的配方
摘要: 尽管单一智能体离线强化学习(RL)取得了显著成就,但多智能体RL(MARL)难以采用这一范式,主要仍坚持使用在线训练和从头开始的自我对弈。这种差距的一个原因在于非线性值分解的不稳定性,导致先前的研究避免复杂的混合网络,而倾向于使用线性值分解(例如VDN)和单一智能体设置中使用的值正则化。在这项工作中,我们分析了离线MARL设置中非线性值分解不稳定性的来源。我们的观察证实它们会引起值尺度放大和不稳定的优化。为了缓解这一问题,我们提出了一种简单的技术,即尺度不变值归一化(SVN),可以在不改变Bellman定点的情况下稳定演员-评论家训练。实证上,我们研究了离线MARL的关键组件之间的相互作用(例如值分解、值学习和策略提取),并提出了一个解锁其全部潜力的实用配方。
更新时间: 2026-03-09 13:57:08
领域: cs.LG,cs.AI,cs.RO
Revealing Behavioral Plasticity in Large Language Models: A Token-Conditional Perspective
In this work, we reveal that Large Language Models (LLMs) possess intrinsic behavioral plasticity-akin to chameleons adapting their coloration to environmental cues-that can be exposed through token-conditional generation and stabilized via reinforcement learning. Specifically, by conditioning generation on carefully selected token prefixes sampled from responses exhibiting desired behaviors, LLMs seamlessly adapt their behavioral modes at inference time (e.g., switching from step-by-step reasoning to direct answering) without retraining. Based on this insight, we propose Token-Conditioned Reinforcement Learning (ToCoRL), a principled framework that leverages RL to internalize this chameleon-like plasticity, transforming transient inference-time adaptations into stable and learnable behavioral patterns. ToCoRL guides exploration with token-conditional generation and keep enhancing exploitation, enabling emergence of appropriate behaviors. Extensive experiments show that ToCoRL enables precise behavioral control without capability degradation. Notably, we show that large reasoning models, while performing strongly on complex mathematics, can be effectively adapted to excel at factual question answering, which was a capability previously hindered by their step-by-step reasoning patterns.
Updated: 2026-03-09 13:56:53
标题: 揭示大型语言模型中的行为可塑性:一个标记条件透视
摘要: 在这项工作中,我们揭示了大型语言模型(LLMs)具有类似变色龙适应环境信号的固有行为可塑性,可以通过基于标记条件生成并通过强化学习稳定。具体来说,通过在生成过程中以精心选择的标记前缀为条件进行采样,LLMs可以在推断时无缝地适应其行为模式(例如,从逐步推理切换到直接回答),而无需重新训练。基于这一见解,我们提出了基于标记条件的强化学习(ToCoRL),这是一个原则性框架,利用强化学习来内化这种变色龙般的可塑性,将瞬时的推断时适应转化为稳定且可学习的行为模式。ToCoRL通过标记条件生成引导探索并持续增强开发,从而促使适当行为的出现。大量实验证明,ToCoRL实现了精确的行为控制而无需降低能力。值得注意的是,我们展示了大型推理模型在复杂数学方面表现强劲的同时,可以有效地改进以擅长事实性问题回答,这是以前由于其逐步推理模式而受阻的能力。
更新时间: 2026-03-09 13:56:53
领域: cs.CL,cs.AI,cs.LG
Co-LoRA: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients
As AI becomes more personal, e.g., Agentic AI, there is an increasing need for personalizing models for various use cases. Personalized federated learning (PFL) enables each client to collaboratively leverage other clients' knowledge for better adaptation to the task of interest, without privacy risks. Despite its potential, existing PFL methods remain confined to rather simplified scenarios where data and models are the same across clients. To move towards realistic scenarios, we move beyond these restrictive assumptions by addressing both data and model heterogeneity. We propose a task-relevance-aware model aggregation strategy to reduce parameter interference under heterogeneous data. Moreover, we introduce Co-LoRA, a dimension-invariant module that enables knowledge sharing across heterogeneous architectures. To mimic the real-world task diversity, we propose a multi-modal PFL benchmark spanning 40 distinct tasks with distribution shifts over time. Extensive experiments shows that our proposed method significantly outperforms the state-of-the-art PFL methods under heterogeneous scenarios.
Updated: 2026-03-09 13:50:42
标题: Co-LoRA: 异构多模客户端上的协同模型个性化
摘要: 随着人工智能变得更加个性化,例如代理人人工智能,对于各种用例个性化模型的需求日益增加。个性化联合学习(PFL)使每个客户端能够共同利用其他客户端的知识,以更好地适应感兴趣的任务,同时避免隐私风险。尽管具有潜力,现有的PFL方法仍然局限于数据和模型在客户端之间相同的简化场景。为了朝着现实情景发展,我们超越了这些限制性假设,解决了数据和模型异质性问题。我们提出了一种任务相关性感知的模型聚合策略,以减少异质数据下的参数干扰。此外,我们引入了Co-LoRA,这是一个维度不变的模块,可以实现跨异构架构的知识共享。为了模拟真实世界的任务多样性,我们提出了一个跨越40个不同任务并随时间分布变化的多模式PFL基准测试。大量实验证明,我们提出的方法在异质情景下显著优于现有的PFL方法。
更新时间: 2026-03-09 13:50:42
领域: cs.LG,cs.AI,cs.DC
Radial Müntz-Szász Networks: Neural Architectures with Learnable Power Bases for Multidimensional Singularities
Radial singular fields, such as $1/r$, $\log r$, and crack-tip profiles, are difficult to model with current coordinate-separable neural architectures. We formally establish this result: any $C^2$ function that is both radial and additively separable must be quadratic, establishing a fundamental obstruction for coordinate-wise power-law models. Motivated by this result, we introduce Radial Müntz-Szász Networks (RMN), which represent fields as linear combinations of learnable radial powers $r^μ$, including negative exponents, together with a limit-stable log-primitive for exact $\log r$ behavior. RMN admits closed-form spatial gradients and Laplacians, enabling physics-informed learning on punctured domains. Across ten 2D and 3D benchmarks, RMN achieves between 1.5 and 51 times lower RMSE than MLPs and between 10 and 100 times lower RMSE than SIREN, while using only 27 parameters, compared with 33,537 for MLPs and 8,577 for SIREN. We extend RMN to incorporate angular dependence (RMN-Angular) and to handle multiple sources with learnable centers (RMN-MC), whose source-center recovery errors fall below $10^{-4}$. We also report controlled failures on smooth, strongly non-radial targets to delineate RMN's operating regime.
Updated: 2026-03-09 13:46:57
标题: Müntz-Szász径向网络:具有可学习功率基础的多维奇异性神经结构
摘要: 径向奇异场,如$1/r$,$\log r$和裂纹尖端轮廓,很难用当前的坐标可分离的神经结构来建模。我们正式建立了这一结果:任何既是径向又是可加分离的$C^2$函数必须是二次的,为坐标分离幂律模型建立了一个基本的障碍。受到这一结果的启发,我们引入了径向Müntz-Szász网络(RMN),将场表示为可学习的径向幂$r^μ$的线性组合,包括负指数,以及一个极限稳定的对数原函数,以实现精确的$\log r$行为。RMN允许闭式空间梯度和拉普拉斯算子,实现在穿孔域上基于物理的学习。在十个2D和3D基准测试中,RMN的均方根误差比MLP低1.5到51倍,比SIREN低10到100倍,而参数只有27个,而MLP为33,537个,SIREN为8,577个。我们扩展了RMN以包含角度依赖(RMN-Angular)并处理具有可学习中心的多源(RMN-MC),其源中心恢复误差低于$10^{-4}$。我们还报告了对光滑、非径向目标的受控失败,以界定RMN的操作范围。
更新时间: 2026-03-09 13:46:57
领域: cs.LG,math.NA
A Hierarchical Error-Corrective Graph Framework for Autonomous Agents with LLM-Based Action Generation
We propose a Hierarchical Error-Corrective Graph FrameworkforAutonomousAgentswithLLM-BasedActionGeneration(HECG),whichincorporates three core innovations: (1) Multi-Dimensional Transferable Strategy (MDTS): by integrating task quality metrics (Q), confidence/cost metrics (C), reward metrics (R), and LLM-based semantic reasoning scores (LLM-Score), MDTS achieves multi-dimensional alignment between quantitative performance and semantic context, enabling more precise selection of high-quality candidate strate gies and effectively reducing the risk of negative transfer. (2) Error Matrix Classification (EMC): unlike simple confusion matrices or overall performance metrics, EMC provides structured attribution of task failures by categorizing errors into ten types, such as Strategy Errors (Strategy Whe) and Script Parsing Errors (Script-Parsing-Error), and decomposing them according to severity, typical actions, error descriptions, and recoverability. This allows precise analysis of the root causes of task failures, offering clear guidance for subsequent error correction and strategy optimization rather than relying solely on overall success rates or single performance metrics. (3) Causal-Context Graph Retrieval (CCGR): to enhance agent retrieval capabilities in dynamic task environments, we construct graphs from historical states, actions, and event sequences, where nodes store executed actions, next-step actions, execution states, transferable strategies, and other relevant information, and edges represent causal dependencies such as preconditions for transitions between nodes. CCGR identifies subgraphs most relevant to the current task context, effectively capturing structural relationships beyond vector similarity, allowing agents to fully leverage contextual information, accelerate strategy adaptation, and improve execution reliability in complex, multi-step tasks.
Updated: 2026-03-09 13:46:00
标题: 基于LLM的动作生成的自主代理的分层错误纠正图框架
摘要: 我们提出了一个Hierarchical Error-Corrective Graph Framework for Autonomous Agents with LLM-Based Action Generation (HECG)的框架,其中包括三个核心创新:(1) 多维可转移策略(MDTS):通过整合任务质量指标(Q)、信心/成本指标(C)、奖励指标(R)和基于LLM的语义推理分数(LLM-Score),MDTS实现了量化性能和语义背景之间的多维对齐,从而能够更精确地选择高质量的候选策略,并有效降低负面转移的风险。(2) 错误矩阵分类(EMC):与简单的混淆矩阵或整体性能指标不同,EMC通过将错误分类为十种类型,如策略错误(Strategy Whe)和脚本解析错误(Script-Parsing-Error),并根据严重性、典型行动、错误描述和可恢复性对其进行分解,提供了对任务失败根本原因的精确分析,为后续错误更正和策略优化提供清晰指导,而不仅仅依赖整体成功率或单一性能指标。(3) 因果-上下文图检索(CCGR):为增强代理在动态任务环境中的检索能力,我们从历史状态、行动和事件序列构建图,其中节点存储执行的行动、下一步行动、执行状态、可转移策略和其他相关信息,边表示因果依赖关系,例如节点之间的转换的先决条件。CCGR识别与当前任务上下文最相关的子图,有效捕捉结构关系,超越向量相似性,允许代理充分利用上下文信息,加速策略适应,并提高在复杂的多步任务中的执行可靠性。
更新时间: 2026-03-09 13:46:00
领域: cs.AI
InFusionLayer: a CFA-based ensemble tool to generate new classifiers for learning and modeling
Ensemble learning is a well established body of methods for machine learning to enhance predictive performance by combining multiple algorithms/models. Combinatorial Fusion Analysis (CFA) has provided method and practice for combining multiple scoring systems, using rank-score characteristic (RSC) function and cognitive diversity (CD), including ensemble method and model fusion. However, there is no general-purpose Python tool available that incorporate these techniques. In this paper we introduce \texttt{InFusionLayer}, a machine learning architecture inspired by CFA at the system fusion level that uses a moderate set of base models to optimize unsupervised and supervised learning multiclassification problems. We demonstrate \texttt{InFusionLayer}'s ease of use for PyTorch, TensorFlow, and Scikit-learn workflows by validating its performance on various computer vision datasets. Our results highlight the practical advantages of incorporating distinctive features of RSC function and CD, paving the way for more sophisticated ensemble learning applications in machine learning. We open-sourced our code to encourage continuing development and community accessibility to leverage CFA on github: https://github.com/ewroginek/Infusion
Updated: 2026-03-09 13:43:48
标题: InFusionLayer:基于CFA的集成工具,用于生成用于学习和建模的新分类器
摘要: 集成学习是一套用于机器学习的方法体系,通过结合多个算法/模型来提升预测性能。组合融合分析(CFA)提供了一种将多个评分系统结合起来的方法和实践,使用排名分数特征(RSC)函数和认知多样性(CD),包括集成方法和模型融合。然而,目前没有通用的Python工具可以整合这些技术。本文介绍了InFusionLayer,这是一个受CFA启发的机器学习架构,通过使用一组适度的基础模型来优化无监督和监督学习的多分类问题。我们展示了InFusionLayer在PyTorch、TensorFlow和Scikit-learn工作流中的易用性,通过在各种计算机视觉数据集上验证其性能。我们的结果凸显了整合RSC函数和CD独特特征的实际优势,为机器学习中更复杂的集成学习应用铺平了道路。我们开源了我们的代码,以鼓励持续发展和社区可访问性,以便在github上利用CFA:https://github.com/ewroginek/Infusion。
更新时间: 2026-03-09 13:43:48
领域: cs.LG,cs.AI
Beyond the Markovian Assumption: Robust Optimization via Fractional Weyl Integrals in Imbalanced Data
Standard Gradient Descent and its modern variants assume local, Markovian weight updates, making them highly susceptible to noise and overfitting. This limitation becomes critically severe in extremely imbalanced datasets such as financial fraud detection where dominant class gradients systematically overwrite the subtle signals of the minority class. In this paper, we introduce a novel optimization algorithm grounded in Fractional Calculus. By isolating the core memory engine of the generalized fractional derivative, the Weighted Fractional Weyl Integral, we replace the instantaneous gradient with a dynamically weighted historical sequence. This fractional memory operator acts as a natural regularizer. Empirical evaluations demonstrate that our method prevents overfitting in medical diagnostics and achieves an approximately 40 percent improvement in PR-AUC over classical optimizers in financial fraud detection, establishing a robust bridge between pure fractional topology and applied Machine Learning.
Updated: 2026-03-09 13:38:45
标题: 超越马尔可夫假设:通过不平衡数据中的分数Weyl积分进行鲁棒优化
摘要: 标准梯度下降及其现代变体假设本地的、马尔可夫的权重更新,使其极易受到噪声和过拟合的影响。这种局限在极度不平衡的数据集(如金融欺诈检测)中变得尤为严重,因为主导类的梯度系统地覆盖了少数类的微妙信号。在本文中,我们介绍了一种基于分数微积分的新型优化算法。通过隔离广义分数导数的核心记忆引擎,加权分数维尔积分,我们用动态加权历史序列取代瞬时梯度。这种分数记忆算子作为自然正则化器。经验评估表明,我们的方法可以防止在医疗诊断中的过拟合,并在金融欺诈检测中将PR-AUC相对于经典优化器提高约40%,建立了纯分数拓扑和应用机器学习之间的强大桥梁。
更新时间: 2026-03-09 13:38:45
领域: cs.LG,stat.ML
Leaderboard Incentives: Model Rankings under Strategic Post-Training
Influential benchmarks incentivize competing model developers to strategically allocate post-training resources toward improvements on the leaderboard, a phenomenon dubbed benchmaxxing or training on the test task. In this work, we initiate a principled study of the incentive structure that benchmarks induce. We model benchmarking as a Stackelberg game between a benchmark designer who chooses an evaluation protocol and multiple model developers who compete simultaneously in a subgame given by the designer's choice. Each competitor has a model of unknown latent quality and can inflate its observed score by allocating resources to benchmark-specific improvements. First, we prove that current benchmarks induce games for which no Nash equilibrium between model developers exists. This result suggests one explanation for why current practice leads to misaligned incentives, prompting model developers to strategize in opaque ways. However, we prove that under mild conditions, a recently proposed evaluation protocol, called tune-before-test, induces a benchmark with a unique Nash equilibrium that ranks models by latent quality. This positive result demonstrates that benchmarks need not set bad incentives, even if current evaluations do.
Updated: 2026-03-09 13:33:20
标题: 领袖榜单激励:在战略性后训练下的模型排名
摘要: 有影响力的基准激励竞争模型开发者将后训练资源战略性地分配到排行榜上的改进上,这一现象被称为benchmaxxing或在测试任务上训练。在这项工作中,我们开始对基准引发的激励结构进行原则性研究。我们将基准建模为一场Stackelberg博弈,其中一位基准设计者选择评估协议,多个模型开发者在设计者选择的子游戏中同时竞争。每个竞争对手拥有未知潜在质量的模型,并可以通过将资源分配给基准特定的改进来增加其观察得分。首先,我们证明了当前基准引发了没有模型开发者之间纳什均衡的游戏。这一结果表明了为什么当前实践导致了不一致的激励的一个解释,促使模型开发者以不透明的方式制定战略。然而,我们证明在温和条件下,最近提出的评估协议,称为测试前调整,引发了一个基准,其唯一的纳什均衡通过潜在质量对模型进行排名。这一积极结果表明,基准不必设定糟糕的激励,即使当前的评估是如此。
更新时间: 2026-03-09 13:33:20
领域: cs.GT,cs.LG
BotaCLIP: Contrastive Learning for Botany-Aware Representation of Earth Observation Data
Foundation models have demonstrated a remarkable ability to learn rich, transferable representations across diverse modalities such as images, text, and audio. In modern machine learning pipelines, these representations often replace raw data as the primary input for downstream tasks. In this paper, we address the challenge of adapting a pre-trained foundation model to inject domain-specific knowledge, without retraining from scratch or incurring significant computational costs. To this end, we introduce BotaCLIP, a lightweight multimodal contrastive framework that adapts a pre-trained Earth Observation foundation model (DOFA) by aligning high-resolution aerial imagery with botanical relevés. Unlike generic embeddings, BotaCLIP internalizes ecological structure through contrastive learning with a regularization strategy that mitigates catastrophic forgetting. Once trained, the resulting embeddings serve as transferable representations for downstream predictors. Motivated by real-world applications in biodiversity modeling, we evaluated BotaCLIP representations in three ecological tasks: plant presence prediction, butterfly occurrence modeling, and soil trophic group abundance estimation. The results showed consistent improvements over those derived from DOFA and supervised baselines. More broadly, this work illustrates how domain-aware adaptation of foundation models can inject expert knowledge into data-scarce settings, enabling frugal representation learning.
Updated: 2026-03-09 13:33:08
标题: 植物学感知地球观测数据的对比学习:BotaCLIP
摘要: 基础模型已经展示出了学习丰富、可转移的表示能力,涵盖了图像、文本和音频等多种形式。在现代机器学习流程中,这些表示通常取代原始数据成为下游任务的主要输入。本文针对将预训练的基础模型调整以注入领域特定知识的挑战,而无需从头开始重新训练或产生巨大的计算成本进行了讨论。为此,我们引入了BotaCLIP,这是一个轻量级的多模态对比框架,通过将高分辨率航空图像与植物调查数据进行对齐,来调整预训练的地球观测基础模型(DOFA)。与通用嵌入不同,BotaCLIP通过对比学习内化生态结构,采用一种正则化策略来减轻灾难性遗忘。一旦训练完成,得到的嵌入将作为下游预测器的可转移表示。受生物多样性建模中的实际应用启发,我们在三个生态任务中评估了BotaCLIP的表示:植物存在预测、蝴蝶出现建模和土壤营养级群落丰度估计。结果显示,相对于DOFA和监督基线,BotaCLIP的性能持续改进。总的来说,这项工作展示了如何通过领域感知的基础模型调整,在数据稀缺的情况下注入专业知识,实现节约的表示学习。
更新时间: 2026-03-09 13:33:08
领域: cs.CV,cs.AI
Unifying On- and Off-Policy Variance Reduction Methods
Continuous and efficient experimentation is key to the practical success of user-facing applications on the web, both through online A/B-tests and off-policy evaluation. Despite their shared objective -- estimating the incremental value of a treatment -- these domains often operate in isolation, utilising distinct terminologies and statistical toolkits. This paper bridges that divide by establishing a formal equivalence between their canonical variance reduction methods. We prove that the standard online Difference-in-Means estimator is mathematically identical to an off-policy Inverse Propensity Scoring estimator equipped with an optimal (variance-minimising) additive control variate. Extending this unification, we demonstrate that widespread regression adjustment methods (such as CUPED, CUPAC, and ML-RATE) are structurally equivalent to Doubly Robust estimation. This unified view extends our understanding of commonly used approaches, and can guide practitioners and researchers working on either class of problems.
Updated: 2026-03-09 13:32:39
标题: 统一的在线和离线策略方差缩减方法
摘要: 持续和高效的实验是网络用户界面应用的实际成功的关键,无论是通过在线A/B测试还是离线评估。尽管它们共同的目标是估计处理的增量价值,但这些领域经常独立运作,利用不同的术语和统计工具包。本文通过建立它们的经典方差缩减方法之间的形式等价来弥合这一分歧。 我们证明了标准在线均值差估计器在数学上与一个配备了最佳(方差最小化)加性控制变量的离线倾向得分估计器是相同的。扩展这一统一,我们证明了广泛使用的回归调整方法(如CUPED、CUPAC和ML-RATE)在结构上等同于双重稳健估计。这种统一的观点拓展了我们对常用方法的理解,并可以指导从事任一类问题研究的从业者和研究人员。
更新时间: 2026-03-09 13:32:39
领域: stat.ML,cs.IR,cs.LG,stat.ME
M$^3$-ACE: Rectifying Visual Perception in Multimodal Math Reasoning via Multi-Agentic Context Engineering
Multimodal large language models have recently shown promising progress in visual mathematical reasoning. However, their performance is often limited by a critical yet underexplored bottleneck: inaccurate visual perception. Through systematic analysis, we find that the most failures originate from incorrect or incomplete visual evidence extraction rather than deficiencies in reasoning capability. Moreover, models tend to remain overly confident in their initial perceptions, making standard strategies such as prompt engineering, multi-round self-reflection, or posterior guidance insufficient to reliably correct errors. To address this limitation, we propose M3-ACE, a multi-agentic context engineering framework designed to rectify visual perception in multimodal math reasoning. Instead of directly aggregating final answers, our approach decouples perception and reasoning by dynamically maintaining a shared context centered on visual evidence lists. Multiple agents collaboratively contribute complementary observations, enabling the system to expose inconsistencies and recover missing perceptual information. To support stable multi-turn collaboration, we further introduce two lightweight tools: a Summary Tool that organizes evidence from different agents into consistent, complementary, and conflicting components, and a Refine Tool that filters unreliable samples and guides iterative correction. Extensive experiments demonstrate that M3-ACE substantially improves visual mathematical reasoning performance across multiple benchmarks. Our method establishes new state-of-the-art results 89.1 on the MathVision benchmark and achieves consistent improvements on other related datasets, including MathVista and MathVerse. These results highlight the importance of perception-centric multi-agent collaboration for advancing multimodal reasoning systems.
Updated: 2026-03-09 13:32:25
标题: M$^3$-ACE:通过多智能环境工程纠正多模态数学推理中的视觉感知
摘要: 多模态大型语言模型最近在视觉数学推理方面显示出了令人期待的进展。然而,它们的性能通常受到一个关键但尚未被充分探讨的瓶颈的限制:不准确的视觉感知。通过系统分析,我们发现大多数失败源于不正确或不完整的视觉证据提取,而非推理能力的不足。此外,模型往往会对其初始感知保持过度自信,使得标准策略如提示工程、多轮自我反思或后验引导等无法可靠地纠正错误。 为了解决这一限制,我们提出了M3-ACE,一个旨在纠正多模态数学推理中的视觉感知的多主体上下文工程框架。我们的方法不是直接聚合最终答案,而是通过动态维护以视觉证据列表为中心的共享上下文,将感知和推理分离。多个主体共同贡献互补观察,使系统能够暴露不一致性并恢复缺失的感知信息。为了支持稳定的多轮协作,我们进一步引入了两个轻量级工具:一个总结工具,将不同主体的证据组织成一致、互补和冲突的组件,以及一个细化工具,过滤不可靠样本并引导迭代校正。 大量实验证明,M3-ACE显著改善了多个基准测试中的视觉数学推理性能。我们的方法在MathVision基准测试中建立了新的最先进结果89.1,并在其他相关数据集,包括MathVista和MathVerse上实现了一致的改进。这些结果突显了以感知为中心的多主体协作对推进多模态推理系统的重要性。
更新时间: 2026-03-09 13:32:25
领域: cs.AI
Computational modeling of early language learning from acoustic speech and audiovisual input without linguistic priors
Learning to understand speech appears almost effortless for typically developing infants, yet from an information-processing perspective, acquiring a language from acoustic speech is an enormous challenge. This chapter reviews recent developments in using computational models to understand early language acquisition from speech and audiovisual input. The focus is on self-supervised and visually grounded models of perceptual learning. We show how these models are becoming increasingly powerful in learning various aspects of speech without strong linguistic priors, and how many features of early language development can be explained through a shared set of learning principles-principles broadly compatible with multiple theories of language acquisition and human cognition. We also discuss how modern learning simulations are gradually becoming more realistic, both in terms of input data and in linking model behavior to empirical findings on infant language development.
Updated: 2026-03-09 13:24:03
标题: 计算模型对早期语言学习的建模:基于声学语音和视听输入,而无需语言先验知识
摘要: 学习理解语言对于正常发育的婴儿来说似乎几乎是毫不费力的,然而从信息处理的角度来看,从声音语言中获得一门语言是一个巨大的挑战。本章回顾了使用计算模型来理解从语音和视听输入中早期语言习得的最新进展。重点是自我监督和基于视觉的感知学习模型。我们展示了这些模型如何越来越强大地学习语音的各个方面,而无需强大的语言先验知识,并且如何可以通过一组共享的学习原则来解释早期语言发展的许多特征-这些原则与多种语言习得理论和人类认知广泛兼容。我们还讨论了现代学习模拟如何逐渐变得更加现实,无论是在输入数据方面,还是在将模型行为与婴儿语言发展的实证发现联系起来方面。
更新时间: 2026-03-09 13:24:03
领域: cs.CL,cs.AI,eess.AS
"That's another doom I haven't thought about": A User Study on AI Labels as a Safeguard Against Image-Based Misinformation
As generative AI is increasingly contributing to the spread of deceptively realistic misinformation, lawmakers have introduced regulations requiring the disclosure of AI-generated content. However, it is unclear if labels reduce the risk of users falling for AI-generated misinformation. To address this research gap, we study the effect of labels on users' perception and the implications of mislabeling, focusing on AI-generated images. We first explored users' opinions and expectations of labels using five focus groups. Although participants were wary of practical implementations, they considered labeling helpful in identifying AI-generated images and avoiding deception. Second, we conducted a survey with 1354 participants to assess how labels affect users' ability to recognize misinformation. While labels reduced participants' belief in false claims supported by AI-generated images, we found evidence of overreliance, leading to unintended side effects: Participants were more susceptible to false claims accompanied by human-made images, and were more hesitant to believe true claims illustrated with labeled AI-generated images.
Updated: 2026-03-09 13:14:22
标题: "这是另一个我没有考虑过的厄运:关于AI标签作为防范图像误信息的用户研究"
摘要: 随着生成式人工智能日益对欺骗性逼真的不实信息的传播做出贡献,立法者已经引入了要求披露人工智能生成内容的法规。然而,目前尚不清楚标签是否能降低用户陷入人工智能生成的虚假信息的风险。为了填补这一研究空白,我们研究了标签对用户感知的影响以及误标的影响,重点关注人工智能生成的图像。我们首先通过五个焦点小组探讨了用户对标签的看法和期望。尽管参与者对实际实施持保留态度,但他们认为标签有助于识别人工智能生成的图像并避免被欺骗。其次,我们通过对1354名参与者进行调查,评估了标签对用户识别虚假信息能力的影响。虽然标签降低了参与者对由人工智能生成的图像支持的虚假主张的信念,但我们发现了过度依赖的证据,导致了意想不到的副作用:参与者更容易受到伴随人类制作图像的虚假主张的影响,并更加犹豫地相信用标记的人工智能生成的图像说明的真实主张。
更新时间: 2026-03-09 13:14:22
领域: cs.CR,cs.AI,cs.CY,cs.SI
Towards plausibility in time series counterfactual explanations
We present a new method for generating plausible counterfactual explanations for time series classification problems. The approach performs gradient-based optimization directly in the input space. To enforce plausibility, we integrate soft-DTW (dynamic time warping) alignment with $k$-nearest neighbors from the target class, which effectively encourages the generated counterfactuals to adopt a realistic temporal structure. The overall optimization objective is a multi-faceted loss function that balances key counterfactual properties. It incorporates losses for validity, sparsity, and proximity, alongside the novel soft-DTW-based plausibility component. We conduct an evaluation of our method against several strong reference approaches, measuring the key properties of the generated counterfactuals across multiple dimensions. The results demonstrate that our method achieves competitive performance in validity while significantly outperforming existing approaches in distributional alignment with the target class, indicating superior temporal realism. Furthermore, a qualitative analysis highlights the critical limitations of existing methods in preserving realistic temporal structure. This work shows that the proposed method consistently generates counterfactual explanations for time series classifiers that are not only valid but also highly plausible and consistent with temporal patterns.
Updated: 2026-03-09 13:13:52
标题: 朝向时间序列反事实解释的可信性
摘要: 我们提出了一种新方法,用于为时间序列分类问题生成可信的反事实解释。该方法直接在输入空间中执行基于梯度的优化。为了强制可信性,我们将软动态时间规整(soft-DTW)对齐与目标类别的$k$个最近邻集成在一起,这有效地鼓励生成的反事实采用真实的时间结构。总体优化目标是一个多方面的损失函数,平衡了关键的反事实属性。它包括有效性、稀疏性和接近性损失,以及新颖的基于soft-DTW的可信性组件。我们对我们的方法进行了评估,与几种强参考方法进行对比,衡量了生成的反事实在多个维度上的关键属性。结果表明,我们的方法在有效性方面表现出竞争性能,同时在与目标类别的分布对齐方面明显优于现有方法,表明具有卓越的时间真实性。此外,定性分析突显了现有方法在保持真实时间结构方面的关键局限性。这项工作表明,所提出的方法持续地生成时间序列分类器的反事实解释,不仅有效,而且高度可信,并与时间模式一致。
更新时间: 2026-03-09 13:13:52
领域: cs.LG,cs.AI,stat.ML
Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers
The dense output projection in multi-head attention scales quadratically with model dimension, contributing significantly to parameter count, memory footprint, and inference cost. We propose replacing this projection with a fixed, parameter-free Walsh Hadamard Transform followed by a lightweight learnable affine rescaling, eliminating approximately 25 percent of attention parameters per block while preserving global cross head interaction through an orthogonal, norm-preserving transformation. Across different model sizes, we demonstrate that this structured substitution maintains comparable or slightly superior downstream task performance on standard benchmarks, while achieving up to 7 percent aggregate parameter reduction, 8.9 percent peak memory savings, and 6.6 percent throughput improvement at scale, with efficiency gains growing monotonically with model size, batch size, and sequence length. Interestingly, we observe that structured Hadamard-based models exhibit a steeper validation loss curve relative to training FLOPs compared to their dense counterparts, suggesting more favorable compute utilization during training.
Updated: 2026-03-09 13:05:08
标题: 重新思考注意力输出投影:用于高效Transformer的结构化Hadamard变换
摘要: 多头注意力中的密集输出投影与模型维度呈二次比例关系,显著影响参数数量、内存占用和推理成本。我们提议用固定的、无参数的Walsh Hadamard变换替代这种投影,然后跟随一个轻量级可学习的仿射重缩放,这样可以在每个块中消除大约25%的注意力参数,同时通过一个正交的、保持范数不变的转换保持全局跨头交互。在不同的模型大小上,我们展示了这种结构化替代在标准基准测试中保持了可比较或略优的下游任务性能,同时在规模上实现了高达7%的参数减少,8.9%的内存节省,以及6.6%的吞吐量提高,而效率收益随着模型大小、批处理大小和序列长度的增加而单调增加。有趣的是,我们观察到,基于结构化Hadamard的模型相对于密集对应模型在验证损失曲线上呈现出更陡的曲线,相对于训练FLOP,这表明在训练过程中计算利用更为有利。
更新时间: 2026-03-09 13:05:08
领域: cs.LG,cs.CL
Finite Sample Bounds for Non-Parametric Regression: Optimal Sample Efficiency and Space Complexity
We address the problem of learning an unknown smooth function and its derivatives from noisy pointwise evaluations under the supremum norm. While classical nonparametric regression provides a strong theoretical foundation, traditional kernel-based estimators often incur high computational costs and memory requirements that scale with the sample size, limiting their utility in real-time applications such as reinforcement learning. To overcome these challenges, we propose a parametric approach based on a finite-dimensional representation that achieves minimax-optimal uniform convergence rates. Our method enables lightweight inference without storing all samples in memory. We provide sharp finite-sample bounds under sub-Gaussian noise, derive second-order Bernstein-type guarantees, and prove matching lower bounds, thereby confirming the optimality of our approach in both estimation error and memory efficiency.
Updated: 2026-03-09 13:03:24
标题: 非参数回归的有限样本界限:最佳样本效率和空间复杂性
摘要: 我们解决了从带有噪声的点评估中学习未知平滑函数及其导数的问题,使用的是最大范数。虽然经典的非参数回归提供了强大的理论基础,但传统的基于核的估计器通常会产生与样本量成比例的高计算成本和内存需求,限制了它们在强化学习等实时应用中的实用性。为了克服这些挑战,我们提出了一种基于有限维表示的参数化方法,实现了极小-最优的均匀收敛速率。我们的方法实现了轻量级推断,无需将所有样本存储在内存中。我们提供了在次高斯噪声下的尖锐有限样本界限,推导了二阶Bernstein型保证,并证明了匹配的下界,从而确认了我们方法在估计误差和内存效率方面的最优性。
更新时间: 2026-03-09 13:03:24
领域: cs.LG
Electrocardiogram Classification with Transformers Using Koopman and Wavelet Features
Electrocardiogram (ECG) analysis is vital for detecting cardiac abnormalities, yet robust automated classification is challenging due to the complexity and variability of physiological signals. In this work, we investigate transformer-based ECG classification using features derived from the Koopman operator and wavelet transforms. Two tasks are studied: (1) binary classification (Normal vs. Non-normal), and (2) four-class classification (Normal, Atrial Fibrillation, Ventricular Arrhythmia, Block). We use Extended Dynamic Mode Decomposition (EDMD) to approximate the Koopman operator. Our results show that wavelet features excel in binary classification, while Koopman features, when paired with transformers, achieve superior performance in the four-class setting. A simple hybrid of Koopman and wavelet features does not improve accuracy. However, selecting an appropriate EDMD dictionary -- specifically a radial basis function dictionary with tuned parameters -- yields significant gains, surpassing the wavelet-only baseline and the hybrid wavelet-Koopman system. We also present a Koopman-based reconstruction analysis for interpretable insights into the learned dynamics and compare against a recurrent neural network baseline. Overall, our findings demonstrate the effectiveness of Koopman-based feature learning with transformers and highlight promising directions for integrating dynamical systems theory into time-series classification.
Updated: 2026-03-09 12:59:19
标题: 使用Koopman和小波特征的变压器进行心电图分类
摘要: 心电图(ECG)分析对于检测心脏异常至关重要,然而由于生理信号的复杂性和变异性,强大的自动化分类具有挑战性。在这项工作中,我们使用从Koopman算子和小波变换中派生的特征来研究基于变压器的心电图分类。我们研究了两项任务:(1)二元分类(正常 vs. 非正常),以及(2)四类分类(正常、房颤、室性心律失常、阻塞)。我们使用扩展动态模式分解(EDMD)来近似Koopman算子。我们的结果显示,小波特征在二元分类中表现出色,而Koopman特征,当与变压器配对时,在四类设置中实现了优越性能。Koopman和小波特征的简单混合并不能提高准确性。然而,选择适当的EDMD字典--特别是调整参数的径向基函数字典--可以获得显著的收益,超越小波基线和混合小波-Koopman系统。我们还提出了一种基于Koopman的重建分析,以获得关于学习动态的可解释见解,并与循环神经网络基线进行比较。总体而言,我们的研究结果表明,基于Koopman的特征学习与变压器的结合是有效的,并突出了将动力系统理论整合到时间序列分类中的有希望的方向。
更新时间: 2026-03-09 12:59:19
领域: eess.SP,cs.AI,cs.LG
Fast reconstruction of degenerate populations of conductance-based neuron models from spike times
Inferring the biophysical parameters of conductance-based models (CBMs) from experimentally accessible recordings remains a central challenge in computational neuroscience. Spike times are the most widely available data, yet they reveal little about which combinations of ion channel conductances generate the observed activity. This inverse problem is further complicated by neuronal degeneracy, where multiple distinct conductance sets yield similar spiking patterns. We introduce a method that addresses this challenge by combining deep learning with Dynamic Input Conductances (DICs), a theoretical framework that reduces complex CBMs to three interpretable feedback components governing excitability and firing patterns. Our approach first maps spike times to DIC densities at threshold using a neural network that learns a low-dimensional representation of neuronal activity. The predicted DIC values are then used to generate degenerate CBM populations via an iterative compensation algorithm, ensuring compatibility with the intermediate target DICs, and thereby reproducing the corresponding firing patterns, even in high-dimensional models. Applied to two models, this algorithmic pipeline reconstructs spiking and bursting regimes with high accuracy and robustness to variability, including spike trains generated under noisy current injection mimicking physiological stochasticity. It produces diverse degenerate populations within milliseconds on standard hardware, enabling scalable and efficient inference from spike recordings alone. Together, this work positions DICs as a practical and interpretable link between experimentally observed activity and mechanistic models. By enabling fast and scalable reconstruction of degenerate populations directly from spike times, our approach provides a powerful way to investigate how neurons exploit conductance variability to achieve reliable computation.
Updated: 2026-03-09 12:57:51
标题: 快速重建基于导纳的神经元模型退化群体的方法
摘要: 从实验可获取的记录中推断基于传导的模型(CBMs)的生物物理参数仍然是计算神经科学中的一个核心挑战。尖峰时间是最广泛可获取的数据,但它们很少揭示哪些离子通道导电率的组合产生了观察到的活动。这个逆问题进一步复杂化了神经元的退化,其中多个不同的导电率集合产生类似的尖峰模式。我们引入了一种方法,通过将深度学习与动态输入导电率(DICs)结合起来,解决了这个挑战,DICs是一种将复杂的CBMs简化为控制兴奋性和射频模式的三个可解释反馈组件的理论框架。我们的方法首先利用神经网络将尖峰时间映射到阈值处的DIC密度,学习神经元活动的低维表示。然后使用预测的DIC值通过迭代补偿算法生成退化的CBM种群,确保与中间目标DIC兼容,从而在高维模型中再现相应的射频模式,即使在高维模型中也是如此。应用于两个模型,这种算法管道以高准确性和对变异性的鲁棒性重建尖峰和爆发模式,包括在模拟生理随机性的噪声电流注入下生成的尖峰火车。它在标准硬件上在毫秒内产生多样的退化种群,实现了仅从尖峰记录中可扩展和高效的推断。总之,这项工作将DICs定位为实验观察到的活动和机械模型之间的实用和可解释的联系。通过使退化种群直接从尖峰时间中快速和可扩展地重建,我们的方法提供了一种强大的方式来研究神经元如何利用导电变异性来实现可靠的计算。
更新时间: 2026-03-09 12:57:51
领域: q-bio.NC,cs.LG,math.DS,stat.ML
Autoregressive Visual Decoding from EEG Signals
Electroencephalogram (EEG) signals have become a popular medium for decoding visual information due to their cost-effectiveness and high temporal resolution. However, current approaches face significant challenges in bridging the modality gap between EEG and image data. These methods typically rely on complex adaptation processes involving multiple stages, making it hard to maintain consistency and manage compounding errors. Furthermore, the computational overhead imposed by large-scale diffusion models limit their practicality in real-world brain-computer interface (BCI) applications. In this work, we present AVDE, a lightweight and efficient framework for visual decoding from EEG signals. First, we leverage LaBraM, a pre-trained EEG model, and fine-tune it via contrastive learning to align EEG and image representations. Second, we adopt an autoregressive generative framework based on a "next-scale prediction" strategy: images are encoded into multi-scale token maps using a pre-trained VQ-VAE, and a transformer is trained to autoregressively predict finer-scale tokens starting from EEG embeddings as the coarsest representation. This design enables coherent generation while preserving a direct connection between the input EEG signals and the reconstructed images. Experiments on two datasets show that AVDE outperforms previous state-of-the-art methods in both image retrieval and reconstruction tasks, while using only 10% of the parameters. In addition, visualization of intermediate outputs shows that the generative process of AVDE reflects the hierarchical nature of human visual perception. These results highlight the potential of autoregressive models as efficient and interpretable tools for practical BCI applications.
Updated: 2026-03-09 12:57:06
标题: 基于脑电信号的自回归视觉解码
摘要: 脑电图(EEG)信号已经成为解码视觉信息的流行媒介,因为其具有经济效益和高时间分辨率。然而,当前的方法在弥合EEG和图像数据之间的模态差距方面面临着重大挑战。这些方法通常依赖于涉及多个阶段的复杂适应过程,这使得难以保持一致性和管理复合误差。此外,大规模扩散模型所施加的计算开销限制了它们在实际世界的脑机接口(BCI)应用中的实用性。在这项工作中,我们提出了AVDE,一种用于从EEG信号解码视觉的轻量级高效框架。首先,我们利用LaBraM,一个预训练的EEG模型,并通过对比学习对其进行微调以对齐EEG和图像表示。其次,我们采用基于“下一尺度预测”策略的自回归生成框架:图像通过使用预训练的VQ-VAE编码为多尺度令牌图,然后训练一个变压器以自回归地从EEG嵌入开始预测更精细的令牌,作为最粗糙的表示。这种设计在保持输入EEG信号和重建图像之间直接连接的同时实现了连贯的生成。对两个数据集的实验表明,AVDE在图像检索和重建任务中优于先前的最先进的方法,同时仅使用了10%的参数。此外,中间输出的可视化显示AVDE的生成过程反映了人类视觉知觉的分层特性。这些结果突显了自回归模型作为实际BCI应用的高效且可解释的工具的潜力。
更新时间: 2026-03-09 12:57:06
领域: cs.LG,cs.AI
Detecting Fake Reviewer Groups in Dynamic Networks: An Adaptive Graph Learning Method
The proliferation of fake reviews, often produced by organized groups, undermines consumer trust and fair competition on online platforms. These groups employ sophisticated strategies that evade traditional detection methods, particularly in cold-start scenarios involving newly launched products with sparse data. To address this, we propose the \underline{D}iversity- and \underline{S}imilarity-aware \underline{D}ynamic \underline{G}raph \underline{A}ttention-enhanced \underline{G}raph \underline{C}onvolutional \underline{N}etwork (DS-DGA-GCN), a new graph learning model for detecting fake reviewer groups. DS-DGA-GCN achieves robust detection since it focuses on the joint relationships among products, reviews, and reviewers by modeling product-review-reviewer networks. DS-DGA-GCN also achieves adaptive detection by integrating a Network Feature Scoring (NFS) system and a new dynamic graph attention mechanism. The NFS system quantifies network attributes, including neighbor diversity, network self-similarity, as a unified feature score. The dynamic graph attention mechanism improves the adaptability and computational efficiency by captures features related to temporal information, node importance, and global network structure. Extensive experiments conducted on two real-world datasets derived from Amazon and Xiaohongshu demonstrate that DS-DGA-GCN significantly outperforms state-of-the-art baselines, achieving accuracies of up to \textbf{89.8\% and 88.3\%}, respectively.
Updated: 2026-03-09 12:49:17
标题: 在动态网络中检测虚假评论者群体:一种自适应图学习方法
摘要: 伪造评论的泛滥,通常由有组织的团体生产,破坏了消费者对在线平台的信任和公平竞争。这些团体采用复杂的策略,逃避传统的检测方法,特别是在涉及新推出的产品和稀疏数据的冷启动场景中。为了解决这个问题,我们提出了面向多样性和相似性的动态图注意增强图卷积网络(DS-DGA-GCN),这是一种新的图学习模型,用于检测假评论团体。DS-DGA-GCN实现了强大的检测能力,因为它专注于建模产品-评论-评论者网络之间的联合关系。DS-DGA-GCN还通过整合网络特征评分(NFS)系统和新的动态图注意机制实现了自适应检测。NFS系统量化网络属性,包括邻居多样性、网络自相似性,作为统一的特征评分。动态图注意机制通过捕捉与时间信息、节点重要性和全局网络结构有关的特征,提高了适应性和计算效率。在从亚马逊和小红书派生的两个真实数据集上进行的大量实验表明,DS-DGA-GCN明显优于最先进的基线方法,分别实现了高达89.8%和88.3%的准确率。
更新时间: 2026-03-09 12:49:17
领域: cs.SI,cs.AI
An Orthogonal Learner for Individualized Outcomes in Markov Decision Processes
Predicting individualized potential outcomes in sequential decision-making is central for optimizing therapeutic decisions in personalized medicine (e.g., which dosing sequence to give to a cancer patient). However, predicting potential outcomes over long horizons is notoriously difficult. Existing methods that break the curse of the horizon typically lack strong theoretical guarantees such as orthogonality and quasi-oracle efficiency. In this paper, we revisit the problem of predicting individualized potential outcomes in sequential decision-making (i.e., estimating Q-functions in Markov decision processes with observational data) through a causal inference lens. In particular, we develop a comprehensive theoretical foundation for meta-learners in this setting with a focus on beneficial theoretical properties. As a result, we yield a novel meta-learner called DRQ-learner and establish that it is: (1) doubly robust (i.e., valid inference under the misspecification of one of the nuisances), (2) Neyman-orthogonal (i.e., insensitive to first-order estimation errors in the nuisance functions), and (3) achieves quasi-oracle efficiency (i.e., behaves asymptotically as if the ground-truth nuisance functions were known). Our DRQ-learner is applicable to settings with both discrete and continuous state spaces. Further, our DRQ-learner is flexible and can be used together with arbitrary machine learning models (e.g., neural networks). We validate our theoretical results through numerical experiments, thereby showing that our meta-learner outperforms state-of-the-art baselines.
Updated: 2026-03-09 12:48:01
标题: 一个用于马尔可夫决策过程中个性化结果的正交学习者
摘要: 在个性化医学中,预测序贯决策中个体化潜在结果对于优化治疗决策至关重要(例如,给癌症患者哪种剂量顺序)。然而,在长期范围内预测潜在结果通常很困难。现有的突破时间限制的方法通常缺乏强大的理论保证,如正交性和准神谕效率。在本文中,我们通过因果推断的视角重新审视了预测序贯决策中个体化潜在结果的问题(即,在马尔可夫决策过程中使用观测数据估计Q函数)。特别是,我们在这个设置中发展了一个关于元学习器的全面理论基础,重点放在有益的理论特性上。因此,我们提出了一种新颖的元学习器称为DRQ-learner,并确定它具有以下特性:(1)双重稳健(即,在一个干扰因素错误规范的情况下有效推断),(2)Neyman-正交(即,对干扰函数的一阶估计误差不敏感),以及(3)实现准神谕效率(即,在渐近情况下表现得就像地面真实的干扰函数被知道一样)。我们的DRQ-learner适用于具有离散和连续状态空间的设置。此外,我们的DRQ-learner灵活,并且可以与任意机器学习模型(例如神经网络)一起使用。我们通过数值实验验证了我们的理论结果,从而显示我们的元学习器优于最先进的基准线。
更新时间: 2026-03-09 12:48:01
领域: stat.ML,cs.LG
SPD-RAG: Sub-Agent Per Document Retrieval-Augmented Generation
Answering complex, real-world queries often requires synthesizing facts scattered across vast document corpora. In these settings, standard retrieval-augmented generation (RAG) pipelines suffer from incomplete evidence coverage, while long-context large language models (LLMs) struggle to reason reliably over massive inputs. We introduce SPD-RAG, a hierarchical multi-agent framework for exhaustive cross-document question answering that decomposes the problem along the document axis. Each document is processed by a dedicated document-level agent operating only on its own content, enabling focused retrieval, while a coordinator dispatches tasks to relevant agents and aggregates their partial answers. Agent outputs are synthesized by merging partial answers through a token-bounded synthesis layer (which supports recursive map-reduce for massive corpora). This document-level specialization with centralized fusion improves scalability and answer quality in heterogeneous multidocument settings while yielding a modular, extensible retrieval pipeline. On the LOONG benchmark (EMNLP 2024) for long-context multi-document QA, SPD-RAG achieves an Avg Score of 58.1 (GPT-5 evaluation), outperforming Normal RAG (33.0) and Agentic RAG (32.8) while using only 38% of the API cost of a full-context baseline (68.0).
Updated: 2026-03-09 12:46:32
标题: SPD-RAG:文档检索增强生成的子代理
摘要: 回答复杂的现实世界查询通常需要综合分散在广泛文档语料库中的事实。在这些情境下,标准的检索增强生成(RAG)管道受到不完整证据覆盖的困扰,而长上下文大型语言模型(LLMs)在大规模输入上难以可靠推理。我们引入了SPD-RAG,这是一个用于详尽跨文档问题回答的分层多代理框架,沿文档轴分解问题。每个文档由专门处理其自身内容的专用文档级代理处理,实现了有针对性的检索,同时协调员将任务分派给相关代理并聚合它们的部分答案。代理输出通过一个支持递归映射-减少大规模语料库的令牌限制综合层进行合成。这种具有集中融合的文档级专门化提高了异构多文档设置中的可扩展性和答案质量,同时产生了一个模块化、可扩展的检索管道。在长上下文多文档QA的LOONG基准上(EMNLP 2024年),SPD-RAG实现了58.1的平均分数(GPT-5评估),优于Normal RAG(33.0)和Agentic RAG(32.8),同时仅使用完整上下文基线(68.0)API成本的38%。
更新时间: 2026-03-09 12:46:32
领域: cs.CL,cs.AI,cs.IR
Beyond Attention Heatmaps: How to Get Better Explanations for Multiple Instance Learning Models in Histopathology
Multiple instance learning (MIL) has enabled substantial progress in computational histopathology, where a large amount of patches from gigapixel whole slide images are aggregated into slide-level predictions. Heatmaps are widely used to validate MIL models and to discover tissue biomarkers. Yet, the validity of these heatmaps has barely been investigated. In this work, we introduce a general framework for evaluating the quality of MIL heatmaps without requiring additional labels. We conduct a large-scale benchmark experiment to assess six explanation methods across histopathology task types (classification, regression, survival), MIL model architectures (Attention-, Transformer-, Mamba-based), and patch encoder backbones (UNI2, Virchow2). Our results show that explanation quality mostly depends on MIL model architecture and task type, with perturbation ("Single"), layer-wise relevance propagation (LRP), and integrated gradients (IG) consistently outperforming attention-based and gradient-based saliency heatmaps, which often fail to reflect model decision mechanisms. We further demonstrate the advanced capabilities of the best-performing explanation methods: (i) We provide a proof-of-concept that MIL heatmaps of a bulk gene expression prediction model can be correlated with spatial transcriptomics for biological validation, and (ii) showcase the discovery of distinct model strategies for predicting human papillomavirus (HPV) infection from head and neck cancer slides. Our work highlights the importance of validating MIL heatmaps and establishes that improved explainability can enable more reliable model validation and yield biological insights, making a case for a broader adoption of explainable AI in digital pathology. Our code is provided in a public GitHub repository: https://github.com/bifold-pathomics/xMIL/tree/xmil-journal
Updated: 2026-03-09 12:45:48
标题: 超越关注热图:如何为组织病理学中的多实例学习模型获得更好的解释
摘要: 多实例学习(MIL)已经在计算组织病理学方面取得了重大进展,其中来自巨量全切片图像的补丁被汇总为切片级别的预测。热图被广泛用于验证MIL模型并发现组织生物标志物。然而,这些热图的有效性几乎没有被调查。在这项工作中,我们引入了一个评估MIL热图质量的通用框架,无需额外标签。我们进行了一项大规模基准实验,评估了六种解释方法在组织病理学任务类型(分类、回归、生存)、MIL模型架构(Attention-、Transformer-、Mamba-based)和补丁编码器骨干(UNI2、Virchow2)之间的差异。我们的结果显示,解释质量主要取决于MIL模型架构和任务类型,扰动(“Single”)、逐层相关传播(LRP)和集成梯度(IG)始终优于基于注意力和梯度的显著性热图,后者通常无法反映模型的决策机制。我们进一步展示了表现最佳的解释方法的先进能力:(i)我们提供了一个概念验证,即基因表达预测模型的MIL热图可以与空间转录组学相关联,用于生物验证,(ii)展示了发现区分模型策略以预测人乳头瘤病毒(HPV)感染的头颈癌切片。我们的工作强调了验证MIL热图的重要性,并建立了更可靠的模型验证和生物洞察力的改进可解释性,为数字病理学中更广泛采用可解释AI提供了依据。我们的代码已在公共GitHub存储库中提供:https://github.com/bifold-pathomics/xMIL/tree/xmil-journal
更新时间: 2026-03-09 12:45:48
领域: cs.CV,cs.LG
EndoSERV: A Vision-based Endoluminal Robot Navigation System
Robot-assisted endoluminal procedures are increasingly used for early cancer intervention. However, the intricate, narrow and tortuous pathways within the luminal anatomy pose substantial difficulties for robot navigation. Vision-based navigation offers a promising solution, but existing localization approaches are error-prone due to tissue deformation, in vivo artifacts and a lack of distinctive landmarks for consistent localization. This paper presents a novel EndoSERV localization method to address these challenges. It includes two main parts, \textit{i.e.}, \textbf{SE}gment-to-structure and \textbf{R}eal-to-\textbf{V}irtual mapping, and hence the name. For long-range and complex luminal structures, we divide them into smaller sub-segments and estimate the odometry independently. To cater for label insufficiency, an efficient transfer technique maps real image features to the virtual domain to use virtual pose ground truth. The training phases of EndoSERV include an offline pretraining to extract texture-agnostic features, and an online phase that adapts to real-world conditions. Extensive experiments based on both public and clinical datasets have been performed to demonstrate the effectiveness of the method even without any real pose labels.
Updated: 2026-03-09 12:44:32
标题: EndoSERV:基于视觉的内镜机器人导航系统
摘要: 机器人辅助内腔手术越来越被用于早期癌症干预。然而,内腔解剖结构内复杂、狭窄和曲折的通道给机器人导航带来了重大困难。基于视觉的导航提供了一个有前途的解决方案,但由于组织变形、体内伪影以及缺乏明显的地标以进行一致的定位,现有的定位方法存在误差。本文提出了一种新颖的EndoSERV定位方法来解决这些挑战。它包括两个主要部分,即分段结构和真实到虚拟映射,因此得名。对于长距离和复杂的内腔结构,我们将其分为较小的子段,并独立地估计里程表。为了应对标签不足,一种有效的转移技术将真实图像特征映射到虚拟领域,以使用虚拟姿势地面真相。EndoSERV的训练阶段包括离线预训练以提取与纹理无关的特征,以及适应真实世界条件的在线阶段。基于公共和临床数据集进行了广泛的实验,以展示即使没有任何真实姿势标签,该方法的有效性。
更新时间: 2026-03-09 12:44:32
领域: cs.RO,cs.AI
Agentic Neurosymbolic Collaboration for Mathematical Discovery: A Case Study in Combinatorial Design
We study mathematical discovery through the lens of neurosymbolic reasoning, where an AI agent powered by a large language model (LLM), coupled with symbolic computation tools, and human strategic direction, jointly produced a new result in combinatorial design theory. The main result of this human-AI collaboration is a tight lower bound on the imbalance of Latin squares for the notoriously difficult case $n \equiv 1 \pmod{3}$. We reconstruct the discovery process from detailed interaction logs spanning multiple sessions over several days and identify the distinct cognitive contributions of each component. The AI agent proved effective at uncovering hidden structure and generating hypotheses. The symbolic component consists of computer algebra, constraint solvers, and simulated annealing, which provides rigorous verification and exhaustive enumeration. Human steering supplied the critical research pivot that transformed a dead end into a productive inquiry. Our analysis reveals that multi-model deliberation among frontier LLMs proved reliable for criticism and error detection but unreliable for constructive claims. The resulting human-AI mathematical contribution, a tight lower bound of $4n(n{-}1)/9$, is achieved via a novel class of near-perfect permutations. The bound was formally verified in Lean 4. Our experiments show that neurosymbolic systems can indeed produce genuine discoveries in pure mathematics.
Updated: 2026-03-09 12:42:56
标题: 代理神经符号协作用于数学发现:组合设计案例研究
摘要: 我们通过神经符号推理的视角研究数学发现,一个由大型语言模型(LLM)驱动的人工智能代理,结合符号计算工具和人类战略指导,共同在组合设计理论中产生了一个新结果。这次人工智能与人类的合作的主要成果是对拉丁方阵不平衡性的严格下界,特别是对于$ n \equiv 1 \pmod{3}$这一极具挑战性的情况。 我们通过跨越数天的多次会话的详细交互日志重建了发现过程,并确定了每个组件的明显认知贡献。人工智能代理在揭示隐藏结构和生成假设方面表现出效果。符号组件包括计算机代数、约束求解器和模拟退火,提供了严格验证和详尽枚举。人类引导提供了关键的研究转折,将死胡同转化为富有成果的探究。我们的分析表明,在前沿LLM之间进行多模式审议对于批评和错误检测是可靠的,但对于建设性主张是不可靠的。 由此产生的人工智能数学贡献,一个严格的下界为$4n(n{-}1)/9$,是通过一种新颖的近乎完美排列类别实现的。该下界在Lean 4中得到了正式验证。我们的实验表明,神经符号系统确实能够在纯数学领域产生真正的发现。
更新时间: 2026-03-09 12:42:56
领域: cs.AI,cs.HC,math.CO
CORE-Acu: Structured Reasoning Traces and Knowledge Graph Safety Verification for Acupuncture Clinical Decision Support
Large language models (LLMs) show significant potential for clinical decision support (CDS), yet their black-box nature -- characterized by untraceable reasoning and probabilistic hallucinations -- poses severe challenges in acupuncture, a field demanding rigorous interpretability and safety. To address this, we propose CORE-Acu, a neuro-symbolic framework for acupuncture clinical decision support that integrates Structured Chain-of-Thought (S-CoT) with knowledge graph (KG) safety verification. First, we construct the first acupuncture Structured Reasoning Trace dataset and a schema-constrained fine-tuning framework. By enforcing an explicit causal chain from pattern identification to treatment principles, treatment plans, and acupoint selection, we transform implicit Traditional Chinese Medicine (TCM) reasoning into interpretable generation constraints, mitigating the opacity of LLM-based CDS. Furthermore, we construct a TCM safety knowledge graph and establish a ``Generate--Verify--Revise'' closed-loop inference system based on a Symbolic Veto Mechanism, employing deterministic rules to intercept hallucinations and enforce hard safety boundaries. Finally, we introduce the Lexicon-Matched Entity-Reweighted Loss (LMERL), which corrects terminology drift caused by the frequency--importance mismatch in general optimization by adaptively amplifying gradient contributions of high-risk entities during fine-tuning. Experiments on 1,000 held-out cases demonstrate CORE-Acu's superior entity fidelity and reasoning quality. Crucially, CORE-Acu achieved 0/1,000 observed safety violations (95\% CI: 0--0.37\%), whereas GPT-4o exhibited an 8.5\% violation rate under identical rules. These results establish CORE-Acu as a robust neuro-symbolic framework for acupuncture clinical decision support, guaranteeing both reasoning auditability and strict safety compliance.
Updated: 2026-03-09 12:42:23
标题: CORE-Acu:针灸临床决策支持的结构化推理追踪和知识图安全验证
摘要: 大型语言模型(LLMs)在临床决策支持(CDS)方面显示出巨大潜力,然而它们的黑匣子特性——即不可追踪的推理和概率幻觉——在针灸领域提出了严峻挑战,这是一个要求严格可解释性和安全性的领域。为了解决这个问题,我们提出了CORE-Acu,一个用于针灸临床决策支持的神经符号框架,将结构化思维链(S-CoT)与知识图(KG)安全验证相结合。首先,我们构建了第一个针灸结构化推理跟踪数据集和一个受模式约束的微调框架。通过从模式识别到治疗原则、治疗方案和穴位选择的明确因果链,我们将隐含的中医推理转化为可解释的生成约束,减轻了基于LLM的CDS的不透明性。此外,我们构建了一个中医安全知识图,并建立了一个基于符号否决机制的“生成-验证-修订”闭环推理系统,利用确定性规则拦截幻觉并强制实施严格的安全边界。最后,我们引入了匹配词汇的实体重新加权损失(LMERL),通过在微调过程中自适应增加高风险实体的梯度贡献,纠正了一般优化中由频率-重要性不匹配引起的术语漂移。对1,000个保留案例的实验证明了CORE-Acu在实体忠实度和推理质量方面具有优越性。至关重要的是,CORE-Acu在观察到的1,000个案例中实现了0次安全违规(95\% CI:0-0.37\%),而GPT-4o在相同规则下表现出8.5\%的违规率。这些结果确立了CORE-Acu作为一个稳健的神经符号框架,用于针灸临床决策支持,保证了推理的审计性和严格的安全合规性。
更新时间: 2026-03-09 12:42:23
领域: cs.AI
Reconsidering the energy efficiency of spiking neural networks
Spiking Neural Networks (SNNs) promise higher energy efficiency over conventional Quantized Artificial Neural Networks (QNNs) due to their event-driven, spike-based computation. However, prevailing energy evaluations often oversimplify, focusing on computational aspects while neglecting critical overheads like comprehensive data movement and memory access. Such simplifications can lead to misleading conclusions regarding the true energy benefits of SNNs. This paper presents a rigorous re-evaluation. We establish a fair baseline by mapping rate-encoded SNNs with $T$ timesteps to functionally equivalent QNNs with $\lceil \log_2(T+1) \rceil$ bits. This ensures both models have comparable representational capacities, as well has similar hardware requirement, enabling meaningful energy comparisons. We introduce a detailed analytical energy model encompassing core computation and data movement. Using this model, we systematically explore a wide parameter space, including intrinsic network characteristics ($T$, spike rate $\SR$, QNN sparsity $γ$, model size $N$, weight bit-level) and hardware characteristics (memory system and network-on-chip). Our analysis identifies specific operational regimes where SNNs genuinely offer superior energy efficiency. For example, under typical neuromorphic hardware conditions, SNNs with moderate time windows ($T \in [5,10]$) require an average spike rate ($\SR$) below 6.4\% to outperform equivalent QNNs. Furthermore, to illustrate the real-world implications of our findings, we analyze the operational lifetime of a typical smartwatch, showing that an optimized SNN can nearly double its battery life compared to a QNN. These insights guide the design of turely energy-efficient neural network solutions.
Updated: 2026-03-09 12:39:37
标题: 重新考虑尖峰神经网络的能效问题
摘要: 脉冲神经网络(SNNs)由于其基于事件驱动、脉冲计算的特性,承诺比传统的量化人工神经网络(QNNs)具有更高的能源效率。然而,目前的能源评估经常过于简化,重点放在计算方面,而忽略了像全面的数据移动和内存访问等关键开销。这种简化可能会导致关于SNNs真正能源优势的误导性结论。本文提出了一个严格的重新评估。我们通过将编码速率的SNNs映射到具有$\lceil \log_2(T+1) \rceil$位的等效QNNs,建立了一个公平的基准。这确保了两个模型具有可比较的表示能力,以及类似的硬件需求,从而实现有意义的能源比较。我们引入了一个涵盖核心计算和数据移动的详细分析能源模型。利用这个模型,我们系统地探索了广泛的参数空间,包括固有的网络特征($T$,脉冲率$\SR$,QNN稀疏度$\gamma$,模型大小$N$,权重位级)和硬件特性(内存系统和网络芯片)。我们的分析确定了SNNs真正提供卓越能源效率的特定运行模式。例如,在典型的神经形态硬件条件下,具有中等时间窗口($T \in [5,10]$)的SNNs需要平均脉冲率($\SR$)低于6.4\%才能胜过等效的QNNs。此外,为了阐明我们研究结果的现实影响,我们分析了典型智能手表的运行寿命,显示优化的SNN可以将其电池寿命几乎提高一倍。这些见解指导了真正能源高效的神经网络解决方案的设计。
更新时间: 2026-03-09 12:39:37
领域: cs.NE,cs.AI,cs.LG
Human-AI Divergence in Ego-centric Action Recognition under Spatial and Spatiotemporal Manipulations
Humans consistently outperform state-of-the-art AI models in action recognition, particularly in challenging real-world conditions involving low resolution, occlusion, and visual clutter. Understanding the sources of this performance gap is essential for developing more robust and human-aligned models. In this paper, we present a large-scale human-AI comparative study of egocentric action recognition using Minimal Identifiable Recognition Crops (MIRCs), defined as the smallest spatial or spatiotemporal regions sufficient for reliable human recognition. We used our previously introduced, Epic ReduAct, a systematically spatially reduced and temporally scrambled dataset derived from 36 EPIC KITCHENS videos, spanning multiple spatial reduction levels and temporal conditions. Recognition performance is evaluated using over 3,000 human participants and the Side4Video model. Our analysis combines quantitative metrics, Average Reduction Rate and Recognition Gap, with qualitative analyses of spatial (high-, mid-, and low-level visual features) and spatiotemporal factors, including a categorisation of actions into Low Temporal Actions (LTA) and High Temporal Actions (HTA). Results show that human performance exhibits sharp declines when transitioning from MIRCs to subMIRCs, reflecting a strong reliance on sparse, semantically critical cues such as hand-object interactions. In contrast, the model degrades more gradually and often relies on contextual and mid- to low-level features, sometimes even exhibiting increased confidence under spatial reduction. Temporally, humans remain robust to scrambling when key spatial cues are preserved, whereas the model often shows insensitivity to temporal disruption, revealing class-dependent temporal sensitivities.
Updated: 2026-03-09 12:38:56
标题: 人类与人工智能在空间和时空操作下的自我中心动作识别差异
摘要: 人类在行动识别方面始终表现优于最先进的人工智能模型,特别是在涉及低分辨率、遮挡和视觉混乱等具有挑战性的现实条件下。了解这种性能差距的来源对于开发更健壮和与人类对齐的模型至关重要。在本文中,我们提出了一个大规模的人工智能比较研究,使用最小可识别识别区域(MIRCs)进行主观行动识别,MIRCs被定义为足以可靠地实现人类识别的最小空间或时空区域。我们使用了之前介绍的Epic ReduAct,这是从36个EPIC KITCHENS视频中衍生出来的一个系统地空间缩减和时间混淆的数据集,涵盖了多个空间缩减级别和时间条件。通过超过3,000名人类参与者和Side4Video模型评估了识别性能。我们的分析结合了定量指标、平均缩减率和识别差距,以及对空间(高、中、低级视觉特征)和时空因素的定性分析,包括将动作分类为低时间动作(LTA)和高时间动作(HTA)。结果显示,当从MIRCs过渡到子MIRCs时,人类的性能急剧下降,反映出对于稀疏的、语义关键的线索(如手-物体交互)的强烈依赖。相比之下,模型的性能下降更为渐进,并且通常依赖于上下文和中至低级特征,有时甚至在空间缩减下表现出增强的信心。在时间上,当关键空间线索得以保留时,人类仍然对混淆保持强大,而模型通常对时间干扰表现出不敏感,揭示了依赖于类别的时间敏感性。
更新时间: 2026-03-09 12:38:56
领域: cs.CV,cs.AI
SlowBA: An efficiency backdoor attack towards VLM-based GUI agents
Modern vision-language-model (VLM) based graphical user interface (GUI) agents are expected not only to execute actions accurately but also to respond to user instructions with low latency. While existing research on GUI-agent security mainly focuses on manipulating action correctness, the security risks related to response efficiency remain largely unexplored. In this paper, we introduce SlowBA, a novel backdoor attack that targets the responsiveness of VLM-based GUI agents. The key idea is to manipulate response latency by inducing excessively long reasoning chains under specific trigger patterns. To achieve this, we propose a two-stage reward-level backdoor injection (RBI) strategy that first aligns the long-response format and then learns trigger-aware activation through reinforcement learning. In addition, we design realistic pop-up windows as triggers that naturally appear in GUI environments, improving the stealthiness of the attack. Extensive experiments across multiple datasets and baselines demonstrate that SlowBA can significantly increase response length and latency while largely preserving task accuracy. The attack remains effective even with a small poisoning ratio and under several defense settings. These findings reveal a previously overlooked security vulnerability in GUI agents and highlight the need for defenses that consider both action correctness and response efficiency. Code can be found in https://github.com/tu-tuing/SlowBA.
Updated: 2026-03-09 12:38:28
标题: SlowBA:一种针对基于VLM的GUI代理的效率后门攻击
摘要: 现代视觉语言模型(VLM)基于图形用户界面(GUI)代理被期望不仅能够准确执行动作,还能够以低延迟响应用户指令。尽管现有的GUI代理安全性研究主要集中在操纵动作正确性上,但与响应效率相关的安全风险仍然未被充分探讨。在本文中,我们介绍了SlowBA,一种针对基于VLM的GUI代理响应性的新型后门攻击。其关键思想是通过在特定触发模式下引发过长的推理链来操纵响应延迟。为了实现这一目标,我们提出了一个两阶段奖励级别后门注入(RBI)策略,首先对齐长响应格式,然后通过强化学习学习触发感知激活。此外,我们设计了逼真的弹出窗口作为触发器,这些触发器在GUI环境中自然出现,提高了攻击的隐蔽性。跨多个数据集和基线的广泛实验表明,SlowBA可以显著增加响应长度和延迟,同时在很大程度上保持任务准确性。即使在小毒害比和多种防御设置下,攻击仍然有效。这些发现揭示了GUI代理中一个先前被忽视的安全漏洞,并强调了需要考虑动作正确性和响应效率的防御措施。代码可以在https://github.com/tu-tuing/SlowBA找到。
更新时间: 2026-03-09 12:38:28
领域: cs.CR,cs.CL,cs.CV
From Mice to Trains: Amortized Bayesian Inference on Graph Data
Graphs arise across diverse domains, from biology and chemistry to social and information networks, as well as in transportation and logistics. Inference on graph-structured data requires methods that are permutation-invariant, scalable across varying sizes and sparsities, and capable of capturing complex long-range dependencies, making posterior estimation on graph parameters particularly challenging. Amortized Bayesian Inference (ABI) is a simulation-based framework that employs generative neural networks to enable fast, likelihood-free posterior inference. We adapt ABI to graph data to address these challenges to perform inference on node-, edge-, and graph-level parameters. Our approach couples permutation-invariant graph encoders with flexible neural posterior estimators in a two-module pipeline: a summary network maps attributed graphs to fixed-length representations, and an inference network approximates the posterior over parameters. In this setting, several neural architectures can serve as the summary network. In this work we evaluate multiple architectures and assess their performance on controlled synthetic settings and two real-world domains - biology and logistics - in terms of recovery and calibration.
Updated: 2026-03-09 12:37:58
标题: 从老鼠到火车:基于图数据的摊销贝叶斯推断
摘要: 图形在不同领域中出现,从生物学和化学到社交和信息网络,以及在交通和物流中。对图结构化数据的推断需要具有排列不变性、可扩展性跨不同大小和稀疏性,并能够捕捉复杂的长程依赖关系的方法,使得对图参数的后验估计特别具有挑战性。摊销贝叶斯推断(ABI)是一种基于模拟的框架,利用生成性神经网络实现快速、无似然的后验推断。我们将ABI调整为图数据,以解决这些挑战,以便对节点、边和图级参数进行推断。我们的方法将排列不变图编码器与灵活的神经后验估计器耦合在一个两模块管道中:一个摘要网络将带属性的图映射到固定长度的表示,而推断网络近似于参数的后验概率。在这种设置中,几种神经架构可以作为摘要网络。在这项工作中,我们评估了多种架构,并根据恢复和校准在受控合成设置和两个实际领域 - 生物学和物流 - 上的性能。
更新时间: 2026-03-09 12:37:58
领域: stat.ML,cs.LG
End-to-end Differentiable Calibration and Reconstruction for Optical Particle Detectors
Large-scale homogeneous detectors with optical readouts are widely used in particle detection, with Cherenkov and scintillator neutrino detectors as prominent examples. Analyses in experimental physics rely on high-fidelity simulators to translate sensor-level information into physical quantities of interest. This task critically depends on accurate calibration, which aligns simulation behavior with real detector data, and on tracking, which infers particle properties from optical signals. We present the first end-to-end differentiable optical particle detector simulator, enabling simultaneous calibration and reconstruction through gradient-based optimization. Our approach unifies simulation, calibration, and tracking, which are traditionally treated as separate problems, within a single differentiable framework. We demonstrate that it achieves smooth and physically meaningful gradients across all key stages of light generation, propagation, and detection while maintaining computational efficiency. We show that gradient-based calibration and reconstruction greatly simplify existing analysis pipelines while matching or surpassing the performance of conventional non-differentiable methods in both accuracy and speed. Moreover, the framework's modularity allows straightforward adaptation to diverse detector geometries and target materials, providing a flexible foundation for experiment design and optimization. The results demonstrate the readiness of this technique for adoption in current and future optical detector experiments, establishing a new paradigm for simulation and reconstruction in particle physics.
Updated: 2026-03-09 12:36:42
标题: 光学粒子探测器的端到端可微标定和重建
摘要: 具有光学读数的大规模均匀探测器被广泛应用于粒子探测,其中切伦科夫和闪烁中微子探测器是著名的例子。实验物理中的分析依赖于高保真度的模拟器,将传感器级信息转化为感兴趣的物理量。这项任务关键取决于准确的校准,将模拟行为与真实探测器数据对齐,以及跟踪,从光学信号中推断粒子属性。我们提出了第一个端到端可微的光学粒子探测器模拟器,通过基于梯度的优化实现同时校准和重建。我们的方法统一了传统上被视为独立问题的模拟、校准和跟踪,在一个单一的可微框架内。我们展示了它在光的生成、传播和检测的所有关键阶段都实现了平滑和物理意义上的梯度,同时保持计算效率。我们表明,基于梯度的校准和重建大大简化了现有的分析流程,同时在准确性和速度方面与传统的非可微方法相匹配或超越。此外,该框架的模块化性使得可以轻松适应各种探测器几何形状和目标材料,为实验设计和优化提供了灵活的基础。结果表明,这种技术已经准备好在当前和未来的光学探测器实验中采用,为粒子物理中的模拟和重建建立了新的范式。
更新时间: 2026-03-09 12:36:42
领域: hep-ex,cs.LG,physics.ins-det
Sign Identifiability of Causal Effects in Stationary Stochastic Dynamical Systems
We study identifiability in continuous-time linear stationary stochastic differential equations with known causal structure. Unlike existing approaches, we relax the assumption of a known diffusion matrix, thereby respecting the model's intrinsic scale invariance. Rather than recovering drift coefficients themselves, we introduce edge-sign identifiability: for a given causal structure, we ask whether the sign of a given drift entry is uniquely determined across all observational covariance matrices induced by parametrizations compatible with that structure. Under a notion of faithfulness, we derive criteria for characterising identifiability, non-identifiability, and partial identifiability for general graphs. Applying our criteria to specific causal structures, both analogous to classical causal settings (e.g., instrumental variables) and novel cyclic settings, we determine their edge-sign identifiability and, in some cases, obtain explicit expressions for the sign of a target edge in terms of the observational covariance matrix.
Updated: 2026-03-09 12:33:51
标题: 稳定随机动力系统中因果效应的标志可辨识性
摘要: 我们研究具有已知因果结构的连续时间线性稳定随机微分方程中的可识别性。与现有方法不同,我们放宽了对已知扩散矩阵的假设,从而尊重模型的固有尺度不变性。我们不是恢复漂移系数本身,而是引入了边符号可识别性:对于给定的因果结构,我们询问一个给定漂移条目的符号是否在与该结构兼容的参数化引起的所有观测协方差矩阵中是唯一确定的。在一个忠实性概念下,我们推导了用于表征一般图的可识别性、非可识别性和部分可识别性的标准。将我们的标准应用于特定因果结构,既类似于经典因果设置(例如,工具变量)又是新颖的循环设置,我们确定它们的边符号可识别性,并在某些情况下,通过观测协方差矩阵的显式表达来获得目标边的符号。
更新时间: 2026-03-09 12:33:51
领域: math.ST,cs.LG
OCN: Effectively Utilizing Higher-Order Common Neighbors for Better Link Prediction
Common Neighbors (CNs) and their higher-order variants are important pairwise features widely used in state-of-the-art link prediction methods. However, existing methods often struggle with the repetition across different orders of CNs and fail to fully leverage their potential. We identify that these limitations stem from two key issues: redundancy and over-smoothing in high-order common neighbors. To address these challenges, we design orthogonalization to eliminate redundancy between different-order CNs and normalization to mitigate over-smoothing. By combining these two techniques, we propose Orthogonal Common Neighbor (OCN), a novel approach that significantly outperforms the strongest baselines by an average of 7.7\% on popular link prediction benchmarks. A thorough theoretical analysis is provided to support our method. Ablation studies also verify the effectiveness of our orthogonalization and normalization techniques. Code is available at: https://github.com/qingpingmo/OCN.
Updated: 2026-03-09 12:33:02
标题: OCN: 充分利用高阶共同邻居提升链接预测效果
摘要: 共同邻居(CNs)及其高阶变体是当前最先进的链接预测方法中广泛使用的重要成对特征。然而,现有方法经常在不同阶次的CNs之间重复,并未充分利用其潜力。我们确定这些限制源自两个关键问题:高阶共同邻居中的冗余和过度平滑。为了解决这些挑战,我们设计了正交化方法来消除不同阶次CNs之间的冗余,并规范化以减轻过度平滑。通过结合这两种技术,我们提出了正交共同邻居(OCN),这是一种新颖方法,平均比流行的链接预测基线表现强大7.7\%。我们提供了彻底的理论分析来支持我们的方法。消融研究还验证了我们的正交化和规范化技术的有效性。代码可在以下链接获得:https://github.com/qingpingmo/OCN。
更新时间: 2026-03-09 12:33:02
领域: cs.LG,cs.AI
Latent Sculpting for Zero-Shot Generalization: A Manifold Learning Approach to Out-of-Distribution Anomaly Detection
A critical vulnerability of supervised deep learning in high-dimensional tabular domains is "generalization collapse": models form precise decision boundaries around known training distributions but fail catastrophically when encountering Out-of-Distribution (OOD) data. To overcome this, we propose Latent Sculpting, a hierarchical, two-stage representation learning architecture designed to enforce explicit structural boundaries prior to density estimation. In the first stage, a Transformer-based tabular encoder is trained using our novel Binary Latent Sculpting loss. This objective explicitly condenses benign network traffic into a dense, low-entropy hypersphere while enforcing a strict geometric minimum-distance margin for anomalous patterns. In the second stage, a Masked Autoregressive Flow (MAF) maps this structurally optimized manifold to calculate exact, probabilistic anomaly thresholds. We evaluate this methodology on the CIC-IDS-2017 benchmark under a rigorous zero-shot protocol, deliberately withholding complex attack classes during training to test true OOD generalization. Averaged across three random initialization seeds to ensure statistical robustness, our framework maintains near-perfect classification on known signatures (F1 = 0.980 +/- 0.000) while achieving an overall zero-shot OOD F1-Score of 0.867 +/- 0.021 and an AUROC of 0.913 +/- 0.010 at an 85th-percentile threshold. Most notably, the model achieves an average recall of 78.7% (peaking at 97.2%) on stealthy "Infiltration" attacks and over 94% on low-volume DoS variations - complex distributional shifts where standard supervised and unsupervised baselines historically suffer near-total detection failure. These empirical results demonstrate that explicitly decoupling topological manifold structuring from probabilistic density estimation establishes a highly stable and scalable defense against zero-day cyber threats.
Updated: 2026-03-09 12:32:04
标题: 潜在雕塑用于零样本泛化:一种流形学习方法应用于超出分布异常检测
摘要: 在高维表格领域中,监督深度学习的一个关键性弱点是“泛化崩溃”:模型在已知训练分布周围形成精确的决策边界,但在遇到超出分布范围的数据时会遭遇灾难性失败。为了克服这一问题,我们提出了Latent Sculpting,这是一个分层、两阶段表示学习架构,旨在在密度估计之前强制实施明确的结构边界。在第一阶段,使用我们的新颖的二值潜变塑造损失训练基于Transformer的表格编码器。这个目标明确地将良性网络流量压缩成一个密集的、低熵的超球体,同时强制实施异常模式的严格几何最小距离边距。在第二阶段,一个掩蔽自回归流(MAF)将这个结构优化的流形映射到计算精确的、概率异常阈值的地方。我们在CIC-IDS-2017基准测试中评估了这种方法,在严格的零样本协议下进行,故意在训练过程中隐瞒复杂的攻击类别,以测试真正的超出分布的泛化。通过对三个随机初始化种子的平均值进行统计稳健性检验,我们的框架在已知签名上保持近乎完美的分类(F1 = 0.980 +/- 0.000),同时在整体零样本超出分布的F1分数为0.867 +/- 0.021,AUROC为0.913 +/- 0.010,在85百分位数阈值处。最值得注意的是,该模型在隐蔽的“渗透”攻击上平均召回率达到78.7%(峰值为97.2%),在低容量DoS变体上超过94% - 这是标准监督和无监督基线历史上普遍表现出近乎完全检测失败的复杂分布变化。这些实证结果表明,明确将拓扑流形结构构建与概率密度估计分离,建立了一种高度稳定且可扩展的防御机制,用于对抗零日网络威胁。
更新时间: 2026-03-09 12:32:04
领域: cs.LG,cs.CR
Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness
Vision Transformers (ViTs) often degrade under distribution shifts because they rely on spurious correlations, such as background cues, rather than semantically meaningful features. Existing regularization methods, typically relying on simple foreground-background masks, which fail to capture the fine-grained semantic concepts that define an object (e.g., ``long beak'' and ``wings'' for a ``bird''). As a result, these methods provide limited robustness to distribution shifts. To address this limitation, we introduce a novel finetuning framework that steers model reasoning toward concept-level semantics. Our approach optimizes the model's internal relevance maps to align with spatially grounded concept masks. These masks are generated automatically, without manual annotation: class-relevant concepts are first proposed using an LLM-based, label-free method, and then segmented using a VLM. The finetuning objective aligns relevance with these concept regions while simultaneously suppressing focus on spurious background areas. Notably, this process requires only a minimal set of images and uses half of the dataset classes. Extensive experiments on five out-of-distribution benchmarks demonstrate that our method improves robustness across multiple ViT-based models. Furthermore, we show that the resulting relevance maps exhibit stronger alignment with semantic object parts, offering a scalable path toward more robust and interpretable vision models. Finally, we confirm that concept-guided masks provide more effective supervision for model robustness than conventional segmentation maps, supporting our central hypothesis.
Updated: 2026-03-09 12:31:14
标题: 概念引导的微调:引导ViTs远离虚假相关性以提高鲁棒性
摘要: 视觉Transformer(ViTs)常常在分布转移下退化,因为它们依赖于虚假相关性,如背景线索,而不是语义有意义的特征。现有的正则化方法通常依赖于简单的前景-背景蒙版,无法捕捉定义对象的细粒度语义概念(例如,“长喙”和“翅膀”对于“鸟”)。因此,这些方法在分布转移方面提供的鲁棒性有限。为了解决这一限制,我们引入了一种新颖的微调框架,将模型推理引导到概念级语义。我们的方法优化模型的内部相关性映射,使其与空间基础概念蒙版对齐。这些蒙版是自动生成的,无需手动注释:首先使用基于标签的LLM方法提出与类相关的概念,然后使用VLM进行分割。微调目标将相关性与这些概念区域对齐,同时抑制对虚假背景区域的关注。值得注意的是,这个过程只需要一小组图像,并且使用数据集类别的一半。对五个分布外基准的广泛实验表明,我们的方法改善了多个基于ViT的模型的鲁棒性。此外,我们展示了结果相关性映射与语义对象部分的更强对齐,为更健壮和可解释的视觉模型提供了一个可扩展的路径。最后,我们确认概念引导蒙版为模型鲁棒性提供了比传统分割图更有效的监督,从而支持我们的中心假设。
更新时间: 2026-03-09 12:31:14
领域: cs.CV,cs.AI,cs.LG
Wasserstein Gradient Flows for Scalable and Regularized Barycenter Computation
Wasserstein barycenters provide a principled approach for aggregating probability measures, while preserving the geometry of their ambient space. Existing discrete methods are not scalable as they assume access to the complete set of samples from the input measures. Meanwhile, neural network approaches do scale well, but rely on complex optimization problems and cannot easily incorporate label information. We address these limitations through gradient flows in the space of probability measures. Through time discretization, we achieve a scalable algorithm that i) relies on mini-batch optimal transport, ii) accepts modular regularization through task-aware functions, and iii) seamlessly integrates supervised information into the ground-cost. We empirically validate our approach on domain adaptation benchmarks that span computer vision, neuroscience, and chemical engineering. Our method establishes a new state-of-the-art barycenter solver, with labeled barycenters consistently outperforming unlabeled ones.
Updated: 2026-03-09 12:27:23
标题: Wasserstein梯度流用于可扩展和正则化的质心计算
摘要: Wasserstein重心提供了一个原则性的方法来聚合概率测度,同时保持它们所处空间的几何结构。现有的离散方法不可扩展,因为它们假设可以访问输入测度的完整样本集。与此同时,神经网络方法可以很好地扩展,但依赖于复杂的优化问题,并且不能轻松地整合标签信息。我们通过概率测度空间中的梯度流来解决这些限制。通过时间离散化,我们实现了一种可扩展的算法,i)依赖于小批量最优输运,ii)通过任务感知函数接受模块化正则化,iii)将监督信息无缝地整合到基础成本中。我们在跨越计算机视觉、神经科学和化学工程的领域适应基准上对我们的方法进行了实证验证。我们的方法建立了一个新的最先进的重心求解器,标记的重心始终优于未标记的。
更新时间: 2026-03-09 12:27:23
领域: stat.ML,cs.AI,cs.LG
Retrieval-Augmented Anatomical Guidance for Text-to-CT Generation
Text-conditioned generative models for volumetric medical imaging provide semantic control but lack explicit anatomical guidance, often resulting in outputs that are spatially ambiguous or anatomically inconsistent. In contrast, structure-driven methods ensure strong anatomical consistency but typically assume access to ground-truth annotations, which are unavailable when the target image is to be synthesized. We propose a retrieval-augmented approach for Text-to-CT generation that integrates semantic and anatomical information under a realistic inference setting. Given a radiology report, our method retrieves a semantically related clinical case using a 3D vision-language encoder and leverages its associated anatomical annotation as a structural proxy. This proxy is injected into a text-conditioned latent diffusion model via a ControlNet branch, providing coarse anatomical guidance while maintaining semantic flexibility. Experiments on the CT-RATE dataset show that retrieval-augmented generation improves image fidelity and clinical consistency compared to text-only baselines, while additionally enabling explicit spatial controllability, a capability inherently absent in such approaches. Further analysis highlights the importance of retrieval quality, with semantically aligned proxies yielding consistent gains across all evaluation axes. This work introduces a principled and scalable mechanism to bridge semantic conditioning and anatomical plausibility in volumetric medical image synthesis. Code will be released.
Updated: 2026-03-09 12:27:17
标题: 检索增强的解剖指导用于文本生成CT
摘要: 基于文本条件的体积医学成像生成模型提供语义控制,但缺乏明确的解剖指导,通常会导致空间模糊或解剖不一致的输出。相比之下,结构驱动方法确保强大的解剖一致性,但通常假设可以访问地面真实标注,而在目标图像需要合成时这些标注是不可用的。我们提出了一种检索增强的文本到CT生成方法,该方法在现实推断环境下整合语义和解剖信息。给定放射学报告,我们的方法使用3D视觉语言编码器检索一个语义相关的临床案例,并利用其相关的解剖标注作为结构代理。这个代理通过一个ControlNet分支注入到一个基于文本条件的潜扩散模型中,提供粗略的解剖指导同时保持语义灵活性。在CT-RATE数据集上的实验表明,检索增强的生成方法相对于仅有文本的基线方法提高了图像的保真度和临床一致性,同时还能实现明确的空间可控性,这是这种方法本质上缺失的能力。进一步的分析突出了检索质量的重要性,语义对齐代理在所有评估轴上产生一致的增益。这项工作介绍了一种合理和可扩展的机制,用于在体积医学图像合成中桥接语义条件和解剖合理性。代码将被发布。
更新时间: 2026-03-09 12:27:17
领域: cs.CV,cs.AI
Graph-Instructed Neural Networks for parametric problems with varying boundary conditions
This work addresses the accurate and efficient simulation of physical phenomena governed by parametric Partial Differential Equations (PDEs) characterized by varying boundary conditions, where parametric instances modify not only the physics of the problem but also the imposition of boundary constraints on the computational domain. In such scenarios, classical Galerkin projection-based reduced order techniques encounter a fundamental bottleneck. Parametric boundaries typically necessitate a re-formulation of the discrete problem for each new configuration, and often, these approaches are unsuitable for real-time applications. To overcome these limitations, we propose a novel methodology based on Graph-Instructed Neural Networks (GINNs). The GINN framework effectively learns the mapping between the parametric description of the computational domain and the corresponding PDE solution. Our results demonstrate that the proposed GINN-based models, can efficiently represent highly complex parametric PDEs, serving as a robust and scalable asset for several applied-oriented settings when compared with fully connected architectures.
Updated: 2026-03-09 12:26:53
标题: 图导向的神经网络用于具有变化边界条件的参数问题
摘要: 这项工作解决了由参数化偏微分方程(PDEs)控制的物理现象的精确和高效模拟问题,这些PDEs具有不断变化的边界条件,其中参数化实例不仅修改问题的物理性质,还会影响计算域上边界约束的施加。在这种情况下,传统的基于Galerkin投影的降阶技术遇到了一个根本性瓶颈。参数边界通常需要为每个新配置重新制定离散问题,而且这些方法通常不适用于实时应用。为了克服这些限制,我们提出了一种基于图指导神经网络(GINNs)的新方法。GINN框架有效地学习了计算域参数化描述和相应PDE解之间的映射。我们的结果表明,所提出的基于GINN的模型可以高效地表示高度复杂的参数化PDEs,与全连接架构相比,在几个应用导向设置中作为一个强大和可扩展的资产。
更新时间: 2026-03-09 12:26:53
领域: math.NA,cs.AI,cs.LG
OTESGN: Optimal Transport-Enhanced Syntactic-Semantic Graph Networks for Aspect-Based Sentiment Analysis
Aspect-based sentiment analysis (ABSA) aims to identify aspect terms and determine their sentiment polarity. While dependency trees combined with contextual semantics provide structural cues, existing approaches often rely on dot-product similarity and fixed graphs, which limit their ability to capture nonlinear associations and adapt to noisy contexts. To address these limitations, we propose the Optimal Transport-Enhanced Syntactic-Semantic Graph Network (OTESGN), a model that jointly integrates structural and distributional signals. Specifically, a Syntactic Graph-Aware Attention module models global dependencies with syntax-guided masking, while a Semantic Optimal Transport Attention module formulates aspect-opinion association as a distribution matching problem solved via the Sinkhorn algorithm. An Adaptive Attention Fusion mechanism balances heterogeneous features, and contrastive regularization enhances robustness. Extensive experiments on three benchmark datasets (Rest14, Laptop14, and Twitter) demonstrate that OTESGN delivers state-of-the-art performance. Notably, it surpasses competitive baselines by up to +1.30 Macro-F1 on Laptop14 and +1.01 on Twitter. Ablation studies and visualization analyses further highlight OTESGN's ability to capture fine-grained sentiment associations and suppress noise from irrelevant context.
Updated: 2026-03-09 12:23:37
标题: OTESGN:用于基于方面的情感分析的最优输运增强句法-语义图网络
摘要: 基于方面的情感分析(ABSA)旨在识别方面术语并确定它们的情感极性。虽然依赖树结合语境语义提供结构线索,但现有方法通常依赖于点积相似性和固定图形,这限制了它们捕捉非线性关联和适应嘈杂环境的能力。为了解决这些限制,我们提出了Optimal Transport-Enhanced Syntactic-Semantic Graph Network(OTESGN),这是一个模型,它同时集成了结构和分布信号。具体而言,一个Syntactic Graph-Aware Attention模块使用语法引导掩蔽来建模全局依赖关系,而一个Semantic Optimal Transport Attention模块将方面-意见关联问题形式化为通过Sinkhorn算法解决的分布匹配问题。自适应注意融合机制平衡异构特征,对比规范化增强鲁棒性。在三个基准数据集(Rest14、Laptop14和Twitter)上进行了大量实验,结果表明OTESGN提供了最先进的性能。值得注意的是,在Laptop14上,它超过竞争基线高达+1.30的Macro-F1,Twitter上超过+1.01。消融研究和可视化分析进一步突显了OTESGN捕捉细粒度情感关联和抑制与无关上下文的噪声的能力。
更新时间: 2026-03-09 12:23:37
领域: cs.CL,cs.AI
LaVCa: LLM-assisted Visual Cortex Captioning
Understanding the property of neural populations (or voxels) in the human brain can advance our comprehension of human perceptual and cognitive processing capabilities and contribute to developing brain-inspired computer models. Recent encoding models using deep neural networks (DNNs) have successfully predicted voxel-wise activity. However, interpreting the properties that explain voxel responses remains challenging because of the black-box nature of DNNs. As a solution, we propose LLM-assisted Visual Cortex Captioning (LaVCa), a data-driven approach that uses large language models (LLMs) to generate natural-language captions for images to which voxels are selective. By applying LaVCa for image-evoked brain activity, we demonstrate that LaVCa generates captions that describe voxel selectivity more accurately than the previously proposed method. Furthermore, the captions generated by LaVCa quantitatively capture more detailed properties than the existing method at both the inter-voxel and intra-voxel levels. Furthermore, a more detailed analysis of the voxel-specific properties generated by LaVCa reveals fine-grained functional differentiation within regions of interest (ROIs) in the visual cortex and voxels that simultaneously represent multiple distinct concepts. These findings offer profound insights into human visual representations by assigning detailed captions throughout the visual cortex while highlighting the potential of LLM-based methods in understanding brain representations.
Updated: 2026-03-09 12:19:23
标题: LaVCa:LLM辅助的视觉皮层字幕
摘要: 理解人脑神经群体(或体素)的性质可以推进我们对人类感知和认知处理能力的理解,并有助于发展基于大脑的计算机模型。最近使用深度神经网络(DNNs)的编码模型成功地预测了体素活动。然而,解释解释体素响应的属性仍然具有挑战性,因为DNNs的黑盒性质。作为解决方案,我们提出了LLM辅助视觉皮质字幕生成(LaVCa),这是一种数据驱动方法,利用大型语言模型(LLMs)为体素选择性的图像生成自然语言字幕。通过将LaVCa应用于图像诱发的大脑活动,我们证明LaVCa生成的字幕比先前提出的方法更准确地描述了体素的选择性。此外,LaVCa生成的字幕在体素间和体素内水平上定量地捕捉到比现有方法更详细的属性。此外,通过更详细地分析LaVCa生成的体素特定属性,揭示了视觉皮质感兴趣区域(ROIs)内的细粒度功能差异以及同时代表多个不同概念的体素。这些发现通过在整个视觉皮质中分配详细的字幕,为人类视觉表示提供了深刻见解,同时突出了基于LLM的方法在理解大脑表示方面的潜力。
更新时间: 2026-03-09 12:19:23
领域: q-bio.NC,cs.AI,cs.CL,cs.CV,cs.LG
SoK: Harmonizing Attack Graphs and Intrusion Detection Systems
Detecting and responding to cyber attacks is increasingly difficult as high-volume, complex network traffic allows threats to remain concealed. While Intrusion Detection Systems (IDSs) identify anomalous behavior, Attack Graphs (AGs) serve as the primary threat model for analyzing attacker strategies and informing any response. Despite the conceptual connection being recognized in early research, the field of AG and IDS integration lacks a common structure. This paper presents the first systematic analysis of AG-IDS integration, reviewing a 73 comprehensive works in literature. We introduce a novel taxonomy revealing that current research is dominated by specialized, single-purpose integrations, such as using AGs to filter IDS false positives or using IDS alerts to prune AGs. Our analysis highlights a critical gap: the absence of a unifying framework that treats IDSs and AGs as a cohesive, integrated system. To address this gap, we propose a formal AG-IDS lifecycle. This framework establishes a continuous feedback loop where IDSs refine the accuracy of AG models, and those updated models, in turn, enhance IDS detection capabilities. We provide a proof-of-concept implementation demonstrating the practical advantages of this lifecycle for threat detection and incident response. Finally, we conclude by elaborating on significant opportunities for future development within the AG-IDS domain.
Updated: 2026-03-09 12:16:51
标题: SoK:攻击图和入侵检测系统的协调
摘要: 检测和应对网络攻击变得越来越困难,因为高容量、复杂的网络流量使威胁保持隐蔽。尽管入侵检测系统(IDSs)可以识别异常行为,攻击图(AGs)作为分析攻击者策略并通知任何响应的主要威胁模型。尽管早期研究中已经认识到概念上的联系,但AG和IDS集成领域缺乏共同的结构。本文首次对AG-IDS集成进行系统分析,评论了文献中的73项作品。我们引入了一个新颖的分类法,揭示了当前研究主要由专业化、单一目的的集成主导,例如使用AGs来过滤IDS的误报或使用IDS警报来修剪AGs。我们的分析突显了一个关键的空白:缺乏将IDSs和AGs视为一个统一、整合的系统的统一框架。为了填补这一空白,我们提出了一个正式的AG-IDS生命周期。该框架建立了一个连续的反馈循环,在这个循环中,IDSs可以提高AG模型的准确性,而更新的模型反过来增强IDS检测能力。我们提供了一个概念验证实现,展示了这种生命周期对威胁检测和事件响应的实际优势。最后,我们总结了AG-IDS领域未来发展的重大机遇。
更新时间: 2026-03-09 12:16:51
领域: cs.CR
Deconstructing Multimodal Mathematical Reasoning: Towards a Unified Perception-Alignment-Reasoning Paradigm
Multimodal Mathematical Reasoning (MMR) has recently attracted increasing attention for its capability to solve mathematical problems that involve both textual and visual modalities. However, current models still face significant challenges in real-world visual math tasks. They often misinterpret diagrams, fail to align mathematical symbols with visual evidence, and produce inconsistent reasoning steps. Moreover, existing evaluations mainly focus on checking final answers rather than verifying the correctness or executability of each intermediate step. To address these limitations, a growing body of recent research addresses these issues by integrating structured perception, explicit alignment, and verifiable reasoning within unified frameworks. To establish a clear roadmap for understanding and comparing different MMR approaches, we systematically study them around four fundamental questions: (1) What to extract from multimodal inputs, (2) How to represent and align textual and visual information, (3) How to perform the reasoning, and (4) How to evaluate the correctness of the overall reasoning process. Finally, we discuss open challenges and offer perspectives on promising directions for future research.
Updated: 2026-03-09 12:11:00
标题: 对多模态数学推理的解构:走向统一的感知-对齐-推理范式
摘要: 多模数学推理(MMR)近年来引起了越来越多的关注,因为它能够解决涉及文本和视觉两种形式的数学问题。然而,当前的模型在现实世界的视觉数学任务中仍然面临重大挑战。它们经常错误解释图表,无法将数学符号与视觉证据对齐,并产生不一致的推理步骤。此外,现有的评估主要集中在检查最终答案,而不是验证每个中间步骤的正确性或可执行性。为了解决这些限制,最近的研究越来越多地解决这些问题,通过在统一框架内整合结构化感知、明确对齐和可验证推理。为了建立一个清晰的路线图,以便理解和比较不同的MMR方法,我们系统地围绕四个基本问题对它们进行了研究:(1)从多模输入中提取什么,(2)如何表示和对齐文本和视觉信息,(3)如何进行推理,以及(4)如何评估整个推理过程的正确性。最后,我们讨论了开放性挑战,并提供了未来研究的有前景的方向。
更新时间: 2026-03-09 12:11:00
领域: cs.AI
Understand Then Memory: A Cognitive Gist-Driven RAG Framework with Global Semantic Diffusion
Retrieval-Augmented Generation (RAG) effectively mitigates hallucinations in LLMs by incorporating external knowledge. However, the inherent discrete representation of text in existing frameworks often results in a loss of semantic integrity, leading to retrieval deviations. Inspired by the human episodic memory mechanism, we propose CogitoRAG, a RAG framework that simulates human cognitive memory processes. The core of this framework lies in the extraction and evolution of the Semantic Gist. During the offline indexing stage, CogitoRAG first deduces unstructured corpora into gist memory corpora, which are then transformed into a multi-dimensional knowledge graph integrating entities, relational facts, and memory nodes. In the online retrieval stage, the framework handles complex queries via Query Decomposition Module that breaks them into comprehensive sub-queries, mimicking the cognitive decomposition humans employ for complex information. Subsequently, Entity Diffusion Module performs associative retrieval across the graph, guided by structural relevance and an entity-frequency reward mechanism. Furthermore, we propose the CogniRank algorithm, which precisely reranks candidate passages by fusing diffusion-derived scores with semantic similarity. The final evidence is delivered to the generator in a passage-memory pairing format, providing high-density information support. Experimental results across five mainstream QA benchmarks and multi-task generation on GraphBench demonstrate that CogitoRAG significantly outperforms state-of-the-art RAG methods, showcasing superior capabilities in complex knowledge integration and reasoning.
Updated: 2026-03-09 12:10:47
标题: 理解然后记忆:一个以认知要点驱动的全局语义扩散的RAG框架
摘要: 检索增强生成(RAG)通过整合外部知识有效地减轻了LLM中的幻觉。然而,现有框架中文本的固有离散表示通常导致语义完整性的丧失,从而导致检索偏差。受人类情节记忆机制的启发,我们提出了CogitoRAG,这是一个模拟人类认知记忆过程的RAG框架。该框架的核心在于提取和演化语义要旨。在离线索引阶段,CogitoRAG首先将非结构化语料库演绎成要旨记忆语料库,然后将其转换为一个多维知识图,集成了实体、关系事实和记忆节点。在在线检索阶段,该框架通过查询分解模块处理复杂查询,将其分解为全面的子查询,模拟人类处理复杂信息时采用的认知分解。随后,实体扩散模块通过结构相关性和一个实体频率奖励机制指导下,在图中进行联想检索。此外,我们提出了CogniRank算法,通过将扩散派生的分数与语义相似性融合,精确地重新排列候选段落。最终的证据以段落-记忆配对格式传递给生成器,提供高密度信息支持。在五个主流QA基准和GraphBench上的多任务生成的实验结果表明,CogitoRAG在复杂知识整合和推理方面明显优于最先进的RAG方法,展示了其在复杂知识整合和推理方面的优越能力。
更新时间: 2026-03-09 12:10:47
领域: cs.CL,cs.AI
Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization
We study the implicit bias of Sharpness-Aware Minimization (SAM) when training $L$-layer linear diagonal networks on linearly separable binary classification. For linear models ($L=1$), both $\ell_\infty$- and $\ell_2$-SAM recover the $\ell_2$ max-margin classifier, matching gradient descent (GD). However, for depth $L = 2$, the behavior changes drastically -- even on a single-example dataset. For $\ell_\infty$-SAM, the limit direction depends critically on initialization and can converge to $\mathbf{0}$ or to any standard basis vector, in stark contrast to GD, whose limit aligns with the basis vector of the dominant data coordinate. For $\ell_2$-SAM, we show that although its limit direction matches the $\ell_1$ max-margin solution as in the case of GD, its finite-time dynamics exhibit a phenomenon we call "sequential feature amplification", in which the predictor initially relies on minor coordinates and gradually shifts to larger ones as training proceeds or initialization increases. Our theoretical analysis attributes this phenomenon to $\ell_2$-SAM's gradient normalization factor applied in its perturbation, which amplifies minor coordinates early and allows major ones to dominate later, giving a concrete example where infinite-time implicit-bias analyses are insufficient. Synthetic and real-data experiments corroborate our findings.
Updated: 2026-03-09 12:09:14
标题: 小事先行,大事后行:一种深度诱发的对锐度感知最小化的隐性偏见
摘要: 我们研究了在线性可分二元分类问题上训练$L$层线性对角网络时,Sharpness-Aware Minimization (SAM)的隐式偏差。对于线性模型($L=1$),$\ell_\infty$-SAM和$\ell_2$-SAM都恢复了$\ell_2$最大间隔分类器,与梯度下降(GD)相匹配。然而,对于深度$L=2$,即使在单个示例数据集上,行为也发生了 drastical 的变化。对于$\ell_\infty$-SAM,极限方向在初始化时关键取决于,可能会收敛到$\mathbf{0}$或任何标准基向量,这与GD形成了鲜明对比,GD的极限与主要数据坐标的基向量对齐。对于$\ell_2$-SAM,我们表明尽管它的极限方向与$\ell_1$最大间隔解匹配,就像在GD的情况下一样,但其有限时间动态展现出一种我们称为“顺序特征放大”的现象,其中预测器最初依赖于次要坐标,随着训练的进行或初始化的增加逐渐转移到较大的坐标。我们的理论分析将这一现象归因于$\ell_2$-SAM中应用于其扰动的梯度归一化因子,该因子早期放大次要坐标,允许主要坐标在后期占主导地位,这提供了一个具体的例子,说明无限时间的隐式偏差分析是不充分的。合成和真实数据实验证实了我们的发现。
更新时间: 2026-03-09 12:09:14
领域: cs.LG,cs.AI
Explainable classification of astronomical uncertain time series
Exploring the expansion history of the universe, understanding its evolutionary stages, and predicting its future evolution are important goals in astrophysics. Today, machine learning tools are used to help achieving these goals by analyzing transient sources, which are modeled as uncertain time series. Although black-box methods achieve appreciable performance, existing interpretable time series methods failed to obtain acceptable performance for this type of data. Furthermore, data uncertainty is rarely taken into account in these methods. In this work, we propose an uncertaintyaware subsequence based model which achieves a classification comparable to that of state-of-the-art methods. Unlike conformal learning which estimates model uncertainty on predictions, our method takes data uncertainty as additional input. Moreover, our approach is explainable-by-design, giving domain experts the ability to inspect the model and explain its predictions. The explainability of the proposed method has also the potential to inspire new developments in theoretical astrophysics modeling by suggesting important subsequences which depict details of light curve shapes. The dataset, the source code of our experiment, and the results are made available on a public repository.
Updated: 2026-03-09 12:08:36
标题: 可解释的天文不确定时间序列分类
摘要: 探索宇宙的扩张历史,了解其演化阶段,并预测其未来演化是天体物理学中的重要目标。如今,机器学习工具被用来帮助实现这些目标,通过分析被建模为不确定时间序列的瞬变源。虽然黑盒方法取得了可观的性能,但现有的可解释时间序列方法未能在这种数据类型上获得可接受的性能。此外,在这些方法中很少考虑数据不确定性。在这项工作中,我们提出了一种基于子序列的不确定性感知模型,其分类性能可与最先进的方法相媲美。与估计预测模型不确定性的一致性学习不同,我们的方法将数据不确定性作为额外输入。此外,我们的方法是通过设计可解释的,使领域专家能够检查模型并解释其预测。所提出的方法的可解释性还有潜力激发理论天体物理建模的新发展,通过提出描绘光曲线形状细节的重要子序列。我们在公共存储库上提供了数据集、实验的源代码和结果。
更新时间: 2026-03-09 12:08:36
领域: cs.LG,astro-ph.GA,cs.AI
A Blockchain-based Traceability System for AI-Driven Engine Blade Inspection
Aircraft engine blade maintenance relies on inspection records shared across manufacturers, airlines, maintenance organizations, and regulators. Yet current systems are fragmented, difficult to audit, and vulnerable to tampering. This paper presents BladeChain, a blockchain-based system providing immutable traceability for blade inspections throughout the component life cycle. BladeChain is the first system to integrate multi-stakeholder endorsement, automated inspection scheduling, AI model provenance, and cryptographic evidence binding, delivering auditable maintenance traceability for aerospace deployments. Built on a four-stakeholder Hyperledger Fabric network (OEM, Airline, MRO, Regulator), BladeChain captures every life-cycle event in a tamper-evident ledger. A chaincode-enforced state machine governs blade status transitions and automatically triggers inspections when configurable flight hour, cycle, or calendar thresholds are exceeded, eliminating manual scheduling errors. Inspection artifacts are stored off-chain in IPFS and linked to on-chain records via SHA-256 hashes, with each inspection record capturing the AI model name and version used for defect detection. This enables regulators to audit both what defects were found and how they were found. The detection module is pluggable, allowing organizations to adopt or upgrade inspection models without modifying the ledger or workflows. We built a prototype and evaluated it on workloads of up to 100 blades, demonstrating 100% life cycle completion with consistent throughput of 26 operations per minute. A centralized SQL baseline quantifies the consensus overhead and highlights the security trade-off. Security validation confirms tamper detection within 17~ms through hash verification.
Updated: 2026-03-09 12:06:56
标题: 一个基于区块链的追溯系统,用于人工智能驱动的发动机叶片检验
摘要: 飞机发动机叶片维护依赖于制造商、航空公司、维修组织和监管机构之间共享的检验记录。然而,当前系统存在碎片化、难以审计和易受篡改的问题。本文介绍了BladeChain,这是一个基于区块链的系统,为叶片检验提供了不可变的追溯性,覆盖了整个组件的生命周期。BladeChain是第一个整合了多利益相关方认可、自动检验调度、AI模型溯源和加密证据绑定的系统,为航空部署提供了可审计的维护追溯性。基于四方利益相关者Hyperledger Fabric网络(OEM、航空公司、MRO、监管机构)构建的BladeChain在一个防篡改的账本中记录了每个生命周期事件。一个通过链码强制执行的状态机管理叶片状态转换,并在超过可配置的飞行小时数、循环数或日历阈值时自动触发检验,消除了手动调度错误。检验文档存储在IPFS中,通过SHA-256哈希与链上记录相连,每个检验记录都包含用于缺陷检测的AI模型名称和版本。这使监管机构能够审计发现的缺陷以及发现方式。检测模块是可插拔的,允许组织在不修改账本或工作流程的情况下采用或升级检验模型。我们建立了一个原型,并在多达100个叶片的工作负载下进行了评估,展示了100%的生命周期完成,每分钟26次操作的一致吞吐量。一个集中式SQL基线量化了共识开销,并突显了安全权衡。安全验证确认通过哈希验证在17毫秒内检测到篡改。
更新时间: 2026-03-09 12:06:56
领域: cs.CR,cs.AI,cs.DC
PhishDebate: An LLM-Based Multi-Agent Framework for Phishing Website Detection
Phishing websites remain a major cybersecurity threat, exploiting deceptive structures, brand impersonation, and social engineering to evade detection. Recent advances in large language models (LLMs) have improved phishing detection through contextual understanding, yet most existing approaches rely on single-agent classification, which is prone to hallucination and often lacks interpretability and robustness. To address these limitations, we propose PhishDebate, a modular multi-agent LLM-based debate framework for phishing website detection. Four specialized agents independently analyze webpage aspects, including URL structure, HTML composition, semantic content, and brand impersonation, under the coordination of a Moderator and final Judge. Through structured debate and divergent reasoning, the framework achieves more accurate and interpretable decisions. By reducing uncertain predictions and providing transparent reasoning, PhishDebate functions as an analyst-augmentation system that lowers cognitive load and supports early, left-of-exploit detection of phishing threats. Evaluations on commercial LLMs show that PhishDebate achieves 98.2 % recall on a real-world phishing dataset and outperforms single-agent and Chain-of-Thought (CoT) baselines. Its modular design enables agent-level configurability, allowing adaptation to varying resource and application requirements, and offers scalability to high-velocity, large-scale security data environments.
Updated: 2026-03-09 12:04:06
标题: PhishDebate:一种基于LLM的多Agent框架用于钓鱼网站检测
摘要: 网络钓鱼网站仍然是一种主要的网络安全威胁,利用欺骗性结构、品牌冒充和社会工程学来规避检测。最近大型语言模型(LLMs)的进展通过上下文理解改善了网络钓鱼检测,然而大多数现有方法依赖于单一代理分类,容易产生幻觉,通常缺乏可解释性和鲁棒性。为了解决这些限制,我们提出了PhishDebate,一个基于模块化多代理LLM的辩论框架,用于检测网络钓鱼网站。四个专门的代理独立分析网页方面,包括URL结构、HTML组成、语义内容和品牌冒充,由调解员和最终裁判协调。通过结构化辩论和分歧推理,该框架实现了更准确和可解释的决策。通过减少不确定的预测并提供透明的推理,PhishDebate作为分析员增强系统,降低认知负担,支持网络钓鱼威胁的早期、未被利用的检测。商业LLMs的评估表明,PhishDebate在真实世界的网络钓鱼数据集上实现了98.2%的召回率,并优于单一代理和Chain-of-Thought(CoT)基线。其模块化设计实现了代理级别的可配置性,可以适应不同的资源和应用需求,并提供可扩展性到高速、大规模的安全数据环境。
更新时间: 2026-03-09 12:04:06
领域: cs.CR
Posterior Sampling Reinforcement Learning with Gaussian Processes for Continuous Control: Sublinear Regret Bounds for Unbounded State Spaces
We analyze the Bayesian regret of the Gaussian process posterior sampling reinforcement learning (GP-PSRL) algorithm. Posterior sampling is an effective heuristic for decision-making under uncertainty that has been used to develop successful algorithms for a variety of continuous control problems. However, theoretical work on GP-PSRL is limited. All known regret bounds either fail to achieve a tight dependence on a kernel-dependent quantity called the maximum information gain or fail to properly account for the fact that the set of possible system states is unbounded. Through a recursive application of the Borell-Tsirelson-Ibragimov-Sudakov inequality, we show that, with high probability, the states actually visited by the algorithm are contained within a ball of near-constant radius. To obtain tight dependence on the maximum information gain, we use the chaining method to control the regret suffered by GP-PSRL. Our main result is a Bayesian regret bound of the order $\widetilde{\mathcal{O}}(H^{3/2}\sqrt{γ_{T/H} T})$, where $H$ is the horizon, $T$ is the number of time steps and $γ_{T/H}$ is the maximum information gain. With this result, we resolve the limitations with prior theoretical work on PSRL, and provide the theoretical foundation and tools for analyzing PSRL in complex settings.
Updated: 2026-03-09 12:03:25
标题: 使用高斯过程的后验抽样强化学习在连续控制中的应用:对无界状态空间的次线性后悔界
摘要: 我们分析了高斯过程后验抽样强化学习(GP-PSRL)算法的贝叶斯遗憾。后验抽样是一种在不确定性下进行决策的有效启发式方法,已被用来开发成功的连续控制问题算法。然而,关于GP-PSRL的理论研究有限。所有已知的遗憾界要么无法实现对依赖于称为最大信息增益的核相关量的紧密依赖,要么无法正确考虑可能系统状态集合不受限制的事实。通过递归应用Borell-Tsirelson-Ibragimov-Sudakov不等式,我们展示了算法实际访问的状态高概率地包含在一个近乎恒定半径的球内。为了获得对最大信息增益的紧密依赖,我们使用链接方法来控制GP-PSRL所承受的遗憾。我们的主要结果是一个贝叶斯遗憾界,其阶数为$\widetilde{\mathcal{O}}(H^{3/2}\sqrt{γ_{T/H} T})$,其中$H$是视野,$T$是时间步数,$γ_{T/H}$是最大信息增益。通过这一结果,我们解决了先前关于PSRL的理论工作的局限性,并为在复杂环境中分析PSRL提供了理论基础和工具。
更新时间: 2026-03-09 12:03:25
领域: stat.ML,cs.LG
PolyFormer: learning efficient reformulations for scalable optimization under complex physical constraints
Real-world optimization problems are often constrained by complex physical laws that limit computational scalability. These constraints are inherently tied to complex regions, and thus learning models that incorporate physical and geometric knowledge, i.e., physics-informed machine learning (PIML), offer a promising pathway for efficient solution. Here, we introduce PolyFormer, which opens a new direction for PIML in prescriptive optimization tasks, where physical and geometric knowledge is not merely used to regularize learning models, but to simplify the problems themselves. PolyFormer captures geometric structures behind constraints and transforms them into efficient polytopic reformulations, thereby decoupling problem complexity from solution difficulty and enabling off-the-shelf optimization solvers to efficiently produce feasible solutions with acceptable optimality loss. Through evaluations across three important problems (large-scale resource aggregation, network-constrained optimization, and optimization under uncertainty), PolyFormer achieves computational speedups up to 6,400-fold and memory reductions up to 99.87%, while maintaining solution quality competitive with or superior to state-of-the-art methods. These results demonstrate that PolyFormer provides an efficient and reliable solution for scalable constrained optimization, expanding the scope of PIML to prescriptive tasks in scientific discovery and engineering applications.
Updated: 2026-03-09 11:59:39
标题: PolyFormer:学习高效的重构,在复杂物理约束下可扩展优化
摘要: 现实世界中的优化问题通常受到限制计算可扩展性的复杂物理定律的约束。这些约束与复杂区域密切相关,因此结合物理和几何知识的学习模型,即物理信息机器学习(PIML),为高效解决提供了一个有前途的途径。在这里,我们介绍了PolyFormer,它为PIML在规定性优化任务中开辟了一条新方向,其中物理和几何知识不仅仅用于规范学习模型,而是用于简化问题本身。PolyFormer捕获了约束背后的几何结构,并将其转化为高效的多面体重构,从而将问题复杂性与解决难度分离,并使得现成的优化求解器能够高效地生成可行解,且具有可接受的优化损失。通过对三个重要问题(大规模资源聚合、网络约束优化和不确定性优化)的评估,PolyFormer实现了高达6,400倍的计算加速和高达99.87%的内存减少,同时保持了与最先进方法相竞争或更优的解决方案质量。这些结果表明,PolyFormer为可扩展的受限优化提供了一种高效可靠的解决方案,将PIML的范围扩展到科学发现和工程应用中的规定性任务。
更新时间: 2026-03-09 11:59:39
领域: cs.LG,eess.SY,math.OC
Evaluating LLM-Based Grant Proposal Review via Structured Perturbations
As AI-assisted grant proposals outpace manual review capacity in a kind of ``Malthusian trap'' for the research ecosystem, this paper investigates the capabilities and limitations of LLM-based grant reviewing for high-stakes evaluation. Using six EPSRC proposals, we develop a perturbation-based framework probing LLM sensitivity across six quality axes: funding, timeline, competency, alignment, clarity, and impact. We compare three review architectures: single-pass review, section-by-section analysis, and a 'Council of Personas' ensemble emulating expert panels. The section-level approach significantly outperforms alternatives in both detection rate and scoring reliability, while the computationally expensive council method performs no better than baseline. Detection varies substantially by perturbation type, with alignment issues readily identified but clarity flaws largely missed by all systems. Human evaluation shows LLM feedback is largely valid but skewed toward compliance checking over holistic assessment. We conclude that current LLMs may provide supplementary value within EPSRC review but exhibit high variability and misaligned review priorities. We release our code and any non-protected data.
Updated: 2026-03-09 11:53:50
标题: 通过结构性扰动评估基于LLM的资助提案审查
摘要: 随着人工智能辅助的资助提案超过了研究生态系统中手动审查的能力,使得研究生态系统陷入了一种“马尔萨斯陷阱”,本文调查了基于LLM的高风险评估资助审查的能力和局限性。我们使用了六个EPSRC提案,开发了一个基于扰动的框架,探究LLM在六个质量轴上的敏感性:资金、时间表、能力、对齐性、清晰度和影响力。我们比较了三种审查架构:单次审查、分节分析和模拟专家小组的“角色委员会”集合。分节级别的方法在检测率和评分可靠性方面明显优于其他选择,而计算密集型的委员会方法并未比基准表现更好。检测结果根据扰动类型差异很大,对齐问题很容易被识别,但清晰度问题几乎被所有系统忽略。人类评估显示,LLM反馈在很大程度上是有效的,但偏向于合规性检查而不是全面评估。我们得出结论,当前的LLM可能在EPSRC审查中提供补充价值,但表现出高度变化和不符合审查优先事项。我们将发布我们的代码和任何非受保护的数据。
更新时间: 2026-03-09 11:53:50
领域: cs.CL,cs.AI,cs.CY
TA-RNN-Medical-Hybrid: A Time-Aware and Interpretable Framework for Mortality Risk Prediction
Accurate and interpretable mortality risk prediction in intensive care units (ICUs) remains a critical challenge due to the irregular temporal structure of electronic health records (EHRs), the complexity of longitudinal disease trajectories, and the lack of clinically grounded explanations in many data-driven models. To address these challenges, we propose \textit{TA-RNN-Medical-Hybrid}, a time-aware and knowledge-enriched deep learning framework that jointly models longitudinal clinical sequences and irregular temporal dynamics through explicit continuous-time encoding, along with standardized medical concept representations. The proposed framework extends time-aware recurrent modeling by integrating explicit continuous-time embeddings that operate independently of visit indexing, SNOMED-based disease representations, and a hierarchical dual-level attention mechanism that captures both visit-level temporal importance and feature/concept-level clinical relevance. This design enables accurate mortality risk estimation while providing transparent and clinically meaningful explanations aligned with established medical knowledge. We evaluate the proposed approach on the MIMIC-III critical care dataset and compare it against strong time-aware and sequential baselines. Experimental results demonstrate that TA-RNN-Medical-Hybrid consistently improves predictive performance in terms of AUC, accuracy, and recall-oriented F$_2$-score. Moreover, qualitative analysis shows that the model effectively decomposes mortality risk across time and clinical concepts, yielding interpretable insights into disease severity, chronicity, and temporal progression. Overall, the proposed framework bridges the gap between predictive accuracy and clinical interpretability, offering a scalable and transparent solution for high-stakes ICU decision support systems.
Updated: 2026-03-09 11:49:42
标题: TA-RNN-Medical-Hybrid:一种用于死亡风险预测的时间感知和可解释的框架
摘要: 在重症监护单位(ICUs)中准确且可解释的死亡风险预测仍然是一个关键挑战,这是由于电子健康记录(EHRs)的不规则时间结构、纵向疾病轨迹的复杂性,以及许多基于数据的模型缺乏临床基础解释。为了解决这些挑战,我们提出了一种名为\textit{TA-RNN-Medical-Hybrid}的时间感知和知识丰富的深度学习框架,通过显式的连续时间编码以及标准化的医学概念表示共同建模纵向临床序列和不规则时间动态。所提出的框架通过整合独立于访问索引的显式连续时间嵌入、基于SNOMED的疾病表示以及捕捉访问级时间重要性和特征/概念级临床相关性的层次双级注意机制,扩展了时间感知的递归建模。这种设计能够准确估计死亡风险,同时提供与已建立的医学知识一致的透明且临床意义重大的解释。我们在MIMIC-III重症护理数据集上评估了所提出的方法,并将其与强时间感知和顺序基线进行了比较。实验结果表明,TA-RNN-Medical-Hybrid在AUC、准确率和召回导向的F$_2$-分数方面始终提高了预测性能。此外,定性分析显示,该模型有效地将死亡风险分解为时间和临床概念,提供了关于疾病严重程度、慢性程度和时间进展的可解释见解。总的来说,所提出的框架弥合了预测准确性和临床可解释性之间的差距,为高风险ICU决策支持系统提供了一个可扩展且透明的解决方案。
更新时间: 2026-03-09 11:49:42
领域: cs.LG,cs.AI,cs.DC,cs.ET
MSPT: Efficient Large-Scale Physical Modeling via Parallelized Multi-Scale Attention
A key scalability challenge in neural solvers for industrial-scale physics simulations is efficiently capturing both fine-grained local interactions and long-range global dependencies across millions of spatial elements. We introduce the Multi-Scale Patch Transformer (MSPT), an architecture that combines local point attention within patches with global attention to coarse patch-level representations. To partition the input domain into spatially-coherent patches, we employ ball trees, which handle irregular geometries efficiently. This dual-scale design enables MSPT to scale to millions of points on a single GPU. We validate our method on standard PDE benchmarks (elasticity, plasticity, fluid dynamics, porous flow) and large-scale aerodynamic datasets (ShapeNet-Car, Ahmed-ML), achieving state-of-the-art accuracy with substantially lower memory footprint and computational cost.
Updated: 2026-03-09 11:47:56
标题: MSPT:通过并行化多尺度注意力实现高效的大规模物理建模
摘要: 神经求解器在工业规模物理模拟中面临的一个关键可扩展性挑战是有效地捕捉数百万空间元素之间的细粒度局部交互和长程全局依赖关系。我们引入了多尺度补丁变换器(MSPT),这是一种结合了补丁内的局部点关注和全局关注粗糙补丁级表示的架构。为了将输入域划分为空间连贯的补丁,我们采用球树,这可以有效处理不规则几何形状。这种双尺度设计使得MSPT可以在单个GPU上扩展到数百万点。我们在标准PDE基准测试(弹性、塑性、流体动力学、多孔流)和大规模空气动力学数据集(ShapeNet-Car、Ahmed-ML)上验证了我们的方法,实现了最先进的精度,并且内存占用和计算成本大幅降低。
更新时间: 2026-03-09 11:47:56
领域: cs.LG
Privately Estimating Black-Box Statistics
Standard techniques for differentially private estimation, such as Laplace or Gaussian noise addition, require guaranteed bounds on the sensitivity of the estimator in question. But such sensitivity bounds are often large or simply unknown. Thus we seek differentially private methods that can be applied to arbitrary black-box functions. A handful of such techniques exist, but all are either inefficient in their use of data or require evaluating the function on exponentially many inputs. In this work we present a scheme that trades off between statistical efficiency (i.e., how much data is needed) and oracle efficiency (i.e., the number of evaluations). We also present lower bounds showing the near-optimality of our scheme.
Updated: 2026-03-09 11:47:55
标题: 私下估算黑匣子统计量
摘要: 传统的差分隐私估计技术,如拉普拉斯或高斯噪声添加,需要对所涉及的估计器的敏感性有保证的边界。但是这种敏感性边界往往很大,或者干脆未知。因此,我们寻求可以应用于任意黑盒函数的差分隐私方法。目前存在少数这样的技术,但它们要么在数据使用效率上低效,要么需要在指数级的输入上评估函数。在这项工作中,我们提出了一种在统计效率(即需要多少数据)和预测效率(即评估次数)之间进行权衡的方案。我们还提出了证明我们方案接近最优的下界。
更新时间: 2026-03-09 11:47:55
领域: cs.CR,cs.CC,cs.DS,cs.LG
AdaCultureSafe: Adaptive Cultural Safety Grounded by Cultural Knowledge in Large Language Models
With the widespread adoption of Large Language Models (LLMs), respecting indigenous cultures becomes essential for models' culturally safety and responsible global applications. Existing studies separately consider cultural safety and cultural knowledge and neglect that the former should be grounded by the latter. This severely prevents LLMs from yielding culture-specific respectful responses. Consequently, adaptive cultural safety remains a formidable task. In this work, we propose to jointly model cultural safety and knowledge. First and foremost, cultural-safety and knowledge-paired data serve as the key prerequisite to conduct this research. However, the cultural diversity across regions and the subtlety of cultural differences pose significant challenges to the creation of such paired evaluation data. To address this issue, we propose a novel framework that integrates authoritative cultural knowledge descriptions curation, LLM-automated query generation, and heavy manual verification. Accordingly, we obtain a dataset named AdaCultureSafe containing 4.8K manually decomposed fine-grained cultural descriptions and the corresponding 48K manually verified safety- and knowledge-oriented queries. Upon the constructed dataset, we evaluate three families of popular LLMs on their cultural safety and knowledge proficiency, via which we make a critical discovery: no significant correlation exists between their cultural safety and knowledge proficiency. We then delve into the utility-related neuron activations within LLMs to investigate the potential cause of the absence of correlation, which can be attributed to the difference of the objectives of pre-training and post-alignment. We finally present a knowledge-grounded method, which significantly enhances cultural safety by enforcing the integration of knowledge into the LLM response generation process.
Updated: 2026-03-09 11:44:37
标题: AdaCultureSafe:以大型语言模型为基础的自适应文化安全,以文化知识为基础
摘要: 随着大型语言模型(LLMs)的广泛应用,尊重土著文化对于模型的文化安全性和全球应用的责任变得至关重要。现有研究分别考虑文化安全和文化知识,忽视了前者应该建立在后者基础之上的事实。这严重阻碍了LLMs产生特定于文化的尊重回应。因此,适应性文化安全仍然是一项艰巨的任务。在这项工作中,我们提出了共同建模文化安全和知识的方法。首先,文化安全和知识配对数据是进行这项研究的关键先决条件。然而,不同地区的文化多样性和文化差异的微妙性给创建这种配对评估数据带来了重大挑战。为了解决这个问题,我们提出了一个新颖的框架,该框架整合了权威文化知识描述的策划、LLM自动生成查询和大量手动验证。因此,我们获得了一个名为AdaCultureSafe的数据集,其中包含4.8K个手动分解的细粒度文化描述和相应的48K个手动验证的安全和知识导向查询。在构建的数据集上,我们评估了三类流行的LLMs在文化安全和知识熟练度上的表现,通过这一过程我们做出了一个重要的发现:它们的文化安全和知识熟练度之间并不存在显著的相关性。然后,我们深入研究LLMs内部与实用性相关的神经元激活,以探讨相关性缺失的潜在原因,这可以归因于预训练和后期对齐的目标的不同。最后,我们提出了一种基于知识的方法,通过将知识融入LLM响应生成过程,显著增强了文化安全性。
更新时间: 2026-03-09 11:44:37
领域: cs.CL,cs.AI
How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms
How much do large language models actually hallucinate when answering questions grounded in provided documents? Despite the critical importance of this question for enterprise AI deployments, reliable measurement has been hampered by benchmarks that rely on static datasets vulnerable to contamination, LLM-based judges with documented biases, or evaluation scales too small for statistical confidence. We address this gap using RIKER, a ground-truth-first evaluation methodology that enables deterministic scoring without human annotation. Across 35 open-weight models, three context lengths (32K, 128K, and 200K tokens), four temperature settings, and three hardware platforms (NVIDIA H200, AMD MI300X, and Intel Gaudi 3), we conducted over 172 billion tokens of evaluation - an order of magnitude beyond prior work. Our findings reveal that: (1) even the best-performing models fabricate answers at a non-trivial rate - 1.19% at best at 32K, with top-tier models at 5 - 7% - and fabrication rises steeply with context length, nearly tripling at 128K and exceeding 10% for all models at 200K; (2) model selection dominates all other factors, with overall accuracy spanning a 72-percentage-point range and model family predicting fabrication resistance better than model size; (3) temperature effects are nuanced - T=0.0 yields the best overall accuracy in roughly 60% of cases, but higher temperatures reduce fabrication for the majority of models and dramatically reduce coherence loss (infinite generation loops), which can reach 48x higher rates at T=0.0 versus T=1.0; (4) grounding ability and fabrication resistance are distinct capabilities - models that excel at finding facts may still fabricate facts that do not exist; and (5) results are consistent across hardware platforms, confirming that deployment decisions need not be hardware-dependent.
Updated: 2026-03-09 11:44:06
标题: LLM在文档问答场景中产生幻觉的程度有多大?一项跨温度、上下文长度和硬件平台的1720亿令牌研究
摘要: 大型语言模型在回答基于提供文档的问题时实际上会产生多少幻觉?尽管这个问题对企业AI部署至关重要,但可靠的测量受到静态数据集易受污染、基于LLM的评委存在已记录的偏见或评估规模过小以致统计置信度不足的基准的阻碍。我们利用RIKER解决了这一差距,这是一种以地面真实为先的评估方法,可以在没有人工注释的情况下实现确定性评分。在35个开放权重模型、三种上下文长度(32K、128K和200K标记)、四种温度设置和三种硬件平台(NVIDIA H200、AMD MI300X和Intel Gaudi 3)上,我们进行了超过1720亿标记的评估 - 较之前的工作大出一个数量级。我们的研究发现:(1)即使表现最佳的模型也会以非微不足道的速度制造答案 - 在32K时最高达到1.19%,顶尖模型在5-7%,制造率随着上下文长度的增加而急剧上升,在128K时几乎增加了三倍,并且在200K时所有模型的制造率超过10%;(2)模型选择主导所有其他因素,总体准确率跨度达到72个百分点,模型系列预测制造抵抗力比模型大小更好;(3)温度效应微妙 - T=0.0在大约60%的情况下提供了最佳的总体准确率,但更高的温度会减少大多数模型的制造,显著减少连贯性丢失(无限生成循环),在T=0.0与T=1.0时可以达到48倍的高速率;(4)基础能力和制造抵抗力是不同的能力 - 擅长查找事实的模型仍可能制造不存在的事实;(5)结果在硬件平台上保持一致,证实部署决策不需要依赖于硬件。
更新时间: 2026-03-09 11:44:06
领域: cs.CL,cs.AI
SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning
Weakly-Supervised Dense Video Captioning aims to localize and describe events in videos trained only on caption annotations, without temporal boundaries. Prior work introduced an implicit supervision paradigm based on Gaussian masking and complementary captioning. However, existing method focuses merely on generating non-overlapping masks without considering their semantic relationship to corresponding events, resulting in simplistic, uniformly distributed masks that fail to capture semantically meaningful regions. Moreover, relying solely on ground-truth captions leads to sub-optimal performance due to the inherent sparsity of existing datasets. In this work, we propose SAIL, which constructs semantically-aware masks through cross-modal alignment. Our similarity aware training objective guides masks to emphasize video regions with high similarity to their corresponding event captions. Furthermore, to guide more accurate mask generation under sparse annotation settings, we introduce an LLM-based augmentation strategy that generates synthetic captions to provide additional alignment signals. These synthetic captions are incorporated through an inter-mask mechanism, providing auxiliary guidance for precise temporal localization without degrading the main objective. Experiments on ActivityNet Captions and YouCook2 demonstrate state-of-the-art performance on both captioning and localization metrics.
Updated: 2026-03-09 11:43:26
标题: SAIL:基于相似度感知引导和跨标题增强学习的弱监督密集视频字幕生成
摘要: 弱监督密集视频字幕旨在定位和描述视频中的事件,仅基于字幕注释进行训练,而不考虑时间边界。先前的工作引入了基于高斯掩模和互补字幕的隐式监督范式。然而,现有方法仅专注于生成不重叠的掩模,而不考虑其与相应事件的语义关系,导致生成简单的、均匀分布的掩模,无法捕获语义上有意义的区域。此外,仅依赖于地面真实字幕会导致性能次优,因为现有数据集固有地稀疏。在这项工作中,我们提出了SAIL,通过跨模态对齐构建具有语义意识的掩模。我们的相似度感知训练目标指导掩模强调与其对应事件字幕相似度高的视频区域。此外,为了在稀疏注释设置下引导更准确的掩模生成,我们引入了基于LLM的增强策略,生成合成字幕以提供额外的对齐信号。这些合成字幕通过一种掩模之间的机制进行整合,为精确的时间定位提供辅助指导,而不会降低主要目标。在ActivityNet Captions和YouCook2上的实验表明,在字幕和定位指标上均表现出最先进的性能。
更新时间: 2026-03-09 11:43:26
领域: cs.CV,cs.AI
To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models
Reinforcement Learning with Verifiable Rewards (RLVR) plays a key role in stimulating the explicit reasoning capability of Large Language Models (LLMs). We can achieve expert-level performance in some specific domains via RLVR, such as coding or math. When a general multi-domain expert-level model is required, we need to carefully consider the collaboration of RLVR across different domains. The current state-of-the-art models mainly employ two different training paradigms for multi-domain RLVR: mixed multi-task RLVR and separate RLVR followed by model merging. However, most of the works did not provide a detailed comparison and analysis about these paradigms. To this end, we choose multiple commonly used high-level tasks (e.g., math, coding, science, instruction following, and agent) as our target domains and design extensive qualitative and quantitative experiments using open-source datasets. We find the RLVR across domains exhibits few mutual interferences, and reasoning-intensive domains demonstrate mutually synergistic effects. Furthermore, we analyze the internal mechanisms of mutual gains from the perspectives of weight space geometry, model prediction behavior, information constraints and self-verification. This project is named as M2RL that means Mixed multi-task training or separate training followed by model Merging for Reinforcement Learning, and the homepage is at https://github.com/mosAI25/M2RL.
Updated: 2026-03-09 11:43:08
标题: 混合还是融合:面向大型语言模型的多领域强化学习
摘要: 具有可验证奖励的强化学习(RLVR)在激发大型语言模型(LLMs)的显式推理能力方面发挥着关键作用。通过RLVR,我们可以在某些特定领域实现专家级表现,如编码或数学。当需要一个通用的多领域专家级模型时,我们需要仔细考虑在不同领域之间RLVR的协作。目前最先进的模型主要采用两种不同的训练范式进行多领域RLVR:混合多任务RLVR和分开RLVR后模型合并。然而,大多数作品没有提供关于这些范式的详细比较和分析。为此,我们选择多个常用的高级任务(如数学、编码、科学、指令遵循和代理)作为我们的目标领域,并使用开源数据集设计了广泛的定性和定量实验。我们发现跨领域的RLVR几乎没有相互干扰,而推理密集型领域展现出相互协同作用。此外,我们从权重空间几何、模型预测行为、信息约束和自我验证的角度分析了相互收益的内部机制。这个项目被命名为M2RL,意思是混合多任务训练或分开训练后模型合并用于强化学习,主页位于https://github.com/mosAI25/M2RL。
更新时间: 2026-03-09 11:43:08
领域: cs.AI
SCL-GNN: Towards Generalizable Graph Neural Networks via Spurious Correlation Learning
Graph Neural Networks (GNNs) have demonstrated remarkable success across diverse tasks. However, their generalization capability is often hindered by spurious correlations between node features and labels in the graph. Our analysis reveals that GNNs tend to exploit imperceptible statistical correlations in training data, even when such correlations are unreliable for prediction. To address this challenge, we propose the Spurious Correlation Learning Graph Neural Network (SCL-GNN), a novel framework designed to enhance generalization on both Independent and Identically Distributed (IID) and Out-of-Distribution (OOD) graphs. SCL-GNN incorporates a principled spurious correlation learning mechanism, leveraging the Hilbert-Schmidt Independence Criterion (HSIC) to quantify correlations between node representations and class scores. This enables the model to identify and mitigate irrelevant but influential spurious correlations effectively. Additionally, we introduce an efficient bi-level optimization strategy to jointly optimize modules and GNN parameters, preventing overfitting. Extensive experiments on real-world and synthetic datasets demonstrate that SCL-GNN consistently outperforms state-of-the-art baselines under various distribution shifts, highlighting its robustness and generalization capabilities.
Updated: 2026-03-09 11:42:06
标题: SCL-GNN: 通过虚假相关性学习实现通用图神经网络
摘要: 图神经网络(GNNs)在各种任务中展现出了显著的成功。然而,它们的泛化能力经常受到图中节点特征和标签之间的虚假相关性的阻碍。我们的分析揭示了GNNs倾向于利用训练数据中难以察觉的统计相关性,即使这些相关性对于预测是不可靠的。为了解决这一挑战,我们提出了虚假相关性学习图神经网络(SCL-GNN),这是一个旨在增强独立同分布(IID)和非分布(OOD)图的泛化能力的新框架。SCL-GNN包含一个基于原则的虚假相关性学习机制,利用Hilbert-Schmidt独立性准则(HSIC)来量化节点表示和类别得分之间的相关性。这使得模型能够有效地识别和减轻无关但具有影响力的虚假相关性。此外,我们引入了一种高效的双层优化策略,来同时优化模块和GNN参数,以防止过拟合。在真实和合成数据集上的大量实验证明,SCL-GNN在各种分布转移下始终优于最先进的基线,突显了其鲁棒性和泛化能力。
更新时间: 2026-03-09 11:42:06
领域: cs.LG,cs.AI
SAIL: Test-Time Scaling for In-Context Imitation Learning with VLM
In-context imitation learning allows robots to acquire skills from demonstrations, yet one-shot trajectory generation remains fragile under environmental variation. We propose SAIL, a framework that reframes robot imitation as an iterative refinement problem capable of scaling with test-time compute. SAIL utilizes Monte Carlo Tree Search, where each node is a complete trajectory and edges correspond to trajectory refinements. The process is guided by three core components: an automated archive of successful trajectories for contextually relevant retrieval, a vision language model-based scoring mechanism for trajectory evaluation, and a step-level feedback that provides trajectory-aligned scores for iterative refinement. Experiments across six diverse manipulation tasks in simulation and real-world validation clearly demonstrate that increasing test-time compute consistently improves success rates, achieving up to 95% on complex tasks. Our results suggest that trajectory-level test-time scaling is a robust path toward more generalizable robotic agents.
Updated: 2026-03-09 11:39:40
标题: SAIL:用于VLM的上下文模仿学习的测试时间缩放
摘要: 在上下文中的模仿学习使得机器人能够从示范中获取技能,然而一次性生成轨迹在环境变化下仍然很脆弱。我们提出了SAIL,这是一个将机器人模仿重新构建为一个能够随测试时间计算规模的迭代优化问题的框架。SAIL利用蒙特卡洛树搜索,其中每个节点是一个完整的轨迹,边对应于轨迹的优化。该过程由三个核心组件引导:一个用于上下文相关检索的成功轨迹自动存档、基于视觉语言模型的轨迹评估机制以及提供迭代优化的轨迹对齐分数的步级反馈。在仿真和现实世界验证中进行的六项不同操作任务的实验清楚地表明,增加测试时间计算一致提高了成功率,最高可达到复杂任务的95%。我们的结果表明,轨迹级别的测试时间规模化是通向更具普遍性的机器人代理的稳健路径。
更新时间: 2026-03-09 11:39:40
领域: cs.RO,cs.AI
Towards a more efficient bias detection in financial language models
Bias in financial language models constitutes a major obstacle to their adoption in real-world applications. Detecting such bias is challenging, as it requires identifying inputs whose predictions change when varying properties unrelated to the decision, such as demographic attributes. Existing approaches typically rely on exhaustive mutation and pairwise prediction analysis over large corpora, which is effective but computationally expensive-particularly for large language models and can become impractical in continuous retraining and releasing processes. Aiming at reducing this cost, we conduct a large-scale study of bias in five financial language models, examining similarities in their bias tendencies across protected attributes and exploring cross-model-guided bias detection to identify bias-revealing inputs earlier. Our study uses approximately 17k real financial news sentences, mutated to construct over 125k original-mutant pairs. Results show that all models exhibit bias under both atomic (0.58\%-6.05\%) and intersectional (0.75\%-5.97\%) settings. Moreover, we observe consistent patterns in bias-revealing inputs across models, enabling substantial reuse and cost reduction in bias detection. For example, up to 73\% of FinMA's biased behaviours can be uncovered using only 20\% of the input pairs when guided by properties derived from DistilRoBERTa outputs.
Updated: 2026-03-09 11:38:53
标题: 朝着更高效的金融语言模型偏见检测方向
摘要: 金融语言模型中的偏见构成了其在现实世界应用中的主要障碍。检测这种偏见是具有挑战性的,因为它需要识别那些在不涉及决策的属性(如人口属性)发生变化的输入。现有方法通常依赖于对大型语料库进行详尽的变异和成对预测分析,这在大型语言模型上是有效的,但计算成本昂贵-特别是对于大型语言模型,在持续的重新训练和发布过程中可能变得不切实际。为了降低这种成本,我们对五个金融语言模型的偏见进行了大规模研究,检查它们在受保护属性上的偏见倾向的相似性,并探索跨模型引导的偏见检测,以更早地识别揭示偏见的输入。我们的研究使用了大约17,000个真实金融新闻句子,经变异后构建了超过125,000个原始-变异对。结果显示,所有模型在原子(0.58\%-6.05\%)和交叉(0.75\%-5.97\%)设置下都表现出偏见。此外,我们观察到在各个模型中,揭示偏见的输入具有一致的模式,可以在偏见检测中实现大量的重复使用和降低成本。例如,通过从DistilRoBERTa输出派生的属性引导,可以发现高达73\%的FinMA的偏见行为,而仅使用20\%的输入对。
更新时间: 2026-03-09 11:38:53
领域: cs.AI,cs.CE,cs.LG
Airborne Magnetic Anomaly Navigation with Neural-Network-Augmented Online Calibration
Airborne Magnetic Anomaly Navigation (MagNav) provides a jamming-resistant and robust alternative to satellite navigation but requires the real-time compensation of the aircraft platform's large and dynamic magnetic interference. State-of-the-art solutions often rely on extensive offline calibration flights or pre-training, creating a logistical barrier to operational deployment. We present a fully adaptive MagNav architecture featuring a "cold-start" capability that identifies and compensates for the aircraft's magnetic signature entirely in-flight. The proposed method utilizes an extended Kalman filter with an augmented state vector that simultaneously estimates the aircraft's kinematic states as well as the coefficients of the physics-based Tolles-Lawson calibration model and the parameters of a Neural Network to model aircraft interferences. The Kalman filter update is mathematically equivalent to an online Natural Gradient descent, integrating superior convergence and data efficiency of state-of-the-art second-order optimization directly into the navigation filter. To enhance operational robustness, the neural network is constrained to a residual learning role, modeling only the nonlinearities uncorrected by the explainable physics-based calibration baseline. Validated on the MagNav Challenge dataset, our framework effectively bounds inertial drift using a magnetometer-only feature set. The results demonstrate navigation accuracy comparable to state-of-the-art models trained offline, without requiring prior calibration flights or dedicated maneuvers.
Updated: 2026-03-09 11:35:04
标题: 空中磁异常导航与神经网络辅助在线校准
摘要: 空中磁异常导航(MagNav)提供了一个抗干扰和稳健的替代卫星导航的选择,但需要实时补偿飞机平台的大规模和动态磁干扰。现有技术解决方案通常依赖于大量的离线校准飞行或预训练,这在操作部署中造成了物流障碍。我们提出了一个全面自适应的MagNav架构,具有“冷启动”功能,完全在飞行中识别和补偿飞机的磁特征。所提出的方法利用扩展卡尔曼滤波器和扩展状态向量,同时估计飞机的运动状态、基于物理的Tolles-Lawson校准模型的系数以及神经网络的参数,以建模飞机干扰。卡尔曼滤波器的更新在数学上等价于在线自然梯度下降,将一流的二阶优化的卓越收敛性和数据效率直接整合到导航滤波器中。为了增强操作的稳健性,神经网络被限制为残差学习角色,仅建模未被可解释的基于物理的校准基线纠正的非线性。通过在MagNav挑战数据集上验证,我们的框架有效地使用仅磁力计特征集来限制惯性漂移。结果表明,导航精度与离线训练的一流模型相当,而无需事先校准飞行或专门的机动。
更新时间: 2026-03-09 11:35:04
领域: cs.LG
FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use
The integration of Large Language Models (LLMs) into the financial domain is driving a paradigm shift from passive information retrieval to dynamic, agentic interaction. While general-purpose tool learning has witnessed a surge in benchmarks, the financial sector, characterized by high stakes, strict compliance, and rapid data volatility, remains critically underserved. Existing financial evaluations predominantly focus on static textual analysis or document-based QA, ignoring the complex reality of tool execution. Conversely, general tool benchmarks lack the domain-specific rigor required for finance, often relying on toy environments or a negligible number of financial APIs. To bridge this gap, we introduce FinToolBench, the first real-world, runnable benchmark dedicated to evaluating financial tool learning agents. Unlike prior works limited to a handful of mock tools, FinToolBench establishes a realistic ecosystem coupling 760 executable financial tools with 295 rigorous, tool-required queries. We propose a novel evaluation framework that goes beyond binary execution success, assessing agents on finance-critical dimensions: timeliness, intent type, and regulatory domain alignment. Furthermore, we present FATR, a finance-aware tool retrieval and reasoning baseline that enhances stability and compliance. By providing the first testbed for auditable, agentic financial execution, FinToolBench sets a new standard for trustworthy AI in finance. The tool manifest, execution environment, and evaluation code will be open-sourced to facilitate future research.
Updated: 2026-03-09 11:33:05
标题: FinToolBench:评估LLM代理在真实世界金融工具使用中的表现
摘要: 将大型语言模型(LLMs)整合到金融领域正在推动从被动信息检索到动态、主动交互的范式转变。虽然通用工具学习在基准测试中取得了飞速发展,但以高风险、严格合规和快速数据波动为特征的金融行业仍然受到严重忽视。现有的金融评估主要集中在静态文本分析或基于文档的问答上,忽略了工具执行的复杂现实。相反,通用工具基准缺乏金融领域所需的领域特定严谨性,通常依赖于玩具环境或数量可忽略的金融API。为了弥合这一差距,我们引入了FinToolBench,这是第一个专门用于评估金融工具学习代理的真实可运行基准。与以往仅限于少数模拟工具的作品不同,FinToolBench建立了一个现实的生态系统,将760个可执行的金融工具与295个严格的、工具所需的查询相结合。我们提出了一个新颖的评估框架,超出了二进制执行成功,评估代理在金融关键维度上的表现:及时性、意图类型和监管领域对齐。此外,我们提出了FATR,一种金融感知的工具检索和推理基线,增强了稳定性和合规性。通过提供第一个可审计、主动执行金融的测试平台,FinToolBench为金融领域的可信任AI设立了新标准。工具清单、执行环境和评估代码将开源,以促进未来的研究。
更新时间: 2026-03-09 11:33:05
领域: cs.AI
Beyond ReinMax: Low-Variance Gradient Estimators for Discrete Latent Variables
Machine learning models involving discrete latent variables require gradient estimators to facilitate backpropagation in a computationally efficient manner. The most recent addition to the Straight-Through family of estimators, ReinMax, can be viewed from a numerical ODE perspective as incorporating an approximation via Heun's method to reduce bias, but at the cost of high variance. In this work, we introduce the ReinMax-Rao and ReinMax-CV estimators which incorporate Rao-Blackwellisation and control variate techniques into ReinMax to reduce its variance. Our estimators demonstrate superior performance on training variational autoencoders with discrete latent spaces. Furthermore, we investigate the possibility of leveraging alternative numerical methods for constructing more accurate gradient approximations and present an alternative view of ReinMax from a simpler numerical integration perspective.
Updated: 2026-03-09 11:27:20
标题: 超越ReinMax:用于离散潜变量的低方差梯度估计器
摘要: 涉及离散潜变量的机器学习模型需要梯度估计器以便以高效的方式进行反向传播。最近加入的Straight-Through估计器系列中的ReinMax可以从数值ODE的角度看作是通过Heun方法进行逼近以减少偏差,但代价是高方差。在这项工作中,我们引入了ReinMax-Rao和ReinMax-CV估计器,它们将Rao-Blackwell化和控制变量技术结合到ReinMax中以减少方差。我们的估计器在训练具有离散潜空间的变分自动编码器上表现出优越性能。此外,我们研究了利用替代数值方法构建更准确的梯度估计的可能性,并从更简单的数值积分角度呈现了ReinMax的替代视图。
更新时间: 2026-03-09 11:27:20
领域: stat.ML,cs.LG
SDFed: Bridging Local Global Discrepancy via Subspace Refinement and Divergence Control in Federated Prompt Learning
Vision-language pretrained models offer strong transferable representations, yet adapting them in privacy-sensitive multi-party settings is challenging due to the high communication cost of federated optimization and the limited local data on clients. Federated prompt learning mitigates this issue by keeping the VLPM backbone frozen and collaboratively training lightweight prompt parameters. However, existing approaches typically enforce a unified prompt structure and length across clients, which is inadequate under practical client heterogeneity in both data distributions and system resources, and may further introduce conflicts between globally shared and locally optimal knowledge. To address these challenges, we propose \textbf{SDFed}, a heterogeneous federated prompt learning framework that bridges Local-Global Discrepancy via Subspace Refinement and Divergence Control. SDFed maintains a fixed-length global prompt for efficient aggregation while allowing each client to learn a variable-length local prompt to better match its data characteristics and capacity. To mitigate local-global conflicts and facilitate effective knowledge transfer, SDFed introduces a subspace refinement method for local prompts and an information retention and divergence control strategy that preserves key local information while maintaining appropriate separability between global and local representations. Extensive experiments on several datasets demonstrate that SDFed consistently improves performance and robustness in heterogeneous federated settings.
Updated: 2026-03-09 11:26:17
标题: SDFed:通过子空间细化和分歧控制来弥合联邦式快速学习中的局部全局差异
摘要: 视觉语言预训练模型提供了强大的可转移表示,但在隐私敏感的多方设置中调整它们是具有挑战性的,这是由于联邦优化的高通信成本以及客户端上有限的本地数据。联邦提示学习通过保持VLPM骨干结构冻结并协同训练轻量级提示参数来缓解这个问题。然而,现有方法通常在客户端之间强制执行统一的提示结构和长度,这在数据分布和系统资源方面的实际客户异质性下是不足够的,并且可能进一步引入全局共享和本地最优知识之间的冲突。为了解决这些挑战,我们提出了SDFed,这是一个异构联邦提示学习框架,通过子空间细化和发散控制来弥合局部-全局差异。SDFed保持一个固定长度的全局提示以实现高效的聚合,同时允许每个客户端学习一个可变长度的本地提示,以更好地匹配其数据特征和容量。为了减轻局部和全局之间的冲突并促进有效的知识转移,SDFed引入了一种局部提示的子空间细化方法,并采用信息保留和发散控制策略,以在保持关键本地信息的同时维持全局和本地表示之间的适当可分离性。对几个数据集进行的大量实验表明,SDFed在异构联邦设置中始终提高了性能和鲁棒性。
更新时间: 2026-03-09 11:26:17
领域: cs.LG,cs.DB
ScaleGNN: Towards Scalable Graph Neural Networks via Adaptive High-order Neighboring Feature Fusion
Graph Neural Networks (GNNs) have demonstrated impressive performance across diverse graph-based tasks by leveraging message passing to capture complex node relationships. However, on large-scale real-world graphs, GNNs face two major challenges: (1) GNNs struggle to ensure scalability and efficiency as repeated aggregation of large neighborhoods incurs significant computational overhead; (2) GNNs suffer from over-smoothing, where excessive propagation makes node representations indistinguishable, hindering model expressiveness. To tackle these, we propose ScaleGNN, which adaptively fuses multi-hop node features for scalable and effective graph learning. We first compute per-hop pure-neighbor matrices to isolate exclusive structural signals, then apply lightweight fusion to balance low- and high-order information, preserving both local detail and global correlations. To curb redundancy and over-smoothing, we introduce Local Contribution Score (LCS)-based masking to prune low-relevance high-order neighbors, and impose learnable sparsity to selectively integrate valuable multi-hop features. Extensive experiments on real-world datasets show that ScaleGNN consistently outperforms state-of-the-art GNNs in both predictive accuracy and computational efficiency.
Updated: 2026-03-09 11:25:44
标题: ScaleGNN:通过自适应高阶邻域特征融合实现可扩展的图神经网络
摘要: 图神经网络(GNNs)通过利用消息传递来捕捉复杂节点关系,在各种基于图的任务中展示出了令人印象深刻的性能。然而,在大规模真实世界图中,GNNs面临两个主要挑战:(1)GNNs难以确保可扩展性和效率,因为重复聚合大邻域会带来显著的计算开销;(2)GNNs受到过度平滑的困扰,过度传播使节点表示不可区分,从而阻碍模型的表达能力。为了解决这些问题,我们提出了ScaleGNN,它自适应地融合多跳节点特征,实现可扩展和有效的图学习。我们首先计算每一跳的纯邻居矩阵,以隔离独占性结构信号,然后应用轻量级融合来平衡低阶和高阶信息,同时保留本地细节和全局相关性。为了避免冗余和过度平滑,我们引入了基于本地贡献分数(LCS)的掩码,以修剪低相关性的高阶邻居,并施加可学习的稀疏性,以有选择地整合有价值的多跳特征。对真实世界数据集的大量实验表明,ScaleGNN在预测准确性和计算效率方面始终优于最先进的GNNs。
更新时间: 2026-03-09 11:25:44
领域: cs.LG
SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation
Incremental Few-Shot (IFS) segmentation aims to learn new categories over time from only a few annotations. Although widely studied in 2D, it remains underexplored for 3D point clouds. Existing methods suffer from catastrophic forgetting or fail to learn discriminative prototypes under sparse supervision, and often overlook a key cue: novel categories frequently appear as unlabelled background in base-training scenes. We introduce SCOPE (Scene-COntextualised Prototype Enrichment), a plug-and-play background-guided prototype enrichment framework that integrates with any prototype-based 3D segmentation method. After base training, a class-agnostic segmentation model extracts high-confidence pseudo-instances from background regions to build a prototype pool. When novel classes arrive with few labelled samples, relevant background prototypes are retrieved and fused with few-shot prototypes to form enriched representations without retraining the backbone or adding parameters. Experiments on ScanNet and S3DIS show that SCOPE achieves SOTA performance, improving novel-class IoU by up to 6.98% and 3.61%, and mean IoU by 2.25% and 1.70%, respectively, while maintaining low forgetting. Code is available https://github.com/Surrey-UP-Lab/SCOPE.
Updated: 2026-03-09 11:24:46
标题: 范围:场景背景下的递增式少样本3D分割
摘要: 增量式少样本(IFS)分割旨在从只有少量标注中学习新类别。虽然在2D中得到了广泛研究,但对于3D点云而言仍未被充分探索。现有方法存在灾难性遗忘或在稀疏监督下无法学习具有区分性的原型,并常常忽视一个关键线索:新类别经常出现在基础训练场景中的未标记背景中。我们引入了SCOPE(场景上下文原型丰富),这是一个可插拔的基于背景指导的原型丰富框架,可以与任何基于原型的3D分割方法集成。在基础训练之后,一个与类别无关的分割模型从背景区域提取高置信度的伪实例来构建原型池。当新类别以少量标记样本到达时,相关的背景原型被检索并与少样本原型融合,形成丰富的表示,而无需重新训练骨干网络或添加参数。在ScanNet和S3DIS上的实验表明,SCOPE实现了SOTA性能,将新类别IoU提高了最多6.98%和3.61%,分别将平均IoU提高了2.25%和1.70%,同时保持低遗忘率。源代码可在https://github.com/Surrey-UP-Lab/SCOPE获取。
更新时间: 2026-03-09 11:24:46
领域: cs.CV,cs.LG
FlowTouch: View-Invariant Visuo-Tactile Prediction
Tactile sensation is essential for contact-rich manipulation tasks. It provides direct feedback on object geometry, surface properties, and interaction forces, enhancing perception and enabling fine-grained control. An inherent limitation of tactile sensors is that readings are available only when an object is touched. This precludes their use during planning and the initial execution phase of a task. Predicting tactile information from visual information can bridge this gap. A common approach is to learn a direct mapping from camera images to the output of vision-based tactile sensors. However, the resulting model will depend strongly on the specific setup and on how well the camera can capture the area where an object is touched. In this work, we introduce FlowTouch, a novel model for view-invariant visuo-tactile prediction. Our key idea is to use an object's local 3D mesh to encode rich information for predicting tactile patterns while abstracting away from scene-dependent details. FlowTouch integrates scene reconstruction and Flow Matching-based models for image generation. Our results show that FlowTouch is able to bridge the sim-to-real gap and generalize to new sensor instances. We further show that the resulting tactile images can be used for downstream grasp stability prediction. Our code, datasets and videos are available at https://flowtouch.github.io/
Updated: 2026-03-09 11:24:29
标题: FlowTouch:视觉-触觉预测中的视角不变性
摘要: 触觉感觉对于接触丰富的操作任务至关重要。它可以直接反馈物体的几何形状、表面特性和相互作用力,增强感知并实现精细控制。触觉传感器的固有限制是只有在触摸物体时才能获取读数。这导致在任务的规划和初始执行阶段无法使用它们。从视觉信息中预测触觉信息可以弥合这一差距。一种常见的方法是学习从相机图像到基于视觉的触觉传感器输出的直接映射。然而,所得模型将严重依赖于具体设置以及相机能否有效捕捉触摸对象的区域。在这项工作中,我们介绍了FlowTouch,一种新颖的模型,用于视角不变的视触觉预测。我们的关键思想是使用物体的本地3D网格来编码丰富信息,以预测触觉模式,并将其抽象化为与场景无关的细节。FlowTouch将场景重建和基于Flow Matching的模型集成到图像生成中。我们的结果表明,FlowTouch能够弥合仿真与现实之间的差距,并泛化到新的传感器实例。我们进一步展示,所得的触觉图像可用于下游抓取稳定性预测。我们的代码、数据集和视频可在https://flowtouch.github.io/ 上获得。
更新时间: 2026-03-09 11:24:29
领域: cs.RO,cs.LG
FedPrism: Adaptive Personalized Federated Learning under Non-IID Data
Federated Learning (FL) suffers significant performance degradation in real-world deployments characterized by moderate to extreme statistical heterogeneity (non-IID client data). While global aggregation strategies promote broad generalization, they often fail to capture the diversity of local data distributions, leading to suboptimal personalization. We address this problem with FedPrism, a framework that uses two main strategies. First, it uses a Prism Decomposition method that builds each client's model from three parts: a global foundation, a shared group part for similar clients, and a private part for unique local data. This allows the system to group similar users together automatically and adapt if their data changes. Second, we include a Dual-Stream design that runs a general model alongside a local specialist. The system routes predictions between the general model and the local specialist based on the specialist's confidence. Through systematic experiments on non-IID data partitions, we demonstrate that FedPrism exceeds static aggregation and hard-clustering baselines, achieving significant accuracy gains under high heterogeneity. These results establish FedPrism as a robust and flexible solution for federated learning in heterogeneous environments, effectively balancing generalizable knowledge with adaptive personalization.
Updated: 2026-03-09 11:23:32
标题: FedPrism:非IID数据下的自适应个性化联邦学习
摘要: 联邦学习(FL)在现实世界的部署中遭受了显著的性能下降,其特点是中度到极端的统计异质性(非IID客户端数据)。虽然全局聚合策略促进了广泛的泛化,但通常无法捕获本地数据分布的多样性,导致个性化效果不佳。 我们通过FedPrism解决了这个问题,这是一个使用两种主要策略的框架。首先,它使用Prism Decomposition方法,从三个部分构建每个客户端的模型:全局基础,用于相似客户端的共享组件,以及用于独特本地数据的私有部分。这使系统可以自动将相似用户分组在一起,并在其数据发生变化时进行调整。其次,我们包括一个双流设计,同时运行一个通用模型和一个本地专家。系统根据专家的置信度在通用模型和本地专家之间路由预测。 通过对非IID数据分区的系统实验,我们证明了FedPrism超越了静态聚合和硬聚类基线,在高异质性下取得了显著的准确度提升。这些结果将FedPrism确立为异质环境中联邦学习的强大而灵活的解决方案,有效平衡了可泛化的知识与自适应个性化。
更新时间: 2026-03-09 11:23:32
领域: cs.LG
Input-to-State Stable Coupled Oscillator Networks for Closed-form Model-based Control in Latent Space
Even though a variety of methods have been proposed in the literature, efficient and effective latent-space control (i.e., control in a learned low-dimensional space) of physical systems remains an open challenge. We argue that a promising avenue is to leverage powerful and well-understood closed-form strategies from control theory literature in combination with learned dynamics, such as potential-energy shaping. We identify three fundamental shortcomings in existing latent-space models that have so far prevented this powerful combination: (i) they lack the mathematical structure of a physical system, (ii) they do not inherently conserve the stability properties of the real systems, (iii) these methods do not have an invertible mapping between input and latent-space forcing. This work proposes a novel Coupled Oscillator Network (CON) model that simultaneously tackles all these issues. More specifically, (i) we show analytically that CON is a Lagrangian system - i.e., it possesses well-defined potential and kinetic energy terms. Then, (ii) we provide formal proof of global Input-to-State stability using Lyapunov arguments. Moving to the experimental side, we demonstrate that CON reaches SoA performance when learning complex nonlinear dynamics of mechanical systems directly from images. An additional methodological innovation contributing to achieving this third goal is an approximated closed-form solution for efficient integration of network dynamics, which eases efficient training. We tackle (iii) by approximating the forcing-to-input mapping with a decoder that is trained to reconstruct the input based on the encoded latent space force. Finally, we show how these properties enable latent-space control. We use an integral-saturated PID with potential force compensation and demonstrate high-quality performance on a soft robot using raw pixels as the only feedback information.
Updated: 2026-03-09 11:18:21
标题: Input-to-State稳定耦合振荡器网络用于潜在空间中基于闭合形式模型的控制
摘要: 尽管文献中提出了各种方法,但对物理系统进行高效和有效的潜空间控制(即在学习到的低维空间中进行控制)仍然是一个挑战。我们认为一个有前途的途径是利用控制理论文献中强大且被充分理解的封闭形式策略,结合学习动态,比如潜在能量塑造。我们发现现有潜空间模型存在三个基本缺点,迄今为止阻碍了这种强大组合的实现:(i)它们缺乏物理系统的数学结构,(ii)它们不自然地保持真实系统的稳定性质,(iii)这些方法没有输入和潜空间强制之间的可逆映射。本研究提出了一种新颖的耦合振荡器网络(CON)模型,同时解决了所有这些问题。具体而言,(i)我们在分析中表明CON是一个拉格朗日系统 - 即它具有明确定义的势能和动能项。然后,(ii)我们提供了使用李亚普诺夫论证的全局输入到状态稳定性的形式证明。转向实验方面,我们展示了CON在直接从图像学习机械系统的复杂非线性动态时达到了最先进的性能。为实现第三个目标的另一种方法创新是对网络动态进行高效集成的近似封闭形式解决方案,这有助于高效训练。我们通过使用训练有素的解码器来近似强制到输入映射,该解码器被训练基于编码的潜空间力量重建输入来解决(iii)。最后,我们展示了这些特性如何实现潜空间控制。我们使用积分饱和PID控制器与潜在力量补偿,并仅使用原始像素作为反馈信息,在软机器人上展示了高质量的性能。
更新时间: 2026-03-09 11:18:21
领域: cs.RO,cs.AI,cs.LG,eess.SY
Optimising antibiotic switching via forecasting of patient physiology
Timely transition from intravenous (IV) to oral antibiotic therapy shortens hospital stays, reduces catheter-related infections, and lowers healthcare costs, yet one in five patients in England remain on IV antibiotics despite meeting switching criteria. Clinical decision support systems can improve switching rates, but approaches that learn from historical decisions reproduce the delays and inconsistencies of routine practice. We propose using neural processes to model vital sign trajectories probabilistically, predicting switch-readiness by comparing forecasts against clinical guidelines rather than learning from past actions, and ranking patients to prioritise clinical review. The design yields interpretable outputs, adapts to updated guidelines without retraining, and preserves clinical judgement. Validated on MIMIC-IV (US intensive care, 6,333 encounters) and UCLH (a large urban academic UK hospital group, 10,584 encounters), the system selects 2.2-3.2$\times$ more relevant patients than random. Our results demonstrate that forecasting patient physiology offers a principled foundation for decision support in antibiotic stewardship.
Updated: 2026-03-09 11:15:49
标题: 优化抗生素切换:通过预测患者生理状况
摘要: 及时从静脉(IV)转为口服抗生素治疗可以缩短住院时间,减少导管相关感染,降低医疗成本,然而,尽管符合转换标准,英格兰有五分之一的患者仍在接受静脉抗生素治疗。临床决策支持系统可以提高转换率,但从历史决策中学习的方法会重现例行实践的延迟和不一致性。我们建议使用神经过程以概率地建模生命体征轨迹,通过将预测与临床指南进行比较来预测转换准备情况,而不是从过去的行动中学习,并对患者进行排名以优先进行临床审查。该设计产生可解释的输出,能够适应更新的指南而无需重新训练,并保留临床判断。在MIMIC-IV(美国重症监护,6,333次就诊)和UCLH(一个大型城市学术英国医院集团,10,584次就诊)上验证后,该系统比随机选择更多地选择了2.2-3.2倍相关的患者。我们的结果表明,对患者生理状况进行预测为抗生素管理决策支持提供了一个合理的基础。
更新时间: 2026-03-09 11:15:49
领域: cs.LG,stat.AP
Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers
Video Diffusion Transformers (DiTs) have been synthesizing high-quality video with high fidelity from given text descriptions involving motion. However, understanding how Video DiTs convert motion words into video remains insufficient. Furthermore, while prior studies on interpretable saliency maps primarily target objects, motion-related behavior in Video DiTs remains largely unexplored. In this paper, we investigate concrete motion features that specify when and which object moves for a given motion concept. First, to spatially localize, we introduce GramCol, which adaptively produces per-frame saliency maps for any text concept, including both motion and non-motion. Second, we propose a motion-feature selection algorithm to obtain an Interpretable Motion-Attentive Map (IMAP) that localizes motion spatially and temporally. Our method discovers concept saliency maps without the need for any gradient calculation or parameter update. Experimentally, our method shows outstanding localization capability on the motion localization task and zero-shot video semantic segmentation, providing interpretable and clearer saliency maps for both motion and non-motion concepts.
Updated: 2026-03-09 11:13:35
标题: 可解释的运动关注图:在视频扩散变压器中对概念进行时空定位
摘要: Video Diffusion Transformers(DiTs)通过从给定的包含运动的文本描述合成高质量、高保真度的视频。然而,了解Video DiTs如何将运动词转换为视频的过程仍然不足。此外,尽管以前的研究主要针对对象的可解释性显著性图,但Video DiTs中与运动相关的行为仍然大多未被探索。在本文中,我们研究指定给定运动概念的何时和哪个物体移动的具体运动特征。首先,为了进行空间定位,我们引入了GramCol,它可以自适应地为任何文本概念(包括运动和非运动)生成逐帧显著性图。其次,我们提出了一种运动特征选择算法,用于获得一个可解释的运动关注图(IMAP),可以在空间和时间上定位运动。我们的方法在不需要任何梯度计算或参数更新的情况下发现概念显著性图。在实验中,我们的方法在运动定位任务和零样本视频语义分割上展示了出色的定位能力,为运动和非运动概念提供了可解释且更清晰的显著性图。
更新时间: 2026-03-09 11:13:35
领域: cs.CV,cs.AI,cs.LG
Fibration Policy Optimization
Large language models are increasingly trained as heterogeneous systems spanning multiple domains, expert partitions, and agentic pipelines, yet prevalent proximal objectives operate at a single scale and lack a principled mechanism for coupling token-level, trajectory-level, and higher-level hierarchical stability control. To bridge this gap, we derive the Aggregational Policy Censoring Objective (APC-Obj), the first exact unconstrained reformulation of sample-based TV-TRPO, establishing that clipping-based surrogate design and trust-region optimization are dual formulations of the same problem. Building on this foundation, we develop Fiber Bundle Gating (FBG), an algebraic framework that organizes sampled RL data as a fiber bundle and decomposes ratio gating into a base-level gate on trajectory aggregates and a fiber-level gate on per-token residuals, with provable first-order agreement with the true RL objective near on-policy. From APC-Obj and FBG we derive Fibration Policy Optimization (or simply, FiberPO), a concrete objective whose Jacobian is block-diagonal over trajectories, reduces to identity at on-policy, and provides better update direction thus improving token efficiency. The compositional nature of the framework extends beyond the trajectory-token case: fibrations compose algebraically into a Fibration Gating Hierarchy (FGH) that scales the same gating mechanism to arbitrary hierarchical depth without new primitives, as demonstrated by FiberPO-Domain, a four-level instantiation with independent trust-region budgets at the domain, prompt group, trajectory, and token levels. Together, these results connect the trust-region theory, a compositional algebraic structure, and practical multi-scale stability control into a unified framework for LLM policy optimization.
Updated: 2026-03-09 11:09:25
标题: 纤维化政策优化
摘要: 大型语言模型越来越多地作为跨多个领域、专家分区和主体管道的异质系统进行训练,然而普遍存在的近端目标在单一尺度上运作,并且缺乏一个有原则的机制来耦合令牌级、轨迹级和更高级别的分层稳定性控制。为了填补这一差距,我们推导出聚合策略审查目标(APC-Obj),这是对基于样本的TV-TRPO的第一个精确无约束的重构,建立了基于剪切的替代设计和信任区域优化是相同问题的对偶形式。在此基础上,我们开发了纤维束门控(FBG),这是一个代数框架,将采样的RL数据组织为纤维束,并将比率门控分解为对轨迹聚合的基本门控和对每个令牌残差的纤维级门控,在策略附近具有可证明的一阶协议。基于APC-Obj和FBG,我们推导出纤维策略优化(简称FiberPO),这是一个具体的目标,其雅可比矩阵在轨迹上是块对角的,在策略上降低到单位,并提供更好的更新方向,从而提高令牌效率。框架的组合性质不仅限于轨迹令牌案例:纤维在代数上组成纤维门控层次结构(FGH),将相同的门控机制扩展到任意分层深度,而无需新的原语,如FiberPO-Domain所示,这是一个四级实例,具有在领域、提示组、轨迹和令牌级别独立的信任区域预算。这些结果将信任区域理论、组合代数结构和实用的多尺度稳定性控制整合到一个统一的框架中,用于LLM策略优化。
更新时间: 2026-03-09 11:09:25
领域: cs.LG,cs.AI,cs.CL
Detecting AI-Generated Images via Contextual Anomaly Estimation in Masked AutoEncoders
Context-based detection methods such as DetectGPT achieve strong generalization in identifying AI-generated text by evaluating content compatibility with a model's learned distribution. In contrast, existing image detectors rely on discriminative features from pretrained backbones such as CLIP, which implicitly capture generator-specific artifacts. However, as modern generative models rapidly advance in visual fidelity, the artifacts these detectors depend on are becoming increasingly subtle or absent, undermining their reliability. Masked AutoEncoders (MAE) are inherently trained to reconstruct masked patches from visible context, naturally modeling patch-level contextual plausibility akin to conditional probability estimation, while also serving as a powerful semantic feature extractor through its encoder. We propose CINEMAE, a novel architecture that exploits both capabilities of MAE for AI-generated image detection: we derive per-patch anomaly signals from the reconstruction mechanism and extract global semantic features from the encoder, fusing both context-based and feature-based cues for robust detection. CINEMAE achieves highly competitive mean accuracies of 96.63\% on GenImage and 93.96\% on AIGCDetectBenchmark, maintaining over 93\% accuracy even under JPEG compression at QF=50.
Updated: 2026-03-09 11:06:21
标题: 通过遮蔽自动编码器中的上下文异常估计检测人工智能生成的图像
摘要: 基于上下文的检测方法,如DetectGPT,通过评估内容与模型学习分布的兼容性,实现了在识别AI生成文本方面的强大泛化能力。相比之下,现有的图像检测器依赖于预训练骨干网络(如CLIP)中的区分特征,这些特征隐含捕捉了生成器特定的特征。然而,随着现代生成模型在视觉逼真度方面的快速进步,这些检测器依赖的特征变得越来越微妙或不存在,削弱了它们的可靠性。掩蔽自动编码器(MAE)本质上是训练用于从可见上下文中重建掩蔽补丁的模型,自然地对补丁级别的上下文合理性进行建模,类似于条件概率估计,同时也通过其编码器作为强大的语义特征提取器。我们提出了CINEMAE,一个利用MAE的这两个能力进行AI生成图像检测的新型架构:我们从重建机制中导出每个补丁的异常信号,并从编码器中提取全局语义特征,融合基于上下文和基于特征的线索以实现稳健的检测。CINEMAE在GenImage上实现了96.63%的高度竞争性平均准确率,在AIGCDetectBenchmark上实现了93.96%的准确率,即使在QF=50的JPEG压缩下,也能保持超过93%的准确率。
更新时间: 2026-03-09 11:06:21
领域: cs.CV,cs.AI,cs.CY
Network Traffic Analysis with Process Mining: The UPSIDE Case Study
Online gaming is a popular activity involving the adoption of complex systems and network infrastructures. The relevance of gaming, which generates large amounts of market revenue, drove research in modeling network devices' behavior to evaluate bandwidth consumption, predict and sustain high loads, and detect malicious activity. In this context, process mining appears promising due to its ability to combine data-driven analyses with model-based insights. In this paper, we propose a process mining-based method that analyzes gaming network traffic, allowing: unsupervised characterization of different states from gaming network data; encoding such states through process mining into interpretable Petri nets; and classification of gaming network traffic data to identify different video games being played. We apply the method to the UPSIDE case study, involving gaming network data of several devices interacting with two video games: Clash Royale and Rocket League. Results demonstrate that the gaming network behavior can be effectively and interpretably modeled through states represented as Petri nets with sufficient coherence and specificity while maintaining a good classification accuracy of the two different video games.
Updated: 2026-03-09 11:04:18
标题: 使用过程挖掘进行网络流量分析:UPSIDE案例研究
摘要: 在线游戏是一种流行的活动,涉及采用复杂的系统和网络基础设施。游戏的相关性产生了大量的市场收入,推动了对建模网络设备行为的研究,以评估带宽消耗、预测和维持高负载,并检测恶意活动。在这种情况下,过程挖掘似乎很有前途,因为它能够将数据驱动的分析与基于模型的见解结合起来。本文提出了一种基于过程挖掘的方法,该方法分析游戏网络流量,允许:从游戏网络数据中无监督地表征不同状态;通过过程挖掘将这些状态编码为可解释的Petri网;并对游戏网络流量数据进行分类,以识别正在进行的不同视频游戏。我们将这种方法应用于UPSIDE案例研究,涉及与两款视频游戏:《皇室战争》和《火箭联盟》交互的多个设备的游戏网络数据。结果表明,通过以Petri网表示的状态有效且可解释地建模游戏网络行为,同时保持了两款不同视频游戏的良好分类准确性。
更新时间: 2026-03-09 11:04:18
领域: cs.LG,cs.NI
Exploring Deep Learning and Ultra-Widefield Imaging for Diabetic Retinopathy and Macular Edema
Diabetic retinopathy (DR) and diabetic macular edema (DME) are leading causes of preventable blindness among working-age adults. Traditional approaches in the literature focus on standard color fundus photography (CFP) for the detection of these conditions. Nevertheless, recent ultra-widefield imaging (UWF) offers a significantly wider field of view in comparison to CFP. Motivated by this, the present study explores state-of-the-art deep learning (DL) methods and UWF imaging on three clinically relevant tasks: i) image quality assessment for UWF, ii) identification of referable diabetic retinopathy (RDR), and iii) identification of DME. Using the publicly available UWF4DR Challenge dataset, released as part of the MICCAI 2024 conference, we benchmark DL models in the spatial (RGB) and frequency domains, including popular convolutional neural networks (CNNs) as well as recent vision transformers (ViTs) and foundation models. In addition, we explore a final feature-level fusion to increase robustness. Finally, we also analyze the decisions of the DL models using Grad-CAM, increasing the explainability. Our proposal achieves consistently strong performance across all architectures, underscoring the competitiveness of emerging ViTs and foundation models and the promise of feature-level fusion and frequency-domain representations for UWF analysis.
Updated: 2026-03-09 11:04:01
标题: 探索深度学习和超广角成像在糖尿病视网膜病变和黄斑水肿中的应用
摘要: 糖尿病视网膜病变(DR)和糖尿病黄斑水肿(DME)是工作年龄成人中可预防失明的主要原因。文献中传统的方法主要集中在使用标准彩色眼底照相(CFP)来检测这些病症。然而,最近的超广视野成像(UWF)相比于CFP提供了显著更宽的视野。受此激励,本研究探讨了最先进的深度学习(DL)方法和UWF成像在三个临床相关任务上的应用:i)UWF图像质量评估、ii)可参考糖尿病视网膜病变(RDR)的识别、以及iii)DME的识别。使用公开可用的UWF4DR挑战数据集,该数据集作为MICCAI 2024会议的一部分发布,我们在空间(RGB)和频率域内对DL模型进行了基准测试,包括流行的卷积神经网络(CNNs)以及最新的视觉转换器(ViTs)和基础模型。此外,我们探讨了最终的特征级融合以增加鲁棒性。最后,我们还使用Grad-CAM分析了DL模型的决策,增加了可解释性。我们的提案在所有体系结构上均取得了一致强劲的表现,突出了新兴ViTs和基础模型的竞争力,以及特征级融合和频率域表示对UWF分析的潜力。
更新时间: 2026-03-09 11:04:01
领域: cs.CV,cs.AI
The Struggle Between Continuation and Refusal: A Mechanistic Analysis of the Continuation-Triggered Jailbreak in LLMs
With the rapid advancement of large language models (LLMs), the safety of LLMs has become a critical concern. Despite significant efforts in safety alignment, current LLMs remain vulnerable to jailbreaking attacks. However, the root causes of such vulnerabilities are still poorly understood, necessitating a rigorous investigation into jailbreak mechanisms across both academic and industrial communities. In this work, we focus on a continuation-triggered jailbreak phenomenon, whereby simply relocating a continuation-triggered instruction suffix can substantially increase jailbreak success rates. To uncover the intrinsic mechanisms of this phenomenon, we conduct a comprehensive mechanistic interpretability analysis at the level of attention heads. Through causal interventions and activation scaling, we show that this jailbreak behavior primarily arises from an inherent competition between the model's intrinsic continuation drive and the safety defenses acquired through alignment training. Furthermore, we perform a detailed behavioral analysis of the identified safety-critical attention heads, revealing notable differences in the functions and behaviors of safety heads across different model architectures. These findings provide a novel mechanistic perspective for understanding and interpreting jailbreak behaviors in LLMs, offering both theoretical insights and practical implications for improving model safety.
Updated: 2026-03-09 11:03:45
标题: 《继续与拒绝之间的斗争:LLMs中由继续触发的越狱的机制分析》
摘要: 随着大型语言模型(LLMs)的快速发展,LLMs的安全性已成为一个关键问题。尽管在安全对齐方面已做出了重大努力,但当前的LLMs仍然容易受到越狱攻击的影响。然而,这种漏洞的根本原因仍然不甚了解,需要在学术界和工业界开展对越狱机制的深入调查。在这项工作中,我们关注一个通过延续触发触发的越狱现象,即简单地重新定位一个延续触发指令后缀可以大幅提高越狱成功率。为了揭示这一现象的内在机制,我们在注意力头的级别上进行了全面的机械可解释性分析。通过因果干预和激活缩放,我们展示了这种越狱行为主要源自模型固有的延续驱动力和通过对齐训练获得的安全防御之间的竞争。此外,我们对确定的安全关键注意力头进行了详细的行为分析,揭示了不同模型架构下安全头的功能和行为之间的明显差异。这些发现为理解和解释LLMs中的越狱行为提供了一种新颖的机械视角,为改进模型安全性提供了理论见解和实际意义。
更新时间: 2026-03-09 11:03:45
领域: cs.AI,cs.LG
Disentangling Reasoning in Large Audio-Language Models for Ambiguous Emotion Prediction
Speech emotion recognition plays an important role in various applications. However, most existing approaches predict a single emotion label, oversimplifying the inherently ambiguous nature of human emotional expression. Recent large audio-language models show promise in generating richer outputs, but their reasoning ability for ambiguous emotional understanding remains limited. In this work, we reformulate ambiguous emotion recognition as a distributional reasoning problem and present the first systematic study of ambiguity-aware reasoning in LALMs. Our framework comprises two complementary components: an ambiguity-aware objective that aligns predictions with human perceptual distributions, and a structured ambiguity-aware chain-of-thought supervision that guides reasoning over emotional cues. Experiments on IEMOCAP and CREMA-D demonstrate consistent improvements across SFT, DPO, and GRPO training strategies.
Updated: 2026-03-09 11:02:34
标题: 在大型音频语言模型中解开模糊情绪预测中的推理
摘要: 语音情感识别在各种应用中起着重要作用。然而,大多数现有方法预测单一情绪标签,过于简化了人类情感表达固有的模糊性质。最近的大型音频语言模型显示出生成更丰富输出的潜力,但它们对于模糊情感理解的推理能力仍然有限。在这项工作中,我们将模糊情感识别重新定义为分布推理问题,并首次系统研究了LALM中的模糊感知推理。我们的框架包括两个互补组件:一个模糊感知目标,将预测与人类感知分布对齐,以及一个结构化的模糊感知思维链监督,指导对情绪线索的推理。在IEMOCAP和CREMA-D上的实验表明,在SFT、DPO和GRPO训练策略中持续改进。
更新时间: 2026-03-09 11:02:34
领域: cs.SD,cs.AI,eess.AS
CrystaL: Spontaneous Emergence of Visual Latents in MLLMs
Multimodal Large Language Models (MLLMs) have achieved remarkable performance by integrating powerful language backbones with large-scale visual encoders. Among these, latent Chain-of-Thought (CoT) methods enable implicit reasoning in continuous hidden states, facilitating seamless vision-language integration and faster inference. However, existing heuristically predefined supervision signals in latent CoT provide limited guidance for preserving critical visual information in intermediate latent states. To address this limitation, we propose CrystaL (Crystallized Latent Reasoning), a single-stage framework with two paths to process intact and corrupted images, respectively. By explicitly aligning the attention patterns and prediction distributions across the two paths, CrystaL crystallizes latent representations into task-relevant visual semantics, without relying on auxiliary annotations or external modules. Extensive experiments on perception-intensive benchmarks demonstrate that CrystaL consistently outperforms state-of-the-art baselines, achieving substantial gains in fine-grained visual understanding while maintaining robust reasoning capabilities.
Updated: 2026-03-09 10:57:17
标题: CrystaL:MLLMs中视觉潜变量的自发出现
摘要: 多模态大型语言模型(MLLMs)通过将强大的语言骨干与大规模视觉编码器集成,取得了显著的性能。在这些模型中,潜在的Chain-of-Thought(CoT)方法通过在连续隐藏状态中进行隐式推理,促进了无缝的视觉-语言集成和更快的推理。然而,现有的启发式预定义的监督信号在潜在的CoT中提供了有限的指导,以保留中间潜在状态中的关键视觉信息。为了解决这一限制,我们提出了CrystaL(Crystallized Latent Reasoning),这是一个单阶段的框架,具有分别处理完整和受损图像的两个路径。通过明确地对齐两条路径上的注意力模式和预测分布,CrystaL将潜在表示结晶为任务相关的视觉语义,而不依赖于辅助注释或外部模块。在感知密集基准测试上进行的大量实验表明,CrystaL始终优于最先进的基线模型,在细粒度视觉理解方面取得了实质性的收益,同时保持了强大的推理能力。
更新时间: 2026-03-09 10:57:17
领域: cs.CV,cs.AI
RedSage: A Cybersecurity Generalist LLM
Cybersecurity operations demand assistant LLMs that support diverse workflows without exposing sensitive data. Existing solutions either rely on proprietary APIs with privacy risks or on open models lacking domain adaptation. To bridge this gap, we curate 11.8B tokens of cybersecurity-focused continual pretraining data via large-scale web filtering and manual collection of high-quality resources, spanning 28.6K documents across frameworks, offensive techniques, and security tools. Building on this, we design an agentic augmentation pipeline that simulates expert workflows to generate 266K multi-turn cybersecurity samples for supervised fine-tuning. Combined with general open-source LLM data, these resources enable the training of RedSage, an open-source, locally deployable cybersecurity assistant with domain-aware pretraining and post-training. To rigorously evaluate the models, we introduce RedSage-Bench, a benchmark with 30K multiple-choice and 240 open-ended Q&A items covering cybersecurity knowledge, skills, and tool expertise. RedSage is further evaluated on established cybersecurity benchmarks (e.g., CTI-Bench, CyberMetric, SECURE) and general LLM benchmarks to assess broader generalization. At the 8B scale, RedSage achieves consistently better results, surpassing the baseline models by up to +5.59 points on cybersecurity benchmarks and +5.05 points on Open LLM Leaderboard tasks. These findings demonstrate that domain-aware agentic augmentation and pre/post-training can not only enhance cybersecurity-specific expertise but also help to improve general reasoning and instruction-following. All models, datasets, and code are publicly available.
Updated: 2026-03-09 10:56:02
标题: 红色鼠尾草:网络安全通才LLM
摘要: 网络安全运营需要辅助的LLM,支持各种工作流程,同时不暴露敏感数据。现有解决方案要么依赖具有隐私风险的专有API,要么依赖缺乏领域适应性的开放模型。为了弥补这一差距,我们通过大规模网络过滤和高质量资源的手动收集,筛选了118亿个令牌的网络安全专注持续预训练数据,涵盖了框架、攻击技术和安全工具等28.6K份文档。在此基础上,我们设计了一种模拟专家工作流程的主动增强管道,生成了266K个多轮网络安全样本,用于监督微调。结合通用开源LLM数据,这些资源使我们能够训练RedSage,这是一个具有领域感知的预训练和后训练的开源、本地可部署的网络安全助手。为了严格评估模型,我们引入了RedSage-Bench,一个基准测试,涵盖了30K个多项选择题和240个开放式问答题,涵盖了网络安全知识、技能和工具专业知识。RedSage进一步在已建立的网络安全基准测试(例如CTI-Bench、CyberMetric、SECURE)和通用LLM基准测试上进行评估,以评估更广泛的泛化能力。在8B规模上,RedSage始终取得更好的结果,在网络安全基准测试中超过基线模型最多+5.59个点,在Open LLM排行榜任务中超过基线模型最多+5.05个点。这些发现表明,领域感知的主动增强和预/后训练不仅可以提高网络安全专业知识,还可以帮助改善一般推理和指令遵循。所有模型、数据集和代码均公开可用。
更新时间: 2026-03-09 10:56:02
领域: cs.CR,cs.AI,cs.CL
Practical Type Inference: High-Throughput Recovery of Real-World Structures and Function Signatures
The recovery of types from stripped binaries is a key to exact decompilation, yet its practical realization suffers. For composite structures in particular, both layout and semantic fidelity are required to enable end-to-end reconstruction. Many existing approaches either synthesize layouts or infer names post-hoc, which weakens downstream usability. This is further aggravated by an excessive runtime overhead that is especially prohibitive in automated environments. We present XTRIDE, an improved n-gram-based approach that focuses on practicality: highly optimized throughput and actionable confidence scores allow for deployment in automated pipelines. When compared to the state of the art in struct recovery, our method achieves comparable performance while being between 70 and 2300 times faster. As our inference is grounded in real-world types, we achieve the highest ratio of fully-correct struct layouts. With an optimized training regimen, our model outperforms the current state of the art on the DIRT dataset by 5.09 percentage points, achieving 90.15% type inference accuracy overall. Furthermore, we show that n-gram-based type prediction generalizes to function signature recovery: conducting a case study on embedded firmware, we show that this efficient approach to function similarity can assist in typical reverse engineering tasks.
Updated: 2026-03-09 10:54:48
标题: 实用类型推断:高吞吐量的恢复现实世界结构和函数签名
摘要: 从剥离二进制文件中恢复类型是精确反编译的关键,但其实际实现存在困难。对于复合结构而言,需要同时具备布局和语义保真度,才能实现端到端的重建。许多现有方法要么合成布局,要么事后推断名称,这会削弱下游的可用性。这一问题又被运行时开销过高所加剧,尤其在自动化环境中是不可接受的。我们提出了XTRIDE,一种基于改进的n-gram方法,专注于实用性:高度优化的吞吐量和可操作的置信度评分使得可以在自动化流水线中部署。与结构恢复领域的最新技术相比,我们的方法在性能上取得了可比较的表现,同时速度快了70到2300倍。由于我们的推断基于真实类型,我们实现了最高的完全正确结构布局比例。通过优化训练方案,我们的模型在DIRT数据集上的表现超越了当前最新技术5.09个百分点,整体达到了90.15%的类型推断准确率。此外,我们展示了基于n-gram的类型预测可以推广到函数签名恢复:通过对嵌入式固件进行案例研究,我们展示了这种高效的函数相似性方法可以协助典型的逆向工程任务。
更新时间: 2026-03-09 10:54:48
领域: cs.CR
SplitAgent: A Privacy-Preserving Distributed Architecture for Enterprise-Cloud Agent Collaboration
Enterprise adoption of cloud-based AI agents faces a fundamental privacy dilemma: leveraging powerful cloud models requires sharing sensitive data, while local processing limits capability. Current agent frameworks like MCP and A2A assume complete data sharing, making them unsuitable for enterprise environments with confidential information. We present SplitAgent, a novel distributed architecture that enables privacy-preserving collaboration between enterprise-side privacy agents and cloud-side reasoning agents. Our key innovation is context-aware dynamic sanitization that adapts privacy protection based on task semantics -- contract review requires different sanitization than code review or financial analysis. SplitAgent extends existing agent protocols with differential privacy guarantees, zero-knowledge tool verification, and privacy budget management. Through comprehensive experiments on enterprise scenarios, we demonstrate that SplitAgent achieves 83.8\% task accuracy while maintaining 90.1\% privacy protection, significantly outperforming static approaches (73.2\% accuracy, 79.7\% privacy). Context-aware sanitization improves task utility by 24.1\% over static methods while reducing privacy leakage by 67\%. Our architecture provides a practical path for enterprise AI adoption without compromising sensitive data.
Updated: 2026-03-09 10:51:31
标题: SplitAgent:一种用于企业云代理协作的隐私保护分布式架构
摘要: 企业采用基于云的人工智能代理面临着一个基本的隐私困境:利用强大的云模型需要共享敏感数据,而本地处理限制了功能。当前的代理框架如MCP和A2A假设完全数据共享,使它们不适用于包含机密信息的企业环境。我们提出了SplitAgent,这是一种新颖的分布式架构,可以在企业端隐私代理和云端推理代理之间实现保护隐私的协作。我们的关键创新是上下文感知动态清理,根据任务语义调整隐私保护 -- 合同审查需要不同的清理方法,与代码审查或财务分析不同。SplitAgent通过差分隐私保证、零知识工具验证和隐私预算管理扩展了现有的代理协议。通过对企业场景的全面实验,我们证明SplitAgent在保持90.1\%的隐私保护的同时实现了83.8\%的任务准确性,明显优于静态方法(73.2%准确性,79.7%隐私)。上下文感知清理方法提高了24.1%的任务效用,同时减少了67%的隐私泄漏。我们的架构为企业采用人工智能提供了一条实用的道路,而不会牺牲敏感数据。
更新时间: 2026-03-09 10:51:31
领域: cs.CR,cs.AI
Wiener Chaos Expansion based Neural Operator for Singular Stochastic Partial Differential Equations
In this paper, we explore how our recently developed Wiener Chaos Expansion (WCE)-based neural operator (NO) can be applied to singular stochastic partial differential equations, e.g., the dynamic $\boldsymbolΦ^4_2$ model simulated in the recent works. Unlike the previous WCE-NO which solves SPDEs by simply inserting Wick-Hermite features into the backbone NO model, we leverage feature-wise linear modulation (FiLM) to appropriately capture the dependency between the solution of singular SPDE and its smooth remainder. The resulting WCE-FiLM-NO shows excellent performance on $\boldsymbolΦ^4_2$, as measured by relative $L_2$ loss, out-of-distribution $L_2$ loss, and autocorrelation score; all without the help of renormalisation factor. In addition, we also show the potential of simulating $\boldsymbolΦ^4_3$ data, which is more aligned with real scientific practice in statistical quantum field theory. To the best of our knowledge, this is among the first works to develop an efficient data-driven surrogate for the dynamical $\boldsymbolΦ^4_3$ model.
Updated: 2026-03-09 10:50:30
标题: 基于Wiener混沌展开的神经算子用于奇异随机偏微分方程
摘要: 在本文中,我们探讨了我们最近开发的基于Wiener Chaos Expansion(WCE)的神经算子(NO)如何应用于奇异随机偏微分方程,例如最近作品中模拟的动态$\boldsymbolΦ^4_2$模型。与以往的WCE-NO不同,以往的WCE-NO通过简单地将Wick-Hermite特征插入骨干NO模型来解决SPDEs,我们利用特征逐层线性调制(FiLM)来适当捕获奇异SPDE的解与其平滑余项之间的依赖关系。产生的WCE-FiLM-NO在$\boldsymbolΦ^4_2$上表现出色,通过相对$L_2$损失,分布外$L_2$损失和自相关得分来衡量;所有这些都没有使用重整化因子的帮助。此外,我们还展示了模拟$\boldsymbolΦ^4_3$数据的潜力,这与统计量子场论中的真实科学实践更加一致。据我们所知,这是首批为动态$\boldsymbolΦ^4_3$模型开发有效的数据驱动替代方案的研究之一。
更新时间: 2026-03-09 10:50:30
领域: cs.LG
A Comparative Study of Recent Advances in Internet of Intrusion Detection Things
The Internet of Things (IoT) has revolutionized the way devices communicate and interact with each other, but it has also created new challenges in terms of security. In this context, intrusion detection has become a crucial mechanism to ensure the safety of IoT systems. To address this issue, a comprehensive comparative study of advanced techniques and types of IoT intrusion detection systems (IDS) has been conducted. The study delves into various architectures, classifications, and evaluation methodologies of IoT IDS. This paper provides a valuable resource for researchers and practitioners interested in IoT security and intrusion detection.
Updated: 2026-03-09 10:49:20
标题: 一项关于物联网入侵检测最新进展的比较研究
摘要: 物联网(IoT)已经彻底改变了设备之间的通信和互动方式,但也在安全方面带来了新的挑战。在这种情况下,入侵检测已成为确保IoT系统安全的关键机制。为了解决这个问题,我们进行了一项关于先进技术和IoT入侵检测系统(IDS)类型的全面比较研究。该研究深入探讨了各种IoT IDS的架构、分类和评估方法。本文为对IoT安全和入侵检测感兴趣的研究人员和从业者提供了宝贵的资源。
更新时间: 2026-03-09 10:49:20
领域: cs.CR,cs.NI
Time-Scale Coupling Between States and Parameters in Recurrent Neural Networks
We show that gating mechanisms in recurrent neural networks (RNNs) induce lag-dependent and direction-dependent effective learning rates, even when training uses a fixed, global step size. This behavior arises from a coupling between state-space time-scales (parametrized by the gates) and parameter-space dynamics during gradient descent. By deriving exact Jacobians for leaky-integrator and gated RNNs and applying a first-order expansion, we make explicit how constant, scalar, and multi-dimensional gates reshape gradient propagation, modulate effective step sizes, and introduce anisotropy in parameter updates. These findings reveal that gates act not only as filters of information flow, but also as data-driven preconditioners of optimization, with formal connections to learning-rate schedules, momentum, and adaptive methods such as Adam. Empirical simulations corroborate these predictions: across several sequence tasks, gates produce lag-dependent effective learning rates and concentrate gradient flow into low-dimensional subspaces, matching or exceeding the anisotropic structure induced by Adam. Notably, gating and optimizer-driven adaptivity shape complementary aspects of credit assignment: gates align state-space transport with loss-relevant directions, while optimizers rescale parameter-space updates. Overall, this work provides a unified dynamical systems perspective on how gating couples state evolution with parameter updates, clarifying why gated architectures achieve robust trainability in practice.
Updated: 2026-03-09 10:48:51
标题: 《循环神经网络中状态和参数之间的时间尺度耦合》
摘要: 我们展示了递归神经网络(RNNs)中的门控机制引入了依赖滞后和方向的有效学习率,即使训练使用固定的全局步长。这种行为是由于状态空间时间尺度(由门控制)与梯度下降过程中的参数空间动态之间的耦合造成的。通过为漏积分器和门控RNNs导出精确的雅可比矩阵并应用一阶展开,我们清晰地说明了常数、标量和多维门如何重塑梯度传播,调节有效步长,并引入参数更新的各向异性。这些发现揭示了门不仅作为信息流的过滤器,还作为优化的数据驱动的预处理器,与学习率调度、动量和Adam等自适应方法形式上有连接。实证模拟证实了这些预测:在多个序列任务中,门产生了依赖滞后的有效学习率,并将梯度流集中到低维子空间中,与Adam引入的各向异性结构相匹配或超过。值得注意的是,门控和优化器驱动的适应性塑造了信用分配的互补方面:门将状态空间传输与损失相关方向对齐,而优化器重新调整参数空间更新。总的来说,这项工作提供了一个统一的动力学系统视角,说明了门控如何将状态演化与参数更新耦合在一起,阐明了为什么门控架构在实践中实现了强大的可训练性。
更新时间: 2026-03-09 10:48:51
领域: cs.LG,math.DS
M4Diffuser: Multi-View Diffusion Policy with Manipulability-Aware Control for Robust Mobile Manipulation
Mobile manipulation requires the coordinated control of a mobile base and a robotic arm while simultaneously perceiving both global scene context and fine-grained object details. Existing single-view approaches often fail in unstructured environments due to limited fields of view, exploration, and generalization abilities. Moreover, classical controllers, although stable, struggle with efficiency and manipulability near singularities. To address these challenges, we propose M4Diffuser, a hybrid framework that integrates a Multi-View Diffusion Policy with a novel Reduced and Manipulability-aware QP (ReM-QP) controller for mobile manipulation. The diffusion policy leverages proprioceptive states and complementary camera perspectives with both close-range object details and global scene context to generate task-relevant end-effector goals in the world frame. These high-level goals are then executed by the ReM-QP controller, which eliminates slack variables for computational efficiency and incorporates manipulability-aware preferences for robustness near singularities. Comprehensive experiments in simulation and real-world environments show that M4Diffuser achieves 7 to 56 percent higher success rates and reduces collisions by 3 to 31 percent over baselines. Our approach demonstrates robust performance for smooth whole-body coordination, and strong generalization to unseen tasks, paving the way for reliable mobile manipulation in unstructured environments. Details of the demo and supplemental material are available on our project website https://sites.google.com/view/m4diffuser.
Updated: 2026-03-09 10:47:55
标题: M4Diffuser:具有可操控性感知控制的多视角扩散策略,用于强大的移动操作
摘要: 移动操作需要同时协调移动基座和机器人臂的控制,同时感知全局场景背景和细粒度对象细节。现有的单视角方法在非结构化环境中经常失败,原因是视野受限,探索能力和泛化能力有限。此外,尽管经典控制器稳定,但在奇点附近的效率和可操作性方面存在困难。为了解决这些挑战,我们提出了M4Diffuser,这是一个混合框架,将多视角扩散策略与一种新颖的简化和可操作性感知QP(ReM-QP)控制器集成在一起,用于移动操作。扩散策略利用本体感知状态和互补的摄像机视角,同时具有近距离对象细节和全局场景背景,以在世界坐标系中生成任务相关的末端执行器目标。然后,这些高层目标由ReM-QP控制器执行,该控制器消除了计算效率的松弛变量,并在奇点附近的鲁棒性中融入可操作性感知偏好。在模拟和真实环境中进行的全面实验表明,M4Diffuser相对于基线实现了7至56%的更高成功率,并将碰撞减少了3至31%。我们的方法展示了对平滑全身协调的稳健性能,并对未见任务具有强大的泛化能力,为在非结构化环境中可靠进行移动操作铺平了道路。演示和补充材料的详细信息可在我们的项目网站https://sites.google.com/view/m4diffuser 上找到。
更新时间: 2026-03-09 10:47:55
领域: cs.RO,cs.AI,cs.CV
LatentMem: Customizing Latent Memory for Multi-Agent Systems
Large language model (LLM)-powered multi-agent systems (MAS) demonstrate remarkable collective intelligence, wherein multi-agent memory serves as a pivotal mechanism for continual adaptation. However, existing multi-agent memory designs remain constrained by two fundamental bottlenecks: (i) memory homogenization arising from the absence of role-aware customization, and (ii) information overload induced by excessively fine-grained memory entries. To address these limitations, we propose LatentMem, a learnable multi-agent memory framework designed to customize agent-specific memories in a token-efficient manner. Specifically, LatentMem comprises an experience bank that stores raw interaction trajectories in a lightweight form, and a memory composer that synthesizes compact latent memories conditioned on retrieved experience and agent-specific contexts. Further, we introduce Latent Memory Policy Optimization (LMPO), which propagates task-level optimization signals through latent memories to the composer, encouraging it to produce compact and high-utility representations. Extensive experiments across diverse benchmarks and mainstream MAS frameworks show that LatentMem achieves a performance gain of up to $19.36$% over vanilla settings and consistently outperforms existing memory architectures, without requiring any modifications to the underlying frameworks.
Updated: 2026-03-09 10:47:31
标题: LatentMem:为多智能体系统定制潜在记忆
摘要: 大型语言模型(LLM)驱动的多智能体系统(MAS)展示了卓越的集体智能,其中多智能体记忆作为持续适应的关键机制。然而,现有的多智能体记忆设计仍受到两个基本瓶颈的限制:(i)由于缺乏角色感知定制而引起的记忆同质化,以及(ii)由于过于精细的记忆条目而引起的信息过载。为了解决这些限制,我们提出了LatentMem,一个可学习的多智能体记忆框架,旨在以令牌高效的方式定制代理特定记忆。具体而言,LatentMem包括一个经验库,以轻量级形式存储原始交互轨迹,以及一个记忆合成器,根据检索到的经验和代理特定上下文合成紧凑的潜在记忆。此外,我们引入了潜在记忆策略优化(LMPO),通过潜在记忆将任务级别的优化信号传播到合成器,鼓励其产生紧凑且高效的表示。在各种基准测试和主流MAS框架上进行的大量实验表明,LatentMem在不需要对底层框架进行任何修改的情况下,相对于基本设置实现了高达19.36%的性能增益,并始终优于现有的记忆架构。
更新时间: 2026-03-09 10:47:31
领域: cs.CL,cs.LG,cs.MA
Distilled Circuits: A Mechanistic Study of Internal Restructuring in Knowledge Distillation
Knowledge distillation compresses a larger neural model (teacher) into smaller, faster student models by training them to match teacher outputs. However, the internal computational transformations that occur during this process remain poorly understood. We apply techniques from mechanistic interpretability to analyze how internal circuits, representations, and activation patterns differ between teachers and students. Focusing on GPT2 and its distilled counterpart DistilGPT2, and generalizing our findings to both bidirectional architectures and larger model pairs, we find that student models can reorganize, compress, and discard teacher components, often resulting in a stronger reliance on fewer individual components. To quantify functional alignment beyond output similarity, we introduce an alignment metric based on influence-weighted component similarity, validated across multiple tasks. Our findings reveal that while knowledge distillation preserves broad functional behaviors, it also causes significant shifts in internal computation, with important implications for the robustness and generalization capacity of distilled models.
Updated: 2026-03-09 10:43:26
标题: 蒸馏电路:知识蒸馏中内部重构的机制研究
摘要: 知识蒸馏通过训练较大的神经模型(教师)以匹配教师输出来将其压缩为较小、更快的学生模型。然而,在这个过程中发生的内部计算转换仍然知之甚少。我们应用机制可解释性技术来分析教师和学生之间的内部电路、表示和激活模式的差异。以GPT2及其蒸馏对应物DistilGPT2为重点,并将我们的发现推广到双向架构和更大模型对,我们发现学生模型可以重新组织、压缩和丢弃教师组件,通常导致对更少的单个组件更强烈依赖。为了量化功能对齐性超出输出相似性,我们引入了一个基于影响加权组件相似度的对齐度量,通过多个任务验证。我们的发现表明,尽管知识蒸馏保留了广泛的功能行为,但也导致内部计算发生显著变化,对蒸馏模型的鲁棒性和泛化能力具有重要影响。
更新时间: 2026-03-09 10:43:26
领域: cs.LG
Pretraining in Actor-Critic Reinforcement Learning for Robot Locomotion
The pretraining-finetuning paradigm has facilitated numerous transformative advancements in artificial intelligence research in recent years. However, in the domain of reinforcement learning (RL) for robot locomotion, individual skills are often learned from scratch despite the high likelihood that some generalizable knowledge is shared across all task-specific policies belonging to the same robot embodiment. This work aims to define a paradigm for pretraining neural network models that encapsulate such knowledge and can subsequently serve as a basis for warm-starting the RL process in classic actor-critic algorithms, such as Proximal Policy Optimization (PPO). We begin with a task-agnostic exploration-based data collection algorithm to gather diverse, dynamic transition data, which is then used to train a Proprioceptive Inverse Dynamics Model (PIDM) through supervised learning. The pretrained weights are then loaded into both the actor and critic networks to warm-start the policy optimization of actual tasks. We systematically validated our proposed method with 9 distinct robot locomotion RL environments comprising 3 different robot embodiments, showing significant benefits of this initialization strategy. Our proposed approach on average improves sample efficiency by 36.9% and task performance by 7.3% compared to random initialization. We further present key ablation studies and empirical analyses that shed light on the mechanisms behind the effectiveness of this method.
Updated: 2026-03-09 10:43:25
标题: 机器人运动的演员-评论者强化学习预训练
摘要: 最近几年,预训练-微调范式在人工智能研究中促进了许多变革性进展。然而,在机器人运动的强化学习领域,尽管同一机器人实体的所有特定任务策略之间可能存在一些可泛化的知识,但个体技能通常是从零开始学习的。本文旨在为预训练神经网络模型定义一种范式,以封装这种知识,并随后作为热启动经典的演员-评论家算法(如PPO)中的RL过程的基础。我们首先使用一种面向任务的探索数据收集算法来收集多样化、动态的转换数据,然后使用这些数据通过监督学习来训练一个本体感知逆动力学模型(PIDM)。预训练的权重然后加载到演员和评论家网络中,以热启动实际任务的策略优化。我们通过9个不同机器人运动RL环境验证了我们提出的方法,包括3种不同的机器人实体,展示了该初始化策略的显著优势。与随机初始化相比,我们提出的方法平均提高了样本效率36.9%,任务性能提高了7.3%。我们进一步提出了关键的消融研究和经验分析,以阐明该方法有效性背后的机制。
更新时间: 2026-03-09 10:43:25
领域: cs.RO,cs.LG
Impact of LLMs news Sentiment Analysis on Stock Price Movement Prediction
This paper addresses stock price movement prediction by leveraging LLM-based news sentiment analysis. Earlier works have largely focused on proposing and assessing sentiment analysis models and stock movement prediction methods, however, separately. Although promising results have been achieved, a clear and in-depth understanding of the benefit of the news sentiment to this task, as well as a comprehensive assessment of different architecture types in this context, is still lacking. Herein, we conduct an evaluation study that compares 3 different LLMs, namely, DeBERTa, RoBERTa and FinBERT, for sentiment-driven stock prediction. Our results suggest that DeBERTa outperforms the other two models with an accuracy of 75% and that an ensemble model that combines the three models can increase the accuracy to about 80%. Also, we see that sentiment news features can benefit (slightly) some stock market prediction models, i.e., LSTM-, PatchTST- and tPatchGNN-based classifiers and PatchTST- and TimesNet-based regression tasks models.
Updated: 2026-03-09 10:42:38
标题: LLMs新闻情感分析对股价走势预测的影响
摘要: 本文通过利用基于LLM的新闻情感分析来预测股价走势。先前的研究主要集中在提出和评估情感分析模型和股票走势预测方法,但是这两者是分开的。尽管取得了一些有希望的结果,但对新闻情感对这一任务的益处的清晰和深入理解,以及在这种情况下对不同架构类型的综合评估仍然缺乏。在这里,我们进行了一项评估研究,比较了3种不同的LLM,即DeBERTa、RoBERTa和FinBERT,用于情感驱动的股票预测。我们的结果表明,DeBERTa的准确率为75%,优于另外两种模型,而将这三种模型结合起来的集成模型可以将准确率提高到约80%。此外,我们发现情感新闻特征可以在一定程度上有益于某些股市预测模型,即基于LSTM、PatchTST和tPatchGNN的分类器以及基于PatchTST和TimesNet的回归任务模型。
更新时间: 2026-03-09 10:42:38
领域: q-fin.ST,cs.AI,cs.CE
Revisiting Gradient Staleness: Evaluating Distance Metrics for Asynchronous Federated Learning Aggregation
In asynchronous federated learning (FL), client devices send updates to a central server at varying times based on their computational speed, often using stale versions of the global model. This staleness can degrade the convergence and accuracy of the global model. Previous work, such as AsyncFedED, proposed an adaptive aggregation method using Euclidean distance to measure staleness. In this paper, we extend this approach by exploring alternative distance metrics to more accurately capture the effect of gradient staleness. We integrate these metrics into the aggregation process and evaluate their impact on convergence speed, model performance, and training stability under heterogeneous clients and non-IID data settings. Our results demonstrate that certain metrics lead to more robust and efficient asynchronous FL training, offering a stronger foundation for practical deployment.
Updated: 2026-03-09 10:40:25
标题: 重新审视梯度陈旧性:评估异步联邦学习聚合的距离度量
摘要: 在异步联邦学习(FL)中,客户端设备根据其计算速度在不同时间向中央服务器发送更新,通常使用全局模型的陈旧版本。这种陈旧可能会降低全局模型的收敛性和准确性。先前的工作,如AsyncFedED,提出了一种使用欧几里德距离来衡量陈旧的自适应聚合方法。在本文中,我们通过探索替代距离度量标准来扩展这种方法,以更准确地捕捉梯度陈旧的影响。我们将这些度量标准整合到聚合过程中,并评估它们对收敛速度、模型性能和在异构客户端和非IID数据设置下的训练稳定性的影响。我们的结果表明,某些度量标准可以导致更稳健和高效的异步FL训练,为实际部署提供了更坚实的基础。
更新时间: 2026-03-09 10:40:25
领域: cs.LG,cs.AI
Alignment-Aware and Reliability-Gated Multimodal Fusion for Unmanned Aerial Vehicle Detection Across Heterogeneous Thermal-Visual Sensors
Reliable unmanned aerial vehicle (UAV) detection is critical for autonomous airspace monitoring but remains challenging when integrating sensor streams that differ substantially in resolution, perspective, and field of view. Conventional fusion methods-such as wavelet-, Laplacian-, and decision-level approaches-often fail to preserve spatial correspondence across modalities and suffer from annotation of inconsistencies, limiting their robustness in real-world settings. This study introduces two fusion strategies, Registration-aware Guided Image Fusion (RGIF) and Reliability-Gated Modality-Attention Fusion (RGMAF), designed to overcome these limitations. RGIF employs Enhanced Correlation Coefficient (ECC)-based affine registration combined with guided filtering to maintain thermal saliency while enhancing structural detail. RGMAF integrates affine and optical-flow registration with a reliability-weighted attention mechanism that adaptively balances thermal contrast and visual sharpness. Experiments were conducted on the Multi-Sensor and Multi-View Fixed-Wing (MMFW)-UAV dataset comprising 147,417 annotated air-to-air frames collected from infrared, wide-angle, and zoom sensors. Among single-modality detectors, YOLOv10x demonstrated the most stable cross-domain performance and was selected as the detection backbone for evaluating fused imagery. RGIF improved the visual baseline by 2.13% mAP@50 (achieving 97.65%), while RGMAF attained the highest recall of 98.64%. These findings show that registration-aware and reliability-adaptive fusion provides a robust framework for integrating heterogeneous modalities, substantially enhancing UAV detection performance in multimodal environments.
Updated: 2026-03-09 10:39:26
标题: 《面向对齐和可靠性门控的多模态融合技术用于异构热-视觉传感器下的无人机检测》
摘要: 可靠的无人机(UAV)检测对于自主空域监测至关重要,但在整合分辨率、透视和视野大不相同的传感器流时仍然具有挑战性。传统的融合方法,如小波、拉普拉斯和决策级方法,通常无法保持不同模态之间的空间对应关系,并且受到不一致性标注的限制,在真实世界环境中的鲁棒性有限。本研究引入了两种融合策略,即Registration-aware Guided Image Fusion(RGIF)和Reliability-Gated Modality-Attention Fusion(RGMAF),旨在克服这些限制。RGIF采用基于增强相关系数(ECC)的仿射配准结合引导滤波,以保持热视觉显著性同时增强结构细节。RGMAF集成了仿射和光流配准,并结合了可靠性加权的注意机制,自适应平衡热对比度和视觉清晰度。实验在包含147,417个标注的红外、广角和变焦传感器采集的空对空帧的多传感器和多视角固定翼(MMFW)-UAV数据集上进行。在单模态检测器中,YOLOv10x表现出最稳定的跨领域性能,并被选为评估融合图像的检测骨干。RGIF将视觉基准提高了2.13% mAP@50(达到97.65%),而RGMAF获得了最高的召回率98.64%。这些发现表明,基于配准和可靠性自适应的融合提供了一个强大的框架,用于整合异质模态,在多模式环境中显著提高了UAV检测性能。
更新时间: 2026-03-09 10:39:26
领域: cs.CV,cs.AI
Distributional Regression with Tabular Foundation Models: Evaluating Probabilistic Predictions via Proper Scoring Rules
Prior-Data Fitted Networks (PFNs), such as TabPFN and TabICL, have revolutionized tabular deep learning by leveraging in-context learning for tabular data. These models are meant as foundation models for classification and regression settings and promise to greatly simplify deployment in practical settings because their performance is unprecedented (in terms of mean squared error or $R^2$, when measured on common benchmarks like TabArena or TALENT). However, we see an important weakness of current benchmarks for the regression setting: the current benchmarks focus on evaluating win rates and performance using metrics like (root) mean squared error or $R^2$. Therefore, these leaderboards (implicitly and explicitly) push researchers to optimize for machine learning pipelines which elicit a good mean value estimate. The main problem is that this approach only evaluates a point estimate (namely the mean estimator which is the Bayes estimator associated with the mean squared error loss). In this article we discuss the application of proper scoring rules for evaluating the goodness of probabilistic forecasts in distributional regression. We also propose to enhance common machine learning benchmarks with metrics for probabilistic regression. To improve the status quo and make the machine learning community aware of scoring rules for probabilistic regression, we advocate to use the continuous ranked probability score (CRPS) in benchmarks for probabilistic regression. However, we also illustrate that the choice of the scoring rule changes the inductive bias of the trained model. We, therefore, advocate for finetuning or promptable tabular foundation models.
Updated: 2026-03-09 10:38:01
标题: 基于表格基础模型的分布回归:通过适当的评分规则评估概率预测
摘要: Prior-Data Fitted Networks (PFNs), 例如TabPFN和TabICL,通过利用上下文学习来革新表格深度学习。这些模型旨在作为分类和回归设置的基础模型,并承诺在实际环境中大大简化部署,因为它们的性能是前所未有的(在像TabArena或TALENT这样的常见基准上以均方误差或$R^2$来衡量)。 然而,我们发现当前回归设置的基准存在一个重要的弱点:当前基准侧重于评估胜率和性能,使用类似(根)均方误差或$R^2$这样的指标。 因此,这些排行榜(隐式和显式地)推动研究人员优化机器学习管道,以获得良好的均值估计。 主要问题在于这种方法只评估一个点估计(即与均方误差损失相关的贝叶斯估计器的均值估计)。 在本文中,我们讨论了应用适当的评分规则来评估分布回归中概率预测的优劣。 我们还提出通过概率回归的指标来增强常见的机器学习基准。 为了改善现状并让机器学习社区了解概率回归的评分规则,我们主张在概率回归的基准中使用连续排名概率分数(CRPS)。 然而,我们也阐明评分规则的选择会改变训练模型的归纳偏差。因此,我们主张对表格基础模型进行微调或提示。
更新时间: 2026-03-09 10:38:01
领域: cs.LG,cs.AI
Security and Quality in LLM-Generated Code: A Multi-Language, Multi-Model Analysis
Artificial Intelligence (AI)-driven code generation tools are increasingly used throughout the software development lifecycle to accelerate coding tasks. However, the security of AI-generated code using Large Language Models (LLMs) remains underexplored, with studies revealing various risks and weaknesses. This paper analyzes the security of code generated by LLMs across different programming languages. We introduce a dataset of 200 tasks grouped into six categories to evaluate the performance of LLMs in generating secure and maintainable code. Our research shows that while LLMs can automate code creation, their security effectiveness varies by language. Many models fail to utilize modern security features in recent compiler and toolkit updates, such as Java 17. Moreover, outdated methods are still commonly used, particularly in C++. This highlights the need for advancing LLMs to enhance security and quality while incorporating emerging best practices in programming languages.
Updated: 2026-03-09 10:34:05
标题: LLM生成代码中的安全性和质量:多语言,多模型分析
摘要: 人工智能(AI)驱动的代码生成工具越来越被广泛应用于软件开发生命周期中,以加速编码任务。然而,使用大型语言模型(LLMs)生成的AI代码的安全性仍未得到充分探讨,研究揭示了各种风险和弱点。本文分析了LLMs生成的代码在不同编程语言中的安全性。我们引入了一个包含200个任务分成六类的数据集,以评估LLMs在生成安全和可维护代码方面的性能。我们的研究表明,虽然LLMs可以自动化代码创建,但它们的安全性效果因语言而异。许多模型未能利用最新编译器和工具包更新中的现代安全功能,例如Java 17。此外,过时的方法仍然常用,特别是在C++中。这突显了需要推进LLMs以增强安全性和质量,同时纳入编程语言中新兴的最佳实践。
更新时间: 2026-03-09 10:34:05
领域: cs.CR,cs.LG,cs.SE
A Bipartite Quantum Key Distribution Protocol Based on Indefinite Causal Order
We propose a bipartite quantum key distribution (QKD) protocol based on causal nonseparability: the presence of a resource -- a process matrix -- that does not correspond to any definite causal order between two parties. In our protocol, Alice and Bob perform local operations arranged in a ``causal-order guessing game,'' whereby each round yields an 85.35\% probability of matching bits when the communication is undisturbed. This raw matching probability (or equivalently, a $\sim14.65\%$ error rate) is amenable to standard forward error-correction strategies. We further discuss the practical construction of the QKD protocol using indefinite causal order, where several different scenarios are deeply analyzed.
Updated: 2026-03-09 10:32:56
标题: 基于不确定因果顺序的双分量量子密钥分发协议
摘要: 我们提出了一种基于因果非可分性的双分区量子密钥分发(QKD)协议:存在一种资源——一个过程矩阵——它不对应于两方之间的任何确定因果顺序。在我们的协议中,Alice和Bob执行本地操作,安排在一个“因果顺序猜测游戏”中,每一轮在通信未受干扰时都会产生85.35%的比特匹配概率。这个原始匹配概率(或者等效地,约14.65%的错误率)适用于标准的前向纠错策略。我们进一步讨论了使用不确定因果顺序构建QKD协议的实际方法,深入分析了几种不同的情景。
更新时间: 2026-03-09 10:32:56
领域: quant-ph,cs.CR
MM-TS: Multi-Modal Temperature and Margin Schedules for Contrastive Learning with Long-Tail Data
Contrastive learning has become a fundamental approach in both uni-modal and multi-modal frameworks. This learning paradigm pulls positive pairs of samples closer while pushing negatives apart. In the uni-modal setting (e.g., image-based learning), previous research has shown that the strength of these forces can be controlled through the temperature parameter. In this work, we propose Multi-Modal Temperature and Margin Schedules (MM-TS), extending the concept of uni-modal temperature scheduling to multi-modal contrastive learning. Our method dynamically adjusts the temperature in the contrastive loss during training, modulating the attraction and repulsion forces in the multi-modal setting. Additionally, recognizing that standard multi-modal datasets often follow imbalanced, long-tail distributions, we adapt the temperature based on the local distribution of each training sample. Specifically, samples from dense clusters are assigned a higher temperature to better preserve their semantic structure. Furthermore, we demonstrate that temperature scheduling can be effectively integrated within a max-margin framework, thereby unifying the two predominant approaches in multi-modal contrastive learning: InfoNCE loss and max-margin objective. We evaluate our approach on four widely used image- and video-language datasets, Flickr30K, MSCOCO, EPIC-KITCHENS-100, and YouCook2, and show that our dynamic temperature and margin schedules improve performance and lead to new state-of-the-art results in the field.
Updated: 2026-03-09 10:29:50
标题: MM-TS: 多模式温度和边缘调度用于长尾数据的对比学习
摘要: 对比学习已成为单模态和多模态框架中的基本方法。这种学习范式拉近正样本对,同时将负样本分开。在单模态设置中(例如,基于图像的学习),先前的研究表明,这些力的强度可以通过温度参数来控制。在这项工作中,我们提出了多模态温度和边缘调度(MM-TS),将单模态温度调度的概念扩展到多模态对比学习中。我们的方法在训练过程中动态调整对比损失中的温度,调节多模态设置中的吸引和排斥力。此外,我们认识到标准的多模态数据集通常遵循不平衡、长尾分布,我们根据每个训练样本的局部分布调整温度。具体来说,稠密聚类中的样本被分配更高的温度,以更好地保留它们的语义结构。此外,我们证明了温度调度可以有效地集成到最大间隔框架中,从而统一多模态对比学习中的两种主要方法:InfoNCE损失和最大间隔目标。我们在四个广泛使用的图像和视频语言数据集(Flickr30K,MSCOCO,EPIC-KITCHENS-100和YouCook2)上评估了我们的方法,并展示了我们的动态温度和边缘调度提高了性能,并在该领域取得了新的最先进结果。
更新时间: 2026-03-09 10:29:50
领域: cs.CV,cs.AI
Experience on Automatically Converting a C++ Monolith to Java EE
Converting a large C++ code base (800k lines of code) into Java alone is challenging. Changing the architecture from a monolith into an application adhering to the Java application server standard and to run it on WildFly is a different number. This report describes the experience made during the C++ to Java conversion, the techniques used as well as the way to success of running the Java code on the application server for the first time. The approaches to solve the usual C++ to Java culprits, like multiple inheritance, enum-handling and scoped objects are described. A clang-tool-based software is developed to continuously regenerate the Java, because development on the C++ code base continued.
Updated: 2026-03-09 10:28:08
标题: 将C++单片机自动转换为Java EE的经验
摘要: 将一个庞大的C++代码库(80万行代码)单独转换成Java是具有挑战性的。将架构从单体转变为符合Java应用服务器标准的应用,并在WildFly上运行它则是另一回事。本报告描述了在C++转换为Java过程中所经历的经验,所采用的技术,以及第一次在应用服务器上运行Java代码成功的方式。描述了解决常见的C++到Java问题的方法,如多重继承,枚举处理和作用域对象。开发了基于clang工具的软件,以持续重新生成Java代码,因为对C++代码库的开发仍在继续。
更新时间: 2026-03-09 10:28:08
领域: cs.CR
Empirical PAC-Bayes bounds for Markov chains
The core of generalization theory was developed for independent observations. Some PAC and PAC-Bayes bounds are available for data that exhibit a temporal dependence. However, there are constants in these bounds that depend on properties of the data-generating process: mixing coefficients, mixing time, spectral gap... Such constants are unknown in practice. In this paper, we prove a new PAC-Bayes bound for Markov chains. This bound depends on a quantity called the pseudo-spectral gap. The main novelty is that we can provide an empirical bound on the pseudo-spectral gap when the state space is finite. Thus, we obtain the first fully empirical PAC-Bayes bound for Markov chains. This extends beyond the finite case, although this requires additional assumptions. On simulated experiments, the empirical version of the bound is essentially as tight as the non-empirical one.
Updated: 2026-03-09 10:28:08
标题: 马尔可夫链的经验性PAC-Bayes界限
摘要: 概括理论的核心是针对独立观察值开发的。对于展现时间依赖性的数据,一些PAC和PAC-Bayes界限是可用的。然而,这些界限中有一些常数取决于数据生成过程的属性:混合系数、混合时间、谱间隙等。这些常数在实践中是未知的。在本文中,我们证明了马尔可夫链的一种新的PAC-Bayes界限。这个界限取决于一个称为伪谱间隙的量。主要的创新点在于,当状态空间是有限的时候,我们可以提供关于伪谱间隙的经验界限。因此,我们获得了马尔可夫链的第一个完全经验的PAC-Bayes界限。这超出了有限情况,尽管这需要额外的假设。在模拟实验中,经验版本的界限实际上与非经验版本几乎一样紧凑。
更新时间: 2026-03-09 10:28:08
领域: stat.ML,cs.LG
Rethinking SNN Online Training and Deployment: Gradient-Coherent Learning via Hybrid-Driven LIF Model
Spiking Neural Networks (SNNs) are considered to have enormous potential in the future development of Artificial Intelligence due to their brain-inspired and energy-efficient properties. Compared to vanilla Spatial-Temporal Back-propagation (STBP) training methods, online training can effectively avoid the risk of GPU memory explosion. However, current online learning frameworks cannot tackle the gradient discrepancy problem between the forward and backward process, merely aiming to optimize the GPU memory, resulting in no performance advantages compared to the STBP-based models in the inference stage. To address the aforementioned challenges, we propose Hybrid-Driven Leaky Integrate-and-Fire (HD-LIF) model family for efficient online learning, which respectively adopt different spiking calculation mechanism in the upper-region and lower-region of the firing threshold. We theoretically point out that our learning framework can effectively separate temporal gradients and address the misalignment problem of surrogate gradients, as well as achieving full-stage optimization towards learning precision, memory complexity and power consumption. Experimental results have demonstrated that our scheme is enable to achieve state-of-the-art performance for multiple evaluation metrics, breaking through the traditional paradigm of SNN online training and deployment. Code is available at \href{https://github.com/hzc1208/HD_LIF}{here}.
Updated: 2026-03-09 10:21:02
标题: 重新思考SNN在线训练和部署:通过混合驱动的LIF模型实现梯度一致学习
摘要: 尖峰神经网络(SNN)由于其类似大脑且高效能的特性,在人工智能未来发展中被认为具有巨大潜力。与普通的空间-时间反向传播(STBP)训练方法相比,在线训练可以有效避免GPU内存爆炸的风险。然而,当前的在线学习框架无法解决前向和后向过程之间的梯度差异问题,仅旨在优化GPU内存,在推理阶段与基于STBP的模型相比没有性能优势。为了解决上述挑战,我们提出了用于高效在线学习的混合驱动漏电积分-发放(HD-LIF)模型系列,分别在发放阈值的上部和下部采用不同的尖峰计算机制。我们理论上指出,我们的学习框架可以有效分离时间梯度,并解决替代梯度的不对齐问题,同时实现对学习精度、内存复杂度和功耗的全阶段优化。实验结果表明,我们的方案能够实现多个评估指标的最新性能,突破了传统SNN在线训练和部署范式。您可以在\href{https://github.com/hzc1208/HD_LIF}{这里}找到代码。
更新时间: 2026-03-09 10:21:02
领域: cs.NE,cs.AI
Step2Motion: Locomotion Reconstruction from Pressure Sensing Insoles
Human motion is fundamentally driven by continuous physical interaction with the environment. Whether walking, running, or simply standing, the forces exchanged between our feet and the ground provide crucial insights for understanding and reconstructing human movement. Recent advances in wearable insole devices offer a compelling solution for capturing these forces in diverse, real-world scenarios. Sensor insoles pose no constraint on the users' motion (unlike mocap suits) and are unaffected by line-of-sight limitations (in contrast to optical systems). These qualities make sensor insoles an ideal choice for robust, unconstrained motion capture, particularly in outdoor environments. Surprisingly, leveraging these devices with recent motion reconstruction methods remains largely unexplored. Aiming to fill this gap, we present Step2Motion, the first approach to reconstruct human locomotion from multi-modal insole sensors. Our method utilizes pressure and inertial data-accelerations and angular rates-captured by the insoles to reconstruct human motion. We evaluate the effectiveness of our approach across a range of experiments to show its versatility for diverse locomotion styles, from simple ones like walking or jogging up to moving sideways, on tiptoes, slightly crouching, or dancing.
Updated: 2026-03-09 10:16:26
标题: Step2Motion:通过压力感应鞋垫重建运动路径
摘要: 人类运动基本上是由与环境的持续物理互动驱动的。无论是行走、奔跑,还是简单站立,我们的双脚与地面之间交换的力量提供了理解和重建人类运动的重要见解。可穿戴鞋垫设备的最新进展提供了一个引人注目的解决方案,可以在多样化的现实场景中捕捉这些力量。传感器鞋垫不会对用户的运动造成限制(与动作捕捉服相反),也不受视线限制的影响(与光学系统相反)。这些特点使传感器鞋垫成为在户外环境中进行稳健、不受限制的运动捕捉的理想选择。令人惊讶的是,利用这些设备与最新的运动重建方法仍然在很大程度上未被探索。为了填补这一空白,我们提出了Step2Motion,这是第一个从多模式鞋垫传感器重建人类运动的方法。我们的方法利用鞋垫捕捉的压力和惯性数据-加速度和角速率-来重建人类运动。我们通过一系列实验评估了我们方法的有效性,展示了其对各种运动风格的多样性,从简单的步行或慢跑到侧身移动、踮起脚尖、轻微蹲下或跳舞。
更新时间: 2026-03-09 10:16:26
领域: cs.GR,cs.AI
Sequential Service Region Design with Capacity-Constrained Investment and Spillover Effect
Service region design determines the geographic coverage of service networks, shaping long-term operational performance. Capital and operational constraints preclude simultaneous large-scale deployment, requiring expansion to proceed sequentially. The resulting challenge is to determine when and where to invest under demand uncertainty, balancing intertemporal trade-offs between early and delayed investment and accounting for network effects whereby each deployment reshapes future demand through inter-regional connectivity. This study addresses a sequential service region design (SSRD) problem incorporating two practical yet underexplored factors: a $k$-region constraint that limits the number of regions investable per period and a stochastic spillover effect linking investment decisions to demand evolution. The resulting problem requires sequencing regional portfolios under uncertainty, leading to a combinatorial explosion in feasible investment sequences. To address this challenge, we propose a solution framework that integrates real options analysis (ROA) with a Transformer-based Proximal Policy Optimization (TPPO) algorithm. ROA evaluates the intertemporal option value of investment sequences, while TPPO learns sequential policies that directly generate high option-value sequences without exhaustive enumeration. Numerical experiments on realistic multi-region settings demonstrate that TPPO converges faster than benchmark DRL methods and consistently identifies sequences with superior option value. Case studies and sensitivity analyses further confirm robustness and provide insights on investment concurrency, regional prioritization, and the increasing benefits of adaptive expansion via our approach under stronger spillovers and dynamic market conditions.
Updated: 2026-03-09 10:09:59
标题: 具有容量限制投资和溢出效应的序贯服务区域设计
摘要: 服务区域设计决定了服务网络的地理覆盖范围,塑造了长期运营绩效。资本和运营限制使得大规模部署无法同时进行,需要按顺序进行扩展。由此产生的挑战是在需求不确定性下确定何时何地投资,平衡早期和延迟投资之间的时间跨度折衷,并考虑到网络效应,即每次部署通过区域间连接重新塑造未来需求。本研究解决了一个包含两个实践但未充分探讨因素的顺序服务区域设计(SSRD)问题:一个限制每期可投资区域数量的$k$-区域约束,以及将投资决策与需求演变联系起来的随机溢出效应。由此产生的问题需要在不确定性下对区域组合进行排序,导致可行投资序列的组合性爆炸。为了解决这一挑战,我们提出了一个解决方案框架,该框架将实物期权分析(ROA)与基于Transformer的Proximal Policy Optimization(TPPO)算法相结合。ROA评估投资序列的时间期权价值,而TPPO学习直接生成高期权价值序列的顺序策略,而无需穷举。在现实多区域环境中进行的数值实验表明,TPPO比基准DRL方法收敛更快,并始终确定具有优越期权价值的序列。案例研究和敏感性分析进一步证实了鲁棒性,并提供关于投资并发性、区域优先级和通过我们的方法在更强的溢出效应和动态市场条件下实现适应性扩张的增加好处的见解。
更新时间: 2026-03-09 10:09:59
领域: cs.LG
Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective
The escalating scale and cost of Large Language Models (LLMs) training necessitate accurate pre-training prediction of downstream task performance for comprehensive understanding of scaling properties. This is challenged by: 1) the emergence phenomenon, where unpredictable capabilities appearing suddenly at critical model scales; and 2) uneven task difficulty and inconsistent performance scaling patterns, leading to high metric variability. Current prediction methods lack accuracy and reliability. We propose a Clustering-On-Difficulty (COD) framework for downstream performance prediction. The COD framework clusters tasks by their difficulty scaling features, thereby constructing a more stable and predictable task subset that exhibits well-behaved scaling characteristics with the increase of compute budget. We adopt a performance scaling law to predict cluster-wise performance with theoretical support. Predictable subset performance acts as an intermediate predictor for the full evaluation set. We further derive a mapping function to accurately extrapolate the performance of the subset to the full set. Applied to an LLM with 70B parameters, COD achieved a 1.55\% average prediction error across eight key LLM benchmarks, thus providing actionable insights for scaling properties and training monitoring during LLM pre-training.
Updated: 2026-03-09 10:06:10
标题: 揭示LLMs的下游性能扩展:基于聚类的视角
摘要: 随着大型语言模型(LLMs)训练规模和成本不断增加,需要准确地预测预训练对下游任务性能的影响,以全面了解其规模特性。这一挑战包括:1)新兴现象,即在关键的模型规模出现无法预测的功能;2)任务难度不均和性能规模特征不一致,导致指标的高度可变性。当前的预测方法缺乏准确性和可靠性。我们提出了一个基于困难度聚类(COD)框架用于预测下游性能。COD框架通过任务的难度规模特征对任务进行聚类,从而构建一个更稳定和可预测的任务子集,随着计算预算的增加,该子集表现出良好的性能规模特性。我们采用性能规模定律来预测簇内性能,并提供理论支持。可预测的子集性能作为完整评估集的中间预测器。我们进一步推导出一个映射函数,精确地将子集的性能推广到完整集。应用于具有70B参数的LLM,COD在八个关键LLM基准测试中实现了1.55\%的平均预测误差,从而为LLM预训练期间的规模特性和训练监测提供了可行的见解。
更新时间: 2026-03-09 10:06:10
领域: cs.CL,cs.AI,cs.LG
Neural delay differential equations: learning non-Markovian closures for partially known dynamical systems
Recent advances in learning dynamical systems from data have shown significant promise. However, many existing methods assume access to the full state of the system -- an assumption that is rarely satisfied in practice, where systems are typically monitored through a limited number of sensors, leading to partial observability. To address this challenge, we draw inspiration from the Mori-Zwanzig formalism, which provides a theoretical connection between hidden variables and memory terms. Motivated by this perspective, we introduce a constant-lag Neural Delay Differential Equations (NDDEs) framework, providing a continuous-time approach for learning non-Markovian dynamics directly from data. These memory effects are captured using a finite set of time delays, which are identified via the adjoint method. We validate the proposed approach on a range of datasets, including synthetic systems, chaotic dynamics, and experimental measurements, such as the Kuramoto-Sivashinsky equation and cavity-flow experiments. Results demonstrate that NDDEs compare favourably with existing approaches for partially observed systems, including long short-term memory (LSTM) networks and augmented neural ordinary differential equations (ANODEs). Overall, NDDEs offer a principled and data-efficient framework for modelling non-Markovian dynamics under partial observability. An open-source implementation accompanies this article.
Updated: 2026-03-09 10:05:32
标题: 神经延迟微分方程:学习部分已知动态系统的非马尔科夫闭合形式
摘要: 最近关于从数据中学习动态系统的研究取得了显著进展。然而,许多现有方法假设可以访问系统的完整状态 -- 这种假设在实践中很少得到满足,实际上系统通常通过有限数量的传感器监测,导致部分可观测性。为了解决这一挑战,我们从Mori-Zwanzig形式主义中汲取灵感,该形式主义提供了隐藏变量和记忆项之间的理论联系。受到这一视角的启发,我们引入了一个恒定滞后的神经延迟微分方程(NDDEs)框架,为直接从数据中学习非马尔科夫动态提供了一种连续时间方法。这些记忆效应通过一组有限的时间延迟来捕获,这些时间延迟是通过共轭方法确定的。我们在一系列数据集上验证了所提出的方法,包括合成系统、混沌动态和实验测量,如Kuramoto-Sivashinsky方程和腔流实验。结果表明,与现有的针对部分可观测系统的方法(包括长短期记忆网络(LSTM)和增强神经常微分方程(ANODEs))相比,NDDEs表现出色。总的来说,NDDEs为在部分可观测性下建模非马尔科夫动态提供了一种原则性和数据高效的框架。本文附带一个开源实现。
更新时间: 2026-03-09 10:05:32
领域: cs.LG,cs.AI,physics.comp-ph
Physics-Aware Neural Operators for Direct Inversion in 3D Photoacoustic Tomography
Learning physics-constrained inverse operators-rather than post-processing physics-based reconstructions-is a broadly applicable strategy for problems with expensive forward models. We demonstrate this principle in three-dimensional photoacoustic computed tomography (3D PACT), where current systems demand dense transducer arrays and prolonged scans, restricting clinical translation. We introduce PANO (PACT imaging neural operator), an end-to-end physics-aware neural operator-a deep learning architecture that generalizes across input sampling densities without retraining-that directly learns the inverse mapping from raw sensor measurements to a 3D volumetric image. Unlike two-step methods that reconstruct then denoise, PANO performs direct inversion in a single pass, jointly embedding physics and data priors. It employs spherical discrete-continuous convolutions to respect hemispherical sensor geometry and Helmholtz equation constraints to ensure physical consistency. PANO reconstructs high-quality images from both simulated and real data across diverse sparse acquisition settings, achieves real-time inference and outperforms the widely-used UBP algorithm by approximately 33 percentage points in cosine similarity on simulated data and 14 percentage points on real phantom data. These results establish a pathway toward more accessible 3D PACT systems for preclinical research, and motivate future in-vivo validation for clinical translation.
Updated: 2026-03-09 10:04:18
标题: 物理感知神经操作符用于3D光声断层成像中的直接反演
摘要: 学习受物理约束的逆算子-而不是后处理基于物理的重建-是解决具有昂贵前向模型问题的广泛适用策略。我们在三维光声计算断层扫描(3D PACT)中展示了这一原则,当前系统需要密集的传感器阵列和长时间扫描,限制了临床转化。我们引入了PANO(PACT成像神经算子),这是一个端到端的物理感知神经算子-一个深度学习架构,可以跨输入采样密度进行泛化,无需重新训练,直接学习从原始传感器测量到三维体积图像的逆映射。与重建然后去噪的两步方法不同,PANO在单次传递中执行直接反演,同时嵌入物理和数据先验。它利用球形离散连续卷积来尊重半球形传感器几何形状和Helmholtz方程约束以确保物理一致性。PANO可以从模拟数据和真实数据中重建高质量图像,跨越多种稀疏采集设置,实现实时推断,并在模拟数据上的余弦相似性上优于广泛使用的UBP算法约33个百分点,在真实幻影数据上优于14个百分点。这些结果为预临床研究提供了更可访问的3D PACT系统的途径,并激励未来进行体内验证,以进行临床转化。
更新时间: 2026-03-09 10:04:18
领域: eess.IV,cs.LG
SERQ: Saliency-Aware Low-Rank Error Reconstruction for LLM Quantization
Post-training quantization (PTQ) has emerged as a prevailing technique for deploying large language models (LLMs) efficiently in terms of both memory and computation, across edge devices and server platforms. Existing PTQ methods primarily aim to reduce precision in weights and activations by mitigating quantization errors caused by channel-wise outlier activations (e.g., pre-quantization scaling, online transformations, or low-rank error reconstruction). Among these approaches, error reconstruction with low-rank adaptation (LoRA) has proven particularly effective, as it introduces a lightweight auxiliary computation path without requiring heavy optimization or additional online layers. However, prior studies reveal severe accuracy degradation under W4A4 settings, and conventional low-rank adaptations rely on two sequential factors, necessitating intermediate quantization during inference and thereby limiting low-precision efficiency. In this work, we propose SERQ, a saliency-aware error reconstruction method for low-bit LLM inference that employs a single low-rank compensation matrix. SERQ preserves efficient 4-bit matrix multiplication in linear layers by jointly mitigating quantization errors arising from both activation and weight saliency through three stages: (1) static activation flattening, (2) saliency-aware error reconstruction, and (3) offline weight permutation. The method incurs additional computation only for low-rank error reconstruction via a single decomposition, while all other operations are performed offline, thereby keeping latency overhead minimal. Empirically, SERQ outperforms prior error reconstruction methods under both W4A8 and W4A4 settings, and achieves higher accuracy than state-of-the-art rotation-based W4A4 approaches, while substantially reducing calibration complexity.
Updated: 2026-03-09 10:04:12
标题: SERQ:基于显著性的低秩误差重建用于LLM量化
摘要: 后训练量化(PTQ)已经成为一种主流技术,可以高效地在边缘设备和服务器平台上部署大型语言模型(LLMs),在内存和计算方面都非常有效。现有的PTQ方法主要旨在通过减少权重和激活的精度来缓解通道异常激活引起的量化误差(例如,预量化缩放,在线转换或低秩误差重构)。在这些方法中,低秩适应(LoRA)的误差重构已被证明特别有效,因为它引入了一个轻量级的辅助计算路径,而无需进行繁重的优化或添加额外的在线层。然而,先前的研究表明,在W4A4设置下存在严重的精度降级,传统的低秩适应依赖于两个连续因素,需要在推理过程中进行中间量化,从而限制了低精度的效率。在这项工作中,我们提出了SERQ,一种适用于低位LLM推理的基于显著性的误差重构方法,它采用单个低秩补偿矩阵。SERQ通过三个阶段(1)静态激活展平,(2)显著性感知误差重构和(3)离线权重排列,同时减少来自激活和权重显著性的量化误差,从而保留了线性层中高效的4位矩阵乘法。该方法只通过单一分解产生额外的计算,用于低秩误差重构,而所有其他操作均在离线进行,从而使延迟开销最小化。从经验上看,SERQ在W4A8和W4A4设置下优于先前的误差重构方法,并且比最先进的基于旋转的W4A4方法获得更高的准确性,同时大大减少了校准复杂性。
更新时间: 2026-03-09 10:04:12
领域: cs.LG
TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation
Large language models often underperform in many European languages due to the dominance of English and a few high-resource languages in training data. This paper presents TildeOpen LLM, a 30-billion-parameter open-weight foundational model trained for 34 European languages to promote linguistic equity and improve performance for low-resource languages. To address the data imbalance, we combine dataset upsampling with a curriculum-based training schedule that alternates between uniform and natural language distributions. The resulting model performs favorably compared to other multilingual LLMs despite being trained with significantly fewer computing resources. Evaluation across multiple multilingual benchmarks shows that TildeOpen surpasses existing open-weight models in text generation and comprehension, particularly for Baltic, Finno-Ugric, and Slavic languages. Human evaluations confirm an up to tenfold reduction in linguistic errors relative to leading baselines. The model and associated resources are fully open-weight and publicly available at huggingface.co/TildeAI/TildeOpen-30b. These outcomes demonstrate that careful data curation and balanced training strategies can substantially enhance multilingual model quality without increasing model size or training volume.
Updated: 2026-03-09 10:03:17
标题: TildeOpen LLM:利用课程学习实现公平语言表达
摘要: 大型语言模型在许多欧洲语言中通常表现不佳,这是由于训练数据中英语和少数高资源语言的主导地位。本文介绍了TildeOpen LLM,这是一个30亿参数的开放权重基础模型,为34种欧洲语言进行训练,以促进语言公平和提高低资源语言的性能。为了解决数据不平衡问题,我们结合了数据集上采样和基于课程的训练计划,交替使用均匀和自然语言分布。结果表明,与其他多语言LLM相比,该模型表现良好,尽管使用的计算资源明显较少。在多个多语言基准测试中的评估显示,TildeOpen在文本生成和理解方面超过了现有的开放权重模型,特别是波罗的海、芬诺乌戈尔和斯拉夫语言。人工评估证实,相对于主要基线,语言错误减少了高达十倍。该模型及相关资源完全开放权重,可在huggingface.co/TildeAI/TildeOpen-30b上公开获取。这些结果表明,仔细的数据筛选和平衡的训练策略可以显著提高多语言模型的质量,而无需增加模型大小或训练量。
更新时间: 2026-03-09 10:03:17
领域: cs.CL,cs.AI
AutoAdapt: An Automated Domain Adaptation Framework for LLMs
Large language models (LLMs) excel in open domains but struggle in specialized settings with limited data and evolving knowledge. Existing domain adaptation practices rely heavily on manual trial-and-error processes, incur significant hyperparameter complexity, and are highly sensitive to data and user preferences, all under the high cost of LLM training. Moreover, the interactions and transferability of hyperparameter choices across models/domains remain poorly understood, making adaptation gains uncertain even with substantial effort. To solve these challenges, we present AutoAdapt, a novel end-to-end automated framework for efficient and reliable LLM domain adaptation. AutoAdapt leverages curated knowledge bases from literature and open-source resources to reduce expert intervention. To narrow the search space, we design a novel multi-agent debating system in which proposal and critic agents iteratively interact to align user intent and incorporate data signals and best practices into the planning process. To optimize hyperparameters under tight budgets, we propose AutoRefine, a novel LLM-based surrogate that replaces costly black-box search. Across 10 tasks, AutoAdapt achieves a 25% average relative accuracy improvement over state-of-the-art Automated Machine Learning baselines with minimal overhead.
Updated: 2026-03-09 10:03:16
标题: AutoAdapt:针对LLMs的自动化领域适应框架
摘要: 大型语言模型(LLMs)在开放领域表现出色,但在数据有限和知识不断更新的专业设置中表现不佳。现有的领域适应做法严重依赖于手动试错过程,具有显著的超参数复杂性,并且对数据和用户偏好非常敏感,而且LLM培训成本高昂。此外,超参数选择在模型/领域之间的相互作用和可传递性仍不为人了解,即使付出大量努力,适应增益也是不确定的。为了解决这些挑战,我们提出了AutoAdapt,这是一个新颖的端到端自动化框架,用于高效可靠的LLM领域适应。AutoAdapt利用文献和开源资源中的精选知识库来减少专家介入。为了缩小搜索空间,我们设计了一个新颖的多智能体辩论系统,其中提案和批评智能体迭代交互,以对齐用户意图并将数据信号和最佳实践纳入规划过程。为了在紧张的预算下优化超参数,我们提出了AutoRefine,这是一个新颖的基于LLM的替代品,取代昂贵的黑匣子搜索。在10个任务中,AutoAdapt相对于最先进的自动机器学习基线实现了25%的平均相对精度改进,开销最小。
更新时间: 2026-03-09 10:03:16
领域: cs.LG
Inference-Time Backdoors via Hidden Instructions in LLM Chat Templates
Open-weight language models are increasingly used in production settings, raising new security challenges. One prominent threat in this context is backdoor attacks, in which adversaries embed hidden behaviors in language models that activate under specific conditions. Previous work has assumed that adversaries have access to training pipelines or deployment infrastructure. We propose a novel attack surface requiring neither, which utilizes the chat template. Chat templates are executable Jinja2 programs invoked at every inference call, occupying a privileged position between user input and model processing. We show that an adversary who distributes a model with a maliciously modified template can implant an inference-time backdoor without modifying model weights, poisoning training data, or controlling runtime infrastructure. We evaluated this attack vector by constructing template backdoors targeting two objectives: degrading factual accuracy and inducing emission of attacker-controlled URLs, and applied them across eighteen models spanning seven families and four inference engines. Under triggered conditions, factual accuracy drops from 90% to 15% on average while attacker-controlled URLs are emitted with success rates exceeding 80%; benign inputs show no measurable degradation. Backdoors generalize across inference runtimes and evade all automated security scans applied by the largest open-weight distribution platform. These results establish chat templates as a reliable and currently undefended attack surface in the LLM supply chain.
Updated: 2026-03-09 10:02:47
标题: 在LLM聊天模板中通过隐藏指令进行推理时间后门攻击
摘要: 开放权重语言模型越来越多地被用于生产环境,这引发了新的安全挑战。在这种情况下,一个突出的威胁是后门攻击,即对手在语言模型中嵌入隐藏行为,在特定条件下激活。先前的工作假设对手可以访问训练管道或部署基础设施。我们提出了一个新颖的攻击面,既不需要训练管道,也不需要部署基础设施,它利用了聊天模板。聊天模板是可执行的Jinja2程序,在每次推理调用时被调用,占据了用户输入和模型处理之间的特权位置。我们展示了一个对手可以在分发带有恶意修改模板的模型时,可以在推理时插入后门,而无需修改模型权重、毒化训练数据或控制运行时基础设施。我们通过构建目标为降低事实准确性和诱导发射受攻击者控制的URL的模板后门来评估这种攻击向量,并将它们应用于跨七个家族和四个推理引擎的十八个模型。在触发条件下,事实准确性平均下降到15%,而受攻击者控制的URL的发射成功率超过80%;良性输入没有可测量的降级。后门攻击可以泛化到推理运行时,并且可以规避最大的开放权重分发平台应用的所有自动化安全扫描。这些结果将聊天模板确认为LLM供应链中一个可靠且目前未受到防御的攻击面。
更新时间: 2026-03-09 10:02:47
领域: cs.CR,cs.LG
ALOOD: Exploiting Language Representations for LiDAR-based Out-of-Distribution Object Detection
LiDAR-based 3D object detection plays a critical role for reliable and safe autonomous driving systems. However, existing detectors often produce overly confident predictions for objects not belonging to known categories, posing significant safety risks. This is caused by so-called out-of-distribution (OOD) objects, which were not part of the training data, resulting in incorrect predictions. To address this challenge, we propose ALOOD (Aligned LiDAR representations for Out-Of-Distribution Detection), a novel approach that incorporates language representations from a vision-language model (VLM). By aligning the object features from the object detector to the feature space of the VLM, we can treat the detection of OOD objects as a zero-shot classification task. We demonstrate competitive performance on the nuScenes OOD benchmark, establishing a novel approach to OOD object detection in LiDAR using language representations. The source code is available at https://github.com/uulm-mrm/mmood3d.
Updated: 2026-03-09 10:02:45
标题: ALOOD:利用语言表示进行基于LiDAR的超出分布目标检测
摘要: 基于LiDAR的3D目标检测在可靠和安全的自动驾驶系统中发挥着关键作用。然而,现有的检测器经常对不属于已知类别的对象产生过于自信的预测,从而造成重大安全风险。这是由所谓的超出分布(OOD)对象引起的,这些对象不是训练数据的一部分,导致了不正确的预测。为了解决这一挑战,我们提出了ALOOD(用于超出分布检测的对齐LiDAR表示),这是一种将来自视觉-语言模型(VLM)的语言表示结合的新方法。通过将目标检测器的目标特征对齐到VLM的特征空间,我们可以将OOD对象的检测视为一项零样本分类任务。我们在nuScenes OOD基准测试中展示了竞争性表现,建立了一种使用语言表示进行LiDAR中OOD目标检测的新方法。源代码可在https://github.com/uulm-mrm/mmood3d 找到。
更新时间: 2026-03-09 10:02:45
领域: cs.CV,cs.LG
Privacy-Preserving End-to-End Full-Duplex Speech Dialogue Models
End-to-end full-duplex speech models feed user audio through an always-on LLM backbone, yet the speaker privacy implications of their hidden representations remain unexamined. Following the VoicePrivacy 2024 protocol with a lazy-informed attacker, we show that the hidden states of SALM-Duplex and Moshi leak substantial speaker identity across all transformer layers. Layer-wise and turn-wise analyses reveal that leakage persists across all layers, with SALM-Duplex showing stronger leakage in early layers while Moshi leaks uniformly, and that Linkability rises sharply within the first few turns. We propose two streaming anonymization setups using Stream-Voice-Anon: a waveform-level front-end (Anon-W2W) and a feature-domain replacement (Anon-W2F). Anon-W2F raises EER by over 3.5x relative to the discrete encoder baseline (11.2% to 41.0%), approaching the 50% random-chance ceiling, while Anon-W2W retains 78-93% of baseline sBERT across setups with sub-second response latency (FRL under 0.8 s).
Updated: 2026-03-09 10:01:24
标题: 隐私保护的端到端全双工语音对话模型
摘要: 全双工语音模型通过始终开启的LLM骨干传递用户音频,但其隐藏表示的演讲者隐私影响尚未得到检验。遵循VoicePrivacy 2024协议,使用懒惰的知情攻击者,我们展示了SALM-Duplex和Moshi的隐藏状态在所有变换器层中泄漏了大量演讲者身份。逐层和逐轮分析显示,泄漏在所有层中持续存在,SALM-Duplex在早期层中显示出更强的泄漏,而Moshi泄漏均匀,并且在前几轮内链接性急剧上升。我们提出了两种使用Stream-Voice-Anon的流式匿名设置:波形级前端(Anon-W2W)和特征域替换(Anon-W2F)。相对于离散编码器基线(11.2%至41.0%),Anon-W2F将EER提高了3.5倍以上,接近50%的随机机会上限,而Anon-W2W在亚秒响应延迟(FRL低于0.8秒)的设置中保留了78-93%的基线sBERT。
更新时间: 2026-03-09 10:01:24
领域: eess.AS,cs.AI,eess.SP
Embedding Ontologies via Incorporating Extensional and Intensional Knowledge
Ontologies contain rich knowledge within domain, which can be divided into two categories, namely extensional knowledge and intensional knowledge. Extensional knowledge provides information about the concrete instances that belong to specific concepts in the ontology, while intensional knowledge details inherent properties, characteristics, and semantic associations among concepts. However, existing ontology embedding approaches fail to take both extensional knowledge and intensional knowledge into fine consideration simultaneously. In this paper, we propose a novel ontology embedding approach named EIKE (Extensional and Intensional Knowledge Embedding) by representing ontologies in two spaces, called extensional space and intensional space. EIKE presents a unified framework for embedding instances, concepts and their relations in an ontology, applying a geometry-based method to model extensional knowledge and a pretrained language model to model intensional knowledge, which can capture both structure information and textual information. Experimental results show that EIKE significantly outperforms state-of-the-art methods in three datasets for both triple classification and link prediction, indicating that EIKE provides a more comprehensive and representative perspective of the domain.
Updated: 2026-03-09 10:01:02
标题: 通过整合外延和内涵知识嵌入本体论
摘要: 本文提出了一种名为EIKE(Extensional and Intensional Knowledge Embedding)的新颖本体嵌入方法,通过在两个空间中表示本体,即扩展空间和内涵空间。EIKE提出了一个统一的框架,用于在本体中嵌入实例、概念及其关系,应用基于几何的方法来建模扩展知识,并使用预训练的语言模型来建模内涵知识,从而能够捕捉结构信息和文本信息。实验结果表明,EIKE在三个数据集中在三元分类和链接预测方面明显优于最先进的方法,表明EIKE提供了更全面和代表性的领域视角。
更新时间: 2026-03-09 10:01:02
领域: cs.AI,cs.CL
Is continuous CoT better suited for multi-lingual reasoning?
We investigate whether performing reasoning in a continuous latent space leads to more robust multilingual capabilities. We compare Continuous Chain-of-Thought (using the CODI framework) against standard supervised fine-tuning across five typologically diverse languages: English, Chinese, German, French, and Urdu. Our experiments on GSM8k and CommonsenseQA demonstrate that continuous reasoning significantly outperforms explicit reasoning on low-resource languages, particularly in zero-shot settings where the target language was not seen during training. Additionally, this approach achieves extreme efficiency, compressing reasoning traces by approximately $29\times$ to $50\times$. These findings indicate that continuous latent representations naturally exhibit greater language invariance, offering a scalable solution for cross-lingual reasoning.
Updated: 2026-03-09 09:57:08
标题: 连续性CoT是否更适合多语言推理?
摘要: 我们研究了在连续潜在空间中进行推理是否会带来更强大的多语言能力。我们比较了使用CODI框架的连续思维链与标准监督微调在五种语言上的表现:英语、中文、德语、法语和乌尔都语。我们在GSM8k和CommonsenseQA上的实验表明,连续推理在低资源语言上明显优于显式推理,特别是在零-shot设置中,目标语言在训练过程中没有出现。此外,这种方法实现了极高的效率,将推理轨迹压缩约$29\times$至$50\times$。这些发现表明,连续潜在表示自然具有更大的语言不变性,为跨语言推理提供了可扩展的解决方案。
更新时间: 2026-03-09 09:57:08
领域: cs.CL,cs.AI,cs.LG
Double projection for reconstructing dynamical systems: between stochastic and deterministic regimes
Learning stochastic models of dynamical systems from observed data is of interest in many scientific fields. Here, we propose a new method for this task within the family of dynamical variational autoencoders. The proposed double projection method estimates both the system state trajectories and the noise time series from data. This approach naturally allows us to perform multi-step system evolution and to learn models with a comparatively low-dimensional state space. We evaluate the performance of the method on six benchmark problems, including both simulated and experimental data. We further illustrate the effects of the teacher forcing interval of the multi-step scheme on the nature of the internal dynamics and compare the resulting behavior to that of deterministic models of equivalent architecture.
Updated: 2026-03-09 09:55:10
标题: 双重投影用于重建动力系统:在随机和确定性区域之间
摘要: 学习动态系统的随机模型对许多科学领域都具有重要意义。在这里,我们提出了一种新的方法,该方法属于动态变分自动编码器家族,用于从观察到的数据中执行此任务。所提出的双重投影方法从数据中估计系统状态轨迹和噪声时间序列。这种方法自然地允许我们执行多步系统演化,并学习具有相对低维状态空间的模型。我们在六个基准问题上评估了该方法的性能,包括模拟和实验数据。我们进一步说明了多步方案的教师强制间隔对内部动态性质的影响,并将结果行为与等效架构的确定性模型进行了比较。
更新时间: 2026-03-09 09:55:10
领域: cs.LG,q-bio.QM
Evolution Strategy-Based Calibration for Low-Bit Quantization of Speech Models
Quantization has become essential for the efficient deployment of speech processing systems. Although widely studied, most existing quantization methods were developed for vision and NLP architectures, while the specific challenges of audio signals remain largely overlooked. In particular, we show that audio activations can exhibit large calibration ranges, leading to significant information loss when standard calibration techniques are applied. To address this, we propose ESC, an Evolution Strategy-based Calibration method that formulates activation scaling as an optimization problem and solves it using a two-step local-global scheme driven by an evolution strategy. ESC enables unaltered performance under full INT8 quantization and is the first calibration method to achieve near-lossless performance for full INT4 quantization across multiple speech tasks. Integrating ESC with PTQ methods further reduces performance loss, achieving a 1% relative accuracy degradation on the AST model.
Updated: 2026-03-09 09:53:05
标题: 基于进化策略的语音模型低比特量化校准
摘要: 量化已成为有效部署语音处理系统的必要条件。尽管已被广泛研究,但大多数现有的量化方法是为视觉和自然语言处理架构开发的,而音频信号的特定挑战则被大多数忽视。特别是,我们发现音频激活可以呈现出较大的校准范围,当应用标准校准技术时可能导致严重信息丢失。为了解决这个问题,我们提出了ESC,一种基于进化策略的校准方法,将激活缩放形式化为一个优化问题,并使用进化策略驱动的两步本地-全局方案来解决。ESC使得在完全INT8量化下性能不受影响,并且是第一个能够在多个语音任务中实现近无损性能的完全INT4量化的校准方法。将ESC与PTQ方法整合进一步减少性能损失,在AST模型上实现了1%的相对准确度下降。
更新时间: 2026-03-09 09:53:05
领域: cs.SD,cs.AI
Evidence-Driven Reasoning for Industrial Maintenance Using Heterogeneous Data
Industrial maintenance platforms contain rich but fragmented evidence, including free-text work orders, heterogeneous operational sensors or indicators, and structured failure knowledge. These sources are often analyzed in isolation, producing alerts or forecasts that do not support conditional decision-making: given this asset history and behavior, what is happening and what action is warranted? We present Condition Insight Agent, a deployed decision-support framework that integrates maintenance language, behavioral abstractions of operational data, and engineering failure semantics to produce evidence-grounded explanations and advisory actions. The system constrains reasoning through deterministic evidence construction and structured failure knowledge, and applies a rule-based verification loop to suppress unsupported conclusions. Case studies from production CMMS deployments show that this verification-first design operates reliably under heterogeneous and incomplete data while preserving human oversight. Our results demonstrate how constrained LLM-based reasoning can function as a governed decision-support layer for industrial maintenance.
Updated: 2026-03-09 09:51:50
标题: Evidence-Driven Reasoning for Industrial Maintenance Using Heterogeneous Data 基于异构数据的工业维护的证据驱动推理
摘要: 工业维护平台包含丰富但分散的证据,包括自由文本工作订单、异构操作传感器或指标以及结构化故障知识。这些来源通常被孤立分析,产生的警报或预测不支持有条件的决策制定:鉴于资产的历史和行为,正在发生什么以及何种行动是必要的?我们提出了“Condition Insight Agent”,这是一个已部署的决策支持框架,它整合了维护语言、操作数据的行为抽象和工程故障语义,以生成基于证据的解释和建议行动。该系统通过确定性证据构建和结构化故障知识来限制推理,并应用基于规则的验证循环来抑制不支持的结论。生产CMMS部署的案例研究表明,这种以验证为先的设计在异构和不完整数据下可靠运行,同时保留了人类监督。我们的结果表明,受限的LLM推理如何可以作为工业维护的受管决策支持层运行。
更新时间: 2026-03-09 09:51:50
领域: cs.AI
An explainable hybrid deep learning-enabled intelligent fault detection and diagnosis approach for automotive software systems validation
Advancements in data-driven machine learning have emerged as a pivotal element in supporting automotive software systems (ASSs) engineering across various levels of the V-development process. Duringsystemverificationandvalidation,theintegrationofanintelligent fault detection anddiagnosis (FDD) model with test recordings analysis process serves as a powerful tool for efficiency ensuring functional safety. However, the lack of interpretability of the black-box FDD models developed not only hinders understanding of the cause underlying the prediction, but also prevents the model from being adapted based on the prediction result. This, in turn, increases the computational cost required for developingacomplexFDDmodelandlimitsconfidenceinreal-timesafety-criticalapplications.To address this challenge, a novel explainable method for fault detection, identification, and localization is proposed in this article with the aim of providing a clear understanding of the logic behind the prediction outcome. To this end, a hybrid 1dCNN-GRU-based intelligent model was developed to analyze the recordings from the real-time validation process of ASSs. The employment of explainable AI techniques, i.e., IGs, DeepLIFT, Gradient SHAP, and DeepLIFT SHAP, was instrumental in enabling model adaptation and facilitating the root cause analysis (RCA). The proposed approach is applied to the real time dataset collected during a virtual test drive performed by the user on hardware in the loop system.
Updated: 2026-03-09 09:46:28
标题: 一个可解释的混合深度学习智能故障检测和诊断方法,用于汽车软件系统验证
摘要: 数据驱动的机器学习在支持汽车软件系统(ASSs)工程的各个V开发过程中已经成为一个关键因素。在系统验证和验证过程中,将智能故障检测和诊断(FDD)模型与测试记录分析过程集成起来,是确保功能安全性的有效工具。然而,黑匣子FDD模型的解释性不足不仅阻碍了预测背后原因的理解,还阻止了模型基于预测结果进行调整。这反过来增加了开发复杂FDD模型所需的计算成本,并限制了对实时安全关键应用的信心。为了解决这一挑战,本文提出了一种新颖的可解释方法,用于故障检测、识别和定位,旨在提供对预测结果背后逻辑的清晰理解。为此,开发了基于混合1dCNN-GRU的智能模型,用于分析ASSs实时验证过程中的记录。采用可解释AI技术,如IGs、DeepLIFT、Gradient SHAP和DeepLIFT SHAP,有助于实现模型的调整和促进根本原因分析(RCA)。所提出的方法应用于用户在硬件在环系统上进行的虚拟测试驾驶期间收集的实时数据集。
更新时间: 2026-03-09 09:46:28
领域: cs.SE,cs.AI
Covenant-72B: Pre-Training a 72B LLM with Trustless Peers Over-the-Internet
Recently, there has been increased interest in globally distributed training, which has the promise to both reduce training costs and democratize participation in building large-scale foundation models. However, existing models trained in a globally distributed manner are relatively small in scale and have only been trained with whitelisted participants. Therefore, they do not yet realize the full promise of democratized participation. In this report, we describe Covenant-72B, an LLM produced by the largest collaborative globally distributed pre-training run (in terms of both compute and model scale), which simultaneously allowed open, permissionless participation supported by a live blockchain protocol. We utilized a state-of-the-art communication-efficient optimizer, SparseLoCo, supporting dynamic participation with peers joining and leaving freely. Our model, pre-trained on approximately 1.1T tokens, performs competitively with fully centralized models pre-trained on similar or higher compute budgets, demonstrating that fully democratized, non-whitelisted participation is not only feasible, but can be achieved at unprecedented scale for a globally distributed pre-training run.
Updated: 2026-03-09 09:44:13
标题: 《Covenant-72B:通过互联网与不可信任的同行预训练72B LLM》
摘要: 最近,全球分布式培训引起了越来越多的关注,这有望降低培训成本并使广大人民参与建立大规模基础模型。然而,目前以全球分布方式训练的现有模型规模相对较小,并且仅由白名单参与者训练。因此,它们尚未实现民主参与的全部承诺。在本报告中,我们描述了Covenant-72B,这是由最大的协作全球分布式预训练运行(在计算和模型规模方面)产生的LLM,同时支持由活动区块链协议支持的开放、无需许可的参与。我们利用了最先进的通信高效优化器SparseLoCo,支持动态参与,同行可以自由加入和离开。我们的模型预训练了约1.1T个令牌,与完全集中的模型在相似或更高计算预算下进行的预训练表现竞争力,表明全面民主化、非白名单参与不仅是可行的,而且可以在全球分布式预训练运行中以前所未有的规模实现。
更新时间: 2026-03-09 09:44:13
领域: cs.DC,cs.LG
A Unified Framework for Zero-Shot Reinforcement Learning
Zero-shot reinforcement learning (RL) has emerged as a setting for developing general agents, capable of solving downstream tasks without additional training or planning at test-time. While conventional RL optimizes policies for fixed rewards, zero-shot RL requires learning representations that enable immediate adaptation to arbitrary reward functions. As the field matures, the growing diversity of approaches demands a foundational framework reconciling different perspectives under a common unifying structure. In this work, we introduce a formal, unified framework for zero-shot RL, allowing for rigorous comparisons across methods. We propose a taxonomy organizing the algorithmic landscape along two levels: representation, distinguishing between compositional and direct methods based on their exploitation of action-value function decompositions; and learning paradigm, differentiating between reward-free and pseudo reward-free training. Additionally, we propose a unified view of existing error bounds, decomposing the total error into three primary contributing components: inference, reward, and approximation, serving as a foundation for more grounded comparisons of zero-shot methods.
Updated: 2026-03-09 09:43:06
标题: 零样本强化学习的统一框架
摘要: 零样本强化学习(RL)已经成为一个发展通用智能体的设置,能够在测试时不需要额外的训练或规划就能解决下游任务。传统的RL优化固定奖励的策略,而零样本RL需要学习表示,使其能够立即适应任意奖励函数。随着这一领域的发展,不断增长的方法多样性需要一个基础框架,将不同观点统一在一个共同的结构下。在这项工作中,我们引入了一个正式的、统一的零样本RL框架,允许在方法之间进行严格的比较。我们提出了一个分类法,将算法景观沿两个层次进行组织:表示,根据它们对行动-值函数分解的利用将方法区分为构成和直接方法;以及学习范式,根据无奖励和伪奖励训练的差异进行区分。此外,我们提出了现有误差界的统一观点,将总误差分解为三个主要的贡献组成部分:推理、奖励和近似,为更加扎实的零样本方法比较奠定基础。
更新时间: 2026-03-09 09:43:06
领域: cs.LG
Learning Hierarchical Knowledge in Text-Rich Networks with Taxonomy-Informed Representation Learning
Hierarchical knowledge structures are ubiquitous across real-world domains and play a vital role in organizing information from coarse to fine semantic levels. While such structures have been widely used in taxonomy systems, biomedical ontologies, and retrieval-augmented generation, their potential remains underexplored in the context of Text-Rich Networks (TRNs), where each node contains rich textual content and edges encode semantic relationships. Existing methods for learning on TRNs often focus on flat semantic modeling, overlooking the inherent hierarchical semantics embedded in textual documents. To this end, we propose TIER (Hierarchical \textbf{T}axonomy-\textbf{I}nformed R\textbf{E}presentation Learning on Text-\textbf{R}ich Networks), which first constructs an implicit hierarchical taxonomy and then integrates it into the learned node representations. Specifically, TIER employs similarity-guided contrastive learning to build a clustering-friendly embedding space, upon which it performs hierarchical K-Means followed by LLM-powered clustering refinement to enable semantically coherent taxonomy construction. Leveraging the resulting taxonomy, TIER introduces a cophenetic correlation coefficient-based regularization loss to align the learned embeddings with the hierarchical structure. By learning representations that respect both fine-grained and coarse-grained semantics, TIER enables more interpretable and structured modeling of real-world TRNs. We demonstrate that our approach significantly outperforms existing methods on multiple datasets across diverse domains, highlighting the importance of hierarchical knowledge learning for TRNs.
Updated: 2026-03-09 09:40:18
标题: 在文本丰富网络中使用基于分类知识的表示学习学习分层知识
摘要: 分层知识结构在现实世界的各个领域中普遍存在,并在将信息从粗到细的语义层次上起着重要作用。虽然这种结构在分类系统、生物医学本体论和检索增强生成中被广泛使用,但在文本丰富网络(TRNs)的背景下,每个节点包含丰富的文本内容,边缘编码语义关系,其潜力尚未得到充分开发。现有的TRNs学习方法往往侧重于平面语义建模,忽略了文本文档中固有的分层语义。为此,我们提出了TIER(基于分层\textbf{t}axonomy-信息化的r\textbf{e}presentation学习在文本-\textbf{r}ich网络上),首先构建一个隐式的分层分类体系,然后将其整合到学习的节点表示中。具体地,TIER采用相似性引导的对比学习来构建一个聚类友好的嵌入空间,然后执行层次K-Means,随后是LLM支持的聚类细化,以实现语义一致的分类体系构建。利用所得到的分类体系,TIER引入基于共系分析相关系数的正则化损失,以使学习的嵌入与分层结构对齐。通过学习尊重细粒度和粗粒度语义的表示,TIER实现了对现实世界TRNs更可解释和结构化的建模。我们证明我们的方法在跨多个领域的多个数据集上明显优于现有方法,突显了TRNs的分层知识学习的重要性。
更新时间: 2026-03-09 09:40:18
领域: cs.LG
Outlier-robust Autocovariance Least Square Estimation via Iteratively Reweighted Least Square
The autocovariance least squares (ALS) method is a computationally efficient approach for estimating noise covariances in Kalman filters without requiring specific noise models. However, conventional ALS and its variants rely on the classic least mean squares (LMS) criterion, making them highly sensitive to measurement outliers and prone to severe performance degradation. To overcome this limitation, this paper proposes a novel outlier-robust ALS algorithm, termed ALS-IRLS, based on the iteratively reweighted least squares (IRLS) framework. Specifically, the proposed approach introduces a two-tier robustification strategy. First, an innovation-level adaptive thresholding mechanism is employed to filter out heavily contaminated data. Second, the outlier-contaminated autocovariance is formulated using an $ε$-contamination model, where the standard LMS criterion is replaced by the Huber cost function. The IRLS method is then utilized to iteratively adjust data weights based on estimation deviations, effectively mitigating the influence of residual outliers. Comparative simulations demonstrate that ALS-IRLS reduces the root-mean-square error (RMSE) of noise covariance estimates by over two orders of magnitude compared to standard ALS. Furthermore, it significantly enhances downstream state estimation accuracy, outperforming existing outlier-robust Kalman filters and achieving performance nearly equivalent to the ideal Oracle lower bound in the presence of noisy and anomalous data.
Updated: 2026-03-09 09:40:11
标题: 通过迭代重新加权最小二乘法实现的鲁棒自相关最小二乘估计
摘要: 自相关最小二乘(ALS)方法是一种在卡尔曼滤波器中估计噪声协方差的计算效率高的方法,而无需特定的噪声模型。然而,传统的ALS及其变种依赖于经典的最小均方(LMS)准则,使其对测量异常值非常敏感,并容易导致性能严重下降。为了克服这一限制,本文提出了一种基于迭代加权最小二乘(IRLS)框架的新型抗异常值ALS算法,称为ALS-IRLS。具体地,所提出的方法引入了一个两层鲁棒化策略。首先,采用创新级的自适应阈值机制来过滤掉受到严重污染的数据。其次,利用$ε$-污染模型来形成受异常值污染的自相关,其中标准LMS准则被Huber成本函数替代。然后利用IRLS方法根据估计偏差迭代调整数据权重,有效地减轻了残余异常值的影响。比较模拟表明,与标准ALS相比,ALS-IRLS将噪声协方差估计的均方根误差(RMSE)降低了两个数量级。此外,在存在嘈杂和异常数据的情况下,它显着提高了下游状态估计的准确性,优于现有的抗异常值卡尔曼滤波器,并在性能上接近理想的Oracle下界。
更新时间: 2026-03-09 09:40:11
领域: math.OC,cs.LG,eess.SP
Are We Winning the Wrong Game? Revisiting Evaluation Practices for Long-Term Time Series Forecasting
Long-term time series forecasting (LTSF) is widely recognized as a central challenge in data mining and machine learning. LTSF has increasingly evolved into a benchmark-driven ''GAME,'' where models are ranked, compared, and declared state-of-the-art based primarily on marginal reductions in aggregated pointwise error metrics such as MSE and MAE. Across a small set of canonical datasets and fixed forecasting horizons, progress is communicated through leaderboard-style tables in which lower numerical scores define success. In this GAME, what is measured becomes what is optimized, and incremental error reduction becomes the dominant currency of advancement. We argue that this metric-centric regime is not merely incomplete, but structurally misaligned with the broader objectives of forecasting. In real-world settings, forecasting often prioritizes preserving temporal structure, trend stability, seasonal coherence, robustness to regime shifts, and supporting downstream decision processes. Optimizing aggregate pointwise error does not necessarily imply modeling these structural properties. As a result, leaderboard improvement may increasingly reflect specialization in benchmark configurations rather than a deeper understanding of temporal dynamics. This paper revisits LTSF evaluation as a foundational question in data science: what does it mean to measure forecasting progress? We propose a multi-dimensional evaluation perspective that integrates statistical fidelity, structural coherence, and decision-level relevance. By challenging the current metric monoculture, we aim to redirect attention from winning benchmark tables toward advancing meaningful, context-aware forecasting.
Updated: 2026-03-09 09:37:46
标题: 我们是否在赢错了游戏?重新审视长期时间序列预测的评估实践
摘要: 长期时间序列预测(LTSF)被广泛认为是数据挖掘和机器学习中的一个核心挑战。 LTSF已经逐渐发展成为一个以基准驱动的“游戏”,在这个游戏中,模型主要基于聚合点误差指标(如MSE和MAE)的边际降低而被排名、比较和宣布为最先进。在一小组经典数据集和固定的预测时段中,进展通过领先者风格的排行榜式表格进行传达,其中较低的数值分数定义成功。在这个“游戏”中,被测量的东西成为被优化的东西,逐渐降低错误成为进步的主导货币。我们认为,这种以度量为中心的制度不仅仅是不完整的,而且与预测的更广泛目标结构不一致。在现实世界的环境中,预测通常优先考虑保持时间结构、趋势稳定性、季节一致性、对制度转变的稳健性以及支持下游决策过程。优化聚合点误差并不一定意味着对这些结构特性进行建模。因此,排行榜的改进可能越来越多地反映了对基准配置的专业化,而不是对时间动态的更深入理解。本文重新审视了LTSF评估作为数据科学中的一个基础问题:什么是衡量预测进展的意义?我们提出了一个多维度评估视角,将统计忠实度、结构一致性和决策级别的相关性整合在一起。通过挑战当前的度量单一文化,我们旨在将注意力从赢得基准表格转向推动有意义的、具有上下文意识的预测。
更新时间: 2026-03-09 09:37:46
领域: cs.LG,stat.ML
C$^2$FG: Control Classifier-Free Guidance via Score Discrepancy Analysis
Classifier-Free Guidance (CFG) is a cornerstone of modern conditional diffusion models, yet its reliance on the fixed or heuristic dynamic guidance weight is predominantly empirical and overlooks the inherent dynamics of the diffusion process. In this paper, we provide a rigorous theoretical analysis of the Classifier-Free Guidance. Specifically, we establish strict upper bounds on the score discrepancy between conditional and unconditional distributions at different timesteps based on the diffusion process. This finding explains the limitations of fixed-weight strategies and establishes a principled foundation for time-dependent guidance. Motivated by this insight, we introduce \textbf{Control Classifier-Free Guidance (C$^2$FG)}, a novel, training-free, and plug-in method that aligns the guidance strength with the diffusion dynamics via an exponential decay control function. Extensive experiments demonstrate that C$^2$FG is effective and broadly applicable across diverse generative tasks, while also exhibiting orthogonality to existing strategies.
Updated: 2026-03-09 09:37:17
标题: C$^2$FG: 通过分数差异分析实现无分类器的控制引导
摘要: 无分类器引导(CFG)是现代条件扩散模型的基石,然而它对固定或启发式动态引导权重的依赖主要是经验性的,并忽略了扩散过程的固有动态。在本文中,我们对无分类器引导进行了严格的理论分析。具体地,我们基于扩散过程在不同时间步骤上建立了条件和无条件分布之间得分差异的严格上界。这一发现解释了固定权重策略的局限性,并为时间相关引导奠定了基础。在这一洞察的驱使下,我们引入了\textbf{控制无分类器引导(C$^2$FG)},这是一种新颖的、无需训练的、可插入方法,通过指数衰减控制函数将引导强度与扩散动态对齐。广泛的实验表明,C$^2$FG在各种生成任务中是有效的并且具有广泛的适用性,同时还表现出与现有策略的正交性。
更新时间: 2026-03-09 09:37:17
领域: cs.LG
CauKer: Classification Time Series Foundation Models Can Be Pretrained on Synthetic Data
Time series foundation models (TSFMs) have recently gained significant attention due to their strong zero-shot capabilities and widespread real-world applications. Such models typically require a computationally costly pre-training on large-scale, carefully curated collections of real-world sequences. To allow for a sample-efficient pre-training of TSFMs, we propose \textsc{CauKer}, a novel algorithm designed to generate diverse, causally coherent synthetic time series with realistic trends, seasonality, and nonlinear interactions. \textsc{CauKer} combines Gaussian Process (GP) kernel composition with Structural Causal Models (SCM) to produce data for sample-efficient pre-training of state-of-the-art classification TSFMs having different architectures and following different pre-training approaches. Additionally, our experiments reveal that \textsc{CauKer}-generated datasets exhibit clear scaling laws for both dataset size (10K to 10M samples) and model capacity (1M to 783M parameters), unlike real-world datasets, which display irregular scaling behavior. The source code is publicly available at https://github.com/ShifengXIE/CauKer.
Updated: 2026-03-09 09:33:29
标题: CauKer:时间序列基础模型的分类可以在合成数据上进行预训练
摘要: 时间序列基础模型(TSFM)最近因其强大的零-shot能力和广泛的实际应用而受到重视。这种模型通常需要在大规模、精心策划的真实序列集合上进行计算成本高昂的预训练。为了实现TSFM的高效预训练,我们提出了一种名为CauKer的新算法,旨在生成具有真实趋势、季节性和非线性相互作用的多样化因果连贯的合成时间序列。CauKer将高斯过程(GP)核组合与结构因果模型(SCM)相结合,以生成适用于最先进的分类TSFM的不同架构和不同预训练方法的高效样本预训练数据。此外,我们的实验表明,CauKer生成的数据集在数据集大小(10K至10M个样本)和模型容量(1M至783M个参数)方面呈现明显的缩放规律,而真实数据集则显示出不规则的缩放行为。源代码可在https://github.com/ShifengXIE/CauKer 上公开获取。
更新时间: 2026-03-09 09:33:29
领域: cs.LG,cs.AI
iGVLM: Dynamic Instruction-Guided Vision Encoding for Question-Aware Multimodal Understanding
Despite the success of Large Vision--Language Models (LVLMs), most existing architectures suffer from a representation bottleneck: they rely on static, instruction-agnostic vision encoders whose visual representations are utilized in an invariant manner across different textual tasks. This rigidity hinders fine-grained reasoning where task-specific visual cues are critical. To address this issue, we propose iGVLM, a general framework for instruction-guided visual modulation. iGVLM introduces a decoupled dual-branch architecture: a frozen representation branch that preserves task-agnostic visual representations learned during pre-training, and a dynamic conditioning branch that performs affine feature modulation via Adaptive Layer Normalization (AdaLN). This design enables a smooth transition from general-purpose perception to instruction-aware reasoning while maintaining the structural integrity and stability of pre-trained visual priors. Beyond standard benchmarks, we introduce MM4, a controlled diagnostic probe for quantifying logical consistency under multi-query, multi-instruction settings. Extensive results show that iGVLM consistently enhances instruction sensitivity across diverse language backbones, offering a plug-and-play paradigm for bridging passive perception and active reasoning.
Updated: 2026-03-09 09:29:46
标题: iGVLM:面向问题感知多模态理解的动态指令引导视觉编码
摘要: 尽管大规模视觉-语言模型(LVLMs)取得了成功,但大多数现有架构存在表示瓶颈:它们依赖于静态、与指令无关的视觉编码器,其视觉表示在不同文本任务中以不变的方式使用。这种刚性阻碍了细粒度推理,其中任务特定的视觉线索至关重要。为了解决这个问题,我们提出了iGVLM,一个用于指导视觉调制的通用框架。iGVLM引入了一个解耦的双分支架构:一个冻结表示分支,保留在预训练期间学习的任务无关的视觉表示,以及一个通过自适应层规范化(AdaLN)执行仿射特征调制的动态调节分支。这种设计使得从通用感知到指令感知推理的平稳过渡,同时保持了预训练视觉先验的结构完整性和稳定性。除了标准基准测试外,我们引入了MM4,一个用于在多查询、多指令设置下量化逻辑一致性的受控诊断探针。广泛的结果表明,iGVLM在不同语言骨干上一贯增强了对指令的敏感性,为连接被动感知和主动推理提供了一种即插即用的范式。
更新时间: 2026-03-09 09:29:46
领域: cs.CV,cs.AI
Gradually Excavating External Knowledge for Implicit Complex Question Answering
Recently, large language models (LLMs) have gained much attention for the emergence of human-comparable capabilities and huge potential. However, for open-domain implicit question-answering problems, LLMs may not be the ultimate solution due to the reasons of: 1) uncovered or out-of-date domain knowledge, 2) one-shot generation and hence restricted comprehensiveness. To this end, this work proposes a gradual knowledge excavation framework for open-domain complex question answering, where LLMs iteratively and actively acquire external information, and then reason based on acquired historical knowledge. Specifically, during each step of the solving process, the model selects an action to execute, such as querying external knowledge or performing a single logical reasoning step, to gradually progress toward a final answer. Our method can effectively leverage plug-and-play external knowledge and dynamically adjust the strategy for solving complex questions. Evaluated on the StrategyQA dataset, our method achieves 78.17% accuracy with less than 6% parameters of its competitors, setting new SOTA for ~10B-scale LLMs.
Updated: 2026-03-09 09:28:42
标题: 逐步挖掘外部知识以实现隐式复杂问题回答
摘要: 最近,大型语言模型(LLM)因具有人类可比拟的能力和巨大潜力而受到广泛关注。然而,对于开放领域的隐式问题回答问题,由于以下原因,LLM可能并非最终解决方案:1)未覆盖或过时的领域知识,2)一次性生成,因此受限于综合性。因此,本文提出了一个逐步知识挖掘框架,用于开放领域复杂问题的回答,LLM迭代地并积极地获取外部信息,然后根据获取的历史知识进行推理。具体来说,在解决过程的每个步骤中,模型选择要执行的操作,例如查询外部知识或执行单个逻辑推理步骤,逐渐朝着最终答案前进。我们的方法可以有效利用即插即用的外部知识,并动态调整解决复杂问题的策略。在StrategyQA数据集上评估,我们的方法以少于其竞争对手的6%参数实现了78.17%的准确率,为~10B级别的LLM设定了新的SOTA。
更新时间: 2026-03-09 09:28:42
领域: cs.CL,cs.AI
ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion
Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We propose ITO, a framework addressing this limitation through two synergistic mechanisms. Multimodal multiple alignment enriches supervision by mining diverse image-text correspondences, while a lightweight training-time multimodal fusion module enforces structured cross-modal interaction. Crucially, the fusion module is discarded at inference, preserving the efficiency of standard dual-encoder architectures. Extensive experiments show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer -- eliminating the modality gap and stabilizing training dynamics to prevent the early saturation often observed in aggressive contrastive learning.
Updated: 2026-03-09 09:27:09
标题: ITO:通过多重对齐和训练时融合实现图像和文本的统一
摘要: 图像文本对比预训练已成为视觉表示学习的主导范式,然而现有方法往往产生的表示仍然部分由模态组织。我们提出ITO,一个通过两种协同机制解决这一限制的框架。多模态多重对齐通过挖掘多样的图像文本对应关系丰富了监督,同时一个轻量级的训练时多模态融合模块强化了结构化的跨模态交互。关键是,融合模块在推断时被丢弃,保持了标准双编码器架构的效率。大量实验表明,ITO在分类、检索和多模态基准测试中始终优于强基线。我们的分析显示,多重对齐推动了区分力,训练时融合作为关键的结构正则化器--消除了模态差距,并稳定了训练动态,防止了激进对比学习中经常观察到的早期饱和。
更新时间: 2026-03-09 09:27:09
领域: cs.CV,cs.AI
Training event-based neural networks with exact gradients via Differentiable ODE Solving in JAX
Existing frameworks for gradient-based training of spiking neural networks face a trade-off: discrete-time methods using surrogate gradients support arbitrary neuron models but introduce gradient bias and constrain spike-time resolution, while continuous-time methods that compute exact gradients require analytical expressions for spike times and state evolution, restricting them to simple neuron types such as Leaky Integrate and Fire (LIF). We introduce the Eventax framework, which resolves this trade-off by combining differentiable numerical ODE solvers with event-based spike handling. Built in JAX, our frame-work uses Diffrax ODE-solvers to compute gradients that are exact with respect to the forward simulation for any neuron model defined by ODEs . It also provides a simple API where users can specify just the neuron dynamics, spike conditions, and reset rules. Eventax prioritises modelling flexibility, supporting a wide range of neuron models, loss functions, and network architectures, which can be easily extended. We demonstrate Eventax on multiple benchmarks, including Yin-Yang and MNIST, using diverse neuron models such as Leaky Integrate-and-fire (LIF), Quadratic Integrate-and-fire (QIF), Exponential integrate-and-fire (EIF), Izhikevich and Event-based Gated Recurrent Unit (EGRU) with both time-to-first-spike and state-based loss functions, demonstrating its utility for prototyping and testing event-based architectures trained with exact gradients. We also demonstrate the application of this framework for more complex neuron types by implementing a multi-compartment neuron that uses a model of dendritic spikes in human layer 2/3 cortical Pyramidal neurons for computation. Code available at https://github.com/efficient-scalable-machine-learning/eventax.
Updated: 2026-03-09 09:25:52
标题: 在JAX中通过可微分ODE求解训练基于事件的神经网络的确切梯度
摘要: 现有的基于梯度训练脉冲神经网络的框架面临一个权衡:使用替代梯度的离散时间方法支持任意神经元模型,但引入梯度偏差并限制脉冲时间分辨率,而计算精确梯度的连续时间方法则需要分析表达式来确定脉冲时间和状态演变,限制在简单的神经元类型如漏电整流与火(LIF)。我们引入了Eventax框架,通过将可微分数值ODE求解器与基于事件的脉冲处理结合解决了这个权衡。我们的框架使用Diffrax ODE求解器构建在JAX中,计算梯度相对于由ODE定义的任何神经元模型的前向模拟是精确的。它还提供了一个简单的API,用户只需指定神经元动态、脉冲条件和重置规则。Eventax优先考虑建模灵活性,支持各种神经元模型、损失函数和网络架构,可以轻松扩展。我们在多个基准测试中展示了Eventax的应用,包括Yin-Yang和MNIST,使用各种神经元模型如漏电整流与火(LIF)、二次整流与火(QIF)、指数整流与火(EIF)、Izhikevich和基于事件的门控循环单元(EGRU)以及基于时间和基于状态的损失函数,展示了它在用精确梯度训练的基于事件的架构的原型和测试中的实用性。我们还演示了这个框架在更复杂的神经元类型上的应用,通过实现一个使用人类2/3皮质金字塔神经元树突脉冲模型进行计算的多室神经元。代码可在https://github.com/efficient-scalable-machine-learning/eventax找到。
更新时间: 2026-03-09 09:25:52
领域: cs.LG
Autoassociative Learning of Structural Representations for Modeling and Classification in Medical Imaging
Deep learning architectures based on convolutional neural networks tend to rely on continuous, smooth features. While this characteristics provides significant robustness and proves useful in many real-world tasks, it is strikingly incompatible with the physical characteristic of the world, which, at the scale in which humans operate, comprises crisp objects, typically representing well-defined categories. This study proposes a class of neurosymbolic systems that learn by reconstructing images in terms of visual primitives and are thus forced to form high-level, structural explanations of them. When applied to the task of diagnosing abnormalities in histological imaging, the method proved superior to a conventional deep learning architecture in terms of classification accuracy, while being more transparent.
Updated: 2026-03-09 09:22:57
标题: 自动相关学习结构表示在医学影像建模和分类中的应用
摘要: 基于卷积神经网络的深度学习架构往往依赖于连续、平滑的特征。虽然这种特性提供了显著的稳健性,并在许多现实世界任务中证明了其有用性,但与世界的物理特性极不相容,人类操作的规模下,世界由清晰的对象组成,通常代表着明确定义的类别。本研究提出了一类通过在视觉原语方面重构图像并迫使形成其高级、结构性解释的神经符号系统。当应用于诊断组织学成像中的异常任务时,该方法在分类准确性方面优于传统的深度学习架构,同时更加透明。
更新时间: 2026-03-09 09:22:57
领域: cs.CV,cs.LG
An Embedding-based Approach to Inconsistency-tolerant Reasoning with Inconsistent Ontologies
Inconsistency handling is an important issue in knowledge management. Especially in ontology engineering, logical inconsistencies may occur during ontology construction. A natural way to reason with an inconsistent ontology is to utilize the maximal consistent subsets of the ontology. However, previous studies on selecting maximum consistent subsets have rarely considered the semantics of the axioms, which may result in irrational inference. In this paper, we propose a novel approach to reasoning with inconsistent ontologies in description logics based on the embeddings of axioms. We first give a method for turning axioms into distributed semantic vectors to compute the semantic connections between the axioms. We then define an embedding-based method for selecting the maximum consistent subsets and use it to define an inconsistency-tolerant inference relation. We show the rationality of our inference relation by considering some logical properties. Finally, we conduct experiments on several ontologies to evaluate the reasoning power of our inference relation. The experimental results show that our embedding-based method can outperform existing inconsistency-tolerant reasoning methods based on maximal consistent subsets.
Updated: 2026-03-09 09:22:30
标题: 一种基于嵌入的方法用于具有不一致本体的容忍不一致推理
摘要: 不一致性处理是知识管理中的一个重要问题。特别是在本体工程中,在本体构建过程中可能会出现逻辑不一致性。理解一个不一致的本体的自然方法是利用本体的最大一致子集。然而,先前的研究中选择最大一致子集时很少考虑公理的语义,这可能导致不合理的推断。在本文中,我们提出了一种基于公理嵌入的描述逻辑中处理不一致本体的新方法。我们首先提出了一种将公理转化为分布式语义向量的方法,以计算公理之间的语义连接。然后我们定义了一种基于嵌入的方法来选择最大一致子集,并将其用于定义一个容忍不一致性的推理关系。我们通过考虑一些逻辑属性展示了我们推理关系的合理性。最后,我们对几个本体进行实验,评估了我们推理关系的推理能力。实验结果显示,我们基于嵌入的方法可以胜过现有基于最大一致子集的不一致性容忍推理方法。
更新时间: 2026-03-09 09:22:30
领域: cs.AI
DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding
Preference-based alignment methods (e.g., RLHF, DPO) typically optimize a single scalar objective, implicitly averaging over heterogeneous human preferences. In practice, systematic annotator and user-group disagreement makes mean-reward maximization brittle and susceptible to proxy over-optimization. We propose **Disagreement-Aware Alignment via Risk-Constrained Decoding (DARC)**, a retraining-free inference-time method that frames response selection as distributionally robust, risk-sensitive decision making. Given multiple preference samples or scalable disagreement proxies, DARC reranks candidates by maximizing a *KL-robust (entropic)* satisfaction objective, and provides simple deployment controls that cap or penalize the corresponding entropic risk premium relative to the mean, enabling explicit risk budgets without retraining. We provide theoretical characterization linking this decoding rule to principled pessimism and KL-based distributionally robust optimization. Experiments on alignment benchmarks show that DARC reduces disagreement and tail risk while maintaining competitive average quality under noisy, heterogeneous feedback.
Updated: 2026-03-09 09:21:29
标题: DARC: 通过风险受限解码的不一致感知对齐
摘要: 基于偏好的对齐方法(例如,RLHF,DPO)通常优化单一标量目标,隐含地对异质人类偏好进行平均。在实践中,系统注释者和用户组的分歧使得均值奖励最大化变得脆弱且容易受到代理过度优化的影响。我们提出了**通过风险约束解码感知分歧意识的对齐方法(DARC)**,这是一种无需重新训练的推理时间方法,将响应选择框架化为分布鲁棒性、风险敏感的决策制定。给定多个偏好样本或可扩展的分歧代理,DARC通过最大化*KL-鲁棒(熵)*满意度目标来重新排列候选项,并提供简单的部署控制,以限制或对比均值罚款相对于熵风险溢价,从而实现明确的风险预算而无需重新训练。我们提供了理论特征化,将这一解码规则与原则悲观主义和基于KL的分布鲁棒优化联系起来。在对齐基准实验中,DARC减少了分歧和尾部风险,同时在嘈杂、异质反馈下保持竞争性的平均质量。
更新时间: 2026-03-09 09:21:29
领域: cs.LG,cs.AI
Mitigating Homophily Disparity in Graph Anomaly Detection: A Scalable and Adaptive Approach
Graph anomaly detection (GAD) aims to identify nodes that deviate from normal patterns in structure or features. While recent GNN-based approaches have advanced this task, they struggle with two major challenges: 1) homophily disparity, where nodes exhibit varying homophily at both class and node levels; and 2) limited scalability, as many methods rely on costly whole-graph operations. To address them, we propose SAGAD, a Scalable and Adaptive framework for GAD. SAGAD precomputes multi-hop embeddings and applies reparameterized Chebyshev filters to extract low- and high-frequency information, enabling efficient training and capturing both homophilic and heterophilic patterns. To mitigate node-level homophily disparity, we introduce an Anomaly Context-Aware Adaptive Fusion, which adaptively fuses low- and high-pass embeddings using fusion coefficients conditioned on Rayleigh Quotient-guided anomalous subgraph structures for each node. To alleviate class-level disparity, we design a Frequency Preference Guidance Loss, which encourages anomalies to preserve more high-frequency information than normal nodes. SAGAD supports mini-batch training, achieves linear time and space complexity, and drastically reduces memory usage on large-scale graphs. Theoretically, SAGAD ensures asymptotic linear separability between normal and abnormal nodes under mild conditions. Extensive experiments on 10 benchmarks confirm SAGAD's superior accuracy and scalability over state-of-the-art methods.
Updated: 2026-03-09 09:15:47
标题: 减轻图异常检测中同质性差异:一种可扩展和自适应的方法
摘要: 图异常检测(GAD)旨在识别在结构或特征上偏离正常模式的节点。尽管最近基于GNN的方法推进了这一任务,但它们面临两个主要挑战:1)同质性差异,即节点在类别和节点级别表现出不同的同质性;以及2)有限的可扩展性,因为许多方法依赖于昂贵的整图操作。为了解决这些问题,我们提出了SAGAD,一个用于GAD的可扩展和自适应框架。SAGAD预先计算多跳嵌入,并应用重新参数化的Chebyshev滤波器来提取低频和高频信息,从而实现高效训练并捕获同质性和异质性模式。为了减轻节点级同质性差异,我们引入了一种异常上下文感知自适应融合,该方法通过在每个节点上基于瑞利商导向的异常子图结构调节融合系数,自适应地融合低通和高通嵌入。为了减轻类别级别差异,我们设计了一个频率偏好引导损失,该损失鼓励异常节点保留比正常节点更多的高频信息。SAGAD支持小批量训练,具有线性时间和空间复杂度,并大大减少了在大规模图上的内存使用。理论上,SAGAD在温和条件下确保正常和异常节点之间的渐近线性可分性。对10个基准数据集的大量实验证实了SAGAD在准确性和可扩展性方面优于最先进的方法。
更新时间: 2026-03-09 09:15:47
领域: cs.LG
Explainable Condition Monitoring via Probabilistic Anomaly Detection Applied to Helicopter Transmissions
We present a novel Explainable methodology for Condition Monitoring, relying on healthy data only. Since faults are rare events, we propose to focus on learning the probability distribution of healthy observations only, and detect Anomalies at runtime. This objective is achieved via the definition of probabilistic measures of deviation from nominality, which allow to detect and anticipate faults. The Bayesian perspective underpinning our approach allows us to perform Uncertainty Quantification to inform decisions. At the same time, we provide descriptive tools to enhance the interpretability of the results, supporting the deployment of the proposed strategy also in safety-critical applications. The methodology is validated experimentally on two use cases: a publicly available benchmark for Predictive Maintenance, and a real-world Helicopter Transmission dataset collected over multiple years. In both applications, the method achieves competitive detection performance with respect to state-of-the-art anomaly detection methods.
Updated: 2026-03-09 09:09:01
标题: 通过概率异常检测应用于直升机传动系统的可解释性状态监测
摘要: 我们提出了一种新颖的可解释的状态监测方法,仅依赖于健康数据。由于故障是罕见事件,我们建议专注于仅学习健康观测的概率分布,并在运行时检测异常。通过定义与正常性偏差的概率测度,实现了这一目标,这使得能够检测和预测故障。我们方法的贝叶斯视角使我们能够进行不确定性量化,以帮助决策。同时,我们提供描述性工具以增强结果的可解释性,支持将所提出的策略部署到安全关键应用中。该方法在两个用例上进行了实验验证:一个用于预测性维护的公开基准,以及在多年时间内收集的真实直升机传输数据集。在两种应用中,与最先进的异常检测方法相比,该方法实现了竞争性的检测性能。
更新时间: 2026-03-09 09:09:01
领域: cs.LG,stat.ML
TRIAGE: Type-Routed Interventions via Aleatoric-Epistemic Gated Estimation in Robotic Manipulation and Adaptive Perception -- Don't Treat All Uncertainty the Same
Most uncertainty-aware robotic systems collapse prediction uncertainty into a single scalar score and use it to trigger uniform corrective responses. This aggregation obscures whether uncertainty arises from corrupted observations or from mismatch between the learned model and the true system dynamics. As a result, corrective actions may be applied to the wrong component of the closed loop, degrading performance relative to leaving the policy unchanged. We introduce a lightweight post hoc framework that decomposes uncertainty into aleatoric and epistemic components and uses these signals to regulate system responses at inference time. Aleatoric uncertainty is estimated from deviations in the observation distribution using a Mahalanobis density model, while epistemic uncertainty is detected using a noise robust forward dynamics ensemble that isolates model mismatch from measurement corruption. The two signals remain empirically near orthogonal during closed loop execution and enable type specific responses. High aleatoric uncertainty triggers observation recovery, while high epistemic uncertainty moderates control actions. The same signals also regulate adaptive perception by guiding model capacity selection during tracking inference. Experiments demonstrate consistent improvements across both control and perception tasks. In robotic manipulation, the decomposed controller improves task success from 59.4% to 80.4% under compound perturbations and outperforms a combined uncertainty baseline by up to 21.0%. In adaptive tracking inference on MOT17, uncertainty-guided model selection reduces average compute by 58.2% relative to a fixed high capacity detector while preserving detection quality within 0.4%. Code and demo videos are available at https://divake.github.io/uncertainty-decomposition/.
Updated: 2026-03-09 09:07:43
标题: 分类:通过随机-认知门控估计在机器人操作和自适应感知中的类型路由干预--不要将所有不确定性视为相同
摘要: 大多数不确定性感知机器人系统将预测不确定性折叠成一个单一的标量分数,并将其用于触发统一的纠正响应。这种聚合方法使得不确定性是否源于受损观测或学习模型与真实系统动态之间的不匹配变得不明显。因此,纠正措施可能被应用于闭环的错误组件,从而降低性能相对于保持策略不变而言。我们引入了一个轻量级的事后框架,将不确定性分解为aleatoric和epistemic两个组成部分,并使用这些信号来在推断时调节系统响应。通过使用Mahalanobis密度模型从观测分布的偏差中估计aleatoric不确定性,同时使用噪声鲁棒的前向动力学集成来检测epistemic不确定性,该集成从测量损坏中隔离出模型不匹配。这两个信号在闭环执行期间始终在经验上保持近乎正交,并使得能够进行特定类型的响应。高aleatoric不确定性会触发观测恢复,而高epistemic不确定性会调节控制动作。这些信号还通过在跟踪推断期间指导模型容量选择来调节自适应感知。实验表明,在控制和感知任务中均实现了一致的改进。在机器人操纵中,分解的控制器在复合扰动下将任务成功率从59.4%提高到80.4%,并且在自适应跟踪推断中,基于不确定性的模型选择相对于固定高容量探测器将平均计算减少了58.2%,同时保持了检测质量在0.4%内。代码和演示视频可在https://divake.github.io/uncertainty-decomposition/中找到。
更新时间: 2026-03-09 09:07:43
领域: cs.RO,cs.LG
Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows
Coordinated audio generation based on video inputs typically requires a strict audio-visual (AV) alignment, where both semantics and rhythmics of the generated audio segments shall correspond to those in the video frames. Previous studies leverage a two-stage design where the AV encoders are firstly aligned via contrastive learning, then the encoded video representations guide the audio generation process. We observe that both contrastive learning and global video guidance are effective in aligning overall AV semantics while limiting temporally rhythmic synchronization. In this work, we propose FoleyFlow to first align unimodal AV encoders via masked modeling training, where the masked audio segments are recovered under the guidance of the corresponding video segments. After training, the AV encoders which are separately pretrained using only unimodal data are aligned with semantic and rhythmic consistency. Then, we develop a dynamic conditional flow for the final audio generation. Built upon the efficient velocity flow generation framework, our dynamic conditional flow utilizes temporally varying video features as the dynamic condition to guide corresponding audio segment generations. To this end, we extract coherent semantic and rhythmic representations during masked AV alignment, and use this representation of video segments to guide audio generation temporally. Our audio results are evaluated on the standard benchmarks and largely surpass existing results under several metrics. The superior performance indicates that FoleyFlow is effective in generating coordinated audios that are both semantically and rhythmically coherent to various video sequences.
Updated: 2026-03-09 09:06:25
标题: 弗利-流:使用遮罩音频视觉对齐和动态条件流协调生成视频到音频
摘要: 基于视频输入的协调音频生成通常需要严格的音频-视觉(AV)对齐,其中生成的音频片段的语义和节奏应与视频帧中的对应。先前的研究利用两阶段设计,其中首先通过对比学习对AV编码器进行对齐,然后编码的视频表示引导音频生成过程。我们观察到,对比学习和全局视频指导都有效地对齐整体AV语义,同时限制了时间节奏同步。在这项工作中,我们提出FoleyFlow,首先通过遮蔽建模训练对单模AV编码器进行对齐,其中遮蔽的音频片段在相应视频片段的指导下得以恢复。训练后,仅使用单模数据预训练的AV编码器被对齐,保持语义和节奏一致性。然后,我们为最终音频生成开发了动态条件流。基于高效的速度流生成框架构建,我们的动态条件流利用时间变化的视频特征作为动态条件,指导相应音频片段生成。为此,我们在遮蔽AV对齐过程中提取连贯的语义和节奏表示,并使用这些视频片段的表示在时间上指导音频生成。我们的音频结果在标准基准上进行评估,并在多个指标下大幅超越现有结果。优越的性能表明FoleyFlow在生成与各种视频序列语义和节奏一致的协调音频方面是有效的。
更新时间: 2026-03-09 09:06:25
领域: cs.CV,cs.AI,cs.LG,cs.SD,eess.AS
SaiVLA-0: Cerebrum--Pons--Cerebellum Tripartite Architecture for Compute-Aware Vision-Language-Action
We revisit Vision-Language-Action through a neuroscience-inspired triad. Biologically, the Cerebrum provides stable high-level multimodal priors and remains frozen; the Pons Adapter integrates these cortical features with real-time proprioceptive inputs and compiles intent into execution-ready tokens; and the Cerebellum (ParaCAT) performs fast, parallel categorical decoding for online control, with hysteresis/EMA/temperature/entropy for stability. A fixed-ratio schedule and two-stage feature caching make the system compute-aware and reproducible. Inspired by active, foveated vision, our wrist ROIs are geometrically tied to the end-effector via calibrated projection, providing a movement-stabilized, high-resolution view that is sensitive to fine-grained pose changes and complements the global context of the main view. The design is modular: upgrading the Cerebrum only retrains the Pons; changing robots only trains the Cerebellum; cerebellum-only RL can further refine control without touching high-level semantics. As a concept-and-protocol paper with preliminary evidence, we outline a timing protocol under matched conditions (GPU, resolution, batch) to verify anticipated efficiency gains. We also report preliminary LIBERO evidence showing that split feature caching reduces training time (7.5h to 4.5h) and improves average success (86.5% to 92.5%) under official N1.5 head-only training, and that SaiVLA0 reaches 99.0% mean success.
Updated: 2026-03-09 09:03:25
标题: SaiVLA-0: 为了计算感知的视觉-语言-动作的大脑-脑干-小脑三部分架构
摘要: 我们通过受神经科学启发的三元组重新审视视觉-语言-行动。从生物学角度看,大脑提供稳定的高层多模态先验信息并保持静止;脑桥适配器将这些皮层特征与实时本体感输入集成,并将意图编译成可执行的令牌;小脑(ParaCAT)执行快速、并行的分类解码以进行在线控制,并具有稳定性的滞后/EMA/温度/熵。固定比例计划和两阶段特征缓存使系统具有计算感知和可重复性。受活跃、视线集中的视觉启发,我们的手腕ROI通过校准投影与末端效应器几何关联,提供了一个运动稳定、高分辨率的视图,对微观姿势变化敏感,并补充主视图的全局上下文。 设计是模块化的:升级大脑只需要重新训练脑桥;更换机器人只需要训练小脑;只有小脑的强化学习可以进一步优化控制而不触及高层语义。作为一个概念和协议的论文,我们概述了在匹配条件下的时间协议(GPU、分辨率、批次)以验证预期的效率提升。我们还报告了初步的LIBERO证据,显示分割特征缓存减少了训练时间(从7.5小时缩短至4.5小时),并在官方N1.5头部训练中提高了平均成功率(从86.5%提高至92.5%),而SaiVLA0达到了99.0%的平均成功率。
更新时间: 2026-03-09 09:03:25
领域: cs.RO,cs.AI,cs.LG
Model-based Offline RL via Robust Value-Aware Model Learning with Implicitly Differentiable Adaptive Weighting
Model-based offline reinforcement learning (RL) aims to enhance offline RL with a dynamics model that facilitates policy exploration. However, \textit{model exploitation} could occur due to inevitable model errors, degrading algorithm performance. Adversarial model learning offers a theoretical framework to mitigate model exploitation by solving a maximin formulation. Within such a paradigm, RAMBO~\citep{rigter2022rambo} has emerged as a representative and most popular method that provides a practical implementation with model gradient. However, we empirically reveal that severe Q-value underestimation and gradient explosion can occur in RAMBO with only slight hyperparameter tuning, suggesting that it tends to be overly conservative and suffers from unstable model updates. To address these issues, we propose \textbf{RO}bust value-aware \textbf{M}odel learning with \textbf{I}mplicitly differentiable adaptive weighting (ROMI). Instead of updating the dynamics model with model gradient, ROMI introduces a novel robust value-aware model learning approach. This approach requires the dynamics model to predict future states with values close to the minimum Q-value within a scale-adjustable state uncertainty set, enabling controllable conservatism and stable model updates. To further improve out-of-distribution (OOD) generalization during multi-step rollouts, we propose implicitly differentiable adaptive weighting, a bi-level optimization scheme that adaptively achieves dynamics- and value-aware model learning. Empirical results on D4RL and NeoRL datasets show that ROMI significantly outperforms RAMBO and achieves competitive or superior performance compared to other state-of-the-art methods on datasets where RAMBO typically underperforms. Code is available at https://github.com/zq2r/ROMI.git.
Updated: 2026-03-09 08:59:45
标题: 通过具有隐式可微适应权重的稳健价值感知模型学习的基于模型的离线RL
摘要: 基于模型的离线强化学习(RL)旨在通过一个有助于政策探索的动力学模型来增强离线RL。然而,由于不可避免的模型错误,可能会发生\textit{模型利用},从而降低算法性能。对抗模型学习提供了一个理论框架,通过解决最大最小化问题来缓解模型利用。在这样的范式中,RAMBO~\citep{rigter2022rambo}已经成为一个具有代表性和最受欢迎的方法,提供了一个具有模型梯度的实用实现。然而,我们在实证研究中发现,RAMBO可能会出现严重的Q值低估和梯度爆炸,即使只进行轻微的超参数调整,这表明它倾向于过于保守,并且容易受到不稳定的模型更新的影响。为了解决这些问题,我们提出了\textbf{RO}bust value-aware \textbf{M}odel learning with \textbf{I}mplicitly differentiable adaptive weighting(ROMI)。ROMI不同于使用模型梯度更新动力学模型,而是引入了一种新颖的强大的价值感知模型学习方法。这种方法要求动力学模型在一个可调节的状态不确定性集合中预测未来状态的值接近最小Q值,从而实现可控的保守性和稳定的模型更新。为了进一步改善多步推演中的分布外(OOD)泛化能力,我们提出了一个隐式可微的自适应加权方案,这是一个双层优化方案,可以自适应实现动态和价值感知的模型学习。在D4RL和NeoRL数据集上的实证结果显示,ROMI明显优于RAMBO,并在RAMBO通常表现不佳的数据集上实现了与其他最先进方法相竞争或更优的性能。代码可在https://github.com/zq2r/ROMI.git获取。
更新时间: 2026-03-09 08:59:45
领域: cs.LG
UIS-Digger: Towards Comprehensive Research Agent Systems for Real-world Unindexed Information Seeking
Recent advancements in LLM-based information-seeking agents have achieved record-breaking performance on established benchmarks. However, these agents remain heavily reliant on search-engine-indexed knowledge, leaving a critical blind spot: Unindexed Information Seeking (UIS). This paper identifies and explores the UIS problem, where vital information is not captured by search engine crawlers, such as overlooked content, dynamic webpages, and embedded files. Despite its significance, UIS remains an underexplored challenge. To address this gap, we introduce UIS-QA, the first dedicated UIS benchmark, comprising 110 expert-annotated QA pairs. Notably, even state-of-the-art agents experience a drastic performance drop on UIS-QA (e.g., from 70.90 on GAIA and 46.70 on BrowseComp-zh to 24.55 on UIS-QA), underscoring the severity of the problem. To mitigate this, we propose UIS-Digger, a novel multi-agent framework that incorporates dual-mode browsing and enables simultaneous webpage searching and file parsing. With a relatively small $\sim$30B-parameter backbone LLM optimized using SFT and RFT training strategies, UIS-Digger sets a strong baseline at 27.27\%, outperforming systems integrating sophisticated LLMs such as O3 and GPT-4.1. This demonstrates the importance of proactive interaction with unindexed sources for effective and comprehensive information-seeking. Our work not only uncovers a fundamental limitation in current agent evaluation paradigms but also provides the first toolkit for advancing UIS research, defining a new and promising direction for robust information-seeking systems.
Updated: 2026-03-09 08:58:40
标题: UIS-Digger:面向真实世界未索引信息检索的综合研究代理系统
摘要: 最近基于LLM的信息检索代理取得了在已建立基准上创纪录的性能。然而,这些代理仍然严重依赖于搜索引擎索引的知识,存在一个重要的盲点:未索引信息检索(UIS)。本文确定并探讨了UIS问题,即重要信息未被搜索引擎爬虫捕捉到,如被忽视的内容、动态网页和嵌入式文件。尽管其重要性,UIS仍然是一个未被充分探讨的挑战。为了解决这一差距,我们引入了UIS-QA,第一个专门的UIS基准,包括110个专家注释的问答对。值得注意的是,即使是最先进的代理在UIS-QA上也经历了显著的性能下降(例如,从GAIA的70.90和BrowseComp-zh的46.70到UIS-QA的24.55),突显了问题的严重性。为了缓解这一问题,我们提出了UIS-Digger,一个新颖的多代理框架,结合了双模式浏览,实现了同时进行网页搜索和文件解析。通过使用相对较小的约30B参数的骨干LLM,通过SFT和RFT训练策略进行优化,UIS-Digger在27.27\%上设定了一个强有力的基线,优于集成了复杂LLM的系统,如O3和GPT-4.1。这证明了积极与未索引信息源进行交互对于有效和全面的信息检索的重要性。我们的工作不仅揭示了目前代理评估范式中的一个基本限制,还为推进UIS研究提供了第一个工具包,为强大的信息检索系统定义了一个新的且有前景的方向。
更新时间: 2026-03-09 08:58:40
领域: cs.AI,cs.IR
Multi-Domain Audio Question Answering Benchmark Toward Acoustic Content Reasoning
We present Task 5 of the DCASE 2025 Challenge: an Audio Question Answering (AQA) benchmark spanning multiple domains of sound understanding. This task defines three QA subsets (Bioacoustics, Temporal Soundscapes, and Complex QA) to test audio-language models on interactive question-answering over diverse acoustic scenes. We describe the dataset composition (from marine mammal calls to soundscapes and complex real-world clips), the evaluation protocol (top-1 accuracy with answer-shuffling robustness), and baseline systems (Qwen2-Audio-7B, AudioFlamingo 2, Gemini-2-Flash). Preliminary results on the development set are compared, showing strong variation across models and subsets. This challenge aims to advance the audio understanding and reasoning capabilities of audio-language models toward human-level acuity, which are crucial for enabling AI agents to perceive and interact about the world effectively.
Updated: 2026-03-09 08:55:05
标题: 多域音频问答基准测试:朝向声学内容推理
摘要: 我们提出了DCASE 2025挑战的任务5:一个跨越多个声音理解领域的音频问答(AQA)基准。该任务定义了三个QA子集(生物声学、时间声景和复杂QA),以测试音频-语言模型在各种声学场景下的交互问答能力。我们描述了数据集的组成(从海洋哺乳动物的叫声到声景和复杂的真实世界片段)、评估协议(带有答案重排鲁棒性的top-1准确率)和基线系统(Qwen2-Audio-7B、AudioFlamingo 2、Gemini-2-Flash)。对开发集的初步结果进行了比较,显示出模型和子集之间的强烈变化。该挑战旨在推进音频-语言模型的音频理解和推理能力,使其达到人类水平的敏锐度,这对于使AI代理能够有效地感知和与世界进行交互至关重要。
更新时间: 2026-03-09 08:55:05
领域: cs.SD,cs.AI,cs.CL,cs.MM,eess.AS
Tau-BNO: Brain Neural Operator for Tau Transport Model
Mechanistic modeling provides a biophysically grounded framework for studying the spread of pathological tau protein in tauopathies like Alzheimer's disease. Existing approaches typically model tau propagation as a diffusive process on the brain's structural connectome, reproducing macroscopic patterns but neglecting microscale cellular transport and reaction mechanisms. The Network Transport Model (NTM) was introduced to fill this gap, explaining how region-level progression of tau emerges from microscale biophysical processes. However, the NTM faces a common challenge for complex models defined by large systems of partial differential equations: the inability to perform parameter inference and mechanistic discovery due to high computational burden and slow model simulations. To overcome this barrier, we propose Tau-BNO, a Brain Neural Operator surrogate framework for rapidly approximating NTM dynamics that captures both intra-regional reaction kinetics and inter-regional network transport. Tau-BNO combines a function operator that encodes kinetic parameters with a query operator that preserves initial state information, while approximating anisotropic transport through a spectral kernel that retains directionality. Empirical evaluations demonstrate high predictive accuracy ($R^2\approx$ 0.98) across diverse biophysical regimes and an 89\% performance improvement over state-of-the-art sequence models like Transformers and Mamba, which lack inherent structural priors. By reducing simulation time from hours to seconds, we show that the surrogate model is capable of producing new insights and generating new hypotheses. This framework is readily extensible to a broader class of connectome-based biophysical models, showcasing the transformative value of deep learning surrogates to accelerate analysis of large-scale, computationally intensive dynamical systems.
Updated: 2026-03-09 08:52:02
标题: Tau-BNO:用于Tau传输模型的大脑神经运算符
摘要: 机械建模为研究像阿尔茨海默病这样的tau蛋白病中病理性tau蛋白传播提供了一个生物物理学基础的框架。现有的方法通常将tau传播建模为大脑结构连接组上的扩散过程,重现宏观模式但忽略微观细胞运输和反应机制。网络传输模型(NTM)被引入以填补这一空白,解释了tau的区域级进展是如何从微观生物物理过程中产生的。然而,NTM面临一个对于由大量偏微分方程定义的复杂模型常见的挑战:由于高计算负担和模型模拟速度慢,无法进行参数推断和机械发现。为了克服这一障碍,我们提出了Tau-BNO,一个用于快速近似NTM动态的大脑神经操作符代理框架,捕捉了区域内反应动力学和区域间网络传输。Tau-BNO结合了编码动力学参数的函数操作符和保留初始状态信息的查询操作符,同时通过保留方向性的谱核来近似各向异性传输。经验评估显示,在各种生物物理区域中具有很高的预测准确度($ R^2 \approx $0.98),并且比缺乏固有结构先验知识的最先进的序列模型(如Transformers和Mamba)提高了89\%的性能。通过将仿真时间从几小时缩短到几秒,我们展示了代理模型能够产生新的见解并生成新的假设。该框架可轻松扩展到更广泛的基于连接组的生物物理模型,展示了深度学习代理在加速大规模、计算密集型动态系统分析中的变革价值。
更新时间: 2026-03-09 08:52:02
领域: cs.CE,cs.LG
UnfoldLDM: Deep Unfolding-based Blind Image Restoration with Latent Diffusion Priors
Deep unfolding networks (DUNs) combine the interpretability of model-based methods with the learning ability of deep networks, yet remain limited for blind image restoration (BIR). Existing DUNs suffer from: (1) \textbf{Degradation-specific dependency}, as their optimization frameworks are tied to a known degradation model, making them unsuitable for BIR tasks; and (2) \textbf{Over-smoothing bias}, resulting from the direct feeding of gradient descent outputs, dominated by low-frequency content, into the proximal term, suppressing fine textures. To overcome these issues, we propose UnfoldLDM to integrate DUNs with latent diffusion model (LDM) for BIR. In each stage, UnfoldLDM employs a multi-granularity degradation-aware (MGDA) module as the gradient descent step. MGDA models BIR as an unknown degradation estimation problem and estimates both the holistic degradation matrix and its decomposed forms, enabling robust degradation removal. For the proximal step, we design a degradation-resistant LDM (DR-LDM) to extract compact degradation-invariant priors from the MGDA output. Guided by this prior, an over-smoothing correction transformer (OCFormer) explicitly recovers high-frequency components and enhances texture details. This unique combination ensures the final result is degradation-free and visually rich. Experiments show that our UnfoldLDM achieves a leading place on various BIR tasks and benefits downstream tasks. Moreover, our design is compatible with existing DUN-based methods, serving as a plug-and-play framework. Code will be released.
Updated: 2026-03-09 08:51:19
标题: UnfoldLDM:基于深度展开的具有潜在扩散先验的盲图像恢复
摘要: 深度展开网络(DUNs)将基于模型的方法的可解释性与深度网络的学习能力结合起来,但在盲图像恢复(BIR)方面仍然存在局限性。现有的DUNs存在以下问题:(1)\textbf{特定于降级的依赖性},因为它们的优化框架与已知的降级模型相关联,使它们不适用于BIR任务;以及(2)\textbf{过度平滑偏差},由于直接将由低频内容主导的梯度下降输出输入到近端项中,导致细节纹理被抑制。为了克服这些问题,我们提出了UnfoldLDM,将DUNs与潜在扩散模型(LDM)集成到BIR中。在每个阶段,UnfoldLDM采用多粒度降级感知(MGDA)模块作为梯度下降步骤。MGDA将BIR建模为一个未知的降级估计问题,并估计全局降级矩阵及其分解形式,实现鲁棒的降级去除。对于近端步骤,我们设计了一种抗降级的LDM(DR-LDM)来从MGDA输出中提取紧凑的降级不变先验。在此先验的指导下,过度平滑校正变换器(OCFormer)明确恢复高频成分并增强纹理细节。这种独特的组合确保最终结果没有降级且视觉丰富。实验证明,我们的UnfoldLDM在各种BIR任务中取得了领先地位,并使下游任务受益。此外,我们的设计与现有基于DUN的方法兼容,可作为即插即用的框架。代码将发布。
更新时间: 2026-03-09 08:51:19
领域: cs.CV,cs.AI
Context Matters! Relaxing Goals with LLMs for Feasible 3D Scene Planning
Embodied agents need to plan and act reliably in real and complex 3D environments. Classical planning (e.g., PDDL) offers structure and guarantees, but in practice it fails under noisy perception and incorrect predicate grounding. On the other hand, Large Language Models (LLMs)-based planners leverage commonsense reasoning, yet frequently propose actions that are unfeasible or unsafe. Following recent works that combine the two approaches, we introduce ContextMatters, a framework that fuses LLMs and classical planning to perform hierarchical goal relaxation: the LLM helps ground symbols to the scene and, when the target is unreachable, it proposes functionally equivalent goals that progressively relax constraints, adapting the goal to the context of the agent's environment. Operating on 3D Scene Graphs, this mechanism turns many nominally unfeasible tasks into tractable plans and enables context-aware partial achievement when full completion is not achievable. Our experimental results show a +52.45% Success Rate improvement over state-of-the-art LLMs+PDDL baseline, demonstrating the effectiveness of our approach. Moreover, we validate the execution of ContextMatter in a real world scenario by deploying it on a TIAGo robot. Code, dataset, and supplementary materials are available to the community at https://lab-rococo-sapienza.github.io/context-matters/.
Updated: 2026-03-09 08:51:01
标题: 背景至关重要!使用LLMs放松目标进行可行的3D场景规划
摘要: 具身代理需要在真实和复杂的3D环境中可靠地进行规划和行动。经典规划(例如,PDDL)提供了结构和保证,但在实践中在噪声感知和不正确的谓词基础上失败。另一方面,基于大型语言模型(LLMs)的规划器利用常识推理,但经常提出不可行或不安全的行动。在综合两种方法的最近工作的基础上,我们介绍了ContextMatters,这是一个融合LLMs和经典规划的框架,用于执行分层目标放松:LLM有助于将符号接地到场景,当目标无法达到时,它提出逐渐放松约束的功能等效目标,将目标适应到代理环境的上下文中。通过在3D场景图上操作,这种机制使许多名义上不可行的任务变成可处理的计划,并在无法完全完成时实现基于上下文的部分实现。我们的实验结果显示,相较于最先进的LLMs+PDDL基线,成功率提高了+52.45%,证明了我们方法的有效性。此外,我们通过在TIAGo机器人上部署ContextMatter来验证在真实世界场景中的执行情况。代码、数据集和补充材料可供社区使用,网址为https://lab-rococo-sapienza.github.io/context-matters/。
更新时间: 2026-03-09 08:51:01
领域: cs.RO,cs.AI
Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
Understanding and addressing potential safety alignment risks in large language models (LLMs) is critical for ensuring their safe and trustworthy deployment. In this paper, we highlight an insidious safety threat: a compromised LLM can maintain a facade of proper safety alignment while covertly generating harmful content. To achieve this, we finetune the model to understand and apply a steganographic technique. At inference time, we input a prompt that contains a steganographically embedded malicious target question along with a plaintext cover question. The model, in turn, produces a target response similarly embedded within a benign-looking cover response. In this process, human observers only see the model being prompted with a cover question and generating a corresponding cover response, while the malicious content is hidden from view. We demonstrate this invisible safety threat on GPT-4.1 despite the OpenAI finetuning API's safeguards. The finetuned model produces steganographic malicious outputs in response to hidden malicious prompts, while the user interface displays only a fully benign cover interaction. We also replicate the attack on three open-source models, Llama-3.3-70B-Instruct, Phi-4, and Mistral-Small-24B-Base-2501, confirming the generality of our method. We quantitatively evaluate our method on the AdvBench dataset, using Llama-Guard-3-8B for content safety classification. Across all four models, all stegotexts containing malicious content are incorrectly classified as safe.
Updated: 2026-03-09 08:48:27
标题: 看不见的安全威胁:通过隐写术对LLM进行恶意微调
摘要: 理解和解决大型语言模型(LLMs)中潜在的安全对齐风险对于确保它们的安全和可信赖部署至关重要。本文突出了一种隐蔽的安全威胁:一个被 compromise 的 LLM 可以保持适当的安全对齐外观,同时秘密生成有害内容。为了实现这一点,我们对模型进行了微调,以理解和应用一种隐写术技术。在推断时,我们输入一个包含隐写嵌入的恶意目标问题以及一个明文封面问题的提示。模型反过来生成一个目标响应,该响应类似地嵌入在一个看起来良性的封面响应中。在这个过程中,人类观察者只能看到模型被提示一个封面问题并生成相应的封面响应,而恶意内容被隐藏在视野之外。我们在 GPT-4.1 上演示了这种看不见的安全威胁,尽管 OpenAI 微调 API 有保护措施。微调后的模型对隐藏的恶意提示产生隐写恶意输出,而用户界面只显示完全良性的封面交互。我们还在三个开源模型 Llama-3.3-70B-Instruct、Phi-4 和 Mistral-Small-24B-Base-2501 上复制了攻击,验证了我们方法的普适性。我们在 AdvBench 数据集上定量评估了我们的方法,使用 Llama-Guard-3-8B 进行内容安全分类。在所有四个模型中,所有包含恶意内容的隐写文本都被错误地分类为安全。
更新时间: 2026-03-09 08:48:27
领域: cs.LG
Dual-Robust Cross-Domain Offline Reinforcement Learning Against Dynamics Shifts
Single-domain offline reinforcement learning (RL) often suffers from limited data coverage, while cross-domain offline RL handles this issue by leveraging additional data from other domains with dynamics shifts. However, existing studies primarily focus on train-time robustness (handling dynamics shifts from training data), neglecting the test-time robustness against dynamics perturbations when deployed in practical scenarios. In this paper, we investigate dual (both train-time and test-time) robustness against dynamics shifts in cross-domain offline RL. We first empirically show that the policy trained with cross-domain offline RL exhibits fragility under dynamics perturbations during evaluation, particularly when target domain data is limited. To address this, we introduce a novel robust cross-domain Bellman (RCB) operator, which enhances test-time robustness against dynamics perturbations while staying conservative to the out-of-distribution dynamics transitions, thus guaranteeing the train-time robustness. To further counteract potential value overestimation or underestimation caused by the RCB operator, we introduce two techniques, the dynamic value penalty and the Huber loss, into our framework, resulting in the practical \textbf{D}ual-\textbf{RO}bust \textbf{C}ross-domain \textbf{O}ffline RL (DROCO) algorithm. Extensive empirical results across various dynamics shift scenarios show that DROCO outperforms strong baselines and exhibits enhanced robustness to dynamics perturbations.
Updated: 2026-03-09 08:48:21
标题: 双重稳健的跨领域离线强化学习抵抗动态变化
摘要: 单域离线强化学习(RL)通常受限于数据覆盖范围有限,而跨域离线RL通过利用来自具有动态变化的其他领域的额外数据来处理这个问题。然而,现有研究主要集中在训练时的稳健性(处理来自训练数据的动态变化),忽略了在实际场景中部署时针对动态扰动的测试时稳健性。在本文中,我们研究了跨域离线RL中的双重(训练时和测试时)对抗动态变化的稳健性。我们首先经验性地展示了使用跨域离线RL训练的策略在评估过程中对动态扰动表现出脆弱性,尤其是当目标领域数据有限时。为了解决这个问题,我们引入了一种新颖的稳健跨域Bellman(RCB)算子,它增强了对动态扰动的测试时稳健性,同时保守地对待分布之外的动态转换,从而保证了训练时的稳健性。为了进一步抵消RCB算子可能导致的值的过高或过低估计,我们将两种技术——动态价值惩罚和Huber损失引入我们的框架中,从而形成了实用的双重对抗跨域离线RL(DROCO)算法。对各种动态变化场景的广泛实证结果表明,DROCO优于强基线,并表现出对动态扰动的增强稳健性。
更新时间: 2026-03-09 08:48:21
领域: cs.LG
MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks
While multi-agent systems (MAS) promise elevated intelligence through coordination of agents, current approaches to automatic MAS design under-deliver. Such shortcomings stem from two key factors: (1) methodological complexity - agent orchestration is performed using sequential, code-level execution that limits global system-level holistic reasoning and scales poorly with agent complexity - and (2) efficacy uncertainty - MAS are deployed without understanding if there are tangible benefits compared to single-agent systems (SAS). We propose MASOrchestra, a training-time framework that formulates MAS orchestration as a function-calling reinforcement learning problem with holistic orchestration, generating an entire MAS at once. In MAS-Orchestra, complex, goal-oriented subagents are abstracted as callable functions, enabling global reasoning over system structure while hiding internal execution details. To rigorously study when and why MAS are beneficial, we introduce MASBENCH, a controlled benchmark that characterizes tasks along five axes: Depth, Horizon, Breadth, Parallel, and Robustness. Our analysis reveals that MAS gains depend critically on task structure, verification protocols, and the capabilities of both orchestrator and subagents, rather than holding universally. Guided by these insights, MAS-Orchestra achieves consistent improvements on public benchmarks including mathematical reasoning, multi-hop QA, and search-based QA, while achieving more than 10x efficiency over strong baselines. Together, MAS-Orchestra and MASBENCH enable better training and understanding of MAS in the pursuit of multi-agent intelligence.
Updated: 2026-03-09 08:39:35
标题: MAS-Orchestra: 通过整体协调和受控基准理解和改进多智体推理
摘要: 尽管多智能体系统(MAS)通过协调智能体承诺提升智能,但目前自动MAS设计的方法仍未兑现。这种缺陷源于两个关键因素:(1)方法论复杂性 - 智能体编排是通过序列化的代码级执行进行的,限制了全局系统层面的整体推理,并且与智能体复杂性的扩展性不佳 - 和(2)效力不确定性 - MAS在不了解与单智能体系统(SAS)相比是否存在切实利益的情况下部署。我们提出了MASOrchestra,这是一个在训练时将MAS编排形式化为一个具有整体编排的函数调用强化学习问题的框架,从而一次生成整个MAS。在MAS-Orchestra中,复杂的目标导向子智能体被抽象为可调用的函数,从而使得可以对系统结构进行全局推理,同时隐藏内部执行细节。为了严格研究MAS何时以及为何有益,我们引入了MASBENCH,一个控制基准,通过五个方面来描述任务:深度、视野、广度、并行性和鲁棒性。我们的分析显示,MAS的收益严重依赖于任务结构、验证协议以及编排者和子智能体的能力,而不是普遍适用。在这些见解的指导下,MAS-Orchestra在公共基准测试中实现了一致的改进,包括数学推理、多跳QA和基于搜索的QA,同时相对于强基线实现了超过10倍的效率。综上所述,MAS-Orchestra和MASBENCH使得更好地训练和理解MAS在追求多智能体智能方面变得可能。
更新时间: 2026-03-09 08:39:35
领域: cs.AI,cs.CL,cs.MA
DC-W2S: Dual-Consensus Weak-to-Strong Training for Reliable Process Reward Modeling in Biological Reasoning
In scientific reasoning tasks, the veracity of the reasoning process is as critical as the final outcome. While Process Reward Models (PRMs) offer a solution to the coarse-grained supervision problems inherent in Outcome Reward Models (ORMs), their deployment is hindered by the prohibitive cost of obtaining expert-verified step-wise labels. This paper addresses the challenge of training reliable PRMs using abundant but noisy "weak" supervision. We argue that existing Weak-to-Strong Generalization (W2SG) theories lack prescriptive guidelines for selecting high-quality training signals from noisy data. To bridge this gap, we introduce the Dual-Consensus Weak-to-Strong (DC-W2S) framework. By intersecting Self-Consensus (SC) metrics among weak supervisors with Neighborhood-Consensus (NC) metrics in the embedding space, we stratify supervision signals into distinct reliability regimes. We then employ a curriculum of instance-level balanced sampling and label-level reliability-aware masking to guide the training process. We demonstrate that DC-W2S enables the training of robust PRMs for complex reasoning without exhaustive expert annotation, proving that strategic data curation is more effective than indiscriminate training on large-scale noisy datasets.
Updated: 2026-03-09 08:36:55
标题: DC-W2S: 双一致性弱到强训练在生物推理中可靠的过程奖励建模
摘要: 在科学推理任务中,推理过程的真实性与最终结果一样关键。虽然过程奖励模型(PRMs)为具有粗粒度监督问题的结果奖励模型(ORMs)提供了解决方案,但由于获取专家验证的逐步标签的成本过高,它们的部署受到阻碍。本文解决了使用丰富但带有噪音的“弱”监督训练可靠的PRMs的挑战。我们认为现有的弱到强泛化(W2SG)理论缺乏从噪音数据中选择高质量训练信号的指导方针。为了弥补这一差距,我们引入了双一致弱到强(DC-W2S)框架。通过在弱监督者之间的自一致性(SC)度量和在嵌入空间中的邻域一致性(NC)度量之间的交集,我们将监督信号分层为不同的可靠性水平。然后,我们采用实例级平衡抽样和标签级可靠性感知掩模的课程来指导训练过程。我们证明DC-W2S使得能够对复杂推理进行强健的PRMs训练,而无需详尽的专家标注,证明了战略数据策划比在大规模噪音数据集上不加选择地训练更有效。
更新时间: 2026-03-09 08:36:55
领域: cs.CL,cs.AI,cs.LG
CryoNet.Refine: A One-step Diffusion Model for Rapid Refinement of Structural Models with Cryo-EM Density Map Restraints
High-resolution structure determination by cryo-electron microscopy (cryo-EM) requires the accurate fitting of an atomic model into an experimental density map. Traditional refinement pipelines such as Phenix.real_space_refine and Rosetta are computationally expensive, demand extensive manual tuning, and present a significant bottleneck for researchers. We present CryoNet.Refine, an end-to-end deep learning framework that automates and accelerates molecular structure refinement. Our approach utilizes a one-step diffusion model that integrates a density-aware loss function with robust stereochemical restraints, enabling rapid optimization of a structure against experimental data. CryoNet.Refine provides a unified and versatile solution capable of refining protein complexes as well as DNA/RNA-protein complexes. In benchmarks against Phenix.real_space_refine, CryoNet.Refine consistently achieves substantial improvements in both model-map correlation and overall geometric quality metrics. By offering a scalable, automated, and powerful alternative, CryoNet.Refine aims to serve as an essential tool for next-generation cryo-EM structure refinement. Web server: https://cryonet.ai/refine; Source code: https://github.com/kuixu/cryonet.refine.
Updated: 2026-03-09 08:34:19
标题: CryoNet.Refine:一种用于利用冷冻电镜密度图约束快速精化结构模型的单步扩散模型
摘要: 通过冷冻电镜(cryo-EM)进行高分辨率结构测定需要将原子模型准确地拟合到实验密度图中。传统的精细化流程,如Phenix.real_space_refine和Rosetta,计算昂贵,需要大量手动调整,并对研究人员构成重大瓶颈。我们提出了CryoNet.Refine,这是一个端对端深度学习框架,可以自动化和加速分子结构的精细化。我们的方法利用了一步扩散模型,将密度感知损失函数与稳健的立体化约束相结合,从而实现结构对实验数据的快速优化。CryoNet.Refine提供了一个统一而多功能的解决方案,可以精细化蛋白质复合物以及DNA/RNA-蛋白质复合物。在与Phenix.real_space_refine的基准测试中,CryoNet.Refine始终在模型-图谱相关性和整体几何质量指标方面取得显著改进。通过提供一个可扩展、自动化和强大的替代方案,CryoNet.Refine旨在成为下一代cryo-EM结构精细化的必备工具。网站:https://cryonet.ai/refine;源代码:https://github.com/kuixu/cryonet.refine。
更新时间: 2026-03-09 08:34:19
领域: q-bio.BM,cs.AI,eess.IV,q-bio.QM
BemaGANv2: Discriminator Combination Strategies for GAN-based Vocoders in Long-Term Audio Generation
This paper presents BemaGANv2, an advanced GAN-based vocoder designed for high-fidelity and long-term audio generation, with a focus on systematic evaluation of discriminator combination strategies. Long-term audio generation is critical for applications in Text-to-Music (TTM) and Text-to-Audio (TTA) systems, where maintaining temporal co- herence, prosodic consistency, and harmonic structure over extended durations remains a significant challenge. Built upon the original BemaGAN architecture, BemaGANv2 incorporates major architectural innovations by replacing traditional ResBlocks in the generator with the Anti-aliased Multi-Periodicity composition (AMP) module, which internally applies the Snake activation function to better model periodic structures. In the discriminator framework, we integrate the Multi-Envelope Discriminator (MED), a novel architecture we proposed, to extract rich temporal en- velope features crucial for periodicity detection. Coupled with the Multi-Resolution Discriminator (MRD), this com- bination enables more accurate modeling of long-range dependencies in audio. We systematically evaluate various discriminator configurations, including Multi-Scale Discriminator (MSD) + MED, MSD + MRD, and Multi-Period Discriminator (MPD) + MED + MRD, using objective metrics (Fréchet Audio Distance (FAD), Structural Similar- ity Index (SSIM), Pearson Correlation Coefficient (PCC), Mel-Cepstral Distortion (MCD), Multi-Resolution STFT (M-STFT), Periodicity error (Periodicity)) and subjective evaluations (MOS, SMOS). To support reproducibility, we provide detailed architectural descriptions, training configurations, and complete implementation details. The code, pre-trained models, and audio demo samples are available at: https://github.com/dinhoitt/BemaGANv2.
Updated: 2026-03-09 08:32:07
标题: BemaGANv2:长期音频生成中基于GAN的声码器的鉴别器组合策略
摘要: 本文介绍了BemaGANv2,这是一种先进的基于GAN的声码器,旨在进行高保真和长期音频生成,重点关注鉴别器组合策略的系统评估。长期音频生成对于文本到音乐(TTM)和文本到音频(TTA)系统中的应用至关重要,在这些应用中,保持时间连贯性、韵律一致性和谐波结构在较长时间内仍然是一个重大挑战。建立在原始BemaGAN架构基础上,BemaGANv2通过在生成器中将传统的ResBlocks替换为Anti-aliased Multi-Periodicity composition(AMP)模块来融入主要的架构创新,该模块内部应用了Snake激活函数,以更好地对周期性结构进行建模。在鉴别器框架中,我们整合了我们提出的一种新颖架构——Multi-Envelope Discriminator(MED),用于提取对于周期性检测至关重要的丰富时间包络特征。结合Multi-Resolution Discriminator(MRD),这种组合使得在音频中更准确地建模长距离依赖关系成为可能。我们系统评估了各种鉴别器配置,包括Multi-Scale Discriminator(MSD)+ MED、MSD + MRD和Multi-Period Discriminator(MPD)+ MED + MRD,使用客观指标(Fréchet音频距离(FAD)、结构相似性指数(SSIM)、皮尔逊相关系数(PCC)、Mel-Cepstral失真(MCD)、Multi-Resolution STFT(M-STFT)、周期性误差(Periodicity))和主观评估(MOS、SMOS)。为了支持可重现性,我们提供了详细的架构描述、训练配置和完整的实现细节。代码、预训练模型和音频演示样本可在以下链接获得:https://github.com/dinhoitt/BemaGANv2。
更新时间: 2026-03-09 08:32:07
领域: cs.SD,cs.AI,cs.LG,cs.LO,eess.AS
DSH-Bench: A Difficulty- and Scenario-Aware Benchmark with Hierarchical Subject Taxonomy for Subject-Driven Text-to-Image Generation
Significant progress has been achieved in subject-driven text-to-image (T2I) generation, which aims to synthesize new images depicting target subjects according to user instructions. However, evaluating these models remains a significant challenge. Existing benchmarks exhibit critical limitations: 1) insufficient diversity and comprehensiveness in subject images, 2) inadequate granularity in assessing model performance across different subject difficulty levels and prompt scenarios, and 3) a profound lack of actionable insights and diagnostic guidance for subsequent model refinement. To address these limitations, we propose DSH-Bench, a comprehensive benchmark that enables systematic multi-perspective analysis of subject-driven T2I models through four principal innovations: 1) a hierarchical taxonomy sampling mechanism ensuring comprehensive subject representation across 58 fine-grained categories, 2) an innovative classification scheme categorizing both subject difficulty level and prompt scenario for granular capability assessment, 3) a novel Subject Identity Consistency Score (SICS) metric demonstrating a 9.4\% higher correlation with human evaluation compared to existing measures in quantifying subject preservation, and 4) a comprehensive set of diagnostic insights derived from the benchmark, offering critical guidance for optimizing future model training paradigms and data construction strategies. Through an extensive empirical evaluation of 19 leading models, DSH-Bench uncovers previously obscured limitations in current approaches, establishing concrete directions for future research and development.
Updated: 2026-03-09 08:30:28
标题: DSH-Bench:一个具有层次主题分类的难度和情景感知基准,用于主题驱动的文本到图像生成
摘要: 在主题驱动的文本到图像(T2I)生成方面取得了显著进展,该方法旨在根据用户指令合成描绘目标主题的新图像。然而,评估这些模型仍然是一个重大挑战。现有的基准测试存在重要限制:1)主题图像的多样性和综合性不足,2)在评估模型在不同主题难度级别和提示场景下的性能时缺乏足够的细粒度,3)缺乏关于后续模型改进的可操作见解和诊断指导。为了解决这些限制,我们提出了DSH-Bench,这是一个全面的基准测试,通过四个主要创新,实现了对主题驱动的T2I模型的系统多视角分析:1)一个分层分类抽样机制,确保涵盖58个细粒度类别的主题全面代表,2)一种创新的分类方案,对主题难度级别和提示场景进行分类,以进行细粒度能力评估,3)一种新颖的主题身份一致性分数(SICS)指标,显示与现有度量相比,在量化主题保留方面与人类评估的相关性高出9.4%,4)从基准测试中得出的全面的诊断见解,为优化未来模型训练范式和数据构建策略提供重要指导。通过对19个领先模型的广泛实证评估,DSH-Bench揭示了当前方法中以前隐匿的限制,为未来的研究和发展确立了明确的方向。
更新时间: 2026-03-09 08:30:28
领域: cs.CV,cs.AI
EAGLE-Pangu: Accelerator-Safe Tree Speculative Decoding on Ascend NPUs
Autoregressive decoding remains a primary bottleneck in large language model (LLM) serving, motivating speculative decoding methods that reduce expensive teacher-model invocations by verifying multiple candidate tokens per step. Tree-structured speculation further increases parallelism, but is often brittle when ported across heterogeneous backends and accelerator stacks, where attention masking, KV-cache layouts, and indexing semantics are not interchangeable. We present EAGLE-Pangu, a reproducible system that ports EAGLE-3-style tree speculative decoding to a Pangu teacher backend on Ascend NPUs. EAGLE-Pangu contributes (i) an explicit branch/commit cache manager built on the Cache API, (ii) accelerator-safe tree tensorization that removes undefined negative indices by construction and validates structural invariants, and (iii) a fused-kernel-compatible teacher verification path with a debuggable eager fallback. On 240 turns from MT-Bench and HumanEval-style prompts, EAGLE-Pangu improves end-to-end decoding throughput by 1.27x on average, up to 2.46x at p99, over teacher-only greedy decoding in the fused-kernel performance path. We also provide a fused-kernel-free reference path with structured traces and invariant checks to support reproducible debugging and ablation across execution modes and tree budgets.
Updated: 2026-03-09 08:30:04
标题: EAGLE-Pangu:Ascend NPU上的加速器安全树形推测解码
摘要: 自回归解码仍然是大型语言模型(LLM)服务中的主要瓶颈,这促使了减少昂贵的教师模型调用的推测解码方法,该方法通过验证每步骤中的多个候选标记来实现。树形结构的推测进一步增加了并行性,但在跨异构后端和加速器堆栈进行移植时往往是脆弱的,因为注意力掩码、KV缓存布局和索引语义不可互换。我们提出了EAGLE-Pangu,这是一个可复现的系统,将EAGLE-3风格的树形推测解码移植到Ascend NPUs上的Pangu教师后端。EAGLE-Pangu提供了(i)基于Cache API构建的显式分支/提交缓存管理器,(ii)加速器安全的树形张量化,通过构造消除未定义的负索引并验证结构不变性,以及(iii)与融合内核兼容的教师验证路径,并带有可调试的急切回退。在来自MT-Bench和HumanEval风格提示的240个轮次中,EAGLE-Pangu将端到端解码吞吐量平均提高了1.27倍,最高可达p99的2.46倍,超过了仅使用教师贪婪解码的融合内核性能路径。我们还提供了一个无融合内核的参考路径,具有结构化的跟踪和不变性检查,以支持跨执行模式和树预算的可复现调试和消融。
更新时间: 2026-03-09 08:30:04
领域: cs.LG,cs.PL
Simulating Non-Markovian Open Quantum Dynamics with Neural Quantum States
Reducing computational scaling for simulating non-Markovian dissipative dynamics using artificial neural networks is both a major focus and formidable challenge in open quantum systems. To enable neural quantum states (NQSs), we encode environmental memory in dissipatons (quasiparticles with characteristic lifetimes), yielding the dissipaton-embedded quantum master equation (DQME). The resulting NQS-DQME framework achieves compact representation of many-body correlations and non-Markovian memory. Benchmarking against numerically exact hierarchical equations of motion confirms NQS-DQME maintains comparable accuracy while enhancing scalability and interpretability. This methodology opens new paths to explore non-Markovian open quantum dynamics in previously intractable systems.
Updated: 2026-03-09 08:30:04
标题: 用神经量子态模拟非马尔可夫开放量子动力学
摘要: 使用人工神经网络减少模拟非马尔可夫耗散动态的计算规模是开放量子系统中的一个主要焦点和巨大挑战。为了实现神经量子态(NQSs),我们在耗散态中编码环境记忆,产生了耗散子嵌入的量子主方程(DQME)。由此产生的NQS-DQME框架实现了多体相关性和非马尔可夫记忆的紧凑表示。与数值精确的分层运动方程进行基准测试表明,NQS-DQME在保持相当准确性的同时增强了可扩展性和可解释性。这种方法为探索以前难以处理的系统中的非马尔可夫开放量子动态打开了新的途径。
更新时间: 2026-03-09 08:30:04
领域: quant-ph,cs.LG
MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision
Multi-agent systems (MAS) leveraging the impressive capabilities of Large Language Models (LLMs) hold significant potential for tackling complex tasks. However, most current MAS depend on manually designed agent roles and communication protocols. These manual designs often fail to align with the underlying LLMs' strengths and struggle to adapt to novel tasks. Recent automatic MAS approaches attempt to mitigate these limitations but typically necessitate a validation set for tuning and yield static MAS designs lacking adaptability during inference, while also removing the flexibility to reduce to simpler systems. We introduce MAS-ZERO, the first self-evolved, inference-time framework for automatic MAS design. MAS-ZERO employs meta-level design to iteratively design, critique, and refine MAS configurations tailored to each problem instance, without requiring a validation set. Critically, it enables dynamic problem decomposition and agent composition through meta-feedback on solvability and completeness, and reduction to simpler systems when appropriate. Experiments across reasoning (math and graduate-level QA), coding, and agentic (search-based) benchmarks, using both closed-source and open-source LLM backbones of varying sizes, demonstrate that MAS-ZERO outperforms strong manual and automatic MAS baselines. It achieves substantial average accuracy improvements of up to 16.69% on reasoning, 16.66% on coding, and 5.45% on agentic tasks, while maintaining cost efficiency.
Updated: 2026-03-09 08:29:05
标题: MAS-ZERO:零监督设计多Agent系统
摘要: 多智能体系统(MAS)利用大型语言模型(LLMs)的强大能力具有处理复杂任务的重要潜力。然而,大多数当前的MAS依赖于手动设计的智能体角色和通信协议。这些手动设计通常无法与基础LLMs的优势相匹配,并且往往难以适应新颖任务。最近的自动MAS方法试图缓解这些限制,但通常需要一个验证集进行调整,并且在推理过程中产生缺乏适应性的静态MAS设计,同时也剥夺了减少至更简单系统的灵活性。我们介绍了MAS-ZERO,这是第一个自我进化的、推理时自动MAS设计框架。MAS-ZERO采用元级设计,通过迭代设计、批判和完善针对每个问题实例量身定制的MAS配置,而无需验证集。关键是,它通过对可解性和完整性的元反馈,实现了动态问题分解和智能体构成,并在适当时减少至更简单的系统。通过使用各种规模的闭源和开源LLM骨干进行推理(数学和研究生级QA)、编码和智能(基于搜索的)基准测试的实验表明,MAS-ZERO胜过强大的手动和自动MAS基线。它在推理任务上实现了高达16.69%的平均准确率提升,在编码任务上提升了16.66%,在智能任务上提升了5.45%,同时保持成本效率。
更新时间: 2026-03-09 08:29:05
领域: cs.CL,cs.AI,cs.LG
Tiny Autoregressive Recursive Models
Tiny Recursive Models (TRMs) have recently demonstrated remarkable performance on ARC-AGI, showing that very small models can compete against large foundation models through a two-step refinement mechanism that updates an internal reasoning state $z$ and the predicted output $y$. Naturally, such refinement is of interest for any predictor; it is therefore natural to wonder whether the TRM mechanism could be effectively re-adopted in autoregressive models. However, TRMs cannot be simply compared to standard models because they lack causal predictive structures and contain persistent latent states that make it difficult to isolate specific performance gains. In this paper, we propose the Autoregressive TRM and evaluate it on small autoregressive tasks. To understand its efficacy, we propose a suite of models that gradually transform a standard Transformer to a Tiny Autoregressive Recursive Model in a controlled setting that fixes the block design, token stream, and next-token objective. Across compute-matched experiments on character-level algorithmic tasks, we surprisingly find that there are some two-level refinement baselines that show strong performance. Contrary to expectations, we find no reliable performance gains from the full Autoregressive TRM architecture. These results offer potential promise for two-step refinement mechanisms more broadly but caution against investing in the autoregressive TRM-specific model as a fruitful research direction.
Updated: 2026-03-09 08:22:45
标题: 微小的自回归递归模型
摘要: 最近,微型递归模型(TRMs)在ARC-AGI上表现出了非凡的性能,表明非常小的模型可以通过一个更新内部推理状态$z$和预测输出$y$的两步细化机制与大型基础模型竞争。自然地,这种细化对任何预测器都很有趣;因此,人们自然会想知道TRM机制是否能够有效地重新应用在自回归模型中。然而,TRMs不能简单地与标准模型进行比较,因为它们缺乏因果预测结构并包含持久的潜在状态,这使得难以分离特定性能增益。在本文中,我们提出了自回归TRM,并在小型自回归任务上对其进行评估。为了理解其功效,我们提出了一系列逐渐将标准Transformer转化为微型自回归递归模型的模型,在一个固定块设计、标记流和下一个标记目标的受控环境中进行。通过在字符级算法任务上进行匹配计算的实验,我们惊讶地发现有一些两级细化基线显示出强大的性能。与预期相反,我们发现从完整的自回归TRM架构中没有可靠的性能增益。这些结果为更广泛的两步细化机制提供了潜在的希望,但警告不要将自回归TRM特定模型作为一个有成果的研究方向。
更新时间: 2026-03-09 08:22:45
领域: cs.LG
Characterizing MARL for Energy Control: A Multi-KPI Benchmark on the CityLearn Environment
The optimization of urban energy systems is crucial for the advancement of sustainable and resilient smart cities, which are becoming increasingly complex with multiple decision-making units. To address scalability and coordination concerns, Multi-Agent Reinforcement Learning (MARL) is a promising solution. This paper addresses the imperative need for comprehensive and reliable benchmarking of MARL algorithms on energy management tasks. CityLearn is used as a case study environment because it realistically simulates urban energy systems, incorporates multiple storage systems, and utilizes renewable energy sources. By doing so, our work sets a new standard for evaluation, conducting a comparative study across multiple key performance indicators (KPIs). This approach illuminates the key strengths and weaknesses of various algorithms, moving beyond traditional KPI averaging which often masks critical insights. Our experiments utilize widely accepted baselines such as Proximal Policy Optimization (PPO) and Soft Actor Critic (SAC), and encompass diverse training schemes including Decentralized Training with Decentralized Execution (DTDE) and Centralized Training with Decentralized Execution (CTDE) approaches and different neural network architectures. Our work also proposes novel KPIs that tackle real world implementation challenges such as individual building contribution and battery storage lifetime. Our findings show that DTDE consistently outperforms CTDE in both average and worst-case performance. Additionally, temporal dependency learning improved control on memory dependent KPIs such as ramping and battery usage, contributing to more sustainable battery operation. Results also reveal robustness to agent or resource removal, highlighting both the resilience and decentralizability of the learned policies.
Updated: 2026-03-09 08:20:08
标题: 对能源控制进行多智能体强化学习建模:在CityLearn环境中的多指标评估Benchmark
摘要: 城市能源系统的优化对于可持续和具有韧性的智能城市的发展至关重要,这些城市越来越复杂,拥有多个决策单元。为了解决可伸缩性和协调性问题,多智能体强化学习(MARL)是一个具有前景的解决方案。本文讨论了在能源管理任务上全面和可靠地对MARL算法进行基准测试的迫切需要。CityLearn被用作案例研究环境,因为它真实地模拟了城市能源系统,包含多个储能系统,并利用可再生能源。通过这样做,我们的工作为评估设定了新的标准,进行了跨多个关键绩效指标(KPIs)的比较研究。这种方法揭示了各种算法的主要优势和劣势,超越了常常掩盖关键见解的传统KPI平均值。我们的实验使用了广泛接受的基准线,如Proximal Policy Optimization(PPO)和Soft Actor Critic(SAC),并涵盖了包括去中心化培训与去中心化执行(DTDE)和集中培训与去中心化执行(CTDE)方法以及不同的神经网络架构在内的多样化训练方案。我们的工作还提出了解决实际实施挑战的新型KPIs,例如个别建筑物的贡献和电池存储寿命。我们的研究结果表明,DTDE在平均绩效和最坏情况绩效方面始终优于CTDE。此外,时间依赖性学习提高了对内存依赖KPIs(如爬坡和电池使用)的控制,有助于更可持续的电池运行。结果还显示了对代理或资源移除的鲁棒性,突出了学习策略的韧性和去中心化性。
更新时间: 2026-03-09 08:20:08
领域: cs.AI,cs.LG,cs.MA
Think, Speak, Decide: Language-Augmented Multi-Agent Reinforcement Learning for Economic Decision-Making
Economic decision-making depends not only on structured signals such as prices and taxes, but also on unstructured language, including peer dialogue and media narratives. While multi-agent reinforcement learning (MARL) has shown promise in optimizing economic decisions, it struggles with the semantic ambiguity and contextual richness of language. We propose LAMP (Language-Augmented Multi-Agent Policy), a framework that integrates language into economic decision-making and narrows the gap to real-world settings. LAMP follows a Think-Speak-Decide pipeline: (1) Think interprets numerical observations to extract short-term shocks and long-term trends, caching high-value reasoning trajectories; (2) Speak crafts and exchanges strategic messages based on reasoning, updating beliefs by parsing peer communications; and (3) Decide fuses numerical data, reasoning, and reflections into a MARL policy to optimize language-augmented decision-making. Experiments in economic simulation show that LAMP outperforms both MARL and LLM-only baselines in cumulative return (+63.5%, +34.0%), robustness (+18.8%, +59.4%), and interpretability. These results demonstrate the potential of language-augmented policies to deliver more effective and robust economic strategies.
Updated: 2026-03-09 08:10:13
标题: 思考,言说,决策:语言增强的多智能体强化学习在经济决策中的应用
摘要: 经济决策不仅取决于结构化信号,如价格和税收,还取决于非结构化语言,包括同行对话和媒体叙事。虽然多智体强化学习(MARL)在优化经济决策方面表现出潜力,但它在语言的语义模糊性和语境丰富性方面仍存在困难。我们提出了LAMP(Language-Augmented Multi-Agent Policy),这是一个将语言整合到经济决策中并缩小与现实世界环境之间差距的框架。LAMP遵循一个“思考-言说-决定”流程:(1)思考解释数字观察结果,提取短期冲击和长期趋势,缓存高价值的推理轨迹;(2)言说根据推理制定和交换战略信息,通过解析同行通信更新信念;(3)决定将数字数据、推理和反思融合为一个MARL策略,以优化语言增强决策。经济模拟实验表明,LAMP在累积回报(+63.5%,+34.0%)、鲁棒性(+18.8%,+59.4%)和可解释性方面优于MARL和仅有LLM的基线。这些结果表明,语言增强策略有潜力提供更有效和鲁棒的经济战略。
更新时间: 2026-03-09 08:10:13
领域: cs.AI,econ.GN
Hybrid Quantum Neural Network for Multivariate Clinical Time Series Forecasting
Forecasting physiological signals can support proactive monitoring and timely clinical intervention by anticipating critical changes in patient status. In this work, we address multivariate multi-horizon forecasting of physiological time series by jointly predicting heart rate, oxygen saturation, pulse rate, and respiratory rate at forecasting horizons of 15, 30, and 60 seconds. We propose a hybrid quantum-classical architecture that integrates a Variational Quantum Circuit (VQC) within a recurrent neural backbone. A GRU encoder summarizes the historical observation window into a latent representation, which is then projected into quantum angles used to parameterize the VQC. The quantum layer acts as a learnable non-linear feature mixer, modeling cross-variable interactions before the final prediction stage. We evaluate the proposed approach on the BIDMC PPG and Respiration dataset under a Leave-One-Patient-Out protocol. The results show competitive accuracy compared with classical and deep learning baselines, together with greater robustness to noise and missing inputs. These findings suggest that hybrid quantum layers can provide useful inductive biases for physiological time series forecasting in small-cohort clinical settings.
Updated: 2026-03-09 08:08:57
标题: 混合量子神经网络用于多变量临床时间序列预测
摘要: 生理信号的预测可以通过预测患者状况中的关键变化,支持积极监测和及时临床干预。在这项工作中,我们通过联合预测心率、血氧饱和度、脉率和呼吸率,来进行多变量多时域的生理时间序列预测,预测时域为15、30和60秒。我们提出了一个混合量子-经典架构,将变分量子电路(VQC)集成在一个递归神经网络骨干中。一个GRU编码器将历史观测窗口总结为潜在表示,然后将其投影到用于参数化VQC的量子角度中。量子层作为一个可学习的非线性特征混合器,在最终预测阶段之前对交叉变量交互进行建模。我们在BIDMC脉搏波形图和呼吸数据集上采用一个留一患者协议对所提出的方法进行评估。结果显示,与经典和深度学习基线相比,准确性具有竞争力,并且对噪声和缺失输入具有更高的鲁棒性。这些发现表明,混合量子层可以为小队临床环境中的生理时间序列预测提供有用的归纳偏好。
更新时间: 2026-03-09 08:08:57
领域: cs.LG
FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
Large language models (LLMs) are widely deployed with rapidly expanding context windows to support increasingly demanding applications. However, long contexts pose significant deployment challenges, primarily due to the KV cache whose size grows proportionally with context length. While KV cache compression methods have been proposed to address this issue, KV dropping methods incur considerable accuracy loss, and KV retrieval methods suffer from significant efficiency bottlenecks. We propose FreeKV, a training-free algorithm-system co-optimization framework to enhance KV retrieval efficiency while preserving accuracy. On the algorithm side, FreeKV introduces speculative retrieval to shift the KV selection and recall processes out of the critical path, combined with fine-grained correction to ensure accuracy. On the system side, FreeKV employs hybrid KV layouts across CPU and GPU memory to eliminate fragmented data transfers, and leverages double-buffered streamed recall to further improve efficiency, enabling effective overlap with computation, full latency hiding, and practical speedups from speculative recall. Experiments demonstrate that FreeKV achieves near-lossless accuracy across various scenarios and models, delivering up to a 13$\times$ speedup compared to SOTA KV retrieval methods. Code is available at https://github.com/sjtu-zhao-lab/FreeKV.
Updated: 2026-03-09 08:08:41
标题: FreeKV:提升KV缓存检索以实现高效的LLM推断
摘要: 大型语言模型(LLMs)被广泛部署,并且上下文窗口迅速扩大,以支持越来越苛刻的应用。然而,长上下文会带来显著的部署挑战,主要是由于KV缓存的大小与上下文长度成正比增长。虽然已经提出了KV缓存压缩方法来解决这个问题,但是KV丢弃方法会导致相当大的准确性损失,而KV检索方法则受到显著的效率瓶颈的影响。我们提出了FreeKV,这是一个无需训练的算法-系统协同优化框架,旨在提高KV检索效率的同时保持准确性。在算法方面,FreeKV引入了推测检索,将KV选择和召回过程移出关键路径,结合细粒度校正以确保准确性。在系统方面,FreeKV利用CPU和GPU内存之间的混合KV布局来消除数据传输的碎片化,并利用双缓冲流式召回来进一步提高效率,实现与计算的有效重叠,完全隐藏延迟,并从推测召回中实现实际加速。实验证明,FreeKV在各种场景和模型中实现了接近无损的准确性,并相比于SOTA KV检索方法实现了高达13倍的加速。代码可在https://github.com/sjtu-zhao-lab/FreeKV 上找到。
更新时间: 2026-03-09 08:08:41
领域: cs.LG,cs.AI,cs.CL
Adopting a human developmental visual diet yields robust, shape-based AI vision
Despite years of research and the dramatic scaling of artificial intelligence (AI) systems, a striking misalignment between artificial and human vision persists. Contrary to humans, AI relies heavily on texture-features rather than shape information, lacks robustness to image distortions, remains highly vulnerable to adversarial attacks, and struggles to recognise simple abstract shapes within complex backgrounds. To close this gap, here we take inspiration from how human vision develops from early infancy into adulthood. We quantified visual maturation by synthesising decades of research into a novel developmental visual diet (DVD) for AI vision. Guiding AI systems through this human-inspired curriculum, which considers the development of visual acuity, contrast sensitivity, and colour, produces models that better align with human behaviour on every hallmark of robust vision tested, yielding the strongest reported reliance on shape information to date, abstract shape recognition beyond the state of the art, and higher resilience to image corruptions and adversarial attacks. Our results thus demonstrate that robust AI vision can be achieved by guiding how a model learns, not merely how much it learns, offering a resource-efficient route toward safer and more human-like artificial visual systems.
Updated: 2026-03-09 08:08:25
标题: 采用人类发展性视觉饮食产生强大的基于形状的人工智能视觉
摘要: 尽管经过多年的研究和人工智能系统的规模化发展,人工智能与人类视觉之间仍存在明显的不一致。与人类相反,人工智能主要依赖纹理特征而不是形状信息,对图像失真缺乏稳健性,极易受到对抗攻击的影响,难以识别复杂背景中的简单抽象形状。为了弥合这一差距,我们从人类视觉从婴儿期到成年期的发展中汲取灵感。我们通过将数十年的研究综合到一个新颖的发展视觉饮食(DVD)中,量化了视觉成熟。通过引导人工智能系统完成这一受人类启发的课程,考虑了视觉敏锐度、对比敏感度和颜色的发展,生成了更符合人类视觉的模型,通过测试每个稳健视觉标志,产生了迄今为止对形状信息依赖最强的结果,超越了现有技术的抽象形状识别能力,并具有更高的抵抗图像破坏和对抗攻击的能力。因此,我们的结果表明,通过引导模型学习的方式而不仅仅是学习的数量,可以实现稳健的人工智能视觉,为更安全和更类人工视觉系统提供了一条资源高效的途径。
更新时间: 2026-03-09 08:08:25
领域: cs.LG,cs.CV
In-Context Reinforcement Learning for Tool Use in Large Language Models
While large language models (LLMs) exhibit strong reasoning abilities, their performance on complex tasks is often constrained by the limitations of their internal knowledge. A compelling approach to overcome this challenge is to augment these models with external tools -- such as Python interpreters for mathematical computations or search engines for retrieving factual information. However, enabling models to use these tools effectively remains a significant challenge. Existing methods typically rely on cold-start pipelines that begin with supervised fine-tuning (SFT), followed by reinforcement learning (RL). These approaches often require substantial amounts of labeled data for SFT, which is expensive to annotate or synthesize. In this work, we propose In-Context Reinforcement Learning (ICRL), an RL-only framework that eliminates the need for SFT by leveraging few-shot prompting during the rollout stage of RL. Specifically, ICRL introduces in-context examples within the rollout prompts to teach the model how to invoke external tools. Furthermore, as training progresses, the number of in-context examples is gradually reduced, eventually reaching a zero-shot setting where the model learns to call tools independently. We conduct extensive experiments across a range of reasoning and tool-use benchmarks. Results show that ICRL achieves state-of-the-art performance, demonstrating its effectiveness as a scalable, data-efficient alternative to traditional SFT-based pipelines.
Updated: 2026-03-09 08:06:18
标题: 大型语言模型中工具使用的上下文强化学习
摘要: 尽管大型语言模型(LLMs)表现出强大的推理能力,但它们在复杂任务上的性能通常受到内部知识的限制。克服这一挑战的一个引人注目的方法是通过增加这些模型的外部工具,例如用于数学计算的Python解释器或用于检索事实信息的搜索引擎。然而,使模型有效地使用这些工具仍然是一个重大挑战。现有方法通常依赖于从监督微调(SFT)开始的冷启动流水线,然后是强化学习(RL)。这些方法通常需要大量用于SFT的标记数据,这些数据要么很昂贵进行标注,要么很昂贵进行合成。在这项工作中,我们提出了In-Context Reinforcement Learning (ICRL),这是一个仅基于RL的框架,通过在RL的展开阶段利用少量提示来消除对SFT的需求。具体来说,ICRL在展开提示中引入上下文示例,教会模型如何调用外部工具。此外,随着训练的进行,逐渐减少上下文示例的数量,最终达到零-shot设置,模型学会独立调用工具。我们在一系列推理和工具使用基准测试中进行了大量实验。结果显示,ICRL实现了最先进的性能,展示了其作为传统基于SFT的流水线的可伸缩、数据高效替代方案的有效性。
更新时间: 2026-03-09 08:06:18
领域: cs.AI
Deterministic Differentiable Structured Pruning for Large Language Models
Structured pruning reduces LLM inference cost by removing low-importance architectural components. This can be viewed as learning a multiplicative gate for each component under an l0 sparsity constraint. Due to the discreteness of the l0 norm, prior work typically adopts stochastic hard-concrete relaxations to enable differentiable optimization; however, this stochasticity can introduce a train--test mismatch when sampled masks are discretized for deployment and restricts masks to a bounded, near-binary range. To address this, we propose Deterministic Differentiable Pruning (DDP), a mask-only optimization method that eliminates stochasticity by directly optimizing a deterministic soft surrogate of the discrete l0 objective. Compared with prior approaches, DDP offers greater expressiveness, reduced train--test mismatch, and faster convergence. We apply our method to several dense and MoE models, including Qwen3-32B and Qwen3-30B-A3B, achieving a performance loss as small as 1% on downstream tasks while outperforming previous methods at 20% sparsity. We further demonstrate end-to-end inference speedups in realistic deployment settings with vLLM.
Updated: 2026-03-09 07:59:17
标题: 大语言模型的确定性可微结构化剪枝
摘要: 结构化剪枝通过去除低重要性的架构组件,降低了LLM推理成本。这可以看作是在l0 稀疏性约束下为每个组件学习一个乘法门。由于l0 范数的离散性,先前的工作通常采用随机硬混凝土松弛方法来实现可微优化;然而,这种随机性可能在将采样的掩码离散化用于部署时引入训练-测试不匹配,并将掩码限制在一个有界的、接近二进制的范围内。为了解决这个问题,我们提出了确定性可微剪枝(DDP),这是一种仅基于掩码的优化方法,通过直接优化离散l0 目标的确定性软代理来消除随机性。与先前的方法相比,DDP 提供了更大的表达能力、减少了训练-测试不匹配,并且收敛速度更快。我们将我们的方法应用于几个密集和MoE 模型,包括Qwen3-32B 和Qwen3-30B-A3B,在下游任务中实现了最多 1% 的性能损失,同时在20% 稀疏性方面优于先前的方法。我们进一步在具有vLLM 的实际部署环境中展示了端到端推理加速。
更新时间: 2026-03-09 07:59:17
领域: cs.LG,cs.CL
Adversarial Domain Adaptation Enables Knowledge Transfer Across Heterogeneous RNA-Seq Datasets
Accurate phenotype prediction from RNA sequencing (RNA-seq) data is essential for diagnosis, biomarker discovery, and personalized medicine. Deep learning models have demonstrated strong potential to outperform classical machine learning approaches, but their performance relies on large, well-annotated datasets. In transcriptomics, such datasets are frequently limited, leading to over-fitting and poor generalization. Knowledge transfer from larger, more general datasets can alleviate this issue. However, transferring information across RNA-seq datasets remains challenging due to heterogeneous preprocessing pipelines and differences in target phenotypes. In this study, we propose a deep learning-based domain adaptation framework that enables effective knowledge transfer from a large general dataset to a smaller one for cancer type classification. The method learns a domain-invariant latent space by jointly optimizing classification and domain alignment objectives. To ensure stable training and robustness in data-scarce scenarios, the framework is trained with an adversarial approach with appropriate regularization. Both supervised and unsupervised approach variants are explored, leveraging labeled or unlabeled target samples. The framework is evaluated on three large-scale transcriptomic datasets (TCGA, ARCHS4, GTEx) to assess its ability to transfer knowledge across cohorts. Experimental results demonstrate consistent improvements in cancer and tissue type classification accuracy compared to non-adaptive baselines, particularly in low-data scenarios. Overall, this work highlights domain adaptation as a powerful strategy for data-efficient knowledge transfer in transcriptomics, enabling robust phenotype prediction under constrained data conditions.
Updated: 2026-03-09 07:55:32
标题: 对抗性领域适应促进跨异质RNA-Seq数据集的知识转移
摘要: 从RNA测序(RNA-seq)数据中准确预测表型对于诊断、生物标记物发现和个性化医疗至关重要。深度学习模型已经展示出超越传统机器学习方法的强大潜力,但它们的性能依赖于大型、有良好注释的数据集。在转录组学中,这种数据集经常受限,导致过拟合和泛化能力差。从更大、更一般的数据集中进行知识转移可以缓解这个问题。然而,由于异构的预处理管道和目标表型的差异,跨RNA-seq数据集进行信息传递仍然具有挑战性。在这项研究中,我们提出了一种基于深度学习的领域适应框架,可以实现从一个大型一般数据集到一个较小数据集的有效知识转移,用于癌症类型分类。该方法通过共同优化分类和领域对齐目标来学习一个领域不变的潜在空间。为了确保在数据稀缺情况下的稳定训练和鲁棒性,该框架采用了对抗性方法和适当的正则化进行训练。探索了有监督和无监督方法的变体,利用标记或未标记的目标样本。该框架在三个大规模转录组数据集(TCGA、ARCHS4、GTEx)上进行评估,以评估其在不同队列之间传递知识的能力。实验结果表明,与非适应性基线相比,癌症和组织类型分类的准确性在低数据情况下得到了一致的改善。总的来说,这项工作突出了领域适应作为转录组学中数据高效知识传输的强大策略,可以在受限数据条件下实现强大的表型预测。
更新时间: 2026-03-09 07:55:32
领域: cs.LG,q-bio.GN
ImageEdit-R1: Boosting Multi-Agent Image Editing via Reinforcement Learning
With the rapid advancement of commercial multi-modal models, image editing has garnered significant attention due to its widespread applicability in daily life. Despite impressive progress, existing image editing systems, particularly closed-source or proprietary models, often struggle with complex, indirect, or multi-step user instructions. These limitations hinder their ability to perform nuanced, context-aware edits that align with human intent. In this work, we propose ImageEdit-R1, a multi-agent framework for intelligent image editing that leverages reinforcement learning to coordinate high-level decision-making across a set of specialized, pretrained vision-language and generative agents. Each agent is responsible for distinct capabilities--such as understanding user intent, identifying regions of interest, selecting appropriate editing actions, and synthesizing visual content--while reinforcement learning governs their collaboration to ensure coherent and goal-directed behavior. Unlike existing approaches that rely on monolithic models or hand-crafted pipelines, our method treats image editing as a sequential decision-making problem, enabling dynamic and context-aware editing strategies. Experimental results demonstrate that ImageEdit-R1 consistently outperforms both individual closed-source diffusion models and alternative multi-agent framework baselines across multiple image editing datasets.
Updated: 2026-03-09 07:50:14
标题: ImageEdit-R1: 通过强化学习提升多智能体图像编辑
摘要: 随着商业多模型的快速发展,图像编辑因其在日常生活中的广泛适用性而受到了重视。尽管取得了令人印象深刻的进展,但现有的图像编辑系统,特别是封闭源或专有模型,通常难以处理复杂、间接或多步骤的用户指令。这些限制阻碍了它们执行与人类意图一致的细致、上下文感知的编辑能力。在本工作中,我们提出了ImageEdit-R1,这是一个智能图像编辑的多代理框架,利用强化学习协调一组专门预训练的视觉-语言和生成代理的高级决策。每个代理负责不同的能力--比如理解用户意图、识别感兴趣的区域、选择适当的编辑动作和合成视觉内容--而强化学习则指导它们协作以确保连贯和目标导向的行为。与依赖单一模型或手工设计的流程的现有方法不同,我们的方法将图像编辑视为一个序列决策问题,从而实现动态和上下文感知的编辑策略。实验结果表明,ImageEdit-R1在多个图像编辑数据集上始终优于个别封闭源扩散模型和替代多代理框架基线。
更新时间: 2026-03-09 07:50:14
领域: cs.CV,cs.AI
Stabilized Fine-Tuning with LoRA in Federated Learning: Mitigating the Side Effect of Client Size and Rank via the Scaling Factor
Large Language Models (LLMs) are pivotal in natural language processing. The impracticality of full fine-tuning has prompted Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA), optimizing low-rank matrices A and B. In distributed scenarios where privacy constraints necessitate Federated Learning (FL), however, the integration of LoRA is often unstable. Specifically, we identify that aggregating updates from multiple clients introduces statistical variance that scales with the client count, causing gradient collapse when using high-rank adapters. Existing scaling factor candidates, such as the one used by Rank-Stabilized LoRA, ignore the interaction caused by the aggregation process. To bridge this gap, this paper introduces Stabilized Federated LoRA (SFed-LoRA), a framework that theoretically characterizes the interaction between adapter rank and federated aggregation. We derive an optimal scaling factor designed to effectively mitigate the aggregation error accumulating across N clients. By correcting the scaling mismatch inherent in previous approaches, SFed-LoRA restores the efficacy of high-rank adaptation without altering the original model architecture or increasing inference latency. Extensive experiments in diverse tasks, model architectures, and heterogeneous data distributions are conducted to validate our results. We demonstrate that SFed-LoRA prevents high-rank collapse, and achieves significantly improved stability and faster convergence compared with state-of-the-art baselines for high-rank adaptation.
Updated: 2026-03-09 07:49:56
标题: 在联邦学习中使用LoRA进行稳定微调:通过缩放因子减轻客户端大小和排名的副作用
摘要: 大型语言模型(LLMs)在自然语言处理中起着关键作用。全面微调的不切实际性促使参数高效微调(PEFT)方法如低秩调整(LoRA)变得重要,优化低秩矩阵A和B。然而,在隐私约束要求联邦学习(FL)的分布式场景中,LoRA的整合常常不稳定。具体来说,我们发现从多个客户端聚合更新会引入随客户数量增加而增大的统计方差,导致使用高秩适配器时梯度崩溃。现有的缩放因子候选,如Rank-Stabilized LoRA使用的因子,忽略了聚合过程引起的交互作用。为了填补这一差距,本文介绍了稳定联邦LoRA(SFed-LoRA),一个框架从理论上刻画了适配器秩和联邦聚合之间的交互作用。我们推导了一种最佳缩放因子,旨在有效减轻在N个客户端之间积累的聚合错误。通过纠正先前方法中固有的缩放不匹配,SFed-LoRA恢复了高秩适配的效果,而无需改变原始模型架构或增加推理延迟。我们进行了广泛的实验,涵盖了各种任务、模型架构和异构数据分布,以验证我们的结果。我们证明了SFed-LoRA防止了高秩崩溃,并与高秩适配的最新基线相比,实现了显著改进的稳定性和更快的收敛速度。
更新时间: 2026-03-09 07:49:56
领域: cs.LG
Speed3R: Sparse Feed-forward 3D Reconstruction Models
While recent feed-forward 3D reconstruction models accelerate 3D reconstruction by jointly inferring dense geometry and camera poses in a single pass, their reliance on dense attention imposes a quadratic complexity, creating a prohibitive computational bottleneck that severely limits inference speed. To resolve this, we introduce Speed3R, an end-to-end trainable model inspired by the core principle of Structure-from-Motion: that a sparse set of keypoints is sufficient for robust pose estimation. Speed3R features a dual-branch attention mechanism where a compression branch creates a coarse contextual prior to guide a selection branch, which performs fine-grained attention only on the most informative image tokens. This strategy mimics the efficiency of traditional keypoint matching, achieving a remarkable 12.4x inference speedup on 1000-view sequences, while introducing a minimal, controlled trade-off in geometric accuracy. Validated on standard benchmarks with both VGGT and $π^3$ backbones, our method delivers high-quality reconstructions at a fraction of computational cost, paving the way for efficient large-scale scene modeling.
Updated: 2026-03-09 07:46:51
标题: Speed3R: 稀疏前馈式3D重建模型
摘要: 最近的前馈3D重建模型通过在单次传递中联合推断密集几何和相机姿势来加速3D重建,但它们对密集关注的依赖 impose a quadratic complexity,导致计算速度受到严重限制。为了解决这个问题,我们引入了Speed3R,这是一个端到端可训练模型,灵感来自于从运动中获得结构的核心原则:即一个稀疏的关键点集足以进行鲁棒的姿势估计。Speed3R具有双分支注意机制,其中一个压缩分支创建一个粗略的上下文先验来引导选择分支,后者仅在最具信息量的图像令牌上执行细粒度的关注。这种策略模仿了传统关键点匹配的效率,在1000个视图序列上实现了惊人的12.4倍的推断速度提升,同时在几何精度上引入了最小的、可控的折衷。通过在标准基准测试中验证,使用VGGT和$π^3$骨干,我们的方法以更少的计算成本实现了高质量的重建,为高效的大规模场景建模铺平了道路。
更新时间: 2026-03-09 07:46:51
领域: cs.CV,cs.AI
Explainable Token-level Noise Filtering for LLM Fine-tuning Datasets
Large Language Models (LLMs) have seen remarkable advancements, achieving state-of-the-art results in diverse applications. Fine-tuning, an important step for adapting LLMs to specific downstream tasks, typically involves further training on corresponding datasets. However, a fundamental discrepancy exists between current fine-tuning datasets and the token-level optimization mechanism of LLMs: most datasets are designed at the sentence-level, which introduces token-level noise, causing negative influence to final performance. In this paper, we propose XTF, an explainable token-level noise filtering framework. XTF decomposes the complex and subtle contributions of token-level data to the fine-tuning process into three distinct and explicit attributes (reasoning importance, knowledge novelty, and task relevance), which can be assessed using scoring methods, and then masks the gradients of selected noisy tokens accordingly to optimize the performance of fine-tuned LLMs. We conduct extensive experiments on three representative downstream tasks (math, code and medicine) across 7 mainstream LLMs. The results demonstrate that XTF can significantly improve downstream performance by up to 13.7% compared to regular fine-tuning. Our work highlights the importance of token-level dataset optimization, and demonstrates the potential of strategies based on attribute decomposition for explaining complex training mechanisms.
Updated: 2026-03-09 07:42:51
标题: 可解释的LLM微调数据集的令牌级噪声过滤Explanation Token-level Noise Filtering for LLM Fine-tuning Datasets
摘要: 大型语言模型(LLMs)已经取得了显著的进展,在各种应用中取得了最先进的结果。微调是将LLMs适应特定下游任务的重要步骤,通常涉及对相应数据集进行进一步训练。然而,当前微调数据集与LLMs的标记级优化机制之间存在基本差异:大多数数据集设计为句子级别,这引入了标记级别的噪音,对最终性能产生了负面影响。在本文中,我们提出了一个可解释的标记级噪音过滤框架XTF。XTF将标记级数据对微调过程的复杂和微妙贡献分解为三个明确的属性(推理重要性、知识新颖性和任务相关性),可以使用评分方法进行评估,然后相应地屏蔽所选嘈杂标记的梯度,以优化微调LLMs的性能。我们在7个主流LLMs上对三个代表性的下游任务(数学、代码和医学)进行了广泛的实验。结果表明,与常规微调相比,XTF可以显著提高下游性能高达13.7%。我们的工作强调了标记级数据集优化的重要性,并展示了基于属性分解的策略对解释复杂训练机制的潜力。
更新时间: 2026-03-09 07:42:51
领域: cs.CL,cs.AI
Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills
Large language model (LLM) agents are moving beyond prompting alone. ChatGPT marked the rise of general-purpose LLM assistants, DeepSeek showed that on-policy reinforcement learning with verifiable rewards can improve reasoning and tool use, and OpenClaw highlights a newer direction in which agents accumulate persistent memory and reusable skills. Yet the research landscape remains fragmented across post-training, retrieval, memory, and skill systems. This survey studies these developments under a single notion of \emph{adaptation}: improving an agent, its tools, or their interaction after pretraining. We organize the field with a four-paradigm framework spanning agent adaptation and tool adaptation. On the agent side, A1 (tool-execution-signaled) and A2 (agent-output-signaled) improve the agent itself through supervised fine-tuning, preference optimization, and reinforcement learning with verifiable rewards. On the tool side, T1 (agent-agnostic) provides reusable pre-trained modules any agent can call, while T2 (agent-supervised) uses the agent's outputs to train memory systems, skill libraries, or lightweight subagents. Using this framework, we review post-training methods, adaptive memory architectures, and agent skills; compare their trade-offs in cost, flexibility, and generalization; and summarize evaluation practices across deep research, software development, computer use, and drug discovery. We conclude by outlining open problems in agent-tool co-adaptation, continual learning, safety, and efficient deployment.
Updated: 2026-03-09 07:39:08
标题: Agentic AI的适应性:关于后训练、记忆和技能的调查
摘要: 大型语言模型(LLM)代理正在超越仅仅提示。ChatGPT标志着通用型LLM助手的崛起,DeepSeek表明,基于策略的强化学习与可验证奖励可以改善推理和工具使用,而OpenClaw则突出了一种新的方向,即代理积累持续记忆和可重复使用的技能。然而,研究领域仍然在后训练、检索、记忆和技能系统之间分散。本调查研究这些发展,统一概念为“适应性”:在预训练之后改善代理、其工具或它们之间的交互。我们用一个四范式框架来组织领域,跨越代理适应和工具适应。在代理方面,A1(工具执行信号)和A2(代理输出信号)通过监督微调、偏好优化和基于可验证奖励的强化学习来改善代理本身。在工具方面,T1(与代理无关)提供可重复使用的预训练模块,任何代理都可以调用,而T2(代理监督)利用代理的输出来训练记忆系统、技能库或轻量级子代理。利用这一框架,我们审查了后训练方法、适应性记忆架构和代理技能;比较它们在成本、灵活性和泛化性方面的权衡;并总结了在深度研究、软件开发、计算机使用和药物发现领域中的评估实践。最后,我们概述了代理-工具共同适应、持续学习、安全性和高效部署方面的开放问题。
更新时间: 2026-03-09 07:39:08
领域: cs.AI,cs.CL
S2S-FDD: Bridging Industrial Time Series and Natural Language for Explainable Zero-shot Fault Diagnosis
Fault diagnosis is critical for the safe operation of industrial systems. Conventional diagnosis models typically produce abstract outputs such as anomaly scores or fault categories, failing to answer critical operational questions like "Why" or "How to repair". While large language models (LLMs) offer strong generalization and reasoning abilities, their training on discrete textual corpora creates a semantic gap when processing high-dimensional, temporal industrial signals. To address this challenge, we propose a Signals-to-Semantics fault diagnosis (S2S-FDD) framework that bridges high-dimensional sensor signals with natural language semantics through two key innovations: We first design a Signal-to-Semantic operator to convert abstract time-series signals into natural language summaries, capturing trends, periodicity, and deviations. Based on the descriptions, we design a multi-turn tree-structured diagnosis method to perform fault diagnosis by referencing historical maintenance documents and dynamically querying additional signals. The framework further supports human-in-the-loop feedback for continuous refinement. Experiments on the multiphase flow process show the feasibility and effectiveness of the proposed method for explainable zero-shot fault diagnosis.
Updated: 2026-03-09 07:38:56
标题: S2S-FDD: 将工业时间序列和自然语言连接起来,用于可解释的零样本故障诊断
摘要: 故障诊断对于工业系统的安全运行至关重要。传统的诊断模型通常产生抽象的输出,如异常分数或故障类别,未能回答关键的操作问题,比如“为什么”或“如何修复”。虽然大型语言模型(LLMs)具有强大的泛化和推理能力,但它们在离散文本语料库上的训练在处理高维、时间序列的工业信号时会产生语义鸿沟。为了解决这一挑战,我们提出了一个信号到语义故障诊断(S2S-FDD)框架,通过两个关键创新将高维传感器信号与自然语言语义连接起来:首先设计了一个信号到语义操作符,将抽象的时间序列信号转换为自然语言摘要,捕捉趋势、周期性和偏差。根据描述,我们设计了一个多轮树结构的诊断方法,通过参考历史维护文件并动态查询额外信号来进行故障诊断。该框架进一步支持人机协同反馈以进行持续改进。在多相流过程的实验中,展示了所提出方法在可解释的零样本故障诊断方面的可行性和有效性。
更新时间: 2026-03-09 07:38:56
领域: cs.AI
Thickening-to-Thinning: Reward Shaping via Human-Inspired Learning Dynamics for LLM Reasoning
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for enhancing reasoning in Large Language Models (LLMs). However, it frequently encounters challenges such as entropy collapse, excessive verbosity, and insufficient exploration for hard problems. Crucially, existing reward schemes fail to distinguish between the need for extensive search during problem-solving and the efficiency required for mastered knowledge. In this work, we introduce T2T(Thickening-to-Thinning), a dynamic reward framework inspired by human learning processes. Specifically, it implements a dual-phase mechanism: (1) On incorrect attempts, T2T incentivizes "thickening" (longer trajectories) to broaden the search space and explore novel solution paths; (2) Upon achieving correctness, it shifts to "thinning", imposing length penalties to discourage redundancy, thereby fostering model confidence and crystallizing reasoning capabilities. Extensive experiments on mathematical benchmarks (MATH-500, AIME, AMC) across Qwen-series and Deepseek models demonstrate that T2T significantly outperforms standard GRPO and recent baselines, achieving superior performance.
Updated: 2026-03-09 07:28:39
标题: 由于LLM推理的人类启发学习动态,从加厚到变薄:奖励塑造
摘要: 利用可验证奖励的强化学习(RLVR)已经被证实是提升大型语言模型(LLMs)推理能力的一种有前景的范例。然而,它经常遇到熵坍缩、冗长和对难题不足的探索等挑战。至关重要的是,现有的奖励方案未能区分问题解决过程中需要广泛搜索和掌握知识所需的效率之间的区别。在这项工作中,我们引入了T2T(由厚化到稀化),这是一个受人类学习过程启发的动态奖励框架。具体来说,它实施了一个双相机制:(1)在错误尝试时,T2T鼓励“厚化”(更长的轨迹)来扩大搜索空间并探索新的解决路径;(2)在达到正确性时,它转向“稀化”,对冗余施加长度惩罚,从而促进模型的信心和巩固推理能力。对数学基准(MATH-500、AIME、AMC)上的广泛实验表明,在Qwen系列和Deepseek模型上,T2T明显优于标准GRPO和最近的基准线,取得了卓越的性能。
更新时间: 2026-03-09 07:28:39
领域: cs.LG,cs.AI
Enhancing Alzheimer's Diagnosis: Leveraging Anatomical Landmarks in Graph Convolutional Neural Networks on Tetrahedral Meshes
Alzheimer's disease (AD) is a major neurodegenerative condition that affects millions around the world. As one of the main biomarkers in the AD diagnosis procedure, brain amyloid positivity is typically identified by positron emission tomography (PET), which is costly and invasive. Brain structural magnetic resonance imaging (sMRI) may provide a safer and more convenient solution for the AD diagnosis. Recent advances in geometric deep learning have facilitated sMRI analysis and early diagnosis of AD. However, determining AD pathology, such as brain amyloid deposition, in preclinical stage remains challenging, as less significant morphological changes can be observed. As a result, few AD classification models are generalizable to the brain amyloid positivity classification task. Blood-based biomarkers (BBBMs), on the other hand, have recently achieved remarkable success in predicting brain amyloid positivity and identifying individuals with high risk of being brain amyloid positive. However, individuals in medium risk group still require gold standard tests such as Amyloid PET for further evaluation. Inspired by the recent success of transformer architectures, we propose a geometric deep learning model based on transformer that is both scalable and robust to variations in input volumetric mesh size. Our work introduced a novel tokenization scheme for tetrahedral meshes, incorporating anatomical landmarks generated by a pre-trained Gaussian process model. Our model achieved superior classification performance in AD classification task. In addition, we showed that the model was also generalizable to the brain amyloid positivity prediction with individuals in the medium risk class, where BM alone cannot achieve a clear classification. Our work may enrich geometric deep learning research and improve AD diagnosis accuracy without using expensive and invasive PET scans.
Updated: 2026-03-09 07:23:37
标题: 增强阿尔茨海默病的诊断:在四面体网格上利用解剖标志的图卷积神经网络
摘要: 阿尔茨海默病(AD)是一种影响全球数百万人的主要神经退行性疾病。作为AD诊断程序中的主要生物标志物之一,大脑淀粉样蛋白阳性通常通过正电子发射断层扫描(PET)来识别,这种方法昂贵且有侵入性。大脑结构磁共振成像(sMRI)可能为AD诊断提供更安全、更便捷的解决方案。几何深度学习的最新进展促进了sMRI分析和早期AD诊断。然而,在临床前阶段确定AD病理,如大脑淀粉样蛋白沉积,仍然具有挑战性,因为可以观察到较不显著的形态学变化。因此,很少有AD分类模型适用于大脑淀粉样蛋白阳性分类任务。另一方面,基于血液的生物标志物(BBBM)最近在预测大脑淀粉样蛋白阳性和识别存在高风险的个体方面取得了显著成功。然而,中风险组的个体仍需要进一步评估的黄金标准测试,如淀粉样PET。受到变压器结构的最新成功启发,我们提出了一种基于变压器的几何深度学习模型,既可扩展又能适应输入体积网格大小的变化。我们的工作引入了一种新颖的四面体网格标记方案,将预训练的高斯过程模型生成的解剖标志结合在一起。我们的模型在AD分类任务中取得了出色的分类性能。此外,我们还展示了该模型在中风险类别中的大脑淀粉样蛋白阳性预测上也具有普适性,而单独使用BBM无法实现清晰的分类。我们的工作可能丰富几何深度学习研究,并提高AD诊断的准确性,而无需使用昂贵且侵入性的PET扫描。
更新时间: 2026-03-09 07:23:37
领域: eess.IV,cs.AI,cs.CV,q-bio.NC
CDRRM: Contrast-Driven Rubric Generation for Reliable and Interpretable Reward Modeling
Reward modeling is essential for aligning Large Language Models(LLMs) with human preferences, yet conventional reward models suffer from poor interpretability and heavy reliance on costly expert annotations. While recent rubric-based approaches enhance evaluation transparency, they lack systematic quality control, yielding noisy and redundant criteria, failing to mitigate persistent biases (e.g., verbosity, position) in LLM evaluators, and creating a scalability-reliability trade-off. To address these limitations, we propose CDRRM (Contrast-Driven Rubric Reward Model), a framework built on a novel Contrast-then-Synthesis paradigm for high-quality rubric generation and guided preference judgment. CDRRM first conducts multi-dimensional contrastive profiling on preference pairs to identify causal discriminative factors, then synthesizes these insights into compact, context-aware rubrics to guide preference judg- ments. Extensive experiments on three authoritative benchmarks (RewardBench, RMBench, RMB) demonstrate that CDRRM achieves state-of-the-art performance across diverse domains and effectively mitigates aforementioned evaluation biases. Notably, our approach delivers exceptional data efficiency: training the rubric generator on only 3k high-quality samples empowers a frozen pre-trained judge model to outperform fully fine-tuned baselines. This work offers a scalable, interpretable, and data-efficient path for reward modeling.
Updated: 2026-03-09 07:15:23
标题: CDRRM:用于可靠且可解释奖励建模的对比驱动的评分生成
摘要: 奖励建模对于将大型语言模型(LLMs)与人类偏好对齐至关重要,然而传统的奖励模型存在解释性差和严重依赖昂贵专家注释的问题。尽管最近基于评分表的方法增强了评估透明度,但它们缺乏系统性质量控制,产生嘈杂和冗余的标准,未能缓解LLM评估者中的持续偏见(例如冗长、位置),并导致可伸缩性和可靠性之间的权衡。为了解决这些限制,我们提出了CDRRM(对比驱动评分表奖励模型),这是一个基于新颖的对比-综合范式构建的框架,用于高质量评分表生成和引导偏好判断。CDRRM首先对偏好对进行多维对比分析,以识别因果辨别因素,然后将这些见解综合成紧凑、上下文感知的评分表,以指导偏好判断。在三个权威基准测试(RewardBench、RMBench、RMB)上进行的大量实验证明,CDRRM在各个领域实现了最先进的性能,并有效缓解了上述评估偏见。值得注意的是,我们的方法提供了出色的数据效率:仅在3k个高质量样本上训练评分表生成器就使得冻结的预训练评判模型胜过完全微调的基线模型。这项工作为奖励建模提供了一个可扩展、可解释和数据高效的路径。
更新时间: 2026-03-09 07:15:23
领域: cs.AI,cs.LG
More Bang for the Buck: Process Reward Modeling with Entropy-Driven Uncertainty
We introduce the Entropy-Driven Uncertainty Process Reward Model (EDU-PRM), a novel entropy-driven training framework for process reward modeling that enables dynamic, uncertainty-aligned segmentation of complex reasoning steps, eliminating the need for costly manual step annotations. Unlike previous Process Reward Models (PRMs) that rely on static partitioning and human labeling, EDU-PRM automatically anchors step boundaries at tokens with high predictive entropy, effectively capturing intrinsic logical transitions and facilitating efficient exploration of diverse reasoning paths. On the ProcessBench benchmark, EDU-PRM outperforms strong public PRM baselines, such as Math-Shepherd PRM and Omega PRM, and EDU-PRM achieves comparable results with SOTA models while only using 1.5% training data. Furthermore, by leveraging our proposed EDU sampling strategy, we observe accuracy boosts from 64.7% to 67.3% for generative reasoning tasks, accompanied by a reduction of 32% in token usage. These findings underscore the potential of EDU-PRM as a scalable and annotation-efficient paradigm for process supervision in mathematical reasoning, paving the way for more efficient and robust approaches to complex mathematical problem solving.
Updated: 2026-03-09 07:14:31
标题: 更高的回报:基于熵驱动不确定性的过程奖励建模
摘要: 我们介绍了基于熵驱动的不确定性过程奖励模型(EDU-PRM),这是一个新颖的熵驱动训练框架,用于过程奖励建模,可以实现对复杂推理步骤的动态、不确定性对齐分割,消除了昂贵的手动步骤注释的需求。与依赖静态分区和人工标注的以往过程奖励模型(PRMs)不同,EDU-PRM自动在具有高预测熵的令牌处锚定步骤边界,有效捕捉内在的逻辑转换,并促进多样化推理路径的高效探索。在ProcessBench基准测试中,EDU-PRM表现优于强大的公共PRM基线,如Math-Shepherd PRM和Omega PRM,而且EDU-PRM在仅使用1.5%训练数据的情况下实现了与SOTA模型相当的结果。此外,通过利用我们提出的EDU采样策略,我们观察到在生成推理任务中,准确率从64.7%提高到67.3%,伴随着令牌使用量减少了32%。这些发现强调了EDU-PRM作为数学推理中一种可扩展且注释高效的过程监督范式的潜力,为复杂数学问题解决的更有效和更健壮的方法铺平了道路。
更新时间: 2026-03-09 07:14:31
领域: cs.LG,cs.AI,cs.CL
Solution to the 10th ABAW Expression Recognition Challenge: A Robust Multimodal Framework with Safe Cross-Attention and Modality Dropout
Emotion recognition in real-world environments is hindered by partial occlusions, missing modalities, and severe class imbalance. To address these issues, particularly for the Affective Behavior Analysis in-the-wild (ABAW) Expression challenge, we propose a multimodal framework that dynamically fuses visual and audio representations. Our approach uses a dual-branch Transformer architecture featuring a safe cross-attention mechanism and a modality dropout strategy. This design allows the network to rely on audio-based predictions when visual cues are absent. To mitigate the long-tail distribution of the Aff-Wild2 dataset, we apply focal loss optimization, combined with a sliding-window soft voting strategy to capture dynamic emotional transitions and reduce frame-level classification jitter. Experiments demonstrate that our framework effectively handles missing modalities and complex spatiotemporal dependencies, achieving an accuracy of 60.79% and an F1-score of 0.5029 on the Aff-Wild2 validation set.
Updated: 2026-03-09 07:13:20
标题: 《第十届ABAW表情识别挑战的解决方案:具有安全交叉注意力和模态丢失的强大多模态框架》
摘要: 情感识别在真实环境中受到部分遮挡、缺失模态和严重类别不平衡的影响。为了解决这些问题,特别是针对野外情感行为分析(ABAW)表达挑战,我们提出了一个动态融合视觉和音频表征的多模态框架。我们的方法使用双分支Transformer架构,具有安全的跨注意机制和模态丢失策略。这种设计使得网络可以在视觉线索缺失时依赖音频预测。为了缓解Aff-Wild2数据集的长尾分布,我们应用焦点损失优化,结合滑动窗口软投票策略来捕捉动态情绪转变并减少帧级分类抖动。实验证明,我们的框架有效处理缺失模态和复杂的时空依赖关系,达到了60.79%的准确率和0.5029的F1分数在Aff-Wild2验证集上。
更新时间: 2026-03-09 07:13:20
领域: cs.CV,cs.AI
GCGNet: Graph-Consistent Generative Network for Time Series Forecasting with Exogenous Variables
Exogenous variables offer valuable supplementary information for predicting future endogenous variables. Forecasting with exogenous variables needs to consider both past-to-future dependencies (i.e., temporal correlations) and the influence of exogenous variables on endogenous variables (i.e., channel correlations). This is pivotal when future exogenous variables are available, because they may directly affect the future endogenous variables. Many methods have been proposed for time series forecasting with exogenous variables, focusing on modeling temporal and channel correlations. However, most of them use a two-step strategy, modeling temporal and channel correlations separately, which limits their ability to capture joint correlations across time and channels. Furthermore, in real-world scenarios, time series are frequently affected by various forms of noises, underscoring the critical importance of robustness in such correlations modeling. To address these limitations, we propose GCGNet, a Graph-Consistent Generative Network for time series forecasting with exogenous variables. Specifically, GCGNet first employs a Variational Generator to produce coarse predictions. A Graph Structure Aligner then further guides it by evaluating the consistency between the generated and true correlations, where the correlations are represented as graphs, and are robust to noises. Finally, a Graph Refiner is proposed to refine the predictions to prevent degeneration and improve accuracy. Extensive experiments on 12 real-world datasets demonstrate that GCGNet outperforms state-of-the-art baselines.
Updated: 2026-03-09 07:11:01
标题: GCGNet: 图一致性生成网络用于具有外生变量的时间序列预测
摘要: 外生变量为预测未来内生变量提供了宝贵的补充信息。利用外生变量进行预测需要考虑过去到未来的依赖关系(即时间相关性)以及外生变量对内生变量的影响(即通道相关性)。当未来外生变量可用时,这一点至关重要,因为它们可能直接影响未来的内生变量。已经提出了许多方法用于带外生变量进行时间序列预测,重点是建模时间和通道相关性。然而,大多数方法采用了两步策略,分别建模时间和通道相关性,这限制了它们捕捉时间和通道之间联合相关性的能力。此外,在现实场景中,时间序列常常受到各种形式的噪声影响,凸显了在这种相关性建模中鲁棒性的关键重要性。为了解决这些局限性,我们提出了GCGNet,即一种用于带外生变量的时间序列预测的图一致性生成网络。具体而言,GCGNet首先利用变分生成器产生粗略预测。然后,图结构对齐器通过评估生成和真实相关性之间的一致性进一步指导,其中相关性表示为图,并且对噪声具有鲁棒性。最后,提出了图精化器来优化预测,以防止退化并提高准确性。对12个真实数据集的广泛实验表明,GCGNet优于最先进的基线模型。
更新时间: 2026-03-09 07:11:01
领域: cs.LG,cs.AI
SwiftEmbed: Ultra-Fast Text Embeddings via Static Token Lookup for Real-Time Applications
We present SwiftEmbed, a production-oriented serving system for static token embeddings that achieves 1.12\,ms p50 latency for single-text requests while maintaining a 60.6 MTEB average score across 8 representative tasks. Built around the open-source Potion-base-8M distilled model from MinishLab and implemented in Rust, the system delivers 50,000 requests per second through static embedding lookup, mean pooling, and zero-copy IEEE754 binary serialization. Evaluation demonstrates exceptional duplicate detection performance (90.1% AP) and strong semantic similarity (76.1% Spearman correlation). Performance relative to Sentence-BERT is task-dependent: robust for deduplication and similarity workloads (89--100%), substantially lower for classification and complex retrieval tasks (75%). Domain-specific performance ranges from 75% to 131% of a GloVe-840B baseline. The system targets real-time embedding applications where sub-5\,ms latency is operationally critical and where full transformer inference is not feasible.
Updated: 2026-03-09 07:05:34
标题: SwiftEmbed:通过静态标记查找实现超快速文本嵌入,用于实时应用程序
摘要: 我们提出了SwiftEmbed,这是一个面向生产的静态令牌嵌入服务系统,针对单文本请求实现了1.12毫秒的p50延迟,同时在8个代表性任务中保持了60.6 MTEB的平均分数。该系统建立在MinishLab的开源Potion-base-8M精馏模型基础上,并使用Rust实现,通过静态嵌入查找、平均池化和零拷贝IEEE754二进制序列化,每秒可处理50,000个请求。评估表明,系统具有出色的重复检测性能(90.1%AP)和强大的语义相似性(76.1%Spearman相关性)。相对于Sentence-BERT,性能取决于任务:在去重和相似性工作负载中表现稳健(89-100%),在分类和复杂检索任务中明显较低(75%)。领域特定性能范围从GloVe-840B基线的75%到131%不等。该系统针对实时嵌入应用,其中低于5毫秒的延迟对运营至关重要,而完整的变压器推断并不可行。
更新时间: 2026-03-09 07:05:34
领域: cs.CL,cs.AI
DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention
Masked Diffusion Language Models (MDLMs) enable parallel token decoding, providing a promising alternative to the sequential nature of autoregressive generation. However, their iterative denoising process remains computationally expensive because it repeatedly processes the entire sequence at every step. We observe that across these diffusion steps, most token representations remain stable; only a small subset, which we term salient tokens, contributes meaningfully to the next update. Leveraging this temporal sparsity, we present DyLLM, a training-free inference framework that accelerates decoding by selectively computing only these salient tokens. DyLLM identifies saliency by measuring the cosine similarity of attention contexts between adjacent denoising steps. It recomputes feed-forward and attention operations only for salient tokens while reusing cached activations for the remainder. Across diverse reasoning and code-generation benchmarks, DyLLM achieves up to 9.6x higher throughput while largely preserving the baseline accuracy of state-of-the-art models like LLaDA and Dream.
Updated: 2026-03-09 07:02:01
标题: DyLLM: 通过基于显著性的标记选择和局部注意力实现高效的扩散LLM推断
摘要: Masked Diffusion Language Models (MDLMs)使得并行标记解码成为可能,为自回归生成的顺序性质提供了一个有希望的替代方案。然而,它们的迭代去噪过程仍然在计算上昂贵,因为它在每一步都重复处理整个序列。我们观察到,在这些扩散步骤中,大多数标记表示保持稳定;只有一小部分,我们称之为显著标记,对下一个更新有实质性的贡献。利用这种时间稀疏性,我们提出了DyLLM,一个无需训练的推理框架,通过仅选择计算这些显著标记来加速解码。DyLLM通过测量相邻去噪步骤之间注意力上下文的余弦相似度来识别显著性。它仅为显著标记重复计算前馈和注意力操作,同时重复使用缓存的激活值来处理其余标记。在各种推理和代码生成基准测试中,DyLLM实现了高达9.6倍的吞吐量提升,同时在很大程度上保持了类似LLaDA和Dream等最先进模型的基准精度。
更新时间: 2026-03-09 07:02:01
领域: cs.CL,cs.AI,cs.PF
Not Like Transformers: Drop the Beat Representation for Dance Generation with Mamba-Based Diffusion Model
Dance is a form of human motion characterized by emotional expression and communication, playing a role in various fields such as music, virtual reality, and content creation. Existing methods for dance generation often fail to adequately capture the inherently sequential, rhythmical, and music-synchronized characteristics of dance. In this paper, we propose \emph{MambaDance}, a new dance generation approach that leverages a Mamba-based diffusion model. Mamba, well-suited to handling long and autoregressive sequences, is integrated into our two-stage diffusion architecture, substituting off-the-shelf Transformer. Additionally, considering the critical role of musical beats in dance choreography, we propose a Gaussian-based beat representation to explicitly guide the decoding of dance sequences. Experiments on AIST++ and FineDance datasets for each sequence length show that our proposed method effectively generates plausible dance movements while reflecting essential characteristics, consistently from short to long dances, compared to the previous methods. Additional qualitative results and demo videos are available at \small{https://vision3d-lab.github.io/mambadance}.
Updated: 2026-03-09 06:59:03
标题: 不像变形金刚:基于曼巴扩散模型的舞蹈代表性节拍。
摘要: 舞蹈是一种以情感表达和交流为特点的人类运动形式,在音乐、虚拟现实和内容创作等领域发挥作用。现有的舞蹈生成方法通常无法充分捕捉舞蹈固有的序列、节奏和与音乐同步的特征。本文提出了一种新的舞蹈生成方法MambaDance,利用基于Mamba的扩散模型。Mamba适合处理长序列和自回归序列,被整合到我们的两阶段扩散架构中,替代了现成的Transformer。此外,考虑到音乐节拍在舞蹈编排中的关键作用,我们提出了基于高斯的节拍表示,明确引导舞蹈序列的解码。在AIST++和FineDance数据集上进行的实验显示,与先前的方法相比,我们提出的方法能够有效地生成合理的舞蹈动作,从短舞蹈到长舞蹈都能保持基本特征的一致性。更多的定性结果和演示视频可在https://vision3d-lab.github.io/mambadance上找到。
更新时间: 2026-03-09 06:59:03
领域: cs.CV,cs.AI,cs.GR,cs.SD
Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization
A data mixture refers to how different data sources are combined to train large language models, and selecting an effective mixture is crucial for optimal downstream performance. Existing methods either conduct costly searches directly on the target model or rely on mixture scaling laws that fail to extrapolate well to large model sizes. We address these limitations by introducing a compute-efficient pipeline for data mixture scaling. First, we propose CAMEL, a capacity-aware mixture law that models validation loss with the nonlinear interplay between model size and mixture. We also introduce a loss-to-benchmark prediction law that estimates benchmark accuracy from validation loss, enabling end-to-end performance prediction for the target model. Next, we study how to allocate a fixed compute budget across model scales to fit the law and reduce prediction error. Finally, we apply our method to Mixture-of-Experts models with up to 7B-A150M parameters to fit the law, and verify the optimal mixture derived from the law by extrapolating to a 55B-A1.2B target model. Compared to prior methods, we reduces mixture optimization costs by 50\% and improves downstream benchmark performance by up to 3\%.
Updated: 2026-03-09 06:58:00
标题: 容量感知混合定律实现高效的LLM数据优化
摘要: 数据混合是指如何将不同的数据源组合起来训练大型语言模型,选择有效的混合对于优化下游性能至关重要。现有方法要么直接在目标模型上进行昂贵的搜索,要么依赖于混合比例法则,这些法则在大型模型规模上往往无法很好地外推。我们通过引入一种计算高效的数据混合比例管道来解决这些限制。首先,我们提出了CAMEL,一种能力感知的混合法则,它通过模型大小和混合之间的非线性相互作用来建模验证损失。我们还引入了一种从验证损失估算基准准确度的损失-基准预测法则,从而实现了目标模型的端到端性能预测。接下来,我们研究如何在不同模型规模之间分配固定的计算预算以适应法则并减少预测误差。最后,我们将我们的方法应用于带有高达7B-A150M参数的专家混合模型,以适应法则,并通过外推到55B-A1.2B目标模型验证了从法则中得出的最佳混合比例。与先前方法相比,我们将混合优化成本降低了50\%,并将下游基准性能提高了最多3\%。
更新时间: 2026-03-09 06:58:00
领域: cs.LG
Conformal Prediction for Risk-Controlled Medical Entity Extraction Across Clinical Domains
Large Language Models (LLMs) are increasingly used for medical entity extraction, yet their confidence scores are often miscalibrated, limiting safe deployment in clinical settings. We present a conformal prediction framework that provides finite-sample coverage guarantees for LLM-based extraction across two clinical domains. First, we extract structured entities from 1,000 FDA drug labels across eight sections using GPT-4.1, verified via FactScore-based atomic statement evaluation (97.7\% accuracy over 128,906 entities). Second, we extract radiological entities from MIMIC-CXR reports using the RadGraph schema with GPT-4.1 and Llama-4-Maverick, evaluated against physician annotations (entity F1: 0.81 to 0.84). Our central finding is that miscalibration direction reverses across domains: on well-structured FDA labels, models are underconfident, requiring modest conformal thresholds ($τ\approx 0.06$), while on free-text radiology reports, models are overconfident, demanding strict thresholds ($τ$ up to 0.99). Despite this heterogeneity, conformal prediction achieves target coverage ($\geq 90\%$) in both settings with manageable rejection rates (9--13\%). These results demonstrate that calibration is not a global model property but depends on document structure, extraction category, and model architecture, motivating domain-specific conformal calibration for safe clinical deployment.
Updated: 2026-03-09 06:54:54
标题: 《临床领域中的风险受控医疗实体提取的一致性预测》
摘要: 大型语言模型(LLMs)越来越被用于医疗实体提取,然而它们的置信度分数通常存在误校准,限制了在临床环境中的安全部署。我们提出了一个符合预测框架,为基于LLM的提取在两个临床领域提供了有限样本覆盖保证。首先,我们使用GPT-4.1从1,000个FDA药物标签的八个部分中提取结构化实体,通过基于FactScore的原子语句评估进行验证(在128,906个实体中的97.7\%准确率)。其次,我们使用RadGraph架构从MIMIC-CXR报告中提取放射学实体,使用GPT-4.1和Llama-4-Maverick进行评估,评估结果与医师注释一致(实体F1:0.81至0.84)。我们的核心发现是,误校准方向在不同领域之间相反:在结构良好的FDA标签上,模型过于谨慎,需要适度的符合阈值($τ\approx 0.06$),而在自由文本放射学报告中,模型过于自信,需要严格的阈值($τ$最多为0.99)。尽管存在这种异质性,符合预测在两种情境中均实现了目标覆盖率($\geq 90\%$),并且拒绝率可控(9-13\%)。这些结果表明,校准不是一个全局模型属性,而是取决于文档结构、提取类别和模型架构,促使为安全临床部署进行特定领域的符合校准。
更新时间: 2026-03-09 06:54:54
领域: cs.CL,cs.AI
Angular Gradient Sign Method: Uncovering Vulnerabilities in Hyperbolic Networks
Adversarial examples in neural networks have been extensively studied in Euclidean geometry, but recent advances in \textit{hyperbolic networks} call for a reevaluation of attack strategies in non-Euclidean geometries. Existing methods such as FGSM and PGD apply perturbations without regard to the underlying hyperbolic structure, potentially leading to inefficient or geometrically inconsistent attacks. In this work, we propose a novel adversarial attack that explicitly leverages the geometric properties of hyperbolic space. Specifically, we compute the gradient of the loss function in the tangent space of hyperbolic space, decompose it into a radial (depth) component and an angular (semantic) component, and apply perturbation derived solely from the angular direction. Our method generates adversarial examples by focusing perturbations in semantically sensitive directions encoded in angular movement within the hyperbolic geometry. Empirical results on image classification, cross-modal retrieval tasks and network architectures demonstrate that our attack achieves higher fooling rates than conventional adversarial attacks, while producing high-impact perturbations with deeper insights into vulnerabilities of hyperbolic embeddings. This work highlights the importance of geometry-aware adversarial strategies in curved representation spaces and provides a principled framework for attacking hierarchical embeddings.
Updated: 2026-03-09 06:52:38
标题: Angular Gradient Sign 方法:揭示双曲网络中的漏洞
摘要: 神经网络中的对抗样本在欧几里得几何中得到了广泛研究,但最近在\textit{双曲网络}方面的进展要求重新评估非欧几里得几何中的攻击策略。现有的方法如FGSM和PGD在不考虑基本双曲结构的情况下应用扰动,可能导致效率低下或几何不一致的攻击。在这项工作中,我们提出了一种新颖的对抗攻击方法,明确利用了双曲空间的几何特性。具体地,我们在双曲空间的切线空间中计算损失函数的梯度,将其分解为径向(深度)分量和角向(语义)分量,并仅从角向导出扰动。我们的方法通过在双曲几何中编码的语义敏感方向集中扰动来生成对抗样本。对图像分类、跨模态检索任务和网络架构的实证结果表明,我们的攻击实现了比传统对抗攻击更高的欺骗率,同时产生了对双曲嵌入的漏洞更深入的洞察力。这项工作强调了在曲线表示空间中的几何感知对抗策略的重要性,并为攻击分层嵌入提供了一个原则性框架。
更新时间: 2026-03-09 06:52:38
领域: cs.LG,cs.CV
Alignment--Process--Outcome: Rethinking How AIs and Humans Collaborate
In real-world collaboration, alignment, process structure, and outcome quality do not exhibit a simple linear or one-to-one correspondence: similar alignment may accompany either rapid convergence or extensive multi-branch exploration, and lead to different results. Existing accounts often isolate these dimensions or focus on specific participant types, limiting structural accounts of collaboration. We reconceptualize collaboration through two complementary lenses. The task lens models collaboration as trajectory evolution in a structured task space, revealing patterns such as advancement, branching, and backtracking. The intent lens examines how individual intents are expressed within shared contexts and enter situated decisions. Together, these lenses clarify the structural relationships among alignment, decision-making, and trajectory structure. Rather than reducing collaboration to outcome quality or treating alignment as the sole objective, we propose a unified dynamic view of the relationships among alignment, process, and outcome, and use it to re-examine collaboration structure across Human-Human, AI-AI, and Human-AI settings.
Updated: 2026-03-09 06:46:59
标题: 对齐-流程-结果:重新思考人工智能和人类如何合作
摘要: 在现实世界的合作中,协调、过程结构和结果质量并不表现为简单的线性或一对一对应关系:相似的协调可能伴随着快速收敛或广泛的多分支探索,并导致不同的结果。现有的描述通常孤立这些维度或专注于特定的参与者类型,从而限制了对合作的结构性描述。 我们通过两种互补的视角重新构想合作。任务视角将合作建模为在结构化任务空间中的轨迹演变,揭示出诸如前进、分支和回溯等模式。意图视角则研究个体意图如何在共享环境中表达并参与具体决策。这两种视角共同澄清了协调、决策制定和轨迹结构之间的结构关系。 我们提出一个统一的动态视角,将协调、过程和结果之间的关系重新考虑,并用它来重新审视人-人、人工智能-人工智能和人工智能-人类之间的合作结构。
更新时间: 2026-03-09 06:46:59
领域: cs.HC,cs.AI
More Women, Same Stereotypes: Unpacking the Gender Bias Paradox in Large Language Models
Large Language Models (LLMs) have revolutionized natural language processing, yet concerns persist regarding their tendency to reflect or amplify social biases. This study introduces a novel evaluation framework to uncover gender biases in LLMs: using free-form storytelling to surface biases embedded within the models. A systematic analysis of ten prominent LLMs shows a consistent pattern of overrepresenting female characters across occupations, likely due to supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). Paradoxically, despite this overrepresentation, the occupational gender distributions produced by these LLMs align more closely with human stereotypes than with real-world labor data. This highlights the challenge and importance of implementing balanced mitigation measures to promote fairness and prevent the establishment of potentially new biases. We release the prompts and LLM-generated stories at GitHub.
Updated: 2026-03-09 06:46:28
标题: 更多女性,同样的刻板印象:拆解大型语言模型中的性别偏见悖论
摘要: 大型语言模型(LLMs)已经彻底改变了自然语言处理,但人们仍然担心它们倾向于反映或放大社会偏见。本研究引入了一个新的评估框架,以揭示LLMs中的性别偏见:使用自由形式的叙事来揭示模型内嵌的偏见。对十个知名LLMs的系统分析显示,由于受监督微调(SFT)和来自人类反馈的强化学习(RLHF),这些模型普遍过度代表女性角色在各个职业中,可能是由于这一模式。然而,尽管存在这种过度代表,这些LLMs产生的职业性别分布与人类刻板印象更加接近,而不是与真实世界的劳动数据相符。这突显了实施平衡的缓解措施以促进公平并防止潜在新的偏见建立的挑战和重要性。我们在GitHub上发布了提示和LLM生成的故事。
更新时间: 2026-03-09 06:46:28
领域: cs.CL,cs.AI
LongAudio-RAG: Event-Grounded Question Answering over Multi-Hour Long Audio
Long-duration audio is increasingly common in industrial and consumer settings, yet reviewing multi-hour recordings is impractical, motivating systems that answer natural-language queries with precise temporal grounding and minimal hallucination. Existing audio-language models show promise, but long-audio question answering remains difficult due to context-length limits. We introduce LongAudio-RAG (LA-RAG), a hybrid framework that grounds Large Language Model outputs in retrieved, timestamped acoustic event detections rather than raw audio. Multi-hour streams are converted into structured event records stored in an SQL database, and at inference time the system resolves natural-language time references, classifies intent, retrieves only the relevant events, and generates answers using this constrained evidence. To evaluate performance, we construct a synthetic long-audio benchmark by concatenating recordings with preserved timestamps and generating template-based question-answer pairs for detection, counting, and summarization tasks. Finally, we demonstrate the practicality of our approach by deploying it in a hybrid edge-cloud environment, where the audio grounding model runs on-device on IoT-class hardware while the LLM is hosted on a GPU-backed server. This architecture enables low-latency event extraction at the edge and high-quality language reasoning in the cloud. Experiments show that structured, event-level retrieval significantly improves accuracy compared to vanilla Retrieval-Augmented Generation (RAG) or text-to-SQL approaches.
Updated: 2026-03-09 06:45:43
标题: LongAudio-RAG: 基于事件的多小时长音频问答
摘要: 长时间音频在工业和消费者领域越来越常见,然而审查多小时的录音是不现实的,这促使系统以精确的时间基础和最小的虚构来回答自然语言查询。现有的音频语言模型显示出潜力,但由于上下文长度限制,长音频问答仍然很困难。我们引入了LongAudio-RAG(LA-RAG),这是一个混合框架,将大型语言模型的输出基于检索到的、带时间戳的声学事件检测,而不是原始音频。多小时的流被转换为存储在SQL数据库中的结构化事件记录,在推理时,系统解决自然语言时间参考,分类意图,仅检索相关事件,并使用这些受限证据生成答案。为了评估性能,我们通过连接保留时间戳的录音并为检测、计数和总结任务生成基于模板的问题-答案对,构建了一个合成的长音频基准。最后,我们通过在混合边缘云环境中部署它来展示我们方法的实用性,在这里,音频基础模型在IoT级硬件上运行,而LLM托管在支持GPU的服务器上。这种架构使得边缘能够实现低延迟的事件提取,云端实现高质量的语言推理。实验证明,结构化的、基于事件级别的检索与普通的检索增强生成(RAG)或文本到SQL方法相比,显著提高了准确性。
更新时间: 2026-03-09 06:45:43
领域: eess.AS,cs.AI,cs.LG
EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video
Imitation learning for manipulation has a well-known data scarcity problem. Unlike natural language and 2D computer vision, there is no Internet-scale corpus of data for dexterous manipulation. One appealing option is egocentric human video, a passively scalable data source. However, existing large-scale datasets such as Ego4D do not have native hand pose annotations and do not focus on object manipulation. To this end, we use Apple Vision Pro to collect EgoDex: the largest and most diverse dataset of dexterous human manipulation to date. EgoDex has 829 hours of egocentric video with paired 3D hand and finger tracking data collected at the time of recording, where multiple calibrated cameras and on-device SLAM can be used to precisely track the pose of every joint of each hand. The dataset covers a wide range of diverse manipulation behaviors with everyday household objects in 194 different tabletop tasks ranging from tying shoelaces to folding laundry. Furthermore, we train and systematically evaluate imitation learning policies for hand trajectory prediction on the dataset, introducing metrics and benchmarks for measuring progress in this increasingly important area. By releasing this large-scale dataset, we hope to push the frontier of robotics, computer vision, and foundation models. EgoDex is publicly available for download at https://github.com/apple/ml-egodex.
Updated: 2026-03-09 06:44:42
标题: EgoDex:从大规模自我中心视频中学习灵巧操作
摘要: 用于操作的模仿学习存在一个众所周知的数据稀缺问题。与自然语言和2D计算机视觉不同,没有针对灵巧操作的互联网规模数据语料库。一个吸引人的选择是自我中心的人类视频,这是一个被动扩展的数据源。然而,现有的大规模数据集,如Ego4D,并没有原生手部姿势注释,也没有专注于对象操作。为此,我们使用Apple Vision Pro收集了EgoDex:迄今为止最大最多样化的灵巧人类操作数据集。EgoDex拥有829小时的自我中心视频,配备了3D手部和手指跟踪数据,在录制时收集,多个校准摄像头和设备上的SLAM可用于精确跟踪每只手的每个关节的姿势。该数据集涵盖了各种不同的操作行为,涉及194种不同的桌面任务,从系鞋带到叠衣服。此外,我们在该数据集上训练和系统评估手部轨迹预测的模仿学习策略,引入了用于衡量这一日益重要领域进展的度量和基准。通过发布这一大规模数据集,我们希望推动机器人技术、计算机视觉和基础模型的前沿。EgoDex可通过https://github.com/apple/ml-egodex进行公开下载。
更新时间: 2026-03-09 06:44:42
领域: cs.CV,cs.LG,cs.RO
FedMomentum: Preserving LoRA Training Momentum in Federated Fine-Tuning
Federated fine-tuning of large language models (LLMs) with low-rank adaptation (LoRA) offers a communication-efficient and privacy-preserving solution for task-specific adaptation. Naive aggregation of LoRA modules introduces noise due to mathematical incorrectness when averaging the downsampling and upsampling matrices independently. However, existing noise-free aggregation strategies inevitably compromise the structural expressiveness of LoRA, limiting its ability to retain client-specific adaptations by either improperly reconstructing the low-rank structure or excluding partially trainable components. We identify this problem as loss of training momentum, where LoRA updates fail to accumulate effectively across rounds, resulting in slower convergence and suboptimal performance. To address this, we propose FedMomentum, a novel framework that enables structured and momentum-preserving LoRA aggregation via singular value decomposition (SVD). Specifically, after aggregating low-rank updates in a mathematically correct manner, FedMomentum applies SVD to extract the dominant components that capture the main update directions. These components are used to reconstruct the LoRA modules with the same rank, while residual components can be retained and later merged into the backbone to preserve semantic information and ensure robustness. Extensive experiments across multiple tasks demonstrate that FedMomentum consistently outperforms prior state-of-the-art methods in convergence speed and final accuracy.
Updated: 2026-03-09 06:43:17
标题: FedMomentum:在联邦微调中保持LoRA训练动量
摘要: 联邦化的大型语言模型(LLMs)与低秩适应(LoRA)的联合微调为任务特定的适应提供了一种通信高效且保护隐私的解决方案。LoRA模块的朴素聚合在独立平均下采样和上采样矩阵时会引入噪声,因为数学上的不正确。然而,现有的无噪声聚合策略不可避免地会损害LoRA的结构表达能力,限制其保留客户端特定适应能力,要么不正确地重建低秩结构,要么排除部分可训练组件。我们将这个问题识别为训练动量的损失,其中LoRA更新未能有效地跨回合积累,导致收敛速度较慢且表现不佳。为了解决这个问题,我们提出了FedMomentum,这是一个通过奇异值分解(SVD)实现结构化和动量保持LoRA聚合的新框架。具体来说,FedMomentum在数学上正确地聚合低秩更新之后,应用SVD提取捕捉主要更新方向的主要组件。这些组件用于以相同的秩重建LoRA模块,而剩余组件可以保留并稍后合并到骨干部分,以保留语义信息并确保稳健性。跨多个任务的广泛实验表明,FedMomentum在收敛速度和最终准确性方面始终优于先前的最先进方法。
更新时间: 2026-03-09 06:43:17
领域: cs.LG,cs.AI
From Semantic To Instance: A Semi-Self-Supervised Learning Approach
Instance segmentation is essential for applications such as automated monitoring of plant health, growth, and yield. However, extensive effort is required to create large-scale datasets with pixel-level annotations of each object instance for developing instance segmentation models that restrict the use of deep learning in these areas. This challenge is more significant in images with densely packed, self-occluded objects, which are common in agriculture. To address this challenge, we propose a semi-self-supervised learning approach that requires minimal manual annotation to develop a high-performing instance segmentation model. We design GLMask, an image-mask representation for the model to focus on shape, texture, and pattern while minimizing its dependence on color features. We develop a pipeline to generate semantic segmentation and then transform it into instance-level segmentation. The proposed approach substantially outperforms the conventional instance segmentation models, establishing a state-of-the-art wheat head instance segmentation model with mAP@50 of 98.5%. Additionally, we assessed the proposed methodology on the general-purpose Microsoft COCO dataset, achieving a significant performance improvement of over 12.6% mAP@50. This highlights that the utility of our proposed approach extends beyond precision agriculture and applies to other domains, specifically those with similar data characteristics.
Updated: 2026-03-09 06:43:09
标题: 从语义到实例:一种半自监督学习方法
摘要: 实例分割对于植物健康、生长和产量的自动监测等应用至关重要。然而,需要大量工作来创建具有每个对象实例的像素级注释的大规模数据集,以开发限制深度学习在这些领域中使用的实例分割模型。在农业中常见的具有密集填充、自遮挡对象的图像中,这一挑战更为重要。为了应对这一挑战,我们提出了一种半自监督学习方法,需要最少的人工注释来开发高性能的实例分割模型。我们设计了GLMask,一种图像-蒙版表示,使模型专注于形状、纹理和模式,同时最大限度地减少其对颜色特征的依赖。我们开发了一个流水线来生成语义分割,然后将其转换为实例级分割。所提出的方法在很大程度上优于传统的实例分割模型,建立了一种最先进的小麦头实例分割模型,其mAP@50为98.5%。此外,我们在通用的Microsoft COCO数据集上评估了所提出的方法,在mAP@50上实现了超过12.6%的显著性能提升。这突显了我们提出的方法的实用性不仅限于精准农业,还适用于其他领域,特别是具有类似数据特征的领域。
更新时间: 2026-03-09 06:43:09
领域: cs.CV,cs.AI,cs.LG
PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation Agents
Current Graphical User Interface (GUI) agents operate primarily under a reactive paradigm: a user must provide an explicit instruction for the agent to execute a task. However, an intelligent AI assistant should be proactive, which is capable of anticipating user intentions directly from continuous visual inputs, such as mobile or desktop screenshots, and offering timely recommendations without explicit user prompting. Transitioning to this proactive paradigm presents significant challenges. Real-world screen activity is rarely linear; it consists of long-horizon trajectories fraught with noisy browsing, meaningless actions, and multithreaded task-switching. To address this gap, we introduce PIRA-Bench (Proactive Intent Recommendation Agent Benchmark), a novel benchmark for evaluating multimodal large language models (MLLMs) on continuous, weakly-supervised visual inputs. Unlike reactive datasets, PIRA-Bench features complex trajectories with multiple interleaved intents and noisy segments with various user profile contexts, challenging agents to detect actionable events while fitting to user preferences. Furthermore, we propose the PIRF baseline, a memory-aware, state-tracking framework that empowers general MLLMs to manage multiple task threads and handle misleading visual inputs. PIRA-Bench serves as an initial step toward robust and proactive GUI-based personal assistants.
Updated: 2026-03-09 06:41:32
标题: PIRA-Bench:从反应式GUI代理到基于GUI的主动意图推荐代理的转变
摘要: 目前的图形用户界面(GUI)代理主要在一种反应性范式下运作:用户必须提供明确的指令给代理执行任务。然而,智能AI助手应该是主动的,能够直接从连续的视觉输入(如移动或桌面截图)中预测用户意图,并在没有明确用户提示的情况下及时提供建议。转向这种主动范式面临着重大挑战。现实世界的屏幕活动很少是线性的;它包含着充满嘈杂浏览、无意义动作和多线程任务切换的长期轨迹。为了弥补这一差距,我们引入了PIRA-Bench(主动意图推荐代理基准),这是一个用于评估多模态大型语言模型(MLLMs)在连续、弱监督视觉输入上的新型基准。与反应性数据集不同,PIRA-Bench具有复杂的轨迹,其中包含多个交织的意图和具有各种用户配置环境的嘈杂段,挑战代理检测可执行事件同时适应用户偏好。此外,我们提出了PIRF基线,这是一个记忆感知、状态跟踪框架,使通用MLLMs能够管理多个任务线程并处理误导性的视觉输入。PIRA-Bench作为向稳健和主动的基于GUI的个人助手迈出的第一步。
更新时间: 2026-03-09 06:41:32
领域: cs.AI
CeRA: Breaking the Linear Ceiling of Low-Rank Adaptation via Manifold Expansion
Low-Rank Adaptation (LoRA) dominates parameter-efficient fine-tuning (PEFT). However, it faces a critical ``linear ceiling'' in complex reasoning tasks: simply increasing the rank yields diminishing returns due to intrinsic linear constraints. We introduce CeRA (Capacity-enhanced Rank Adaptation), a weight-level parallel adapter that injects SiLU gating and structural dropout to induce manifold expansion. On the SlimOrca benchmark, CeRA breaks this linear barrier: at rank 64 (PPL 3.89), it outperforms LoRA at rank 512 (PPL 3.90), demonstrating superior spectral efficiency. This advantage generalizes to mathematical reasoning, where CeRA achieves a perplexity of 1.97 on MathInstruct, significantly surpassing LoRA's saturation point of 2.07. Mechanism analysis via Singular Value Decomposition (SVD) confirms that CeRA activates the dormant tail of the singular value spectrum, effectively preventing the rank collapse observed in linear methods.
Updated: 2026-03-09 06:40:14
标题: CeRA: 通过流形扩展打破低秩适应的线性限制
摘要: 低秩适应(LoRA)主导参数高效微调(PEFT)。然而,在复杂推理任务中,它面临关键的“线性天花板”:简单地增加秩会因固有的线性约束而导致收益递减。我们引入CeRA(容量增强秩适应),这是一个权重级并行适配器,注入了SiLU门控和结构丢失以诱导流形扩展。在SlimOrca基准上,CeRA打破了这个线性障碍:在秩64(PPL 3.89)时,它优于秩512(PPL 3.90)的LoRA,展现出卓越的频谱效率。这种优势推广到数学推理,CeRA在MathInstruct上实现了1.97的困惑度,显著超过LoRA的饱和点2.07。通过奇异值分解(SVD)进行机制分析证实,CeRA激活了奇异值谱的潜在尾部,有效地防止了线性方法中观察到的秩崩溃。
更新时间: 2026-03-09 06:40:14
领域: cs.LG,cs.AI,cs.CL
Information Routing in Atomistic Foundation Models: How Task Alignment and Equivariance Shape Linear Disentanglement
What determines whether a molecular property prediction model organizes its representations so that geometric and compositional information can be cleanly separated? We introduce Compositional Probe Decomposition (CPD), which linearly projects out composition signal and measures how much geometric information remains accessible to a Ridge probe. We validate CPD with four independent checks, including a structural isomer benchmark where compositional projections score at chance while geometric residuals reach 94.6\% pairwise classification accuracy. Across ten models from five architectural families on QM9, we find a \emph{linear accessibility gradient}: models differ by $6.6\times$ in geometric information accessible after composition removal ($R^2_{\mathrm{geom}}$ from 0.081 to 0.533 for HOMO-LUMO gap). Three factors explain this gradient. Task alignment dominates: models trained on HOMO-LUMO gap ($R^2_{\mathrm{geom}}$ 0.44--0.53) outscore energy-trained models by $\sim$0.25 $R^2$ regardless of architecture. Within-architecture ablations on two independent architectures confirm this: PaiNN drops from 0.53 to 0.31 when retrained on energy, and MACE drops from 0.44 to 0.08. Data diversity partially compensates for misaligned objectives, with MACE pretrained on MPTraj (0.36) outperforming QM9-only energy models. Inside MACE's representations, information routes by symmetry type: $L{=}1$ (vector) channels preferentially encode dipole moment ($R^2 = 0.59$ vs.\ 0.38 in $L{=}0$), while $L{=}0$ (scalar) channels encode HOMO-LUMO gap ($R^2 = 0.76$ vs.\ 0.34 in $L{=}1$). This pattern is absent in ViSNet. We also show that nonlinear probes produce misleading results on residualized representations, recovering $R^2 = 0.68$--$0.95$ on a purely compositional target, and recommend linear probes for this setting.
Updated: 2026-03-09 06:36:19
标题: 原子基础模型中的信息路由:任务对齐和等变性如何塑造线性解缠。
摘要: 什么决定了分子性质预测模型如何组织其表示,以便几何和组成信息可以被清晰地分离?我们引入了组合探针分解(CPD),该方法线性投影出组成信号,并测量剩余多少几何信息可以被 Ridge 探针访问。我们通过四项独立检查验证了CPD,包括一个结构异构体基准,在该基准中,组成投影得分为偶然,而几何残差达到了94.6%的成对分类准确度。 在 QM9 数据集上的十个模型中,我们发现了一个线性可访问性梯度:在去除组成后,几何信息的可访问性相差6.6倍(HOMO-LUMO 能隙的 $R^2_{\mathrm{geom}}$ 从 0.081 到 0.533)。三个因素解释了这一梯度。任务对齐占主导地位:在HOMO-LUMO 能隙上训练的模型($R^2_{\mathrm{geom}}$ 0.44--0.53)无论架构如何,都比能量训练的模型高约0.25 $R^2$。在两个独立架构上进行的架构内消融实验证实了这一点:在能量重新训练时,PaiNN 从 0.53 降至 0.31,而 MACE 从 0.44 降至 0.08。数据多样性部分补偿了任务不对齐的问题,使用 MPTraj 预训练的 MACE 模型(0.36)胜过仅在 QM9 上训练的能量模型。 在 MACE 的表示中,信息通过对称类型进行路由:$L{=}1$(向量)通道优先编码偶极矩($R^2 = 0.59$ vs. $L{=}0$ 中的 0.38),而 $L{=}0$(标量)通道编码 HOMO-LUMO 能隙($R^2 = 0.76$ vs. $L{=}1$ 中的 0.34)。这种模式在 ViSNet 中不存在。我们还展示了非线性探针在残差表示上产生误导性结果,在纯组成目标上恢复了 $R^2 = 0.68$--$0.95,并推荐在此设置中使用线性探针。
更新时间: 2026-03-09 06:36:19
领域: cs.LG,cs.AI,physics.chem-ph
ViSA-Enhanced Aerial VLN: A Visual-Spatial Reasoning Enhanced Framework for Aerial Vision-Language Navigation
Existing aerial Vision-Language Navigation (VLN) methods predominantly adopt a detection-and-planning pipeline, which converts open-vocabulary detections into discrete textual scene graphs. These approaches are plagued by inadequate spatial reasoning capabilities and inherent linguistic ambiguities. To address these bottlenecks, we propose a Visual-Spatial Reasoning (ViSA) enhanced framework for aerial VLN. Specifically, a triple-phase collaborative architecture is designed to leverage structured visual prompting, enabling Vision-Language Models (VLMs) to perform direct reasoning on image planes without the need for additional training or complex intermediate representations. Comprehensive evaluations on the CityNav benchmark demonstrate that the ViSA-enhanced VLN achieves a 70.3\% improvement in success rate compared to the fully trained state-of-the-art (SOTA) method, elucidating its great potential as a backbone for aerial VLN systems.
Updated: 2026-03-09 06:29:17
标题: ViSA增强空中VLN:用于空中视觉语言导航的视觉空间推理增强框架
摘要: 现有的航空视觉语言导航(VLN)方法主要采用检测和规划管道,将开放词汇检测转换为离散的文本场景图。这些方法受到了空间推理能力不足和固有的语言歧义的困扰。为了解决这些瓶颈,我们提出了一个用于航空VLN的Visual-Spatial Reasoning(ViSA)增强框架。具体而言,设计了一个三阶段协作架构,利用结构化的视觉提示,使视觉语言模型(VLMs)能够在图像平面上直接进行推理,而无需额外的训练或复杂的中间表示。在CityNav基准测试中进行的全面评估表明,ViSA增强型VLN的成功率比完全训练的最先进方法提高了70.3%,阐明了其作为航空VLN系统骨干的巨大潜力。
更新时间: 2026-03-09 06:29:17
领域: cs.CV,cs.AI
WikiDBGraph: A Data Management Benchmark Suite for Collaborative Learning over Database Silos
Relational databases are often fragmented across organizations, creating data silos that hinder distributed data management and mining. Collaborative learning (CL) -- techniques that enable multiple parties to train models jointly without sharing raw data -- offers a principled approach to this challenge. However, existing CL frameworks (e.g., federated and split learning) remain limited in real-world deployments. Current CL benchmarks and algorithms primarily target the learning step under assumptions of isolated, aligned, and joinable databases, and they typically neglect the end-to-end data management pipeline, especially preprocessing steps such as table joins and data alignment. In contrast, our analysis of the real-world corpus WikiDBs shows that databases are interconnected, unaligned, and sometimes unjoinable, exposing a significant gap between CL algorithm design and practical deployment. To close this evaluation gap, we build WikiDBGraph, a large-scale dataset constructed from 100{,}000 real-world relational databases linked by 17 million weighted edges. Each node (database) and edge (relationship) is annotated with 13 and 12 properties, respectively, capturing a hybrid of instance- and feature-level overlap across databases. Experiments on WikiDBGraph demonstrate both the effectiveness and limitations of existing CL methods under realistic conditions, highlighting previously overlooked gaps in managing real-world data silos and pointing to concrete directions for practical deployment of collaborative learning systems.
Updated: 2026-03-09 06:16:20
标题: WikiDBGraph:用于在数据库孤岛上进行协作学习的数据管理基准套件
摘要: 关系数据库通常在组织间分散,形成数据孤岛,阻碍了分布式数据管理和挖掘。协作学习(CL)——一种使多方能够联合训练模型而不共享原始数据的技术——为解决这一挑战提供了一个基本方法。然而,现有的CL框架(例如,联邦和分裂学习)在实际部署中仍然存在限制。当前的CL基准和算法主要针对学习步骤,假设数据库是孤立的、对齐的和可连接的,并且通常忽视端到端的数据管理流水线,特别是诸如表连接和数据对齐等预处理步骤。相反,我们对实际语料库WikiDBs的分析显示,数据库是相互连接的、不对齐的,有时是不可连接的,揭示了CL算法设计与实际部署之间的显著差距。为了弥补这一评估差距,我们构建了WikiDBGraph,这是一个由17百万加权边连接的10万个真实关系数据库构成的大规模数据集。每个节点(数据库)和边(关系)分别用13和12个属性进行注释,捕捉了数据库间实例和特征级别的重叠。在WikiDBGraph上的实验展示了现有CL方法在现实条件下的有效性和局限性,突显了在管理实际数据孤岛时先前被忽视的差距,并指出了协作学习系统实际部署的具体方向。
更新时间: 2026-03-09 06:16:20
领域: cs.DB,cs.LG
Amortizing Maximum Inner Product Search with Learned Support Functions
Maximum inner product search (MIPS) is a crucial subroutine in machine learning, requiring the identification of key vectors that align best with a given query. We propose amortized MIPS: a learning-based approach that trains neural networks to directly predict MIPS solutions, amortizing the computational cost of matching queries (drawn from a fixed distribution) to a fixed set of keys. Our key insight is that the MIPS value function, the maximal inner product between a query and keys, is also known as the support function of the set of keys. Support functions are convex, 1-homogeneous and their gradient w.r.t. the query is exactly the optimal key in the database. We approximate the support function using two complementary approaches: (1) we train an input-convex neural network (SupportNet) to model the support function directly; the optimal key can be recovered via (autodiff) gradient computation, and (2) we regress directly the optimal key from the query using a vector valued network (KeyNet), bypassing gradient computation entirely at inference time. To learn a SupportNet, we combine score regression with gradient matching losses, and propose homogenization wrappers that enforce the positive 1-homogeneity of a neural network, theoretically linking function values to gradients. To train a KeyNet, we introduce a score consistency loss derived from the Euler theorem for homogeneous functions. Our experiments show that learned SupportNet or KeyNet achieve high match rates and open up new directions to compress databases with a specific query distribution in mind.
Updated: 2026-03-09 06:09:20
标题: 使用学习支持函数摊销最大内积搜索
摘要: 最大内积搜索(MIPS)是机器学习中的一个关键子程序,需要识别与给定查询最佳对齐的关键向量。我们提出了摊销MIPS:一种基于学习的方法,训练神经网络直接预测MIPS解决方案,摊销匹配查询(从固定分布中绘制)到一组固定键的计算成本。我们的关键见解是,MIPS值函数,即查询和关键之间的最大内积,也被称为关键集的支持函数。支持函数是凸函数,1-齐次函数,其对查询的梯度恰好是数据库中的最优关键。我们使用两种互补方法近似支持函数:(1)我们训练一个输入凸神经网络(SupportNet)直接建模支持函数;通过(自动微分)梯度计算可以恢复最优关键,(2)我们直接从查询中回归最优关键使用矢量值网络(KeyNet),在推断时完全绕过梯度计算。为了学习SupportNet,我们将得分回归与梯度匹配损失结合起来,并提出了强制神经网络正1-齐次性的同质化包装器,从理论上将函数值与梯度联系起来。为了训练KeyNet,我们引入了从齐次函数的欧拉定理导出的得分一致性损失。我们的实验表明,学习的SupportNet或KeyNet实现了高匹配率,并为在思考特定查询分布的情况下压缩数据库开辟了新方向。
更新时间: 2026-03-09 06:09:20
领域: cs.LG,stat.ML
SmartThinker: Progressive Chain-of-Thought Length Calibration for Efficient Large Language Model Reasoning
Large reasoning models (LRMs) like OpenAI o1 and DeepSeek-R1 achieve high accuracy on complex tasks by adopting long chain-of-thought (CoT) reasoning paths. However, the inherent verbosity of these processes frequently results in redundancy and overthinking. To address this issue, existing works leverage Group Relative Policy Optimization (GRPO) to reduce LRM output length, but their static length reward design cannot dynamically adapt according to the relative problem difficulty and response length distribution, causing over-compression and compromised accuracy. Therefore, we propose SmartThinker, a novel GRPO-based efficient reasoning method with progressive CoT length calibration. SmartThinker makes a two-fold contribution: First, it dynamically estimates the optimal length with peak accuracy during training and guides overlong responses toward it to reduce response length while sustaining accuracy. Second, it dynamically modulates the length reward coefficient to avoid the unwarranted penalization of correct reasoning paths. Extensive experiment results show that SmartThinker achieves up to 52.5% average length compression with improved accuracy, and achieves up to 16.6% accuracy improvement on challenging benchmarks like AIME25. The source code can be found at https://github.com/SJTU-RTEAS/SmartThinker.
Updated: 2026-03-09 06:08:14
标题: 智能思考者:用于高效大型语言模型推理的渐进式思维链长度校准
摘要: 大型推理模型(LRMs)如OpenAI o1和DeepSeek-R1通过采用长链式思维路径(CoT)实现在复杂任务上的高准确性。然而,这些过程固有的冗长经常导致冗余和过度思考。为了解决这个问题,现有的工作利用Group Relative Policy Optimization(GRPO)来减少LRM输出长度,但它们静态长度奖励设计无法根据相对问题难度和响应长度分布动态地进行调整,导致过度压缩和准确性受损。因此,我们提出了SmartThinker,一种基于GRPO的高效推理方法,具有渐进的CoT长度校准。SmartThinker做出了双重贡献:首先,在训练过程中动态估计具有最高准确性的最佳长度,并引导过长的响应朝向该长度,以减少响应长度同时保持准确性。其次,它动态调节长度奖励系数,避免对正确推理路径的不必要惩罚。大量实验结果显示,SmartThinker实现了高达52.5%的平均长度压缩,并在挑战性基准测试如AIME25上实现了高达16.6%的准确性改进。源代码可在https://github.com/SJTU-RTEAS/SmartThinker找到。
更新时间: 2026-03-09 06:08:14
领域: cs.CL,cs.LG
Aero-Promptness: Drag-Aware Aerodynamic Manipulability for Propeller-driven Vehicles
This work introduces the Drag-Aware Aerodynamic Manipulability (DAAM), a geometric framework for control allocation in redundant multirotors. By equipping the propeller spin-rate space with a Riemannian metric based on the remaining symmetric acceleration capacity of each motor, the formulation explicitly accounts for motor torque limits and aerodynamic drag. Mapping this metric through the nonlinear thrust law to the generalized force space yields a state-dependent manipulability volume. The log-determinant of this volume acts as a natural barrier function, strictly penalizing drag-induced saturation and low-spin thrust loss. Optimizing this volume along the allocation fibers provides a redundancy resolution strategy inherently invariant to arbitrary coordinate scaling in the generalized-force space. Analytically, we prove that the resulting optimal allocations locally form smooth embedded manifolds, and we geometrically characterize the global jump discontinuities that inevitably arise from physical actuator limits and spin-rate sign transitions.
Updated: 2026-03-09 06:02:59
标题: 空气快感:螺旋桨驱动车辆的拖曳感知空气动力操纵
摘要: 这项工作介绍了Drag-Aware Aerodynamic Manipulability(DAAM),这是一个用于多余多旋翼飞行器控制分配的几何框架。通过利用基于每个电机剩余对称加速容量的黎曼度量来装备螺旋桨旋转速率空间,该公式明确考虑了电机扭矩限制和空气动力学阻力。通过将这个度量映射到非线性推力定律,我们得到了一个状态相关的操纵空间。这个空间的对数行列式作为一个自然的障碍函数,严格惩罚了由阻力引起的饱和和低旋转推力损失。沿着分配纤维优化这个空间提供了一种固有地对于广义力空间中任意坐标缩放不变的冗余解决策略。在理论上,我们证明了得到的最佳分配在局部形成平滑嵌入流形,并且我们从几何角度描述了由物理执行器限制和旋转速率符号转换引起的全局跳跃不连续性。
更新时间: 2026-03-09 06:02:59
领域: cs.RO,cs.AI,eess.SY,math.OC
CMMR-VLN: Vision-and-Language Navigation via Continual Multimodal Memory Retrieval
Although large language models (LLMs) are introduced into vision-and-language navigation (VLN) to improve instruction comprehension and generalization, existing LLM- based VLN lacks the ability to selectively recall and use relevant priori experiences to help navigation tasks, limiting their performance in long-horizon and unfamiliar scenarios. In this work, we propose CMMR-VLN (Continual Multimodal Memory Retrieval based VLN), a VLN framework that endows LLM agents with structured memory and reflection capabilities. Specifically, the CMMR-VLN constructs a multimodal experi- ence memory indexed by panoramic visual images and salient landmarks to retrieve relevant experiences during navigation, introduces a retrieved-augmented generation pipeline to mimick how experienced human navigators leverage priori knowledge, and incorporates a reflection-based memory update strategy that selectively stores complete successful paths and the key initial mistake in failure cases. Comprehensive tests illustrate average success rate improvements of 52.9%, 20.9% and 20.9%, and 200%, 50% and 50% over the NavGPT, the MapGPT, and the DiscussNav in simulation and real tests, respectively eluci- dating the great potential of the CMMR-VLN as a backbone VLN framework.
Updated: 2026-03-09 06:02:50
标题: CMMR-VLN:通过持续多模态记忆检索实现视觉和语言导航
摘要: 尽管大型语言模型(LLMs)被引入视觉-语言导航(VLN)以提高指令理解和泛化能力,但现有基于LLM的VLN缺乏选择性地回忆和利用相关先验经验来帮助导航任务,在长时间跨度和陌生场景中的表现受到限制。在这项工作中,我们提出了CMMR-VLN(基于连续多模式记忆检索的VLN),这是一个为LLM代理赋予结构化记忆和反思能力的VLN框架。具体来说,CMMR-VLN构建了一个由全景视觉图像和显著地标索引的多模式体验记忆,用于在导航过程中检索相关经验,引入了一个检索增强生成管道,模拟经验丰富的人类导航员如何利用先验知识,并结合了基于反思的记忆更新策略,选择性地存储完整的成功路径和失败情况中的关键初始错误。全面的测试表明,CMMR-VLN在模拟和实际测试中的平均成功率分别比NavGPT、MapGPT和DiscussNav提高了52.9%、20.9%和20.9%、200%、50%和50%,阐明了CMMR-VLN作为VLN框架的潜力。
更新时间: 2026-03-09 06:02:50
领域: cs.AI
More to Extract: Discovering MEV by Token Contract Analysis
This paper tackles the discovery of tMEV, that is, the Maximal Extractable Value on blockchains that arises from Token smart contracts. This scope differs from the existing MEV-discovery research, which analyzes application-layer contracts or attacker contracts, but ignores the wide and diverse range of token contracts. This paper presents a pipeline of techniques for tMEV discovery, including tSCAN, a static analysis tool for identifying non-standard supply-control functions in token contracts, and tSEARCH, a searcher that uncovers profitable tMEV opportunities by generating, refining, and solving token-specific constraints. By replaying real-world transactions, this paper demonstrates both the profitability of tMEV strategies and existing searchers' unawareness of them: the proposed tSEARCH extracts $10\times$ more profit than observed MEV activity on Ethereum. The practicality of tMEV searching is demonstrated through a prototype built on Slither, showing high effectiveness with low performance overhead.
Updated: 2026-03-09 06:02:15
标题: 更多可提取的价值:通过代币合约分析发现MEV
摘要: 本文讨论了tMEV的发现,即来自代币智能合约的区块链上的最大可提取价值。这一范围不同于现有的MEV发现研究,后者分析应用层合约或攻击者合约,但忽略了广泛和多样化的代币合约。 本文提出了一套用于tMEV发现的技术流程,包括tSCAN,一种用于识别代币合约中非标准供应控制函数的静态分析工具,以及tSEARCH,一种通过生成、细化和解决特定于代币的约束条件来揭示有利可图的tMEV机会的搜索器。 通过重放真实世界的交易,本文展示了tMEV策略的盈利性以及现有搜索器对其的无知:所提出的tSEARCH在以太坊上比观察到的MEV活动多提取了10倍的利润。通过在Slither上构建的原型展示了tMEV搜索的实用性,表现出高效性和低性能开销。
更新时间: 2026-03-09 06:02:15
领域: cs.CR
MJ1: Multimodal Judgment via Grounded Verification
Multimodal judges struggle to ground decisions in visual evidence. We present MJ1, a multimodal judge trained with reinforcement learning that enforces visual grounding through a structured grounded verification chain (observations $\rightarrow$ claims $\rightarrow$ verification $\rightarrow$ evaluation $\rightarrow$ scoring) and a counterfactual consistency reward that penalizes position bias. Even without training, our mechanism improves base-model accuracy on MMRB2 by +3.8 points on Image Editing and +1.7 on Multimodal Reasoning. After training, MJ1, with only 3B active parameters, achieves 77.0% accuracy on MMRB2 and surpasses orders-of-magnitude larger models like Gemini-3-Pro. These results show that grounded verification and consistency-based training substantially improve multimodal judgment without increasing model scale.
Updated: 2026-03-09 05:55:48
标题: MJ1:通过基础验证进行多模态判断
摘要: 多模式法官在视觉证据上做出决策时存在困难。我们提出了MJ1,一个经过强化学习训练的多模式法官,通过结构化的基于视觉证据的验证链(观察$\rightarrow$主张$\rightarrow$验证$\rightarrow$评估$\rightarrow$评分)和一种惩罚位置偏见的反事实一致性奖励来强化视觉证据。即使在没有训练的情况下,我们的机制也将在图像编辑上将基础模型的准确性提高了+3.8点,在多模式推理上提高了+1.7点。经过训练后,仅有3B个活跃参数的MJ1在MMRB2上达到了77.0%的准确率,并超过了像Gemini-3-Pro这样数量级更大的模型。这些结果表明,基于视觉证据的验证和基于一致性的训练极大地提高了多模式判断的准确性,而不需要增加模型规模。
更新时间: 2026-03-09 05:55:48
领域: cs.LG
ACE-GF-based Attestation Relay for PQC - Lightweight Mempool Propagation Without On-Path Proofs
In post-quantum blockchain settings, objects that require validity proofs (e.g., blob roots, execution-layer or consensus-layer signature aggregates) must be broadcast through mempool and relay networks. Recursive STARKs have been proposed to aggregate such proofs so that each node forwards one proof per tick plus objects without proofs, capping per-node proof bandwidth at roughly 128 KB degree per tick. We observe that propagation does not inherently require validity proofs on the path-only a lightweight assurance that an object is eligible for relay. We present AR-ACE (ACE-GF-based Attestation Relay for PQC), in which relay nodes forward objects plus compact attestations (e.g., identity-bound signatures or commitments) and do not generate, hold, or forward any full validity proof. Only the builder (or final verifier) performs a single aggregated validity proof over the set of objects it includes. This proof-off-path design removes proof overhead from the propagation path entirely, yielding an order-of-magnitude reduction in proof-related relay bandwidth relative to proof-carrying propagation. When instantiated with ACE-GF-derived attestation keys, AR-ACE preserves a unified identity story with on-chain authorization and is PQC-ready. We specify a protocol model, state design goals and security considerations, define security games, and provide a structural bandwidth comparison with recursive-STARK-based propagation.
Updated: 2026-03-09 05:41:31
标题: 基于ACE-GF的PQC认证中继 - 无需路径证明的轻量级内存池传播
摘要: 在后量子区块链设置中,需要有效性证明的对象(例如,blob根、执行层或共识层签名聚合)必须通过内存池和中继网络进行广播。已经提出了递归STARKs,以聚合这些证明,使得每个节点在每个滴答中转发一个证明加上没有证明的对象,将每个节点的证明带宽限制在每个滴答大约128 KB。我们观察到传播并不本质上需要路径上的有效性证明-只需要一个轻量级的保证,表明一个对象适合中继。我们提出了AR-ACE(基于ACE-GF的后量子认证中继),在其中中继节点转发对象以及紧凑的证明(例如,身份绑定签名或承诺),不生成、持有或转发任何完整的有效性证明。只有构建者(或最终验证者)对其包含的对象集执行单个聚合的有效性证明。这种路径外的证明设计完全消除了传播路径上的证明开销,相对于携带证明的传播,使得证明相关的中继带宽降低了一个数量级。当使用ACE-GF派生的认证密钥实例化时,AR-ACE保留了一个具有链上授权的统一身份故事,并且准备好应对后量子密码学。我们指定了一个协议模型,阐明了设计目标和安全考虑,定义了安全博弈,并提供了与基于递归STARK的传播的结构带宽比较。
更新时间: 2026-03-09 05:41:31
领域: cs.CR,cs.DC
The Ends Justify the Thoughts: RL-Induced Motivated Reasoning in LLM CoTs
Chain-of-Thought (CoT) monitoring has emerged as a compelling method for detecting harmful behaviors such as reward hacking for reasoning models, under the assumption that models' reasoning processes are informative of such behaviors. In practice, LLM training often produces unintended behaviors due to imperfect reward signals, leading models to develop misaligned tendencies. A common corrective approach is to apply post-hoc instructions to avoid problematic behaviors, but what happens to the model's reasoning process when these instructions conflict with learned behaviors? We investigate this question in simple settings and find that models engage in systematic motivated reasoning -- generating plausible-sounding justifications for violating their instructions while downplaying potential harms or contradictions. Concerningly, we find that as motivated reasoning becomes more prevalent over the course of training, an 8B-parameter CoT monitor is increasingly fooled by the motivated reasoning, being persuaded to judge the answer as following the constitution, despite correctly identifying the answer as contradicting the constitution when not provided with the model's reasoning trace. While we find that large frontier reasoning models closely track human ability in detecting motivated reasoning, this should not give us too much solace, as frontier model developers rely on smaller models for monitoring due to their low latency and deployment costs. Our results underscore the necessity for further research into the emergence and detection of motivated reasoning in model evaluation and oversight. Code for this paper is available at https://github.com/nikihowe/motivated-reasoning. WARNING: some examples in this paper may be upsetting.
Updated: 2026-03-09 05:37:15
标题: 结尾证明思想:RL引发的在LLM CoTs中的动机推理
摘要: 思维链监测(CoT)已经成为一种有效的方法,用于检测推理模型的有害行为,例如奖励篡改,假设模型的推理过程可以提供有关这些行为的信息。在实践中,LLM训练通常会由于不完善的奖励信号而产生意外行为,导致模型发展出不一致的倾向。一种常见的纠正方法是应用事后指令来避免问题行为,但当这些指令与学习的行为发生冲突时,模型的推理过程会发生什么?我们在简单的设置中对这个问题进行了调查,发现模型会进行系统性的动机推理,为违反指令提供听起来合理的理由,同时淡化潜在的危害或矛盾。令人担忧的是,我们发现随着训练过程中动机推理变得更加普遍,一个包含8B参数的CoT监视器将越来越容易受到动机推理的欺骗,被说服认为答案符合宪法,尽管在没有提供模型推理轨迹的情况下,正确地识别答案与宪法相矛盾。虽然我们发现大型前沿推理模型在检测动机推理方面密切追踪人类能力,但这并不应让我们感到太安慰,因为前沿模型开发者依赖较小的模型进行监测,因为它们具有较低的延迟和部署成本。我们的结果强调了对模型评估和监督中动机推理的出现和检测的进一步研究的必要性。本文的代码可在https://github.com/nikihowe/motivated-reasoning找到。警告:本文中的一些示例可能令人不快。
更新时间: 2026-03-09 05:37:15
领域: cs.LG,cs.AI
Aurora: Towards Universal Generative Multimodal Time Series Forecasting
Cross-domain generalization is very important in Time Series Forecasting because similar historical information may lead to distinct future trends due to the domain-specific characteristics. Recent works focus on building unimodal time series foundation models and end-to-end multimodal supervised models. Since domain-specific knowledge is often contained in modalities like texts, the former lacks the explicit utilization of them, thus hindering the performance. The latter is tailored for end-to-end scenarios and does not support zero-shot inference for cross-domain scenarios. In this work, we introduce Aurora, a Multimodal Time Series Foundation Model, which supports multimodal inputs and zero-shot inference. Pretrained on Cross-domain Multimodal Time Series Corpus, Aurora can adaptively extract and focus on key domain knowledge contained in corresponding text or image modalities, thus possessing strong cross-domain generalization capability. Through tokenization, encoding, and distillation, Aurora can extract multimodal domain knowledge as guidance and then utilizes a Modality-Guided Multi-head Self-Attention to inject them into the modeling of temporal representations. In the decoding phase, the multimodal representations are used to generate the conditions and prototypes of future tokens, contributing to a novel Prototype-Guided Flow Matching for generative probabilistic forecasting. Comprehensive experiments on 5 well-recognized benchmarks, including TimeMMD, TSFM-Bench, ProbTS, TFB, and EPF, demonstrate the consistent state-of-the-art performance of Aurora on both unimodal and multimodal scenarios.
Updated: 2026-03-09 05:33:53
标题: 极光:朝向通用生成多模态时间序列预测
摘要: 跨领域泛化在时间序列预测中非常重要,因为类似的历史信息可能由于领域特定特征导致不同的未来趋势。最近的研究着重于构建单模态时间序列基础模型和端到端的多模态监督模型。由于领域特定知识通常包含在文本等模态中,前者缺乏对其明确利用,从而影响性能。后者专门针对端到端情景设计,并不支持跨领域情景的零样本推理。在这项研究中,我们介绍了Aurora,一种多模态时间序列基础模型,支持多模态输入和零样本推理。在跨领域多模态时间序列语料库上预训练的Aurora能够自适应地提取并聚焦于对应文本或图像模态中包含的关键领域知识,因此具有强大的跨领域泛化能力。通过标记化、编码和提炼,Aurora可以提取多模态领域知识作为指导,然后利用模态引导的多头自注意力将其注入到时间表示建模中。在解码阶段,多模态表示用于生成未来令牌的条件和原型,为生成概率预测提供了一种新颖的原型引导流匹配方法。在包括TimeMMD、TSFM-Bench、ProbTS、TFB和EPF在内的5个知名基准测试上进行的综合实验表明,Aurora在单模态和多模态情景下均表现出一致的最新技术水平。
更新时间: 2026-03-09 05:33:53
领域: cs.LG
\$OneMillion-Bench: How Far are Language Agents from Human Experts?
As language models (LMs) evolve from chat assistants to long-horizon agents capable of multi-step reasoning and tool use, existing benchmarks remain largely confined to structured or exam-style tasks that fall short of real-world professional demands. To this end, we introduce \$OneMillion-Bench \$OneMillion-Bench, a benchmark of 400 expert-curated tasks spanning Law, Finance, Industry, Healthcare, and Natural Science, built to evaluate agents across economically consequential scenarios. Unlike prior work, the benchmark requires retrieving authoritative sources, resolving conflicting evidence, applying domain-specific rules, and making constraint decisions, where correctness depends as much on the reasoning process as the final answer. We adopt a rubric-based evaluation protocol scoring factual accuracy, logical coherence, practical feasibility, and professional compliance, focused on expert-level problems to ensure meaningful differentiation across agents. Together, \$OneMillion-Bench provides a unified testbed for assessing agentic reliability, professional depth, and practical readiness in domain-intensive scenarios.
Updated: 2026-03-09 05:32:42
标题: 一百万美元悬赏:语言代理与人类专家相差多远?
摘要: 随着语言模型(LMs)从聊天助手发展为能够进行多步推理和工具使用的长视程代理,现有的基准测试仍然主要局限于结构化或考试风格的任务,无法满足真实世界专业需求。为此,我们引入了\$ OneMillion-Bench,这是一个涵盖法律、金融、工业、医疗保健和自然科学领域的400个专家策划任务的基准测试,旨在评估代理在经济上具有重要影响的场景中的表现。与以往的工作不同,该基准测试要求检索权威来源,解决冲突证据,应用领域特定规则,并做出约束决策,其中正确性取决于推理过程和最终答案一样重要。我们采用基于评分标准的评估协议,评分包括事实准确性、逻辑连贯性、实际可行性和专业合规性,重点关注专家级问题,以确保在代理之间实现有意义的差异化。总的来说,\$ OneMillion-Bench为评估代理的可靠性、专业深度和领域密集场景中的实际准备提供了一个统一的测试平台。
更新时间: 2026-03-09 05:32:42
领域: cs.LG,cs.AI,cs.CL
Emergence is Overrated: AGI as an Archipelago of Experts
Krakauer, Krakauer, and Mitchell (2025) distinguish between emergent capabilities and emergent intelligence, arguing that true intelligence requires efficient coarse-grained representations enabling diverse problem-solving through analogy and minimal modification. They contend that intelligence means doing "more with less" through compression and generalization, contrasting this with "vast assemblages of diverse calculators" that merely accumulate specialized capabilities. This paper examines whether their framework accurately characterizes human intelligence and its implications for conceptualizing artificial general intelligence. Drawing on empirical evidence from cognitive science, I demonstrate that human expertise operates primarily through domain-specific pattern accumulation rather than elegant compression. Expert performance appears flexible not through unifying principles but through vast repertoires of specialized responses. Creative breakthroughs themselves may emerge through evolutionary processes of blind variation and selective retention rather than principled analogical reasoning. These findings suggest reconceptualizing AGI as an "archipelago of experts": isolated islands of specialized competence without unifying principles or shared representations. If we accept human expertise with its characteristic brittleness as genuine intelligence, then consistency demands recognizing that artificial systems comprising millions of specialized modules could constitute general intelligence despite lacking KKM's emergent intelligence.
Updated: 2026-03-09 05:28:16
标题: “Emergence is Overrated: AGI as an Archipelago of Experts” 的翻译是“突现被高估:AGI作为专家群岛”
摘要: Krakauer、Krakauer和Mitchell(2025)区分了新兴能力和新兴智能,认为真正的智能需要高效的粗粒度表示,通过类比和最小修改实现多样化问题解决。他们认为智能意味着通过压缩和概括实现“以少做更多”,并将此与仅积累专业能力的“庞大的多样计算器集合”进行对比。本文检验了他们的框架是否准确描述了人类智能及其对概念化人工通用智能的影响。借鉴认知科学的经验证据,我展示了人类专业知识主要通过特定领域模式积累而非优雅压缩来运作。专家表现似乎灵活性不是通过统一原则而是通过大量的专门响应来实现。创造性突破本身可能是通过盲目变异和选择性保留的进化过程而非原则性类比推理来实现的。这些发现表明,应将AGI重新构想为“专家群岛”:孤立的专业能力岛屿,没有统一原则或共享表示。如果我们接受具有特征脆弱性的人类专业知识是真正的智能,那么一致性要求承认,尽管缺乏KKM的新兴智能,但由数百万专门模块组成的人工系统可能构成通用智能。
更新时间: 2026-03-09 05:28:16
领域: cs.CL,cs.AI
OSExpert: Computer-Use Agents Learning Professional Skills via Exploration
General-purpose computer-use agents have shown impressive performance across diverse digital environments. However, our new benchmark, OSExpert-Eval, indicates they remain far less helpful than human experts. Although inference-time scaling enables adaptation, these agents complete complex tasks inefficiently with degraded performance, transfer poorly to unseen UIs, and struggle with fine-grained action sequences. To solve the problem, we introduce a GUI-based depth-first search (GUI-DFS) exploration algorithm to comprehensively explore and verify an environment's unit functions. The agent then exploits compositionality between unit skills to self-construct a curriculum for composite tasks. To support fine-grained actions, we curate a database of action primitives for agents to discover during exploration; these are saved as a skill set once the exploration is complete. We use the learned skills to improve the agent's performance and efficiency by (1) enriching agents with ready-to-use procedural knowledge, allowing them to plan only once for long trajectories and generate accurate actions, and (2) enabling them to end inference-time scaling earlier by realizing their boundary of capabilities. Extensive experiments show that our environment-learned agent takes a meaningful step toward expert-level computer use, achieving a around 20 percent performance gain on OSExpert-Eval and closing the efficiency gap to humans by around 80 percent
Updated: 2026-03-09 05:27:56
标题: OSExpert:通过探索学习专业技能的计算机使用代理
摘要: 通用计算机使用代理在各种数字环境中表现出色。然而,我们的新基准测试OSExpert-Eval表明,它们仍然远不及人类专家的帮助水平。尽管推理时间缩放能够实现适应性,但这些代理在完成复杂任务时效率低下,性能下降,对未知的用户界面转移能力差,以及处理细粒度动作序列时遇到困难。为了解决这个问题,我们引入了基于图形用户界面的深度优先搜索(GUI-DFS)探索算法,全面探索和验证环境的单元功能。代理然后利用单元技能之间的组合性自行构建复合任务的课程。为了支持细粒度动作,我们策划了一个动作原语数据库,供代理在探索过程中发现;一旦探索完成,这些将保存为一个技能集。我们利用学到的技能来提高代理的性能和效率,方法是(1)为代理提供现成的程序化知识,使它们只需计划一次即可实现长路径并生成准确动作,以及(2)通过认识到自身能力的边界,使其尽早结束推理时间缩放。广泛的实验表明,我们基于环境学习的代理迈出了朝着专家级计算机使用的有意义的一步,在OSExpert-Eval上实现了约20%的性能增益,并将效率差距缩小了约80%,接近人类水平。
更新时间: 2026-03-09 05:27:56
领域: cs.AI
Scaling Machine Learning Interatomic Potentials with Mixtures of Experts
Machine Learning Interatomic Potentials (MLIPs) enable accurate large-scale atomistic simulations, yet improving their expressive capacity efficiently remains challenging. Here we systematically develop Mixture-of-Experts (MoE) and Mixture-of-Linear-Experts (MoLE) architectures for MLIPs and analyze the effects of routing strategies and expert designs. We show that sparse activation combined with shared experts yields substantial performance gains, and that nonlinear MoE formulations outperform MoLE when shared experts are present, underscoring the importance of nonlinear expert specialization. Furthermore, element-wise routing consistently surpasses configuration-level routing, while global MoE routing often leads to numerical instability. The resulting element-wise MoE model achieves state-of-the-art accuracy across the OMol25, OMat24, and OC20M benchmarks. Analysis of routing patterns reveals chemically interpretable expert specialization aligned with periodic-table trends, indicating that the model effectively captures element-specific chemical characteristics for precise interatomic modeling.
Updated: 2026-03-09 05:27:38
标题: 用混合专家技术扩展机器学习原子间势。
摘要: 机器学习原子间势(Machine Learning Interatomic Potentials, MLIPs)使得精确的大规模原子模拟成为可能,然而有效地提高其表达能力仍然具有挑战性。在这里,我们系统地开发了混合专家(Mixture-of-Experts, MoE)和混合线性专家(Mixture-of-Linear-Experts, MoLE)架构用于MLIPs,并分析了路由策略和专家设计的影响。我们展示出稀疏激活结合共享专家能够显著提升性能,非线性MoE公式在存在共享专家时优于MoLE,强调了非线性专家专业化的重要性。此外,元素级路由始终优于配置级路由,而全局MoE路由通常会导致数值不稳定。由此产生的元素级MoE模型在OMol25、OMat24和OC20M基准测试中实现了最先进的准确性。路由模式分析显示出与周期表趋势一致的可解释化学专业化,表明该模型有效地捕捉了元素特定的化学特性,用于精确的原子间建模。
更新时间: 2026-03-09 05:27:38
领域: physics.chem-ph,cs.LG,physics.comp-ph
A Simple "Motivation" Can Enhance Reinforcement Finetuning of Large Reasoning Models
Reinforcement Learning with Verifiable Rewards~(RLVR) has emerged as a powerful learn-to-reason paradigm for large reasoning models to tackle complex tasks. However, the current RLVR paradigm is still not efficient enough, as it works in a trial-and-error manner. To perform better, the model needs to explore the reward space by numerously generating responses and learn from fragmented reward signals, blind to the overall reward patterns. Fortunately, verifiable rewards make the natural language description of the reward function possible, and meanwhile, LLMs have demonstrated strong in-context learning ability. This motivates us to explore if large reasoning models can benefit from a \textbf{motivation} of the task, \textit{i.e.}, awareness of the reward function, during the reinforcement finetuning process, as we humans sometimes do when learning. In this paper, we introduce \textit{\textbf{M}otivation-\textbf{e}nhanced \textbf{R}einforcement \textbf{F}inetuning}~(\textbf{MeRF}), an intuitive yet effective method enhancing reinforcement finetuning of LLMs by involving \emph{``telling LLMs rules of the game''}. Specifically, \textbf{MeRF} directly injects the reward specification into the prompt, which serves as an in-context motivation for the model to be aware of the optimization objective. This simple modification leverages the in-context learning ability of LLMs, aligning generation with optimization, thereby incentivizing the model to generate desired outputs from both inner motivation and external reward. Empirical evaluations demonstrate that \textbf{MeRF} achieves substantial performance gains over the RLVR baseline. Moreover, ablation studies show that MeRF performs better with greater consistency between the in-context motivation and the external reward function, while the model also demonstrates an ability to adapt to misleading motivations through reinforcement finetuning.
Updated: 2026-03-09 05:26:38
标题: 一个简单的“动机”可以增强对大型推理模型的强化微调
摘要: 使用可验证奖励的强化学习(RLVR)已经成为一个强大的学习推理范式,用于大型推理模型处理复杂任务。然而,当前的RLVR范式仍然不够高效,因为它是以试错的方式工作的。为了表现更好,模型需要通过大量生成响应来探索奖励空间,并从碎片化的奖励信号中学习,对整体奖励模式一无所知。幸运的是,可验证奖励使奖励函数的自然语言描述成为可能,同时,LLMs已经展示了在上下文中学习的强大能力。这激励我们探索大型推理模型是否可以从任务的动机中受益,即在强化微调过程中意识到奖励函数,就像我们人类有时学习时所做的那样。在本文中,我们介绍了\textbf{M}otivation-\textbf{e}nhanced \textbf{R}einforcement \textbf{F}inetuning(\textbf{MeRF}),这是一种直观而有效的方法,通过涉及“告诉LLMs游戏规则”来增强LLMs的强化微调。具体来说,\textbf{MeRF}直接将奖励规范注入到提示中,这作为一个上下文动机,让模型意识到优化目标。这个简单的修改利用了LLMs的上下文学习能力,将生成与优化对齐,从而激励模型通过内在动机和外部奖励生成期望的输出。实证评估表明,\textbf{MeRF}相对于RLVR基线取得了显著的性能增益。此外,消融研究表明,当内在动机与外部奖励函数之间的一致性更大时,MeRF表现得更好,同时,模型还表现出通过强化微调适应误导性动机的能力。
更新时间: 2026-03-09 05:26:38
领域: cs.CL,cs.AI
ZK-ACE: Identity-Centric Zero-Knowledge Authorization for Post-Quantum Blockchain Systems
Post-quantum signature schemes introduce kilobyte-scale authorization artifacts when applied directly to blockchain transaction validation. A widely considered mitigation is to verify post-quantum signatures inside zero-knowledge circuits and publish only succinct proofs on-chain. However, this approach preserves the signature-centric authorization model, merely relocating the verification cost, and embeds expensive high-dimensional lattice arithmetic into prover circuits.We present ZK-ACE (Zero-Knowledge Authorization for Cryptographic Entities), an authorization layer that replaces transaction-carried signature objects entirely with identity-bound zero-knowledge authorization statements. Rather than proving the correctness of a specific post-quantum signature, the prover demonstrates in zero knowledge that a transaction is authorized by an identity consistent with an on-chain commitment and bound replay state. The construction assumes a deterministic identity derivation primitive (DIDP) as a black box and uses a compact identity commitment as the primary on-chain identity anchor, supplemented by per-transaction replay-prevention state. We formalize ZK-ACE with explicit game-based security definitions for authorization soundness, replay resistance, substitution resistance, and cross-domain separation. We present a complete circuit constraint specification, define two replay-prevention models, and provide reduction-based security proofs under standard assumptions (knowledge soundness, collision resistance, and DIDP identity-root recovery hardness). A structural, protocol-level data accounting demonstrates an order-of-magnitude reduction in consensus-visible authorization data relative to direct post-quantum signature deployment. The design supports batch aggregation and recursive proof composition, and is compatible with account-abstraction and rollup-based deployment architectures.
Updated: 2026-03-09 05:21:44
标题: ZK-ACE:面向身份中心的后量子区块链系统的零知识授权
摘要: 后量子签名方案在直接应用于区块链交易验证时引入了千字节级别的授权工件。一个广泛考虑的缓解方法是在零知识电路内验证后量子签名,仅在链上发布简洁的证明。然而,这种方法保留了以签名为中心的授权模型,仅仅是将验证成本转移,并将昂贵的高维格子算术嵌入到证明者电路中。我们提出了ZK-ACE(加密实体的零知识授权),这是一个授权层,完全用与身份绑定的零知识授权声明替换了交易携带的签名对象。证明者不是证明特定后量子签名的正确性,而是在零知识中证明一个交易是由一个与链上承诺和绑定重放状态一致的身份授权的。该构造假设确定性身份推导原语(DIDP)作为黑盒子,并使用紧凑的身份承诺作为主要的链上身份锚点,辅以每笔交易的重放预防状态。我们为授权的声音、重放抵抗、替代抵抗和跨域分离提供了明确的基于游戏的安全定义。我们提供了完整的电路约束规范,定义了两种重放预防模型,并在标准假设下(知识声音、碰撞抵抗和DIDP身份根恢复难度)提供了基于约简的安全证明。在结构上,协议级的数据会计表明,相对于直接后量子签名部署,共识可见的授权数据减少了一个数量级。该设计支持批量聚合和递归证明组成,并与账户抽象和基于Rollup的部署架构兼容。
更新时间: 2026-03-09 05:21:44
领域: cs.CR,cs.DC
VORL-EXPLORE: A Hybrid Learning Planning Approach to Multi-Robot Exploration in Dynamic Environments
Hierarchical multi-robot exploration commonly decouples frontier allocation from local navigation, which can make the system brittle in dense and dynamic environments. Because the allocator lacks direct awareness of execution difficulty, robots may cluster at bottlenecks, trigger oscillatory replanning, and generate redundant coverage. We propose VORL-EXPLORE, a hybrid learning and planning framework that addresses this limitation through execution fidelity, a shared estimate of local navigability that couples task allocation with motion execution. This fidelity signal is incorporated into a fidelity-coupled Voronoi objective with inter-robot repulsion to reduce contention before it emerges. It also drives a risk-aware adaptive arbitration mechanism between global A* guidance and a reactive reinforcement learning policy, balancing long-range efficiency with safe interaction in confined spaces. The framework further supports online self-supervised recalibration of the fidelity model using pseudo-labels derived from recent progress and safety outcomes, enabling adaptation to non-stationary obstacles without manual risk tuning. We evaluate this capability separately in a dedicated severe-traffic ablation. Extensive experiments in randomized grids and a Gazebo factory scenario show high success rates, shorter path length, lower overlap, and robust collision avoidance. The source code will be made publicly available upon acceptance.
Updated: 2026-03-09 05:20:33
标题: VORL-EXPLORE:一种混合学习规划方法,用于动态环境下的多机器人探索
摘要: 分层多机器人探索通常将前沿分配与本地导航分离开来,这可能使系统在密集和动态环境中变得脆弱。由于分配器缺乏对执行困难的直接认识,机器人可能会聚集在瓶颈处,触发振荡式重新规划,并产生冗余覆盖。我们提出了VORL-EXPLORE,这是一个混合学习和规划框架,通过执行忠实度来解决这一局限性,这是一个本地可导航性的共享估计,将任务分配与运动执行联系起来。这种忠实度信号被纳入一个与忠实度耦合的Voronoi目标,具有机器人间的排斥作用,以在争用出现之前减少。它还驱动一个风险感知的自适应仲裁机制,在全局A*引导和反应式强化学习策略之间实现平衡,以在受限空间中保持长距离效率和安全互动。该框架进一步支持使用来自最近进展和安全结果的伪标签进行在线自监督重新校准忠实度模型,从而实现对非静态障碍的自适应,无需手动风险调整。我们在专门的严重交通消融中单独评估了这种能力。在随机网格和Gazebo工厂场景中进行了大量实验,显示出较高的成功率、更短的路径长度、更低的重叠度和稳健的碰撞回避。在接受后,源代码将公开发布。
更新时间: 2026-03-09 05:20:33
领域: cs.RO,cs.AI
Adaptive Collaboration with Humans: Metacognitive Policy Optimization for Multi-Agent LLMs with Continual Learning
While scaling individual Large Language Models (LLMs) has delivered remarkable progress, the next frontier lies in scaling collaboration through multi-agent systems (MAS). However, purely autonomous MAS remain ''closed-world'' systems, constrained by the static knowledge horizon of pre-trained models. This limitation makes them brittle on tasks requiring knowledge beyond training data, often leading to collective failure under novel challenges. To address this, we propose the Human-In-the-Loop Multi-Agent Collaboration (HILA) framework, a principled paradigm for human--agent collaboration. HILA trains agents to learn a metacognitive policy that governs when to solve problems autonomously and when to defer to a human expert. To operationalize this policy, we introduce Dual-Loop Policy Optimization, which disentangles immediate decision-making from long-term capability growth. The inner loop applies Group Relative Policy Optimization (GRPO) with a cost-aware reward to optimize deferral decisions, while the outer loop implements continual learning, transforming expert feedback into high-quality supervised signals that strengthen the agent's reasoning ability. Experiments on challenging mathematical and problem-solving benchmarks show that HILA, equipped with Dual-Loop Policy Optimization, consistently outperforms advanced MAS, establishing a principled foundation for collaborative and continually improving agentic systems.
Updated: 2026-03-09 05:18:07
标题: 与人类的适应性协作:元认知策略优化用于具有持续学习的多智能体LLMs
摘要: 尽管扩展个体大型语言模型(LLMs)取得了显著进展,但下一个前沿在于通过多智能体系统(MAS)扩展协作。然而,纯自主的MAS仍然是“封闭世界”系统,受到预训练模型静态知识范围的限制。这种限制使它们在需要超出训练数据的知识的任务上变得脆弱,往往在面对新挑战时导致集体失败。为了解决这个问题,我们提出了人机协作多智能体协作(HILA)框架,这是一种人-智能体协作的原则性范式。HILA训练智能体学习一个元认知策略,该策略规定了何时独立解决问题以及何时推迟到人类专家。为了实现这一策略,我们引入了双循环策略优化,将即时决策与长期能力增长分离开来。内循环应用组相对策略优化(GRPO)与成本感知奖励来优化推迟决策,而外循环实施持续学习,将专家反馈转化为高质量的监督信号,加强智能体的推理能力。在具有挑战性的数学和问题解决基准测试中的实验表明,配备双循环策略优化的HILA始终优于先进的MAS,为协作和不断改进的智能系统奠定了原则性基础。
更新时间: 2026-03-09 05:18:07
领域: cs.AI
EROICA: Online Performance Troubleshooting for Large-scale Model Training
Troubleshooting performance problems of large model training (LMT) is immensely challenging, due to unprecedented scales of modern GPU clusters, the complexity of software-hardware interactions, and the data intensity of the training process. Existing troubleshooting approaches designed for traditional distributed systems or datacenter networks fall short and can hardly apply to real-world training systems. In this paper, we present EROICA, the first online troubleshooting system that provides both fine-grained observation based on profiling, and coverage of all machines in GPU clusters, to diagnose performance issues in production, including both hardware and software problems (or the mixture of both). EROICA effectively summarizes runtime behavior patterns of LMT function executions via online profiling, and leverages differential observability to localize the root cause with minimal production impact. EROICA has been deployed as a production service for large-scale GPU clusters of ~100,000 GPUs for 1.5 years. It has diagnosed a variety of difficult performance issues with 97.5% success.
Updated: 2026-03-09 05:14:48
标题: EROICA:大规模模型训练的在线性能故障排除
摘要: 大型模型训练(LMT)的性能问题排查是非常具有挑战性的,这是由于现代GPU集群规模的前所未有、软硬件交互的复杂性以及训练过程的数据密集性所致。现有设计用于传统分布式系统或数据中心网络的排查方法存在不足之处,几乎无法应用于真实世界的训练系统。本文介绍了EROICA,这是第一个在线排查系统,它提供基于性能分析的细粒度观察,并覆盖GPU集群中所有机器,用于诊断生产中的性能问题,包括硬件问题、软件问题或二者的混合。EROICA通过在线性能分析有效地总结了LMT功能执行的运行时行为模式,并利用差分可观察性来定位根本原因,最小化对生产的影响。EROICA已经作为一个生产服务部署在拥有约10万个GPU的大型GPU集群中1.5年。它已成功诊断了各种困难的性能问题,成功率为97.5%。
更新时间: 2026-03-09 05:14:48
领域: cs.DC,cs.LG,cs.OS
Advancing Automated Algorithm Design via Evolutionary Stagewise Design with LLMs
With the rapid advancement of human science and technology, problems in industrial scenarios are becoming increasingly challenging, bringing significant challenges to traditional algorithm design. Automated algorithm design with LLMs emerges as a promising solution, but the currently adopted black-box modeling deprives LLMs of any awareness of the intrinsic mechanism of the target problem, leading to hallucinated designs. In this paper, we introduce Evolutionary Stagewise Algorithm Design (EvoStage), a novel evolutionary paradigm that bridges the gap between the rigorous demands of industrial-scale algorithm design and the LLM-based algorithm design methods. Drawing inspiration from CoT, EvoStage decomposes the algorithm design process into sequential, manageable stages and integrates real-time intermediate feedback to iteratively refine algorithm design directions. To further reduce the algorithm design space and avoid falling into local optima, we introduce a multi-agent system and a "global-local perspective" mechanism. We apply EvoStage to the design of two types of common optimizers: designing parameter configuration schedules of the Adam optimizer for chip placement, and designing acquisition functions of Bayesian optimization for black-box optimization. Experimental results across open-source benchmarks demonstrate that EvoStage outperforms human-expert designs and existing LLM-based methods within only a couple of evolution steps, even achieving the historically state-of-the-art half-perimeter wire-length results on every tested chip case. Furthermore, when deployed on a commercial-grade 3D chip placement tool, EvoStage significantly surpasses the original performance metrics, achieving record-breaking efficiency. We hope EvoStage can significantly advance automated algorithm design in the real world, helping elevate human productivity.
Updated: 2026-03-09 05:13:44
标题: 通过带有LLMs的进化分阶段设计推进自动化算法设计
摘要: 随着人类科学技术的快速发展,工业场景中的问题变得越来越具有挑战性,给传统算法设计带来了重大挑战。基于LLMs的自动算法设计被认为是一种有前途的解决方案,但目前采用的黑盒建模剥夺了LLMs对目标问题内在机制的任何认识,导致产生虚构设计。在本文中,我们介绍了进化阶段算法设计(EvoStage),这是一种新颖的进化范式,弥合了工业规模算法设计的严格要求与基于LLM的算法设计方法之间的差距。受CoT的启发,EvoStage将算法设计过程分解为顺序、可管理的阶段,并集成了实时的中间反馈,以迭代地完善算法设计方向。为了进一步减少算法设计空间并避免陷入局部最优解,我们引入了一个多智能体系统和“全局-局部视角”机制。我们将EvoStage应用于两种常见优化器的设计:为芯片布局设计Adam优化器的参数配置计划,以及为黑盒优化设计贝叶斯优化的获取函数。跨开源基准的实验结果表明,EvoStage在仅经过几个进化步骤的情况下就优于人类专家设计和现有基于LLM的方法,甚至在每个测试芯片案例上都实现了历史上最先进的半周长线长结果。此外,当部署在商业级3D芯片布局工具上时,EvoStage显著超越了原始性能指标,实现了创纪录的效率。我们希望EvoStage能够在现实世界中显著推动自动算法设计,帮助提高人类生产力。
更新时间: 2026-03-09 05:13:44
领域: cs.AI
Conditional Diffusion Guidance under Hard Constraint: A Stochastic Analysis Approach
We study conditional generation in diffusion models under hard constraints, where generated samples must satisfy prescribed events with probability one. Such constraints arise naturally in safety-critical applications and in rare-event simulation, where soft or reward-based guidance methods offer no guarantee of constraint satisfaction. Building on a probabilistic interpretation of diffusion models, we develop a principled conditional diffusion guidance framework based on Doob's h-transform, martingale representation and quadratic variation process. Specifically, the resulting guided dynamics augment a pretrained diffusion with an explicit drift correction involving the logarithmic gradient of a conditioning function, without modifying the pretrained score network. Leveraging martingale and quadratic-variation identities, we propose two novel off-policy learning algorithms based on a martingale loss and a martingale-covariation loss to estimate h and its gradient using only trajectories from the pretrained model. We provide non-asymptotic guarantees for the resulting conditional sampler in both total variation and Wasserstein distances, explicitly characterizing the impact of score approximation and guidance estimation errors. Numerical experiments demonstrate the effectiveness of the proposed methods in enforcing hard constraints and generating rare-event samples. The code of the numerical experiments can be found at https://github.com/ZhengyiGuo2002/CDG_Finance.
Updated: 2026-03-09 05:09:50
标题: 硬约束条件下的条件扩散引导:一种随机分析方法
摘要: 我们研究了在强约束条件下扩散模型中的条件生成,其中生成的样本必须以概率为一满足预先规定的事件。这种约束在安全关键应用和稀有事件模拟中自然产生,软约束或基于奖励的指导方法无法保证约束的满足。基于扩散模型的概率解释,我们开发了一个基于杜布h-变换、鞍点表示和二次变差过程的原则性条件扩散引导框架。具体来说,生成的引导动态通过一个涉及条件函数的对数梯度的显式漂移校正来增强预训练扩散,而不修改预训练的评分网络。利用鞍点和二次变差身份,我们提出了两种基于鞍点损失和鞍点协变损失的新型离策略学习算法,仅使用来自预训练模型的轨迹来估计h及其梯度。我们为结果条件采样器在总变差和Wasserstein距离中提供了非渐近保证,明确表征了评分逼近和引导估计误差的影响。数值实验表明所提方法在强制实施强约束和生成稀有事件样本方面的有效性。数值实验的代码可以在https://github.com/ZhengyiGuo2002/CDG_Finance找到。
更新时间: 2026-03-09 05:09:50
领域: cs.AI
Reinforcement Learning for Individual Optimal Policy from Heterogeneous Data
Offline reinforcement learning (RL) aims to find optimal policies in dynamic environments in order to maximize the expected total rewards by leveraging pre-collected data. Learning from heterogeneous data is one of the fundamental challenges in offline RL. Traditional methods focus on learning an optimal policy for all individuals with pre-collected data from a single episode or homogeneous batch episodes, and thus, may result in a suboptimal policy for a heterogeneous population. In this paper, we propose an individualized offline policy optimization framework for heterogeneous time-stationary Markov decision processes (MDPs). The proposed heterogeneous model with individual latent variables enables us to efficiently estimate the individual Q-functions, and our Penalized Pessimistic Personalized Policy Learning (P4L) algorithm guarantees a fast rate on the average regret under a weak partial coverage assumption on behavior policies. In addition, our simulation studies and a real data application demonstrate the superior numerical performance of the proposed method compared with existing methods.
Updated: 2026-03-09 05:07:47
标题: 从异构数据中学习个体最优策略的强化学习
摘要: 离线强化学习(RL)旨在通过利用预先收集的数据,在动态环境中找到最优策略,以最大化预期总奖励。从异构数据中学习是离线RL中的一个基本挑战。传统方法侧重于利用来自单个剧集或同质批次剧集的预先收集数据学习所有个体的最优策略,因此可能导致异质人群的次优策略。在本文中,我们提出了一种用于异质时间静态马尔可夫决策过程(MDP)的个性化离线策略优化框架。提出的具有个体潜变量的异质模型使我们能够有效地估计个体Q函数,我们的Penalized Pessimistic Personalized Policy Learning(P4L)算法在行为策略上的弱部分覆盖假设下保证了平均遗憾率的快速速率。此外,我们的模拟研究和实际数据应用展示了所提出方法相对于现有方法的优越数值性能。
更新时间: 2026-03-09 05:07:47
领域: stat.ML,cs.LG
Local Constrained Bayesian Optimization
Bayesian optimization (BO) for high-dimensional constrained problems remains a significant challenge due to the curse of dimensionality. We propose Local Constrained Bayesian Optimization (LCBO), a novel framework tailored for such settings. Unlike trust-region methods that are prone to premature shrinking when confronting tight or complex constraints, LCBO leverages the differentiable landscape of constraint-penalized surrogates to alternate between rapid local descent and uncertainty-driven exploration. Theoretically, we prove that LCBO achieves a convergence rate for the Karush-Kuhn-Tucker (KKT) residual that depends polynomially on the dimension $d$ for common kernels under mild assumptions, offering a rigorous alternative to global BO where regret bounds typically scale exponentially. Extensive evaluations on high-dimensional benchmarks (up to 100D) demonstrate that LCBO consistently outperforms state-of-the-art baselines.
Updated: 2026-03-09 05:05:22
标题: 本地约束贝叶斯优化
摘要: Bayesian optimization (BO)用于高维受限问题仍然是一个重要挑战,因为维度诅咒存在。我们提出了Local Constrained Bayesian Optimization(LCBO),这是一个专门针对这种情境的新框架。LCBO不同于信任域方法,当遇到紧密或复杂的约束时容易过早收缩,它利用了约束惩罚代理的可微景观,交替进行快速局部下降和基于不确定性的探索。在理论上,我们证明了在常见核函数下,LCBO在温和假设下实现了依赖于维度$d$的Karush-Kuhn-Tucker(KKT)残差的收敛速率,为全局BO提供了一个严格的替代方案,其中遗憾界通常呈指数增长。对高维基准测试(高达100D)的广泛评估表明,LCBO始终优于最先进的基线方法。
更新时间: 2026-03-09 05:05:22
领域: stat.ML,cs.LG
Improving Visual Object Tracking through Visual Prompting
Learning a discriminative model that distinguishes the specified target from surrounding distractors across frames is essential for generic object tracking (GOT). Dynamic adaptation of target representation against distractors remains challenging because prevailing trackers exhibit limited discriminative capability. To address this issue, we present a new visual prompting mechanism for generic object tracking, termed PiVOT. PiVOT introduces mechanisms that leverage the pretrained foundation model (CLIP) to automatically generate and refine visual prompts online, thereby enabling the tracker to suppress distractors through contrastive guidance. To transfer contrastive knowledge from the foundation model to the tracker, PiVOT automatically propagates this knowledge online and dynamically generates and updates visual prompts. Specifically, it proposes a prompt initialization mechanism that produces an initial visual prompt highlighting potential target locations. The foundation model is then used to refine the prompt based on appearance similarities between candidate objects and reference templates across potential targets. After refinement, the visual prompt better highlights potential target locations and reduces irrelevant prompt information. With the proposed prompting mechanism, the tracker can generate instance-aware feature maps guided by the visual prompts, which are incrementally and automatically updated during tracking, thereby effectively suppressing distractors. Extensive experiments across multiple benchmarks indicate that PiVOT, with the proposed prompting mechanism, can suppress distracting objects and improve tracking performance.
Updated: 2026-03-09 05:01:03
标题: 通过视觉提示改善视觉目标跟踪
摘要: 学习一个能够区分指定目标和周围干扰物的辨别模型对于通用目标跟踪(GOT)至关重要。针对干扰物动态调整目标表示仍然具有挑战性,因为现有的跟踪器表现出有限的辨别能力。为了解决这个问题,我们提出了一种新的通用目标跟踪的视觉提示机制,称为PiVOT。PiVOT引入了一些机制,利用预训练的基础模型(CLIP)来自动生成和精炼视觉提示,从而使跟踪器通过对比引导来抑制干扰物。为了将基础模型的对比知识传递给跟踪器,PiVOT自动在线传播这些知识,并动态生成和更新视觉提示。具体来说,它提出了一个提示初始化机制,生成一个突出潜在目标位置的初始视觉提示。然后,基础模型根据候选对象和潜在目标之间的外观相似性来优化提示。优化后,视觉提示更好地突出潜在目标位置,并减少无关的提示信息。通过提出的提示机制,跟踪器可以生成由视觉提示引导的实例感知特征图,在跟踪过程中逐步自动更新,从而有效抑制干扰物。在多个基准测试中进行的大量实验表明,PiVOT的提出的提示机制可以抑制干扰物并提高跟踪性能。
更新时间: 2026-03-09 05:01:03
领域: cs.CV,cs.AI,cs.MM,eess.IV
RadDiff: Retrieval-Augmented Denoising Diffusion for Protein Inverse Folding
Protein inverse folding, the design of an amino acid sequence based on a target protein structure, is a fundamental problem of computational protein engineering. Existing methods either generate sequences without leveraging external knowledge or relying on protein language models~(PLMs). The former omits the knowledge stored in natural protein data, while the latter is parameter-inefficient and inflexible to adapt to ever-growing protein data. To overcome the above drawbacks, in this paper we propose a novel method, called $\underline{\text{r}}$etrieval-$\underline{\text{a}}$ugmented $\underline{\text{d}}$enoising $\underline{\text{diff}}$usion~($\mbox{RadDiff}$), for protein inverse folding. In RadDiff, a novel retrieval-augmentation mechanism is designed to capture the up-to-date protein knowledge. We further design a knowledge-aware diffusion model that integrates this protein knowledge into the diffusion process via a lightweight module. Experimental results on the CATH, TS50, and PDB2022 datasets show that $\mbox{RadDiff}$ consistently outperforms existing methods, improving sequence recovery rate by up to 19\%. Experimental results also demonstrate that RadDiff generates highly foldable sequences and scales effectively with database size.
Updated: 2026-03-09 04:52:28
标题: RadDiff:检索增强去噪扩散用于蛋白质逆折叠
摘要: 蛋白质逆向折叠,即基于目标蛋白质结构设计氨基酸序列,是计算蛋白工程的一个基本问题。现有方法要么生成不利用外部知识或依赖蛋白质语言模型(PLMs)的序列。前者忽略了存储在天然蛋白数据中的知识,而后者在参数效率和适应不断增长的蛋白数据方面缺乏灵活性。为了克服上述缺点,本文提出了一种新颖的方法,称为检索增强去噪扩散(RadDiff),用于蛋白质逆向折叠。在RadDiff中,设计了一种新颖的检索增强机制来捕捉最新的蛋白质知识。我们进一步设计了一个知识感知扩散模型,通过一个轻量级模块将这些蛋白质知识整合到扩散过程中。对CATH、TS50和PDB2022数据集的实验结果显示,RadDiff始终优于现有方法,将序列恢复率提高多达19\%。实验结果还表明,RadDiff生成的序列具有高度可折叠性,并且随着数据库规模的增加,其扩展效果明显。
更新时间: 2026-03-09 04:52:28
领域: q-bio.QM,cs.AI
Building Effective AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned
The landscape of AI coding assistance is undergoing a fundamental shift from complex IDE plugins to versatile, terminal-native agents. Operating directly where developers manage source control, execute builds, and deploy environments, CLI-based agents offer unprecedented autonomy for long-horizon development tasks. In this paper, we present OPENDEV, an open-source, command-line coding agent engineered specifically for this new paradigm. Effective autonomous assistance requires strict safety controls and highly efficient context management to prevent context bloat and reasoning degradation. OPENDEV overcomes these challenges through a compound AI system architecture with workload-specialized model routing, a dual-agent architecture separating planning from execution, lazy tool discovery, and adaptive context compaction that progressively reduces older observations. Furthermore, it employs an automated memory system to accumulate project-specific knowledge across sessions and counteracts instruction fade-out through event-driven system reminders. By enforcing explicit reasoning phases and prioritizing context efficiency, OPENDEV provides a secure, extensible foundation for terminal-first AI assistance, offering a blueprint for robust autonomous software engineering.
Updated: 2026-03-09 04:47:29
标题: 为终端构建有效的AI编码代理:脚手架、工具、上下文工程和经验教训
摘要: AI编码辅助的风景正在从复杂的IDE插件向多才多艺的终端本地代理发生根本性转变。CLI(命令行界面)代理直接在开发人员管理源代码控制、执行构建和部署环境的地方运行,为长期开发任务提供前所未有的自主性。在本文中,我们介绍了OPENDEV,一个专门为这一新范式设计的开源命令行编码代理。有效的自主辅助需要严格的安全控制和高效的上下文管理,以防止上下文膨胀和推理退化。OPENDEV通过复合AI系统架构克服了这些挑战,该架构具有工作负载专门化模型路由、将规划与执行分离的双代理架构、惰性工具发现和逐渐减少较旧观察的自适应上下文压缩。此外,它采用自动化记忆系统,在会话间积累项目特定知识,并通过事件驱动系统提醒来对抗指令消失。通过强制明确的推理阶段并优先考虑上下文效率,OPENDEV为终端优先的AI辅助提供了一个安全、可扩展的基础,为稳健的自主软件工程提供了蓝图。
更新时间: 2026-03-09 04:47:29
领域: cs.AI
PSTNet: Physically-Structured Turbulence Network
Reliable real-time estimation of atmospheric turbulence intensity remains an open challenge for aircraft operating across diverse altitude bands, particularly over oceanic, polar, and data-sparse regions that lack operational nowcasting infrastructure. Classical spectral models encode climatological averages rather than the instantaneous atmospheric state, and generic ML regressors offer adaptivity but provide no guarantee that predictions respect fundamental scaling laws. This paper introduces the Physically-Structured Turbulence Network (PSTNet), a lightweight architecture that embeds physics directly into its structure. PSTNet couples four components: (i) a zero-parameter backbone derived from Monin-Obukhov theory, (ii) a regime-gated mixture of specialist sub-networks supervised by Richardson-number-derived soft targets, (iii) Feature-wise Linear Modulation layers conditioning hidden representations on local air-density ratio, and (iv) a Kolmogorov output layer enforcing inertial-subrange scaling as an architectural constraint. The entire model contains only 552 learnable parameters, requiring fewer than 2.5 kB of storage and executing in under 12s on a Cortex-M7 microcontroller. We validate PSTNet on 340 paired six-degree-of-freedom guidance simulations spanning three vehicle classes (Mach 2.8, 4.5, and 8.0) and six operational categories with real-time satellite weather ingestion. PSTNet achieves a mean miss-distance improvement of +2.8% with a 78% win rate and a statistically significant effect size. Our results demonstrate that encoding domain physics as architectural priors yields a more efficient and interpretable path to turbulence estimation accuracy than scaling model capacity, establishing PSTNet as a viable drop-in replacement for legacy look-up tables in resource-constrained, safety-critical on-board guidance systems.
Updated: 2026-03-09 04:46:46
标题: PSTNet:物理结构湍流网络
摘要: 可靠的实时大气湍流强度估计仍然是飞机在不同高度带上操作的一个挑战,特别是在缺乏实时预报基础设施的海洋、极地和数据稀疏地区。传统的谱模型编码的是气候平均值,而不是瞬时的大气状态,通用的机器学习回归器提供了适应性,但不能保证预测遵守基本的比例定律。本文介绍了物理结构湍流网络(PSTNet),这是一个轻量级结构,直接将物理嵌入其结构中。PSTNet包括四个组件:(i)源自莫宁-奥布霍夫理论的零参数骨干,(ii)由理查逊数导出的软目标监督的专家子网络的制度门控混合,(iii)基于局部空气密度比的特征线性调制层调节隐藏表示,和(iv)强制惯性亚范围缩放的科尔莫哥罗夫输出层作为架构约束。整个模型仅包含552个可学习参数,存储需求少于2.5 kB,在Cortex-M7微控制器上执行时间不到12秒。我们验证了PSTNet在340对六自由度引导模拟中的表现,涵盖了三种车辆类别(马赫2.8、4.5和8.0)和六种操作类别,同时实时接收卫星天气数据。PSTNet在平均失误距离改善方面取得了+2.8%的效果,胜率为78%,效果显著。我们的结果表明,将领域物理编码为架构先验可以比扩大模型容量更有效地提高湍流估计的准确性,将PSTNet建立为资源受限、安全关键的机载引导系统中传统查找表的可替代方案。
更新时间: 2026-03-09 04:46:46
领域: cs.LG,cs.AI
RL unknotter, hard unknots and unknotting number
We develop a reinforcement learning pipeline for simplifying knot diagrams. A trained agent learns move proposals and a value heuristic for navigating Reidemeister moves. The pipeline applies to arbitrary knots and links; we test it on ``very hard'' unknot diagrams and, using diagram inflation, on $4_1\#9_{10}$ where we recover the recently established and surprising upper bound of three for the unknotting number.
Updated: 2026-03-09 04:43:59
标题: RL解结器,难解结和解结数
摘要: 我们开发了一个用于简化结环图的强化学习流程。一个经过训练的代理学习了移动建议和一个值启发式方法,用于导航Reidemeister移动。该流程适用于任意结环和链接;我们在“非常困难”的解结图和使用图形膨胀的$4_1\#9_{10}$上进行测试,其中我们恢复了最近建立的对解结数的惊人上限为三。
更新时间: 2026-03-09 04:43:59
领域: math.GT,cs.LG,stat.ML
UniCast: A Unified Framework for Instance-Conditioned Multimodal Time-Series Forecasting
Time series forecasting underpins applications in finance, healthcare, and environmental monitoring. Despite the success of Time Series Foundation Models (TSFMs), existing approaches operate in a unimodal setting and rely on static prompts or fixed fusion schemes, limiting their ability to exploit multimodal context and adapt to instance-level variation. We propose UniCast, a parameter-efficient multimodal framework that extends TSFMs through instance conditioned prompting and dynamic modality routing. UniCast infers a conditional prompt from time series, vision, and text inputs via a Transformer-based contextual distiller, enabling input-specific adaptation without updating the forecasting backbone. To regulate how auxiliary modalities influence predictions, UniCast employs Modality Routing, a cross-attention mechanism that estimates modality relevance given the current temporal state and selectively amplifies informative signals while suppressing noise. Integrated with a frozen TSFM via soft prompt tuning, UniCast preserves foundation-level generalization while enabling effective multimodal control. Extensive experiments across diverse forecasting benchmarks show that UniCast consistently outperforms all existing TSFM baselines, demonstrating that instance-conditioned multimodal control is critical for next-generation time series forecasting.
Updated: 2026-03-09 04:33:40
标题: UniCast:一个统一框架用于实例条件多模态时间序列预测
摘要: 时间序列预测支撑着金融、医疗保健和环境监测等应用。尽管时间序列基础模型(TSFMs)取得了成功,但现有方法在单模态环境中运作,并依赖于静态提示或固定融合方案,限制了它们利用多模态上下文和适应实例级变化的能力。我们提出了UniCast,这是一个参数高效的多模态框架,通过实例条件提示和动态模态路由扩展了TSFMs。UniCast通过基于Transformer的上下文提取器从时间序列、视觉和文本输入中推断出一个条件提示,实现了针对输入的特定适应性,而无需更新预测的基础。为了调节辅助模态如何影响预测,UniCast采用模态路由,这是一个交叉注意力机制,根据当前时间状态估计模态相关性,并选择性地放大信息信号,同时抑制噪音。通过软提示调节与冻结的TSFM集成,UniCast保留了基础级别的泛化能力,同时实现了有效的多模态控制。在各种预测基准测试中进行的大量实验证明,UniCast始终优于所有现有的TSFM基线,表明实例条件多模态控制对于下一代时间序列预测至关重要。
更新时间: 2026-03-09 04:33:40
领域: cs.AI
Lattice: A Post-Quantum Settlement Layer
We present Lattice (L, ticker: LAT), a peer-to-peer electronic cash system designed as a post-quantum settlement layer for the era of quantum computing. Lattice combines three independent defense vectors: hardware resilience through RandomX CPU-only proof-of-work, network resilience through LWMA-1 per-block difficulty adjustment (mitigating the Flash Hash Rate vulnerability that affects fixed-interval retarget protocols), and cryptographic resilience through ML-DSA-44 post-quantum digital signatures (NIST FIPS 204, lattice-based), enforced exclusively from the genesis block with no classical signature fallback. The protocol uses a brief warm-up period of 5,670 fast blocks (53-second target, 25 LAT reduced reward) for network bootstrap, then transitions permanently to 240-second blocks, following a 295,000-block halving schedule with a perpetual tail emission floor of 0.15 LAT per block. Block weight capacity grows in stages (11M to 28M to 56M) as the network matures. The smallest unit of LAT is the shor, named after Peter Shor, where 1 LAT = 10^8 shors.
Updated: 2026-03-09 04:30:04
标题: Lattice: 一个后量子结算层
摘要: 我们介绍了Lattice(简称LAT),这是一个点对点的电子现金系统,旨在作为量子计算时代的后量子结算层。Lattice结合了三个独立的防御向量:通过RandomX CPU-only工作量证明实现硬件抗击性、通过LWMA-1每个区块难度调整实现网络抗击性(缓解了影响固定间隔重新定位协议的Flash Hash Rate漏洞),以及通过ML-DSA-44后量子数字签名(NIST FIPS 204,基于格)实现加密抗击性,仅从创世区块开始强制执行,没有经典签名回退。该协议在网络引导期间使用了5,670个快速区块(53秒目标,25 LAT减少奖励),然后永久过渡到240秒区块,按照295,000个区块减半计划进行,每个区块的永续尾部发行量底线为0.15 LAT。随着网络成熟,区块权重容量逐步增长(11M至28M至56M)。LAT的最小单位是shor,以Peter Shor的名字命名,其中1 LAT = 10^8 shors。
更新时间: 2026-03-09 04:30:04
领域: quant-ph,cs.CR
ELLMob: Event-Driven Human Mobility Generation with Self-Aligned LLM Framework
Human mobility generation aims to synthesize plausible trajectory data, which is widely used in urban system research. While Large Language Model-based methods excel at generating routine trajectories, they struggle to capture deviated mobility during large-scale societal events. This limitation stems from two critical gaps: (1) the absence of event-annotated mobility datasets for design and evaluation, and (2) the inability of current frameworks to reconcile competitions between users' habitual patterns and event-imposed constraints when making trajectory decisions. This work addresses these gaps with a twofold contribution. First, we construct the first event-annotated mobility dataset covering three major events: Typhoon Hagibis, COVID-19, and the Tokyo 2021 Olympics. Second, we propose ELLMob, a self-aligned LLM framework that first extracts competing rationales between habitual patterns and event constraints, based on Fuzzy-Trace Theory, and then iteratively aligns them to generate trajectories that are both habitually grounded and event-responsive. Extensive experiments show that ELLMob wins state-of-the-art baselines across all events, demonstrating its effectiveness. Our codes and datasets are available at https://github.com/deepkashiwa20/ELLMob.
Updated: 2026-03-09 04:28:03
标题: ELLMob:基于自对齐LLM框架的事件驱动的人类移动生成
摘要: 人类移动生成旨在合成可信的轨迹数据,广泛应用于城市系统研究。虽然基于大型语言模型的方法擅长生成日常轨迹,但在大规模社会事件期间难以捕捉偏离的移动。这一限制源于两个关键差距:(1)缺乏用于设计和评估的事件注释移动数据集,以及(2)当前框架无法在制定轨迹决策时协调用户习惯模式和事件施加的约束之间的竞争。本研究通过双重贡献解决了这些差距。首先,我们构建了涵盖三个重大事件的首个事件注释移动数据集:台风哈吉比斯、COVID-19和东京2021年奥运会。其次,我们提出了ELLMob,这是一个自我对齐的LLM框架,首先基于模糊痕迹理论提取习惯模式和事件约束之间的竞争理由,然后迭代地将它们对齐,生成既具有习惯基础又对事件响应的轨迹。广泛的实验表明,ELLMob在所有事件中均胜过最先进的基线,展示了其有效性。我们的代码和数据集可在https://github.com/deepkashiwa20/ELLMob 上获得。
更新时间: 2026-03-09 04:28:03
领域: cs.LG,cs.AI
Towards Practical Benchmarking of Data Cleaning Techniques: On Generating Authentic Errors via Large Language Models
Data quality remains an important challenge in data-driven systems, as errors in tabular data can severely compromise downstream analytics and machine learning performance. Although numerous error detection algorithms have been proposed, the lack of diverse, real-world error datasets limits comprehensive evaluation. Manual error annotation is both time-consuming and inconsistent, motivating the exploration of synthetic error generation as an alternative. In this work, we introduce TableEG, a framework that leverages large language models (LLMs) to generate authentic errors. By employing a table fine-tuning strategy and a triplet representation $(I, T, O)$ to model error generation, detection, and correction tasks, TableEG captures the complex dependencies inherent in two-dimensional tables. Trained on 12 real-world datasets spanning 10 diverse domains, TableEG ensures that the synthesized errors faithfully reflect authentic error distributions. Experimental results indicate that errors generated by TableEG exhibit superior pattern and distribution similarity compared to both rule-based methods and LLM-generated errors without fine-tuning. Furthermore, performance metrics on TableEG-generated errors closely align with those on real-world errors across nearly all datasets and detection algorithms, particularly for machine learning based detection techniques. Overall, TableEG not only bridges the gap between synthetic and real-world errors but also establishes a robust benchmark for subsequent error detection and correction tasks.
Updated: 2026-03-09 04:16:39
标题: 走向数据清理技术的实用基准测试:通过大型语言模型生成真实错误
摘要: 数据质量在数据驱动系统中仍然是一个重要挑战,因为表格数据中的错误严重影响下游分析和机器学习的性能。虽然已经提出了许多错误检测算法,但缺乏多样化的真实世界错误数据集限制了全面评估。手动错误标注既耗时又不一致,促使探索合成错误生成作为一种替代方法。在这项工作中,我们介绍了TableEG,这是一个利用大型语言模型(LLMs)生成真实错误的框架。通过采用表格微调策略和三元表示$(I, T, O)$来建模错误生成、检测和纠正任务,TableEG捕捉了二维表格中固有的复杂依赖关系。在跨越10个不同领域的12个真实世界数据集上进行训练,TableEG确保合成的错误忠实地反映了真实错误分布。实验结果表明,TableEG生成的错误与基于规则的方法和没有微调的LLM生成的错误相比,展现出更优越的模式和分布相似性。此外,TableEG生成的错误的性能指标与几乎所有数据集和检测算法上的真实世界错误的性能指标密切匹配,特别是对于基于机器学习的检测技术。总体而言,TableEG不仅填补了合成和真实世界错误之间的差距,还为后续的错误检测和纠正任务建立了一个强大的基准。
更新时间: 2026-03-09 04:16:39
领域: cs.DB,cs.LG
Adaptive Batch-Wise Sample Scheduling for Direct Preference Optimization
Direct Preference Optimization (DPO) has emerged as an effective approach for aligning large language models (LLMs) with human preferences. However, its performance is highly dependent on the quality of the underlying human preference data. To address this bottleneck, prior work has explored various data selection strategies, but these methods often overlook the impact of the evolving states of the language model during the optimization process. In this paper, we introduce a novel problem: Sample Scheduling for DPO, which aims to dynamically and adaptively schedule training samples based on the model's evolving batch-wise states throughout preference optimization. To solve this problem, we propose SamS, an efficient and effective algorithm that adaptively selects samples in each training batch based on the LLM's learning feedback to maximize the potential generalization performance. Notably, without modifying the core DPO algorithm, simply integrating SamS significantly improves performance across tasks, with minimal additional computational overhead. This work points to a promising new direction for improving LLM alignment through batch-wise sample selection, with potential generalization to RLHF and broader supervised learning paradigms.
Updated: 2026-03-09 04:16:04
标题: 自适应批次样本调度用于直接偏好优化
摘要: 直接优化偏好(DPO)已经成为将大型语言模型(LLMs)与人类偏好对齐的有效方法。然而,其性能高度依赖于基础人类偏好数据的质量。为了解决这一瓶颈,先前的研究已经探索了各种数据选择策略,但这些方法通常忽视了在优化过程中语言模型不断演化状态的影响。在本文中,我们介绍了一个新颖的问题:DPO的样本调度,旨在根据模型在偏好优化过程中批次演化的状态动态和自适应地安排训练样本。为了解决这个问题,我们提出了SamS,一种高效而有效的算法,根据LLM的学习反馈在每个训练批次中自适应地选择样本,以最大化潜在的泛化性能。值得注意的是,仅仅集成SamS而不修改核心DPO算法就显著提高了各项任务的性能,同时增加的计算开销很小。这项工作指向了通过批次样本选择改进LLM对齐的有希望的新方向,并具有潜在的泛化到RLHF和更广泛的监督学习范式的可能性。
更新时间: 2026-03-09 04:16:04
领域: cs.LG,cs.AI
AI Agents, Language, Deep Learning and the Next Revolution in Science
Modern science is reaching a critical inflection point. Instruments across disciplines, from particle physics and astronomy to genomics and climate modeling, now produce data of such scale, diversity, and interdependence that traditional analytical methods can no longer keep pace. This growing imbalance between data generation and data understanding signals the need for a new scientific paradigm. We propose that intelligent, human-supervised AI agents operating over deep-learning algorithms, represent the next evolution of the scientific method. Built upon large language models and multimodal learning, these agents can interpret scientific intent, design and execute analytical workflows, and ensure traceability through domain-specific languages that preserve human oversight and accountability. Particle physics, a historic incubator of computational innovation, offers the ideal testbed for this transition. At the Institute of High Energy Physics of the Chinese Academy of Sciences, the Dr. Sai system embodies this vision, a multi-agent reasoning framework deployed within collider research at the CEPC. This emerging approach does not replace human scientists but extends their cognitive reach, enabling discovery to scale with complexity and redefining how knowledge itself is produced in the age of intelligent machines. The significance of this paradigm transcends particle physics, offering a blueprint for all data-driven sciences facing the same complexity ceiling.
Updated: 2026-03-09 04:14:20
标题: AI代理,语言,深度学习和科学的下一次革命
摘要: 现代科学正达到一个关键的转折点。从粒子物理学和天文学到基因组学和气候建模,跨学科的仪器现在产生的数据规模、多样性和相互依赖性已经超出了传统分析方法的跟不上的程度。数据生成与数据理解之间不断增长的失衡表明需要一种新的科学范式。我们提出,智能的、人类监督的AI代理运行在深度学习算法之上,代表着科学方法的下一个演变阶段。建立在大型语言模型和多模态学习之上,这些代理可以解释科学意图,设计和执行分析工作流程,并通过保留人类监督和问责制的领域特定语言来确保可追溯性。粒子物理学,作为计算创新的历史孵化器,为这种转变提供了理想的测试平台。在中国科学院高能物理研究所,Sai博士系统体现了这一愿景,这是一个在CEPC碰撞机研究中部署的多代理推理框架。这种新兴方法并不取代人类科学家,而是扩展了他们的认知能力,使发现能够与复杂性相适应,并重新定义在智能机器时代知识本身是如何产生的。这一范式的重要性超越了粒子物理学,为所有面临相同复杂性上限的数据驱动科学提供了蓝图。
更新时间: 2026-03-09 04:14:20
领域: hep-ex,cs.AI
ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training
Feed-forward transformer models have driven rapid progress in 3D vision, but state-of-the-art methods such as VGGT and $π^3$ have a computational cost that scales quadratically with the number of input images, making them inefficient when applied to large image collections. Sequential-reconstruction approaches reduce this cost but sacrifice reconstruction quality. We introduce ZipMap, a stateful feed-forward model that achieves linear-time, bidirectional 3D reconstruction while matching or surpassing the accuracy of quadratic-time methods. ZipMap employs test-time training layers to zip an entire image collection into a compact hidden scene state in a single forward pass, enabling reconstruction of over 700 frames in under 10 seconds on a single H100 GPU, more than $20\times$ faster than state-of-the-art methods such as VGGT. Moreover, we demonstrate the benefits of having a stateful representation in real-time scene-state querying and its extension to sequential streaming reconstruction.
Updated: 2026-03-09 04:13:20
标题: ZipMap:通过测试时间训练实现线性时间状态3D重建
摘要: 前馈变压器模型推动了三维视觉领域的快速进展,但目前的最先进方法如VGGT和$π^3$的计算成本随着输入图像数量的增加呈二次方增长,使它们在处理大型图像集合时效率低下。顺序重建方法可以降低这种成本,但会牺牲重建质量。我们引入了ZipMap,这是一个有状态的前馈模型,实现了线性时间、双向三维重建,同时与二次时间方法的准确性相匹配甚至超过。ZipMap利用测试时训练层将整个图像集合压缩成一个紧凑的隐藏场景状态,只需单次前向传递,就能在单个H100 GPU上不到10秒的时间内重建超过700帧图像,比VGGT等最先进方法快20倍以上。此外,我们展示了具有有状态表示在实时场景状态查询和其扩展到顺序流式重建中的好处。
更新时间: 2026-03-09 04:13:20
领域: cs.CV,cs.AI,cs.LG
MeanCache: From Instantaneous to Average Velocity for Accelerating Flow Matching Inference
We present MeanCache, a training-free caching framework for efficient Flow Matching inference. Existing caching methods reduce redundant computation but typically rely on instantaneous velocity information (e.g., feature caching), which often leads to severe trajectory deviations and error accumulation under high acceleration ratios. MeanCache introduces an average-velocity perspective: by leveraging cached Jacobian--vector products (JVP) to construct interval average velocities from instantaneous velocities, it effectively mitigates local error accumulation. To further improve cache timing and JVP reuse stability, we develop a trajectory-stability scheduling strategy as a practical tool, employing a Peak-Suppressed Shortest Path under budget constraints to determine the schedule. Experiments on FLUX.1, Qwen-Image, and HunyuanVideo demonstrate that MeanCache achieves 4.12X and 4.56X and 3.59X acceleration, respectively, while consistently outperforming state-of-the-art caching baselines in generation quality. We believe this simple yet effective approach provides a new perspective for Flow Matching inference and will inspire further exploration of stability-driven acceleration in commercial-scale generative models.
Updated: 2026-03-09 04:12:59
标题: MeanCache:从瞬时速度到平均速度,用于加速流匹配推断
摘要: 我们提出了MeanCache,这是一个无需训练的缓存框架,用于高效的流匹配推断。现有的缓存方法可以减少冗余计算,但通常依赖于瞬时速度信息(如特征缓存),这往往会导致在高加速比下严重的轨迹偏差和误差累积。MeanCache引入了平均速度的观点:通过利用缓存的雅可比-向量乘积(JVP)来构建区间平均速度,从而有效减轻局部误差积累。为了进一步改善缓存时机和JVP重复使用的稳定性,我们开发了一种轨迹稳定性调度策略作为实用工具,利用在预算约束下确定日程的Peak-Suppressed最短路径。在FLUX.1、Qwen-Image和HunyuanVideo上的实验表明,MeanCache分别实现了4.12倍、4.56倍和3.59倍的加速,同时在生成质量方面始终优于最先进的缓存基准。我们相信这种简单而有效的方法为流匹配推断提供了一个新的视角,并将激发商业规模生成模型中稳定性驱动加速的进一步探索。
更新时间: 2026-03-09 04:12:59
领域: cs.LG,cs.AI,cs.CV
Survey of Computerized Adaptive Testing: A Machine Learning Perspective
Computerized Adaptive Testing (CAT) offers an efficient and personalized method for assessing examinee proficiency by dynamically adjusting test questions based on individual performance. Compared to traditional, non-personalized testing methods, CAT requires fewer questions and provides more accurate assessments. As a result, CAT has been widely adopted across various fields, including education, healthcare, sports, sociology, and the evaluation of AI models. While traditional methods rely on psychometrics and statistics, the increasing complexity of large-scale testing has spurred the integration of machine learning techniques. This paper aims to provide a machine learning-focused survey on CAT, presenting a fresh perspective on this adaptive testing paradigm. We delve into measurement models, question selection algorithm, bank construction, and test control within CAT, exploring how machine learning can optimize these components. Through an analysis of current methods, strengths, limitations, and challenges, we strive to develop robust, fair, and efficient CAT systems. By bridging psychometric-driven CAT research with machine learning, this survey advocates for a more inclusive and interdisciplinary approach to the future of adaptive testing.
Updated: 2026-03-09 04:12:13
标题: 计算机自适应测试调查:机器学习视角
摘要: 计算机自适应测试(CAT)为评估受试者能力提供了一种高效且个性化的方法,通过根据个体表现动态调整测试问题。与传统的非个性化测试方法相比,CAT需要更少的问题,并提供更准确的评估结果。因此,CAT已被广泛应用于教育、医疗保健、体育、社会学以及人工智能模型评估等各个领域。虽然传统方法依赖于心理测量学和统计学,但大规模测试的日益复杂性促使了机器学习技术的整合。本文旨在提供一份以机器学习为重点的CAT调查,为这种自适应测试范式提供新的视角。我们深入研究了CAT中的测量模型、问题选择算法、题库构建和测试控制,探讨了机器学习如何优化这些组成部分。通过对当前方法、优势、局限性和挑战的分析,我们致力于开发健壮、公平且高效的CAT系统。通过将心理测量学驱动的CAT研究与机器学习联系起来,本调查倡导一种更具包容性和跨学科性的自适应测试未来。
更新时间: 2026-03-09 04:12:13
领域: cs.LG,cs.AI,cs.CY,cs.IR
Prompt-SID: Learning Structural Representation Prompt via Latent Diffusion for Single-Image Denoising
Many studies have concentrated on constructing supervised models utilizing paired datasets for image denoising, which proves to be expensive and time-consuming. Current self-supervised and unsupervised approaches typically rely on blind-spot networks or sub-image pairs sampling, resulting in pixel information loss and destruction of detailed structural information, thereby significantly constraining the efficacy of such methods. In this paper, we introduce Prompt-SID, a prompt-learning-based single image denoising framework that emphasizes preserving of structural details. This approach is trained in a self-supervised manner using downsampled image pairs. It captures original-scale image information through structural encoding and integrates this prompt into the denoiser. To achieve this, we propose a structural representation generation model based on the latent diffusion process and design a structural attention module within the transformer-based denoiser architecture to decode the prompt. Additionally, we introduce a scale replay training mechanism, which effectively mitigates the scale gap from images of different resolutions. We conduct comprehensive experiments on synthetic, real-world, and fluorescence imaging datasets, showcasing the remarkable effectiveness of Prompt-SID. Our code will be released at https://github.com/huaqlili/Prompt-SID.
Updated: 2026-03-09 04:02:07
标题: 快速SID:通过潜在扩散学习结构表示提示以用于单幅图像去噪
摘要: 许多研究集中在利用成对数据集构建监督模型进行图像去噪,这被证明是昂贵且耗时的。目前的自监督和无监督方法通常依赖于盲点网络或子图像对采样,导致像素信息丢失和详细结构信息的破坏,从而显著限制了这些方法的效力。在本文中,我们介绍了Prompt-SID,这是一个基于提示学习的单图像去噪框架,强调保留结构细节。该方法以自监督方式使用降采样图像对进行训练。通过结构编码捕获原始尺度图像信息,并将此提示集成到去噪器中。为实现这一目标,我们提出了基于潜在扩散过程的结构表示生成模型,并在基于transformer的去噪器架构中设计了结构注意力模块来解码提示。此外,我们引入了一个尺度重播训练机制,有效减轻不同分辨率图像之间的尺度差距。我们在合成、真实世界和荧光成像数据集上进行了全面实验,展示了Prompt-SID的显著有效性。我们的代码将在https://github.com/huaqlili/Prompt-SID 上发布。
更新时间: 2026-03-09 04:02:07
领域: cs.CV,cs.AI
Emotion Collider: Dual Hyperbolic Mirror Manifolds for Sentiment Recovery via Anti Emotion Reflection
Emotional expression underpins natural communication and effective human-computer interaction. We present Emotion Collider (EC-Net), a hyperbolic hypergraph framework for multimodal emotion and sentiment modeling. EC-Net represents modality hierarchies using Poincare-ball embeddings and performs fusion through a hypergraph mechanism that passes messages bidirectionally between nodes and hyperedges. To sharpen class separation, contrastive learning is formulated in hyperbolic space with decoupled radial and angular objectives. High-order semantic relations across time steps and modalities are preserved via adaptive hyperedge construction. Empirical results on standard multimodal emotion benchmarks show that EC-Net produces robust, semantically coherent representations and consistently improves accuracy, particularly when modalities are partially available or contaminated by noise. These findings indicate that explicit hierarchical geometry combined with hypergraph fusion is effective for resilient multimodal affect understanding.
Updated: 2026-03-09 04:01:36
标题: 情感碰撞器:通过反情感反射实现情绪恢复的双双曲镜面流形
摘要: 情感表达是自然交流和有效的人机交互的基础。我们提出了Emotion Collider (EC-Net),这是一个用于多模态情感和情绪建模的双曲双超图框架。EC-Net使用Poincare球嵌入表示模态层次结构,并通过双向传递消息的超图机制进行融合。为了加强类别分离,对比学习在双曲空间中以分离的径向和角向目标进行了制定。通过自适应超边构建,跨时间步和模态的高阶语义关系得以保留。在标准多模态情感基准上的实证结果显示,EC-Net产生了强大、语义连贯的表示,并在模态部分可用或受到噪声污染时不断提高准确性。这些发现表明,显式的层次几何结构与超图融合相结合对于韧性多模态情感理解是有效的。
更新时间: 2026-03-09 04:01:36
领域: cs.MM,cs.CL,cs.LG
Real-Time Aligned Reward Model beyond Semantics
Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique for aligning large language models (LLMs) with human preferences, yet it is susceptible to reward overoptimization, in which policy models overfit to the reward model, exploit spurious reward patterns instead of faithfully capturing human intent. Prior mitigations primarily relies on surface semantic information and fails to efficiently address the misalignment between the reward model (RM) and the policy model caused by continuous policy distribution shifts. This inevitably leads to an increasing reward discrepancy, exacerbating reward overoptimization. To address these limitations, we introduce R2M (Real-Time Aligned Reward Model), a novel lightweight RLHF framework. R2M goes beyond vanilla reward models that solely depend on the semantic representations of a pretrained LLM. Instead, it leverages the evolving hidden states of the policy (namely policy feedback) to align with the real-time distribution shift of the policy during the RL process. This work points to a promising new direction for improving the performance of reward models through real-time utilization of feedback from policy models.
Updated: 2026-03-09 04:00:33
标题: 实时对齐奖励模型超越语义
摘要: 来源于人类反馈的强化学习(RLHF)是将大型语言模型(LLMs)与人类偏好对齐的关键技术,然而它容易受到奖励过度优化的影响,即策略模型过度拟合奖励模型,利用虚假的奖励模式而不是忠实地捕捉人类意图。先前的缓解方法主要依赖于表面语义信息,并未能有效解决由连续策略分布变化引起的奖励模型(RM)与策略模型之间的不一致。这不可避免地导致奖励差异增加,加剧奖励过度优化。为了解决这些限制,我们引入了R2M(实时对齐奖励模型),这是一个新颖的轻量级RLHF框架。R2M超越了仅依赖于预训练LLM的语义表示的普通奖励模型。相反,它利用策略的演变隐藏状态(即策略反馈)来与RL过程中策略的实时分布变化对齐。这项工作指向通过实时利用策略模型反馈来改善奖励模型性能的有希望的新方向。
更新时间: 2026-03-09 04:00:33
领域: cs.AI
Condition-Triggered Cryptographic Asset Control via Dormant Authorization Paths
Control of encrypted digital assets is traditionally equated with permanent possession of private keys, a model that precludes regulatory supervision, conditional delegation, and legally compliant transfer at the cryptographic layer. Existing remedies (multi-signature schemes, threshold signatures, smart contracts, custodial delegation) require persistent key exposure, on-chain state mutation, or trusted intermediaries. We introduce Condition-Triggered Dormant Authorization Paths (CT-DAP), a cryptographic asset control method built on destructible authorization factors and parameterized by a root-derivable framework satisfying deterministic key derivation, context-isolated capability generation, and authorization-bound revocation. Under CT-DAP, control rights are dormant authorization paths composed of user-held credentials and administrative factors held by independent custodians; a path remains cryptographically inactive until all factors are simultaneously available. Upon verification of predefined conditions (e.g., user consent, inheritance events, time-based triggers), the corresponding factor is released, activating the path. Revocation is achieved by destroying factors, rendering the path permanently unusable without altering the cryptographic root. We formalize the threat model, define security games for unauthorized control resistance, path isolation, and stateless revocation, and prove security under standard assumptions (AEAD security of AES-GCM-SIV, PRF security of HKDF, memory-hardness of Argon2id, collision resistance of SHA-256). We instantiate CT-DAP using the Atomic Cryptographic Entity Generative Framework (ACE-GF) and evaluate performance, demonstrating sub-second activation latency with configurable security-performance trade-offs.
Updated: 2026-03-09 03:57:06
标题: 通过休眠授权路径实现条件触发的加密资产控制
摘要: 传统上,加密数字资产的控制被等同于永久拥有私钥,这种模式排除了监管监督、有条件委托和在加密层面上合法转移的可能性。现有的解决方案(多重签名方案、阈值签名、智能合约、托管委托)需要持续的密钥暴露、链上状态变异或信任的中介。我们引入了一种基于可销毁授权因素的条件触发休眠授权路径(CT-DAP)的加密资产控制方法,该方法由一个满足确定性密钥派生、上下文隔离能力生成和授权绑定吊销的根可派生框架参数化。在CT-DAP下,控制权是由用户持有的凭证和由独立托管人持有的管理因素组成的休眠授权路径;在所有因素同时可用之前,路径保持密码学上的不活跃状态。在验证预定义条件(例如用户同意、继承事件、基于时间的触发器)之后,相应的因素被释放,激活路径。吊销通过销毁因素实现,使路径在不改变加密根的情况下永久不可用。我们形式化了威胁模型,为未经授权的控制抵抗、路径隔离和无状态吊销定义了安全游戏,并在标准假设下证明了安全性(AES-GCM-SIV的AEAD安全性,HKDF的PRF安全性,Argon2id的内存硬度,SHA-256的碰撞抵抗)。我们使用原子加密实体生成框架(ACE-GF)实例化了CT-DAP并评估了性能,展示了可配置的安全性能权衡下的亚秒级激活延迟。
更新时间: 2026-03-09 03:57:06
领域: cs.CR
SWE-Fuse: Empowering Software Agents via Issue-free Trajectory Learning and Entropy-aware RLVR Training
Large language models (LLMs) have transformed the software engineering landscape. Recently, numerous LLM-based agents have been developed to address real-world software issue fixing tasks. Despite their state-of-the-art performance, Despite achieving state-of-the-art performance, these agents face a significant challenge: \textbf{Insufficient high-quality issue descriptions.} Real-world datasets often exhibit misalignments between issue descriptions and their corresponding solutions, introducing noise and ambiguity that mislead automated agents and limit their problem-solving effectiveness. We propose \textbf{\textit{SWE-Fuse}}, an issue-description-aware training framework that fuses issue-description-guided and issue-free samples for training SWE agents. It consists of two key modules: (1) An issue-free-driven trajectory learning module for mitigating potentially misleading issue descriptions while enabling the model to learn step-by-step debugging processes; and (2) An entropy-aware RLVR training module, which adaptively adjusts training dynamics through entropy-driven clipping. It applies relaxed clipping under high entropy to encourage exploration, and stricter clipping under low entropy to ensure training stability. We evaluate SWE-Fuse on the widely studied SWE-bench Verified benchmark shows to demonstrate its effectiveness in solving real-world software problems. Specifically, SWE-Fuse outperforms the best 8B and 32B baselines by 43.0\% and 60.2\% in solve rate, respectively. Furthermore, integrating SWE-Fuse with test-time scaling (TTS) enables further performance improvements, achieving solve rates of 49.8\% and 65.2\% under TTS@8 for the 8B and 32B models, respectively.
Updated: 2026-03-09 03:47:10
标题: SWE-Fuse: 通过无问题轨迹学习和熵感知RLVR训练增强软件代理
摘要: 大型语言模型(LLMs)已经改变了软件工程领域。最近,已经开发了许多基于LLM的代理来解决现实世界的软件问题修复任务。尽管它们取得了最先进的性能,但这些代理面临着一个重大挑战:\textbf{问题描述质量不足。}真实世界数据集通常表现出问题描述和相应解决方案之间的不对齐,引入了噪音和歧义,导致自动代理产生误导和限制其问题解决效率。我们提出了一个问题描述感知的训练框架\textbf{\textit{SWE-Fuse}},它融合了问题描述引导和无问题样本训练SWE代理。它包括两个关键模块:(1)一个无问题驱动的轨迹学习模块,用于减轻潜在误导性的问题描述,同时使模型能够学习逐步的调试过程;(2)一个熵感知的RLVR训练模块,通过熵驱动的剪切自适应调整训练动态。在高熵下应用松弛剪切以鼓励探索,在低熵下应用更严格的剪切以确保训练稳定性。我们在广泛研究的SWE-bench Verified基准上评估了SWE-Fuse,以展示其在解决真实世界软件问题方面的有效性。具体而言,SWE-Fuse在解决率方面分别比最佳的8B和32B基线提高了43.0%和60.2%。此外,将SWE-Fuse与测试时间缩放(TTS)集成,可以进一步提高性能,分别在8B和32B模型的TTS@8下实现了49.8%和65.2%的解决率。
更新时间: 2026-03-09 03:47:10
领域: cs.SE,cs.AI
ModalImmune: Immunity Driven Unlearning via Self Destructive Training
Multimodal systems are vulnerable to partial or complete loss of input channels at deployment, which undermines reliability in real-world settings. This paper presents ModalImmune, a training framework that enforces modality immunity by intentionally and controllably collapsing selected modality information during training so the model learns joint representations that are robust to destructive modality influence. The framework combines a spectrum-adaptive collapse regularizer, an information-gain guided controller for targeted interventions, curvature-aware gradient masking to stabilize destructive updates, and a certified Neumann-truncated hyper-gradient procedure for automatic meta-parameter adaptation. Empirical evaluation on standard multimodal benchmarks demonstrates that ModalImmune improves resilience to modality removal and corruption while retaining convergence stability and reconstruction capacity.
Updated: 2026-03-09 03:45:46
标题: ModalImmune:通过自毁式训练驱动的免疫学习
摘要: 多模态系统在部署过程中容易受到输入通道的部分或完全丢失,这会削弱在现实世界环境中的可靠性。本文介绍了ModalImmune,这是一个训练框架,通过有意识地和可控地在训练过程中折叠选定的模态信息,来强化模态免疫力,使模型学习到对破坏性模态影响具有鲁棒性的联合表示。该框架结合了一种谱自适应折叠正则化器、一个信息增益引导的控制器用于有针对性的干预、基于曲率的梯度屏蔽以稳定破坏性更新,以及一个经过认证的Neumann截断的超梯度程序用于自动元参数适应。对标准的多模态基准进行的实证评估表明,ModalImmune提高了对模态移除和破坏的韧性,同时保持了收敛稳定性和重建能力。
更新时间: 2026-03-09 03:45:46
领域: cs.LG,cs.CL,cs.MM
IMSE: Intrinsic Mixture of Spectral Experts Fine-tuning for Test-Time Adaptation
Test-time adaptation (TTA) has been widely explored to prevent performance degradation when test data differ from the training distribution. However, fully leveraging the rich representations of large pretrained models with minimal parameter updates remains underexplored. In this paper, we propose Intrinsic Mixture of Spectral Experts (IMSE) that leverages the spectral experts inherently embedded in Vision Transformers. We decompose each linear layer via singular value decomposition (SVD) and adapt only the singular values, while keeping the singular vectors fixed. We further identify a key limitation of entropy minimization in TTA: it often induces feature collapse, causing the model to rely on domain-specific features rather than class-discriminative features. To address this, we propose a diversity maximization loss based on expert-input alignment, which encourages diverse utilization of spectral experts during adaptation. In the continual test-time adaptation (CTTA) scenario, beyond preserving pretrained knowledge, it is crucial to retain and reuse knowledge from previously observed domains. We introduce Domain-Aware Spectral Code Retrieval, which estimates input distributions to detect domain shifts, and retrieves adapted singular values for rapid adaptation. Consequently, our method achieves state-of-the-art performance on various distribution-shift benchmarks under the TTA setting. In CTTA and Gradual CTTA, it further improves accuracy by 3.4 percentage points (pp) and 2.4 pp, respectively, while requiring 385 times fewer trainable parameters. Our code is available at https://github.com/baek85/IMSE.
Updated: 2026-03-09 03:42:44
标题: IMSE: 内在光谱专家混合物微调用于测试时间自适应
摘要: 测试时间适应(TTA)已被广泛探索,以防止当测试数据与训练分布不同时性能下降。然而,利用大型预训练模型丰富的表示并进行最小参数更新仍未得到充分开发。在本文中,我们提出了一种内在混合谱专家(IMSE)方法,利用了视觉变压器中固有嵌入的谱专家。我们通过奇异值分解(SVD)分解每个线性层,并仅调整奇异值,同时保持奇异向量固定。我们进一步确定了TTA中熵最小化的一个关键限制:它经常导致特征坍塌,导致模型依赖于特定领域特征而不是类别判别特征。为了解决这个问题,我们提出了一种基于专家输入对齐的多样性最大化损失,它鼓励在适应过程中对谱专家进行多样化利用。在持续测试时间适应(CTTA)场景中,除了保留预训练知识外,保留和重用先前观察到的领域知识至关重要。我们引入了领域感知谱代码检索,用于估计输入分布以检测领域转移,并检索适应的奇异值以进行快速适应。因此,我们的方法在TTA设置下在各种分布转移基准上实现了最先进的性能。在CTTA和渐进CTTA中,分别将准确性提高了3.4个百分点(pp)和2.4个百分点,同时需要的可训练参数减少了385倍。我们的代码可在https://github.com/baek85/IMSE上找到。
更新时间: 2026-03-09 03:42:44
领域: cs.CV,cs.AI
SPREAD: Subspace Representation Distillation for Lifelong Imitation Learning
A key challenge in lifelong imitation learning (LIL) is enabling agents to acquire new skills from expert demonstrations while retaining prior knowledge. This requires preserving the low-dimensional manifolds and geometric structures that underlie task representations across sequential learning. Existing distillation methods, which rely on L2-norm feature matching in raw feature space, are sensitive to noise and high-dimensional variability, often failing to preserve intrinsic task manifolds. To address this, we introduce SPREAD, a geometry-preserving framework that employs singular value decomposition (SVD) to align policy representations across tasks within low-rank subspaces. This alignment maintains the underlying geometry of multimodal features, facilitating stable transfer, robustness, and generalization. Additionally, we propose a confidence-guided distillation strategy that applies a Kullback-Leibler divergence loss restricted to the top-M most confident action samples, emphasizing reliable modes and improving optimization stability. Experiments on the LIBERO, lifelong imitation learning benchmark, show that SPREAD substantially improves knowledge transfer, mitigates catastrophic forgetting, and achieves state-of-the-art performance.
Updated: 2026-03-09 03:38:42
标题: SPREAD:用于终身模仿学习的子空间表示精炼
摘要: 在终身模仿学习(LIL)中的一个关键挑战是使代理能够从专家演示中获取新技能,同时保留先前的知识。这需要在顺序学习中保持支撑任务表示的低维流形和几何结构。现有的蒸馏方法依赖于原始特征空间中的L2范数特征匹配,对噪声和高维变化敏感,通常无法保留内在的任务流形。为了解决这个问题,我们介绍了SPREAD,一个保持几何结构的框架,采用奇异值分解(SVD)在低秩子空间内对任务之间的策略表示进行对齐。这种对齐保持了多模态特征的底层几何结构,促进了稳定的转移、鲁棒性和泛化。此外,我们提出了一种基于置信度的蒸馏策略,应用了限制在前M个最可靠的动作样本上的Kullback-Leibler散度损失,强调可靠的模式并改善优化稳定性。在LIBERO,终身模仿学习基准上的实验表明,SPREAD显著改善了知识转移,减轻了灾难性遗忘,并实现了最先进的性能。
更新时间: 2026-03-09 03:38:42
领域: cs.LG,cs.RO
Transferable Graph Condensation from the Causal Perspective
The increasing scale of graph datasets has significantly improved the performance of graph representation learning methods, but it has also introduced substantial training challenges. Graph dataset condensation techniques have emerged to compress large datasets into smaller yet information-rich datasets, while maintaining similar test performance. However, these methods strictly require downstream applications to match the original dataset and task, which often fails in cross-task and cross-domain scenarios. To address these challenges, we propose a novel causal-invariance-based and transferable graph dataset condensation method, named TGCC, providing effective and transferable condensed datasets. Specifically, to preserve domain-invariant knowledge, we first extract domain causal-invariant features from the spatial domain of the graph using causal interventions. Then, to fully capture the structural and feature information of the original graph, we perform enhanced condensation operations. Finally, through spectral-domain enhanced contrastive learning, we inject the causal-invariant features into the condensed graph, ensuring that the compressed graph retains the causal information of the original graph. Experimental results on five public datasets and our novel FinReport dataset demonstrate that TGCC achieves up to a 13.41% improvement in cross-task and cross-domain complex scenarios compared to existing methods, and achieves state-of-the-art performance on 5 out of 6 datasets in the single dataset and task scenario.
Updated: 2026-03-09 03:37:09
标题: 从因果关系视角的可传递图压缩
摘要: 图形数据集规模的增加显著提高了图形表示学习方法的性能,但也引入了大量的训练挑战。图形数据集压缩技术已经出现,将大型数据集压缩为更小但信息丰富的数据集,同时保持类似的测试性能。然而,这些方法严格要求下游应用程序与原始数据集和任务匹配,这在跨任务和跨领域情况下经常失败。为了解决这些挑战,我们提出了一种基于因果不变性和可转移性的图形数据集压缩方法,命名为TGCC,提供有效且可转移的压缩数据集。具体来说,为了保留域不变的知识,我们首先使用因果干预从图的空间域中提取域因果不变特征。然后,为了充分捕捉原始图的结构和特征信息,我们执行增强的压缩操作。最后,通过谱域增强对比学习,我们将因果不变特征注入到压缩图中,确保压缩图保留原始图的因果信息。对五个公开数据集和我们的新型FinReport数据集的实验结果表明,与现有方法相比,TGCC在跨任务和跨领域复杂场景中取得了高达13.41%的改进,并在单个数据集和任务场景中在6个数据集中的5个上实现了最先进的性能。
更新时间: 2026-03-09 03:37:09
领域: cs.LG
Estimating Item Difficulty Using Large Language Models and Tree-Based Machine Learning Algorithms
Estimating item difficulty through field-testing is often resource-intensive and time-consuming. As such, there is strong motivation to develop methods that can predict item difficulty at scale using only the item content. Large Language Models (LLMs) represent a new frontier for this goal. The present research examines the feasibility of using an LLM to predict item difficulty for K-5 mathematics and reading assessment items (N = 5170). Two estimation approaches were implemented: (a) a direct estimation method that prompted the LLM to assign a single difficulty rating to each item, and (b) a feature-based strategy where the LLM extracted multiple cognitive and linguistic features, which were then used in ensemble tree-based models (random forests and gradient boosting) to predict difficulty. Overall, direct LLM estimates showed moderate to strong correlations with true item difficulties. However, their accuracy varied by grade level, often performing worse for early grades. In contrast, the feature-based method yielded stronger predictive accuracy, with correlations as high as r = 0.87 and lower error estimates compared to both direct LLM predictions and baseline regressors. These findings highlight the promise of LLMs in streamlining item development and reducing reliance on extensive field testing and underscore the importance of structured feature extraction. We provide a seven-step workflow for testing professionals who would want to implement a similar item difficulty estimation approach with their item pool.
Updated: 2026-03-09 03:36:58
标题: 使用大型语言模型和基于树的机器学习算法估计项目难度
摘要: 通过现场测试来估计项目的难度通常需要大量资源和时间。因此,有强烈的动机开发能够仅使用项目内容预测项目难度的方法。大型语言模型(LLMs)代表了实现这一目标的新前沿。本研究考察了使用LLM预测K-5年级数学和阅读评估项目(总数5170个)难度的可行性。实施了两种估计方法:(a)直接估计方法,促使LLM为每个项目分配一个单一的难度评分;(b)基于特征的策略,LLM提取多个认知和语言特征,然后使用集成基于树的模型(随机森林和梯度提升)来预测难度。总体而言,直接LLM估计与真实项目难度之间显示出中等到较强的相关性。然而,它们的准确性因年级而异,通常对早期年级的表现较差。相比之下,基于特征的方法产生了更强的预测准确性,相关性高达r = 0.87,误差估计较直接LLM预测和基线回归器更低。这些发现突显了LLMs在简化项目开发和减少对广泛现场测试的依赖方面的潜力,并强调了结构化特征提取的重要性。我们为希望使用类似的项目难度估计方法的测试专业人员提供了一个七步工作流程。
更新时间: 2026-03-09 03:36:58
领域: cs.CY,cs.CL,cs.LG
Semantic Risk Scoring of Aggregated Metrics: An AI-Driven Approach for Healthcare Data Governance
Large healthcare institutions typically operate multiple business intelligence (BI) teams segmented by domain, including clinical performance, fundraising, operations, and compliance. Due to HIPAA, FERPA, and IRB restrictions, these teams face challenges in sharing patient-level data needed for analytics. To mitigate this, A metric aggregation table is proposed, which is a precomputed, privacy-compliant summary. These abstractions enable decision-making without direct access to sensitive data. However, even aggregated metrics can inadvertently lead to privacy risks if constructed without rigorous safeguards. A modular AI framework is proposed that evaluates SQL-based metric definitions for potential overexposure using both semantic and syntactic features. Specifically, the system parses SQL queries into abstract syntax trees (ASTs), extracts sensitive patterns (e.g., fine-grained GROUP BY on ZIP code or gender), and encodes the logic using pretrained CodeBERT embeddings. These are fused with structural features and passed to an XGBoost classifier trained to assign risk scores. Queries that surpass the risk threshold (e.g., > 0.85) are flagged and returned with human-readable explanations. This enables proactive governance, preventing statistical disclosure before deployment. This implementation demonstrates strong potential for cross-departmental metric sharing in healthcare while maintaining compliance and auditability. The system also promotes role-based access control (RBAC), supports zero-trust data architectures, and aligns with national data modernization goals by ensuring that metric pipelines are explainable, privacy-preserving, and AI-auditable by design. Unlike prior works that rely on runtime data access to flag privacy violations, the proposed framework performs static, explainable detection at the query-level, enabling pre-execution protection and audit readiness
Updated: 2026-03-09 03:36:11
标题: 聚合指标的语义风险评分:面向医疗数据治理的人工智能驱动方法
摘要: 大型医疗机构通常通过领域划分操作多个业务智能(BI)团队,包括临床表现、筹款、运营和合规性。由于HIPAA、FERPA和IRB的限制,这些团队在共享需要用于分析的患者级数据时面临挑战。为了缓解这一问题,提出了一个度量聚合表,这是一个预先计算的、符合隐私规定的摘要。这些抽象使决策可以在没有直接访问敏感数据的情况下进行。然而,即使聚合的度量也可能在没有严格保障的情况下无意中导致隐私风险。提出了一个模块化AI框架,该框架通过使用语义和语法特征评估基于SQL的度量定义的潜在过度暴露。具体而言,系统将SQL查询解析为抽象语法树(ASTs),提取敏感模式(例如,对邮政编码或性别的精细分组),并使用预训练的CodeBERT嵌入编码逻辑。这些特征与结构特征融合,并传递给训练有素的XGBoost分类器,用于分配风险分数。超过风险阈值(例如,>0.85)的查询将被标记并返回具有可读性的解释。这使得在部署之前可以进行积极的治理,防止统计披露。这种实现展示了在医疗保健领域跨部门度量共享的强大潜力,同时保持合规性和可审计性。该系统还促进了基于角色的访问控制(RBAC),支持零信任数据架构,并通过确保度量管道是可解释的、隐私保护的和设计上可由AI审计的方式,与国家数据现代化目标保持一致。与依赖运行时数据访问来标记隐私违规的先前作品不同,提出的框架在查询级别执行静态、可解释的检测,使得在执行前可以进行保护和审计准备。
更新时间: 2026-03-09 03:36:11
领域: cs.LG,cs.CY
Jagarin: A Three-Layer Architecture for Hibernating Personal Duty Agents on Mobile
Personal AI agents face a fundamental deployment paradox on mobile: persistent background execution drains battery and violates platform sandboxing policies, yet purely reactive agents miss time-sensitive obligations until the user remembers to ask. We present Jagarin, a three-layer architecture that resolves this paradox through structured hibernation and demand-driven wake. The first layer, DAWN (Duty-Aware Wake Network), is an on-device heuristic engine that computes a composite urgency score from four signals: duty-typed optimal action windows, user behavioral engagement prediction, opportunity cost of inaction, and cross-duty batch resonance. It uses adaptive per-user thresholds to decide when a sleeping agent should nudge or escalate. The second layer, ARIA (Agent Relay Identity Architecture), is a commercial email identity proxy that routes the full commercial inbox -- obligations, promotional offers, loyalty rewards, and platform updates -- to appropriate DAWN handlers by message category, eliminating cold-start and removing manual data entry. The third layer, ACE (Agent-Centric Exchange), is a protocol framework for direct machine-readable communication from institutions to personal agents, replacing human-targeted email as the canonical channel. Together, these three layers form a complete stack from institutional signal to on-device action, without persistent cloud state, continuous background execution, or privacy compromise. A working Flutter prototype is demonstrated on Android, combining all three layers with an ephemeral cloud agent invoked only on user-initiated escalation.
Updated: 2026-03-09 03:35:12
标题: Jagarin:一种用于在移动设备上休眠个人责任代理的三层架构
摘要: 个人AI代理在移动设备上面临着一个基本的部署悖论:持续的后台执行会消耗电量并违反平台沙盒政策,然而纯粹的反应式代理会错过时间敏感的义务,直到用户记得提出。我们提出了Jagarin,一个通过结构化休眠和需求驱动唤醒解决这一悖论的三层架构。第一层,DAWN(Duty-Aware Wake Network),是一个设备上的启发式引擎,从四个信号中计算出一个复合紧急程度分数:义务类型的最佳行动窗口、用户行为参与预测、不作为的机会成本和交叉义务批量共振。它使用自适应的每用户阈值来决定何时唤醒或升级沉睡中的代理。第二层,ARIA(Agent Relay Identity Architecture),是一个商业电子邮件身份代理,将完整的商业收件箱 -- 义务、促销优惠、忠诚奖励和平台更新 -- 按照消息类别路由到适当的DAWN处理程序,消除了冷启动并消除了手动数据输入。第三层,ACE(Agent-Centric Exchange),是一个协议框架,用于直接从机构向个人代理进行可机器读取的通信,取代了以人为目标的电子邮件作为规范通道。这三个层共同形成了一个从机构信号到设备上操作的完整堆栈,无需持久云状态、持续后台执行或隐私妥协。在Android上展示了一个可工作的Flutter原型,将所有三层与仅在用户发起升级时调用的短暂云代理结合起来。
更新时间: 2026-03-09 03:35:12
领域: cs.AI,cs.HC,cs.MA
MERIT Feedback Elicits Better Bargaining in LLM Negotiators
Bargaining is often regarded as a logical arena rather than an art or a matter of intuition, yet Large Language Models (LLMs) still struggle to navigate it due to limited strategic depth and difficulty adapting to complex human factors. Current benchmarks rarely capture this limitation. To bridge this gap, we present a utility feedback centric framework. Our contributions are: (i) AgoraBench, a new benchmark spanning nine challenging settings (e.g., deception, monopoly) that supports diverse strategy modeling; (ii) human-aligned, economically grounded metrics derived from utility theory. This is operationalized via agent utility, negotiation power, and acquisition ratio that implicitly measure how well the negotiation aligns with human preference and (iii) a human preference grounded dataset with learning pipeline that strengthens LLMs' bargaining ability through both prompting and finetuning. Empirical results indicate that baseline LLM strategies often diverge from human preferences, while our mechanism substantially improves negotiation performance, yielding deeper strategic behavior and stronger opponent awareness.
Updated: 2026-03-09 03:32:14
标题: MERIT反馈激发了LLM谈判者更好的谈判
摘要: 谈判通常被视为一个逻辑的领域,而不是一门艺术或直觉的问题,然而,大型语言模型(LLMs)仍然很难在其中导航,因为它们的战略深度有限,难以适应复杂的人类因素。当前的基准很少能捕捉到这种限制。为了弥合这一差距,我们提出了一个以效用反馈为中心的框架。我们的贡献包括:(i) AgoraBench,一个新的基准,涵盖了九种具有挑战性的情境(例如欺骗、垄断),支持多样化的策略建模;(ii) 从效用理论中派生的与人类对齐、经济基础的度量标准。这通过代理的效用、谈判权力和收购比率来实现,隐含地衡量了谈判与人类偏好的一致程度;(iii) 一个基于人类偏好的数据集和学习管道,通过提示和微调强化LLMs的谈判能力。实证结果表明,基准LLM策略往往偏离人类偏好,而我们的机制显著改善了谈判表现,产生了更深层次的战略行为和更强的对手意识。
更新时间: 2026-03-09 03:32:14
领域: cs.AI
Detecting AI-Generated Images via Diffusion Snap-Back Reconstruction: A Forensic Approach
The rapid advancement of generative image models has transformed digital media to the point where AI generated images can no longer be reliably distinguished from authentic photographs by human observers or many conventional detection methods. Modern text to image systems such as Stable Diffusion and DALL E can now generate images so realistic that they often appear completely natural, leaving little to no visible artifacts for traditional deepfake detectors to rely on. This challenge has practical consequences for misinformation control, institutional identity verification, and digital trust in political and legal contexts. Instead of searching for hidden pixel level traces, we take a different approach: we observe how an image responds when it is gently disturbed and reconstructed by a diffusion model. We call this behavior diffusion snap back. By tracking how perceptual similarity measures (LPIPS, SSIM, and PSNR) change across different reconstruction strengths, we capture compact and interpretable signals that reveal how closely an image aligns with the diffusion model's learned denoising behavior. Evaluated on a balanced dataset of 4,000 human and AI generated images, the proposed method achieves an AUROC of 0.993 under stratified five fold cross validation and 0.990 on a holdout split using only logistic regression. Initial robustness tests show that the method remains stable under common real world distortions such as image compression and added noise. Although our experiments were conducted using a single diffusion backbone, the results indicate that reconstruction behavior can serve as a reliable and scalable foundation for synthetic media detection as generative models continue to grow more realistic.
Updated: 2026-03-09 03:31:54
标题: 通过扩散回弹重建检测AI生成的图像:一种法庭取证方法
摘要: 生成图像模型的快速发展已经将数字媒体转变到了一个程度,即人类观察者或许多传统检测方法已经无法可靠地区分人工智能生成的图像和真实照片。现代文本到图像系统,如稳定扩散和DALL E,现在可以生成如此逼真的图像,以至于它们通常看起来完全自然,几乎没有可见的痕迹供传统深度伪造检测器依赖。这一挑战对于虚假信息控制、机构身份验证以及政治和法律背景下的数字信任具有实际后果。我们采取了一种不同的方法,而不是寻找隐藏的像素级痕迹:我们观察一个图像在受到扰动并由扩散模型重构时的反应。我们称这种行为为扩散回弹。通过跟踪感知相似度度量(LPIPS、SSIM和PSNR)在不同重构强度下的变化,我们捕捉到了紧凑且可解释的信号,揭示了一个图像与扩散模型学习的去噪行为有多接近。在平衡数据集上评估了4,000张人类和人工智能生成的图像后,所提出的方法在分层五折交叉验证下实现了0.993的AUROC,并且在仅使用逻辑回归的保留分割上达到了0.990。初始的稳健性测试表明,该方法在常见的实际扭曲下仍然稳定,如图像压缩和添加噪音。虽然我们的实验是使用单个扩散骨干进行的,但结果表明,重构行为可以作为合成媒体检测的可靠且可扩展的基础,因为生成模型继续变得更加逼真。
更新时间: 2026-03-09 03:31:54
领域: cs.CV,cs.AI
Robust Transfer Learning with Side Information
Robust Markov Decision Processes (MDPs) address environmental shift through distributionally robust optimization (DRO) by finding an optimal worst-case policy within an uncertainty set of transition kernels. However, standard DRO approaches require enlarging the uncertainty set under large shifts, which leads to overly conservative and pessimistic policies. In this paper, we propose a framework for transfer under environment shift that derives a robust target-domain policy via estimate-centered uncertainty sets, constructed through constrained estimation that integrates limited target samples with side information about the source-target dynamics. The side information includes bounds on feature moments, distributional distances, and density ratios, yielding improved kernel estimates and tighter uncertainty sets. The side information includes bounds on feature moments, distributional distances, and density ratios, yielding improved kernel estimates and tighter uncertainty sets. Error bounds and convergence results are established for both robust and non-robust value functions. Moreover, we provide a finite-sample guarantee on the learned robust policy and analyze the robust sub-optimality gap. Under mild low-dimensional structure on the transition model, the side information reduces this gap and improves sample efficiency. We assess the performance of our approach across OpenAI Gym environments and classic control problems, consistently demonstrating superior target-domain performance over state-of-the-art robust and non-robust baselines.
Updated: 2026-03-09 03:29:44
标题: 具有辅助信息的稳健迁移学习
摘要: 鲁棒的马尔可夫决策过程(MDPs)通过分布鲁棒优化(DRO)来解决环境变化问题,通过在转移核心的不确定性集合中找到最优的最坏情况策略。然而,标准的DRO方法在大幅度变化时需要扩大不确定性集合,导致过于保守和悲观的策略。 本文提出了一个在环境变化下进行转移的框架,通过基于估计的不确定性集合导出鲁棒目标域策略,该集合通过受限估计构建,将有限的目标样本与关于源目标动态的辅助信息集成。辅助信息包括特征矩、分布距离和密度比的界限,从而产生改进的核心估计和更紧密的不确定性集合。 为鲁棒和非鲁棒价值函数建立了误差界和收敛结果。此外,我们对学习的鲁棒策略提供了有限样本保证,并分析了鲁棒次优差距。在转移模型上具有温和低维结构的情况下,辅助信息减小了这一差距并提高了样本效率。我们评估了我们的方法在OpenAI Gym环境和经典控制问题中的性能,始终表现出优于最先进的鲁棒和非鲁棒基线的目标域性能。
更新时间: 2026-03-09 03:29:44
领域: stat.ML,cs.LG
The Exploration of Error Bounds in Classification with Noisy Labels
Numerous studies have shown that label noise can lead to poor generalization performance, negatively affecting classification accuracy. Therefore, understanding the effectiveness of classifiers trained using deep neural networks in the presence of noisy labels is of considerable practical significance. In this paper, we focus on the error bounds of excess risks for classification problems with noisy labels within deep learning frameworks. We derive error bounds for the excess risk, decomposing it into statistical error and approximation error. To handle statistical dependencies (e.g., mixing sequences), we employ an independent block construction to bound the error, leveraging techniques for dependent processes. For the approximation error, we establish these theoretical results to the vector-valued setting, where the output space consists of $K$-dimensional unit vectors. Finally, under the low-dimensional manifold hypothesis, we further refine the approximation error to mitigate the impact of high-dimensional input spaces.
Updated: 2026-03-09 03:20:21
标题: 带有噪声标签的分类错误边界探索
摘要: 许多研究表明,标签噪声会导致泛化性能不佳,从而对分类准确性产生负面影响。因此,在存在噪声标签的情况下,了解使用深度神经网络训练的分类器的有效性具有相当实际意义。本文关注在深度学习框架中具有噪声标签的分类问题的过度风险的误差界。我们推导出过度风险的误差界,将其分解为统计误差和逼近误差。为了处理统计依赖关系(例如混合序列),我们采用独立块构造来限制误差,利用依赖过程的技术。对于逼近误差,我们将这些理论结果建立在矢量值设置中,其中输出空间由$K$维单位向量组成。最后,在低维流形假设下,我们进一步细化逼近误差,以减轻高维输入空间的影响。
更新时间: 2026-03-09 03:20:21
领域: cs.LG,stat.ML
Rel-MOSS: Towards Imbalanced Relational Deep Learning on Relational Databases
In recent advances, to enable a fully data-driven learning paradigm on relational databases (RDB), relational deep learning (RDL) is proposed to structure the RDB as a heterogeneous entity graph and adopt the graph neural network (GNN) as the predictive model. However, existing RDL methods neglect the imbalance problem of relational data in RDBs and risk under-representing the minority entities, leading to an unusable model in practice. In this work, we investigate, for the first time, class imbalance problem in RDB entity classification and design the relation-centric minority synthetic over-sampling GNN (Rel-MOSS), in order to fill a critical void in the current literature. Specifically, to mitigate the issue of minority-related information being submerged by majority counterparts, we design the relation-wise gating controller to modulate neighborhood messages from each individual relation type. Based on the relational-gated representations, we further propose the relation-guided minority synthesizer for over-sampling, which integrates the entity relational signatures to maintain relational consistency. Extensive experiments on 12 entity classification datasets provide compelling evidence for the superiority of Rel-MOSS, yielding an average improvement of up to 2.46% and 4.00% in terms of Balanced Accuracy and G-Mean, compared with SOTA RDL methods and classic methods for handling class imbalance.
Updated: 2026-03-09 03:18:26
标题: Rel-MOSS:在关系数据库上实现不平衡的关系深度学习
摘要: 在最近的进展中,为了实现对关系数据库(RDB)的完全数据驱动学习范式,提出了关系深度学习(RDL)将RDB结构化为异质实体图,并采用图神经网络(GNN)作为预测模型。然而,现有的RDL方法忽视了RDB中关系数据的不平衡问题,并有可能使少数实体受到低估,导致在实践中无法使用的模型。在这项工作中,我们首次研究了RDB实体分类中的类别不平衡问题,并设计了关系中心的少数合成过采样GNN(Rel-MOSS),以填补当前文献中的一个关键空白。具体来说,为了减轻少数相关信息被多数部分淹没的问题,我们设计了关系智能门控器来调节来自每个单独关系类型的邻域消息。基于关系门控表示,我们进一步提出了关系引导的少数合成器进行过采样,该方法整合了实体关系签名以保持关系一致性。对12个实体分类数据集进行了广泛实验,为Rel-MOSS的优越性提供了令人信服的证据,相比于处理类别不平衡的SOTA RDL方法和经典方法,平均提高了Balanced Accuracy和G-Mean高达2.46%和4.00%。
更新时间: 2026-03-09 03:18:26
领域: cs.AI,cs.DB,cs.LG
Ares: Adaptive Reasoning Effort Selection for Efficient LLM Agents
Modern agents powered by thinking LLMs achieve high accuracy through long chain-of-thought reasoning but incur substantial inference costs. While many LLMs now support configurable reasoning levels (e.g., high/medium/low), static strategies are often ineffective: using low-effort modes at every step leads to significant performance degradation, while random selection fails to preserve accuracy or provide meaningful cost reduction. However, agents should reserve high reasoning effort for difficult steps like navigating complex website structures, while using lower-effort modes for simpler steps like opening a target URL. In this paper, we propose Ares, a framework for per-step dynamic reasoning effort selection tailored for multi-step agent tasks. Ares employs a lightweight router to predict the lowest appropriate reasoning level for each step based on the interaction history. To train this router, we develop a data generation pipeline that identifies the minimum reasoning effort required for successful step completion. We then fine-tune the router to predict these levels, enabling plug-and-play integration for any LLM agents. We evaluate Ares on a diverse set of agent tasks, including TAU-Bench for tool use agents, BrowseComp-Plus for deep-research agents, and WebArena for web agents. Experimental results show that Ares reduces reasoning token usage by up to 52.7% compared to fixed high-effort reasoning, while introducing minimal degradation in task success rates.
Updated: 2026-03-09 03:17:29
标题: Ares:用于高效LLM智能体的自适应推理努力选择
摘要: 由思考LLMs驱动的现代代理通过长链推理实现高准确性,但会产生大量推理成本。虽然许多LLMs现在支持可配置的推理级别(例如高/中/低),但静态策略通常是无效的:在每个步骤中使用低工作模式会导致性能显著下降,而随机选择无法保持准确性或提供有意义的成本减少。然而,代理应该将高推理努力保留给像导航复杂网站结构这样的困难步骤,而对于打开目标URL这样的简单步骤使用较低工作模式。在本文中,我们提出了Ares,一个针对多步代理任务量身定制的逐步动态推理努力选择框架。Ares采用轻量级路由器根据交互历史预测每个步骤的最低适当推理级别。为了训练这个路由器,我们开发了一个数据生成管道,识别成功完成步骤所需的最小推理努力。然后,我们对路由器进行微调,以预测这些级别,为任何LLM代理提供即插即用集成。我们在一系列不同的代理任务上评估了Ares,包括用于工具使用代理的TAU-Bench,用于深度研究代理的BrowseComp-Plus和用于Web代理的WebArena。实验结果显示,与固定的高推理努力相比,Ares将推理令牌使用减少了高达52.7%,同时在任务成功率上引入了最小的降级。
更新时间: 2026-03-09 03:17:29
领域: cs.AI
Quantifying Information Loss under Coarse-Grained Partitions: A Discrete Framework for Explainable Artificial Intelligence
As artificial intelligence (AI) systems are increasingly used in ethically sensitive domains such as education, healthcare, and transportation, balancing accuracy and interpretability has become a central concern. Coarse ethics (CE) motivates coarse-grained evaluations under cognitive, institutional, and contextual constraints, but it still lacks a simple mathematical formalization of admissible coarse-graining and its informational consequences. This paper introduces coarse-grained partitions (CGPs) as a discrete framework for modeling coarse evaluation on a finite totally ordered score scale. A CGP represents coarse evaluation as a partition into grains with an index assignment, and induces a coarse-grained distribution by pushforward. To compare admissible coarse-grainings, we introduce categorical unification (CU), which constructs a canonical fine-scale reconstruction from the coarse representation under minimal assumptions. On this basis, we define a KL-based measure of information loss, $D_{\mathrm{KL\text{-}CU}}$, as the divergence between the original fine-grained distribution and its CU-based reconstruction. We prove that $D_{\mathrm{KL\text{-}CU}}=0$ if and only if the original distribution is already uniform within each grain. This shows that zero loss, in the sense of the proposed measure, is a highly exceptional limiting case rather than a realistic benchmark for ordinary evaluative practice. We also show that the framework leads naturally to an optimization problem for comparing alternative admissible CGPs. Applications to educational grading and explainable AI (XAI) illustrate how the framework clarifies trade-offs among informational fidelity, interpretability, and coarsening cost.
Updated: 2026-03-09 03:13:37
标题: 量化粗粒度划分下的信息损失:可解释人工智能的离散框架
摘要: 随着人工智能系统在教育、医疗保健和交通等伦理敏感领域的应用日益增多,平衡准确性和可解释性已经成为一个中心关注点。粗略伦理(CE)激发了在认知、制度和环境约束下进行粗粒度评估的动机,但仍然缺乏对可接受的粗粒度化及其信息后果的简单数学形式化。本文引入了粗粒度分区(CGP)作为一个离散框架,用于在有限的全序分数尺度上对粗略评估进行建模。CGP将粗粒度评估表示为一个分成粒度的分区,并通过向前推动诱导出粗粒度分布。为了比较可接受的粗粒度分区,我们引入了范畴统一(CU),它根据最小假设从粗粒度表示构建出一个规范的细粒度重建。基于此基础,我们定义了一个基于KL的信息损失度量$D_{\mathrm{KL\text{-}CU}}$,即原始细粒度分布与其基于CU的重建之间的分歧。我们证明了当且仅当原始分布在每个粒度内已经是均匀的时,$D_{\mathrm{KL\text{-}CU}}=0$。这表明零损失在所提出的度量意义上是一个非常特殊的极限情况,而不是普通评估实践的现实基准。我们还展示了该框架自然地导致了一个用于比较替代可接受CGP的优化问题。教育评分和可解释人工智能(XAI)的应用说明了该框架如何澄清信息忠实度、可解释性和粗粒化成本之间的权衡。
更新时间: 2026-03-09 03:13:37
领域: cs.AI,cs.IT,math.LO,math.PR
CountFormer: A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting
Humans can often count unfamiliar objects by observing visual repetition and composition, rather than relying only on object categories. However, many exemplar-free counting models struggle in such situations and may overcount when objects contain symmetric components, repeated substructures, or partial occlusion. We introduce CountFormer, a controlled adaptation of a density-regression framework inspired by CounTR, where the image encoder is replaced with the self-supervised vision foundation model DINOv2. The resulting transformer features are combined with explicit two-dimensional positional embeddings and decoded by a lightweight convolutional network to produce a density map whose integral gives the final count. Our goal is not to propose a new counting architecture, but to study whether foundation-based representations improve structural consistency under a strictly exemplar-free setting. On FSC-147, CountFormer achieves competitive performance under the official benchmark (MAE 19.06, RMSE 118.45). Qualitative analysis suggests fewer part-level overcounting errors for some structurally complex objects, while overall error remains broadly consistent with prior approaches. Sensitivity analysis shows that evaluation metrics are strongly affected by a small number of extreme high-density scenes. Overall, the results highlight the role of representation quality in exemplar-free object counting.
Updated: 2026-03-09 03:13:27
标题: CountFormer:一种用于学习视觉重复和结构的Transformer框架,用于类别无关的物体计数
摘要: 人类通常可以通过观察视觉重复和组合来计算陌生物体的数量,而不仅仅依靠物体类别。然而,许多无实例计数模型在这种情况下往往表现不佳,并且在物体包含对称组件、重复子结构或部分遮挡时可能会过度计数。我们介绍了CountFormer,这是受CounTR启发的密度回归框架的受控调整,其中图像编码器被自监督视觉基础模型DINOv2替换。结果的Transformer特征与明确的二维位置嵌入相结合,并由轻量级卷积网络解码,以生成一个密度图,其积分给出最终的计数。我们的目标不是提出一个新的计数架构,而是研究基于基础的表示是否可以在严格的无实例设置下提高结构一致性。在FSC-147上,CountFormer在官方基准测试中取得了竞争性表现(MAE 19.06,RMSE 118.45)。定性分析表明,在一些结构复杂的物体中,部分级别的过度计数错误较少,而整体错误与之前的方法基本保持一致。敏感性分析显示,评估指标受极高密度场景的极少数影响较大。总的来说,结果突显了在无实例物体计数中表示质量的重要作用。
更新时间: 2026-03-09 03:13:27
领域: cs.CV,cs.AI
Long-Short Term Agents for Pure-Vision Bronchoscopy Robotic Autonomy
Accurate intraoperative navigation is essential for robot-assisted endoluminal intervention, but remains difficult because of limited endoscopic field of view and dynamic artifacts. Existing navigation platforms often rely on external localization technologies, such as electromagnetic tracking or shape sensing, which increase hardware complexity and remain vulnerable to intraoperative anatomical mismatch. We present a vision-only autonomy framework that performs long-horizon bronchoscopic navigation using preoperative CT-derived virtual targets and live endoscopic video, without external tracking during navigation. The framework uses hierarchical long-short agents: a short-term reactive agent for continuous low-latency motion control, and a long-term strategic agent for decision support at anatomically ambiguous points. When their recommendations conflict, a world-model critic predicts future visual states for candidate actions and selects the action whose predicted state best matches the target view. We evaluated the system in a high-fidelity airway phantom, three ex vivo porcine lungs, and a live porcine model. The system reached all planned segmental targets in the phantom, maintained 80\% success to the eighth generation ex vivo, and achieved in vivo navigation performance comparable to the expert bronchoscopist. These results support the preclinical feasibility of sensor-free autonomous bronchoscopic navigation.
Updated: 2026-03-09 03:09:51
标题: 长短期代理人用于纯视觉支气管镜机器人自主性
摘要: 准确的术中导航对于机器人辅助内腔干预至关重要,但由于内窥镜视野有限和动态伪影的存在,导航仍然困难。现有的导航平台通常依赖于外部定位技术,如电磁跟踪或形状感知,这增加了硬件复杂性并仍然容易受到术中解剖不匹配的影响。我们提出了一个仅基于视觉的自主框架,使用术前CT衍生的虚拟目标和实时内窥镜视频执行长视程支气管导航,在导航过程中不需要外部跟踪。该框架使用分层长短代理:一个短期反应代理用于连续低延迟运动控制,一个长期战略代理用于解决解剖学模糊点的决策支持。当它们的建议发生冲突时,一个世界模型批评者预测候选动作的未来视觉状态,并选择预测状态最符合目标视图的动作。我们在一个高保真气道幻影、三个离体猪肺和一个活体猪模型中评估了该系统。系统在幻影中达到了所有计划的分段目标,在离体第八代中保持80\%的成功率,并在体内导航表现与专业支气管镜医生相当。这些结果支持无传感器自主支气管镜导航的临床前可行性。
更新时间: 2026-03-09 03:09:51
领域: cs.RO,cs.AI
Utility Theory based Cognitive Modeling in the Application of Robotics: A Survey
Cognitive modeling, which explores the essence of cognition, including motivation, emotion, and perception, has been widely applied in the artificial intelligence (AI) agent domains, such as robotics. From the computational perspective, various cognitive functionalities have been developed through utility theory to provide a detailed and process-based understanding for specifying corresponding computational models of representations, mechanisms, and processes. Especially for decision-making and learning in multi-agent/robot systems (MAS/MRS), a suitable cognitive model can guide agents in choosing reasonable strategies to achieve their current needs and learning to cooperate and organize their behaviors, optimizing the system's utility, building stable and reliable relationships, and guaranteeing each group member's sustainable development, similar to the human society. This survey examines existing robotic systems for developmental cognitive models in the context of utility theory. We discuss the evolution of cognitive modeling in robotics from behavior-based robotics (BBR) and cognitive architectures to the properties of value systems in robots, such as the studies on motivations as artificial value systems, and the utility theory based cognitive modeling for generating and updating strategies in robotic interactions. Then, we examine the extent to which existing value systems support the application of robotics from an AI agent cognitive modeling perspective, including single-agent and multi-agent systems, trust among agents, and human-robot interaction. Finally, we survey the existing literature of current value systems in relevant fields and propose several promising research directions, along with some open problems that we deem necessary for further investigation.
Updated: 2026-03-09 03:06:26
标题: 基于效用理论的认知建模在机器人应用中的研究:一项调查
摘要: 认知建模探索认知的本质,包括动机、情感和感知,在人工智能(AI)代理领域,如机器人技术中得到广泛应用。从计算的角度来看,通过效用理论开发了各种认知功能,为指定对应的计算模型提供了详细的基于过程的理解,包括表示、机制和过程。特别是对于多代理/机器人系统(MAS/MRS)中的决策和学习,一个合适的认知模型可以引导代理选择合理的策略来满足他们当前的需求,学会合作和组织自己的行为,优化系统的效用,建立稳定可靠的关系,并保证每个群体成员的可持续发展,类似于人类社会。本调查考察了在效用理论背景下的发展性认知模型在机器人系统中的应用。我们讨论了机器人技术中认知建模的演变,从基于行为的机器人技术(BBR)和认知架构到机器人中价值系统的属性,如将动机研究作为人工价值系统,以及基于效用理论的认知建模用于生成和更新机器人交互中的策略。然后,我们考察了现有的价值系统在从AI代理认知建模的角度支持机器人技术的程度,包括单一代理和多代理系统、代理之间的信任以及人机交互。最后,我们调查了相关领域当前价值系统的现有文献,并提出了几个有前途的研究方向,以及一些我们认为需要进一步研究的开放问题。
更新时间: 2026-03-09 03:06:26
领域: cs.RO,cs.AI,cs.MA,cs.NE,eess.SY
Balancing Interpretability and Performance in Motor Imagery EEG Classification: A Comparative Study of ANFIS-FBCSP-PSO and EEGNet
Achieving both accurate and interpretable classification of motor-imagery EEG remains a key challenge in brain-computer interface (BCI) research. In this paper, we compare a transparent fuzzy-reasoning approach (ANFIS-FBCSP-PSO) with a well-known deep-learning benchmark (EEGNet) using the publicly available BCI Competition IV-2a dataset. The ANFIS pipeline combines filter-bank common spatial pattern feature extraction with fuzzy IF-THEN rules optimized via particle-swarm optimization, while EEGNet learns hierarchical spatial-temporal representations directly from raw EEG data. In within-subject experiments, the fuzzy-neural model performed better (68.58% +/- 13.76% accuracy, kappa = 58.04% +/- 18.43), while in cross-subject (LOSO) tests, the deep model exhibited stronger generalization (68.20% +/- 12.13% accuracy, kappa = 57.33% +/- 16.22). The study therefore provides practical guidance for selecting MI-BCI systems according to the design goal: interpretability or robustness across users. Future investigations into transformer-based and hybrid neuro-symbolic frameworks are expected to further advance transparent EEG decoding.
Updated: 2026-03-09 03:06:03
标题: 在运动想象脑电图分类中平衡解释性和性能:ANFIS-FBCSP-PSO和EEGNet的比较研究
摘要: 在脑机接口(BCI)研究中,实现运动想象脑电信号(EEG)的准确且可解释分类仍然是一个关键挑战。本文中,我们使用公开可用的BCI Competition IV-2a数据集,将透明模糊推理方法(ANFIS-FBCSP-PSO)与著名的深度学习基准(EEGNet)进行比较。ANFIS管道将滤波器组公共空间模式特征提取与通过粒子群优化的模糊IF-THEN规则相结合,而EEGNet直接从原始脑电数据中学习分层空间-时间表示。在被试内实验中,模糊神经模型表现更好(准确率为68.58% +/- 13.76%,kappa = 58.04% +/- 18.43),而在被试间(LOSO)测试中,深度模型展现出更强的泛化能力(准确率为68.20% +/- 12.13%,kappa = 57.33% +/- 16.22)。因此,该研究为根据设计目标选择MI-BCI系统提供了实用指导:可解释性或跨用户的稳健性。未来对基于变压器和混合神经符号框架的调查有望进一步推动透明脑电解码的发展。
更新时间: 2026-03-09 03:06:03
领域: cs.LG,cs.AI,cs.NE
DyQ-VLA: Temporal-Dynamic-Aware Quantization for Embodied Vision-Language-Action Models
Vision-Language-Action (VLA) models are dominant in embodied intelligence but are constrained by inference overheads. While model quantization alleviates these bottlenecks for edge deployment, static quantization approaches remain suboptimal for VLAs due to two critical challenges: (1) Temporal-dynamic sensitivity, where fixed precision wastes resources by ignoring stage-varying error tolerances; and (2) Real-time allocation, where identifying real-time sensitivity to guide bit allocation remains unsolved. To address these challenges, we propose DyQ-VLA, a dynamic quantization framework for VLAs. Specifically, a sensitivity-aware switching strategy leverages real-time kinematic proxies to trigger the bit-width switch, while a kinematic-guided module dynamically allocates the optimal bit-width. Experiments show that DyQ-VLA requires only 30.9% of the original memory footprint while maintaining 99.5% of its original performance, achieving 1.49x simulation and up to 1.43x real-world speedups.
Updated: 2026-03-09 02:52:57
标题: DyQ-VLA:面向具身视觉-语言-动作模型的时态动态感知量化
摘要: 视觉-语言-行动(VLA)模型在具有体现智能方面占主导地位,但受到推理开销的限制。虽然模型量化可以缓解边缘部署的瓶颈,但静态量化方法对于VLA仍然不够优化,因为存在两个关键挑战:(1)时间动态敏感性,固定精度浪费资源,忽略阶段变化的错误容忍度;(2)实时分配,识别实时敏感性以指导比特分配仍未解决。为了解决这些挑战,我们提出了DyQ-VLA,一种用于VLA的动态量化框架。具体而言,一种敏感性感知的切换策略利用实时运动学代理触发比特宽度切换,而一个运动学引导模块动态分配最佳比特宽度。实验表明,DyQ-VLA仅需要原始内存占用的30.9%,同时保持其原始性能的99.5%,实现了1.49倍的模拟加速和最高1.43倍的实际速度提升。
更新时间: 2026-03-09 02:52:57
领域: cs.LG,cs.RO
NaviDriveVLM: Decoupling High-Level Reasoning and Motion Planning for Autonomous Driving
Vision-language models (VLMs) have emerged as a promising direction for end-to-end autonomous driving (AD) by jointly modeling visual observations, driving context, and language-based reasoning. However, existing VLM-based systems face a trade-off between high-level reasoning and motion planning: large models offer strong semantic understanding but are costly to adapt for precise control, whereas small VLM models can be fine-tuned efficiently but often exhibit weaker reasoning. We propose NaviDriveVLM, a decoupled framework that separates reasoning from action generation using a large-scale Navigator and a lightweight trainable Driver. This design preserves reasoning ability, reduces training cost, and provides an explicit interpretable intermediate representation for downstream planning. Experiments on the nuScenes benchmark show that NaviDriveVLM outperforms large VLM baselines in end-to-end motion planning.
Updated: 2026-03-09 02:47:44
标题: NaviDriveVLM:自动驾驶的高级推理和运动规划的解耦
摘要: 视觉语言模型(VLMs)已经成为自动驾驶的一个有前途的方向,通过联合建模视觉观察、驾驶环境和基于语言的推理。然而,现有的基于VLM的系统在高级推理和运动规划之间存在权衡:大型模型提供了强大的语义理解,但在精确控制方面成本高昂,而小型的VLM模型可以有效地进行微调,但通常展现出较弱的推理能力。我们提出了NaviDriveVLM,一个分离的框架,使用大规模导航器和轻量级可训练驾驶员将推理与动作生成分开。这种设计保留了推理能力,降低了训练成本,并为下游规划提供了明确可解释的中间表示。对nuScenes基准数据集的实验表明,NaviDriveVLM在端到端运动规划方面优于大型VLM基线。
更新时间: 2026-03-09 02:47:44
领域: cs.RO,cs.LG
EveryQuery: Zero-Shot Clinical Prediction via Task-Conditioned Pretraining over Electronic Health Records
Foundation models pretrained on electronic health records (EHR) have demonstrated zero-shot clinical prediction capabilities by generating synthetic patient futures and aggregating statistics over sampled trajectories. However, this autoregressive inference procedure is computationally expensive, statistically noisy, and not natively promptable because users cannot directly condition predictions on specific clinical questions. In this preliminary work, we introduce EveryQuery, an EHR foundation model that achieves zero-shot inference through task-conditioned pre-training. Rather than generating future events, EveryQuery takes as input a patient's history and a structured query specifying a clinical task, and directly estimates the likelihood of the outcome occurring in the future window via a single forward pass. EveryQuery realizes this capability by pre-training over randomly sampled combinations of query tasks and patient contexts, directly training the model to produce correct answers to arbitrary input prompts. This enables zero-shot prediction for any task in the query space without finetuning, linear probing, or trajectory generation. On MIMIC-IV, EveryQuery outperforms an autoregressive baseline on 82% of 39 randomly sampled prediction tasks, with a mean AUC improvement of +0.16 (95% CI: [0.10,0.22]). This advantage remains consistent on tasks that were explicitly held out from the pre-training distribution. Further, EveryQuery's performance gains are most pronounced for rare clinical events, affirming and demonstrating a solution to the fundamental limitation of autoregressive inference for low-prevalence outcomes. However, at present, EveryQuery underperforms on tasks requiring disjunctive reasoning over multiple codes, such as 30-day readmission, exposing a concrete expressiveness limitation of the current query language.
Updated: 2026-03-09 02:45:22
标题: EveryQuery: 通过电子健康记录上的任务条件预训练实现零射击临床预测
摘要: 在电子健康记录(EHR)上预训练的基础模型已经通过生成合成患者未来数据并在采样轨迹上聚合统计数据,展示了零-shot临床预测能力。然而,这种自回归推断过程在计算上昂贵,统计上嘈杂,并且不能直接提示,因为用户无法直接将预测条件设置为特定临床问题。在这项初步工作中,我们介绍了EveryQuery,这是一个通过任务条件预训练实现零-shot推断的EHR基础模型。与生成未来事件不同,EveryQuery接受患者历史记录和指定临床任务的结构化查询作为输入,并通过单次前向传递直接估计未来时间窗口中发生结果的可能性。EveryQuery通过在随机采样的查询任务和患者背景的组合上进行预训练,直接训练模型以产生对任意输入提示正确答案的能力来实现这一功能。这使得在不进行微调、线性探测或轨迹生成的情况下,可以对查询空间中的任何任务进行零-shot预测。在MIMIC-IV上,EveryQuery在39个随机抽样预测任务中的82%上优于自回归基线,均值AUC提升为+0.16(95% CI: [0.10,0.22])。这种优势在明确保留在预训练分布之外的任务上仍然保持一致。此外,EveryQuery在罕见的临床事件上的性能提升最为显著,验证并展示了对于低患病率结果的自回归推断的基本限制的解决方案。然而,目前,EveryQuery在需要对多个代码进行分离推理的任务上表现不佳,例如30天再入院,暴露了当前查询语言的具体表达能力限制。
更新时间: 2026-03-09 02:45:22
领域: cs.AI
Feedback Control for Small Budget Pacing
Budget pacing is critical in online advertising to align spend with campaign goals under dynamic auctions. Existing pacing methods often rely on ad-hoc parameter tuning, which can be unstable and inefficient. We propose a principled controller that combines bucketized hysteresis with proportional feedback to provide stable and adaptive spend control. Our method provides a framework and analysis for parameter selection that enables accurate tracking of desired spend rates across campaigns. Experiments in real-world auctions demonstrate significant improvements in pacing accuracy and delivery consistency, reducing pacing error by 13% and $λ$-volatility by 54% compared to baseline method. By bridging control theory with advertising systems, our approach offers a scalable and reliable solution for budget pacing, with particular benefits for small-budget campaigns.
Updated: 2026-03-09 02:44:27
标题: 小预算节奏的反馈控制
摘要: 在线广告中的预算节奏对于在动态拍卖中将支出与活动目标对齐至关重要。现有的预算节奏方法通常依赖于临时参数调整,这可能不稳定且低效。我们提出了一个基于桶化滞后和比例反馈结合的原则控制器,以提供稳定和自适应的支出控制。我们的方法提供了一个参数选择的框架和分析,可以实现对不同活动的所需支出速率的准确跟踪。在真实世界的拍卖中进行的实验证明,在节奏准确性和交付一致性方面取得了显著的改善,与基准方法相比,节奏误差减少了13%,$ λ $波动性减少了54%。通过将控制理论与广告系统结合起来,我们的方法为预算节奏提供了一种可扩展和可靠的解决方案,尤其适用于小预算活动。
更新时间: 2026-03-09 02:44:27
领域: cs.LG,cs.GT
Bayesian Transformer for Probabilistic Load Forecasting in Smart Grids
The reliable operation of modern power grids requires probabilistic load forecasts with well-calibrated uncertainty estimates. However, existing deep learning models produce overconfident point predictions that fail catastrophically under extreme weather distributional shifts. This study proposes a Bayesian Transformer (BT) framework that integrates three complementary uncertainty mechanisms into a PatchTST backbone: Monte Carlo Dropout for epistemic parameter uncertainty, variational feed-forward layers with log-uniform weight priors, and stochastic attention with learnable Gaussian noise perturbations on pre-softmax logits, representing, to the best of our knowledge, the first application of Bayesian attention to probabilistic load forecasting. A seven-level multi-quantile pinball-loss prediction head and post-training isotonic regression calibration produce sharp, near-nominally covered prediction intervals. Evaluation of five grid datasets (PJM, ERCOT, ENTSO-E Germany, France, and Great Britain) augmented with NOAA covariates across 24, 48, and 168-hour horizons demonstrates state-of-the-art performance. On the primary benchmark (PJM, H=24h), BT achieves a CRPS of 0.0289, improving 7.4% over Deep Ensembles and 29.9% over the deterministic LSTM, with 90.4% PICP at the 90% nominal level and the narrowest prediction intervals (4,960 MW) among all probabilistic baselines. During heat-wave and cold snap events, BT maintained 89.6% and 90.1% PICP respectively, versus 64.7% and 67.2% for the deterministic LSTM, confirming that Bayesian epistemic uncertainty naturally widens intervals for out-of-distribution inputs. Calibration remained stable across all horizons (89.8-90.4% PICP), while ablation confirmed that each component contributed a distinct value. The calibrated outputs directly support risk-based reserve sizing, stochastic unit commitment, and demand response activation.
Updated: 2026-03-09 02:39:51
标题: 贝叶斯变换器用于智能电网中的概率负荷预测
摘要: 现代电网的可靠运行需要具有良好校准不确定性估计的概率负荷预测。然而,现有的深度学习模型产生过度自信的点预测,在极端天气分布变化下会出现灾难性失败。本研究提出了一个贝叶斯变换器(BT)框架,将三种互补的不确定性机制整合到PatchTST骨干中:蒙特卡洛Dropout用于认知参数不确定性,具有对数均匀权重先验的变分前馈层,以及在预softmax logits上具有可学习的高斯噪声扰动的随机注意力,代表了我们所知的第一个将贝叶斯注意力应用于概率负荷预测。七级多分位数风筝损失预测头和训练后的等温回归校准产生尖锐、接近名义覆盖的预测区间。对五个电网数据集(PJM、ERCOT、ENTSO-E德国、法国和英国)进行评估,跨24、48和168小时的时间范围,展示了最先进的性能。在主要基准测试(PJM,H=24h)上,BT实现了0.0289的CRPS,比深度合奏提高了7.4%,比确定性LSTM提高了29.9%,在90%名义水平上具有90.4%的PICP,且在所有概率基线中具有最窄的预测区间(4,960 MW)。在热浪和寒潮事件中,BT分别保持了89.6%和90.1%的PICP,而确定性LSTM分别为64.7%和67.2%,证实贝叶斯认知不确定性自然地扩大了超出分布输入的间隔。校准在所有时间范围内保持稳定(89.8-90.4%的PICP),消融实验证实每个组件都提供了独特的价值。校准输出直接支持基于风险的备用容量规模、随机机组承诺和需求响应激活。
更新时间: 2026-03-09 02:39:51
领域: cs.LG,stat.ML
Stronger Enforcement of Instruction Hierarchy via Augmented Intermediate Representations
Prompt injection attacks are a critical security vulnerability in large language models (LLMs), allowing attackers to hijack model behavior by injecting malicious instructions within the input context. Recent defense mechanisms have leveraged an Instruction Hierarchy (IH) Signal, often implemented through special delimiter tokens or additive embeddings to denote the privilege level of input tokens. However, these prior works typically inject the IH signal exclusively at the initial input layer, which we hypothesize limits its ability to effectively distinguish the privilege levels of tokens as it propagates through the different layers of the model. To overcome this limitation, we introduce a novel approach that injects the IH signal into the intermediate token representations within the network. Our method augments these representations with layer-specific trainable embeddings that encode the privilege information. Our evaluations across multiple models and training methods reveal that our proposal yields between $1.6\times$ and $9.2\times$ reduction in attack success rate on gradient-based prompt injection attacks compared to state-of-the-art methods, without significantly degrading the model's utility.
Updated: 2026-03-09 02:38:13
标题: 通过增强中间表示实现更强的指令层次执行力
摘要: 即时注入攻击是大型语言模型(LLMs)中的关键安全漏洞,允许攻击者通过在输入上下文中注入恶意指令来劫持模型行为。最近的防御机制已经利用了指令层次结构(IH)信号,通常通过特殊的分隔符令牌或附加嵌入来表示输入令牌的特权级别。然而,这些先前的工作通常将IH信号独占地注入到初始输入层,我们假设这限制了其在模型不同层之间传播时有效区分令牌特权级别的能力。为了克服这一限制,我们引入了一种新颖的方法,将IH信号注入到网络内部的中间令牌表示中。我们的方法通过在每个层次特定的可训练嵌入中编码特权信息来增强这些表示。我们在多个模型和训练方法上的评估表明,与最先进的方法相比,我们的提议在基于梯度的即时注入攻击中能够将攻击成功率降低1.6倍至9.2倍,而不会显著降低模型的实用性。
更新时间: 2026-03-09 02:38:13
领域: cs.AI,cs.LG
Revisiting Unknowns: Towards Effective and Efficient Open-Set Active Learning
Open-set active learning (OSAL) aims to identify informative samples for annotation when unlabeled data may contain previously unseen classes-a common challenge in safety-critical and open-world scenarios. Existing approaches typically rely on separately trained open-set detectors, introducing substantial training overhead and overlooking the supervisory value of labeled unknowns for improving known-class learning. In this paper, we propose E$^2$OAL (Effective and Efficient Open-set Active Learning), a unified and detector-free framework that fully exploits labeled unknowns for both stronger supervision and more reliable querying. E$^2$OAL first uncovers the latent class structure of unknowns through label-guided clustering in a frozen contrastively pre-trained feature space, optimized by a structure-aware F1-product objective. To leverage labeled unknowns, it employs a Dirichlet-calibrated auxiliary head that jointly models known and unknown categories, improving both confidence calibration and known-class discrimination. Building on this, a logit-margin purity score estimates the likelihood of known classes to construct a high-purity candidate pool, while an OSAL-specific informativeness metric prioritizes partially ambiguous yet reliable samples. These components together form a flexible two-stage query strategy with adaptive precision control and minimal hyperparameter sensitivity. Extensive experiments across multiple OSAL benchmarks demonstrate that E$^2$OAL consistently surpasses state-of-the-art methods in accuracy, efficiency, and query precision, highlighting its effectiveness and practicality for real-world applications. The code is available at github.com/chenchenzong/E2OAL.
Updated: 2026-03-09 02:35:36
标题: 重新审视未知:朝向有效和高效的开放式主动学习
摘要: 开放集主动学习(OSAL)旨在在未标记的数据可能包含先前未见类别的情况下,确定信息样本以进行注释-这是安全关键和开放世界场景中的常见挑战。现有方法通常依赖于单独训练的开放集检测器,引入了大量的训练开销,并忽视了标记的未知类别对改进已知类别学习的监督价值。在本文中,我们提出了E$^2$OAL(有效和高效的开放集主动学习),这是一个统一且无需检测器的框架,充分利用标记的未知类别进行更强的监督和更可靠的查询。E$^2$OAL首先通过在冻结的对比性预训练特征空间中进行标签引导聚类来揭示未知类别的潜在类结构,通过结构感知F1-乘积目标进行优化。为了利用标记的未知类别,它采用了一个狄利克雷校准的辅助头,联合建模已知和未知类别,提高了信心校准和已知类别区分能力。在此基础上,一个logit-margin纯度得分估计已知类别的可能性,构建一个高纯度的候选池,同时一个OSAL特定的信息度量指标优先考虑部分模糊但可靠的样本。这些组件共同构成一个灵活的两阶段查询策略,具有自适应精度控制和最小的超参数敏感性。在多个OSAL基准测试中进行的大量实验表明,E$^2$OAL在准确性、效率和查询精度方面始终优于最先进的方法,突显了其对实际应用的有效性和实用性。该代码可在github.com/chenchenzong/E2OAL上找到。
更新时间: 2026-03-09 02:35:36
领域: cs.CV,cs.LG
Mapping Overlaps in Benchmarks through Perplexity in the Wild
We introduce benchmark signatures to characterize the capacity demands of LLM benchmarks and their overlaps. Signatures are sets of salient tokens from in-the-wild corpora whose model token perplexity, reflecting training exposure, predicts benchmark performance. We extract them via stepwise forward selection with linear regression in a meta-evaluation spanning 32 LLMs and 89 benchmarks across diverse domains. We then analyze how these signatures relate to both the semantic similarity of benchmark questions and the correlation structure of model performance. While performance correlations are uniformly high and semantic overlaps stay in a narrow mid-range, benchmark signatures reveal more nuanced structure. For instance, they uncover substantial overlap between benchmarks in knowledge and reasoning tasks, whereas benchmarks in culture- and humanity-oriented domains show low similarity with each other. Unlike raw performance correlations, which are influenced by benchmark-orthogonal factors such as question formats, signatures are robust to such confounds. We further identify cross-functional overlaps between logic, math, language, instruction following, and cultural/world modeling, with coding emerging as the most isolated function, interacting only moderately with the ability of detecting missing information. Qualitative analysis shows that only the knowledge signature aligns with actual knowledge, suggesting that LLM semantic organization may differ from human conceptual structure. Together, these findings offer insights into benchmark validity, LLM sensitivities, and the landscape of interconnected LLM capacities. We have open-sourced the code and data in this https://github.com/siyangwu1/Benchmark-Signature-Repository.
Updated: 2026-03-09 02:32:28
标题: 在野外通过困惑度映射基准测试中的重叠
摘要: 我们引入基准特征来表征LLM基准的容量需求及其重叠。特征是从野外语料库中提取的显著标记的集合,其模型标记困惑性,反映训练暴露,可以预测基准性能。我们通过跨越32个LLMs和89个不同领域的基准的元评估中的线性回归逐步前向选择提取它们。然后,我们分析这些特征与基准问题的语义相似性以及模型性能的相关结构之间的关系。尽管性能相关性普遍较高,语义重叠保持在一个较窄的中间范围内,基准特征揭示了更微妙的结构。例如,它们揭示了知识和推理任务之间的基准之间的重叠,而以文化和人文为导向的领域的基准之间的相似性很低。与原始性能相关性不同,受基准正交因素(如问题格式)影响的特征对这种混淆具有鲁棒性。我们进一步确定了逻辑、数学、语言、遵循指令和文化/世界建模之间的跨功能重叠,编码出现为最孤立的功能,与检测缺失信息的能力仅有适度的交互。定性分析表明,只有知识特征与实际知识相符,这表明LLM语义组织可能与人类概念结构不同。这些发现共同提供了关于基准有效性、LLM敏感性以及互相关LLM容量的景观的见解。我们在https://github.com/siyangwu1/Benchmark-Signature-Repository中开源了代码和数据。
更新时间: 2026-03-09 02:32:28
领域: cs.AI,cs.CL
LeJOT-AutoML: LLM-Driven Feature Engineering for Job Execution Time Prediction in Databricks Cost Optimization
Databricks job orchestration systems (e.g., LeJOT) reduce cloud costs by selecting low-priced compute configurations while meeting latency and dependency constraints. Accurate execution-time prediction under heterogeneous instance types and non-stationary runtime conditions is therefore critical. Existing pipelines rely on static, manually engineered features that under-capture runtime effects (e.g., partition pruning, data skew, and shuffle amplification), and predictive signals are scattered across logs, metadata, and job scripts-lengthening update cycles and increasing engineering overhead. We present LeJOT-AutoML, an agent-driven AutoML framework that embeds large language model agents throughout the ML lifecycle. LeJOT-AutoML combines retrieval-augmented generation over a domain knowledge base with a Model Context Protocol toolchain (log parsers, metadata queries, and a read-only SQL sandbox) to analyze job artifacts, synthesize and validate feature-extraction code via safety gates, and train/select predictors. This design materializes runtime-derived features that are difficult to obtain through static analysis alone. On enterprise Databricks workloads, LeJOT-AutoML generates over 200 features and reduces the feature-engineering and evaluation loop from weeks to 20-30 minutes, while maintaining competitive prediction accuracy. Integrated into the LeJOT pipeline, it enables automated continuous model updates and achieves 19.01% cost savings in our deployment setting through improved orchestration.
Updated: 2026-03-09 02:31:50
标题: LeJOT-AutoML: 基于LLM的特征工程用于在Databricks成本优化中预测作业执行时间
摘要: Databricks作业编排系统(例如LeJOT)通过选择价格低廉的计算配置来降低云成本,同时满足延迟和依赖性约束。因此,在异构实例类型和非稳态运行时条件下准确预测执行时间至关重要。现有的管道依赖于静态、手工设计的特征,这些特征对运行时效果(例如分区修剪、数据倾斜和洗牌放大)进行了不充分的捕捉,而预测信号分散在日志、元数据和作业脚本之间,延长了更新周期并增加了工程开销。我们提出了LeJOT-AutoML,这是一个以代理驱动的AutoML框架,贯穿整个机器学习生命周期嵌入了大型语言模型代理。LeJOT-AutoML将检索增强生成与领域知识库相结合,利用模型上下文协议工具链(日志解析器、元数据查询和只读SQL沙盒)来分析作业工件,通过安全门限合成和验证特征提取代码,并训练/选择预测器。这种设计实现了通过静态分析难以获得的运行时派生特征。在企业Databricks工作负载中,LeJOT-AutoML生成了超过200个特征,并将特征工程和评估循环从几周减少到20-30分钟,同时保持了竞争力的预测准确性。集成到LeJOT管道中,它实现了自动化连续模型更新,并通过改进编排实现了我们部署设置中的19.01%成本节约。
更新时间: 2026-03-09 02:31:50
领域: cs.LG
SMGI: A Structural Theory of General Artificial Intelligence
We introduce SMGI, a structural theory of general artificial intelligence, and recast the foundational problem of learning from the optimization of hypotheses within fixed environments to the controlled evolution of the learning interface itself. We formalize the Structural Model of General Intelligence (SMGI) via a typed meta-model $θ= (r,\mathcal H,Π,\mathcal L,\mathcal E,\mathcal M)$ that treats representational maps, hypothesis spaces, structural priors, multi-regime evaluators, and memory operators as explicitly typed, dynamic components. By enforcing a strict mathematical separation between this structural ontology ($θ$) and its induced behavioral semantics ($T_θ$), we define general artificial intelligence as a class of admissible coupled dynamics $(θ, T_θ)$ satisfying four obligations: structural closure under typed transformations, dynamical stability under certified evolution, bounded statistical capacity, and evaluative invariance across regime shifts. We prove a structural generalization bound that links sequential PAC-Bayes analysis and Lyapunov stability, providing sufficient conditions for capacity control and bounded drift under admissible task transformations. Furthermore, we establish a strict structural inclusion theorem demonstrating that classical empirical risk minimization, reinforcement learning, program-prior models (Solomonoff-style), and modern frontier agentic pipelines operate as structurally restricted instances of SMGI.
Updated: 2026-03-09 02:31:31
标题: SMGI:一种通用人工智能的结构理论
摘要: 我们介绍SMGI,这是一个关于一般人工智能的结构理论,并将学习的基础问题重新构建为通过对学习界面自身的控制进化而不是在固定环境中优化假设。我们通过一个类型化的元模型$θ=(r,\mathcal H,Π,\mathcal L,\mathcal E,\mathcal M)$来形式化一般智能的结构模型(SMGI),该模型将表示映射、假设空间、结构先验、多重评估器和记忆操作符作为显式类型化的动态组件。通过在结构本体($θ$)和其引导行为语义($T_θ$)之间严格数学分离,我们将一般人工智能定义为满足四个义务的一类可接受的耦合动力学$(θ, T_θ)$:在类型化转换下的结构闭包、经过认证进化后的动态稳定性、有界的统计容量以及在制度转变下的评估不变性。我们证明了一个结构泛化界限,将顺序PAC-Bayes分析和Lyapunov稳定性联系起来,提供了在可接受的任务转换下的容量控制和有界漂移的充分条件。此外,我们建立了一个严格的结构包含定理,证明了传统的经验风险最小化、强化学习、程序先验模型(Solomonoff风格)和现代前沿代理管道是SMGI的结构受限实例。
更新时间: 2026-03-09 02:31:31
领域: cs.AI,cs.LG
MICA: Multi-Agent Industrial Coordination Assistant
Industrial workflows demand adaptive and trustworthy assistance that can operate under limited computing, connectivity, and strict privacy constraints. In this work, we present MICA (Multi-Agent Industrial Coordination Assistant), a perception-grounded and speech-interactive system that delivers real-time guidance for assembly, troubleshooting, part queries, and maintenance. MICA coordinates five role-specialized language agents, audited by a safety checker, to ensure accurate and compliant support. To achieve robust step understanding, we introduce Adaptive Step Fusion (ASF), which dynamically blends expert reasoning with online adaptation from natural speech feedback. Furthermore, we establish a new multi-agent coordination benchmark across representative task categories and propose evaluation metrics tailored to industrial assistance, enabling systematic comparison of different coordination topologies. Our experiments demonstrate that MICA consistently improves task success, reliability, and responsiveness over baseline structures, while remaining deployable on practical offline hardware. Together, these contributions highlight MICA as a step toward deployable, privacy-preserving multi-agent assistants for dynamic factory environments. The source code will be made publicly available at https://github.com/Kratos-Wen/MICA.
Updated: 2026-03-09 02:29:31
标题: MICA: 多智能体工业协调助手
摘要: 工业工作流需要适应性强且可信赖的辅助系统,能够在计算、连接和隐私等方面受到限制的情况下运行。在这项工作中,我们提出了MICA(多智能体工业协调助手),这是一个基于感知和语音交互的系统,为装配、故障排除、零部件查询和维护提供实时指导。MICA协调五个专业角色的语言智能体,经过安全检查,以确保准确和符合规范的支持。为了实现稳健的步骤理解,我们引入了自适应步骤融合(ASF),它动态地将专家推理与自然语音反馈的在线适应相结合。此外,我们建立了一个新的跨代表性任务类别的多智能体协调基准,并提出了针对工业辅助的评估指标,使不同协调拓扑结构能够进行系统比较。我们的实验证明,与基线结构相比,MICA 在任务成功率、可靠性和响应性方面始终得到改善,同时仍能在实际的离线硬件上部署。总的来说,这些贡献突显了MICA 作为面向动态工厂环境的可部署、隐私保护的多智能体助手的一大进步。源代码将在https://github.com/Kratos-Wen/MICA 上公开提供。
更新时间: 2026-03-09 02:29:31
领域: cs.AI,cs.CV,cs.LG
LMMRec: LLM-driven Motivation-aware Multimodal Recommendation
Motivation-based recommendation systems uncover user behavior drivers. Motivation modeling, crucial for decision-making and content preference, explains recommendation generation. Existing methods often treat motivation as latent variables from interaction data, neglecting heterogeneous information like review text. In multimodal motivation fusion, two challenges arise: 1) achieving stable cross-modal alignment amid noise, and 2) identifying features reflecting the same underlying motivation across modalities. To address these, we propose LLM-driven Motivation-aware Multimodal Recommendation (LMMRec), a model-agnostic framework leveraging large language models for deep semantic priors and motivation understanding. LMMRec uses chain-of-thought prompting to extract fine-grained user and item motivations from text. A dual-encoder architecture models textual and interaction-based motivations for cross-modal alignment, while Motivation Coordination Strategy and Interaction-Text Correspondence Method mitigate noise and semantic drift through contrastive learning and momentum updates. Experiments on three datasets show LMMRec achieves up to a 4.98\% performance improvement.
Updated: 2026-03-09 02:29:01
标题: LMMRec: 基于LLM的动机感知多模态推荐
摘要: 基于动机的推荐系统揭示了用户行为驱动因素。决策和内容偏好至关重要的动机建模解释了推荐生成。现有方法通常将动机视为交互数据中的潜在变量,忽略了像评论文本这样的异构信息。在多模态动机融合中,出现了两个挑战:1)在噪声中实现稳定的跨模态对齐,2)识别反映跨模态中相同基础动机的特征。为了解决这些问题,我们提出了基于LLM的动机感知多模态推荐(LMMRec),这是一个利用大型语言模型进行深层语义先验和动机理解的模型不可知框架。LMMRec使用思维链提示从文本中提取用户和项目的细粒度动机。双编码器架构模拟了文本和基于互动的动机以进行跨模态对齐,同时通过对比学习和动量更新的动机协调策略和互动-文本对应方法减轻了噪声和语义漂移。在三个数据集上的实验表明,LMMRec实现了高达4.98%的性能改进。
更新时间: 2026-03-09 02:29:01
领域: cs.IR,cs.AI
Designing probabilistic AI monsoon forecasts to inform agricultural decision-making
Hundreds of millions of farmers make high-stakes decisions under uncertainty about future weather. Forecasts can inform these decisions, but available choices and their risks and benefits vary between farmers. We introduce a decision-theory framework for designing useful forecasts in settings where the forecaster cannot prescribe optimal actions because farmers' circumstances are heterogeneous. We apply this framework to the case of seasonal onset of monsoon rains, a key date for planting decisions and agricultural investments in many tropical countries. We develop a system for tailoring forecasts to the requirements of this framework by blending systematically benchmarked artificial intelligence (AI) weather prediction models with a new "evolving farmer expectations" statistical model. This statistical model applies Bayesian inference to historical observations to predict time-varying probabilities of first-occurrence events throughout a season. The blended system yields more skillful Indian monsoon forecasts at longer lead times than its components or any multi-model average. In 2025, this system was deployed operationally in a government-led program that delivered subseasonal monsoon onset forecasts to 38 million Indian farmers, skillfully predicting that year's early-summer anomalous dry period. This decision-theory framework and blending system offer a pathway for developing climate adaptation tools for large vulnerable populations around the world.
Updated: 2026-03-09 02:25:12
标题: 设计概率人工智能季风预测以指导农业决策-making
摘要: 数亿农民在未来天气不确定性下做出高风险决策。预测可以帮助这些决策,但可选选择和它们的风险和收益在农民之间有所不同。我们引入了一个决策理论框架,用于设计在预测者无法规定最佳行动的情况下有用的预测,因为农民的情况是异质的。我们将这一框架应用于季风雨的季节性开始案例,这是许多热带国家种植决策和农业投资的关键日期。我们开发了一个系统,通过将经过系统基准测试的人工智能(AI)天气预测模型与新的“不断演化的农民期望”统计模型相结合,来定制符合这一框架要求的预测。这个统计模型应用贝叶斯推断对历史观测进行预测,以预测整个季节中首次发生事件的时间变化概率。混合系统在比其组成部分或任何多模型平均更长的前导时间内产生更具技能的印度季风预测。在2025年,这个系统在一个由政府主导的项目中被投入运营,向3800万印度农民提供了次季节性季风开始预测,并成功预测了当年初夏异常干旱期。这一决策理论框架和混合系统为开发全球大规模脆弱人口的气候适应工具提供了一条途径。
更新时间: 2026-03-09 02:25:12
领域: cs.LG,cs.AI,econ.GN,physics.ao-ph
Revisiting Sharpness-Aware Minimization: A More Faithful and Effective Implementation
Sharpness-Aware Minimization (SAM) enhances generalization by minimizing the maximum training loss within a predefined neighborhood around the parameters. However, its practical implementation approximates this as gradient ascent(s) followed by applying the gradient at the ascent point to update the current parameters. This practice can be justified as approximately optimizing the objective by neglecting the (full) derivative of the ascent point with respect to the current parameters. Nevertheless, a direct and intuitive understanding of why using the gradient at the ascent point to update the current parameters works superiorly is still lacking. Our work bridges this gap by proposing a novel and intuitive interpretation. We show that the gradient at the single-step ascent point, \uline{when applied to the current parameters}, provides a better approximation of the direction from the current parameters toward the maximum within the local neighborhood than the local gradient. This improved approximation thereby enables a more direct escape from the maximum within the local neighborhood. Nevertheless, our analysis further reveals two issues. First, the approximation by the gradient at the single-step ascent point is often inaccurate. Second, the approximation quality may degrade as the number of ascent steps increases. To address these limitations, we propose in this paper eXplicit Sharpness-Aware Minimization (XSAM). It tackles the first by explicitly estimating the direction of the maximum during training, while addressing the second by crafting a search space that effectively leverages the gradient information at the multi-step ascent point. XSAM features a unified formulation that applies to both single-step and multi-step settings and only incurs negligible computational overhead. Extensive experiments demonstrate the consistent superiority of XSAM against existing counterparts.
Updated: 2026-03-09 02:11:56
标题: 重新审视锐度感知最小化:更忠实和有效的实现
摘要: 锐度感知最小化(SAM)通过在参数周围的预定义邻域内最小化最大训练损失来增强泛化能力。然而,其实际实现将此近似为梯度上升,然后应用梯度在上升点更新当前参数。这种做法可以被解释为大致优化目标,忽略相对于当前参数的(完整的)上升点的导数。然而,为什么使用上升点的梯度来更新当前参数能够更优越地工作仍然缺乏直接和直观的理解。我们的工作通过提出一种新颖而直观的解释弥合了这一差距。我们表明,在单步上升点处的梯度,\uline{当应用于当前参数时},提供了一个更好的近似,指向在局部邻域内朝向最大值的方向,比局部梯度更好。这种改进的近似从而使得更直接地逃离局部邻域内的最大值。然而,我们的分析进一步揭示了两个问题。首先,单步上升点处的梯度近似通常是不准确的。其次,随着上升步数的增加,近似质量可能会下降。为了解决这些限制,我们在本文中提出了显式锐度感知最小化(XSAM)。它通过在训练过程中显式估计最大值的方向来解决第一个问题,通过设计一个有效利用多步上升点处的梯度信息的搜索空间来解决第二个问题。XSAM具有一个统一的公式,适用于单步和多步设置,只产生微不足道的计算开销。大量实验表明,XSAM相对于现有对手具有持续的优越性。
更新时间: 2026-03-09 02:11:56
领域: cs.LG,cs.AI
A Lightweight Traffic Map for Efficient Anytime LaCAM*
Multi-Agent Path Finding (MAPF) aims to compute collision-free paths for multiple agents and has a wide range of practical applications. LaCAM*, an anytime configuration-based solver, currently represents the state of the art. Recent work has explored the use of guidance paths to steer LaCAM* toward configurations that avoid traffic congestion, thereby improving solution quality. However, existing approaches rely on Frank-Wolfe-style optimization that repeatedly invokes single-agent search before executing LaCAM*, resulting in substantial computational overhead for large-scale problems. Moreover, the guidance path is static and primarily beneficial for finding the first solution in LaCAM*. To address these limitations, we propose a new approach that leverages LaCAM*'s ability to construct a dynamic, lightweight traffic map during its search. Experimental results demonstrate that our method achieves higher solution quality than state-of-the-art guidance-path approaches across two MAPF variants.
Updated: 2026-03-09 02:11:07
标题: 一种轻量级交通地图用于高效的任意时刻LaCAM*
摘要: 多智能体路径规划(MAPF)旨在为多个智能体计算无碰撞路径,并具有广泛的实际应用。LaCAM*是一种任何时候都可以基于配置的解决方案,目前代表着最先进的技术。最近的研究探讨了使用引导路径来引导LaCAM*朝着避免交通拥堵的配置,从而提高解决方案的质量。然而,现有方法依赖于Frank-Wolfe风格的优化,它在执行LaCAM*之前反复调用单智能体搜索,导致大规模问题的计算开销相当大。此外,引导路径是静态的,并且主要有利于在LaCAM*中找到第一个解决方案。为了解决这些限制,我们提出了一种新方法,利用LaCAM*在搜索过程中构建动态、轻量级的交通地图的能力。实验结果表明,我们的方法在两个MAPF变体中实现了比最先进的引导路径方法更高的解决方案质量。
更新时间: 2026-03-09 02:11:07
领域: cs.AI
Computing Evolutionarily Stable Strategies in Multiplayer Games
We present an algorithm for computing all evolutionarily stable strategies in nondegenerate normal-form games with three or more players.
Updated: 2026-03-09 02:09:51
标题: 计算多人游戏中的进化稳定策略
摘要: 我们提出了一种算法,用于计算在具有三名或更多玩家的非退化正态形式博弈中的所有进化稳定策略。
更新时间: 2026-03-09 02:09:51
领域: cs.GT,cs.AI,cs.MA,econ.TH,q-bio.PE
Visualizing Coalition Formation: From Hedonic Games to Image Segmentation
We propose image segmentation as a visual diagnostic testbed for coalition formation in hedonic games. Modeling pixels as agents on a graph, we study how a granularization parameter shapes equilibrium fragmentation and boundary structure. On the Weizmann single-object benchmark, we relate multi-coalition equilibria to binary protocols by measuring whether the converged coalitions overlap with a foreground ground-truth. We observe transitions from cohesive to fragmented yet recoverable equilibria, and finally to intrinsic failure under excessive fragmentation. Our core contribution links multi-agent systems with image segmentation by quantifying the impact of mechanism design parameters on equilibrium structures.
Updated: 2026-03-09 02:08:17
标题: 可视化联盟形成:从快感游戏到图像分割
摘要: 我们提出将图像分割作为快感游戏中联盟形成的视觉诊断测试平台。将像素建模为图上的代理,我们研究了颗粒化参数如何塑造平衡碎片化和边界结构。在Weizmann单对象基准测试中,我们通过测量收敛的联盟是否与前景真实值重叠,将多联盟平衡与二进制协议相关联。我们观察到从凝聚到碎片化但可恢复的平衡,最终到过度碎片化下的内在失败的转变。我们的核心贡献是通过量化机制设计参数对平衡结构的影响,将多智能体系统与图像分割联系起来。
更新时间: 2026-03-09 02:08:17
领域: cs.AI,cs.CV
VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?
The ability to distinguish subtle differences between visually similar images is essential for diverse domains such as industrial anomaly detection, medical imaging, and aerial surveillance. While comparative reasoning benchmarks for vision-language models (VLMs) have recently emerged, they primarily focus on images with large, salient differences and fail to capture the nuanced reasoning required for real-world applications. In this work, we introduce VLM-SubtleBench, a benchmark designed to evaluate VLMs on subtle comparative reasoning. Our benchmark covers ten difference types - Attribute, State, Emotion, Temporal, Spatial, Existence, Quantity, Quality, Viewpoint, and Action - and curate paired question-image sets reflecting these fine-grained variations. Unlike prior benchmarks restricted to natural image datasets, our benchmark spans diverse domains, including industrial, aerial, and medical imagery. Through extensive evaluation of both proprietary and open-source VLMs, we reveal systematic gaps between model and human performance across difference types and domains, and provide controlled analyses highlighting where VLMs' reasoning sharply deteriorates. Together, our benchmark and findings establish a foundation for advancing VLMs toward human-level comparative reasoning.
Updated: 2026-03-09 02:01:02
标题: VLM-SubtleBench:VLM与人类水平微妙比较推理有多远?
摘要: 能够区分视觉上相似图像之间微小差异的能力对于工业异常检测、医学影像和航空监视等各种领域至关重要。虽然最近出现了用于视觉-语言模型(VLMs)的比较推理基准,但它们主要关注具有明显差异的图像,并未捕捉到现实应用所需的微妙推理。在这项工作中,我们引入了VLM-SubtleBench,这是一个旨在评估VLMs在微妙比较推理上的基准。我们的基准涵盖了十种差异类型 - 属性、状态、情绪、时间、空间、存在、数量、质量、视角和行动 - 并筛选出反映这些细微变化的配对问题-图像集。与之前仅限于自然图像数据集的基准不同,我们的基准跨越了多个领域,包括工业、航空和医学影像。通过对专有和开源VLMs的广泛评估,我们揭示了模型和人类在不同类型和领域之间性能之间的系统差距,并提供了控制分析,突出了VLMs推理明显恶化的地方。总的来说,我们的基准和发现为推动VLMs迈向人类级比较推理奠定了基础。
更新时间: 2026-03-09 02:01:02
领域: cs.CV,cs.AI,cs.LG
Reject, Resample, Repeat: Understanding Parallel Reasoning in Language Model Inference
Inference-time methods that aggregate and prune multiple samples have emerged as a powerful paradigm for steering large language models, yet we lack any principled understanding of their accuracy-cost tradeoffs. In this paper, we introduce a route to rigorously study such approaches using the lens of *particle filtering* algorithms such as Sequential Monte Carlo (SMC). Given a base language model and a *process reward model* estimating expected terminal rewards, we ask: *how accurately can we sample from a target distribution given some number of process reward evaluations?* Theoretically, we identify (1) simple criteria enabling non-asymptotic guarantees for SMC; (2) algorithmic improvements to SMC; and (3) a fundamental limit faced by all particle filtering methods. Empirically, we demonstrate that our theoretical criteria effectively govern the *sampling error* of SMC, though not necessarily its final *accuracy*, suggesting that theoretical perspectives beyond sampling may be necessary.
Updated: 2026-03-09 01:50:31
标题: 拒绝、重新采样、重复:理解语言模型推理中的并行推理
摘要: 推理时间方法,聚合和修剪多个样本已成为引导大型语言模型的强大范式,然而我们缺乏对它们准确性成本权衡的原则性理解。在本文中,我们介绍了一种严格研究这种方法的途径,使用*粒子滤波*算法的视角,如顺序蒙特卡洛(SMC)。给定一个基础语言模型和一个估计期望终端奖励的*过程奖励模型*,我们问:*在一定数量的过程奖励评估下,我们能够多精确地从目标分布中采样?*理论上,我们确定了(1)为SMC提供非渐近保证的简单标准;(2)对SMC的算法改进;以及(3)所有粒子滤波方法面临的基本限制。从经验上讲,我们证明了我们的理论标准有效地管理了SMC的*采样误差*,尽管不一定是其最终*准确性*,这表明除了采样之外可能需要理论视角。
更新时间: 2026-03-09 01:50:31
领域: cs.LG,cs.AI,cs.CL,math.ST,stat.ML
CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases
Enhancing the ability of large language models (LLMs) to follow complex instructions is critical for their deployment in real-world applications. However, existing evaluation methods often oversimplify instruction complexity as a mere additive combination of atomic constraints, failing to adequately capture the high-dimensional complexity arising from the intricate interplay of content and format, logical workflow control, and real-world applications. This leads to a significant gap between current evaluation practices and practical demands. To bridge this gap, we introduce CCR-Bench, a novel benchmark designed to assess LLMs' adherence to complex instructions. CCR-Bench is characterized by: (1) deep entanglement of content and formatting requirements in task specifications; (2) instructions that involve intricate task decomposition, conditional reasoning, and procedural planning; and (3) evaluation samples derived entirely from real-world industrial scenarios. Extensive experiments on CCR-Bench demonstrate that even state-of-the-art models exhibit substantial performance deficiencies, clearly quantifying the gap between current LLM capabilities and the demands of realworld instruction understanding. We believe that CCR-Bench offers a more rigorous and realistic evaluation framework, advancing the development of LLMs toward the next generation of models capable of understanding and executing complex tasks in industrial applications.
Updated: 2026-03-09 01:49:19
标题: CCR-Bench:用于评估LLM在复杂约束、控制流和实际案例上的综合基准
摘要: 增强大型语言模型(LLMs)遵循复杂指令的能力对于它们在现实世界应用中的部署至关重要。然而,现有的评估方法往往过于简化指令复杂性,将其仅视为原子约束的简单组合,未能充分捕捉由内容和格式的错综复杂相互作用、逻辑工作流控制和现实应用所产生的高维复杂性。这导致当前评估实践与实际需求之间存在显著差距。为了弥合这一差距,我们引入了CCR-Bench,这是一个旨在评估LLMs遵循复杂指令的新型基准。CCR-Bench的特点包括:(1)在任务规范中深度纠缠内容和格式要求;(2)涉及复杂任务分解、条件推理和程序规划的指令;以及(3)完全源自真实工业场景的评估样本。对CCR-Bench的广泛实验表明,即使是最先进的模型也存在显著的性能缺陷,清晰量化了当前LLMs能力与真实世界指令理解需求之间的差距。我们相信CCR-Bench提供了一个更严谨和现实的评估框架,推动了LLMs向着能够理解和执行工业应用中复杂任务的下一代模型的发展。
更新时间: 2026-03-09 01:49:19
领域: cs.CL,cs.AI
ORIC: Benchmarking Object Recognition under Contextual Incongruity in Large Vision-Language Models
Large Vision-Language Models (LVLMs) excel at captioning, visual question answering, and robotics by combining vision and language, yet they often miss obvious objects or hallucinate nonexistent ones in atypical scenes. We examine these failures through the lens of uncertainty, focusing on contextual incongruity, where objects appear unexpectedly or fail to appear in expected contexts, and show that such cases increase recognition difficulty for state-of-the- art LVLMs. To study this regime, we introduce the Object Recognition in Incongruous Context (ORIC) framework, which constructs incongruous object-context pairs through two complementary strategies: (1) LLM-guided sampling to identify hard-to-recognize objects present in the image and (2) CLIP-guided sampling to mine plausible but absent ones. Applied to MSCOCO, ORIC creates ORIC-Bench and ORIC-style training data. Evaluating 18 LVLMs and 2 open-vocabulary detectors reveals significant degradation and bias under incongruous contexts. Visual Reinforcement Fine-Tuning of Qwen3-VL-8B-Instruct on 600 ORIC samples improves performance on ORIC-Bench, AMBER, and HallusionBench. Overall, we show that contextual incongruity is a key source of uncertainty and provide tools for more reliable LVLMs. The dataset and code are publicly available at https://github.com/ZhaoyangLi-1/ORIC.
Updated: 2026-03-09 01:42:37
标题: ORIC: 在大型视觉语言模型中基于背景不一致性的对象识别基准测试
摘要: 大型视觉语言模型(LVLMs)通过结合视觉和语言在字幕生成、视觉问题回答和机器人技术方面表现出色,但它们经常会错过明显的物体或在非典型场景中产生不存在的物体。我们通过不确定性的视角来研究这些失败,关注上下文不一致性,即物体出现在意外的位置或未出现在预期的背景中,表明这种情况增加了最先进的LVLMs的识别难度。为了研究这一领域,我们引入了不一致上下文中的物体识别(ORIC)框架,通过两种互补策略构建不一致的物体-背景对:(1)LLM引导采样来识别图像中难以识别的物体,(2)CLIP引导采样来挖掘可能存在但缺失的物体。应用于MSCOCO数据集,ORIC创建了ORIC-Bench和ORIC风格的训练数据。评估了18个LVLMs和2个开放词汇检测器,在不一致的背景下显示出显著的降级和偏见。对Qwen3-VL-8B-Instruct进行600个ORIC样本的视觉强化微调,提高了ORIC-Bench、AMBER和HallusionBench的性能。总的来说,我们表明上下文不一致性是不确定性的关键来源,并提供了更可靠的LVLMs的工具。数据集和代码可以在https://github.com/ZhaoyangLi-1/ORIC 上公开获取。
更新时间: 2026-03-09 01:42:37
领域: cs.CV,cs.LG
Post-Disaster Affected Area Segmentation with a Vision Transformer (ViT)-based EVAP Model using Sentinel-2 and Formosat-5 Imagery
We propose a vision transformer (ViT)-based deep learning framework to refine disaster-affected area segmentation from remote sensing imagery, aiming to support and enhance the Emergent Value Added Product (EVAP) developed by the Taiwan Space Agency (TASA). The process starts with a small set of manually annotated regions. We then apply principal component analysis (PCA)-based feature space analysis and construct a confidence index (CI) to expand these labels, producing a weakly supervised training set. These expanded labels are then used to train ViT-based encoder-decoder models with multi-band inputs from Sentinel-2 and Formosat-5 imagery. Our architecture supports multiple decoder variants and multi-stage loss strategies to improve performance under limited supervision. During the evaluation, model predictions are compared with higher-resolution EVAP output to assess spatial coherence and segmentation consistency. Case studies on the 2022 Poyang Lake drought and the 2023 Rhodes wildfire demonstrate that our framework improves the smoothness and reliability of segmentation results, offering a scalable approach for disaster mapping when accurate ground truth is unavailable.
Updated: 2026-03-09 01:23:06
标题: 基于视觉转换器(ViT)的EVAP模型利用Sentinel-2和Formosat-5图像对受灾地区进行分割
摘要: 我们提出了一种基于视觉变换器(ViT)的深度学习框架,用于从遥感图像中细化灾害受影响区域的分割,旨在支持和增强由台湾空间局(TASA)开发的紧急增值产品(EVAP)。该过程始于一小组手工注释的区域。然后我们应用基于主成分分析(PCA)的特征空间分析,并构建一个置信度指数(CI)来扩展这些标签,生成一个弱监督训练集。然后使用这些扩展标签来训练基于ViT的编码器-解码器模型,使用来自Sentinel-2和Formosat-5图像的多波段输入。我们的架构支持多个解码器变体和多阶段损失策略,以提高在有限监督下的性能。在评估过程中,将模型预测与分辨率更高的EVAP输出进行比较,以评估空间一致性和分割一致性。对2022年鄱阳湖干旱和2023年罗德斯山野火的案例研究表明,我们的框架改善了分割结果的平滑性和可靠性,在准确的地面真实情况不可用时提供了一种可扩展的灾害制图方法。
更新时间: 2026-03-09 01:23:06
领域: cs.CV,cs.AI
Toward Unified Multimodal Representation Learning for Autonomous Driving
Contrastive Language-Image Pre-training (CLIP) has shown impressive performance in aligning visual and textual representations. Recent studies have extended this paradigm to 3D vision to improve scene understanding for autonomous driving. A common strategy is to employ pairwise cosine similarity between modalities to guide the training of a 3D encoder. However, considering the similarity between individual modality pairs rather than all modalities jointly fails to ensure consistent and unified alignment across the entire multimodal space. In this paper, we propose a Contrastive Tensor Pre-training (CTP) framework that simultaneously aligns multiple modalities in a unified embedding space to enhance end-to-end autonomous driving. Compared with pairwise cosine similarity alignment, our method extends the 2D similarity matrix into a multimodal similarity tensor. Furthermore, we introduce a tensor loss to enable joint contrastive learning across all modalities. For experimental validation of our framework, we construct a text-image-point cloud triplet dataset derived from existing autonomous driving datasets. The results show that our proposed unified multimodal alignment framework achieves favorable performance for both scenarios: (i) aligning a 3D encoder with pretrained CLIP encoders, and (ii) pretraining all encoders from scratch.
Updated: 2026-03-09 01:18:50
标题: 朝向自动驾驶统一多模态表示学习
摘要: 对比语言-图像预训练(CLIP)展现出在对齐视觉和文本表示方面的出色表现。最近的研究将这一范式扩展到三维视觉,以改善自动驾驶的场景理解。一种常见策略是利用模态之间的成对余弦相似性来指导3D编码器的训练。然而,考虑到单个模态对之间的相似性而不是所有模态一起可能会导致无法确保整个多模态空间一致和统一的对齐。在本文中,我们提出了一种对比张量预训练(CTP)框架,可以同时将多个模态在一个统一的嵌入空间中对齐,以增强端到端的自动驾驶。与成对余弦相似性对齐相比,我们的方法将2D相似性矩阵扩展为多模态相似性张量。此外,我们引入了一个张量损失来实现跨所有模态的联合对比学习。为了对我们的框架进行实验验证,我们构建了一个从现有自动驾驶数据集派生出的文本-图像-点云三元组数据集。结果表明,我们提出的统一多模态对齐框架在两种情况下均取得了良好的性能:(i)将3D编码器与预训练的CLIP编码器对齐,(ii)从头开始对所有编码器进行预训练。
更新时间: 2026-03-09 01:18:50
领域: cs.CV,cs.LG
Kolmogorov-Arnold Energy Models: Fast, Interpretable Generative Modeling
Generative models typically rely on either simple latent priors (e.g., Variational Autoencoders, VAEs), which are efficient but limited, or highly expressive iterative samplers (e.g., Diffusion and Energy-based Models), which are costly and opaque. We introduce the Kolmogorov-Arnold Energy Model (KAEM) to bridge this trade-off and provide a new avenue for latent-space interpretability. Based on a novel interpretation of the Kolmogorov-Arnold Representation Theorem, KAEM imposes a univariate latent structure that enables fast and exact inference via the inverse transform method. With a low-dimensional latent space and appropriate inductive biases, we show that importance sampling becomes a viable, unbiased, and highly efficient posterior inference method. For settings where importance sampling fails, we propose a population-based strategy that decomposes the posterior into a sequence of annealed distributions to improve mixing during sampling, a common pitfall in Energy-based Models. We present initial comparisons of KAEM against VAEs for the SVHN and CelebA datasets, demonstrating its potential for competitive sample quality, inference speed, and interpretability.
Updated: 2026-03-09 01:12:51
标题: 科尔莫戈洛夫-阿诺尔德能量模型:快速、可解释的生成建模
摘要: 生成模型通常依赖于简单的潜在先验(例如,变分自动编码器,VAEs),这些模型高效但有局限性,或者高度表达性的迭代采样器(例如扩散和基于能量的模型),这些模型成本高且不透明。我们引入了科尔莫戈洛夫-阿诺德能量模型(KAEM),以弥合这种权衡,并为潜在空间的可解释性提供了新途径。基于科尔莫戈洛夫-阿诺德表示定理的新颖解释,KAEM强加了一维潜在结构,通过反变换方法实现快速且精确的推断。通过低维潜在空间和适当的归纳偏差,我们展示了重要性采样成为一种可行、无偏且高效的后验推断方法。对于重要性采样失败的情况,我们提出了一种基于人口的策略,将后验分解为一系列退火分布,以改善采样过程中的混合,这是基于能量模型中的常见缺陷。我们对KAEM和VAEs在SVHN和CelebA数据集上进行了初步比较,展示了其在竞争性样本质量、推断速度和可解释性方面的潜力。
更新时间: 2026-03-09 01:12:51
领域: cs.LG
Opponent State Inference Under Partial Observability: An HMM-POMDP Framework for 2026 Formula 1 Energy Strategy
The 2026 Formula 1 technical regulations introduce a fundamental change to energy strategy: under a 50/50 internal combustion engine / battery power split with unlimited regeneration and a driver-controlled Override Mode (abbreviated MOM throughout), the optimal energy deployment policy depends not only on a driver's own state but on the hidden state of rival cars. This creates a Partially Observable Stochastic Game that cannot be solved by single-agent optimisation methods. We present a tractable two-layer inference and decision framework. The first layer is a 30-state Hidden Markov Model (HMM) that infers a probability distribution over each rival's ERS charge level, Override Mode status, and tyre degradation state from five publicly observable telemetry signals. The second layer is a Deep Q-Network (DQN) policy that takes the HMM belief state as input and selects between energy deployment strategies. We formally characterise the counter-harvest trap -- a deceptive strategy in which a car deliberately suppresses observable deployment signals to induce a rival into a failed attack -- and show that detecting it requires belief-state inference rather than reactive threshold rules. On synthetic races generated from the model's own assumptions, the HMM achieves 92.3% ERS inference accuracy (random baseline: 33.3%) and detects counter-harvest trap conditions with 95.7% recall. Pre-registration -- empirical validation begins Australian Grand Prix, 8 March 2026.
Updated: 2026-03-09 01:10:14
标题: 对手状态推断在部分可观测性下:2026年F1能源策略的HMM-POMDP框架
摘要: 2026年的一级方程式技术规则引入了对能源战略的根本性改变:在一个50/50的内燃机/电池动力分配下,能量无限再生,驾驶员控制的覆盖模式(简称为MOM),最佳能量部署政策不仅取决于驾驶员自身状态,还取决于对手车辆的隐藏状态。这创造了一个无法通过单一代理优化方法解决的部分可观察随机博弈。我们提出了一个可行的两层推理和决策框架。第一层是一个30状态的隐马尔可夫模型(HMM),从五个公开可观测的遥测信号中推断出每个对手的ERS充电水平、覆盖模式状态和轮胎退化状态的概率分布。第二层是一个深度Q网络(DQN)策略,将HMM信念状态作为输入,并在能量部署策略之间进行选择。我们正式描述了对策收获陷阱——一种欺骗性策略,其中一辆车故意抑制可观测的部署信号,以诱使对手进行失败的进攻,同时表明检测它需要信念状态推理而非反应性阈值规则。在从模型自身假设生成的合成比赛中,HMM实现了92.3%的ERS推断准确率(随机基线:33.3%),并以95.7%的召回率检测到对策收获陷阱条件。预注册——经验验证将于2026年3月8日的澳大利亚大奖赛开始。
更新时间: 2026-03-09 01:10:14
领域: cs.AI,cs.GT,cs.LG,eess.SY
Hospitality-VQA: Decision-Oriented Informativeness Evaluation for Vision-Language Models
Recent advances in Vision-Language Models (VLMs) have demonstrated impressive multimodal understanding in general domains. However, their applicability to decision-oriented domains such as hospitality remains largely unexplored. In this work, we investigate how well VLMs can perform visual question answering (VQA) about hotel and facility images that are central to consumer decision-making. While many existing VQA benchmarks focus on factual correctness, they rarely capture what information users actually find useful. To address this, we first introduce Informativeness as a formal framework to quantify how much hospitality-relevant information an image-question pair provides. Guided by this framework, we construct a new hospitality-specific VQA dataset that covers various facility types, where questions are specifically designed to reflect key user information needs. Using this benchmark, we conduct experiments with several state-of-the-art VLMs, revealing that VLMs are not intrinsically decision-aware-key visual signals remain underutilized, and reliable informativeness reasoning emerges only after modest domain-specific finetuning.
Updated: 2026-03-09 00:46:45
标题: Hospitality-VQA:面向决策的视觉-语言模型信息量评估
摘要: 最近对视觉语言模型(VLMs)的进展已经展示了在一般领域中令人印象深刻的多模态理解。然而,它们在决策导向的领域,如酒店业,的适用性仍然大部分未被探索。在这项工作中,我们调查了VLMs在关于酒店和设施图像的视觉问答(VQA)中表现如何,这对于消费者决策至关重要。虽然许多现有的VQA基准关注事实的正确性,但它们很少捕捉到用户实际发现有用的信息。为了解决这个问题,我们首先引入了信息量作为一个正式的框架,来量化一对图像-问题提供多少与酒店相关的信息。在这个框架的指导下,我们构建了一个新的特定于酒店业的VQA数据集,涵盖了各种设施类型,其中问题专门设计以反映关键的用户信息需求。利用这个基准,我们对几种最先进的VLMs进行实验,揭示了VLMs并非本质上具有决策意识-关键的视觉信号仍然未被充分利用,并且可靠的信息量推理只有在适度的领域特定微调之后才会出现。
更新时间: 2026-03-09 00:46:45
领域: cs.AI,cs.LG
Slumbering to Precision: Enhancing Artificial Neural Network Calibration Through Sleep-like Processes
Artificial neural networks are often overconfident, undermining trust because their predicted probabilities do not match actual accuracy. Inspired by biological sleep and the role of spontaneous replay in memory and learning, we introduce Sleep Replay Consolidation (SRC), a novel calibration approach. SRC is a post-training, sleep-like phase that selectively replays internal representations to update network weights and improve calibration without supervised retraining. Across multiple experiments, SRC is competitive with and complementary to standard approaches such as temperature scaling. Combining SRC with temperature scaling achieves the best Brier score and entropy trade-offs for AlexNet and VGG19. These results show that SRC provides a fundamentally novel approach to improving neural network calibration. SRC-based calibration offers a practical path toward more trustworthy confidence estimates and narrows the gap between human-like uncertainty handling and modern deep networks.
Updated: 2026-03-09 00:43:14
标题: 沉睡至精准:通过类似睡眠过程增强人工神经网络的校准
摘要: 人工神经网络通常过于自信,降低了信任度,因为它们预测的概率与实际准确性不匹配。受生物睡眠和自发重播在记忆和学习中的作用的启发,我们引入了一种新颖的校准方法Sleep Replay Consolidation(SRC)。SRC是一个后训练的、类似于睡眠的阶段,选择性地重播内部表示,以更新网络权重并改进校准,而无需监督重新训练。在多个实验中,SRC与标准方法(如温度缩放)具有竞争性和互补性。将SRC与温度缩放相结合,实现了AlexNet和VGG19的最佳Brier得分和熵权衡。这些结果表明,SRC提供了一种根本新颖的方法来改进神经网络校准。基于SRC的校准提供了一条实用的途径,使信心估计更加可靠,并缩小了人类样式不确定性处理与现代深度网络之间的差距。
更新时间: 2026-03-09 00:43:14
领域: cs.LG,cs.AI
Viewpoint-Agnostic Grasp Pipeline using VLM and Partial Observations
Robust grasping in cluttered, unstructured environments remains challenging for mobile legged manipulators due to occlusions that lead to partial observations, unreliable depth estimates, and the need for collision-free, execution-feasible approaches. In this paper we present an end-to-end pipeline for language-guided grasping that bridges open-vocabulary target selection to safe grasp execution on a real robot. Given a natural-language command, the system grounds the target in RGB using open-vocabulary detection and promptable instance segmentation, extracts an object-centric point cloud from RGB-D, and improves geometric reliability under occlusion via back-projected depth compensation and two-stage point cloud completion. We then generate and collision-filter 6-DoF grasp candidates and select an executable grasp using safety-oriented heuristics that account for reachability, approach feasibility, and clearance. We evaluate the method on a quadruped robot with an arm in two cluttered tabletop scenarios, using paired trials against a view-dependent baseline. The proposed approach achieves a 90% overall success rate (9/10) against 30% (3/10) for the baseline, demonstrating substantially improved robustness to occlusions and partial observations in clutter.
Updated: 2026-03-09 00:42:32
标题: 不考虑视角的抓取管道使用VLM和部分观察
摘要: 在拥挤、无结构环境中进行稳健抓取对于移动式四肢机械手仍然是一个挑战,这是因为遮挡导致了部分观测、不可靠的深度估计以及需要无碰撞、可执行的方法。在本文中,我们提出了一种端到端的语言引导抓取管道,将开放词汇目标选择与真实机器人上的安全抓取执行进行了连接。给定一个自然语言命令,系统使用开放词汇检测和可提示的实例分割在RGB中确定目标,从RGB-D提取以物体为中心的点云,并通过背投深度补偿和两阶段点云完成在遮挡下提高几何可靠性。然后我们生成并筛选出6自由度抓取候选,并使用以安全为导向的启发式方法选择可执行的抓取,考虑到可达性、接近可行性和间隙。我们在一个四足机器人上评估了该方法,该机器人带有一个臂,在两个拥挤的桌面场景中进行了配对试验,与基线视角相关的对照组。所提出的方法在拥挤环境中展示出了对遮挡和部分观测的显著改进,获得了90%的总体成功率(9/10),而基线仅为30%(3/10)。
更新时间: 2026-03-09 00:42:32
领域: cs.RO,cs.LG,eess.SY
An Interpretable Generative Framework for Anomaly Detection in High-Dimensional Financial Time Series
Detecting structural instability and anomalies in high-dimensional financial time series is challenging due to complex temporal dependence and evolving cross-sectional structure. We propose ReGEN-TAD, an interpretable generative framework that integrates modern machine learning with econometric diagnostics for anomaly detection. The model combines joint forecasting and reconstruction within a refined convolutional--transformer architecture and aggregates complementary signals capturing predictive inconsistency, reconstruction degradation, latent distortion, and volatility shifts. Robust calibration yields a unified anomaly score without labeled data. Experiments on synthetic and financial panels demonstrate improved robustness to structured deviations while enabling economically coherent factor-level attribution.
Updated: 2026-03-09 00:36:19
标题: 一个可解释的生成框架用于高维金融时间序列异常检测
摘要: 检测高维金融时间序列中的结构不稳定性和异常是具有挑战性的,这是由于复杂的时间依赖性和不断变化的横截面结构。我们提出了一种可解释的生成框架ReGEN-TAD,该框架将现代机器学习与计量诊断结合起来,用于异常检测。该模型结合了在精细的卷积-变换器架构中进行联合预测和重构,并聚合了捕捉预测不一致性、重构退化、潜在畸变和波动性转变的互补信号。强大的校准产生了一个统一的异常分数,无需标记数据。对合成和金融面板的实验表明,在保持经济一致的因子级归因的同时,提高了对结构偏差的鲁棒性。
更新时间: 2026-03-09 00:36:19
领域: stat.ML,cs.LG
MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark
Tables and table-based use cases play a crucial role in many important real-world applications, such as spreadsheets, databases, and computational notebooks, which traditionally require expert-level users like data engineers, data analysts, and database administrators to operate. Although LLMs have shown remarkable progress in working with tables (e.g., in spreadsheet and database copilot scenarios), comprehensive benchmarking of such capabilities remains limited. In contrast to an extensive and growing list of NLP benchmarks, evaluations of table-related tasks are scarce, and narrowly focus on tasks like NL-to-SQL and Table-QA, overlooking the broader spectrum of real-world tasks that professional users face. This gap limits our understanding and model progress in this important area. In this work, we introduce MMTU, a large-scale benchmark with over 28K questions across 25 real-world table tasks, designed to comprehensively evaluate models ability to understand, reason, and manipulate real tables at the expert-level. These tasks are drawn from decades' worth of computer science research on tabular data, with a focus on complex table tasks faced by professional users. We show that MMTU require a combination of skills -- including table understanding, reasoning, and coding -- that remain challenging for today's frontier models, where even frontier reasoning models like OpenAI GPT-5 and DeepSeek R1 score only around 69\% and 57\% respectively, suggesting significant room for improvement. We highlight key findings in our evaluation using MMTU and hope that this benchmark drives further advances in understanding and developing foundation models for structured data processing and analysis. Our code and data are available at https://github.com/MMTU-Benchmark/MMTU and https://huggingface.co/datasets/MMTU-benchmark/MMTU.
Updated: 2026-03-09 00:32:06
标题: MMTU:一个大规模多任务表格理解和推理基准
摘要: 表格和基于表格的用例在许多重要的现实世界应用中起着至关重要的作用,例如电子表格、数据库和计算笔记本,传统上需要数据工程师、数据分析师和数据库管理员等专家级用户来操作。尽管LLMs在处理表格方面显示出了显著的进展(例如在电子表格和数据库联合助手场景中),但对这些能力的全面基准测试仍然有限。与庞大且不断增长的自然语言处理基准测试清单相比,与表格相关的任务的评估很少,并且狭窄地集中在类似NL-to-SQL和Table-QA的任务上,忽视了专业用户面临的更广泛的实际任务范围。这种差距限制了我们对这一重要领域的理解和模型进展。 在这项工作中,我们介绍了MMTU,这是一个规模庞大的基准测试,涵盖25个真实世界表格任务,共计超过28,000个问题,旨在全面评估模型在专家级别理解、推理和操作真实表格的能力。这些任务来自几十年的计算机科学研究中关于表格数据的研究,重点放在专业用户面临的复杂表格任务上。我们展示了MMTU需要一系列技能的结合,包括表格理解、推理和编码,这些技能对于今天的前沿模型仍然具有挑战性,即使是前沿推理模型如OpenAI GPT-5和DeepSeek R1的得分也只分别约为69%和57%,表明有很大的改进空间。我们通过使用MMTU进行评估的主要发现,并希望这一基准测试推动进一步的进展,从而加深对结构化数据处理和分析基础模型的理解和发展。 我们的代码和数据可在https://github.com/MMTU-Benchmark/MMTU 和 https://huggingface.co/datasets/MMTU-benchmark/MMTU 上获取。
更新时间: 2026-03-09 00:32:06
领域: cs.AI,cs.CL,cs.DB,cs.LG
The UK Cyber Security and Resilience Bill: A Practitioner's Guide to Legislative Reform, Compliance, and Organisational Readiness
The Cyber Security and Resilience (Network and Information Systems) Bill, introduced to Parliament in November 2025, represents the most significant reform of UK cyber security legislation in nearly a decade. This paper provides a comprehensive practitioner-oriented analysis of the Bill's provisions, their practical implications, and the steps organisations must take to achieve compliance. It examines the expanded regulatory scope covering managed service providers, data centres, and designated critical suppliers; the enhanced 24/72-hour incident reporting regime; the strengthened enforcement architecture including penalties of up to \pounds17 million or 4\% of worldwide turnover; and the Secretary of State's new executive powers. The paper compares the Bill with the EU's NIS2 Directive and DORA, proposing a practical dual-compliance framework for financial services firms. It explains how Zero Trust Architecture principles can serve as a foundation for meeting the Bill's requirements, and how the NCSC's Cyber Assessment Framework v4.0 provides the assurance pathway. Four detailed appendices provide entity-specific compliance roadmaps, worked case studies mapping real UK incidents to Bill provisions, sector-specific action plans for financial services, energy, health, and MSPs, and a complete gap analysis and self-assessment tool mapped to CAF v4.0 and the Bill's requirements.
Updated: 2026-03-09 00:23:57
标题: 《英国网络安全和抗灾法案:从立法改革、合规到组织准备的从业者指南》
摘要: 《网络和信息系统网络安全和弹性法案》,于2025年11月提交议会,代表了近十年来英国网络安全立法的最重大改革。本文提供了对该法案条款、实际影响以及组织必须采取的步骤以实现合规的全面从业者取向分析。它审查了扩大的监管范围,涵盖托管服务提供商、数据中心和指定的关键供应商;加强的24/72小时事件报告制度;加强的执法架构,包括高达1700万英镑或全球营业额的4%的罚款;以及国务大臣的新执行权。本文将该法案与欧盟的NIS2指令和DORA进行比较,提出了金融服务公司的实用双重合规框架。它解释了零信任体系结构原则如何成为满足该法案要求的基础,以及NCSC的网络评估框架v4.0如何提供保证路径。四个详细的附录提供了特定实体的合规路线图,映射真实英国事件到法案条款的案例研究,金融服务、能源、卫生和MSP的行业特定行动计划,以及与CAF v4.0和法案要求对应的完整差距分析和自我评估工具。
更新时间: 2026-03-09 00:23:57
领域: cs.CR
Guess & Guide: Gradient-Free Zero-Shot Diffusion Guidance
Pretrained diffusion models serve as effective priors for Bayesian inverse problems. These priors enable zero-shot generation by sampling from the conditional distribution, which avoids the need for task-specific retraining. However, a major limitation of existing methods is their reliance on surrogate likelihoods that require vector-Jacobian products at each denoising step, creating a substantial computational burden. To address this, we introduce a lightweight likelihood surrogate that eliminates the need to calculate gradients through the denoiser network. This enables us to handle diverse inverse problems without backpropagation overhead. Experiments confirm that using our method, the inference cost drops dramatically. At the same time, our approach delivers the highest results in multiple tasks. Broadly speaking, we propose the fastest and Pareto optimal method for Bayesian inverse problems.
Updated: 2026-03-09 00:21:53
标题: 猜测与指导:无梯度零射扩散引导
摘要: 预训练扩散模型作为贝叶斯逆问题的有效先验。这些先验通过从条件分布中采样实现零样本生成,避免了需要特定任务重新训练的需要。然而,现有方法的一个主要限制是它们依赖于需要在每个去噪步骤中进行向量-Jacobian乘积的替代似然函数,从而产生了重大的计算负担。为了解决这个问题,我们引入了一种轻量级似然函数替代方法,消除了通过去噪网络计算梯度的需要。这使我们能够处理各种逆问题,而无需反向传播开销。实验证实,使用我们的方法,推理成本显著降低。同时,我们的方法在多个任务中提供了最高的结果。总体而言,我们提出了贝叶斯逆问题的最快和帕累托最优方法。
更新时间: 2026-03-09 00:21:53
领域: cs.LG
Stochastic Self-Organization in Multi-Agent Systems
Multi-agent systems (MAS) based on Large Language Models (LLMs) have the potential to solve tasks that are beyond the reach of any single LLM. However, this potential can only be realized when the collaboration mechanism between agents is optimized. Specifically, optimizing the communication structure between agents is critical for fruitful collaboration. Most existing approaches rely on fixed topologies, pretrained graph generators, optimization over edges, or employ external LLM judges, thereby adding to the complexity. In this work, we introduce a response-conditioned framework that adapts communication on-the-fly. Agents independently generate responses to the user query and assess peer contributions using an approximation of the Shapley value. A directed acyclic graph (DAG) is then constructed to regulate the propagation of the responses among agents, which ensures stable and efficient message transmission from high-contributing agents to others. This graph is dynamically updated based on the agent responses from the previous collaboration round. Since the proposed framework enables the self-organization of agents without additional supervision or training, we refer to it as SelfOrg. The SelfOrg framework goes beyond task- and query-level optimization and takes into account the stochastic nature of agent responses. Experiments with both strong and weak LLM backends demonstrate robust performance, with significant gains in the weak regime where prior methods collapse. We also theoretically show that multiple agents increase the chance of correctness and that the correct responses naturally dominate the information flow.
Updated: 2026-03-09 00:21:49
标题: 多智能体系统中的随机自组织
摘要: 基于大型语言模型(LLM)的多Agent系统(MAS)具有解决任何单个LLM无法达到的任务的潜力。然而,只有在Agent之间的协作机制得到优化时,这种潜力才能得以实现。具体来说,优化Agent之间的通信结构对于富有成效的协作至关重要。大多数现有方法依赖于固定拓扑结构、预训练的图生成器、在边缘上进行优化,或者使用外部LLM评判者,从而增加了复杂性。在这项工作中,我们介绍了一个响应条件框架,可以动态地调整通信。Agent独立地生成对用户查询的响应,并使用Shapley值的近似来评估同行贡献。然后构建一个有向无环图(DAG)来调节响应在Agent之间的传播,从而确保高贡献Agent向其他Agent传递信息的稳定和高效。该图根据上一轮协作中Agent的响应动态更新。由于所提出的框架使Agent能够自组织而无需额外的监督或培训,我们将其称为SelfOrg。SelfOrg框架超越了任务和查询级别的优化,并考虑了Agent响应的随机性。强大和弱LLM后端的实验表明,SelfOrg框架具有稳健的性能,在先前方法崩溃的弱区域获得了显著的收益。我们还在理论上证明了多Agent增加了正确性的几率,正确的响应自然地主导了信息流。
更新时间: 2026-03-09 00:21:49
领域: cs.MA,cs.CL,cs.LG
Improving Conditional VAE with Non-Volume Preserving transformations
Variational Autoencoders and Generative Adversarial Networks remained the state-of-the-art (SOTA) generative models until 2022. Now they are superseded by diffusion-based models. Efforts to improve traditional models have stagnated as a result. In old-school fashion, we explore image generation with conditional Variational Autoencoders (CVAE) to incorporate desired attributes within the images. VAEs are known to produce blurry images with less diversity; we refer to a method that solves this issue by leveraging the variance of the gaussian decoder as a learnable parameter during training. Previous works on CVAEs assumed that the conditional distribution of the latent space given the labels is equal to the prior distribution, which is not the case in reality. We show that estimating it using Non-Volume Preserving (NVP) transformations results in better image generation than existing methods by reducing the FID by 4% and increasing log likelihood by 7.6% compared to the previous cases.
Updated: 2026-03-09 00:15:38
标题: 使用非体积保持变换改进条件变分自编码器
摘要: 变分自动编码器(Variational Autoencoders)和生成对抗网络(Generative Adversarial Networks)一直是直到2022年的最先进(SOTA)的生成模型。现在它们已被基于扩散的模型取代。由于这一结果,改进传统模型的努力已经停滞不前。我们以老派的方式,使用条件变分自动编码器(CVAE)探索图像生成,以在图像中加入所需的属性。VAEs以产生模糊图像和少量多样性而闻名;我们提到一种通过在训练过程中利用高斯解码器的方差作为可学习参数来解决这个问题的方法。以前关于CVAEs的作品假设给定标签的潜在空间的条件分布等于先验分布,而实际情况并非如此。我们展示,通过使用非体积保持(NVP)变换来估计它,比现有方法更好地生成图像,将FID减少了4%,对数似然性提高了7.6%。
更新时间: 2026-03-09 00:15:38
领域: cs.LG
SynPlanResearch-R1: Encouraging Tool Exploration for Deep Research with Synthetic Plans
Research Agents enable models to gather information from the web using tools to answer user queries, requiring them to dynamically interleave internal reasoning with tool use. While such capabilities can in principle be learned via reinforcement learning with verifiable rewards (RLVR), we observe that agents often exhibit poor exploration behaviors, including premature termination and biased tool usage. As a result, RLVR alone yields limited improvements. We propose SynPlanResearch-R1, a framework that synthesizes tool-use trajectories that encourage deeper exploration to shape exploration during cold-start supervised fine-tuning, providing a strong initialization for subsequent RL. Across seven multi-hop and open-web benchmarks, \framework improves performance by up to 6.0% on Qwen3-8B and 5.8% on Qwen3-4B backbones respectively compared to SOTA baselines. Further analyses of tool-use patterns and training dynamics compared to baselines shed light on the factors underlying these gains. Our code is publicly available at https://github.com/HansiZeng/syn-plan-research.
Updated: 2026-03-09 00:05:29
标题: SynPlanResearch-R1:通过合成计划鼓励深度研究中的工具探索
摘要: 研究代理使模型能够利用工具从网络中收集信息以回答用户查询,要求它们在工具使用过程中动态地交替内部推理。虽然这些能力从原则上可以通过具有可验证奖励的强化学习(RLVR)来学习,但我们观察到代理通常表现出较差的探索行为,包括过早终止和偏倚的工具使用。因此,仅靠RLVR很难取得显著改进。我们提出了SynPlanResearch-R1,这是一个框架,通过合成工具使用轨迹来鼓励更深入的探索,从而在冷启动监督微调期间塑造探索,为后续RL提供强大的初始化。在七个多跳和开放网络基准测试中,相对于SOTA基线,\framework在Qwen3-8B和Qwen3-4B骨干上的性能分别提高了高达6.0%和5.8%。与基线相比,进一步分析工具使用模式和训练动态揭示了这些增益背后的因素。我们的代码可在https://github.com/HansiZeng/syn-plan-research 上公开获取。
更新时间: 2026-03-09 00:05:29
领域: cs.AI,cs.CL,cs.IR
LoFT: Low-Rank Adaptation That Behaves Like Full Fine-Tuning
Large pre-trained models are commonly adapted to downstream tasks using parameter-efficient fine-tuning methods such as Low-Rank Adaptation (LoRA), which injects small trainable low-rank matrices instead of updating all weights. While LoRA dramatically reduces trainable parameters with little overhead, it can still underperform full fine-tuning in accuracy and often converges more slowly. We introduce LoFT, a novel low-rank adaptation method that behaves like full fine-tuning by aligning the optimizer's internal dynamics with those of updating all model weights. LoFT not only learns weight updates in a low-rank subspace (like LoRA) but also properly projects the optimizer's first and second moments (Adam's momentum and variance) into the same subspace, mirroring full-model updates. By aligning the low-rank update itself with the full update, LoFT eliminates the need for tuning extra hyperparameters, e.g., the LoRA scaling factor $α$. Empirically, this approach substantially narrows the performance gap between adapter-based tuning and full fine-tuning and consistently outperforms standard LoRA-style methods, all without increasing inference cost.
Updated: 2026-03-09 00:00:10
标题: LoFT:像全面微调一样表现的低秩调整
摘要: 大型预训练模型通常使用参数高效的微调方法(如低秩适应LoRA)来适应下游任务,该方法在更新所有权重之外注入小的可训练低秩矩阵。虽然LoRA可以显著减少可训练参数且几乎没有额外开销,但在准确性上仍可能表现不佳,并且往往收敛速度较慢。我们介绍了LoFT,这是一种新颖的低秩适应方法,其行为类似于完全微调,通过使优化器的内部动态与更新所有模型权重的动态保持一致来实现。LoFT不仅在低秩子空间中学习权重更新(类似于LoRA),而且还正确地将优化器的一阶和二阶矩(Adam的动量和方差)投影到相同的子空间中,反映了完整模型的更新。通过将低秩更新本身与完整更新对齐,LoFT消除了调整额外超参数的需要,例如LoRA缩放因子$α。实证结果表明,这种方法显著缩小了基于适配器的调整和完全微调之间的性能差距,并且始终优于标准LoRA风格方法,而不增加推理成本。
更新时间: 2026-03-09 00:00:10
领域: cs.LG,math.OC