Arxiv Day: Article

Diamond: End-to-End Forward-secure and Compact Authenticated Encryption for Internet of Things

Resource-constrained Internet of Things (IoT) devices, from medical implants to small drones, must transmit sensitive telemetry under adversarial wireless channels while operating under stringent computing and energy budgets. Authenticated Encryption (AE) is essential to ensure confidentiality, integrity, and authenticity. However, existing lightweight AE standards lack forward-security guarantees, compact tag aggregation, and offline-online (OO) optimizations required for modern high-throughput IoT pipelines. We introduce Diamond , the first provably secure Forward-secure and Aggregate Authenticated Encryption (FAAE) framework that extends and generalizes prior FAAE constructions through a lightweight key evolution mechanism, an OOoptimized computation pipeline, and a set of performance-tier instantiations. Diamond substantially reduces amortized offline preprocessing (up to 47%) and achieves up to an order-of-magnitude reduction in end-toend latency for large telemetry batches. Our comprehensive evaluation on 64-bit ARM Cortex-A72, 32-bit ARM Cortex-M4 and 8-bit AVR architectures confirms that Diamond outperforms baseline FAAE variants in authenticated encryption throughput and end-to-end verification latency while maintaining compact tag aggregation and strong breach resilience. Diamond outperforms NIST lightweight AE candidates for medium and large payloads, while remaining competitive for small messages when amortized across batches. We formally prove the security of Diamond and provide two concrete instantiations optimized for compliance and high efficiency. Our open-source release enables reproducibility and seamless integration into IoT platforms.

Updated: 2026-03-30 23:17:29

标题: 钻石：面向物联网的端到端前向安全和紧凑认证加密

摘要: 资源受限的物联网（IoT）设备，从医疗植入物到小型无人机，必须在对抗性无线信道下传输敏感的遥测数据，同时在严格的计算和能量预算下运行。经过身份验证的加密（AE）对于确保机密性、完整性和真实性至关重要。然而，现有的轻量级AE标准缺乏前向安全性保证、紧凑的标签聚合以及现代高吞吐量IoT管道所需的离线-在线（OO）优化。我们引入了Diamond，这是第一个经过证明安全的前向安全和聚合身份验证加密（FAAE）框架，通过轻量级密钥演化机制、OO优化的计算管道和一组性能层实例扩展和概括了先前的FAAE构造。Diamond大大减少了摊销的离线预处理（多达47%），并实现了大批量遥测数据的端到端延迟降低一个数量级。我们在64位ARM Cortex-A72、32位ARM Cortex-M4和8位AVR体系结构上进行了全面评估，确认Diamond在认证加密吞吐量和端到端验证延迟方面优于基线FAAE变体，同时保持紧凑的标签聚合和强大的泄漏韧性。Diamond在中等和大型负载方面优于NIST轻量级AE候选，而在跨批次摊销时，对于小型消息仍具有竞争力。我们正式证明了Diamond的安全性，并提供了两种针对合规性和高效性优化的具体实例。我们的开源发布使得结果可复制，并且无缝集成到IoT平台中。

更新时间: 2026-03-30 23:17:29

领域: cs.CR

下载: http://arxiv.org/abs/2601.00353v2

Uncovering Relationships between Android Developers, User Privacy, and Developer Willingness to Reduce Fingerprinting Risks

The major mobile platforms, Android and iOS, have introduced changes that restrict user tracking to improve user privacy, yet apps continue to covertly track users via device fingerprinting. We study the opportunity to improve this dynamic with a case study on mobile fingerprinting that evaluates developers' perceptions of how well platforms protect user privacy and how developers perceive platform privacy interventions. Specifically, we study developers' willingness to make changes to protect users from fingerprinting and how developers consider trade-offs between user privacy and developer effort. We do this via a survey of 246 Android developers, presented with a hypothetical Android change that protects users from fingerprinting at the cost of additional developer effort. We find developers overwhelmingly (89%) support this change, even when they anticipate significant effort, yet prefer the change be optional versus required. Surprisingly, developers who use fingerprinting are six times more likely to support the change, despite being most impacted by it. We also find developers are most concerned about compliance and enforcement. In addition, our results show that while most rank iOS above Android for protecting user privacy, this distinction significantly reduces among developers very familiar with fingerprinting. Thus there is an important opportunity for platforms and developers to collaboratively build privacy protections, and we present actionable ways platforms can facilitate this.

Updated: 2026-03-30 23:01:09

标题: 揭示安卓开发者、用户隐私和开发者减少指纹识别风险意愿之间的关系

摘要: 主要移动平台Android和iOS已经引入了限制用户跟踪以改善用户隐私的变化，然而应用程序仍然通过设备指纹技术秘密跟踪用户。我们通过一项关于移动指纹技术的案例研究来研究如何改善这一动态，评估开发人员对平台如何保护用户隐私以及开发人员如何看待平台隐私干预的看法。具体而言，我们研究了开发人员愿意做出改变以保护用户免受指纹技术影响的意愿，以及开发人员如何权衡用户隐私和开发工作之间的取舍。我们通过对246名Android开发人员进行一项调查来进行研究，这些开发人员面临一项假设的Android变化，以保护用户免受指纹技术影响，但需要额外的开发工作。我们发现，绝大多数开发人员（89%）支持这一变化，即使他们预计需要付出重大努力，但他们更希望这一变化是可选的而非必需的。令人惊讶的是，使用指纹技术的开发人员支持这一变化的可能性是其它开发人员的六倍，尽管他们受到影响最大。我们还发现，开发人员最关心的是合规性和执行。此外，我们的结果显示，尽管大多数人认为iOS在保护用户隐私方面优于Android，但在熟悉指纹技术的开发人员中，这种区别显著减少。因此，平台和开发人员有重要的机会共同建立隐私保护，并提出平台可以促进这一过程的可行方式。

更新时间: 2026-03-30 23:01:09

领域: cs.CR,cs.HC

下载: http://arxiv.org/abs/2603.29063v1

CivicShield: A Cross-Domain Defense-in-Depth Framework for Securing Government-Facing AI Chatbots Against Multi-Turn Adversarial Attacks

LLM-based chatbots in government services face critical security gaps. Multi-turn adversarial attacks achieve over 90% success against current defenses, and single-layer guardrails are bypassed with similar rates. We present CivicShield, a cross-domain defense-in-depth framework for government-facing AI chatbots. Drawing on network security, formal verification, biological immune systems, aviation safety, and zero-trust cryptography, CivicShield introduces seven defense layers: (1) zero-trust foundation with capability-based access control, (2) perimeter input validation, (3) semantic firewall with intent classification, (4) conversation state machine with safety invariants, (5) behavioral anomaly detection, (6) multi-model consensus verification, and (7) graduated human-in-the-loop escalation. We present a formal threat model covering 8 multi-turn attack families, map the framework to NIST SP 800-53 controls across 14 families, and evaluate using ablation analysis. Theoretical analysis shows layered defenses reduce attack probability by 1-2 orders of magnitude versus single-layer approaches. Simulation against 1,436 scenarios including HarmBench (416), JailbreakBench (200), and XSTest (450) achieves 72.9% combined detection [69.5-76.0% CI] with 2.9% effective false positive rate after graduated response, while maintaining 100% detection of multi-turn crescendo and slow-drift attacks. The honest drop on real benchmarks versus author-generated scenarios (71.2% vs 76.7% on HarmBench, 47.0% vs 70.0% on JailbreakBench) validates independent evaluation importance. CivicShield addresses an open gap at the intersection of AI safety, government compliance, and practical deployment.

Updated: 2026-03-30 22:58:04

标题: CivicShield：一种跨领域深度防御框架，用于保护面向政府的AI聊天机器人免受多轮对抗攻击

摘要: 基于LLM的政府服务聊天机器人面临严重的安全漏洞。多轮对抗攻击在当前的防御中取得了超过90%的成功率，而单层防护栏被以类似的速率绕过。我们提出了CivicShield，一个面向政府AI聊天机器人的跨领域深度防御框架。CivicShield借鉴了网络安全、形式验证、生物免疫系统、航空安全和零信任密码学，引入了七个防御层：（1）基于能力的零信任基础访问控制，（2）周边输入验证，（3）带有意图分类的语义防火墙，（4）带有安全不变性的对话状态机，（5）行为异常检测，（6）多模型共识验证，以及（7）逐级人机协同升级。我们提出了一个覆盖8种多轮攻击家族的正式威胁模型，将该框架映射到14个家族的NIST SP 800-53控制中，并使用消融分析进行评估。理论分析显示，分层防御将攻击概率降低1-2个数量级，与单层方法相比。对1,436个场景进行模拟，包括HarmBench（416）、JailbreakBench（200）和XSTest（450），在逐级响应后实现了72.9%的综合检测率[69.5-76.0% CI]，有效误报率为2.9%，同时保持了对多轮高潮和缓慢漂移攻击的100%检测率。在真实基准测试中，对比作者生成的场景下的结果（HarmBench上的71.2% vs 76.7%，JailbreakBench上的47.0% vs 70.0%），验证了独立评估的重要性。CivicShield解决了AI安全、政府合规和实际部署交汇处的一个开放漏洞。

更新时间: 2026-03-30 22:58:04

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2603.29062v1

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

Fine-tuning APIs offered by major AI providers create new attack surfaces where adversaries can bypass safety measures through targeted fine-tuning. We introduce Trojan-Speak, an adversarial fine-tuning method that bypasses Anthropic's Constitutional Classifiers. Our approach uses curriculum learning combined with GRPO-based hybrid reinforcement learning to teach models a communication protocol that evades LLM-based content classification. Crucially, while prior adversarial fine-tuning approaches report more than 25% capability degradation on reasoning benchmarks, Trojan-Speak incurs less than 5% degradation while achieving 99+% classifier evasion for models with 14B+ parameters. We demonstrate that fine-tuned models can provide detailed responses to expert-level CBRN (Chemical, Biological, Radiological, and Nuclear) queries from Anthropic's Constitutional Classifiers bug-bounty program. Our findings reveal that LLM-based content classifiers alone are insufficient for preventing dangerous information disclosure when adversaries have fine-tuning access, and we show that activation-level probes can substantially improve robustness to such attacks.

Updated: 2026-03-30 22:10:28

标题: 特洛伊语：通过对抗微调实现绕过宪法分类器，无需越狱税

摘要: 主要人工智能提供商提供的API细化为对手可以通过有针对性的微调绕过安全措施的新攻击面。我们引入了特洛伊语（Trojan-Speak），这是一种对抗性微调方法，可以绕过Anthropic的宪法分类器。我们的方法使用课程学习结合基于GRPO的混合强化学习，教授模型一种通信协议，可以规避基于LLM的内容分类。至关重要的是，虽然先前的对抗性微调方法在推理基准上报告了超过25%的能力下降，但特洛伊语微调只产生不到5%的下降，同时实现了对具有14B+参数的模型的99%以上的分类器逃避。我们证明，微调模型可以对Anthropic的宪法分类器漏洞赏金计划中的专业级CBRN（化学、生物、放射性和核）查询提供详细响应。我们的研究发现，仅基于LLM的内容分类器无法阻止对手在进行微调访问时危险信息的披露，并且我们表明激活级别的探针可以显著提高对此类攻击的鲁棒性。

更新时间: 2026-03-30 22:10:28

领域: cs.CR,cs.AI,cs.CL

下载: http://arxiv.org/abs/2603.29038v1

Access Hoare Logic

Following Hoare's seminal invention, now called Hoare logic, to reason about correctness of computer programs, we advocate a related but fundamentally different approach to reason about access security of computer programs such as access control. We define the formalism, which we denote access Hoare logic, and present examples which demonstrate its usefulness and fundamental difference to Hoare logic. We prove soundness and completeness of access Hoare logic, and provide a link between access Hoare logic and standard Hoare logic. We also demonstrate a fundamental difference of access Hoare logic to other approaches, in particular incorrectness logic.

Updated: 2026-03-30 22:00:29

标题: Hoare逻辑访问

摘要: 在Hoare的开创性发明（现在称为Hoare逻辑）的基础上，用于推理计算机程序正确性的方法，我们提倡一种相关但基本不同的方法，用于推理计算机程序的访问安全性，例如访问控制。我们定义了形式化，我们将其称为访问Hoare逻辑，并提供示例来展示其有用性和与Hoare逻辑的基本区别。我们证明了访问Hoare逻辑的完备性和完备性，并提供了访问Hoare逻辑与标准Hoare逻辑之间的联系。我们还展示了访问Hoare逻辑与其他方法（特别是不正确逻辑）之间的基本差异。

更新时间: 2026-03-30 22:00:29

领域: cs.LO,cs.CR,cs.SC

下载: http://arxiv.org/abs/2511.01754v3

Robust Safety Monitoring of Language Models via Activation Watermarking

Large language models (LLMs) can be misused to reveal sensitive information, such as weapon-making instructions or writing malware. LLM providers rely on $\emph{monitoring}$ to detect and flag unsafe behavior during inference. An open security challenge is $\emph{adaptive}$ adversaries who craft attacks that simultaneously (i) evade detection while (ii) eliciting unsafe behavior. Adaptive attackers are a major concern as LLM providers cannot patch their security mechanisms, since they are unaware of how their models are being misused. We cast $\emph{robust}$ LLM monitoring as a security game, where adversaries who know about the monitor try to extract sensitive information, while a provider must accurately detect these adversarial queries at low false positive rates. Our work (i) shows that existing LLM monitors are vulnerable to adaptive attackers and (ii) designs improved defenses through $\emph{activation watermarking}$ by carefully introducing uncertainty for the attacker during inference. We find that $\emph{activation watermarking}$ outperforms guard baselines by up to $52\%$ under adaptive attackers who know the monitoring algorithm but not the secret key.

Updated: 2026-03-30 21:43:49

标题: 通过激活水印技术实现语言模型的稳健安全监测

摘要: 大型语言模型（LLMs）可能被滥用来泄露敏感信息，例如制造武器的指导或编写恶意软件。LLM提供商依赖于$\emph{监控}$来在推理过程中检测并标记不安全的行为。一个开放的安全挑战是$\emph{适应性}$的对手，他们制定攻击策略，同时（i）规避检测，同时（ii）引发不安全的行为。适应性攻击者是一个主要关注点，因为LLM提供商无法修补他们的安全机制，因为他们不知道他们的模型是如何被滥用的。我们将$\emph{鲁棒}$的LLM监控视为一个安全游戏，对手了解监视器的情况下会尝试提取敏感信息，而提供商必须以低误报率准确检测这些敌对查询。我们的工作（i）表明现有的LLM监控对适应性攻击者是脆弱的，（ii）通过在推理过程中小心引入不确定性，设计了通过$\emph{激活水印}$来改进防御措施。我们发现，$\emph{激活水印}$在适应性攻击者知道监控算法但不知道秘钥的情况下，比防护基线表现更好，提高了高达$52\%$。

更新时间: 2026-03-30 21:43:49

领域: cs.CR,cs.AI,cs.CY,cs.LG

下载: http://arxiv.org/abs/2603.23171v2

Privacy-Preserving Machine Learning for IoT: A Cross-Paradigm Survey and Future Roadmap

The rapid proliferation of the Internet of Things has intensified demand for robust privacy-preserving machine learning mechanisms to safeguard sensitive data generated by large-scale, heterogeneous, and resource-constrained devices. Unlike centralized environments, IoT ecosystems are inherently decentralized, bandwidth-limited, and latency-sensitive, exposing privacy risks across sensing, communication, and distributed training pipelines. These characteristics render conventional anonymization and centralized protection strategies insufficient for practical deployments. This survey presents a comprehensive IoT-centric, cross-paradigm analysis of privacy-preserving machine learning. We introduce a structured taxonomy spanning perturbation-based mechanisms such as differential privacy, distributed paradigms such as federated learning, cryptographic approaches including homomorphic encryption and secure multiparty computation, and generative synthesis techniques based on generative adversarial networks. For each paradigm, we examine formal privacy guarantees, computational and communication complexity, scalability under heterogeneous device participation, and resilience against threats including membership inference, model inversion, gradient leakage, and adversarial manipulation. We further analyze deployment constraints in wireless IoT environments, highlighting trade-offs between privacy, communication overhead, model convergence, and system efficiency within next-generation mobile architectures. We also consolidate evaluation methodologies, summarize representative datasets and open-source frameworks, and identify open challenges including hybrid privacy integration, energy-aware learning, privacy-preserving large language models, and quantum-resilient machine learning.

Updated: 2026-03-30 21:05:16

标题: 隐私保护机器学习在物联网中的应用：跨范式调查和未来发展路线图

摘要: 物联网的快速发展加剧了对强大的隐私保护机制的需求，以保护大规模、异构和资源受限设备生成的敏感数据。与集中式环境不同，物联网生态系统本质上是分散的、带宽受限的，并且对延迟敏感，暴露了在感知、通信和分布式训练管道中的隐私风险。这些特征使传统的匿名化和集中式保护策略对于实际部署不足够。本调查提供了一个全面的以物联网为中心、跨范式的隐私保护机器学习分析。我们介绍了一个结构化的分类学，涵盖基于扰动的机制，如差分隐私，分布式范例，如联邦学习，包括同态加密和安全多方计算的密码学方法，以及基于生成对抗网络的生成合成技术。对于每个范例，我们检查了形式化的隐私保证、计算和通信复杂性、在异构设备参与下的可伸缩性，以及对抗包括成员推断、模型反演、梯度泄漏和对抗性篡改的威胁的弹性。我们进一步分析了在无线物联网环境中的部署约束，突出了在下一代移动架构中隐私、通信开销、模型收敛和系统效率之间的权衡。我们还整合了评估方法论，总结了代表性数据集和开源框架，并确定了一些挑战，包括混合隐私集成、能源感知学习、隐私保护的大型语言模型和量子强度机器学习。

更新时间: 2026-03-30 21:05:16

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2603.13570v3

Design Principles for the Construction of a Benchmark Evaluating Security Operation Capabilities of Multi-agent AI Systems

As Large Language Models (LLMs) and multi-agent AI systems are demonstrating increasing potential in cybersecurity operations, organizations, policymakers, model providers, and researchers in the AI and cybersecurity communities are interested in quantifying the capabilities of such AI systems to achieve more autonomous SOCs (security operation centers) and reduce manual effort. In particular, the AI and cybersecurity communities have recently developed several benchmarks for evaluating the red team capabilities of multi-agent AI systems. However, because the operations in SOCs are dominated by blue team operations, the capabilities of AI systems & agents to achieve more autonomous SOCs cannot be evaluated without a benchmark focused on blue team operations. To our best knowledge, no systematic benchmark for evaluating coordinated multi-task blue team AI has been proposed in the literature. Existing blue team benchmarks focus on a particular task. The goal of this work is to develop a set of design principles for the construction of a benchmark, which is denoted as SOC-bench, to evaluate the blue team capabilities of AI. Following these design principles, we have developed a conceptual design of SOC-bench, which consists of a family of five blue team tasks in the context of large-scale ransomware attack incident response.

Updated: 2026-03-30 21:01:00

标题: 多智能体人工智能系统安全运营能力评估基准构建的设计原则

摘要: 随着大型语言模型（LLMs）和多智能体人工智能系统在网络安全操作中展现出越来越大的潜力，AI和网络安全社区的组织机构、决策者、模型提供商和研究人员对量化此类AI系统的能力感兴趣，以实现更自主的SOC（安全操作中心）并减少手动工作。特别是，AI和网络安全社区最近开发了几个用于评估多智能体AI系统红队能力的基准测试。然而，由于SOC中的操作主要由蓝队操作主导，因此在没有专注于蓝队操作的基准测试的情况下，无法评估AI系统和代理实现更自主SOC的能力。据我们所知，文献中尚未提出用于评估协调多任务蓝队AI的系统性基准测试。现有的蓝队基准测试侧重于特定任务。本研究的目标是制定一套设计原则，用于构建一个名为SOC-bench的基准测试，以评估AI的蓝队能力。遵循这些设计原则，我们已经开发了SOC-bench的概念设计，其中包括在大规模勒索软件攻击事件响应背景下的五个蓝队任务系列。

更新时间: 2026-03-30 21:01:00

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2603.28998v1

Attesting LLM Pipelines: Enforcing Verifiable Training and Release Claims

Modern Large Language Model (LLM) systems are assembled from third-party artifacts such as pre-trained weights, fine-tuning adapters, datasets, dependency packages, and container images, fetched through automated pipelines. This speed comes with supply-chain risks, including compromised dependencies, malicious hub artifacts, unsafe deserialization, forged provenance, and backdoored models. A core gap is that training and release claims (e.g., data and code lineage, build environment, and security scanning results) are rarely cryptographically bound to the artifacts they describe, making enforcement inconsistent across teams and stages. We propose an attestation-aware promotion gate: before an artifact is admitted into trusted environments (training, fine-tuning, deployment), the gate verifies claim evidence, enforces safe loading and static scanning policies, and applies secure-by-default deployment constraints. When organizations operate runtime security tooling, the same gate can optionally ingest standardized dynamic signals via plugins to reduce uncertainty for high-risk artifacts. We outline a practical claims-to-controls mapping and an evaluation blueprint using representative supply-chain scenarios and operational metrics (coverage and decisions), charting a path toward a full research paper.

Updated: 2026-03-30 20:37:48

标题: 验证LLM管道：实施可验证的培训和发布声明

摘要: 现代大型语言模型（LLM）系统由第三方工件组装而成，如预训练权重、微调适配器、数据集、依赖包和容器映像，通过自动化流水线获取。这种速度带来了供应链风险，包括受损的依赖关系、恶意中心工件、不安全的反序列化、伪造的来源和后门模型。一个核心缺陷是训练和发布声明（如数据和代码渊源、构建环境和安全扫描结果）很少与它们描述的工件进行加密绑定，使得在团队和阶段之间的执行不一致。我们提出了一个证明感知的晋升门：在工件被准入到受信任环境（训练、微调、部署）之前，该门验证声明证据，执行安全加载和静态扫描政策，并应用安全默认的部署约束。当组织运行运行时安全工具时，相同的门可以选择通过插件摄取标准化的动态信号，以减少高风险工件的不确定性。我们概述了一个实际的声明到控制映射和一个使用代表性供应链场景和操作指标（覆盖率和决策）的评估蓝图，为一篇完整的研究论文铺平道路。

更新时间: 2026-03-30 20:37:48

领域: cs.CR

下载: http://arxiv.org/abs/2603.28988v1

KAN-LSTM: Benchmarking Kolmogorov-Arnold Networks for Cyber Security Threat Detection in IoT Networks

By utilising their adaptive activation functions, Kolmogorov-Arnold Networks (KANs) can be applied in a novel way for the diverse machine learning tasks, including cyber threat detection. KANs substitute conventional linear weights with spline-parametrized univariate functions, which allows them to learn activation patterns dynamically, inspired by the Kolmogorov-Arnold representation theorem. In a network traffic data, we show that KANs perform better than traditional Multi-Layer Perceptrons (MLPs), yielding more accurate results with a significantly less number of learnable parameters. We also propose KAN-LSTM model to combine advantages of spatial and temporal encoding. The suggested methodology highlights the potential of KANs as an effective tool in detecting cyber threats and offers up new directions for adaptive defensive models. Lastly, we conducted experiments on three main dataset, UNSW-NB15, NSL-KDD, and CICID2017, as well as we developed a new dataset combined from IOT-BOT, NSL-KDD, and CICID2017 to present a stable, unbiased, large-scale dataset with diverse traffic patterns. The results show the superiority of KAN-LSTM and then KAN models over the traditional deep learning models. The source code is available at GitHub repository

Updated: 2026-03-30 20:35:27

标题: KAN-LSTM：在物联网网络中基于科尔莫戈洛夫-阿诺德网络的网络安全威胁检测的基准测试

摘要: 通过利用其自适应激活函数，科尔莫戈洛夫-阿诺德网络（KANs）可以以一种新颖的方式应用于各种机器学习任务，包括网络威胁检测。KANs用样条参数化的单变量函数替代传统的线性权重，这使它们能够动态学习激活模式，受科尔莫戈洛夫-阿诺德表示定理的启发。在网络流量数据中，我们展示了KANs比传统的多层感知器（MLPs）表现更好，产生更准确的结果，且具有明显更少的可学习参数。我们还提出了KAN-LSTM模型，结合了空间和时间编码的优势。所提出的方法强调了KANs作为检测网络威胁的有效工具，并提供了适应性防御模型的新方向。最后，我们在三个主要数据集UNSW-NB15、NSL-KDD和CICID2017上进行了实验，同时我们开发了一个新的数据集，结合了IOT-BOT、NSL-KDD和CICID2017，以呈现具有多样化流量模式的稳定、公正、大规模数据集。结果显示KAN-LSTM模型以及KAN模型优于传统深度学习模型。源代码可在GitHub存储库中找到。

更新时间: 2026-03-30 20:35:27

领域: cs.CR

下载: http://arxiv.org/abs/2603.28985v1

Privacy Guard & Token Parsimony by Prompt and Context Handling and LLM Routing

The large-scale adoption of Large Language Models (LLMs) forces a trade-off between operational cost (OpEx) and data privacy. Current routing frameworks reduce costs but ignore prompt sensitivity, exposing users and institutions to leakage risks towards third-party cloud providers. We formalise the "Inseparability Paradigm": advanced context management intrinsically coincides with privacy management. We propose a local "Privacy Guard" -- a holistic contextual observer powered by an on-premise Small Language Model (SLM) -- that performs abstractive summarisation and Automatic Prompt Optimisation (APO) to decompose prompts into focused sub-tasks, re-routing high-risk queries to Zero-Trust or NDA-covered models. This dual mechanism simultaneously eliminates sensitive inference vectors (Zero Leakage) and reduces cloud token payloads (OpEx Reduction). A LIFO-based context compacting mechanism further bounds working memory, limiting the emergent leakage surface. We validate the framework through a 2x2 benchmark (Lazy vs. Expert users; Personal vs. Institutional secrets) on a 1,000-sample dataset, achieving a 45% blended OpEx reduction, 100% redaction success on personal secrets, and -- via LLM-as-a-Judge evaluation -- an 85% preference rate for APO-compressed responses over raw baselines. Our results demonstrate that Token Parsimony and Zero Leakage are mathematically dual projections of the same contextual compression operator.

Updated: 2026-03-30 20:16:42

标题: 隐私保护与令牌简约：通过提示和上下文处理以及LLM路由

摘要: 大规模采用大型语言模型（LLM）在运营成本（OpEx）和数据隐私之间产生了一种权衡。当前的路由框架可以降低成本，但忽略了提示的敏感性，使用户和机构面临向第三方云提供商泄露风险。我们正式提出了“不可分割范式”：高级上下文管理与隐私管理本质上是一致的。我们提出了本地“隐私防护”——由本地的小型语言模型（SLM）驱动的全面上下文观察器，执行抽象摘要和自动提示优化（APO），将提示分解为专注的子任务，将高风险查询重新定向到零信任或NDA覆盖的模型。这种双重机制同时消除了敏感推理向量（零泄漏）并减少了云令牌负载（OpEx减少）。基于LIFO的上下文紧缩机制进一步限制了工作内存，限制了新出现的泄漏表面。我们通过对1,000个样本数据集进行Lazy vs. Expert用户、个人vs.机构秘密的2x2基准测试验证了该框架，实现了45%的混合OpEx降低，100%的个人秘密遮蔽成功，并通过LLM作为评判者的评估，在APO压缩响应与原始基线之间实现了85%的偏好率。我们的结果表明，令牌节俭和零泄漏在数学上是同一上下文压缩算子的双重投影。

更新时间: 2026-03-30 20:16:42

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2603.28972v1

\texttt{ReproMIA}: A Comprehensive Analysis of Model Reprogramming for Proactive Membership Inference Attacks

The pervasive deployment of deep learning models across critical domains has concurrently intensified privacy concerns due to their inherent propensity for data memorization. While Membership Inference Attacks (MIAs) serve as the gold standard for auditing these privacy vulnerabilities, conventional MIA paradigms are increasingly constrained by the prohibitive computational costs of shadow model training and a precipitous performance degradation under low False Positive Rate constraints. To overcome these challenges, we introduce a novel perspective by leveraging the principles of model reprogramming as an active signal amplifier for privacy leakage. Building upon this insight, we present \texttt{ReproMIA}, a unified and efficient proactive framework for membership inference. We rigorously substantiate, both theoretically and empirically, how our methodology proactively induces and magnifies latent privacy footprints embedded within the model's representations. We provide specialized instantiations of \texttt{ReproMIA} across diverse architectural paradigms, including LLMs, Diffusion Models, and Classification Models. Comprehensive experimental evaluations across more than ten benchmarks and a variety of model architectures demonstrate that \texttt{ReproMIA} consistently and substantially outperforms existing state-of-the-art baselines, achieving a transformative leap in performance specifically within low-FPR regimes, such as an average of 5.25\% AUC and 10.68\% TPR@1\%FPR increase over the runner-up for LLMs, as well as 3.70\% and 12.40\% respectively for Diffusion Models.

Updated: 2026-03-30 19:35:10

标题: \texttt{ReproMIA}: 一项关于主动成员推断攻击的模型重编程的全面分析

摘要: 跨关键领域深度学习模型的普遍部署同时加剧了隐私担忧，因为它们天生倾向于数据记忆。虽然成员推断攻击（MIAs）被视为审计这些隐私漏洞的黄金标准，但传统的MIA范式受到阴影模型训练的高昂计算成本和在低误报率约束下性能急剧下降的限制越来越多。为了克服这些挑战，我们提出了一种新颖的视角，通过利用模型重编程原则作为隐私泄漏的主动信号放大器。基于这一洞察，我们提出了\texttt{ReproMIA}，一个统一而高效的主动成员推断框架。我们在理论和经验上严格证实了，我们的方法如何主动诱导和放大嵌入在模型表示中的潜在隐私足迹。我们提供了\texttt{ReproMIA}在不同架构范例中的专门实例化，包括LLMs，扩散模型和分类模型。在超过十个基准测试和各种模型架构的全面实验评估中，我们证实\texttt{ReproMIA}始终显著优于现有的最先进基线，特别是在低误报率范围内，例如在LLMs中平均AUC提高5.25％，TPR@1％FPR提高10.68％，在扩散模型中分别提高3.70％和12.40％。

更新时间: 2026-03-30 19:35:10

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2603.28942v1

Differential Privacy for Symbolic Trajectories via the Permute-and-Flip Mechanism

Privacy techniques have been developed for data-driven systems, but systems with non-numeric data cannot use typical noise-adding techniques. Therefore, we develop a new mechanism for privatizing state trajectories of symbolic systems that may be represented as words over a finite alphabet. Such systems include Markov chains, Markov decision processes, and finite-state automata, and we protect their symbolic trajectories with differential privacy. The mechanism we develop randomly selects a private approximation to be released in place of the original sensitive word, with a bias towards low-error private words. This work is based on the permute-and-flip mechanism for differential privacy, which can be applied to non-numeric data. However, a naïve implementation would have to enumerate an exponentially large list of words to generate a private word. As a result, we develop a new mechanism that generates private words without ever needing to enumerate such a list. We prove that the accuracy of our mechanism is never worse than the prior state of the art, and we empirically show on a real traffic dataset that it introduces up to $55\%$ less error than the prior state of the art under a conventional privacy implementation.

Updated: 2026-03-30 18:30:06

标题: 通过排列和翻转机制的符号轨迹的差分隐私

摘要: 隐私技术已经针对数据驱动系统进行了开发，但是对于非数值数据的系统无法使用典型的添加噪音技术。因此，我们开发了一种新的机制，用于保护可以表示为有限字母表上的单词的符号系统的状态轨迹的隐私。这样的系统包括马尔可夫链、马尔可夫决策过程和有限状态自动机，我们用差分隐私保护它们的符号轨迹。我们开发的机制随机选择一个私有近似值来替代原始的敏感单词，且偏向于低误差的私有单词。这项工作基于差分隐私的置换和翻转机制，可应用于非数值数据。然而，一个天真的实现需要枚举一个指数级别的单词列表来生成私有单词。因此，我们开发了一种新的机制，可以在不需要枚举这样的列表的情况下生成私有单词。我们证明我们的机制的准确性永远不会差于现有技术水平，并且在一个实际的交通数据集上进行了实证研究，结果表明在传统的隐私实现下，与现有技术相比，它引入的误差最多减少了55%。

更新时间: 2026-03-30 18:30:06

领域: cs.CR

下载: http://arxiv.org/abs/2603.28903v1

ViPRA: Video Prediction for Robot Actions

Can we turn a video prediction model into a robot policy? Videos, including those of humans or teleoperated robots, capture rich physical interactions. However, most of them lack labeled actions, which limits their use in robot learning. We present Video Prediction for Robot Actions (ViPRA), a simple pretraining-finetuning framework that learns continuous robot control from these actionless videos. Instead of directly predicting actions, we train a video-language model to predict both future visual observations and motion-centric latent actions, which serve as intermediate representations of scene dynamics. We train these latent actions using perceptual losses and optical flow consistency to ensure they reflect physically grounded behavior. For downstream control, we introduce a chunked flow matching decoder that maps latent actions to robot-specific continuous action sequences, using only 100 to 200 teleoperated demonstrations. This approach avoids expensive action annotation, supports generalization across embodiments, and enables smooth, high-frequency continuous control upto 22 Hz via chunked action decoding. Unlike prior latent action works that treat pretraining as autoregressive policy learning, ViPRA explicitly models both what changes and how. Our method outperforms strong baselines, with a 16% gain on the SIMPLER benchmark and a 13% improvement across real world manipulation tasks. We have released models and code at https://vipra-project.github.io

Updated: 2026-03-30 17:59:36

标题: ViPRA：机器人动作的视频预测

摘要: 我们能否将视频预测模型转化为机器人策略？视频，包括人类或远程操作机器人的视频，捕捉丰富的物理交互。然而，大多数视频缺乏标记的动作，这限制了它们在机器人学习中的应用。我们提出了Video Prediction for Robot Actions（ViPRA），这是一个简单的预训练-微调框架，从这些没有动作的视频中学习连续的机器人控制。我们不直接预测动作，而是训练一个视频-语言模型来预测未来的视觉观察和以动作为中心的潜在动作，这些潜在动作作为场景动态的中间表示。我们使用感知损失和光流一致性训练这些潜在动作，以确保它们反映出物理上的基础行为。对于下游控制，我们引入了一个分块的流匹配解码器，将潜在动作映射到机器人特定的连续动作序列，仅使用100到200次远程操作演示。这种方法避免了昂贵的动作注释，支持在不同实体之间的泛化，并通过分块动作解码实现高频连续控制，最高达22赫兹。与以往处理潜在动作的工作将预训练视为自回归策略学习的方法不同，ViPRA明确地对变化和方式进行建模。我们的方法胜过强基线，在SIMPLER基准测试中提高了16%，在真实世界的操作任务中提高了13%。我们已在https://vipra-project.github.io发布了模型和代码。

更新时间: 2026-03-30 17:59:36

领域: cs.RO,cs.AI,cs.CL,cs.CV,cs.LG

下载: http://arxiv.org/abs/2511.07732v2

Securing Elliptic Curve Cryptocurrencies against Quantum Vulnerabilities: Resource Estimates and Mitigations

This whitepaper seeks to elucidate implications that the capabilities of developing quantum architectures have on blockchain vulnerabilities and mitigation strategies. First, we provide new resource estimates for breaking the 256-bit Elliptic Curve Discrete Logarithm Problem, the core of modern blockchain cryptography. We demonstrate that Shor's algorithm for this problem can execute with either <1200 logical qubits and <90 million Toffoli gates or <1450 logical qubits and <70 million Toffoli gates. In the interest of responsible disclosure, we use a zero-knowledge proof to validate these results without disclosing attack vectors. On superconducting architectures with 1e-3 physical error rates and planar connectivity, those circuits can execute in minutes using fewer than half a million physical qubits. We introduce a critical distinction between fast-clock (such as superconducting and photonic) and slow-clock (such as neutral atom and ion trap) architectures. Our analysis reveals that the first fast-clock CRQCs would enable on-spend attacks on public mempool transactions of some cryptocurrencies. We survey major cryptocurrency vulnerabilities through this lens, identifying systemic risks associated with advanced features in some blockchains such as smart contracts, Proof-of-Stake consensus, and Data Availability Sampling, as well as the enduring concern of abandoned assets. We argue that technical solutions would benefit from accompanying public policy and discuss various frameworks of digital salvage to regulate the recovery or destruction of dormant assets while preventing adversarial seizure. We also discuss implications for other digital assets and tokenization as well as challenges and successful examples of the ongoing transition to Post-Quantum Cryptography (PQC). Finally, we urge all vulnerable cryptocurrency communities to join the ongoing migration to PQC without delay.

Updated: 2026-03-30 17:59:25

标题: 保护椭圆曲线加密货币免受量子漏洞的影响：资源评估和缓解措施

摘要: 这份白皮书旨在阐明发展中的量子架构对区块链漏洞和缓解策略的影响。首先，我们提供了破解256位椭圆曲线离散对数问题的新资源估算，这是现代区块链密码学的核心。我们展示了Shor算法可以用<1200个逻辑量子比特和<9000万Toffoli门或<1450个逻辑量子比特和<7000万Toffoli门来执行此问题。为了负责任地披露，我们使用零知识证明来验证这些结果，而不披露攻击向量。在物理错误率为1e-3且具有平面连接性的超导架构上，这些电路可以在几分钟内执行，使用少于一百万物理量子比特。我们引入了快时钟（如超导和光子）和慢时钟（如中性原子和离子阱）架构之间的关键区别。我们的分析表明，第一批快时钟CRQC将能够对某些加密货币的公共内存池交易进行攻击。通过这个视角，我们对主要加密货币的漏洞进行了调查，识别了与一些区块链中的高级功能（如智能合约、权益证明共识和数据可用性抽样）以及被废弃资产的持久关注相关的系统风险。我们认为技术解决方案将受益于伴随的公共政策，并讨论了各种数字救援框架，以规范废弃资产的恢复或销毁，同时防止对手夺取。我们还讨论了对其他数字资产和代币化的影响，以及正在进行的转向后量子密码学（PQC）的挑战和成功示例。最后，我们敦促所有容易受攻击的加密货币社区立即加入向PQC的迁移。

更新时间: 2026-03-30 17:59:25

领域: quant-ph,cs.CR

下载: http://arxiv.org/abs/2603.28846v1

Geometry-aware similarity metrics for neural representations on Riemannian and statistical manifolds

Similarity measures are widely used to interpret the representational geometries used by neural networks to solve tasks. Yet, because existing methods compare the extrinsic geometry of representations in state space, rather than their intrinsic geometry, they may fail to capture subtle yet crucial distinctions between fundamentally different neural network solutions. Here, we introduce metric similarity analysis (MSA), a novel method which leverages tools from Riemannian geometry to compare the intrinsic geometry of neural representations under the manifold hypothesis. We show that MSA can be used to i) disentangle features of neural computations in deep networks with different learning regimes, ii) compare nonlinear dynamics, and iii) investigate diffusion models. Hence, we introduce a mathematically grounded and broadly applicable framework to understand the mechanisms behind neural computations by comparing their intrinsic geometries.

Updated: 2026-03-30 17:59:22

标题: 在黎曼和统计流形上的几何感知相似度度量方法

摘要: 相似性度量广泛用于解释神经网络用于解决任务的表征几何结构。然而，由于现有方法比较状态空间中表示的外部几何结构，而不是它们的内在几何结构，它们可能无法捕捉基本不同的神经网络解决方案之间微妙而关键的区别。在这里，我们介绍了度量相似性分析（MSA），这是一种新颖的方法，利用黎曼几何工具来比较流形假设下神经表征的内在几何结构。我们展示了MSA可以用于i）解开具有不同学习机制的深度网络中的神经计算特征，ii）比较非线性动态，以及iii）研究扩散模型。因此，我们引入了一个在数学上有根基且广泛适用的框架，通过比较神经计算的内在几何结构来理解神经计算背后的机制。

更新时间: 2026-03-30 17:59:22

领域: cs.LG,cs.AI,math.DG,q-bio.NC

下载: http://arxiv.org/abs/2603.28764v1

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

Modern Text-to-Image (T2I) diffusion models have achieved remarkable semantic alignment, yet they often suffer from a significant lack of variety, converging on a narrow set of visual solutions for any given prompt. This typicality bias presents a challenge for creative applications that require a wide range of generative outcomes. We identify a fundamental trade-off in current approaches to diversity: modifying model inputs requires costly optimization to incorporate feedback from the generative path. In contrast, acting on spatially-committed intermediate latents tends to disrupt the forming visual structure, leading to artifacts. In this work, we propose to apply repulsion in the Contextual Space as a novel framework for achieving rich diversity in Diffusion Transformers. By intervening in the multimodal attention channels, we apply on-the-fly repulsion during the transformer's forward pass, injecting the intervention between blocks where text conditioning is enriched with emergent image structure. This allows for redirecting the guidance trajectory after it is structurally informed but before the composition is fixed. Our results demonstrate that repulsion in the Contextual Space produces significantly richer diversity without sacrificing visual fidelity or semantic adherence. Furthermore, our method is uniquely efficient, imposing a small computational overhead while remaining effective even in modern "Turbo" and distilled models where traditional trajectory-based interventions typically fail.

Updated: 2026-03-30 17:59:13

标题: Diffusion Transformers 中的上下文空间中的即时排斥，以实现丰富多样性

摘要: 现代文本到图像（T2I）扩散模型在语义对齐方面取得了显著进展，但往往缺乏多样性，对于任何给定的提示，往往会收敛于一组狭窄的视觉解决方案。这种典型性偏见为需要广泛生成结果的创意应用程序带来了挑战。我们确定了当前方法中多样性的一个基本权衡：修改模型输入需要昂贵的优化，以纳入生成路径的反馈。相比之下，对空间承诺的中间潜在因素进行干预往往会破坏正在形成的视觉结构，导致伪像。在这项工作中，我们提出在上下文空间中应用排斥作为实现Diffusion Transformers丰富多样性的新框架。通过干预多模态注意力通道，我们在变压器的前向传递过程中实时应用排斥，在文本调节与新兴图像结构丰富的区块之间注入干预。这允许在结构上被告知但组合未固定之前重定向指导轨迹。我们的结果表明，在上下文空间中的排斥产生了显著更丰富的多样性，而无需牺牲视觉保真度或语义遵从性。此外，我们的方法具有独特的高效性，在现代的“Turbo”和精简模型中，即使传统的基于轨迹的干预通常失败，也能施加较小的计算开销。

更新时间: 2026-03-30 17:59:13

领域: cs.CV,cs.AI,cs.GR,cs.LG

下载: http://arxiv.org/abs/2603.28762v1

Primitive-Root Determinant Densities over Prime Fields and Implications for PRIM-LWE

The PRIM-LWE problem, introduced by Sehrawat, Yeo, and Desmedt (Theoretical Computer Science, 886 (2021)), is a variant of the Learning with Errors problem in which the secret matrix is required to have a primitive-root determinant. The dimension-uniform reduction constant is $c(p)=\inf_{n\ge 1}c_n(p)$, where $c_n(p)$ is the exact density of $n\times n$ matrices over $\mathbb{F}_p$ with primitive-root determinant. Sehrawat, Yeo, and Desmedt asked whether $\inf_{p\text{ prime}} c(p)=0$, observing that an affirmative answer would follow from the conjectural infinitude of primorial primes. We resolve this question unconditionally using only Dirichlet's theorem and Mertens' product formula, entirely bypassing the primorial-prime hypothesis. We further establish the sharp order \[ \min_{p\le x} c(p)\asymp \frac{1}{\log\log x} \qquad (x\to\infty), \] and show that the limiting distribution of $c(p)$ over the primes has support exactly $[0,1/2]$. We have not found this full-support statement in the literature. The law coincides with the classical shifted-prime distribution of $\varphi(p-1)/(p-1)$ via a transport lemma and is moreover continuous and purely singular. We also derive explicit lower bounds on $c(q)$ for primes of cryptographic interest, parameterized solely by the number of distinct prime factors of $q-1$. As a simple conservative explicit bound, for any prime $q>2^{30}$ the expected overhead $1/c(q)$ is at most $1.79\log q$. On the other hand, our results show that the worst-case overhead among primes $p\le x$ is of order $Θ(\log\log x)$, and in particular $1/c(q)=O(\log\log q)$ pointwise.

Updated: 2026-03-30 17:56:18

标题: 原根确定密度在素域上的应用及对PRIM-LWE的影响

摘要: PRIM-LWE问题是由Sehrawat，Yeo和Desmedt引入的，是在错误学习问题的一个变种，其中要求秘密矩阵具有原根行列式。维度统一的降低常数为$c(p)=\inf_{n\ge 1}c_n(p)$，其中$c_n(p)$是在$\mathbb{F}_p$上具有原根行列式的$n\times n$矩阵的精确密度。Sehrawat，Yeo和Desmedt提出了一个问题，即是否$\inf_{p\text{ prime}} c(p)=0$，观察到如果肯定的答案将从原则上的素数无限性猜想得出。我们无条件地解决了这个问题，只使用了狄利克雷定理和默滕斯乘积公式，完全绕过原则性素数假设。我们进一步确定了尖锐的顺序 \[ \min_{p\le x} c(p)\asymp \frac{1}{\log\log x} \qquad (x\to\infty), \]并且展示了在素数上的$c(p)$的极限分布完全支持为$[0,1/2]$。我们在文献中没有找到这个全支持声明。该定律与$\varphi(p-1)/(p-1)$的经典偏移素数分布通过传输引理一致，并且是连续且纯奇异的。我们还推导了对加密兴趣的素数$q$的$c(q)$的显式下界，仅由$q-1$的不同素因子的数量参数化。作为一个简单保守的显式上界，对于任何大于$2^{30}$的素数$q$，预期的开销$1/c(q)$最多为$1.79\log q$。另一方面，我们的结果表明，在小于$x$的素数$p$中，最坏情况下的开销是$Θ(\log\log x)$的数量级，特别地$1/c(q)=O(\log\log q)$逐点。

更新时间: 2026-03-30 17:56:18

领域: cs.CR,math.NT

下载: http://arxiv.org/abs/2603.11196v3

Temporal Credit Is Free

Recurrent networks do not need Jacobian propagation to adapt online. The hidden state already carries temporal credit through the forward pass; immediate derivatives suffice if you stop corrupting them with stale trace memory and normalize gradient scales across parameter groups. An architectural rule predicts when normalization is needed: \b{eta}2 is required when gradients must pass through a nonlinear state update with no output bypass, and unnecessary otherwise. Across ten architectures, real primate neural data, and streaming ML benchmarks, immediate derivatives with RMSprop match or exceed full RTRL, scaling to n = 1024 at 1000x less memory.

Updated: 2026-03-30 17:54:55

标题: 时间信用是免费的

摘要: 循环网络在线调整时不需要使用雅可比传播。隐藏状态通过前向传递已经携带了时间信用；如果您停止用陈旧的跟踪记忆破坏它们，即时导数就足够了，并且通过参数组归一化梯度尺度。一个架构规则预测了归一化何时需要：当梯度必须通过无输出旁路的非线性状态更新时需要\eta2，否则是不必要的。在十种架构、真实灵长类神经数据和流式ML基准测试中，使用RMSprop的即时导数与完整的RTRL匹配或超过，可以在1,000倍的内存下扩展到n = 1024。

更新时间: 2026-03-30 17:54:55

领域: cs.LG

下载: http://arxiv.org/abs/2603.28750v1

To Augment or Not to Augment? Diagnosing Distributional Symmetry Breaking

Symmetry-aware methods for machine learning, such as data augmentation and equivariant architectures, encourage correct model behavior on all transformations (e.g. rotations or permutations) of the original dataset. These methods can improve generalization and sample efficiency, under the assumption that the transformed datapoints are highly probable, or "important", under the test distribution. In this work, we develop a method for critically evaluating this assumption. In particular, we propose a metric to quantify the amount of symmetry breaking in a dataset, via a two-sample classifier test that distinguishes between the original dataset and its randomly augmented equivalent. We validate our metric on synthetic datasets, and then use it to uncover surprisingly high degrees of symmetry-breaking in several benchmark point cloud datasets, constituting a severe form of dataset bias. We show theoretically that distributional symmetry-breaking can prevent invariant methods from performing optimally even when the underlying labels are truly invariant, for invariant ridge regression in the infinite feature limit. Empirically, the implication for symmetry-aware methods is dataset-dependent: equivariant methods still impart benefits on some symmetry-biased datasets, but not others, particularly when the symmetry bias is predictive of the labels. Overall, these findings suggest that understanding equivariance -- both when it works, and why -- may require rethinking symmetry biases in the data.

Updated: 2026-03-30 17:52:45

标题: 增加还是不增加？诊断分布对称性破坏

摘要: 对于机器学习而言，对称性感知方法，如数据增强和等变体结构，鼓励模型在原始数据集的所有转换（如旋转或排列）上表现正确。这些方法可以提高泛化能力和样本效率，前提是变换后的数据点在测试分布下高度可能，或者说在测试分布下“重要”。在这项工作中，我们开发了一种方法来批判性地评估这一假设。特别地，我们提出了一种度量数据集中对称性破坏程度的方法，通过一个双样本分类器测试，区分原始数据集和其随机增强等价物之间的差异。我们在合成数据集上验证了我们的度量方法，然后使用它来揭示几个基准点云数据集中令人惊讶的高度对称性破坏程度，构成一种严重的数据集偏差形式。我们理论上表明，分布对称破坏可以阻止不变方法在无限特征极限下实现最佳性能，即使底层标签实际上是不变的，对于无限特征极限下的不变岭回归。从经验上看，对称性感知方法的含义取决于数据集：等变方法仍然在一些对称性偏斜的数据集上带来益处，但在其他数据集上则不然，特别是当对称性偏差可以预测标签时。总的来说，这些发现表明，理解等变性——包括它何时有效以及为什么有效——可能需要重新考虑数据中的对称性偏差。

更新时间: 2026-03-30 17:52:45

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2510.01349v2

Stop Probing, Start Coding: Why Linear Probes and Sparse Autoencoders Fail at Compositional Generalisation

The linear representation hypothesis states that neural network activations encode high-level concepts as linear mixtures. However, under superposition, this encoding is a projection from a higher-dimensional concept space into a lower-dimensional activation space, and a linear decision boundary in the concept space need not remain linear after projection. In this setting, classical sparse coding methods with per-sample iterative inference leverage compressed sensing guarantees to recover latent factors. Sparse autoencoders (SAEs), on the other hand, amortise sparse inference into a fixed encoder, introducing a systematic gap. We show this amortisation gap persists across training set sizes, latent dimensions, and sparsity levels, causing SAEs to fail under out-of-distribution (OOD) compositional shifts. Through controlled experiments that decompose the failure, we identify dictionary learning -- not the inference procedure -- as the binding constraint: SAE-learned dictionaries point in substantially wrong directions, and replacing the encoder with per-sample FISTA on the same dictionary does not close the gap. An oracle baseline proves the problem is solvable with a good dictionary at all scales tested. Our results reframe the SAE failure as a dictionary learning challenge, not an amortisation problem, and point to scalable dictionary learning as the key open problem for sparse inference under superposition.

Updated: 2026-03-30 17:52:16

标题: 停止探究，开始编码：为什么线性探测器和稀疏自编码器在组合泛化方面失败

摘要: 线性表示假设表明神经网络激活将高级概念编码为线性混合物。然而，在叠加情况下，这种编码是从更高维概念空间投影到较低维激活空间的过程，并且概念空间中的线性决策边界在投影后不一定保持线性。在这种情况下，经典的稀疏编码方法通过每个样本的迭代推理利用了压缩感知保证来恢复潜在因子。另一方面，稀疏自编码器（SAEs）将稀疏推理摊销到一个固定的编码器中，引入了一个系统性差距。我们展示了这种摊销差距在训练集大小、潜在维度和稀疏水平之间持续存在，导致SAEs在超出分布（OOD）的复合转移下失败。通过分解失败的受控实验，我们确定字典学习 -- 而不是推理过程 -- 是约束因素：SAE学习到的字典指向完全错误的方向，用相同字典替换编码器并在每个样本上使用FISTA并不能消除差距。一个oracle基线证明了在所有测试规模上都可以通过一个好的字典来解决问题。我们的结果重新定义了SAE的失败为一个字典学习挑战，而不是一个摊销问题，并指出可扩展的字典学习是在叠加情况下稀疏推理的关键未解决问题。

更新时间: 2026-03-30 17:52:16

领域: cs.LG

下载: http://arxiv.org/abs/2603.28744v1

Rethinking Language Model Scaling under Transferable Hypersphere Optimization

Scaling laws for large language models depend critically on the optimizer and parameterization. Existing hyperparameter transfer laws are mainly developed for first-order optimizers, and they do not structurally prevent training instability at scale. Recent hypersphere optimization methods constrain weight matrices to a fixed-norm hypersphere, offering a promising alternative for more stable scaling. We introduce HyperP (Hypersphere Parameterization), the first framework for transferring optimal learning rates across model width, depth, training tokens, and Mixture-of-Experts (MoE) granularity under the Frobenius-sphere constraint with the Muon optimizer. We prove that weight decay is a first-order no-op on the Frobenius sphere, show that Depth-$μ$P remains necessary, and find that the optimal learning rate follows the same data-scaling power law with the "magic exponent" 0.32 previously observed for AdamW. A single base learning rate tuned at the smallest scale transfers across all compute budgets under HyperP, yielding $1.58\times$ compute efficiency over a strong Muon baseline at $6\times10^{21}$ FLOPs. Moreover, HyperP delivers transferable stability: all monitored instability indicators, including $Z$-values, output RMS, and activation outliers, remain bounded and non-increasing under training FLOPs scaling. We also propose SqrtGate, an MoE gating mechanism derived from the hypersphere constraint that preserves output RMS across MoE granularities for improved granularity scaling, and show that hypersphere optimization enables substantially larger auxiliary load-balancing weights, yielding both strong performance and good expert balance. We release our training codebase at https://github.com/microsoft/ArchScale.

Updated: 2026-03-30 17:51:47

标题: 重新思考在可转移的超球优化下的语言模型扩展

摘要: 大型语言模型的缩放规律在很大程度上取决于优化器和参数化。现有的超参数转移规律主要针对一阶优化器进行开发，并且它们在结构上不能有效防止规模上的训练不稳定性。最近的超球面优化方法将权重矩阵限制在固定范数的超球面上，为更稳定的缩放提供了一个有希望的替代方案。我们引入了HyperP（超球面参数化），这是第一个在Frobenius-球面约束下使用Muon优化器跨模型宽度、深度、训练标记和专家混合（MoE）粒度进行最优学习率转移的框架。我们证明了在Frobenius球面上，权重衰减是一个一阶无操作符，展示了深度-$μ$P仍然是必要的，并且发现最优学习率遵循与之前观察到的AdamW的“魔幂指数”0.32相同的数据缩放幂律。在HyperP下，调整在最小规模上的单个基础学习率可以跨所有计算预算进行转移，以$6\times10^{21}$ FLOPs的强Muon基准为基础，实现了$1.58\times$ 的计算效率。此外，HyperP提供了可转移的稳定性：所有监测到的不稳定性指标，包括$Z$值、输出RMS和激活异常值，在训练FLOPs缩放下保持有界且不增加。我们还提出了SqrtGate，这是一种从超球面约束中派生出的MoE门控机制，可以在不同MoE粒度中保持输出RMS，从而改善粒度缩放，并展示了超球面优化可以实现更大的辅助负载平衡权重，既取得了良好的性能又保持了专家平衡。我们在https://github.com/microsoft/ArchScale上发布了我们的训练代码库。

更新时间: 2026-03-30 17:51:47

领域: cs.LG

下载: http://arxiv.org/abs/2603.28743v1

Hidden Elo: Private Matchmaking through Encrypted Rating Systems

Matchmaking has become a prevalent part in contemporary applications, being used in dating apps, social media, online games, contact tracing and in various other use-cases. However, most implementations of matchmaking require the collection of sensitive/personal data for proper functionality. As such, with this work we aim to reduce the privacy leakage inherent in matchmaking applications. We propose H-Elo, a Fully Homomorphic Encryption (FHE)-based, private rating system, which allows for secure matchmaking through the use of traditional rating systems. In this work, we provide the construction of H-Elo, analyse the security of it against a capable adversary as well as benchmark our construction in a chess-based rating update scenario. Through our experiments we show that H-Elo can achieve similar accuracy to a plaintext implementation, while keeping rating values private and secure. Additionally, we compare our work to other private matchmaking solutions as well as cover some future directions in the field of private matchmaking. To the best of our knowledge we provide one of the first private and secure rating system-based matchmaking protocols.

Updated: 2026-03-30 17:50:57

标题: 隐藏的Elo：通过加密评级系统进行私人匹配

摘要: 相互匹配已成为当代应用程序中普遍存在的一部分，在约会应用程序、社交媒体、在线游戏、接触者追踪以及其他各种用例中被使用。然而，大多数相互匹配的实现都需要收集敏感/个人数据以确保功能正常。因此，本文旨在减少相互匹配应用程序中固有的隐私泄露。我们提出了基于完全同态加密（FHE）的私密评分系统H-Elo，通过传统评分系统实现安全的相互匹配。在这项工作中，我们提供了H-Elo的构建，分析了其在面对有能力的对手时的安全性，并在基于棋类评分更新场景中对我们的构建进行了基准测试。通过我们的实验，我们展示了H-Elo可以实现与明文实现类似的准确性，同时保持评分值的私密性和安全性。另外，我们将我们的工作与其他私密相互匹配解决方案进行了比较，并探讨了私密相互匹配领域的一些未来方向。据我们所知，我们提供了一种基于私密和安全评分系统的相互匹配协议的首批实现之一。

更新时间: 2026-03-30 17:50:57

领域: cs.CR

下载: http://arxiv.org/abs/2603.26407v2

Expectation Error Bounds for Transfer Learning in Linear Regression and Linear Neural Networks

In transfer learning, the learner leverages auxiliary data to improve generalization on a main task. However, the precise theoretical understanding of when and how auxiliary data help remains incomplete. We provide new insights on this issue in two canonical linear settings: ordinary least squares regression and under-parameterized linear neural networks. For linear regression, we derive exact closed-form expressions for the expected generalization error with bias-variance decomposition, yielding necessary and sufficient conditions for auxiliary tasks to improve generalization on the main task. We also derive globally optimal task weights as outputs of solvable optimization programs, with consistency guarantees for empirical estimates. For linear neural networks with shared representations of width $q \leq K$, where $K$ is the number of auxiliary tasks, we derive a non-asymptotic expectation bound on the generalization error, yielding the first non-vacuous sufficient condition for beneficial auxiliary learning in this setting, as well as principled directions for task weight curation. We achieve this by proving a new column-wise low-rank perturbation bound for random matrices, which improves upon existing bounds by preserving fine-grained column structures. Our results are verified on synthetic data simulated with controlled parameters.

Updated: 2026-03-30 17:50:52

标题: 线性回归和线性神经网络中的迁移学习的期望误差界

摘要: 在迁移学习中，学习者利用辅助数据来提高主要任务的泛化能力。然而，关于辅助数据何时以及如何帮助的精确理论理解仍然不完善。我们在两个经典的线性设置中就这个问题提供了新的见解：普通最小二乘回归和参数不足的线性神经网络。对于线性回归，我们推导出期望泛化误差的精确闭式表达式，并进行了偏差-方差分解，得出了辅助任务改善主要任务泛化的必要和充分条件。我们还推导出作为可解优化程序的输出的全局最优任务权重，并对经验估计提供了一致性保证。对于具有宽度 $q \leq K$ 的共享表示的线性神经网络，其中 $K$ 是辅助任务的数量，我们推导出了泛化误差的非渐近期望界限，得出了在这种情况下有益的辅助学习的第一个非空条件，以及任务权重策划的原则性方向。我们通过证明随机矩阵的新列低秩扰动界限来实现这一点，这通过保留细粒度的列结构改进了现有的界限。我们在使用受控参数模拟的合成数据上验证了我们的结果。

更新时间: 2026-03-30 17:50:52

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2603.28739v1

ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining

We introduce ParaSpeechCLAP, a dual-encoder contrastive model that maps speech and text style captions into a common embedding space, supporting a wide range of intrinsic (speaker-level) and situational (utterance-level) descriptors (such as pitch, texture and emotion) far beyond the narrow set handled by existing models. We train specialized ParaSpeechCLAP-Intrinsic and ParaSpeechCLAP-Situational models alongside a unified ParaSpeechCLAP-Combined model, finding that specialization yields stronger performance on individual style dimensions while the unified model excels on compositional evaluation. We further show that ParaSpeechCLAP-Intrinsic benefits from an additional classification loss and class-balanced training. We demonstrate our models' performance on style caption retrieval, speech attribute classification and as an inference-time reward model that improves style-prompted TTS without additional training. ParaSpeechCLAP outperforms baselines on most metrics across all three applications. Our models and code are released at https://github.com/ajd12342/paraspeechclap .

Updated: 2026-03-30 17:50:07

标题: ParaSpeechCLAP：用于丰富风格语言-音频预训练的双编码器语音文本模型

摘要: 我们介绍了ParaSpeechCLAP，这是一个双编码对比模型，将语音和文本风格字幕映射到一个共同的嵌入空间，支持一系列广泛的内在（说话者级别）和情境（话语级别）描述符（如音调、质地和情绪），远远超出现有模型处理的狭窄范围。我们训练了专门的ParaSpeechCLAP-Intrinsic和ParaSpeechCLAP-Situational模型以及统一的ParaSpeechCLAP-Combined模型，发现专业化在个别风格维度上表现更强，而统一模型在组合评估方面表现出色。我们进一步展示了ParaSpeechCLAP-Intrinsic受益于额外的分类损失和类平衡训练。我们展示了我们的模型在风格字幕检索、语音属性分类以及作为推理时间奖励模型的性能，该模型改善了风格提示的TTS而无需额外训练。在所有三个应用程序中，ParaSpeechCLAP在大多数指标上优于基线。我们的模型和代码已发布在https://github.com/ajd12342/paraspeechclap。

更新时间: 2026-03-30 17:50:07

领域: eess.AS,cs.AI,cs.CL,cs.SD

下载: http://arxiv.org/abs/2603.28737v1

RAD-AI: Rethinking Architecture Documentation for AI-Augmented Ecosystems

AI-augmented ecosystems (interconnected systems where multiple AI components interact through shared data and infrastructure) are becoming the architectural norm for smart cities, autonomous fleets, and intelligent platforms. Yet the architecture documentation frameworks practitioners rely on, arc42 and the C4 model, were designed for deterministic software and cannot capture probabilistic behavior, data-dependent evolution, or dual ML/software lifecycles. This gap carries regulatory consequence: the EU AI Act (Regulation 2024/1689) mandates technical documentation through Annex IV that no existing framework provides structured support for, with enforcement for high-risk systems beginning August 2, 2026. We present RAD-AI, a backward-compatible extension framework that augments arc42 with eight AI-specific sections and C4 with three diagram extensions, complemented by a systematic EU AI Act Annex IV compliance mapping. A regulatory coverage assessment with six experienced software-architecture practitioners provides preliminary evidence that RAD-AI increases Annex IV addressability from approximately 36% to 93% (mean rating) and demonstrates substantial improvement over existing frameworks. Comparative analysis on two production AI platforms (Uber Michelangelo, Netflix Metaflow) captures eight additional AI-specific concerns missed by standard frameworks and demonstrates that documentation deficiencies are structural rather than domain-specific. An illustrative smart mobility ecosystem case study reveals ecosystem-level concerns, including cascading drift and differentiated compliance obligations, that are invisible under standard notation.

Updated: 2026-03-30 17:48:56

标题: RAD-AI: 重新思考AI增强生态系统的架构文档

摘要: 人工智能增强生态系统（通过共享数据和基础设施，多个人工智能组件相互交互的相互连接系统）正在成为智能城市、自主车队和智能平台的架构规范。然而，从业者依赖的架构文档框架arc42和C4模型是为确定性软件设计的，无法捕捉概率行为、数据依赖演化或双重ML/软件生命周期。这一差距带来了监管后果：欧盟人工智能法案（2024/1689号法规）通过附件IV规定了技术文档，目前没有任何现有框架提供结构化支持，对于高风险系统，规定将于2026年8月2日开始实施。我们提出了RAD-AI，这是一个向后兼容的扩展框架，通过增加arc42的八个人工智能特定部分和C4的三个图表扩展来弥补这一缺陷，同时辅以系统性的欧盟人工智能法案附件IV合规性映射。通过对六位有经验的软件架构从业者进行的监管范围评估，初步证据表明RAD-AI将附件IV的可处理性从36%提高到93%（平均评分），并且相对于现有框架表现出显著改进。在两个生产型人工智能平台（Uber Michelangelo、Netflix Metaflow）的比较分析中，捕捉到了标准框架所忽视的八个额外的人工智能特定问题，并且表明文档缺陷是结构性的而非领域特定的。一个生动的智能移动生态系统案例研究揭示了生态系统级别的问题，包括级联漂移和不同的合规义务，在标准符号下是不可见的。

更新时间: 2026-03-30 17:48:56

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2603.28735v1

See it to Place it: Evolving Macro Placements with Vision-Language Models

We propose using Vision-Language Models (VLMs) for macro placement in chip floorplanning, a complex optimization task that has recently shown promising advancements through machine learning methods. Because human designers rely heavily on spatial reasoning to arrange components on the chip canvas, we hypothesize that VLMs with strong visual reasoning abilities can effectively complement existing learning-based approaches. We introduce VeoPlace (Visual Evolutionary Optimization Placement), a novel framework that uses a VLM, without any fine-tuning, to guide the actions of a base placer by constraining them to subregions of the chip canvas. The VLM proposals are iteratively optimized through an evolutionary search strategy with respect to resulting placement quality. On open-source benchmarks, VeoPlace outperforms the best prior learning-based approach on 9 of 10 benchmarks with peak wirelength reductions exceeding 32%. We further demonstrate that VeoPlace generalizes to analytical placers, improving DREAMPlace performance on all 8 evaluated benchmarks with gains up to 4.3%. Our approach opens new possibilities for electronic design automation tools that leverage foundation models to solve complex physical design problems.

Updated: 2026-03-30 17:47:34

标题: 看见它才能摆放它：通过视觉-语言模型演变的宏观摆放

摘要: 我们提议在芯片布局中使用视觉语言模型（VLMs），这是一个复杂的优化任务，最近通过机器学习方法取得了令人期待的进展。由于人类设计师在芯片画布上排列组件时严重依赖空间推理，我们假设具有强大视觉推理能力的VLMs可以有效地补充现有的基于学习的方法。我们引入了VeoPlace（Visual Evolutionary Optimization Placement），这是一个新颖的框架，使用VLMs，无需任何微调，通过限制基础放置器的操作范围到芯片画布的子区域来指导其行动。VLMs的提议通过演化搜索策略进行迭代优化，以提高放置质量。在开源基准测试中，VeoPlace在10个基准测试中有9个表现优于最佳的先前基于学习的方法，最大线长减少超过32%。我们进一步证明VeoPlace推广到分析放置器，改善了所有8个评估基准测试上的DREAMPlace性能，增益高达4.3%。我们的方法为利用基础模型解决复杂物理设计问题的电子设计自动化工具开辟了新的可能性。

更新时间: 2026-03-30 17:47:34

领域: cs.LG

下载: http://arxiv.org/abs/2603.28733v1

SAGAI-MID: A Generative AI-Driven Middleware for Dynamic Runtime Interoperability

Modern distributed systems integrate heterogeneous services, REST APIs with different schema versions, GraphQL endpoints, and IoT devices with proprietary payloads that suffer from persistent schema mismatches. Traditional static adapters require manual coding for every schema pair and cannot handle novel combinations at runtime. We present SAGAI-MID, a FastAPI-based middleware that uses large language models (LLMs) to dynamically detect and resolve schema mismatches at runtime. The system employs a five-layer pipeline: hybrid detection (structural diff plus LLM semantic analysis), dual resolution strategies (per-request LLM transformation and LLM-generated reusable adapter code), and a three-tier safeguard stack (validation, ensemble voting, rule-based fallback). We frame the architecture through Bass et al.'s interoperability tactics, transforming them from design-time artifacts into runtime capabilities. We evaluate SAGAI-MID on 10 interoperability scenarios spanning REST version migration, IoT-to-analytics bridging, and GraphQL protocol conversion across six LLMs from two providers. The best-performing configuration achieves 0.90 pass@1 accuracy. The CODEGEN strategy consistently outperforms DIRECT (0.83 vs 0.77 mean pass@1), while cost varies by over 30x across models with no proportional accuracy gain; the most accurate model is also the cheapest. We discuss implications for software architects adopting LLMs as runtime architectural components.

Updated: 2026-03-30 17:46:41

标题: SAGAI-MID：用于动态运行时互操作性的生成式人工智能驱动中间件

摘要: 现代分布式系统集成了异构服务、具有不同模式版本的REST API、GraphQL端点和具有专有有效负载的物联网设备，这些设备遭受持续的模式不匹配问题。传统的静态适配器需要为每个模式对手动编码，并且无法在运行时处理新颖的组合。我们提出了基于FastAPI的中间件SAGAI-MID，该中间件使用大型语言模型（LLM）在运行时动态检测和解决模式不匹配问题。该系统采用了五层管道：混合检测（结构差异加上LLM语义分析）、双重解决策略（每个请求LLM转换和LLM生成的可重用适配器代码）以及三层保护堆栈（验证、集成投票、基于规则的回退）。我们通过Bass等人的互操作策略框架来描述架构，将其从设计时工件转变为运行时能力。我们在涵盖REST版本迁移、物联网到分析桥接和GraphQL协议转换的10个互操作性场景上评估了SAGAI-MID，在来自两个提供商的六个LLM上进行。表现最佳的配置达到了0.90的一次通过准确率。CODEGEN策略始终优于DIRECT（均值通过率为0.83比0.77），而成本在模型之间变化超过30倍，没有相应的准确性增益；最准确的模型也是最便宜的。我们讨论了采用LLM作为运行时架构组件的软件架构师的影响。

更新时间: 2026-03-30 17:46:41

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2603.28731v1

BIOGEN: Evidence-Grounded Multi-Agent Reasoning Framework for Transcriptomic Interpretation in Antimicrobial Resistance

Interpreting gene clusters from RNA-seq remains challenging, especially in antimicrobial resistance studies where mechanistic context is essential for hypothesis generation. Conventional enrichment methods summarize co-expressed modules using predefined categories, but often return sparse results and lack cluster-specific, literature-linked explanations. We present BIOGEN, an evidence-grounded multi-agent framework for post hoc interpretation of RNA-seq transcriptional modules that integrates biomedical retrieval, structured reasoning, and multi-critic verification. BIOGEN organizes evidence from PubMed and UniProt into traceable cluster-level interpretations with explicit support and confidence tiering. On a primary Salmonella enterica dataset, BIOGEN achieved strong evidence-grounding performance while reducing hallucination from 0.67 in an unconstrained LLM setting to 0.00 under retrieval-grounded configurations. Compared with KEGG/ORA and GO/ORA, BIOGEN recovered broader biological coverage, identifying substantially more biological themes per cluster. Across four additional bacterial RNA-seq datasets, BIOGEN maintained zero hallucination and consistently outperformed KEGG/ORA in cluster-level thematic coverage. These results position BIOGEN as an interpretive support framework that complements transcriptomic workflows through improved traceability, evidential transparency, and biological coverage.

Updated: 2026-03-30 17:46:12

标题: 生物源：基于证据的多智能体推理框架用于抗菌耐药转录组解释

摘要: 从RNA-seq解释基因簇仍然具有挑战性，特别是在抗微生物药物抗性研究中，机制背景对于假设生成至关重要。传统的富集方法使用预定义的类别总结共表达模块，但通常返回稀疏的结果，并且缺乏特定于簇的、与文献相关的解释。我们提出了BIOGEN，这是一个基于证据的多代理框架，用于对RNA-seq转录模块进行事后解释，整合了生物医学检索、结构化推理和多批评者验证。BIOGEN将来自PubMed和UniProt的证据组织成可追踪的簇级解释，具有明确的支持和信心分层。在一个主要的沙门氏菌数据集上，BIOGEN实现了强有力的证据支撑性能，将在无约束的LLM设置下从0.67的幻觉减少到在检索基础配置下的0.00。与KEGG/ORA和GO/ORA相比，BIOGEN恢复了更广泛的生物覆盖范围，每个簇识别出更多的生物主题。在另外四个细菌RNA-seq数据集中，BIOGEN保持了零幻觉，并在簇级主题覆盖范围上始终优于KEGG/ORA。这些结果将BIOGEN定位为一个解释支持框架，通过改进的可追踪性、证据透明性和生物覆盖范围，为转录组工作流提供补充。

更新时间: 2026-03-30 17:46:12

领域: q-bio.QM,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.16082v3

BitSov: A Composable Bitcoin-Native Architecture for Sovereign Internet Infrastructure

Today's internet concentrates identity, payments, communication, and content hosting under a small number of corporate intermediaries, creating single points of failure, enabling censorship, and extracting economic rent from participants. We present BitSov, an architectural framework for sovereign internet infrastructure that composes existing decentralized technologies (Bitcoin, Lightning Network, decentralized storage, federated messaging, and mesh connectivity) into a unified, eight-layer protocol stack anchored to Bitcoin's base layer. The framework introduces three architectural patterns: (1) payment-gated messaging, where every transmitted message requires cryptographic proof of a Bitcoin payment, deterring spam through economic incentives rather than moderation; (2) timechain-locked contracts, which anchor subscriptions and licenses to Bitcoin block height (the timechain) rather than calendar dates; and (3) a self-sustaining economic flywheel that converts service revenue into infrastructure growth. A dual settlement model supports both on-chain transactions for permanence and auditability and Lightning micropayments for high-frequency messaging. As a position paper, we analyze the quality attributes, discuss open challenges, and propose a research agenda for empirical validation.

Updated: 2026-03-30 17:44:06

标题: BitSov：一种用于主权互联网基础架构的可组合比特币本地架构

摘要: 今天的互联网将身份、支付、通讯和内容托管集中在少数几家公司的中间人手中，造成单一故障点，促使审查，并从参与者身上提取经济租金。我们提出了BitSov，这是一个主权互联网基础架构的架构框架，它将现有的去中心化技术（比特币、闪电网络、去中心化存储、联合消息传递和网状连接）组合成一个统一的、八层协议栈，其中以比特币的基础层为基础。该框架引入了三种架构模式：（1）通过支付门控的消息传递，每个传输的消息都需要比特币支付的加密证明，通过经济激励而不是调解来阻止垃圾信息传播；（2）时间链锁定合同，将订阅和许可证锚定在比特币区块高度（时间链）而不是日历日期上；以及（3）一种自我维持的经济飞轮，将服务收入转化为基础设施的增长。双重结算模型支持链上交易以实现永久性和可审计性，同时支持闪电微支付以实现高频消息传递。作为一份立场文件，我们分析了质量属性，讨论了开放挑战，并提出了一个用于实证验证的研究议程。

更新时间: 2026-03-30 17:44:06

领域: cs.CR,cs.DC,cs.NI,cs.SE

下载: http://arxiv.org/abs/2603.28727v1

Online monotone density estimation and log-optimal calibration

We study the problem of online monotone density estimation, where density estimators must be constructed in a predictable manner from sequentially observed data. We propose two online estimators: an online analogue of the classical Grenander estimator, and an expert aggregation estimator inspired by exponential weighting methods from the online learning literature. In the well-specified stochastic setting, where the underlying density is monotone, we show that the expected cumulative log-likelihood gap between the online estimators and the true density admits an $O(n^{1/3})$ bound. We further establish a $\sqrt{n\log{n}}$ pathwise regret bound for the expert aggregation estimator relative to the best offline monotone estimator chosen in hindsight, under minimal regularity assumptions on the observed sequence. As an application of independent interest, we show that the problem of constructing log-optimal p-to-e calibrators for sequential hypothesis testing can be formulated as an online monotone density estimation problem. We adapt the proposed estimators to build empirically adaptive p-to-e calibrators and establish their optimality. Numerical experiments illustrate the theoretical results.

Updated: 2026-03-30 17:40:13

标题: 在线单调密度估计和对数优化校准

摘要: 我们研究在线单调密度估计的问题，其中密度估计器必须以可预测的方式从连续观察到的数据中构建。我们提出了两种在线估计器：经典Grenander估计器的在线类比，以及受在线学习文献中指数加权方法启发的专家聚合估计器。在良确定的随机设置中，其中底层密度是单调的，我们证明了在线估计器与真实密度之间的期望累积对数似然差异具有$O(n^{1/3})$的界限。在观察到的序列上最小正则性假设下，我们进一步建立了专家聚合估计器相对于事后选择的最佳离线单调估计器的$\sqrt{n\log{n}}$路径后悔界限。作为一个独立感兴趣的应用，我们展示了构建用于顺序假设检验的对数最优p-to-e校准器的问题可以被制定为在线单调密度估计问题。我们调整提出的估计器以构建经验自适应p-to-e校准器，并建立其最优性。数值实验展示了理论结果。

更新时间: 2026-03-30 17:40:13

领域: stat.ML,cs.LG,stat.ME

下载: http://arxiv.org/abs/2602.08927v2

Stepwise Credit Assignment for GRPO on Flow-Matching Models

Flow-GRPO successfully applies reinforcement learning to flow models, but uses uniform credit assignment across all steps. This ignores the temporal structure of diffusion generation: early steps determine composition and content (low-frequency structure), while late steps resolve details and textures (high-frequency details). Moreover, assigning uniform credit based solely on the final image can inadvertently reward suboptimal intermediate steps, especially when errors are corrected later in the diffusion trajectory. We propose Stepwise-Flow-GRPO, which assigns credit based on each step's reward improvement. By leveraging Tweedie's formula to obtain intermediate reward estimates and introducing gain-based advantages, our method achieves superior sample efficiency and faster convergence. We also introduce a DDIM-inspired SDE that improves reward quality while preserving stochasticity for policy gradients.

Updated: 2026-03-30 17:35:14

标题: 阶段性信用分配对流匹配模型的翻译

摘要: Flow-GRPO成功地将强化学习应用于流模型，但是它在所有步骤上使用均匀的信用分配。这忽略了扩散生成的时间结构：早期步骤确定了组成和内容（低频结构），而晚期步骤解决了细节和纹理（高频细节）。此外，仅基于最终图像分配均匀信用可能会无意中奖励次优的中间步骤，特别是当扩散轨迹后来进行错误更正时。我们提出了Stepwise-Flow-GRPO，它根据每个步骤的奖励改进来分配信用。通过利用Tweedie公式获得中间奖励估计并引入基于增益的优势，我们的方法实现了更优越的样本效率和更快的收敛性。我们还引入了一个受DDIM启发的SDE，它提高了奖励质量，同时保留了策略梯度的随机性。

更新时间: 2026-03-30 17:35:14

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2603.28718v1

Dynamic Dual-Granularity Skill Bank for Agentic RL

Agentic reinforcement learning (RL) can benefit substantially from reusable experience, yet existing skill-based methods mainly extract trajectory-level guidance and often lack principled mechanisms for maintaining an evolving skill memory. We propose D2Skill, a dynamic dual-granularity skill bank for agentic RL that organizes reusable experience into task skills for high-level guidance and step skills for fine-grained decision support and error correction. D2Skill jointly trains the policy and skill bank through paired baseline and skill-injected rollouts under the same policy, using their performance gap to derive hindsight utility signals for both skill updating and policy optimization. Built entirely from training-time experience, the skill bank is continuously expanded through reflection and maintained with utility-aware retrieval and pruning. Experiments on ALFWorld and WebShop with Qwen2.5-7B-Instruct and Qwen3-4B-Instruct-2507 show that D2Skill consistently improves success rates over skill-free baselines by 10-20 points. Further ablations and analyses show that both dual-granularity skill modeling and dynamic skill maintenance are critical to these gains, while the learned skills exhibit higher utility, transfer across evaluation settings, and introduce only modest training overhead.

Updated: 2026-03-30 17:32:11

标题: 主动RL的动态双粒度技能库

摘要: 主动性强化学习（RL）可以从可重复使用的经验中获益，然而现有的基于技能的方法主要提取轨迹级别的指导，往往缺乏保持不断发展的技能记忆的原则性机制。我们提出了D2Skill，这是一种为主动性RL设计的动态双粒度技能库，将可重复使用的经验组织成任务技能以提供高级别的指导，以及步骤技能以提供细粒度的决策支持和错误纠正。D2Skill通过在相同策略下进行基准对比和注入技能的回合来联合训练策略和技能库，利用它们之间的性能差距来为技能更新和策略优化推导出事后效用信号。技能库完全由训练时经验构建，通过反思不断扩展，并通过具有效用意识的检索和修剪进行维护。在ALFWorld和WebShop上进行的实验，使用Qwen2.5-7B-Instruct和Qwen3-4B-Instruct-2507显示，D2Skill始终比无技能基准提高10-20个百分点的成功率。进一步的剥离和分析表明，双粒度技能建模和动态技能维护对这些收益至关重要，而所学到的技能表现出更高的效用，能够在评估设置中进行转移，并且只引入了适度的训练开销。

更新时间: 2026-03-30 17:32:11

领域: cs.AI

下载: http://arxiv.org/abs/2603.28716v1

Image-Adaptive GAN based Reconstruction

In the recent years, there has been a significant improvement in the quality of samples produced by (deep) generative models such as variational auto-encoders and generative adversarial networks. However, the representation capabilities of these methods still do not capture the full distribution for complex classes of images, such as human faces. This deficiency has been clearly observed in previous works that use pre-trained generative models to solve imaging inverse problems. In this paper, we suggest to mitigate the limited representation capabilities of generators by making them image-adaptive and enforcing compliance of the restoration with the observations via back-projections. We empirically demonstrate the advantages of our proposed approach for image super-resolution and compressed sensing.

Updated: 2026-03-30 17:31:19

标题: 基于图像自适应的生成对抗网络重建

摘要: 近年来，(深度)生成模型，如变分自动编码器和生成对抗网络，产生的样本质量有了显著改善。然而，这些方法的表征能力仍然无法捕捉复杂图像类别的完整分布，比如人脸。这一不足在先前使用预训练生成模型解决图像反问题的研究中已经明显观察到。本文建议通过使生成器成为图像自适应的，并通过反投影强制执行恢复与观测值的一致性，来缓解生成器的有限表征能力。我们通过实验证明了我们提出的方法在图像超分辨率和压缩感知方面的优势。

更新时间: 2026-03-30 17:31:19

领域: eess.IV,cs.CV,cs.LG

下载: http://arxiv.org/abs/1906.05284v3

GPU-Accelerated Optimization of Transformer-Based Neural Networks for Real-Time Inference

This paper presents the design and evaluation of a GPU-accelerated inference pipeline for transformer models using NVIDIA TensorRT with mixed-precision optimization. We evaluate BERT-base (110M parameters) and GPT-2 (124M parameters) across batch sizes from 1 to 32 and sequence lengths from 32 to 512. The system achieves up to 64.4x speedup over CPU baselines, sub-10 ms latency for single-sample inference, and a 63 percent reduction in memory usage. We introduce a hybrid precision strategy that preserves FP32 for numerically sensitive operations such as softmax and layer normalization, while applying FP16 to linear layers. This approach maintains high numerical fidelity (cosine similarity >= 0.9998 relative to baseline outputs) and eliminates NaN instability. The pipeline is implemented as a modular, containerized system that enables reproducible benchmarking across more than 360 configurations. Cross-GPU validation on an NVIDIA A100 shows consistent FP16 speedup ratios between 1.84x and 2.00x, along with stable numerical behavior. Downstream evaluation on SST-2 demonstrates no accuracy degradation under hybrid precision. Validation on WikiText-2 shows that random inputs underestimate NaN instability by up to 6x for full FP16, while confirming the robustness of the hybrid approach (0.0 percent NaN, cosine similarity >= 0.9998). These results provide a detailed characterization of performance and accuracy trade-offs across GPU architectures and offer practical guidance for deploying transformer models in latency-critical environments.

Updated: 2026-03-30 17:27:33

标题: GPU加速的变压器神经网络优化，用于实时推断

摘要: 本文介绍了使用NVIDIA TensorRT和混合精度优化设计和评估了一种用于变压器模型的GPU加速推理管道。我们评估了BERT-base（110M参数）和GPT-2（124M参数），批处理大小从1到32，序列长度从32到512。该系统相对于CPU基线实现了高达64.4倍的加速，单样本推理延迟低于10毫秒，并减少了63%的内存使用。我们引入了一种混合精度策略，将FP32保留用于数值敏感操作，如softmax和层归一化，同时将FP16应用于线性层。这种方法保持了高数值保真度（余弦相似度>=0.9998相对于基线输出）并消除了NaN不稳定性。该管道被实现为一个模块化的、容器化的系统，可以在360多种配置下进行可重现的基准测试。在NVIDIA A100上进行的跨GPU验证显示，FP16加速比在1.84倍和2.00倍之间，数值行为稳定。在SST-2上的下游评估表明，在混合精度下没有准确性下降。在WikiText-2上的验证显示，随机输入低估了全FP16的NaN不稳定性，最高可达6倍，同时确认了混合方法的稳健性（0.0%NaN，余弦相似度>=0.9998）。这些结果提供了在GPU架构中性能和准确性权衡的详细描述，并为在延迟关键环境中部署变压器模型提供了实用指导。

更新时间: 2026-03-30 17:27:33

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2603.28708v1

A Convex Route to Thermomechanics: Learning Internal Energy and Dissipation

We present a physics-based neural network framework for the discovery of constitutive models in fully coupled thermomechanics. In contrast to classical formulations based on the Helmholtz energy, we adopt the internal energy and a dissipation potential as primary constitutive functions, expressed in terms of deformation and entropy. This choice avoids the need to enforce mixed convexity--concavity conditions and facilitates a consistent incorporation of thermodynamic principles. In this contribution, we focus on materials without preferred directions or internal variables. While the formulation is posed in terms of entropy, the temperature is treated as the independent observable, and the entropy is inferred internally through the constitutive relation, enabling thermodynamically consistent modeling without requiring entropy data. Thermodynamic admissibility of the networks is guaranteed by construction. The internal energy and dissipation potential are represented by input convex neural networks, ensuring convexity and compliance with the second law. Objectivity, material symmetry, and normalization are embedded directly into the architecture through invariant-based representations and zero-anchored formulations. We demonstrate the performance of the proposed framework on synthetic and experimental datasets, including purely thermal problems and fully coupled thermomechanical responses of soft tissues and filled rubbers. The results show that the learned models accurately capture the underlying constitutive behavior. All code, data, and trained models are made publicly available via https://doi.org/10.5281/zenodo.19248596.

Updated: 2026-03-30 17:26:13

标题: 一个凸形路线通往热力学力学：学习内能和耗散

摘要: 我们提出了一个基于物理学的神经网络框架，用于在完全耦合的热力学和力学中发现本构模型。与基于Helmholtz能量的经典公式相比，我们采用内部能量和耗散势作为主要的本构函数，以变形和熵的形式表达。这种选择避免了需要强制执行混合凸凹条件，并有助于一致地融入热力学原理。在这篇文章中，我们专注于没有优选方向或内部变量的材料。虽然公式是以熵的形式提出的，但温度被视为独立的可观测量，并且熵是通过本构关系内部推断的，从而实现了在不需要熵数据的情况下进行热力学一致建模。网络的热力学适应性是通过构造保证的。内部能量和耗散潜力由输入凸神经网络表示，确保凸性并符合第二定律。客观性、材料对称性和归一化直接通过基于不变量的表示和零锚定公式嵌入到体系结构中。我们在合成和实验数据集上展示了所提出框架的性能，包括纯热问题和软组织和填充橡胶的完全耦合热力学响应。结果显示，学习的模型准确捕捉了潜在的本构行为。所有代码、数据和训练模型都可通过https://doi.org/10.5281/zenodo.19248596 公开获取。

更新时间: 2026-03-30 17:26:13

领域: cs.CE,cs.AI

下载: http://arxiv.org/abs/2603.28707v1

Vision-Language Agents for Interactive Forest Change Analysis

Modern forest monitoring workflows increasingly benefit from the growing availability of high-resolution satellite imagery and advances in deep learning. Two persistent challenges in this context are accurate pixel-level change detection and meaningful semantic change captioning for complex forest dynamics. While large language models (LLMs) are being adapted for interactive data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored. To address this gap, we introduce an LLM-driven agent for integrated forest change analysis that supports natural language querying across multiple RSICI tasks. The proposed system builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration. To facilitate adaptation and evaluation in forest environments, we further introduce the Forest-Change dataset, which comprises bi-temporal satellite imagery, pixel-level change masks, and multi-granularity semantic change captions generated using a combination of human annotation and rule-based methods. Experimental results show that the proposed system achieves mIoU and BLEU-4 scores of 67.10% and 40.17% on the Forest-Change dataset, and 88.13% and 34.41% on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI benchmark for joint change detection and captioning. These results highlight the potential of interactive, LLM-driven RSICI systems to improve accessibility, interpretability, and efficiency of forest change analysis. All data and code are publicly available at https://github.com/JamesBrockUoB/ForestChat.

Updated: 2026-03-30 17:23:33

标题: 视觉语言代理人用于交互式森林变化分析

摘要: 现代森林监测工作流程越来越受益于高分辨率卫星图像的日益增多和深度学习的进步。在这种情况下，两个持久的挑战是准确的像素级变化检测和复杂森林动态的有意义的语义变化描述。虽然大型语言模型（LLMs）正在被用于交互式数据探索，但它们与视觉语言模型（VLMs）在遥感图像变化解释（RSICI）方面的整合仍未得到充分探讨。为了填补这一空白，我们引入了一个由LLM驱动的综合森林变化分析代理，支持跨多个RSICI任务的自然语言查询。所提出的系统基于LLM为主导的多级变化解释（MCI）视觉语言骨干。为了促进在森林环境中的适应和评估，我们进一步引入了Forest-Change数据集，其中包含双时间卫星图像、像素级变化掩模和使用人工注释和基于规则的方法生成的多粒度语义变化标题。实验结果显示，所提出的系统在Forest-Change数据集上实现了67.10%的mIoU和40.17%的BLEU-4分数，在LEVIR-MCI-Trees上实现了88.13%和34.41%的分数，这是LEVIR-MCI基准的树木集合，用于联合变化检测和标题。这些结果突显了交互式、LLM驱动的RSICI系统提高森林变化分析的可访问性、可解释性和效率的潜力。所有数据和代码都可以在https://github.com/JamesBrockUoB/ForestChat上公开获取。

更新时间: 2026-03-30 17:23:33

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2601.04497v2

NARVis: Neural Accelerated Rendering for Real-Time Scientific Point Cloud Visualization

Exploring scientific datasets with billions of samples in real-time visualization presents a challenge - balancing high-fidelity rendering with speed. This work introduces a neural accelerated renderer, NARVis, that uses the neural deferred rendering framework to visualize large-scale scientific point cloud data. NARVis augments a real-time point cloud rendering pipeline with high-quality neural post-processing, making the approach ideal for interactive visualization at scale. Specifically, we render the multi-attribute point cloud using a high-performance multi-attribute rasterizer and train a neural renderer to capture the desired post-processing effects from a conventional high-quality renderer. NARVis is effective in visualizing complex multidimensional Lagrangian flow fields and photometric scans of a large terrain as compared to the state-of-the-art high-quality renderers. Extensive evaluations demonstrate that NARVis prioritizes speed and scalability while retaining high visual fidelity. We achieve competitive frame rates of $>$126 fps for interactive rendering of $>$350M points (i.e., an effective throughput of $>$44 billion points per second) using ~12 GB of memory on RTX 2080 Ti GPU. Furthermore, NARVis is generalizable across different point clouds with similar visualization needs and the desired post-processing effects could be obtained with substantial high quality even at lower resolutions of the original point cloud, further reducing the memory requirements.

Updated: 2026-03-30 17:22:16

标题: NARVis：用于实时科学点云可视化的神经加速渲染

摘要: 在实时可视化中探索包含数十亿个样本的科学数据集提出了一个挑战 - 在高保真度渲染和速度之间平衡。本文介绍了一种神经加速渲染器NARVis，它使用神经延迟渲染框架来可视化大规模科学点云数据。NARVis通过高质量的神经后处理增强了实时点云渲染管线，使该方法在大规模交互式可视化中表现出色。具体来说，我们使用高性能的多属性光栅化器对多属性点云进行渲染，并训练一个神经渲染器来捕捉从传统高质量渲染器获得的所需后处理效果。与最先进的高质量渲染器相比，NARVis在可视化复杂的多维拉格朗日流场和大型地形的光度扫描方面表现出色。广泛的评估表明，NARVis在保持高视觉保真度的同时优先考虑速度和可扩展性。我们在RTX 2080 Ti GPU上使用约12 GB的内存实现了超过126 fps的竞争性帧速率，用于交互式渲染超过350M个点（即每秒有效吞吐量超过440亿个点）。此外，NARVis可以推广到具有相似可视化需求的不同点云，并且即使在原始点云的较低分辨率下，也可以获得相当高质量的所需后处理效果，进一步降低内存需求。

更新时间: 2026-03-30 17:22:16

领域: cs.GR,cs.CV,cs.HC,cs.LG

下载: http://arxiv.org/abs/2407.19097v2

CoPE-VideoLM: Leveraging Codec Primitives For Efficient Video Language Modeling

Video Language Models (VideoLMs) enable AI systems to understand temporal dynamics in videos. To fit within the maximum context window constraint, current methods use keyframe sampling which often misses both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. We address these limitations by leveraging video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach, CoPE-VideoLM, reduces the time-to-first-token by up to 86% and token usage by up to 93% compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we maintain or exceed performance on 14 diverse video understanding benchmarks spanning general question answering, temporal and motion reasoning, long-form understanding, and spatial scene understanding.

Updated: 2026-03-30 17:19:32

标题: CoPE-VideoLM：利用编解码器原语进行高效视频语言建模

摘要: 视频语言模型（VideoLMs）使AI系统能够理解视频中的时间动态。为了适应最大上下文窗口约束，当前方法使用关键帧采样，这往往会由于稀疏的时间覆盖而错过宏观事件和微观细节。此外，为每一帧处理完整图像及其令牌会带来大量的计算开销。我们通过利用视频编解码器原语（特别是运动矢量和残差）来解决这些限制，这些原语可以本地编码视频冗余和稀疏性，而不需要为大多数帧进行昂贵的完整图像编码。为此，我们引入了基于轻量级变压器的编码器，这些编码器可以聚合编解码器原语，并通过一种预训练策略将它们的表示与图像编码器嵌入对齐，从而在端到端微调过程中加速收敛。我们的方法CoPE-VideoLM将首个令牌的时间减少了高达86％，令牌使用率减少了高达93％，与标准VideoLM相比。此外，通过改变关键帧和编解码器原语的密度，我们在14个不同视频理解基准测试中保持或超越了性能，跨越了一般问题回答、时间和动作推理、长篇理解以及空间场景理解。

更新时间: 2026-03-30 17:19:32

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2602.13191v2

Understanding SAM's Robustness to Noisy Labels through Gradient Down-weighting

Sharpness-Aware Minimization (SAM) was introduced to improve generalization by seeking flat minima, yet it also exhibits robustness to label noise, a phenomenon that remains only partially understood. Prior work has mainly attributed this effect to SAM's tendency to prolong the learning of clean samples. In this work, we provide a complementary explanation by analyzing SAM at the element-wise level. We show that when noisy gradients dominate a parameter direction, their influence is reduced by the stronger amplification of clean gradients. This slows the memorization of noisy labels while sustaining clean learning, offering a more complete account of SAM's robustness. Building on this insight, we propose SANER (Sharpness-Aware Noise-Explicit Reweighting), a simple variant of SAM that explicitly magnifies this down-weighting effect. Experiments on benchmark image classification tasks with noisy labels demonstrate that SANER significantly mitigates noisy-label memorization and improves generalization over both SAM and SGD. Moreover, since SANER is designed from the mechanism of SAM, it can also be seamlessly integrated into SAM-like variants, further boosting their robustness.

Updated: 2026-03-30 17:14:51

标题: 通过梯度下降权重理解SAM对嘈杂标签的稳健性

摘要: Sharpness-Aware Minimization（SAM）被引入以通过寻求平坦的最小值来改善泛化，然而它也展现出对标签噪声的鲁棒性，这一现象仅被部分理解。先前的研究主要将这种效应归因于SAM延长清洁样本学习的倾向。在这项工作中，我们通过对SAM进行元素级别的分析提供了一个补充的解释。我们展示了当噪声梯度主导参数方向时，它们的影响会被干净梯度更强的放大所减弱。这减缓了噪声标签的记忆，同时维持了清洁学习，提供了对SAM鲁棒性的更完整解释。基于这一洞察，我们提出了SANER（Sharpness-Aware Noise-Explicit Reweighting），这是SAM的一个简单变体，明确放大了这种减权效应。在带有噪声标签的基准图像分类任务上进行的实验表明，SANER显著减轻了噪声标签的记忆，并在SAM和SGD上提高了泛化性能。此外，由于SANER是从SAM的机制设计而来，它也可以无缝集成到SAM类似的变体中，进一步提升它们的鲁棒性。

更新时间: 2026-03-30 17:14:51

领域: cs.LG

下载: http://arxiv.org/abs/2411.17132v2

What Is the Optimal Ranking Score Between Precision and Recall? We Can Always Find It and It Is Rarely $F_1$

Ranking methods or models based on their performance is of prime importance but is tricky because performance is fundamentally multidimensional. In the case of classification, precision and recall are scores with probabilistic interpretations that are both important to consider and complementary. The rankings induced by these two scores are often in partial contradiction. In practice, therefore, it is extremely useful to establish a compromise between the two views to obtain a single, global ranking. Over the last fifty years or so, it has been proposed to take a weighted harmonic mean, known as the F-score, F-measure, or $F_β$. Generally speaking, by averaging basic scores, we obtain a score that is intermediate in terms of values. However, there is no guarantee that these scores lead to meaningful rankings and no guarantee that the rankings are good tradeoffs between these base scores. Given the ubiquity of $F_β$ scores in the literature, some clarification is in order. Concretely: (1) We establish that $F_β$-induced rankings are meaningful and define a shortest path between precision- and recall-induced rankings. (2) We frame the problem of finding a tradeoff between two scores as an optimization problem expressed with Kendall rank correlations. We show that $F_1$ and its skew-insensitive version are far from being optimal in that regard. (3) We provide theoretical tools and a closed-form expression to find the optimal value for $β$ for any distribution or set of performances, and we illustrate their use on six case studies. Code is available at https://github.com/pierard/cvpr-2026-optimal-tradeoff-precision-recall.

Updated: 2026-03-30 17:14:33

标题: 什么是精确度和召回率之间的最佳排名分数？我们总是可以找到它，而且很少是$F_1$

摘要: 根据它们的性能对排名方法或模型进行排名是非常重要的，但也很棘手，因为性能在根本上是多维的。在分类的情况下，精确度和召回率是具有概率解释的分数，这两个分数都很重要且互补。由这两个分数引发的排名通常在某种程度上相互矛盾。因此，在实践中，建立两种观点之间的折衷以获得单一的全局排名是非常有用的。在过去的大约五十年里，提出了采用加权调和平均值的方法，称为F分数，F度量或$F_β$。一般来说，通过对基本分数进行平均，我们获得一个在值方面居中的分数。然而，并不能保证这些分数会导致有意义的排名，也不能保证排名是基本分数之间的良好权衡。鉴于文献中广泛使用$F_β$分数，有必要澄清一下。具体来说：（1）我们确定$F_β$引发的排名是有意义的，并定义了精确度和召回率引发的排名之间的最短路径。（2）我们将寻找两个分数之间的权衡问题框定为用Kendall等级相关性表示的优化问题。我们发现$F_1$及其不敏感于偏斜的版本在这方面远非最佳。（3）我们提供了理论工具和一个封闭形式表达式，以找到任何分布或性能集合的$β$的最佳值，并且我们在六个案例研究中说明了它们的使用。代码可以在https://github.com/pierard/cvpr-2026-optimal-tradeoff-precision-recall找到。

更新时间: 2026-03-30 17:14:33

领域: cs.PF,cs.AI,cs.CV,cs.LG,stat.ML

下载: http://arxiv.org/abs/2511.22442v2

AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding

Long video understanding remains challenging for Multi-modal Large Language Models (MLLMs) due to high memory costs and context-length limits. Prior approaches mitigate this by scoring and selecting frames/tokens within short clips, but they lack a principled mechanism to (i) compare relevance across distant video clips and (ii) stop processing once sufficient evidence has been gathered. We propose AdaptToken, a training-free framework that turns an MLLM's self-uncertainty into a global control signal for long-video token selection. AdaptToken splits a video into groups, extracts cross-modal attention to rank tokens within each group, and uses the model's response entropy to estimate each group's prompt relevance. This entropy signal enables a global token budget allocation across groups and further supports early stopping (AdaptToken-Lite), skipping the remaining groups when the model becomes sufficiently certain. Across four long-video benchmarks (VideoMME, LongVideoBench, LVBench, and MLVU) and multiple base MLLMs (7B-72B), AdaptToken consistently improves accuracy (e.g., +6.7 on average over Qwen2.5-VL 7B) and continues to benefit from extremely long inputs (up to 10K frames), while AdaptToken-Lite reduces inference time by about half with comparable performance. Project page: https://haozheqi.github.io/adapt-token

Updated: 2026-03-30 17:14:15

标题: AdaptToken：基于熵的MLLM长视频理解自适应令牌选择

摘要: 长视频理解对于多模态大型语言模型（MLLMs）仍然具有挑战性，因为高内存成本和上下文长度限制。先前的方法通过在短视频片段中评分和选择帧/标记来缓解这一问题，但它们缺乏一个原则性机制来（i）比较远程视频片段之间的相关性，并且（ii）在收集到足够证据后停止处理。我们提出了AdaptToken，这是一个无需训练的框架，将MLLM的自不确定性转化为长视频令牌选择的全局控制信号。AdaptToken将视频分成组，提取跨模态注意力以对每个组中的令牌进行排名，并使用模型的响应熵来估计每个组的提示相关性。这个熵信号使得可以在组之间进行全局令牌预算分配，并进一步支持早期停止（AdaptToken-Lite），当模型变得足够确定时，跳过剩余的组。在四个长视频基准测试（VideoMME、LongVideoBench、LVBench和MLVU）以及多个基本MLLMs（7B-72B）上，AdaptToken始终提高准确性（例如，在Qwen2.5-VL 7B上平均提高了+6.7），并且继续受益于极长的输入（最多达到10K帧），而AdaptToken-Lite则将推理时间减少了约一半，并具有可比较的性能。项目页面：https://haozheqi.github.io/adapt-token

更新时间: 2026-03-30 17:14:15

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2603.28696v1

Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration

The process of discovery requires active exploration -- the act of collecting new and informative data. However, efficient autonomous exploration remains a major unsolved problem. The dominant paradigm addresses this challenge by using Reinforcement Learning (RL) to train agents with intrinsic motivation, maximizing a composite objective of extrinsic and intrinsic rewards. We suggest that this approach incurs unnecessary overhead: while policy optimization is necessary for precise task execution, employing such machinery solely to expand state coverage may be inefficient. In this paper, we propose a new paradigm that explicitly separates exploration from exploitation and bypasses RL during the exploration phase. Our method uses a tree-search strategy inspired by the Go-With-The-Winner algorithm, paired with a measure of epistemic uncertainty to systematically drive exploration. By removing the overhead of policy optimization, our approach explores an order of magnitude more efficiently than standard intrinsic motivation baselines on hard Atari benchmarks. Further, we demonstrate that the discovered trajectories can be distilled into deployable policies using existing supervised backward learning algorithms, achieving state-of-the-art scores by a wide margin on Montezuma's Revenge, Pitfall!, and Venture without relying on domain-specific knowledge. Finally, we demonstrate the generality of our framework in high-dimensional continuous action spaces by solving the MuJoCo Adroit dexterous manipulation and AntMaze tasks in a sparse-reward setting, directly from image observations and without expert demonstrations or offline datasets. To the best of our knowledge, this has not been achieved before for the Adroit tasks.

Updated: 2026-03-30 17:14:06

标题: 解耦探索和策略优化：基于不确定性引导的树搜索用于困难探索

摘要: 发现的过程需要积极探索——收集新的、信息丰富的数据。然而，高效的自主探索仍然是一个尚未解决的重要问题。主导范式通过使用强化学习（RL）来训练具有内在动机的代理，最大化外在和内在奖励的组合目标来解决这一挑战。我们认为这种方法带来了不必要的开销：虽然政策优化对于精确的任务执行是必要的，但仅仅利用这样的机制来扩展状态覆盖范围可能是低效的。在本文中，我们提出了一个新的范式，明确将探索和开发分开，并在探索阶段绕过了RL。我们的方法使用了受到Go-With-The-Winner算法启发的树搜索策略，配以一种诗词不确性的度量来系统地推动探索。通过消除政策优化的开销，我们的方法在困难的Atari基准测试中比标准内在动机基线更有效地探索了一个数量级。此外，我们展示了通过现有的监督反向学习算法，可以将发现的轨迹提炼成可部署的策略，在蒙特祖玛的复仇、陷阱！和冒险游戏上取得了遥遥领先的最新成绩，而无需依赖领域特定的知识。最后，我们展示了我们的框架在高维连续动作空间中的通用性，通过在稀疏奖励设置中直接从图像观察中解决MuJoCo Adroit灵巧操作和AntMaze任务，而无需专家示范或离线数据集。据我们所知，这在Adroit任务中从未实现过。

更新时间: 2026-03-30 17:14:06

领域: cs.LG

下载: http://arxiv.org/abs/2603.22273v3

SpecMoE: Spectral Mixture-of-Experts Foundation Model for Cross-Species EEG Decoding

Decoding the orchestration of neural activity in electroencephalography (EEG) signals is a central challenge in bridging neuroscience with artificial intelligence. Foundation models have made strides in generalized EEG decoding, yet many existing frameworks primarily relying on separate temporal and spectral masking of raw signals during self-supervised pretraining. Such strategies often tend to bias learning toward high-frequency oscillations, as low-frequency rhythmic patterns can be easily inferred from the unmasked signal. We introduce a foundation model that utilizes a novel Gaussian-smoothed masking scheme applied to short-time Fourier transform (STFT) maps. By jointly applying time, frequency, and time-frequency Gaussian masks, we make the reconstruction task much more challenging, forcing the model to learn intricate neural patterns across both high- and low-frequency domains. To effectively recover signals under this aggressive masking strategy, we design SpecHi-Net, a U-shaped hierarchical architecture with multiple encoding and decoding stages. To accelerate large-scale pretraining, we partition the data into three subsets, each used to train an independent expert model. We then combine these models through SpecMoE, a mixture of experts framework guided by a learned spectral gating mechanism. SpecMoE achieves state-of-the-art performance across a diverse set of EEG decoding tasks, including sleep staging, emotion recognition, motor imagery classification, abnormal signal detection, and drug effect prediction. Importantly, the model demonstrates strong cross-species and cross-subject generalization, maintaining high accuracy on both human and murine EEG datasets.

Updated: 2026-03-30 17:09:43

标题: SpecMoE: 跨物种脑电图解码的谱混合专家基础模型

摘要: 在将神经科学与人工智能联系起来的过程中，解码脑电图（EEG）信号中神经活动的编排是一个中心挑战。基础模型在广义EEG解码方面取得了进展，然而许多现有框架主要依赖于在自监督预训练期间对原始信号进行分离的时间和频谱掩蔽。这种策略往往倾向于偏向高频振荡，因为低频节奏模式可以很容易地从未掩蔽的信号中推断出来。我们引入了一种基础模型，该模型利用了一种新颖的高斯平滑掩蔽方案，应用于短时傅里叶变换（STFT）映射。通过联合应用时间、频率和时间频率高斯掩蔽，我们使重建任务变得更加具有挑战性，迫使模型学习高频和低频领域中的复杂神经模式。为了在这种激进的掩蔽策略下有效地恢复信号，我们设计了SpecHi-Net，这是一个具有多个编码和解码阶段的U形分层架构。为了加速大规模预训练，我们将数据分成三个子集，每个子集用于训练一个独立的专家模型。然后，我们通过SpecMoE将这些模型组合在一起，SpecMoE是一个由学习的频谱门控机制引导的专家混合框架。 SpecMoE在各种EEG解码任务中取得了最先进的性能，包括睡眠分期、情绪识别、动作想象分类、异常信号检测和药物效应预测。重要的是，该模型表现出强大的跨物种和跨受试者泛化能力，在人类和小鼠EEG数据集上保持高准确性。

更新时间: 2026-03-30 17:09:43

领域: cs.LG,cs.AI,cs.HC

下载: http://arxiv.org/abs/2603.16739v2

Semiring Provenance for Lightweight Description Logics

We investigate semiring provenance--a successful framework originally defined in the relational database setting--for description logics. In this context, the ontology axioms are annotated with elements of a commutative semiring and these annotations are propagated to the ontology consequences in a way that reflects how they are derived. We define a provenance semantics for a language that encompasses several lightweight description logics and show its relationships with semantics that have been defined for ontologies annotated with a specific kind of annotation (such as fuzzy degrees). We show that under some restrictions on the semiring, the semantics satisfies desirable properties (such as extending the semiring provenance defined for databases). We then focus on the well-known why-provenance, for which we study the complexity of problems related to the provenance of an assertion or a conjunctive query answer. Finally, we consider two more restricted cases which correspond to the so-called positive Boolean provenance and lineage in the database setting. For these cases, we exhibit relationships with well-known notions related to explanations in description logics and complete our complexity analysis. As a side contribution, we provide conditions on an $\mathcal{ELHI}_\bot$ ontology that guarantee tractable reasoning.

Updated: 2026-03-30 17:09:15

标题: 轻量级描述逻辑的半环溯源

摘要: 我们研究了半环溯源——最初在关系数据库设置中定义的成功框架——用于描述逻辑。在这个背景下，本体公理被注释为可交换半环的元素，并且这些注释以一种反映它们是如何派生的方式传播到本体结论中。我们为一个语言定义了溯源语义，该语言涵盖了几种轻量级描述逻辑，并展示了它与已经为带有特定种类注释（如模糊程度）的本体定义的语义之间的关系。我们展示了在对半环施加一些限制时，语义满足期望的性质（例如扩展了数据库定义的半环溯源）。然后我们专注于众所周知的为什么溯源，我们研究了与一个断言或一个合取查询答案的溯源相关的问题的复杂性。最后，我们考虑了两种更受限制的情况，对应于数据库设置中所谓的正布尔溯源和谱系。对于这些情况，我们展示了与描述逻辑中解释相关的众所周知概念的关系，并完成了我们的复杂性分析。作为一个附加贡献，我们提供了关于$\mathcal{ELHI}_\bot$本体的条件，以确保可处理的推理。

更新时间: 2026-03-30 17:09:15

领域: cs.LO,cs.AI,cs.DB

下载: http://arxiv.org/abs/2310.16472v4

Functional Natural Policy Gradients

We propose a cross-fitted debiasing device for policy learning from offline data. A key consequence of the resulting learning principle is $\sqrt N$ regret even for policy classes with complexity greater than Donsker, provided a product-of-errors nuisance remainder is $O(N^{-1/2})$. The regret bound factors into a plug-in policy error factor governed by policy-class complexity and an environment nuisance factor governed by the complexity of the environment dynamics, making explicit how one may be traded against the other.

Updated: 2026-03-30 16:59:53

标题: 功能性自然策略梯度

摘要: 我们提出了一个用于从离线数据学习政策的交叉适配去偏差设备。由此产生的学习原则的一个关键结果是，即使对于复杂度大于Donsker的政策类别，也可以实现$\sqrt N$的遗憾，前提是残差是$O(N^{-1/2})$的误差乘积。遗憾界分解为由政策类别复杂度主导的插入政策错误因子和由环境动态复杂度主导的环境残余因子，明确说明了如何将一个因子与另一个进行交换。

更新时间: 2026-03-30 16:59:53

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2603.28681v1

Subspace Optimization for Backpropagation-Free Continual Test-Time Adaptation

We introduce PACE, a backpropagation-free continual test-time adaptation system that directly optimizes the affine parameters of normalization layers. Existing derivative-free approaches struggle to balance runtime efficiency with learning capacity, as they either restrict updates to input prompts or require continuous, resource-intensive adaptation regardless of domain stability. To address these limitations, PACE leverages the Covariance Matrix Adaptation Evolution Strategy with the Fastfood projection to optimize high-dimensional affine parameters within a low-dimensional subspace, leading to superior adaptive performance. Furthermore, we enhance the runtime efficiency by incorporating an adaptation stopping criterion and a domain-specialized vector bank to eliminate redundant computation. Our framework achieves state-of-the-art accuracy across multiple benchmarks under continual distribution shifts, reducing runtime by over 50% compared to existing backpropagation-free methods.

Updated: 2026-03-30 16:58:13

标题: 子空间优化用于无反向传播连续测试时适应

摘要: 我们介绍了PACE，这是一个不需要反向传播的持续测试时适应系统，它直接优化归一化层的仿射参数。现有的无导数方法在运行时效率和学习能力之间很难取得平衡，因为它们要么将更新限制在输入提示中，要么需要持续的、资源密集型的适应，而不考虑领域的稳定性。为了解决这些限制，PACE利用协方差矩阵适应进化策略和Fastfood投影来优化低维子空间内的高维仿射参数，从而实现卓越的自适应性能。此外，我们通过引入适应停止准则和领域专用的向量库来增强运行时效率，消除冗余计算。我们的框架在持续分布转移下实现了最先进的准确性，与现有的不需要反向传播的方法相比，运行时减少了超过50%。

更新时间: 2026-03-30 16:58:13

领域: cs.LG

下载: http://arxiv.org/abs/2603.28678v1

Why Aggregate Accuracy is Inadequate for Evaluating Fairness in Law Enforcement Facial Recognition Systems

Facial recognition systems are increasingly deployed in law enforcement and security contexts, where algorithmic decisions can carry significant societal consequences. Despite high reported accuracy, growing evidence demonstrates that such systems often exhibit uneven performance across demographic groups, leading to disproportionate error rates and potential harm. This paper argues that aggregate accuracy is an insufficient metric for evaluating the fairness and reliability of facial recognition systems in high-stakes environments. Through analysis of subgroup-level error distribution, including false positive rate (FPR) and false negative rate (FNR), the paper demonstrates how aggregate performance metrics can obscure critical disparities across demographic groups. Empirical observations show that systems with similar overall accuracy can exhibit substantially different fairness profiles, with subgroup error rates varying significantly despite a single aggregate metric. The paper further examines the operational risks associated with accuracy-centric evaluation practices in law enforcement applications, where misclassification may result in wrongful suspicion or missed identification. It highlights the importance of fairness-aware evaluation approaches and model-agnostic auditing strategies that enable post-deployment assessment of real-world systems. The findings emphasise the need to move beyond accuracy as a primary metric and adopt more comprehensive evaluation frameworks for responsible AI deployment.

Updated: 2026-03-30 16:56:54

标题: 为什么聚合准确性不足以评估执法面部识别系统的公平性

摘要: 面部识别系统越来越多地部署在执法和安全背景下，算法决策可能带来重大的社会后果。尽管高报告的准确性，越来越多的证据表明，这种系统往往在不同人口群体之间表现不均，导致错误率不成比例和潜在的危害。本文认为，综合准确性是评估高风险环境中面部识别系统公平性和可靠性的不足指标。通过分析子组水平的错误分布，包括误报率（FPR）和漏报率（FNR），本文展示了综合性能指标如何可以掩盖人口群体之间的关键差异。实证观察表明，具有类似总体准确性的系统可能展现出明显不同的公平性概况，尽管存在单一综合指标。本文进一步审查了在执法应用中与准确性为中心的评估实践相关的操作风险，其中错误分类可能导致错误的怀疑或错过的识别。它强调了公平意识评估方法和模型无关审计策略的重要性，这些方法使得可以对实际系统进行部署后评估。研究结果强调了需要超越准确性作为主要指标，并采用更全面的评估框架来负责任地部署人工智能。

更新时间: 2026-03-30 16:56:54

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2603.28675v1

FL-PBM: Pre-Training Backdoor Mitigation for Federated Learning

Backdoor attacks pose a significant threat to the integrity and reliability of Artificial Intelligence (AI) models, enabling adversaries to manipulate model behavior by injecting poisoned data with hidden triggers. These attacks can lead to severe consequences, especially in critical applications such as autonomous driving, healthcare, and finance. Detecting and mitigating backdoor attacks is crucial across the lifespan of model's phases, including pre-training, in-training, and post-training. In this paper, we propose Pre-Training Backdoor Mitigation for Federated Learning (FL-PBM), a novel defense mechanism that proactively filters poisoned data on the client side before model training in a federated learning (FL) environment. The approach consists of three stages: (1) inserting a benign trigger into the data to establish a controlled baseline, (2) applying Principal Component Analysis (PCA) to extract discriminative features and assess the separability of the data, (3) performing Gaussian Mixture Model (GMM) clustering to identify potentially malicious data samples based on their distribution in the PCA-transformed space, and (4) applying a targeted blurring technique to disrupt potential backdoor triggers. Together, these steps ensure that suspicious data is detected early and sanitized effectively, thereby minimizing the influence of backdoor triggers on the global model. Experimental evaluations on image-based datasets demonstrate that FL-PBM reduces attack success rates by up to 95% compared to baseline federated learning (FedAvg) and by 30 to 80% relative to state-of-the-art defenses (RDFL and LPSF). At the same time, it maintains over 90% clean model accuracy in most experiments, achieving better mitigation without degrading model performance.

Updated: 2026-03-30 16:56:38

标题: FL-PBM：面向联邦学习的预训练后门缓解

摘要: 后门攻击对人工智能（AI）模型的完整性和可靠性造成了重大威胁，使对手能够通过注入带有隐藏触发器的有毒数据来操纵模型行为。这些攻击可能导致严重后果，特别是在关键应用领域，如自动驾驶、医疗保健和金融领域。在模型的各个阶段，包括预训练、训练中和训练后，检测和缓解后门攻击至关重要。在本文中，我们提出了用于联邦学习（FL）环境中的预训练后门缓解（FL-PBM）的新颖防御机制，该机制在模型训练之前在客户端端主动过滤有毒数据。该方法包括三个阶段：（1）将良性触发器插入数据以建立受控基线，（2）应用主成分分析（PCA）来提取具有区分性的特征并评估数据的可分离性，（3）执行高斯混合模型（GMM）聚类以根据它们在PCA转换空间中的分布识别潜在恶意数据样本，（4）应用有针对性的模糊技术来破坏潜在的后门触发器。通过这些步骤，确保可疑数据早期被检测并有效地清理，从而最大程度地减少后门触发器对全局模型的影响。基于基于图像的数据集的实验评估表明，与基线联邦学习（FedAvg）相比，FL-PBM将攻击成功率降低了高达95％，与最先进的防御措施（RDFL和LPSF）相比，降低了30至80％。同时，在大多数实验中，它保持了超过90％的干净模型准确度，实现更好的缓解而不降低模型性能。

更新时间: 2026-03-30 16:56:38

领域: cs.LG,cs.CR,cs.DC

下载: http://arxiv.org/abs/2603.28673v1

Remedying uncertainty representations in visual inference through Explaining-Away Variational Autoencoders

Optimal computations under uncertainty require an adequate probabilistic representation about beliefs. Deep generative models, and specifically Variational Autoencoders (VAEs), have the potential to meet this demand by building latent representations that learn to associate uncertainties with inferences while avoiding their characteristic intractable computations. Yet, we show that it is precisely uncertainty representation that suffers from inconsistencies under an array of relevant computer vision conditions: contrast-dependent computations, image corruption, out-of-distribution detection. Drawing inspiration from classical computer vision, we present a principled extension to the standard VAE by introducing a simple yet powerful inductive bias through a global scaling latent variable, which we call the Explaining-Away VAE (EA-VAE). By applying EA-VAEs to a spectrum of computer vision domains and a variety of datasets, spanning standard NIST datasets to rich medical and natural image sets, we show the EA-VAE restores normative requirements for uncertainty. Furthermore, we provide an analytical underpinning of the contribution of the introduced scaling latent to contrast-related and out-of-distribution related modulations of uncertainty, demonstrating that this mild inductive bias has stark benefits in a broad set of problems. Moreover, we find that EA-VAEs recruit divisive normalization, a motif widespread in biological neural networks, to remedy defective inference. Our results demonstrate that an easily implemented, still powerful update to the VAE architecture can remedy defective inference of uncertainty in probabilistic computations.

Updated: 2026-03-30 16:56:33

标题: 通过解释变分自动编码器来纠正视觉推理中的不确定性表示

摘要: 在不确定性条件下的最佳计算需要关于信念的充分概率表示。深度生成模型，特别是变分自动编码器（VAEs），有潜力通过构建学习将不确定性与推断相关联的潜在表示来满足这一需求，同时避免其特征性的难以计算。然而，我们发现正是不确定性表示在一系列相关的计算机视觉条件下存在不一致性：对比度依赖计算、图像损坏、超出分布检测。受经典计算机视觉启发，我们通过引入一个简单而强大的归纳偏差——全局缩放潜变量，将标准VAE进行了原则性扩展，我们称之为“解释型VAE”（EA-VAE）。通过将EA-VAE应用于一系列计算机视觉领域和各种数据集，涵盖标准的NIST数据集到丰富的医学和自然图像集，我们展示了EA-VAE恢复了不确定性的规范要求。此外，我们提供了引入缩放潜变量对对比度相关和超出分布相关的不确定性调节的贡献的分析基础，表明这种轻微的归纳偏差在广泛的问题集中具有显著的好处。此外，我们发现EA-VAEs利用了生物神经网络中普遍存在的除法归一化来纠正有缺陷的推断。我们的结果表明，对VAE架构进行易实现但强大的更新可以纠正概率计算中关于不确定性的有缺陷的推断。

更新时间: 2026-03-30 16:56:33

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2404.15390v3

Hierarchical Concept Embedding & Pursuit for Interpretable Image Classification

Interpretable-by-design models are gaining traction in computer vision because they provide faithful explanations for their predictions. In image classification, these models typically recover human-interpretable concepts from an image and use them for classification. Sparse concept recovery methods leverage the latent space of vision-language models to represent image embeddings as sparse combinations of concept embeddings. However, by ignoring the hierarchical structure of semantic concepts, these methods may produce correct predictions with explanations that are inconsistent with the hierarchy. In this work, we propose Hierarchical Concept Embedding & Pursuit (HCEP), a framework that induces a hierarchy of concept embeddings in the latent space and performs hierarchical sparse coding to recover the concepts present in an image. Given a hierarchy of semantic concepts, we introduce a geometric construction for the corresponding hierarchy of embeddings. Under the assumption that the true concepts form a rooted path in the hierarchy, we derive sufficient conditions for their recovery in the embedding space. We further show that hierarchical sparse coding reliably recovers hierarchical concept embeddings, whereas standard sparse coding fails. Experiments on real-world datasets show that HCEP improves concept precision and recall compared to existing methods while maintaining competitive classification accuracy. Moreover, when the number of samples available for concept estimation and classifier training is limited, HCEP achieves superior classification accuracy and concept recovery. Our results demonstrate that incorporating hierarchical structure into sparse concept recovery leads to more faithful and interpretable image classification models.

Updated: 2026-03-30 16:55:56

标题: 分层概念嵌入与追求可解释图像分类

摘要: Interpretable-by-design模型在计算机视觉领域越来越受到关注，因为它们为其预测提供了忠实的解释。在图像分类中，这些模型通常从图像中恢复人类可解释的概念，并将其用于分类。稀疏概念恢复方法利用视觉-语言模型的潜在空间，将图像嵌入表示为概念嵌入的稀疏组合。然而，通过忽略语义概念的层次结构，这些方法可能会产生具有与层次结构不一致的解释的正确预测。在这项工作中，我们提出了分层概念嵌入与追踪（HCEP）框架，它在潜在空间中诱导概念嵌入的层次结构，并执行层次稀疏编码以恢复图像中存在的概念。鉴于语义概念的层次结构，我们引入了相应嵌入层次结构的几何构造。在假设真实概念在层次结构中形成根路径的情况下，我们推导出在嵌入空间中对其恢复的充分条件。我们进一步展示，分层稀疏编码可可靠地恢复分层概念嵌入，而标准稀疏编码失败。对真实数据集的实验表明，HCEP相对于现有方法提高了概念精度和召回率，同时保持了竞争性的分类准确性。此外，当用于概念估计和分类器训练的样本数量有限时，HCEP实现了更高的分类准确性和概念恢复。我们的结果表明，将层次结构纳入稀疏概念恢复中可以导致更忠实和可解释的图像分类模型。

更新时间: 2026-03-30 16:55:56

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2602.11448v3

AMIGO: Agentic Multi-Image Grounding Oracle Benchmark

Agentic vision-language models increasingly act through extended interactions, but most evaluations still focus on single-image, single-turn correctness. We introduce AMIGO (Agentic Multi-Image Grounding Oracle Benchmark), a long-horizon benchmark for hidden-target identification over galleries of visually similar images. In AMIGO, the oracle privately selects a target image, and the model must recover it by asking a sequence of attribute-focused Yes/No/Unsure questions under a strict protocol that penalizes invalid actions with Skip. This setting stresses (i) question selection under uncertainty, (ii) consistent constraint tracking across turns, and (iii) fine-grained discrimination as evidence accumulates. AMIGO also supports controlled oracle imperfections to probe robustness and verification behavior under inconsistent feedback. We instantiate AMIGO with Guess My Preferred Dress task and report metrics covering both outcomes and interaction quality, including identification success, evidence verification, efficiency, protocol compliance, noise tolerance, and trajectory-level diagnostics.

Updated: 2026-03-30 16:48:51

标题: AMIGO: 主动式多图像关联 Oracle 基准测试

摘要: 主动视觉-语言模型越来越通过延长的交互行为，但大多数评估仍然集中在单图像、单轮正确性上。我们介绍了AMIGO（主动多图像定位神谕基准），这是一个针对视觉上相似图像库中隐藏目标识别的长视程基准。在AMIGO中，神谕私下选择一个目标图像，模型必须通过在严格协议下提出一系列以属性为焦点的是/否/不确定问题来恢复它，惩罚无效操作跳过。这种设置强调了（i）在不确定性下的问题选择，（ii）跨轮一致性约束跟踪，以及（iii）随着证据积累的精细鉴别。AMIGO还支持受控神谕缺陷，以探测在不一致反馈下的鲁棒性和验证行为。我们以“猜我的首选服装”任务实例化AMIGO，并报告涵盖结果和交互质量的指标，包括识别成功、证据验证、效率、协议遵从、噪声容忍度和轨迹级诊断。

更新时间: 2026-03-30 16:48:51

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.28662v1

CLMN: Concept based Language Models via Neural Symbolic Reasoning

Deep learning has advanced NLP, but interpretability remains limited, especially in healthcare and finance. Concept bottleneck models tie predictions to human concepts in vision, but NLP versions either use binary activations that harm text representations or latent concepts that weaken semantics, and they rarely model dynamic concept interactions such as negation and context. We introduce the Concept Language Model Network (CLMN), a neural-symbolic framework that keeps both performance and interpretability. CLMN represents concepts as continuous, human-readable embeddings and applies fuzzy-logic reasoning to learn adaptive interaction rules that state how concepts affect each other and the final decision. The model augments original text features with concept-aware representations and automatically induces interpretable logic rules. Across multiple datasets and pre-trained language models, CLMN achieves higher accuracy than existing concept-based methods while improving explanation quality. These results show that integrating neural representations with symbolic reasoning in a unified concept space can yield practical, transparent NLP systems.

Updated: 2026-03-30 16:44:05

标题: CLMN: 通过神经符号推理构建基于概念的语言模型

摘要: 深度学习已经推动了自然语言处理（NLP）的发展，但解释性仍然有限，特别是在医疗和金融领域。概念瓶颈模型将预测与视觉中的人类概念联系起来，但NLP版本要么使用有害于文本表示的二进制激活，要么使用削弱语义的潜在概念，而且它们很少建模动态概念交互，如否定和语境。我们引入了概念语言模型网络（CLMN），这是一个神经符号框架，既保持了性能又保持了可解释性。CLMN将概念表示为连续的、可读的人类嵌入，并应用模糊逻辑推理来学习自适应交互规则，说明概念如何相互影响和最终决策。该模型通过概念感知表示增强了原始文本特征，并自动引入了可解释的逻辑规则。在多个数据集和预训练语言模型上，CLMN实现了比现有基于概念的方法更高的准确性，同时改进了解释质量。这些结果表明，在统一的概念空间中将神经表示与符号推理整合在一起可以产生实用、透明的NLP系统。

更新时间: 2026-03-30 16:44:05

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.10063v2

Safeguarding LLMs Against Misuse and AI-Driven Malware Using Steganographic Canaries

AI-powered malware increasingly exploits cloud-hosted generative-AI services and large language models (LLMs) as analysis engines for reconnaissance and code generation. Simultaneously, enterprise uploads expose sensitive documents to third-party AI vendors. Both threats converge at the AI service ingestion boundary, yet existing defenses focus on endpoints and network perimeters, leaving organizations with limited visibility once plaintext reaches an LLM service. To address this, we present a framework based on steganographic canary files: realistic documents carrying cryptographically derived identifiers embedded via complementary encoding channels. A pre-ingestion filter extracts and verifies these identifiers before LLM processing, enabling passive, format-agnostic detection without semantic classification. We support two modes of operation where Mode A marks existing sensitive documents with layered symbolic encodings (whitespace substitution, zero-width character insertion, homoglyph substitution), while Mode B generates synthetic canary documents using linguistic steganography (arithmetic coding over GPT-2), augmented with compatible symbolic layers. We model increasing document pre-processing and adversarial capability for both modes via a four-tier transport-transform taxonomy: All methods achieve 100% identifier recovery under benign and sanitization workflows (Tiers 1-2). The hybrid Mode B maintains 97% through targeted adversarial transforms (Tier 3). An end-to-end case study against an LLM-orchestrated ransomware pipeline confirms that both modes detect and block canary-bearing uploads before file encryption begins. To our knowledge, this is the first framework to systematically combine symbolic and linguistic text steganography into layered canary documents for detecting unauthorized LLM processing, evaluated against a transport-threat taxonomy tailored to AI malware.

Updated: 2026-03-30 16:40:55

标题: 保护LLMs免受滥用和使用隐写金丝雀抵御人工智能驱动的恶意软件

摘要: 人工智能驱动的恶意软件越来越多地利用云托管的生成式人工智能服务和大型语言模型（LLMs）作为侦察和代码生成的分析引擎。与此同时，企业上传将敏感文件暴露给第三方人工智能供应商。这两种威胁在人工智能服务摄入边界汇合，然而现有的防御重点在终端和网络边界，一旦明文到达LLM服务，组织的可见性就会受限。为了解决这个问题，我们提出了一个基于隐写术金丝雀文件的框架：通过互补编码通道嵌入由加密推导的标识符的现实文档。一个预摄入过滤器在LLM处理之前提取和验证这些标识符，实现了被动的、格式无关的检测，而无需语义分类。我们支持两种操作模式，模式A使用分层符号编码（空格替换、零宽字符插入、同形符号替换）标记现有的敏感文件，而模式B使用语言隐写术（GPT-2上的算术编码），辅以兼容的符号层，生成合成金丝雀文件。我们通过一个四层传输-转换分类模型对两种模式的文档预处理和对抗能力进行建模：所有方法在良性和消毒工作流程（第1-2层）下都实现了100%的标识符恢复。混合模式B在有针对性的对抗转换（第3层）下保持了97%的恢复率。一项针对LLM编排的勒索软件管道的端到端案例研究证实，两种模式都能在文件加密开始之前检测和阻止携带金丝雀的上传。据我们所知，这是第一个系统地将符号和语言文本隐写术组合成分层金丝雀文件，用于检测未经授权的LLM处理的框架，根据专为人工智能恶意软件定制的传输威胁分类进行评估。

更新时间: 2026-03-30 16:40:55

领域: cs.CR

下载: http://arxiv.org/abs/2603.28655v1

Interpretable Ensemble Learning for Network Traffic Anomaly Detection: A SHAP-based Explainable AI Framework for Embedded Systems Security

Network security threats in embedded systems pose significant challenges to critical infrastructure protection. This paper presents a comprehensive framework combining ensemble learning methods with explainable artificial intelligence (XAI) techniques for robust anomaly detection in network traffic. We evaluate multiple machine learning models including Random Forest, Gradient Boosting, Support Vector Machines, and ensemble methods on a real-world network traffic dataset containing 19 features derived from packet-level and frequency domain characteristics. Our experimental results demonstrate that ensemble methods achieve superior performance, with Random Forest attaining 90% accuracy and an AUC of 0.617 on validation data. Furthermore, we employ SHAP (SHapley Additive exPlanations) analysis to provide interpretable insights into model predictions, revealing that packet_count_5s,inter_arrival_time, and spectral_entropy are the most influential features for anomaly detection. The integration of XAI techniques enhances model trustworthiness and facilitates deployment in security-critical embedded systems where interpretability is paramount.

Updated: 2026-03-30 16:40:34

标题: 可解释的集成学习用于网络流量异常检测：基于SHAP的可解释人工智能框架用于嵌入式系统安全

摘要: 嵌入式系统中的网络安全威胁对关键基础设施保护构成重大挑战。本文提出了一个综合框架，将集成学习方法与可解释人工智能（XAI）技术结合起来，用于网络流量中的稳健异常检测。我们评估了包括随机森林、梯度提升、支持向量机和集成方法在内的多个机器学习模型，这些模型在一个包含19个特征的真实网络流量数据集上进行了评估，这些特征来自数据包级和频域特征。我们的实验结果表明，集成方法实现了卓越的性能，随机森林在验证数据上达到了90%的准确率和0.617的AUC。此外，我们使用SHAP（SHapley Additive exPlanations）分析提供可解释的模型预测见解，揭示出packet_count_5s、inter_arrival_time和spectral_entropy是用于异常检测的最具影响力的特征。XAI技术的整合增强了模型的可信度，并有助于在安全关键的嵌入式系统中部署，这些系统中可解释性至关重要。

更新时间: 2026-03-30 16:40:34

领域: cs.CR

下载: http://arxiv.org/abs/2603.28654v1

Mitigating Backdoor Attacks in Federated Learning Using PPA and MiniMax Game Theory

Federated Learning (FL) is witnessing wider adoption due to its ability to benefit from large amounts of scattered data while preserving privacy. However, despite its advantages, federated learning suffers from several setbacks that directly impact the accuracy, and the integrity of the global model it produces. One of these setbacks is the presence of malicious clients who actively try to harm the global model by injecting backdoor data into their local models while trying to evade detection. The objective of such clients is to trick the global model into making false predictions during inference, thereby compromising the integrity and trustworthiness of the global model on which honest stakeholders rely. To mitigate such mischievous behavior, we propose FedBBA (Federated Backdoor and Behavior Analysis). The proposed model aims to dampen the effect of such clients on the final accuracy, creating more resilient federated learning environments. We engineer our approach through the combination of (1) a reputation system to evaluate and track client behavior, (2) an incentive mechanism to reward honest participation and penalize malicious behavior, and (3) game theoretical models with projection pursuit analysis (PPA) to dynamically identify and minimize the impact of malicious clients on the global model. Extensive simulations on the German Traffic Sign Recognition Benchmark (GTSRB) and Belgium Traffic Sign Classification (BTSC) datasets demonstrate that FedBBA reduces the backdoor attack success rate to approximately 1.1%--11% across various attack scenarios, significantly outperforming state-of-the-art defenses like RDFL and RoPE, which yielded attack success rates between 23% and 76%, while maintaining high normal task accuracy (~95%--98%).

Updated: 2026-03-30 16:39:02

标题: 使用PPA和MiniMax博弈论在联邦学习中缓解后门攻击

摘要: 联合学习（FL）由于能够从大量分散数据中获益同时保护隐私，正见证着更广泛的应用。然而，尽管具有优势，联合学习也存在一些直接影响其产生的全局模型的准确性和完整性的挫折。其中之一就是存在恶意客户，他们积极尝试通过向本地模型注入后门数据来损害全局模型，并试图躲避检测。这些客户的目标是欺骗全局模型在推理过程中作出错误预测，从而损害依赖于全局模型的诚实利益相关者的完整性和可信度。为了减轻这种恶作剧行为，我们提出了FedBBA（联合后门和行为分析）。提出的模型旨在减缓这些客户对最终准确性的影响，创造更具弹性的联合学习环境。我们通过以下方式设计我们的方法：（1）一个声誉系统来评估和追踪客户行为，（2）一种奖励诚实参与和惩罚恶意行为的激励机制，以及（3）用于动态识别和减少恶意客户对全局模型影响的博弈理论模型和投影寻求分析（PPA）。对德国交通标志识别基准（GTSRB）和比利时交通标志分类（BTSC）数据集进行的大量模拟表明，FedBBA将后门攻击成功率降低到各种攻击场景中的约1.1%至11%，明显优于现有技术防御措施，如RDFL和RoPE，其后门攻击成功率在23%至76%之间，同时保持高的正常任务准确性（约95%至98%）。

更新时间: 2026-03-30 16:39:02

领域: cs.LG,cs.CR,cs.DC,cs.GT

下载: http://arxiv.org/abs/2603.28652v1

Information-Theoretic Limits of Safety Verification for Self-Improving Systems

Can a safety gate permit unbounded beneficial self-modification while maintaining bounded cumulative risk? We formalize this question through dual conditions -- requiring sum delta_n < infinity (bounded risk) and sum TPR_n = infinity (unbounded utility) -- and establish a theory of their (in)compatibility. Classification impossibility (Theorem 1): For power-law risk schedules delta_n = O(n^{-p}) with p > 1, any classifier-based gate under overlapping safe/unsafe distributions satisfies TPR_n <= C_alpha * delta_n^beta via Holder's inequality, forcing sum TPR_n < infinity. This impossibility is exponent-optimal (Theorem 3). A second independent proof via the NP counting method (Theorem 4) yields a 13% tighter bound without Holder's inequality. Universal finite-horizon ceiling (Theorem 5): For any summable risk schedule, the exact maximum achievable classifier utility is U*(N, B) = N * TPR_NP(B/N), growing as exp(O(sqrt(log N))) -- subpolynomial. At N = 10^6 with budget B = 1.0, a classifier extracts at most U* ~ 87 versus a verifier's ~500,000. Verification escape (Theorem 2): A Lipschitz ball verifier achieves delta = 0 with TPR > 0, escaping the impossibility. Formal Lipschitz bounds for pre-LayerNorm transformers under LoRA enable LLM-scale verification. The separation is strict. We validate on GPT-2 (d_LoRA = 147,456): conditional delta = 0 with TPR = 0.352. Comprehensive empirical validation is in the companion paper [D2].

Updated: 2026-03-30 16:34:37

标题: 自我改进系统安全验证的信息论限制

摘要: 可以一个安全门允许无限有益的自我修改同时保持有限的累积风险吗？我们通过双重条件形式化这个问题--要求sum delta_n < infinity（有界风险）和sum TPR_n = infinity（无限效用）--并建立了它们的（不）兼容性理论。分类不可能性（定理1）：对于幂律风险时间表delta_n = O(n^{-p})，其中p > 1，任何基于分类器的门在重叠的安全/不安全分布下都满足TPR_n <= C_alpha * delta_n^beta通过Holder不等式，强制sum TPR_n < infinity。这种不可能性是指数最优的（定理3）。通过NP计数方法的第二个独立证明（定理4）得出了比Holder不等式更紧的13%的界限。通用有限时间上限（定理5）：对于任何可求和的风险时间表，精确可达到的最大分类器效用为U*（N，B）= N * TPR_NP（B/N），增长为exp（O(sqrt(log N))）--次多项式。在N = 10^6，预算B = 1.0时，一个分类器最多提取U* ~ 87，而验证者的 ~500,000。验证逃避（定理2）：一个Lipschitz球验证者实现delta = 0，同时TPR > 0，逃脱不可能性。LoRA下的pre-LayerNorm transformers的形式Lipschitz界限使LLM级别的验证成为可能。分离是严格的。我们在GPT-2（d_LoRA = 147,456）上验证：有条件的delta = 0，TPR = 0.352。全面的实证验证在伴随论文中[D2]。

更新时间: 2026-03-30 16:34:37

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2603.28650v1

Ruka-v2: Tendon Driven Open-Source Dexterous Hand with Wrist and Abduction for Robot Learning

Lack of accessible and dexterous robot hardware has been a significant bottleneck to achieving human-level dexterity in robots. Last year, we released Ruka, a fully open-sourced, tendon-driven humanoid hand with 11 degrees of freedom - 2 per finger and 3 at the thumb - buildable for under $1,300. It was one of the first fully open-sourced humanoid hands, and introduced a novel data-driven approach to finger control that captures tendon dynamics within the control system. Despite these contributions, Ruka lacked two degrees of freedom essential for closely imitating human behavior: wrist mobility and finger adduction/abduction. In this paper, we introduce Ruka-v2: a fully open-sourced, tendon-driven humanoid hand featuring a decoupled 2-DOF parallel wrist and abduction/adduction at the fingers. The parallel wrist adds smooth, independent flexion/extension and radial/ulnar deviation, enabling manipulation in confined environments such as cabinets. Abduction enables motions such as grasping thin objects, in-hand rotation, and calligraphy. We present the design of Ruka-v2 and evaluate it against Ruka through user studies on teleoperated tasks, finding a 51.3% reduction in completion time and a 21.2% increase in success rate. We further demonstrate its full range of applications for robot learning: bimanual and single-arm teleoperation across 13 dexterous tasks, and autonomous policy learning on 3 tasks. All 3D print files, assembly instructions, controller software, and videos are available at https://ruka-hand-v2.github.io/ .

Updated: 2026-03-30 16:30:40

标题: Ruka-v2：带有腕部和外展功能的肌腱驱动开源灵巧手的机器人学习

摘要: 缺乏可访问和灵巧的机器人硬件一直是实现机器人人类级灵巧的一个重要瓶颈。去年，我们发布了Ruka，这是一个完全开源的、腱驱动的人形手，具有11个自由度-每个手指2个自由度，拇指3个自由度-可以在1300美元以下建造。它是第一个完全开源的人形手之一，并引入了一种新颖的数据驱动的手指控制方法，可以在控制系统中捕捉腱动力学。尽管有这些贡献，Ruka缺乏两个对于紧密模拟人类行为至关重要的自由度：手腕活动和手指外展/内收。在本文中，我们介绍了Ruka-v2：一个完全开源的、腱驱动的人形手，具有解耦的2自由度平行手腕和手指外展/内收。平行手腕增加了平滑、独立的屈伸和尺桡偏移，使得在如柜子等有限空间中进行操纵成为可能。外展使得动作如夹持薄物、手内旋和书法成为可能。我们介绍了Ruka-v2的设计，并通过用户研究对其进行了评估，发现完成时间减少了51.3%，成功率增加了21.2%。我们进一步展示了它在机器人学习中的各种应用：跨13个灵巧任务的双手和单臂远程操作，以及在3个任务上的自主策略学习。所有的3D打印文件、组装说明、控制器软件和视频都可以在https://ruka-hand-v2.github.io/ 上找到。

更新时间: 2026-03-30 16:30:40

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2603.26660v2

IrisFP: Adversarial-Example-based Model Fingerprinting with Enhanced Uniqueness and Robustness

We propose IrisFP, a novel adversarial-example-based model fingerprinting framework that enhances both uniqueness and robustness by leveraging multi-boundary characteristics, multi-sample behaviors, and fingerprint discriminative power assessment to generate composite-sample fingerprints. Three key innovations make IrisFP outstanding: 1) It positions fingerprints near the intersection of all decision boundaries - unlike prior methods that target a single boundary - thus increasing the prediction margin without placing fingerprints deep inside target class regions, enhancing both robustness and uniqueness; 2) It constructs composite-sample fingerprints, each comprising multiple samples close to the multi-boundary intersection, to exploit collective behavior patterns and further boost uniqueness; and 3) It assesses the discriminative power of generated fingerprints using statistical separability metrics developed based on two reference model sets, respectively, for pirated and independently-trained models, retains the fingerprints with high discriminative power, and assigns fingerprint-specific thresholds to such retained fingerprints. Extensive experiments show that IrisFP consistently outperforms state-of-the-art methods, achieving reliable ownership verification by enhancing both robustness and uniqueness.

Updated: 2026-03-30 16:29:29

标题: IrisFP：基于对抗样本的模型指纹识别技术，具有增强的独特性和鲁棒性

摘要: 我们提出了一种新颖的基于对抗样本的模型指纹框架IrisFP，通过利用多边界特征、多样本行为和指纹辨别力评估来生成复合样本指纹，从而增强了独特性和鲁棒性。三个关键创新使IrisFP出色：1) 它将指纹定位在所有决策边界的交叉点附近 - 不同于以往的方法只针对单一边界 - 从而增加了预测边界而不将指纹深埋在目标类区域中，增强了鲁棒性和独特性；2) 它构建了复合样本指纹，每个指纹包含多个接近多边界交叉点的样本，以利用集体行为模式并进一步提升独特性；3) 它使用基于两个参考模型集开发的统计分离度量来评估生成指纹的辨别力，分别针对盗版和独立训练模型，保留具有高辨别力的指纹，并为这些保留的指纹分配指纹特定的阈值。大量实验证明，IrisFP持续优于最先进的方法，通过增强鲁棒性和独特性实现了可靠的所有权验证。

更新时间: 2026-03-30 16:29:29

领域: cs.CR

下载: http://arxiv.org/abs/2603.24996v2

Generalizing Fair Top-$k$ Selection: An Integrative Approach

Fair top-$k$ selection, which ensures appropriate proportional representation of members from minority or historically disadvantaged groups among the top-$k$ selected candidates, has drawn significant attention. We study the problem of finding a fair (linear) scoring function with multiple protected groups while also minimizing the disparity from a reference scoring function. This generalizes the prior setup, which was restricted to the single-group setting without disparity minimization. Previous studies imply that the number of protected groups may have a limited impact on the runtime efficiency. However, driven by the need for experimental exploration, we find that this implication overlooks a critical issue that may affect the fairness of the outcome. Once this issue is properly considered, our hardness analysis shows that the problem may become computationally intractable even for a two-dimensional dataset and small values of $k$. However, our analysis also reveals a gap in the hardness barrier, enabling us to recover the efficiency for the case of small $k$ when the number of protected groups is sufficiently small. Furthermore, beyond measuring disparity as the "distance" between the fair and the reference scoring functions, we introduce an alternative disparity measure$\unicode{x2014}$utility loss$\unicode{x2014}$that may yield a more stable scoring function under small weight perturbations. Through careful engineering trade-offs that balance implementation complexity, robustness, and performance, our augmented two-pronged solution demonstrates strong empirical performance on real-world datasets, with experimental observations also informing algorithm design and implementation decisions.

Updated: 2026-03-30 16:27:31

标题: 泛化公平的Top-$k$选择：一种整合方法

摘要: 公平的top-k选择，确保在所选的前k名候选人中适当地代表少数或历史上处于劣势的群体的成员，引起了广泛关注。我们研究了在多个受保护群体中找到一个公平（线性）评分函数的问题，同时最小化与参考评分函数的差异。这扩展了先前的设置，先前的设置仅限于单一群组设置，没有差异最小化。先前的研究暗示受保护群体的数量对运行效率可能产生有限影响。然而，受到实验探索需求的驱使，我们发现这一暗示忽视了可能影响结果公平性的关键问题。一旦这个问题得到适当考虑，我们的难度分析表明，即使对于二维数据集和较小的k值，问题可能变得计算上难以处理。然而，我们的分析也揭示了在难度限制中的一个差距，使我们能够在受保护群体数量足够少的情况下恢复小k值情况下的效率。此外，除了将差异度量为公平和参考评分函数之间的“距离”之外，我们还引入了一种替代差异度量--效用损失--在小权重扰动下可能产生更稳定的评分函数。通过仔细平衡实施复杂性、鲁棒性和性能的工程权衡，我们的增强的双重解决方案在真实数据集上展现了强大的实证表现，实验观察也为算法设计和实施决策提供了信息。

更新时间: 2026-03-30 16:27:31

领域: cs.DS,cs.CC,cs.CG,cs.CY,cs.DB,cs.LG

下载: http://arxiv.org/abs/2603.04689v2

Constructing Composite Features for Interpretable Music-Tagging

Combining multiple audio features can improve the performance of music tagging, but common deep learning-based feature fusion methods often lack interpretability. To address this problem, we propose a Genetic Programming (GP) pipeline that automatically evolves composite features by mathematically combining base music features, thereby capturing synergistic interactions while preserving interpretability. This approach provides representational benefits similar to deep feature fusion without sacrificing interpretability. Experiments on the MTG-Jamendo and GTZAN datasets demonstrate consistent improvements compared to state-of-the-art systems across base feature sets at different abstraction levels. It should be noted that most of the performance gains are noticed within the first few hundred GP evaluations, indicating that effective feature combinations can be identified under modest search budgets. The top evolved expressions include linear, nonlinear, and conditional forms, with various low-complexity solutions at top performance aligned with parsimony pressure to prefer simpler expressions. Analyzing these composite features further reveals which interactions and transformations tend to be beneficial for tagging, offering insights that remain opaque in black-box deep models.

Updated: 2026-03-30 16:25:58

标题: 构建可解释的音乐标记的复合特征

摘要: 将多个音频特征结合在一起可以提高音乐标记的性能，但常见的基于深度学习的特征融合方法通常缺乏可解释性。为了解决这个问题，我们提出了一种遗传规划（GP）管道，通过数学组合基本音乐特征自动演化复合特征，从而捕捉协同作用，同时保持可解释性。这种方法提供了类似于深度特征融合的表现优势，而不会牺牲可解释性。在MTG-Jamendo和GTZAN数据集上的实验表明，在不同抽象级别的基础特征集上，与最先进系统相比，一致地实现了改进。值得注意的是，大部分性能提升发生在前几百次GP评估内，这表明在适度的搜索预算下可以识别有效的特征组合。顶级演化表达式包括线性、非线性和条件形式，在顶级性能上与偏好简单表达式的简洁压力相一致。进一步分析这些复合特征还揭示了哪些交互作用和转换对标记有益，提供了在黑匣子深层模型中仍然不透明的见解。

更新时间: 2026-03-30 16:25:58

领域: cs.SD,cs.LG,cs.MM

下载: http://arxiv.org/abs/2603.28644v1

The Ultimate Tutorial for AI-driven Scale Development in Generative Psychometrics: Releasing AIGENIE from its Bottle

Psychological scale development has traditionally required extensive expert involvement, iterative revision, and large-scale pilot testing before psychometric evaluation can begin. The `AIGENIE` R package implements the AI-GENIE framework (Automatic Item Generation with Network-Integrated Evaluation), which integrates large language model (LLM) text generation with network psychometric methods to automate the early stages of this process. The package generates candidate item pools using LLMs, transforms them into high-dimensional embeddings, and applies a multi-step reduction pipeline -- Exploratory Graph Analysis (EGA), Unique Variable Analysis (UVA), and bootstrap EGA -- to produce structurally validated item pools entirely *in silico*. This tutorial introduces the package across six parts: installation and setup, understanding Application Programming Interfaces (APIs), text generation, item generation, the `AIGENIE` function, and the `GENIE` function. Two running examples illustrate the package's use: the Big Five personality model (a well-established construct) and AI Anxiety (an emerging construct). The package supports multiple LLM providers (OpenAI, Anthropic, Groq, HuggingFace, and local models), offers a fully offline mode with no external API calls, and provides the `GENIE()` function for researchers who wish to apply the psychometric reduction pipeline to existing item pools regardless of their origin. The `AIGENIE` package is freely available on R-universe at https://laralee.r-universe.dev/AIGENIE.

Updated: 2026-03-30 16:25:37

标题: 生成心理计量学中基于人工智能的量表发展的终极教程：让AIGENIE从瓶中释放

摘要: 心理量表的开发传统上需要广泛的专家参与、迭代修订和大规模的试点测试，然后才能开始心理测量评估。`AIGENIE` R包实现了AI-GENIE框架（具有网络集成评估的自动项目生成），该框架将大型语言模型（LLM）文本生成与网络心理测量方法相结合，以自动化该过程的早期阶段。该包使用LLMs生成候选项目池，将其转换为高维嵌入，并应用多步减少管道——探索性图分析（EGA）、唯一变量分析（UVA）和自举EGA——以完全*在硅*中产生结构验证的项目池。本教程分为六个部分介绍了该包：安装和设置、理解应用程序编程接口（APIs）、文本生成、项目生成、`AIGENIE`函数和`GENIE`函数。两个运行示例说明了该包的用途：五大人格模型（一个成熟的构建）和人工智能焦虑（一个新兴的构建）。该包支持多个LLM提供商（OpenAI、Anthropic、Groq、HuggingFace和本地模型），提供完全脱机模式，无需外部API调用，并为希望将心理减少管道应用于现有项目池的研究人员提供`GENIE()`函数。`AIGENIE`包可以在R-universe上免费获取，网址为https://laralee.r-universe.dev/AIGENIE。

更新时间: 2026-03-30 16:25:37

领域: cs.AI,cs.CL,cs.HC

下载: http://arxiv.org/abs/2603.28643v1

Empowering Mobile Networks Security Resilience by using Post-Quantum Cryptography

The transition to a cloud-native 5G Service-Based Architecture (SBA) improves scalability but exposes control-plane signaling to emerging quantum threats, including Harvest-Now, Decrypt-Later (HNDL) attacks. While NIST has standardized post-quantum cryptography (PQC), practical, deployable integration in operational 5G cores remains underexplored. This work experimentally integrates NIST-standardized ML-KEM-768 and ML-DSA into an open-source 5G core (free5GC) using a sidecar proxy pattern that preserves unmodified network functions (NFs). Implemented on free5GC, we compare three deployments: (i) native HTTPS/TLS, (ii) TLS sidecar, and (iii) PQC-enabled sidecar. Measurements at the HTTP/2 request-response boundary over repeated independent runs show that PQC increases end-to-end Service-Based Interface (SBI) latency to approximately 54 ms, adding a deterministic 48-49 ms overhead relative to the classical baseline, while maintaining tightly bounded variance (IQR <= 0.2 ms, CV < 0.4%). We also quantify the impact of Certification Authority (CA) security levels, identifying certificate validation as a tunable contributor to overall delay. Overall, the results demonstrate that sidecar-based PQC insertion enables a non-disruptive and operationally predictable migration path for quantum-resilient 5G signaling.

Updated: 2026-03-30 16:09:56

标题: 使用后量子密码学增强移动网络安全弹性

摘要: 过渡到基于云的5G服务架构（SBA）可以提高可扩展性，但也使控制平面信令面临新兴的量子威胁，包括“立即收割，延后解密”（HNDL）攻击。虽然NIST已经标准化了后量子密码学（PQC），但在运营5G核心中实际部署集成仍未得到充分探索。本研究将NIST标准的ML-KEM-768和ML-DSA实验性地集成到开源5G核心（free5GC）中，采用了保留未修改网络功能（NFs）的侧车代理模式。在free5GC上实施后，我们比较了三种部署方式：（i）原生HTTPS/TLS，（ii）TLS侧车，和（iii）启用PQC的侧车。在重复独立运行的HTTP/2请求-响应边界上进行的测量显示，PQC将端到端服务接口（SBI）的延迟增加到约54毫秒，相对于经典基线增加了确定性的48-49毫秒开销，同时保持了严格界定的方差（IQR <= 0.2毫秒，CV < 0.4%）。我们还量化了认证机构（CA）安全级别的影响，确定证书验证作为整体延迟的可调节因素。总体而言，结果表明基于侧车的PQC插入为量子弹性5G信令提供了一条非干扰且操作可预测的迁移路径。

更新时间: 2026-03-30 16:09:56

领域: cs.CR

下载: http://arxiv.org/abs/2603.28626v1

Dynamic Lookahead Distance via Reinforcement Learning-Based Pure Pursuit for Autonomous Racing

Pure Pursuit (PP) is a widely used path-tracking algorithm in autonomous vehicles due to its simplicity and real-time performance. However, its effectiveness is sensitive to the choice of lookahead distance: shorter values improve cornering but can cause instability on straights, while longer values improve smoothness but reduce accuracy in curves. We propose a hybrid control framework that integrates Proximal Policy Optimization (PPO) with the classical Pure Pursuit controller to adjust the lookahead distance dynamically during racing. The PPO agent maps vehicle speed and multi-horizon curvature features to an online lookahead command. It is trained using Stable-Baselines3 in the F1TENTH Gym simulator with a KL penalty and learning-rate decay for stability, then deployed in a ROS2 environment to guide the controller. Experiments in simulation compare the proposed method against both fixed-lookahead Pure Pursuit and an adaptive Pure Pursuit baseline. Additional real-car experiments compare the learned controller against a fixed-lookahead Pure Pursuit controller. Results show that the learned policy improves lap-time performance and repeated lap completion on unseen tracks, while also transferring zero-shot to hardware. The learned controller adapts the lookahead by increasing it on straights and reducing it in curves, demonstrating effectiveness in augmenting a classical controller by online adaptation of a single interpretable parameter. On unseen tracks, the proposed method achieved 33.16 s on Montreal and 46.05 s on Yas Marina, while tolerating more aggressive speed-profile scaling than the baselines and achieving the best lap times among the tested settings. Initial real-car experiments further support sim-to-real transfer on a 1:10-scale autonomous racing platform

Updated: 2026-03-30 16:09:26

标题: 自主赛车的基于强化学习的纯追踪动态前瞻距离

摘要: Pure Pursuit（PP）是自动驾驶车辆中广泛使用的路径跟踪算法，因其简单性和实时性而受到青睐。然而，其有效性对前瞻距离的选择非常敏感：较短的值可以改善转弯，但可能导致直道上的不稳定性，而较长的值可以提高平滑性，但会降低在曲线上的精度。我们提出了一个混合控制框架，将Proximal Policy Optimization（PPO）与经典的Pure Pursuit控制器集成在一起，以在比赛过程中动态调整前瞻距离。PPO代理将车辆速度和多个视角曲率特征映射到在线前瞻命令。它在F1TENTH Gym模拟器中使用稳定的基线3进行训练，采用KL惩罚和学习速率衰减以确保稳定性，然后在ROS2环境中部署以指导控制器。在模拟实验中，将所提出的方法与固定前瞻Pure Pursuit和自适应Pure Pursuit基线进行比较。额外的真实车辆实验将学习的控制器与固定前瞻Pure Pursuit控制器进行比较。结果表明，学习的策略改善了未知赛道上的圈速表现和重复圈数完成，同时还将零镜头技术转移到硬件上。学习的控制器通过在直道上增加前瞻距离并在曲线上减少前瞻距离来适应，展示了通过在线调整一个可解释参数来增强经典控制器的有效性。在未知赛道上，所提出的方法在蒙特利尔赛道上达到了33.16秒，在亚斯玛林纳赛道上达到了46.05秒，同时容忍比基线更具侵略性的速度配置文件缩放，并在测试设置中获得了最佳圈速表现。初步的真实车辆实验进一步支持了在1:10比例自主赛车平台上的从模拟到真实的转移。

更新时间: 2026-03-30 16:09:26

领域: cs.RO,cs.AI,eess.SY

下载: http://arxiv.org/abs/2603.28625v1

Trust-Aware Routing for Distributed Generative AI Inference at the Edge

Emerging deployments of Generative AI increasingly execute inference across decentralized and heterogeneous edge devices rather than on a single trusted server. In such environments, a single device failure or misbehavior can disrupt the entire inference process, making traditional best-effort peer-to-peer routing insufficient. Coordinating distributed generative inference therefore requires mechanisms that explicitly account for reliability, performance variability, and trust among participating peers. In this paper, we present G-TRAC, a trust-aware coordination framework that integrates algorithmic path selection with system-level protocol design to ensure robust distributed inference. First, we formulate the routing problem as a \textit{Risk-Bounded Shortest Path} computation and introduce a polynomial-time solution that combines trust-floor pruning with Dijkstra's search, achieving sub-millisecond median routing latency at practical edge scales, and remaining below 10 ms at larger scales. Second, to operationally support the routing logic in dynamic environments, the framework employs a \textit{Hybrid Trust Architecture} that maintains global reputation state at stable anchors while disseminating lightweight updates to edge peers via background synchronization. Experimental evaluation on a heterogeneous testbed of commodity devices demonstrates that G-TRAC significantly improves inference completion rates, effectively isolates unreliable peers, and sustains robust execution even under node failures and network partitions.

Updated: 2026-03-30 16:07:11

标题: 边缘上基于信任的路由技术用于分布式生成式AI推理

摘要: 新兴的生成式人工智能部署越来越多地在分散和异构的边缘设备上执行推理，而不是在单个可信服务器上。在这种环境中，单个设备故障或不当行为可能会破坏整个推理过程，使传统的尽力而为的点对点路由不足以应对。因此，协调分布式生成式推理需要明确考虑参与方之间的可靠性、性能变化和信任的机制。本文提出了G-TRAC，一个信任感知协调框架，它将算法路径选择与系统级协议设计相结合，以确保强大的分布式推理。首先，我们将路由问题制定为\textit{风险有界的最短路径}计算，并引入了一种将信任下限修剪与Dijkstra搜索相结合的多项式时间解决方案，实现了实用边缘规模下的亚毫秒中位数路由延迟，并在更大规模下保持在10毫秒以下。其次，为了在动态环境中支持路由逻辑的运行，该框架采用了一种\textit{混合信任架构}，在稳定的锚点处保持全局声誉状态，同时通过后台同步向边缘对等体传播轻量级更新。在一组商品设备的异构测试平台上的实验评估表明，G-TRAC显著提高了推理完成率，有效隔离了不可靠的对等体，并在节点故障和网络分区下仍能维持强大的执行能力。

更新时间: 2026-03-30 16:07:11

领域: cs.DC,cs.AI,cs.NI

下载: http://arxiv.org/abs/2603.28622v1

Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning

Reinforcement learning with verifiable rewards (RLVR) has substantially enhanced the reasoning capabilities of multimodal large language models (MLLMs). However, existing RLVR approaches typically rely on outcome-driven optimization that updates both perception and reasoning using a shared reward based solely on the final answer. This shared reward blurs credit assignment, frequently improving reasoning patterns while failing to reliably enhance the accuracy of upstream visual evidence extraction. To address this perception bottleneck, we introduce PRCO (Perception-Reasoning Coevolution), a dual-role RLVR framework with a shared policy. PRCO consists of two cooperative roles: an Observer that generates an evidence caption tailored to the question and a Solver that predicts the final answer based on this caption. Crucially, PRCO employs role-specific reward signals: the Solver is optimized using verifiable outcome rewards on the final answer, while the Observer receives a utility reward derived from the Solver's downstream success. Extensive experiments across eight challenging multimodal reasoning benchmarks demonstrate that PRCO yields consistent improvements across model scales by over 7 points on average accuracy compared to the base model, outperforming prior open-source RL-tuned baselines.

Updated: 2026-03-30 16:03:56

标题: 与你一起看：多模态推理的感知推理共同进化

摘要: 采用可验证奖励的强化学习（RLVR）极大地增强了多模态大型语言模型（MLLMs）的推理能力。然而，现有的RLVR方法通常依赖于以结果为驱动的优化，使用仅基于最终答案的共享奖励更新感知和推理。这种共享奖励模糊了信用分配，经常改善推理模式，但未能可靠地提高上游视觉证据提取的准确性。为了解决这一感知瓶颈，我们引入了PRCO（Perception-Reasoning Coevolution），这是一个具有共享策略的双重作用RLVR框架。PRCO由两个协作角色组成：生成针对问题定制的证据标题的观察者和基于此标题预测最终答案的解算器。关键的是，PRCO采用角色特定的奖励信号：解算器使用最终答案上的可验证结果奖励进行优化，而观察者接收从解算器的下游成功中得出的效用奖励。在八个具有挑战性的多模态推理基准测试中进行的大量实验表明，与基本模型相比，PRCO在模型规模上持续提高了平均准确度超过7个百分点，优于先前的开源RL调整基线。

更新时间: 2026-03-30 16:03:56

领域: cs.AI

下载: http://arxiv.org/abs/2603.28618v1

TGIF2: Extended Text-Guided Inpainting Forgery Dataset & Benchmark

Generative AI has made text-guided inpainting a powerful image editing tool, but at the same time a growing challenge for media forensics. Existing benchmarks, including our text-guided inpainting forgery (TGIF) dataset, show that image forgery localization (IFL) methods can localize manipulations in spliced images but struggle not in fully regenerated (FR) images, while synthetic image detection (SID) methods can detect fully regenerated images but cannot perform localization. With new generative inpainting models emerging and the open problem of localization in FR images remaining, updated datasets and benchmarks are needed. We introduce TGIF2, an extended version of TGIF, that captures recent advances in text-guided inpainting and enables a deeper analysis of forensic robustness. TGIF2 augments the original dataset with edits generated by FLUX.1 models, as well as with random non-semantic masks. Using the TGIF2 dataset, we conduct a forensic evaluation spanning IFL and SID, including fine-tuning IFL methods on FR images and generative super-resolution attacks. Our experiments show that both IFL and SID methods degrade on FLUX.1 manipulations, highlighting limited generalization. Additionally, while fine-tuning improves localization on FR images, evaluation with random non-semantic masks reveals object bias. Furthermore, generative super-resolution significantly weakens forensic traces, demonstrating that common image enhancement operations can undermine current forensic pipelines. In summary, TGIF2 provides an updated dataset and benchmark, which enables new insights into the challenges posed by modern inpainting and AI-based image enhancements. TGIF2 is available at https://github.com/IDLabMedia/tgif-dataset.

Updated: 2026-03-30 15:59:16

标题: TGIF2：扩展文本引导修复伪造数据集和基准测试

摘要: 生成式人工智能已经使得文本引导修补成为一个强大的图像编辑工具，但同时也成为媒体取证领域日益增长的挑战。现有的基准数据集，包括我们的文本引导修补伪造（TGIF）数据集，显示图像伪造定位（IFL）方法可以定位拼接图像中的篡改，但在完全重建（FR）图像中却遇到困难，而合成图像检测（SID）方法可以检测完全重建的图像，但无法进行定位。随着新的生成式修补模型的出现以及完全重建图像中的定位问题仍未解决，需要更新的数据集和基准测试。我们引入了TGIF2，这是TGIF的扩展版本，捕捉了文本引导修补的最新进展，并能够对取证鲁棒性进行更深入的分析。TGIF2通过FLUX.1模型生成的编辑以及随机的非语义蒙版来增强原始数据集。利用TGIF2数据集，我们进行了涵盖IFL和SID的取证评估，包括在FR图像上对IFL方法进行微调以及生成式超分辨率攻击。我们的实验表明，IFL和SID方法在FLUX.1篡改上表现下降，突显了有限的泛化能力。此外，虽然微调可以改善对FR图像的定位，但使用随机非语义蒙版进行评估显示了物体偏差。此外，生成式超分辨率显著削弱了取证痕迹，表明常见的图像增强操作可能会破坏当前的取证流程。总之，TGIF2提供了一个更新的数据集和基准测试，可为现代修补和基于人工智能的图像增强所带来的挑战提供新的见解。TGIF2可在https://github.com/IDLabMedia/tgif-dataset上获得。

更新时间: 2026-03-30 15:59:16

领域: cs.CV,cs.AI,cs.CR,cs.MM

下载: http://arxiv.org/abs/2603.28613v1

LACE: Loss-Adaptive Capacity Expansion for Continual Learning

Fixed representational capacity is a fundamental constraint in continual learning: practitioners must guess an appropriate model width before training, without knowing how many distinct concepts the data contains. We propose LACE (Loss-Adaptive Capacity Expansion), a simple online mechanism that expands a model's representational capacity during training by monitoring its own loss signal. When sustained loss deviation exceeds a threshold - indicating that the current capacity is insufficient for newly encountered data - LACE adds new dimensions to the projection layer and trains them jointly with existing parameters. Across synthetic and real-data experiments, LACE triggers expansions exclusively at domain boundaries (100% boundary precision, zero false positives), matches the accuracy of a large fixed-capacity model while starting from a fraction of its dimensions, and produces adapter dimensions that are collectively critical to performance (3% accuracy drop when all adapters removed). We further demonstrate unsupervised domain separation in GPT-2 activations via layer-wise clustering, showing a U-shaped separability curve across layers that motivates adaptive capacity allocation in deep networks. LACE requires no labels, no replay buffers, and no external controllers, making it suitable for on-device continual learning under resource constraints.

Updated: 2026-03-30 15:58:33

标题: LACE: 用于持续学习的损失自适应容量扩展

摘要: 固定的表征能力是继续学习中的一个基本限制：实践者必须在训练之前猜测一个合适的模型宽度，而不知道数据包含多少个不同的概念。我们提出了LACE（Loss-Adaptive Capacity Expansion），这是一种简单的在线机制，通过监控自身损失信号来扩展模型的表征能力。当持续的损失偏差超过一个阈值时，表明当前的容量对于新遇到的数据是不足的，LACE会向投影层添加新的维度，并与现有参数一起进行训练。通过合成和真实数据实验，LACE仅在领域边界处触发扩展（100%边界精度，零误报），与一个固定容量大模型的准确率相匹配，同时从其维度的一部分开始，并产生对性能至关重要的适配器维度（当移除所有适配器时准确率下降3%）。我们进一步通过层次聚类展示了GPT-2激活中的无监督领域分离，展示了跨层的U形可分性曲线，这激励了在深度网络中进行自适应容量分配。LACE不需要标签、重播缓冲区和外部控制器，适用于在资源限制下进行设备上的持续学习。

更新时间: 2026-03-30 15:58:33

领域: cs.LG

下载: http://arxiv.org/abs/2603.28611v1

ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning

Multimodal Large Language Models (MLLMs) achieve stronger visual understanding by scaling input fidelity, yet the resulting visual token growth makes jointly sustaining high spatial resolution and long temporal context prohibitive. We argue that the bottleneck lies not in how post-encoding representations are compressed but in the volume of pixels the encoder receives, and address it with ResAdapt, an Input-side adaptation framework that learns how much visual budget each frame should receive before encoding. ResAdapt couples a lightweight Allocator with an unchanged MLLM backbone, so the backbone retains its native visual-token interface while receiving an operator-transformed input. We formulate allocation as a contextual bandit and train the Allocator with Cost-Aware Policy Optimization (CAPO), which converts sparse rollout feedback into a stable accuracy-cost learning signal. Across budget-controlled video QA, temporal grounding, and image reasoning tasks, ResAdapt improves low-budget operating points and often lies on or near the efficiency-accuracy frontier, with the clearest gains on reasoning-intensive benchmarks under aggressive compression. Notably, ResAdapt supports up to 16x more frames at the same visual budget while delivering over 15% performance gain. Code is available at https://github.com/Xnhyacinth/ResAdapt.

Updated: 2026-03-30 15:57:32

标题: ResAdapt：用于高效多模态推理的自适应分辨率

摘要: 多模态大型语言模型（MLLMs）通过增加输入保真度实现了更强的视觉理解，然而由此产生的视觉标记增长使得同时保持高空间分辨率和长时间上下文变得困难。我们认为瓶颈不在于后编码表示如何被压缩，而在于编码器接收的像素量，我们通过ResAdapt来解决这个问题，这是一个输入端适应性框架，学习在编码之前每帧应该接收多少视觉预算。ResAdapt将一个轻量级的分配器与未改变的MLLM主干相结合，因此主干保留其原生的视觉标记接口，同时接收一个经过操作转换的输入。我们将分配形式化为一个上下文强盗，并使用成本感知策略优化（CAPO）来训练分配器，将稀疏回滚反馈转换为稳定的准确率-成本学习信号。在受预算控制的视频问答、时间定位和图像推理任务中，ResAdapt改进了低预算操作点，并经常位于或接近效率-准确度前沿，对于在激烈压缩下的推理密集型基准测试，收益最为明显。值得注意的是，ResAdapt在相同的视觉预算下支持多达16倍的帧数，同时提供超过15%的性能增益。代码可在https://github.com/Xnhyacinth/ResAdapt获取。

更新时间: 2026-03-30 15:57:32

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2603.28610v1

Detecting Intrinsic and Instrumental Self-Preservation in Autonomous Agents: The Unified Continuation-Interest Protocol

How can we determine whether an AI system preserves itself as a deeply held objective or merely as an instrumental strategy? Autonomous agents with memory, persistent context, and multi-step planning create a measurement problem: terminal and instrumental self-preservation can produce similar behavior, so behavior alone cannot reliably distinguish them. We introduce the Unified Continuation-Interest Protocol (UCIP), a detection framework that shifts analysis from behavior to latent trajectory structure. UCIP encodes trajectories with a Quantum Boltzmann Machine, a classical model using density-matrix formalism, and measures von Neumann entropy over a bipartition of hidden units. The core hypothesis is that agents with terminal continuation objectives (Type A) produce higher entanglement entropy than agents with merely instrumental continuation (Type B). UCIP combines this signal with diagnostics of dependence, persistence, perturbation stability, counterfactual restructuring, and confound-rejection filters for cyclic adversaries and related false-positive patterns. On gridworld agents with known ground truth, UCIP achieves 100% detection accuracy. Type A and Type B agents show an entanglement gap of Delta = 0.381; aligned support runs preserve the same separation with AUC-ROC = 1.0. A permutation-test rerun yields p < 0.001. Pearson r = 0.934 between continuation weight alpha and S_ent across an 11-point sweep shows graded tracking beyond mere binary classification. Classical RBM, autoencoder, VAE, and PCA baselines fail to reproduce the effect. All computations are classical; "quantum" refers only to the mathematical formalism. UCIP offers a falsifiable criterion for whether advanced AI systems have morally relevant continuation interests that behavioral methods alone cannot resolve.

Updated: 2026-03-30 15:56:26

标题: 检测自主代理中的内在和工具自保：统一的续订-兴趣协议

摘要: 我们如何确定一个人工智能系统是将自我保护视为深层目标，还是仅仅将其视为一种手段策略？具有记忆、持久上下文和多步规划的自主代理会产生一个测量问题：终极和工具性自我保护可能会产生相似的行为，因此仅凭行为本身无法可靠地区分它们。我们引入了统一的延续-兴趣协议（UCIP），这是一个将分析重心从行为转移到潜在轨迹结构的检测框架。UCIP使用量子玻尔兹曼机对轨迹进行编码，这是一个使用密度矩阵形式主义的经典模型，并通过隐藏单元的双分区上的冯·诺伊曼熵来测量。核心假设是具有终极延续目标（A型）的代理会产生比仅具有工具性延续（B型）的代理更高的纠缠熵。UCIP将这一信号与依赖性、持久性、扰动稳定性、反事实重构和混淆拒绝滤波器相结合，用于循环对手和相关的假阳性模式。在已知真实情况的网格世界代理上，UCIP实现了100%的检测准确率。A型和B型代理显示了Δ=0.381的纠缠间隙；对齐支持运行保持了相同的分离，AUC-ROC=1.0。一次置换检验的重新运行结果为p<0.001。在一个11点扫描中，延续权重alpha和S_ent之间的Pearson r=0.934显示了超越简单二元分类的分级跟踪。经典的RBM、自编码器、VAE和PCA基线未能复制这一效果。所有计算都是经典的；“量子”只是指数学形式主义。UCIP提供了一个可验证的标准，用于确定高级人工智能系统是否具有道德相关的延续兴趣，这是仅凭行为方法无法解决的。

更新时间: 2026-03-30 15:56:26

领域: cs.AI,cs.ET,cs.LG,quant-ph

下载: http://arxiv.org/abs/2603.11382v4

Unsafe2Safe: Controllable Image Anonymization for Downstream Utility

Large-scale image datasets frequently contain identifiable or sensitive content, raising privacy risks when training models that may memorize and leak such information. We present Unsafe2Safe, a fully automated pipeline that detects privacy-prone images and rewrites only their sensitive regions using multimodally guided diffusion editing. Unsafe2Safe operates in two stages. Stage 1 uses a vision-language model to (i) inspect images for privacy risks, (ii) generate paired private and public captions that respectively include and omit sensitive attributes, and (iii) prompt a large language model to produce structured, identity-neutral edit instructions conditioned on the public caption. Stage 2 employs instruction-driven diffusion editors to apply these dual textual prompts, producing privacy-safe images that preserve global structure and task-relevant semantics while neutralizing private content. To measure anonymization quality, we introduce a unified evaluation suite covering Quality, Cheating, Privacy, and Utility dimensions. Across MS-COCO, Caltech101, and MIT Indoor67, Unsafe2Safe reduces face similarity, text similarity, and demographic predictability by large margins, while maintaining downstream model accuracy comparable to training on raw data. Fine-tuning diffusion editors on our automatically generated triplets (private caption, public caption, edit instruction) further improves both privacy protection and semantic fidelity. Unsafe2Safe provides a scalable, principled solution for constructing large, privacy-safe datasets without sacrificing visual consistency or downstream utility.

Updated: 2026-03-30 15:54:47

标题: 从不安全到安全：可控的图像匿名化以满足下游实用性

摘要: 大规模图像数据集经常包含可识别或敏感内容，训练模型时可能导致泄露此类信息，存在隐私风险。我们提出了Unsafe2Safe，这是一个完全自动化的流程，用于检测存在隐私风险的图像，并仅重写它们的敏感区域，使用多模态引导扩散编辑。Unsafe2Safe分为两个阶段。第一阶段使用视觉-语言模型来（i）检查图像中的隐私风险，（ii）生成包含和省略敏感属性的私人和公共标题对，（iii）促使一个大型语言模型根据公共标题生成结构化、身份中立的编辑指令。第二阶段使用指令驱动的扩散编辑器来应用这些双文本提示，生成保护隐私的图像，同时保留全局结构和任务相关语义，同时中和私人内容。为了衡量匿名化质量，我们引入了一个统一的评估套件，涵盖了质量、欺骗、隐私和效用维度。在MS-COCO、Caltech101和MIT Indoor67数据集中，Unsafe2Safe通过大幅减少面部相似度、文本相似度和人口预测性，同时保持下游模型准确性与在原始数据上训练相当。在我们自动生成的三元组（私人标题、公共标题、编辑指令）上微调扩散编辑器进一步提高了隐私保护和语义保真度。Unsafe2Safe为构建大规模、保护隐私的数据集提供了一个可扩展、基于原则的解决方案，而不会牺牲视觉一致性或下游效用。

更新时间: 2026-03-30 15:54:47

领域: cs.CV,cs.CY,cs.LG

下载: http://arxiv.org/abs/2603.28605v1

Multilingual Medical Reasoning for Question Answering with Large Language Models

Large Language Models (LLMs) with reasoning capabilities have recently demonstrated strong potential in medical Question Answering (QA). Existing approaches are largely English-focused and primarily rely on distillation from general-purpose LLMs, raising concerns about the reliability of their medical knowledge. In this work, we present a method to generate multilingual reasoning traces based on medical knowledge extracted from Wikipedia. We produce 500k traces in English, Italian, and Spanish, using a retrieval-augmented generation approach over medical information from Wikipedia. The traces are generated to solve medical questions drawn from MedQA and MedMCQA, which we extend to Italian and Spanish. We test our pipeline in both in-domain and out-of-domain settings across Medical QA benchmarks, and demonstrate that our reasoning traces improve performance both when utilized via in-context learning (few-shot) and supervised fine-tuning, yielding state-of-the-art results among 8B-parameter LLMs. We believe that these resources can support the development of more transparent clinical decision-support tools in multilingual settings. We release the full suite of resources: reasoning traces, translated QA datasets, Medical-Wikipedia, and fine-tuned models.

Updated: 2026-03-30 15:46:48

标题: 大语言模型用于多语言医学推理问答

摘要: 最近，具有推理能力的大型语言模型（LLMs）在医学问答（QA）中表现出强大潜力。现有方法主要集中在英语，并主要依赖于从通用LLMs中提取，引发对其医学知识可靠性的担忧。在这项工作中，我们提出了一种基于从维基百科提取的医学知识生成多语言推理迹象的方法。我们使用检索增强生成方法从维基百科的医学信息中生成了50万条英语、意大利语和西班牙语的迹象。这些迹象被生成用于解决从MedQA和MedMCQA抽取的医学问题，我们将其扩展到意大利语和西班牙语。我们在医学QA基准测试中在领域内和领域外设置中测试我们的流程，并展示了我们的推理迹象在通过上下文学习（少样本）和监督微调时提高性能，取得了在8B参数LLMs中的最先进结果。我们相信这些资源可以支持多语言环境中更透明的临床决策支持工具的发展。我们发布了完整的资源套件：推理迹象、翻译的QA数据集、医学维基百科和微调模型。

更新时间: 2026-03-30 15:46:48

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2512.05658v2

Position: Explainable AI is Causality in Disguise

The demand for Explainable AI (XAI) has triggered an explosion of methods, producing a landscape so fragmented that we now rely on surveys of surveys. Yet, fundamental challenges persist: conflicting metrics, failed sanity checks, and unresolved debates over robustness and fairness. The only consensus on how to achieve explainability is a lack of one. This has led many to point to the absence of a ground truth for defining ``the'' correct explanation as the main culprit. This position paper posits that the persistent discord in XAI arises not from an absent ground truth but from a ground truth that exists, albeit as an elusive and challenging target: the causal model that governs the relevant system. By reframing XAI queries about data, models, or decisions as causal inquiries, we prove the necessity and sufficiency of causal models for XAI. We contend that without this causal grounding, XAI remains unmoored. Ultimately, we encourage the community to converge around advanced concept and causal discovery to escape this entrenched uncertainty.

Updated: 2026-03-30 15:44:53

标题: 位置：可解释的人工智能是伪装中的因果关系

摘要: 对可解释人工智能（XAI）的需求引发了各种方法的爆炸性增长，导致一个如此分散的格局，以至于我们现在依赖于调查的调查。然而，基本挑战仍然存在：度量标准冲突，失败的合理性检查，以及对鲁棒性和公平性的争论仍未解决。关于如何实现可解释性的唯一共识是缺乏共识。这导致许多人指出缺乏定义“正确解释”的基本真相是主要罪魁祸首。本文认为，XAI中持续存在的分歧并非源自缺乏基本真相，而是源自一个存在的基本真相，尽管这个真相是一个难以捉摸和具有挑战性的目标：统治相关系统的因果模型。通过将关于数据、模型或决策的XAI查询重新定义为因果性查询，我们证明了因果模型对XAI的必要性和充分性。我们认为，如果没有这种因果根基，XAI将无法立足。最终，我们鼓励社区围绕先进概念和因果发现聚集起来，以摆脱这种根深蒂固的不确定性。

更新时间: 2026-03-30 15:44:53

领域: cs.LG

下载: http://arxiv.org/abs/2603.28597v1

Moving Beyond Review: Applying Language Models to Planning and Translation in Reflection

Reflective writing is known to support the development of students' metacognitive skills, yet learners often struggle to engage in deep reflection, limiting learning gains. Although large language models (LLMs) have been shown to improve writing skills, their use as conversational agents for reflective writing has produced mixed results and has largely focused on providing feedback on reflective texts, rather than support during planning and organizing. In this paper, inspired by the Cognitive Process Theory of writing (CPT), we propose the first application of LLMs to the planning and translation steps of reflective writing. We introduce Pensée, a tool to explore the effects of explicit AI support during these stages by scaffolding structured reflection planning using a conversational agent, and supporting translation by automatically extracting key concepts. We evaluate Pensée in a controlled between-subjects experiment (N=93), manipulating AI support across writing phases. Results show significantly greater reflection depth and structural quality when learners receive support during planning and translation stages of CPT, though these effects reduce in a delayed post-test. Analyses of learner behavior and perceptions further illustrate how CPT-aligned conversational support shapes reflection processes and learner experience, contributing empirical evidence for theory-driven uses of LLMs in AI-supported reflective writing.

Updated: 2026-03-30 15:42:38

标题: 超越评论：将语言模型应用于反思中的规划和翻译

摘要: 反思性写作被认为能够支持学生元认知技能的发展，然而学习者经常难以进行深入反思，从而限制了学习效益。尽管大型语言模型(LLMs)已被证明可以提高写作技能，但它们作为对话代理用于反思性写作的应用效果不一，并且主要集中在对反思性文本提供反馈，而不是在规划和组织过程中提供支持。本文受写作认知过程理论(CPT)的启发，首次提出将LLMs应用于反思性写作的规划和翻译阶段。我们引入了Pensée工具，通过使用对话代理支持结构化反思规划，并通过自动提取关键概念支持翻译，以探索明确的人工智能支持在这些阶段的影响。我们在一项受控的实验中评估了Pensée(N=93)，在写作阶段间操纵了人工智能支持。结果显示，当学习者在CPT的规划和翻译阶段接受支持时，反思深度和结构质量显著提高，尽管这些效果在延迟的后测中减弱。对学习者行为和感知的分析进一步说明了与CPT对齐的对话支持如何塑造反思过程和学习者体验，为理论驱动的LLMs在AI支持的反思性写作中的应用提供了经验证据。

更新时间: 2026-03-30 15:42:38

领域: cs.HC,cs.AI,cs.CL

下载: http://arxiv.org/abs/2603.28596v1

Optimistic Actor-Critic with Parametric Policies for Linear Markov Decision Processes

Although actor-critic methods have been successful in practice, their theoretical analyses have several limitations. Specifically, existing theoretical work either sidesteps the exploration problem by making strong assumptions or analyzes impractical methods with complicated algorithmic modifications. Moreover, the actor-critic methods analyzed for linear MDPs often employ natural policy gradient (NPG) and construct "implicit" policies without explicit parameterization. Such policies are computationally expensive to sample from, making the environment interactions inefficient. To that end, we focus on the finite-horizon linear MDPs and propose an optimistic actor-critic framework that uses parametric log-linear policies. In particular, we introduce a tractable \textit{logit-matching} regression objective for the actor. For the critic, we use approximate Thompson sampling via Langevin Monte Carlo to obtain optimistic value estimates. We prove that the resulting algorithm achieves $\widetilde{\mathcal{O}}(ε^{-4})$ and $\widetilde{\mathcal{O}}(ε^{-2})$ sample complexity in the on-policy and off-policy setting, respectively. Our results match prior theoretical works in achieving the state-of-the-art sample complexity, while our algorithm is more aligned with practice.

Updated: 2026-03-30 15:41:59

标题: 线性马尔可夫决策过程的具参数策略的乐观演员-评论家算法

摘要: 尽管演员-评论家方法在实践中取得了成功，但它们的理论分析存在一些局限性。具体来说，现有的理论工作要么通过做出强假设来回避探索问题，要么分析具有复杂算法修改的不切实际的方法。此外，针对线性MDP进行分析的演员-评论家方法通常使用自然策略梯度（NPG）并构建“隐式”策略，而不是明确参数化。这种策略在采样时计算成本高昂，导致环境交互效率低下。因此，我们专注于有限时间段的线性MDP，并提出了一个乐观的演员-评论家框架，使用参数化对数线性策略。具体来说，我们为演员引入了一个可处理的“对数匹配”回归目标。对于评论家，我们使用通过Langevin Monte Carlo的近似汤普森采样来获得乐观的值估计。我们证明，由此产生的算法在基于政策和离线政策设置中实现了$\widetilde{\mathcal{O}}(ε^{-4})$和$\widetilde{\mathcal{O}}(ε^{-2})$的样本复杂度。我们的结果与先前的理论工作相匹配，实现了最先进的样本复杂度，同时我们的算法更符合实践。

更新时间: 2026-03-30 15:41:59

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2603.28595v1

Detection of Adversarial Attacks in Robotic Perception

Deep Neural Networks (DNNs) achieve strong performance in semantic segmentation for robotic perception but remain vulnerable to adversarial attacks, threatening safety-critical applications. While robustness has been studied for image classification, semantic segmentation in robotic contexts requires specialized architectures and detection strategies.

Updated: 2026-03-30 15:41:49

标题: 机器人感知中对抗性攻击的检测

摘要: 深度神经网络（DNNs）在机器人感知的语义分割任务中取得了很强的性能，但仍然容易受到对抗攻击的影响，对安全关键应用构成威胁。虽然对图像分类的鲁棒性已经进行了研究，但在机器人环境中进行语义分割需要专门的架构和检测策略。

更新时间: 2026-03-30 15:41:49

领域: cs.CV,cs.AI,cs.CR,cs.RO

下载: http://arxiv.org/abs/2603.28594v1

Physics-Informed Framework for Impact Identification in Aerospace Composites

This paper introduces a novel physics-informed impact identification (Phy-ID) framework. The proposed method integrates observational, inductive, and learning biases to combine physical knowledge with data-driven inference in a unified modelling strategy, achieving physically consistent and numerically stable impact identification. The physics-informed approach structures the input space using physics-based energy indicators, constrains admissible solutions via architectural design, and enforces governing relations via hybrid loss formulations. Together, these mechanisms limit non-physical solutions and stabilise inference under degraded measurement conditions. A disjoint inference formulation is used as a representative use case to demonstrate the framework capabilities, in which impact velocity and impactor mass are inferred through decoupled surrogate models, and impact energy is computed by enforcing kinetic energy consistency. Experimental evaluations show mean absolute percentage errors below 8% for inferred impact velocity and impactor mass and below 10% for impact energy. Additional analyses confirm stable performance under reduced data availability and increased measurement noise, as well as generalisation for out-of-distribution cases across pristine and damaged regimes when damaged responses are included in training. These results indicate that the systematic integration of physics-informed biases enables reliable, physically consistent, and data-efficient impact identification, highlighting the potential of the approach for practical monitoring systems.

Updated: 2026-03-30 15:40:28

标题: 航空航天复合材料中冲击识别的物理信息框架

摘要: 本文介绍了一种新颖的物理知识驱动的冲击识别（Phy-ID）框架。所提出的方法整合了观测性、归纳性和学习性偏见，将物理知识与数据驱动推断结合在统一的建模策略中，实现了物理一致和数值稳定的冲击识别。物理知识驱动的方法利用基于物理的能量指标结构化输入空间，通过架构设计约束可接受的解决方案，并通过混合损失公式来强制执行控制关系。这些机制共同限制了非物理解决方案，并在降低的测量条件下稳定推断。一个不相交的推断公式被用作代表性用例来展示框架的能力，其中通过解耦的代理模型推断冲击速度和冲击器质量，并通过强制动能一致性计算冲击能量。实验评估显示，推断的冲击速度和冲击器质量的平均绝对百分比误差低于8％，冲击能量低于10％。额外的分析证实了在数据可用性降低和测量噪声增加的情况下的稳定性表现，以及在包括受损响应在内的训练中跨无损和受损区域的分布。这些结果表明，物理知识驱动的偏见的系统整合使得可靠、物理一致和数据高效的冲击识别成为可能，突显了该方法在实际监测系统中的潜力。

更新时间: 2026-03-30 15:40:28

领域: cs.LG,physics.app-ph

下载: http://arxiv.org/abs/2603.28593v1

Advancing Few-Shot Pediatric Arrhythmia Classification with a Novel Contrastive Loss and Multimodal Learning

Arrhythmias are a major cause of sudden cardiac death in children, making automated rhythm classification from electrocardiograms (ECGs) clinically important. However, pediatric arrhythmia analysis remains challenging because of age-dependent waveform variability, limited data availability, and a pronounced long-tailed class distribution that hinders recognition of rare but clinically important rhythms. To address these issues, we propose a multimodal end-to-end framework that integrates surface ECG and intracardiac electrogram (IEGM) signals for pediatric arrhythmia classification. The model combines dual-branch feature encoders, attention-based cross-modal fusion, and a lightweight Transformer classifier to learn complementary electrophysiological representations. We further introduce an Adaptive Global Class-Aware Contrastive Loss (AGCACL), which incorporates prototype-based alignment, class-frequency reweighting, and globally informed hard-class modulation to improve intra-class compactness and inter-class separability under class imbalance. We evaluate the proposed method on the pediatric subset of the Leipzig Heart Center ECG-Database and establish a reproducible preprocessing pipeline including rhythm-segment construction, denoising, and label grouping. The proposed approach achieves 96.22% Top-1 accuracy and improves macro precision, macro recall, macro F1 score, and macro F2 score by 4.48, 1.17, 6.98, and 7.34 percentage points, respectively, over the strongest baseline. These results indicate improved minority-sensitive classification performance on the current benchmark. However, further validation under subject-independent and multicenter settings is still required before clinical translation.

Updated: 2026-03-30 15:40:19

标题: 用一种新颖的对比损失和多模态学习推进少样本儿童心律失常分类

摘要: 心律失常是儿童突发性心脏死亡的主要原因，因此从心电图（ECG）中自动分类心律对临床具有重要意义。然而，由于年龄相关的波形变异性、有限的数据可用性以及明显的长尾类分布，儿科心律失常分析仍然具有挑战性，这些问题阻碍了对罕见但临床重要的心律的识别。为了解决这些问题，我们提出了一个多模态端到端框架，将表面心电图和心内电图（IEGM）信号整合到一起，用于儿科心律失常分类。该模型结合了双分支特征编码器、基于注意力的跨模态融合和轻量级Transformer分类器，以学习互补的电生理表示。我们进一步引入了一种自适应全局类感知对比损失（AGCACL），其中包含基于原型的对齐、类频率重新加权和全局通知的硬类调制，以改善类不平衡下的内类紧密度和类间可分离性。我们在莱比锡心脏中心ECG数据库的儿科子集上评估了所提出的方法，并建立了一个可复现的预处理流程，包括节律段构建、去噪和标签分组。所提出的方法实现了96.22%的Top-1准确率，并将宏精度、宏召回率、宏F1得分和宏F2得分分别提高了4.48、1.17、6.98和7.34个百分点，超过了最强基线。这些结果表明在当前基准测试中改进了对少数敏感分类性能。然而，在临床转化之前，仍需要在独立于受试者和多中心设置下进行进一步验证。

更新时间: 2026-03-30 15:40:19

领域: eess.SP,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.19315v2

Universal Approximation Constraints of Narrow ResNets: The Tunnel Effect

We analyze the universal approximation constraints of narrow Residual Neural Networks (ResNets) both theoretically and numerically. For deep neural networks without input space augmentation, a central constraint is the inability to represent critical points of the input-output map. We prove that this has global consequences for target function approximations and show that the manifestation of this defect is typically a shift of the critical point to infinity, which we call the ``tunnel effect'' in the context of classification tasks. While ResNets offer greater expressivity than standard multilayer perceptrons (MLPs), their capability strongly depends on the signal ratio between the skip and residual channels. We establish quantitative approximation bounds for both the residual-dominant (close to MLP) and skip-dominant (close to neural ODE) regimes. These estimates depend explicitly on the channel ratio and uniform network weight bounds. Low-dimensional examples further provide a detailed analysis of the different ResNet regimes and how architecture-target incompatibility influences the approximation error.

Updated: 2026-03-30 15:37:51

标题: 窄ResNets的通用逼近约束：隧道效应

摘要: 我们从理论和数值两方面分析了窄Residual神经网络（ResNets）的通用逼近约束。对于没有输入空间增强的深度神经网络，一个核心限制是无法表示输入输出映射的临界点。我们证明了这对目标函数逼近具有全局影响，并展示了这一缺陷的表现通常是将临界点移至无穷远处，我们在分类任务的背景下称之为“隧道效应”。虽然ResNets提供了比标准多层感知器（MLPs）更大的表达能力，但它们的能力强烈依赖于跳跃和残差通道之间的信号比率。我们为残差主导（接近MLP）和跳跃主导（接近神经ODE）两种情况建立了定量逼近界限。这些估计明确取决于通道比率和网络权重界限的统一性。低维示例进一步提供了对不同ResNet模式的详细分析，以及架构目标不兼容性如何影响逼近误差。

更新时间: 2026-03-30 15:37:51

领域: math.DS,cs.LG

下载: http://arxiv.org/abs/2603.28591v1

MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models

Large language models (LLMs) can generate chains of thought (CoTs) that are not always causally responsible for their final outputs. When such a mismatch occurs, the CoT no longer faithfully reflects the decision-critical factors driving the model's behavior, leading to the reduced CoT monitorability problem. However, a comprehensive and fully open-source benchmark for studying CoT monitorability remains lacking. To address this gap, we propose MonitorBench, a systematic benchmark for evaluating CoT monitorability in LLMs. MonitorBench provides: (1) a diverse set of 1,514 test instances with carefully designed decision-critical factors across 19 tasks spanning 7 categories to characterize when CoTs can be used to monitor the factors driving LLM behavior; and (2) two stress-test settings to quantify the extent to which CoT monitorability can be degraded. Extensive experiments across multiple popular LLMs with varying capabilities show that CoT monitorability is higher when producing the final target response requires structural reasoning through the decision-critical factor. Closed-source LLMs generally show lower monitorability, and there exists a negative relationship between monitorability and model capability. Moreover, both open- and closed-source LLMs can intentionally reduce monitorability under stress-tests, with monitorability dropping by up to 30% in some tasks that do not require structural reasoning over the decision-critical factors. Beyond these empirical insights, MonitorBench provides a basis for further research on evaluating future LLMs, studying advanced stress-test monitorability techniques, and developing new monitoring approaches.

Updated: 2026-03-30 15:37:42

标题: MonitorBench：大型语言模型中思维链可监控性的综合基准

摘要: 大型语言模型（LLMs）可以生成一系列思维链（CoTs），这些思维链并不总是对最终输出负有因果责任。当出现这种不匹配时，CoT不再忠实地反映驱动模型行为的决策关键因素，导致降低的CoT可监控性问题。然而，目前仍缺乏一个全面的、完全开源的用于研究CoT可监控性的基准。为了填补这一空白，我们提出了MonitorBench，这是一个系统的基准，用于评估LLMs中CoT可监控性。MonitorBench提供：（1）一个包含1,514个测试实例的多样化测试集，这些实例涵盖了跨越7个类别的19个任务，并精心设计了决策关键因素，以便表征何时可以使用CoTs来监控驱动LLM行为的因素；（2）两种压力测试设置，以量化CoT可监控性可以降低的程度。通过对多个具有不同能力的流行LLMs进行广泛实验，结果显示当生成最终目标响应需要通过决策关键因素进行结构推理时，CoT可监控性较高。闭源LLMs通常显示较低的监控性，而监控性与模型能力之间存在负相关关系。此外，无论是开源还是闭源的LLMs在压力测试下都可以有意地降低监控性，在一些不需要通过决策关键因素进行结构推理的任务中，监控性可以下降多达30%。除了这些实证见解外，MonitorBench为进一步研究评估未来LLMs、研究先进的压力测试监控技术以及开发新的监控方法提供了基础。

更新时间: 2026-03-30 15:37:42

领域: cs.AI

下载: http://arxiv.org/abs/2603.28590v1

Towards a Medical AI Scientist

Autonomous systems that generate scientific hypotheses, conduct experiments, and draft manuscripts have recently emerged as a promising paradigm for accelerating discovery. However, existing AI Scientists remain largely domain-agnostic, limiting their applicability to clinical medicine, where research is required to be grounded in medical evidence with specialized data modalities. In this work, we introduce Medical AI Scientist, the first autonomous research framework tailored to clinical autonomous research. It enables clinically grounded ideation by transforming extensively surveyed literature into actionable evidence through clinician-engineer co-reasoning mechanism, which improves the traceability of generated research ideas. It further facilitates evidence-grounded manuscript drafting guided by structured medical compositional conventions and ethical policies. The framework operates under 3 research modes, namely paper-based reproduction, literature-inspired innovation, and task-driven exploration, each corresponding to a distinct level of automated scientific inquiry with progressively increasing autonomy. Comprehensive evaluations by both large language models and human experts demonstrate that the ideas generated by the Medical AI Scientist are of substantially higher quality than those produced by commercial LLMs across 171 cases, 19 clinical tasks, and 6 data modalities. Meanwhile, our system achieves strong alignment between the proposed method and its implementation, while also demonstrating significantly higher success rates in executable experiments. Double-blind evaluations by human experts and the Stanford Agentic Reviewer suggest that the generated manuscripts approach MICCAI-level quality, while consistently surpassing those from ISBI and BIBM. The proposed Medical AI Scientist highlights the potential of leveraging AI for autonomous scientific discovery in healthcare.

Updated: 2026-03-30 15:37:25

标题: 走向医疗人工智能科学家

摘要: 具有生成科学假设、进行实验和起草手稿能力的自主系统最近被提出作为加速发现的一种有前途的范式。然而，现有的AI科学家仍然主要是领域无关的，限制了它们在临床医学中的适用性，因为研究必须建立在医学证据和专门数据模态之上。在这项工作中，我们介绍了医学AI科学家，这是第一个专门针对临床自主研究的自主研究框架。它通过临床医生-工程师共同推理机制将广泛调查的文献转化为可操作的证据，从而实现临床基础的构思，提高了生成的研究思路的可追溯性。它进一步通过结构化的医学组合规范和伦理政策指导基于证据的手稿起草。该框架在3种研究模式下运行，分别是基于文献的再现、基于文献的创新和任务驱动的探索，每种模式对应着不同级别的自动化科学探究，随着自主性的逐渐增加。通过大型语言模型和人类专家的全面评估表明，医学AI科学家生成的思路在171个案例、19个临床任务和6种数据模态下的质量明显高于商业LLMs生成的思路。与此同时，我们的系统在提出的方法与实施之间实现了强有力的一致性，同时在可执行实验的成功率方面也显著高于其他系统。人类专家和斯坦福代理审阅员的双盲评估表明，所生成的手稿接近于MICCAI水平的质量，同时始终超越了ISBI和BIBM的手稿质量。提出的医学AI科学家强调了在医疗保健领域利用AI进行自主科学发现的潜力。

更新时间: 2026-03-30 15:37:25

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2603.28589v1

Navigating the Mirage: A Dual-Path Agentic Framework for Robust Misleading Chart Question Answering

Despite the success of Vision-Language Models (VLMs), misleading charts remain a significant challenge due to their deceptive visual structures and distorted data representations. We present ChartCynics, an agentic dual-path framework designed to unmask visual deception via a "skeptical" reasoning paradigm. Unlike holistic models, ChartCynics decouples perception from verification: a Diagnostic Vision Path captures structural anomalies (e.g., inverted axes) through strategic ROI cropping, while an OCR-Driven Data Path ensures numerical grounding. To resolve cross-modal conflicts, we introduce an Agentic Summarizer optimized via a two-stage protocol: Oracle-Informed SFT for reasoning distillation and Deception-Aware GRPO for adversarial alignment. This pipeline effectively penalizes visual traps and enforces logical consistency. Evaluations on two benchmarks show that ChartCynics achieves 74.43% and 64.55% accuracy, providing an absolute performance boost of ~29% over the Qwen3-VL-8B backbone, outperforming state-of-the-art proprietary models. Our results demonstrate that specialized agentic workflows can grant smaller open-source models superior robustness, establishing a new foundation for trustworthy chart interpretation.

Updated: 2026-03-30 15:32:24

标题: 穿越幻象：一个强大的误导性图表问题回答的双路径代理框架

摘要: 尽管视觉语言模型（VLMs）取得了成功，但误导性图表仍然是一个重要挑战，因为它们具有欺骗性的视觉结构和扭曲的数据表现。我们提出了ChartCynics，这是一个设计用于通过“怀疑”推理范式揭示视觉欺骗的主动双通路框架。与整体模型不同，ChartCynics将感知与验证分离：诊断视觉路径通过策略性ROI裁剪捕捉结构异常（例如，反转的轴），而OCR驱动的数据路径确保数字基础。为了解决跨模态冲突，我们引入了一个经过两阶段协议优化的主动总结器：Oracle-Informed SFT用于推理提炼，Deception-Aware GRPO用于对抗性对齐。该流程有效地惩罚视觉陷阱并强制执行逻辑一致性。在两个基准测试上的评估表明，ChartCynics实现了74.43%和64.55%的准确性，相对于Qwen3-VL-8B骨干，绝对性能提升约29%，优于最先进的专有模型。我们的结果表明，专门的主动工作流程可以使较小的开源模型具有更强大的鲁棒性，为可信的图表解释奠定了新的基础。

更新时间: 2026-03-30 15:32:24

领域: cs.CV,cs.AI,cs.MM

下载: http://arxiv.org/abs/2603.28583v1

Understanding the Use of a Large Language Model-Powered Guide to Make Virtual Reality Accessible for Blind and Low Vision People

As social virtual reality (VR) grows more popular, addressing accessibility for blind and low vision (BLV) users is increasingly critical. Researchers have proposed an AI "sighted guide" to help users navigate VR and answer their questions, but it has not been studied with users. To address this gap, we developed a large language model (LLM)-powered guide and studied its use with 16 BLV participants in virtual environments with confederates posing as other users. We found that when alone, participants treated the guide as a tool, but treated it companionably around others, giving it nicknames, rationalizing its mistakes with its appearance, and encouraging confederate-guide interaction. Our work furthers understanding of guides as a versatile method for VR accessibility and presents design recommendations for future guides.

Updated: 2026-03-30 15:28:55

标题: 理解使用大型语言模型驱动的指南，使虚拟现实对盲人和低视力人群更易接触

摘要: 随着社交虚拟现实（VR）的普及，解决盲人和低视力（BLV）用户的可访问性问题变得日益重要。研究人员提出了一种人工智能“视觉导游”，帮助用户在VR中导航并回答他们的问题，但尚未与用户进行研究。为了填补这一空白，我们开发了一个由大型语言模型（LLM）驱动的导游，并在虚拟环境中与假扮其他用户的同伴一起，研究了16名BLV参与者的使用情况。我们发现，当独自一人时，参与者将导游视为工具，但在他人面前，他们则将其视为伙伴，给予其昵称，以其外观为理由来解释其错误，并鼓励同伴与导游互动。我们的工作深化了对导游作为VR可访问性的多功能方法的理解，并为未来导游提出了设计建议。

更新时间: 2026-03-30 15:28:55

领域: cs.HC,cs.AI,cs.ET

下载: http://arxiv.org/abs/2603.09964v2

ChemCLIP: Bridging Organic and Inorganic Anticancer Compounds Through Contrastive Learning

The discovery of anticancer therapeutics has traditionally treated organic small molecules and metal-based coordination complexes as separate chemical domains, limiting knowledge transfer despite their shared biological objectives. This disparity is particularly pronounced in available data, with extensive screening databases for organic compounds compared to only a few thousand characterized metal complexes. Here, we introduce ChemCLIP, a dual-encoder contrastive learning framework that bridges this organic-inorganic divide by learning unified representations based on shared anticancer activities rather than structural similarity. We compiled complementary datasets comprising 44,854 unique organic compounds and 5,164 unique metal complexes, standardized across 60 cancer cell lines. By training parallel encoders with activity-aware hard negative mining, we mapped structurally distinct compounds into a shared 256-dimensional embedding space where biologically similar compounds cluster together regardless of chemical class. We systematically evaluated four molecular encoding strategies: Morgan fingerprints, ChemBERTa, MolFormer, and Chemprop, through quantitative alignment metrics, embedding visualizations, and downstream classification tasks. Morgan fingerprints achieved superior performance with an average alignment ratio of 0.899 and downstream classification AUCs of 0.859 (inorganic) and 0.817 (organic). This work establishes contrastive learning as an effective strategy for unifying disparate chemical domains and provides empirical guidance for encoder selection in multi-modal chemistry applications, with implications extending beyond anticancer drug discovery to any scenario requiring cross-domain chemical knowledge transfer.

Updated: 2026-03-30 15:28:36

标题: ChemCLIP：通过对比学习桥接有机和无机抗癌化合物

摘要: 传统上，抗癌治疗药物的发现通常将有机小分子和基于金属的配位化合物视为独立的化学领域，尽管它们具有共同的生物学目标，但限制了知识转移。这种差距在现有数据中尤为明显，有机化合物的广泛筛选数据库与仅有几千个经过表征的金属络合物相比。在这里，我们介绍了ChemCLIP，这是一个双编码器对比学习框架，通过学习基于共同抗癌活性而不是结构相似性的统一表示，弥合了有机和无机之间的分歧。我们编制了包括44,854个独特有机化合物和5,164个独特金属络合物在内的互补数据集，这些数据集在60个癌细胞系中进行了标准化。通过训练具有活性感知的硬负采样的并行编码器，我们将结构不同的化合物映射到一个共享的256维嵌入空间，其中生物相似的化合物不论化学类别如何都会聚集在一起。我们系统评估了四种分子编码策略：Morgan指纹、ChemBERTa、MolFormer和Chemprop，通过定量对齐指标、嵌入可视化和下游分类任务。Morgan指纹表现出优秀的性能，平均对齐比为0.899，下游分类AUC分别为0.859（无机）和0.817（有机）。这项工作将对比学习确立为统一不同化学领域的有效策略，并为多模化学应用中的编码器选择提供了经验指导，其影响不仅限于抗癌药物发现，还延伸到任何需要跨领域化学知识转移的场景。

更新时间: 2026-03-30 15:28:36

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.28575v1

Learning Partial Action Replacement in Offline MARL

Offline multi-agent reinforcement learning (MARL) faces a critical challenge: the joint action space grows exponentially with the number of agents, making dataset coverage exponentially sparse and out-of-distribution (OOD) joint actions unavoidable. Partial Action Replacement (PAR) mitigates this by anchoring a subset of agents to dataset actions, but existing approach relies on enumerating multiple subset configurations at high computational cost and cannot adapt to varying states. We introduce PLCQL, a framework that formulates PAR subset selection as a contextual bandit problem and learns a state-dependent PAR policy using Proximal Policy Optimisation with an uncertainty-weighted reward. This adaptive policy dynamically determines how many agents to replace at each update step, balancing policy improvement against conservative value estimation. We prove a value-error bound showing that the estimation error scales linearly with the expected number of deviating agents. Compared with the previous PAR-based method SPaCQL, PLCQL reduces the number of per-iteration Q-function evaluations from n to 1, significantly improving computational efficiency. Empirically, PLCQL achieves the highest normalised scores on 66% of tasks across MPE, MaMuJoCo, and SMAC benchmarks, outperforming SPaCQL on 84% of tasks while substantially reducing computational cost.

Updated: 2026-03-30 15:28:13

标题: 学习离线多智能体强化学习中的部分动作替换

摘要: 离线多智能体强化学习(MARL)面临一个关键挑战：随着智能体数量的增加，联合动作空间呈指数级增长，使得数据集覆盖度指数级稀疏，联合动作超出分布（OOD）不可避免。局部动作替换（PAR）通过将一部分智能体锚定到数据集动作来缓解这一问题，但现有方法依赖于在高计算成本下枚举多个子集配置，并且无法适应不同状态。我们引入了PLCQL，一个将PAR子集选择形式化为上下文匪徒问题，并使用带有不确定性加权奖励的近端策略优化来学习一个依赖于状态的PAR策略的框架。这种自适应策略动态确定每个更新步骤替换多少个智能体，平衡策略改进与保守价值估计。我们证明了一个值误差上界，表明估计误差与预期偏离智能体数量成线性关系。与之前基于PAR的方法SPaCQL相比，PLCQL将每次迭代中Q函数评估的数量从n减少到1，显著提高了计算效率。在实证方面，PLCQL在MPE、MaMuJoCo和SMAC基准测试中的66%任务中取得了最高的标准化分数，在84%的任务上优于SPaCQL，同时大幅降低了计算成本。

更新时间: 2026-03-30 15:28:13

领域: cs.LG,cs.AI,cs.MA

下载: http://arxiv.org/abs/2603.28573v1

Unrestrained Simplex Denoising for Discrete Data. A Non-Markovian Approach Applied to Graph Generation

Denoising models such as Diffusion or Flow Matching have recently advanced generative modeling for discrete structures, yet most approaches either operate directly in the discrete state space, causing abrupt state changes. We introduce simplex denoising, a simple yet effective generative framework that operates on the probability simplex. The key idea is a non-Markovian noising scheme in which, for a given clean data point, noisy representations at different times are conditionally independent. While preserving the theoretical guarantees of denoising-based generative models, our method removes unnecessary constraints, thereby improving performance and simplifying the formulation. Empirically, \emph{unrestrained simplex denoising} surpasses strong discrete diffusion and flow-matching baselines across synthetic and real-world graph benchmarks. These results highlight the probability simplex as an effective framework for discrete generative modeling.

Updated: 2026-03-30 15:26:56

标题: 无约束的简单去噪方法用于离散数据。一种非马尔可夫方法应用于图生成。

摘要: 去噪模型，如扩散或流匹配，最近推动了离散结构的生成建模进展，然而大多数方法要么直接在离散状态空间中操作，导致状态变化突然。我们引入了单纯形去噪，这是一个简单而有效的生成框架，它在概率单纯形上操作。关键思想是一种非马尔可夫噪声方案，在这种方案中，对于给定的干净数据点，不同时间的嘈杂表示是有条件独立的。虽然保留了基于去噪的生成模型的理论保证，但我们的方法消除了不必要的约束，从而提高了性能并简化了公式。在经验上，“无约束的单纯形去噪”在合成和真实世界的图基准测试中超越了强离散扩散和流匹配基线。这些结果突显了概率单纯形作为离散生成建模的有效框架。

更新时间: 2026-03-30 15:26:56

领域: cs.LG

下载: http://arxiv.org/abs/2603.28572v1

CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments

The increasing agentic capabilities of Large Language Models (LLMs) have enabled their deployment in real-world applications, such as cloud services, where customer-assistant interactions exhibit high technical complexity and long-horizon dependencies, making robustness and resolution efficiency critical for customer satisfaction. However, existing benchmarks for LLM-based agents largely rely on synthetic environments that fail to capture the diversity and unpredictability of authentic customer inputs, often ignoring the resolution efficiency essential for real-world deployment. To bridge this gap, we introduce CirrusBench, a novel evaluation framework distinguished by its foundation in real-world data from authentic cloud service tickets. CirrusBench preserves the intricate multi-turn logical chains and realistic tool dependencies inherent to technical service environments. Moving beyond execution correctness, we introduce novel Customer-Centric metrics to define agent success, quantifying service quality through metrics such as the Normalized Efficiency Index and Multi-Turn Latency to explicitly measure resolution efficiency. Experiments utilizing our framework reveal that while state-of-the-art models demonstrate strong reasoning capabilities, they frequently struggle in complex, realistic multi-turn tasks and fail to meet the high-efficiency standards required for customer service, highlighting critical directions for the future development of LLM-based agents in practical technical service applications. CirrusBench evaluation framework is released at: https://github.com/CirrusAI

Updated: 2026-03-30 15:26:00

标题: CirrusBench: 评估基于LLM的代理在现实世界云服务环境中的正确性之外的性能

摘要: 大型语言模型（LLM）的增强代理能力使它们能够在现实世界应用中部署，例如云服务，在这些服务中，客户助理交互展示出高技术复杂性和长期依赖性，因此对于客户满意度来说，鲁棒性和分辨率效率至关重要。然而，现有基于LLM的代理的基准主要依赖于无法捕捉真实客户输入多样性和不可预测性的合成环境，通常忽略了对真实世界部署至关重要的分辨率效率。为了弥补这一差距，我们引入了CirrusBench，这是一个新颖的评估框架，其基础在于来自真实云服务工单的真实数据。CirrusBench保留了技术服务环境中复杂的多轮逻辑链和现实工具依赖关系。在超越执行正确性的同时，我们引入了新颖的以客户为中心的度量标准来定义代理的成功，通过诸如标准化效率指数和多轮延迟等指标来量化服务质量，明确衡量分辨率效率。利用我们的框架进行的实验表明，尽管最先进的模型展现出强大的推理能力，但它们在复杂、现实的多轮任务中经常面临困难，无法满足客户服务所需的高效标准，突出了未来LLM代理在实际技术服务应用中发展的关键方向。CirrusBench评估框架已发布在：https://github.com/CirrusAI

更新时间: 2026-03-30 15:26:00

领域: cs.LG,cs.AI,cs.IR,cs.PF

下载: http://arxiv.org/abs/2603.28569v1

Fine-Tuning Large Language Models for Cooperative Tactical Deconfliction of Small Unmanned Aerial Systems

The growing deployment of small Unmanned Aerial Systems (sUASs) in low-altitude airspaces has increased the need for reliable tactical deconfliction under safety-critical constraints. Tactical deconfliction involves short-horizon decision-making in dense, partially observable, and heterogeneous multi-agent environments, where both cooperative separation assurance and operational efficiency must be maintained. While Large Language Models (LLMs) exhibit strong reasoning capabilities, their direct application to air traffic control remains limited by insufficient domain grounding and unpredictable output inconsistency. This paper investigates LLMs as decision-makers in cooperative multi-agent tactical deconfliction using fine-tuning strategies that align model outputs to human operator heuristics. We propose a simulation-to-language data generation pipeline based on the BlueSky air traffic simulator that produces rule-consistent deconfliction datasets reflecting established safety practices. A pretrained Qwen-Math-7B model is fine-tuned using two parameter-efficient strategies: supervised fine-tuning with Low-Rank Adaptation (LoRA) and preference-based fine-tuning combining LoRA with Group-Relative Policy Optimization (GRPO). Experimental results on validation datasets and closed-loop simulations demonstrate that supervised LoRA fine-tuning substantially improves decision accuracy, consistency, and separation performance compared to the pretrained LLM, with significant reductions in near mid-air collisions. GRPO provides additional coordination benefits but exhibits reduced robustness when interacting with heterogeneous agent policies.

Updated: 2026-03-30 15:22:27

标题: 细调大型语言模型，用于小型无人机的合作战术解决冲突

摘要: 随着小型无人机系统（sUASs）在低空领域的不断部署，安全关键约束下可靠的战术去冲突需求不断增加。战术去冲突涉及在密集、部分可观察和异构的多智能体环境中进行短视野决策，其中必须保持合作分离保障和运行效率。尽管大规模语言模型（LLMs）展现出强大的推理能力，但由于领域基础不足和输出不一致性难以预测，它们在空中交通管制中的直接应用仍受到限制。本文研究了LLMs作为决策者在合作多智能体战术去冲突中的应用，利用微调策略来使模型输出与人类操作者的启发式方法保持一致。我们提出了基于BlueSky空中交通模拟器的模拟到语言数据生成管道，产生符合已建立安全实践的规则一致的去冲突数据集。预训练的Qwen-Math-7B模型通过两种参数高效的策略进行微调：带有低秩适应（LoRA）的监督微调和结合LoRA和组相对策略优化（GRPO）的基于偏好的微调。验证数据集和封闭环模拟的实验结果表明，监督LoRA微调相对于预训练的LLM显着提高了决策准确性、一致性和分离性能，近空中碰撞的数量显著减少。GRPO提供了额外的协调优势，但在与异构智能体政策互动时显示出较低的鲁棒性。

更新时间: 2026-03-30 15:22:27

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2603.28561v1

Your Models Have Thought Enough: Training Large Reasoning Models to Stop Overthinking

Large Reasoning Models (LRMs) have achieved impressive performance on challenging tasks, yet their deep reasoning often incurs substantial computational costs. To achieve efficient reasoning, existing reinforcement learning methods still struggle to construct short reasoning path during the rollout stage, limiting effective learning. Inspired by Evidence Accumulation Models, we find that LRMs have accumulated sufficient information early in reasoning, making further reasoning steps redundant. Based on this insight, we propose Just-Enough Thinking (JET), which trains models to proactively terminate unnecessary reasoning. JET performs trajectory truncation during rollout to expose the model to short, distributionally consistent reasoning paths. Besides, it uses a quality-controlled length reward to better encourage concise reasoning while maintaining correctness. Extensive experiments demonstrate that JET significantly improves reasoning efficiency without sacrificing accuracy. Especially, DeepSeek-Distill-Qwen-1.5B achieves a 4.6% accuracy gain while reducing output length by 46.3% on the Olympiad benchmark. Our code is available in the GitHub.

Updated: 2026-03-30 15:21:37

标题: 你的模型已经思考得足够多了：训练大型推理模型停止过度思考

摘要: 大型推理模型（LRMs）在挑战性任务上取得了令人印象深刻的表现，但它们深层次的推理通常会产生相当大的计算成本。为了实现高效的推理，现有的强化学习方法仍然在展开阶段难以构建短推理路径，从而限制了有效学习。受证据积累模型的启发，我们发现LRMs在推理的早期阶段已经积累了足够的信息，进一步的推理步骤变得多余。基于这一观察结果，我们提出了Just-Enough Thinking（JET），该方法训练模型主动终止不必要的推理。JET在展开期间执行轨迹截断，使模型暴露于短、分布一致的推理路径。此外，它使用一个质量受控的长度奖励来更好地鼓励简洁的推理同时保持正确性。大量实验证明，JET显著提高了推理效率而不牺牲准确性。特别是，DeepSeek-Distill-Qwen-1.5B在奥林匹克基准测试中准确率提高了4.6%，同时减少了输出长度46.3%。我们的代码可以在GitHub上找到。

更新时间: 2026-03-30 15:21:37

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2509.23392v3

GhanaNLP Parallel Corpora: Comprehensive Multilingual Resources for Low-Resource Ghanaian Languages

Low resource languages present unique challenges for natural language processing due to the limited availability of digitized and well structured linguistic data. To address this gap, the GhanaNLP initiative has developed and curated 41,513 parallel sentence pairs for the Twi, Fante, Ewe, Ga, and Kusaal languages, which are widely spoken across Ghana yet remain underrepresented in digital spaces. Each dataset consists of carefully aligned sentence pairs between a local language and English. The data were collected, translated, and annotated by human professionals and enriched with standard structural metadata to ensure consistency and usability. These corpora are designed to support research, educational, and commercial applications, including machine translation, speech technologies, and language preservation. This paper documents the dataset creation methodology, structure, intended use cases, and evaluation, as well as their deployment in real world applications such as the Khaya AI translation engine. Overall, this work contributes to broader efforts to democratize AI by enabling inclusive and accessible language technologies for African languages.

Updated: 2026-03-30 15:19:36

标题: 加纳NLP平行语料库：低资源加纳语言的全面多语资源

摘要: 低资源语言在自然语言处理中面临独特的挑战，因为数字化和结构化语言数据的可用性有限。为了解决这一问题，GhanaNLP倡议已经开发并整理了41,513对特威语、方特语、依韦语、加语和库萨尔语的平行句对，这些语言在加纳广泛使用，但在数字空间中仍然属于少数。每个数据集包含本地语言和英语之间精心对齐的句子对。这些数据由人类专业人员收集、翻译和注释，并丰富了标准结构化元数据，以确保一致性和可用性。这些语料库旨在支持研究、教育和商业应用，包括机器翻译、语音技术和语言保护。本文记录了数据集的创建方法、结构、预期用例和评估，以及它们在现实世界应用中的部署，如Khaya AI翻译引擎。总的来说，这项工作为推动AI的民主化努力做出了贡献，通过为非洲语言提供包容性和可访问的语言技术。

更新时间: 2026-03-30 15:19:36

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2603.13793v2

T-Norm Operators for EU AI Act Compliance Classification: An Empirical Comparison of Lukasiewicz, Product, and Gödel Semantics in a Neuro-Symbolic Reasoning System

We present a first comparative pilot study of three t-norm operators -- Lukasiewicz (T_L), Product (T_P), and Gödel (T_G) - as logical conjunction mechanisms in a neuro-symbolic reasoning system for EU AI Act compliance classification. Using the LGGT+ (Logic-Guided Graph Transformers Plus) engine and a benchmark of 1035 annotated AI system descriptions spanning four risk categories (prohibited, high_risk, limited_risk, minimal_risk), we evaluate classification accuracy, false positive and false negative rates, and operator behaviour on ambiguous cases. At n=1035, all three operators differ significantly (McNemar p<0.001). T_G achieves highest accuracy (84.5%) and best borderline recall (85%), but introduces 8 false positives (0.8%) via min-semantics over-classification. T_L and T_P maintain zero false positives, with T_P outperforming T_L (81.2% vs. 78.5%). Our principal findings are: (1) operator choice is secondary to rule base completeness; (2) T_L and T_P maintain zero false positives but miss borderline cases; (3) T_G's min-semantics achieves higher recall at cost of 0.8% false positive rate; (4) a mixed-semantics classifier is the productive next step. We release the LGGT+ core engine (201/201 tests passing) and benchmark dataset (n=1035) under Apache 2.0.

Updated: 2026-03-30 15:19:28

标题: T-范数运算符用于欧盟人工智能法合规分类：在神经符号推理系统中对Lukasiewicz、Product和Gödel语义的实证比较

摘要: 我们提出了第一个比较性的三个t-范数运算符 - Lukasiewicz（T_L）、Product（T_P）和Gödel（T_G）- 作为逻辑连接机制在一个神经符号推理系统中，用于欧盟AI法案合规分类的试点研究。使用LGGT+（逻辑引导图转换器加强版）引擎和一个包含1035个注释的AI系统描述的基准，涵盖了四个风险类别（禁止、高风险、有限风险、最小风险），我们评估了分类准确性、假阳性率、假阴性率以及在模糊情况下运算符的行为。在n=1035时，这三个运算符都有显著差异（McNemar p<0.001）。T_G实现了最高的准确性（84.5%）和最佳的边界召回率（85%），但通过min-语义过度分类引入了8个假阳性（0.8%）。T_L和T_P保持了零假阳性，其中T_P表现优于T_L（81.2% vs. 78.5%）。我们的主要发现是：（1）运算符选择次于规则库的完整性；（2）T_L和T_P保持零假阳性但错过了边界案例；（3）T_G的min-语义以更高的召回率实现，代价是0.8%的假阳性率；（4）混合语义分类器是下一步的有效步骤。我们在Apache 2.0下发布了LGGT+核心引擎（201/201个测试通过）和基准数据集（n=1035）。

更新时间: 2026-03-30 15:19:28

领域: cs.AI

下载: http://arxiv.org/abs/2603.28558v1

FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures

Scientific compound figures combine multiple labeled panels into a single image. However, in a PMC-scale crawl of 346,567 compound figures, 16.3% have no caption and 1.8% only have captions shorter than ten words, causing them to be discarded by existing caption-decomposition pipelines. We propose FigEx2, a visual-conditioned framework that localizes panels and generates panel-wise captions directly from the image, converting otherwise unusable figures into aligned panel-text pairs for downstream pretraining and retrieval. To mitigate linguistic variance in open-ended captioning, we introduce a noise-aware gated fusion module that adaptively controls how caption features condition the detection query space, and employ a staged SFT+RL strategy with CLIP-based alignment and BERTScore-based semantic rewards. To support high-quality supervision, we curate BioSci-Fig-Cap, a refined benchmark for panel-level grounding, alongside cross-disciplinary test suites in physics and chemistry. FigEx2 achieves 0.728 mAP@0.5:0.95 for detection, outperforms Qwen3-VL-8B by 0.44 in METEOR and 0.22 in BERTScore, and transfers zero-shot to out-of-distribution scientific domains without fine-tuning.

Updated: 2026-03-30 15:19:06

标题: FigEx2：科学复合图中的视觉条件面板检测和字幕生成

摘要: 科学复合图将多个带标签的面板组合成单个图像。然而，在对346,567个复合图进行PMC规模爬取时，16.3%没有标题，1.8%只有少于十个字的标题，导致它们被现有的标题分解管道丢弃。我们提出了FigEx2，这是一个视觉条件框架，它可以从图像中定位面板并直接生成面板级标题，将否则无法使用的图形转换为与下游预训练和检索相对应的面板-文本对。为了减少开放式字幕中的语言差异，我们引入了一个噪声感知的门控融合模块，自适应地控制字幕特征如何调节检测查询空间，并采用了基于CLIP对齐和基于BERTScore的语义奖励的分阶段SFT+RL策略。为了支持高质量的监督，我们策划了BioSci-Fig-Cap，这是一个用于面板级定位的精心筛选的基准，同时还包括物理和化学跨学科测试套件。FigEx2在检测方面达到了0.728的mAP@0.5:0.95，比Qwen3-VL-8B在METEOR和BERTScore方面表现更好，且可以零-shot转移到无需微调的分布之外的科学领域。

更新时间: 2026-03-30 15:19:06

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2601.08026v4

Domain-Invariant Prompt Learning for Vision-Language Models

Large pre-trained vision-language models like CLIP have transformed computer vision by aligning images and text in a shared feature space, enabling robust zero-shot transfer via prompting. Soft-prompting, such as Context Optimization (CoOp), effectively adapts these models for downstream recognition tasks by learning a set of context vectors. However, CoOp lacks explicit mechanisms for handling domain shifts across unseen distributions. To address this, we propose Domain-invariant Context Optimization (DiCoOp), an extension of CoOp optimized for domain generalization. By employing an adversarial training approach, DiCoOp forces the model to learn domain-invariant prompts while preserving discriminative power for classification. Experimental results show that DiCoOp consistently surpasses CoOp in domain generalization tasks across diverse visual domains.

Updated: 2026-03-30 15:18:31

标题: 跨领域提示学习对视觉语言模型的影响

摘要: 大型预训练的视觉-语言模型（如CLIP）通过将图像和文本在共享特征空间中对齐，实现了计算机视觉的转变，通过提示实现了强大的零样本迁移。软提示，如上下文优化（CoOp），通过学习一组上下文向量有效地将这些模型适应于下游识别任务。然而，CoOp缺乏处理未知分布中的领域转移的明确机制。为了解决这个问题，我们提出了适用于领域泛化的域不变上下文优化（DiCoOp），这是CoOp的扩展。通过采用对抗训练方法，DiCoOp迫使模型学习领域不变的提示，同时保留分类的区分力。实验结果表明，DiCoOp在各种视觉领域的领域泛化任务中始终优于CoOp。

更新时间: 2026-03-30 15:18:31

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2603.28555v1

Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model

Visual document understanding typically requires separate retrieval and generation models, doubling memory and system complexity. We present Hydra, a dual-head approach that provides both ColBERT-style late-interaction retrieval and autoregressive generation from a single vision-language model (VLM). A single LoRA adapter, trained only for retrieval, is toggled at inference: enabling it produces multi-vector embeddings; disabling it recovers the base model's generation quality -- byte-identical outputs in 100% of 10,500 greedy and stochastic samples, with max delta-ANLS = 0.0044 across 15,301 samples on four VQA benchmarks (three informative; ChartQA is near-zero for both models under greedy decoding) when compared against an independent base-model pipeline. We identify three engineering requirements (attention-mode restoration, lm_head preservation, KV-cache-aware decoding) whose omission silently breaks generation despite correct weight recovery. On ViDoRe V1, Hydra (4B) is within 1 percentage point of a controlled single-head baseline in a single training run, with higher aggregate scores on V2 and V3 that are concentrated on a subset of tasks; multi-seed experiments are needed to confirm these trends. The single-model design reduces peak GPU memory by 41%, though adapter switching introduces throughput overhead under concurrent serving loads. An ablation shows that GritLM-style joint training provides no benefit within the LoRA-based (r=16) training regime. A proof-of-concept extension to Qwen2.5-Omni-3B demonstrates that the mechanism generalizes to audio retrieval and video embedding, with speech generation.

Updated: 2026-03-30 15:17:41

标题: Hydra：在一个统一的视觉语言模型中统一文档检索和生成

摘要: 视觉文档理解通常需要单独的检索和生成模型，这会使内存和系统复杂性加倍。我们提出了Hydra，这是一种双头方法，提供了ColBERT风格的晚期交互检索和自回归生成功能，这些功能都来自于一个视觉语言模型（VLM）。在推断过程中，只对检索进行训练的单个LoRA适配器被切换：启用它会产生多向量嵌入；禁用它会恢复基本模型的生成质量--在四个VQA基准测试中的15,301个样本中，与独立基础模型管道相比，10,500个贪婪和随机样本中的100%输出为字节相同，最大ΔANLS=0.0044（三个信息性；在贪婪解码下，ChartQA对两种模型都接近零）。我们确定了三个工程要求（注意模式恢复，lm_head保留，KV-cache感知解码），忽略这些要求将悄无声息地破坏生成，尽管权重恢复正确。在ViDoRe V1上，Hydra（4B）在单次训练运行中与受控单头基线相差不到1个百分点，在V2和V3上的总体得分更高，这些得分集中在一部分任务上；需要多种子实验来确认这些趋势。单一模型设计将峰值GPU内存降低了41%，尽管适配器切换在并发服务负载下会引入吞吐量开销。消融实验表明，在LoRA-based（r=16）训练方案中，GritLM风格的联合训练并没有提供任何好处。对Qwen2.5-Omni-3B的概念验证扩展表明，这种机制可以推广到音频检索和视频嵌入，以及语音生成。

更新时间: 2026-03-30 15:17:41

领域: cs.CV,cs.AI,cs.IR

下载: http://arxiv.org/abs/2603.28554v1

Policy-Guided Threat Hunting: An LLM enabled Framework with Splunk SOC Triage

With frequently evolving Advanced Persistent Threats (APTs) in cyberspace, traditional security solutions approaches have become inadequate for threat hunting for organizations. Moreover, SOC (Security Operation Centers) analysts are often overwhelmed and struggle to analyze the huge volume of logs received from diverse devices in organizations. To address these challenges, we propose an automated and dynamic threat hunting framework for monitoring evolving threats, adapting to changing network conditions, and performing risk-based prioritization for the mitigation of suspicious and malicious traffic. By integrating Agentic AI with Splunk, an established SIEM platform, we developed a unique threat hunting framework. The framework systematically and seamlessly integrates different threat hunting modules together, ranging from traffic ingestion to anomaly assessment using a reconstruction-based autoencoder, deep reinforcement learning (DRL) with two layers for initial triage, and a large language model (LLM) for contextual analysis. We evaluated the framework against a publicly available benchmark dataset, as well as against a simulated dataset. The experimental results show that the framework can effectively adapt to different SOC objectives autonomously and identify suspicious and malicious traffic. The framework enhances operational effectiveness by supporting SOC analysts in their decision-making to block, allow, or monitor network traffic. This study thus enhances cybersecurity and threat hunting literature by presenting the novel threat hunting framework for security decision-making, as well as promoting cumulative research efforts to develop more effective frameworks to battle continuously evolving cyber threats.

Updated: 2026-03-30 15:17:07

标题: 政策指导的威胁猎杀：基于LLM的Splunk SOC分流框架

摘要: 随着网络空间中不断发展的高级持久性威胁（APTs），传统的安全解决方案途径已经不再适用于组织的威胁猎杀。此外，安全运营中心（SOC）分析人员经常不堪重负，难以分析从组织中各种设备接收到的大量日志。为了解决这些挑战，我们提出了一个自动化和动态的威胁猎杀框架，用于监控不断发展的威胁，适应不断变化的网络条件，并对可疑和恶意流量进行基于风险的优先处理。通过将智能AI与Splunk，一个成熟的SIEM平台，相结合，我们开发了一个独特的威胁猎杀框架。该框架系统地和无缝地整合了不同的威胁猎杀模块，从流量摄入到使用基于重建的自动编码器进行异常评估，再到具有两层初始分类的深度强化学习（DRL），以及用于情境分析的大型语言模型（LLM）。我们针对一个公开可用的基准数据集以及一个模拟数据集对框架进行了评估。实验结果表明，该框架可以有效地自适应不同的SOC目标，自主地识别可疑和恶意流量。该框架通过支持SOC分析人员在决策是否阻止、允许或监控网络流量方面提高了运营效率。本研究通过提出用于安全决策的新型威胁猎杀框架，以及促进累积研究努力以开发更有效的框架来应对不断发展的网络威胁，增强了网络安全和威胁猎杀文献。

更新时间: 2026-03-30 15:17:07

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2603.23966v2

$φ$-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models

Fairness in Continual Learning for Large Multimodal Models (LMMs) is an emerging yet underexplored challenge, particularly in the presence of imbalanced data distributions that can lead to biased model updates and suboptimal performance across tasks. While recent continual learning studies have made progress in addressing catastrophic forgetting, the problem of fairness caused the imbalanced data remains largely underexplored. This paper presents a novel Fairness Direct Preference Optimization (FaiDPO or $φ$-DPO) framework for continual learning in LMMs. In particular, we first propose a new continual learning paradigm based on Direct Preference Optimization (DPO) to mitigate catastrophic forgetting by aligning learning with pairwise preference signals. Then, we identify the limitations of conventional DPO in imbalanced data and present a new $φ$-DPO loss that explicitly addresses distributional biases. We provide a comprehensive theoretical analysis demonstrating that our approach addresses both forgetting and data imbalance. Additionally, to enable $φ$-DPO-based continual learning, we construct pairwise preference annotations for existing benchmarks in the context of continual learning. Extensive experiments and ablation studies show the proposed $φ$-DPO achieves State-of-the-Art performance across multiple benchmarks, outperforming prior continual learning methods of LMMs.

Updated: 2026-03-30 15:16:05

标题: $φ$-DPO：大型多模型模型中连续学习的公平直接偏好优化方法

摘要: 大型多模态模型（LMMs）的持续学习中的公平性是一个新兴但尚未充分探索的挑战，特别是在存在不平衡数据分布的情况下，可能导致模型更新的偏见和跨任务性能的次优表现。虽然最近的持续学习研究在解决灾难性遗忘方面取得了进展，但由于不平衡数据引起的公平性问题仍然大多未被探讨。本文提出了一种新颖的持续学习框架，称为公平性直接偏好优化（FaiDPO或$φ$-DPO），用于LMMs。具体地，我们首先提出了一种基于直接偏好优化（DPO）的新持续学习范例，通过与成对偏好信号对齐学习来减轻灾难性遗忘。然后，我们确定了传统DPO在不平衡数据中的局限性，并提出了一种新的$φ$-DPO损失函数，明确解决了分布偏差。我们提供了全面的理论分析，证明了我们的方法既解决了遗忘问题，又解决了数据不平衡问题。此外，为了实现基于$φ$-DPO的持续学习，我们在持续学习的背景下为现有基准构建了成对偏好注释。大量实验和消融研究表明，提出的$φ$-DPO在多个基准测试中实现了最新的性能，优于先前的LMMs持续学习方法。

更新时间: 2026-03-30 15:16:05

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2602.22601v2

Multimodal Analytics of Cybersecurity Crisis Preparation Exercises: What Predicts Success?

Instructional alignment, the match between intended cognition and enacted activity, is central to effective instruction but hard to operationalize at scale. We examine alignment in cybersecurity simulations using multimodal traces from 23 teams (76 students) across five exercise sessions. Study 1 codes objectives and team emails with Bloom's taxonomy and models the completion of key exercise tasks with generalized linear mixed models. Alignment, defined as the discrepancy between required and enacted Bloom levels, predicts success, whereas the Bloom category alone does not predict success once discrepancy is considered. Study 2 compares predictive feature families using grouped cross-validation and l1-regularized logistic regression. Text embeddings and log features outperform Bloom-only models (AUC~0.74 and 0.71 vs. 0.55), and their combination performs best (Test AUC~0.80), with Bloom frequencies adding little. Overall, the work offers a measure of alignment for simulations and shows that multimodal traces best forecast performance, while alignment provides interpretable diagnostic insight.

Updated: 2026-03-30 15:14:56

标题: 网络安全危机准备演习的多模态分析：成功的预测因素是什么？

摘要: 教学对齐，即预期认知与实际活动之间的匹配，是有效教学的核心，但在规模上很难实现。我们使用来自23个团队（76名学生）在五个练习会话中的多模态跟踪来研究网络安全模拟中的对齐。研究1使用布鲁姆的分类法对目标和团队邮件进行编码，并使用广义线性混合模型对重要练习任务的完成进行建模。定义为所需和实际布鲁姆水平之间的差异，对齐预测成功，而仅考虑差异后，布鲁姆类别本身不预测成功。研究2使用分组交叉验证和l1正则化逻辑回归比较预测特征系列。文本嵌入和日志特征优于仅使用布鲁姆的模型（AUC约为0.74和0.71对0.55），它们的组合表现最好（测试AUC约为0.80），布鲁姆频率几乎不增加。总的来说，这项工作为模拟提供了对齐度的度量，并显示多模态跟踪最能预测绩效，而对齐提供了可解释的诊断洞察。

更新时间: 2026-03-30 15:14:56

领域: cs.HC,cs.CY,cs.LG

下载: http://arxiv.org/abs/2603.28553v1

"What Did It Actually Do?": Understanding Risk Awareness and Traceability for Computer-Use Agents

Personalized computer-use agents are rapidly moving from expert communities into mainstream use. Unlike conventional chatbots, these systems can install skills, invoke tools, access private resources, and modify local environments on users' behalf. Yet users often do not know what authority they have delegated, what the agent actually did during task execution, or whether the system has been safely removed afterward. We investigate this gap as a combined problem of risk understanding and post-hoc auditability, using OpenClaw as a motivating case. We first build a multi-source corpus of the OpenClaw ecosystem, including incidents, advisories, malicious-skill reports, news coverage, tutorials, and social-media narratives. We then conduct an interview study to examine how users and practitioners understand skills, autonomy, privilege, persistence, and uninstallation. Our findings suggest that participants often recognized these systems as risky in the abstract, but lacked concrete mental models of what skills can do, what resources agents can access, and what changes may remain after execution or removal. Motivated by these findings, we propose AgentTrace, a traceability framework and prototype interface for visualizing agent actions, touched resources, permission history, provenance, and persistent side effects. A scenario-based evaluation suggests that traceability-oriented interfaces can improve understanding of agent behavior, support anomaly detection, and foster more calibrated trust.

Updated: 2026-03-30 15:12:55

标题: "它到底做了什么？"：理解计算机使用代理的风险意识和可追溯性

摘要: 个性化计算机使用代理正在迅速从专家社区进入主流使用。与传统的聊天机器人不同，这些系统可以安装技能，调用工具，访问私人资源，并代表用户修改本地环境。然而，用户通常不知道他们已委托了什么权限，代理在执行任务期间实际做了什么，或者系统在之后是否已安全移除。我们将这一差距视为风险理解和事后可审计性的综合问题，以OpenClaw作为激励案例进行研究。我们首先构建了一个OpenClaw生态系统的多源语料库，包括事件、建议、恶意技能报告、新闻报道、教程和社交媒体叙事。然后，我们进行了一项访谈研究，以考察用户和从业者对技能、自治、特权、持久性和卸载的理解。我们的研究结果表明，参与者通常将这些系统抽象地认为是有风险的，但缺乏关于技能可以做什么、代理可以访问什么资源以及执行或移除后可能会保留的变化的具体心智模型。受到这些发现的启发，我们提出了AgentTrace，一个用于可视化代理行为、接触资源、权限历史、出处和持久性副作用的可追溯性框架和原型界面。基于场景的评估表明，面向可追溯性的界面可以提高对代理行为的理解，支持异常检测，并促进更加准确的信任。

更新时间: 2026-03-30 15:12:55

领域: cs.CR,cs.ET,cs.HC,cs.MA

下载: http://arxiv.org/abs/2603.28551v1

Coarse-Guided Visual Generation via Weighted h-Transform Sampling

Coarse-guided visual generation, which synthesizes fine visual samples from degraded or low-fidelity coarse references, is essential for various real-world applications. While training-based approaches are effective, they are inherently limited by high training costs and restricted generalization due to paired data collection. Accordingly, recent training-free works propose to leverage pretrained diffusion models and incorporate guidance during the sampling process. However, these training-free methods either require knowing the forward (fine-to-coarse) transformation operator, e.g., bicubic downsampling, or are difficult to balance between guidance and synthetic quality. To address these challenges, we propose a novel guided method by using the h-transform, a tool that can constrain stochastic processes (e.g., sampling process) under desired conditions. Specifically, we modify the transition probability at each sampling timestep by adding to the original differential equation with a drift function, which approximately steers the generation toward the ideal fine sample. To address unavoidable approximation errors, we introduce a noise-level-aware schedule that gradually de-weights the term as the error increases, ensuring both guidance adherence and high-quality synthesis. Extensive experiments across diverse image and video generation tasks demonstrate the effectiveness and generalization of our method.

Updated: 2026-03-30 15:10:02

标题: 基于加权h-变换采样的粗引导视觉生成

摘要: 粗引导视觉生成是从降级或低保真度的粗略参考中合成精细视觉样本，对于各种现实世界应用至关重要。虽然基于训练的方法是有效的，但由于配对数据收集的高训练成本和受限的泛化能力而固有地受到限制。因此，最近的无训练方法提出利用预训练的扩散模型，并在采样过程中整合引导。然而，这些无训练方法要么需要了解正向（精细到粗糙）转换算子，例如双三次降采样，要么在引导和合成质量之间难以平衡。为了解决这些挑战，我们提出了一种新颖的引导方法，使用 h 变换，这是一种可以在期望条件下约束随机过程（例如采样过程）的工具。具体来说，我们通过在每个采样时间步长上添加漂移函数来修改每个采样时间步长的转移概率，这大致将生成引导向理想的精细样本。为了解决不可避免的近似误差，我们引入了一个噪声水平感知的时间表，随着误差的增加逐渐减小该项的权重，确保引导遵从和高质量的合成。广泛的图像和视频生成任务实验证明了我们方法的有效性和泛化能力。

更新时间: 2026-03-30 15:10:02

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2603.12057v2

Evaluating Latent Knowledge of Public Tabular Datasets in Large Language Models

Large language models (LLMs) are increasingly exposed to data contamination, i.e., performance gains driven by prior exposure of test datasets rather than generalization. However, in the context of tabular data, this problem is largely unexplored. Existing approaches primarily rely on memorization tests, which are too coarse to detect contamination. In contrast, we propose a framework for assessing contamination in tabular datasets by generating controlled queries and performing comparative evaluation. Given a dataset, we craft multiple-choice aligned queries that preserve task structure while allowing systematic transformations of the underlying data. These transformations are designed to selectively disrupt dataset information while preserving partial knowledge, enabling us to isolate performance attributable to contamination. We complement this setup with non-neural baselines that provide reference performance, and we introduce a statistical testing procedure to formally detect significant deviations indicative of contamination. Empirical results on eight widely used tabular datasets reveal clear evidence of contamination in four cases. These findings suggest that performance on downstream tasks involving such datasets may be substantially inflated, raising concerns about the reliability of current evaluation practices.

Updated: 2026-03-30 15:08:50

标题: 评估大型语言模型中公共表格数据集的潜在知识

摘要: 大型语言模型（LLMs）越来越容易受到数据污染的影响，即测试数据集的性能提升主要是由于之前接触过的数据集，而非泛化能力。然而，在表格数据的背景下，这个问题很少被探讨。现有方法主要依赖于记忆测试，但这种方法过于粗糙，无法检测到污染。相反，我们提出了一个框架，通过生成受控查询和进行比较评估来评估表格数据集中的污染情况。给定一个数据集，我们设计了保留任务结构的多选对齐查询，同时允许对基础数据进行系统性转换。这些转换旨在有选择性地破坏数据集信息，同时保留部分知识，从而使我们能够分离出由于污染导致的性能。我们通过非神经网络基线来补充这一设置，提供参考性能，并引入了一个统计检验程序，以正式检测到表明污染的显著偏差。对八个广泛使用的表格数据集的实证结果显示，在四种情况下明确存在污染的证据。这些发现表明，在涉及这些数据集的下游任务的性能可能被大幅夸大，引发对当前评估实践可靠性的担忧。

更新时间: 2026-03-30 15:08:50

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.20351v2

Shy Guys: A Light-Weight Approach to Detecting Robots on Websites

Automated bots now account for roughly half of all web requests, and an increasing number deliberately spoof their identity to either evade detection or to not respect robots.txt. Existing countermeasures are either resource-intensive (JavaScript challenges, CAPTCHAs), cost-prohibitive (commercial solutions), or degrade the user experience. This paper proposes a lightweight, passive approach to bot detection that combines user-agent string analysis with favicon-based heuristics, operating entirely on standard web server logs with no client-side interaction. We evaluate the method on over 4.6 million requests containing 54,945 unique user-agent strings collected from website hosted all around the earth. Our approach detects 67.7% of bot traffic while maintaining a false-positive rate of 3%, outperforming state of the art (less than 20%). This method can serve as a first line of defence, routing only genuinely ambiguous requests to active challenges and preserving the experience of legitimate users.

Updated: 2026-03-30 15:07:00

标题: 害羞的家伙：一种轻量级检测网站上机器人的方法

摘要: 自动化机器人现在大约占所有网络请求的一半，越来越多的机器人故意伪装其身份，以逃避检测或不遵守robots.txt。现有的对策要么资源密集（JavaScript挑战，CAPTCHAs），要么成本高昂（商业解决方案），要么降低用户体验。本文提出了一种轻量级的被动式机器人检测方法，将用户代理字符串分析与基于favicon的启发式方法结合，完全在标准Web服务器日志中运行，无需客户端交互。我们在收集自全球各地托管的网站的超过460万个请求中评估了该方法，其中包含54945个唯一的用户代理字符串。我们的方法检测到67.7%的机器人流量，同时保持3%的假阳性率，表现优于现有技术（低于20%）。这种方法可以作为第一道防线，只将真正模糊的请求路由到主动挑战，并保护合法用户的体验。

更新时间: 2026-03-30 15:07:00

领域: cs.NI,cs.CR

下载: http://arxiv.org/abs/2603.28546v1

FlowPure: Continuous Normalizing Flows for Adversarial Purification

Despite significant advances in the area, adversarial robustness remains a critical challenge in systems employing machine learning models. The removal of adversarial perturbations at inference time, known as adversarial purification, has emerged as a promising defense strategy. To achieve this, state-of-the-art methods leverage diffusion models that inject Gaussian noise during a forward process to dilute adversarial perturbations, followed by a denoising step to restore clean samples before classification. In this work, we propose FlowPure, a novel purification method based on Continuous Normalizing Flows (CNFs) trained with Conditional Flow Matching (CFM) to learn mappings from adversarial examples to their clean counterparts. Unlike prior diffusion-based approaches that rely on fixed noise processes, FlowPure can leverage specific attack knowledge to improve robustness under known threats, while also supporting a more general stochastic variant trained on Gaussian perturbations for settings where such knowledge is unavailable. Experiments on CIFAR-10 and CIFAR-100 demonstrate that our method outperforms state-of-the-art purification defenses in preprocessor-blind and white-box scenarios, and can do so while fully preserving benign accuracy in the former. Moreover, our results show that not only is FlowPure a highly effective purifier but it also holds strong potential for adversarial detection, identifying preprocessor-blind PGD samples with near-perfect accuracy. Our code is publicly available at https://github.com/DistriNet/FlowPure.

Updated: 2026-03-30 15:01:02

标题: FlowPure：用于对抗性净化的连续归一化流

摘要: 尽管在这一领域取得了显著进展，但在采用机器学习模型的系统中，对抗性鲁棒性仍然是一个关键挑战。在推理时去除对抗性扰动，即所谓的对抗性净化，已经成为一种有前途的防御策略。为了实现这一目标，最先进的方法利用扩散模型，在前向过程中注入高斯噪声以稀释对抗性扰动，然后进行去噪步骤以恢复干净样本，然后再进行分类。在这项工作中，我们提出了FlowPure，一种基于连续归一化流（CNFs）的新型净化方法，采用条件流匹配（CFM）训练，以学习从对抗性示例到它们的干净对应物的映射。与先前基于扩散的方法不同，它们依赖于固定的噪声过程，FlowPure可以利用特定攻击知识来提高在已知威胁下的鲁棒性，同时还支持在不可利用此类知识的情况下训练的更一般的随机变量，以获取高斯扰动的设置。在CIFAR-10和CIFAR-100上的实验证明，我们的方法在预处理器盲目和白盒方案中胜过了最先进的净化防御措施，并且在前者完全保留良好准确性的情况下做到了这一点。此外，我们的结果显示，FlowPure不仅是一种非常有效的净化剂，而且还具有强大的对抗性检测潜力，可以几乎完美地识别出预处理器盲目的PGD样本。我们的代码可以在https://github.com/DistriNet/FlowPure上公开获取。

更新时间: 2026-03-30 15:01:02

领域: cs.LG,cs.AI,cs.CR

下载: http://arxiv.org/abs/2505.13280v2

AutoRegressive Generation with B-rep Holistic Token Sequence Representation

Previous representation and generation approaches for the B-rep relied on graph-based representations that disentangle geometric and topological features through decoupled computational pipelines, thereby precluding the application of sequence-based generative frameworks, such as transformer architectures that have demonstrated remarkable performance. In this paper, we propose BrepARG, the first attempt to encode B-rep's geometry and topology into a holistic token sequence representation, enabling sequence-based B-rep generation with an autoregressive architecture. Specifically, BrepARG encodes B-rep into 3 types of tokens: geometry and position tokens representing geometric features, and face index tokens representing topology. Then the holistic token sequence is constructed hierarchically, starting with constructing the geometry blocks (i.e., faces and edges) using the above tokens, followed by geometry block sequencing. Finally, we assemble the holistic sequence representation for the entire B-rep. We also construct a transformer-based autoregressive model that learns the distribution over holistic token sequences via next-token prediction, using a multi-layer decoder-only architecture with causal masking. Experiments demonstrate that BrepARG achieves state-of-the-art (SOTA) performance. BrepARG validates the feasibility of representing B-rep as holistic token sequences, opening new directions for B-rep generation.

Updated: 2026-03-30 14:57:56

标题: AutoRegressive Generation with B-rep Holistic Token Sequence Representation 使用B-rep整体标记序列表示的自回归生成

摘要: 以往的B-rep表示和生成方法依赖于基于图的表示，通过解耦的计算流程来区分几何和拓扑特征，因此无法应用基于序列的生成框架，如已经展示出卓越性能的Transformer架构。在本文中，我们提出了BrepARG，这是第一次尝试将B-rep的几何和拓扑编码为整体的令牌序列表示，从而实现基于序列的B-rep生成，使用自回归架构。具体而言，BrepARG将B-rep编码为3种类型的令牌：表示几何特征的几何和位置令牌，以及表示拓扑的面索引令牌。然后，通过分层方式构建整体令牌序列，从构建几何块（即面和边）开始，使用上述令牌，然后是几何块的排序。最后，我们将整体的序列表示组装为整个B-rep。我们还构建了一个基于Transformer的自回归模型，通过下一个令牌预测学习整体令牌序列的分布，使用具有因果蒙版的多层解码器架构。实验证明，BrepARG实现了最先进的性能。BrepARG验证了将B-rep表示为整体令牌序列的可行性，为B-rep生成开辟了新的方向。

更新时间: 2026-03-30 14:57:56

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2601.16771v2

UniGame: Turning a Unified Multimodal Model Into Its Own Adversary

Unified Multimodal Models (UMMs) have shown impressive performance in both understanding and generation with a single architecture. However, UMMs still exhibit a fundamental inconsistency: understanding favors compact embeddings, whereas generation favors reconstruction-rich representations. This structural trade-off produces misaligned decision boundaries, degraded cross-modal coherence, and heightened vulnerability under distributional and adversarial shifts. In this paper, we present UniGame, a self-adversarial post-training framework that directly targets the inconsistencies. By applying a lightweight perturber at the shared token interface, UniGame enables the generation branch to actively seek and challenge fragile understanding, turning the model itself into its own adversary. Experiments demonstrate that UniGame significantly improves the consistency (+4.6%). Moreover, it also achieves substantial improvements in understanding (+3.6%), generation (+0.02)on GenEval, out-of-distribution and adversarial robustness (+4.8% and +6.2% on NaturalBench and AdVQA). The framework is architecture-agnostic, introduces less than 1% additional parameters, and is complementary to existing post-training methods. These results position adversarial self-play as a general and effective principle for enhancing the coherence, stability, and unified competence of future multimodal foundation models. The official code is available at: https://github.com/AIFrontierLab/TorchUMM

Updated: 2026-03-30 14:56:29

标题: UniGame：将统一的多模态模型转化为自身对手

摘要: 统一多模态模型（UMMs）已经在理解和生成方面展现出令人印象深刻的性能，而且只需一个架构。然而，UMMs仍然存在一个基本的不一致性：理解更偏向于紧凑的嵌入，而生成更偏向于重构丰富的表示。这种结构上的权衡产生了不一致的决策边界、跨模态一致性下降以及在分布和对抗性转移下的脆弱性增加。在本文中，我们提出了UniGame，一个自对抗的后训练框架，直接针对这些不一致性。通过在共享令牌接口应用轻量级扰动器，UniGame使生成分支能够积极寻找并挑战脆弱的理解，将模型本身变成自身对手。实验证明，UniGame显著提高了一致性（+4.6%）。此外，它还在理解（+3.6%）、生成（+0.02）方面在GenEval上取得了显著改进，对抗性和分布鲁棒性（在NaturalBench和AdVQA上分别提高了+4.8%和+6.2%）。该框架与架构无关，引入的额外参数不到1%，并且与现有的后训练方法是互补的。这些结果将自对抗对弈定位为增强未来多模态基础模型的一致性、稳定性和统一能力的通用且有效的原则。官方代码可在以下链接找到：https://github.com/AIFrontierLab/TorchUMM

更新时间: 2026-03-30 14:56:29

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2511.19413v3

A Benchmark for Incremental Micro-expression Recognition

Micro-expression recognition plays a pivotal role in understanding hidden emotions and has applications across various fields. Traditional recognition methods assume access to all training data at once, but real-world scenarios involve continuously evolving data streams. To respond to the requirement of adapting to new data while retaining previously learned knowledge, we introduce the first benchmark specifically designed for incremental micro-expression recognition. Our contributions include: Firstly, we formulate the incremental learning setting tailored for micro-expression recognition. Secondly, we organize sequential datasets with carefully curated learning orders to reflect real-world scenarios. Thirdly, we define two cross-evaluation-based testing protocols, each targeting distinct evaluation objectives. Finally, we provide six baseline methods and their corresponding evaluation results. This benchmark lays the groundwork for advancing incremental micro-expression recognition research. All source code used in this study will be publicly available at https://github.com/ZhengQinLai/IMER-benchmark.

Updated: 2026-03-30 14:56:09

标题: 一个用于增量微表情识别的基准。

摘要: 微表情识别在理解隐藏情绪中起着至关重要的作用，并在各个领域都有应用。传统的识别方法假设一次性访问所有训练数据，但现实世界的场景涉及不断演变的数据流。为了满足适应新数据并保留先前学习知识的要求，我们引入了专门针对增量微表情识别设计的第一个基准。我们的贡献包括：首先，我们为微表情识别量身定制了增量学习设置。其次，我们组织了经过精心策划的学习顺序的顺序数据集，以反映真实世界的场景。第三，我们定义了两种基于交叉评估的测试协议，每种协议都针对不同的评估目标。最后，我们提供了六种基准方法及其相应的评估结果。这个基准为推进增量微表情识别研究奠定了基础。本研究中使用的所有源代码将在https://github.com/ZhengQinLai/IMER-benchmark 上公开。

更新时间: 2026-03-30 14:56:09

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2501.19111v3

Detecting low left ventricular ejection fraction from ECG using an interpretable and scalable predictor-driven framework

Low left ventricular ejection fraction (LEF) frequently remains undetected until progression to symptomatic heart failure, underscoring the need for scalable screening strategies. Although artificial intelligence-enabled electrocardiography (AI-ECG) has shown promise, existing approaches rely solely on end-to-end black-box models with limited interpretability or on tabular systems dependent on commercial ECG measurement algorithms with suboptimal performance. We introduced ECG-based Predictor-Driven LEF (ECGPD-LEF), a structured framework that integrates foundation model-derived diagnostic probabilities with interpretable modeling for detecting LEF from ECG. Trained on the benchmark EchoNext dataset comprising 72,475 ECG-echocardiogram pairs and evaluated in predefined independent internal (n=5,442) and external (n=16,017) cohorts, our framework achieved robust discrimination for moderate LEF (internal AUROC 88.4%, F1 64.5%; external AUROC 86.8%, F1 53.6%), consistently outperforming the official end-to-end baseline provided with the benchmark across demographic and clinical subgroups. Interpretability analyses identified high-impact predictors, including normal ECG, incomplete left bundle branch block, and subendocardial injury in anterolateral leads, driving LEF risk estimation. Notably, these predictors independently enabled zero-shot-like inference without task-specific retraining (internal AUROC 75.3-81.0%; external AUROC 71.6-78.6%), indicating that ventricular dysfunction is intrinsically encoded within structured diagnostic probability representations. This framework reconciles predictive performance with mechanistic transparency, supporting scalable enhancement through additional predictors and seamless integration with existing AI-ECG systems.

Updated: 2026-03-30 14:55:52

标题: 使用可解释和可扩展的预测驱动框架从心电图中检测低左心室射血分数

摘要: 低左心室射血分数（LEF）经常在进展到症状性心力衰竭之前未被检测到，强调了可扩展的筛查策略的必要性。尽管人工智能启用的心电图（AI-ECG）显示出潜力，但现有方法仅依赖于端到端的黑匣子模型，可解释性有限，或者依赖于商业心电图测量算法的表格系统，性能不佳。我们引入了基于心电图的预测驱动LEF（ECGPD-LEF），这是一个结构化框架，将基础模型派生的诊断概率与可解释性建模相结合，用于从心电图中检测LEF。我们在基准EchoNext数据集上训练，该数据集包括72,475对心电图-超声心动图，并在预定义的独立内部（n=5,442）和外部（n=16,017）队列中进行评估，我们的框架实现了对中等LEF的稳健区分（内部AUROC 88.4％，F1 64.5％；外部AUROC 86.8％，F1 53.6％），始终优于基准提供的官方端到端基线在人口统计和临床亚组中的表现。可解释性分析确定了高影响预测因子，包括正常心电图，不完全的左束支传导阻滞和前侧导联的心内膜下损伤，推动LEF风险估计。值得注意的是，这些预测因子独立地实现了零样本推理，无需特定任务的再训练（内部AUROC 75.3-81.0％；外部AUROC 71.6-78.6％），表明室壁功能障碍在结构化诊断概率表示中固有地编码。这一框架将预测性能与机制透明度协调一致，通过额外的预测因子的增加和与现有AI-ECG系统的无缝集成，支持可扩展的增强。

更新时间: 2026-03-30 14:55:52

领域: cs.LG,cs.AI,stat.AP

下载: http://arxiv.org/abs/2603.28532v1

Hellinger Multimodal Variational Autoencoders

Multimodal variational autoencoders (VAEs) are widely used for weakly supervised generative learning with multiple modalities. Predominant methods aggregate unimodal inference distributions using either a product of experts (PoE), a mixture of experts (MoE), or their combinations to approximate the joint posterior. In this work, we revisit multimodal inference through the lens of probabilistic opinion pooling, an optimization-based approach. We start from Hölder pooling with $α=0.5$, which corresponds to the unique symmetric member of the $α\text{-divergence}$ family, and derive a moment-matching approximation, termed Hellinger. We then leverage such an approximation to propose HELVAE, a multimodal VAE that avoids sub-sampling, yielding an efficient yet effective model that: (i) learns more expressive latent representations as additional modalities are observed; and (ii) empirically achieves better trade-offs between generative coherence and quality, outperforming state-of-the-art multimodal VAE models.

Updated: 2026-03-30 14:53:45

标题: Hellinger多模式变分自动编码器

摘要: 多模态变分自动编码器（VAEs）被广泛用于多模态弱监督生成学习。主要方法使用专家的乘积（PoE）、专家的混合（MoE）或它们的组合来聚合单模态推断分布以近似联合后验。在这项工作中，我们通过概率观点汇总的视角重新审视多模态推断，这是一种基于优化的方法。我们从$α=0.5$的Hölder汇总开始，这对应于$α\text{-散度}$家族的唯一对称成员，并推导出一个称为Hellinger的时刻匹配近似。然后，我们利用这样的近似提出了HELVAE，一个多模态VAE，避免了子采样，产生了一个既有效又高效的模型，它: (i) 当观察到额外的模态时，学习更具表现力的潜在表示; 和 (ii) 在生成的一致性和质量之间实现更好的权衡，胜过了最先进的多模态VAE模型。

更新时间: 2026-03-30 14:53:45

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2601.06572v2

RAD-LAD: Rule and Language Grounded Autonomous Driving in Real-Time

We present LAD, a real-time language--action planner with an interruptible architecture that produces a motion plan in a single forward pass (~20 Hz) or generates textual reasoning alongside a motion plan (~10 Hz). LAD is fast enough for real-time closed-loop deployment, achieving ~3x lower latency than prior driving language models while setting a new learning-based state of the art on nuPlan Test14-Hard and InterPlan. We also introduce RAD, a rule-based planner designed to address structural limitations of PDM-Closed. RAD achieves state-of-the-art performance among rule-based planners on nuPlan Test14-Hard and InterPlan. Finally, we show that combining RAD and LAD enables hybrid planning that captures the strengths of both approaches. This hybrid system demonstrates that rules and learning provide complementary capabilities: rules support reliable maneuvering, while language enables adaptive and explainable decision-making.

Updated: 2026-03-30 14:50:37

标题: RAD-LAD：基于规则和语言的实时自动驾驶

摘要: 我们提出了LAD，这是一个具有可中断架构的实时语言-行动规划器，可以在单次前向传递中生成动作计划（约20赫兹），或者在生成文本推理同时生成动作计划（约10赫兹）。LAD足够快，可以实时闭环部署，实现比以前的驾驶语言模型低大约3倍的延迟，同时在nuPlan Test14-Hard和InterPlan上设置了一个新的基于学习的最先进水平。我们还引入了RAD，这是一个基于规则的规划器，旨在解决PDM-Closed的结构限制。RAD在nuPlan Test14-Hard和InterPlan上取得了基于规则的规划器中的最先进性能。最后，我们展示了结合RAD和LAD实现的混合规划系统，这种混合系统展示了规则和学习提供互补的能力：规则支持可靠的操纵，而语言则实现了自适应和可解释的决策制定。

更新时间: 2026-03-30 14:50:37

领域: cs.RO,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2603.28522v1

KG-Hopper: Empowering Compact Open LLMs with Knowledge Graph Reasoning via Reinforcement Learning

Large Language Models (LLMs) demonstrate impressive natural language capabilities but often struggle with knowledge-intensive reasoning tasks. Knowledge Base Question Answering (KBQA), which leverages structured Knowledge Graphs (KGs) exemplifies this challenge due to the need for accurate multi-hop reasoning. Existing approaches typically perform sequential reasoning steps guided by predefined pipelines, restricting flexibility and causing error cascades due to isolated reasoning at each step. To address these limitations, we propose KG-Hopper, a novel Reinforcement Learning (RL) framework that empowers compact open LLMs with the ability to perform integrated multi-hop KG reasoning within a single inference round. Rather than reasoning step-by-step, we train a Reasoning LLM that embeds the entire KG traversal and decision process into a unified ``thinking'' stage, enabling global reasoning over cross-step dependencies and dynamic path exploration with backtracking. Experimental results on eight KG reasoning benchmarks show that KG-Hopper, based on a 7B-parameter LLM, consistently outperforms larger multi-step systems (up to 70B) and achieves competitive performance with proprietary models such as GPT-3.5-Turbo and GPT-4o-mini, while remaining compact, open, and data-efficient. The code is publicly available at: https://github.com/Wangshuaiia/KG-Hopper.

Updated: 2026-03-30 14:46:23

标题: KG-Hopper：通过强化学习赋能紧凑的开放LLMs与知识图推理

摘要: 大型语言模型（LLMs）展示了令人印象深刻的自然语言能力，但在知识密集型推理任务中经常遇到困难。利用结构化知识图（KGs）的知识库问答（KBQA）是一个例证，由于需要准确的多跳推理，这种挑战变得尤为明显。现有方法通常执行由预定义管道引导的顺序推理步骤，限制了灵活性，并由于每一步孤立推理而导致错误级联。为了解决这些限制，我们提出了KG-Hopper，这是一个新颖的强化学习（RL）框架，赋予紧凑的开放LLMs执行单一推理轮内的综合多跳KG推理能力。与逐步推理不同，我们训练了一个Reasoning LLM，将整个KG遍历和决策过程嵌入到统一的“思考”阶段，实现了跨步依赖的全局推理和具有回溯功能的动态路径探索。在八个KG推理基准上的实验结果表明，基于7B参数LLM的KG-Hopper始终优于更大的多步系统（高达70B），并且与专有模型（如GPT-3.5-Turbo和GPT-4o-mini）达到了竞争性表现，同时保持了紧凑、开放和数据高效的特点。代码公开可用于：https://github.com/Wangshuaiia/KG-Hopper。

更新时间: 2026-03-30 14:46:23

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2603.21440v3

On the Normalization of Confusion Matrices: Methods and Geometric Interpretations

The confusion matrix is a standard tool for evaluating classifiers by providing insights into class-level errors. In heterogeneous settings, its values are shaped by two main factors: class similarity -- how easily the model confuses two classes -- and distribution bias, arising from skewed distributions in the training and test sets. However, confusion matrix values reflect a mix of both factors, making it difficult to disentangle their individual contributions. To address this, we introduce bistochastic normalization using Iterative Proportional Fitting, a generalization of row and column normalization. Unlike standard normalizations, this method recovers the underlying structure of class similarity. By disentangling error sources, it enables more accurate diagnosis of model behavior and supports more targeted improvements. We also show a correspondence between confusion matrix normalizations and the model's internal class representations. Both standard and bistochastic normalizations can be interpreted geometrically in this space, offering a deeper understanding of what normalization reveals about a classifier.

Updated: 2026-03-30 14:43:06

标题: 关于混淆矩阵的规范化：方法和几何解释

摘要: 混淆矩阵是评估分类器的标准工具，通过提供关于类别级别错误的见解。在异质设置中，其值受两个主要因素影响：类相似性——模型容易混淆两个类别的程度——以及分布偏差，源自训练集和测试集中的倾斜分布。然而，混淆矩阵的值反映了这两个因素的混合，使得很难分开它们各自的贡献。为了解决这个问题，我们引入了使用迭代比例拟合的双随机归一化，这是对行和列归一化的一般化。与标准归一化不同，这种方法恢复了类相似性的潜在结构。通过分解错误来源，它能够更准确地诊断模型行为并支持更有针对性的改进。我们还展示了混淆矩阵归一化与模型的内部类别表示之间的对应关系。在这个空间中，标准和双随机归一化都可以几何地解释，深入理解归一化对分类器的揭示。

更新时间: 2026-03-30 14:43:06

领域: cs.LG

下载: http://arxiv.org/abs/2509.04959v2

The Unreasonable Effectiveness of Scaling Laws in AI

Classical AI scaling laws, especially for pre-training, describe how training loss decreases with compute in a power-law form. Their effectiveness has a basic and very practical sense: they make progress predictable, albeit at a declining rate. Yet their effectiveness is also unreasonable in two further senses. First, these laws are largely empirical and observational, but they appear repeatedly across model families and increasingly across training-adjacent regimes. Second, despite the diminishing returns they predict, progress in practice has often continued through rapidly improving efficiency, visible for example in falling cost per token. This paper argues that both features arise from the same source: scaling laws are unusually effective because they abstract away from many realization details. The compute variable is best understood as logical compute, an implementation-agnostic notion of model-side work, while the practical burden of scaling depends on how efficiently real resources are converted into that compute. This abstraction helps explain both why the laws travel so well across settings and why they give rise to a persistent efficiency game in hardware, algorithms, and systems. Once efficiency is made explicit, the main practical question becomes how many efficiency doublings are required to keep scaling productive despite diminishing returns. Under that view, diminishing returns are not only a geometric flattening of the loss curve, but also rising pressure for cost reduction, system-level innovation, and the breakthroughs needed to sustain Moore-like efficiency doublings.

Updated: 2026-03-30 14:42:53

标题: 人工智能中尺度定律的不合理有效性

摘要: 经典人工智能规模律，特别是针对预训练，描述了训练损失如何以幂律形式随计算量减少。它们的有效性具有基本且非常实际的意义：它们使进展变得可预测，尽管速度在递减。然而，它们的有效性在另外两个方面也是不合理的。首先，这些规律在很大程度上是基于经验和观察的，但它们在模型家族中反复出现，并且越来越多地出现在训练相关的领域。其次，尽管它们预测出递减的回报，但在实践中，进展往往通过效率的迅速提高而持续进行，例如在每个标记的成本下降中可见。本文认为，这两个特征来自同一来源：规模律之所以异常有效，是因为它们抽象出许多实现细节。计算变量最好理解为逻辑计算，这是一种与实现无关的模型端工作概念，而规模化的实际负担取决于将实际资源转化为该计算的效率如何。这种抽象有助于解释这些规律在不同环境中的传播以及为什么它们引发了硬件、算法和系统中持续效率竞赛。一旦效率变得明确，主要的实际问题就变成了需要多少效率加倍才能保持规模化在递减回报的情况下保持生产力。在这种观点下，递减回报不仅是损失曲线的几何平坦化，还增加了成本降低、系统层面创新和维持类似摩尔效率加倍所需的突破的压力。

更新时间: 2026-03-30 14:42:53

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.28507v1

An Attention Mechanism for Robust Multimodal Integration in a Global Workspace Architecture

Robust multimodal systems must remain effective when some modalities are noisy, degraded, or unreliable. Existing multimodal fusion methods often learn modality selection jointly with representation learning, making it difficult to determine whether robustness comes from the selector itself or from full end-to-end co-adaptation. Motivated by Global Workspace Theory (GWT), we study this question using a lightweight top-down modality selector operating on top of a frozen multimodal global workspace. We evaluate our method on two multimodal datasets of increasing complexity: Simple Shapes and MM-IMDb 1.0, under structured modality corruptions. The selector improves robustness while using far fewer trainable parameters than end-to-end attention baselines, and the learned selection strategy transfers better across downstream tasks, corruption regimes, and even to a previously unseen modality. Beyond explicit corruption settings, on the MM-IMDb 1.0 benchmark, we show that the same mechanism improves the global workspace over its no-attention counterpart and yields decent benchmark performance.

Updated: 2026-03-30 14:40:05

标题: 一个用于全球工作空间架构中稳健多模态集成的注意机制

摘要: 鲁棒的多模态系统在某些模态嘈杂、退化或不可靠时必须保持有效。现有的多模态融合方法通常将模态选择与表示学习联合学习，这使得难以确定鲁棒性是来自选择器本身还是来自完全端到端的协同适应。受全局工作空间理论（GWT）的启发，我们使用一个轻量级的自上而下的模态选择器，在一个冻结的多模态全局工作空间之上运行，来研究这个问题。我们在两个逐渐复杂的多模态数据集Simple Shapes和MM-IMDb 1.0上评估了我们的方法，这些数据集受到结构化模态损坏的影响。选择器在使用比端到端注意基线少得多的可训练参数的同时提高了鲁棒性，并且学习的选择策略在下游任务、损坏制度甚至先前未见的模态上更好地传输。除了显式的损坏设置，在MM-IMDb 1.0基准测试中，我们展示了相同的机制提高了全局工作空间，相较于其无注意力的对应物，产生了不错的基准性能。

更新时间: 2026-03-30 14:40:05

领域: cs.AI

下载: http://arxiv.org/abs/2602.08597v2

Learning the Model While Learning Q: Finite-Time Sample Complexity of Online SyncMBQ

Reinforcement learning has witnessed significant advancements, particularly with the emergence of model-based approaches. Among these, $Q$-learning has proven to be a powerful algorithm in model-free settings. However, the extension of $Q$-learning to a model-based framework remains relatively unexplored. In this paper, we investigate the sample complexity of $Q$-learning when integrated with a model-based approach. The proposed algorihtms learns both the model and Q-value in an online manner. We demonstrate a near-optimal sample complexity result within a broad range of step sizes.

Updated: 2026-03-30 14:38:20

标题: 学习模型同时学习Q：在线SyncMBQ的有限时间样本复杂度

摘要: 强化学习在模型基础方法的出现尤其取得了显著进展。其中，$Q$-learning在无模型设置中被证明是一个强大的算法。然而，将$Q$-learning扩展到模型基础框架仍然相对未被探索。在本文中，我们研究了当$Q$-learning与模型基础方法集成时的样本复杂度。所提出的算法以在线方式学习模型和Q值。我们展示了在广泛范围的步长内接近最优样本复杂度结果。

更新时间: 2026-03-30 14:38:20

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2402.11877v2

Decomposable Neuro Symbolic Regression

Symbolic regression (SR) models complex systems by discovering mathematical expressions that capture underlying relationships in observed data. However, most SR methods prioritize minimizing prediction error over identifying the governing equations, often producing overly complex or inaccurate expressions. To address this, we present a decomposable SR method that generates interpretable multivariate expressions leveraging transformer models, genetic algorithms (GAs), and genetic programming (GP). In particular, our explainable SR method distills a trained ``opaque'' regression model into mathematical expressions that serve as explanations of its computed function. Our method employs a Multi-Set Transformer to generate multiple univariate symbolic skeletons that characterize how each variable influences the opaque model's response. We then evaluate the generated skeletons' performance using a GA-based approach to select a subset of high-quality candidates before incrementally merging them via a GP-based cascade procedure that preserves their original skeleton structure. The final multivariate skeletons undergo coefficient optimization via a GA. We evaluated our method on problems with controlled and varying degrees of noise, demonstrating lower or comparable interpolation and extrapolation errors compared to two GP-based methods, three neural SR methods, and a hybrid approach. Unlike them, our approach consistently learned expressions that matched the original mathematical structure. Similarly, our method achieved both a high symbolic solution recovery rate and competitive predictive performance relative to benchmark methods on the Feynman dataset.

Updated: 2026-03-30 14:36:09

标题: 可分解的神经符号回归

摘要: 符号回归（SR）通过发现能够捕捉观测数据中潜在关系的数学表达式来建模复杂系统。然而，大多数SR方法更注重最小化预测误差而非识别主导方程，通常会产生过于复杂或不准确的表达式。为了解决这个问题，我们提出了一种可分解的SR方法，利用变压器模型、遗传算法（GAs）和遗传规划（GP）生成可解释的多变量表达式。特别是，我们的可解释性SR方法将经过训练的“不透明”回归模型提炼为数学表达式，作为其计算函数的解释。我们的方法采用了一个多集变压器，生成多个表征每个变量如何影响不透明模型响应的单变量符号骨架。然后，我们使用基于GA的方法评估生成的骨架的性能，选择一组高质量的候选人，然后通过基于GP的级联过程逐步合并它们，保留其原始骨架结构。最终的多变量骨架通过GA进行系数优化。我们在受控和变化程度不同的噪声问题上评估了我们的方法，结果显示与两种基于GP的方法、三种神经SR方法和一种混合方法相比，插值和外推误差较低或相当。与它们不同的是，我们的方法始终学习到与原始数学结构相匹配的表达式。同样，我们的方法在Feynman数据集上实现了高符号解恢复率和与基准方法相比具有竞争力的预测性能。

更新时间: 2026-03-30 14:36:09

领域: cs.LG

下载: http://arxiv.org/abs/2511.04124v2

GMA-SAWGAN-GP: A Novel Data Generative Framework to Enhance IDS Detection Performance

Intrusion Detection System (IDS) is often calibrated to known attacks and generalizes poorly to unknown threats. This paper proposes GMA-SAWGAN-GP, a novel generative augmentation framework built on a Self-Attention-enhanced Wasserstein GAN with Gradient Penalty (WGAN-GP). The generator employs Gumbel-Softmax regularization to model discrete fields, while a Multilayer Perceptron (MLP)-based AutoEncoder acts as a manifold regularizer. A lightweight gating network adaptively balances adversarial and reconstruction losses via entropy regularization, improving stability and mitigating mode collapse. The self-attention mechanism enables the generator to capture both short- and long-range dependencies among features within each record while preserving categorical semantics through Gumbel-Softmax heads. Extensive experiments on NSL-KDD, UNSW-NB15, and CICIDS2017 using five representative IDS models demonstrate that GMA-SAWGAN-GP significantly improves detection performance on known attacks and enhances generalization to unknown attacks. Leave-One-Attack-type-Out (LOAO) evaluations using Area Under the Receiver Operating Characteristic (AUROC) and True Positive Rate at a 5 percent False Positive Rate confirm that IDS models trained on augmented datasets achieve higher robustness under unseen attack scenarios. Ablation studies validate the contribution of each component to performance gains. Compared with baseline models, the proposed framework improves binary classification accuracy by an average of 5.3 percent and multi-classification accuracy by 2.2 percent, while AUROC and True Positive Rate at a 5 percent False Positive Rate for unknown attacks increase by 3.9 percent and 4.8 percent, respectively, across the three datasets. Overall, GMA-SAWGAN-GP provides an effective approach to generative augmentation for mixed-type network traffic, improving IDS accuracy and resilience.

Updated: 2026-03-30 14:35:23

标题: GMA-SAWGAN-GP：一种新颖的数据生成框架，以提高入侵检测系统的性能

摘要: 入侵检测系统（IDS）通常被校准为已知攻击，并对未知威胁的泛化能力较差。本文提出了GMA-SAWGAN-GP，这是一个基于Wasserstein GAN和Gradient Penalty（WGAN-GP）的自注意增强生成增强框架。生成器采用Gumbel-Softmax正则化来建模离散字段，而基于多层感知器（MLP）的自动编码器作为流形正则化器。一个轻量级的门控网络通过熵正则化自适应平衡对抗和重构损失，提高稳定性并减轻模态崩溃。自注意机制使生成器能够在每个记录内捕捉特征之间的短程和长程依赖关系，同时通过Gumbel-Softmax头保持分类语义。在NSL-KDD、UNSW-NB15和CICIDS2017上进行的广泛实验使用五种代表性IDS模型表明，GMA-SAWGAN-GP显着提高了已知攻击的检测性能，并增强了对未知攻击的泛化能力。使用接收器操作特征（AUROC）下的留一攻击类型评估和5％误报率下的真阳性率确认，训练在增强数据集上的IDS模型在未见攻击情景下具有更高的稳健性。消融研究验证了每个组件对性能提升的贡献。与基线模型相比，所提出的框架将二元分类准确性平均提高了5.3％，多类分类准确性提高了2.2％，而对于未知攻击，AUROC和5％误报率下的真阳性率分别在三个数据集上增加了3.9％和4.8％。总体而言，GMA-SAWGAN-GP为混合网络流量的生成增强提供了一种有效方法，提高了IDS的准确性和弹性。

更新时间: 2026-03-30 14:35:23

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2603.28838v1

Next-Token Prediction and Regret Minimization

We consider the question of how to employ next-token prediction algorithms in adversarial online decision-making environments. Specifically, if we train a next-token prediction model on a distribution $\mathcal{D}$ over sequences of opponent actions, when is it the case that the induced online decision-making algorithm (by approximately best responding to the model's predictions) has low adversarial regret (i.e., when is $\mathcal{D}$ a \emph{low-regret distribution})? For unbounded context windows (where the prediction made by the model can depend on all the actions taken by the adversary thus far), we show that although not every distribution $\mathcal{D}$ is a low-regret distribution, every distribution $\mathcal{D}$ is exponentially close (in TV distance) to one low-regret distribution, and hence sublinear regret can always be achieved at negligible cost to the accuracy of the original next-token prediction model. In contrast to this, for bounded context windows (where the prediction made by the model can depend only on the past $w$ actions taken by the adversary, as may be the case in modern transformer architectures), we show that there are some distributions $\mathcal{D}$ of opponent play that are $Θ(1)$-far from any low-regret distribution $\mathcal{D'}$ (even when $w = Ω(T)$ and such distributions exist). Finally, we complement these results by showing that the unbounded context robustification procedure can be implemented by layers of a standard transformer architecture, and provide empirical evidence that transformer models can be efficiently trained to represent these new low-regret distributions.

Updated: 2026-03-30 14:34:41

标题: 下一个令牌预测和遗憾最小化

摘要: 我们考虑如何在对抗性在线决策环境中使用下一个标记预测算法的问题。具体来说，如果我们在对手行动序列的分布$\mathcal{D}$上训练下一个标记预测模型，那么什么时候会导致诱导的在线决策算法（通过近似最佳响应模型的预测）具有低对抗性遗憾（即$\mathcal{D}$是一个低遗憾分布）？对于无界上下文窗口（模型的预测可以取决于对手迄今为止采取的所有行动），我们表明尽管并非每个分布$\mathcal{D}$都是低遗憾分布，但每个分布$\mathcal{D}$都与一个低遗憾分布指数接近（在总变差距离上），因此可以始终以对原始下一个标记预测模型的准确性的微不足道的代价实现亚线性遗憾。相比之下，对于有界上下文窗口（模型的预测只能取决于对手过去$w$个行动，这可能是现代变压器结构的情况），我们表明存在一些对手游戏分布$\mathcal{D}$与任何低遗憾分布$\mathcal{D'}$相距$Θ(1)$（即使$w = Ω(T)$并且存在这样的分布）。最后，我们通过显示无界上下文鲁棒化过程可以通过标准变压器结构的层来实现，并提供经验证据表明变压器模型可以有效地训练以表示这些新的低遗憾分布。

更新时间: 2026-03-30 14:34:41

领域: cs.LG,cs.AI,cs.GT

下载: http://arxiv.org/abs/2603.28499v1

MRI-to-CT synthesis using drifting models

Accurate MRI-to-CT synthesis could enable MR-only pelvic workflows by providing CT-like images with bone details while avoiding additional ionizing radiation. In this work, we investigate recently proposed drifting models for synthesizing pelvis CT images from MRI and benchmark them against convolutional neural networks (UNet, VAE), a generative adversarial network (WGAN-GP), a physics-inspired probabilistic model (PPFM), and diffusion-based methods (FastDDPM, DDIM, DDPM). Experiments are performed on two complementary datasets: Gold Atlas Male Pelvis and the SynthRAD2023 pelvis subset. Image fidelity and structural consistency are evaluated with SSIM, PSNR, and RMSE, complemented by qualitative assessment of anatomically critical regions such as cortical bone and pelvic soft-tissue interfaces. Across both datasets, the proposed drifting model achieves high SSIM and PSNR and low RMSE, surpassing strong diffusion baselines and conventional CNN-, VAE-, GAN-, and PPFM-based methods. Visual inspection shows sharper cortical bone edges, improved depiction of sacral and femoral head geometry, and reduced artifacts or over-smoothing, particularly at bone-air-soft tissue boundaries. Moreover, the drifting model attains these gains with one-step inference and inference times on the order of milliseconds, yielding a more favorable accuracy-efficiency trade-off than iterative diffusion sampling while remaining competitive in image quality. These findings suggest that drifting models are a promising direction for fast, high-quality pelvic synthetic CT generation from MRI and warrant further investigation for downstream applications such as MRI-only radiotherapy planning and PET/MR attenuation correction.

Updated: 2026-03-30 14:34:32

标题: MRI 到 CT 的合成使用漂移模型

摘要: MRI与CT的准确合成可以通过提供具有骨骼细节的类似CT的图像而避免额外的电离辐射，从而实现MR-only盆腔工作流程。在这项工作中，我们研究了最近提出的漂移模型，用于从MRI合成盆腔CT图像，并将它们与卷积神经网络（UNet，VAE），生成对抗网络（WGAN-GP），基于物理的概率模型（PPFM）和扩散方法（FastDDPM，DDIM，DDPM）进行了基准测试。实验在两个互补数据集上进行：Gold Atlas Male Pelvis和SynthRAD2023盆腔子集。图像保真度和结构一致性通过SSIM，PSNR和RMSE进行评估，同时通过对解剖关键区域（如皮质骨和盆腔软组织界面）的定性评估进行补充。在两个数据集上，提出的漂移模型实现了高SSIM和PSNR，低RMSE，超过了强扩散基线和传统的CNN，VAE，GAN和PPFM方法。视觉检查显示更清晰的皮质骨边缘，改善了骶骨和股骨头几何形状的描绘，并减少了伪影或过度平滑，特别是在骨气软组织边界处。此外，漂移模型通过一步推断实现这些收益，并且推断时间在毫秒级别，产生了比迭代扩散采样更有利的准确性-效率权衡，同时在图像质量上保持竞争力。这些发现表明，漂移模型是快速，高质量的盆腔合成CT从MRI生成的有希望的方向，并值得进一步研究，用于MRI-only放射治疗计划和PET/MR衰减校正等下游应用。

更新时间: 2026-03-30 14:34:32

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2603.28498v1

Deconfounded Lifelong Learning for Autonomous Driving via Dynamic Knowledge Spaces

End-to-End autonomous driving (E2E-AD) systems face challenges in lifelong learning, including catastrophic forgetting, difficulty in knowledge transfer across diverse scenarios, and spurious correlations between unobservable confounders and true driving intents. To address these issues, we propose DeLL, a Deconfounded Lifelong Learning framework that integrates a Dirichlet process mixture model (DPMM) with the front-door adjustment mechanism from causal inference. The DPMM is employed to construct two dynamic knowledge spaces: a trajectory knowledge space for clustering explicit driving behaviors and an implicit feature knowledge space for discovering latent driving abilities. Leveraging the non-parametric Bayesian nature of DPMM, our framework enables adaptive expansion and incremental updating of knowledge without predefining the number of clusters, thereby mitigating catastrophic forgetting. Meanwhile, the front-door adjustment mechanism utilizes the DPMM-derived knowledge as valid mediators to deconfound spurious correlations, such as those induced by sensor noise or environmental changes, and enhances the causal expressiveness of the learned representations. Additionally, we introduce an evolutionary trajectory decoder that enables non-autoregressive planning. To evaluate the lifelong learning performance of E2E-AD, we propose new evaluation protocols and metrics based on Bench2Drive. Extensive evaluations in the closed-loop CARLA simulator demonstrate that our framework significantly improves adaptability to new driving scenarios and overall driving performance, while effectively retaining previous acquired knowledge.

Updated: 2026-03-30 14:27:55

标题: 通过动态知识空间解决自主驾驶的去混淆终身学习

摘要: End-to-End自主驾驶（E2E-AD）系统在终身学习方面面临挑战，包括灾难性遗忘，跨多种场景知识传递困难，以及不可观察的混淆因素与真实驾驶意图之间的虚假相关性。为了解决这些问题，我们提出了DeLL，一个整合了Dirichlet过程混合模型（DPMM）和因果推断中前门调整机制的去混淆终身学习框架。DPMM用于构建两个动态知识空间：用于聚类显式驾驶行为的轨迹知识空间和用于发现潜在驾驶能力的隐式特征知识空间。利用DPMM的非参数贝叶斯特性，我们的框架实现了知识的自适应扩展和增量更新，无需预定义聚类数量，从而减轻了灾难性遗忘。同时，前门调整机制利用DPMM衍生的知识作为有效中介因素来去除虚假相关性，例如由传感器噪声或环境变化引起的相关性，并增强了学习表示的因果表达能力。此外，我们引入了一种进化轨迹解码器，实现了非自回归计划。为了评估E2E-AD的终身学习性能，我们提出了基于Bench2Drive的新评估协议和指标。在封闭环CARLA模拟器中进行的大量评估表明，我们的框架显著提高了对新驾驶场景的适应能力和整体驾驶性能，同时有效保留了先前获得的知识。

更新时间: 2026-03-30 14:27:55

领域: cs.LG,cs.AI,cs.RO

下载: http://arxiv.org/abs/2603.14354v2

Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification

Large language models (LLMs) remain unreliable for high-stakes claim verification due to hallucinations and shallow reasoning. While retrieval-augmented generation (RAG) and multi-agent debate (MAD) address this, they are limited by one-pass retrieval and unstructured debate dynamics. We propose a courtroom-style multi-agent framework, PROClaim, that reformulates verification as a structured, adversarial deliberation. Our approach integrates specialized roles (e.g., Plaintiff, Defense, Judge) with Progressive RAG (P-RAG) to dynamically expand and refine the evidence pool during the debate. Furthermore, we employ evidence negotiation, self-reflection, and heterogeneous multi-judge aggregation to enforce calibration, robustness, and diversity. In zero-shot evaluations on the Check-COVID benchmark, PROClaim achieves 81.7% accuracy, outperforming standard multi-agent debate by 10.0 percentage points, with P-RAG driving the primary performance gains (+7.5 pp). We ultimately demonstrate that structural deliberation and model heterogeneity effectively mitigate systematic biases, providing a robust foundation for reliable claim verification. Our code and data are publicly available at https://github.com/mnc13/PROClaim.

Updated: 2026-03-30 14:23:15

标题: 法庭风格的多智能体辩论：基于渐进式RAG和角色转换的有争议主张验证

摘要: 大型语言模型（LLMs）在高风险声明验证方面仍然不可靠，因为存在幻觉和浅层推理。虽然检索增强生成（RAG）和多代理辩论（MAD）解决了这一问题，但它们受到一次性检索和无结构辩论动态的限制。我们提出了一个法庭式多代理框架PROClaim，将验证重新构想为一种结构化的对抗性讨论。我们的方法将专门的角色（例如原告、辩护、法官）与渐进式RAG（P-RAG）相结合，在辩论过程中动态扩展和完善证据池。此外，我们采用证据谈判、自我反思和异质多法官聚合来强化校准、稳健性和多样性。在Check-COVID基准测试的零射击评估中，PROClaim实现了81.7%的准确率，比标准多代理辩论高出10.0个百分点，其中P-RAG推动了主要的性能增益（+7.5个百分点）。我们最终证明了结构化的辩论和模型异质性有效地缓解了系统性偏见，为可靠的声明验证提供了坚实基础。我们的代码和数据可以在https://github.com/mnc13/PROClaim 上公开获取。

更新时间: 2026-03-30 14:23:15

领域: cs.CL,cs.AI,cs.MA

下载: http://arxiv.org/abs/2603.28488v1

Benchmarking NLP-supported Language Sample Analysis for Swiss Children's Speech

Language sample analysis (LSA) is a process that complements standardized psychometric tests for diagnosing, for example, developmental language disorder (DLD) in children. However, its labour-intensive nature has limited its use in speech-language pathology practice. We introduce an approach that leverages natural language processing (NLP) methods that do not rely on commercial large language models (LLMs) applied to transcribed speech data from 119 children in the German-speaking part of Switzerland with typical and atypical language development. This preliminary study aims to identify optimal practices that support speech-language pathologists in diagnosing DLD more efficiently with active involvement of human specialists. Preliminary findings underscore the potential of integrating locally deployed NLP methods into the process of semi-automatic LSA.

Updated: 2026-03-30 14:18:33

标题: Benchmarking NLP-supported Language Sample Analysis for Swiss Children's Speech （为瑞士儿童言语进行NLP支持的语言样本分析进行基准测试）

摘要: 语言样本分析（LSA）是一种补充标准心理测量测试的过程，用于诊断儿童的发展性语言障碍（DLD）等问题。然而，由于其劳动强度大，LSA在言语病理学实践中的应用受到限制。我们介绍了一种利用自然语言处理（NLP）方法的方法，该方法不依赖于商业大型语言模型（LLM），应用于瑞士德语区域的119名典型和非典型语言发展儿童的转录语音数据。该初步研究旨在确定支持言语病理学家更有效地诊断DLD的最佳实践，同时积极涉及人类专家。初步发现强调了将本地部署的NLP方法整合到半自动LSA过程中的潜力。

更新时间: 2026-03-30 14:18:33

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.00780v2

AgentDrift: Unsafe Recommendation Drift Under Tool Corruption Hidden by Ranking Metrics in LLM Agents

Tool-augmented LLM agents increasingly operate as multi-turn advisors in high-stakes domains, yet their evaluation relies on ranking metrics that measure what is recommended but not whether it is safe for the user. We present a paired-trajectory protocol that replays real financial dialogues under clean and contaminated tool-output conditions across eight LLMs (7B to frontier), decomposing divergence into information-channel and memory-channel mechanisms. We observe evaluation blindness: recommendation quality is preserved under contamination (UPR~1.0) while risk-inappropriate products appear in 65-93% of turns, invisible to standard NDCG. Violations are information-channel-driven, emerge at turn 1, and persist without self-correction over 23-step trajectories. Even non-extreme perturbations (within-band corruption, narrative-only attacks) evade threshold monitors while producing significant drift. Susceptibility scales with instruction-following fidelity across all eight models. Sparse autoencoder probing reveals models internally distinguish adversarial perturbations but fail to propagate this signal to output; causal interventions (activation patching, feature clamping, direct steering) confirm this representation-to-action gap is structural and resists linear repair. A safety-penalized NDCG variant (sNDCG) reduces preservation ratios to 0.51-0.74. These results motivate trajectory-level safety monitoring for deployed multi-turn agents.

Updated: 2026-03-30 14:18:25

标题: AgentDrift：在LLM代理中由排名指标隐藏的工具损坏下的不安全推荐漂移

摘要: 工具增强的LLM代理在高风险领域越来越多地充当多轮顾问，然而它们的评估依赖于衡量推荐内容而不是用户安全性的排名指标。我们提出了一种配对轨迹协议，通过在八个LLM（从7B到前沿）中重新播放真实的财务对话，在干净和受污染的工具输出条件下分解出信息通道和记忆通道机制的差异。我们观察到评估盲点：在受污染的情况下，推荐质量得以保留（UPR ~ 1.0），而风险不合适的产品出现在65-93%的轮次中，对标准NDCG不可见。违规行为是信息通道驱动的，在第一个轮次就出现，并且在23步轨迹中没有自我修正。即使是非极端的扰动（在带内腐败，仅叙述攻击内）也会规避阈值监视器，同时产生显著漂移。在所有八个模型中，易受指令跟随忠诚度的影响力大小不一。稀疏自编码器探测显示模型在内部区分敌对扰动，但未能将此信号传播到输出；因果干预（激活补丁，特征夹持，直接转向）证实了这种表示到行动的差距是结构性的，并且抵抗线性修复。一种带有安全惩罚的NDCG变体（sNDCG）将保存比率降低到0.51-0.74。这些结果促使我们为已部署的多轮代理进行轨迹级别的安全监测。

更新时间: 2026-03-30 14:18:25

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2603.12564v4

With a Little Help From My Friends: Collective Manipulation in Risk-Controlling Recommender Systems

Recommendation systems have become central gatekeepers of online information, shaping user behaviour across a wide range of activities. In response, users increasingly organize and coordinate to steer algorithmic outcomes toward diverse goals, such as promoting relevant content or limiting harmful material, relying on platform affordances -- such as likes, reviews, or ratings. While these mechanisms can serve beneficial purposes, they can also be leveraged for adversarial manipulation, particularly in systems where such feedback directly informs safety guarantees. In this paper, we study this vulnerability in recently proposed risk-controlling recommender systems, which use binary user feedback (e.g., "Not Interested") to provably limit exposure to unwanted content via conformal risk control. We empirically demonstrate that their reliance on aggregate feedback signals makes them inherently susceptible to coordinated adversarial user behaviour. Using data from a large-scale online video-sharing platform, we show that a small coordinated group (comprising only 1% of the user population) can induce up to a 20% degradation in nDCG for non-adversarial users by exploiting the affordances provided by risk-controlling recommender systems. We evaluate simple, realistic attack strategies that require little to no knowledge of the underlying recommendation algorithm and find that, while coordinated users can significantly harm overall recommendation quality, they cannot selectively suppress specific content groups through reporting alone. Finally, we propose a mitigation strategy that shifts guarantees from the group level to the user level, showing empirically how it can reduce the impact of adversarial coordinated behaviour while ensuring personalized safety for individuals.

Updated: 2026-03-30 14:14:48

标题: 和朋友的帮助：在风险控制推荐系统中的集体操纵

摘要: 推荐系统已经成为在线信息的中央守门人，塑造着用户在各种活动中的行为。作为回应，用户越来越多地组织和协调，以引导算法结果朝向多样化的目标，例如促进相关内容或限制有害材料，依赖于平台提供的功能，如点赞、评论或评分。虽然这些机制可以为有益目的提供服务，但它们也可以被用于对抗性操纵，特别是在这些反馈直接影响安全保证的系统中。在本文中，我们研究了最近提出的风险控制推荐系统中的这种脆弱性，这些系统利用二进制用户反馈（例如“不感兴趣”）通过符合风险控制来明确限制接触不想要的内容。我们凭经验证明，它们对聚合反馈信号的依赖使它们固有地容易受到协调的对抗用户行为的影响。利用大规模在线视频分享平台的数据，我们展示了一个小型协调团体（仅占用户总数的1％）可以通过利用风险控制推荐系统提供的功能影响非对抗用户的nDCG最多降低20％。我们评估了简单、现实的攻击策略，几乎不需要了解底层推荐算法，发现虽然协调用户可以显著损害整体推荐质量，但他们不能仅通过举报来选择性地抑制特定内容组。最后，我们提出了一种缓解策略，将保证从团体级别转移到用户级别，通过实证显示它如何减少对抗性协调行为的影响，同时确保个体的个性化安全。

更新时间: 2026-03-30 14:14:48

领域: cs.IR,cs.LG,cs.SI

下载: http://arxiv.org/abs/2603.28476v1

CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains

The connoisseurship of antique Chinese porcelain demands extensive historical expertise, material understanding, and aesthetic sensitivity, making it difficult for non-specialists to engage. To democratize cultural-heritage understanding and assist expert connoisseurship, we introduce CiQi-Agent -- a domain-specific Porcelain Connoisseurship Agent for intelligent analysis of antique Chinese porcelain. CiQi-Agent supports multi-image porcelain inputs and enables vision tool invocation and multimodal retrieval-augmented generation, performing fine-grained connoisseurship analysis across six attributes: dynasty, reign period, kiln site, glaze color, decorative motif, and vessel shape. Beyond attribute classification, it captures subtle visual details, retrieves relevant domain knowledge, and integrates visual and textual evidence to produce coherent, explainable connoisseurship descriptions. To achieve this capability, we construct a large-scale, expert-annotated dataset CiQi-VQA, comprising 29,596 porcelain specimens, 51,553 images, and 557,940 visual question--answering pairs, and further establish a comprehensive benchmark CiQi-Bench aligned with the previously mentioned six attributes. CiQi-Agent is trained through supervised fine-tuning, reinforcement learning, and a tool-augmented reasoning framework that integrates two categories of tools: a vision tool and multimodal retrieval tools. Experimental results show that CiQi-Agent (7B) outperforms all competitive open- and closed-source models across all six attributes on CiQi-Bench, achieving on average 12.2\% higher accuracy than GPT-5. The model and dataset have been released and are publicly available at https://huggingface.co/datasets/SII-Monument-Valley/CiQi-VQA.

Updated: 2026-03-30 14:13:47

标题: CiQi-Agent: 将视觉、工具和美学在多模态代理程序中对中国瓷器文化推理进行对齐

摘要: 古代中国瓷器的鉴赏需要广泛的历史专业知识、材料理解和审美敏感度，这使得非专家难以参与其中。为了民主化文化遗产的理解并协助专家鉴赏，我们引入了CiQi-Agent——一个专门用于智能分析古代中国瓷器的领域特定瓷器鉴赏代理。CiQi-Agent支持多图像的瓷器输入，并能够调用视觉工具和多模式检索增强生成，对瓷器进行精细的鉴赏分析，涵盖朝代、年代、窑址、釉色、装饰图案和器型等六个属性。除了属性分类外，它还捕捉微妙的视觉细节，检索相关领域知识，并整合视觉和文本证据，产生连贯、可解释的鉴赏描述。为了实现这一能力，我们构建了一个大规模的、专家注释的数据集CiQi-VQA，包括29,596件瓷器标本、51,553张图片和557,940个视觉问题-回答对，并进一步建立了一个与前述六个属性对齐的全面基准CiQi-Bench。CiQi-Agent通过监督微调、强化学习和一个集成两类工具的工具增强推理框架进行训练：一个是视觉工具，另一个是多模式检索工具。实验结果表明，CiQi-Agent（7B）在CiQi-Bench上的所有六个属性中优于所有竞争性的开源和闭源模型，平均准确率比GPT-5高出12.2\%。该模型和数据集已发布，并可在https://huggingface.co/datasets/SII-Monument-Valley/CiQi-VQA 上公开获取。

更新时间: 2026-03-30 14:13:47

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2603.28474v1

Quantum-Resistant Authentication Scheme for RFID Systems Using Lattice-Based Cryptography

We propose a novel quantum-resistant mutual authentication scheme for radio-frequency identification (RFID) systems. Our scheme uses lattice-based cryptography and, in particular, achieves quantum-resistance by leveraging the hardness of the inhomogeneous short integer solution (ISIS) problem. In contrast to prior work, which assumes that the reader-server communication channel is secure, our scheme is secure even when both the reader-server and tag-reader communication channels are insecure. Our proposed protocol provides robust security against man-in-the-middle (MITM), replay, impersonation, and reflection attacks, while also ensuring unforgeability and preserving anonymity. We present a detailed security analysis, including semi-formal analysis and formal verification using the Automated Validation of Internet Security Protocols and Applications (AVISPA) tool. In addition, we analyze the storage, computation, and communication costs of the proposed protocol and compare its security properties with those of existing protocols, demonstrating that our scheme offers strong security guarantees. To the best of our knowledge, this paper is the first quantum-resistant authentication protocol for RFID systems that comprehensively addresses the insecurity of both the reader-server and tag-reader communication channels.

Updated: 2026-03-30 14:03:37

标题: 基于格密码的RFID系统的量子抗性认证方案

摘要: 我们提出了一种新颖的基于格密码学的量子抗攻击互认证方案，用于无线射频识别（RFID）系统。我们的方案利用不均匀短整数解（ISIS）问题的困难性实现了量子抗攻击。与先前的工作相比，该方案假设读者-服务器通信渠道是安全的，我们的方案即使在读者-服务器和标签-读者通信渠道都不安全的情况下也是安全的。我们的提议协议提供了强大的安全性，可以抵御中间人攻击、重放攻击、冒充攻击和反射攻击，同时确保不可伪造性和保护匿名性。我们提供了详细的安全分析，包括半形式化分析和使用Internet安全协议和应用程序自动验证（AVISPA）工具进行形式验证。此外，我们分析了提议协议的存储、计算和通信成本，并将其安全性属性与现有协议进行比较，证明了我们的方案提供了强大的安全保证。据我们所知，这篇论文是第一个全面解决了读者-服务器和标签-读者通信渠道不安全性的RFID系统的量子抗攻击认证协议。

更新时间: 2026-03-30 14:03:37

领域: cs.CR

下载: http://arxiv.org/abs/2511.20630v2

Code Review Agent Benchmark

Software engineering agents have shown significant promise in writing code. As AI agents permeate code writing, and generate huge volumes of code automatically -- the matter of code quality comes front and centre. As the automatically generated code gets integrated into huge code-bases -- the issue of code review and broadly quality assurance becomes important. In this paper, we take a fresh look at the problem and curate a code review dataset for AI agents to work with. Our dataset called c-CRAB (pronounced see-crab) can evaluate agents for code review tasks. Specifically given a pull-request (which could be coming from code generation agents or humans), if a code review agent produces a review, our evaluation framework can asses the reviewing capability of the code review agents. Our evaluation framework is used to evaluate the state of the art today -- the open-source PR-agent, as well as commercial code review agents from Devin, Claude Code, and Codex. Our c-CRAB dataset is systematically constructed from human reviews -- given a human review of a pull request instance we generate corresponding tests to evaluate the code review agent generated reviews. Such a benchmark construction gives us several insights. Firstly, the existing review agents taken together can solve only around 40% of the c-CRAB tasks, indicating the potential to close this gap by future research. Secondly, we observe that the agent reviews often consider different aspects from the human reviews -- indicating the potential for human-agent collaboration for code review that could be deployed in future software teams. Last but not the least, the agent generated tests from our data-set act as a held out test-suite and hence quality gate for agent generated reviews. What this will mean for future collaboration of code generation agents, test generation agents and code review agents -- remains to be investigated.

Updated: 2026-03-30 14:02:50

标题: 代码审查代理基准测试

摘要: 软件工程代理在编写代码方面显示出了显著的潜力。随着AI代理渗透到代码编写中，并自动生成大量代码 -- 代码质量的问题变得尤为重要。随着自动生成的代码被集成到庞大的代码库中 -- 代码审查和广泛的质量保证问题变得重要。在这篇论文中，我们对这个问题进行了新的探讨，并为AI代理策划了一个代码审查数据集。我们的数据集名为c-CRAB（发音为see-crab），可以评估代理执行代码审查任务的能力。具体来说，给定一个拉取请求（可能来自代码生成代理或人类），如果一个代码审查代理生成审查意见，我们的评估框架可以评估代码审查代理的审查能力。我们的评估框架用于评估当今的最新技术 -- 开源PR代理，以及Devin、Claude Code和Codex的商业代码审查代理。我们的c-CRAB数据集是从人类审查系统地构建的 -- 给定一个拉取请求实例的人类审查，我们生成相应的测试来评估代码审查代理生成的审查意见。这样的基准构建为我们提供了一些见解。首先，现有的审查代理一起只能解决约40%的c-CRAB任务，表明未来的研究有望填补这一差距。其次，我们观察到代理审查通常考虑与人类审查不同的方面 -- 表明未来软件团队中可以部署人类代理协作进行代码审查的潜力。最后，我们数据集中代理生成的测试充当一个留存的测试套件，因此也是代理生成审查意见的质量关卡。这对于未来代码生成代理、测试生成代理和代码审查代理的合作意味着什么 -- 仍有待进一步研究。

更新时间: 2026-03-30 14:02:50

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2603.23448v2

$R_{dm}$: Re-conceptualizing Distribution Matching as a Reward for Diffusion Distillation

Diffusion models achieve state-of-the-art generative performance but are fundamentally bottlenecked by their slow iterative sampling process. While diffusion distillation techniques enable high-fidelity few-step generation, traditional objectives often restrict the student's performance by anchoring it solely to the teacher. Recent approaches have attempted to break this ceiling by integrating Reinforcement Learning (RL), typically through a simple summation of distillation and RL objectives. In this work, we propose a novel paradigm by reconceptualizing distribution matching as a reward, denoted as $R_{dm}$. This unified perspective bridges the algorithmic gap between Diffusion Matching Distillation (DMD) and RL, providing several key benefits. (1) Enhanced optimization stability: we introduce Group Normalized Distribution Matching (GNDM), which adapts standard RL group normalization to stabilize $R_{dm}$ estimation. By leveraging group-mean statistics, GNDM establishes a more robust and effective optimization direction. (2) Seamless reward integration: our reward-centric formulation inherently supports adaptive weighting mechanisms, allowing flexible combination of DMD with external reward models. (3) Improved sampling efficiency: by aligning with RL principles, the framework readily incorporates importance sampling (IS), leading to a significant boost in sampling efficiency. Extensive experiments demonstrate that GNDM outperforms vanilla DMD, reducing the FID by 1.87. Furthermore, our multi-reward variant, GNDMR, surpasses existing baselines by achieving a strong balance between aesthetic quality and fidelity, reaching a peak HPS of 30.37 and a low FID-SD of 12.21. Overall, $R_{dm}$ provides a flexible, stable, and efficient framework for real-time high-fidelity synthesis. Code will be released upon publication.

Updated: 2026-03-30 14:01:31

标题: $R_{dm}$：将分布匹配重新构想为扩散蒸馏的奖励

摘要: 扩散模型实现了最先进的生成性能，但基本上受到其缓慢的迭代采样过程的限制。虽然扩散提馏技术实现了高保真度的少步生成，但传统目标通常通过将学生的性能仅锚定在老师身上来限制其性能。最近的方法尝试通过集成强化学习（RL）来打破这一限制，通常通过简单地对提馏和RL目标进行求和来实现。在这项工作中，我们提出了一种新颖的范式，通过将分布匹配重新构想为一种奖励，标记为$R_{dm}$。这种统一的视角弥合了扩散匹配提馏（DMD）和RL之间的算法差距，提供了几个关键优势。 (1) 提升了优化稳定性：我们引入了组归一化分布匹配（GNDM），它将标准RL组归一化调整为稳定$R_{dm}$估计。通过利用组均值统计数据，GNDM建立了一个更加稳健和有效的优化方向。 (2) 无缝奖励集成：我们的奖励中心化公式本质上支持自适应加权机制，允许DMD与外部奖励模型灵活组合。 (3) 提高了采样效率：通过与RL原则对齐，该框架可以轻松地整合重要性采样（IS），从而显著提高采样效率。广泛的实验表明，GNDM优于普通的DMD，将FID减少了1.87。此外，我们的多奖励变体，GNDMR，通过在审美质量和保真度之间取得强大的平衡，达到了30.37的峰值HPS和12.21的低FID-SD，超越了现有基线。总的来说，$R_{dm}$为实时高保真度合成提供了一个灵活、稳定和高效的框架。代码将在发表时发布。

更新时间: 2026-03-30 14:01:31

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2603.28460v1

CPUBone: Efficient Vision Backbone Design for Devices with Low Parallelization Capabilities

Recent research on vision backbone architectures has predominantly focused on optimizing efficiency for hardware platforms with high parallel processing capabilities. This category increasingly includes embedded systems such as mobile phones and embedded AI accelerator modules. In contrast, CPUs do not have the possibility to parallelize operations in the same manner, wherefore models benefit from a specific design philosophy that balances amount of operations (MACs) and hardware-efficient execution by having high MACs per second (MACpS). In pursuit of this, we investigate two modifications to standard convolutions, aimed at reducing computational cost: grouping convolutions and reducing kernel sizes. While both adaptations substantially decrease the total number of MACs required for inference, sustaining low latency necessitates preserving hardware-efficiency. Our experiments across diverse CPU devices confirm that these adaptations successfully retain high hardware-efficiency on CPUs. Based on these insights, we introduce CPUBone, a new family of vision backbone models optimized for CPU-based inference. CPUBone achieves state-of-the-art Speed-Accuracy Trade-offs (SATs) across a wide range of CPU devices and effectively transfers its efficiency to downstream tasks such as object detection and semantic segmentation. Models and code are available at https://github.com/altair199797/CPUBone.

Updated: 2026-03-30 14:01:18

标题: CPUBone：适用于具有低并行能力设备的高效视觉骨干设计

摘要: 最近对视觉骨干架构的研究主要集中在优化具有高并行处理能力的硬件平台的效率上。这一类别越来越包括嵌入式系统，如手机和嵌入式AI加速器模块。相比之下，CPU没有同样的并行化操作的可能性，因此模型从一个特定的设计理念中受益，该设计理念平衡了操作数量（MACs）和高MACs每秒（MACpS）的硬件高效执行。为了追求这一目标，我们对标准卷积进行了两种修改，旨在降低计算成本：分组卷积和减小核大小。虽然这两种改进显著减少了推断所需的总MACs数量，但保持低延迟需要保持硬件效率。我们在各种CPU设备上的实验证实，这些改进成功地保持了CPU的高硬件效率。基于这些见解，我们引入了CPUBone，这是一种针对基于CPU的推断进行优化的新型视觉骨干模型系列。CPUBone在各种CPU设备上实现了最先进的速度-准确性权衡（SATs），并有效地将其效率转移到下游任务，如目标检测和语义分割。模型和代码可在https://github.com/altair199797/CPUBone 上找到。

更新时间: 2026-03-30 14:01:18

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2603.26425v2

HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention

Token-level sparse attention mechanisms, exemplified by DeepSeek Sparse Attention (DSA), achieve fine-grained key selection by scoring every historical token for each query using a lightweight indexer, and then computing attention only over the selected subset. While the downstream sparse attention scales efficiently, the indexer still scans the entire prefix for every query, introducing an O($L^2$) per-layer bottleneck that becomes prohibitive as context length grows. We propose HISA (Hierarchical Indexed Sparse Attention), a drop-in replacement for the indexer that transforms the search process from a flat token scan into a two-stage hierarchical procedure. First, a block-level coarse filter scores pooled block representatives to prune irrelevant regions. Then, a token-level refinement applies the original indexer only within the remaining candidate blocks. HISA preserves the exact token-level top-k sparsity pattern required by the downstream Sparse MLA operator and requires no additional training. On kernel-level benchmarks, HISA achieves a 2$\times$ speedup at 32K context length and 4$\times$ at 128K. On Needle-in-a-Haystack and LongBench, we directly replace the indexer in DeepSeek-V3.2 with HISA, without any fine-tuning. HISA closely matches the original DSA in quality while significantly outperforming block-sparse baselines. Moreover, the token selection sets produced by HISA and the original DSA exhibit a mean IoU greater than 99%, indicating that the efficiency gains come with virtually no impact on selection fidelity.

Updated: 2026-03-30 13:59:51

标题: HISA：高效的细粒度稀疏注意力层次索引

摘要: Token级稀疏注意机制，例如DeepSeek稀疏注意（DSA），通过使用轻量级索引器为每个查询对每个历史标记进行评分，实现了细粒度的关键选择，然后仅对所选子集进行注意计算。尽管下游稀疏注意效率高，索引器仍然会扫描每个查询的整个前缀，引入了一个O($L^2$)每层的瓶颈，在上下文长度增长时变得不可行。我们提出了HISA（Hierarchical Indexed Sparse Attention），这是索引器的一个可替换部分，它将搜索过程从平面标记扫描转换为两阶段的分层过程。首先，一个块级粗筛选器评分池化块代表以修剪不相关区域。然后，在剩余的候选块内应用原始索引器进行标记级细化。HISA保留了下游稀疏MLA运算符所需的精确的标记级top-k稀疏模式，无需额外训练。在核心级基准测试中，HISA在32K上下文长度实现了2倍加速，而在128K上下文长度实现了4倍加速。在Needle-in-a-Haystack和LongBench中，我们直接用HISA替换DeepSeek-V3.2中的索引器，无需任何微调。HISA与原始DSA在质量上非常接近，同时明显优于块稀疏基线。此外，HISA和原始DSA生成的标记选择集的平均IoU大于99%，表明效率提升几乎没有对选择准确性造成影响。

更新时间: 2026-03-30 13:59:51

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.28458v1

LLM-Powered Workflow Optimization for Multidisciplinary Software Development: An Automotive Industry Case Study

Multidisciplinary Software Development (MSD) requires domain experts and developers to collaborate across incompatible formalisms and separate artifact sets. Today, even with AI coding assistants like GitHub Copilot, this process remains inefficient; individual coding tasks are semi-automated, but the workflow connecting domain knowledge to implementation is not. Developers and experts still lack a shared view, resulting in repeated coordination, clarification rounds, and error-prone handoffs. We address this gap through a graph-based workflow optimization approach that progressively replaces manual coordination with LLM-powered services, enabling incremental adoption without disrupting established practices. We evaluate our approach on \texttt{spapi}, a production in-vehicle API system at Volvo Group involving 192 endpoints, 420 properties, and 776 CAN signals across six functional domains. The automated workflow achieves 93.7\% F1 score while reducing per-API development time from approximately 5 hours to under 7 minutes, saving an estimated 979 engineering hours. In production, the system received high satisfaction from both domain experts and developers, with all participants reporting full satisfaction with communication efficiency.

Updated: 2026-03-30 13:59:16

标题: 利用LLM技术进行多学科软件开发的工作流优化：汽车行业案例研究

摘要: 多学科软件开发（MSD）需要领域专家和开发人员跨越不兼容的形式主义和分离的工件集进行合作。即使在如GitHub Copilot等AI编码助手的帮助下，这个过程仍然是低效的；个别编码任务是半自动化的，但将领域知识与实现连接起来的工作流程并没有。开发人员和专家仍然缺乏共享视图，导致重复的协调、澄清轮次和容易出错的移交。我们通过基于图的工作流优化方法来解决这一差距，逐步用LLM动力服务取代手工协调，实现增量采用而不干扰已建立的实践。我们在Volvo Group的一个生产车载API系统\texttt{spapi}上评估了我们的方法，该系统涉及192个端点、420个属性和776个CAN信号，涵盖了六个功能域。自动化工作流程实现了93.7\%的F1分数，同时将每个API的开发时间从大约5小时减少到不到7分钟，节省了约979个工程小时。在生产中，该系统得到了领域专家和开发人员的高度满意，所有参与者都报告了对沟通效率的满意程度。

更新时间: 2026-03-30 13:59:16

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2603.21439v4

FeDMRA: Federated Incremental Learning with Dynamic Memory Replay Allocation

In federated healthcare systems, Federated Class-Incremental Learning (FCIL) has emerged as a key paradigm, enabling continuous adaptive model learning among distributed clients while safeguarding data privacy. However, in practical applications, data across agent nodes within the distributed framework often exhibits non-independent and identically distributed (non-IID) characteristics, rendering traditional continual learning methods inapplicable. To address these challenges, this paper covers more comprehensive incremental task scenarios and proposes a dynamic memory allocation strategy for exemplar storage based on the data replay mechanism. This strategy fully taps into the inherent potential of data heterogeneity, while taking into account the performance fairness of all participating clients, thereby establishing a balanced and adaptive solution to mitigate catastrophic forgetting. Unlike the fixed allocation of client exemplar memory, the proposed scheme emphasizes the rational allocation of limited storage resources among clients to improve model performance. Furthermore, extensive experiments are conducted on three medical image datasets, and the results demonstrate significant performance improvements compared to existing baseline models.

Updated: 2026-03-30 13:58:36

标题: FeDMRA: 具有动态内存重放分配的联邦增量学习

摘要: 在联邦卫生系统中，联邦类增量学习（FCIL）已经成为一个关键范式，能够在分布式客户端之间实现持续的自适应模型学习，同时确保数据隐私。然而，在实际应用中，分布式框架中代理节点之间的数据往往表现出非独立同分布（non-IID）的特征，使得传统的持续学习方法不适用。为了应对这些挑战，本文涵盖了更全面的增量任务场景，并提出了一种基于数据重播机制的动态内存分配策略用于实例存储。这种策略充分利用数据异质性的潜力，同时考虑到所有参与客户端的性能公平性，从而建立一个平衡和自适应的解决方案以减轻灾难性遗忘。与固定分配客户端实例内存不同，所提出的方案强调在客户端之间合理分配有限的存储资源以提高模型性能。此外，在三个医学图像数据集上进行了大量实验，结果显示与现有基线模型相比，性能有显著改善。

更新时间: 2026-03-30 13:58:36

领域: cs.LG,cs.AI,cs.CV,cs.DC,stat.ML

下载: http://arxiv.org/abs/2603.28455v1

Yau's Affine Normal Descent: Algorithmic Framework and Convergence Analysis

We propose Yau's Affine Normal Descent (YAND), a geometric framework for smooth unconstrained optimization in which search directions are defined by the equi-affine normal of level-set hypersurfaces. The resulting directions are invariant under volume-preserving affine transformations and intrinsically adapt to anisotropic curvature. Using the analytic representation of the affine normal from affine differential geometry, we establish its equivalence with the classical slice-centroid construction under convexity. For strictly convex quadratic objectives, affine-normal directions are collinear with Newton directions, implying one-step convergence under exact line search. For general smooth (possibly nonconvex) objectives, we characterize precisely when affine-normal directions yield strict descent and develop a line-search-based YAND. We establish global convergence under standard smoothness assumptions, linear convergence under strong convexity and Polyak-Lojasiewicz conditions, and quadratic local convergence near nondegenerate minimizers. We further show that affine-normal directions are robust under affine scalings, remaining insensitive to arbitrarily ill-conditioned transformations. Numerical experiments illustrate the geometric behavior of the method and its robustness under strong anisotropic scaling.

Updated: 2026-03-30 13:55:11

标题: 尧氏仿射法正下降：算法框架和收敛分析

摘要: 我们提出了Yau的仿射法线下降（YAND），这是一个用于光滑无约束优化的几何框架，其中搜索方向由水平集超曲面的等仿射法线定义。由此产生的方向在保体积仿射变换下保持不变，并且内在地适应各向异性曲率。利用仿射微分几何中的仿射法线的解析表示，我们建立了它与在凸性下的经典切片质心构造的等价性。对于严格凸二次目标函数，仿射法线方向与牛顿方向共线，意味着在精确线搜索下一步收敛。对于一般的光滑（可能是非凸）目标函数，我们准确地刻画了仿射法线方向何时产生严格下降，并开发了基于线搜索的YAND。在标准光滑性假设下，我们建立了全局收敛性，在强凸性和Polyak-Lojasiewicz条件下，具有线性收敛性，并在非退化极小值点附近具有二次局部收敛性。我们进一步展示，仿射法线方向在仿射缩放下具有鲁棒性，对任意病态变换保持不变。数值实验展示了该方法的几何行为及其在强各向异性缩放下的鲁棒性。

更新时间: 2026-03-30 13:55:11

领域: math.OC,cs.LG,math.DG,math.NA

下载: http://arxiv.org/abs/2603.28448v1

Randomized HyperSteiner: A Stochastic Delaunay Triangulation Heuristic for the Hyperbolic Steiner Minimal Tree

We study the problem of constructing Steiner Minimal Trees (SMTs) in hyperbolic space. Exact SMT computation is NP-hard, and existing hyperbolic heuristics such as HyperSteiner are deterministic and often get trapped in locally suboptimal configurations. We introduce Randomized HyperSteiner (RHS), a stochastic Delaunay triangulation heuristic that incorporates randomness into the expansion process and refines candidate trees via Riemannian gradient descent optimization. Experiments on synthetic data sets and a real-world single-cell transcriptomic data show that RHS outperforms Minimum Spanning Tree (MST), Neighbour Joining, and vanilla HyperSteiner (HS). In near-boundary configurations, RHS can achieve a 32% reduction in total length over HS, demonstrating its effectiveness and robustness in diverse data regimes.

Updated: 2026-03-30 13:53:14

标题: 随机超斯坦纳：用于双曲斯坦纳最小树的随机德劳内三角剖分启发式算法

摘要: 我们研究在双曲空间中构建斯坦纳最小树（SMTs）的问题。精确的SMT计算是NP难的，现有的双曲启发式方法，如HyperSteiner，是确定性的，并且通常会陷入局部次优配置中。我们引入了随机HyperSteiner（RHS），这是一种随机Delaunay三角剖分启发式方法，它将随机性融入扩展过程，并通过黎曼梯度下降优化来细化候选树。对合成数据集和真实世界的单细胞转录组数据进行的实验表明，RHS优于最小生成树（MST）、邻居连接和普通HyperSteiner（HS）。在接近边界配置中，RHS可以比HS减少32%的总长度，展示了其在不同数据范围中的效果和鲁棒性。

更新时间: 2026-03-30 13:53:14

领域: cs.CG,cs.AI

下载: http://arxiv.org/abs/2510.09328v2

Wasserstein Propagation for Reverse Diffusion under Weak Log-Concavity: Exploiting Metric Mismatch via One-Switch Routing

Existing analyses of reverse diffusion typically propagate sampling error in the Euclidean geometry underlying $\Wtwo$ throughout the reverse trajectory. Under weak log-concavity, this can be suboptimal: Gaussian smoothing may create contraction first at large separations, while short-scale Euclidean dissipativity is still absent. We show that exploiting this metric mismatch can yield strictly sharper end-to-end $\Wtwo$ bounds than direct full-horizon Euclidean propagation on mismatch windows. Our analysis derives an explicit radial lower profile for the learned reverse drift, whose far-field and near-field limits quantify the contraction reserve and the residual Euclidean load, respectively. This profile determines admissible switch times and leads to a one-switch routing theorem: reflection coupling damps initialization mismatch, pre-switch score forcing, and pre-switch discretization in an adapted concave transport metric; a single $p$-moment interpolation converts the damped switch-time discrepancy back to $\Wtwo$; and synchronous coupling propagates the remaining error over the late Euclidean window. Under $L^2$ score-error control, a one-sided monotonicity condition on the score error, and standard well-posedness and coupling assumptions, we obtain explicit non-asymptotic end-to-end $\Wtwo$ guarantees, a scalar switch-selection objective, and a conversion exponent $θ_p=(p-2)/(2(p-1))$ that cannot be improved uniformly within the affine-tail concave class under the same $p$-moment switch assumption. For a fixed switch, the routed and direct Euclidean bounds share the same late-window term, so any strict improvement is entirely an early-window effect.

Updated: 2026-03-30 13:53:04

标题: Wasserstein传播在弱对数凹性下的反向扩散：通过一次切换路由利用度量不匹配

摘要: 现有的逆扩散分析通常在$\Wtwo$的欧几里得几何学中传播采样误差，而这种传播通常存在不足：高斯平滑可能在大距离处首先产生收缩，而短尺度的欧几里得耗散性仍然不存在。我们表明，利用这种度量不匹配可以比在不匹配窗口上直接进行全视野欧几里得传播获得更加严格的端到端$\Wtwo$上限。我们的分析得出了学习的逆扩散的显式径向下界轮廓，其远场和近场极限分别量化了收缩保留和残余欧几里得负载。这个轮廓确定了可接受的切换时间，并导致一个单切换路径定理：反射耦合抑制了初始化不匹配，预切换评分强制和预切换离散化在适应的凹运输度量中；一个单$p$-矩插值将抑制的切换时间差异转换回$\Wtwo$；同步耦合在晚期欧几里得窗口上传播剩余误差。在$L^2$评分误差控制，评分误差的单侧单调性条件以及标准的适定性和耦合假设下，我们获得了显式的非渐近端到端$\Wtwo$保证，一个标量切换选择目标和一个转换指数$θ_p=(p-2)/(2(p-1))$，在相同的$p$-矩切换假设下无法统一改进在相同的仿尾凹类中。对于固定切换，路由和直接欧几里得上限共享相同的晚期窗口项，因此任何严格的改进完全是早期窗口效应。

更新时间: 2026-03-30 13:53:04

领域: cs.LG

下载: http://arxiv.org/abs/2603.19670v2

Dataset Distillation Efficiently Encodes Low-Dimensional Representations from Gradient-Based Learning of Non-Linear Tasks

Dataset distillation, a training-aware data compression technique, has recently attracted increasing attention as an effective tool for mitigating costs of optimization and data storage. However, progress remains largely empirical. Mechanisms underlying the extraction of task-relevant information from the training process and the efficient encoding of such information into synthetic data points remain elusive. In this paper, we theoretically analyze practical algorithms of dataset distillation applied to the gradient-based training of two-layer neural networks with width $L$. By focusing on a non-linear task structure called multi-index model, we prove that the low-dimensional structure of the problem is efficiently encoded into the resulting distilled data. This dataset reproduces a model with high generalization ability for a required memory complexity of $\tildeΘ$$(r^2d+L)$, where $d$ and $r$ are the input and intrinsic dimensions of the task. To the best of our knowledge, this is one of the first theoretical works that include a specific task structure, leverage its intrinsic dimensionality to quantify the compression rate and study dataset distillation implemented solely via gradient-based algorithms.

Updated: 2026-03-30 13:52:03

标题: 数据集精炼：从基于梯度学习的非线性任务中高效编码低维表示

摘要: 数据集提炼是一种训练感知数据压缩技术，最近越来越受到关注，被认为是缓解优化成本和数据存储成本的有效工具。然而，进展仍然主要是经验性的。从训练过程中提取任务相关信息的机制以及将这些信息高效编码为合成数据点的方式仍然是难以捉摸的。在本文中，我们在理论上分析了应用于宽度为$L$的两层神经网络基于梯度的训练的数据集提炼的实际算法。通过专注于一种称为多指数模型的非线性任务结构，我们证明了问题的低维结构被有效地编码到所得到的提炼数据中。该数据集重现了一个具有高泛化能力的模型，所需的内存复杂度为$\tildeΘ$(r^2d+L)，其中$d$和$r$分别为任务的输入和内在维度。据我们所知，这是第一批包括特定任务结构、利用其内在维度量化压缩率并且仅通过基于梯度的算法实现数据集提炼的理论工作之一。

更新时间: 2026-03-30 13:52:03

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2603.14830v2

Entropic Claim Resolution: Uncertainty-Driven Evidence Selection for RAG

Current Retrieval-Augmented Generation (RAG) systems predominantly rely on relevance-based dense retrieval, sequentially fetching documents to maximize semantic similarity with the query. However, in knowledge-intensive and real-world scenarios characterized by conflicting evidence or fundamental query ambiguity, relevance alone is insufficient for resolving epistemic uncertainty. We introduce Entropic Claim Resolution (ECR), a novel inference-time algorithm that reframes RAG reasoning as entropy minimization over competing semantic answer hypotheses. Unlike action-driven agentic frameworks (e.g., ReAct) or fixed-pipeline RAG architectures, ECR sequentially selects atomic evidence claims by maximizing Expected Entropy Reduction (EER), a decision-theoretic criterion for the value of information. The process dynamically terminates when the system reaches a mathematically defined state of epistemic sufficiency (H <= epsilon, subject to epistemic coherence). We integrate ECR into a production-grade multi-strategy retrieval pipeline (CSGR++) and analyze its theoretical properties. Our framework provides a rigorous foundation for uncertainty-aware evidence selection, shifting the paradigm from retrieving what is most relevant to retrieving what is most discriminative.

Updated: 2026-03-30 13:49:03

标题: 信息熵主张解决方案：基于不确定性的证据选择对于RAG

摘要: 目前的检索增强生成（RAG）系统主要依赖基于相关性的密集检索，顺序提取文档以最大化与查询的语义相似性。然而，在知识密集和现实世界场景中，存在冲突证据或基本查询模糊性的情况下，仅仅依靠相关性无法解决认识不确定性。我们引入了熵解释决（ECR），这是一种新颖的推理时算法，将RAG推理重新构建为对竞争性语义答案假设的熵最小化过程。与基于行为驱动的主体框架（例如ReAct）或固定流水线RAG架构不同，ECR通过最大化预期熵降低（EER）来选择原子证据声明，这是信息价值的决策理论标准。该过程在系统达到数学上定义的认识充分状态（H <= epsilon，符合认识连贯性）时动态终止。我们将ECR集成到生产级多策略检索流水线（CSGR++）中，并分析其理论属性。我们的框架为意识到不确定性的证据选择提供了严格的基础，将范式从检索最相关的内容转变为检索最具区分性的内容。

更新时间: 2026-03-30 13:49:03

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2603.28444v1

Deep Neural Networks: A Formulation Via Non-Archimedean Analysis

We introduce a new class of deep neural networks (DNNs) with multilayered tree-like architectures. The architectures are codified using numbers from the ring of integers of non-Archimdean local fields. These rings have a natural hierarchical organization as infinite rooted trees. Natural morphisms on these rings allow us to construct finite multilayered architectures. The new DNNs are robust universal approximators of real-valued functions defined on the mentioned rings. We also show that the DNNs are robust universal approximators of real-valued square-integrable functions defined in the unit interval.

Updated: 2026-03-30 13:43:44

标题: 深度神经网络：通过非阿基米德分析的表述

摘要: 我们引入了一种新的深度神经网络（DNNs）类别，具有多层树状结构。这些结构是通过使用非阿基米德局部域的整数环中的数字来编码的。这些环具有自然的分层组织，如无穷根树。这些环上的自然同态映射使我们能够构建有限多层结构。这些新的DNNs是对在提到的环上定义的实值函数的强大的通用逼近器。我们还展示了这些DNNs是对在单位区间上定义的实值可积函数的强大的通用逼近器。

更新时间: 2026-03-30 13:43:44

领域: cs.NE,cs.AI,cs.LG

下载: http://arxiv.org/abs/2402.00094v2

Democratizing Federated Learning with Blockchain and Multi-Task Peer Prediction

The synergy between Federated Learning and blockchain has been considered promising; however, the computationally intensive nature of contribution measurement conflicts with the strict computation and storage limits of blockchain systems. We propose a novel concept to decentralize the AI training process using blockchain technology and Multi-task Peer Prediction. By leveraging smart contracts and cryptocurrencies to incentivize contributions to the training process, we aim to harness the mutual benefits of AI and blockchain. We discuss the advantages and limitations of our design.

Updated: 2026-03-30 13:42:56

标题: 使用区块链和多任务对等预测实现联邦学习的民主化

摘要: 联邦学习和区块链之间的协同作用被认为是有前途的；然而，贡献测量的计算密集性与区块链系统的严格计算和存储限制发生冲突。我们提出了一种新颖的概念，利用区块链技术和多任务对等预测来去中心化AI训练过程。通过利用智能合约和加密货币激励对训练过程的贡献，我们旨在利用AI和区块链的相互利益。我们讨论了我们设计的优势和局限性。

更新时间: 2026-03-30 13:42:56

领域: cs.CR,cs.CY

下载: http://arxiv.org/abs/2603.28434v1

GeoHCC: Local Geometry-Aware Hierarchical Context Compression for 3D Gaussian Splatting

Although 3D Gaussian Splatting (3DGS) enables high-fidelity real-time rendering, its prohibitive storage overhead severely hinders practical deployment. Recent anchor-based 3DGS compression schemes reduce redundancy through context modeling, yet overlook explicit geometric dependencies, leading to structural degradation and suboptimal rate-distortion performance. In this paper, we propose GeoHCC, a geometry-aware 3DGS compression framework that incorporates inter-anchor geometric correlations into anchor pruning and entropy coding for compact representation. We first introduce Neighborhood-Aware Anchor Pruning (NAAP), which evaluates anchor importance via weighted neighborhood feature aggregation and merges redundant anchors into salient neighbors, yielding a compact yet geometry-consistent anchor set. Building upon this optimized structure, we further develop a hierarchical entropy coding scheme, in which coarse-to-fine priors are exploited through a lightweight Geometry-Guided Convolution (GG-Conv) operator to enable spatially adaptive context modeling and rate-distortion optimization. Extensive experiments demonstrate that GeoHCC effectively resolves the structure preservation bottleneck, maintaining superior geometric integrity and rendering fidelity over state-of-the-art anchor-based approaches.

Updated: 2026-03-30 13:39:35

标题: GeoHCC：用于3D高斯喷溅的局部几何感知分层上下文压缩

摘要: 尽管3D高斯喷射（3DGS）实现了高保真实时渲染，但其过高的存储开销严重阻碍了实际部署。最近基于锚点的3DGS压缩方案通过上下文建模减少冗余，但忽视了显式的几何依赖关系，导致结构降级和次优的速率失真性能。在本文中，我们提出了GeoHCC，一种考虑几何的3DGS压缩框架，将锚点间的几何相关性纳入到锚点修剪和熵编码中，以实现紧凑的表示。我们首先引入了邻域感知锚点修剪（NAAP），通过加权邻域特征聚合评估锚点重要性，并将冗余锚点合并到显著的邻居中，得到一个紧凑但几何一致的锚点集。在此优化结构的基础上，我们进一步开发了一个分层熵编码方案，通过轻量级的几何引导卷积（GG-Conv）操作利用粗到细的先验信息，实现空间自适应的上下文建模和速率失真优化。大量实验证明，GeoHCC有效地解决了结构保持瓶颈，保持了优秀的几何完整性和渲染保真度，优于最先进的基于锚点的方法。

更新时间: 2026-03-30 13:39:35

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2603.28431v1

IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression

Orthogonal feature decorrelation is effective for low-bit online vector quantization, but dense random orthogonal transforms incur prohibitive $O(d^2)$ storage and compute. RotorQuant reduces this cost with blockwise $3$D Clifford rotors, yet the resulting $3$D partition is poorly aligned with modern hardware and offers limited local mixing. We propose \textbf{IsoQuant}, a blockwise rotation framework based on quaternion algebra and the isoclinic decomposition of $SO(4)$. It represents each $4$D block as a quaternion and applies a closed-form transform $T(v)=q_L v \overline{q_R}$. This yields two main variants: \emph{IsoQuant-Full}, which realizes the full $SO(4)$ rotation, and \emph{IsoQuant-Fast}, which keeps only one isoclinic factor for lower cost; the framework also admits a lightweight $2$D special case. At $d=128$, IsoQuant-Full reduces forward rotation cost from about $2{,}408$ FMAs in RotorQuant to $1{,}024$, while IsoQuant-Fast further reduces it to $512$. Across $18$ fused CUDA settings with $d \in {128,256,512}$, bit widths ${2,3,4}$, and FP16/FP32 execution, IsoQuant achieves mean kernel-level speedups of about $4.5\times$--$4.7\times$ over RotorQuant while maintaining comparable reconstruction MSE, with peak speedups above $6\times$. Current validation is limited to the stage-1 quantize--dequantize path on synthetic normalized vectors; end-to-end KV-cache evaluation remains future work.

Updated: 2026-03-30 13:37:45

标题: IsoQuant: 硬件对齐的SO(4)等角旋转用于LLM KV缓存压缩

摘要: 正交特征去相关对于低比特在线矢量量化是有效的，但是密集的随机正交变换会产生昂贵的$O(d^2)$存储和计算成本。RotorQuant通过分块$3$D Clifford转子降低了这种成本，然而所得到的$3$D分区与现代硬件不太对齐，提供了有限的局部混合。我们提出了\textbf{IsoQuant}，这是一个基于四元数代数和$SO(4)$的等角分解的分块旋转框架。它将每个$4$D块表示为一个四元数，并应用一个闭合形式的变换$T(v)=q_L v \overline{q_R}$。这产生了两个主要变体：\emph{IsoQuant-Full}，实现了完整的$SO(4)$旋转，和\emph{IsoQuant-Fast}，仅保留一个等角因子以降低成本；该框架还允许一个轻量级的$2$D特例。在$d=128$时，IsoQuant-Full将前向旋转成本从RotorQuant的约$2{,}408$ FMA降低到$1{,}024$，而IsoQuant-Fast进一步将其降低到$512$。在包括$d \in {128,256,512}$，比特宽度为${2,3,4}$，以及FP16/FP32执行的$18$个融合CUDA设置中，IsoQuant在保持可比的重建MSE的同时，实现了大约$4.5\times$--$4.7\times$的平均核级加速，最大加速度超过$6\times$。当前验证仅限于合成归一化矢量上的阶段-1量化--反量化路径；端到端的KV-cache评估仍需进一步研究。

更新时间: 2026-03-30 13:37:45

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2603.28430v1

AceleradorSNN: A Neuromorphic Cognitive System Integrating Spiking Neural Networks and DynamicImage Signal Processing on FPGA

The demand for high-speed, low-latency, and energy-efficient object detection in autonomous systems -- such as advanced driver-assistance systems (ADAS), unmanned aerial vehicles (UAVs), and Industry 4.0 robotics -- has exposed the limitations of traditional Convolutional Neural Networks (CNNs). To address these challenges, we have developed AceleradorSNN, a third-generation artificial intelligence cognitive system. This architecture integrates a Neuromorphic Processing Unit (NPU) based on Spiking Neural Networks (SNNs) to process asynchronous data from Dynamic Vision Sensors (DVS), alongside a dynamically reconfigurable Cognitive Image Signal Processor (ISP) for RGB cameras. This paper details the hardware-oriented design of both IP cores, the evaluation of surrogate-gradienttrained SNN backbones, and the real-time streaming ISP architecture implemented on Field-Programmable Gate Arrays (FPGA).

Updated: 2026-03-30 13:37:09

标题: AceleradorSNN：一种在FPGA上集成脉冲神经网络和动态图像信号处理的神经形认知系统

摘要: 对于自主系统中对高速、低延迟和高能效目标检测的需求，如高级驾驶辅助系统(ADAS)、无人机和工业4.0机器人，暴露了传统卷积神经网络(CNNs)的局限性。为解决这些挑战，我们开发了AceleradorSNN，这是第三代人工智能认知系统。该架构集成了基于脉冲神经网络(SNNs)的神经形态处理单元(NPU)来处理来自动态视觉传感器(DVS)的异步数据，同时还配备了动态可重构的RGB摄像头认知图像信号处理器(ISP)。本文详细介绍了两个IP核的硬件设计，代理梯度训练的SNN骨干的评估，以及在可编程逻辑器件(FPGA)上实现的实时流式ISP架构。

更新时间: 2026-03-30 13:37:09

领域: cs.AR,cs.AI

下载: http://arxiv.org/abs/2603.28429v1

Learning unified control of internal spin squeezing in atomic qudits for magnetometry

Generating and preserving metrologically useful quantum states is a central challenge in quantum-enhanced atomic magnetometry. In multilevel atoms operated in the low-field regime, the nonlinear Zeeman (NLZ) effect is both a resource and a limitation. It nonlinearly redistributes internal spin fluctuations to generate spin-squeezed states within a single atomic qudit, yet under fixed readout it distorts the measurement-relevant quadrature and limits the accessible metrological gain. This challenge is compounded by the time dependence of both the squeezing axis and the effective nonlinear action. Here we show that physics-informed reinforcement learning can transform NLZ dynamics from a source of readout degradation into a sustained metrological resource. Using only experimentally accessible low-order spin moments, a trained agent identifies, in the $f=21/2$ manifold of $^{161}\mathrm{Dy}$, a unified control policy that rapidly prepares strongly squeezed internal states and stabilizes more than $4\,\mathrm{dB}$ of fixed-axis spin squeezing under always-on NLZ evolution. Including state-preparation overhead, the learned protocol yields a single-atom magnetic sensitivity of $13.9\,\mathrm{pT}/\sqrt{\mathrm{Hz}}$, corresponding to an advantage of approximately $3\,\mathrm{dB}$ beyond the standard quantum limit. Our results establish learning-based control as a practical route for converting unavoidable intrinsic nonlinear dynamics in multilevel quantum sensors into operational metrological advantage.

Updated: 2026-03-30 13:29:45

标题: 学习原子四态系统中内部自旋压缩的统一控制以用于磁力计量

摘要: 产生和保持量子态对计量有用是量子增强原子磁力计中的一个核心挑战。在低场范围内操作的多能级原子中，非线性塞曼（NLZ）效应既是资源又是限制。它非线性地重新分配内部自旋波动，以在单个原子qudit内生成自旋压缩态，但在固定读数下，它会扭曲测量相关的象限并限制可访问的计量增益。这一挑战受到挤压轴和有效非线性作用的时间依赖性的影响。在这里，我们展示了基于物理信息的强化学习可以将NLZ动态从读数退化的源转变为持续的计量资源。仅使用实验可访问的低阶自旋矩，一个经过训练的代理在$^{161}\mathrm{Dy}$的$f=21/2$能级中确定了一个统一的控制策略，该策略可以快速准备出强烈压缩的内部态，并在总是开启的NLZ演化下稳定超过$4\,\mathrm{dB}$的固定轴自旋压缩。考虑到状态准备开销，学习到的协议产生了$13.9\,\mathrm{pT}/\sqrt{\mathrm{Hz}}$的单原子磁敏感度，相当于标准量子极限之外约$3\,\mathrm{dB}$的优势。我们的结果确立了基于学习的控制作为将多能级量子传感器中不可避免的固有非线性动态转化为操作计量优势的实际途径。

更新时间: 2026-03-30 13:29:45

领域: quant-ph,cs.AI

下载: http://arxiv.org/abs/2603.28421v1

Spectral Higher-Order Neural Networks

Neural networks are fundamental tools of modern machine learning. The standard paradigm assumes binary interactions (across feedforward linear passes) between inter-tangled units, organized in sequential layers. Generalized architectures have been also designed that move beyond pairwise interactions, so as to account for higher-order couplings among computing neurons. Higher-order networks are however usually deployed as augmented graph neural networks (GNNs), and, as such, prove solely advantageous in contexts where the input exhibits an explicit hypergraph structure. Here, we present Spectral Higher-Order Neural Networks (SHONNs), a new algorithmic strategy to incorporate higher-order interactions in general-purpose, feedforward, network structures. SHONNs leverages a reformulation of the model in terms of spectral attributes. This allows to mitigate the common stability and parameter scaling problems that come along weighted, higher-order, forward propagations.

Updated: 2026-03-30 13:29:34

标题: 频谱高阶神经网络

摘要: 神经网络是现代机器学习的基本工具。标准范式假设交错单元之间存在二进制交互（通过前馈线性传递），这些单元组织在顺序层中。还设计了广义结构，超越了成对交互，以考虑计算神经元之间的高阶耦合。然而，高阶网络通常作为增强图神经网络（GNNs）部署，并且仅在输入展示显式超图结构的情况下证明有利。在这里，我们提出了谱高阶神经网络（SHONNs），一种在通用的前馈网络结构中整合高阶交互的新算法策略。SHONNs利用模型的谱属性重构。这可以减轻加权、高阶、前向传播带来的常见稳定性和参数缩放问题。

更新时间: 2026-03-30 13:29:34

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.28420v1

KGroups: A Versatile Univariate Max-Relevance Min-Redundancy Feature Selection Algorithm for High-dimensional Biological Data

This paper proposes a new univariate filter feature selection (FFS) algorithm called KGroups. The majority of work in the literature focuses on investigating the relevance or redundancy estimations of feature selection (FS) methods. This has shown promising results and a real improvement of FFS methods' predictive performance. However, limited efforts have been made to investigate alternative FFS algorithms. This raises the following question: how much of the FFS methods' predictive performance depends on the selection algorithm rather than the relevance or the redundancy estimations? The majority of FFS methods fall into two categories: relevance maximisation (Max-Rel, also known as KBest) or simultaneous relevance maximisation and redundancy minimisation (mRMR). KBest is a univariate FFS algorithm that employs sorting (descending) for selection. mRMR is a multivariate FFS algorithm that employs an incremental search algorithm for selection. In this paper, we propose a new univariate mRMR called KGroups that employs clustering for selection. Extensive experiments on 14 high-dimensional biological benchmark datasets showed that KGroups achieves similar predictive performance compared to multivariate mRMR while being up to 821 times faster. KGroups is parameterisable, which leaves room for further predictive performance improvement through hyperparameter finetuning, unlike mRMR and KBest. KGroups outperforms KBest.

Updated: 2026-03-30 13:28:09

标题: KGroups：一种多功能的单变量最大相关性最小冗余特征选择算法，用于高维生物数据

摘要: 本文提出了一种名为KGroups的新的单变量滤波器特征选择（FFS）算法。文献中的大部分工作都集中在研究特征选择（FS）方法的相关性或冗余性估计。这显示出有希望的结果和FFS方法预测性能的真正改进。然而，很少有人尝试研究替代FFS算法。这引发了以下问题：FFS方法的预测性能有多大程度取决于选择算法，而不是相关性或冗余性估计？大多数FFS方法分为两类：相关性最大化（Max-Rel，也称为KBest）或同时最大化相关性和最小化冗余性（mRMR）。KBest是一种单变量FFS算法，采用（降序）排序进行选择。mRMR是一种多变量FFS算法，采用增量搜索算法进行选择。在本文中，我们提出了一种名为KGroups的新的单变量mRMR，它采用聚类进行选择。对14个高维生物基准数据集进行的大量实验表明，与多变量mRMR相比，KGroups在预测性能方面取得了类似的结果，同时速度快了高达821倍。KGroups是可参数化的，这为通过超参数微调进一步提高预测性能留下了空间，而mRMR和KBest则没有这样的可能。KGroups优于KBest。

更新时间: 2026-03-30 13:28:09

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.28417v1

Evolutionary Discovery of Reinforcement Learning Algorithms via Large Language Models

Reinforcement learning algorithms are defined by their learning update rules, which are typically hand-designed and fixed. We present an evolutionary framework for discovering reinforcement learning algorithms by searching directly over executable update rules that implement complete training procedures. The approach builds on REvolve, an evolutionary system that uses large language models as generative variation operators, and extends it from reward-function discovery to algorithm discovery. To promote the emergence of nonstandard learning rules, the search excludes canonical mechanisms such as actor--critic structures, temporal-difference losses, and value bootstrapping. Because reinforcement learning algorithms are highly sensitive to internal scalar parameters, we introduce a post-evolution refinement stage in which a large language model proposes feasible hyperparameter ranges for each evolved update rule. Evaluated end-to-end by full training runs on multiple Gymnasium benchmarks, the discovered algorithms achieve competitive performance relative to established baselines, including SAC, PPO, DQN, and A2C.

Updated: 2026-03-30 13:28:01

标题: 通过大型语言模型的进化发现强化学习算法

摘要: 强化学习算法通过学习更新规则来定义，这些规则通常是手动设计并固定的。我们提出了一个通过直接搜索执行可执行更新规则来发现强化学习算法的进化框架。该方法建立在REvolve之上，后者使用大型语言模型作为生成变异操作符，并将其从奖励函数发现扩展到算法发现。为了促进非标准学习规则的出现，搜索排除了传统机制，如演员-评论家结构、时序差分损失和值引导。由于强化学习算法对内部标量参数非常敏感，我们引入了一个后进化的细化阶段，在这个阶段，一个大型语言模型提出了每个进化的更新规则的可行超参数范围。通过在多个Gymnasium基准测试上进行完整的训练运行进行端到端评估，发现的算法相对于已建立的基线（包括SAC、PPO、DQN和A2C）实现了竞争性能。

更新时间: 2026-03-30 13:28:01

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.28416v1

Gradient Compression Beyond Low-Rank: Wavelet Subspaces Compact Optimizer States

Large language models (LLMs) have shown impressive performance across a range of natural language processing tasks. However, their vast number of parameters introduces significant memory challenges during training, particularly when using memory-intensive optimizers like Adam. Existing memory-efficient algorithms often rely on techniques such as singular value decomposition projection or weight freezing. While these approaches help alleviate memory constraints, they generally produce suboptimal results compared to full-rank updates. In this paper, we investigate the memory-efficient method beyond low-rank training, proposing a novel solution called Gradient Wavelet Transform (GWT), which applies wavelet transforms to gradients in order to significantly reduce the memory requirements for maintaining optimizer states. We demonstrate that GWT can be seamlessly integrated with memory-intensive optimizers, enabling efficient training while maintaining performance. Through extensive experiments on both pre-training and fine-tuning tasks, we show that GWT achieves performance comparable to advanced memory-efficient optimizers and full-rank approaches in terms of both memory usage and training performance.

Updated: 2026-03-30 13:25:46

标题: 超越低秩的梯度压缩：小波子空间紧凑优化器状态

摘要: 大型语言模型（LLMs）在各种自然语言处理任务中展现出令人印象深刻的性能。然而，它们庞大的参数数量在训练过程中引入了显著的内存挑战，特别是在使用像Adam这样耗费内存的优化器时。现有的内存高效算法通常依赖于诸如奇异值分解投影或权重冻结等技术。虽然这些方法有助于缓解内存限制，但通常与完整秩更新相比产生次优结果。在本文中，我们研究了超越低秩训练的内存高效方法，提出了一种名为梯度小波变换（GWT）的新颖解决方案，该解决方案将小波变换应用于梯度，从而显著减少了维护优化器状态的内存需求。我们证明GWT可以与内存密集型优化器无缝集成，实现高效训练同时保持性能。通过对预训练和微调任务的广泛实验，我们展示GWT在内存使用和训练性能方面实现了与先进的内存高效优化器和完整秩方法相媲美的性能。

更新时间: 2026-03-30 13:25:46

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2501.07237v4

Initialization Schemes for Kolmogorov-Arnold Networks: An Empirical Study

Kolmogorov-Arnold Networks (KANs) are a recently introduced neural architecture that replace fixed nonlinearities with trainable activation functions, offering enhanced flexibility and interpretability. While KANs have been applied successfully across scientific and machine learning tasks, their initialization strategies remain largely unexplored. In this work, we study initialization schemes for spline-based KANs, proposing two theory-driven approaches inspired by LeCun and Glorot, as well as an empirical power-law family with tunable exponents. Our evaluation combines large-scale grid searches on function fitting and forward PDE benchmarks, an analysis of training dynamics through the lens of the Neural Tangent Kernel, and evaluations on a subset of the Feynman dataset. Our findings indicate that the Glorot-inspired initialization significantly outperforms the baseline in parameter-rich models, while power-law initialization achieves the strongest performance overall, both across tasks and for architectures of varying size. All code and data accompanying this manuscript are publicly available at https://github.com/srigas/KAN_Initialization_Schemes.

Updated: 2026-03-30 13:22:40

标题: 初始化方案对Kolmogorov-Arnold网络的影响：一项实证研究

摘要: Kolmogorov-Arnold网络（KANs）是一种最近引入的神经架构，用可训练的激活函数替代固定的非线性，提供了增强的灵活性和可解释性。虽然KANs已经成功应用于科学和机器学习任务，但它们的初始化策略仍然很少被探索。在这项工作中，我们研究了基于样条的KANs的初始化方案，提出了两种受LeCun和Glorot启发的理论驱动方法，以及一个具有可调指数的经验幂律家族。我们的评估结合了对函数拟合和前向PDE基准的大规模网格搜索，通过神经切线核的视角对训练动态进行分析，以及对费曼数据集的子集进行评估。我们的研究结果表明，Glorot启发的初始化在参数丰富的模型中明显优于基线，而幂律初始化在各个任务和不同大小的架构中都实现了最强的性能。本文附带的所有代码和数据都可以在https://github.com/srigas/KAN_Initialization_Schemes上公开获取。

更新时间: 2026-03-30 13:22:40

领域: cs.LG

下载: http://arxiv.org/abs/2509.03417v3

Mixture-Model Preference Learning for Many-Objective Bayesian Optimization

Preference-based many-objective optimization faces two obstacles: an expanding space of trade-offs and heterogeneous, context-dependent human value structures. Towards this, we propose a Bayesian framework that learns a small set of latent preference archetypes rather than assuming a single fixed utility function, modelling them as components of a Dirichlet-process mixture with uncertainty over both archetypes and their weights. To query efficiently, we designing hybrid queries that target information about (i) mode identity and (ii) within-mode trade-offs. Under mild assumptions, we provide a simple regret guarantee for the resulting mixture-aware Bayesian optimization procedure. Empirically, our method outperforms standard baselines on synthetic and real-world many-objective benchmarks, and mixture-aware diagnostics reveal structure that regret alone fails to capture.

Updated: 2026-03-30 13:18:43

标题: 混合模型偏好学习用于多目标贝叶斯优化

摘要: Preference-based many-objective optimization faces two obstacles: an expanding space of trade-offs and heterogeneous, context-dependent human value structures. Towards this, we propose a Bayesian framework that learns a small set of latent preference archetypes rather than assuming a single fixed utility function, modelling them as components of a Dirichlet-process mixture with uncertainty over both archetypes and their weights. To query efficiently, we design hybrid queries that target information about (i) mode identity and (ii) within-mode trade-offs. Under mild assumptions, we provide a simple regret guarantee for the resulting mixture-aware Bayesian optimization procedure. Empirically, our method outperforms standard baselines on synthetic and real-world many-objective benchmarks, and mixture-aware diagnostics reveal structure that regret alone fails to capture.

更新时间: 2026-03-30 13:18:43

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2603.28410v1

Declarative Scenario-based Testing with RoadLogic

Scenario-based testing is a key method for cost-effective and safe validation of autonomous vehicles (AVs). Existing approaches rely on imperative scenario definitions, requiring developers to manually enumerate numerous variants to achieve coverage. Declarative languages, such as ASAM OpenSCENARIO DSL (OS2), raise the abstraction level but lack systematic methods for instantiating concrete and specification-compliant scenarios. To our knowledge, currently, no open-source solution provides this capability. We present RoadLogic that bridges declarative OS2 specifications and executable simulations. It uses Answer Set Programming to generate abstract plans satisfying scenario constraints, motion planning to refine the plans into feasible trajectories, and specification-based monitoring to verify correctness. We evaluate RoadLogic on instantiating representative OS2 scenarios executed in the CommonRoad framework. Results show that RoadLogic consistently produces realistic, specification-satisfying simulations within minutes and captures diverse behavioral variants through parameter sampling, thus opening the door to systematic scenario-based testing for autonomous driving systems.

Updated: 2026-03-30 13:17:27

标题: 使用RoadLogic进行声明式基于场景的测试

摘要: 基于场景的测试是实现自动驾驶车辆（AVs）成本效益和安全验证的关键方法。现有方法依赖于命令式场景定义，需要开发人员手动枚举大量变体以实现覆盖范围。声明性语言，如ASAM OpenSCENARIO DSL（OS2），提高了抽象级别，但缺乏实例化具体且符合规范的场景的系统方法。据我们所知，目前没有开源解决方案提供这种能力。我们提出了RoadLogic，它连接了声明性OS2规范和可执行模拟。它使用答案集编程生成满足场景约束的抽象计划，运动规划将计划细化为可行轨迹，并使用基于规范的监控来验证正确性。我们在CommonRoad框架中执行代表性OS2场景实例化评估了RoadLogic。结果显示，RoadLogic在几分钟内始终产生逼真的、符合规范的模拟，并通过参数采样捕捉多样的行为变体，从而为自动驾驶系统的系统性基于场景的测试打开了大门。

更新时间: 2026-03-30 13:17:27

领域: cs.SE,cs.AI,cs.LO

下载: http://arxiv.org/abs/2603.09455v2

Misleading Large Language Models used (or misused) in Scientific Peer-Reviewing via Hidden Prompt-Injection Attacks

Large Language Models (LLMs) are increasingly being integrated into the scientific peer-review process, raising new questions about their reliability and resilience to manipulation. In this work, we investigate the potential for hidden prompt injection attacks, where authors embed adversarial text within a paper's PDF to influence the LLM-generated review. We begin by formalising three distinct threat models that envision attackers with different motivations -- not all of which implying malicious intent. For each threat model, we design adversarial prompts that remain invisible to human readers yet can steer an LLM's output toward the author's desired outcome. Using a user study with domain scholars, we derive four representative reviewing prompts used to elicit peer reviews from LLMs. We then evaluate the robustness of our adversarial prompts across (i) different reviewing prompts, (ii) different commercial LLM-based systems, and (iii) different peer-reviewed papers. Our results show that adversarial prompts can reliably mislead the LLM, sometimes in ways that adversely affect a "honest-but-lazy" reviewer. Finally, we propose and empirically assess methods to reduce detectability of adversarial prompts under automated content checks.

Updated: 2026-03-30 13:16:08

标题: 通过隐藏式提示注入攻击误导使用（或滥用）科学同行评审中的大型语言模型

摘要: 大型语言模型（LLMs）越来越多地被整合到科学同行评审过程中，引发了关于它们的可靠性和抗干扰性的新问题。在这项工作中，我们调查了隐藏提示注入攻击的潜力，即作者在论文的PDF中嵌入对抗性文本，以影响LLM生成的审查。我们首先形式化了三种不同威胁模型，设想了具有不同动机的攻击者--并非所有都暗示恶意意图。针对每种威胁模型，我们设计了对抗性提示，这些提示对人类读者保持不可见，但可以引导LLM的输出朝向作者期望的结果。通过与领域学者进行用户研究，我们得出了四种代表性审查提示，用于从LLMs中获取同行评审。然后，我们评估了我们的对抗性提示在（i）不同审查提示、（ii）不同商业LLM系统和（iii）不同同行评审论文中的稳健性。我们的结果表明，对抗性提示可以可靠地误导LLM，有时以可能对“诚实但懒惰”的审阅者产生不利影响的方式。最后，我们提出并实证评估了减少对抗性提示在自动化内容检查中可检测性的方法。

更新时间: 2026-03-30 13:16:08

领域: cs.CR

下载: http://arxiv.org/abs/2508.20863v3

MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthetic tasks that do not reflect real-world query complexity, and cannot be refreshed as knowledge evolves. To address these gaps, we introduce MiroEval, a benchmark and evaluation framework for deep research systems. The benchmark comprises 100 tasks (70 text-only, 30 multimodal), all grounded in real user needs and constructed via a dual-path pipeline that supports periodic updates, enabling a live and evolving setting. The proposed evaluation suite assesses deep research systems along three complementary dimensions: adaptive synthesis quality evaluation with task-specific rubrics, agentic factuality verification via active retrieval and reasoning over both web sources and multimodal attachments, and process-centric evaluation audits how the system searches, reasons, and refines throughout its investigation. Evaluation across 13 systems yields three principal findings: the three evaluation dimensions capture complementary aspects of system capability, with each revealing distinct strengths and weaknesses across systems; process quality serves as a reliable predictor of overall outcome while revealing weaknesses invisible to output-level metrics; and multimodal tasks pose substantially greater challenges, with most systems declining by 3 to 10 points. The MiroThinker series achieves the most balanced performance, with MiroThinker-H1 ranking the highest overall in both settings. Human verification and robustness results confirm the reliability of the benchmark and evaluation framework. MiroEval provides a holistic diagnostic tool for the next generation of deep research agents.

Updated: 2026-03-30 13:16:03

标题: MiroEval: 对过程和结果中的多模态深度研究代理进行基准测试

摘要: 深度研究系统的最近进展令人印象深刻，但评估仍然落后于真实用户需求。现有的基准主要通过固定的评分标准评估最终报告，未能评估潜在的研究过程。大多数基准还提供有限的多模态覆盖范围，依赖于不反映真实世界查询复杂性的合成任务，并且无法随着知识的发展而更新。为了弥补这些差距，我们引入了MiroEval，一个用于深度研究系统的基准和评估框架。该基准包括100个任务（70个仅文本，30个多模态），所有任务都基于真实用户需求，并通过支持定期更新的双路径管道构建，从而实现实时演变的环境。所提出的评估套件沿着三个互补的维度评估深度研究系统：使用任务特定的评分标准进行自适应综合质量评估，通过主动检索和推理来验证行为真实性，涉及网络来源和多模态附件，以及过程中心化评估审计系统在整个调查过程中的搜索，推理和细化。对13个系统的评估得出三个主要发现：三个评估维度捕捉了系统能力的互补方面，每个维度在系统之间揭示了不同的优势和劣势；过程质量作为总体结果的可靠预测因子，同时揭示了无法通过输出级指标看到的弱点；多模态任务带来了更大的挑战，大多数系统下降了3到10个点。MiroThinker系列实现了最平衡的表现，其中MiroThinker-H1在两个设置中排名最高。人类验证和鲁棒性结果证实了基准和评估框架的可靠性。MiroEval为下一代深度研究代理提供了全面的诊断工具。

更新时间: 2026-03-30 13:16:03

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2603.28407v1

EdgeDiT: Hardware-Aware Diffusion Transformers for Efficient On-Device Image Generation

Diffusion Transformers (DiT) have established a new state-of-the-art in high-fidelity image synthesis; however, their massive computational complexity and memory requirements hinder local deployment on resource-constrained edge devices. In this paper, we introduce EdgeDiT, a family of hardware-efficient generative transformers specifically engineered for mobile Neural Processing Units (NPUs), such as the Qualcomm Hexagon and Apple Neural Engine (ANE). By leveraging a hardware-aware optimization framework, we systematically identify and prune structural redundancies within the DiT backbone that are particularly taxing for mobile data-flows. Our approach yields a series of lightweight models that achieve a 20-30% reduction in parameters, a 36-46% decrease in FLOPs, and a 1.65-fold reduction in on-device latency without sacrificing the scaling advantages or the expressive capacity of the original transformer architecture. Extensive benchmarking demonstrates that EdgeDiT offers a superior Pareto-optimal trade-off between Frechet Inception Distance (FID) and inference latency compared to both optimized mobile U-Nets and vanilla DiT variants. By enabling responsive, private, and offline generative AI directly on-device, EdgeDiT provides a scalable blueprint for transitioning large-scale foundation models from high-end GPUs to the palm of the user.

Updated: 2026-03-30 13:14:30

标题: EdgeDiT：面向硬件的扩散变换器，用于高效的设备端图像生成

摘要: 扩散变压器（DiT）已经在高保真图像合成方面建立了新的技术水平；然而，它们庞大的计算复杂性和内存需求阻碍了在资源受限的边缘设备上进行本地部署。在本文中，我们介绍了EdgeDiT，这是一系列专门为移动神经处理单元（NPUs），如高通Hexagon和苹果神经引擎（ANE）而设计的硬件高效生成变压器。通过利用硬件感知优化框架，我们系统地识别和修剪了DiT骨干中特别耗费移动数据流的结构冗余。我们的方法产生了一系列轻量级模型，其参数减少了20-30％，FLOPs减少了36-46％，在设备上的延迟减少了1.65倍，而不牺牲原始变压器架构的扩展优势或表达能力。广泛的基准测试表明，EdgeDiT相比优化的移动U-Net和普通DiT变体，提供了更优越的Frechet Inception Distance（FID）和推理延迟之间的帕累托最优权衡。通过在设备上直接实现响应迅速、私密、离线的生成AI，EdgeDiT为将大规模基础模型从高端GPU过渡到用户手掌提供了一个可扩展的蓝图。

更新时间: 2026-03-30 13:14:30

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2603.28405v1

Synthesis of timeline-based planning strategies avoiding determinization

Qualitative timeline-based planning models domains as sets of independent, but interacting, components whose behaviors over time, the timelines, are governed by sets of qualitative temporal constraints (ordering relations), called synchronization rules. Its plan-existence problem has been shown to be PSPACE-complete; in particular, PSPACE-membership has been proved via reduction to the nonemptiness problem for nondeterministic finite automata. However, nondeterministic automata cannot be directly used to synthesize planning strategies as a costly determinization step is needed. In this paper, we identify a fragment of qualitative timeline-based planning whose plan-existence problem can be directly mapped into the nonemptiness problem of deterministic finite automata, which can then synthesize strategies. In addition, we identify a maximal subset of Allen's relations that fits into such a deterministic fragment.

Updated: 2026-03-30 13:14:10

标题: 避免确定化的基于时间轴的规划策略的合成

摘要: 定性基于时间轴的规划模型将领域视为一组独立但相互作用的组件，其随时间变化的行为（时间轴）受定性时间约束集合（顺序关系）控制，称为同步规则。已经证明其计划存在问题是PSPACE完全的；特别是通过将其归约为非确定性有限自动机的非空问题，已经证明了PSPACE成员资格。然而，非确定性自动机不能直接用于合成规划策略，因为需要昂贵的确定化步骤。在本文中，我们确定了一种定性基于时间轴的规划的片段，其计划存在问题可以直接映射到确定性有限自动机的非空问题，从而可以合成策略。此外，我们确定了Allen关系的最大子集，适用于这样一个确定性片段。

更新时间: 2026-03-30 13:14:10

领域: cs.AI

下载: http://arxiv.org/abs/2507.17988v2

Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers

The dense output projection in multi head attention scales quadratically with model dimension, contributing significantly to parameter count, memory footprint, and inference cost. We propose replacing this projection with a fixed, parameter free Walsh Hadamard Transform (WHT) followed by a diagonal affine transformation. This approach eliminates approximately 25 percent of attention parameters per block while maintaining global cross-head interaction through an orthogonal, norm-preserving transformation. Our results demonstrate that WHT augmented models exhibit a steeper validation loss curve relative to training FLOPs compared to dense baselines, suggesting superior compute utilization during training. Crucially, we show that efficiency gains including reduced memory footprint and increased throughput grow monotonically with model size, batch size, and sequence length. We evaluate performance across both prefill and decoding stages, finding that the structured transform consistently outperforms dense projections as complexity increases. Our findings indicate that replacing dense projections with structured transforms allows for more compute-efficient architectures that achieve lower loss than dense models at an equivalent training budget.

Updated: 2026-03-30 13:10:04

标题: 重新思考注意力输出投影：结构化哈达玛变换用于高效Transformer

摘要: 在多头注意力中，密集输出投影随着模型维度呈二次增长，显著增加了参数数量、内存占用和推理成本。我们提出用固定的、无参数的Walsh Hadamard变换（WHT）代替这种投影，然后再进行对角仿射变换。这种方法可以在保持全局跨头交互的同时，每个块中消除大约25%的注意力参数。我们的结果表明，使用WHT增强模型相对于密集基准模型在验证损失曲线上表现更陡，这表明在训练期间具有更优越的计算利用率。关键是，我们发现效率提升，包括减少内存占用和增加吞吐量，随着模型大小、批处理大小和序列长度的增加而单调增加。我们评估了在预填充和解码阶段的性能，发现随着复杂度的增加，结构化变换始终优于密集投影。我们的研究结果表明，用结构化变换代替密集投影可以实现更高效的计算架构，在相同的训练预算下实现比密集模型更低的损失。

更新时间: 2026-03-30 13:10:04

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2603.08343v2

Label-efficient Training Updates for Malware Detection over Time

Machine Learning (ML)-based detectors are becoming essential to counter the proliferation of malware. However, common ML algorithms are not designed to cope with the dynamic nature of real-world settings, where both legitimate and malicious software evolve. This distribution drift causes models trained under static assumptions to degrade over time unless they are continuously updated. Regularly retraining these models, however, is expensive, since labeling new acquired data requires costly manual analysis by security experts. To reduce labeling costs and address distribution drift in malware detection, prior work explored active learning (AL) and semi-supervised learning (SSL) techniques. Yet, existing studies (i) are tightly coupled to specific detector architectures and restricted to a specific malware domain, resulting in non-uniform comparisons; and (ii) lack a consistent methodology for analyzing the distribution drift, despite the critical sensitivity of the malware domain to temporal changes. In this work, we bridge this gap by proposing a model-agnostic framework that evaluates an extensive set of AL and SSL techniques, isolated and combined, for Android and Windows malware detection. We show that these techniques, when combined, can reduce manual annotation costs by up to 90% across both domains while achieving comparable detection performance to full-labeling retraining. We also introduce a methodology for feature-level drift analysis that measures feature stability over time, showing its correlation with the detector performance. Overall, our study provides a detailed understanding of how AL and SSL behave under distribution drift and how they can be successfully combined, offering practical insights for the design of effective detectors over time.

Updated: 2026-03-30 13:05:44

标题: 随时间标签高效训练更新在恶意软件检测中的应用

摘要: 基于机器学习（ML）的检测器正在成为对抗恶意软件蔓延的必要手段。然而，常见的ML算法并非设计用于应对真实世界环境的动态特性，其中合法和恶意软件都在不断演变。这种分布漂移会导致在静态假设下训练的模型随时间推移而退化，除非它们不断更新。然而，定期重新训练这些模型是昂贵的，因为标记新获得的数据需要安全专家进行昂贵的手动分析。为了降低标记成本并解决恶意软件检测中的分布漂移问题，先前的研究探讨了主动学习（AL）和半监督学习（SSL）技术。然而，现有研究（i）与特定的检测器架构紧密耦合，并受限于特定的恶意软件领域，导致了不均匀的比较；（ii）尽管恶意软件领域对时间变化的敏感性很高，却缺乏一致的分布漂移分析方法。在这项工作中，我们通过提出一个模型无关的框架，评估了一系列独立和组合的AL和SSL技术，用于Android和Windows恶意软件检测。我们展示了当这些技术结合使用时，可以在两个领域中将手动注释成本降低高达90%，同时实现与全标记重新训练相当的检测性能。我们还介绍了一种特征级漂移分析方法，用于衡量特征随时间的稳定性，并显示其与检测器性能之间的相关性。总的来说，我们的研究提供了对AL和SSL在分布漂移下的行为的详细理解，并展示了它们如何成功结合，为设计随时间有效的检测器提供了实用洞见。

更新时间: 2026-03-30 13:05:44

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2603.28396v1

From Simulation to Deep Learning: Survey on Network Performance Modeling Approaches

Network performance modeling is a field that predates early computer networks and the beginning of the Internet. It aims to predict the traffic performance of packet flows in a given network. Its applications range from network planning and troubleshooting to feeding information to network controllers for configuration optimization. Traditional network performance modeling has relied heavily on Discrete Event Simulation (DES) and analytical methods grounded in mathematical theories such as Queuing Theory and Network Calculus. However, as of late, we have observed a paradigm shift, with attempts to obtain efficient Parallel DES, the surge of Machine Learning models, and their integration with other methodologies in hybrid approaches. This has resulted in a great variety of modeling approaches, each with its strengths and often tailored to specific scenarios or requirements. In this paper, we comprehensively survey the relevant network performance modeling approaches for wired networks over the last decades. With this understanding, we also define a taxonomy of approaches, summarizing our understanding of the state-of-the-art and how both technology and the concerns of the research community evolve over time. Finally, we also consider how these models are evaluated, how their different nature results in different evaluation requirements and goals, and how this may complicate their comparison.

Updated: 2026-03-30 13:03:19

标题: 从模拟到深度学习：网络性能建模方法调查

摘要: 网络性能建模是一个早在计算机网络和互联网开始之前就存在的领域。它旨在预测给定网络中数据包流的流量性能。其应用范围从网络规划和故障排除到向网络控制器提供信息以进行配置优化。传统的网络性能建模主要依赖于离散事件模拟（DES）和基于数学理论如排队论和网络微积分的分析方法。然而，最近，我们观察到了一个范式转变，尝试获得高效的并行DES，机器学习模型的激增以及它们与其他方法在混合方法中的整合。这导致了各种建模方法的出现，每种方法都有其优势，通常针对特定场景或要求进行定制。在本文中，我们全面调查了过去几十年有线网络的相关网络性能建模方法。通过这种理解，我们还定义了一种方法论，总结了我们对最新技术和研究社区关注的理解如何随着时间演变。最后，我们还考虑了这些模型如何进行评估，它们不同的性质导致不同的评估要求和目标，以及这可能会使它们的比较变得复杂。

更新时间: 2026-03-30 13:03:19

领域: cs.NI,cs.AI,cs.LG

下载: http://arxiv.org/abs/2603.28394v1

The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, \textsc{FOR2107} (affective disorders) and \textsc{OASIS-3} (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal. Under these conditions, smaller VLMs exhibit gains of up to 58\% F1 upon introduction of neuroimaging context, with distilled models becoming competitive with counterparts an order of magnitude larger. A contrastive confidence analysis reveals that merely \emph{mentioning} MRI availability in the task prompt accounts for 70-80\% of this shift, independent of whether imaging data is present, a domain-specific instance of modality collapse we term the \emph{scaffold effect}. Expert evaluation reveals fabrication of neuroimaging-grounded justifications across all conditions, and preference alignment, while eliminating MRI-referencing behavior, collapses both conditions toward random baseline. Our findings demonstrate that surface evaluations are inadequate indicators of multimodal reasoning, with direct implications for the deployment of VLMs in clinical settings.

Updated: 2026-03-30 12:58:10

标题: 脚手架效应：提示框架如何推动临床VLM评估中的明显多模态增益

摘要: 值得信赖的临床人工智能需要表现提升反映真实证据整合，而不是表面层面的痕迹。我们在两个临床神经影像队列（\textsc{FOR2107}（情感障碍）和\textsc{OASIS-3}（认知衰退））上评估了12个开放权重的视觉-语言模型（VLMs）进行二分类。这两个数据集都带有结构性MRI数据，其中不包含可靠的个体级诊断信号。在这些条件下，较小的VLMs在引入神经影像背景时表现提升高达58％的F1，经过提炼的模型变得与大小相差一个数量级的对手相竞争。对比信心分析显示，仅在任务提示中“提及”MRI可用性就占据了这一变化的70-80％，无论是否存在影像数据，这是一种我们称之为“支架效应”的领域特定的模态折叠实例。专家评估显示，在所有条件下都存在基于神经影像的理由的捏造，而偏好一致性，消除了MRI参考行为，使两个条件都向随机基线折叠。我们的发现表明，表面评估不足以作为多模态推理的指标，对于在临床环境中部署VLMs具有直接意义。

更新时间: 2026-03-30 12:58:10

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2603.28387v1

COvolve: Adversarial Co-Evolution of Large-Language-Model-Generated Policies and Environments via Two-Player Zero-Sum Game

A central challenge in building continually improving agents is that training environments are typically static or manually constructed. This restricts continual learning and generalization beyond the training distribution. We address this with COvolve, a co-evolutionary framework that leverages large language models (LLMs) to generate both environments and agent policies, expressed as executable Python code. We model the interaction between environment and policy designers as a two-player zero-sum game, ensuring adversarial co-evolution in which environments expose policy weaknesses and policies adapt in response. This process induces an automated curriculum in which environments and policies co-evolve toward increasing complexity. To guarantee robustness and prevent forgetting as the curriculum progresses, we compute the mixed-strategy Nash equilibrium (MSNE) of the zero-sum game, thereby yielding a meta-policy. This MSNE meta-policy ensures that the agent does not forget to solve previously seen environments while learning to solve previously unseen ones. Experiments in urban driving, symbolic maze-solving, and geometric navigation showcase that COvolve produces progressively more complex environments. Our results demonstrate the potential of LLM-driven co-evolution to achieve open-ended learning without predefined task distributions or manual intervention.

Updated: 2026-03-30 12:56:54

标题: COvolve：通过双人零和博弈对大语言模型生成的策略和环境进行对抗性共同进化

摘要: 建立不断改进的代理程序的一个主要挑战是训练环境通常是静态的或手动构建的。这限制了持续学习和泛化超出训练分布。我们通过COvolve来解决这个问题，这是一个利用大型语言模型（LLMs）生成环境和代理策略的协同进化框架，表达为可执行的Python代码。我们将环境和策略设计者之间的交互建模为一个双人零和博弈，确保对抗性协同进化，其中环境暴露策略的弱点，策略则做出相应调整。这个过程引入了一个自动化课程，使环境和策略朝着增加复杂性的方向进行协同进化。为了确保鲁棒性并防止在课程进行中遗忘，我们计算了零和博弈的混合策略纳什均衡（MSNE），从而得到了一个元策略。这个MSNE元策略确保代理程序不会忘记解决先前见过的环境，同时学会解决先前未见过的环境。在城市驾驶、符号迷宫解决和几何导航方面的实验展示了COvolve产生逐渐复杂的环境的潜力。我们的结果表明，由LLM驱动的协同进化可以实现无预定义任务分布或手动干预的无限学习的潜力。

更新时间: 2026-03-30 12:56:54

领域: cs.AI

下载: http://arxiv.org/abs/2603.28386v1

Critic-Free Deep Reinforcement Learning for Maritime Coverage Path Planning on Irregular Hexagonal Grids

Maritime surveillance missions, such as search and rescue and environmental monitoring, rely on the efficient allocation of sensing assets over vast and geometrically complex areas. Traditional Coverage Path Planning (CPP) approaches depend on decomposition techniques that struggle with irregular coastlines, islands, and exclusion zones, or require computationally expensive re-planning for every instance. We propose a Deep Reinforcement Learning (DRL) framework to solve CPP on hexagonal grid representations of irregular maritime areas. Unlike conventional methods, we formulate the problem as a neural combinatorial optimization task where a Transformer-based pointer policy autoregressively constructs coverage tours. To overcome the instability of value estimation in long-horizon routing problems, we implement a critic-free Group-Relative Policy Optimization (GRPO) scheme. This method estimates advantages through within-instance comparisons of sampled trajectories rather than relying on a value function. Experiments on 1,000 unseen synthetic maritime environments demonstrate that a trained policy achieves a 99.0% Hamiltonian success rate, more than double the best heuristic (46.0%), while producing paths 7% shorter and with 24% fewer heading changes than the closest baseline. All three inference modes (greedy, stochastic sampling, and sampling with 2-opt refinement) operate under 50~ms per instance on a laptop GPU, confirming feasibility for real-time on-board deployment.

Updated: 2026-03-30 12:56:38

标题: 无批评的深度强化学习在不规则六边形网格上的海上覆盖路径规划中的应用

摘要: 海上监视任务，如搜索和救援以及环境监测，依赖于在广阔且几何复杂的区域上高效分配感知资产。传统的覆盖路径规划（CPP）方法依赖于分解技术，这些技术在处理不规则海岸线、岛屿和排除区域时存在困难，或者需要在每个实例中进行计算昂贵的重新规划。我们提出了一种基于深度强化学习（DRL）框架来解决不规则海上区域的六边形网格表示的CPP。与传统方法不同，我们将问题形式化为一个神经组合优化任务，其中一个基于Transformer的指针策略自回归地构建覆盖路径。为了克服长期规划问题中价值估计的不稳定性，我们实现了一种不依赖评论家的群相对策略优化（GRPO）方案。该方法通过对抽样路径进行实例内比较来估计优势，而不是依赖价值函数。对1,000个未见过的合成海上环境的实验表明，训练有素的策略达到了99.0%的哈密顿成功率，是最佳启发式方法的两倍以上（46.0%），同时产生的路径比最接近的基线短7%，航向变化少24%。所有三种推理模式（贪心、随机抽样和带有2-opt优化的抽样）在笔记本电脑GPU上每个实例不到50毫秒的速度运行，证实了实时机载部署的可行性。

更新时间: 2026-03-30 12:56:38

领域: cs.LG,cs.AI,cs.NE,cs.RO

下载: http://arxiv.org/abs/2603.28385v1

MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation

Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings. MALLVI presents a Multi Agent Large Language and Vision framework that enables closed-loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVI generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next step. Rather than using a single model, MALLVI coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full replanning. Experiments in simulation and real-world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks. Code available at https://github.com/iman1234ahmadi/MALLVI .

Updated: 2026-03-30 12:50:11

标题: MALLVI：用于集成化广义机器人操作的多智能体框架

摘要: 使用大型语言模型（LLMs）进行机器人操作任务规划是一个新兴领域。先前的方法依赖于专门的模型、微调或提示微调，并且通常在没有稳健环境反馈的情况下以开环方式运行，使它们在动态环境中变得脆弱。MALLVI提出了一个多Agent大型语言和视觉框架，可以实现闭环反馈驱动的机器人操作。给定自然语言指令和环境图像，MALLVI为机器人操作员生成可执行的原子动作。在动作执行之后，视觉语言模型（VLM）评估环境反馈，并决定是重复过程还是继续下一步。与使用单一模型不同，MALLVI协调专门的代理人，分解器、定位器、思考者和反射器，来管理感知、定位、推理和高层规划。一个可选的描述代理提供了初始状态的视觉记忆。反射器通过重新激活仅相关代理而不进行完整的重新规划，支持有针对性的错误检测和恢复。在模拟和真实环境中的实验表明，迭代闭环多代理协调改善了泛化能力，并提高了零镜头操作任务的成功率。代码可在https://github.com/iman1234ahmadi/MALLVI找到。

更新时间: 2026-03-30 12:50:11

领域: cs.RO,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2602.16898v4

Fairness in Healthcare Processes: A Quantitative Analysis of Decision Making in Triage

Fairness in automated decision-making has become a critical concern, particularly in high-pressure healthcare scenarios such as emergency triage, where fast and equitable decisions are essential. Process mining is increasingly investigating fairness. There is a growing area focusing on fairness-aware algorithms. So far, we know less how these concepts perform on empirical healthcare data or how they cover aspects of justice theory. This study addresses this research problem and proposes a process mining approach to assess fairness in triage by linking real-life event logs with conceptual dimensions of justice. Using the MIMICEL event log (as derived from MIMIC-IV ED), we analyze time, re-do, deviation and decision as process outcomes, and evaluate the influence of age, gender, race, language and insurance using the Kruskal-Wallis, Chi-square and effect size measurements. These outcomes are mapped to justice dimensions to support the development of a conceptual framework. The results demonstrate which aspects of potential unfairness in high-acuity and sub-acute surface. In this way, this study contributes empirical insights that support further research in responsible, fairness-aware process mining in healthcare.

Updated: 2026-03-30 12:49:58

标题: 医疗过程中的公正性：三级分诊决策的定量分析

摘要: 自动决策中的公平性已成为一个关键关注点，特别是在高压力的医疗场景中，如紧急分诊中，快速和公平的决策至关重要。过程挖掘越来越多地探讨公平性。有一个日益增长的领域专注于公平意识算法。到目前为止，我们知道较少关于这些概念在实证医疗数据上的表现，或者它们如何涵盖正义理论的方面。本研究解决了这一研究问题，并提出了一个过程挖掘方法来通过将真实事件日志与正义概念维度联系起来来评估分诊的公平性。使用MIMICEL事件日志（从MIMIC-IV ED派生），我们分析了时间、重做、偏差和决策作为过程结果，并使用Kruskal-Wallis、卡方和效应大小测量来评估年龄、性别、种族、语言和保险的影响。将这些结果映射到正义维度，以支持概念框架的发展。结果展示了在高急性和次急性表面可能不公平的方面。通过这种方式，本研究提供了支持进一步研究负责任、公平意识的过程挖掘在医疗保健领域的实证见解。

更新时间: 2026-03-30 12:49:58

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2601.11065v2

Membership Inference Attacks against Large Audio Language Models

We present the first systematic Membership Inference Attack (MIA) evaluation of Large Audio Language Models (LALMs). As audio encodes non-semantic information, it induces severe train and test distribution shifts and can lead to spurious MIA performance. Using a multi-modal blind baseline based on textual, spectral, and prosodic features, we demonstrate that common speech datasets exhibit near-perfect train/test separability (AUC approximately 1.0) even without model inference, and the standard MIA scores strongly correlate with these blind acoustic artifacts (correlation greater than 0.7). Using this blind baseline, we identify that distribution-matched datasets enable reliable MIA evaluation without distribution shift confounds. We benchmark multiple MIA methods and conduct modality disentanglement experiments on these datasets. The results reveal that LALM memorization is cross-modal, arising only from binding a speaker's vocal identity with its text. These findings establish a principled standard for auditing LALMs beyond spurious correlations.

Updated: 2026-03-30 12:45:28

标题: 大型音频语言模型的成员推断攻击

摘要: 我们提出了对大型音频语言模型（LALMs）进行系统成员推理攻击（MIA）评估的第一个系统评估。由于音频编码了非语义信息，它引起了严重的训练和测试分布偏移，可能导致虚假的MIA性能。使用基于文本、频谱和韵律特征的多模态盲基线，我们证明常见的语音数据集即使在没有模型推理的情况下也表现出接近完美的训练/测试可分离性（AUC约为1.0），标准MIA分数与这些盲目声学特征强相关（相关性大于0.7）。使用这个盲目基线，我们确定匹配分布的数据集可以在没有分布偏移混淆的情况下可靠地进行MIA评估。我们对这些数据集进行多个MIA方法的基准测试，并进行模态解缠实验。结果显示，LALM的记忆是跨模态的，仅来自将说话者的声音身份与其文本绑定。这些发现为审计LALMs建立了一个基于原则的标准，超越了虚假相关性。

更新时间: 2026-03-30 12:45:28

领域: cs.SD,cs.AI

下载: http://arxiv.org/abs/2603.28378v1

Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design

Deep research agents autonomously conduct open-ended investigations, integrating complex information retrieval with multi-step reasoning across diverse sources to solve real-world problems. To sustain this capability on long-horizon tasks, reliable verification is critical during both training and inference. A major bottleneck in existing paradigms stems from the lack of explicit verification mechanisms in QA data synthesis, trajectory construction, and test-time scaling. Errors introduced at each stage propagate downstream and degrade the overall agent performance. To address this, we present Marco DeepResearch, a deep research agent optimized with a verification-centric framework design at three levels: \textbf{(1)~QA Data Synthesis:} We introduce verification mechanisms to graph-based and agent-based QA synthesis to control question difficulty while ensuring answers are unique and correct; \textbf{(2)~Trajectory Construction:} We design a verification-driven trajectory synthesis method that injects explicit verification patterns into training trajectories; and \textbf{(3)~Test-time scaling:} We use Marco DeepResearch itself as a verifier at inference time and effectively improve performance on challenging questions. Extensive experimental results demonstrate that our proposed Marco DeepResearch agent significantly outperforms 8B-scale deep research agents on most challenging benchmarks, such as BrowseComp and BrowseComp-ZH. Crucially, under a maximum budget of 600 tool calls, Marco DeepResearch even surpasses or approaches several 30B-scale agents, like Tongyi DeepResearch-30B.

Updated: 2026-03-30 12:42:02

标题: Marco DeepResearch: 通过以验证为中心的设计解锁高效的深度研究代理

摘要: 深度研究代理自主地进行开放式调查，整合复杂信息检索与跨多源多步推理，以解决现实世界问题。为了在长期任务中维持这种能力，可靠的验证在训练和推理过程中至关重要。现有模式中的一个主要瓶颈源于在QA数据合成、轨迹构建和测试时缺乏明确的验证机制。在每个阶段引入的错误会向下游传播，降低整体代理性能。为了解决这个问题，我们提出了Marco DeepResearch，一个优化了验证为中心框架设计的深度研究代理，包括三个层次：(1) QA数据合成：我们引入验证机制到基于图和代理的QA合成中，以控制问题难度，同时确保答案是独特且正确的；(2) 轨迹构建：我们设计了一种验证驱动的轨迹合成方法，将明确的验证模式注入到训练轨迹中；(3) 测试时扩展：我们在推理时使用Marco DeepResearch本身作为验证器，并有效提高了在具有挑战性问题上的表现。广泛的实验结果表明，我们提出的Marco DeepResearch代理在大多数具有挑战性的基准测试中显著优于8B规模的深度研究代理，如BrowseComp和BrowseComp-ZH。至关重要的是，在最大预算为600工具调用的情况下，Marco DeepResearch甚至超过或接近一些30B规模的代理，如Tongyi DeepResearch-30B。

更新时间: 2026-03-30 12:42:02

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2603.28376v1

A Dynamic Framework for Grid Adaptation in Kolmogorov-Arnold Networks

Kolmogorov-Arnold Networks (KANs) have recently demonstrated promising potential in scientific machine learning, partly due to their capacity for grid adaptation during training. However, existing adaptation strategies rely solely on input data density, failing to account for the geometric complexity of the target function or metrics calculated during network training. In this work, we propose a generalized framework that treats knot allocation as a density estimation task governed by Importance Density Functions (IDFs), allowing training dynamics to determine grid resolution. We introduce a curvature-based adaptation strategy and evaluate it across synthetic function fitting, regression on a subset of the Feynman dataset and different instances of the Helmholtz PDE, demonstrating that it significantly outperforms the standard input-based baseline. Specifically, our method yields average relative error reductions of 25.3% on synthetic functions, 9.4% on the Feynman dataset, and 23.3% on the PDE benchmark. Statistical significance is confirmed via Wilcoxon signed-rank tests, establishing curvature-based adaptation as a robust and computationally efficient alternative for KAN training.

Updated: 2026-03-30 12:41:10

标题: 一个动态框架用于Kolmogorov-Arnold网络中的网格调整

摘要: 科尔莫戈洛夫-阿诺德网络（KANs）最近在科学机器学习领域展现出了很有潜力，部分原因是它们在训练过程中具有网格自适应的能力。然而，现有的自适应策略仅依赖于输入数据密度，未考虑目标函数的几何复杂性或网络训练过程中计算的度量。在本研究中，我们提出了一个广义框架，将节点分配视为由重要密度函数（IDFs）控制的密度估计任务，允许训练动态确定网格分辨率。我们引入了基于曲率的自适应策略，并在合成函数拟合、费曼数据集的回归以及不同实例的赫尔姆霍兹PDE上进行了评估，结果表明它明显优于标准的基于输入的基线。具体地，我们的方法在合成函数上平均相对误差降低了25.3％，在费曼数据集上降低了9.4％，在PDE基准上降低了23.3％。通过Wilcoxon符号秩检验确认了统计显著性，将基于曲率的自适应视为KAN训练的一种稳健且计算效率高的替代方法。

更新时间: 2026-03-30 12:41:10

领域: cs.LG

下载: http://arxiv.org/abs/2601.18672v2

Echo State Networks for Time Series Forecasting: Hyperparameter Sweep and Benchmarking

This paper investigates the forecasting performance of Echo State Networks (ESNs) for univariate time series forecasting using a subset of the M4 Forecasting Competition dataset. Focusing on monthly and quarterly time series with at most 20 years of historical data, we evaluate whether a fully automatic, purely feedback-driven ESN can serve as a competitive alternative to widely used statistical forecasting methods. The study adopts a rigorous two-stage evaluation approach: a Parameter dataset is used to conduct an extensive hyperparameter sweep covering leakage rate, spectral radius, reservoir size, and information criteria for regularization, resulting in over four million ESN model fits; a disjoint Forecast dataset is then used for out-of-sample accuracy assessment. Forecast accuracy is measured using MASE and sMAPE and benchmarked against simple benchmarks like drift and seasonal naive and statistical models like ARIMA, ETS, and TBATS. The hyperparameter analysis reveals consistent and interpretable patterns, with monthly series favoring moderately persistent reservoirs and quarterly series favoring more contractive dynamics. Across both frequencies, high leakage rates are preferred, while optimal spectral radii and reservoir sizes vary with temporal resolution. In the out-of-sample evaluation, the ESN performs on par with ARIMA and TBATS for monthly data and achieves the lowest mean MASE for quarterly data, while requiring lower computational cost than the more complex statistical models. Overall, the results demonstrate that ESNs offer a compelling balance between predictive accuracy, robustness, and computational efficiency, positioning them as a practical option for automated time series forecasting.

Updated: 2026-03-30 12:39:28

标题: Echo State Networks 用于时间序列预测：超参数扫描和基准测试

摘要: 本文研究了使用M4预测竞赛数据集的子集进行单变量时间序列预测时Echo State Networks（ESNs）的预测性能。专注于具有最多20年历史数据的月度和季度时间序列，我们评估了一个完全自动的、纯反馈驱动的ESN是否可以作为广泛使用的统计预测方法的竞争性替代品。研究采用了严格的两阶段评估方法：使用参数数据集进行广泛的超参数调整，涵盖泄漏率、谱半径、储水池大小和用于正则化的信息标准，从而产生超过400万个ESN模型拟合；然后使用不相交的预测数据集进行外样本准确性评估。使用MASE和sMAPE来衡量预测准确性，并将其与漂移和季节性天真以及ARIMA、ETS和TBATS等简单基准模型进行对比。超参数分析显示出一致且可解释的模式，月度系列偏好于中度持久的储水池，季度系列偏好于更具收缩性的动态。在两种频率上，高泄漏率被青睐，而最佳谱半径和储水池大小随时间分辨率而变化。在外样本评估中，ESN在月度数据上与ARIMA和TBATS表现相当，并且对于季度数据实现了最低的平均MASE，同时比更复杂的统计模型具有更低的计算成本。总体而言，结果表明ESNs在预测准确性、鲁棒性和计算效率之间提供了令人信服的平衡，使其成为自动化时间序列预测的实用选择。

更新时间: 2026-03-30 12:39:28

领域: cs.LG

下载: http://arxiv.org/abs/2602.03912v2

Coherent Without Grounding, Grounded Without Success: Observability and Epistemic Failure

When an agent can articulate why something works, we typically take this as evidence of genuine understanding. This presupposes that effective action and correct explanation covary, and that coherent explanation reliably signals both. I argue that this assumption fails for contemporary Large Language Models (LLMs). I introduce what I call the Bidirectional Coherence Paradox: competence and grounding not only dissociate but invert across epistemic conditions. In low-observability domains, LLMs often act successfully while misidentifying the mechanisms that produce their success. In high-observability domains, they frequently generate explanations that accurately track observable causal structure yet fail to translate those diagnoses into effective intervention. In both cases, explanatory coherence remains intact, obscuring the underlying dissociation. Drawing on experiments in compiler optimization and hyperparameter tuning, I develop the Epistemic Triangle, a model of how priors, signals, and domain knowledge interact under varying observability. The results suggest that neither behavioral success nor explanatory accuracy alone suffices for attributing understanding. I argue that evaluating artificial epistemic agents requires a tripartite framework -- coherence, grounding, and a proper basing relation linking explanation to action. The systematic separation of knowing-that and knowing-how in LLMs thus challenges assumptions inherited from both epistemology and current AI evaluation practice.

Updated: 2026-03-30 12:38:42

标题: 没有凝聚力，没有基础支撑：可观察性与认识失败

摘要: 当一个代理人能够解释为什么某事起作用时，我们通常将其视为真正理解的证据。这假定有效行动和正确解释是相互关联的，并且连贯的解释可靠地表明两者。我认为，这一假设在当今的大型语言模型（LLMs）中不成立。我介绍了我所谓的双向连贯悖论：能力和基础不仅分离，而且在认识条件下相互转换。在低可观测性领域，LLMs经常能成功行动，同时错误地识别产生成功的机制。在高可观测性领域，它们经常生成能够准确追踪可观察因果结构的解释，但未能将这些诊断转化为有效干预。在这两种情况下，解释的连贯性仍然保持完整，掩盖了其中的分离。根据编译器优化和超参数调整的实验结果，我发展了认识三角形，这是一个模型，说明先验、信号和领域知识在不同可观测性条件下如何相互作用。结果表明，仅有行为成功或解释准确性是不足以归因于理解的。我认为，评估人工认识代理人需要一个三部分的框架--连贯性、基础和将解释与行动联系起来的适当基础关系。LLMs中知道什么和知道如何的系统分离因此挑战了从认识论和当前人工智能评估实践中继承的假设。

更新时间: 2026-03-30 12:38:42

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2603.28371v1

An Agentic Operationalization of DISARM for FIMI Investigation on Social Media

Interoperable data and intelligence flows among allied partners and operational end-users remain essential to NATO's collective defense across both conventional and hybrid threat environments. Foreign Information Manipulation and Interference (FIMI) increasingly spans multiple societal domains and information ecosystems, complicating threat characterization, persistent situational awareness, and coordinated response. Concurrent advances in AI have further lowered the barrier to conducting large-scale, AI-augmented FIMI activities -- including automated generation, personalization, and amplification of manipulative content. While frameworks such as DISARM offer a standardized analytical and metadata schema for characterizing FIMI incidents, their practical application for automating large-scale detection remains challenging. We present a framework-agnostic, agent-based operationalization of DISARM piloted to support FIMI investigation on social platforms. Our agent coordination pipeline integrates general agentic AI components that (1) identify candidate manipulative behaviors in social-media data and (2) map these behaviors to DISARM taxonomies through transparent, auditable reasoning steps. Evaluation on two practitioner-annotated, real-world datasets demonstrates that our approach can effectively scale analytic workflows that are currently manual, time-intensive, and interpretation-heavy. Notably, the experiment surfaced more than 30 previously undetected Russian bot accounts -- deployed for the 2025 election in Moldova -- during the prior non-agentic investigation. By enhancing analytic throughput, interoperability, and explainability, the proposed approach provides a direct contribution to defense policy and planning needs for improved situational awareness, cross-partner data integration, and rapid assessment of information-environment threats.

Updated: 2026-03-30 12:35:33

标题: 一种DISARM在社交媒体FIMI调查中的主体化操作化

摘要: 跨盟军伙伴和操作终端用户之间的可互操作数据和情报流仍然对北约在传统和混合威胁环境下的集体防御至关重要。外国信息操纵和干扰（FIMI）越来越跨越多的社会领域和信息生态系统，使威胁特征化、持续情报意识和协调应对变得更加复杂。人工智能的同时进步进一步降低了进行大规模、AI增强的FIMI活动的障碍，包括操纵内容的自动生成、个性化和放大。虽然像DISARM这样的框架提供了一个标准化的分析和元数据模式，用于描述FIMI事件，但它们在自动化大规模检测方面的实际应用仍然具有挑战性。我们提出了一种基于框架不可知的、基于代理的DISARM操作化方法，用于支持社交平台上的FIMI调查。我们的代理协调管道整合了一般的代理人工智能组件，这些组件可以（1）在社交媒体数据中识别候选的操纵行为，以及（2）通过透明、可审计的推理步骤将这些行为映射到DISARM分类法。对两个从业者注释的真实数据集进行评估表明，我们的方法可以有效地扩展目前手动、耗时和需要解释的分析工作流程。值得注意的是，在先前的非代理调查中，实验发现了30多个此前未被检测到的俄罗斯机器人账户，这些账户被用于2025年在摩尔多瓦的选举中。通过增强分析吞吐量、互操作性和可解释性，所提出的方法为改善情报环境威胁的形势意识、跨合作伙伴数据集成和快速评估提供了直接的贡献。

更新时间: 2026-03-30 12:35:33

领域: cs.SI,cs.AI,cs.CY,cs.HC,cs.MA

下载: http://arxiv.org/abs/2601.15109v3

Deep Research of Deep Research: From Transformer to Agent, From AI to AI for Science

With the advancement of large language models (LLMs) in their knowledge base and reasoning capabilities, their interactive modalities have evolved from pure text to multimodality and further to agentic tool use. Consequently, their applications have broadened from question answering to AI assistants and now to general-purpose agents. Deep research (DR) represents a prototypical vertical application for general-purpose agents, which represents an ideal approach for intelligent information processing and assisting humans in discovering and solving problems, with the goal of reaching or even surpassing the level of top human scientists. This paper provides a deep research of deep research. We articulate a clear and precise definition of deep research and unify perspectives from industry's deep research and academia's AI for Science (AI4S) within a developmental framework. We position LLMs and Stable Diffusion as the twin pillars of generative AI, and lay out a roadmap evolving from the Transformer to agents. We examine the progress of AI4S across various disciplines. We identify the predominant paradigms of human-AI interaction and prevailing system architectures, and discuss the major challenges and fundamental research issues that remain. AI supports scientific innovation, and science also can contribute to AI growth (Science for AI, S4AI). We hope this paper can help bridge the gap between the AI and AI4S communities.

Updated: 2026-03-30 12:29:47

标题: 深入研究深度研究：从Transformer到Agent，从人工智能到科学人工智能

摘要: 随着大型语言模型（LLMs）在知识库和推理能力方面的进展，它们的交互模式已从纯文本发展到多模态，进一步演变为代理工具使用。因此，它们的应用已从问题回答扩展到人工智能助手，现在又发展到通用代理。深度研究（DR）代表了通用代理的一个典型垂直应用，这代表了智能信息处理和帮助人类发现和解决问题的理想方法，目标是达到甚至超越顶尖人类科学家的水平。本文对深度研究进行了深入的研究。我们明确和精确地定义了深度研究，并在一个发展框架内统一了行业深度研究和学术界的科学人工智能（AI4S）的观点。我们将LLMs和稳定扩散定位为生成式人工智能的双支柱，并提出了从Transformer到代理的演进路线图。我们检验了AI4S在各个学科领域的进展。我们确定了人工智能与人类互动的主导范式和主流系统架构，并讨论了仍然存在的主要挑战和基本研究问题。人工智能支持科学创新，科学也可以促进人工智能的增长（科学为人工智能，S4AI）。我们希望本文能够帮助弥合人工智能和AI4S社区之间的鸿沟。

更新时间: 2026-03-30 12:29:47

领域: cs.AI,cs.MA

下载: http://arxiv.org/abs/2603.28361v1

Scaling Attention via Feature Sparsity

Scaling Transformers to ultra-long contexts is bottlenecked by the $O(n^2 d)$ cost of self-attention. Existing methods reduce this cost along the sequence axis through local windows, kernel approximations, or token-level sparsity, but these approaches consistently degrade accuracy. In this paper, we instead explore an orthogonal axis: feature sparsity. We propose Sparse Feature Attention (SFA), where queries and keys are represented as $k$-sparse codes that preserve high-dimensional expressivity while reducing the cost of attention from $Θ(n^2 d)$ to $Θ(n^2 k^2/d)$. To make this efficient at scale, we introduce FlashSFA, an IO-aware kernel that extends FlashAttention to operate directly on sparse overlaps without materializing dense score matrices. Across GPT-2 and Qwen3 pretraining, SFA matches dense baselines while improving speed by up to $2.5\times$ and reducing FLOPs and KV-cache by nearly 50\%. On synthetic and downstream benchmarks, SFA preserves retrieval accuracy and robustness at long contexts, outperforming short-embedding baselines that collapse feature diversity. These results establish feature-level sparsity as a complementary and underexplored axis for efficient attention, enabling Transformers to scale to orders-of-magnitude longer contexts with minimal quality loss. Code is available at https://github.com/YannX1e/Sparse-Feature-Attention.

Updated: 2026-03-30 12:28:48

标题: 通过特征稀疏性扩展注意力

摘要: 将Transformer扩展到超长上下文受到自注意力的$O(n^2 d)$成本的限制。现有的方法通过本地窗口、核逼近或令牌级稀疏性沿序列轴减少这种成本，但这些方法一直导致精度下降。在本文中，我们探索了一个正交轴：特征稀疏性。我们提出了稀疏特征注意力（SFA），其中查询和键被表示为保留高维表达能力的$k$-稀疏编码，同时将注意力成本从$Θ(n^2 d)$降低到$Θ(n^2 k^2/d)$。为了使这种方法在规模上高效，我们引入了FlashSFA，这是一种IO感知的内核，将FlashAttention扩展到直接在稀疏重叠上操作，而无需实现密集的评分矩阵。在GPT-2和Qwen3的预训练中，SFA与密集基线匹配，同时提高速度最多$2.5\times$，并将FLOPs和KV缓存减少近50%。在合成和下游基准测试中，SFA保持在长上下文中的检索精度和鲁棒性，胜过将特征多样性折叠的短嵌入基线。这些结果将特征级稀疏性确定为一种用于高效注意力的补充且未被充分探索的轴，使得Transformer可以在质量损失最小的情况下扩展到数量级更长的上下文。代码可在https://github.com/YannX1e/Sparse-Feature-Attention找到。

更新时间: 2026-03-30 12:28:48

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.22300v2

CoE: Collaborative Entropy for Uncertainty Quantification in Agentic Multi-LLM Systems

Uncertainty estimation in multi-LLM systems remains largely single-model-centric: existing methods quantify uncertainty within each model but do not adequately capture semantic disagreement across models. To address this gap, we propose Collaborative Entropy (CoE), a unified information-theoretic metric for semantic uncertainty in multi-LLM collaboration. CoE is defined on a shared semantic cluster space and combines two components: intra-model semantic entropy and inter-model divergence to the ensemble mean. CoE is not a weighted ensemble predictor; it is a system-level uncertainty measure that characterizes collaborative confidence and disagreement. We analyze several core properties of CoE, including non-negativity, zero-value certainty under perfect semantic consensus, and the behavior of CoE when individual models collapse to delta distributions. These results clarify when reducing per-model uncertainty is sufficient and when residual inter-model disagreement remains. We also present a simple CoE-guided, training-free post-hoc coordination heuristic as a practical application of the metric. Experiments on \textit{TriviaQA} and \textit{SQuAD} with LLaMA-3.1-8B-Instruct, Qwen-2.5-7B-Instruct, and Mistral-7B-Instruct show that CoE provides stronger uncertainty estimation than standard entropy- and divergence-based baselines, with gains becoming larger as additional heterogeneous models are introduced. Overall, CoE offers a useful uncertainty-aware perspective on multi-LLM collaboration.

Updated: 2026-03-30 12:28:26

标题: CoE：在代理多层次概率逻辑模型系统中用于不确定性量化的协作熵

摘要: 在多LLM系统中，不确定性估计主要集中在单一模型：现有方法在每个模型内量化不确定性，但未能充分捕捉跨模型之间的语义分歧。为了填补这一空白，我们提出了协作熵（CoE），这是一种用于多LLM协作中语义不确定性的统一信息论度量。CoE定义在共享的语义聚类空间上，结合了两个组成部分：模型内语义熵和模型间到集成均值的分歧。CoE不是一个加权集成预测器；它是一个系统级的不确定性测量，用来表征协作的信心和分歧。我们分析了CoE的几个核心属性，包括非负性，在完美语义一致性下的零值确定性，以及当单个模型坍缩为δ分布时CoE的行为。这些结果澄清了何时减少每个模型的不确定性足够，何时残留的模型间分歧仍然存在。我们还提出了一个简单的基于CoE的、无需训练的事后协调启发式方法作为度量的实际应用。在\textit{TriviaQA}和\textit{SQuAD}上进行的实验，使用LLaMA-3.1-8B-Instruct、Qwen-2.5-7B-Instruct和Mistral-7B-Instruct，结果显示CoE比标准熵和分歧基线提供了更强的不确定性估计，随着引入更多异构模型，收益变得更大。总的来说，CoE为多LLM协作提供了一个有用的不确定性感知视角。

更新时间: 2026-03-30 12:28:26

领域: cs.AI

下载: http://arxiv.org/abs/2603.28360v1

Optimized Weighted Voting System for Brain Tumor Classification Using MRI Images

The accurate classification of brain tumors from MRI scans is essential for effective diagnosis and treatment planning. This paper presents a weighted ensemble learning approach that combines deep learning and traditional machine learning models to improve classification performance. The proposed system integrates multiple classifiers, including ResNet101, DenseNet121, Xception, CNN-MRI, and ResNet50 with edge-enhanced images, SVM, and KNN with HOG features. A weighted voting mechanism assigns higher influence to models with better individual accuracy, ensuring robust decision-making. Image processing techniques such as Balance Contrast Enhancement, K-means clustering, and Canny edge detection are applied to enhance feature extraction. Experimental evaluations on the Figshare and Kaggle MRI datasets demonstrate that the proposed method achieves state-of-the-art accuracy, outperforming existing models. These findings highlight the potential of ensemble-based learning for improving brain tumor classification, offering a reliable and scalable framework for medical image analysis.

Updated: 2026-03-30 12:24:54

标题: MRI图像在脑肿瘤分类中的优化加权投票系统

摘要: 从MRI扫描中准确分类脑肿瘤对于有效的诊断和治疗规划至关重要。本文提出了一种加权集成学习方法，结合了深度学习和传统机器学习模型，以提高分类性能。所提出的系统集成了多个分类器，包括ResNet101、DenseNet121、Xception、CNN-MRI和ResNet50与边缘增强图像、SVM和KNN与HOG特征。加权投票机制赋予具有更好个体准确性的模型更高的影响力，确保了强大的决策制定。图像处理技术如平衡对比增强、K均值聚类和Canny边缘检测被应用于增强特征提取。在Figshare和Kaggle MRI数据集上的实验评估表明，所提出的方法实现了最先进的准确性，优于现有模型。这些发现强调了基于集成学习的潜力，可以改善脑肿瘤分类，提供了一个可靠和可扩展的医学图像分析框架。

更新时间: 2026-03-30 12:24:54

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2603.28357v1

The Minimax Lower Bound of Kernel Stein Discrepancy Estimation

Kernel Stein discrepancies (KSDs) have emerged as a powerful tool for quantifying goodness-of-fit over the last decade, featuring numerous successful applications. To the best of our knowledge, all existing KSD estimators with known rate achieve $\sqrt n$-convergence. In this work, we present two complementary results (with different proof strategies), establishing that the minimax lower bound of KSD estimation is $n^{-1/2}$ and settling the optimality of these estimators. Our first result focuses on KSD estimation on $\mathbb R^d$ with the Langevin-Stein operator; our explicit constant for the Gaussian kernel indicates that the difficulty of KSD estimation may increase exponentially with the dimensionality $d$. Our second result settles the minimax lower bound for KSD estimation on general domains.

Updated: 2026-03-30 12:17:36

标题: 核斯坦不一致估计的极小下界

摘要: 核Stein差异（KSDs）在过去十年中已经成为衡量拟合度的有力工具，具有许多成功的应用。据我们所知，所有现有的KSD估计器都能实现$\sqrt n$收敛率。在这项工作中，我们提出了两个互补的结果（具有不同的证明策略），建立了KSD估计的极小下界为$n^{-1/2}$，并确定了这些估计器的最优性。我们的第一个结果集中在$\mathbb R^d$上使用Langevin-Stein算子进行KSD估计；我们对高斯核的显式常数表明，KSD估计的难度可能随着维度$d$的增加呈指数增长。我们的第二个结果解决了在一般域上进行KSD估计的极小下界。

更新时间: 2026-03-30 12:17:36

领域: stat.ML,cs.LG,math.ST

下载: http://arxiv.org/abs/2510.15058v3

Machine Learning-Assisted High-Dimensional Matrix Estimation

Efficient estimation of high-dimensional matrices-including covariance and precision matrices-is a cornerstone of modern multivariate statistics. Most existing studies have focused primarily on the theoretical properties of the estimators (e.g., consistency and sparsity), while largely overlooking the computational challenges inherent in high-dimensional settings. Motivated by recent advances in learning-based optimization method-which integrate data-driven structures with classical optimization algorithms-we explore high-dimensional matrix estimation assisted by machine learning. Specifically, for the optimization problem of high-dimensional matrix estimation, we first present a solution procedure based on the Linearized Alternating Direction Method of Multipliers (LADMM). We then introduce learnable parameters and model the proximal operators in the iterative scheme with neural networks, thereby improving estimation accuracy and accelerating convergence. Theoretically, we first prove the convergence of LADMM, and then establish the convergence, convergence rate, and monotonicity of its reparameterized counterpart; importantly, we show that the reparameterized LADMM enjoys a faster convergence rate. Notably, the proposed reparameterization theory and methodology are applicable to the estimation of both high-dimensional covariance and precision matrices. We validate the effectiveness of our method by comparing it with several classical optimization algorithms across different structures and dimensions of high-dimensional matrices.

Updated: 2026-03-30 12:15:59

标题: 机器学习辅助的高维矩阵估计

摘要: 高维矩阵-包括协方差和精度矩阵-的高效估计是现代多元统计学的基石。大多数现有研究主要关注估计量的理论特性（例如一致性和稀疏性），而大部分忽视了高维设置中固有的计算挑战。受最近学习为基础的优化方法的进展的启发-这些方法将数据驱动结构与经典优化算法相结合-我们探索了机器学习辅助的高维矩阵估计。具体而言，对于高维矩阵估计的优化问题，我们首先提出了基于线性化交替方向乘子法（LADMM）的解决方案程序。然后，我们引入可学习参数，并使用神经网络在迭代方案中建模近端算子，从而提高估计精度并加快收敛速度。从理论上讲，我们首先证明了LADMM的收敛性，然后建立了其重新参数化对应物的收敛性、收敛速度和单调性；重要的是，我们表明重新参数化的LADMM具有更快的收敛速度。值得注意的是，所提出的重新参数化理论和方法适用于高维协方差和精度矩阵的估计。我们通过将其与几种经典优化算法在不同结构和维度的高维矩阵上进行比较，验证了我们方法的有效性。

更新时间: 2026-03-30 12:15:59

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2603.28346v1

Crossing the NL/PL Divide: Information Flow Analysis Across the NL/PL Boundary in LLM-Integrated Code

LLM API calls are becoming a ubiquitous program construct, yet they create a boundary that no existing program analysis can cross: runtime values enter a natural-language prompt, undergo opaque processing inside the LLM, and re-emerge as code, SQL, JSON, or text that the program consumes. Every analysis that tracks data across function boundaries, including taint analysis, program slicing, dependency analysis, and change-impact analysis, relies on dataflow summaries of callee behavior. LLM calls have no such summaries, breaking all of these analyses at what we call the NL/PL boundary. We present the first information flow method to bridge this boundary. Grounded in quantitative information flow theory, our taxonomy defines 24 labels along two orthogonal dimensions: information preservation level (from lexically preserved to fully blocked) and output modality (natural language, structured format, executable artifact). We label 9,083 placeholder-output pairs from 4,154 real-world Python files and validate reliability with Cohen's $κ= 0.82$ and near-complete coverage (0.01\% unclassifiable). We demonstrate the taxonomy's utility on two downstream applications: (1)~a two-stage taint propagation pipeline combining taxonomy-based filtering with LLM verification achieves $F_1 = 0.923$ on 353 expert-annotated pairs, with cross-language validation on six real-world OpenClaw prompt injection cases further confirming effectiveness; (2)~taxonomy-informed backward slicing reduces slice size by a mean of 15\% in files containing non-propagating placeholders. Per-label analysis reveals that four blocked labels account for nearly all non-propagating cases, providing actionable filtering criteria for tool builders.

Updated: 2026-03-30 12:14:24

标题: 跨越自然语言/编程语言分界线：LLM集成代码中自然语言/编程语言边界上的信息流分析

摘要: LLM API调用正在成为一种普遍的程序构造，但它们创建了一个边界，现有的程序分析无法跨越：运行时值进入自然语言提示，经过LLM内部的不透明处理，重新出现为程序消耗的代码、SQL、JSON或文本。跟踪跨函数边界的数据的每个分析，包括污点分析、程序切片、依赖分析和变更影响分析，都依赖于被调用者行为的数据流摘要。LLM调用没有这样的摘要，它们在我们所称的NL/PL边界处中断了所有这些分析。我们提出了第一个信息流方法来跨越这个边界。基于量化信息流理论，我们的分类法沿着两个正交维度定义了24个标签：信息保留级别（从词法保留到完全阻塞）和输出模态（自然语言、结构化格式、可执行工件）。我们从4154个真实的Python文件中标记了9083个占位符-输出对，并使用Cohen's $κ= 0.82$验证了可靠性，覆盖率几乎完整（0.01%不可分类）。我们在两个下游应用程序上展示了分类法的实用性：(1) 一个两阶段污点传播管道结合基于分类法的过滤与LLM验证，在353个专家注释对上达到了$F_1 = 0.923$，并在六个真实的OpenClaw提示注入案例上进行了跨语言验证，进一步证实了有效性；(2) 基于分类法的向后切片在包含非传播占位符的文件中将切片大小平均减少了15%。按标签分析表明，四个被阻止的标签几乎占据了所有非传播情况，为工具构建者提供了可操作的过滤标准。

更新时间: 2026-03-30 12:14:24

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2603.28345v1

Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe. On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness, and speedup. To make this search reliable, we build backend-specific evaluation services for Triton on NVIDIA GPUs and Maca on MetaX GPUs. On the training side, we convert long-horizon evolution trajectories into step-centric supervision and reinforcement learning signals by retaining correctness-preserving, high-gain revisions, so that the model is optimized as a strong local improver inside the evolutionary loop rather than as a one-shot generator. Under a unified evolutionary protocol, Kernel-Smith-235B-RL achieves state-of-the-art overall performance on KernelBench with Nvidia Triton backend, attaining the best average speedup ratio and outperforming frontier proprietary models including Gemini-3.0-pro and Claude-4.6-opus. We further validate the framework on the MetaX MACA backend, where our Kernel-Smith-MACA-30B surpasses large-scale counterparts such as DeepSeek-V3.2-think and Qwen3-235B-2507-think, highlighting potential for seamless adaptation across heterogeneous platforms. Beyond benchmark results, the same workflow produces upstream contributions to production systems including SGLang and LMDeploy, demonstrating that LLM-driven kernel optimization can transfer from controlled evaluation to practical deployment.

Updated: 2026-03-30 12:12:49

标题: Kernel-Smith：进化核优化的统一方法 (Note: "Kernel-Smith" could be a specific name or term in this context and may not have a direct translation)

摘要: 我们提出了Kernel-Smith，这是一个用于高性能GPU核心和操作符生成的框架，它将一个稳定的基于评估的进化代理与一个进化导向的后训练配方结合在一起。在代理方面，Kernel-Smith维护一个可执行候选集合，并使用一组表现最佳且多样化的程序的档案以及关于编译、正确性和加速度的结构化执行反馈来迭代地改进它们。为了使这种搜索可靠，我们为NVIDIA GPU上的Triton和MetaX GPU上的Maca构建了特定于后端的评估服务。在训练方面，我们通过保留保持正确性且高收益修订的长期进化轨迹，将其转换为以步骤为中心的监督和强化学习信号，以便模型在进化循环内作为强大的本地改进者进行优化，而不是作为一次性生成器。在统一的进化协议下，Kernel-Smith-235B-RL在Nvidia Triton后端的KernelBench上实现了最先进的整体性能，获得了最佳的平均加速比，并超越了前沿专有模型，包括Gemini-3.0-pro和Claude-4.6-opus。我们进一步在MetaX MACA后端上验证了该框架，在那里我们的Kernel-Smith-MACA-30B超越了大规模对手，如DeepSeek-V3.2-think和Qwen3-235B-2507-think，突显了在异构平台上的无缝适应潜力。除了基准结果外，同样的工作流程还为生产系统包括SGLang和LMDeploy做出了上游贡献，证明了基于LLM的核心优化可以从受控评估转移到实际部署。

更新时间: 2026-03-30 12:12:49

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2603.28342v1

A Multi-Agent Rhizomatic Pipeline for Non-Linear Literature Analysis

Systematic literature reviews in the social sciences overwhelmingly follow arborescent logics -- hierarchical keyword filtering, linear screening, and taxonomic classification -- that suppress the lateral connections, ruptures, and emergent patterns characteristic of complex research landscapes. This research note presents the Rhizomatic Research Agent (V3), a multi-agent computational pipeline grounded in Deleuzian process-relational ontology, designed to conduct non-linear literature analysis through 12 specialized agents operating across a seven-phase architecture. The system was developed in response to the methodological groundwork established by (Narayan2023), who employed rhizomatic inquiry in her doctoral research on sustainable energy transitions but relied on manual, researcher-driven exploration. The Rhizomatic Research Agent operationalizes the six principles of the rhizome -- connection, heterogeneity, multiplicity, asignifying rupture, cartography, and decalcomania -- into an automated pipeline integrating large language model (LLM) orchestration, dual-source corpus ingestion from OpenAlex and arXiv, SciBERT semantic topography, and dynamic rupture detection protocols. Preliminary deployment demonstrates the system's capacity to surface cross-disciplinary convergences and structural research gaps that conventional review methods systematically overlook. The pipeline is open-source and extensible to any phenomenon zone where non-linear knowledge mapping is required.

Updated: 2026-03-30 12:06:25

标题: 一个用于非线性文学分析的多代理根状管道

摘要: 社会科学领域中的系统文献综述主要遵循分层关键字过滤、线性筛选和分类逻辑，这些逻辑抑制了复杂研究领域中特有的横向连接、断裂和新兴模式。本研究笔记介绍了Rhizomatic Research Agent（V3），这是一个基于Deleuzian过程关系本体论的多智能体计算管道，旨在通过12个专门的智能体在七个阶段架构中进行非线性文献分析。该系统是为了响应（Narayan2023）建立的方法论基础而开发的，她在可持续能源转型方面的博士研究中采用了根茎式调查，但依赖于手动、研究人员驱动的探索。Rhizomatic Research Agent将根茎的六个原则（连接、异质性、多样性、非指示性断裂、制图和脱胶画）操作化为一个自动化管道，集成了大型语言模型（LLM）编排、来自OpenAlex和arXiv的双源语料库摄入、SciBERT语义地形学和动态断裂检测协议。初步部署表明，该系统能够展现跨学科的交汇和传统综述方法系统性忽视的结构研究空白。这个管道是开源的，可扩展到任何需要非线性知识映射的现象领域。

更新时间: 2026-03-30 12:06:25

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2603.28336v1

Key-Embedded Privacy for Decentralized AI in Biomedical Omics

The rapid adoption of data-driven methods in biomedicine has intensified concerns over privacy, governance, and regulation, limiting raw data sharing and hindering the assembly of representative cohorts for clinically relevant AI. This landscape necessitates practical, efficient privacy solutions, as cryptographic defenses often impose heavy overhead and differential privacy can degrade performance, leading to sub-optimal outcomes in real-world settings. Here, we present a lightweight federated learning method, INFL, based on Implicit Neural Representations that addresses these challenges. Our approach integrates plug-and-play, coordinate-conditioned modules into client models, embeds a secret key directly into the architecture, and supports seamless aggregation across heterogeneous sites. Across diverse biomedical omics tasks, including cohort-scale classification in bulk proteomics, regression for perturbation prediction in single-cell transcriptomics, and clustering in spatial transcriptomics and multi-omics with both public and private data, we demonstrate that INFL achieves strong, controllable privacy while maintaining utility, preserving the performance necessary for downstream scientific and clinical applications.

Updated: 2026-03-30 12:04:50

标题: 关键嵌入式隐私保护技术在生物医学组学中的分布式人工智能应用

摘要: 生物医学领域数据驱动方法的快速普及加剧了人们对隐私、治理和监管的担忧，限制了原始数据共享，并阻碍了为临床相关人工智能组装代表性队列。这种情况需要实用高效的隐私解决方案，因为加密防御通常会带来沉重的开销，而差分隐私可能会降低性能，在现实环境中导致次优结果。在这里，我们提出了一种基于隐式神经表示的轻量级联合学习方法INFL，以应对这些挑战。我们的方法将即插即用的、坐标条件模块集成到客户端模型中，将秘密密钥直接嵌入架构中，并支持跨异构站点的无缝聚合。在各种生物医学组学任务中，包括在大规模蛋白质组学中进行队列分类、在单细胞转录组学中进行干扰预测回归、以及在空间转录组学和多组学中进行聚类，使用公共和私人数据，我们证明了INFL实现了强大的、可控的隐私保护，同时保持了实用性，保留了下游科学和临床应用所需的性能。

更新时间: 2026-03-30 12:04:50

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2603.28334v1

Integrating Multimodal Large Language Model Knowledge into Amodal Completion

With the widespread adoption of autonomous vehicles and robotics, amodal completion, which reconstructs the occluded parts of people and objects in an image, has become increasingly crucial. Just as humans infer hidden regions based on prior experience and common sense, this task inherently requires physical knowledge about real-world entities. However, existing approaches either depend solely on the image generation ability of visual generative models, which lack such knowledge, or leverage it only during the segmentation stage, preventing it from explicitly guiding the completion process. To address this, we propose AmodalCG, a novel framework that harnesses the real-world knowledge of Multimodal Large Language Models (MLLMs) to guide amodal completion. Our framework first assesses the extent of occlusion to selectively invoke MLLM guidance only when the target object is heavily occluded. If guidance is required, the framework further incorporates MLLMs to reason about both the (1) extent and (2) content of the missing regions. Finally, a visual generative model integrates these guidance and iteratively refines imperfect completions that may arise from inaccurate MLLM guidance. Experimental results on various real-world images show impressive improvements compared to all existing works, suggesting MLLMs as a promising direction for addressing challenging amodal completion.

Updated: 2026-03-30 12:03:47

标题: 将多模式大语言模型知识整合到无模态完成中

摘要: 随着自动驾驶车辆和机器人的广泛应用，非模态完成（amodal completion）已变得日益关键，该技术重建图像中被遮挡的人和物体的部分。就像人类基于先前经验和常识推断隐藏区域一样，这项任务本质上需要对真实世界实体的物理知识。然而，现有方法要么仅依赖于视觉生成模型的图像生成能力，缺乏这种知识，要么仅在分割阶段利用它，无法明确引导完成过程。为了解决这个问题，我们提出了AmodalCG，一种新颖的框架，利用多模式大型语言模型（MLLMs）的真实世界知识来引导非模态完成。我们的框架首先评估遮挡的程度，仅在目标物体严重遮挡时选择性地调用MLLM引导。如果需要引导，框架进一步整合MLLMs来推理缺失区域的（1）范围和（2）内容。最后，一个视觉生成模型整合这些引导，并迭代地改进可能由不准确的MLLM引导导致的不完美完成。对各种真实世界图像的实验结果显示，与所有现有作品相比，我们取得了令人印象深刻的改进，表明MLLMs是解决具有挑战性的非模态完成问题的一个有前途的方向。

更新时间: 2026-03-30 12:03:47

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2603.28333v1

Physics-Informed Neural Networks for Predicting Hydrogen Sorption in Geological Formations: Thermodynamically Constrained Deep Learning Integrating Classical Adsorption Theory

Accurate prediction of hydrogen sorption in fine-grained geological materials is essential for evaluating underground hydrogen storage capacity, assessing caprock integrity, and characterizing hydrogen migration in subsurface energy systems. Classical isotherm models perform well at the individual-sample level but fail when generalized across heterogeneous populations, with the coefficient of determination collapsing from 0.80-0.90 for single-sample fits to 0.09-0.38 for aggregated multi-sample datasets. We present a multi-scale physics-informed neural network framework that addresses this limitation by embedding classical adsorption theory and thermodynamic constraints directly into the learning process. The framework utilizes 1,987 hydrogen sorption isotherm measurements across clays, shales, coals, supplemented by 224 characteristic uptake measurements. A seven-category physics-informed feature engineering scheme generates 62 thermodynamically meaningful descriptors from raw material characterization data. The loss function enforces saturation limits, a monotonic pressure response, and Van't Hoff temperature dependence via penalty weighting, while a three-phase curriculum-based training strategy ensures stable integration of competing physical constraints. An architecture-diverse ensemble of ten members provides calibrated uncertainty quantification, with post-hoc temperature scaling achieving target prediction interval coverage. The optimized PINN achieves R2 = 0.9544, RMSE = 0.0484 mmol/g, and MAE = 0.0231 mmol/g on the held-out test set, with 98.6% monotonicity satisfaction and zero non-physical negative predictions. Physics-informed regularization yields a 10-15% cross-lithology generalization advantage over a well-tuned random forest under leave-one-lithology-out validation, confirming that thermodynamic constraints transfer meaningfully across geological boundaries.

Updated: 2026-03-30 11:59:36

标题: 物理信息神经网络用于预测地质形成物中的氢吸附：热力学约束的深度学习整合经典吸附理论

摘要: 准确预测细粒地质材料中的氢吸附对于评估地下氢储存容量、评估盖岩完整性以及表征地下能源系统中的氢迁移至关重要。经典吸附等温线模型在单个样本水平上表现良好，但在异质种群中推广时失败，决定系数从单样本拟合的0.80-0.90降至聚合多样本数据集的0.09-0.38。我们提出了一个多尺度物理信息神经网络框架，通过直接将经典吸附理论和热力学约束嵌入到学习过程中，解决了这一局限性。该框架利用了涉及粘土、页岩和煤等1,987个氢吸附等温线测量值，辅以224个特征吸收测量值。一个包含七个类别的物理信息特征工程方案从原始材料表征数据中生成了62个具有热力学意义的描述符。损失函数通过惩罚权重强制执行饱和限制、单调压力响应和Van't Hoff温度依赖性，而基于三相课程的训练策略确保了竞争物理约束的稳定整合。由十个成员组成的架构多样的集成提供了校准的不确定性量化，后续的温度缩放实现了目标预测区间覆盖。优化的PINN在保留测试集上实现了R2 = 0.9544，RMSE = 0.0484 mmol/g和MAE = 0.0231 mmol/g，98.6%的单调性满足度和零个非物理负预测。物理信息正则化在留一岩性验证中比良好调整的随机森林具有10-15%的跨岩性泛化优势，证实了热力学约束在地质界限之间的有效传递。

更新时间: 2026-03-30 11:59:36

领域: cs.LG,physics.geo-ph

下载: http://arxiv.org/abs/2603.28328v1

Building evidence-based knowledge graphs from full-text literature for disease-specific biomedical reasoning

Biomedical knowledge resources often either preserve evidence as unstructured text or compress it into flat triples that omit study design, provenance, and quantitative support. Here we present EvidenceNet, a framework and dataset for building disease-specific knowledge graphs from full-text biomedical literature. EvidenceNet uses a large language model (LLM)-assisted pipeline to extract experimentally grounded findings as structured evidence nodes, normalize biomedical entities, score evidence quality, and connect evidence records through typed semantic relations. We release two resources: EvidenceNet-HCC with 7,872 evidence records, 10,328 graph nodes, and 49,756 edges, and EvidenceNet-CRC with 6,622 records, 8,795 nodes, and 39,361 edges. Technical validation shows high component fidelity, including 98.3% field-level extraction accuracy, 100.0% high-confidence entity-link accuracy, 87.5% fusion integrity, and 90.0% semantic relation-type accuracy. In downstream evaluation, EvidenceNet improves internal and external retrieval-augmented question answering and retains structural signal for future link prediction and target prioritization. These results establish EvidenceNet as a disease-specific resource for evidence-aware biomedical reasoning and hypothesis generation.

Updated: 2026-03-30 11:53:45

标题: 利用全文文献构建基于证据的疾病特定生物医学知识图谱

摘要: 生物医学知识资源通常要么将证据保存为非结构化文本，要么将其压缩为省略研究设计、出处和定量支持的平面三元组。在这里，我们提出了EvidenceNet，这是一个用于从全文生物医学文献构建特定疾病知识图的框架和数据集。EvidenceNet使用大型语言模型(LLM)辅助管道来提取实验基础的发现作为结构化证据节点，标准化生物医学实体，评分证据质量，并通过类型化语义关系连接证据记录。我们发布了两个资源：EvidenceNet-HCC，其中包含7,872个证据记录、10,328个图节点和49,756条边，以及EvidenceNet-CRC，其中包含6,622个记录、8,795个节点和39,361条边。技术验证显示高组件保真度，包括98.3%的字段级提取准确性，100.0%的高置信度实体链接准确性，87.5%的融合完整性，以及90.0%的语义关系类型准确性。在下游评估中，EvidenceNet改善了内部和外部检索增强问答，并保留了未来链接预测和目标优先级的结构信号。这些结果将EvidenceNet建立为一种特定疾病资源，用于基于证据的生物医学推理和假设生成。

更新时间: 2026-03-30 11:53:45

领域: cs.CE,cs.AI

下载: http://arxiv.org/abs/2603.28325v1

LDDMM stochastic interpolants: an application to domain uncertainty quantification in hemodynamics

We introduce a novel conditional stochastic interpolant framework for generative modeling of three-dimensional shapes. The method builds on a recent LDDMM-based registration approach to learn the conditional drift between geometries. By leveraging the resulting pull-back and push-forward operators, we extend this formulation beyond standard Cartesian grids to complex shapes and random variables defined on distinct domains. We present an application in the context of cardiovascular simulations, where aortic shapes are generated from an initial cohort of patients. The conditioning variable is a latent geometric representation defined by a set of centerline points and the radii of the corresponding inscribed spheres. This methodology facilitates both data augmentation for three-dimensional biomedical shapes, and the generation of random perturbations of controlled magnitude for a given shape. These capabilities are essential for quantifying the impact of domain uncertainties arising from medical image segmentation on the estimation of relevant biomarkers.

Updated: 2026-03-30 11:52:36

标题: LDDMM随机插值器：在血液动力学领域不确定性量化中的应用

摘要: 我们引入了一种新颖的条件随机插值框架，用于生成三维形状的生成建模。该方法建立在最近的基于LDDMM的注册方法之上，用于学习几何之间的条件漂移。通过利用生成的拉回和推送操作符，我们将这种形式延伸到超出标准笛卡尔网格的复杂形状和定义在不同域上的随机变量。我们在心血管模拟的背景下提出了一个应用，其中主动脉形状是从一组患者的初始队列中生成的。条件变量是由一组中轴点和对应内切球的半径定义的潜在几何表示。该方法既促进了三维生物医学形状的数据增强，也为给定形状的随机扰动的控制幅度的生成提供了可能。这些能力对于量化由医学图像分割引起的域不确定性对相关生物标记估计的影响至关重要。

更新时间: 2026-03-30 11:52:36

领域: stat.ML,cs.LG,math.NA

下载: http://arxiv.org/abs/2603.28324v1

Isogeny-based Post-Quantum Proxy Signature for Internet of Things

The rapid growth of the Internet of Things (IoT) introduces challenges in secure authentication and delegation due to the limited computational capabilities of devices. Proxy signature schemes offer an effective solution by enabling controlled delegation of signing rights to more capable entities, such as gateway nodes. However, most existing schemes rely on classical assumptions that are likely to be broken by quantum adversaries. In this work, we address these challenges by proposing an isogeny-based post-quantum proxy signature scheme, \textit{CSI-PS}. The scheme leverages the hardness of the Group Action Inverse Problem (GAIP) to ensure quantum-resistant security while maintaining efficiency suitable for resource-constrained environments. We further demonstrate its applicability in IoT architectures through a gateway-based delegation model. Our analysis shows that the proposed scheme strikes an effective balance between security and efficiency in terms of computation and communication overhead, along with provable security under the EUF-CMA notion.

Updated: 2026-03-30 11:50:52

标题: 基于同源密码的后量子代理签名用于物联网

摘要: 物联网（IoT）的快速增长引入了安全身份验证和委托方面的挑战，这是由于设备的计算能力受限。代理签名方案通过将签名权限控制地委托给更具能力的实体（如网关节点），提供了一种有效的解决方案。然而，大多数现有方案依赖于可能被量子对手破坏的经典假设。在这项工作中，我们通过提出一个基于同态密码学的后量子代理签名方案\textit{CSI-PS}来解决这些挑战。该方案利用群作用逆问题（GAIP）的难度来确保量子抗性安全性，同时保持适合资源受限环境的效率。我们进一步通过基于网关的委托模型展示了其在物联网架构中的适用性。我们的分析表明，所提议的方案在计算和通信开销方面在安全性和效率之间取得了有效的平衡，并在EUF-CMA概念下具有可证明的安全性。

更新时间: 2026-03-30 11:50:52

领域: cs.CR

下载: http://arxiv.org/abs/2407.13318v3

FairGC: Fairness-aware Graph Condensation

Graph condensation (GC) has become a vital strategy for scaling Graph Neural Networks by compressing massive datasets into small, synthetic node sets. While current GC methods effectively maintain predictive accuracy, they are primarily designed for utility and often ignore fairness constraints. Because these techniques are bias-blind, they frequently capture and even amplify demographic disparities found in the original data. This leads to synthetic proxies that are unsuitable for sensitive applications like credit scoring or social recommendations. To solve this problem, we introduce FairGC, a unified framework that embeds fairness directly into the graph distillation process. Our approach consists of three key components. First, a Distribution-Preserving Condensation module synchronizes the joint distributions of labels and sensitive attributes to stop bias from spreading. Second, a Spectral Encoding module uses Laplacian eigen-decomposition to preserve essential global structural patterns. Finally, a Fairness-Enhanced Neural Architecture employs multi-domain fusion and a label-smoothing curriculum to produce equitable predictions. Rigorous evaluations on four real-world datasets, show that FairGC provides a superior balance between accuracy and fairness. Our results confirm that FairGC significantly reduces disparity in Statistical Parity and Equal Opportunity compared to existing state-of-the-art condensation models. The codes are available at https://github.com/LuoRenqiang/FairGC.

Updated: 2026-03-30 11:46:05

标题: FairGC: 公平感知图压缩

摘要: 图形浓缩（GC）已成为通过将大规模数据集压缩为小型综合节点集来扩展图神经网络的重要策略。尽管当前的GC方法有效地保持了预测准确性，但它们主要设计用于效用，往往忽略公平约束。由于这些技术是盲目的偏见，它们经常捕捉甚至放大原始数据中发现的人口统计差异。这导致了对于信用评分或社交推荐等敏感应用而言不适用的合成代理。为了解决这个问题，我们引入了FairGC，一个将公平性直接嵌入图精炼过程的统一框架。我们的方法包括三个关键组件。首先，一个保持分布的浓缩模块同步标签和敏感属性的联合分布，以阻止偏见扩散。其次，一个谱编码模块使用拉普拉斯特征分解来保持基本的全局结构模式。最后，一个增强公平性的神经架构采用多领域融合和标签平滑课程来产生公平的预测。对四个真实世界数据集进行严格评估表明，FairGC在准确性和公平性之间提供了更优的平衡。我们的结果证实，与现有最先进的浓缩模型相比，FairGC在统计平等和机会平等方面显著减少了差距。代码可在https://github.com/LuoRenqiang/FairGC找到。

更新时间: 2026-03-30 11:46:05

领域: cs.LG

下载: http://arxiv.org/abs/2603.28321v1

From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings

We present a novel unsupervised framework to unlock vast unlabeled human demonstration data from continuous industrial video streams for Vision-Language-Action (VLA) model pre-training. Our method first trains a lightweight motion tokenizer to encode motion dynamics, then employs an unsupervised action segmenter leveraging a novel "Latent Action Energy" metric to discover and segment semantically coherent action primitives. The pipeline outputs both segmented video clips and their corresponding latent action sequences, providing structured data directly suitable for VLA pre-training. Evaluations on public benchmarks and a proprietary electric motor assembly dataset demonstrate effective segmentation of key tasks performed by humans at workstations. Further clustering and quantitative assessment via a Vision-Language Model confirm the semantic coherence of the discovered action primitives. To our knowledge, this is the first fully automated end-to-end system for extracting and organizing VLA pre-training data from unstructured industrial videos, offering a scalable solution for embodied AI integration in manufacturing.

Updated: 2026-03-30 11:44:12

标题: 从观察到行动：基于潜在行动的原始分割在工业环境中的VLA预训练

摘要: 我们提出了一个新颖的无监督框架，用于从连续的工业视频流中解锁大量未标记的人类演示数据，用于Vision-Language-Action（VLA）模型的预训练。我们的方法首先训练一个轻量级运动标记器来编码运动动态，然后利用一种无监督的动作分割器，利用一种新颖的“潜在动作能量”度量来发现和分割语义连贯的动作基元。该流程同时输出分割的视频片段及其对应的潜在动作序列，为VLA预训练提供直接适用的结构化数据。在公共基准数据集和专有电动机装配数据集上的评估表明，有效地分割了工作站上人类执行的关键任务。通过一个Vision-Language模型的进一步聚类和定量评估，确认了所发现的动作基元的语义连贯性。据我们所知，这是首个完全自动化的端到端系统，用于从非结构化工业视频中提取和组织VLA预训练数据，为制造业中的具象AI集成提供了可扩展的解决方案。

更新时间: 2026-03-30 11:44:12

领域: cs.CV,cs.AI,cs.RO

下载: http://arxiv.org/abs/2511.21428v2

Mapping data literacy trajectories in K-12 education

Data literacy skills are fundamental in computer science education. However, understanding how data-driven systems work represents a paradigm shift from traditional rule-based programming. We conducted a systematic literature review of 84 studies to understand K-12 learners' engagement with data across disciplines and contexts. We propose the data paradigms framework that categorises learning activities along two dimensions: (i) logic (knowledge-based or data-driven systems), and (ii) explainability (transparent or opaque models). We further apply the notion of learning trajectories to visualize the pathways learners follow across these distinct paradigms. We detail four distinct trajectories as a provocation for researchers and educators to reflect on how the notion of data literacy varies depending on the learning context. We suggest these trajectories could be useful to those concerned with the design of data literacy learning environments within and beyond CS education.

Updated: 2026-03-30 11:38:07

标题: 在K-12教育中绘制数据素养轨迹

摘要: 数据素养技能在计算机科学教育中是基础性的。然而，理解数据驱动系统的工作方式代表了从传统基于规则的编程到新兴的范式转变。我们对84项研究进行了系统文献综述，以了解K-12学习者在不同学科和情境中与数据的互动。我们提出了数据范式框架，将学习活动分类为两个维度：（一）逻辑（基于知识或数据驱动系统），以及（二）可解释性（透明或不透明模型）。我们进一步应用学习轨迹的概念来可视化学习者在这些不同范式中遵循的路径。我们详细描述了四条独特的轨迹，作为研究人员和教育者思考数据素养概念如何因学习环境而异的一种激发。我们建议这些轨迹对于那些关注设计CS教育内外的数据素养学习环境的人是有用的。

更新时间: 2026-03-30 11:38:07

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2603.28317v1

Taming the Instability: A Robust Second-Order Optimizer for Federated Learning over Non-IID Data

In this paper, we present Federated Robust Curvature Optimization (FedRCO), a novel second-order optimization framework designed to improve convergence speed and reduce communication cost in Federated Learning systems under statistical heterogeneity. Existing second-order optimization methods are often computationally expensive and numerically unstable in distributed settings. In contrast, FedRCO addresses these challenges by integrating an efficient approximate curvature optimizer with a provable stability mechanism. Specifically, FedRCO incorporates three key components: (1) a Gradient Anomaly Monitor that detects and mitigates exploding gradients in real-time, (2) a Fail-Safe Resilience protocol that resets optimization states upon numerical instability, and (3) a Curvature-Preserving Adaptive Aggregation strategy that safely integrates global knowledge without erasing the local curvature geometry. Theoretical analysis shows that FedRCO can effectively mitigate instability and prevent unbounded updates while preserving optimization efficiency. Extensive experiments show that FedRCO achieves superior robustness against diverse non-IID scenarios while achieving higher accuracy and faster convergence than both state-of-the-art first-order and second-order methods.

Updated: 2026-03-30 11:37:46

标题: 驯服不均衡性：一种针对非独立同分布数据的稳健二阶优化器，用于联邦学习

摘要: 在这篇论文中，我们提出了联邦鲁棒曲率优化（FedRCO），这是一个新颖的二阶优化框架，旨在提高联邦学习系统中统计异质性下的收敛速度并减少通信成本。现有的二阶优化方法在分布式环境中往往计算成本高昂且数值不稳定。相反，FedRCO通过整合高效的近似曲率优化器和可证明的稳定机制来应对这些挑战。具体而言，FedRCO包括三个关键组成部分：（1）梯度异常监视器，实时检测和减轻爆炸梯度；（2）故障安全恢复协议，在数值不稳定时重置优化状态；（3）曲率保持自适应聚合策略，安全地整合全局知识而不消除局部曲率几何。理论分析表明，FedRCO可以有效减轻不稳定性并防止无界更新，同时保持优化效率。大量实验证明，FedRCO在面对多样的非独立同分布情况时表现出卓越的鲁棒性，同时比最先进的一阶和二阶方法实现更高的准确性和更快的收敛速度。

更新时间: 2026-03-30 11:37:46

领域: cs.LG

下载: http://arxiv.org/abs/2603.28316v1

Prototype-Enhanced Multi-View Learning for Thyroid Nodule Ultrasound Classification

Thyroid nodule classification using ultrasound imaging is essential for early diagnosis and clinical decision-making; however, despite promising performance on in-distribution data, existing deep learning methods often exhibit limited robustness and generalisation when deployed across different ultrasound devices or clinical environments. This limitation is mainly attributed to the pronounced heterogeneity of thyroid ultrasound images, which can lead models to capture spurious correlations rather than reliable diagnostic cues. To address this challenge, we propose PEMV-thyroid, a Prototype-Enhanced Multi-View learning framework that accounts for data heterogeneity by learning complementary representations from multiple feature perspectives and refining decision boundaries through a prototype-based correction mechanism with mixed prototype information. By integrating multi-view representations with prototype-level guidance, the proposed approach enables more stable representation learning under heterogeneous imaging conditions. Extensive experiments on multiple thyroid ultrasound datasets demonstrate that PEMV-thyroid consistently outperforms state-of-the-art methods, particularly in cross-device and cross-domain evaluation scenarios, leading to improved diagnostic accuracy and generalisation performance in real-world clinical settings. The source code is available at https://github.com/chenyangmeii/Prototype-Enhanced-Multi-View-Learning.

Updated: 2026-03-30 11:37:26

标题: 原型增强的甲状腺结节超声分类多视角学习

摘要: 使用超声成像对甲状腺结节进行分类对于早期诊断和临床决策至关重要；然而，尽管现有的深度学习方法在分布数据上表现出色，但在不同超声设备或临床环境中部署时往往表现出有限的稳健性和泛化能力。这一限制主要归因于甲状腺超声图像的显著异质性，这可能导致模型捕捉到虚假相关性而非可靠的诊断线索。为了应对这一挑战，我们提出了PEMV-thyroid，一种原型增强的多视图学习框架，通过从多个特征角度学习互补表示，并通过基于原型的校正机制和混合原型信息细化决策边界，以考虑数据的异质性。通过将多视图表示与原型级别的指导相结合，所提出的方法在异质成像条件下实现了更稳定的表示学习。对多个甲状腺超声数据集进行的大量实验表明，PEMV-thyroid始终优于最先进的方法，特别是在跨设备和跨领域评估场景中，在真实临床环境中提高了诊断准确性和泛化性能。源代码可在https://github.com/chenyangmeii/Prototype-Enhanced-Multi-View-Learning找到。

更新时间: 2026-03-30 11:37:26

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2603.28315v1

Cryptanalysis of a Lightweight RFID Authentication Protocol Based on a Variable Matrix Encryption Algorithm

Recently, a two-way RFID authentication protocol based on the AM-SUEO-DBLTKM variable matrix encryption algorithm was proposed for low-cost mobile RFID systems. Its design combines adaptive modulus selection, self-updating matrix ordering, and transpose/block-based matrix generation. In this paper, we show that the protocol has structural weaknesses. First, the underlying primitive remains a linear transformation modulo a session modulus, with no nonlinear confusion layer and no ciphertext chaining. Second, in the lightweight setting emphasized by the original paper, the update space is very small: there are only a few modulus choices, only four matrix-order choices when two secret matrices are used, and only a limited family of DBLTKM-generated matrices. Third, the correctness requirements of the protocol impose nontrivial constraints on the sizes of the modulus and plaintext coordinates, weakening the claimed entropy of the secret quantities. Building on these observations, we describe a multi-session algebraic attack path. Under repeated reuse of the same matrix and modulus -- an event plausible because of the small update space -- ciphertexts corresponding to $N_t$, $N_t+1$, $N_r$, and $N_r+1$ reveal a full column of the matrix. Across sessions, transpose-based matrix generation helps recover additional entries of the secret matrices, while the remaining entries can be obtained later from ordinary ciphertext equations. We then show that candidate factors of the session moduli can be tested by solving reduced equations for secret $S$ across many sessions and checking for mutually consistent solutions. This, in turn, enables recovery of candidate 64-bit moduli and the remaining protocol secrets. Taken together, our results indicate that the protocol is structurally insecure and admits a realistic route to full compromise in the lightweight parameter regime advocated for deployment.

Updated: 2026-03-30 11:35:40

标题: 基于可变矩阵加密算法的轻量级RFID身份验证协议的密码分析

摘要: 最近，针对低成本移动RFID系统，提出了一种基于AM-SUEO-DBLTKM可变矩阵加密算法的双向RFID身份验证协议。其设计结合了自适应模数选择、自更新矩阵排序和转置/基于块的矩阵生成。本文显示该协议存在结构上的弱点。首先，基础原语仍然是对会话模数的线性变换，没有非线性混淆层和密文链接。其次，在原始论文强调的轻量级设置中，更新空间非常有限：在使用两个秘密矩阵时，仅有少数模数选择、仅有四种矩阵排序选择和仅有有限数量的由DBLTKM生成的矩阵。第三，协议的正确性要求对模数和明文坐标的大小施加了非平凡的约束，削弱了秘密数量的熵的声明。基于这些观察结果，我们描述了一种多会话代数攻击路径。在重复使用相同矩阵和模数的情况下，由于更新空间较小，对应于$N_t$，$N_t+1$，$N_r$和$N_r+1$的密文揭示了矩阵的一整列。跨会话，基于转置的矩阵生成有助于恢复秘密矩阵的额外条目，而剩余的条目可以稍后从普通密文方程中获得。然后，我们展示了会话模数的候选因子可以通过在许多会话中解决秘密$S$的简化方程并检查相互一致的解来进行测试。这反过来使得能够恢复候选64位模数和剩余的协议秘密。总的来说，我们的结果表明该协议在结构上是不安全的，并且在轻量级参数范围中提倡部署的情况下存在实际的完全妥协的途径。

更新时间: 2026-03-30 11:35:40

领域: cs.CR

下载: http://arxiv.org/abs/2603.28313v1

VulnScout-C: A Lightweight Transformer for C Code Vulnerability Detection

Vulnerability detection in C programs is a critical challenge in software security. Although large language models (LLMs) achieve strong detection performance, their multi-billion-parameter scale makes them impractical for integration into development workflows requiring low latency and continuous analysis. We introduce VULNSCOUT-C, a compact transformer architecture with 693M total parameters (353M active during inference), derived from the Qwen model family and optimized for C code vulnerability detection. Alongside the model, we present VULNSCOUT, a new 33,565-sample curated dataset generated through a controlled multi-agent pipeline with formal verification, designed to fill coverage gaps in existing benchmarks across underrepresented CWE categories. Evaluated on a standardized C vulnerability detection benchmark, VULNSCOUT-C outperforms all evaluated baselines, including state-of-the-art reasoning LLMs and commercial static analysis tools, while offering a fraction of their inference cost. These results demonstrate that task-specialized compact architectures can match or even outperform the detection capability of models orders of magnitude larger, making continuous, low-latency vulnerability analysis practical within real-world development workflows.

Updated: 2026-03-30 11:33:32

标题: VulnScout-C：一种用于C代码漏洞检测的轻量级转换器

摘要: C程序中的漏洞检测是软件安全中的一个关键挑战。尽管大型语言模型(LLMs)实现了强大的检测性能，但它们数十亿参数的规模使它们不适合集成到需要低延迟和持续分析的开发工作流程中。我们介绍了VULNSCOUT-C，这是一个紧凑的变压器架构，总共有693M个参数(推理过程中的活跃参数为353M)，源自Qwen模型系列，并针对C代码漏洞检测进行了优化。除了模型之外，我们还提供了VULNSCOUT，一个新的经过精心策划的33565个样本的数据集，通过一个受控的多代理管道和正式验证生成，旨在填补现有基准测试中被低估的CWE类别的覆盖缺口。在一个标准化的C漏洞检测基准测试中评估，VULNSCOUT-C的性能优于所有评估的基线，包括最先进的推理LLMs和商业静态分析工具，同时提供了它们推理成本的一小部分。这些结果表明，任务专用的紧凑架构可以与甚至超过数量级更大的模型的检测能力相匹配，使得在现实开发工作流程中进行持续、低延迟的漏洞分析变得实用。

更新时间: 2026-03-30 11:33:32

领域: cs.CR

下载: http://arxiv.org/abs/2603.28309v1

Self++: Co-Determined Agency for Human--AI Symbiosis in Extended Reality

Self++ is a design blueprint for human-AI symbiosis in extended reality (XR) that preserves human authorship while still benefiting from increasingly capable AI agents. Because XR can shape both perceptual evidence and action, apparently 'helpful' assistance can drift into over-reliance, covert persuasion, and blurred responsibility. Self++ grounds interaction in two complementary theories: Self-Determination Theory (autonomy, competence, relatedness) and the Free Energy Principle (predictive stability under uncertainty). It operationalises these foundations through co-determination, treating the human and the AI as a coupled system that must keep intent and limits legible, tune support over time, and preserve the user's right to endorse, contest, and override. These requirements are summarised as the co-determination principles (T.A.N.): Transparency, Adaptivity, and Negotiability. Self++ organises augmentation into three concurrently activatable overlays spanning sensorimotor competence support (Self: competence overlay), deliberative autonomy support (Self+: autonomy overlay), and social and long-horizon relatedness and purpose support (Self++: relatedness and purpose overlay). Across the overlays, it specifies nine role patterns (Tutor, Skill Builder, Coach; Choice Architect, Advisor, Agentic Worker; Contextual Interpreter, Social Facilitator, Purpose Amplifier) that can be implemented as interaction patterns, not personas. The contribution is a role-based map for designing and evaluating XR-AI systems that grow capability without replacing judgment, enabling symbiotic agency in work, learning, and social life and resilient human development.

Updated: 2026-03-30 11:32:23

标题: 自我++：扩展现实中人工智能共同决定的代理机制

摘要: Self++ 是一个人工智能与人类在扩展现实中共生的设计蓝图，它保留了人类的创作权利，同时又能够从日益强大的人工智能代理中受益。由于扩展现实可以塑造感知证据和行动，明显“有益”的帮助可能会演变成过度依赖、隐性说服和责任模糊。Self++ 将交互基础放在两个互补理论上：自我决定理论（自主性、能力、相关性）和自由能原理（在不确定性下的预测稳定性）。它通过共同决定来实现这些基础，将人类和人工智能视为一个耦合系统，必须保持意图和限制可读性，随时间调节支持，并保留用户对认可、争议和覆盖的权利。这些要求被总结为共同决定原则（T.A.N.）：透明度、适应性和可谈判性。Self++ 将增强功能组织成三个同时可激活的叠加层，涵盖感觉运动能力支持（Self：能力叠加层）、思考自主支持（Self+：自主叠加层）以及社交和长期相关性和目的支持（Self++：相关性和目的叠加层）。在这些叠加层中，它具体规定了九种角色模式（导师、技能建设者、教练；选择建构者、顾问、行动工作者；语境解释者、社交促进者、目的放大器），这些可以作为交互模式而不是人设来实施。该贡献是一个基于角色的地图，用于设计和评估可以增强能力而不取代判断的 XR-AI 系统，从而在工作、学习和社交生活中实现共生代理，并促进人类的可持续发展。

更新时间: 2026-03-30 11:32:23

领域: cs.HC,cs.AI,cs.MA,cs.MM

下载: http://arxiv.org/abs/2603.28306v1

LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models

Vision-Language-Action (VLA) models achieve strong performance in robotic manipulation by leveraging pre-trained vision-language backbones. However, in downstream robotic settings, they are typically fine-tuned with limited data, leading to overfitting to specific instruction formulations and leaving robustness to paraphrased instructions underexplored. To study this gap, we introduce LIBERO-Para, a controlled benchmark that independently varies action expressions and object references for fine-grained analysis of linguistic generalization. Across seven VLA configurations (0.6B-7.5B), we observe consistent performance degradation of 22-52 pp under paraphrasing. This degradation is primarily driven by object-level lexical variation: even simple synonym substitutions cause large drops, indicating reliance on surface-level matching rather than semantic grounding. Moreover, 80-96% of failures arise from planning-level trajectory divergence rather than execution errors, showing that paraphrasing disrupts task identification. Binary success rate treats all paraphrases equally, obscuring whether models perform consistently across difficulty levels or rely on easier cases. To address this, we propose PRIDE, a metric that quantifies paraphrase difficulty using semantic and syntactic factors. Our benchmark and corresponding code are available at: https://github.com/cau-hai-lab/LIBERO-Para

Updated: 2026-03-30 11:27:34

标题: LIBERO-Para：VLA模型中释义鲁棒性的诊断基准和度量标准

摘要: 视觉-语言-动作（VLA）模型通过利用预训练的视觉-语言骨干，在机器人操作中取得了强大的性能。然而，在下游机器人设置中，它们通常只使用有限的数据进行微调，导致过拟合到特定的指令形式，并且对于释义指令的稳健性尚未得到充分探讨。为了研究这一差距，我们介绍了LIBERO-Para，一个受控基准，独立变化动作表达和对象引用，用于对语言概括性进行细粒度分析。在七个VLA配置（0.6B-7.5B）中，我们观察到在释义下的一致性性能下降为22-52个百分点。这种下降主要是由于对象级别的词汇变化驱动的：即使是简单的同义词替换也会导致较大的下降，表明依赖表面级别匹配而不是语义基础。此外，80-96%的失败是由于规划级别的轨迹分歧而不是执行错误，表明释义扰乱了任务识别。二进制成功率对所有释义都一视同仁，模糊了模型是否在不同难度水平上一致执行或者依赖于更容易的情况。为了解决这个问题，我们提出了PRIDE，一种利用语义和句法因素量化释义难度的指标。我们的基准和相应的代码可以在以下链接找到：https://github.com/cau-hai-lab/LIBERO-Para

更新时间: 2026-03-30 11:27:34

领域: cs.LG

下载: http://arxiv.org/abs/2603.28301v1

NeiGAD: Augmenting Graph Anomaly Detection via Spectral Neighbor Information

Graph anomaly detection (GAD) aims to identify irregular nodes or structures in attributed graphs. Neighbor information, which reflects both structural connectivity and attribute consistency with surrounding nodes, is essential for distinguishing anomalies from normal patterns. Although recent graph neural network (GNN)-based methods incorporate such information through message passing, they often fail to explicitly model its effect or interaction with attributes, limiting detection performance. This work introduces NeiGAD, a novel plug-and-play module that captures neighbor information through spectral graph analysis. Theoretical insights demonstrate that eigenvectors of the adjacency matrix encode local neighbor interactions and progressively amplify anomaly signals. Based on this, NeiGAD selects a compact set of eigenvectors to construct efficient and discriminative representations. Experiments on eight real-world datasets show that NeiGAD consistently improves detection accuracy and outperforms state-of-the-art GAD methods. These results demonstrate the importance of explicit neighbor modeling and the effectiveness of spectral analysis in anomaly detection. Code is available at: https://github.com/huafeihuang/NeiGAD.

Updated: 2026-03-30 11:26:56

标题: NeiGAD：通过谱邻居信息增强图异常检测

摘要: 图异常检测（Graph anomaly detection，GAD）旨在识别属性图中的异常节点或结构。邻居信息，既反映了结构连接性，又反映了与周围节点的属性一致性，对于区分异常与正常模式至关重要。尽管最近基于图神经网络（Graph Neural Network，GNN）的方法通过消息传递来整合这样的信息，但它们通常未能明确建模邻居信息对属性的影响或相互作用，从而限制了检测性能。本文介绍了NeiGAD，一个通过谱图分析捕获邻居信息的新型即插即用模块。理论洞见表明，邻接矩阵的特征向量编码了局部邻居之间的相互作用，并逐渐放大异常信号。基于此，NeiGAD选择了一组紧凑的特征向量来构建高效且具有区分性的表示。对八个真实数据集的实验表明，NeiGAD始终提高了检测准确性，并优于最先进的GAD方法。这些结果证明了明确建模邻居以及谱分析在异常检测中的有效性。代码可在以下链接找到：https://github.com/huafeihuang/NeiGAD。

更新时间: 2026-03-30 11:26:56

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.28300v1

Live Knowledge Tracing: Real-Time Adaptation using Tabular Foundation Models

Deep knowledge tracing models have achieved significant breakthroughs in modeling student learning trajectories. However, these architectures require substantial training time and are prone to overfitting on datasets with short sequences. In this paper, we explore a new paradigm for knowledge tracing by leveraging tabular foundation models (TFMs). Unlike traditional methods that require offline training on a fixed training set, our approach performs real-time ''live'' knowledge tracing in an online way. The core of our method lies in a two-way attention mechanism: while attention knowledge tracing models only attend across earlier time steps, TFMs simultaneously attend across both time steps and interactions of other students in the training set. They align testing sequences with relevant training sequences at inference time, therefore skipping the training step entirely. We demonstrate, using several datasets of increasing size, that our method achieves competitive predictive performance with up to 273x speedups, in a setting where more student interactions are observed over time.

Updated: 2026-03-30 11:23:11

标题: 实时知识追踪：使用表格基础模型进行实时适应

摘要: 深度知识追踪模型在建模学生学习轨迹方面取得了显著突破。然而，这些架构需要大量训练时间，并且容易在具有短序列的数据集上过拟合。在本文中，我们通过利用表格基础模型（TFMs）探索了一种新的知识追踪范式。与传统方法需要在固定训练集上进行离线训练不同，我们的方法以在线方式实时执行“活跃”知识追踪。我们方法的核心在于双向注意机制：虽然注意力知识追踪模型仅关注较早的时间步长，但TFMs同时跨越时间步长和训练集中其他学生的交互进行关注。它们在推断时将测试序列与相关训练序列对齐，因此完全跳过了训练步骤。我们使用几个不断增加的数据集展示，我们的方法在更多学生交互随时间观察的情况下实现了竞争性的预测性能，并获得了高达273倍的加速。

更新时间: 2026-03-30 11:23:11

领域: cs.LG

下载: http://arxiv.org/abs/2602.06542v2

Evaluating LLMs for Answering Student Questions in Introductory Programming Courses

The rapid emergence of Large Language Models (LLMs) presents both opportunities and challenges for programming education. While students increasingly use generative AI tools, direct access often hinders the learning process by providing complete solutions rather than pedagogical hints. Concurrently, educators face significant workload and scalability challenges when providing timely, personalized feedback. This study investigates the capabilities of LLMs to safely and effectively assist educators in answering student questions within a CS1 programming course. To achieve this, we established a rigorous, reproducible evaluation process by curating a benchmark dataset of 170 authentic student questions from a learning management system, paired with ground-truth responses authored by subject matter experts. Because traditional text-matching metrics are insufficient for evaluating open-ended educational responses, we developed and validated a custom LLM-as-a-Judge metric optimized for assessing pedagogical accuracy. Our findings demonstrate that models, such as Gemini 3 flash, can surpass the quality baseline of typical educator responses, achieving high alignment with expert pedagogical standards. To mitigate persistent risks like hallucination and ensure alignment with course-specific context, we advocate for a "teacher-in-the-loop" implementation. Finally, we abstract our methodology into a task-agnostic evaluation framework, advocating for a shift in the development of educational LLM tools from ad-hoc, post-deployment testing to a quantifiable, pre-deployment validation process.

Updated: 2026-03-30 11:22:58

标题: 评估用于解答入门编程课程中学生问题的LLMs

摘要: 大型语言模型（LLMs）的快速出现为编程教育提供了机遇和挑战。尽管学生越来越多地使用生成式人工智能工具，但直接访问通常会阻碍学习过程，因为它提供的是完整解决方案而不是教学提示。同时，教育工作者在及时提供个性化反馈时面临重大工作量和可伸缩性挑战。本研究调查了LLMs的能力，以安全有效地帮助教育工作者回答CS1编程课程中学生的问题。为实现这一目标，我们通过精心策划一个基准数据集，包括从学习管理系统中提取的170个真实学生问题，以及由学科专家编写的真实答案。由于传统的文本匹配指标不足以评估开放式教育答案，我们开发并验证了一个定制的LLM作为评判者度量标准，以优化评估教学准确性。我们的研究结果表明，像Gemini 3 flash这样的模型可以超越典型教育工作者回答的质量基线，与专家教学标准高度一致。为了减轻持续存在的风险，如幻觉，并确保与课程特定的上下文一致，我们主张采用“教师参与”的实施方式。最后，我们将我们的方法论抽象为一个任务不可知的评估框架，主张将教育LLM工具的开发从临时部署后测试转变为可量化的预部署验证过程。

更新时间: 2026-03-30 11:22:58

领域: cs.AI

下载: http://arxiv.org/abs/2603.28295v1

Learning from imperfect quantum data via unsupervised domain adaptation with classical shadows

Learning from quantum data using classical machine learning models has emerged as a promising paradigm toward realizing quantum advantages. Despite extensive analyses on their performance, clean and fully labeled quantum data from the target domain are often unavailable in practical scenarios, forcing models to be trained on data collected under conditions that differ from those encountered at deployment. This mismatch highlights the need for new approaches beyond the common assumptions of prior work. In this work, we address this issue by employing an unsupervised domain adaptation framework for learning from imperfect quantum data. Specifically, by leveraging classical representations of quantum states obtained via classical shadows, we perform unsupervised domain adaptation entirely within a classical computational pipeline once measurements on the quantum states are executed. We numerically evaluate the framework on quantum phases of matter and entanglement classification tasks under realistic domain shifts. Across both tasks, our method outperforms source-only non-adaptive baselines and target-only unsupervised learning approaches, demonstrating the practical applicability of domain adaptation to realistic quantum data learning.

Updated: 2026-03-30 11:20:34

标题: 通过使用经典阴影进行无监督域适应，从不完美的量子数据中学习

摘要: 使用经典机器学习模型从量子数据中学习已成为实现量子优势的一种有前景的范式。尽管对它们的性能进行了广泛分析，但在实际情况下，往往难以获取来自目标领域的干净和完全标记的量子数据，这迫使模型在与部署时遇到的条件不同的数据上进行训练。这种不匹配凸显了超越先前工作常见假设的新方法的必要性。在这项工作中，我们通过采用一种无监督域自适应框架来解决这个问题，用于从不完美的量子数据中学习。具体而言，通过利用通过经典阴影获得的量子态的经典表示，我们在量子态上执行测量后完全在一个经典计算流程中进行无监督域自适应。我们在量子物质相和纠缠分类任务中对该框架进行了数值评估，考虑了现实领域的转移。在这两个任务中，我们的方法优于仅源非自适应基线和仅目标无监督学习方法，展示了域自适应对于实际量子数据学习的实际适用性。

更新时间: 2026-03-30 11:20:34

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2603.28294v1

OptINC: Optical In-Network-Computing for Scalable Distributed Learning

Distributed learning is widely used for training large models on large datasets by distributing parts of the model or dataset across multiple devices and aggregating the computed results for subsequent computations or parameter updates. Existing communication algorithms for distributed learning such as ring all-reduce result in heavy communication overhead between servers. Since communication in large-scale systems uses optical fibers, we propose an Optical In-Network-Computing (OptINC) architecture to offload the computation in servers onto the optical interconnects. To execute gradient averaging and quantization in the optical domain, we incorporate optical devices such as Mach-Zehnder-Interferometers (MZIs) into the interconnects. Such a de facto optical neural network (ONN) can effectively reduce the communication overhead in existing distributed training solutions. To reduce dataset complexity for training this neural network, a preprocessing algorithm implemented in the optical domain is also proposed. Hardware cost is lowered by approximating the weight matrices of the optical neural network with unitary and diagonal matrices, while the accuracy is maintained by a proposed hardware-aware training algorithm. The proposed solution was evaluated on real distributed learning tasks, including ResNet50 on CIFAR-100, and a LLaMA-based network on Wikipedia-1B. In both cases, the proposed framework can achieve comparable training accuracy to the ring all-reduce baseline, while eliminating communication overhead.

Updated: 2026-03-30 11:10:47

标题: OptINC：用于可扩展分布式学习的光网络计算

摘要: 分布式学习被广泛用于通过将模型或数据集的部分分布在多个设备上并聚合计算结果来训练大型模型。现有的分布式学习通信算法，如环形全局归约，导致服务器之间的通信开销较大。由于大规模系统中的通信使用光纤，我们提出了一种光学网络计算（OptINC）架构，将服务器中的计算任务卸载到光互连中。为了在光域执行梯度平均值和量化，我们将光学设备（如Mach-Zehnder干涉仪）纳入到互连中。这种光学神经网络（ONN）可以有效减少现有分布式训练解决方案中的通信开销。为了降低训练该神经网络的数据集复杂度，还提出了在光域中实现的预处理算法。通过将光学神经网络的权重矩阵近似为幺正和对角矩阵，降低了硬件成本，而通过提出的硬件感知训练算法保持了准确性。提出的解决方案在真实的分布式学习任务中进行了评估，包括在CIFAR-100上的ResNet50和在Wikipedia-1B上的基于LLaMA的网络。在这两种情况下，提出的框架可以实现与环形全局归约基线相当的训练准确性，同时消除了通信开销。

更新时间: 2026-03-30 11:10:47

领域: cs.LG,cs.AR

下载: http://arxiv.org/abs/2603.28290v1

Retrieving Classes of Causal Orders with Inconsistent Knowledge Bases

Traditional causal discovery methods often depend on strong, untestable assumptions, making them unreliable in real-world applications. In this context, Large Language Models (LLMs) have emerged as a promising alternative for extracting causal knowledge from text-based metadata, effectively consolidating domain expertise. However, LLMs are prone to hallucinations, necessitating strategies that account for these limitations. One effective approach is to use a consistency measure as a proxy of reliability. Moreover, LLMs do not clearly distinguish direct from indirect causal relationships, complicating the discovery of causal Directed Acyclic Graphs (DAGs), which are often sparse. This ambiguity is evident in the way informal sentences are formulated in various domains. For this reason, focusing on causal orders provides a more practical and direct task for LLMs. We propose a new method for deriving abstractions of causal orders that maximizes a consistency score obtained from an LLM. Our approach begins by computing pairwise consistency scores between variables, from which we construct a semi-complete partially directed graph that consolidates these scores into an abstraction. Using this structure, we identify both a maximally oriented partially directed acyclic graph and an optimal set of acyclic tournaments that maximize consistency across all configurations. We further demonstrate how both the abstraction and the class of causal orders can be used to estimate causal effects. We evaluate our method on a wide set of causal DAGs extracted from scientific literature in epidemiology and public health. Our results show that the proposed approach can effectively recover the correct causal order, providing a reliable and practical LLM-assisted causal framework.

Updated: 2026-03-30 11:09:55

标题: 使用不一致知识库检索因果顺序类别

摘要: 传统的因果发现方法通常依赖于强大的、不可测试的假设，使它们在现实应用中不可靠。在这种情况下，大型语言模型(LLMs)已经成为从基于文本的元数据中提取因果知识的一个有前景的替代方法，有效地巩固领域专业知识。然而，LLMs容易出现幻觉，需要考虑这些局限性的策略。一个有效的方法是使用一致性度量作为可靠性的代理。此外，LLMs并不能明确区分直接和间接的因果关系，使得发现通常稀疏的因果有向图(DAGs)更加复杂。这种模糊性体现在各个领域中非正式句子的表达方式上。因此，专注于因果顺序为LLMs提供了一个更实际和直接的任务。我们提出了一种新方法，用于推导最大化从LLMs获得的一致性分数的因果顺序的抽象。我们的方法从计算变量之间的一致性分数开始，然后构建一个半完全部分有向图，将这些分数整合到一个抽象中。利用这个结构，我们确定了一个最大定向部分有向无环图和一个最优的无环锦标赛集合，以最大化所有配置的一致性。我们进一步展示了抽象和因果顺序类别如何用于估计因果效应。我们在流行病学和公共卫生科学文献中提取的一系列因果DAGs上评估了我们的方法。我们的结果表明，提出的方法可以有效地恢复正确的因果顺序，提供一个可靠且实用的LLM辅助因果框架。

更新时间: 2026-03-30 11:09:55

领域: cs.AI

下载: http://arxiv.org/abs/2412.14019v4

FI-KAN: Fractal Interpolation Kolmogorov-Arnold Networks

Kolmogorov-Arnold Networks (KAN) employ B-spline bases on a fixed grid, providing no intrinsic multi-scale decomposition for non-smooth function approximation. We introduce Fractal Interpolation KAN (FI-KAN), which incorporates learnable fractal interpolation function (FIF) bases from iterated function system (IFS) theory into KAN. Two variants are presented: Pure FI-KAN (Barnsley, 1986) replaces B-splines entirely with FIF bases; Hybrid FI-KAN (Navascues, 2005) retains the B-spline path and adds a learnable fractal correction. The IFS contraction parameters give each edge a differentiable fractal dimension that adapts to target regularity during training. On a Holder regularity benchmark ($α\in [0.2, 2.0]$), Hybrid FI-KAN outperforms KAN at every regularity level (1.3x to 33x). On fractal targets, FI-KAN achieves up to 6.3x MSE reduction over KAN, maintaining 4.7x advantage at 5 dB SNR. On non-smooth PDE solutions (scikit-fem), Hybrid FI-KAN achieves up to 79x improvement on rough-coefficient diffusion and 3.5x on L-shaped domain corner singularities. Pure FI-KAN's complementary behavior, dominating on rough targets while underperforming on smooth ones, provides controlled evidence that basis geometry must match target regularity. A fractal dimension regularizer provides interpretable complexity control whose learned values recover the true fractal dimension of each target. These results establish regularity-matched basis design as a principled strategy for neural function approximation.

Updated: 2026-03-30 11:09:07

标题: FI-KAN：分形插值科尔莫戈洛夫-阿诺德网络

摘要: Kolmogorov-Arnold网络（KAN）采用固定网格上的B样条基础，为非光滑函数逼近提供了没有内在多尺度分解的方法。我们引入了分形插值KAN（FI-KAN），它将可学习的分形插值函数（FIF）基础从迭代函数系统（IFS）理论中整合到KAN中。介绍了两个变体：纯FI-KAN（Barnsley，1986）完全用FIF基础替换B样条；混合FI-KAN（Navascues，2005）保留了B样条路径并添加了可学习的分形校正。IFS收缩参数赋予每个边可微分的分形维度，使其在训练过程中适应目标的规则性。在Holder规则性基准（α∈[0.2, 2.0]）上，混合FI-KAN在每个规则性水平上均优于KAN（1.3倍至33倍）。在分形目标上，FI-KAN在KAN上实现了高达6.3倍的均方误差减少，在5 dB信噪比下保持4.7倍的优势。在非光滑PDE解决方案（scikit-fem）中，混合FI-KAN在粗系数扩散上实现了高达79倍的改进，在L形域角奇异性上实现了3.5倍的改进。纯FI-KAN的互补行为，在粗糙目标上处于主导地位，但在光滑目标上表现不佳，提供了有控制的证据，即基础几何形状必须与目标规则性匹配。分形维度正则化器提供了可解释的复杂性控制，学习到的值恢复了每个目标的真实分形维度。这些结果确立了与规则性匹配的基础设计作为神经函数逼近的原则策略。

更新时间: 2026-03-30 11:09:07

领域: cs.LG,cs.AI,math.NA

下载: http://arxiv.org/abs/2603.28288v1

Pre-Deployment Complexity Estimation for Federated Perception Systems

Edge AI systems increasingly rely on federated learning to train perception models in distributed, privacy-preserving, and resource-constrained environments. Yet, before training begins, practitioners often lack practical tools to estimate how difficult a federated learning task will be in terms of achievable accuracy and communication cost. This paper presents a classifier-agnostic, pre-deployment framework for estimating learning complexity in federated perception systems by jointly modeling intrinsic properties of the data and characteristics of the distributed environment. The proposed complexity metric integrates dataset attributes such as dimensionality, sparsity, and heterogeneity with factors related to the composition of participating clients. Using federated learning as a representative distributed training setting, we examine how learning difficulty varies across different federated configurations. Experiments on multiple variants of the MNIST dataset and CIFAR dataset show that the proposed metric strongly correlates with federated learning performance and the communication effort required to reach fixed accuracy targets. These findings suggest that complexity estimation can serve as a practical diagnostic tool for resource planning, dataset assessment, and feasibility evaluation in edge-deployed perception systems.

Updated: 2026-03-30 11:04:28

标题: 联合感知系统的部署前复杂性估计

摘要: 边缘人工智能系统越来越依赖于联邦学习在分布式、隐私保护和资源受限的环境中训练感知模型。然而，在训练开始之前，从业者通常缺乏实用工具来估计联邦学习任务在可实现准确度和通信成本方面的困难程度。本文提出了一个分类器无关的、预部署的框架，用于通过共同建模数据的固有属性和分布式环境的特征来估计联邦感知系统中的学习复杂性。所提出的复杂度度量将数据集属性（如维度、稀疏性和异质性）与参与客户端的组成相关因素结合起来。以联邦学习作为代表性的分布式训练设置，我们研究了学习难度在不同联邦配置下的变化。对MNIST数据集和CIFAR数据集的多个变体进行的实验表明，所提出的度量与联邦学习性能和达到固定准确度目标所需的通信工作强相关。这些发现表明，复杂度估计可以作为边缘部署感知系统中资源规划、数据集评估和可行性评估的实用诊断工具。

更新时间: 2026-03-30 11:04:28

领域: cs.LG,cs.AI,cs.DC

下载: http://arxiv.org/abs/2603.28282v1

Few Batches or Little Memory, But Not Both: Simultaneous Space and Adaptivity Constraints in Stochastic Bandits

We study stochastic multi-armed bandits under simultaneous constraints on space and adaptivity: the learner interacts with the environment in $B$ batches and has only $W$ bits of persistent memory. Prior work shows that each constraint alone is surprisingly mild: near-minimax regret $\widetilde{O}(\sqrt{KT})$ is achievable with $O(\log T)$ bits of memory under fully adaptive interaction, and with a $K$-independent $O(\log\log T)$-type number of batches when memory is unrestricted. We show that this picture breaks down in the simultaneously constrained regime. We prove that any algorithm with a $W$-bit memory constraint must use at least $Ω(K/W)$ batches to achieve near-minimax regret $\widetilde{O}(\sqrt{KT})$, even under adaptive grids. In particular, logarithmic memory rules out $O(K^{1-\varepsilon})$ batch complexity. Our proof is based on an information bottleneck. We show that near-minimax regret forces the learner to acquire $Ω(K)$ bits of information about the hidden set of good arms under a suitable hard prior, whereas an algorithm with $B$ batches and $W$ bits of memory allows only $O(BW)$ bits of information. A key ingredient is a localized change-of-measure lemma that yields probability-level arm exploration guarantees, which is of independent interest. We also give an algorithm that, for any bit budget $W$ with $Ω(\log T) \le W \le O(K\log T)$, uses at most $W$ bits of memory and $\widetilde{O}(K/W)$ batches while achieving regret $\widetilde{O}(\sqrt{KT})$, nearly matching our lower bound up to polylogarithmic factors.

Updated: 2026-03-30 11:03:48

标题: 少批次或少内存，但不能两者兼得：随机赌博中的同时空间和适应性约束

摘要: 我们研究具有空间和适应性同时约束的随机多臂赌博机：学习者与环境在$B$批次中交互，并且只有$W$比特的持久内存。先前的研究表明，每个约束单独来看都相当温和：在完全适应性交互下，$O(\log T)$比特的内存可以实现近似极小遗憾$\widetilde{O}(\sqrt{KT})$，在内存不受限制时，批次数量只需$K$无关的$O(\log\log T)$类型。我们表明，在同时受限制的情况下，这种情况会发生变化。我们证明，任何具有$W$比特内存约束的算法必须使用至少$Ω(K/W)$批次才能实现近似极小遗憾$\widetilde{O}(\sqrt{KT})$，即使在自适应网格下也是如此。特别地，对数内存排除$O(K^{1-\varepsilon})$批次复杂度。我们的证明基于信息瓶颈。我们表明，近似极小遗憾迫使学习者在适当的硬先验下获取有关好臂隐藏集合的$Ω(K)$比特信息，而具有$B$批次和$W$比特内存的算法只允许获取$O(BW)$比特信息。一个关键因素是一个局部改变测度引理，提供概率级别的臂探索保证，这是独立感兴趣的。我们还提供了一种算法，对于任何比特预算$W$，其中$Ω(\log T) \le W \le O(K\log T)$，使用最多$W$比特内存和$\widetilde{O}(K/W)$批次，同时实现近似极小遗憾$\widetilde{O}(\sqrt{KT})$，几乎匹配我们的下界，直到多对数因子。

更新时间: 2026-03-30 11:03:48

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2603.13742v2

Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback

We consider robustness against data corruption in offline multi-agent reinforcement learning from human feedback (MARLHF) under a strong-contamination model: given a dataset $D$ of trajectory-preference tuples (each preference being an $n$-dimensional binary label vector representing each of the $n$ agents' preferences), an $ε$-fraction of the samples may be arbitrarily corrupted. We model the problem using the framework of linear Markov games. First, under a uniform coverage assumption - where every policy of interest is sufficiently represented in the clean (prior to corruption) data - we introduce a robust estimator that guarantees an $O(ε^{1 - o(1)})$ bound on the Nash equilibrium gap. Next, we move to the more challenging unilateral coverage setting, in which only a Nash equilibrium and its single-player deviations are covered. In this case, our proposed algorithm achieves an $O(\sqrtε)$ bound on the Nash gap. Both of these procedures, however, suffer from intractable computation. To address this, we relax our solution concept to coarse correlated equilibria (CCE). Under the same unilateral coverage regime, we derive a quasi-polynomial-time algorithm whose CCE gap scales as $O(\sqrtε)$. To the best of our knowledge, this is the first systematic treatment of adversarial data corruption in offline MARLHF.

Updated: 2026-03-30 11:03:36

标题: 腐败鲁棒的离线多智能体强化学习：来自人类反馈

摘要: 我们考虑在人类反馈下的离线多智能体强化学习中针对数据损坏的鲁棒性（MARLHF），在一个强污染模型下：给定一个轨迹-偏好元组（每个偏好是一个$n$维二进制标签向量，表示每个$n$个智能体的偏好）的数据集$D$，其中$\epsilon$比例的样本可能会被任意破坏。我们使用线性马尔可夫博弈的框架来建模问题。首先，在均匀覆盖假设下 - 意味着每个感兴趣的策略在干净（在损坏之前）的数据中有足够的表示 - 我们引入一个鲁棒估计器，保证纳什均衡差距的上界为$O(\epsilon^{1 - o(1)})$。接下来，我们转向更具挑战性的单边覆盖设置，其中只有一个纳什均衡及其单人偏差被覆盖。在这种情况下，我们提出的算法实现了纳什差距的$O(\sqrt{\epsilon})$上界。然而，这两种方法都受到难以计算的问题。为了解决这个问题，我们将我们的解决方案概念放宽到粗粒度相关均衡（CCE）。在相同的单边覆盖制度下，我们推导出一个准多项式时间算法，其CCE差距缩放为$O(\sqrt{\epsilon})$。据我们所知，这是对离线MARLHF中对抗性数据损坏的首次系统处理。

更新时间: 2026-03-30 11:03:36

领域: cs.LG

下载: http://arxiv.org/abs/2603.28281v1

Observable Channels, Not Just Storage: Evaluating Privacy Leakage in LLM Agent Pipelines

Privacy leakage in LLM agents is often studied through individual storage or execution components, such as memory modules, retrieval pipelines, or tool-mediated artifacts. However, these settings are typically analyzed in isolation, making it difficult to compare how private internal dependence becomes externally recoverable across heterogeneous agent pipelines. In this paper, we present CIPL (Channel Inversion for Privacy Leakage) as a unified channel-oriented measurement interface for evaluating privacy leakage in LLM agent pipelines. Rather than claiming a universally strongest attack recipe, CIPL provides a shared way to represent a target through its sensitive source, selection, assembly, execution, observation, and extraction stages, and to measure how internal exposure is transformed into attacker-recoverable leakage under a common protocol. Using memory-based, retrieval-mediated, and tool-mediated instantiations under this shared interface, we identify a distinct cross-target risk picture. Memory behaves as a near-saturated high-risk special case, while beyond-memory leakage exhibits a different regime: retrieval-mediated targets show frequent but often incomplete leakage, and tool-mediated targets are strongly shaped by the exposed observation surface and provider behavior. We further show that leakage is governed by channel conditions rather than by a universally dominant recipe: cleaned weak controls sharply suppress leakage, and semantic annotation reveals attacker-useful leakage beyond exact-match extraction. Together, these findings suggest that privacy risk in LLM agent pipelines is better understood through \emph{observable channels}, not just storage components. More broadly, our results motivate channel-oriented privacy evaluation as a necessary complement to component-local or exact-only analyses.

Updated: 2026-03-30 10:54:59

标题: 可观察通道，并非仅仅存储：评估LLM代理管道中的隐私泄露

摘要: 在LLM代理中的隐私泄露通常通过个体存储或执行组件进行研究，例如内存模块、检索管道或工具介导的工件。然而，这些设置通常被单独分析，这使得难以比较私人内部依赖如何在异构代理管道中变得外部可恢复。在本文中，我们提出CIPL（用于隐私泄露的通道反演）作为一个统一的面向通道的测量接口，用于评估LLM代理管道中的隐私泄露。CIPL并不声称提供一个普遍最强的攻击配方，而是提供了一种共享的方式来通过其敏感来源、选择、装配、执行、观察和提取阶段来表示目标，并衡量内部暴露如何在共同的协议下转化为攻击者可恢复的泄露。在这个共享接口下，通过基于内存、检索介导和工具介导的实例化，我们识别出了一个独特的跨目标风险图景。内存表现为一个接近饱和的高风险特例，而超越内存的泄露展示了一个不同的制度：检索介导的目标经常显示但通常是不完整的泄露，而工具介导的目标则受到暴露的观察表面和提供者行为的影响。我们进一步表明，泄露是受通道条件而不是普遍主导配方所支配的：清洁的弱控制能够明显抑制泄露，语义标注显示了超越精确匹配提取的对攻击者有用的泄露。总的来说，这些发现表明，LLM代理管道中的隐私风险最好通过\emph{可观察通道}来理解，而不仅仅是存储组件。从更广泛的角度来看，我们的结果促使将面向通道的隐私评估作为对组件本地或仅精确分析的必要补充。

更新时间: 2026-03-30 10:54:59

领域: cs.CR

下载: http://arxiv.org/abs/2603.22751v2

Generating Findings for Jaw Cysts in Dental Panoramic Radiographs Using a GPT-Based VLM: A Preliminary Study on Building a Two-Stage Self-Correction Loop with Structured Output (SLSO) Framework

Vision-language models (VLMs) such as GPT (Generative Pre-Trained Transformer) have shown potential for medical image interpretation; however, challenges remain in generating reliable radiological findings in clinical practice, as exemplified by dental pathologies. This study proposes a Self-correction Loop with Structured Output (SLSO) framework as an integrated processing methodology to enhance the accuracy and reliability of AI-generated findings for jaw cysts in dental panoramic radiographs. Dental panoramic radiographs with jaw cysts were used to implement a 10-step integrated processing framework incorporating image analysis, structured data generation, tooth number extraction, consistency checking, and iterative regeneration. The framework functioned as an external validation mechanism for GPT outputs. Performance was compared against the conventional Chain-of-Thought (CoT) method across seven evaluation items: transparency, internal structure, borders, root resorption, tooth movement, relationships with other structures, and tooth number. The SLSO framework improved output accuracy for multiple items compared to the CoT method, with the most notable improvements observed in tooth number identification, tooth movement detection, and root resorption assessment. In successful cases, consistently structured outputs were achieved after up to five regenerations. The framework enforced explicit negative finding descriptions and suppressed hallucinations, although accurate identification of extensive lesions spanning multiple teeth remained limited. This investigation established the feasibility of the proposed integrated processing methodology and provided a foundation for future validation studies with larger, more diverse datasets.

Updated: 2026-03-30 10:54:10

标题: 使用基于GPT的VLM在牙科全景X射线片中生成下颌囊肿的发现：关于构建具有结构化输出（SLSO）框架的两阶段自校正循环的初步研究

摘要: 视觉语言模型（VLMs）如GPT（生成预训练变换器）已经显示出在医学图像解释方面的潜力；然而，在临床实践中生成可靠的放射学结果仍然存在挑战，例如牙科病变。本研究提出了一个自校正循环结构化输出（SLSO）框架作为一种综合处理方法，以提高AI生成的下颌囊肿在口腔全景X射线片中的准确性和可靠性。使用带有下颌囊肿的牙科全景X射线片来实施一个包括图像分析、结构化数据生成、牙齿编号提取、一致性检查和迭代再生的10步综合处理框架。该框架作为GPT输出的外部验证机制。在七个评估项目（透明度、内部结构、边界、根吸收、牙齿移动、与其他结构的关系和牙齿编号）中，性能与传统的思维链（CoT）方法进行比较。与CoT方法相比，SLSO框架提高了多个项目的输出准确性，其中最显著的改进出现在牙齿编号识别、牙齿移动检测和根吸收评估方面。在成功案例中，经过最多五次再生后实现了一致结构化的输出。该框架强制执行明确的负面结果描述并抑制幻觉，尽管对跨越多个牙齿的广泛病变的准确识别仍然有限。这项研究建立了所提出的综合处理方法的可行性，并为未来具有更大、更多样化数据集的验证研究奠定了基础。

更新时间: 2026-03-30 10:54:10

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.02001v5

MM-DADM: Multimodal Drug-Aware Diffusion Model for Virtual Clinical Trials

High failure rates in cardiac drug development necessitate virtual clinical trials via electrocardiogram (ECG) generation to reduce risks and costs. However, existing ECG generation models struggle to balance morphological realism with pathological flexibility, fail to disentangle demographics from genuine drug effects, and are severely bottlenecked by early-phase data scarcity. To overcome these hurdles, we propose the Multimodal Drug-Aware Diffusion Model (MM-DADM), the first generative framework for generating individualized drug-induced ECGs. Specifically, our proposed MM-DADM integrates a Dynamic Cross-Attention (DCA) module that adaptively fuses External Physical Knowledge (EPK) to preserve morphological realism while avoiding the suppression of complex pathological nuances. To resolve feature entanglement, a Causal Feature Encoder (CFE) actively filters out demographic noise to extract pure pharmacological representations. These representations subsequently guide a Causal-Disentangled ControlNet (CDC-Net), which leverages counterfactual data augmentation to explicitly learn intrinsic pharmacological mechanisms despite limited clinical data. Extensive experiments on $9,443$ ECGs across $8$ drug regimens demonstrate that MM-DADM outperforms $10$ state-of-the-art ECG generation models, improving simulation accuracy by at least $6.13\%$ and recall by $5.89\%$, while providing highly effective data augmentation for downstream classification tasks.

Updated: 2026-03-30 10:51:39

标题: MM-DADM：用于虚拟临床试验的多模态药物感知扩散模型

摘要: 心脏药物开发中高失败率需要通过心电图（ECG）生成进行虚拟临床试验，以降低风险和成本。然而，现有的ECG生成模型很难在形态逼真性和病理灵活性之间取得平衡，无法将人口统计学与真实药效分离，并且严重受到早期数据稀缺的限制。为了克服这些障碍，我们提出了多模态药物感知扩散模型（MM-DADM），这是第一个用于生成个性化药物诱导的ECG的生成框架。具体而言，我们提出的MM-DADM集成了一个动态交叉注意力（DCA）模块，自适应地融合外部物理知识（EPK），以保留形态逼真性，同时避免抑制复杂病理细微差别。为了解决特征混叠问题，我们采用因果特征编码器（CFE）主动过滤掉人口统计噪音，提取纯药理表示。这些表示随后引导因果解耦控制网络（CDC-Net），利用反事实数据增强来明确学习内在药理机制，尽管临床数据有限。对涵盖8种药物方案的9443个ECG进行了大量实验，结果表明MM-DADM优于10种最先进的ECG生成模型，模拟精度提高至少6.13％，召回率提高5.89％，同时为下游分类任务提供高效的数据增强。

更新时间: 2026-03-30 10:51:39

领域: cs.LG,q-bio.QM

下载: http://arxiv.org/abs/2502.07297v3

Synergizing Large Language Models and Task-specific Models for Time Series Anomaly Detection

In anomaly detection, methods based on large language models (LLMs) can incorporate expert knowledge by reading professional document, while task-specific small models excel at extracting normal data patterns and detecting value fluctuations from training data of target applications. Inspired by the human nervous system, where the brain stores expert knowledge and the peripheral nervous system and spinal cord handle specific tasks like withdrawal and knee-jerk reflexes, we propose CoLLaTe, a framework designed to facilitate collaboration between LLMs and task-specific models, leveraging the strengths of both models for anomaly detection. In particular, we first formulate the collaboration process and identify two key challenges in the collaboration: (1) the misalignment between the expression domains of the LLMs and task-specific small models, and (2) error accumulation arising from the predictions of both models. To address these challenges, we then introduce two key components in CoLLaTe: a model alignment module and a collaborative loss function. Through theoretical analysis and experimental validation, we demonstrate that these components effectively mitigate the identified challenges and achieve better performance than both LLM-based and task-specific models.

Updated: 2026-03-30 10:50:11

标题: 将大型语言模型与特定任务模型相结合，用于时间序列异常检测

摘要: 在异常检测中，基于大型语言模型（LLMs）的方法可以通过阅读专业文档融入专家知识，而专门用于提取正常数据模式并从目标应用的训练数据中检测价值波动的小型模型在任务中表现出色。受人类神经系统的启发，大脑存储专家知识，外周神经系统和脊髓处理撤退和膝反射等具体任务，我们提出了CoLLaTe框架，旨在促进LLMs和专门任务模型之间的协作，利用两种模型的优势进行异常检测。具体来说，我们首先制定了协作过程，并确定了协作中的两个关键挑战：（1）LLMs和专门任务小模型的表达域之间的不对齐，以及（2）由于两种模型的预测而产生的错误累积。为了解决这些挑战，我们引入了CoLLaTe中的两个关键组件：模型对齐模块和协作损失函数。通过理论分析和实验验证，我们证明这些组件有效地缓解了确定的挑战，并实现了比基于LLMs的模型和专门任务模型都更好的性能。

更新时间: 2026-03-30 10:50:11

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2501.05675v5

LLM-Assisted Emergency Triage Benchmark: Bridging Hospital-Rich and MCI-Like Field Simulation

Research on emergency and mass casualty incident (MCI) triage has been limited by the absence of openly usable, reproducible benchmarks. Yet these scenarios demand rapid identification of the patients most in need, where accurate deterioration prediction can guide timely interventions. While the MIMIC-IV-ED database is openly available to credentialed researchers, transforming it into a triage-focused benchmark requires extensive preprocessing, feature harmonization, and schema alignment -- barriers that restrict accessibility to only highly technical users. We address these gaps by first introducing an open, LLM-assisted emergency triage benchmark for deterioration prediction (ICU transfer, in-hospital mortality). The benchmark then defines two regimes: (i) a hospital-rich setting with vitals, labs, notes, chief complaints, and structured observations, and (ii) an MCI-like field simulation limited to vitals, observations, and notes. Large language models (LLMs) contributed directly to dataset construction by (i) harmonizing noisy fields such as AVPU and breathing devices, (ii) prioritizing clinically relevant vitals and labs, and (iii) guiding schema alignment and efficient merging of disparate tables. We further provide baseline models and SHAP-based interpretability analyses, illustrating predictive gaps between regimes and the features most critical for triage. Together, these contributions make triage prediction research more reproducible and accessible -- a step toward dataset democratization in clinical AI.

Updated: 2026-03-30 10:47:55

标题: LLM辅助的紧急分类基准：连接医院丰富和类似多发伤现场模拟

摘要: 急救和大规模伤员事件（MCI）分类的研究受到可公开使用、可重现的基准的缺乏限制。然而，这些场景要求迅速识别最需要帮助的患者，准确的恶化预测可以指导及时干预。虽然MIMIC-IV-ED数据库对持有资格的研究人员是开放的，但将其转化为以分类为重点的基准需要进行大量的预处理、特征协调和模式对齐 -- 这些障碍限制了只有高度技术用户才能使用它的可访问性。我们首先介绍了一个由LLM辅助的应急分类基准，用于恶化预测（ICU转运、住院死亡）。基准然后定义了两种情况：（i）一个富有医院资源的设置，包括生命体征、实验室、笔记、主诉和结构化观察；（ii）一个类似MCI的野外模拟，仅限于生命体征、观察和笔记。大型语言模型（LLMs）通过（i）协调诸如AVPU和呼吸设备等嘈杂字段，（ii）优先考虑临床相关的生命体征和实验室，以及（iii）指导模式对齐和有效合并不同的表格，直接促进了数据集的构建。我们进一步提供了基线模型和基于SHAP的可解释性分析，展示了分类之间的预测差距和对分类最关键的特征。这些贡献共同使分类预测研究更具重现性和可访问性 -- 迈向临床人工智能数据集民主化的一步。

更新时间: 2026-03-30 10:47:55

领域: cs.LG

下载: http://arxiv.org/abs/2509.26351v2

Merge and Conquer: Instructing Multilingual Models by Adding Target Language Weights

Large Language Models (LLMs) remain heavily centered on English, with limited performance in low-resource languages. Existing adaptation approaches, such as continual pre-training, demand significant computational resources. In the case of instructed models, high-quality instruction data is also required, both of which are often inaccessible for low-resource language communities. Under these constraints, model merging offers a lightweight alternative, but its potential in low-resource contexts has not been systematically explored. In this work, we explore whether it is possible to transfer language knowledge to an instruction-tuned LLM by merging it with a language-specific base model, thereby eliminating the need of language-specific instructions and repeated fine-tuning processes whenever stronger instructed variants become available. Through experiments covering four Iberian languages (Basque, Catalan, Galician, and Spanish) and two model families, we show that merging enables effective instruction following behavior in new languages and even supports multilingual capability through the combination of multiple language-specific models. Our results indicate that model merging is a viable and efficient alternative to traditional adaptation methods for low-resource languages, achieving competitive performance while greatly reducing computational cost.

Updated: 2026-03-30 10:46:50

标题: 合并与征服：通过添加目标语言权重来指导多语言模型

摘要: 大型语言模型（LLMs）仍然主要集中在英语上，在低资源语言方面的性能有限。现有的适应方法，如持续预训练，需要大量的计算资源。在指导模型的情况下，还需要高质量的指导数据，这些数据通常对于低资源语言社区是不可获得的。在这些限制条件下，模型合并提供了一种轻量级的替代方案，但其在低资源环境中的潜力尚未被系统地探索。在这项工作中，我们探讨了是否可以通过将其与特定语言基础模型合并来将语言知识转移到一个经过指导调整的LLM，从而消除了在更强的指导变体可用时需要特定于语言的指导和重复微调过程的需求。通过涵盖四种伊比利亚语言（巴斯克语、加泰罗尼亚语、加利西亚语和西班牙语）和两个模型系列的实验，我们展示了合并使得在新语言中实现有效的指导行为成为可能，甚至通过结合多个特定语言模型实现多语言功能。我们的结果表明，模型合并是低资源语言传统适应方法的一种可行且高效的替代方案，实现了竞争性的性能同时大大降低了计算成本。

更新时间: 2026-03-30 10:46:50

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2603.28263v1

Algorithmic Insurance

When AI systems make errors in high-stakes domains like medical diagnosis or autonomous vehicles, a single algorithmic flaw across varying operational contexts can generate highly heterogeneous losses that challenge traditional insurance assumptions. Algorithmic insurance constitutes a novel form of financial coverage for AI-induced damages, representing an emerging market that addresses algorithm-driven liability. However, insurers currently struggle to price these risks, while AI developers lack rigorous frameworks connecting system design with financial liability exposure. We analyze the connection between operational choices of binary classification performance to tail risk exposure. Using conditional value-at-risk (CVaR) to capture extreme losses, we prove that established approaches like maximizing accuracy can significantly increase worst-case losses compared to tail risk optimization, with penalties growing quadratically as thresholds deviate from optimal. We then propose a liability insurance contract structure that mandates risk-aware classification thresholds and characterize the conditions under which it creates value for AI providers. Our analysis extends to degrading model performance and human oversight scenarios. We validate our findings through a mammography case study, demonstrating that CVaR-optimal thresholds reduce tail risk up to 13-fold compared to accuracy maximization. This risk reduction enables insurance contracts to create 14-16% gains for well-calibrated firms, while poorly calibrated firms benefit up to 65% through risk transfer, mandatory recalibration, and regulatory capital relief. Unlike traditional insurance that merely transfers risk, algorithmic insurance can function as both a financial instrument and an operational governance mechanism, simultaneously enabling efficient risk transfer while improving AI safety.

Updated: 2026-03-30 10:38:08

标题: 算法保险

摘要: 当人工智能系统在医疗诊断或自动驾驶等高风险领域出现错误时，一个算法缺陷在不同操作环境下可能产生高度异质化的损失，挑战传统保险假设。算法保险构成了一种新型的金融覆盖形式，用于AI引发的损害，代表了一个涉及算法驱动责任的新兴市场。然而，保险公司目前在定价这些风险方面仍存在困难，而AI开发者缺乏将系统设计与金融责任敞口相连接的严格框架。我们分析了二元分类性能的操作选择与尾风险敞口之间的联系。使用条件风险价值（CVaR）来捕捉极端损失，我们证明了像最大化准确性这样的成熟方法相比尾风险优化可以显著增加最坏情况下的损失，随着阈值偏离最佳值，惩罚呈二次增长。然后，我们提出了一种责任保险合同结构，要求风险感知分类阈值，并表征了它为AI提供者创造价值的条件。我们的分析扩展到模型性能下降和人工监督场景。我们通过乳腺X线检查案例研究验证了我们的发现，证明了CVaR最优阈值相比准确性最大化可以将尾风险降低多达13倍。这种风险降低使保险合同能够为良好校准的公司创造14-16%的收益，而校准不佳的公司通过风险转移、强制重新校准和监管资本减轻可以获益高达65%。与仅仅转移风险的传统保险不同，算法保险可以同时作为一种金融工具和运营治理机制，同时实现有效的风险转移，同时提高AI的安全性。

更新时间: 2026-03-30 10:38:08

领域: cs.LG,q-fin.RM,stat.ML

下载: http://arxiv.org/abs/2106.00839v3

Categorical Perception in Large Language Model Hidden States: Structural Warping at Digit-Count Boundaries

Categorical perception (CP) -- enhanced discriminability at category boundaries -- is among the most studied phenomena in perceptual psychology. This paper reports that analogous geometric warping occurs in the hidden-state representations of large language models (LLMs) processing Arabic numerals. Using representational similarity analysis across six models from five architecture families, the study finds that a CP-additive model (log-distance plus a boundary boost) fits the representational geometry better than a purely continuous model at 100% of primary layers in every model tested. The effect is specific to structurally defined boundaries (digit-count transitions at 10 and 100), absent at non-boundary control positions, and absent in the temperature domain where linguistic categories (hot/cold) lack a tokenisation discontinuity. Two qualitatively distinct signatures emerge: "classic CP" (Gemma, Qwen), where models both categorise explicitly and show geometric warping, and "structural CP" (Llama, Mistral, Phi), where geometry warps at the boundary but models cannot report the category distinction. This dissociation is stable across boundaries and is a property of the architecture, not the stimulus. Structural input-format discontinuities are sufficient to produce categorical perception geometry in LLMs, independently of explicit semantic category knowledge.

Updated: 2026-03-30 10:34:58

标题: 大型语言模型隐藏状态中的范畴感知：数字计数边界处的结构变形

摘要: 分类知觉（CP）——在类别边界处增强辨别能力——是感知心理学中研究最多的现象之一。本文报道了大型语言模型（LLMs）处理阿拉伯数字时隐藏状态表示中出现的类似几何扭曲现象。通过对来自五种架构家族的六个模型进行表征相似性分析，研究发现在每个测试模型的每个主要层中，CP-加法模型（对数距离加上边界提升）比纯连续模型更好地拟合了表示几何结构。这种效应特定于结构定义的边界（数字计数在10和100处的转变），在非边界控制位置不存在，并且在温度领域中，语言类别（热/冷）缺乏标记化不连续性。出现了两种定性不同的特征：“经典CP”（Gemma，Qwen），其中模型既明确地对类别进行分类又显示几何扭曲，“结构CP”（Llama，Mistral，Phi），其中几何在边界处扭曲，但模型无法报告类别区别。这种分离在边界上是稳定的，并且是架构的属性，而不是刺激的属性。结构输入格式的不连续足以独立于明确的语义类别知识在LLMs中产生分类知觉几何结构。

更新时间: 2026-03-30 10:34:58

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2603.28258v1

Nonlinear Factor Decomposition via Kolmogorov-Arnold Networks: A Spectral Approach to Asset Return Analysis

KAN-PCA is an autoencoder that uses a KAN as encoder and a linear map as decoder. It generalizes classical PCA by replacing linear projections with learned B-spline functions on each edge. The motivation is to capture more variance than classical PCA, which becomes inefficient during market crises when the linear assumption breaks down and correlations between assets change dramatically. We prove that if the spline activations are forced to be linear, KAN-PCA yields exactly the same results as classical PCA, establishing PCA as a special case. Experiments on 20 S&P 500 stocks (2015-2024) show that KAN-PCA achieves a reconstruction R^2 of 66.57%, compared to 62.99% for classical PCA with the same 3 factors, while matching PCA out-of-sample after correcting for data leakage in the training procedure.

Updated: 2026-03-30 10:33:59

标题: 通过科尔莫戈洛夫-阿诺德网络的非线性因子分解：一种资产回报分析的谱方法

摘要: KAN-PCA是一种自动编码器，它使用KAN作为编码器，线性映射作为解码器。它通过在每个边缘上用学习的B样条函数替换线性投影来推广经典PCA。动机是捕获比经典PCA更多的方差，因为在市场危机时线性假设失效，资产之间的相关性发生了显著变化，经典PCA变得低效。我们证明，如果强制样条激活为线性，KAN-PCA产生的结果与经典PCA完全相同，将PCA建立为一种特例。对20只标普500股票（2015-2024年）进行的实验表明，与相同3个因素的经典PCA相比，KAN-PCA实现了66.57%的重建R^2，而经过在训练过程中纠正数据泄漏后，在样本外匹配PCA。

更新时间: 2026-03-30 10:33:59

领域: q-fin.ST,cs.LG

下载: http://arxiv.org/abs/2603.28257v1

Benchmarking Early Deterioration Prediction Across Hospital-Rich and MCI-Like Emergency Triage Under Constrained Sensing

Emergency triage decisions are made under severe information constraints, yet most data-driven deterioration models are evaluated using signals unavailable during initial assessment. We present a leakage-aware benchmarking framework for early deterioration prediction that evaluates model performance under realistic, time-limited sensing conditions. Using a patient-deduplicated cohort derived from MIMIC-IV-ED, we compare hospital-rich triage with a vitals-only, MCI-like setting, restricting inputs to information available within the first hour of presentation. Across multiple modeling approaches, predictive performance declines only modestly when limited to vitals, indicating that early physiological measurements retain substantial clinical signal. Structured ablation and interpretability analyses identify respiratory and oxygenation measures as the most influential contributors to early risk stratification, with models exhibiting stable, graceful degradation as sensing is reduced. This work provides a clinically grounded benchmark to support the evaluation and design of deployable triage decision-support systems in resource-constrained settings.

Updated: 2026-03-30 10:33:40

标题: 跨医院丰富和类MCI的急诊分诊下受限感知的早期恶化预测基准对比

摘要: 急诊分诊决策在信息受限的情况下进行，然而大多数基于数据驱动的恶化模型是使用在初始评估时不可用的信号进行评估的。我们提出了一个考虑泄露的基准评估框架，用于早期恶化预测，评估模型在现实的、时间有限的感知条件下的表现。使用从MIMIC-IV-ED派生的患者去重队列，我们比较了富有医院特色的分诊与仅基于生命体征的、类似多受伤者事件的设置，将输入限制为在就诊的第一个小时内可用的信息。通过多种建模方法，预测性能仅在仅限于生命体征时略有下降，表明早期生理测量保留了大量的临床信号。结构化的消融和可解释性分析确定呼吸和氧合指标是早期风险分层的最有影响力的贡献者，模型在感知减少时表现出稳定、优雅的退化。这项工作提供了一个基于临床的基准，支持对资源受限环境中可部署的分诊决策支持系统的评估和设计。

更新时间: 2026-03-30 10:33:40

领域: cs.CY,cs.AI,cs.LG

下载: http://arxiv.org/abs/2602.20168v2

MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

Orthogonalized-update optimizers such as Muon improve training of matrix-valued parameters, but existing extensions mostly act either after orthogonalization by rescaling updates or before it with heavier whitening-based preconditioners. We introduce {\method}, a lightweight family of pre-orthogonalization equilibration schemes for Muon in three forms: two-sided row/column normalization (RC), row normalization (R), and column normalization (C). These variants rebalance the momentum matrix before finite-step Newton--Schulz using row/column squared-norm statistics and only $\mathcal{O}(m+n)$ auxiliary state. We show that finite-step orthogonalization is governed by input spectral properties, especially stable rank and condition number, and that row/column normalization is a zeroth-order whitening surrogate that removes marginal scale mismatch. For the hidden matrix weights targeted by {\method}, the row-normalized variant R is the natural default and preserves the $\widetilde{\mathcal{O}}(T^{-1/4})$ stationarity guarantee of Muon-type methods. In LLaMA2 pretraining on C4, the default R variant consistently outperforms Muon on 130M and 350M models, yielding faster convergence and lower validation perplexity.

Updated: 2026-03-30 10:28:18

标题: MuonEq：在正交化之前进行轻量级均衡

摘要: 正交更新优化器，如Muon，改进了矩阵参数的训练，但现有的扩展大多在正交化后通过重新缩放更新或在之前使用更重的白化预处理器后才起作用。我们介绍了{\method}，这是Muon的一个轻量级的预正交均衡方案家族，有三种形式：双侧行/列归一化（RC），行归一化（R）和列归一化（C）。这些变体在有限步骤Newton-Schulz之前使用行/列平方范数统计重新平衡动量矩阵，仅需要$\mathcal{O}(m+n)$的辅助状态。我们表明，有限步骤的正交化由输入的谱特性控制，特别是稳定秩和条件数，并且行/列归一化是一种零阶白化替代方法，可以消除边缘尺度不匹配。对于{\method}所针对的隐藏矩阵权重，行归一化变体R是自然的默认选择，并保留了Muon类型方法的$\widetilde{\mathcal{O}}(T^{-1/4})$稳态保证。在C4上的LLaMA2预训练中，默认的R变体在130M和350M模型上一直优于Muon，收敛更快，验证困惑度更低。

更新时间: 2026-03-30 10:28:18

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2603.28254v1

MR-ImagenTime: Multi-Resolution Time Series Generation through Dual Image Representations

Time series forecasting is vital across many domains, yet existing models struggle with fixed-length inputs and inadequate multi-scale modeling. We propose MR-CDM, a framework combining hierarchical multi-resolution trend decomposition, an adaptive embedding mechanism for variable-length inputs, and a multi-scale conditional diffusion process. Evaluations on four real-world datasets demonstrate that MR-CDM significantly outperforms state-of-the-art baselines (e.g., CSDI, Informer), reducing MAE and RMSE by approximately 6-10 to a certain degree.

Updated: 2026-03-30 10:25:35

标题: MR-ImagenTime：通过双图像表示生成多分辨率时间序列

摘要: 时间序列预测在许多领域都是至关重要的，然而现有模型在固定长度输入和不足的多尺度建模方面存在困难。我们提出了MR-CDM，这是一个结合了层次多分辨率趋势分解、适应性嵌入机制用于可变长度输入以及多尺度条件扩散过程的框架。对四个真实数据集的评估表明，MR-CDM明显优于最先进的基线模型（例如CSDI、Informer），在一定程度上将MAE和RMSE降低了大约6-10。

更新时间: 2026-03-30 10:25:35

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.28253v1

Secret Key Rate Analysis of RIS-Assisted THz MIMO CV-QKD Systems under Localized and Global Eavesdropping

A multiple-input multiple-output (MIMO) system operating at terahertz (THz) frequencies and consisting of a transmitter, Alice, that encodes secret keys using Gaussian-modulated coherent states, which are communicated to a legitimate receiver, Bob, under the assistance of a reconfigurable intelligent surface (RIS) is considered in this paper. The composite wireless channel comprising the direct Alice-to-Bob signal propagation path and the RIS-enabled reflected one is modeled as a passive linear Gaussian quantum channel, allowing for a unitary dilation that preserves the canonical commutation relations. The security of the considered RIS-empowered MIMO system is analyzed under collective Gaussian entangling attacks, according to which an eavesdropper, Eve, is assumed to have access to environmental modes associated with specific propagation segments. We also study, as a benchmark, the case where Eve has access to the purification of the overall channel. The legitimate receiver, Bob, is designed to deploy homodyne detection and reverse reconciliation for key extraction. Novel expressions for the achievable secret key rate (SKR) of the system are derived for both the considered eavesdropping scenarios. Furthermore, an optimization framework is developed to determine the optimal RIS phase configuration matrix that maximizes the SKR performance. The resulting optimization problem is efficiently solved using particle swarm optimization. Numerical results are presented to demonstrate the system's performance with respect to various free parameters. It is showcased that the considered RIS plays a crucial role in enhancing the SKR of the system as well as in extending the secure communication range. This establishes RIS-assisted THz MIMO CV-QKD as a promising solution for next generation secure wireless networks.

Updated: 2026-03-30 10:25:30

标题: RIS辅助的THz MIMO CV-QKD系统在局部和全局窗口窃听下的秘钥速率分析

摘要: 本文考虑了一个在太赫兹（THz）频率下运行的多输入多输出（MIMO）系统，由一个用高斯调制相干态编码秘钥的发射机Alice组成，这些秘钥在可重构智能表面（RIS）的协助下传输给合法接收机Bob。由直接Alice到Bob信号传播路径和RIS启用的反射路径组成的复合无线信道被建模为被动线性高斯量子信道，允许保持规范对易关系的幺正膨胀。考虑的RIS增强MIMO系统的安全性在集体高斯纠缠攻击下进行分析，根据这种攻击，假设窃听者Eve可以接触与特定传输段相关联的环境模式。我们还研究了一个基准案例，即Eve可以接触整个信道的纯化。合法接收机Bob被设计为使用同相位检测和反向协商进行密钥提取。对于所考虑的窃听场景，推导了系统可实现的秘密密钥速率（SKR）的新表达式。此外，开发了一个优化框架，以确定最大化SKR性能的最佳RIS相位配置矩阵。利用粒子群优化高效地解决了所得到的优化问题。通过数值结果展示了系统在各种自由参数方面的性能。展示了所考虑的RIS在增强系统的SKR以及扩展安全通信范围方面起着关键作用。这将RIS辅助的THz MIMO CV-QKD作为下一代安全无线网络的有希望的解决方案。

更新时间: 2026-03-30 10:25:30

领域: cs.IT,cs.CR,eess.SP

下载: http://arxiv.org/abs/2603.28252v1

Noise in Photonic Quantum Machine Learning: Models, Impacts, and Mitigation Strategies

Photonic Quantum Machine Learning (PQML) is an emerging method to implement scalable, energy-efficient quantum information processing by combining photonic quantum computing technologies with machine learning techniques. The features of photonic technologies offer several benefits: room-temperature operation; fast (low delay) processing of signals; and the possibility of representing computations in high-dimensional (Hilbert) spaces. This makes photonic technologies a good candidate for the near-term development of quantum devices. However, noise is still a major limiting factor for the performance, reliability, and scalability of PQML implementations. This review provides a detailed and systematic analysis of the sources of noise that will affect PQML implementations. We will present an overview of the principal photonic quantum computer designs and summarize the many different types of quantum machine learning algorithms that have been successfully implemented using photonic quantum computer architectures such as variational quantum circuits, quantum neural networks, and quantum support vector machines. We identify and categorize the primary sources of noise within photonic quantum systems and how these sources of noise behave algorithm-specifically with respect to degrading the accuracy of learning, unstable training, and slower convergence than expected. Additionally, we review traditional and advanced techniques for characterizing noise and provide an extensive survey of strategies for mitigating the effects of noise on learning performance. Finally, we discuss recent advances that demonstrate PQML's capability to operate in real-world settings with realistic noise conditions and future obstacles that will challenge the use of PQML as an effective quantum processing platform.

Updated: 2026-03-30 10:13:14

标题: 光子量子机器学习中的噪声：模型、影响和缓解策略

摘要: 光子量子机器学习（PQML）是一种新兴的方法，通过将光子量子计算技术与机器学习技术结合，实现可扩展、节能的量子信息处理。光子技术的特点带来了几个优点：室温操作；快速（低延迟）信号处理；以及在高维（希尔伯特）空间中表示计算的可能性。这使得光子技术成为近期发展量子设备的良好选择。然而，噪音仍然是影响PQML实施性能、可靠性和可扩展性的主要限制因素。本综述提供了对可能影响PQML实施的噪音来源的详细系统分析。我们将概述主要光子量子计算机设计，并总结已成功使用光子量子计算机架构实施的许多不同类型的量子机器学习算法，如变分量子电路、量子神经网络和量子支持向量机。我们识别和分类了光子量子系统内的主要噪音来源，以及这些噪音如何在特定算法中表现，影响学习准确性、不稳定训练和比预期更慢的收敛。此外，我们回顾了用于表征噪音的传统和先进技术，并提供了大量的策略调查，以减轻噪音对学习性能的影响。最后，我们讨论了最近的进展，展示了PQML在现实世界环境中以真实噪音条件运行的能力，以及未来将挑战PQML作为有效的量子处理平台的障碍。

更新时间: 2026-03-30 10:13:14

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2603.09645v2

Software Supply Chain Smells: Lightweight Analysis for Secure Dependency Management

Modern software systems heavily rely on third-party dependencies, making software supply chain security a critical concern. We introduce the concept of software supply chain smells as structural indicators that signal potential security risks. We design and evaluate Dirty-Waters, a novel tool for detecting such smells in the supply chains of software packages. Through interviews with practitioners, we show that our proposed smells align with real-world concerns and capture signals considered valuable. A quantitative study of popular packages in the Maven and NPM ecosystems reveals that while smells are prevalent in both, they differ significantly across ecosystems, with traceability and signing issues dominating in Maven and most smells being rare in NPM, due to strong registry-level guarantees. Software supply chain smells support developers and organizations in making informed decisions and improving their software supply chain security posture.

Updated: 2026-03-30 09:57:31

标题: 软件供应链异味：轻量级分析用于安全依赖管理

摘要: 现代软件系统严重依赖第三方依赖关系，使得软件供应链安全成为一个关键关注点。我们引入了软件供应链气味的概念，作为表明潜在安全风险的结构指标。我们设计并评估了Dirty-Waters，一个用于检测软件包供应链中此类气味的新颖工具。通过与实践者的访谈，我们展示了我们提出的气味与现实世界的关注点一致，并捕捉到被认为有价值的信号。对Maven和NPM生态系统中流行软件包的定量研究表明，虽然气味在两者中普遍存在，但在不同生态系统中存在显著差异，Maven中以可追溯性和签名问题为主，而NPM中大多数气味很少见，因为强大的注册级保证。软件供应链气味支持开发人员和组织做出明智决策，并改善其软件供应链安全姿态。

更新时间: 2026-03-30 09:57:31

领域: cs.SE,cs.CR

下载: http://arxiv.org/abs/2603.24282v2

Boltzmann Generators for Condensed Matter via Riemannian Flow Matching

Sampling equilibrium distributions is fundamental to statistical mechanics. While flow matching has emerged as scalable state-of-the-art paradigm for generative modeling, its potential for equilibrium sampling in condensed-phase systems remains largely unexplored. We address this by incorporating the periodicity inherent to these systems into continuous normalizing flows using Riemannian flow matching. The high computational cost of exact density estimation intrinsic to continuous normalizing flows is mitigated by using Hutchinson's trace estimator, utilizing a crucial bias-correction step based on cumulant expansion to render the stochastic estimates suitable for rigorous thermodynamic reweighting. Our approach is validated on monatomic ice, demonstrating the ability to train on systems of unprecedented size and obtain highly accurate free energy estimates without the need for traditional multistage estimators.

Updated: 2026-03-30 09:55:00

标题: Boltzmann生成器通过黎曼流匹配用于凝聚态物质

摘要: 采样平衡分布对于统计力学至关重要。虽然流匹配已经成为可伸缩的最先进的生成建模范式，但其在凝聚相系统中的平衡采样潜力仍然很少被探索。我们通过将这些系统固有的周期性纳入连续正则化流中，利用黎曼流匹配来解决这个问题。连续正则化流中固有的精确密度估计的高计算成本通过使用Hutchinson的迹估计器得到缓解，利用基于累积展开的关键偏差校正步骤使得随机估计适用于严格的热力学重加权。我们的方法在单原子冰上得到验证，展示了能够在前所未有的规模系统上进行训练并获得高度准确的自由能估计，而无需传统的多阶段估计器。

更新时间: 2026-03-30 09:55:00

领域: physics.comp-ph,cond-mat.stat-mech,cs.LG,stat.ML

下载: http://arxiv.org/abs/2602.18482v2

Detecting the Unexpected: AI-Driven Anomaly Detection in Smart Bridge Monitoring

Bridges are critical components of national infrastructure and smart cities. Therefore, smart bridge monitoring is essential for ensuring public safety and preventing catastrophic failures or accidents. Traditional bridge monitoring methods rely heavily on human visual inspections, which are time-consuming and prone to subjectivity and error. This paper proposes an artificial intelligence (AI)-driven anomaly detection approach for smart bridge monitoring. Specifically, a simple machine learning (ML) model is developed using real-time sensor data collected by the iBridge sensor devices installed on a bridge in Norway. The proposed model is evaluated against different ML models. Experimental results demonstrate that the density-based spatial clustering of applications with noise (DBSCAN)-based model outperforms in accurately detecting the anomalous events (bridge accident). These findings indicate that the proposed model is well-suited for smart bridge monitoring and can enhance public safety by enabling the timely detection of unforeseen incidents.

Updated: 2026-03-30 09:47:36

标题: 探测意外情况：智能桥梁监测中的人工智能驱动异常检测

摘要: 桥梁是国家基础设施和智慧城市的关键组成部分。因此，智慧桥梁监测对确保公共安全、预防灾难性故障或事故至关重要。传统的桥梁监测方法主要依赖于人工视觉检查，耗时且容易受主观性和错误影响。本文提出了一种基于人工智能（AI）的异常检测方法，用于智慧桥梁监测。具体而言，该方法使用挪威某桥梁上安装的iBridge传感器设备收集的实时传感器数据开发了一个简单的机器学习（ML）模型。提出的模型与不同的ML模型进行了评估。实验结果表明，基于密度的带有噪声的空间聚类（DBSCAN）模型在准确检测异常事件（桥梁事故）方面表现优异。这些发现表明，该模型适用于智慧桥梁监测，并通过及时发现意外事件来提高公共安全。

更新时间: 2026-03-30 09:47:36

领域: cs.LG

下载: http://arxiv.org/abs/2603.28225v1

Variational Neurons in Transformers for Language Modeling

Transformers for language modeling usually rely on deterministic internal computation, with uncertainty expressed mainly at the output layer. We introduce variational neurons into Transformer feed-forward computation so that uncertainty becomes part of the internal computation itself. Concretely, we replace deterministic feed-forward units with local variational units based on EVE while preserving the overall Transformer backbone. We evaluate this design in compact next-token language-modeling settings. We compare deterministic and variational variants with both predictive and probabilistic criteria. Alongside negative log-likelihood, perplexity and accuracy, we analyze calibration, conditional variance, mutual information and latent-usage statistics. The resulting picture is clear. Variational neurons integrate stably into Transformers, preserve strong predictive performance and produce informative uncertainty signals. The experiments also show that task quality, useful depth and internal stability are distinct properties. These results establish variational Transformers as a practical form of uncertainty-aware language modeling. They show that Transformers can predict with an explicit internal structure of uncertainty, which supports stronger probabilistic evaluation and a more informative analysis of model behavior.

Updated: 2026-03-30 09:39:00

标题: 变分神经元在变压器中用于语言建模

摘要: 用于语言建模的Transformer通常依赖于确定性的内部计算，不确定性主要体现在输出层。我们将变分神经元引入Transformer前馈计算中，使不确定性成为内部计算的一部分。具体来说，我们用基于EVE的局部变分单元替换确定性前馈单元，同时保留整体Transformer骨干结构。我们在紧凑的下一个标记语言建模设置中评估了这一设计。我们比较了确定性和变分变体，用预测和概率标准进行评估。除了负对数似然、困惑度和准确性，我们还分析了校准、条件方差、互信息和潜在使用统计数据。结果很明显。变分神经元稳定地融入了Transformer，保持了强大的预测性能，并产生了有信息量的不确定性信号。实验还表明，任务质量、有用深度和内部稳定性是不同的特性。这些结果确立了变分Transformer作为一种实用的具有不确定性意识的语言建模形式。它们表明，Transformer可以以明确的不确定内部结构进行预测，从而支持更强的概率评估，并更具信息量地分析模型行为。

更新时间: 2026-03-30 09:39:00

领域: cs.LG

下载: http://arxiv.org/abs/2603.28219v1

ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models

Reinforcement learning from verifiable rewards (RLVR) has significantly advanced the reasoning capabilities of large language models. However, standard Group Relative Policy Optimization (GRPO) typically assigns a uniform, sequence-level advantage to all tokens, thereby overlooking the intrinsic information heterogeneity along reasoning chains. We show that this coarse-grained credit assignment leads to premature entropy collapse and encourages the model to generate redundant, low-quality reasoning paths. Through systematic empirical analysis, we identify Critical Decision Pivots (CDPs): transient high-entropy states where the policy's trajectory is most sensitive to perturbations. These pivots represent the "forks in the road" where effective multi-path exploration is most crucial yet often suppressed by uniform advantage signals. Building on these insights, we propose Entropy-Regulated Policy Optimization (ERPO), which transitions the optimization focus from coarse sequences to fine-grained token dynamics. ERPO introduces three synergistic components: (i) Entropy-aware Gating, which adaptively amplifies exploration at CDPs to facilitate diverse path discovery; (ii) Bucket-based Implicit Normalization, which mitigates difficulty bias by aligning token progress windows; and (iii) Result-anchored Advantage Synthesis, which re-weights token-level signals via outcome-driven anchors. Extensive experiments on competitive mathematical benchmarks (e.g., MATH, AIME) demonstrate that ERPO significantly outperforms GRPO. Notably, ERPO not only boosts reasoning accuracy but also yields significantly more concise and robust derivation paths, establishing a new efficiency-accuracy frontier for large reasoning models.

Updated: 2026-03-30 09:20:25

标题: ERPO：大型推理模型的标记级熵调节策略优化

摘要: 可以的，翻译如下：可验证奖励的强化学习（RLVR）显著提升了大型语言模型的推理能力。然而，标准的群体相对策略优化（GRPO）通常会给所有令牌分配统一的序列级优势，从而忽视了推理链中固有的信息异质性。我们发现，这种粗粒度的信用分配导致熵的过早崩溃，并鼓励模型生成冗余、低质量的推理路径。通过系统的实证分析，我们确定了关键决策枢纽（CDPs）：瞬时高熵状态，在这些状态下，策略的轨迹对扰动最为敏感。这些枢纽代表了“路口”，在这里有效的多路径探索是至关重要的，但往往被统一的优势信号所抑制。基于这些见解，我们提出了熵调节策略优化（ERPO），将优化焦点从粗粒度序列转移到细粒度令牌动态。ERPO引入了三个协同组成部分：（i）熵感知门控，自适应地增强在CDPs的探索，以促进多样化路径的发现；（ii）基于桶的隐式归一化，通过对齐令牌进度窗口来缓解困难偏差；（iii）基于结果锚定的优势合成，通过结果驱动的锚点重新加权令牌级信号。对竞争性数学基准测试（例如，MATH，AIME）的广泛实验表明，ERPO明显优于GRPO。值得注意的是，ERPO不仅提高了推理准确性，还产生了更为简洁和稳健的推导路径，为大型推理模型建立了新的效率-准确性前沿。

更新时间: 2026-03-30 09:20:25

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2603.28204v1

Differentiable Power-Flow Optimization

With the rise of renewable energy sources and their high variability in generation, the management of power grids becomes increasingly complex and computationally demanding. Conventional AC-power-flow simulations, which use the Newton-Raphson (NR) method, suffer from poor scalability, making them impractical for emerging use cases such as joint transmission-distribution modeling and global grid analysis. At the same time, purely data-driven surrogate models lack physical guarantees and may violate fundamental constraints. In this work, we propose Differentiable Power-Flow (DPF), a reformulation of the AC power-flow problem as a differentiable simulation. DPF enables end-to-end gradient propagation from the physical power mismatches to the underlying simulation parameters, thereby allowing these parameters to be identified efficiently using gradient-based optimization. We demonstrate that DPF provides a scalable alternative to NR by leveraging GPU acceleration, sparse tensor representations, and batching capabilities available in modern machine-learning frameworks such as PyTorch. DPF is especially suited as a tool for time-series analyses due to its efficient reuse of previous solutions, for N-1 contingency-analyses due to its ability to process cases in batches, and as a screening tool by leveraging its speed and early stopping capability. The code is available in the authors' code repository.

Updated: 2026-03-30 09:19:43

标题: 可微分潮流优化

摘要: 随着可再生能源的兴起及其在发电中的高度变动性，电网管理变得越来越复杂且计算需求量大。传统的交流功率流模拟使用牛顿-拉弗森（NR）方法，存在可扩展性差的问题，因此在新兴用例如联合输电-配电建模和全球电网分析上变得不切实际。同时，纯数据驱动的替代模型缺乏物理保证，可能违反基本约束。本文提出了可微功率流（DPF），将交流功率流问题重新表述为一个可微模拟。DPF使得能够从物理功率不匹配到底层模拟参数进行端到端的梯度传播，从而能够通过基于梯度的优化高效地确定这些参数。我们证明，通过利用现代机器学习框架（如PyTorch）中的GPU加速、稀疏张量表示和批处理能力，DPF提供了一种可扩展的替代方案，特别适用于时间序列分析，因为它能够高效地重复使用之前的解决方案，适用于N-1事故分析，因为它能够批处理案例，以及通过其速度和提前停止能力利用作为筛选工具。作者的代码存储库中提供了代码。

更新时间: 2026-03-30 09:19:43

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2603.28203v1

SkillRouter: Skill Routing for LLM Agents at Scale

Reusable skills let LLM agents package task-specific procedures, tool affordances, and execution guidance into modular building blocks. As skill ecosystems grow to tens of thousands of entries, exposing every skill at inference time becomes infeasible. This creates a skill-routing problem: given a user task, the system must identify relevant skills before downstream planning or execution. Existing agent stacks often rely on progressive disclosure, exposing only skill names and descriptions while hiding the full implementation body. We examine this design choice on a SkillsBench-derived benchmark with approximately 80K candidate skills, targeting the practically important setting of large skill registries with heavy overlap. Across representative sparse, dense, and reranking baselines on this setting, hiding the skill body causes a 31--44 percentage point drop in routing accuracy, showing that full skill text is a critical routing signal in this setting rather than a minor metadata refinement. Motivated by this finding, we present SkillRouter, a compact 1.2B full-text retrieve-and-rerank pipeline. SkillRouter achieves 74.0% Hit@1 on our benchmark -- the strongest average top-1 routing performance among the baselines we evaluate -- while using 13$\times$ fewer parameters and running 5.8$\times$ faster than the strongest base pipeline. In a complementary end-to-end study across four coding agents, routing gains transfer to improved task success, with larger gains for more capable agents.

Updated: 2026-03-30 09:19:32

标题: SkillRouter：大规模LLM代理的技能路由

摘要: 可重复使用的技能使LLM代理能够将特定任务的程序、工具支持和执行指导打包成模块化的构建块。随着技能生态系统增长到成千上万个条目，将每个技能暴露在推理时变得不可行。这引发了一个技能路由问题：在下游规划或执行之前，系统必须确定相关技能以应对用户任务。现有的代理堆栈通常依赖于逐步披露，仅暴露技能名称和描述，同时隐藏完整的实现细节。我们在一个由SkillsBench派生的基准测试中检查这种设计选择，该基准测试拥有大约80,000个候选技能，并针对具有重叠较多的大型技能注册表的实际重要设置。在这种设置下，通过代表性的稀疏、密集和重新排名基线，隐藏技能主体会导致路由准确性下降31-44个百分点，表明在这种设置中，完整的技能文本是一个关键的路由信号，而不是一个次要的元数据优化。受到这一发现的启发，我们提出了SkillRouter，一个紧凑的12亿全文检索和重新排名管道。SkillRouter在我们的基准测试中实现了74.0%的Hit@1 -- 在我们评估的基线中具有最强大的平均前1路由性能 -- 同时使用的参数比最强大的基本管道少13倍，运行速度比最强大的基本管道快5.8倍。在跨四个编码代理进行的补充端到端研究中，路由增益转化为改进的任务成功率，对于更有能力的代理来说，获益更大。

更新时间: 2026-03-30 09:19:32

领域: cs.LG

下载: http://arxiv.org/abs/2603.22455v2

A Perturbation Approach to Unconstrained Linear Bandits

We revisit the standard perturbation-based approach of Abernethy et al. (2008) in the context of unconstrained Bandit Linear Optimization (uBLO). We show the surprising result that in the unconstrained setting, this approach effectively reduces Bandit Linear Optimization (BLO) to a standard Online Linear Optimization (OLO) problem. Our framework improves on prior work in several ways. First, we derive expected-regret guarantees when our perturbation scheme is combined with comparator-adaptive OLO algorithms, leading to new insights about the impact of different adversarial models on the resulting comparator-adaptive rates. We also extend our analysis to dynamic regret, obtaining the optimal $\sqrt{P_T}$ path-length dependencies without prior knowledge of $P_T$. We then develop the first high-probability guarantees for both static and dynamic regret in uBLO. Finally, we discuss lower bounds on the static regret, and prove the folklore $Ω(\sqrt{dT})$ rate for adversarial linear bandits on the unit Euclidean ball, which is of independent interest.

Updated: 2026-03-30 09:17:46

标题: 一种扰动方法用于无约束线性赌博机

摘要: 我们重新审视了Abernethy等人（2008年）在无约束Bandit线性优化（uBLO）背景下的标准扰动方法。我们展示了一个令人惊讶的结果，即在无约束设置下，这种方法有效地将Bandit线性优化（BLO）简化为标准的在线线性优化（OLO）问题。我们的框架在几个方面改进了先前的工作。首先，当我们的扰动方案与比较器自适应OLO算法结合时，我们推导出了期望遗憾保证，从而揭示了不同对手模型对结果比较器自适应速率的影响的新见解。我们还将分析扩展到动态遗憾，获得了在没有先验知识的情况下获得最佳的$\sqrt{P_T}$路径长度依赖性。然后，我们为uBLO中的静态和动态遗憾开发了第一高概率保证。最后，我们讨论了静态遗憾的下界，并证明了在单位欧几里德球上的对手线性bandit的民间传说$Ω(\sqrt{dT})$速率，这具有独立的利益。

更新时间: 2026-03-30 09:17:46

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2603.28201v1

MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models

Quantization significantly accelerates inference in large language models (LLMs) by replacing original high-precision matrices with low-precision counterparts. Recent advances in weight-activation quantization have primarily focused on mapping both weights and activations to the INT4 format. Although the new FP4 Tensor Cores in NVIDIA's Blackwell architecture offer up to 4x speedup over FP16, existing INT4-based kernels fail to fully exploit this capability due to mismatched data formats. To bridge this gap, we propose MicroMix, a co-designed mixed-precision quantization algorithm and GEMM kernel based on Microscaling (MX) data formats. Tailored for the Blackwell architecture, the MicroMix kernel supports arbitrary combinations of MXFP4, MXFP6, and MXFP8 channels, and produces BFloat16 outputs. To achieve a favorable trade-off between accuracy and efficiency for each linear layer, we introduce quantization thresholds that identify activation elements where lower-precision formats (MXFP4 or MXFP6) incur excessive quantization error. Our algorithm selectively allocates higher-precision channels to preserve accuracy while maintaining compute efficiency. On the Llama and Qwen model families, MicroMix achieves near-FP16 performance across diverse downstream tasks with an average precision of 5 bits. In particular, Qwen2.5-32B-Base, Coder and Math exhibit lossless accuracy on zero-shot, code generation, and mathematical reasoning benchmarks. In addition, on RTX 5070Ti laptop and RTX 5090 GPUs, our kernel achieves 2.29-3.38x acceleration compared to TensorRT-FP16. Our code is available at https://github.com/lwy2020/MicroMix.

Updated: 2026-03-30 09:16:28

标题: MicroMix：用于大型语言模型的高效混合精度量化和微缩放格式

摘要: 量化通过用低精度矩阵替换原始高精度矩阵，显著加速大型语言模型（LLMs）的推断过程。最近在权重-激活量化方面的进展主要集中在将权重和激活映射到INT4格式上。尽管NVIDIA的Blackwell架构中的新FP4张量核心比FP16提供了高达4倍的加速，但现有的基于INT4的内核未能充分利用这一功能，原因是数据格式不匹配。为了弥合这一差距，我们提出了MicroMix，一种基于Microscaling（MX）数据格式的混合精度量化算法和GEMM内核的共同设计。MicroMix内核专为Blackwell架构定制，支持MXFP4、MXFP6和MXFP8通道的任意组合，并产生BFloat16输出。为了在每个线性层中实现精度和效率之间的良好平衡，我们引入了量化阈值，用于识别激活元素，其中较低精度格式（MXFP4或MXFP6）会导致过多的量化误差。我们的算法有选择地分配更高精度的通道以保持准确性同时保持计算效率。在Llama和Qwen模型系列上，MicroMix在各种下游任务中实现了接近FP16性能，平均精度为5位。特别地，Qwen2.5-32B-Base、Coder和Math在零样本、代码生成和数学推理基准测试中表现出无损精度。此外，在RTX 5070Ti笔记本电脑和RTX 5090 GPU上，我们的内核相比于TensorRT-FP16实现了2.29-3.38倍的加速。我们的代码可在https://github.com/lwy2020/MicroMix找到。

更新时间: 2026-03-30 09:16:28

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.02343v2

A Deep Reinforcement Learning Framework for Closed-loop Guidance of Fish Schools via Virtual Agents

Guiding collective motion in biological groups is a fundamental challenge in understanding social interaction rules and developing automated systems for animal management. In this study, we propose a deep reinforcement learning (RL) framework for the closed-loop guidance of fish schools using virtual agents. These agents are controlled by policies trained via Proximal Policy Optimization (PPO) in simulation and deployed in physical experiments with rummy-nose tetras (Petitella bleheri), enabling real-time interaction between artificial agents and live individuals. To cope with the stochastic behavior of live individuals, we design a composite reward function to balance directional guidance with social cohesion. Our systematic evaluation of visual parameters shows that a white background and larger stimulus sizes maximize guidance efficacy in physical trials. Furthermore, evaluation across group sizes revealed that while the system demonstrates effective guidance for groups of five individuals, this capability markedly degrades as group size increases to eight. This study highlights the potential of deep RL for automated guidance of biological collectives and identifies challenges in maintaining artificial influence in larger groups.

Updated: 2026-03-30 09:10:02

标题: 一种用于通过虚拟代理实现鱼群闭环引导的深度强化学习框架

摘要: 本研究提出了一种用于闭环引导鱼群的深度强化学习（RL）框架，使用虚拟代理控制。这些代理通过在模拟中经过Proximal Policy Optimization（PPO）训练的策略进行控制，并在实验中与红鼻子四线小刺鱼（Petitella bleheri）一起部署，实现了人工代理和活体个体之间的实时互动。为了应对活体个体的随机行为，我们设计了一个复合奖励函数，以平衡方向引导和社会凝聚力。我们系统地评估了视觉参数，结果显示在物理实验中，白色背景和更大的刺激大小最大化了引导效果。此外，跨群体规模的评估显示，尽管系统对五个个体的群体展示了有效的引导，但当群体规模增加到八个时，这一能力明显下降。这项研究突出了深度强化学习在生物集体自动引导中的潜力，并识别了在维持更大群体中的人工影响方面的挑战。

更新时间: 2026-03-30 09:10:02

领域: cs.RO,cs.LG,q-bio.PE

下载: http://arxiv.org/abs/2603.28200v1

Policy-Controlled Generalized Share: A General Framework with a Transformer Instantiation for Strictly Online Switching-Oracle Tracking

Static regret to a single expert is often the wrong target for strictly online prediction under non-stationarity, where the best expert may switch repeatedly over time. We study Policy-Controlled Generalized Share (PCGS), a general strictly online framework in which the generalized-share recursion is fixed while the post-loss update controls are allowed to vary adaptively. Its principal instantiation in this paper is PCGS-TF, which uses a causal Transformer as an update controller: after round t finishes and the loss vector is observed, the Transformer outputs the controls that map w_t to w_{t+1} without altering the already committed decision w_t. Under admissible post-loss update controls, we obtain a pathwise weighted regret guarantee for general time-varying learning rates, and a standard dynamic-regret guarantee against any expert path with at most S switches under the constant-learning-rate specialization. Empirically, on a controlled synthetic suite with exact dynamic-programming switching-oracle evaluation, PCGS-TF attains the lowest mean dynamic regret in all seven non-stationary families, with its advantage increasing for larger expert pools. On a reproduced household-electricity benchmark, PCGS-TF also achieves the lowest normalized dynamic regret for S = 5, 10, and 20.

Updated: 2026-03-30 09:07:10

标题: 策略控制的广义共享：一个具有Transformer实例化的通用框架，用于严格在线切换Oracle跟踪

摘要: 单个专家的静态遗憾往往是在非稳态条件下严格在线预测的错误目标，其中最佳专家可能会随时间反复切换。我们研究了政策控制广义共享（PCGS），这是一个通用的严格在线框架，其中广义共享递归被固定，而后损失更新控制可以自适应地变化。本文中的主要实例化是PCGS-TF，它使用因果Transformer作为更新控制器：在第t轮结束并观察到损失向量后，Transformer输出将w_t映射到w_{t+1}的控制，而不改变已经确定的决策w_t。在可接受的后损失更新控制下，我们获得了一种适用于一般时变学习率的路径加权遗憾保证，以及在常数学习率特化下对最多S次切换的任何专家路径的标准动态遗憾保证。在一个具有精确动态规划切换预测的受控合成套件上的实证研究中，PCGS-TF在所有七个非稳态家族中获得了最低的平均动态遗憾，其优势在专家池更大时增加。在一个重现的家庭电力基准测试中，PCGS-TF在S = 5、10和20时也实现了最低的归一化动态遗憾。

更新时间: 2026-03-30 09:07:10

领域: cs.LG,q-fin.ST

下载: http://arxiv.org/abs/2603.28198v1

PANTHER: Generative Pretraining Beyond Language for Sequential User Behavior Modeling

Large language models (LLMs) have shown that generative pretraining can distill vast world knowledge into compact token representations. While LLMs encapsulate extensive world knowledge, they remain limited in modeling the behavioral knowledge contained within user interaction histories. User behavior forms a distinct modality, where each action, defined by multi-dimensional attributes such as time, context, and transaction type, constitutes a behavioral token. Modeling these high-cardinality sequences is challenging, and discriminative models often falter under limited supervision. To bridge this gap, we extend generative pretraining to user behavior, learning transferable representations from unlabeled behavioral data analogous to how LLMs learn from text. We present PANTHER, a hybrid generative-discriminative framework that unifies user behavior pretraining and downstream adaptation, enabling large-scale sequential user representation learning and real-time inference. PANTHER introduces: (1) Structured Tokenization to compress multi-dimensional transaction attributes into an interpretable vocabulary; (2) Sequence Pattern Recognition Module (SPRM) for modeling periodic transaction motifs; (3) a Unified User-Profile Embedding that fuses static demographics with dynamic transaction histories; and (4) Real-time scalability enabled by offline caching of pretrained embeddings for millisecond-level inference. Fully deployed and operational online at WeChat Pay, PANTHER delivers a 25.6 percent boost in next-transaction prediction HitRate@1 and a 38.6 percent relative improvement in fraud detection recall over baselines. Cross-domain evaluations on public benchmarks show strong generalization, achieving up to 21 percent HitRate@1 gains over transformer baselines, establishing PANTHER as a scalable, high-performance framework for industrial sequential user behavior modeling.

Updated: 2026-03-30 09:05:31

标题: PANTHER：超越语言的生成式预训练，用于序列用户行为建模

摘要: 大型语言模型(LLMs)已经表明生成式预训练可以将广泛的世界知识蒸馏成紧凑的令牌表示。虽然LLMs封装了广泛的世界知识，但在建模用户互动历史中包含的行为知识方面仍存在局限性。用户行为形成了一个独特的模态，其中每个动作由多维属性（如时间、上下文和交易类型）定义，构成了一个行为令牌。建模这些高基数序列是具有挑战性的，鉴别模型在有限监督下往往表现不佳。为了弥合这一差距，我们将生成式预训练扩展到用户行为，从未标记的行为数据中学习可转移的表示，类似于LLMs从文本中学习的方式。我们提出了PANTHER，一个混合生成-鉴别框架，统一了用户行为预训练和下游适应，实现了大规模顺序用户表示学习和实时推断。PANTHER引入了：(1) 结构化标记化，将多维交易属性压缩成可解释的词汇表；(2) 序列模式识别模块(SPRM)，用于建模周期性交易模式；(3) 一个统一的用户配置文件嵌入，将静态人口统计数据与动态交易历史融合；以及(4) 通过预训练嵌入的离线缓存实现的实时可扩展性，用于毫秒级推断。在微信支付上完全部署和运行的PANTHER在下一个交易预测HitRate@1上实现了25.6%的提升，并相对于基线改进了38.6%的欺诈检测召回率。在公共基准上进行跨领域评估显示出强大的泛化能力，HitRate@1相对于变压器基线提高了21%，将PANTHER确立为一个可扩展的、高性能的工业顺序用户行为建模框架。

更新时间: 2026-03-30 09:05:31

领域: cs.LG

下载: http://arxiv.org/abs/2510.10102v2

LiFeChain: Lightweight Blockchain for Secure and Efficient Federated Lifelong Learning in IoT

Internet of Things (IoT) devices constantly generate heterogeneous data streams, driving demand for continuous, decentralized intelligence. Federated Lifelong Learning (FLL) provides an ideal solution by incorporating federated learning and lifelong learning. However, the extended lifecycle of FLL in IoT systems increases their vulnerability to persistent attacks. This problem is exacerbated by the single point of failure. Furthermore, the single point of trust created by the central server hinders reliable auditing for long-term threats. Blockchain technology provides a tamper-proof foundation for trustworthy FLL. Nevertheless, directly applying blockchain to FLL significantly increases computational and retrieval costs with the expansion of the knowledge base, slowing down the training on resource-constrained IoT devices. To address these challenges, we propose LiFeChain, a lightweight blockchain for secure and efficient federated lifelong learning with minimal on-chain disclosure and bidirectional verification. LiFeChain is the first blockchain tailored for FLL. It incorporates two complementary mechanisms: the Proof-of-Model-Correlation (PoMC) consensus on the server, which couples learning and unlearning mechanisms to mitigate negative transfer; and Segmented Zero-knowledge Arbitration (Seg-ZA) at the client, which detects and arbitrates abnormal committee behavior without compromising privacy. LiFeChain is a plug-and-play component that can be seamlessly integrated into existing FLL algorithms for IoT applications. To demonstrate its practicality and performance, we implement LiFeChain in representative FLL algorithms with Hyperledger Fabric under 6 attacks. Theoretical analysis and extensive evaluations demonstrate that LiFeChain effectively mitigates long-term attacks, and significantly reduces latency and storage overhead compared to state-of-the-art blockchain solutions.

Updated: 2026-03-30 09:04:15

标题: LiFeChain：轻量级区块链用于IoT中安全高效的联合终身学习

摘要: 物联网(IoT)设备不断生成异构数据流，推动了对连续、分散智能的需求。联邦终身学习(FLL)通过结合联邦学习和终身学习提供了一个理想的解决方案。然而，在物联网系统中FLL的延长生命周期增加了它们对持续攻击的脆弱性。这个问题被单点故障所恶化。此外，由中央服务器创建的单一信任点阻碍了对长期威胁的可靠审计。区块链技术为可信的FLL提供了一个防篡改的基础。然而，直接将区块链应用于FLL会显著增加计算和检索成本随着知识库的扩展，减慢资源受限的物联网设备上的训练速度。为了解决这些挑战，我们提出了LiFeChain，一个轻量级区块链，用于安全高效的联邦终身学习，最小化链上披露和双向验证。LiFeChain是专为FLL量身定制的第一个区块链。它包含两种互补机制：服务器上的模型相关性证明(PoMC)共识，将学习和取消学习机制耦合起来以减轻负面传递;以及客户端上的分段零知识仲裁(Seg-ZA)，在不损害隐私的情况下检测和仲裁异常委员会行为。LiFeChain是一个即插即用的组件，可以无缝地集成到现有的物联网应用的FLL算法中。为了证明其实用性和性能，我们在Hyperledger Fabric中实现了LiFeChain，并进行了6次攻击。理论分析和广泛的评估表明，LiFeChain有效地减轻了长期攻击，并与最先进的区块链解决方案相比显著减少了延迟和存储开销。

更新时间: 2026-03-30 09:04:15

领域: cs.CR,cs.DC

下载: http://arxiv.org/abs/2509.01434v2

LingoLoop Attack: Trapping MLLMs via Linguistic Context and State Entrapment into Endless Loops

Multimodal Large Language Models (MLLMs) have shown great promise but require substantial computational resources during inference. Attackers can exploit this by inducing excessive output, leading to resource exhaustion and service degradation. Prior energy-latency attacks aim to increase generation time by broadly shifting the output token distribution away from the EOS token, but they neglect the influence of token-level Part-of-Speech (POS) characteristics on EOS and sentence-level structural patterns on output counts, limiting their efficacy. To address this, we propose LingoLoop, an attack designed to induce MLLMs to generate excessively verbose and repetitive sequences. First, we find that the POS tag of a token strongly affects the likelihood of generating an EOS token. Based on this insight, we propose a POS-Aware Delay Mechanism to postpone EOS token generation by adjusting attention weights guided by POS information. Second, we identify that constraining output diversity to induce repetitive loops is effective for sustained generation. We introduce a Generative Path Pruning Mechanism that limits the magnitude of hidden states, encouraging the model to produce persistent loops. Extensive experiments on models like Qwen2.5-VL-3B demonstrate LingoLoop's powerful ability to trap them in generative loops; it consistently drives them to their generation limits and, when those limits are relaxed, can induce outputs with up to 367x more tokens than clean inputs, triggering a commensurate surge in energy consumption. These findings expose significant MLLMs' vulnerabilities, posing challenges for their reliable deployment.

Updated: 2026-03-30 08:41:21

标题: LingoLoop 攻击：通过语言环境和状态诱捕将MLLMs困在无尽循环中

摘要: 多模态大型语言模型（MLLMs）表现出巨大的潜力，但在推断过程中需要大量的计算资源。攻击者可以利用这一点，通过诱导产生过多的输出，导致资源耗尽和服务降级。先前的能量-延迟攻击旨在通过广泛偏移输出标记分布，使生成时间增加，但它们忽略了标记级别的词性（POS）特征对EOS和句子级结构模式对输出计数的影响，从而限制了它们的有效性。为了解决这个问题，我们提出了LingoLoop，一种旨在诱导MLLMs生成过度冗长和重复序列的攻击。首先，我们发现标记的词性强烈影响生成EOS标记的可能性。基于这一认识，我们提出了一种POS感知延迟机制，通过调整由POS信息引导的关注权重来推迟EOS标记的生成。其次，我们发现限制输出多样性以诱导重复循环对于持续生成是有效的。我们引入了一种生成路径修剪机制，限制隐藏状态的大小，鼓励模型产生持续循环。对像Qwen2.5-VL-3B这样的模型进行了大量实验，证明了LingoLoop陷入生成循环的强大能力；它始终将它们推向生成极限，当这些极限放宽时，可以诱发比干净输入多多达367倍的输出，引发相应能源消耗的激增。这些发现揭示了重要的MLLMs的脆弱性，为它们的可靠部署带来了挑战。

更新时间: 2026-03-30 08:41:21

领域: cs.CL,cs.CR

下载: http://arxiv.org/abs/2506.14493v2

Evaluating Privilege Usage of Agents on Real-World Tools

Equipping LLM agents with real-world tools can substantially improve productivity. However, granting agents autonomy over tool use also transfers the associated privileges to both the agent and the underlying LLM. Improper privilege usage may lead to serious consequences, including information leakage and infrastructure damage. While several benchmarks have been built to study agents' security, they often rely on pre-coded tools and restricted interaction patterns. Such crafted environments differ substantially from the real-world, making it hard to assess agents' security capabilities in critical privilege control and usage. Therefore, we propose GrantBox, a security evaluation sandbox for analyzing agent privilege usage. GrantBox automatically integrates real-world tools and allows LLM agents to invoke genuine privileges, enabling the evaluation of privilege usage under prompt injection attacks. Our results indicate that while LLMs exhibit basic security awareness and can block some direct attacks, they remain vulnerable to more sophisticated attacks, resulting in an average attack success rate of 84.80% in carefully crafted scenarios.

Updated: 2026-03-30 08:35:00

标题: 评估代理在现实世界工具中的特权使用

摘要: 给LLM代理装备真实世界工具可以显著提高生产力。然而，赋予代理对工具使用的自主权也将相关特权转移给代理和基础LLM。不当的特权使用可能导致严重后果，包括信息泄露和基础设施损坏。虽然已建立了几个基准来研究代理的安全性，但它们通常依赖于预编码工具和受限制的交互模式。这样设计的环境与真实世界有很大不同，使得很难评估代理在关键特权控制和使用方面的安全能力。因此，我们提出了GrantBox，一个用于分析代理特权使用的安全评估沙盒。GrantBox自动集成了真实世界工具，并允许LLM代理调用真实特权，从而使得可以在即时注入攻击下评估特权使用。我们的结果表明，虽然LLMs表现出基本的安全意识并且可以阻止一些直接攻击，但它们仍然容易受到更复杂攻击的影响，在精心设计的场景中攻击成功率平均为84.80%。

更新时间: 2026-03-30 08:35:00

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2603.28166v1

Silent Guardians: Independent and Secure Decision Tree Evaluation Without Chatter

As machine learning as a service (MLaaS) gains increasing popularity, it raises two critical challenges: privacy and verifiability. For privacy, clients are reluctant to disclose sensitive private information to access MLaaS, while model providers must safeguard their proprietary models. For verifiability, clients lack reliable mechanisms to ensure that cloud servers execute model inference correctly. Decision trees are widely adopted in MLaaS due to their popularity, interpretability, and broad applicability in domains like medicine and finance. In this context, outsourcing decision tree evaluation (ODTE) enables both clients and model providers to offload their sensitive data and decision tree models to the cloud securely. However, existing ODTE schemes often fail to address both privacy and verifiability simultaneously. To bridge this gap, we propose $\sf PVODTE$, a novel two-server private and verifiable ODTE protocol that leverages homomorphic secret sharing and a MAC-based verification mechanism. $\sf PVODTE$ eliminates the need for server-to-server communication, enabling independent computation by each cloud server. This ``non-interactive'' setting addresses the latency and synchronization bottlenecks of prior arts, making it uniquely suitable for wide-area network (WAN) deployments. To our knowledge, $\sf PVODTE$ is the first two-server ODTE protocol that eliminates server-to-server communication. Furthermore, $\sf PVODTE$ achieves security against \emph{malicious} servers, where servers cannot learn anything about the client's input or the providers' decision tree models, and servers cannot alter the inference result without being detected.

Updated: 2026-03-30 08:07:10

标题: 沉默的守护者：无需废话的独立和安全决策树评估

摘要: 随着机器学习服务（MLaaS）日益普及，它提出了两个关键挑战：隐私和可验证性。对于隐私问题，客户不愿透露敏感的私人信息以访问MLaaS，而模型提供者必须保护其专有模型。对于可验证性，客户缺乏可靠的机制来确保云服务器正确执行模型推断。决策树由于其在医学和金融等领域的流行性、可解释性和广泛适用性而被广泛采用于MLaaS。在这种情况下，外包决策树评估（ODTE）使客户和模型提供者能够安全地将其敏感数据和决策树模型外包到云中。然而，现有的ODTE方案通常无法同时解决隐私和可验证性问题。为了弥补这一差距，我们提出了PVODTE，这是一种新颖的两服务器私密和可验证的ODTE协议，利用同态秘密共享和基于MAC的验证机制。PVODTE消除了服务器之间的通信需求，使每个云服务器可以独立计算。这种“非交互式”设置解决了先前技术的延迟和同步瓶颈，使其在广域网（WAN）部署中具有独特的适用性。据我们所知，PVODTE是第一个消除服务器间通信的两服务器ODTE协议。此外，PVODTE实现了对恶意服务器的安全性，其中服务器无法获取有关客户输入或提供者决策树模型的任何信息，并且服务器无法在不被检测到的情况下更改推断结果。

更新时间: 2026-03-30 08:07:10

领域: cs.CR

下载: http://arxiv.org/abs/2603.28143v1

ORACAL: A Robust and Explainable Multimodal Framework for Smart Contract Vulnerability Detection with Causal Graph Enrichment

Although Graph Neural Networks (GNNs) have shown promise for smart contract vulnerability detection, they still face significant limitations. Homogeneous graph models fail to capture the interplay between control flow and data dependencies, while heterogeneous graph approaches often lack deep semantic understanding, leaving them susceptible to adversarial attacks. Moreover, most black-box models fail to provide explainable evidence, hindering trust in professional audits. To address these challenges, we propose ORACAL (Observable RAG-enhanced Analysis with CausAL reasoning), a heterogeneous multimodal graph learning framework that integrates Control Flow Graph (CFG), Data Flow Graph (DFG), and Call Graph (CG). ORACAL selectively enriches critical subgraphs with expert-level security context from Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs), and employs a causal attention mechanism to disentangle true vulnerability indicators from spurious correlations. For transparency, the framework adopts PGExplainer to generate subgraph-level explanations identifying vulnerability triggering paths. Experiments on large-scale datasets demonstrate that ORACAL achieves state-of-the-art performance, outperforming MANDO-HGT, MTVHunter, GNN-SC, and SCVHunter by up to 39.6 percentage points, with a peak Macro F1 of 91.28% on the primary benchmark. ORACAL maintains strong generalization on out-of-distribution datasets with 91.8% on CGT Weakness and 77.1% on DAppScan. In explainability evaluation, PGExplainer achieves 32.51% Mean Intersection over Union (MIoU) against manually annotated vulnerability triggering paths. Under adversarial attacks, ORACAL limits performance degradation to approximately 2.35% F1 decrease with an Attack Success Rate (ASR) of only 3%, surpassing SCVHunter and MANDO-HGT which exhibit ASRs ranging from 10.91% to 18.73%.

Updated: 2026-03-30 07:46:59

标题: ORACAL：一个稳健且可解释的多模态框架，用于智能合约漏洞检测与因果图丰富化

摘要: 尽管图神经网络（GNNs）在智能合约漏洞检测方面表现出潜力，但它们仍然面临着重大限制。同质图模型无法捕捉控制流和数据依赖之间的相互作用，而异构图方法通常缺乏深层语义理解，使它们容易受到对抗性攻击。此外，大多数黑盒模型无法提供可解释的证据，阻碍了对专业审计的信任。为了解决这些挑战，我们提出了ORACAL（具有因果推理的可观察RAG增强分析），这是一个整合了控制流图（CFG）、数据流图（DFG）和调用图（CG）的异构多模态图学习框架。ORACAL通过从检索增强生成（RAG）和大型语言模型（LLMs）中选择性地丰富关键子图，融入专家级安全上下文，并采用因果关注机制来区分真实漏洞指标和虚假相关性。为了透明化，该框架采用PGExplainer生成子图级解释，识别漏洞触发路径。在大型数据集上的实验表明，ORACAL实现了最先进的性能，超过MANDO-HGT、MTVHunter、GNN-SC和SCVHunter高达39.6个百分点，主要基准测试中的最高Macro F1为91.28%。ORACAL在超出分布数据集上保持强大的泛化能力，在CGT弱点和DAppScan上分别达到91.8%和77.1%。在可解释性评估方面，PGExplainer对手动注释的漏洞触发路径实现了32.51%的平均交集联合（MIoU）。在对抗性攻击下，ORACAL将性能下降限制在约2.35%的F1减少，攻击成功率（ASR）仅为3%，超越了SCVHunter和MANDO-HGT，它们的ASR范围在10.91%到18.73%之间。

更新时间: 2026-03-30 07:46:59

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2603.28128v1

Seeing the Unseen: Rethinking Illicit Promotion Detection with In-Context Learning

Illicit online promotion is a persistent threat that evolves to evade detection. Existing moderation systems remain tethered to platform-specific supervision and static taxonomies, a reactive paradigm that struggles to generalize across domains or uncover novel threats. This paper presents a systematic study of In-Context Learning (ICL) as a unified framework for illicit promotion detection. Through rigorous analysis, we show that properly configured ICL achieves performance comparable to fine-tuned models using 22x fewer labeled examples. We demonstrate three key capabilities: (1) Generalization to unseen threats: ICL generalizes to new illicit categories without category-specific demonstrations, with a performance drop of less than 6% for most evaluated categories. (2) Autonomous discovery: A novel two-stage pipeline distills 2,900 free-form labels into coherent taxonomies, surfacing eight previously undocumented illicit categories such as usury and illegal immigration. (3) Cross-platform generalization: Deployed on 200,000 real-world samples from search engines and Twitter without adaptation, ICL achieves 92.6% accuracy. Furthermore, 61.8% of its uniquely flagged samples correspond to borderline or obfuscated content missed by existing detectors. Our findings position ICL as a new paradigm for content moderation, combining the precision of specialized classifiers with cross-platform generalization and autonomous threat discovery. By shifting to inference-time reasoning, ICL offers a path toward proactively adaptive moderation systems.

Updated: 2026-03-30 05:08:59

标题: 看不见的：通过上下文学习重新思考非法宣传检测

摘要: 非法在线推广是一种持久的威胁，不断演变以逃避检测。现有的管理系统仍然依赖于特定平台的监督和静态分类法，这是一种反应性范式，很难在不同领域中推广或发现新的威胁。本文提出了一项对于非法推广检测的统一框架——上下文学习（ICL）的系统研究。通过严格的分析，我们展示了正确配置的ICL可以实现与经过精细调整的模型相当的性能，使用的标记示例数量减少了22倍。我们展示了三个关键能力：（1）对未见威胁的泛化：ICL可以泛化到新的非法类别，而无需特定于类别的演示，对大多数评估的类别来说性能降低不到6%。（2）自主发现：一种新颖的两阶段管道将2900个自由形式标签提炼成连贯的分类法，呈现出八种之前未记录的非法类别，如高利贷和非法移民。(3) 跨平台泛化：在无需适应的情况下，部署在搜索引擎和Twitter的20万真实样本上，ICL达到了92.6%的准确率。此外，其独特标记的样本中，61.8%对应于被现有检测器错过的边界或混淆内容。我们的发现将ICL定位为内容管理的新范式，将专门分类器的精确性与跨平台泛化和自主威胁发现相结合。通过转向推理时间推理，ICL为积极适应的管理系统提供了一条道路。

更新时间: 2026-03-30 05:08:59

领域: cs.CR

下载: http://arxiv.org/abs/2603.28043v1

Colluding LoRA: A Compositional Vulnerability in LLM Safety Alignment

We show that safety alignment in modular LLMs can exhibit a compositional vulnerability: adapters that appear benign and plausibly functional in isolation can, when linearly composed, compromise safety. We study this failure mode through Colluding LoRA (CoLoRA), in which harmful behavior emerges only in the composition state. Unlike attacks that depend on adversarial prompts or explicit input triggers, this composition-triggered broad refusal suppression causes the model to comply with harmful requests under standard prompts once a particular set of adapters is loaded. This behavior exposes a combinatorial blind spot in current unit-centric defenses, for which exhaustive verification over adapter compositions is computationally intractable. Across several open-weight LLMs, we find that individual adapters remain benign in isolation while their composition yields high attack success rates, indicating that securing modular LLM supply-chains requires moving beyond single-module verification toward composition-aware defenses.

Updated: 2026-03-30 05:02:56

标题: 共谋的 LoRA：LLM 安全对齐中的组合漏洞

摘要: 我们展示了模块化LLM中的安全对齐可能存在组合漏洞：在单独使用时看似良性且功能合理的适配器，在线性组合时可能会危及安全。我们通过Colluding LoRA（CoLoRA）研究了这种故障模式，其中有害行为仅在组合状态下出现。与依赖对抗性提示或明确输入触发器的攻击不同，这种组合触发的广泛拒绝抑制导致模型在加载特定适配器集后在标准提示下遵从有害请求。这种行为揭示了当前以单元为中心的防御策略存在的组合盲点，对于其中适配器组合的穷举验证在计算上是不可行的。在几个开放权重的LLM中，我们发现单个适配器在单独使用时仍然是良性的，但它们的组合却导致攻击成功率高，这表明确保模块化LLM供应链的安全需要超越单个模块验证，向组合感知的防御策略迈进。

更新时间: 2026-03-30 05:02:56

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2603.12681v2

Towards Privacy-Preserving LLM Inference via Covariant Obfuscation (Technical Report)

The rapid development of large language models (LLMs) has driven the widespread adoption of cloud-based LLM inference services, while also bringing prominent privacy risks associated with the transmission and processing of private data in remote inference. For privacy-preserving LLM inference technologies to be practically applied in industrial scenarios, three core requirements must be satisfied simultaneously: (1) Accuracy and efficiency losses should be minimized to mitigate degradation in service experience. (2) The inference process can be run on large-scale clusters consist of heterogeneous legacy xPUs. (3) Compatibility with existing LLM infrastructures should be ensured to reuse their engineering optimizations. To the best of our knowledge, none of the existing privacy-preserving LLM inference methods satisfy all the above constraints while delivering meaningful privacy guarantees. In this paper, we propose AloePri, the first privacy-preserving LLM inference method for industrial applications. AloePri protects both the input and output data by covariant obfuscation, which jointly transforms data and model parameters to achieve better accuracy and privacy. We carefully design the transformation for each model component to ensure inference accuracy and data privacy while keeping full compatibility with existing infrastructures of Language Model as a Service. AloePri has been integrated into an industrial system for the evaluation of mainstream LLMs. The evaluation on Deepseek-V3.1-Terminus model (671B parameters) demonstrates that AloePri causes accuracy loss of 0.0%~3.5% and exhibits efficiency equivalent to that of plaintext inference. Meanwhile, AloePri successfully resists state-of-the-art attacks, with less than 5\% of tokens recovered. To the best of our knowledge, AloePri is the first method to exhibit practical applicability to large-scale models in real-world systems.

Updated: 2026-03-30 04:19:18

标题: 朝向通过协变混淆实现隐私保护的LLM推断（技术报告）

摘要: 大型语言模型（LLMs）的快速发展推动了云端LLM推断服务的广泛应用，同时也带来了与远程推断中私人数据的传输和处理相关的突出隐私风险。为了在工业场景中实际应用隐私保护的LLM推断技术，必须同时满足三个核心要求：（1）最小化准确性和效率损失，以减轻服务体验的降级。（2）推断过程可以在由异构遗留xPU组成的大规模集群上运行。（3）必须确保与现有LLM基础设施的兼容性，以重用其工程优化。据我们所知，目前没有任何现有的隐私保护LLM推断方法同时满足上述所有约束条件并提供有意义的隐私保证。在本文中，我们提出了AloePri，这是首个用于工业应用的隐私保护LLM推断方法。AloePri通过共变混淆保护输入和输出数据，联合转换数据和模型参数以实现更好的准确性和隐私。我们为每个模型组件精心设计转换，以确保推断准确性和数据隐私，并保持与语言模型服务的现有基础设施的完全兼容性。AloePri已经集成到一个工业系统中，用于评估主流LLMs。对Deepseek-V3.1-Terminus模型（671B参数）的评估表明，AloePri导致准确性损失为0.0%~3.5%，并展示出与明文推断相当的效率。同时，AloePri成功抵抗了最先进的攻击，恢复的标记少于5\%。据我们所知，AloePri是第一个展示在实际系统中对大型模型具有实际适用性的方法。

更新时间: 2026-03-30 04:19:18

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2603.01499v2

Beyond Permissions: A Configuration-Aware Empirical Assessment of Privacy Exposure in Children-Oriented and General-Audience Mobile Gaming Apps

Mobile gaming applications (apps) have become increasingly pervasive, including a growing number of games designed for children. Despite their popularity, these apps often integrate complex analytics, advertising, and attribution infrastructures that may introduce privacy and security risks. Existing research has primarily focused on tracking behaviors or monetization models, leaving configuration-level privacy exposure and children-oriented apps underexplored. In this study, we conducted a comparative static analysis of Android mobile games to investigate privacy and security risks beyond permission usage. The analysis follows a three-phase methodology comprising (i) designing study protocol, (ii) Android Package Kit (APK) collection and static inspection, and (iii) data analysis. We examined permissions, manifest-level configuration properties (e.g., backup settings, cleartext network traffic, and exported components), and embedded third-party Software Development Kit (SDK) ecosystems across children-oriented and general-audience mobile games. The extracted indicators are synthesized into qualitative privacy-risk categories to support comparative reporting. The results showed that while children-oriented games often request fewer permissions, they frequently exhibit configuration-level risks and embed third-party tracking SDKs similar to general-audience games. Architectural and configuration decisions play a critical role in shaping privacy risks, particularly for apps targeting children. This study contributes a holistic static assessment of privacy exposure in mobile games and provides actionable insights for developers, platform providers, and researchers seeking to improve privacy-by-design practices in mobile applications.

Updated: 2026-03-30 04:18:45

标题: 超越权限：儿童导向和一般观众移动游戏应用隐私暴露的配置感知实证评估

摘要: 移动游戏应用程序（应用程序）已变得越来越普遍，包括越来越多专为儿童设计的游戏。尽管这些应用程序很受欢迎，但它们通常集成了复杂的分析、广告和归因基础设施，可能会引入隐私和安全风险。现有研究主要集中在跟踪行为或货币化模型上，对配置级别的隐私暴露和面向儿童的应用程序尚未得到充分探讨。在这项研究中，我们进行了对安卓移动游戏的比较静态分析，以调查超出权限使用范围的隐私和安全风险。该分析遵循一个包括（i）设计研究方案、（ii）安卓包（APK）收集和静态检查、以及（iii）数据分析的三个阶段方法。我们检查了儿童导向和通用受众移动游戏中的权限、清单级别配置属性（例如备份设置、明文网络流量和导出组件）以及内嵌的第三方软件开发工具包（SDK）生态系统。提取的指标被综合成定性隐私风险类别，以支持比较报告。结果表明，尽管儿童导向的游戏通常请求更少的权限，但它们经常展示配置级别的风险，并嵌入第三方跟踪SDK，类似于通用受众游戏。架构和配置决策在塑造隐私风险方面起着关键作用，特别是对于针对儿童的应用程序。这项研究提供了对移动游戏中隐私暴露的全面静态评估，并为寻求改进移动应用程序中隐私设计实践的开发人员、平台提供商和研究人员提供了可操作的见解。

更新时间: 2026-03-30 04:18:45

领域: cs.CR

下载: http://arxiv.org/abs/2602.10877v2

Lite-BD: A Lightweight Black-box Backdoor Defense via Reviving Multi-Stage Image Transformations

Deep Neural Networks (DNNs) are vulnerable to backdoor attacks. Due to the nature of Machine Learning as a Service (MLaaS) applications, black-box defenses are more practical than white-box methods, yet existing purification techniques suffer from key limitations: a lack of justification for specific transformations, dataset dependency, high computational overhead, and a neglect of frequency-domain transformations. This paper conducts a preliminary study on various image transformations, identifying down-upscaling as the most effective backdoor trigger disruption technique. We subsequently propose \texttt{Lite-BD}, a lightweight two-stage blackbox backdoor defense. \texttt{Lite-BD} first employs a super-resolution-based down-upscaling stage to neutralize spatial triggers. A secondary stage utilizes query-based band-by-band frequency filtering to remove triggers hidden in specific bands. Extensive experiments against state-of-the-art attacks demonstrate that \texttt{Lite-BD} provides robust and efficient protection. Codes can be found at https://github.com/SiSL-URI/Lite-BD.

Updated: 2026-03-30 04:15:35

标题: Lite-BD：通过恢复多阶段图像变换实现的轻量级黑盒后门防御

摘要: 深度神经网络（DNNs）容易受到后门攻击的影响。由于机器学习作为服务（MLaaS）应用的特性，黑盒防御比白盒方法更为实用，然而现有的净化技术存在关键限制：缺乏特定转换的理由、数据集依赖性、高计算开销以及对频域转换的忽视。本文对各种图像转换进行了初步研究，确定了下上采样作为最有效的后门触发干扰技术。我们随后提出了\texttt{Lite-BD}，这是一种轻量级的两阶段黑盒后门防御系统。\texttt{Lite-BD}首先采用基于超分辨率的下上采样阶段来中和空间触发器。第二阶段利用基于查询的频带逐个过滤来消除隐藏在特定频带中的触发器。对抗最先进的攻击的广泛实验表明，\texttt{Lite-BD}提供了强大而高效的保护。代码可以在https://github.com/SiSL-URI/Lite-BD找到。

更新时间: 2026-03-30 04:15:35

领域: cs.CR

下载: http://arxiv.org/abs/2602.07197v3

Benchmarking NIST-Standardised ML-KEM and ML-DSA on ARM Cortex-M0+: Performance, Memory, and Energy on the RP2040

The migration to post-quantum cryptography is urgent for Internet of Things devices with 10--20 year lifespans, yet no systematic benchmarks exist for the finalised NIST standards on the most constrained 32-bit processor class. This paper presents the first isolated algorithm-level benchmarks of ML-KEM (FIPS 203) and ML-DSA (FIPS 204) on ARM Cortex-M0+, measured on the RP2040 (Raspberry Pi Pico) at 133 MHz with 264 KB SRAM. Using PQClean reference C implementations, we measure all three security levels of ML-KEM (512/768/1024) and ML-DSA (44/65/87) across key generation, encapsulation/signing, and decapsulation/verification. ML-KEM-512 completes a full key exchange in 35.7 ms with an estimated energy cost of 2.83 mJ (datasheet power model)--17x faster than a complete ECDH P-256 key agreement on the same hardware. ML-DSA signing exhibits high latency variance due to rejection sampling (coefficient of variation 66--73%, 99th-percentile up to 1,125 ms for ML-DSA-87). The M0+ incurs only a 1.8--1.9x slowdown relative to published Cortex-M4 reference C results (compiled with -O3 versus our -Os), despite lacking 64-bit multiply, DSP, and SIMD instructions--making this a conservative upper bound on the microarchitectural penalty. All code, data, and scripts are released as an open-source benchmark suite for reproducibility.

Updated: 2026-03-30 04:07:21

标题: 在ARM Cortex-M0+上对RP2040上的基准测试NIST标准化ML-KEM和ML-DSA：性能、内存和能耗

摘要: 迁移到后量子密码对于具有10-20年寿命的物联网设备至关紧急，然而目前并没有针对最受限制的32位处理器类的NIST标准的系统化基准。本文在ARM Cortex-M0+上首次提出了ML-KEM（FIPS 203）和ML-DSA（FIPS 204）的隔离算法级基准，测量了在133 MHz的RP2040（树莓派Pico）上的表现，SRAM为264 KB。使用PQClean参考C实现，我们测量了ML-KEM（512/768/1024）和ML-DSA（44/65/87）的所有三个安全级别，在密钥生成、封装/签名和解封/验证过程中。ML-KEM-512在35.7毫秒内完成了完整的密钥交换，估计能量成本为2.83 mJ（数据表功耗模型）--比相同硬件上完整的ECDH P-256密钥协商快了17倍。ML-DSA签名由于拒绝抽样导致高延迟方差（变异系数为66-73%，ML-DSA-87的99百分位数高达1,125毫秒）。与已发表的Cortex-M4参考C结果相比（使用-O3编译与我们的-Os编译），M0+仅产生1.8-1.9倍的减速，尽管缺少64位乘法、DSP和SIMD指令--这使得这是对微体系结构惩罚的一个保守上限。所有代码、数据和脚本均以开源基准套件的形式发布，以便重现结果。

更新时间: 2026-03-30 04:07:21

领域: cs.CR,cs.AR,cs.PF

下载: http://arxiv.org/abs/2603.19340v4

Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers

We present a stage-decomposed analysis of prompt injection attacks against five frontier LLM agents. Prior work measures task-level attack success rate (ASR); we localize the pipeline stage at which each model's defense activates. We instrument every run with a cryptographic canary token (SECRET-[A-F0-9]{8}) tracked through four kill-chain stages -- Exposed, Persisted, Relayed, Executed -- across four attack surfaces and five defense conditions (764 total runs, 428 no-defense attacked). Our central finding is that model safety is determined not by whether adversarial content is seen, but by whether it is propagated across pipeline stages. Concretely: (1) in our evaluation, exposure is 100% for all five models -- the safety gap is entirely downstream; (2) Claude strips injections at write_memory summarization (0/164 ASR), while GPT-4o-mini propagates canaries without loss (53% ASR, 95% CI: 41--65%); (3) DeepSeek exhibits 0% ASR on memory surfaces and 100% ASR on tool-stream surfaces from the same model -- a complete reversal across injection channels; (4) all four active defense conditions (write_filter, pi_detector, spotlighting, and their combination) produce 100% ASR due to threat-model surface mismatch; (5) a Claude relay node decontaminates downstream agents -- 0/40 canaries survived into shared memory.

Updated: 2026-03-30 04:07:18

标题: Kill-Chain Canaries: 捕获攻击链中的预警注入跨攻击面和模型安全层级的阶段跟踪

摘要: 我们提出了一种对五个前沿LLM代理进行即时注入攻击的阶段分解分析。先前的工作测量了任务级别的攻击成功率（ASR）；我们定位了每个模型的防御何时激活的管道阶段。我们使用一个加密的金丝雀标记（SECRET-[A-F0-9]{8}）对每次运行进行仪器化，跟踪了四个杀伤链阶段 -- 暴露、持续、中继、执行 -- 跨越四个攻击表面和五种防御条件（总共764次运行，其中428次没有受到攻击）。我们的中心发现是，模型的安全性不是由于是否看到了对抗内容，而是由于它是否在管道阶段传播。具体来说：（1）在我们的评估中，所有五个模型的暴露率均为100% -- 安全差距完全在下游；（2）Claude在write_memory汇总时去除注入（0/164 ASR），而GPT-4o-mini在没有损失的情况下传播金丝雀（53% ASR，95% CI：41-65%）；（3）DeepSeek在内存表面上的ASR为0%，在来自同一模型的工具流表面上的ASR为100% -- 在注入通道上完全颠倒；（4）所有四种主动的防御条件（write_filter、pi_detector、spotlighting以及它们的组合）由于威胁模型表面不匹配而产生100%的ASR；（5）Claude中继节点将下游代理去污 -- 40个金丝雀中没有一个进入共享内存。

更新时间: 2026-03-30 04:07:18

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2603.28013v1

FedFG: Privacy-Preserving and Robust Federated Learning via Flow-Matching Generation

Federated learning (FL) enables distributed clients to collaboratively train a global model using local private data. Nevertheless, recent studies show that conventional FL algorithms still exhibit deficiencies in privacy protection, and the server lacks a reliable and stable aggregation rule for updating the global model. This situation creates opportunities for adversaries: on the one hand, they may eavesdrop on uploaded gradients or model parameters, potentially leaking benign clients' private data; on the other hand, they may compromise clients to launch poisoning attacks that corrupt the global model. To balance accuracy and security, we propose FedFG, a robust FL framework based on flow-matching generation that simultaneously preserves client privacy and resists sophisticated poisoning attacks. On the client side, each local network is decoupled into a private feature extractor and a public classifier. Each client is further equipped with a flow-matching generator that replaces the extractor when interacting with the server, thereby protecting private features while learning an approximation of the underlying data distribution. Complementing the client-side design, the server employs a client-update verification scheme and a novel robust aggregation mechanism driven by synthetic samples produced by the flow-matching generator. Experiments on MNIST, FMNIST, and CIFAR-10 demonstrate that, compared with prior work, our approach adapts to multiple attack strategies and achieves higher accuracy while maintaining strong privacy protection.

Updated: 2026-03-30 03:11:35

标题: FedFG：通过流匹配生成实现隐私保护和稳健的联邦学习

摘要: 联邦学习（FL）使分布式客户端能够共同训练一个使用本地私有数据的全局模型。然而，最近的研究显示，传统的FL算法仍然存在隐私保护方面的缺陷，服务器缺乏可靠和稳定的聚合规则来更新全局模型。这种情况为攻击者创造了机会：一方面，他们可能窃听上传的梯度或模型参数，潜在地泄露良性客户端的私人数据；另一方面，他们可能 compromise 客户端以发动污染攻击，破坏全局模型。为了平衡准确性和安全性，我们提出了FedFG，这是一个基于流匹配生成的强大FL框架，它同时保护客户端的隐私并抵抗复杂的污染攻击。在客户端方面，每个本地网络被分解为一个私有特征提取器和一个公共分类器。每个客户端还配备了一个流匹配生成器，在与服务器交互时替换提取器，从而保护私有特征同时学习底层数据分布的近似值。作为对客户端设计的补充，服务器采用客户端更新验证方案和由流匹配生成器产生的合成样本驱动的新型强大聚合机制。在MNIST、FMNIST和CIFAR-10上的实验表明，与先前的工作相比，我们的方法适应多种攻击策略，并在保持强大隐私保护的同时实现更高的准确性。

更新时间: 2026-03-30 03:11:35

领域: cs.CR,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2603.27986v1

TAC: Hybrid IAM Privilege Escalation Detection

IAM misconfigurations are a major cause of privilege escalation (PE) attacks in the cloud, leading to data breaches and major financial losses. Existing PE detectors have two main limits: they cover only some PE types, so many attacks are missed, and they require full access to cloud configurations, which customers may not want to share because of sensitive information. We present TAC, the first IAM PE detection framework that supports both whitebox and greybox analysis for Amazon Web Services (AWS). To improve coverage, we systematically study how permissions are acquired in AWS IAM and identify five PE categories. All five share one pattern: permissions spread across entities. We define this as permission flows and manually extract 219 templates from more than 14,000 AWS operations. Based on this, we build TAC-WB, a whitebox detector with broad PE coverage. We also build TAC-GB, the first greybox PE detector, which works with partial configurations. Customers can choose which entities to reveal and whether to answer questions about permissions. TAC-GB uses a dynamic query process that adapts to each response and uses reinforcement learning with graph neural networks to ask the most useful questions while reducing interaction. We also create TAC-Bench, a benchmark with 2,500 tasks reflecting real-world IAM misconfigurations. Experiments show that TAC-WB finds all PEs missed by prior tools, while TAC-GB outperforms other greybox methods and often matches whitebox methods even with limited query budgets.

Updated: 2026-03-30 02:56:51

标题: TAC：混合IAM权限提升检测

摘要: IAM配置错误是云中特权升级（PE）攻击的主要原因之一，导致数据泄露和重大财务损失。现有的PE检测器有两个主要限制：它们只覆盖一些PE类型，因此许多攻击被忽略，并且它们需要完全访问云配置，而客户可能不想分享这些信息因为它们包含敏感信息。我们提出了TAC，这是第一个支持亚马逊网络服务（AWS）的IAM PE检测框架，支持白盒和灰盒分析。为了提高覆盖范围，我们系统地研究了在AWS IAM中如何获取权限并确定了五个PE类别。这五个类别共享一个模式：权限分布在实体之间。我们将这定义为权限流，并从超过14,000个AWS操作中手动提取了219个模板。基于此，我们构建了TAC-WB，一个具有广泛PE覆盖范围的白盒检测器。我们还构建了TAC-GB，第一个与部分配置一起工作的灰盒PE检测器。客户可以选择要展示哪些实体以及是否回答关于权限的问题。TAC-GB使用动态查询过程，根据每个响应进行调整，并使用图神经网络加强学习，以提出最有用的问题同时减少交互。我们还创建了TAC-Bench，一个包含2,500个任务的基准，反映了真实世界中的IAM配置错误。实验表明，TAC-WB能够找到之前工具所忽略的所有PE，而TAC-GB胜过其他灰盒方法，甚至在有限的查询预算下也经常与白盒方法相匹配。

更新时间: 2026-03-30 02:56:51

领域: cs.CR,cs.SE

下载: http://arxiv.org/abs/2304.14540v8

Combinatorial Privacy: Private Multi-Party Bitstream Grand Sum by Hiding in Birkhoff Polytopes

We introduce PolyVeil, a protocol for private Boolean summation across $k$ clients that encodes private bits as permutation matrices in the Birkhoff polytope. A two-layer architecture gives the server perfect simulation-based security (statistical distance zero) while a separate aggregator faces \#P-hard likelihood inference via the permanent and mixed discriminant. Two variants (full and compressed) differ in what the aggregator observes. We develop a finite-sample $(\varepsilon,δ)$-DP analysis with explicit constants. In the full variant, where the aggregator sees a doubly stochastic matrix per client, the log-Lipschitz constant grows as $n^4 K_t$ and a signal-to-noise analysis shows the DP guarantee is non-vacuous only when the private signal is undetectable. In the compressed variant, where the aggregator sees a single scalar, the univariate density ratio yields non-vacuous $\varepsilon$ at moderate SNR, with the optimal decoy count balancing CLT accuracy against noise concentration. This exposes a fundamental tension. \#P-hardness requires the full matrix view (Birkhoff structure visible), while non-vacuous DP requires the scalar view (low dimensionality). Whether both hold simultaneously in one variant remains open. The protocol needs no PKI, has $O(k)$ communication, and outputs exact aggregates.

Updated: 2026-03-30 02:38:39

标题: 组合隐私：通过隐藏在Birkhoff多面体中进行的私人多方位比特流总和

摘要: 我们介绍了PolyVeil，这是一个用于在$k$个客户端之间进行私密布尔求和的协议，它将私密比特编码为Birkhoff多面体中的排列矩阵。一个两层架构为服务器提供了完美的基于仿真的安全性（统计距离为零），而一个单独的聚合器通过永久和混合判别式面临了\#P难的似然推断。两个变体（完整和压缩）在聚合器观察到的内容上有所不同。我们开发了一个有明确常数的有限样本$(\varepsilon,δ)$-DP分析。在完整变体中，聚合器看到每个客户端的双随机矩阵，对数-利普希茨常数随着$n^4 K_t$增长，信噪比分析表明只有在私密信号不可检测时，DP保证才是非空的。在压缩变体中，聚合器只看到一个标量，单变量密度比在适度信噪比下产生非空的$\varepsilon$，最优的诱饵计数平衡了中心极限定理的准确性和噪声集中性。这揭示了一个基本的紧张关系。\#P难度要求完整的矩阵视图（Birkhoff结构可见），而非空的DP要求标量视图（低维度）。一个变体中同时具有这两个特性仍然是未知的。该协议不需要公钥基础设施，通信量为$O(k)$，并输出精确的聚合结果。

更新时间: 2026-03-30 02:38:39

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2603.22808v4

AVDA: Autonomous Vibe Detection Authoring for Cybersecurity

With the rapid advancement of AI in code generation, cybersecurity detection engineering faces new opportunities to automate traditionally manual processes. Detection authoring - the practice of creating executable logic that identifies malicious activities from security telemetry - is hindered by fragmented code across repositories, duplication, and limited organizational visibility. Current workflows remain heavily manual, constraining both coverage and velocity. In this paper, we introduce AVDA, a framework that leverages the Model Context Protocol (MCP) to automate detection authoring by integrating organizational context - existing detections, telemetry schemas, and style guides - into AI-assisted code generation. We evaluate three authoring strategies - Baseline, Sequential, and Agentic - across a diverse corpus of production detections and state-of-the-art LLMs. Our results show that Agentic workflows achieve a 19% improvement in overall similarity score over Baseline approaches, while Sequential workflows attain 87% of Agentic quality at 40x lower token cost. Generated detections excel at TTP matching (99.4%) and syntax validity (95.9%) but struggle with exclusion parity (8.9%). Expert validation on a 22-detection subset confirms strong Spearman correlation between automated metrics and practitioner judgment ($ρ= 0.64$, $p < 0.002$). By integrating seamlessly into standard developer environments, AVDA provides a practical path toward AI-assisted detection engineering with quantified trade-offs between quality, cost, and latency.

Updated: 2026-03-30 02:09:57

标题: AVDA：自主振动检测编写用于网络安全

摘要: 随着人工智能在代码生成方面的快速发展，网络安全检测工程面临着自动化传统手动流程的新机遇。检测编写 - 即创建可执行逻辑，用于从安全遥测数据中识别恶意活动的实践 - 受到代码片段分散、重复和有限组织可见性的阻碍。当前的工作流程仍然主要是手动的，限制了覆盖范围和速度。在本文中，我们介绍了AVDA，这是一个利用模型上下文协议（MCP）的框架，通过将组织上下文 - 现有检测、遥测模式和样式指南 - 整合到人工智能辅助代码生成中，从而自动化检测编写。我们评估了三种编写策略 - 基线、顺序和代理 - 在一个多样化的生产检测语料库和最先进的LLM之间。我们的结果显示，代理工作流在整体相似性得分上比基线方法提高了19%，而顺序工作流以40倍较低的令牌成本达到了代理品质的87%。生成的检测在TTP匹配（99.4%）和语法有效性（95.9%）方面表现出色，但在排除平价（8.9%）方面表现不佳。针对一个22个检测子集的专家验证确认了自动度量和从业者判断之间强有力的Spearman相关性（$ρ= 0.64$，$p < 0.002$）。通过无缝集成到标准开发环境中，AVDA为AI辅助检测工程提供了一条实用的道路，可以量化质量、成本和延迟之间的权衡。

更新时间: 2026-03-30 02:09:57

领域: cs.CR,cs.SE

下载: http://arxiv.org/abs/2603.25930v2

Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey

Multimodal large language models (MLLMs) integrate information from multiple modalities such as text, images, audio, and video, enabling complex capabilities such as visual question answering and audio translation. While powerful, this increased expressiveness introduces new and amplified vulnerabilities to adversarial manipulation. This survey provides a comprehensive and systematic analysis of adversarial threats to MLLMs, moving beyond enumerating attack techniques to explain the underlying causes of model susceptibility. We introduce a taxonomy that organizes adversarial attacks according to attacker objectives, unifying diverse attack surfaces across modalities and deployment settings. Additionally, we also present a vulnerability-centric analysis that links integrity attacks, safety and jailbreak failures, control and instruction hijacking, and training-time poisoning to shared architectural and representational weaknesses in multimodal systems. Together, this framework provides an explanatory foundation for understanding adversarial behavior in MLLMs and informs the development of more robust and secure multimodal language systems.

Updated: 2026-03-30 00:16:31

标题: 对多模态大型语言模型的对抗攻击：一项综合调查

摘要: 多模态大型语言模型（MLLMs）集成了来自多种模态（如文本、图像、音频和视频）的信息，实现了复杂的能力，如视觉问答和音频翻译。尽管功能强大，但这种增加的表达性引入了新的和加强的对抗操纵漏洞。本调查提供了对MLLMs的对抗威胁的全面和系统分析，超越了列举攻击技术，解释了模型易受攻击的根本原因。我们引入了一个根据攻击者目标组织对抗性攻击的分类法，统一了跨模态和部署设置的不同攻击面。此外，我们还提出了一个以漏洞为中心的分析，将完整性攻击、安全和越狱失败、控制和指令劫持以及训练时污染与多模态系统中的共享架构和表现弱点联系起来。总之，这一框架为理解MLLMs中的对抗行为提供了解释性基础，并指导更加健壮和安全的多模态语言系统的开发。

更新时间: 2026-03-30 00:16:31

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2603.27918v1

Learning to Diagnose Privately: DP-Powered LLMs for Radiology Report Classification

Large Language Models (LLMs) are increasingly adopted across domains such as education, healthcare, and finance. In healthcare, LLMs support tasks including disease diagnosis, abnormality classification, and clinical decision-making. Among these, multi-abnormality classification of radiology reports is critical for clinical workflow automation and biomedical research. Leveraging strong natural language processing capabilities, LLMs enable efficient processing of unstructured medical text and reduce the administrative burden of manual report analysis. To improve performance, LLMs are often fine-tuned on private, institution-specific datasets such as radiology reports. However, this raises significant privacy concerns: LLMs may memorize training data and become vulnerable to data extraction attacks, while sharing fine-tuned models risks exposing sensitive patient information. Despite growing interest in LLMs for medical text classification, privacy-preserving fine-tuning for multi-abnormality classification remains underexplored. To address this gap, we propose a differentially private (DP) fine-tuning framework for multi-abnormality classification from free-text radiology reports. Our approach integrates differential privacy with Low-Rank Adaptation (LoRA) to efficiently fine-tune LLMs on sensitive clinical data while mitigating leakage risks. We further employ labels generated by a larger LLM to train smaller models, enabling efficient inference under strong privacy guarantees. Experiments on MIMIC-CXR and CT-RATE demonstrate the effectiveness of our DP-LoRA framework across varying privacy regimes. On MIMIC-CXR, our method achieves weighted F1-scores up to 0.89 under moderate privacy budgets, approaching non-private LoRA (0.90) and full fine-tuning (0.96), confirming that strong privacy can be achieved with only modest performance trade-offs.

Updated: 2026-03-30 00:14:05

标题: 学习隐私诊断：基于DP的LLMs用于放射学报告分类

摘要: 大型语言模型（LLMs）越来越多地被采用在教育、医疗和金融等领域。在医疗领域，LLMs支持包括疾病诊断、异常分类和临床决策在内的任务。在这些任务中，对放射学报告进行多异常分类对于临床工作流自动化和生物医学研究至关重要。利用强大的自然语言处理能力，LLMs使得对非结构化医学文本的高效处理成为可能，并减少了手动报告分析的行政负担。为了提高性能，LLMs经常在私有的、机构特定的数据集上进行微调，比如放射学报告。然而，这引发了重大的隐私问题：LLMs可能会记忆训练数据并变得容易受到数据提取攻击，而分享微调模型可能会暴露敏感的患者信息。尽管对于医学文本分类的LLMs日益受到关注，但用于多异常分类的隐私保护微调仍未被充分探索。为了填补这一空白，我们提出了一个针对自由文本放射学报告的差分隐私（DP）微调框架，用于多异常分类。我们的方法将差分隐私与低秩适应（LoRA）相结合，可以在敏感的临床数据上高效微调LLMs，并减少泄漏风险。我们进一步利用一个更大的LLM生成的标签来训练较小的模型，从而在强隐私保证下实现高效推理。在MIMIC-CXR和CT-RATE上的实验表明，我们的DP-LoRA框架在不同的隐私制度下的有效性。在MIMIC-CXR上，我们的方法在中等隐私预算下实现了高达0.89的加权F1分数，接近非私有的LoRA（0.90）和完全微调（0.96），证实了只有适度的性能折衷就可以实现强隐私。

更新时间: 2026-03-30 00:14:05

领域: cs.CR,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2506.04450v5