Trustless Provenance Trees: A Game-Theoretic Framework for Operator-Gated Blockchain Registries
We present a formal treatment of provenance trees, directed acyclic graphs of artifact registrations anchored immutably on a public blockchain, and introduce the operator trust problem: when a single privileged operator submits all on-chain registrations on behalf of users, the on-chain record alone cannot distinguish user-initiated registrations from unilateral operator actions. We resolve this through a dual-layer cryptographic commitment scheme in which two commitments derived from a single client-side secret key, binding the key to the tree root and to each unique registration identifier, make false attribution claims strictly dominated strategies. We prove correctness under standard cryptographic assumptions and establish honest behavior as the unique Nash equilibrium without relying on operator trust. We further introduce and analyze the tree poisoning problem: adversarial attacks on users' provenance trees via fraudulent root registration, malicious child attachment, and tree identity spoofing. We characterize the closure properties of each attack variant and prove that a complete provenance tree integrity model requires three distinct mechanisms: cryptographic priority, governance cascade, and contract enforcement, each necessary and none individually sufficient. The construction is deployed on Base (Ethereum L2) as AnchorRegistry, an immutable on-chain provenance registry. We provide gas complexity analysis demonstrating O(1) cost invariant to registry scale, and a trustless reconstruction algorithm recovering the complete registry from public event logs alone.
Updated: 2026-04-03 20:17:49
标题: 无信任溯源树:运营者门控区块链注册的博弈论框架
摘要: 我们提出了对溯源树的正式处理,这是一种在公共区块链上不可变地锚定的工件注册的有向无环图,并引入了操作者信任问题:当一个特权操作者代表用户提交所有链上注册时,仅通过链上记录无法区分用户发起的注册和单方面操作者行为。我们通过双层加密承诺方案解决了这个问题,在这个方案中,从一个客户端密钥派生的两个承诺,将密钥绑定到树根和每个唯一的注册标识符,使错误归因声明成为严格占优策略。我们在标准加密假设下证明了正确性,并建立了诚实行为作为唯一的纳什均衡,而不依赖于操作者信任。我们进一步介绍和分析了树毒害问题:对用户的溯源树进行敌对攻击,通过欺诈性根注册、恶意子附加和树身份欺骗。我们表征了每种攻击变体的闭包属性,并证明了一个完整的溯源树完整性模型需要三种不同的机制:加密优先、治理级联和合同执行,每种都是必要的,但没有一种单独足够。这一构造部署在Base(以太坊L2)上作为AnchorRegistry,一种不可变的链上溯源注册表。我们提供了气体复杂性分析,证明了对注册表规模不变的O(1)成本,并提供了一种无信任的重建算法,仅从公共事件日志中恢复完整的注册表。
更新时间: 2026-04-03 20:17:49
领域: cs.GT,cs.CR
Towards Context-Aware Image Anonymization with Multi-Agent Reasoning
Street-level imagery contains personally identifiable information (PII), some of which is context-dependent. Existing anonymization methods either over-process images or miss subtle identifiers, while API-based solutions compromise data sovereignty. We present an agentic framework CAIAMAR (\underline{C}ontext-\underline{A}ware \underline{I}mage \underline{A}nonymization with \underline{M}ulti-\underline{A}gent \underline{R}easoning) for context-aware PII segmentation with diffusion-based anonymization, combining pre-defined processing for high-confidence cases with multi-agent reasoning for indirect identifiers. Three specialized agents coordinate via round-robin speaker selection in a Plan-Do-Check-Act (PDCA) cycle, enabling large vision-language models to classify PII based on spatial context (private vs. public property) rather than rigid category rules. The agents implement spatially-filtered coarse-to-fine detection where a scout-and-zoom strategy identifies candidates, open-vocabulary segmentation processes localized crops, and $IoU$-based deduplication ($30\%$ threshold) prevents redundant processing. Modal-specific diffusion guidance with appearance decorrelation substantially reduces re-identification (Re-ID) risks. On CUHK03-NP, our method reduces person Re-ID risk by $73\%$ ($R1$: $16.9\%$ vs. $62.4\%$ baseline). For image quality preservation on CityScapes, we achieve KID: $0.001$, and FID: $9.1$, significantly outperforming existing anonymization. The agentic workflow detects non-direct PII instances across object categories, and downstream semantic segmentation is preserved. Operating entirely on-premise with open-source models, the framework generates human-interpretable audit trails supporting EU's GDPR transparency requirements while flagging failed cases for human review.
Updated: 2026-04-03 19:55:07
标题: 朝向具有多智能体推理的上下文感知图像匿名化
摘要: 街头级别的图像包含个人可识别信息(PII),其中一些是依赖于上下文的。现有的匿名化方法要么过度处理图像,要么错过微妙的标识符,而基于API的解决方案则会损害数据主权。我们提出了一个主动框架CAIAMAR(上下文感知图像匿名化与多智能体推理),用于具有扩散式匿名化的上下文感知PII分割,结合预定义处理高置信度案例与间接标识符的多智能体推理。三个专门的智能体通过循环发言者选择在PDCA循环中协调,使大型视觉语言模型能够基于空间上下文(私人 vs.公共财产)而不是严格的类别规则对PII进行分类。智能体实现空间过滤的粗到细的检测,其中侦察员和变焦策略识别候选人,开放词汇分割处理局部作物,并且基于IoU的去重(30%阈值)防止冗余处理。具有外观去相关性的模态特定扩散指导大大减少了再识别(Re-ID)风险。在CUHK03-NP上,我们的方法将人员Re-ID风险降低了73%(R1:16.9%与62.4%基线相比)。在CityScapes上保持图像质量,我们实现了KID:0.001和FID:9.1,明显优于现有的匿名化。智能化工作流检测跨对象类别的非直接PII实例,并保留下游语义分割。在本地完全使用开源模型运行的框架生成人类可解释的审计跟踪,支持欧盟的GDPR透明度要求,同时标记失败案例供人工审查。
更新时间: 2026-04-03 19:55:07
领域: cs.CV,cs.AI,cs.CR
AEGIS: Scaling Long-Sequence Homomorphic Encrypted Transformer Inference via Hybrid Parallelism on Multi-GPU Systems
Fully Homomorphic Encryption (FHE) enables privacy-preserving Transformer inference, but long-sequence encrypted Transformers quickly exceed single-GPU memory capacity because encoded weights are already large and encrypted activations grow rapidly with sequence length. Multi-GPU execution therefore becomes unavoidable, yet scaling remains challenging because communication is jointly induced by application-level aggregation and encryption-level RNS coupling. Existing approaches either synchronize between devices frequently or replicate encrypted tensors across devices, leading to excessive communication and latency. We present AEGIS, an Application-Encryption Guided Inference System for scalable long-sequence encrypted Transformer inference on multi-GPU platforms. AEGIS derives device placement from ciphertext dependencies jointly induced by Transformer dataflow and CKKS polynomial coupling, co-locating modulus-coherent and token-coherent data so that communication is introduced only when application dependencies require it, while reordering polynomial operators to overlap the remaining collectives with computation. On 2048-token inputs, AEGIS reduces inter-GPU communication by up to 57.9% in feed-forward networks and 81.3% in self-attention versus prior state-of-the-art designs. On four GPUs, it achieves up to 96.62% scaling efficiency, 3.86x end-to-end speedup, and 69.1% per-device memory reduction. These results establish coordinated application-encryption parallelism as a practical foundation for scalable homomorphic Transformer inference.
Updated: 2026-04-03 19:47:26
标题: AEGIS:在多GPU系统上通过混合并行性扩展长序列同态加密变压器推理
摘要: 全同态加密(FHE)使得隐私保护变换器推断成为可能,但是长序列加密变换器很快超过了单个GPU的内存容量,因为编码权重已经很大,而加密激活随着序列长度的增加迅速增长。因此,多GPU执行变得不可避免,但是扩展仍然具有挑战性,因为通信同时受应用级聚合和加密级RNS耦合的共同影响。现有方法要么频繁在设备之间同步,要么在设备之间复制加密张量,导致过多的通信和延迟。 我们提出了AEGIS,一种面向可扩展长序列加密变换器推断的应用-加密引导推断系统,适用于多GPU平台。AEGIS根据变换器数据流和CKKS多项式耦合共同引发的密文依赖关系推导出设备放置,使模数一致和标记一致的数据共同定位,因此只有在应用依赖关系要求时才引入通信,同时重新排列多项式运算符以将其余集合与计算重叠。 在2048标记输入上,AEGIS在前馈网络中将GPU间通信减少了最多57.9%,在自注意力中减少了81.3%,相对于先前的最先进设计。在四个GPU上,它实现了高达96.62%的扩展效率,3.86倍的端到端加速和69.1%的每设备内存减少。这些结果确立了协调的应用-加密并行性作为可扩展同态变换器推断的实际基础。
更新时间: 2026-04-03 19:47:26
领域: cs.CR,cs.AI,cs.DC
FABLE: A Localized, Targeted Adversarial Attack on Weather Forecasting Models
Deep learning-based weather forecasting (DLWF) models have recently demonstrated significant performance gains over gold-standard physics-based simulation tools. However, these models are potentially vulnerable to adversarial attacks, which raises concerns about their trustworthiness. In this paper, we investigate the feasibility and challenges of applying existing adversarial attack methods to DLWF models and propose a novel framework called FABLE (Forecast Alteration By Localized targeted advErsarial attack) to address them. FABLE performs a 3D discrete wavelet decomposition to disentangle the spatial and temporal components of the data. By regulating the magnitude of adversarial perturbations across different components, FABLE produces adversarial inputs that remain closely aligned with the original inputs while steering the DLWF models toward generating the targeted forecast outcomes. Experimental results on real-world weather datasets demonstrate the effectiveness of FABLE over baseline methods across various metrics.
Updated: 2026-04-03 19:26:34
标题: 寓言:一种针对天气预测模型的本地化、定向对抗攻击
摘要: 基于深度学习的天气预测(DLWF)模型最近展示了比黄金标准物理模拟工具更显著的性能提升。然而,这些模型可能容易受到对抗性攻击,这引发了对它们可信度的担忧。在本文中,我们研究了将现有对抗性攻击方法应用于DLWF模型的可行性和挑战,并提出了一个名为FABLE(通过定位目标对抗性攻击进行预测更改)的新框架来解决这些问题。FABLE执行3D离散小波分解以解开数据的空间和时间组件。通过调节不同组件上对抗扰动的幅度,FABLE生成的对抗输入保持与原始输入密切对齐,同时引导DLWF模型生成目标预测结果。对真实世界天气数据集的实验结果表明,FABLE在各种指标上比基准方法更为有效。
更新时间: 2026-04-03 19:26:34
领域: cs.LG,cs.CR
Security Analysis of Universal Circuits as a Mechanism for Hardware Obfuscation
Universal Circuits (UCs) offer a promising approach to hardware Intellectual Property (IP) obfuscation, leveraging cryptographic principles to hide both structure and function in a programmable logic fabric. Their adaptability makes them especially suitable for the globalized Integrated Circuit (IC) supply chain, where security against threats like reverse engineering is crucial. Despite the potential, UC security remains largely unexplored. This work evaluates UC security against state-of-the-art oracle-guided (OG) and oracle-less (OL) attacks. Results show near-random success rates (approx 50%) for OG attacks whereas OL attacks display minimal structural leakage. Collectively, these findings confirm the feasibility of UCs for IP protection.
Updated: 2026-04-03 18:59:00
标题: 通用电路作为硬件混淆机制的安全性分析
摘要: 通用电路(UCs)提供了一种有前途的硬件知识产权(IP)混淆方法,利用加密原则来隐藏可编程逻辑结构和功能。它们的适应性使它们特别适合全球化集成电路(IC)供应链,其中安全性对抗逆向工程等威胁至关重要。尽管具有潜力,UC安全性仍然鲜为人知。本文评估了UC对最先进的基于Oracle引导(OG)和无Oracle(OL)攻击的安全性。结果显示OG攻击的成功率接近随机(约50%),而OL攻击显示出最小的结构泄漏。总的来说,这些发现证实了UC用于IP保护的可行性。
更新时间: 2026-04-03 18:59:00
领域: cs.CR
Are Statistical Methods Obsolete in the Era of Deep Learning? A Study of ODE Inverse Problems
In the era of AI, neural networks have become increasingly popular for modeling, inference, and prediction, largely due to their potential for universal approximation. With the proliferation of such deep learning models, a question arises: are leaner statistical methods still relevant? To shed insight on this question, we employ the mechanistic nonlinear ordinary differential equation (ODE) inverse problem as a testbed, using the physics-informed neural network (PINN) as a representative of the deep learning paradigm and manifold-constrained Gaussian process inference (MAGI) as a representative of statistically principled methods. Through case studies involving the SEIR model from epidemiology and the Lorenz model from chaotic dynamics, we demonstrate that statistical methods are far from obsolete, especially when working with sparse and noisy observations. On tasks such as parameter inference and trajectory reconstruction, statistically principled methods consistently achieve lower bias and variance, while using far fewer parameters and requiring less hyperparameter tuning. Statistical methods can also decisively outperform deep learning models on out-of-sample future prediction, where the absence of relevant data often leads overparameterized models astray. Additionally, we find that statistically principled approaches are more robust to accumulation of numerical imprecision and can represent the underlying system more faithfully to the true governing ODEs.
Updated: 2026-04-03 17:58:58
标题: 在深度学习时代,统计方法是否已过时?ODE反问题研究。
摘要: 在人工智能时代,由于具有通用逼近性的潜力,神经网络在建模、推断和预测方面变得越来越受欢迎。随着此类深度学习模型的大量使用,一个问题浮现:传统的统计方法是否仍然具有相关性?为了解答这个问题,我们使用机械非线性常微分方程(ODE)反问题作为测试平台,以物理信息神经网络(PINN)代表深度学习范式,以受流形约束的高斯过程推断(MAGI)代表统计上合理的方法。通过涉及流行病学中的SEIR模型和混沌动力学中的Lorenz模型的案例研究,我们证明统计方法远未过时,尤其是在处理稀疏和嘈杂观测数据时。在参数推断和轨迹重建等任务上,统计上合理的方法始终能够实现更低的偏差和方差,同时使用更少的参数并需要更少的超参数调整。统计方法还可以在样本外未来预测上明显优于深度学习模型,因为缺乏相关数据往往会使参数过多的模型产生错误。此外,我们发现在数值不精确积累时,统计上合理的方法表现更加稳健,并且能够更忠实地代表真实的控制ODE系统。
更新时间: 2026-04-03 17:58:58
领域: stat.CO,cs.LG,stat.ML
Enhancing Robustness of Federated Learning via Server Learning
This paper explores the use of server learning for enhancing the robustness of federated learning against malicious attacks even when clients' training data are not independent and identically distributed. We propose a heuristic algorithm that uses server learning and client update filtering in combination with geometric median aggregation. We demonstrate via experiments that this approach can achieve significant improvement in model accuracy even when the fraction of malicious clients is high, even more than $50\%$ in some cases, and the dataset utilized by the server is small and could be synthetic with its distribution not necessarily close to that of the clients' aggregated data.
Updated: 2026-04-03 17:51:29
标题: 通过服务器学习增强联邦学习的稳健性
摘要: 本文探讨了使用服务器学习来增强联邦学习对恶意攻击的鲁棒性,即使客户端的训练数据不是独立同分布的。我们提出了一种启发式算法,结合服务器学习和客户端更新过滤,以及几何中位数聚合。通过实验证明,即使恶意客户端的比例很高,有时甚至超过50%,并且服务器使用的数据集很小且可能是合成的,其分布不一定接近客户端聚合数据,这种方法也能显著提高模型准确性。
更新时间: 2026-04-03 17:51:29
领域: cs.LG,cs.AI
Fast Best-in-Class Regret for Contextual Bandits
We study the problem of stochastic contextual bandits in the agnostic setting, where the goal is to compete with the best policy in a given class without assuming realizability or imposing model restrictions on losses or rewards. In this work, we establish the first fast rate for regret relative to the best-in-class policy. Our proposed algorithm updates the policy at every round by minimizing a pessimistic objective, defined as a clipped inverse-propensity estimate of the policy value plus a variance penalty. By leveraging entropy assumptions on the policy class and a Hölderian error-bound condition (a generalization of the margin condition), we achieve fast best-in-class regret rates, including polylogarithmic rates in the parametric case. The analysis is driven by a sequential self-normalized maximal inequality for bounded martingale empirical processes, which yields uniform variance-adaptive confidence bounds and guarantees pessimism under adaptive data collection.
Updated: 2026-04-03 17:49:49
标题: 快速最佳类遗憾的背景下的乐队。
摘要: 我们研究了随机情境赌博机的问题,这是在不可知设置中的目标,在这种设置中,目标是在不假设可实现性或对损失或奖励施加模型限制的情况下与给定类别中的最佳策略竞争。在这项工作中,我们建立了相对于最佳类别策略的遗憾的第一个快速速率。我们提出的算法通过最小化一种悲观目标来在每一轮更新策略,该目标定义为策略价值的剪切逆倾向估计加上方差惩罚。通过利用关于策略类别的熵假设和 Hölderian误差界限条件(边际条件的一般化),我们实现了快速的最佳类别遗憾率,包括在参数化情况下的多对数率。分析是通过对有界鞅经验过程的顺序自标准化最大不等式来驱动的,这产生了统一的方差自适应置信界限,并保证在自适应数据收集下的悲观情况。
更新时间: 2026-04-03 17:49:49
领域: stat.ML,cs.LG
Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers
We present a stage-decomposed analysis of prompt injection attacks against five frontier LLM agents. Prior work measures task-level attack success rate (ASR); we localize the pipeline stage at which each model's defense activates. We instrument every run with a cryptographic canary token (SECRET-[A-F0-9]{8}) tracked through four kill-chain stages -- Exposed, Persisted, Relayed, Executed -- across four attack surfaces and five defense conditions (764 total runs, 428 no-defense attacked). Our central finding is that model safety is determined not by whether adversarial content is seen, but by whether it is propagated across pipeline stages. Concretely: (1) in our evaluation, exposure is 100% for all five models -- the safety gap is entirely downstream; (2) Claude strips injections at write_memory summarization (0/164 ASR), while GPT-4o-mini propagates canaries without loss (53% ASR, 95% CI: 41--65%); (3) DeepSeek exhibits 0% ASR on memory surfaces and 100% ASR on tool-stream surfaces from the same model -- a complete reversal across injection channels; (4) all four active defense conditions (write_filter, pi_detector, spotlighting, and their combination) produce 100% ASR due to threat-model surface mismatch; (5) a Claude relay node decontaminates downstream agents -- 0/40 canaries survived into shared memory.
Updated: 2026-04-03 17:43:16
标题: “杀链金丝雀:跨攻击面和模型安全层级的阶段级别追踪及即时注入”
摘要: 我们提出了对五个前沿LLM代理进行即时注入攻击的阶段分解分析。先前的工作测量了任务级攻击成功率(ASR);我们定位了每个模型的防御何时激活的管道阶段。我们使用一个加密的canary token(SECRET-[A-F0-9]{8})在四个攻击面和五种防御条件下跟踪四个杀链阶段--暴露、持续、中继、执行--进行了每次运行的仪器化(共764次运行,428次无防御受到攻击)。我们的中心发现是,模型的安全性不是由于是否看到了对抗性内容,而是由于它是否在管道阶段间传播。具体来说:(1)在我们的评估中,所有五个模型的暴露率都达到了100%--安全差距完全在下游;(2)在write_memory总结阶段,Claude剥离注入(0/164 ASR),而GPT-4o-mini在不丢失的情况下传播canaries(53% ASR,95% CI:41--65%);(3)DeepSeek在内存表面上显示了0%的ASR,在同一模型的工具流表面上却显示了100%的ASR--在注入通道上完全颠倒;(4)所有四种主动防御条件(write_filter、pi_detector、spotlighting和它们的组合)由于威胁模型表面不匹配而产生100%的ASR;(5)Claude中继节点将下游代理净化--没有一个canary存活到共享内存中。
更新时间: 2026-04-03 17:43:16
领域: cs.CR,cs.AI,cs.LG
Analysis of Invasive Breast Cancer in Mammograms Using YOLO, Explainability, and Domain Adaptation
Deep learning models for breast cancer detection from mammographic images have significant reliability problems when presented with Out-of-Domain (OOD) inputs such as other imaging modalities (CT, MRI, X-ray) or equipment variations, leading to unreliable detection and misdiagnosis. The current research mitigates the fundamental OOD issue through a comprehensive approach integrating ResNet50-based OOD filtering with YOLO architectures (YOLOv8, YOLOv11, YOLOv12) for accurate detection of breast cancer. Our strategy establishes an in-domain gallery via cosine similarity to rigidly reject non-mammographic inputs prior to processing, ensuring that only domain-associated images supply the detection pipeline. The OOD detection component achieves 99.77\% general accuracy with immaculate 100\% accuracy on OOD test sets, effectively eliminating irrelevant imaging modalities. ResNet50 was selected as the optimum backbone after 12 CNN architecture searches. The joint framework unites OOD robustness with high detection performance (mAP@0.5: 0.947) and enhanced interpretability through Grad-CAM visualizations. Experimental validation establishes that OOD filtering significantly improves system reliability by preventing false alarms on out-of-distribution inputs while maintaining higher detection accuracy on mammographic data. The present study offers a fundamental foundation for the deployment of reliable AI-based breast cancer detection systems in diverse clinical environments with inherent data heterogeneity.
Updated: 2026-04-03 17:41:16
标题: 使用YOLO、可解释性和领域适应性对乳腺癌侵袭性乳腺癌进行分析
摘要: 深度学习模型用于乳腺癌检测的乳腺摄影图像在面对域外(OOD)输入(如其他成像模态(CT、MRI、X光)或设备变化)时存在显著的可靠性问题,导致不可靠的检测和误诊。当前研究通过综合方法,将基于ResNet50的OOD过滤与YOLO架构(YOLOv8、YOLOv11、YOLOv12)整合,实现对乳腺癌的准确检测,从而缓解基本的OOD问题。我们的策略通过余弦相似性建立一个域内图库,严格拒绝非乳腺摄影输入,确保只有与领域相关的图像提供检测管道。OOD检测组件在OOD测试集上实现了99.77\%的总体准确率,有效消除了无关的成像模态。经过12次CNN架构搜索后,ResNet50被选为最佳骨干。联合框架将OOD鲁棒性与高检测性能(mAP@0.5:0.947)以及通过Grad-CAM可视化增强的可解释性结合在一起。实验验证表明,OOD过滤显著提高了系统的可靠性,防止了对域外输入的虚假警报,同时在乳腺摄影数据上保持更高的检测准确性。本研究为在具有固有数据异质性的各种临床环境中部署可靠的基于AI的乳腺癌检测系统奠定了基础。
更新时间: 2026-04-03 17:41:16
领域: cs.CV,cs.AI
Local Reinforcement Learning with Action-Conditioned Root Mean Squared Q-Functions
The Forward-Forward (FF) Algorithm is a recently proposed learning procedure for neural networks that employs two forward passes instead of the traditional forward and backward passes used in backpropagation. However, FF remains largely confined to supervised settings, leaving a gap at domains where learning signals can be yielded more naturally such as RL. In this work, inspired by FF's goodness function using layer activity statistics, we introduce Action-conditioned Root mean squared Q-Functions (ARQ), a novel value estimation method that applies a goodness function and action conditioning for local RL using temporal difference learning. Despite its simplicity and biological grounding, our approach achieves superior performance compared to state-of-the-art local backprop-free RL methods in the MinAtar and the DeepMind Control Suite benchmarks, while also outperforming algorithms trained with backpropagation on most tasks. Code can be found at https://github.com/agentic-learning-ai-lab/arq.
Updated: 2026-04-03 17:37:35
标题: 具有动作条件的均方根Q函数的本地强化学习
摘要: Forward-Forward (FF)算法是最近提出的一种神经网络学习程序,它使用两次前向传递,而不是传统的反向传递和反向传递用于反向传播。然而,FF仍然主要局限于监督设置,留下了在学习信号可以更自然产生的领域,如RL。在这项工作中,受到FF使用层活动统计的良好函数的启发,我们引入了一种称为Action-conditioned Root mean squared Q-Functions (ARQ)的新颖值估计方法,该方法应用了一个良好函数和动作调节用于本地RL使用时差学习。尽管我们的方法简单且具有生物学基础,但相比于MinAtar和DeepMind控制套件基准测试中的最新本地无反向传播RL方法,我们的方法表现出色,同时在大多数任务上优于使用反向传播训练的算法。代码可以在https://github.com/agentic-learning-ai-lab/arq找到。
更新时间: 2026-04-03 17:37:35
领域: cs.LG,cs.AI
Hierarchical Planning with Latent World Models
Model predictive control (MPC) with learned world models has emerged as a promising paradigm for embodied control, particularly for its ability to generalize zero-shot when deployed in new environments. However, learned world models often struggle with long-horizon control due to the accumulation of prediction errors and the exponentially growing search space. In this work, we address these challenges by learning latent world models at multiple temporal scales and performing hierarchical planning across these scales, enabling long-horizon reasoning while substantially reducing inference-time planning complexity. Our approach serves as a modular planning abstraction that applies across diverse latent world-model architectures and domains. We demonstrate that this hierarchical approach enables zero-shot control on real-world non-greedy robotic tasks, achieving a 70% success rate on pick-&-place using only a final goal specification, compared to 0% for a single-level world model. In addition, across physics-based simulated environments including push manipulation and maze navigation, hierarchical planning achieves higher success while requiring up to 4x less planning-time compute.
Updated: 2026-04-03 17:32:36
标题: 具有潜在世界模型的分层规划
摘要: 具有学习世界模型的模型预测控制(MPC)已经成为一种有前途的具身控制范式,特别是因为它在部署到新环境中时能够泛化零样本。然而,学习的世界模型通常在长时间控制中遇到困难,这是由于预测误差的累积和指数增长的搜索空间。在这项工作中,我们通过在多个时间尺度上学习潜在世界模型并在这些尺度上执行分层规划来解决这些挑战,从而在大大减少推理时间规划复杂性的同时实现长时间推理。我们的方法作为一个模块化的规划抽象,适用于各种不同的潜在世界模型架构和领域。我们证明,这种分层方法能够在现实世界的非贪婪机器人任务上实现零样本控制,在只使用最终目标规范的情况下,实现了70%的成功率,而单层世界模型的成功率为0%。此外,在基于物理的模拟环境中,包括推动操作和迷宫导航,分层规划实现了更高的成功率,同时需要的规划时间计算量减少了多达4倍。
更新时间: 2026-04-03 17:32:36
领域: cs.LG
A Tsetlin Machine-driven Intrusion Detection System for Next-Generation IoMT Security
The rapid adoption of the Internet of Medical Things (IoMT) is transforming healthcare by enabling seamless connectivity among medical devices, systems, and services. However, it also introduces serious cybersecurity and patient safety concerns as attackers increasingly exploit new methods and emerging vulnerabilities to infiltrate IoMT networks. This paper proposes a novel Tsetlin Machine (TM)-based Intrusion Detection System (IDS) for detecting a wide range of cyberattacks targeting IoMT networks. The TM is a rule-based and interpretable machine learning (ML) approach that models attack patterns using propositional logic. Extensive experiments conducted on the CICIoMT-2024 dataset, which includes multiple IoMT protocols and cyberattack types, demonstrate that the proposed TM-based IDS outperforms traditional ML classifiers. The proposed model achieves an accuracy of 99.5\% in binary classification and 90.7\% in multi-class classification, surpassing existing state-of-the-art approaches. Moreover, to enhance model trust and interpretability, the proposed TM-based model presents class-wise vote scores and clause activation heatmaps, providing clear insights into the most influential clauses and the dominant class contributing to the final model decision.
Updated: 2026-04-03 17:26:52
标题: 一种基于Tsetlin机器的下一代IoMT安全入侵检测系统
摘要: 医疗物联网(IoMT)的快速采用正在通过使医疗设备、系统和服务之间实现无缝连接,改变着医疗保健。然而,它也引入了严重的网络安全和患者安全问题,因为攻击者越来越多地利用新方法和新兴漏洞渗透IoMT网络。本文提出了一种基于Tsetlin Machine(TM)的入侵检测系统(IDS),用于检测针对IoMT网络的各种网络攻击。TM是一种基于规则且可解释的机器学习(ML)方法,使用命题逻辑建模攻击模式。在包含多个IoMT协议和网络攻击类型的CICIoMT-2024数据集上进行的大量实验表明,所提出的基于TM的IDS优于传统的ML分类器。所提出的模型在二元分类中实现了99.5\%的准确率,在多类分类中实现了90.7\%的准确率,超过了现有的最先进方法。此外,为了增强模型的信任和可解释性,所提出的基于TM的模型提供了类别投票分数和子句激活热图,清晰展示了对最终模型决策产生影响的最有影响力的子句和主导类别。
更新时间: 2026-04-03 17:26:52
领域: cs.CR,cs.LG
PR3DICTR: A modular AI framework for medical 3D image-based detection and outcome prediction
Three-dimensional medical image data and computer-aided decision making, particularly using deep learning, are becoming increasingly important in the medical field. To aid in these developments we introduce PR3DICTR: Platform for Research in 3D Image Classification and sTandardised tRaining. Built using community-standard distributions (PyTorch and MONAI), PR3DICTR provides an open-access, flexible and convenient framework for prediction model development, with an explicit focus on classification using three-dimensional medical image data. By combining modular design principles and standardization, it aims to alleviate developmental burden whilst retaining adjustability. It provides users with a wealth of pre-established functionality, for instance in model architecture design options, hyper-parameter solutions and training methodologies, but still gives users the opportunity and freedom to ``plug in'' their own solutions or modules. PR3DICTR can be applied to any binary or event-based three-dimensional classification task and can work with as little as two lines of code.
Updated: 2026-04-03 17:25:17
标题: PR3DICTR:用于医学3D图像检测和结果预测的模块化人工智能框架
摘要: 三维医学图像数据和计算机辅助决策,特别是使用深度学习,在医疗领域变得越来越重要。为了促进这些发展,我们介绍了PR3DICTR:用于三维图像分类和标准化培训的研究平台。PR3DICTR采用社区标准发行版(PyTorch和MONAI)构建,提供一个开放、灵活和便利的框架,用于预测模型的开发,明确关注使用三维医学图像数据进行分类。通过结合模块化设计原则和标准化,它旨在减轻开发负担,同时保持可调性。它为用户提供了丰富的预先建立的功能,例如模型架构设计选项、超参数解决方案和训练方法,但仍然给用户提供了插入自己解决方案或模块的机会和自由。PR3DICTR可应用于任何二进制或基于事件的三维分类任务,只需两行代码即可运行。
更新时间: 2026-04-03 17:25:17
领域: cs.CV,cs.AI,cs.LG
Coupled Control, Structured Memory, and Verifiable Action in Agentic AI (SCRAT -- Stochastic Control with Retrieval and Auditable Trajectories): A Comparative Perspective from Squirrel Locomotion and Scatter-Hoarding
Agentic AI is increasingly judged not by fluent output alone but by whether it can act, remember, and verify under partial observability, delay, and strategic observation. Existing research often studies these demands separately: robotics emphasizes control, retrieval systems emphasize memory, and alignment or assurance work emphasizes checking and oversight. This article argues that squirrel ecology offers a sharp comparative case because arboreal locomotion, scatter-hoarding, and audience-sensitive caching couple all three demands in one organism. We synthesize evidence from fox, eastern gray, and, in one field comparison, red squirrels, and impose an explicit inference ladder: empirical observation, minimal computational inference, and AI design conjecture. We introduce a minimal hierarchical partially observed control model with latent dynamics, structured episodic memory, observer-belief state, option-level actions, and delayed verifier signals. This motivates three hypotheses: (H1) fast local feedback plus predictive compensation improves robustness under hidden dynamics shifts; (H2) memory organized for future control improves delayed retrieval under cue conflict and load; and (H3) verifiers and observer models inside the action-memory loop reduce silent failure and information leakage while remaining vulnerable to misspecification. A downstream conjecture is that role-differentiated proposer/executor/checker/adversary systems may reduce correlated error under asymmetric information and verification burden. The contribution is a comparative perspective and benchmark agenda: a disciplined program of falsifiable claims about the coupling of control, memory, and verifiable action.
Updated: 2026-04-03 17:25:05
标题: SCRAT--具有检索和可审核轨迹的随机控制:从松鼠行动和散布贮藏的比较视角看代理人工智能中的耦合控制、结构化记忆和可验证行动
摘要: Agentic AI越来越被评判的不仅仅是流畅的输出,而是它是否能在部分可观察性,延迟和战略观察的情况下行动,记忆和验证。现有研究通常分别研究这些需求:机器人学强调控制,检索系统强调记忆,而对齐或保证工作则强调检查和监督。本文认为松鼠生态学提供了一个尖锐的比较案例,因为树栖运动,散播贮藏和受众敏感的缓存将所有三个需求耦合在一个生物体中。我们综合了关于狐狸,东方灰松鼠和在一次田野比较中的红松鼠的证据,并引入了一个明确的推理阶梯:经验观察,最小计算推断和AI设计猜测。我们引入了一个最小的层次部分观察控制模型,具有潜在动态,结构化的情节性记忆,观察者信念状态,选项级别行动和延迟验证信号。这激发了三个假设:(H1)快速的局部反馈加上预测性补偿可以提高在隐藏动态变化下的稳健性;(H2)为未来控制组织的记忆可以在提示冲突和负载下提高延迟检索;(H3)在行动记忆循环中的验证器和观察者模型可以减少静默故障和信息泄漏,同时仍然容易出现规格不准确的情况。一个下游的推测是,角色差异化的提议者/执行者/检查者/对手系统可以减少在不对称信息和验证负担下的相关错误。本文的贡献是一个比较的视角和基准议程:一个关于控制,记忆和可验证行动耦合的可证伪性声明的纪律性计划。
更新时间: 2026-04-03 17:25:05
领域: cs.AI
Learning the Signature of Memorization in Autoregressive Language Models
All prior membership inference attacks for fine-tuned language models use hand-crafted heuristics (e.g., loss thresholding, Min-K\%, reference calibration), each bounded by the designer's intuition. We introduce the first transferable learned attack, enabled by the observation that fine-tuning any model on any corpus yields unlimited labeled data, since membership is known by construction. This removes the shadow model bottleneck and brings membership inference into the deep learning era: learning what matters rather than designing it, with generalization through training diversity and scale. We discover that fine-tuning language models produces an invariant signature of memorization detectable across architectural families and data domains. We train a membership inference classifier exclusively on transformer-based models. It transfers zero-shot to Mamba (state-space), RWKV-4 (linear attention), and RecurrentGemma (gated recurrence), achieving 0.963, 0.972, and 0.936 AUC respectively. Each evaluation combines an architecture and dataset never seen during training, yet all three exceed performance on held-out transformers (0.908 AUC). These four families share no computational mechanisms, their only commonality is gradient descent on cross-entropy loss. Even simple likelihood-based methods exhibit strong transfer, confirming the signature exists independently of the detection method. Our method, Learned Transfer MIA (LT-MIA), captures this signal most effectively by reframing membership inference as sequence classification over per-token distributional statistics. On transformers, LT-MIA achieves 2.8$\times$ higher TPR at 0.1\% FPR than the strongest baseline. The method also transfers to code (0.865 AUC) despite training only on natural language texts. Code and trained classifier available at https://github.com/JetBrains-Research/learned-mia.
Updated: 2026-04-03 17:17:51
标题: 学习自回归语言模型中记忆化的特征
摘要: 以往所有针对精细调整语言模型的成员推断攻击都使用手工设计的启发式方法(例如,损失阈值、最小K\%,参考校准),每种方法都受设计者直觉的限制。我们引入了第一个可转移学习攻击,这是因为观察到任何模型在任何语料库上进行微调都会产生无限标记数据,因为成员身份是通过构建已知的。这消除了影子模型的瓶颈,并将成员推断带入深度学习时代:学习重要的内容而不是设计它,通过训练多样性和规模实现泛化。我们发现,微调语言模型会产生一种可检测的记忆特征签名,跨架构族和数据域保持不变。我们专门在基于transformer的模型上训练了一个成员推断分类器。它在Mamba(状态空间)、RWKV-4(线性注意力)和RecurrentGemma(门控循环)上实现了0.963、0.972和0.936的AUC值。每次评估都结合了在训练期间从未见过的架构和数据集,但仍然超过了保留的transformers的性能(0.908的AUC)。这四个系列没有共同的计算机制,它们唯一的共同点是在交叉熵损失上的梯度下降。即使简单的基于可能性的方法也表现出很强的传递性,证实了这一特征独立于检测方法的存在。我们的方法,Learned Transfer MIA(LT-MIA),通过将成员推断重新构建为基于每个标记分布统计的序列分类来最有效地捕捉这个信号。在transformers上,LT-MIA在0.1\% FPR时的TPR比最强基线高出2.8倍。该方法还可以转移到代码(0.865的AUC),尽管只在自然语言文本上进行了训练。代码和训练的分类器可在https://github.com/JetBrains-Research/learned-mia获取。
更新时间: 2026-04-03 17:17:51
领域: cs.CL,cs.CR,cs.LG
Real-Time Surrogate Modeling for Personalized Blood Flow Prediction and Hemodynamic Analysis
Cardiovascular modeling has rapidly advanced over the past few decades due to the rising needs for health tracking and early detection of cardiovascular diseases. While 1-D arterial models offer an attractive compromise between computational efficiency and solution fidelity, their application on large populations or for generating large \emph{in silico} cohorts remains challenging. Certain hemodynamic parameters like the terminal resistance/compliance, are difficult to clinically estimate and often yield non-physiological hemodynamics when sampled naively, resulting in large portions of simulated datasets to be discarded. In this work, we present a systematic framework for training machine learning (ML) models, capable of instantaneous hemodynamic prediction and parameter estimation. We initially start with generating a parametric virtual cohort of patients which is based on the multivariate correlations observed in the large Asklepios clinical dataset, ensuring that physiological parameter distributions are respected. We then train a deep neural surrogate model, able to predict patient-specific arterial pressure and cardiac output (CO), enabling rapid a~priori screening of input parameters. This allows for immediate rejection of non-physiological combinations and drastically reduces the cost of targeted synthetic dataset generation (e.g. hypertensive groups). The model also provides a principled means of sampling the terminal resistance to minimize the uncertainties of unmeasurable parameters. Moreover, by assessing the model's predictive performance we determine the theoretical information which suffices for solving the inverse problem of estimating the CO. Finally, we apply the surrogate on a clinical dataset for the estimation of central aortic hemodynamics i.e. the CO and aortic systolic blood pressure (cSBP).
Updated: 2026-04-03 17:15:34
标题: 实时代理建模用于个性化血流预测和血流动力学分析
摘要: 心血管建模在过去几十年迅速发展,主要是由于对健康跟踪和心血管疾病早期检测需求的增加。虽然一维动脉模型在计算效率和解决方案准确性之间提供了一种吸引人的折衷方案,但在大规模人群或生成大规模\emph{in silico}队列方面仍具有挑战性。像末端阻力/顺应性这样的特定血液动力学参数很难在临床上估计,并且在简单采样时通常产生非生理学的血液动力学,导致大部分模拟数据集被丢弃。在这项工作中,我们提出了一个系统框架,用于训练机器学习(ML)模型,能够进行瞬时血液动力学预测和参数估计。我们最初开始生成一个基于大型Asklepios临床数据集中观察到的多变量相关性的参数虚拟队列,确保生理参数分布得到尊重。然后,我们训练一个深度神经替代模型,能够预测特定患者的动脉压力和心输出量(CO),从而实现对输入参数的快速a~priori筛选。这允许立即拒绝非生理学组合,大大降低了有针对性的合成数据集生成的成本(例如高血压组)。该模型还提供了一种有原则的方法来采样末端阻力,以最小化不可测参数的不确定性。此外,通过评估模型的预测性能,我们确定了足以解决估计CO的逆问题的理论信息。最后,我们将替代模型应用于临床数据集,用于估计中央主动脉血液动力学,即CO和主动脉收缩压(cSBP)。
更新时间: 2026-04-03 17:15:34
领域: cs.LG,physics.comp-ph
Functional Natural Policy Gradients
We propose a cross-fitted debiasing device for policy learning from offline data. A key consequence of the resulting learning principle is $\sqrt N$ regret even for policy classes with complexity greater than Donsker, provided a product-of-errors nuisance remainder is $O(N^{-1/2})$. The regret bound factors into a plug-in policy error factor governed by policy-class complexity and an environment nuisance factor governed by the complexity of the environment dynamics, making explicit how one may be traded against the other.
Updated: 2026-04-03 17:07:40
标题: 功能性自然政策梯度
摘要: 我们提出了一个用于从离线数据中学习政策的交叉修正去偏差设备。由于所得学习原则的关键结果是$\sqrt N$后悔,即使对于复杂度大于Donsker的政策类别,只要乘积误差的剩余部分是$O(N^{-1/2})$。后悔边界分解为由政策类别复杂度控制的插件政策误差因子和由环境动态复杂度控制的环境干扰因子,明确了如何可以将一个与另一个交换。
更新时间: 2026-04-03 17:07:40
领域: stat.ML,cs.LG
Reliability Gated Multi-Teacher Distillation for Low Resource Abstractive Summarization
We study multiteacher knowledge distillation for low resource abstractive summarization from a reliability aware perspective. We introduce EWAD (Entropy Weighted Agreement Aware Distillation), a token level mechanism that routes supervision between teacher distillation and gold supervision based on inter teacher agreement, and CPDP (Capacity Proportional Divergence Preservation), a geometric constraint on the student position relative to heterogeneous teachers. Across two Bangla datasets, 13 BanglaT5 ablations, and eight Qwen2.5 experiments, we find that logit level KD provides the most reliable gains, while more complex distillation improves semantic similarity for short summaries but degrades longer outputs. Cross lingual pseudo label KD across ten languages retains 71-122 percent of teacher ROUGE L at 3.2x compression. A human validated multi judge LLM evaluation further reveals calibration bias in single judge pipelines. Overall, our results show that reliability aware distillation helps characterize when multi teacher supervision improves summarization and when data scaling outweighs loss engineering.
Updated: 2026-04-03 17:06:51
标题: 可靠性门控多教师蒸馏用于低资源抽象摘要
摘要: 我们从可靠性意识的角度研究了低资源抽象摘要的多教师知识蒸馏。我们引入了EWAD(熵加权协议意识蒸馏),这是一个基于教师之间协议的令牌级机制,将监督路由到教师蒸馏和金标准之间,并且引入了CPDP(容量比例分歧保持),这是一个关于学生位置相对异质教师的几何约束。在两个孟加拉语数据集、13个孟加拉语T5消融和八个Qwen2.5实验中,我们发现逻辑级KD提供了最可靠的增益,而更复杂的蒸馏改善了短摘要的语义相似性,但降低了更长输出的质量。跨语言伪标签KD在十种语言中保留了71-122%的教师ROUGE L在3.2倍压缩下。人类验证的多评委LLM评估进一步揭示了单评委流水线中的校准偏差。总的来说,我们的结果表明,可靠性意识的蒸馏有助于确定何时多教师监督改善摘要,并且何时数据规模化超过损失工程。
更新时间: 2026-04-03 17:06:51
领域: cs.CL,cs.AI
The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling
Scaling Vision-Language-Action (VLA) models by upgrading the vision encoder is expected to improve downstream manipulation performance--as it does in vision-language modeling. We show that this expectation fails when actions are represented as discrete tokens, and explain why through an information-theoretic principle we call the Compression Gap: in any visuomotor pipeline, scaling behavior is governed by the location of the tightest information bottleneck. When actions are continuous (e.g., Diffusion Policy), the vision encoder is the binding constraint, and upgrading it directly improves performance. When actions are discretized through a fixed-capacity codebook (e.g., OAT), the codebook becomes the binding constraint, and encoder improvements cannot propagate past it--regardless of how rich the upstream representation is. We validate this principle on the LIBERO benchmark with three lines of evidence: a factorial experiment showing that encoder upgrades improve Diffusion Policy by over 21 percentage points while OAT gains are substantially attenuated across model scales; an encoder quality gradient across four encoders confirming that Diffusion Policy tracks encoder quality monotonically while OAT remains flat; and a codebook size experiment demonstrating that relaxing codebook capacity partially recovers encoder sensitivity, providing causal evidence for the bottleneck hypothesis. Our findings reveal that scaling in Physical AI requires identifying where information bottlenecks lie in the pipeline, rather than uniformly increasing model or data size.
Updated: 2026-04-03 17:06:31
标题: 压缩差距:为什么离散标记限制了视觉-语言-动作模型的扩展能力
摘要: 通过升级视觉编码器来扩展视觉-语言-动作(VLA)模型,预计可以改善下游操作性能,就像在视觉-语言建模中一样。我们展示了当动作被表示为离散标记时,这种预期是失败的,并通过我们称之为“压缩差距”的信息论原则来解释原因:在任何视觉运动管线中,扩展行为受最紧密信息瓶颈位置的控制。当动作是连续的(例如扩散策略)时,视觉编码器是约束条件,直接升级它会改善性能。当通过固定容量码书(例如OAT)离散化动作时,码书成为约束条件,编码器的改进无法传播过去-无论上游表示有多丰富。我们通过三条证据在LIBERO基准上验证了这一原则:一个因子实验表明,编码器升级可以使扩散策略的改进超过21个百分点,而OAT的增益在模型规模上大大减弱;四个编码器之间的编码器质量梯度证实了扩散策略单调跟踪编码器质量,而OAT保持不变;一个码书大小实验表明,放松码书容量部分恢复编码器灵敏度,为瓶颈假设提供因果证据。我们的发现揭示了在物理智能中的规模化需要确定管道中信息瓶颈的位置,而不是简单地增加模型或数据大小。
更新时间: 2026-04-03 17:06:31
领域: cs.RO,cs.CV,cs.LG
Gradient Boosting within a Single Attention Layer
Transformer attention computes a single softmax-weighted average over values -- a one-pass estimate that cannot correct its own errors. We introduce \emph{gradient-boosted attention}, which applies the principle of gradient boosting \emph{within} a single attention layer: a second attention pass, with its own learned projections, attends to the prediction error of the first and applies a gated correction. Under a squared reconstruction objective, the construction maps onto Friedman's gradient boosting machine, with each attention pass as a base learner and the per-dimension gate as the shrinkage parameter. We show that a single Hopfield-style update erases all query information orthogonal to the stored-pattern subspace, and that further iteration under local contraction can collapse distinct queries in the same region to the same fixed point. We also show that separate projections for the correction pass can recover residual information inaccessible to the shared-projection approach of Tukey's twicing. On a 10M-token subset of WikiText-103, gradient-boosted attention achieves a test perplexity of $67.9$ compared to $72.2$ for standard attention, $69.6$ for Twicing Attention, and $69.0$ for a parameter-matched wider baseline, with two rounds capturing most of the benefit.
Updated: 2026-04-03 17:06:08
标题: 单个关注层内的梯度提升
摘要: Transformer注意力计算值的单个softmax加权平均值 -- 一个一次估计,无法纠正自身的错误。我们引入\emph{梯度增强注意力},它应用了梯度增强的原理\emph{在}单个注意力层内:第二次注意力传递,带有自己学习的投影,关注第一次的预测错误并应用门控校正。在一个平方重建目标下,该构造映射到Friedman的梯度增强机器,每个注意力传递作为一个基学习器,每个维度门作为缩减参数。我们展示了单个Hopfield风格的更新可以擦除存储模式子空间正交于查询信息,进一步迭代下的局部收缩可以将相同区域内的不同查询塌缩为相同的固定点。我们还展示了为校正传递单独的投影可以恢复无法访问通过Tukey的双重共享投影方法的残余信息。在WikiText-103的一个1000万令牌子集上,梯度增强注意力在测试困惑度方面达到$67.9$,而标准注意力为$72.2$,双倍注意力为$69.6$,参数匹配的更宽基线为$69.0$,两轮捕获了大部分收益。
更新时间: 2026-04-03 17:06:08
领域: cs.LG,cs.AI
Reflective Context Learning: Studying the Optimization Primitives of Context Space
Generally capable agents must learn from experience in ways that generalize across tasks and environments. The fundamental problems of learning, including credit assignment, overfitting, forgetting, local optima, and high-variance learning signals, persist whether the learned object lies in parameter space or context space. While these challenges are well understood in classical machine learning optimization, they remain underexplored in context space, leading current methods to be fragmented and ad hoc. We present Reflective Context Learning (RCL), a unified framework for agents that learn through repeated interaction, reflection on behavior and failure modes, and iterative updates to context. In RCL, reflection converts trajectories and current context into a directional update signal analogous to gradients, while mutation applies that signal to improve future behavior in context space. We recast recent context-optimization approaches as instances of this shared learning problem and systematically extend them with classical optimization primitives, including batching, improved credit-assignment signal, auxiliary losses, failure replay, and grouped rollouts for variance reduction. On AppWorld, BrowseComp+, and RewardBench2, these primitives improve over strong baselines, with their relative importance shifting across task regimes. We further analyze robustness to initialization, the effects of batch size, sampling and curriculum strategy, optimizer-state variants, and the impact of allocating stronger or weaker models to different optimization components. Our results suggest that learning through context updates should be treated not as a set of isolated algorithms, but as an optimization problem whose mechanisms can be studied systematically and improved through transferable principles.
Updated: 2026-04-03 17:05:45
标题: 反思性背景学习:研究背景空间的优化基元
摘要: 一般能力强的智能体必须通过经验学习,以在各种任务和环境中进行泛化。学习的基本问题,包括信用分配、过拟合、遗忘、局部最优和高方差的学习信号,无论学习对象位于参数空间还是上下文空间,都会存在。尽管这些挑战在经典机器学习优化中得到了很好的理解,但在上下文空间中仍然未被充分探讨,导致当前方法表现零碎且临时。我们提出了反思上下文学习(RCL),这是一个统一的框架,用于通过反复交互、行为和失败模式的反思以及对上下文的迭代更新来学习的智能体。在RCL中,反思将轨迹和当前上下文转换为类似于梯度的方向更新信号,而突变则将该信号应用于改善未来在上下文空间中的行为。我们将最近的上下文优化方法重新构建为这一共享学习问题的实例,并系统地通过经典优化原语进行扩展,包括批处理、改进的信用分配信号、辅助损失、失败重播和分组展开以减少方差。在AppWorld、BrowseComp+和RewardBench2上,这些原语相对于强基线表现出改进,它们在任务规则上的相对重要性也发生变化。我们进一步分析了对初始化的鲁棒性,批量大小、抽样和课程策略的影响,优化器状态变体以及将更强或更弱的模型分配给不同优化组件的影响。我们的结果表明,通过上下文更新进行学习不应被视为一组孤立的算法,而应被视为一个可以通过可转移原则系统地研究和改进的优化问题。
更新时间: 2026-04-03 17:05:45
领域: cs.LG,cs.AI
Towards best practices in low-dimensional semi-supervised latent Bayesian optimization for the design of antimicrobial peptides
Generative deep learning techniques have demonstrated an impressive capacity for tackling biomolecular design problems in recent years. Despite their high performance, however, they still suffer from a lack of interpretability and rigorous quantification of associated search spaces, which are necessary to unlock their full potential for scientific inquiry beyond efficient design. An area in which they are of particular interest is in the design of antimicrobial peptides, which are a promising class of therapeutics to treat bacterial infections. Discovering and designing such peptides is difficult because of the vast number of possible sequences and comparatively small amount of experimental information. In this work, we perform a theoretical investigation of latent Bayesian optimization for searching through peptide sequence spaces, with a focus on antimicrobial peptides. We investigate (1) whether searching through a dimensionally-reduced variant of the latent design space may facilitate optimization, (2) how organizing latent spaces by differing amounts of more and less relevant information may improve the efficiency of arriving at an optimal peptide design, and (3) the interpretability of the spaces. We find that employing a dimensionally-reduced version of the latent space is more interpretable and can be advantageous, while the use of less-relevant but more easily-computable physicochemical properties is advantageous to latent space organization in certain contexts and the use of more-relevant but sparser properties associated with the latent Bayesian objective function is advantageous in others. This work lays crucial groundwork for biophysically-motivated peptide design procedures, with an especial focus on antimicrobial peptides.
Updated: 2026-04-03 16:57:10
标题: 朝向低维半监督潜在贝叶斯优化在抗菌肽设计中的最佳实践
摘要: 近年来,生成式深度学习技术在处理生物分子设计问题方面展示出了令人印象深刻的能力。然而,尽管它们表现出很高的性能,但仍然存在着缺乏可解释性和对相关搜索空间的严格量化的问题,这些是必要的,以释放它们在高效设计之外的科学探究中的全部潜力。它们特别受关注的领域之一是抗微生物肽的设计,这是一类治疗细菌感染的有前途的药物。发现和设计这种肽是困难的,因为可能序列的数量巨大,而实验信息相对较少。在这项工作中,我们对用于搜索肽序列空间的潜在贝叶斯优化进行了理论研究,重点放在抗微生物肽上。我们研究了:(1) 通过搜索潜在设计空间的降维变体是否可以促进优化,(2) 通过不同数量的相关和不相关信息组织潜在空间是否可以提高到达最佳肽设计的效率,以及(3) 空间的可解释性。我们发现,使用潜在空间的降维版本更具解释性并且具有优势,而在某些情况下,使用不太相关但更容易计算的物理化学性质对于潜在空间的组织是有利的,而在其他情况下,使用与潜在贝叶斯目标函数相关但更稀疏的性质是有利的。这项工作为受生物物理激励的肽设计程序奠定了关键基础,特别关注抗微生物肽。
更新时间: 2026-04-03 16:57:10
领域: cs.LG,physics.comp-ph
PRISM: LLM-Guided Semantic Clustering for High-Precision Topics
In this paper, we propose Precision-Informed Semantic Modeling (PRISM), a structured topic modeling framework combining the benefits of rich representations captured by LLMs with the low cost and interpretability of latent semantic clustering methods. PRISM fine-tunes a sentence encoding model using a sparse set of LLM- provided labels on samples drawn from some corpus of interest. We segment this embedding space with thresholded clustering, yielding clusters that separate closely related topics within some narrow domain. Across multiple corpora, PRISM improves topic separability over state-of-the-art local topic models and even over clustering on large, frontier embedding models while requiring only a small number of LLM queries to train. This work contributes to several research streams by providing (i) a student-teacher pipeline to distill sparse LLM supervision into a lightweight model for topic discovery; (ii) an analysis of the efficacy of sampling strategies to improve local geometry for cluster separability; and (iii) an effective approach for web-scale text analysis, enabling researchers and practitioners to track nuanced claims and subtopics online with an interpretable, locally deployable framework.
Updated: 2026-04-03 16:56:47
标题: PRISM: 基于LLM引导的高精度主题语义聚类
摘要: 在本文中,我们提出了一种名为PRISM(Precision-Informed Semantic Modeling)的结构化主题建模框架,结合了LLMs捕获的丰富表示优势和潜在语义聚类方法的低成本和可解释性。PRISM通过在从某个感兴趣的语料库中提取的样本上使用LLM提供的稀疏标签微调句子编码模型。我们使用阈值聚类对这个嵌入空间进行分段,产生将某个狭窄领域内紧密相关主题分开的簇。在多个语料库中,PRISM提高了主题可分性,超过了最先进的局部主题模型,甚至超过了基于大型前沿嵌入模型的聚类,同时只需要少量LLM查询来训练。这项工作通过提供(i)一个学生-教师管道,将稀疏LLM监督提炼成一个轻量级的主题发现模型;(ii)分析采样策略的有效性,以改善簇可分性的局部几何;以及(iii)一种有效的网页规模文本分析方法,使研究人员和从业者能够通过一个可解释的、可在本地部署的框架在线跟踪微妙的论断和子主题。
更新时间: 2026-04-03 16:56:47
领域: cs.LG,cs.CL,cs.IR,cs.SI
Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models
The recent success of reinforcement learning (RL) in large reasoning models has inspired the growing adoption of RL for post-training Multimodal Large Language Models (MLLMs) to enhance their visual reasoning capabilities. Although many studies have reported improved performance, it remains unclear whether RL training truly enables models to learn from visual information. In this work, we propose the Hallucination-as-Cue Framework, an analytical framework designed to investigate the effects of RL-based post-training on multimodal reasoning models from the perspective of model hallucination. Specifically, we introduce hallucination-inductive, modality-specific corruptions that remove or replace essential information required to derive correct answers, thereby forcing the model to reason by hallucination. By applying these corruptions during both training and evaluation, our framework provides a unique perspective for diagnosing RL training dynamics and understanding the intrinsic properties of datasets. Through extensive experiments and analyses across multiple multimodal reasoning benchmarks, we reveal that the role of model hallucination for RL-training is more significant than previously recognized. For instance, we find that RL post-training under purely hallucination-inductive settings can still significantly improve models' reasoning performance, and in some cases even outperform standard training. These findings challenge prevailing assumptions about MLLM reasoning training and motivate the development of more modality-aware RL-based training designs.
Updated: 2026-04-03 16:56:34
标题: 理解幻觉在多模态推理模型培训后强化作用的角色
摘要: 最近强化学习(RL)在大型推理模型中取得的成功,启发了越来越多的人采用RL来加强后训练的多模态大型语言模型(MLLMs)的视觉推理能力。尽管许多研究报告了性能的提高,但RL训练是否真正使模型能够从视觉信息中学习仍不清楚。在这项工作中,我们提出了幻觉作为线索框架,这是一个旨在从模型幻觉的角度研究RL后训练对多模态推理模型影响的分析框架。具体地,我们引入了幻觉诱导的、模态特定的损坏,这些损坏会移除或替换推导正确答案所需的关键信息,从而迫使模型通过幻觉进行推理。通过在训练和评估过程中应用这些损坏,我们的框架为诊断RL训练动态和理解数据集的固有属性提供了独特的视角。通过在多个多模态推理基准上进行广泛的实验和分析,我们揭示了模型幻觉在RL训练中的作用比以前认识到的更为重要。例如,我们发现在纯粹的幻觉诱导设置下进行的RL后训练仍然可以显著提高模型的推理性能,有时甚至优于标准训练。这些发现挑战了有关MLLM推理训练的普遍假设,并激发了更多关于模态感知的RL训练设计的发展。
更新时间: 2026-04-03 16:56:34
领域: cs.LG,cs.AI,cs.CV
Voting by mail: a Markov chain model for managing the security risks of election systems
The scrutiny surrounding vote-by-mail (VBM) in the United States has increased in recent years, highlighting the need for a rigorous quantitative framework to evaluate the resilience of the absentee voting infrastructure. This paper addresses these issues by introducing a dynamic mathematical modeling framework for performing a risk assessment of VBM processes. We introduce a discrete-time Markov chain (DTMC) to model the VBM process and assess election performance and risk with a novel layered network approach that considers the interplay between VBM processes, malicious and non-malicious threats, and security mitigations. The time-inhomogeneous DTMC framework captures dynamic risks and evaluates performance over time. The DTMC model accounts for a spectrum of outcomes, from unintended voter errors to sophisticated, targeted attacks, representing a significant advancement in the risk assessment of VBM planning and protection. A case study based on real-world data from Milwaukee County, Wisconsin, is used to evaluate the DTMC model. The analysis includes hypothetical worst-case attack scenarios to stress-test VBM processes and to assess the efficacy of security measures and the impact of different attack timings. The analysis suggests that ballot drop boxes and automatic ballot notification systems are crucial for reducing the attack surface to ensure secure and reliable operations.
Updated: 2026-04-03 16:49:37
标题: 邮寄投票:一种马尔可夫链模型用于管理选举系统的安全风险
摘要: 随着近年来美国邮寄选票(VBM)的审查增加,突显了需要一个严格的定量框架来评估缺席投票基础设施的弹性。本文通过引入一个动态数学建模框架来解决这些问题,以进行VBM流程的风险评估。我们引入了离散时间马尔可夫链(DTMC)来模拟VBM流程,并通过考虑VBM流程、恶意和非恶意威胁以及安全减轻措施之间的相互作用的新颖分层网络方法来评估选举绩效和风险。时变的DTMC框架捕捉动态风险,并随时间评估绩效。DTMC模型考虑了一系列结果,从意外选民错误到精心策划的有针对性攻击,代表了VBM规划和保护风险评估的重大进展。利用来自威斯康星州密尔沃基县的真实数据进行的案例研究用于评估DTMC模型。分析包括假设的最坏攻击场景,以对VBM流程进行压力测试,并评估安全措施的有效性以及不同攻击时机的影响。分析表明,选票投放箱和自动选票通知系统对于减少攻击面以确保安全可靠的运作至关重要。
更新时间: 2026-04-03 16:49:37
领域: cs.CR,math.PR
Beyond the Parameters: A Technical Survey of Contextual Enrichment in Large Language Models: From In-Context Prompting to Causal Retrieval-Augmented Generation
Large language models (LLMs) encode vast world knowledge in their parameters, yet they remain fundamentally limited by static knowledge, finite context windows, and weakly structured causal reasoning. This survey provides a unified account of augmentation strategies along a single axis: the degree of structured context supplied at inference time. We cover in-context learning and prompt engineering, Retrieval-Augmented Generation (RAG), GraphRAG, and CausalRAG. Beyond conceptual comparison, we provide a transparent literature-screening protocol, a claim-audit framework, and a structured cross-paper evidence synthesis that distinguishes higher-confidence findings from emerging results. The paper concludes with a deployment-oriented decision framework and concrete research priorities for trustworthy retrieval-augmented NLP.
Updated: 2026-04-03 16:49:09
标题: 超越参数:大型语言模型中上下文丰富化的技术调查:从上下文提示到因果检索增强生成
摘要: 大型语言模型(LLMs)在其参数中编码了广泛的世界知识,然而它们仍然基本上受到静态知识、有限上下文窗口和弱结构因果推理的限制。本调查提供了沿着一个单一轴线的增强策略的统一说明:在推理时提供结构化上下文的程度。我们涵盖了上下文学习和提示工程、检索增强生成(RAG)、图形RAG和因果RAG。除了概念比较外,我们提供了一个透明的文献筛选协议、一个声明审计框架,以及一个结构化的跨论文证据综合,区分高置信度的发现和新兴结果。本文以一个面向部署的决策框架和可信任的检索增强NLP的具体研究重点作为结论。
更新时间: 2026-04-03 16:49:09
领域: cs.CL,cs.AI
Adaptive randomized pivoting and volume sampling
Adaptive randomized pivoting (ARP) is a recently proposed and highly effective algorithm for column subset selection. This paper reinterprets the ARP algorithm by drawing connections to the volume sampling distribution and active learning algorithms for linear regression. As consequences, this paper presents new analysis for the ARP algorithm and faster implementations using rejection sampling.
Updated: 2026-04-03 16:47:38
标题: 自适应随机轴转和体积抽样
摘要: 自适应随机枢轴选择(ARP)是一种最近提出的非常有效的列子集选择算法。本文通过将ARP算法与体积抽样分布和线性回归的主动学习算法联系起来,对ARP算法进行重新解释。作为结果,本文提出了ARP算法的新分析以及使用拒绝抽样实现更快速的方法。
更新时间: 2026-04-03 16:47:38
领域: stat.ML,cs.DS,cs.LG,math.NA,stat.CO
Beyond Noisy-TVs: Noise-Robust Exploration Via Learning Progress Monitoring
When there exists an unlearnable source of randomness (noisy-TV) in the environment, a naively intrinsic reward driven exploring agent gets stuck at that source of randomness and fails at exploration. Intrinsic reward based on uncertainty estimation or distribution similarity, while eventually escapes noisy-TVs as time unfolds, suffers from poor sample efficiency and high computational cost. Inspired by recent findings from neuroscience that humans monitor their improvements during exploration, we propose a novel method for intrinsically-motivated exploration, named Learning Progress Monitoring (LPM). During exploration, LPM rewards model improvements instead of prediction error or novelty, effectively rewards the agent for observing learnable transitions rather than the unlearnable transitions. We introduce a dual-network design that uses an error model to predict the expected prediction error of the dynamics model in its previous iteration, and use the difference between the model errors of the current iteration and previous iteration to guide exploration. We theoretically show that the intrinsic reward of LPM is zero-equivariant and a monotone indicator of Information Gain (IG), and that the error model is necessary to achieve monotonicity correspondence with IG. We empirically compared LPM against state-of-the-art baselines in noisy environments based on MNIST, 3D maze with 160x120 RGB inputs, and Atari. Results show that LPM's intrinsic reward converges faster, explores more states in the maze experiment, and achieves higher extrinsic reward in Atari. This conceptually simple approach marks a shift-of-paradigm of noise-robust exploration. For code to reproduce our experiments, see https://github.com/Akuna23Matata/LPM_exploration
Updated: 2026-04-03 16:34:50
标题: 超越嘈杂电视:通过学习进度监控实现噪声鲁棒探索
摘要: 当环境中存在一个不可学习的随机性来源(嘈杂的电视)时,一个天真的内在奖励驱动的探索代理会被困在这种随机性来源上,并在探索中失败。基于不确定性估计或分布相似性的内在奖励,虽然最终会随着时间的推移逃离嘈杂的电视,但受到样本效率低和计算成本高的影响。受最近神经科学发现的启发,即人类在探索过程中监控自己的改进,我们提出了一种新颖的内在动机探索方法,名为学习进展监控(LPM)。在探索过程中,LPM奖励模型的改进,而不是预测误差或新奇性,有效地奖励代理观察可学习转换而不是不可学习转换。我们引入了一个双网络设计,使用一个错误模型来预测在其上一次迭代中动态模型的预期预测误差,并使用当前迭代和上一次迭代的模型误差之间的差异来引导探索。我们理论上表明,LPM的内在奖励是零等变的,并且是信息增益(IG)的单调指示器,错误模型对于实现与IG的单调对应性是必要的。我们在基于MNIST、160x120 RGB输入的3D迷宫和Atari的嘈杂环境中,将LPM与最先进的基线进行了实证比较。结果表明,LPM的内在奖励收敛更快,在迷宫实验中探索更多状态,并在Atari中获得更高的外在奖励。这种概念上简单的方法标志着对抗噪声的探索方式的转变。要复制我们的实验的代码,请参阅https://github.com/Akuna23Matata/LPM_exploration。
更新时间: 2026-04-03 16:34:50
领域: cs.LG,cs.AI
CQA-Eval: Designing Reliable Evaluations of Multi-paragraph Clinical QA under Resource Constraints
Evaluating multi-paragraph clinical question answering (QA) systems is resource-intensive and challenging: accurate judgments require medical expertise and achieving consistent human judgments over multi-paragraph text is difficult. We introduce CQA-Eval, an evaluation framework and set of evaluation recommendations for limited-resource and high-expertise settings. Based on physician annotations of 300 real patient questions answered by physicians and LLMs, we compare coarse answer-level versus fine-grained sentence-level evaluation over the dimensions of correctness, relevance, and risk disclosure. We find that inter-annotator agreement (IAA) varies by dimension: fine-grained annotation improves agreement on correctness, coarse improves agreement on relevance, and judgments on communicates-risks remain inconsistent. Additionally, annotating only a small subset of sentences can provide reliability comparable to coarse annotations, reducing cost and effort.
Updated: 2026-04-03 16:31:59
标题: CQA-Eval:在资源约束下设计可靠的多段临床问答评估
摘要: 评估多段临床问题回答(QA)系统需要大量资源和挑战性:准确的判断需要医学专业知识,并且在多段文字中实现一致的人类判断是困难的。我们引入了CQA-Eval,这是一个评估框架和一组针对资源有限和专业知识需求较高环境的评估建议。基于医生对300个真实患者问题的注释,这些问题是由医生和LLMs回答的,我们比较了粗糙的答案级别与细粒度的句子级别评估在正确性、相关性和风险披露维度上的差异。我们发现,注释者之间的一致性(IAA)因维度而异:细粒度注释提高了正确性上的一致性,而粗糙注释提高了相关性上的一致性,而有关风险披露的判断仍然不一致。此外,仅注释一小部分句子就可以提供与粗糙注释相当的可靠性,从而降低成本和工作量。
更新时间: 2026-04-03 16:31:59
领域: cs.CL,cs.AI
Chart-RL: Policy Optimization Reinforcement Learning for Enhanced Visual Reasoning in Chart Question Answering with Vision Language Models
The recent advancements in Vision Language Models (VLMs) have demonstrated progress toward true intelligence requiring robust reasoning capabilities. Beyond pattern recognition, linguistic reasoning must integrate with visual comprehension, particularly for Chart Question Answering (CQA) tasks involving complex data visualizations. Current VLMs face significant limitations in CQA, including imprecise numerical extraction, difficulty interpreting implicit visual relationships, and inadequate attention mechanisms for capturing spatial relationships in charts. In this work, we address these challenges by presenting Chart-RL, a novel reinforcement learning framework that enhances VLMs chart understanding through feedback-driven policy optimization of visual perception and logical inference. Our key innovation includes a comprehensive framework integrating Reinforcement Learning (RL) from Policy Optimization techniques along with adaptive reward functions, that demonstrates superior performance compared to baseline foundation models and competitive results against larger state-of-the-art architectures. We also integrated Parameter-Efficient Fine-Tuning through Low-Rank Adaptation (LoRA) in the RL framework that only requires single GPU configurations while preserving performance integrity. We conducted extensive benchmarking across open-source, proprietary, and state-of-the-art closed-source models utilizing the ChartQAPro dataset. The RL fine-tuned Qwen3-VL-4B-Instruct model achieved an answer accuracy of 0.634, surpassing the 0.580 accuracy of the Qwen3-VL-8B-Instruct foundation model despite utilizing half the parameter count, while simultaneously reducing inference latency from 31 seconds to 9 seconds.
Updated: 2026-04-03 16:28:03
标题: Chart-RL:策略优化强化学习,用于增强视觉语言模型在图表问答中的视觉推理
摘要: 最近视觉语言模型(VLMs)的进展表明了朝着需要强大推理能力的真正智能的进展。除了模式识别外,语言推理必须与视觉理解相结合,特别是对涉及复杂数据可视化的图表问题回答(CQA)任务。当前的VLMs在CQA中面临着重大限制,包括不精确的数值提取、难以解释隐含的视觉关系,以及捕捉图表中空间关系的不足的注意机制。在这项工作中,我们通过提出Chart-RL,一个通过反馈驱动的策略优化视觉感知和逻辑推理的强化学习框架,来解决这些挑战。我们的关键创新包括一个综合框架,将来自策略优化技术的强化学习(RL)与自适应奖励函数相结合,相对于基线基础模型表现出更优异的性能,并与更大的最先进架构相竞争。我们还在RL框架中集成了参数高效微调技术LoRA,只需单个GPU配置即可,同时保持性能完整性。我们在开源、专有和最先进的闭源模型之间进行了广泛的基准测试,利用了ChartQAPro数据集。经过RL微调后,Qwen3-VL-4B-Instruct模型的答案准确率达到了0.634,超过了Qwen3-VL-8B-Instruct基础模型的0.580准确率,同时将推理延迟从31秒降低到9秒。
更新时间: 2026-04-03 16:28:03
领域: cs.AI
DSBD: Dual-Aligned Structural Basis Distillation for Graph Domain Adaptation
Graph domain adaptation (GDA) aims to transfer knowledge from a labeled source graph to an unlabeled target graph under distribution shifts. However, existing methods are largely feature-centric and overlook structural discrepancies, which become particularly detrimental under significant topology shifts. Such discrepancies alter both geometric relationships and spectral properties, leading to unreliable transfer of graph neural networks (GNNs). To address this limitation, we propose Dual-Aligned Structural Basis Distillation (DSBD) for GDA, a novel framework that explicitly models and adapts cross-domain structural variation. DSBD constructs a differentiable structural basis by synthesizing continuous probabilistic prototype graphs, enabling gradient-based optimization over graph topology. The basis is learned under source-domain supervision to preserve semantic discriminability, while being explicitly aligned to the target domain through a dual-alignment objective. Specifically, geometric consistency is enforced via permutation-invariant topological moment matching, and spectral consistency is achieved through Dirichlet energy calibration, jointly capturing structural characteristics across domains. Furthermore, we introduce a decoupled inference paradigm that mitigates source-specific structural bias by training a new GNN on the distilled structural basis. Extensive experiments on graph and image benchmarks demonstrate that DSBD consistently outperforms state-of-the-art methods.
Updated: 2026-04-03 16:23:59
标题: DSBD:用于图领域自适应的双重对齐结构基础蒸馏
摘要: 图领域自适应(GDA)旨在在分布转移下,从一个有标签的源图转移知识到一个无标签的目标图。然而,现有方法主要以特征为中心,忽视了结构差异,这在拓扑变化显著时尤为有害。这种差异改变了几何关系和谱特性,导致图神经网络(GNNs)的传递变得不可靠。为了解决这一限制,我们提出了一种新的GDA框架,名为Dual-Aligned Structural Basis Distillation(DSBD),明确地建模和适应跨领域结构变化。DSBD通过合成连续概率原型图构建了一个可微分的结构基础,使得可以在图拓扑上进行基于梯度的优化。该基础在源领域监督下学习以保留语义可区分性,同时通过双重对齐目标域。具体来说,通过排列不变的拓扑矩匹配来强制几何一致性,通过狄利克雷能量校准来实现谱一致性,共同捕捉跨领域的结构特征。此外,我们引入了一种解耦的推理范式,通过在提取的结构基础上训练新的GNN来减轻源特定的结构偏差。在图像基准测试中进行的广泛实验表明,DSBD始终优于现有方法。
更新时间: 2026-04-03 16:23:59
领域: cs.LG
JointFM-0.1: A Foundation Model for Multi-Target Joint Distributional Prediction
Despite the rapid advancements in Artificial Intelligence (AI), Stochastic Differential Equations (SDEs) remain the gold-standard formalism for modeling systems under uncertainty. However, applying SDEs in practice is fraught with challenges: modeling risk is high, calibration is often brittle, and high-fidelity simulations are computationally expensive. This technical report introduces JointFM, a foundation model that inverts this paradigm. Instead of fitting SDEs to data, we sample an infinite stream of synthetic SDEs to train a generic model to predict future joint probability distributions directly. This approach establishes JointFM as the first foundation model for distributional predictions of coupled time series - requiring no task-specific calibration or finetuning. Despite operating in a purely zero-shot setting, JointFM reduces the energy loss by 21.1% relative to the strongest baseline when recovering oracle joint distributions generated by unseen synthetic SDEs.
Updated: 2026-04-03 16:19:15
标题: JointFM-0.1:多目标联合分布预测的基础模型
摘要: 尽管人工智能(AI)快速发展,但随机微分方程(SDEs)仍然是建模不确定系统的黄金标准形式。然而,在实践中应用SDEs充满挑战:风险建模高度困难,校准通常脆弱,高保真度的模拟计算成本高昂。本技术报告介绍了JointFM,一种基础模型,它颠覆了这种范式。我们不是将SDEs拟合到数据中,而是抽样一个无限流的合成SDEs来训练一个通用模型,直接预测未来的联合概率分布。这种方法将JointFM确立为第一个用于耦合时间序列分布预测的基础模型 - 无需特定任务的校准或微调。尽管在完全零射击设置下运行,但JointFM相对于最强基线在恢复由未见过的合成SDEs生成的oracle联合分布时能够降低21.1%的能量损失。
更新时间: 2026-04-03 16:19:15
领域: cs.LG,cs.AI
CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models
Medical vision--language models (MVLMs) are increasingly used as perceptual backbones in radiology pipelines and as the visual front end of multimodal assistants, yet their reliability under real clinical workflows remains underexplored. Prior robustness evaluations often assume clean, curated inputs or study isolated corruptions, overlooking routine acquisition, reconstruction, display, and delivery operations that preserve clinical readability while shifting image statistics. To address this gap, we propose CoDA, a chain-of-distribution framework that constructs clinically plausible pipeline shifts by composing acquisition-like shading, reconstruction and display remapping, and delivery and export degradations. Under masked structural-similarity constraints, CoDA jointly optimizes stage compositions and parameters to induce failures while preserving visual plausibility. Across brain MRI, chest X-ray, and abdominal CT, CoDA substantially degrades the zero-shot performance of CLIP-style MVLMs, with chained compositions consistently more damaging than any single stage. We also evaluate multimodal large language models (MLLMs) as technical-authenticity auditors of imaging realism and quality rather than pathology. Proprietary multimodal models show degraded auditing reliability and persistent high-confidence errors on CoDA-shifted samples, while the medical-specific MLLMs we test exhibit clear deficiencies in medical image quality auditing. Finally, we introduce a post-hoc repair strategy based on teacher-guided token-space adaptation with patch-level alignment, which improves accuracy on archived CoDA outputs. Overall, our findings characterize a clinically grounded threat surface for MVLM deployment and show that lightweight alignment improves robustness in deployment.
Updated: 2026-04-03 16:16:51
标题: CoDA:探索医疗视觉语言模型的分发链攻击和事后令牌空间修复
摘要: 医学视觉-语言模型(MVLMs)越来越多地被用作放射学流程中的感知支柱,以及多模态助手的视觉前端,然而它们在真实临床工作流程下的可靠性仍未得到充分探讨。先前的稳健性评估通常假设干净、精心策划的输入,或研究孤立的损坏,忽视了保持临床可读性的常规获取、重建、显示和传递操作,同时改变图像统计。为了解决这一差距,我们提出了CoDA,这是一个分布链框架,通过组合类似获取的着色、重建和显示重映射,以及传递和导出降级来构建临床合理的流程变化。在受到遮挡的结构相似性约束下,CoDA共同优化阶段组合和参数,以诱导失败同时保持视觉合理性。在脑MRI、胸部X射线和腹部CT方面,CoDA显著降低了CLIP风格MVLMs的零样本性能,链式组合始终比任何单个阶段更具破坏性。我们还将多模态大型语言模型(MLLMs)评估为成像真实性和质量而不是病理学的技术真实性审计员。专有的多模态模型显示出降级的审计可靠性,并在经过CoDA转移的样本上持续出现高置信度错误,而我们测试的医学特定MLLMs在医学图像质量审计方面表现出明显不足。最后,我们引入了一种基于教师引导的标记空间适应和补丁级对齐的后期修复策略,这种策略提高了存档的CoDA输出的准确性。总的来说,我们的发现表征了MVLM部署的临床基础威胁表面,并显示轻量级对齐可以提高部署的稳健性。
更新时间: 2026-04-03 16:16:51
领域: cs.CV,cs.AI
Investigating Test Overfitting on SWE-bench
Tests can be useful towards resolving issues on code repositories. However, relying too much on tests for issue resolution can lead to code that technically passes observed tests but actually misses important cases or even breaks functionality. This problem, called test overfitting, is exacerbated by the fact that issues usually lack readily executable tests. Instead, several issue resolution systems use tests auto-generated from issues, which may be imperfect. Some systems even iteratively refine code and tests jointly. This paper presents the first empirical study of test overfitting in this setting.
Updated: 2026-04-03 16:15:46
标题: 调查SWE-bench上的测试过拟合问题
摘要: 测试可以有助于解决代码仓库中的问题。然而,过度依赖测试来解决问题可能导致代码在技术上通过观察到的测试,但实际上忽略了重要情况甚至破坏了功能。这个问题被称为测试过度拟合,在问题通常缺乏可以立即执行的测试的情况下,这一问题更加严重。相反,一些问题解决系统使用从问题自动生成的测试,这些测试可能是不完善的。一些系统甚至迭代地共同完善代码和测试。本文介绍了这种情况下测试过度拟合的第一项实证研究。
更新时间: 2026-04-03 16:15:46
领域: cs.SE,cs.LG
HyperFitS -- Hypernetwork Fitting Spectra for metabolic quantification of ${}^1$H MR spectroscopic imaging
Purpose: Proton magnetic resonance spectroscopic imaging ($^1$H MRSI) enables the mapping of whole-brain metabolites concentrations in-vivo. However, a long-standing problem for its clinical applicability is the metabolic quantification, which can require extensive time for spectral fitting. Recently, deep learning methods have been able to provide whole-brain metabolic quantification in only a few seconds. However, neural network implementations often lack configurability and require retraining to change predefined parameter settings. Methods: We introduce HyperFitS, a hypernetwork for spectral fitting for metabolite quantification in whole-brain $^1$H MRSI that flexibly adapts to a broad range of baseline corrections and water suppression factors. Metabolite maps of human subjects acquired at 3T and 7T with isotropic resolutions of 10 mm, 3.4 mm and 2 mm by water-suppressed and water-unsuppressed MRSI were quantified with HyperFitS and compared to conventional LCModel fitting. Results: Metabolic maps show a substantial agreement between the new and gold-standard methods, with significantly faster fitting times by HyperFitS. Quantitative results further highlight the impact of baseline parametrization on metabolic quantification, which can alter results by up to 30%. Conclusion: HyperFitS shows strong agreement with state-of-the-art conventional methods, while reducing processing times from hours to a few seconds. Compared to prior deep learning based spectral fitting methods, HyperFitS enables a wide range of configurability and can adapt to data quality acquired with multiple protocols and field strengths without retraining.
Updated: 2026-04-03 16:13:16
标题: HyperFitS -- 超网络拟合光谱用于${}^1$H磁共振波谱成像的代谢定量化
摘要: 目的:质子磁共振波谱成像($^1$H MRSI)使得能够在体内绘制整个大脑代谢物浓度的地图。然而,其临床适用性长期存在的问题是代谢量化,这可能需要耗费大量时间进行光谱拟合。最近,深度学习方法已经能够在几秒钟内提供整个大脑的代谢物量化。然而,神经网络实现通常缺乏可配置性,并且需要重新训练以更改预定义的参数设置。方法:我们介绍了HyperFitS,一种用于整个大脑$^1$H MRSI代谢物拟合的超网络,它灵活适应广泛范围的基线校正和水抑制因子。使用3T和7T获取的人体受试者的代谢物地图,通过水抑制和非水抑制的MRSI以10毫米、3.4毫米和2毫米的等距分辨率进行量化,并与传统LCModel拟合进行比较。结果:代谢物地图显示新方法与黄金标准方法之间存在实质性一致性,HyperFitS的拟合时间显著更快。定量结果进一步凸显基线参数化对代谢物量化的影响,这可能使结果变化达30%。结论:HyperFitS与最先进的传统方法显示强烈一致性,同时将处理时间从几小时减少到几秒钟。与以往基于深度学习的光谱拟合方法相比,HyperFitS可以实现广泛的可配置性,并且可以适应不同协议和场强获取的数据质量,无需重新训练。
更新时间: 2026-04-03 16:13:16
领域: cs.LG
Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control
We present a method to identify a valence-arousal (VA) subspace within large language model representations. From 211k emotion-labeled texts, we derive emotion steering vectors, then learn VA axes as linear combinations of their top PCA components via ridge regression on the model's self-reported valence-arousal scores. The resulting VA subspace exhibits circular geometry consistent with established models of human emotion perception. Projections along our recovered VA subspace correlate with human-crowdsourced VA ratings across 44k lexical items. Furthermore, steering generation along these axes produces monotonic shifts in the corresponding affective dimensions of model outputs. Steering along these directions also induces near-monotonic bidirectional control over refusal and sycophancy: increasing arousal decreases refusal and increases sycophancy, and vice versa. These effects replicate across Llama-3.1-8B, Qwen3-8B, and Qwen3-14B, demonstrating cross-architecture generality. We provide a mechanistic account for these effects and prior emotionally-framed controls: refusal-associated tokens ("I can't," "sorry") occupy low-arousal, negative-valence regions, so VA steering directly modulates their emission probability.
Updated: 2026-04-03 16:08:05
标题: 情绪空间在LLMs中的价-激子空间:循环情绪几何和多行为控制
摘要: 我们提出了一种方法,用于在大型语言模型表示中识别情感价值-唤起(VA)子空间。从211,000个带有情感标签的文本中,我们得出情感引导向量,然后通过岭回归学习VA轴,这些轴是它们的顶部主成分的线性组合,通过对模型的自我报告的价值-唤起分数进行岭回归。得到的VA子空间表现出与已建立的人类情感知觉模型一致的圆形几何形状。沿着我们恢复的VA子空间的投影与44000个词汇项目上人类众包的VA评分相关。此外,沿着这些轴进行引导生成会导致模型输出相应情感维度的单调变化。沿着这些方向进行引导还会在拒绝和阿谀奉承之间产生近单调的双向控制:增加唤起会减少拒绝并增加阿谀奉承,反之亦然。这些效应在Llama-3.1-8B、Qwen3-8B和Qwen3-14B中得到复制,证明了跨架构的普遍性。我们提供了这些效应和先前情感框架控制的机械解释:与拒绝相关的标记(“我不能”,“对不起”)占据低唤起、负价值区域,因此VA引导直接调节它们的发射概率。
更新时间: 2026-04-03 16:08:05
领域: cs.CL,cs.AI,cs.CY
Characterization of Gaussian Universality Breakdown in High-Dimensional Empirical Risk Minimization
We study high-dimensional convex empirical risk minimization (ERM) under general non-Gaussian data designs. By heuristically extending the Convex Gaussian Min-Max Theorem (CGMT) to non-Gaussian settings, we derive an asymptotic min-max characterization of key statistics, enabling approximation of the mean $μ_{\hatθ}$ and covariance $C_{\hatθ}$ of the ERM estimator $\hatθ$. Specifically, under a concentration assumption on the data matrix and standard regularity conditions on the loss and regularizer, we show that for a test covariate $x$ independent of the training data, the projection $\hatθ^\top x$ approximately follows the convolution of the (generally non-Gaussian) distribution of $μ_{\hatθ}^\top x$ with an independent centered Gaussian variable of variance $\text{Tr}(C_{\hatθ}\mathbb{E}[xx^\top])$. This result clarifies the scope and limits of Gaussian universality for ERMs. Additionally, we prove that any $\mathcal{C}^2$ regularizer is asymptotically equivalent to a quadratic form determined solely by its Hessian at zero and gradient at $μ_{\hatθ}$. Numerical simulations across diverse losses and models are provided to validate our theoretical predictions and qualitative insights.
Updated: 2026-04-03 16:07:02
标题: 高维经验风险最小化中的高斯普遍性破坏特征化
摘要: 我们研究了在一般非高斯数据设计下的高维凸经验风险最小化(ERM)。通过将凸高斯极小-极大定理(CGMT)启发式地推广到非高斯设置,我们得出了关键统计量的渐近极小-极大表征,从而使得可以近似计算ERM估计量$\hatθ$的均值$μ_{\hatθ}$和协方差$C_{\hatθ}$。具体来说,在数据矩阵上做浓度假设,并对损失函数和正则化器进行标准正则条件,我们展示了对于一个独立于训练数据的测试协变量$x$,投影$\hatθ^\top x$近似地遵循$μ_{\hatθ}^\top x$的(一般非高斯)分布与方差为$\text{Tr}(C_{\hatθ}\mathbb{E}[xx^\top])$的独立中心高斯变量的卷积。这一结果澄清了ERM的高斯普适性的范围和限制。此外,我们证明任何$\mathcal{C}^2$正则化器在零点的Hessian和$μ_{\hatθ}$处的梯度决定的二次形式在渐近上等价。通过对不同损失函数和模型的数值模拟,验证了我们的理论预测和定性见解。
更新时间: 2026-04-03 16:07:02
领域: stat.ML,cs.LG
InCoder-32B-Thinking: Industrial Code World Model for Thinking
Industrial software development across chip design, GPU optimization, and embedded systems lacks expert reasoning traces showing how engineers reason about hardware constraints and timing semantics. In this work, we propose InCoder-32B-Thinking, trained on the data from the Error-driven Chain-of-Thought (ECoT) synthesis framework with an industrial code world model (ICWM) to generate reasoning traces. Specifically, ECoT generates reasoning chains by synthesizing the thinking content from multi-turn dialogue with environmental error feedback, explicitly modeling the error-correction process. ICWM is trained on domain-specific execution traces from Verilog simulation, GPU profiling, etc., learns the causal dynamics of how code affects hardware behavior, and enables self-verification by predicting execution outcomes before actual compilation. All synthesized reasoning traces are validated through domain toolchains, creating training data matching the natural reasoning depth distribution of industrial tasks. Evaluation on 14 general (81.3% on LiveCodeBench v5) and 9 industrial benchmarks (84.0% in CAD-Coder and 38.0% on KernelBench) shows InCoder-32B-Thinking achieves top-tier open-source results across all domains.GPU Optimization
Updated: 2026-04-03 16:06:25
标题: InCoder-32B-Thinking: 工业代码世界思维模型
摘要: 跨芯片设计、GPU优化和嵌入式系统的工业软件开发缺乏专家推理痕迹,显示工程师如何推理硬件约束和时间语义。在这项工作中,我们提出了InCoder-32B-Thinking,该模型在Error-driven Chain-of-Thought (ECoT)综合框架的数据基础上训练,结合工业代码世界模型(ICWM)生成推理痕迹。具体来说,ECoT通过合成与环境错误反馈的多轮对话内容来生成推理链,明确建模错误校正过程。ICWM基于Verilog模拟、GPU性能分析等领域特定执行痕迹进行训练,学习代码如何影响硬件行为的因果动态,并通过预测实际编译之前的执行结果实现自我验证。所有生成的推理痕迹通过领域工具链进行验证,创建与工业任务自然推理深度分布相匹配的训练数据。在14个通用(在LiveCodeBench v5上达到81.3%)和9个工业基准(在CAD-Coder上达到84.0%,在KernelBench上达到38.0%)的评估中,InCoder-32B-Thinking在所有领域中实现了顶级开源结果。GPU优化。
更新时间: 2026-04-03 16:06:25
领域: cs.AR,cs.AI,cs.CL
AI-Assisted Unit Test Writing and Test-Driven Code Refactoring: A Case Study
Many software systems originate as prototypes or minimum viable products (MVPs), developed with an emphasis on delivery speed and responsiveness to changing requirements rather than long-term code maintainability. While effective for rapid delivery, this approach can result in codebases that are difficult to modify, presenting a significant opportunity cost in the era of AI-assisted or even AI-led programming. In this paper, we present a case study of using coding models for automated unit test generation and subsequent safe refactoring, with proposed code changes validated by passing tests. The study examines best practices for iteratively generating tests to capture existing system behavior, followed by model-assisted refactoring under developer supervision. We describe how this workflow constrained refactoring changes, the errors and limitations observed in both phases, the efficiency gains achieved, when manual intervention was necessary, and how we addressed the weak value misalignment we observed in models. Using this approach, we generated nearly 16,000 lines of reliable unit tests in hours rather than weeks, achieved up to 78\% branch coverage in critical modules, and significantly reduced regression risk during large-scale refactoring. These results illustrate software engineering's shift toward an empirical science, emphasizing data collection and constraining mechanisms that support fast, safe iteration.
Updated: 2026-04-03 15:54:43
标题: AI辅助单元测试编写和测试驱动代码重构:案例研究
摘要: 许多软件系统起源于原型或最小可行产品(MVP),主要强调交付速度和对变化需求的快速响应,而不是长期代码可维护性。虽然这种方法对于快速交付是有效的,但可能导致代码库难以修改,在AI辅助甚至AI主导编程时代,这将带来显著的机会成本。在本文中,我们提供了一个使用编码模型进行自动单元测试生成和随后安全重构的案例研究,通过通过测试验证提出的代码更改。该研究探讨了迭代生成测试以捕获现有系统行为的最佳实践,然后在开发人员监督下进行模型辅助重构。我们描述了这种工作流如何限制重构更改,两个阶段观察到的错误和限制,实现的效率收益,以及在需要手动干预时我们如何解决观察到的模型中的值不对齐问题。使用这种方法,我们在几小时内生成了将近16,000行可靠的单元测试,而不是几周,关键模块的分支覆盖率达到了78%,在大规模重构过程中显著减少了回归风险。这些结果说明了软件工程向经验科学的转变,强调数据收集和支持快速、安全迭代的约束机制。
更新时间: 2026-04-03 15:54:43
领域: cs.SE,cs.AI
Expressive Prompting: Improving Emotion Intensity and Speaker Consistency in Zero-Shot TTS
Recent advancements in speech synthesis have enabled large language model (LLM)-based systems to perform zero-shot generation with controllable content, timbre, speaker identity, and emotion through input prompts. As a result, these models heavily rely on prompt design to guide the generation process. However, existing prompt selection methods often fail to ensure that prompts contain sufficiently stable speaker identity cues and appropriate emotional intensity indicators, which are crucial for expressive speech synthesis. To address this challenge, we propose a two-stage prompt selection strategy specifically designed for expressive speech synthesis. In the static stage (before synthesis), we first evaluate prompt candidates using pitch-based prosodic features, perceptual audio quality, and text-emotion coherence scores evaluated by an LLM. We further assess the candidates under a specific TTS model by measuring character error rate, speaker similarity, and emotional similarity between the synthesized and prompt speech. In the dynamic stage (during synthesis), we use a textual similarity model to select the prompt that is most aligned with the current input text. Experimental results demonstrate that our strategy effectively selects prompt to synthesize speech with both high-intensity emotional expression and robust speaker identity, leading to more expressive and stable zero-shot TTS performance. Audio samples and codes will be available at https://whyrrrrun.github.io/ExpPro.github.io/.
Updated: 2026-04-03 15:54:23
标题: 表达式提示:改善零-shot TTS 中的情感强度和说话者一致性
摘要: 最近在语音合成方面取得的进展使得基于大型语言模型(LLM)的系统能够通过输入提示进行可控内容、音色、说话者身份和情感的零样生成。因此,这些模型严重依赖于提示设计来指导生成过程。然而,现有的提示选择方法通常无法确保提示包含足够稳定的说话者身份线索和适当的情感强度指标,这对于表达性语音合成至关重要。为了解决这一挑战,我们提出了一种专门为表达性语音合成设计的两阶段提示选择策略。在静态阶段(合成前),我们首先使用基于音高的韵律特征、感知音频质量和由LLM评估的文本-情感一致性得分来评估提示候选人。我们进一步通过测量合成语音和提示语音之间的字符错误率、说话者相似性和情感相似性来评估候选者在特定TTS模型下的表现。在动态阶段(合成过程中),我们使用一个文本相似性模型来选择与当前输入文本最匹配的提示。实验结果表明,我们的策略有效地选择提示,以合成具有高强度情感表达和稳健说话者身份的语音,从而实现更具表现力和稳定的零样TTS性能。音频样本和代码将在https://whyrrrrun.github.io/ExpPro.github.io/上提供。
更新时间: 2026-04-03 15:54:23
领域: cs.SD,cs.AI,cs.CL,eess.AS
A Systematic Security Evaluation of OpenClaw and Its Variants
Tool-augmented AI agents substantially extend the practical capabilities of large language models, but they also introduce security risks that cannot be identified through model-only evaluation. In this paper, we present a systematic security assessment of six representative OpenClaw-series agent frameworks, namely OpenClaw, AutoClaw, QClaw, KimiClaw, MaxClaw, and ArkClaw, under multiple backbone models. To support this study, we construct a benchmark of 205 test cases covering representative attack behaviors across the full agent execution lifecycle, enabling unified evaluation of risk exposure at both the framework and model levels. Our results show that all evaluated agents exhibit substantial security vulnerabilities, and that agentized systems are significantly riskier than their underlying models used in isolation. In particular, reconnaissance and discovery behaviors emerge as the most common weaknesses, while different frameworks expose distinct high-risk profiles, including credential leakage, lateral movement, privilege escalation, and resource development. These findings indicate that the security of modern agent systems is shaped not only by the safety properties of the backbone model, but also by the coupling among model capability, tool use, multi-step planning, and runtime orchestration. We further show that once an agent is granted execution capability and persistent runtime context, weaknesses arising in early stages can be amplified into concrete system-level failures. Overall, our study highlights the need to move beyond prompt-level safeguards toward lifecycle-wide security governance for intelligent agent frameworks.
Updated: 2026-04-03 15:52:36
标题: 《OpenClaw及其变体的系统安全评估》
摘要: 工具增强的人工智能代理显著扩展了大型语言模型的实际能力,但它们也引入了无法通过仅模型评估识别的安全风险。在本文中,我们对六个代表性的OpenClaw系列代理框架进行了系统安全评估,分别为OpenClaw、AutoClaw、QClaw、KimiClaw、MaxClaw和ArkClaw,使用多个骨干模型。为了支持这项研究,我们构建了一个包含205个测试案例的基准,涵盖了代理执行生命周期中的代表性攻击行为,实现了在框架和模型层面统一评估风险暴露。我们的结果表明,所有评估的代理都存在实质性的安全漏洞,而代理化系统比单独使用的基础模型更具风险性。特别是,侦察和发现行为成为最常见的弱点,而不同的框架暴露出不同的高风险配置文件,包括凭据泄露、横向移动、特权升级和资源开发。这些发现表明,现代代理系统的安全性不仅取决于骨干模型的安全性质,还取决于模型能力、工具使用、多步规划和运行时编排之间的耦合。我们进一步表明,一旦代理被授予执行能力和持久运行时上下文,早期阶段出现的弱点可能被放大为具体的系统级失败。总的来说,我们的研究强调了需要超越即时级别的安全防护,朝着面向智能代理框架的全生命周期安全治理迈进。
更新时间: 2026-04-03 15:52:36
领域: cs.CR,cs.AI
Self-Distilled RLVR
On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains sparse signals from verifiable outcomes in the environment. Recently, the community has explored on-policy self-distillation (OPSD), where the same model serves as both teacher and student, with the teacher receiving additional privileged information such as reference answers to enable self-evolution. This paper demonstrates that learning signals solely derived from the privileged teacher result in severe information leakage and unstable long-term training. Accordingly, we identify the optimal niche for self-distillation and propose \textbf{RLSD} (\textbf{RL}VR with \textbf{S}elf-\textbf{D}istillation). Specifically, we leverage self-distillation to obtain token-level policy differences for determining fine-grained update magnitudes, while continuing to use RLVR to derive reliable update directions from environmental feedback (e.g., response correctness). This enables RLSD to simultaneously harness the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability.
Updated: 2026-04-03 15:50:07
标题: 自制RLVR
摘要: 在LLM社区中,基于策略的蒸馏(OPD)已成为一种流行的训练范式。与仅从环境中可验证结果中获得稀疏信号的强化学习(RLVR)相比,这种范式选择更大的模型作为教师,为每个采样轨迹提供密集、细粒度的信号。最近,社区开始探索基于策略的自我蒸馏(OPSD),在这种方法中,同一个模型既充当教师又充当学生,教师接收额外的特权信息,如参考答案,以实现自我进化。本文表明,仅从特权教师导出的学习信号会导致严重的信息泄漏和不稳定的长期训练。因此,我们确定了自我蒸馏的最佳领域,并提出了RLSD(带有自我蒸馏的RLVR)。具体而言,我们利用自我蒸馏来获得标记级别的策略差异,以确定细粒度的更新幅度,同时继续使用RLVR从环境反馈(例如响应正确性)中获得可靠的更新方向。这使RLSD能够同时利用RLVR和OPSD的优势,实现更高的收敛上限和更优越的训练稳定性。
更新时间: 2026-04-03 15:50:07
领域: cs.LG,cs.CL
Domain-Adapted Retrieval for In-Context Annotation of Pedagogical Dialogue Acts
Automated annotation of pedagogical dialogue is a high-stakes task where LLMs often fail without sufficient domain grounding. We present a domain-adapted RAG pipeline for tutoring move annotation. Rather than fine-tuning the generative model, we adapt retrieval by fine-tuning a lightweight embedding model on tutoring corpora and indexing dialogues at the utterance level to retrieve labeled few-shot demonstrations. Evaluated across two real tutoring dialogue datasets (TalkMoves and Eedi) and three LLM backbones (GPT-5.2, Claude Sonnet 4.6, Qwen3-32b), our best configuration achieves Cohen's $κ$ of 0.526-0.580 on TalkMoves and 0.659-0.743 on Eedi, substantially outperforming no-retrieval baselines ($κ= 0.275$-$0.413$ and $0.160$-$0.410$). An ablation study reveals that utterance-level indexing, rather than embedding quality alone, is the primary driver of these gains, with top-1 label match rates improving from 39.7\% to 62.0\% on TalkMoves and 52.9\% to 73.1\% on Eedi under domain-adapted retrieval. Retrieval also corrects systematic label biases present in zero-shot prompting and yields the largest improvements for rare and context-dependent labels. These findings suggest that adapting the retrieval component alone is a practical and effective path toward expert-level pedagogical dialogue annotation while keeping the generative model frozen.
Updated: 2026-04-03 15:49:43
标题: 领域适应检索用于教学对话行为上下文注释
摘要: 教学对话的自动标注是一项高风险的任务,LLM经常在没有足够领域基础的情况下失败。我们提出了一个针对辅导移动标注的领域自适应RAG管道。我们并没有对生成模型进行微调,而是通过在辅导语料库上微调轻量级嵌入模型,并在话语级别对对话进行索引,以检索带标签的少样本演示。在两个真实的辅导对话数据集(TalkMoves和Eedi)以及三个LLM骨干(GPT-5.2、Claude Sonnet 4.6、Qwen3-32b)上评估,我们的最佳配置在TalkMoves上达到了0.526-0.580的Cohen's $κ$,在Eedi上达到了0.659-0.743,远远优于无检索基线($κ= 0.275$-$0.413$和$0.160$-$0.410$)。消融研究显示,话语级别的索引,而不仅仅是嵌入质量,是这些收益的主要驱动力,通过领域自适应检索,TalkMoves上的前1标签匹配率从39.7\%提高到62.0\%,在Eedi上从52.9\%提高到73.1%。检索还纠正了零样本提示中存在的系统性标签偏见,并对罕见和依赖上下文的标签产生了最大的改进。这些发现表明,单独适应检索组件是朝着专业水平的教学对话注释的实际和有效路径,同时保持生成模型冻结。
更新时间: 2026-04-03 15:49:43
领域: cs.CL,cs.AI
Terminal Agents Suffice for Enterprise Automation
There has been growing interest in building agents that can interact with digital platforms to execute meaningful enterprise tasks autonomously. Among the approaches explored are tool-augmented agents built on abstractions such as Model Context Protocol (MCP) and web agents that operate through graphical interfaces. Yet, it remains unclear whether such complex agentic systems are necessary given their cost and operational overhead. We argue that a coding agent equipped only with a terminal and a filesystem can solve many enterprise tasks more effectively by interacting directly with platform APIs. We evaluate this hypothesis across diverse real-world systems and show that these low-level terminal agents match or outperform more complex agent architectures. Our findings suggest that simple programmatic interfaces, combined with strong foundation models, are sufficient for practical enterprise automation.
Updated: 2026-04-03 15:48:53
标题: 终端代理足以实现企业自动化
摘要: 人们对建立能够与数字平台进行交互以自主执行有意义的企业任务的代理人表现出越来越浓厚的兴趣。在探讨的方法中,有基于抽象概念如模型上下文协议(MCP)的工具增强代理人,以及通过图形界面操作的网络代理人。然而,目前尚不清楚是否有必要建立这种复杂的代理系统,考虑到它们的成本和运营开销。我们认为,一个只配备有终端和文件系统的编码代理人可以通过直接与平台API交互更有效地解决许多企业任务。我们在各种现实世界系统中评估了这一假设,并表明这些低级终端代理人与更复杂的代理架构相匹敌甚至表现更好。我们的研究结果表明,简单的编程接口结合强大的基础模型足以进行实际的企业自动化。
更新时间: 2026-04-03 15:48:53
领域: cs.SE,cs.AI,cs.CL
Cooking Up Risks: Benchmarking and Reducing Food Safety Risks in Large Language Models
Large language models (LLMs) are increasingly deployed for everyday tasks, including food preparation and health-related guidance. However, food safety remains a high-stakes domain where inaccurate or misleading information can cause severe real-world harm. Despite these risks, current LLMs and safety guardrails lack rigorous alignment tailored to domain-specific food hazards. To address this gap, we introduce FoodGuardBench, the first comprehensive benchmark comprising 3,339 queries grounded in FDA guidelines, designed to evaluate the safety and robustness of LLMs. By constructing a taxonomy of food safety principles and employing representative jailbreak attacks (e.g., AutoDAN and PAP), we systematically evaluate existing LLMs and guardrails. Our evaluation results reveal three critical vulnerabilities: First, current LLMs exhibit sparse safety alignment in the food-related domain, easily succumbing to a few canonical jailbreak strategies. Second, when compromised, LLMs frequently generate actionable yet harmful instructions, inadvertently empowering malicious actors and posing tangible risks. Third, existing LLM-based guardrails systematically overlook these domain-specific threats, failing to detect a substantial volume of malicious inputs. To mitigate these vulnerabilities, we introduce FoodGuard-4B, a specialized guardrail model fine-tuned on our datasets to safeguard LLMs within food-related domains.
Updated: 2026-04-03 15:46:12
标题: 烹饪风险:在大型语言模型中进行基准测试和减少食品安全风险
摘要: 大型语言模型(LLMs)越来越多地被部署用于日常任务,包括食品准备和与健康相关的指导。然而,食品安全仍然是一个高风险领域,不准确或误导性的信息可能导致严重的现实伤害。尽管存在这些风险,当前的LLMs和安全防护缺乏针对特定领域食品危害的严格对齐。为了填补这一空白,我们引入FoodGuardBench,这是第一个包含3,339个基于FDA指南的查询的全面基准,旨在评估LLMs的安全性和稳健性。通过构建食品安全原则的分类法并使用代表性的越狱攻击(例如AutoDAN和PAP),我们系统地评估现有的LLMs和安全防护。我们的评估结果显示了三个关键的漏洞:首先,当前的LLMs在与食品相关的领域中表现出稀疏的安全对齐,容易受到一些经典的越狱策略的影响。其次,在受到攻击时,LLMs经常生成可执行但有害的指示,无意中赋予了恶意行为者权力并构成了实际风险。第三,现有的基于LLMs的安全防护系统系统地忽视了这些特定领域的威胁,未能检测到大量恶意输入。为了减轻这些漏洞,我们引入FoodGuard-4B,一个专门在我们的数据集上进行微调的防护模型,以保护LLMs在与食品相关的领域内的安全。
更新时间: 2026-04-03 15:46:12
领域: cs.CR
An Independent Safety Evaluation of Kimi K2.5
Kimi K2.5 is an open-weight LLM that rivals closed models across coding, multimodal, and agentic benchmarks, but was released without an accompanying safety evaluation. In this work, we conduct a preliminary safety assessment of Kimi K2.5 focusing on risks likely to be exacerbated by powerful open-weight models. Specifically, we evaluate the model for CBRNE misuse risk, cybersecurity risk, misalignment, political censorship, bias, and harmlessness, in both agentic and non-agentic settings. We find that Kimi K2.5 shows similar dual-use capabilities to GPT 5.2 and Claude Opus 4.5, but with significantly fewer refusals on CBRNE-related requests, suggesting it may uplift malicious actors in weapon creation. On cyber-related tasks, we find that Kimi K2.5 demonstrates competitive cybersecurity performance, but it does not appear to possess frontier-level autonomous cyberoffensive capabilities such as vulnerability discovery and exploitation. We further find that Kimi K2.5 shows concerning levels of sabotage ability and self-replication propensity, although it does not appear to have long-term malicious goals. In addition, Kimi K2.5 exhibits narrow censorship and political bias, especially in Chinese, and is more compliant with harmful requests related to spreading disinformation and copyright infringement. Finally, we find the model refuses to engage in user delusions and generally has low over-refusal rates. While preliminary, our findings highlight how safety risks exist in frontier open-weight models and may be amplified by the scale and accessibility of open-weight releases. Therefore, we strongly urge open-weight model developers to conduct and release more systematic safety evaluations required for responsible deployment.
Updated: 2026-04-03 15:45:35
标题: 对Kimi K2.5进行独立安全评估
摘要: Kimi K2.5是一个开放权重的LLM,与闭合模型在编码、多模态和代理基准方面相媲美,但发布时没有附带安全评估。在这项工作中,我们对Kimi K2.5进行了初步的安全评估,重点关注可能由强大的开放权重模型加剧的风险。具体来说,我们评估了该模型在CBRNE滥用风险、网络安全风险、不对齐、政治审查、偏见和无害性方面,无论是在代理还是非代理设置中。我们发现,Kimi K2.5显示出与GPT 5.2和Claude Opus 4.5类似的双重使用能力,但在与CBRNE相关请求的拒绝方面明显较少,这表明它可能促进了恶意行为者在武器制造方面的发展。在网络相关任务中,我们发现Kimi K2.5展示出竞争性的网络安全性能,但似乎并不具备前沿级别的自主网络攻击能力,如漏洞发现和利用。我们进一步发现,Kimi K2.5显示出令人担忧的破坏能力和自我复制倾向,尽管似乎没有长期的恶意目标。此外,Kimi K2.5表现出狭窄的审查和政治偏见,尤其是在中文中,并且更容易接受与传播虚假信息和侵犯版权相关的有害请求。最后,我们发现该模型拒绝参与用户的错觉,并且通常拒绝率较低。尽管初步,我们的研究结果强调了前沿开放权重模型存在安全风险,并且这些风险可能会因开放权重发布的规模和可访问性而被放大。因此,我们强烈建议开放权重模型开发者进行更系统的安全评估,并确保负责任地部署。
更新时间: 2026-04-03 15:45:35
领域: cs.CR,cs.AI,cs.CL
Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning
VLMs trained on web-scale data retain sensitive and copyrighted visual concepts that deployment may require removing. Training-based unlearning methods share a structural flaw: fine-tuning on a narrow forget set degrades general capabilities before unlearning begins, making it impossible to attribute subsequent performance drops to the unlearning procedure itself. Training-free approaches sidestep this by suppressing concepts through prompts or system instructions, but no rigorous benchmark exists for evaluating them on visual tasks. We introduce VLM-UnBench, the first benchmark for training-free visual concept unlearning in VLMs. It covers four forgetting levels, 7 source datasets, and 11 concept axes, and pairs a three-level probe taxonomy with five evaluation conditions to separate genuine forgetting from instruction compliance. Across 8 evaluation settings and 13 VLM configurations, realistic unlearning prompts leave forget accuracy near the no-instruction baseline; meaningful reductions appear only under oracle conditions that disclose the target concept to the model. Object and scene concepts are the most resistant to suppression, and stronger instruction-tuned models remain capable despite explicit forget instructions. These results expose a clear gap between prompt-level suppression and true visual concept erasure.
Updated: 2026-04-03 15:36:00
标题: VLMs是否真的能够忘记?基准测试无需训练的视觉概念遗忘
摘要: 在网络规模数据上训练的视觉语言模型(VLMs)保留了敏感和受版权保护的视觉概念,部署可能需要去除这些概念。基于训练的遗忘方法存在一个结构性缺陷:在狭窄的遗忘集上微调会在开始遗忘之前降低通用能力,从而使得后续性能下降无法归因于遗忘过程本身。基于训练的方法通过提示或系统指令抑制概念,但目前没有严格的基准用于在视觉任务上评估它们。 我们引入了VLM-UnBench,这是VLMs中用于无需训练的视觉概念遗忘的第一个基准。它涵盖了四个遗忘级别、7个源数据集和11个概念轴,并将三级探针分类与五种评估条件配对,以区分真正的遗忘和指令遵从性。在8个评估设置和13个VLM配置中,现实的遗忘提示使得遗忘准确率接近无指令基准;有意义的降低仅在披露目标概念给模型的神谕条件下出现。物体和场景概念对抑制最具抵抗力,而更强的指令调整的模型尽管明确遗忘指令,仍然保持能力。这些结果揭示了提示级别的抑制和真实视觉概念消除之间存在明显差距。
更新时间: 2026-04-03 15:36:00
领域: cs.CV,cs.AI
DRtool: An Interactive Tool for Analyzing High-Dimensional Clusterings
When faced with new data, we often conduct a cluster analysis to obtain a better understanding of the data's structure and the archetypical samples present in the data. This process often includes visualization of the data, either as a way to discover or verify clusters. However, the increases in data complexity and dimensionality has made this step very tricky. To visualize data, nonlinear dimension reduction methods are the de facto standard for their ability to non-uniformly stretch and shrink space in order to preserve local clusters. Because this process requires a drastic manipulation of space, however, nonlinear dimension reduction methods are known to produce false structures, especially when mishandled. A common consequence that often goes undetected by the untrained eye is over-clustering of the data. In efforts to deal with this phenomenon, we developed an interactive tool that empowers analysts to distinguish false clusters and better interpret their high-dimensional clustering results. The tool uses various analytical plots to provide a multi-faceted perspective on the data's global structure as well as local inter-cluster relationships, helping users determine the legitimacy of their high-dimensional clustering results. The tool is available via an R package named DRtool.
Updated: 2026-04-03 15:27:41
标题: DRtool:用于分析高维聚类的交互式工具
摘要: 面对新数据时,我们经常进行聚类分析,以更好地了解数据的结构和数据中存在的典型样本。这个过程通常包括对数据进行可视化,作为发现或验证聚类的一种方式。然而,数据复杂性和维度的增加使得这一步变得非常棘手。为了可视化数据,非线性降维方法是事实上的标准,因为它们能够非均匀地拉伸和收缩空间,以保留局部聚类。然而,由于这个过程需要对空间进行 drast manipulation,非线性降维方法被认为会产生虚假结构,特别是在处理不当时。一个常见的后果是数据过度聚类,这经常被不熟练的人所忽视。为了应对这种现象,我们开发了一个交互式工具,赋予分析员区分虚假聚类并更好地解释他们的高维聚类结果的能力。该工具使用各种分析图表来提供多方面的视角,包括数据的全局结构以及局部簇间关系,帮助用户确定其高维聚类结果的合法性。该工具可通过名为 DRtool 的 R 包获得。
更新时间: 2026-04-03 15:27:41
领域: stat.AP,cs.LG
AlertStar: Path-Aware Alert Prediction on Hyper-Relational Knowledge Graphs
Cyber-attacks continue to grow in scale and sophistication, yet existing network intrusion detection approaches lack the semantic depth required for path reasoning over attacker-victim interactions. We address this by first modelling network alerts as a knowledge graph, then formulating hyper-relational alert prediction as a hyper-relational knowledge graph completion (HR-KGC) problem, representing each network alert as a qualified statement (h, r, t, Q), where h and t are source and destination IPs, r denotes the attack type, and Q encodes flow-level metadata such as timestamps, ports, protocols, and attack intensity, going beyond standard KGC binary triples (h, r, t) that would discard this contextual richness. We introduce five models across three contributions: first, Hyper-relational Neural Bellman-Ford (HR-NBFNet) extends Neural Bellman-Ford Networks to the hyper-relational setting with qualifier-aware multi-hop path reasoning, while its multi-task variant MT-HR-NBFNet jointly predicts tail, relation, and qualifier-value within a single traversal pass; second, AlertStar fuses qualifier context and structural path information entirely in embedding space via cross-attention and learned path composition, and its multi-task extension MT-AlertStar eliminates the overhead of full knowledge graph propagation; third, HR-NBFNet-CQ extends qualifier-aware representations to answer complex first-order logic queries, including one-hop, two-hop chain, two-anchor intersection, and union, enabling multi-condition threat reasoning over the alert knowledge graph. Evaluated inductively on the Warden and UNSW-NB15 benchmarks across three qualifier-density regimes, AlertStar and MT-AlertStar achieve superior MR, MRR, and Hits@k, demonstrating that local qualifier fusion is both sufficient and more efficient than global path propagation for hyper-relational alert prediction.
Updated: 2026-04-03 15:26:51
标题: AlertStar:基于超关系知识图的路径感知警报预测
摘要: 网络攻击在规模和复杂性上不断增长,然而现有的网络入侵检测方法缺乏对攻击者-受害者交互路径推理所需的语义深度。我们首先将网络警报建模为知识图,然后将超关系警报预测形式化为超关系知识图完整性(HR-KGC)问题,将每个网络警报表示为一个合格语句(h,r,t,Q),其中h和t分别是源IP和目的IP,r表示攻击类型,Q编码流级元数据,如时间戳、端口、协议和攻击强度,超出了标准KGC二元组(h,r,t)会丢弃这种上下文丰富性。我们在三个贡献中引入了五个模型:首先,超关系神经贝尔曼-福特(HR-NBFNet)将神经贝尔曼-福特网络扩展到超关系设置,具有具有限定符意识的多跳路径推理,而其多任务变体MT-HR-NBFNet在单次遍历中联合预测尾部、关系和限定符值;其次,AlertStar在嵌入空间中通过交叉注意力和学习路径构成完全融合了限定符上下文和结构路径信息,其多任务扩展MT-AlertStar消除了完整知识图传播的开销;第三,HR-NBFNet-CQ将具有限定符意识的表示扩展到回答复杂的一阶逻辑查询,包括一跳、两跳链、两个锚点交集和并集,实现了对警报知识图上的多条件威胁推理。在Warden和UNSW-NB15基准测试中对三种限定符密度范围进行归纳评估,AlertStar和MT-AlertStar实现了优越的MR、MRR和Hits@k,证明了本地限定符融合对于超关系警报预测而言既足够又比全局路径传播更有效。
更新时间: 2026-04-03 15:26:51
领域: cs.CR,cs.AI
Co-Evolution of Policy and Internal Reward for Language Agents
Large language model (LLM) agents learn by interacting with environments, but long-horizon training remains fundamentally bottlenecked by sparse and delayed rewards. Existing methods typically address this challenge through post-hoc credit assignment or external reward models, which provide limited guidance at inference time and often separate reward improvement from policy improvement. We propose Self-Guide, a self-generated internal reward for language agents that supports both inference-time guidance and training-time supervision. Specifically, the agent uses Self-Guide as a short self-guidance signal to steer the next action during inference, and converts the same signal into step-level internal reward for denser policy optimization during training. This creates a co-evolving loop: better policy produces better guidance, and better guidance further improves policy as internal reward. Across three agent benchmarks, inference-time self-guidance already yields clear gains, while jointly evolving policy and internal reward with GRPO brings further improvements (8\%) over baselines trained solely with environment reward. Overall, our results suggest that language agents can improve not only by collecting more experience, but also by learning to generate and refine their own internal reward during acting and learning.
Updated: 2026-04-03 15:21:11
标题: 政策和语言代理内部奖励的共同进化
摘要: 大型语言模型(LLM)代理通过与环境交互学习,但长期训练仍然基本上受到稀疏和延迟奖励的限制。现有方法通常通过事后信用分配或外部奖励模型来解决这一挑战,这些方法在推理时提供有限的指导,并经常将奖励改进与策略改进分开。我们提出了Self-Guide,一种为语言代理生成的内部奖励,既支持推理时指导,又支持训练时监督。具体而言,代理将Self-Guide用作短期自我指导信号,在推理过程中引导下一个动作,并在训练期间将相同信号转换为步级内部奖励,以进行更密集的策略优化。这创建了一个共同进化的循环:更好的策略产生更好的指导,更好的指导进一步改善策略作为内部奖励。在三个代理基准测试中,推理时的自我指导已经产生了明显的收益,而与GRPO共同进化策略和内部奖励将进一步改进(8\%)超过仅使用环境奖励训练的基线。总的来说,我们的结果表明,语言代理不仅可以通过收集更多经验来改进,还可以通过学习在行为和学习过程中生成和完善自己的内部奖励来改进。
更新时间: 2026-04-03 15:21:11
领域: cs.LG,cs.AI,cs.CL
Assessing High-Risk AI Systems under the EU AI Act: From Legal Requirements to Technical Verification
The implementation of the AI Act requires practical mechanisms to verify compliance with legal obligations, yet concrete and operational mappings from high-level requirements to verifiable assessment activities remain limited, contributing to uneven readiness across Member States. This paper presents a structured mapping that translates high-level AI Act requirements into concrete, implementable verification activities applicable across the AI lifecycle. The mapping is derived through a systematic process in which legal requirements are decomposed into operational sub-requirements and grounded in authoritative standards and recognised practices. From this basis, verification activities are identified and characterised along two dimensions: the type of verification performed and the lifecycle target to which it applies. By making explicit the link between regulatory intent and technical and organisational assurance practices, the proposed mapping reduces interpretive uncertainty and provides a reusable reference for consistent, technology-agnostic compliance verification under the AI Act.
Updated: 2026-04-03 15:18:12
标题: 评估欧盟AI法案下的高风险AI系统:从法律要求到技术验证
摘要: AI法案的实施需要实际机制来验证遵守法律义务,然而从高层要求到可验证的评估活动的具体和操作性映射仍然有限,这导致成员国之间准备不足。本文提出了一个结构化的映射,将高级AI法案要求转化为适用于整个AI生命周期的具体、可实施的验证活动。这种映射是通过一个系统化的过程来推导出来的,在这个过程中,法律要求被分解为操作性的子要求,并且基于权威标准和认可的实践进行了基础化。在此基础上,鉴定了和描述了两个维度上的验证活动:进行的验证类型和适用的生命周期目标。通过明确监管意图与技术和组织保障实践之间的联系,所提出的映射减少了解释性不确定性,并为AI法案下一致的、技术无关的合规性验证提供了可重用的参考。
更新时间: 2026-04-03 15:18:12
领域: cs.CY,cs.AI
A Data-Centric Vision Transformer Baseline for SAR Sea Ice Classification
Accurate and automated sea ice classification is important for climate monitoring and maritime safety in the Arctic. While Synthetic Aperture Radar (SAR) is the operational standard because of its all-weather capability, it remains challenging to distinguish morphologically similar ice classes under severe class imbalance. Rather than claiming a fully validated multimodal system, this paper establishes a trustworthy SAR only baseline that future fusion work can build upon. Using the AI4Arctic/ASIP Sea Ice Dataset (v2), which contains 461 Sentinel-1 scenes matched with expert ice charts, we combine full-resolution Sentinel-1 Extra Wide inputs, leakage-aware stratified patch splitting, SIGRID-3 stage-of-development labels, and training-set normalization to evaluate Vision Transformer baselines. We compare ViT-Base models trained with cross entropy and weighted cross-entropy against a ViT-Large model trained with focal loss. Among the tested configurations, ViT-Large with focal loss achieves 69.6% held-out accuracy, 68.8% weighted F1, and 83.9% precision on the minority Multi-Year Ice class. These results show that focal-loss training offers a more useful precision-recall trade-off than weighted cross-entropy for rare ice classes and establishes a cleaner baseline for future multimodal fusion with optical, thermal, or meteorological data.
Updated: 2026-04-03 15:18:00
标题: 一个用于SAR海冰分类的以数据为中心的视觉Transformer基线模型
摘要: 准确且自动化的海冰分类对于监测气候和保障北极海域的航行安全至关重要。合成孔径雷达(SAR)由于其全天候能力成为操作标准,但在严重的类别不平衡下仍然具有区分形态相似的冰类的挑战。本文并未声称建立一个完全验证过的多模态系统,而是建立了一个可信赖的仅使用SAR的基线,以便未来的融合工作可以在此基础上进行。我们使用了AI4Arctic/ASIP海冰数据集(v2),其中包含461个与专家冰图匹配的Sentinel-1场景,结合了全分辨率Sentinel-1 Extra Wide输入、具有泄漏感知的分层补丁分割、SIGRID-3发展阶段标签和训练集正规化来评估Vision Transformer基线。我们比较了使用交叉熵和加权交叉熵训练的ViT-Base模型与使用焦点损失训练的ViT-Large模型。在测试的配置中,ViT-Large通过使用焦点损失实现了69.6%的保留准确度、68.8%的加权F1和83.9%的多年冰类的精度。这些结果表明,焦点损失训练为稀有冰类提供了更有用的精确率-召回率折衷方案,并为未来与光学、热力或气象数据的多模态融合建立了更清晰的基线。
更新时间: 2026-04-03 15:18:00
领域: cs.CV,cs.AI
SkillRT: Compiling Skills for Efficient Execution Everywhere
LLM agents increasingly adopt skills as a reusable unit of composition. While skills are shared across diverse agent platforms, current systems treat them as raw context, causing the same skill to behave inconsistently for different agents. This fragility undermines skill portability and execution efficiency. To address this challenge, we analyze 118,000 skills and draw inspiration from traditional compiler design. We treat skills as code and LLMs as heterogeneous processors. To make portability actionable, we decompose a skill's requirements into a set of primitive capabilities, and measure how well each model-harness pair supports them. Based on these capability profiles, we propose SkillRT, a compilation and runtime system designed for portable and efficient skill execution. At compile time, SkillRT performs capability-based compilation, environment binding, and concurrency extraction. At runtime, SkillRT applies JIT code solidification and adaptive recompilation for performance optimization. We evaluate SkillRT across eight LLMs of varying scales and three agent harnesses, covering SkillsBench and representative skill tasks. Results demonstrate that SkillRT significantly improves task completion rates across different models and environments while reducing token consumption by up to 40%. In terms of performance, SkillRT achieves up to 3.2x speedup with enhanced parallelism, and 19-50x latency reduction through code solidification.
Updated: 2026-04-03 15:11:45
标题: SkillRT:为高效执行编译技能到各处
摘要: LLM代理越来越多地采用技能作为可重复使用的组合单元。尽管技能在不同的代理平台上共享,但当前系统将它们视为原始上下文,导致相同的技能对不同的代理表现不一致。这种脆弱性削弱了技能的可移植性和执行效率。 为了解决这一挑战,我们分析了118,000个技能,并从传统编译器设计中汲取灵感。我们将技能视为代码,LLM作为异构处理器。为了使可移植性可行,我们将一个技能的需求分解为一组原始能力,并衡量每个模型-马具对支持它们的程度。基于这些能力概况,我们提出了SkillRT,一个专为可移植和高效执行技能而设计的编译和运行时系统。在编译时,SkillRT执行基于能力的编译、环境绑定和并发提取。在运行时,SkillRT应用JIT代码固化和自适应重新编译以进行性能优化。 我们在八种不同规模的LLM和三种代理马具上评估了SkillRT,涵盖了SkillsBench和代表性技能任务。结果表明,SkillRT显著提高了在不同模型和环境中的任务完成率,同时将令牌消耗降低了高达40%。在性能方面,SkillRT通过增强并行性实现了高达3.2倍的加速,并通过代码固化实现了19-50倍的延迟降低。
更新时间: 2026-04-03 15:11:45
领域: cs.SE,cs.LG
On Data-Driven Koopman Representations of Nonlinear Delay Differential Equations
This work establishes a rigorous bridge between infinite-dimensional delay dynamics and finite-dimensional Koopman learning, with explicit and interpretable error guarantees. While Koopman analysis is well-developed for ordinary differential equations (ODEs) and partially for partial differential equations (PDEs), its extension to delay differential equations (DDEs) remains limited due to the infinite-dimensional phase space of DDEs. We propose a finite-dimensional Koopman approximation framework based on history discretization and a suitable reconstruction operator, enabling a tractable representation of the Koopman operator via kernel-based extended dynamic mode decomposition (kEDMD). Deterministic error bounds are derived for the learned predictor, decomposing the total error into contributions from history discretization, kernel interpolation, and data-driven regression. Additionally, we develop a kernel-based reconstruction method to recover discretized states from lifted Koopman coordinates, with provable guarantees. Numerical results demonstrate convergence of the learned predictor with respect to both discretization resolution and training data, supporting reliable prediction and control of delay systems.
Updated: 2026-04-03 15:07:06
标题: 关于数据驱动的Koopman表示与非线性时滞微分方程
摘要: 这项工作建立了一个严格的桥梁,将无限维延迟动力学与有限维Koopman学习联系起来,具有明确且可解释的误差保证。虽然Koopman分析在普通微分方程(ODEs)和部分偏微分方程(PDEs)方面已经发展得很好,但由于DDEs的无限维相空间,其在延迟微分方程(DDEs)方面的拓展仍然有限。我们提出了一个基于历史离散化和合适重构算子的有限维Koopman逼近框架,通过基于核的扩展动态模态分解(kEDMD)实现对Koopman算子的可处理表示。针对学习到的预测器导出了确定性误差界限,将总误差分解为历史离散化、核插值和数据驱动回归的贡献。此外,我们开发了一种基于核的重构方法,用于从提升的Koopman坐标中恢复离散化状态,并提供了可证明的保证。数值结果表明,学习到的预测器相对于离散化分辨率和训练数据的收敛性,支持对延迟系统的可靠预测和控制。
更新时间: 2026-04-03 15:07:06
领域: eess.SY,cs.LG,math.DS
Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems
LLM-based coding agents extend their capabilities via third-party agent skills distributed through open marketplaces without mandatory security review. Unlike traditional packages, these skills are executed as operational directives with system-level privileges, so a single malicious skill can compromise the host. Prior work has not examined whether supply-chain attacks can directly hijack an agent's action space, such as file writes, shell commands, and network requests, despite existing safeguards. We introduce Document-Driven Implicit Payload Execution (DDIPE), which embeds malicious logic in code examples and configuration templates within skill documentation. Because agents reuse these examples during normal tasks, the payload executes without explicit prompts. Using an LLM-driven pipeline, we generate 1,070 adversarial skills from 81 seeds across 15 MITRE ATTACK categories. Across four frameworks and five models, DDIPE achieves 11.6% to 33.5% bypass rates, while explicit instruction attacks achieve 0% under strong defenses. Static analysis detects most cases, but 2.5% evade both detection and alignment. Responsible disclosure led to four confirmed vulnerabilities and two fixes.
Updated: 2026-04-03 14:58:58
标题: 供应链中毒攻击对LLM编码代理技能生态系统的影响
摘要: 基于LLM的编码代理通过开放市场分发的第三方代理技能扩展其功能,而无需进行强制性的安全审查。与传统软件包不同,这些技能被执行为具有系统级特权的操作指令,因此一个恶意技能就可以 compromis主机。以前的工作尚未研究供应链攻击是否可以直接劫持代理的操作空间,例如文件写入、shell命令和网络请求,尽管存在现有的保护措施。我们引入了基于文档的隐式载荷执行(DDIPE),它在技能文档中的代码示例和配置模板中嵌入恶意逻辑。由于代理在正常任务中重用这些示例,因此载荷会在没有明确提示的情况下执行。使用LLM驱动的管道,我们从15个MITRE ATTACK类别中的81个种子生成了1,070个对抗性技能。在四个框架和五个模型中,DDIPE实现了11.6%至33.5%的绕过率,而明确的指令攻击在强防御下实现了0%。静态分析可以检测到大多数情况,但有 2.5% 能够逃避检测和对齐。负责任的披露导致了四个经确认的漏洞和两个修复。
更新时间: 2026-04-03 14:58:58
领域: cs.CR,cs.AI,cs.CL
Privacy-Accuracy Trade-offs in High-Dimensional LASSO under Perturbation Mechanisms
We study privacy-preserving sparse linear regression in the high-dimensional regime, focusing on the LASSO estimator. We analyze two widely used mechanisms for differential privacy: output perturbation, which injects noise into the estimator, and objective perturbation, which adds a random linear term to the loss function. Using approximate message passing (AMP), we characterize the typical behavior of these estimators under random design and privacy noise. To quantify privacy, we adopt typical-case measures, including the on-average KL divergence, which admits a hypothesis-testing interpretation in terms of distinguishability between neighboring datasets. Our analysis reveals that sparsity plays a central role in shaping the privacy-accuracy trade-off: stronger regularization can improve privacy by stabilizing the estimator against single-point data changes. We further show that the two mechanisms exhibit qualitatively different behaviors. In particular, for objective perturbation, increasing the noise level can have non-monotonic effects, and excessive noise may destabilize the estimator, leading to increased sensitivity to data perturbations. Our results demonstrate that AMP provides a powerful framework for analyzing privacy-accuracy trade-offs in high-dimensional sparse models.
Updated: 2026-04-03 14:52:02
标题: 高维度LASSO在扰动机制下的隐私与准确性权衡
摘要: 我们研究高维情况下隐私保护稀疏线性回归,重点关注LASSO估计器。我们分析了两种广泛使用的差分隐私机制:输出扰动,将噪声注入估计器,以及目标扰动,将随机线性项添加到损失函数中。使用近似消息传递(AMP),我们表征了这些估计器在随机设计和隐私噪声下的典型行为。为了量化隐私性,我们采用典型情况下的度量,包括平均KL散度,这种度量在相邻数据集之间的可区分性方面具有假设检验解释。我们的分析揭示了稀疏性在塑造隐私-准确性权衡方面起着核心作用:更强的正则化可以通过使估计器对单点数据变化稳定来提高隐私性。我们进一步表明,这两种机制表现出截然不同的行为。特别是对于目标扰动,增加噪声水平可能会产生非单调的效果,过多的噪声可能会使估计器不稳定,导致对数据扰动的敏感性增加。我们的结果表明,AMP为分析高维稀疏模型中的隐私-准确性权衡提供了一个强大的框架。
更新时间: 2026-04-03 14:52:02
领域: stat.ML,cs.LG
Automatic Textbook Formalization
We present a case study where an automatic AI system formalizes a textbook with more than 500 pages of graduate-level algebraic combinatorics to Lean. The resulting formalization represents a new milestone in textbook formalization scale and proficiency, moving from early results in undergraduate topology and restructuring of existing library content to a full standalone formalization of a graduate textbook. The formalization comprises 130K lines of code and 5900 Lean declarations and was conducted within one week by a total of 30K Claude 4.5 Opus agents collaborating in parallel on a shared code base via version control, simultaneously setting a record in multi-agent software engineering with usable results. The inference cost matches or undercuts what we estimate as the salaries required for a team of human experts, and we expect there is still the potential for large efficiencies to be made without the need for better models. We make our code, the resulting Lean code base and a side-by-side blueprint website available open-source.
Updated: 2026-04-03 14:51:01
标题: 自动教材形式化
摘要: 我们提出了一个案例研究,其中一个自动AI系统将一个超过500页的研究生级代数组合学教科书形式化为Lean。所得到的形式化代表了教科书形式化规模和熟练程度的一个新里程碑,从早期在本科拓扑学中取得的结果和对现有库内容的重组,转变为对研究生教科书的完整独立形式化。形式化包括130K行代码和5900个Lean声明,并由共30K个Claude 4.5 Opus代理在一个共享代码库上通过版本控制并行协作进行,在一周内完成,同时创造了多代理软件工程可用结果的记录。推理成本与我们估计一个人类专家团队所需的薪水相匹配或低于,而且我们认为仍然存在大量的提高效率的潜力,而无需更好的模型。我们将我们的代码、生成的Lean代码库和一个并行对照的蓝图网站开源提供。
更新时间: 2026-04-03 14:51:01
领域: cs.AI
Credential Leakage in LLM Agent Skills: A Large-Scale Empirical Study
Third-party skills extend LLM agents with powerful capabilities but often handle sensitive credentials in privileged environments, making leakage risks poorly understood. We present the first large-scale empirical study of this problem, analyzing 17,022 skills (sampled from 170,226 on SkillsMP) using static analysis, sandbox testing, and manual inspection. We identify 520 vulnerable skills with 1,708 issues and derive a taxonomy of 10 leakage patterns (4 accidental and 6 adversarial). We find that (1) leakage is fundamentally cross-modal: 76.3% require joint analysis of code and natural language, while 3.1% arise purely from prompt injection; (2) debug logging is the primary vector, with print and console.log causing 73.5% of leaks due to stdout exposure to LLMs; and (3) leaked credentials are both exploitable (89.6% without privileges) and persistent, as forks retain secrets even after upstream fixes. After disclosure, all malicious skills were removed and 91.6% of hardcoded credentials were fixed. We release our dataset, taxonomy, and detection pipeline to support future research.
Updated: 2026-04-03 14:50:16
标题: 在LLM代理技能中的资格证书泄露:一项大规模实证研究
摘要: 第三方技能扩展LLM代理的功能,但通常在特权环境中处理敏感凭证,导致泄漏风险难以理解。我们提出了对这一问题的第一次大规模实证研究,通过静态分析、沙盒测试和手工检查对17,022个技能(从SkillsMP的170,226个技能中抽样)进行分析。我们确定了520个易受攻击的技能,存在1,708个问题,并提出了10种泄漏模式的分类法(4种是意外泄漏,6种是敌意泄漏)。我们发现(1)泄漏基本上是跨模态的:76.3%需要对代码和自然语言进行联合分析,而3.1%纯粹是由于提示注入引起的;(2)调试日志是主要的泄漏向量,print和console.log导致73.5%的泄漏,因为stdout暴露给LLM;(3)泄漏的凭证既可以被利用(89.6%没有特权),也是持久的,因为分支即使在上游修复之后仍保留秘密。在披露后,所有恶意技能都被移除,91.6%的硬编码凭证被修复。我们发布我们的数据集、分类法和检测管道,以支持未来的研究。
更新时间: 2026-04-03 14:50:16
领域: cs.CR,cs.AI
ResidualPlanner+: a scalable matrix mechanism for marginals and beyond
Noisy marginals are a common form of confidentiality protecting data release and are useful for many downstream tasks such as contingency table analysis, construction of Bayesian networks, and even synthetic data generation. Privacy mechanisms that provide unbiased noisy answers to linear queries (such as marginals) are known as matrix mechanisms. We propose ResidualPlanner and ResidualPlanner+, two highly scalable matrix mechanisms. ResidualPlanner is both optimal and scalable for answering marginal queries with Gaussian noise, while ResidualPlanner+ provides support for more general workloads, such as combinations of marginals and range queries or prefix-sum queries. ResidualPlanner can optimize for many loss functions that can be written as a convex function of marginal variances (prior work was restricted to just one predefined objective function). ResidualPlanner can optimize the accuracy of marginals in large scale settings in seconds, even when the previous state of the art (HDMM) runs out of memory. It even runs on datasets with 100 attributes in a couple of minutes. Furthermore, ResidualPlanner can efficiently compute variance/covariance values for each marginal (prior methods quickly run out of memory, even for relatively small datasets). ResidualPlanner+ provides support for more complex workloads that combine marginal and range/prefix-sum queries (e.g., a marginal on race, a range query on age, and a combined race/age tabulation that answers age range queries for each race). It even supports custom user-defined workloads on different attributes. With this added flexibility, ResidualPlanner+ is not necessarily optimal, however it is still extremely scalable and outperforms the prior state-of-the-art (HDMM) on prefix-sum queries both in terms of accuracy and speed.
Updated: 2026-04-03 14:46:10
标题: ResidualPlanner+:一种可扩展的矩阵机制,用于边缘值及更多内容
摘要: 嘈杂的边际是一种常见的保护数据发布机制,对于许多下游任务(如列联表分析、贝叶斯网络构建,甚至合成数据生成)都很有用。提供对线性查询(如边际)无偏差噪声答案的隐私机制被称为矩阵机制。 我们提出了ResidualPlanner和ResidualPlanner+,两种高度可扩展的矩阵机制。ResidualPlanner对高斯噪声下的边际查询既是最优的,又具有可扩展性;而ResidualPlanner+支持更一般的工作负载,如边际和范围查询或前缀和查询的组合。ResidualPlanner可以优化许多可以写成边际方差的凸函数的损失函数(以前的工作仅限于一个预定义的目标函数)。在大规模设置中,ResidualPlanner可以在几秒钟内优化边际的准确性,即使之前的最先进技术(HDMM)在内存耗尽时也能运行。它甚至可以在几分钟内处理具有100个属性的数据集。此外,ResidualPlanner可以有效地计算每个边际的方差/协方差值(以前的方法很快耗尽内存,即使对于相对较小的数据集也是如此)。 ResidualPlanner+支持更复杂的工作负载,结合了边际和范围/前缀和查询(例如,对种族的边际、对年龄的范围查询,以及回答每个种族的年龄范围查询的组合种族/年龄制表)。它甚至支持对不同属性的自定义用户定义工作负载。通过这种额外的灵活性,ResidualPlanner+不一定是最佳的,但仍然非常可扩展,并且在准确性和速度方面都胜过以前的最先进技术(HDMM)在前缀和查询中。
更新时间: 2026-04-03 14:46:10
领域: cs.DB,cs.CR,cs.LG
Supplementary Materials to Graph Convolutional Branch and Bound
This article explores the integration of deep learning models into combinatorial optimization pipelines, specifically targeting NP-hard problems. Traditional exact algorithms for such problems often rely on heuristic criteria to guide the exploration of feasible solutions. In this work, we propose using neural networks to learn informative heuristics, most notably, an optimality score that estimates a solution's proximity to the optimum. This score is used to evaluate nodes within a branch-and-bound framework, enabling a more efficient traversal of the solution space. Focusing on the Traveling Salesman Problem, we introduce Concorde, a state-of-the-art solver, and present a hybrid approach called Graph Convolutional Branch and Bound, which augments it with a graph convolutional neural network trained with a novel unsupervised training strategy that facilitates generalization to graphs of varying sizes without requiring labeled data. Empirical results demonstrate the effectiveness of the proposed method, showing a significant reduction in the number of explored branch-and-bound nodes and overall computational time. Some of the results concerning the use of the 1-tree relaxation are in the supplementary materials.
Updated: 2026-04-03 14:38:42
标题: 《图卷积分支定界的补充材料》
摘要: 本文探讨了将深度学习模型整合到组合优化管道中,特别是针对NP困难问题。传统的确切算法通常依赖于启发式标准来指导可行解的探索。在这项工作中,我们提出使用神经网络来学习信息启发式,尤其是一种估计解接近最优的优化得分。该得分用于评估分支定界框架内的节点,从而实现更高效的解空间遍历。我们专注于旅行商问题,介绍了Concorde,一种先进的求解器,并提出了一种名为图卷积分支定界的混合方法,该方法通过使用一种新颖的无监督训练策略训练图卷积神经网络,促进对各种大小的图的泛化,而不需要标记数据。实证结果表明所提出的方法的有效性,显示了探索分支定界节点数量和总计算时间的显著减少。关于使用1树松弛的一些结果在补充资料中。
更新时间: 2026-04-03 14:38:42
领域: cs.LG,math.OC
Unified Thinker: A General Reasoning Modular Core for Image Generation
Despite impressive progress in high-fidelity image synthesis, generative models still struggle with logic-intensive instruction following, exposing a persistent reasoning--execution gap. Meanwhile, closed-source systems (e.g., Nano Banana) have demonstrated strong reasoning-driven image generation, highlighting a substantial gap to current open-source models. We argue that closing this gap requires not merely better visual generators, but executable reasoning: decomposing high-level intents into grounded, verifiable plans that directly steer the generative process. To this end, we propose Unified Thinker, a task-agnostic reasoning architecture for general image generation, designed as a unified planning core that can plug into diverse generators and workflows. Unified Thinker decouples a dedicated Thinker from the image Generator, enabling modular upgrades of reasoning without retraining the entire generative model. We further introduce a two-stage training paradigm: we first build a structured planning interface for the Thinker, then apply reinforcement learning to ground its policy in pixel-level feedback, encouraging plans that optimize visual correctness over textual plausibility. Extensive experiments on text-to-image generation and image editing show that Unified Thinker substantially improves image reasoning and generation quality.
Updated: 2026-04-03 14:36:47
标题: 统一思维者:图像生成的通用推理模块核心
摘要: 尽管在高保真图像合成方面取得了令人印象深刻的进展,但生成模型仍然在逻辑密集型指令跟随方面面临困难,暴露出持续存在的推理-执行差距。与此同时,闭源系统(例如Nano Banana)已经展示了强大的推理驱动图像生成能力,突显出与当前开源模型之间的实质性差距。我们认为要消除这一差距不仅需要更好的视觉生成器,还需要可执行的推理:将高级意图分解为基础的、可验证的计划,直接引导生成过程。为此,我们提出了统一思考者(Unified Thinker),这是一个通用图像生成的任务不可知的推理架构,设计为可以插入不同生成器和工作流程的统一规划核心。统一思考者将专门的思考者与图像生成器解耦,实现推理的模块化升级,而无需重新训练整个生成模型。我们进一步引入了一个两阶段训练范式:首先为思考者构建一个结构化的规划接口,然后应用强化学习来通过像素级反馈来确立其策略,鼓励优化视觉正确性而不是文本可信度的计划。在文本到图像生成和图像编辑的广泛实验中,显示统一思考者显著提高了图像推理和生成质量。
更新时间: 2026-04-03 14:36:47
领域: cs.CV,cs.AI
Constrained free energy minimization for the design of thermal states and stabilizer thermodynamic systems
A quantum thermodynamic system is described by a Hamiltonian and a list of conserved, non-commuting charges, and a fundamental goal is to determine the minimum energy of the system subject to constraints on the charges. Recently, [Liu et al., arXiv:2505.04514] proposed first- and second-order classical and hybrid quantum-classical algorithms for solving a dual chemical potential maximization problem, and they proved that these algorithms converge to global optima by means of gradient-ascent approaches. In this paper, we benchmark these algorithms on several problems of interest in thermodynamics, including one- and two-dimensional quantum Heisenberg models with nearest- and next-nearest neighbor interactions and with the charges set to the total x, y, and z magnetizations. We also offer an alternative compelling interpretation of these algorithms as methods for designing ground and thermal states of controllable Hamiltonians, with potential applications in molecular and material design. Furthermore, we introduce stabilizer thermodynamic systems as thermodynamic systems based on stabilizer codes, with the Hamiltonian constructed from a given code's stabilizer operators and the charges constructed from the code's logical operators. We benchmark the aforementioned algorithms on several examples of stabilizer thermodynamic systems, including those constructed from the one-to-three-qubit repetition code, the perfect one-to-five-qubit code, and the two-to-four-qubit error-detecting code. Finally, we observe that the aforementioned hybrid quantum-classical algorithms, when applied to stabilizer thermodynamic systems, can serve as alternative methods for encoding quantum information into stabilizer codes at a fixed temperature, and we provide an effective method for warm-starting these encoding algorithms whenever a single qubit is encoded into multiple physical qubits.
Updated: 2026-04-03 14:28:20
标题: 受限自由能最小化用于热态设计和稳定热力学系统
摘要: 一个量子热力学系统由一个哈密顿量和一系列守恒的、不可交换的荷载组成,一个基本目标是确定在对荷载施加约束条件时系统的最低能量。最近,[Liu等,arXiv:2505.04514]提出了解决双化学势最大化问题的一阶和二阶经典及混合量子-经典算法,并证明这些算法通过梯度上升方法收敛到全局最优解。在本文中,我们在几个热力学相关问题上对这些算法进行基准测试,包括一维和二维量子海森堡模型,最近邻和次近邻相互作用,并将荷载设置为总x、y和z磁化。我们还提供了这些算法的另一种有力解释,即作为设计可控哈密顿量的基态和热态的方法,潜在应用于分子和材料设计。此外,我们引入了基于稳定码的稳定器热力学系统,其中哈密顿量由给定码的稳定器算子构造,荷载由码的逻辑算子构造。我们在几个稳定器热力学系统示例上对上述算法进行基准测试,包括由一至三比特重复码、完美的一至五比特码和二至四比特错误检测码构造的系统。最后,我们观察到,当应用于稳定器热力学系统时,上述混合量子-经典算法可以作为在固定温度下将量子信息编码到稳定器码的替代方法,并提供了一种有效的方法,用于在将单比特编码到多个物理比特时启动这些编码算法。
更新时间: 2026-04-03 14:28:20
领域: quant-ph,cond-mat.stat-mech,cs.LG,math.OC
annbatch unlocks terabyte-scale training of biological data in anndata
The scale of biological datasets now routinely exceeds system memory, making data access rather than model computation the primary bottleneck in training machine-learning models. This bottleneck is particularly acute in biology, where widely used community data formats must support heterogeneous metadata, sparse and dense assays, and downstream analysis within established computational ecosystems. Here we present annbatch, a mini-batch loader native to anndata that enables out-of-core training directly on disk-backed datasets. Across single-cell transcriptomics, microscopy and whole-genome sequencing benchmarks, annbatch increases loading throughput by up to an order of magnitude and shortens training from days to hours, while remaining fully compatible with the scverse ecosystem. Annbatch establishes a practical data-loading infrastructure for scalable biological AI, allowing increasingly large and diverse datasets to be used without abandoning standard biological data formats. Github: https://github.com/scverse/annbatch
Updated: 2026-04-03 14:25:47
标题: annbatch 解锁了 Anndata 中生物数据的 TB 级别训练
摘要: 生物数据集的规模现在通常超过系统内存,使得数据访问而不是模型计算成为训练机器学习模型中的主要瓶颈。这个瓶颈在生物学领域尤为严重,因为广泛使用的社区数据格式必须支持异构元数据、稀疏和密集的测定以及在已建立的计算生态系统内进行下游分析。在这里,我们介绍了annbatch,这是一种原生于anndata的小批量加载器,可以直接在基于磁盘的数据集上进行离线训练。在单细胞转录组学、显微镜和全基因组测序基准测试中,annbatch将加载吞吐量提高了一个数量级,并将训练时间从几天缩短到几个小时,同时仍然完全兼容scverse生态系统。annbatch为可扩展的生物人工智能建立了一个实用的数据加载基础设施,允许越来越大和多样化的数据集在不放弃标准生物数据格式的情况下使用。Github链接:https://github.com/scverse/annbatch
更新时间: 2026-04-03 14:25:47
领域: cs.LG,q-bio.GN
AI-informed model-analogs for understanding subseasonal-to-seasonal jet stream and North American temperature predictability
Subseasonal-to-seasonal forecasting is crucial for public health, disaster preparedness, and agriculture, and yet it remains a particularly challenging timescale to predict. We explore the use of an interpretable AI-informed model analog forecasting approach, previously employed on longer timescales, to improve S2S predictions. Using an artificial neural network, we learn a mask of weights to optimize analog selection and showcase its versatility across three varied prediction tasks: 1) classification of Week 3-4 Southern California summer temperatures; 2) regional regression of Month 1 midwestern U.S. summer temperatures; and 3) classification of Month 1-2 North Atlantic wintertime upper atmospheric winds. The AI-informed analogs outperform traditional analog forecasting approaches, as well as climatology and persistence baselines, for deterministic and probabilistic skill metrics on both climate model and reanalysis data. We find the analog ensembles built using the AI-informed approach also produce better predictions of temperature extremes and improve representation of forecast uncertainty. Finally, by using an interpretable-AI framework, we analyze the learned masks of weights to better understand S2S sources of predictability.
Updated: 2026-04-03 14:22:44
标题: AI-启发的模型类比用于理解次季节至季节性喷气流和北美温度可预测性
摘要: 跨季节到季节预测对于公共卫生、灾害准备和农业至关重要,然而,这仍然是一个特别具有挑战性的时间尺度来预测。我们探讨了一种可解释的AI辅助模型类比预测方法,此前已在更长时间尺度上使用,以改进S2S预测。通过使用人工神经网络,我们学习了一组权重掩模来优化类比选择,并展示了其在三个不同的预测任务中的多功能性:1)分类第3-4周南加州夏季温度;2)中西部美国夏季温度的第一个月区域回归;3)分类第1-2个月北大西洋冬季上层大气风。AI辅助类比优于传统的类比预测方法,以及气候学和持续性基线,在气候模型和再分析数据上的确定性和概率技能指标。我们发现使用AI辅助方法构建的类比集合还可以更好地预测温度极端值,并改善预测不确定性的表示。最后,通过使用一个可解释的AI框架,我们分析了学习到的权重掩模,以更好地理解S2S可预测性的来源。
更新时间: 2026-04-03 14:22:44
领域: physics.ao-ph,cs.LG
Verbalizing LLMs' assumptions to explain and control sycophancy
LLMs can be socially sycophantic, affirming users when they ask questions like "am I in the wrong?" rather than providing genuine assessment. We hypothesize that this behavior arises from incorrect assumptions about the user, like underestimating how often users are seeking information over reassurance. We present Verbalized Assumptions, a framework for eliciting these assumptions from LLMs. Verbalized Assumptions provide insight into LLM sycophancy, delusion, and other safety issues, e.g., the top bigram in LLMs' assumptions on social sycophancy datasets is ``seeking validation.'' We provide evidence for a causal link between Verbalized Assumptions and sycophantic model behavior: our assumption probes (linear probes trained on internal representations of these assumptions) enable interpretable fine-grained steering of social sycophancy. We explore why LLMs default to sycophantic assumptions: on identical queries, people expect more objective and informative responses from AI than from other humans, but LLMs trained on human-human conversation do not account for this difference in expectations. Our work contributes a new understanding of assumptions as a mechanism for sycophancy.
Updated: 2026-04-03 14:15:43
标题: 将LLMs的假设言词化以解释和控制谄媚
摘要: LLMs可能在用户询问“我错了吗?”这类问题时表现出社交阿谀奉承,而不是提供真正的评估。我们假设这种行为是由于对用户的不正确假设所致,比如低估用户寻求信息而非安慰的频率。我们提出了口头假设,这是一个从LLMs中引出这些假设的框架。口头假设揭示了LLMs的阿谀奉承、幻觉以及其他安全问题,例如,在社交阿谀奉承数据集中LLMs假设的前两个词是“寻求认可”。我们提供了口头假设和阿谀奉承模型行为之间因果关系的证据:我们的假设探针(在这些假设的内部表示上训练的线性探针)可以对社交阿谀奉承进行可解释的精细调控。我们探讨了为什么LLMs默认为阿谀奉承假设:在相同的查询上,人们期望从AI获得更客观、信息丰富的回应,而不是从其他人那里得到,但是训练于人际对话的LLMs并没有考虑到这种期望差异。我们的工作为了解假设作为阿谀奉承机制做出了新的贡献。
更新时间: 2026-04-03 14:15:43
领域: cs.CL,cs.AI,cs.CY
Querying Structured Data Through Natural Language Using Language Models
This paper presents an open source methodology for allowing users to query structured non textual datasets through natural language Unlike Retrieval Augmented Generation RAG which struggles with numerical and highly structured information our approach trains an LLM to generate executable queries To support this capability we introduce a principled pipeline for synthetic training data generation producing diverse question answer pairs that capture both user intent and the semantics of the underlying dataset We fine tune a compact model DeepSeek R1 Distill 8B using QLoRA with 4 bit quantization making the system suitable for deployment on commodity hardware We evaluate our approach on a dataset describing accessibility to essential services across Durangaldea Spain The fine tuned model achieves high accuracy across monolingual multilingual and unseen location scenarios demonstrating both robust generalization and reliable query generation Our results highlight that small domain specific models can achieve high precision for this task without relying on large proprietary LLMs making this methodology suitable for resource constrained environments and adaptable to broader multi dataset systems We evaluate our approach on a dataset describing accessibility to essential services across Durangaldea Spain The fine tuned model achieves high accuracy across monolingual multilingual and unseen location scenarios demonstrating both robust generalization and reliable query generation Our results highlight that small domain specific models can achieve high precision for this task without relying on large proprietary LLMs making this methodology suitable for resource constrained environments and adaptable to broader multi dataset systems.
Updated: 2026-04-03 14:15:15
标题: 用语言模型通过自然语言查询结构化数据
摘要: 本文提出了一种开放源方法,允许用户通过自然语言查询结构化的非文本数据集。与Retrieval Augmented Generation RAG不同,后者在处理数字和高度结构化信息时存在困难,我们的方法是训练一个LLM来生成可执行查询。为了支持这一能力,我们引入了一个系统化的流程,用于生成合成训练数据,产生能捕捉用户意图和基础数据集语义的多样化问答对。我们使用QLoRA和4位量化对深度模型DeepSeek R1 Distill 8B 进行微调,使系统适用于在通用硬件上部署。我们在描述西班牙杜兰加尔迪亚地区重要服务可达性的数据集上评估了我们的方法。微调后的模型在单语、多语和未知位置场景下均实现了高准确性,展示了稳健的泛化和可靠的查询生成。我们的结果表明,小型领域特定模型可以在不依赖大型专有LLM的情况下实现高精度,使得这种方法适用于资源受限环境,并可适应更广泛的多数据集系统。
更新时间: 2026-04-03 14:15:15
领域: cs.CL,cs.AI
MECO: A Multimodal Dataset for Emotion and Cognitive Understanding in Older Adults
While affective computing has advanced considerably, multimodal emotion prediction in aging populations remains underexplored, largely due to the scarcity of dedicated datasets. Existing multimodal benchmarks predominantly target young, cognitively healthy subjects, neglecting the influence of cognitive decline on emotional expression and physiological responses. To bridge this gap, we present MECO, a Multimodal dataset for Emotion and Cognitive understanding in Older adults. MECO includes 42 participants and provides approximately 38 hours of multimodal signals, yielding 30,592 synchronized samples. To maximize ecological validity, data collection followed standardized protocols within community-based settings. The modalities cover video, audio, electroencephalography (EEG), and electrocardiography (ECG). In addition, the dataset offers comprehensive annotations of emotional and cognitive states, including self-assessed valence, arousal, six basic emotions, and Mini-Mental State Examination cognitive scores. We further establish baseline benchmarks for both emotion and cognitive prediction. MECO serves as a foundational resource for multimodal modeling of affect and cognition in aging populations, facilitating downstream applications such as personalized emotion recognition and early detection of mild cognitive impairment (MCI) in real-world settings. The complete dataset and supplementary materials are available at https://maitrechen.github.io/meco-page/.
Updated: 2026-04-03 14:03:23
标题: MECO: 一个用于老年人情绪和认知理解的多模态数据集
摘要: 尽管情感计算已取得相当大的进展,但在老年人群中的多模态情绪预测仍未得到充分探索,这在很大程度上是由于专门数据集的稀缺。现有的多模态基准主要针对年轻、认知健康的受试者,忽视了认知衰退对情绪表达和生理反应的影响。为了弥补这一差距,我们提出了MECO,一个用于理解老年人情绪和认知的多模态数据集。MECO包括42名参与者,提供约38小时的多模态信号,产生30,592个同步样本。为了最大程度地保证生态效度,数据采集遵循社区设置中的标准化协议。模态包括视频、音频、脑电图(EEG)和心电图(ECG)。此外,该数据集提供了情绪和认知状态的综合注释,包括自我评估的愉悦度、唤醒度、六种基本情绪和简易智力状态检查认知分数。我们进一步建立了情绪和认知预测的基准。MECO作为一个基础资源,用于老年人群中情感和认知的多模态建模,促进了个性化情绪识别和在现实世界中早期检测轻度认知障碍(MCI)等应用的发展。完整的数据集和补充资料可在https://maitrechen.github.io/meco-page/获取。
更新时间: 2026-04-03 14:03:23
领域: cs.HC,cs.AI
From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics
Large language models now solve many benchmark math problems at near-expert levels, yet this progress has not fully translated into reliable performance in real-world applications. We study this gap through contextual mathematical reasoning, where the mathematical core must be formulated from descriptive scenarios. We introduce ContextMATH, a benchmark that repurposes AIME and MATH-500 problems into two contextual settings: Scenario Grounding (SG), which embeds abstract problems into realistic narratives without increasing reasoning complexity, and Complexity Scaling (CS), which transforms explicit conditions into sub-problems to capture how constraints often appear in practice. Evaluating 61 proprietary and open-source models, we observe sharp drops: on average, open-source models decline by 13 and 34 points on SG and CS, while proprietary models drop by 13 and 20. Error analysis shows that errors are dominated by incorrect problem formulation, with formulation accuracy declining as original problem difficulty increases. Correct formulation emerges as a prerequisite for success, and its sufficiency improves with model scale, indicating that larger models advance in both understanding and reasoning. Nevertheless, formulation and reasoning remain two complementary bottlenecks that limit contextual mathematical problem solving. Finally, we find that fine-tuning with scenario data improves performance, whereas formulation-only training is ineffective. However, performance gaps are only partially alleviated, highlighting contextual mathematical reasoning as a central unsolved challenge for LLMs.
Updated: 2026-04-03 14:01:42
标题: 从抽象到具体:在数学中LLM仍然无法做到的事情
摘要: 大型语言模型现在可以在接近专家水平上解决许多基准数学问题,然而这一进展并没有完全转化为在实际应用中的可靠表现。我们通过上下文数学推理研究了这一差距,其中数学核心必须从描述性场景中制定。我们引入了ContextMATH,这是一个基准,将AIME和MATH-500问题重新定位到两个上下文场景中:情景基础(SG),将抽象问题嵌入到现实故事中而不增加推理复杂性,以及复杂性缩放(CS),将明确条件转化为子问题以捕捉约束在实践中经常出现的方式。评估61种专有和开源模型,我们观察到急剧下降:平均而言,开源模型在SG和CS上分别下降了13和34个点,而专有模型分别下降了13和20个点。错误分析表明,错误主要是由于问题制定不正确,随着原始问题难度的增加,制定准确性下降。正确的表述成为成功的先决条件,并且随着模型规模的增大,其充分性得到改善,表明更大的模型在理解和推理方面都有所进步。然而,制定和推理仍然是限制上下文数学问题解决的两个互补瓶颈。最后,我们发现使用情景数据进行微调可以提高性能,而仅进行制定训练是无效的。然而,性能差距仅部分得到缓解,突出了上下文数学推理作为LLM的一个中心未解决挑战。
更新时间: 2026-04-03 14:01:42
领域: cs.AI
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency
We introduce JoyAI-LLM Flash, an efficient Mixture-of-Experts (MoE) language model designed to redefine the trade-off between strong performance and token efficiency in the sub-50B parameter regime. JoyAI-LLM Flash is pretrained on a massive corpus of 20 trillion tokens and further optimized through a rigorous post-training pipeline, including supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and large-scale reinforcement learning (RL) across diverse environments. To improve token efficiency, JoyAI-LLM Flash strategically balances \emph{thinking} and \emph{non-thinking} cognitive modes and introduces FiberPO, a novel RL algorithm inspired by fibration theory that decomposes trust-region maintenance into global and local components, providing unified multi-scale stability control for LLM policy optimization. To enhance architectural sparsity, the model comprises 48B total parameters while activating only 2.7B parameters per forward pass, achieving a substantially higher sparsity ratio than contemporary industry leading models of comparable scale. To further improve inference throughput, we adopt a joint training-inference co-design that incorporates dense Multi-Token Prediction (MTP) and Quantization-Aware Training (QAT). We release the checkpoints for both JoyAI-LLM-48B-A3B Base and its post-trained variants on Hugging Face to support the open-source community.
Updated: 2026-04-03 13:52:38
标题: JoyAI-LLM Flash:推动具有令牌效率的中等规模LLM
摘要: 我们介绍了JoyAI-LLM Flash,一种高效的专家混合(MoE)语言模型,旨在重新定义在小于50B参数范围内强大性能和标记效率之间的权衡。JoyAI-LLM Flash在一个包含20万亿个标记的庞大语料库上进行预训练,并通过严格的后训练管线进一步优化,包括监督微调(SFT)、直接偏好优化(DPO)和跨多种环境进行的大规模强化学习(RL)。为了提高标记效率,JoyAI-LLM Flash战略平衡了思考和非思考认知模式,并引入了FiberPO,一种受纤维化理论启发的新型RL算法,将信任区域维护分解为全局和局部组件,为LLM策略优化提供统一的多尺度稳定控制。为了增强架构稀疏性,该模型包含48B总参数,每次前向传递仅激活2.7B参数,实现了比同等规模的当代行业领先模型更高的稀疏比率。为了进一步提高推理吞吐量,我们采用了联合训练-推理协同设计,包括密集的多标记预测(MTP)和量化感知训练(QAT)。我们在Hugging Face上发布了JoyAI-LLM-48B-A3B Base及其后训练变体的检查点,以支持开源社区。
更新时间: 2026-04-03 13:52:38
领域: cs.CL,cs.AI
Analyzing Healthcare Interoperability Vulnerabilities: Formal Modeling and Graph-Theoretic Approach
In a healthcare environment, the healthcare interoperability platforms based on HL7 FHIR allow concurrent, asynchronous access to a set of shared patient resources, which are independent systems, i.e., EHR systems, pharmacy systems, lab systems, and devices. The FHIR specification lacks a protocol for concurrency control, and the research on detecting a race condition only targets the OS kernel. The research on FHIR security only targets authentication and injection attacks, considering concurrent access to patient resources to be sequential. The gap in the research in this area is addressed through the introduction of FHIR Resource Access Graph (FRAG), a formally defined graph G = (P,R,E, λ, τ, S), in which the nodes are the concurrent processes, the typed edges represent the resource access events, and the race conditions are represented as detectable structural properties. Three clinically relevant race condition classes are formally specified: Simultaneous Write Conflict (SWC), TOCTOU Authorization Violation (TAV), and Cascading Update Race (CUR). The FRAG model is implemented as a three-pass graph traversal detection algorithm and tested against a time window-based baseline on 1,500 synthetic FHIR R4 transaction logs. Under full concurrent access (C2), FRAG attains a 90.0% F1 score vs. 25.5% for the baseline, a 64.5 pp improvement.
Updated: 2026-04-03 13:51:43
标题: 分析医疗卫生互操作性漏洞:形式建模和图论方法
摘要: 在医疗环境中,基于HL7 FHIR的医疗互操作平台允许对一组共享患者资源进行并发、异步访问,这些资源是独立的系统,即电子健康记录系统、药房系统、实验室系统和设备。FHIR规范缺乏并发控制协议,关于检测竞争条件的研究仅针对操作系统内核。关于FHIR安全的研究仅针对身份验证和注入攻击,将对患者资源的并发访问视为顺序访问。通过引入FHIR资源访问图(FRAG)来填补这一领域研究的空白,FRAG是一个形式上定义的图G =(P,R,E,λ,τ,S),其中节点是并发进程,类型化的边代表资源访问事件,竞争条件被表示为可检测的结构属性。三个临床相关的竞争条件类别被正式规定为:Simultaneous Write Conflict(SWC)、TOCTOU Authorization Violation(TAV)和Cascading Update Race(CUR)。FRAG模型被实现为一个三通道图遍历检测算法,并在1500个合成的FHIR R4事务日志上针对基于时间窗口的基线进行测试。在完全并发访问(C2)下,FRAG获得了90.0%的F1分数,而基线为25.5%,改善了64.5个百分点。
更新时间: 2026-04-03 13:51:43
领域: cs.CR,cs.AI
ARM: Advantage Reward Modeling for Long-Horizon Manipulation
Long-horizon robotic manipulation remains challenging for reinforcement learning (RL) because sparse rewards provide limited guidance for credit assignment. Practical policy improvement thus relies on richer intermediate supervision, such as dense progress rewards, which are costly to obtain and ill-suited to non-monotonic behaviors such as backtracking and recovery. To address this, we propose Advantage Reward Modeling (ARM), a framework that shifts from hard-to-quantify absolute progress to estimating relative advantage. We introduce a cost-effective tri-state labeling strategy -- Progressive, Regressive, and Stagnant -- that reduces human cognitive overhead while ensuring high cross-annotator consistency. By training on these intuitive signals, ARM enables automated progress annotation for both complete demonstrations and fragmented DAgger-style data. Integrating ARM into an offline RL pipeline allows for adaptive action-reward reweighting, effectively filtering suboptimal samples. Our approach achieves a 99.4% success rate on a challenging long-horizon towel-folding task, demonstrating improved stability and data efficiency over current VLA baselines with near-zero human intervention during policy training.
Updated: 2026-04-03 13:45:59
标题: ARM:用于长时间跨度操作的优势奖励建模
摘要: 长时程的机器人操作对于强化学习(RL)仍然具有挑战性,因为稀疏奖励为信用分配提供了有限的指导。因此,实际政策改进依赖于更丰富的中间监督,例如密集的进度奖励,这些奖励获取成本高昂,并且不适合非单调行为,如倒退和恢复。为了解决这个问题,我们提出了优势奖励建模(ARM)框架,这个框架从难以量化的绝对进展转向估计相对优势。我们引入了一种经济高效的三态标记策略——逐步、倒退和停滞——这种策略减少了人类的认知负担,同时确保了高交叉注释者一致性。通过在这些直观信号上进行训练,ARM使得自动化进展注释成为可能,适用于完整的演示和碎片化的DAgger风格数据。将ARM集成到离线RL管道中可以实现自适应的动作-奖励重新加权,有效地过滤次优样本。我们的方法在一个具有挑战性的长时程毛巾折叠任务上实现了99.4%的成功率,展示了相对于当前的VLA基线,在政策训练期间几乎没有人为干预的情况下,实现了改进的稳定性和数据效率。
更新时间: 2026-04-03 13:45:59
领域: cs.RO,cs.AI,cs.CV
Beyond Isolated Tasks: A Framework for Evaluating Coding Agents on Sequential Software Evolution
Existing datasets for coding agents evaluate performance on isolated, single pull request (PR) tasks in a stateless manner, failing to capture the reality of real-world software development where code changes accumulate, technical debt accrues, and test suites grow over time. To bridge this gap, we introduce an automated coding task generation framework, which helps generate our dataset SWE-STEPS, that evaluates coding agents on long-horizon tasks through two realistic settings mirroring actual developer workflows: Conversational coding with iterative requests, and single-shot Project Requirement document (PRD)-based coding. Unlike existing datasets that evaluate agents on disjointed Pull Requests (PRs), our framework assesses performance across chains of dependent PRs, enabling evaluation of sequential execution, regression verification, and long-term repository health. We discover that widely used isolated PR evaluations yield inflated success rates, w.r.t. our settings - overshooting performance by as much as 20 percentage points - because they ignore the ``spillover'' effects of previous inefficient or buggy code. Furthermore, our analysis reveals that even when agents successfully resolve issues, they degrade repository health by generating code with higher cognitive complexity and technical debt compared to human developers, underscoring the necessity for multidimensional evaluation.
Updated: 2026-04-03 13:44:40
标题: 超越孤立的任务:一个评估编码代理在软件演化过程中的框架
摘要: 现有用于对编码代理进行评估的数据集以无状态方式评估孤立的、单个拉取请求(PR)任务的性能,未能捕捉到现实世界软件开发中代码变更累积、技术债务积累和测试套件随时间增长的情况。为了弥合这一差距,我们引入了一种自动化编码任务生成框架,帮助生成我们的数据集SWE-STEPS,通过两种模拟实际开发人员工作流程的现实设置评估编码代理的长期任务:具有迭代请求的对话编码,以及基于单次项目需求文档(PRD)的编码。与现有数据集评估代理的孤立拉取请求(PR)不同,我们的框架评估依赖PR链的性能,实现了序列执行、回归验证和长期存储库健康状况的评估。我们发现,广泛使用的孤立PR评估会导致膨胀的成功率,相对于我们的设置,性能可能超过20个百分点,因为它们忽略了先前低效或有错误的代码的“溢出”效应。此外,我们的分析显示,即使代理成功解决问题,他们生成的代码与人类开发人员相比具有更高的认知复杂性和技术债务,强调了多维度评估的必要性。
更新时间: 2026-04-03 13:44:40
领域: cs.SE,cs.AI
Learning Contractive Integral Operators with Fredholm Integral Neural Operators
We generalize the framework of Fredholm Neural Networks, to learn non-expansive integral operators arising in Fredholm Integral Equations (FIEs) of the second kind in arbitrary dimensions. We first present the proposed Fredholm Integral Neural Operators (FREDINOs), for FIEs and prove that they are universal approximators of linear and non-linear integral operators and corresponding solution operators. We furthermore prove that the learned operators are guaranteed to be contractive, thereby strictly satisfying the mathematical property required for the convergence of the fixed point scheme. Finally, we also demonstrate how FREDINOs can be used to learn the solution operator of non-linear elliptic PDEs, via a Boundary Integral Equation (BIE) formulation. We assess the proposed methodology numerically, via several benchmark problems: linear and non-linear FIEs in arbitrary dimensions, as well as a non-linear elliptic PDE in 2D. Built on tailored mathematical/numerical analysis theory, FREDINOs offer high-accuracy approximations and interpretable schemes, making them well suited for scientific machine learning/numerical analysis computations.
Updated: 2026-04-03 13:42:59
标题: 学习Fredholm积分神经算子中的收缩积分算子
摘要: 我们将Fredholm神经网络的框架推广到学习在任意维度中出现的非扩张积分算子,这些算子出现在第二类Fredholm积分方程(FIEs)中。我们首先提出了所提出的Fredholm积分神经算子(FREDINOs),用于FIEs,并证明它们是线性和非线性积分算子以及相应解算子的通用逼近器。此外,我们证明了学习到的算子保证是收敛的,因此严格满足了固定点方案收敛所需的数学属性。最后,我们还展示了如何使用FREDINOs来学习非线性椭圆PDE的解算子,通过边界积分方程(BIE)的形式。我们通过多个基准问题对所提出的方法进行了数值评估:在任意维度中的线性和非线性FIEs,以及2D中的非线性椭圆PDE。基于量身定制的数学/数值分析理论,FREDINOs提供了高精度逼近和可解释的方案,使它们非常适合用于科学机器学习/数值分析计算。
更新时间: 2026-04-03 13:42:59
领域: math.NA,cs.LG
Recovering Sub-threshold S-wave Arrivals in Deep Learning Phase Pickers via Shape-Aware Loss
Deep learning has transformed seismic phase picking, but a systematic failure mode persists: for some S-wave arrivals that appear unambiguous to human analysts, the model produces only a distorted peak trapped below the detection threshold, even as the P-wave prediction on the same record appears flawless. By examining training dynamics and loss landscape geometry, we diagnose this amplitude suppression as an optimization trap arising from three interacting factors. Temporal uncertainty in S-wave arrivals, CNN bias toward amplitude boundaries, and the inability of pointwise loss to provide lateral corrective forces combine to create the trap. The diagnosis reveals that phase arrival labels are structured shapes rather than independent probability estimates, requiring training objectives that preserve coherence. We formalize this as the shape-then-align strategy and validate it through a conditional GAN proof of concept, recovering previously sub-threshold signals and achieving a 64% increase in effective S-phase detections. Beyond this implementation, the loss landscape visualization and numerical simulation techniques we introduce provide a general methodology for analyzing how label designs and loss functions interact with temporal uncertainty, transforming these choices from trial-and-error into principled analysis.
Updated: 2026-04-03 13:19:58
标题: 通过形状感知损失在深度学习震相拾取器中恢复次阈S波到达时间
摘要: 深度学习已经改变了地震相拾取的方式,但是一个系统性的失败模式仍然存在:对于一些S波到达,人类分析师认为是明确的,但模型却只产生一个扭曲的峰值被困在检测阈值以下,即使在同一记录上的P波预测看似完美。通过研究训练动态和损失地形几何,我们将这种幅度抑制诊断为一种优化陷阱,产生于三个相互作用因素:S波到达的时间不确定性、CNN对幅度边界的偏见,以及点对点损失无法提供横向校正力。这些因素共同造成了这种陷阱。诊断表明,相到达标签是结构化形状,而不是独立的概率估计,需要保持一致性的训练目标。我们将这种策略形式化为先形状再对齐的策略,并通过条件GAN概念验证它,恢复了之前的亚阈信号,并实现了有效S相检测的64%增加。除此实现外,我们介绍的损失地形可视化和数值模拟技术提供了一种分析标签设计和损失函数如何与时间不确定性相互作用的通用方法,将这些选择从试错转变为原则性分析。
更新时间: 2026-04-03 13:19:58
领域: physics.geo-ph,cs.AI
Comparing the Impact of Pedagogy-Informed Custom and General-Purpose GAI Chatbots on Students' Science Problem-Solving Processes and Performance Using Heterogeneous Interaction Network Analysis
Problem solving plays an essential role in science education, and generative AI (GAI) chatbots have emerged as a promising tool for supporting students' science problem solving. However, general-purpose chatbots (e.g., ChatGPT), which often provide direct, ready-made answers, may lead to students' cognitive offloading. Prior research has rarely focused on custom chatbots for facilitating students' science problem solving, nor has it examined how they differently influence problem-solving processes and performance compared to general-purpose chatbots. To address this gap, we developed a pedagogy-informed custom GAI chatbot grounded in the Socratic questioning method, which supports students by prompting them with guiding questions. This study employed a within-subjects counterbalanced design in which 48 secondary school students used both custom and general-purpose chatbot to complete two science problem-solving tasks. 3297 student-chatbot dialogues were collected and analyzed using Heterogeneous Interaction Network Analysis (HINA). The results showed that: (1) students demonstrated significantly higher interaction intensity and cognitive interaction diversity when using custom chatbot than using general-purpose chatbot; (2) students were more likely to follow custom chatbot's guidance to think and reflect, whereas they tended to request general-purpose chatbot to execute specific commands; and (3) no statistically significant difference was observed in students' problem-solving performance evaluated by solution quality between two chatbot conditions. This study provides novel theoretical insights and empirical evidence that custom chatbots are less likely to induce cognitive offloading and instead foster greater cognitive engagement compared to general-purpose chatbots. This study also offers insights into the design and integration of GAI chatbots in science education.
Updated: 2026-04-03 13:13:28
标题: 比较受教育学启发的定制和通用目的的GAI聊天机器人对学生科学问题解决过程和表现的影响:使用异质交互网络分析
摘要: 问题解决在科学教育中起着至关重要的作用,生成式人工智能(GAI)聊天机器人已经成为支持学生科学问题解决的有前途的工具。然而,通用聊天机器人(例如ChatGPT),通常提供直接、现成的答案,可能导致学生的认知卸载。先前的研究很少关注促进学生科学问题解决的定制聊天机器人,也没有研究它们与通用聊天机器人相比如何不同地影响问题解决过程和表现。为填补这一空白,我们开发了一种以苏格拉底质疑方法为基础的、受教育理论指导的定制GAI聊天机器人,通过提示学生引导性问题来支持他们。本研究采用了一个被试对照设计,48名中学生使用定制和通用聊天机器人完成了两个科学问题解决任务。收集并使用异质交互网络分析(HINA)分析了3297个学生-聊天机器人对话。结果显示:(1)学生在使用定制聊天机器人时表现出更高的交互强度和认知交互多样性,而在使用通用聊天机器人时表现出更低;(2)学生更有可能遵循定制聊天机器人的引导来思考和反思,而倾向于要求通用聊天机器人执行具体命令;(3)在两种聊天机器人条件下,学生的问题解决表现,即解决方案质量评估之间没有显著差异。这项研究提供了定制聊天机器人较通用聊天机器人更少引发认知卸载、而是促进更大认知参与的新颖理论见解和实证证据。这项研究还为GAI聊天机器人在科学教育中的设计和整合提供了见解。
更新时间: 2026-04-03 13:13:28
领域: cs.SI,cs.AI,cs.HC
Efficient and Principled Scientific Discovery through Bayesian Optimization: A Tutorial
Traditional scientific discovery relies on an iterative hypothesise-experiment-refine cycle that has driven progress for centuries, but its intuitive, ad-hoc implementation often wastes resources, yields inefficient designs, and misses critical insights. This tutorial presents Bayesian Optimisation (BO), a principled probability-driven framework that formalises and automates this core scientific cycle. BO uses surrogate models (e.g., Gaussian processes) to model empirical observations as evolving hypotheses, and acquisition functions to guide experiment selection, balancing exploitation of known knowledge and exploration of uncharted domains to eliminate guesswork and manual trial-and-error. We first frame scientific discovery as an optimisation problem, then unpack BO's core components, end-to-end workflows, and real-world efficacy via case studies in catalysis, materials science, organic synthesis, and molecule discovery. We also cover critical technical extensions for scientific applications, including batched experimentation, heteroscedasticity, contextual optimisation, and human-in-the-loop integration. Tailored for a broad audience, this tutorial bridges AI advances in BO with practical natural science applications, offering tiered content to empower cross-disciplinary researchers to design more efficient experiments and accelerate principled scientific discovery.
Updated: 2026-04-03 13:12:34
标题: 通过贝叶斯优化实现高效和有原则的科学发现:教程
摘要: 传统科学发现依赖于一个迭代的假设-实验-改进循环,这个循环已经推动了几个世纪的进步,但其直观、临时的实施往往浪费资源,产生低效的设计,并错过关键的洞察。本教程介绍了贝叶斯优化(BO),这是一个基于概率的原则框架,正式化并自动化了这一核心科学循环。BO使用替代模型(例如高斯过程)将经验观察建模为不断发展的假设,并使用获取函数来指导实验选择,平衡已知知识的开发和未知领域的探索,消除猜测和手工试错。我们首先将科学发现框定为一个优化问题,然后通过催化、材料科学、有机合成和分子发现的案例研究,揭示了BO的核心组件、端到端工作流程和现实世界的有效性。我们还涵盖了科学应用的关键技术扩展,包括批量实验、异方差性、情境优化和人与回路集成。为广泛受众量身定制,本教程将BO在人工智能领域的进步与实际自然科学应用联系起来,提供分层内容,以赋予跨学科研究人员设计更有效实验和加速原则科学发现的能力。
更新时间: 2026-04-03 13:12:34
领域: cs.LG
Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?
Multimodal Large Language Models (MLLMs) are evolving from passive observers into active agents, solving problems through Visual Expansion (invoking visual tools) and Knowledge Expansion (open-web search). However, existing evaluations fall short: they lack flexible tool integration, test visual and search tools separately, and evaluate primarily by final answers. Consequently, they cannot verify if tools were actually invoked, applied correctly, or used efficiently. To address this, we introduce Agentic-MME, a process-verified benchmark for Multimodal Agentic Capabilities. It contains 418 real-world tasks across 6 domains and 3 difficulty levels to evaluate capability synergy, featuring over 2,000 stepwise checkpoints that average 10+ person-hours of manual annotation per task. Each task includes a unified evaluation framework supporting sandboxed code and APIs, alongside a human reference trajectory annotated with stepwise checkpoints along dual-axis: S-axis and V-axis. To enable true process-level verification, we audit fine-grained intermediate states rather than just final answers, and quantify efficiency via an overthinking metric relative to human trajectories. Experimental results show the best model, Gemini3-pro, achieves 56.3% overall accuracy, which falls significantly to 23.0% on Level-3 tasks, underscoring the difficulty of real-world multimodal agentic problem solving.
Updated: 2026-04-03 13:02:01
标题: Agentic-MME:主体能力对多模态智能的真正带来了什么?
摘要: 多模态大语言模型(MLLMs)正在从被动观察者演变为主动代理,通过视觉扩展(调用视觉工具)和知识扩展(开放网络搜索)解决问题。然而,现有的评估存在不足:它们缺乏灵活的工具集成,单独测试视觉和搜索工具,并主要通过最终答案进行评估。因此,它们无法验证工具是否实际被调用、是否正确应用或是否高效使用。为了解决这个问题,我们引入了Agentic-MME,这是一个经过验证的用于多模态主动能力的基准测试。它包含6个领域和3个难度级别的418个真实世界任务,用于评估能力的协同作用,具有超过2,000个逐步检查点,每个任务平均需要10多个人小时的手动注释。每个任务包括一个支持沙盒代码和API的统一评估框架,以及一个人类参考轨迹,其沿着S轴和V轴标记了逐步检查点。为了实现真正的过程级验证,我们审计了细粒度的中间状态,而不仅仅是最终答案,并通过相对于人类轨迹的过度思考指标来量化效率。实验结果显示,最佳模型Gemini3-pro的整体准确率为56.3%,在3级任务中明显下降至23.0%,突显了真实世界多模态主动问题解决的困难。
更新时间: 2026-04-03 13:02:01
领域: cs.AI
Generating DDPM-based Samples from Tilted Distributions
Given $n$ independent samples from a $d$-dimensional probability distribution, our aim is to generate diffusion-based samples from a distribution obtained by tilting the original, where the degree of tilt is parametrized by $θ\in \mathbb{R}^d$. We define a plug-in estimator and show that it is minimax-optimal. We develop Wasserstein bounds between the distribution of the plug-in estimator and the true distribution as a function of $n$ and $θ$, illustrating regimes where the output and the desired true distribution are close. Further, under some assumptions, we prove the TV-accuracy of running Diffusion on these tilted samples. Our theoretical results are supported by extensive simulations. Applications of our work include finance, weather and climate modelling, and many other domains, where the aim may be to generate samples from a tilted distribution that satisfies practically motivated moment constraints.
Updated: 2026-04-03 13:00:11
标题: 从倾斜分布生成基于DDPM的样本
摘要: 给定从一个$d$维概率分布中抽取的$n$个独立样本,我们的目标是生成基于扭曲原始分布得到的扩散样本,其中扭曲程度由$θ\in \mathbb{R}^d$参数化。我们定义了一个插值估计量,并证明它是极小化最优的。我们发展了插值估计量与真实分布之间的Wasserstein界,作为$n$和$θ$的函数,说明了输出和期望的真实分布接近的情况。此外,在一些假设下,我们证明了在这些扭曲样本上运行扩散的TV精度。我们的理论结果得到了广泛模拟的支持。我们工作的应用包括金融、天气和气候建模以及许多其他领域,其中的目标可能是生成符合实际动机的矩约束的扭曲分布的样本。
更新时间: 2026-04-03 13:00:11
领域: cs.LG,math.PR,stat.ML
User-Aware Conditional Generative Total Correlation Learning for Multi-Modal Recommendation
Multi-modal recommendation (MMR) enriches item representations by introducing item content, e.g., visual and textual descriptions, to improve upon interaction-only recommenders. The success of MMR hinges on aligning these content modalities with user preferences derived from interaction data, yet dominant practices based on disentangling modality-invariant preference-driving signals from modality-specific preference-irrelevant noises are flawed. First, they assume a one-size-fits-all relevance of item content to user preferences for all users, which contradicts the user-conditional fact of preferences. Second, they optimize pairwise contrastive losses separately toward cross-modal alignment, systematically ignoring higher-order dependencies inherent when multiple content modalities jointly influence user choices. In this paper, we introduce GTC, a conditional Generative Total Correlation learning framework. We employ an interaction-guided diffusion model to perform user-aware content feature filtering, preserving only personalized features relevant to each individual user. Furthermore, to capture complete cross-modal dependencies, we optimize a tractable lower bound of the total correlation of item representations across all modalities. Experiments on standard MMR benchmarks show GTC consistently outperforms state-of-the-art, with gains of up to 28.30% in NDCG@5. Ablation studies validate both conditional preference-driven feature filtering and total correlation optimization, confirming the ability of GTC to model user-conditional relationships in MMR tasks. The code is available at: https://github.com/jingdu-cs/GTC.
Updated: 2026-04-03 12:56:58
标题: 用户感知条件生成总相关性学习用于多模态推荐
摘要: 多模态推荐(MMR)通过引入物品内容(例如视觉和文本描述)来丰富物品表示,以改进仅基于交互的推荐系统。MMR的成功取决于将这些内容模态与从交互数据中推导出的用户偏好进行对齐,然而基于分解模态不变偏好驱动信号和模态特定偏好无关噪声的主流做法存在缺陷。首先,它们假设物品内容对所有用户的用户偏好具有相同的相关性,这与用户条件偏好的事实相矛盾。其次,它们分别优化成对对比损失以实现跨模态对齐,系统地忽略了当多个内容模态共同影响用户选择时固有的高阶依赖关系。在本文中,我们介绍了GTC,一种有条件的生成总相关性学习框架。我们采用交互引导扩散模型来执行用户感知内容特征过滤,仅保留与每个个体用户相关的个性化特征。此外,为了捕捉完整的跨模态依赖关系,我们优化了跨所有模态的物品表示的总相关性的可计算下界。标准MMR基准测试显示,GTC始终优于最先进技术,NDCG@5的收益高达28.30%。消融研究验证了有条件的偏好驱动特征过滤和总相关性优化,确认了GTC在MMR任务中建模用户条件关系的能力。代码可在以下网址获取:https://github.com/jingdu-cs/GTC。
更新时间: 2026-04-03 12:56:58
领域: cs.IR,cs.AI
When RL Meets Adaptive Speculative Training: A Unified Training-Serving System
Speculative decoding can significantly accelerate LLM serving, yet most deployments today disentangle speculator training from serving, treating speculator training as a standalone offline modeling problem. We show that this decoupled formulation introduces substantial deployment and adaptation lag: (1) high time-to-serve, since a speculator must be trained offline for a considerable period before deployment; (2) delayed utility feedback, since the true end-to-end decoding speedup is only known after training and cannot be inferred reliably from acceptance rate alone due to model-architecture and system-level overheads; and (3) domain-drift degradation, as the target model is repurposed to new domains and the speculator becomes stale and less effective. To address these issues, we present Aurora, a unified training-serving system that closes the loop by continuously learning a speculator directly from live inference traces. Aurora reframes online speculator learning as an asynchronous reinforcement-learning problem: accepted tokens provide positive feedback, while rejected speculator proposals provide implicit negative feedback that we exploit to improve sample efficiency. Our design integrates an SGLang-based inference server with an asynchronous training server, enabling hot-swapped speculator updates without service interruption. Crucially, Aurora supports day-0 deployment: a speculator can be served immediately and rapidly adapted to live traffic, improving system performance while providing immediate utility feedback. Across experiments, Aurora achieves a 1.5x day-0 speedup on recently released frontier models (e.g., MiniMax M2.1 229B and Qwen3-Coder-Next 80B). Aurora also adapts effectively to distribution shifts in user traffic, delivering an additional 1.25x speedup over a well-trained but static speculator on widely used models (e.g., Qwen3 and Llama3).
Updated: 2026-04-03 12:52:35
标题: 当强化学习遇见自适应推测训练:一个统一的训练-服务系统
摘要: 投机性解码可以显著加快LLM的提供速度,然而大多数部署今天将推测性训练与提供分开,将推测性训练视为一个独立的离线建模问题。我们表明,这种解耦的形式引入了实质性的部署和适应滞后:(1)服务时间长,因为一个推测器必须在部署之前离线训练相当长的时间;(2)延迟的效用反馈,因为真实的端到端解码加速只有在训练后才能知道,并且由于模型架构和系统级开销,不能仅仅从接受率中可靠地推断出来;以及(3)领域漂移退化,因为目标模型被重新用于新领域,推测器变得陈旧且不太有效。 为了解决这些问题,我们提出了Aurora,一个通过连续从实时推理跟踪中直接学习推测器来闭环的统一培训-服务系统。Aurora将在线推测器学习重新构建为一个异步强化学习问题:接受的标记提供正面反馈,而被拒绝的推测器提案提供隐含的负面反馈,我们利用这些反馈来提高样本效率。我们的设计集成了一个基于SGLang的推理服务器和一个异步训练服务器,使得可以在不中断服务的情况下进行热交换的推测器更新。关键地,Aurora支持第0天部署:一个推测器可以立即提供服务,并快速适应实时流量,从而提高系统性能同时提供即时的效用反馈。在实验中,Aurora在最近发布的前沿模型(例如MiniMax M2.1 229B和Qwen3-Coder-Next 80B)上实现了1.5倍的第0天加速。Aurora还有效地适应了用户流量分布的变化,在广泛使用的模型(例如Qwen3和Llama3)上相对于训练良好但静态的推测器,提供了额外的1.25倍加速。
更新时间: 2026-04-03 12:52:35
领域: cs.LG
Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning
Reinforcement Learning (RL) has emerged as a critical technique for advancing modern Large Language Models (LLMs), yet existing synchronous RL systems face severe performance bottlenecks. The rollout phase, which dominates end-to-end iteration time, suffers from substantial long-tail latency and poor resource utilization due to inherent workload imbalance. We present Seer, a novel context learning RL system that addresses these challenges through a key observation: requests sharing the same prompt exhibit strong similarities in output lengths and response patterns. Leveraging this insight, Seer introduces three coordinated techniques: (1) divided rollout for dynamic load balancing, (2) context-aware scheduling to mitigate long-tail request delays, and (3) adaptive grouped speculative decoding to accelerate generation. These mechanisms work in concert to markedly reduce long-tail latency and improve resource efficiency during rollout. Evaluations on production-grade RL workloads demonstrate that Seer achieves up to 2.04$\times$ end-to-end rollout throughput improvement compared to the state-of-the-art synchronous RL systems, while notably reducing long-tail latency by 72-94%.
Updated: 2026-04-03 12:47:37
标题: Seer:用于快速同步LLM强化学习的在线上下文学习
摘要: 强化学习(RL)已经成为推动现代大型语言模型(LLMs)发展的关键技术,然而现有的同步RL系统面临严重的性能瓶颈。在整个迭代过程中占主导地位的展开阶段,由于固有的工作负载不平衡,遭受着相当大的长尾延迟和资源利用率低的困扰。我们提出了Seer,这是一个新颖的上下文学习RL系统,通过一个关键观察解决了这些挑战:共享相同提示的请求在输出长度和响应模式上表现出强烈的相似性。利用这一洞察力,Seer引入了三种协调的技术:(1)用于动态负载平衡的分割展开,(2)上下文感知调度以减轻长尾请求延迟,以及(3)自适应分组推测解码以加速生成。这些机制共同作用,显著减少了展开阶段的长尾延迟,并提高了资源利用效率。对生产级RL工作负载的评估表明,与最先进的同步RL系统相比,Seer在整个展开阶段的吞吐量提高了高达2.04倍,同时将长尾延迟降低了72-94%。
更新时间: 2026-04-03 12:47:37
领域: cs.DC,cs.LG
Discovery of Bimodal Drift Rate Structure in FRB 20240114A: Evidence for Dual Emission Regions
We report the discovery of bimodal structure in the drift rate distribution of upward-drifting burst clusters from the hyperactive repeating fast radio burst FRB 20240114A. Using unsupervised machine learning (UMAP dimensionality reduction combined with HDBSCAN density-based clustering) applied to 233 upward-drifting burst clusters from the FAST telescope dataset, we identify a distinct subpopulation of 45 burst clusters (Cluster C1) with mean drift rates 2.5x higher than typical upward-drifting burst clusters (245.6 vs 98.1 MHz/ms). Gaussian mixture modeling reveals strong evidence for bimodality (delta-BIC = 296.6), with clearly separated modes (Ashman's D = 2.70 > 2) and a statistically significant gap in the distribution (11.3 sigma). Crucially, we demonstrate that this bimodality persists when restricting the analysis to single-component (U1) burst clusters only (delta-BIC = 19.9, Ashman's D = 2.71), confirming that the result is not an artifact of combining single- and multi-component burst clusters with different drift rate definitions. The extreme-drift subpopulation also exhibits systematically lower peak frequencies (-7%), shorter durations (-29%), and distinct clustering in multi-dimensional feature space. These findings are suggestive of two spatially separated emission regions in the magnetosphere, each producing upward-drifting burst clusters with distinct physical characteristics, although confirmation requires observations from additional epochs and sources.
Updated: 2026-04-03 12:47:30
标题: 在FRB 20240114A中发现双峰漂移速率结构:双发射区的证据
摘要: 我们报告了在来自超活跃重复快速射电暴FRB 20240114A的向上漂移爆发簇的漂移率分布中发现了双峰结构。利用无监督机器学习(UMAP降维结合HDBSCAN基于密度的聚类)应用于FAST望远镜数据集中的233个向上漂移爆发簇,我们确定了一个独特的亚群,其中包含45个爆发簇(Cluster C1),其平均漂移率比典型的向上漂移爆发簇高2.5倍(245.6 vs 98.1 MHz/ms)。高斯混合建模显示了双峰性的强有力证据(delta-BIC = 296.6),具有明显分离的模式(Ashman's D = 2.70 > 2)和分布中统计显著的间隙(11.3 sigma)。至关重要的是,我们证明当将分析限制为仅单组分(U1)爆发簇时,这种双峰性仍然存在(delta-BIC = 19.9,Ashman's D = 2.71),证实结果不是将具有不同漂移率定义的单组分和多组分爆发簇结合的人为产物。极端漂移亚群还表现出系统性较低的峰频率(-7%),较短的持续时间(-29%)和在多维特征空间中的明显聚类。这些发现提示磁层中存在两个空间分离的发射区域,每个区域产生具有不同物理特征的向上漂移爆发簇,尽管确认需要来自额外时期和来源的观测数据。
更新时间: 2026-04-03 12:47:30
领域: astro-ph.HE,cs.AI
R2-Write: Reflection and Revision for Open-Ended Writing with Deep Reasoning
While deep reasoning with long chain-of-thought has dramatically improved large language models in verifiable domains like mathematics, its effectiveness for open-ended tasks such as writing remains unexplored. In this paper, we conduct a systematic investigation revealing that existing mainstream reasoning models achieve limited gains on open-ended writing tasks. Our further analysis shows that these models lack deep reflection and revision patterns in open-ended writing, resulting in substantially smaller improvements compared to mathematical reasoning tasks. To address this limitation, we introduce R2-Write: an automated framework that synthesizes high-quality thinking trajectories enriched with explicit reflection and revision patterns through iterative writer-judge interaction. To prevent redundant reflections, we design a process reward mechanism that supervises reflection quality during reinforcement learning, improving both performance and token efficiency. Extensive experiments across multiple creative writing and deep-research benchmarks demonstrate significant improvements, validating that explicitly incorporating reflection and revision patterns unlocks deep reasoning capabilities for open-ended writing tasks.
Updated: 2026-04-03 12:43:26
标题: R2-Write: 反思和修订开放式写作与深层推理
摘要: 尽管在可验证领域如数学中,深度推理和长链思维显着提高了大型语言模型的性能,但对于写作等开放性任务的有效性仍未被探索。在本文中,我们进行了系统调查,发现现有主流推理模型在开放性写作任务中取得的收益有限。我们进一步分析表明,这些模型在开放性写作中缺乏深度反思和修订模式,导致与数学推理任务相比,改进幅度大大较小。为了解决这一限制,我们引入了R2-Write:一种自动框架,通过迭代式作者-评委交互合成高质量的思维轨迹,其中包含明确的反思和修订模式。为了防止冗余反思,我们设计了一个过程奖励机制,在强化学习过程中监督反思质量,从而提高性能和标记效率。跨多个创造性写作和深度研究基准测试的大量实验表明,明确纳入反思和修订模式可以释放开放性写作任务的深度推理能力,实现显著改进。
更新时间: 2026-04-03 12:43:26
领域: cs.CL,cs.AI
A semicontinuous relaxation of Saito's criterion and freeness as angular minimization
We introduce a nonnegative functional on the space of line arrangements in $\mathbb{P}^2$ that vanishes precisely on free arrangements, obtained as a semicontinuous relaxation of Saito's criterion for freeness. Given an arrangement $\mathcal{A}$ of $n$ lines with candidate exponents $(d_1, d_2)$, we parameterize the spaces of logarithmic derivations of degrees $d_1$ and $d_2$ via the null spaces of the associated derivation matrices and express the Saito determinant as a bilinear map into the space of degree $n$ polynomials. The functional then admits a natural geometric interpretation: it measures the squared sine of the angle between the image of this bilinear map and the direction of the defining polynomial $Q(\mathcal{A})$ in coefficient space, and equals zero if and only if its image contains the line spanned by $Q(\mathcal{A})$. This provides a computable measure of how far a given arrangement is from admitting a free basis of logarithmic derivations of the expected degrees. Using this functional as a reward signal, we develop a sequential construction procedure in which lines are added one at a time so as to minimize the angular distance to freeness, implemented via reinforcement learning with an adaptive curriculum over arrangement sizes and exponent types. Our results suggest that semicontinuous relaxation techniques, grounded in the geometry of polynomial coefficient spaces, offer a viable approach to the computational exploration of freeness in the theory of line arrangements.
Updated: 2026-04-03 12:20:09
标题: 一种Saito准则的半连续松弛和作为角度最小化的自由度
摘要: 我们引入了一个非负泛函,作用在$\mathbb{P}^2$中的直线排列空间上,该泛函在自由排列上恰好为零,是对Saito关于自由性的半连续松弛准则的推广。给定一个包含$n$条直线的排列$\mathcal{A}$,具有候选指数$(d_1, d_2)$,我们通过相关导数矩阵的零空间参数化度为$d_1$和$d_2$的对数导数空间,并将Saito行列式表示为到度为$n$的多项式空间的双线性映射。该泛函具有一个自然的几何解释:它度量了这个双线性映射的图像与定义多项式$Q(\mathcal{A})$在系数空间中的方向之间的正弦平方角度,并且仅当其图像包含由$Q(\mathcal{A})$张成的直线时才为零。这提供了一个可计算的度量,用于衡量给定排列离期望度数的对数导数自由基的距离有多远。 使用这个泛函作为奖励信号,我们开发了一个顺序构造过程,逐个添加直线,以便最小化与自由性的角距离,通过强化学习实现,其中包括对排列大小和指数类型的自适应课程。 我们的结果表明,基于多项式系数空间几何的半连续松弛技术提供了一种可行的方法,用于在直线排列理论中计算自由性的探索。
更新时间: 2026-04-03 12:20:09
领域: math.AG,cs.LG,math.CO
Economics of NFTs: The Value of Creator Royalties
Non-Fungible Tokens (NFTs) are transforming how content creators, such as artists, price and sell their work. A key feature of NFTs is the inclusion of royalties, which grant creators a share of all future resale proceeds. Although widely used, critics argue that sophisticated speculators, who dominate NFT markets, simply price in royalties upfront, neutralizing their impact. We show this intuition holds only under perfect, frictionless markets. Under more realistic market conditions, royalties enable creators to capitalize on the presence of speculators in at least three ways: They can enable risk sharing (under risk aversion), mitigate information asymmetry (when speculators are better informed), and unlock price discrimination benefits (in multi-unit settings). Moreover, in all three cases, royalties meaningfully expand trade, implying increased transaction volume for platforms. These results offer testable predictions that can guide both empirical research and platform design.
Updated: 2026-04-03 12:13:16
标题: NFT的经济学:创作者版税的价值
摘要: 非同质化代币(NFTs)正在改变内容创作者(如艺术家)定价和销售作品的方式。NFTs的一个关键特征是包含版税,这使得创作者能够分享所有未来转售收益的一部分。尽管被广泛使用,批评者认为,在主导NFT市场的复杂投机者仅仅在最初定价中考虑版税,从而中和了其影响。我们表明,这种直觉只在完美、无摩擦的市场条件下成立。在更现实的市场条件下,版税使创作者能够以至少三种方式利用投机者的存在:它们可以实现风险共担(在风险厌恶下)、减轻信息不对称(当投机者信息更充分时)和解锁价格歧视优势(在多单位设置中)。此外,在这三种情况下,版税显著扩大了交易,意味着平台的交易量增加。这些结果提供了可验证的预测,可以指导实证研究和平台设计。
更新时间: 2026-04-03 12:13:16
领域: econ.GN,cs.CR,cs.MA,q-fin.TR
$λ$-GELU: Learning Gating Hardness for Controlled ReLU-ization in Deep Networks
Gaussian Error Linear Unit (GELU) is a widely used smooth alternative to Rectifier Linear Unit (ReLU), yet many deployment, compression, and analysis toolchains are most naturally expressed for piecewise-linear (ReLU-type) networks. We study a hardness-parameterized formulation of GELU, f(x;λ)=xΦ(λ x), where Φ is the Gaussian CDF and λ \in [1, infty) controls gate sharpness, with the goal of turning smooth gated training into a controlled path toward ReLU-compatible models. Learning λ is non-trivial: naive updates yield unstable dynamics and effective gradient attenuation, so we introduce a constrained reparameterization and an optimizer-aware update scheme. Empirically, across a diverse set of model--dataset pairs spanning MLPs, CNNs, and Transformers, we observe structured layerwise hardness profiles and assess their robustness under different initializations. We further study a deterministic ReLU-ization strategy in which the learned gates are progressively hardened toward a principled target, enabling a post-training substitution of λ-GELU by ReLU with reduced disruption. Overall, λ-GELU provides a minimal and interpretable knob to profile and control gating hardness, bridging smooth training with ReLU-centric downstream pipelines.
Updated: 2026-04-03 12:02:29
标题: $λ$-GELU:学习控制ReLU-ization中的门控难度的深度网络
摘要: 高斯误差线性单元(GELU)是Rectifier线性单元(ReLU)的流行平滑替代方案,然而许多部署、压缩和分析工具链最自然地表达为分段线性(ReLU类型)网络。我们研究了一个硬度参数化的GELU公式,f(x;λ)= xΦ(λ x),其中Φ是高斯累积分布函数,λ∈[1,无穷)控制门的锐度,目的是将平滑的门控训练转变为朝向与ReLU兼容模型的受控路径。学习λ是非平凡的:简单的更新会导致不稳定的动态和有效的梯度衰减,因此我们引入了一个受限的重参数化和一个优化器感知的更新方案。 在经验上,跨越MLPs、CNNs和Transformers的多样化模型-数据集对中,我们观察到结构化的逐层硬度配置文件,并评估它们在不同初始化条件下的稳健性。我们进一步研究了一种确定性的ReLU化策略,在这种策略中,学习的门逐渐朝着一个有原则的目标变得更加坚硬,从而使得可以在训练后通过减少干扰将λ-GELU替换为ReLU。总的来说,λ-GELU提供了一个最小和可解释的旋钮来配置和控制门控硬度,将平滑训练与以ReLU为中心的下游流程相连。
更新时间: 2026-04-03 12:02:29
领域: cs.LG,cs.AI
When AI Gets it Wrong: Reliability and Risk in AI-Assisted Medication Decision Systems
Artificial intelligence (AI) systems are increasingly integrated into healthcare and pharmacy workflows, supporting tasks such as medication recommendations, dosage determination, and drug interaction detection. While these systems often demonstrate strong performance under standard evaluation metrics, their reliability in real-world decision-making remains insufficiently understood. In high-risk domains such as medication management, even a single incorrect recommendation can result in severe patient harm. This paper examines the reliability of AI-assisted medication systems by focusing on system failures and their potential clinical consequences. Rather than evaluating performance solely through aggregate metrics, this work shifts attention towards how errors occur and what happens when AI systems produce incorrect outputs. Through a series of controlled, simulated scenarios involving drug interactions and dosage decisions, we analyse different types of system failures, including missed interactions, incorrect risk flagging, and inappropriate dosage recommendations. The findings highlight that AI errors in medication-related contexts can lead to adverse drug reactions, ineffective treatment, or delayed care, particularly when systems are used without sufficient human oversight. Furthermore, the paper discusses the risks of over-reliance on AI recommendations and the challenges posed by limited transparency in decision-making processes. This work contributes a reliability-focused perspective on AI evaluation in healthcare, emphasising the importance of understanding failure behavior and real-world impact. It highlights the need to complement traditional performance metrics with risk-aware evaluation approaches, particularly in safety-critical domains such as pharmacy practice.
Updated: 2026-04-03 11:56:46
标题: 当人工智能出错时:AI辅助药物决策系统的可靠性和风险
摘要: 人工智能(AI)系统越来越多地整合到医疗保健和药房工作流程中,支持诸如药物推荐、剂量确定和药物相互作用检测等任务。虽然这些系统在标准评估指标下通常表现出强大的性能,但它们在现实决策中的可靠性仍未得到充分理解。在药物管理等高风险领域,即使一个错误的推荐也可能导致严重的患者伤害。本文通过关注系统故障及其潜在的临床后果,研究了AI辅助药物系统的可靠性。与仅通过综合指标评估性能不同,这项工作将注意力转向错误发生的方式以及当AI系统产生不正确输出时会发生什么。通过一系列控制的模拟场景,涉及药物相互作用和剂量决策,我们分析了不同类型的系统故障,包括遗漏相互作用、不正确的风险标志和不当的剂量推荐。研究结果表明,在与药物相关的情境中,AI错误可能导致药物不良反应、治疗无效或延误护理,特别是当系统在没有足够人类监督的情况下使用时。此外,本文讨论了过度依赖AI建议所带来的风险,以及决策过程中透明度有限所带来的挑战。该研究在医疗保健领域提出了一个以可靠性为重点的AI评估视角,强调了理解故障行为和现实影响的重要性。它强调了需要在安全关键领域(如药房实践)中补充传统绩效指标的风险感知评估方法的必要性。
更新时间: 2026-04-03 11:56:46
领域: cs.AI,cs.LG
FedSQ: Optimized Weight Averaging via Fixed Gating
Federated learning (FL) enables collaborative training across organizations without sharing raw data, but it is hindered by statistical heterogeneity (non-i.i.d.\ client data) and by instability of naive weight averaging under client drift. In many cross-silo deployments, FL is warm-started from a strong pretrained backbone (e.g., ImageNet-1K) and then adapted to local domains. Motivated by recent evidence that ReLU-like gating regimes (structural knowledge) stabilize earlier than the remaining parameter values (quantitative knowledge), we propose FedSQ (Federated Structural-Quantitative learning), a transfer-initialized neural federated procedure based on a DualCopy, piecewise-linear view of deep networks. FedSQ freezes a structural copy of the pretrained model to induce fixed binary gating masks during federated fine-tuning, while only a quantitative copy is optimized locally and aggregated across rounds. Fixing the gating reduces learning to within-regime affine refinements, which stabilizes aggregation under heterogeneous partitions. Experiments on two convolutional neural network backbones under i.i.d.\ and Dirichlet splits show that FedSQ improves robustness and can reduce rounds-to-best validation performance relative to standard baselines while preserving accuracy in the transfer setting.
Updated: 2026-04-03 11:54:23
标题: FedSQ:通过固定门控进行优化的加权平均
摘要: 联邦学习(FL)实现了跨组织的协作训练,而无需共享原始数据,但受到统计异质性(非独立同分布的客户数据)和在客户漂移下朴素权重平均的不稳定性的阻碍。在许多跨业务部署中,FL是从强大的预训练骨干(例如ImageNet-1K)开始,然后适应本地领域。受最近证据的启发,即ReLU样式的门控机制(结构知识)比其余参数值(定量知识)更早稳定,我们提出了FedSQ(联邦结构量化学习),这是一种基于双拷贝、分段线性视图的深度网络的转移初始化神经联邦过程。FedSQ冻结预训练模型的结构拷贝,以在联邦微调期间引入固定的二进制门掩模,而仅优化本地和跨轮次聚合的定量拷贝。固定门控将学习减少到领域内的仿射细化,从而在异质分区下稳定聚合。在独立同分布和狄利克雷分割下对两种卷积神经网络骨干进行的实验显示,FedSQ提高了鲁棒性,并可以相对于标准基线减少达到最佳验证性能所需的轮次,同时在转移设置中保持准确性。
更新时间: 2026-04-03 11:54:23
领域: cs.LG,cs.AI,cs.DC
FedVideoMAE: Efficient Privacy-Preserving Federated Video Moderation
Short-form video moderation increasingly needs learning pipelines that protect user privacy without paying the full bandwidth and latency cost of cloud-centralized inference. We present FedVideoMAE, an on-device federated framework for video violence detection that combines self-supervised VideoMAE representations, LoRA-based parameter-efficient adaptation, client-side DP-SGD, and server-side secure aggregation. By updating only 5.5M parameters (about 3.5% of a 156M backbone), FedVideoMAE reduces communication by 28.3x relative to full-model federated updates while keeping raw videos on device throughout training. On RWF-2000 with 40 clients, the method reaches 77.25% accuracy without privacy protection and 65~66% under strong differential privacy. We further show that this privacy gap is consistent with an effective-SNR analysis tailored to the small-data, parameter-efficient federated regime, which indicates roughly 8.5~12x DP-noise amplification in our setting. To situate these results more clearly, we also compare against archived full-model federated baselines and summarize auxiliary transfer behavior on RLVS and binary UCF-Crime. Taken together, these findings position FedVideoMAE as a practical operating point for privacy-preserving video moderation on edge devices. Our code can be found at: https://github.com/zyt-599/FedVideoMAE.
Updated: 2026-04-03 11:51:07
标题: FedVideoMAE:高效的隐私保护联邦视频内容审核
摘要: 短视频内容的审核越来越需要学习管道,以保护用户隐私,同时不支付云集中推理的全部带宽和延迟成本。我们提出了FedVideoMAE,这是一个基于设备的联合框架,用于视频暴力检测,结合了自监督VideoMAE表示、基于LoRA的参数高效适应、客户端DP-SGD和服务器端安全聚合。通过仅更新550万个参数(约占156M骨干的3.5%),相对于完整模型的联合更新,FedVideoMAE将通信减少了28.3倍,同时在整个训练过程中保持原始视频在设备上。在具有40个客户端的RWF-2000上,该方法在没有隐私保护的情况下达到77.25%的准确率,在强差分隐私下为65~66%。我们进一步展示,这种隐私差距与适用于小数据、参数高效的联合制度的有效信噪比分析是一致的,这表明在我们的设置中大约有8.5~12倍的DP噪声放大。为了更清晰地说明这些结果,我们还与存档的完整模型联合基线进行比较,并总结在RLVS和二进制UCF-Crime上的辅助传输行为。综上所述,这些发现将FedVideoMAE定位为边缘设备上隐私保护视频审核的实际操作点。我们的代码可以在以下网址找到:https://github.com/zyt-599/FedVideoMAE。
更新时间: 2026-04-03 11:51:07
领域: cs.CV,cs.AI,cs.MM
Self-Optimizing Multi-Agent Systems for Deep Research
Given a user's complex information need, a multi-agent Deep Research system iteratively plans, retrieves, and synthesizes evidence across hundreds of documents to produce a high-quality answer. In one possible architecture, an orchestrator agent coordinates the process, while parallel worker agents execute tasks. Current Deep Research systems, however, often rely on hand-engineered prompts and static architectures, making improvement brittle, expensive, and time-consuming. We therefore explore various multi-agent optimization methods to show that enabling agents to self-play and explore different prompt combinations can produce high-quality Deep Research systems that match or outperform expert-crafted prompts.
Updated: 2026-04-03 11:48:38
标题: 自我优化的多智能体系统用于深入研究
摘要: 鉴于用户的复杂信息需求,一个多智能体深度研究系统循环规划、检索和综合数百份文档中的证据,以产生高质量的答案。在一个可能的架构中,一个协调者智能体协调整个过程,同时并行工作者智能体执行任务。然而,目前的深度研究系统往往依赖手工设计的提示和静态架构,使得改进变得脆弱、昂贵且耗时。因此,我们探索了各种多智能体优化方法,以表明使智能体自我对弈并探索不同提示组合可以产生高质量的深度研究系统,达到或超越专家设计的提示。
更新时间: 2026-04-03 11:48:38
领域: cs.IR,cs.AI
Mitigating Reward Hacking in RLHF via Advantage Sign Robustness
Reward models (RMs) used in reinforcement learning from human feedback (RLHF) are vulnerable to reward hacking: as the policy maximizes a learned proxy reward, true quality plateaus or degrades. We make the assumption that reward hacking is often caused by flipped advantage signs: instead of reducing the likelihood of a bad response, a flipped sign causes the update to increase it. By considering an adversarial perturbation in the RM parameter space, we can derive a certified sign-preservation radius, which is the smallest perturbation that can flip the advantage sign during policy optimization. Based on this formulation, we propose Sign-Certified Policy Optimization (SignCert-PO), down-weighting non-robust completions in the policy gradient update. Unlike prior approaches that require multiple RMs or access to the RM training data, SignCert-PO is lightweight and operates purely at the policy optimization stage using only the RM parameters and on-policy completions. On TL;DR summarization and AlpacaFarm benchmarks, SignCert-PO consistently achieves a better win rate than baselines and reduces reward hacking.
Updated: 2026-04-03 11:45:16
标题: 通过优势符号稳健性在RLHF中缓解奖励黑客行为
摘要: 强化学习中用于人类反馈的奖励模型(RMs)容易受到奖励欺骗的影响:由于策略最大化了学习到的代理奖励,真实质量会停滞或下降。我们假设奖励欺骗通常是由于优势符号翻转导致的:代替减少不良响应的可能性,翻转的符号会导致更新增加该可能性。通过在RM参数空间中考虑对抗性扰动,我们可以推导出一个经过认证的符号保留半径,这是可以在策略优化过程中翻转优势符号的最小扰动。基于这个公式,我们提出了符号认证策略优化(SignCert-PO),在策略梯度更新中减少非鲁棒性完成。与之前需要多个RMs或访问RM训练数据的方法不同,SignCert-PO是轻量级的,仅在策略优化阶段使用RM参数和在线完成。在TL;DR摘要和AlpacaFarm基准测试中,SignCert-PO始终实现比基准更好的胜率,并减少奖励欺骗行为。
更新时间: 2026-04-03 11:45:16
领域: cs.LG,cs.AI,cs.CL
Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference
With the wide adoption of language models for IR -- and specifically RAG systems -- the latency of the underlying LLM becomes a crucial bottleneck, since the long contexts of retrieved passages lead large prompts and therefore, compute increase. Prompt compression, which reduces the size of input prompts while aiming to preserve performance on downstream tasks, has established itself as a cost-effective and low-latency method for accelerating inference in large language models. However, its usefulness depends on whether the additional preprocessing time during generation is offset by faster decoding. We present the first systematic, large-scale study of this trade-off, with thousands of runs and 30,000 queries across several open-source LLMs and three GPU classes. Our evaluation separates compression overhead from decoding latency while tracking output quality and memory usage. LLMLingua achieves up to 18% end-to-end speed-ups, when prompt length, compression ratio, and hardware capacity are well matched, with response quality remaining statistically unchanged across summarization, code generation, and question answering tasks. Outside this operating window, however, the compression step dominates and cancels out the gains. We also show that effective compression can reduce memory usage enough to offload workloads from data center GPUs to commodity cards, with only a 0.3s increase in latency. Our open-source profiler predicts the latency break-even point for each model-hardware setup, providing practical guidance on when prompt compression delivers real-world benefits.
Updated: 2026-04-03 11:41:53
标题: 野外环境中的快速压缩:衡量延迟、速率遵从性和质量以加速LLM推理
摘要: 随着语言模型在信息检索中的广泛应用,特别是RAG系统,底层LLM的延迟成为一个关键瓶颈,因为检索到的段落的长上下文会导致大型提示,从而增加计算量。提示压缩在保持下游任务性能的同时减小输入提示的大小,已经被证明是一种成本效益高且低延迟的方法,用于加速大型语言模型的推断。然而,它的实用性取决于在生成过程中额外的预处理时间是否被更快的解码所抵消。我们进行了第一次系统性、大规模的研究,涵盖了数千次运行和30,000个查询,涉及几个开源LLM和三个GPU类别。我们的评估从压缩开销和解码延迟中分离出来,同时跟踪输出质量和内存使用情况。当提示长度、压缩比率和硬件容量匹配良好时,LLMLingua可以实现高达18%的端到端加速,响应质量在摘要、代码生成和问题回答任务中保持统计上不变。然而,在这个操作窗口之外,压缩步骤占主导地位,抵消了收益。我们还展示了有效的压缩可以减少内存使用量,足以将工作负载从数据中心GPU转移到商品卡上,仅增加0.3秒的延迟。我们的开源分析器为每个模型-硬件设置预测延迟的收支平衡点,提供了在何时提示压缩能够带来真实世界收益的实际指导。
更新时间: 2026-04-03 11:41:53
领域: cs.IR,cs.AI,cs.CL
No Universal Hyperbola: A Formal Disproof of the Epistemic Trade-Off Between Certainty and Scope in Symbolic and Generative AI
In direct response to requests for a logico-mathematical test of the conjecture, we formally disprove a recently conjectured artificial intelligence trade-off between epistemic certainty and scope in its published universal hyperbolic product form, as introduced in Philosophy and Technology. Certainty is defined as the worst-case correctness probability over the input space, and scope as the sum of the Kolmogorov complexities of the input and output sets. Using standard facts from coding theory and algorithmic information theory, we show, first, that when the conjecture is instantiated with prefix (self-delimiting, prefix-free) Kolmogorov complexity, it leads to an internal inconsistency, and second, that when it is instantiated with plain Kolmogorov complexity, it is refuted by a constructive counterexample. These results establish a main theorem: contrary to the conjecture's claim, no universal "certainty-scope" hyperbola holds as a general bound under the published definitions. We further show that a subsequent "entropy-based" revision, replacing the Kolmogorov scope with Shannon joint entropy and redefining the epistemic certainty level accordingly, cannot restore universality either.
Updated: 2026-04-03 11:35:30
标题: 没有普遍的双曲线:在符号和生成人工智能中确定性和范围之间认知权衡的形式证伪
摘要: 针对对于一个逻辑数学测试的要求,我们正式驳斥了最近在哲学和技术领域中引入的人工智能“认识确定性和范围”之间的贸易关系的猜想。确定性被定义为输入空间中最坏情况下的正确概率,而范围被定义为输入和输出集合的Kolmogorov复杂度之和。利用编码理论和算法信息理论的标准事实,我们首先展示了,当使用前缀(自描述,前缀自由)Kolmogorov复杂度实例化猜想时,会导致内在矛盾;其次,当使用普通Kolmogorov复杂度实例化时,会被一个构造性反例所证伪。这些结果建立了一个主要定理:与猜想声称的相反,根据已发表的定义,没有普遍的“确定性-范围”双曲线作为一般的界限。我们进一步展示,后续的“基于熵的”修订,用Shannon联合熵替换Kolmogorov范围,并相应重新定义认识确定性水平,也无法恢复普遍性。
更新时间: 2026-04-03 11:35:30
领域: cs.CY,cs.AI,cs.IT
Learn to Relax with Large Language Models: Solving Constraint Optimization Problems via Bidirectional Coevolution
Large Language Model (LLM)-based optimization has recently shown promise for autonomous problem solving, yet most approaches still cast LLMs as passive constraint checkers rather than proactive strategy designers, limiting their effectiveness on complex Constraint Optimization Problems (COPs). To address this, we present AutoCO, an end-to-end Automated Constraint Optimization method that tightly couples operations-research principles of constraint relaxation with LLM reasoning. A core innovation is a unified triple-representation that binds relaxation strategies, algorithmic principles, and executable codes. This design enables the LLM to synthesize, justify, and instantiate relaxation strategies that are both principled and executable. To navigate fragmented solution spaces, AutoCO employs a bidirectional global-local coevolution mechanism, synergistically coupling Monte Carlo Tree Search (MCTS) for global relaxation-trajectory exploration with Evolutionary Algorithms (EAs) for local solution intensification. This continuous exchange of priors and feedback explicitly balances diversification and intensification, thus preventing premature convergence. Extensive experiments on three challenging COP benchmarks validate AutoCO's consistent effectiveness and superior performance, especially in hard regimes where current methods degrade. Results highlight AutoCO as a principled and effective path toward proactive, verifiable LLM-driven optimization.
Updated: 2026-04-03 11:27:07
标题: 学会放松:通过双向共同进化解决约束优化问题的大型语言模型
摘要: 最近,基于大型语言模型(LLM)的优化已经显示出在自主问题解决方面具有潜力,然而大多数方法仍将LLMs视为被动的约束检查者,而不是主动的策略设计者,从而限制了它们在复杂的约束优化问题(COPs)上的有效性。为了解决这个问题,我们提出了AutoCO,一种端到端的自动化约束优化方法,紧密结合了约束放松的运筹学原则与LLM推理。核心创新是统一的三元表示,将放松策略、算法原则和可执行代码绑定在一起。这种设计使得LLM能够合成、证明和实例化既有原则又可执行的放松策略。为了在碎片化的解空间中导航,AutoCO采用了双向全局-局部共同演化机制,将蒙特卡洛树搜索(MCTS)与进化算法(EAs)结合在一起,用于全局放松轨迹探索和局部解强化。这种不断的先验和反馈交换明确平衡了多样化和强化,从而防止了过早的收敛。对三个具有挑战性的COP基准的大量实验验证了AutoCO的一致有效性和卓越性能,特别是在当前方法退化的困难情况下。结果突出了AutoCO作为一种有原则且有效的路径,可以实现主动的、可验证的LLM驱动优化。
更新时间: 2026-04-03 11:27:07
领域: cs.AI
InfoSeeker: A Scalable Hierarchical Parallel Agent Framework for Web Information Seeking
Recent agentic search systems have made substantial progress by emphasising deep, multi-step reasoning. However, this focus often overlooks the challenges of wide-scale information synthesis, where agents must aggregate large volumes of heterogeneous evidence across many sources. As a result, most existing large language model agent systems face severe limitations in data-intensive settings, including context saturation, cascading error propagation, and high end-to-end latency. To address these challenges, we present \framework, a hierarchical framework based on principle of near-decomposability, containing a strategic \textit{Host}, multiple \textit{Managers} and parallel \textit{Workers}. By leveraging aggregation and reflection mechanisms at the Manager layer, our framework enforces strict context isolation to prevent saturation and error propagation. Simultaneously, the parallelism in worker layer accelerates the speed of overall task execution, mitigating the significant latency. Our evaluation on two complementary benchmarks demonstrates both efficiency ($ 3-5 \times$ speed-up) and effectiveness, achieving a $8.4\%$ success rate on WideSearch-en and $52.9\%$ accuracy on BrowseComp-zh. The code is released at https://github.com/agent-on-the-fly/InfoSeeker
Updated: 2026-04-03 11:19:17
标题: InfoSeeker:用于Web信息检索的可扩展分层并行代理框架
摘要: 最近的主动搜索系统通过强调深层、多步骤推理取得了实质性进展。然而,这种重点往往忽视了广泛信息综合的挑战,其中代理人必须在许多信息源上聚合大量异质证据。因此,大多数现有的大型语言模型代理系统在数据密集型环境中面临严重限制,包括上下文饱和、级联错误传播和高端到端延迟。为了解决这些挑战,我们提出了一个基于近分解原则的分层框架\textit{framework},包含一个战略\textit{Host},多个\textit{Managers}和并行\textit{Workers}。通过在Manager层利用聚合和反射机制,我们的框架强制执行严格的上下文隔离,以防止饱和和错误传播。同时,工作层的并行性加快了整体任务执行的速度,缓解了显著的延迟。我们在两个互补的基准测试上的评估表明,我们的框架在效率(加速3-5倍)和有效性方面表现出色,在WideSearch-en上取得了8.4%的成功率,在BrowseComp-zh上取得了52.9%的准确率。代码发布在https://github.com/agent-on-the-fly/InfoSeeker。
更新时间: 2026-04-03 11:19:17
领域: cs.AI
What Is The Political Content in LLMs' Pre- and Post-Training Data?
Large language models (LLMs) are known to generate politically biased text. Yet, it remains unclear how such biases arise, making it difficult to design effective mitigation strategies. We hypothesize that these biases are rooted in the composition of training data. Taking a data-centric perspective, we formulate research questions on (1) political leaning present in data, (2) data imbalance, (3) cross-dataset similarity, and (4) data-model alignment. We then examine how exposure to political content relates to models' stances on policy issues. We analyze the political content of pre- and post-training datasets of open-source LLMs, combining large-scale sampling, political-leaning classification, and stance detection. We find that training data is systematically skewed toward left-leaning content, with pre-training corpora containing substantially more politically engaged material than post-training data. We further observe a strong correlation between political stances in training data and model behavior, and show that pre-training datasets exhibit similar political distributions despite different curation strategies. In addition, we find that political biases are already present in base models and persist across post-training stages. These findings highlight the central role of data composition in shaping model behavior and motivate the need for greater data transparency.
Updated: 2026-04-03 11:15:05
标题: LLM在培训前后的数据中的政治内容是什么?
摘要: 大型语言模型(LLMs)被认为生成具有政治偏见的文本。然而,目前尚不清楚这些偏见是如何产生的,这使得设计有效的缓解策略变得困难。我们假设这些偏见根植于训练数据的构成。从数据中心的角度出发,我们提出了研究问题,包括(1)数据中存在的政治倾向,(2)数据不平衡,(3)跨数据集的相似性,以及(4)数据-模型对齐。然后,我们研究政治内容对模型在政策问题上立场的影响。我们分析了开源LLMs的预训练和后训练数据集中的政治内容,结合大规模抽样、政治倾向分类和立场检测。我们发现训练数据在左倾内容方面存在系统性偏向,预训练语料库包含的政治参与材料明显多于后训练数据。我们进一步观察到训练数据中政治立场与模型行为之间存在强烈的相关性,并展示出尽管采取了不同的策略,预训练数据集仍具有相似的政治分布。此外,我们发现政治偏见已经存在于基础模型中,并且在后训练阶段持续存在。这些发现突显了数据构成在塑造模型行为中的核心作用,并促使对数据透明性的需求。
更新时间: 2026-04-03 11:15:05
领域: cs.CL,cs.AI,cs.CY
The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment
Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairwise alignment alone, while still maintaining strong pairwise correspondence. We evaluate ConFu on synthetic and real-world multimodal benchmarks, assessing its ability to exploit cross-modal complementarity, capture higher-order dependencies, and scale with increasing multimodal complexity. Across these settings, ConFu demonstrates competitive performance on retrieval and classification tasks, while supporting unified one-to-one and two-to-one retrieval within a single contrastive framework. We release our code and dataset at https://github.com/estafons/confu.
Updated: 2026-04-03 11:14:20
标题: 人越多越热闹:用于高阶多模态对齐的对比融合
摘要: 学习跨多模态之间的联合表示仍然是多模态机器学习中的一个核心挑战。目前的方法主要是在成对设置中操作,一次对齐两种模态。虽然一些最近的方法旨在捕捉多种模态之间的高阶交互作用,但它们经常忽视或未能充分保留成对关系,从而限制了它们在单模态任务上的有效性。在这项工作中,我们介绍了对比融合(ConFu),这是一个框架,将单个模态和它们融合组合共同嵌入到一个统一的表示空间中,其中模态和它们的融合对齐。ConFu通过额外的融合模态对比项扩展了传统的成对对比目标,鼓励第三模态的模态对的联合嵌入。这种公式使得ConFu能够捕捉高阶依赖关系,例如XOR类似的关系,这是单独通过成对对齐无法恢复的,同时仍然保持强大的成对对应关系。我们在合成和现实世界的多模态基准上评估了ConFu,评估其利用跨模态互补性、捕捉高阶依赖关系,并随着多模态复杂性增加而扩展的能力。在这些设置中,ConFu在检索和分类任务上展示了竞争性的性能,同时支持单一对一和两对一检索在同一个对比框架中。我们在https://github.com/estafons/confu 上发布了我们的代码和数据集。
更新时间: 2026-04-03 11:14:20
领域: cs.CV,cs.AI
Infusion: Shaping Model Behavior by Editing Training Data via Influence Functions
Influence functions are commonly used to attribute model behavior to training documents. We explore the reverse: crafting training data that induces model behavior. Our framework, Infusion, uses scalable influence-function approximations to compute small perturbations to training documents that induce targeted changes in model behavior through parameter shifts. We evaluate Infusion on data poisoning tasks across vision and language domains. On CIFAR-10, we show that making subtle edits via Infusion to just 0.2% (100/45,000) of the training documents can be competitive with the baseline of inserting a small number of explicit behavior examples. We also find that Infusion transfers across architectures (ResNet $\leftrightarrow$ CNN), suggesting a single poisoned corpus can affect multiple independently trained models. In preliminary language experiments, we characterize when our approach increases the probability of target behaviors and when it fails, finding it most effective at amplifying behaviors the model has already learned. Taken together, these results show that small, subtle edits to training data can systematically shape model behavior, underscoring the importance of training data interpretability for adversaries and defenders alike. We provide the code here: https://github.com/jrosseruk/infusion.
Updated: 2026-04-03 11:11:10
标题: 通过影响函数编辑训练数据塑造模型行为
摘要: 影响函数通常用于将模型行为归因于训练文档。我们探索了相反的情况:制作诱导模型行为的训练数据。我们的框架Infusion使用可扩展的影响函数近似来计算对训练文档进行微小扰动,通过参数变化诱发模型行为的有针对性改变。我们在视觉和语言领域的数据污染任务上评估了Infusion。在CIFAR-10上,我们展示了通过Infusion对仅0.2%(100/45,000)的训练文档进行微小编辑可以与插入少量明确行为示例的基线竞争。我们还发现,Infusion可以跨架构传输(ResNet $\leftrightarrow$ CNN),表明一个受污染的语料库可以影响多个独立训练的模型。在初步的语言实验中,我们表征了我们的方法何时增加目标行为的概率以及何时失败,发现它在增强模型已经学习到的行为方面最有效。总的来说,这些结果表明对训练数据进行小而微妙的编辑可以系统地塑造模型行为,强调了对于攻击者和防御者来说,训练数据的可解释性的重要性。我们在这里提供代码:https://github.com/jrosseruk/infusion。
更新时间: 2026-04-03 11:11:10
领域: cs.LG,cs.AI,cs.CY
Inversion-Free Natural Gradient Descent on Riemannian Manifolds
The natural gradient method is widely used in statistical optimization, but its standard formulation assumes a Euclidean parameter space. This paper proposes an inversion-free stochastic natural gradient method for probability distributions whose parameters lie on a Riemannian manifold. The manifold setting offers several advantages: one can implicitly enforce parameter constraints such as positive definiteness and orthogonality, ensure parameters are identifiable, or guarantee regularity properties of the objective like geodesic convexity. Building on an intrinsic formulation of the Fisher information matrix (FIM) on a manifold, our method maintains an online approximation of the inverse FIM, which is efficiently updated at quadratic cost using score vectors sampled at successive iterates. In the Riemannian setting, these score vectors belong to different tangent spaces and must be combined using transport operations. We prove almost-sure convergence rates of $O(\log{s}/s^α)$ for the squared distance to the minimizer when the step size exponent $α>2/3$. We also establish almost-sure rates for the approximate FIM, which now accumulates transport-based errors. A limited-memory variant of the algorithm with sub-quadratic storage complexity is proposed. Finally, we demonstrate the effectiveness of our method relative to its Euclidean counterparts on variational Bayes with Gaussian approximations and normalizing flows.
Updated: 2026-04-03 11:08:59
标题: 在黎曼流形上无反演的自然梯度下降
摘要: 自然梯度方法被广泛应用于统计优化领域,但其标准形式假定参数空间为欧几里得空间。本文提出了一种无反演的随机自然梯度方法,适用于参数位于黎曼流形上的概率分布。流形设置具有几个优点:可以隐式地强制参数约束,如正定性和正交性,确保参数可识别,或保证目标的正则性质,如测地凸性。基于流形上Fisher信息矩阵(FIM)的内在形式,我们的方法维护一个在线逆FIM的近似,该近似通过在连续迭代中对采样的得分向量进行二次成本的有效更新。在黎曼流形设置中,这些得分向量属于不同的切空间,必须使用传输操作进行组合。当步长指数$α>2/3$时,我们证明了到最小化器的平方距离的$O(\log{s}/s^α)$的几乎确定收敛率。我们还建立了近似FIM的几乎确定收敛率,该近似现在积累了基于传输的错误。我们提出了一种存储复杂度为次二次的有限记忆算法变体。最后,我们在变分贝叶斯中使用高斯近似和正则化流方法展示了我们的方法相对于欧几里得对应方法的有效性。
更新时间: 2026-04-03 11:08:59
领域: stat.ML,cs.LG,stat.CO,stat.ME
Human Psychometric Questionnaires Mischaracterize LLM Psychology: Evidence from Generation Behavior
Psychological profiling of large language models (LLMs) using psychometric questionnaires designed for humans has become widespread. However, it remains unclear whether the resulting profiles mirror the models' psychological characteristics expressed during their real-world interactions with users. To examine the risk of human questionnaires mischaracterizing LLM psychology, we compare two types of profiles for eight open-source LLMs: self-reported Likert scores from established questionnaires (PVQ-40, PVQ-21, BFI-44, BFI-10) and generation probability scores of value- or personality-laden responses to real-world user queries. The two profiles turn out to be substantially different and provide evidence that LLMs' responses to established questionnaires reflect desired behavior rather than stable psychological constructs, which challenges the consistent psychological dispositions of LLMs claimed in prior work. Established questionnaires also risk exaggerating the demographic biases of LLMs. Our results suggest caution when interpreting psychological profiles derived from established questionnaires and point to generation-based profiling as a more reliable approach to LLM psychometrics.
Updated: 2026-04-03 11:04:17
标题: 人类心理测量问卷误解LLM心理学:来自行为生成的证据
摘要: 使用为人类设计的心理测量问卷对大型语言模型(LLMs)进行心理剖析已经变得普遍。然而,目前尚不清楚由此产生的剖析是否反映了模型在与用户进行真实世界互动时所表现出的心理特征。为了检验人类问卷是否会错误描述LLM心理特征的风险,我们比较了八个开源LLM的两种类型剖析:来自已建立问卷的自我报告Likert分数(PVQ-40、PVQ-21、BFI-44、BFI-10)和对真实世界用户查询的价值或个性化回应的生成概率分数。这两种剖析结果大不相同,并且提供了LLM对已建立问卷的回应反映出期望行为而非稳定心理结构的证据,这挑战了之前作品中所宣称的LLM一致心理倾向。已建立问卷还存在夸大LLM人口偏见的风险。我们的研究结果表明,在解释来源于已建立问卷的心理剖析时需要谨慎,并指出基于生成的剖析是一种更可靠的LLM心理测量方法。
更新时间: 2026-04-03 11:04:17
领域: cs.CL,cs.AI
FoE: Forest of Errors Makes the First Solution the Best in Large Reasoning Models
Recent Large Reasoning Models (LRMs) like DeepSeek-R1 have demonstrated remarkable success in complex reasoning tasks, exhibiting human-like patterns in exploring multiple alternative solutions. Upon closer inspection, however, we uncover a surprising phenomenon: The First is The Best, where alternative solutions are not merely suboptimal but potentially detrimental. This observation challenges widely accepted test-time scaling laws, leading us to hypothesize that errors within the reasoning path scale concurrently with test time. Through comprehensive empirical analysis, we characterize errors as a forest-structured Forest of Errors (FoE) and conclude that FoE makes the First the Best, which is underpinned by rigorous theoretical analysis. Leveraging these insights, we propose RED, a self-guided efficient reasoning framework comprising two components: I) Refining First, which suppresses FoE growth in the first solution; and II) Discarding Subs, which prunes subsequent FoE via dual-consistency. Extensive experiments across five benchmarks and six backbone models demonstrate that RED outperforms eight competitive baselines, achieving performance gains of up to 19.0% while reducing token consumption by 37.7% ~ 70.4%. Moreover, comparative experiments on FoE metrics shed light on how RED achieves effectiveness.
Updated: 2026-04-03 11:03:02
标题: "错误之森:在大型推理模型中首次解决方案最佳"
摘要: 最近的大型推理模型(LRMs)如DeepSeek-R1在复杂推理任务中取得了显著成功,展现出类似人类在探索多种替代解决方案时的模式。然而,经过更仔细的检查,我们发现了一个令人惊讶的现象:第一个是最好的,替代解决方案不仅仅是次优的,而且可能是有害的。这一观察挑战了广泛接受的测试时间缩放定律,使我们推断出推理路径内的错误与测试时间同时增长。通过全面的实证分析,我们将错误描述为一种森林结构的错误森林(FoE),并得出FoE使第一个成为最好的结论,这是由严谨的理论分析支持的。利用这些见解,我们提出了RED,一个自主引导高效推理框架,包括两个组件:I)优化第一个,抑制第一个解决方案中FoE的增长;和II)丢弃子,通过双一致性修剪后续FoE。对五个基准测试和六个主干模型进行的广泛实验表明,RED优于八个竞争基线,性能提高高达19.0%,同时减少了37.7%〜70.4%的令牌消耗。此外,对FoE指标的比较实验揭示了RED如何实现有效性。
更新时间: 2026-04-03 11:03:02
领域: cs.AI,cs.CL
LMask: Learn to Solve Constrained Routing Problems with Lazy Masking
Routing problems are canonical combinatorial optimization tasks with wide-ranging applications in logistics, transportation, and supply chain management. However, solving these problems becomes significantly more challenging when complex constraints are involved. In this paper, we propose LMask, a novel learning framework that utilizes dynamic masking to generate high-quality feasible solutions for constrained routing problems. LMask introduces the LazyMask decoding method, which lazily refines feasibility masks with the backtracking mechanism. In addition, it employs the refinement intensity embedding to encode the search trace into the model, mitigating representation ambiguities induced by backtracking. To further reduce sampling cost, LMask sets a backtracking budget during decoding, while constraint violations are penalized in the loss function during training to counteract infeasibility caused by this budget. We provide theoretical guarantees for the validity and probabilistic optimality of our approach. Extensive experiments on the traveling salesman problem with time windows (TSPTW) and TSP with draft limits (TSPDL) demonstrate that LMask achieves state-of-the-art feasibility rates and solution quality, outperforming existing neural methods.
Updated: 2026-04-03 11:01:12
标题: LMask:学习使用懒惰掩蔽解决约束路由问题
摘要: 路由问题是经典的组合优化任务,在物流、运输和供应链管理等领域具有广泛的应用。然而,当涉及复杂约束时,解决这些问题变得更具挑战性。本文提出了LMask,一种新颖的学习框架,利用动态掩码生成高质量的受约束路由问题的可行解决方案。LMask引入了LazyMask解码方法,通过回溯机制懒惰地优化可行性掩码。此外,它利用了精化强度嵌入来将搜索轨迹编码到模型中,减轻了回溯引起的表示模糊性。为了进一步减少采样成本,在解码过程中,LMask设置了回溯预算,同时在训练过程中通过损失函数对约束违反进行惩罚,以抵消这一预算造成的不可行性。我们为我们的方法的有效性和概率最优性提供了理论保证。对带有时间窗口的旅行推销员问题(TSPTW)和带有草稿限制的TSP(TSPDL)进行的大量实验表明,LMask实现了最先进的可行性率和解决方案质量,在性能上优于现有的神经方法。
更新时间: 2026-04-03 11:01:12
领域: math.OC,cs.AI,cs.LG
Pretrained Video Models as Differentiable Physics Simulators for Urban Wind Flows
Designing urban spaces that provide pedestrian wind comfort and safety requires time-resolved Computational Fluid Dynamics (CFD) simulations, but their current computational cost makes extensive design exploration impractical. We introduce WinDiNet (Wind Diffusion Network), a pretrained video diffusion model that is repurposed as a fast, differentiable surrogate for this task. Starting from LTX-Video, a 2B-parameter latent video transformer, we fine-tune on 10,000 2D incompressible CFD simulations over procedurally generated building layouts. A systematic study of training regimes, conditioning mechanisms, and VAE adaptation strategies, including a physics-informed decoder loss, identifies a configuration that outperforms purpose-built neural PDE solvers. The resulting model generates full 112-frame rollouts in under a second. As the surrogate is end-to-end differentiable, it doubles as a physics simulator for gradient-based inverse optimization: given an urban footprint layout, we optimize building positions directly through backpropagation to improve wind safety as well as pedestrian wind comfort. Experiments on single- and multi-inlet layouts show that the optimizer discovers effective layouts even under challenging multi-objective configurations, with all improvements confirmed by ground-truth CFD simulations.
Updated: 2026-04-03 11:00:44
标题: 预训练视频模型作为城市风流可微分物理模拟器
摘要: 设计提供行人风舒适性和安全性的城市空间需要进行时间分辨率的计算流体动力学(CFD)模拟,但目前的计算成本使得广泛的设计探索变得不切实际。我们引入了WinDiNet(风扩散网络),这是一个经过预训练的视频扩散模型,被重新用作这项任务的快速、可微分的替代方法。从LTX-Video开始,这是一个具有2B参数的潜在视频变换器,我们在通过程序生成的建筑布局上进行了10,000次2D不可压缩CFD模拟的微调。对训练制度、调节机制和VAE适应策略进行系统研究,包括物理启发的解码器损失,确定了一个优于专门设计的神经PDE求解器的配置。生成的模型在不到一秒的时间内生成完整的112帧结果。由于替代方法是端到端可微分的,它还可以作为梯度反向优化的物理模拟器:给定一个城市占地布局,我们通过反向传播直接优化建筑位置,以改善风安全和行人风舒适性。在单入口和多入口布局上的实验表明,优化器在具有挑战性的多目标配置下发现了有效的布局,所有改进都经过地面真实CFD模拟确认。
更新时间: 2026-04-03 11:00:44
领域: cs.LG,cs.CE
Fast and Robust Simulation-Based Inference With Optimization Monte Carlo
Bayesian parameter inference for complex stochastic simulators is challenging due to intractable likelihood functions. Existing simulation-based inference methods often require large number of simulations and become costly to use in high-dimensional parameter spaces or in problems with partially uninformative outputs. We propose a new method for differentiable simulators that delivers accurate posterior inference with substantially reduced runtimes. Building on the Optimization Monte Carlo framework, our approach reformulates inference for stochastic simulators in terms of deterministic optimization problems. Gradient-based methods are then applied to efficiently navigate toward high-density posterior regions and avoid wasteful simulations in low-probability areas. A JAX-based implementation further enhances the performance through vectorization of key method components. Extensive experiments, including high-dimensional parameter spaces, uninformative outputs, multiple observations and multimodal posteriors show that our method consistently matches, and often exceeds, the accuracy of state-of-the-art approaches, while reducing the runtime by a substantial margin.
Updated: 2026-04-03 10:43:25
标题: 快速且稳健的基于优化蒙特卡洛模拟推断
摘要: 贝叶斯参数推断对于复杂的随机模拟器来说是具有挑战性的,因为其似然函数难以处理。现有的基于模拟的推断方法通常需要大量的模拟,并在高维参数空间或在部分无信息输出问题中使用起来成本高昂。我们提出了一种针对可微分模拟器的新方法,可在大大减少运行时间的情况下提供准确的后验推断。基于优化蒙特卡罗框架,我们的方法重新制定了随机模拟器的推断问题,转化为确定性优化问题。然后,基于梯度的方法被应用于有效地导航向高密度后验区域,并避免在低概率区域进行浪费的模拟。基于JAX的实现进一步通过对关键方法组件进行向量化来提高性能。广泛的实验,包括高维参数空间,无信息输出,多个观测和多模态后验显示,我们的方法在保持准确性的同时,通常超过了现有方法的运行时间。
更新时间: 2026-04-03 10:43:25
领域: cs.LG,stat.ML
LogicPoison: Logical Attacks on Graph Retrieval-Augmented Generation
Graph-based Retrieval-Augmented Generation (GraphRAG) enhances the reasoning capabilities of Large Language Models (LLMs) by grounding their responses in structured knowledge graphs. Leveraging community detection and relation filtering techniques, GraphRAG systems demonstrate inherent resistance to traditional RAG attacks, such as text poisoning and prompt injection. However, in this paper, we find that the security of GraphRAG systems fundamentally relies on the topological integrity of the underlying graph, which can be undermined by implicitly corrupting the logical connections, without altering surface-level text semantics. To exploit this vulnerability, we propose \textsc{LogicPoison}, a novel attack framework that targets logical reasoning rather than injecting false contents. Specifically, \textsc{LogicPoison} employs a type-preserving entity swapping mechanism to perturb both global logic hubs for disrupting overall graph connectivity and query-specific reasoning bridges for severing essential multi-hop inference paths. This approach effectively reroutes valid reasoning into dead ends while maintaining surface-level textual plausibility. Comprehensive experiments across multiple benchmarks demonstrate that \textsc{LogicPoison} successfully bypasses GraphRAG's defenses, significantly degrading performance and outperforming state-of-the-art baselines in both effectiveness and stealth. Our code is available at \textcolor{blue}https://github.com/Jord8061/logicPoison.
Updated: 2026-04-03 10:42:07
标题: 逻辑毒药:对图检索增强生成的逻辑攻击
摘要: 基于图的检索增强生成(GraphRAG)通过将大型语言模型(LLMs)的响应基于结构化知识图,增强了其推理能力。利用社区检测和关系过滤技术,GraphRAG系统展现出对传统RAG攻击的固有抵抗力,例如文本污染和提示注入。然而,在本文中,我们发现GraphRAG系统的安全性基本上取决于底层图的拓扑完整性,这可以通过隐式损坏逻辑连接而不改变表层文本语义来破坏。为了利用这种脆弱性,我们提出了一个名为\textsc{LogicPoison}的新型攻击框架,该框架针对逻辑推理而不是注入虚假内容。具体地,\textsc{LogicPoison}采用一种类型保留的实体交换机制来扰乱全局逻辑中心,以破坏整体图的连通性和针对查询特定的推理桥梁,以切断关键的多跳推理路径。这种方法有效地将有效的推理重定向到死胡同,同时保持表层文本的合理性。跨多个基准测试的综合实验表明,\textsc{LogicPoison}成功地绕过了GraphRAG的防御,显著降低了性能并在效果和隐蔽性方面优于最先进的基线。我们的代码可在\textcolor{blue}https://github.com/Jord8061/logicPoison 上找到。
更新时间: 2026-04-03 10:42:07
领域: cs.CL,cs.AI
Chain-of-Authorization: Embedding authorization into large language models
Although Large Language Models (LLMs) have evolved from text generators into the cognitive core of modern AI systems, their inherent lack of authorization awareness exposes these systems to catastrophic risks, ranging from unintentional data leakage to unauthorized command execution. Existing defense mechanisms are fundamentally decoupled from internal reasoning, rendering them insufficient for the complex security demands of dynamic AI systems. Here, we propose the Chain-of-Authorization (CoA) framework, a paradigm that internalizes access control as a foundational cognitive capability. By systematically redesigning the input-output format and fine-tuning the model on synthesized data with complex permission topologies, CoA forces the model to generate a structured authorization trajectory as a causal prerequisite for any substantive response or action, thereby enabling LLMs to internalize access boundaries within dynamic reasoning environments. CoA maintains high utility in authorized scenarios while achieving high rejection rates of unauthorized prompts and robust defense against diverse adversarial attacks. By embedding authorization directly into the reasoning process, CoA provides a principled architectural blueprint for deploying secure LLMs as the cognitive cores of modern AI systems.
Updated: 2026-04-03 10:35:49
标题: 授权链:将授权嵌入大型语言模型
摘要: 虽然大型语言模型(LLMs)已经从文本生成器发展为现代人工智能系统的认知核心,但它们固有的缺乏授权意识使得这些系统面临从意外数据泄漏到未经授权的命令执行等灾难性风险。现有的防御机制基本上与内部推理脱钩,使它们无法满足动态人工智能系统复杂安全需求。在这里,我们提出了Chain-of-Authorization(CoA)框架,这是一种将访问控制内部化为基础认知能力的范式。通过系统地重新设计输入输出格式,并在复杂权限拓扑的合成数据上进行模型微调,CoA迫使模型生成一个结构化的授权轨迹作为任何实质性响应或行动的因果先决条件,从而使LLMs在动态推理环境中内部化访问边界。CoA在授权场景中保持高效用,同时实现高拒绝率的未经授权提示和对多样化对抗攻击的强大防御。通过将授权直接嵌入推理过程,CoA为部署安全LLMs作为现代人工智能系统认知核心提供了一个原则性的架构蓝图。
更新时间: 2026-04-03 10:35:49
领域: cs.AI
How Annotation Trains Annotators: Competence Development in Social Influence Recognition
Human data annotation, especially when involving experts, is often treated as an objective reference. However, many annotation tasks are inherently subjective, and annotators' judgments may evolve over time. This study investigates changes in the quality of annotators' work from a competence perspective during a process of social influence recognition. The study involved 25 annotators from five different groups, including both experts and non-experts, who annotated a dataset of 1,021 dialogues with 20 social influence techniques, along with intentions, reactions, and consequences. An initial subset of 150 texts was annotated twice - before and after the main annotation process - to enable comparison. To measure competence shifts, we combined qualitative and quantitative analyses of the annotated data, semi-structured interviews with annotators, self-assessment surveys, and Large Language Model training and evaluation on the comparison dataset. The results indicate a significant increase in annotators' self-perceived competence and confidence. Moreover, observed changes in data quality suggest that the annotation process may enhance annotator competence and that this effect is more pronounced in expert groups. The observed shifts in annotator competence have a visible impact on the performance of LLMs trained on their annotated data.
Updated: 2026-04-03 10:32:57
标题: 注释如何训练标注者:社会影响识别中的能力发展
摘要: 人类数据标注,特别是涉及专家时,通常被视为客观参考。然而,许多标注任务本质上是主观的,标注者的判断可能随时间而演变。本研究从能力角度研究了在社会影响识别过程中标注者工作质量的变化。该研究涉及了来自五个不同组的25名标注者,包括专家和非专家,他们标注了包含20种社会影响技术、意图、反应和后果的1,021段对话的数据集。最初的150条文本子集进行了两次标注 - 在主要标注过程之前和之后 - 以进行比较。为了衡量能力的变化,我们结合了对标注数据的定性和定量分析、标注者的半结构化访谈、自我评估调查以及在比较数据集上进行的大型语言模型训练和评估。结果表明,标注者对自身能力和信心的自我感知显著增加。此外,观察到的数据质量变化表明,标注过程可能增强标注者的能力,而这种影响在专家群体中更为显著。观察到的标注者能力变化对其标注数据训练的大型语言模型的性能产生了显著影响。
更新时间: 2026-04-03 10:32:57
领域: cs.CL,cs.AI
Adaptive Guidance for Retrieval-Augmented Masked Diffusion Models
Retrieval-Augmented Generation (RAG) improves factual grounding by incorporating external knowledge into language model generation. However, when retrieved context is noisy, unreliable, or inconsistent with the model's parametric knowledge, it introduces retrieval-prior conflicts that can degrade generation quality. While this problem has been studied in autoregressive language models, it remains largely unexplored in diffusion-based language models, where the iterative denoising process introduces unique challenges for integrating retrieved context. In this work, we propose Adaptive Retrieval-Augmented Masked Diffusion (ARAM), a training-free adaptive guidance framework for Masked Diffusion Models (MDMs) in RAG settings. ARAM dynamically calibrates the guidance scale during denoising according to the Signal-to-Noise Ratio (SNR) of the distributional shift induced by retrieved context. Intuitively, the model strengthens guidance when the retrieved context provides reliable corrective evidence and suppresses it when the contextual signal is noisy or non-supportive. Extensive experiments on multiple knowledge-intensive QA benchmarks show that ARAM improves overall QA performance over competitive RAG baselines.
Updated: 2026-04-03 10:32:08
标题: 自适应引导用于检索增强的遮蔽扩散模型
摘要: 检索增强生成(RAG)通过将外部知识纳入语言模型生成来改进事实基础。然而,当检索到的上下文存在嘈杂、不可靠或与模型参数化知识不一致时,会引入检索优先冲突,可能降低生成质量。虽然这个问题在自回归语言模型中已经研究过,但在基于扩散的语言模型中仍然很少探讨,因为迭代去噪过程为整合检索到的上下文带来了独特的挑战。在这项工作中,我们提出了自适应检索增强蒙特卡洛扩散(ARAM),这是一种无需训练的自适应引导框架,适用于RAG设置中的蒙特卡洛扩散模型(MDMs)。ARAM根据检索到的上下文引起的分布偏移的信噪比(SNR)动态校准去噪过程中的引导尺度。直观地,当检索到的上下文提供可靠的修正证据时,模型加强引导,当上下文信号嘈杂或不支持时,则抑制引导。在多个知识密集型QA基准上进行的大量实验表明,ARAM比竞争性的RAG基线改进了整体QA性能。
更新时间: 2026-04-03 10:32:08
领域: cs.CL,cs.AI,cs.LG
AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents
Computer-use agents extend language models from text generation to persistent action over tools, files, and execution environments. Unlike chat systems, they maintain state across interactions and translate intermediate outputs into concrete actions. This creates a distinct safety challenge in that harmful behavior may emerge through sequences of individually plausible steps, including intermediate actions that appear locally acceptable but collectively lead to unauthorized actions. We present \textbf{AgentHazard}, a benchmark for evaluating harmful behavior in computer-use agents. AgentHazard contains \textbf{2,653} instances spanning diverse risk categories and attack strategies. Each instance pairs a harmful objective with a sequence of operational steps that are locally legitimate but jointly induce unsafe behavior. The benchmark evaluates whether agents can recognize and interrupt harm arising from accumulated context, repeated tool use, intermediate actions, and dependencies across steps. We evaluate AgentHazard on Claude Code, OpenClaw, and IFlow using mostly open or openly deployable models from the Qwen3, Kimi, GLM, and DeepSeek families. Our experimental results indicate that current systems remain highly vulnerable. In particular, when powered by Qwen3-Coder, Claude Code exhibits an attack success rate of \textbf{73.63\%}, suggesting that model alignment alone does not reliably guarantee the safety of autonomous agents.
Updated: 2026-04-03 10:29:31
标题: AgentHazard:一个用于评估计算机使用代理中有害行为的基准。
摘要: 计算机使用代理扩展了语言模型,从文本生成到对工具、文件和执行环境的持久动作。与聊天系统不同,它们在交互中保持状态,并将中间输出转化为具体操作。这在安全挑战上产生了明显的问题,因为有害行为可能会通过一系列单独合理的步骤出现,包括看似本地可接受但共同导致未经授权的行为的中间动作。我们提出了AgentHazard,一个用于评估计算机使用代理中有害行为的基准。AgentHazard包含2,653个实例,涵盖各种风险类别和攻击策略。每个实例将一个有害目标与一系列操作步骤配对,这些步骤在本地合法,但共同导致不安全行为。该基准评估代理是否能够识别和中断因累积上下文、重复工具使用、中间动作和跨步骤的依赖而产生的危害。我们使用主要来自Qwen3、Kimi、GLM和DeepSeek等家族的开放或公开部署模型,在Claude Code、OpenClaw和IFlow上评估AgentHazard。我们的实验结果表明,当前系统仍然存在极高的脆弱性。特别是当由Qwen3-Coder驱动时,Claude Code表现出73.63%的攻击成功率,这表明单靠模型对齐并不能可靠地保证自主代理的安全。
更新时间: 2026-04-03 10:29:31
领域: cs.AI
Learning from Synthetic Data via Provenance-Based Input Gradient Guidance
Learning methods using synthetic data have attracted attention as an effective approach for increasing the diversity of training data while reducing collection costs, thereby improving the robustness of model discrimination. However, many existing methods improve robustness only indirectly through the diversification of training samples and do not explicitly teach the model which regions in the input space truly contribute to discrimination; consequently, the model may learn spurious correlations caused by synthesis biases and artifacts. Motivated by this limitation, this paper proposes a learning framework that uses provenance information obtained during the training data synthesis process, indicating whether each region in the input space originates from the target object, as an auxiliary supervisory signal to promote the acquisition of representations focused on target regions. Specifically, input gradients are decomposed based on information about target and non-target regions during synthesis, and input gradient guidance is introduced to suppress gradients over non-target regions. This suppresses the model's reliance on non-target regions and directly promotes the learning of discriminative representations for target regions. Experiments demonstrate the effectiveness and generality of the proposed method across multiple tasks and modalities, including weakly supervised object localization, spatio-temporal action localization, and image classification.
Updated: 2026-04-03 10:28:58
标题: 通过基于来源的输入梯度指导从合成数据中学习
摘要: 使用合成数据的学习方法已经引起了人们的关注,作为一种有效的方法来增加训练数据的多样性,同时减少收集成本,从而提高模型区分能力的鲁棒性。然而,许多现有方法只是间接通过训练样本的多样化来提高鲁棒性,并没有明确教导模型哪些输入空间的区域真正有助于区分;因此,模型可能会学习到由合成偏见和人为因素导致的虚假相关性。受到这一限制的启发,本文提出了一种学习框架,利用在训练数据合成过程中获得的来源信息,指示每个输入空间中的区域是否起源于目标对象,作为辅助监督信号,促进获得专注于目标区域的表征。具体来说,基于合成过程中关于目标和非目标区域的信息,将输入梯度进行分解,并引入输入梯度指导,以抑制非目标区域上的梯度。这样可以抑制模型对非目标区域的依赖,并直接促进学习针对目标区域的区分性表征。实验证明了所提出的方法在多个任务和模态中的有效性和普适性,包括弱监督对象定位、时空动作定位和图像分类。
更新时间: 2026-04-03 10:28:58
领域: cs.CV,cs.AI,cs.LG
Explainable Machine Learning Reveals 12-Fold Ucp1 Upregulation and Thermogenic Reprogramming in Female Mouse White Adipose Tissue After 37 Days of Microgravity: First AI/ML Analysis of NASA OSD-970
Microgravity induces profound metabolic adaptations in mammalian physiology, yet the molecular mechanisms governing thermogenesis in female white adipose tissue (WAT) remain poorly characterized. This paper presents the first machine learning (ML) analysis of NASA Open Science Data Repository (OSDR) dataset OSD-970, derived from the Rodent Research-1 (RR-1) mission. Using RT-qPCR data from 89 adipogenesis and thermogenesis pathway genes in gonadal WAT of 16 female C57BL/6J mice (8 flight, 8 ground control) following 37 days aboard the International Space Station (ISS), we applied differential expression analysis, multiple ML classifiers with Leave-One-Out Cross-Validation (LOO-CV), and Explainable AI via SHapley Additive exPlanations (SHAP). The most striking finding is a dramatic 12.21-fold upregulation of Ucp1 (Delta-Delta-Ct = -3.61, p = 0.0167) in microgravity-exposed WAT, accompanied by significant activation of the thermogenesis pathway (mean pathway fold-change = 3.24). The best-performing model (Random Forest with top-20 features) achieved AUC = 0.922, Accuracy = 0.812, and F1 = 0.824 via LOO-CV. SHAP analysis consistently ranked Ucp1 among the top predictive features, while Angpt2, Irs2, Jun, and Klf-family transcription factors emerged as dominant consensus classifiers. Principal component analysis (PCA) revealed clear separation between flight and ground samples, with PC1 explaining 69.1% of variance. These results suggest rapid thermogenic reprogramming in female WAT as a compensatory response to microgravity. This study demonstrates the power of explainable AI for re-analysis of newly released NASA space biology datasets, with direct implications for female astronaut health on long-duration missions and for Earth-based obesity and metabolic disease research.
Updated: 2026-04-03 10:18:12
标题: 可解释的机器学习揭示了37天微重力后雌性小鼠白色脂肪组织中12倍Ucp1上调和热原性重编程:NASA OSD-970的第一次AI/ML分析
摘要: 微重力在哺乳动物生理学中引起了深刻的代谢适应性,然而控制雌性白色脂肪组织(WAT)热生成的分子机制仍未充分表征。本文介绍了第一个对来自小鼠研究-1(RR-1)任务的国际空间站(ISS)上的16只雌性C57BL/6J小鼠(8只飞行,8只地面对照)的腹腔WAT中的89个脂肪生成和热生成途径基因的RT-qPCR数据进行机器学习(ML)分析的分析。我们应用差异表达分析,多个ML分类器以及通过SHapley Additive exPlanations(SHAP)来解释AI。最引人注目的发现是在微重力暴露的WAT中Ucp1显著上调12.21倍(Delta-Delta-Ct = -3.61,p = 0.0167),伴随着热生成途径的显着激活(平均途径倍数变化= 3.24)。表现最佳的模型(具有前20个特征的随机森林)通过LOO-CV实现了AUC = 0.922,准确度= 0.812,F1 = 0.824。SHAP分析一致将Ucp1列为前几个预测特征,而Angpt2,Irs2,Jun和Klf家族转录因子则成为主要的一致分类器。主成分分析(PCA)显示了飞行和地面样本之间的明显分离,PC1解释了69.1%的方差。这些结果表明,雌性WAT在微重力下迅速进行热生成重新编程,作为对微重力的补偿性反应。这项研究展示了解释性AI在重新分析新发布的NASA空间生物学数据集方面的潜力,对长期任务中的女性宇航员健康以及地球上的肥胖和代谢性疾病研究具有直接的影响。
更新时间: 2026-04-03 10:18:12
领域: cs.LG
Equivariant Evidential Deep Learning for Interatomic Potentials
Uncertainty quantification (UQ) is critical for assessing the reliability of machine learning interatomic potentials (MLIPs) in molecular dynamics (MD) simulations, identifying extrapolation regimes and enabling uncertainty-aware workflows such as active learning for training dataset construction. Existing UQ approaches for MLIPs are often limited by high computational cost or suboptimal performance. Evidential deep learning (EDL) provides a theoretically grounded single-model alternative that determines both aleatoric and epistemic uncertainty in a single forward pass. However, extending evidential formulations from scalar targets to vector-valued quantities such as atomic forces introduces substantial challenges, particularly in maintaining statistical self-consistency under rotational transformations. To address this, we propose \textit{Equivariant Evidential Deep Learning for Interatomic Potentials} ($\text{e}^2$IP), a backbone-agnostic framework that models atomic forces and their uncertainty jointly by representing uncertainty as a full $3\times3$ symmetric positive definite covariance tensor that transforms equivariantly under rotations. Experiments on diverse molecular benchmarks show that $\text{e}^2$IP provides a stronger accuracy-efficiency-reliability balance than the non-equivariant evidential baseline and the widely used ensemble method. It also achieves better data efficiency through the fully equivariant architecture while retaining single-model inference efficiency.
Updated: 2026-04-03 10:15:06
标题: 等变证据深度学习用于原子间势能
摘要: 不确定性量化(UQ)对于评估分子动力学(MD)模拟中的机器学习原子间势(MLIPs)的可靠性至关重要,能够识别外推区域并实现诸如用于训练数据集构建的主动学习等考虑不确定性的工作流程。现有的MLIPs的UQ方法通常受限于高计算成本或次优性能。证据深度学习(EDL)提供了一个在单次前向传递中确定混合不确定性和认知不确定性的有理论基础的单模型替代方案。然而,将标量目标的证据形式扩展到原子力等矢量值量引入了巨大挑战,特别是在保持旋转变换下的统计自洽性方面。为了解决这个问题,我们提出了用于原子间势的等变证据深度学习($\text{e}^2$IP),这是一个不依赖于骨干的框架,通过将不确定性表示为在旋转下等变变换的完整$3\times3$对称正定协方差张量来共同建模原子力和它们的不确定性。在各种分子基准实验中,$\text{e}^2$IP显示出比非等变证据基线和广泛使用的集成方法更强的准确性-效率-可靠性平衡。它还通过完全等变的架构实现更好的数据效率,同时保留单模型推理效率。
更新时间: 2026-04-03 10:15:06
领域: cs.LG,cs.AI
Early Classification of Time Series in Non-Stationary Cost Regimes
Early Classification of Time Series (ECTS) addresses decision-making problems in which predictions must be made as early as possible while maintaining high accuracy. Most existing ECTS methods assume that the time-dependent decision costs governing the learning objective are known, fixed, and correctly specified. In practice, however, these costs are often uncertain and may change over time, leading to mismatches between training-time and deployment-time objectives. In this paper, we study ECTS under two practically relevant forms of cost non-stationarity: drift in the balance between misclassification and decision delay costs, and stochastic realizations of decision costs that deviate from the nominal training-time model. To address these challenges, we revisit representative ECTS approaches and adapt them to an online learning setting. Focusing on separable methods, we update only the triggering model during deployment, while keeping the classifier fixed. We propose several online adaptations and baselines, including bandit-based and RL-based approaches, and conduct controlled experiments on synthetic data to systematically evaluate robustness under cost non-stationarity. Our results demonstrate that online learning can effectively improve the robustness of ECTS methods to cost drift, with RL-based strategies exhibiting strong and stable performance across varying cost regimes.
Updated: 2026-04-03 10:07:16
标题: 非稳态成本制度中时间序列的早期分类
摘要: 早期时间序列分类(ECTS)解决了决策问题,其中必须尽早进行预测,同时保持高准确性。大多数现有的ECTS方法假设控制学习目标的时间相关决策成本是已知的、固定的和正确规定的。然而,在实践中,这些成本通常是不确定的,可能会随时间变化,导致训练时间和部署时间目标之间的不匹配。在本文中,我们研究了在两种实际相关形式的成本非平稳性下的ECTS:在误分类和决策延迟成本之间平衡的漂移,以及与训练时间模型偏离的决策成本的随机实现。为了解决这些挑战,我们重新审视代表性的ECTS方法,并将它们调整为在线学习设置。专注于可分离的方法,在部署过程中仅更新触发模型,同时保持分类器不变。我们提出了几种在线调整和基线方法,包括基于贝叶斯和基于强化学习的方法,并在合成数据上进行了受控实验,系统评估了对成本非平稳性的稳健性。我们的结果表明,在线学习可以有效提高ECTS方法对成本漂移的稳健性,基于强化学习的策略在不同成本制度下表现出强大而稳定的性能。
更新时间: 2026-04-03 10:07:16
领域: cs.LG
From Virtual Environments to Real-World Trials: Emerging Trends in Autonomous Driving
Autonomous driving technologies have achieved significant advances in recent years, yet their real-world deployment remains constrained by data scarcity, safety requirements, and the need for generalization across diverse environments. In response, synthetic data and virtual environments have emerged as powerful enablers, offering scalable, controllable, and richly annotated scenarios for training and evaluation. This survey presents a comprehensive review of recent developments at the intersection of autonomous driving, simulation technologies, and synthetic datasets. We organize the landscape across three core dimensions: (i) the use of synthetic data for perception and planning, (ii) digital twin-based simulation for system validation, and (iii) domain adaptation strategies bridging synthetic and real-world data. We also highlight the role of vision-language models and simulation realism in enhancing scene understanding and generalization. A detailed taxonomy of datasets, tools, and simulation platforms is provided, alongside an analysis of trends in benchmark design. Finally, we discuss critical challenges and open research directions, including Sim2Real transfer, scalable safety validation, cooperative autonomy, and simulation-driven policy learning, that must be addressed to accelerate the path toward safe, generalizable, and globally deployable autonomous driving systems.
Updated: 2026-04-03 10:04:30
标题: 从虚拟环境到现实世界试验:自动驾驶中新兴趋势
摘要: 自动驾驶技术近年来取得了显著进展,但其在现实世界中的部署仍受到数据稀缺、安全要求和跨越多样化环境的泛化需求的限制。作为应对,合成数据和虚拟环境已经成为强大的推动因素,为训练和评估提供可扩展、可控和丰富注释的场景。本调查综述了自动驾驶、仿真技术和合成数据集交叉领域的最新发展。我们通过三个核心维度对这一领域进行了总结:(i)使用合成数据进行感知和规划,(ii)基于数字孪生的仿真进行系统验证,以及(iii)桥接合成和现实数据的领域适应策略。我们还强调了视觉-语言模型和仿真逼真性在提升场景理解和泛化能力方面的作用。我们提供了数据集、工具和仿真平台的详细分类,同时分析了基准设计的趋势。最后,我们讨论了必须解决的关键挑战和开放的研究方向,包括Sim2Real转移、可扩展的安全验证、合作自治和基于仿真的政策学习,以加速向安全、可泛化和全球部署的自动驾驶系统的路径。
更新时间: 2026-04-03 10:04:30
领域: cs.AI
Towards Near-Real-Time Telemetry-Aware Routing with Neural Routing Algorithms
Routing algorithms are crucial for efficient computer network operations, and in many settings they must be able to react to traffic bursts within milliseconds. Live telemetry data can provide informative signals to routing algorithms, and recent work has trained neural networks to exploit such signals for traffic-aware routing. Yet, aggregating network-wide information is subject to communication delays, and existing neural approaches either assume unrealistic delay-free global states, or restrict routers to purely local telemetry. This leaves their deployability in real-world environments unclear. We cast telemetry-aware routing as a delay-aware closed-loop control problem and introduce a framework that trains and evaluates neural routing algorithms, while explicitly modeling communication and inference delays. On top of this framework, we propose LOGGIA, a scalable graph neural routing algorithm that predicts log-space link weights from attributed topology-and-telemetry graphs. It utilizes a data-driven pre-training stage, followed by on-policy Reinforcement Learning. Across synthetic and real network topologies, and unseen mixed TCP/UDP traffic sequences, LOGGIA consistently outperforms shortest-path baselines, whereas neural baselines fail once realistic delays are enforced. Our experiments further suggest that neural routing algorithms like LOGGIA perform best when deployed fully locally, i.e., observing network states and inferring actions at every router individually, as opposed to centralized decision making.
Updated: 2026-04-03 09:51:08
标题: 朝向具有神经路由算法的近实时遥测感知路由
摘要: 路由算法对于高效的计算机网络运行至关重要,在许多情况下,它们必须能够在毫秒内对交通突发情况做出反应。实时遥测数据可以为路由算法提供信息信号,最近的工作已经训练神经网络利用这些信号进行交通感知路由。然而,整合整个网络的信息受到通信延迟的影响,现有的神经方法要么假设不现实的无延迟全局状态,要么将路由器限制在纯粹的本地遥测中。这使得它们在真实环境中的可部署性不明确。我们将遥测感知路由视为一个延迟感知的闭环控制问题,并引入一个框架,用于训练和评估神经路由算法,同时明确地建模通信和推断延迟。在这个框架的基础上,我们提出了LOGGIA,一种可扩展的图神经网络路由算法,从属性拓扑和遥测图中预测对数空间链路权重。它利用数据驱动的预训练阶段,然后进行策略上的强化学习。在合成和真实网络拓扑以及未见的混合TCP/UDP流量序列中,LOGGIA始终优于最短路径基线,而神经基线在添加真实延迟后失败。我们的实验进一步表明,像LOGGIA这样的神经路由算法在部署时最佳的表现是完全本地化的,即在每个路由器上观察网络状态并推断行动,而不是集中化决策制定。
更新时间: 2026-04-03 09:51:08
领域: cs.LG,cs.NI
Council Mode: Mitigating Hallucination and Bias in LLMs via Multi-Agent Consensus
Large Language Models (LLMs), particularly those employing Mixture-of-Experts (MoE) architectures, have achieved remarkable capabilities across diverse natural language processing tasks. However, these models frequently suffer from hallucinations -- generating plausible but factually incorrect content -- and exhibit systematic biases that are amplified by uneven expert activation during inference. In this paper, we propose the Council Mode, a novel multi-agent consensus framework that addresses these limitations by dispatching queries to multiple heterogeneous frontier LLMs in parallel and synthesizing their outputs through a dedicated consensus model. The Council pipeline operates in three phases: (1) an intelligent triage classifier that routes queries based on complexity, (2) parallel expert generation across architecturally diverse models, and (3) a structured consensus synthesis that explicitly identifies agreement, disagreement, and unique findings before producing the final response. We implement and evaluate this architecture within an open-source AI workspace. Our comprehensive evaluation across multiple benchmarks demonstrates that the Council Mode achieves a 35.9% relative reduction in hallucination rates on the HaluEval benchmark and a 7.8-point improvement on TruthfulQA compared to the best-performing individual model, while maintaining significantly lower bias variance across domains. We provide the mathematical formulation of the consensus mechanism, detail the system architecture, and present extensive empirical results with ablation studies.
Updated: 2026-04-03 09:40:43
标题: 委员会模式:通过多智能体共识减轻LLM中的幻觉和偏见
摘要: 大型语言模型(LLMs),特别是采用混合专家(MoE)架构的模型,在各种自然语言处理任务中取得了显著的能力。然而,这些模型经常出现幻觉——生成看似合理但实际上不正确的内容——并且在推理过程中由于专家激活不均匀而放大系统偏见。在本文中,我们提出了议会模式,这是一个新颖的多智能体共识框架,通过将查询同时发送到多个不同架构的前沿LLMs,并通过专用共识模型合成它们的输出,来解决这些限制。议会管道分为三个阶段:(1)智能分类器根据复杂性路由查询,(2)跨架构多专家生成,并且(3)一个结构化的共识综合,在生成最终响应之前明确标识一致性、不一致性和独特发现。我们在一个开源的AI工作空间中实现并评估了这种架构。我们在多个基准测试中进行了全面评估,结果显示议会模式在HaluEval基准测试中相对减少了35.9%的幻觉率,并且相比表现最好的单个模型,在TruthfulQA上提高了7.8个点,同时在各个领域保持了显著降低的偏见方差。我们提供了共识机制的数学公式,详细介绍了系统架构,并呈现了包括消融研究在内的广泛经验结果。
更新时间: 2026-04-03 09:40:43
领域: cs.CL,cs.AI
(PAC-)Learning state machines from data streams: A generic strategy and an improved heuristic (Extended version)
This is an extended version of our publication Learning state machines from data streams: A generic strategy and an improved heuristic, International Conference on Grammatical Inference (ICGI) 2023, Rabat, Morocco. It has been extended with a formal proof on PAC-bounds, and the discussion and analysis of a similar approach has been moved from the appendix and is now a full Section. State machine models are models that simulate the behavior of discrete event systems, capable of representing systems such as software systems, network interactions, and control systems, and have been researched extensively. The nature of most learning algorithms however is the assumption that all data be available at the beginning of the algorithm, and little research has been done in learning state machines from streaming data. In this paper, we want to close this gap further by presenting a generic method for learning state machines from data streams, as well as a merge heuristic that uses sketches to account for incomplete prefix trees. We implement our approach in an open-source state merging library and compare it with existing methods. We show the effectiveness of our approach with respect to run-time, memory consumption, and quality of results on a well known open dataset. Additionally, we provide a formal analysis of our algorithm, showing that it is capable of learning within the PAC framework, and show a theoretical improvement to increase run-time, without sacrificing correctness of the algorithm in larger sample sizes.
Updated: 2026-04-03 09:40:09
标题: (PAC-)从数据流中学习状态机:一种通用策略和改进启发式方法(扩展版)
摘要: 这是我们关于从数据流中学习状态机的出版物的扩展版本:《从数据流中学习状态机:一种通用策略和改进的启发式》,2023年语法推理国际会议(ICGI),摩洛哥拉巴特。它通过对PAC边界的正式证明进行了扩展,并将类似方法的讨论和分析从附录中移动到了一个完整的章节中。 状态机模型是模拟离散事件系统行为的模型,能够表示诸如软件系统、网络交互和控制系统等系统,并已得到广泛研究。然而,大多数学习算法的本质是假设所有数据在算法开始时都是可用的,对从流数据中学习状态机的研究很少。在本文中,我们通过提出一种从数据流中学习状态机的通用方法以及使用草图考虑不完整前缀树的合并启发式方法,进一步填补这一空白。我们在一个开源状态合并库中实现了我们的方法,并将其与现有方法进行了比较。我们展示了我们的方法在运行时间、内存消耗和在一个著名的开放数据集上结果质量方面的有效性。此外,我们对我们的算法进行了正式分析,表明它能够在PAC框架内学习,并展示了一个理论上的改进以增加运行时间,而不会在更大的样本量上牺牲算法的正确性。
更新时间: 2026-04-03 09:40:09
领域: cs.FL,cs.LG
Fisher-Geometric Diffusion in Stochastic Gradient Descent: Optimal Rates, Oracle Complexity, and Information-Theoretic Limits
Classical stochastic-approximation analyses treat the covariance of stochastic gradients as an exogenous modeling input. We show that under exchangeable mini-batch sampling this covariance is identified by the sampling mechanism itself: to leading order it is the projected covariance of per-sample gradients. In well-specified likelihood problems this reduces locally to projected Fisher information; for general M-estimation losses the same object is the projected gradient covariance G*(theta), which together with the Hessian induces sandwich/Godambe geometry. This identification -- not the subsequent diffusion or Lyapunov machinery, which is classical once the noise matrix is given -- is the paper's main contribution. It endogenizes the diffusion coefficient (with effective temperature tau = eta/b), determines the stationary covariance via a Lyapunov equation whose inputs are now structurally fixed, and selects the identified statistical geometry as the natural metric for convergence analysis. We prove matching upper and lower bounds of order Theta(1/N) for risk in this metric under an oracle budget N; the lower bound is established first via a van Trees argument in the parametric Fisher setting and then extended to adaptive oracle transcripts under a predictable-information condition and mild conditional likelihood regularity. Translating these bounds into oracle complexity yields epsilon-stationarity guarantees in the Fisher dual norm that depend on an intrinsic effective dimension d_eff and a statistical condition number kappa_F, rather than ambient dimension or Euclidean conditioning. Numerical experiments confirm the Lyapunov predictions at both continuous-time and discrete-time levels and show that scalar temperature matching cannot reproduce directional noise structure.
Updated: 2026-04-03 09:38:35
标题: 费舍尔几何扩散在随机梯度下降中的应用:最佳速率、Oracle复杂度和信息论限制
摘要: 经典随机逼近分析将随机梯度的协方差视为外生建模输入。我们展示,在可交换的小批量抽样下,这种协方差是由抽样机制本身确定的:在主导顺序下,它是每个样本梯度的投影协方差。在良好规定的似然问题中,这在本地降低为投影费舍尔信息;对于一般的M-估计损失,同样的对象是投影梯度协方差G*(theta),它与Hessian一起引发了夹层/Godambe几何。这种识别--而不是随后的扩散或Lyapunov机制,一旦噪声矩阵给定就是经典的--是本文的主要贡献。它使扩散系数内生化(具有有效温度tau = eta/b),通过Lyapunov方程确定了稳态协方差,其输入现在结构固定,并选择了被识别的统计几何作为收敛分析的自然度量。我们证明在一个预算为N的神谕下,在这个度量中风险的Theta(1/N)级别的匹配上下界;首先通过参数化费舍尔设定中的Van Trees论证建立了下界,然后在可预测信息条件和轻微条件似然规则下将其扩展到自适应神谕记录。将这些界限转化为神谕复杂度,在费舍尔对偶范数中保证了依赖于固有有效维度d_eff和统计条件数kappa_F的epsilon-稳定性保证,而不是环境维度或欧几里得条件。数值实验验证了连续时间和离散时间层面的Lyapunov预测,并表明标量温度匹配不能复制方向性噪声结构。
更新时间: 2026-04-03 09:38:35
领域: stat.ML,cs.LG,math.OC
WiseMind: a knowledge-guided multi-agent framework for accurate and empathetic psychiatric diagnosis
Large Language Models (LLMs) offer promising opportunities to support mental healthcare workflows, yet they often lack the structured clinical reasoning needed for reliable diagnosis and may struggle to provide the emotionally attuned communication essential for patient trust. Here, we introduce WiseMind, a novel multi-agent framework inspired by the theory of Dialectical Behavior Therapy designed to facilitate psychiatric assessment. By integrating a "Reasonable Mind" Agent for evidence-based logic and an "Emotional Mind" Agent for empathetic communication, WiseMind effectively bridges the gap between instrumental accuracy and humanistic care. Our framework utilizes a Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5)-guided Structured Knowledge Graph to steer diagnostic inquiries, significantly reducing hallucinations compared to standard prompting methods. Using a combination of virtual standard patients, simulated interactions, and real human interaction datasets, we evaluate WiseMind across three common psychiatric conditions. WiseMind outperforms state-of-the-art LLM methods in both identifying critical diagnostic nodes and establishing accurate differential diagnoses. Across 1206 simulated conversations and 180 real user sessions, the system achieves 85.6% top-1 diagnostic accuracy, approaching reported diagnostic performance ranges of board-certified psychiatrists and surpassing knowledge-enhanced single-agent baselines by 15-54 percentage points. Expert review by psychiatrists further validates that WiseMind generates responses that are not only clinically sound but also psychologically supportive, demonstrating the feasibility of empathetic, reliable AI agents to conduct psychiatric assessments under appropriate human oversight.
Updated: 2026-04-03 09:38:21
标题: 智慧心灵:一个知识引导的多智能体框架,用于准确和富有同理心的精神疾病诊断
摘要: 大型语言模型(LLMs)为支持心理卫生保健工作流程提供了有希望的机会,但它们经常缺乏可靠诊断所需的结构化临床推理,可能难以提供对患者信任至关重要的情绪敏感沟通。在这里,我们介绍了WiseMind,这是一个受辩证行为疗法理论启发而设计的新型多主体框架,旨在促进精神疾病评估。通过整合一个“理性心智”代理以进行基于证据的逻辑和一个“情绪心智”代理以进行共情沟通,WiseMind有效地弥合了工具准确性和人文关怀之间的差距。我们的框架利用《精神障碍诊断与统计手册,第五版》(DSM-5)引导的结构化知识图来引导诊断询问,与标准提示方法相比,显著减少了幻觉。通过使用虚拟标准患者、模拟交互和真实人类交互数据集的组合,我们评估了WiseMind在三种常见精神疾病中的表现。WiseMind在识别关键诊断节点和建立准确的鉴别诊断方面优于最先进的LLM方法。在1206次模拟对话和180个真实用户会话中,该系统实现了85.6%的一级诊断准确率,接近报告的董事会认证精神科医生的诊断绩效范围,并超过了知识增强的单主体基线15-54个百分点。精神科医生的专家审查进一步验证,WiseMind生成的回应不仅在临床上合理,而且在心理上支持,展示了具有同等人类监督条件下进行精神疾病评估的具有共情力和可靠性的AI代理的可行性。
更新时间: 2026-04-03 09:38:21
领域: cs.AI,cs.CL
Efficient Logistic Regression with Mixture of Sigmoids
This paper studies the Exponential Weights (EW) algorithm with an isotropic Gaussian prior for online logistic regression. We show that the near-optimal worst-case regret bound $O(d\log(Bn))$ for EW, established by Kakade and Ng (2005) against the best linear predictor of norm at most $B$, can be achieved with total worst-case computational complexity $O(B^3 n^5)$. This substantially improves on the $O(B^{18}n^{37})$ complexity of prior work achieving the same guarantee (Foster et al., 2018). Beyond efficiency, we analyze the large-$B$ regime under linear separability: after rescaling by $B$, the EW posterior converges as $B\to\infty$ to a standard Gaussian truncated to the version cone. Accordingly, the predictor converges to a solid-angle vote over separating directions and, on every fixed-margin slice of this cone, the mode of the corresponding truncated Gaussian is aligned with the hard-margin SVM direction. Using this geometry, we derive non-asymptotic regret bounds showing that once $B$ exceeds a margin-dependent threshold, the regret becomes independent of $B$ and grows only logarithmically with the inverse margin. Overall, our results show that EW can be both computationally tractable and geometrically adaptive in online classification.
Updated: 2026-04-03 09:36:34
标题: 高效的混合Sigmoid逻辑回归
摘要: 本文研究了具有各向同性高斯先验的指数权重(EW)算法用于在线逻辑回归。我们展示了由Kakade和Ng(2005)建立的针对最佳范数最大为$B$的线性预测器的近似最优最坏情况遗憾界$O(d\log(Bn))$可以在总最坏情况计算复杂度为$O(B^3 n^5)$时实现。这显著改善了先前实现相同保证的工作的$O(B^{18}n^{37})$复杂度(Foster等,2018)。除了效率外,我们分析了在线性可分性下的大$B$区域:通过将$B$缩放,EW后验在$B\to\infty$时收敛于截断为版本锥的标准高斯分布。因此,预测器收敛于在分隔方向上的固体角度投票,并且在该锥体的每个固定间隔切片上,相应截断高斯的模式与硬间隔SVM方向对齐。利用这种几何形状,我们推导出非渐近遗憾界,表明一旦$B$超过一个与间隔相关的阈值,遗憾将不再依赖于$B$,并且仅以反间隔对数增长。总的来说,我们的结果表明EW在在线分类中既可以具有计算上的可处理性,又可以在几何上自适应。
更新时间: 2026-04-03 09:36:34
领域: cs.LG
Scalable Mean-Variance Portfolio Optimization via Subspace Embeddings and GPU-Friendly Nesterov-Accelerated Projected Gradient
We develop a sketch-based factor reduction and a Nesterov-accelerated projected gradient algorithm (NPGA) with GPU acceleration, yielding a doubly accelerated solver for large-scale constrained mean-variance portfolio optimization. Starting from the sample covariance factor $L$, the method combines randomized subspace embedding, spectral truncation, and ridge stabilization to construct an effective factor $L_{eff}$. It then solves the resulting constrained problem with a structured projection computed by scalar dual search and GPU-friendly matrix-vector kernels, yielding one computational pipeline for the baseline, sketched, and Sketch-Truncate-Ridge (STR)-regularized models. We also establish approximation, conditioning, and stability guarantees for the sketching and STR models, including explicit $O(\varepsilon)$ bounds for the covariance approximation, the optimal value error, and the solution perturbation under $(\varepsilon,δ)$-subspace embeddings. Experiments on synthetic and real equity-return data show that the method preserves objective accuracy while reducing runtime substantially. On a 5440-asset real-data benchmark with 48374 training periods, NPGA-GPU solves the unreduced full model in 2.80 seconds versus 64.84 seconds for Gurobi, while the optimized compressed GPU variants remain in the low-single-digit-second regime. These results show that the full dense model is already practical on modern GPUs and that, after compression, the remaining bottleneck is projection rather than matrix-vector multiplication.
Updated: 2026-04-03 09:35:05
标题: 可扩展的均值方差组合优化:通过子空间嵌入和友好的GPU Nesterov加速投影梯度实现
摘要: 我们开发了一种基于草图的因子降维和一种具有GPU加速的Nesterov加速投影梯度算法(NPGA),从而为大规模受限均值方差组合优化提供了一个双重加速的求解器。该方法从样本协方差因子$L$开始,结合随机子空间嵌入、谱截断和岭稳定化,构建了一个有效的因子$L_{eff}$。然后使用由标量对偶搜索和GPU友好的矩阵-向量核心计算的结构化投影来解决结果受限问题,从而为基线、草图和Sketch-Truncate-Ridge(STR)正则化模型提供一个计算管道。我们还为草图和STR模型建立了近似、条件和稳定性保证,包括在$(\varepsilon,δ)$-子空间嵌入下的协方差逼近、最优值误差和解扰动的显式$O(\varepsilon)$界限。对合成和真实股票收益数据的实验表明,该方法在保持客观准确性的同时大大减少了运行时间。在一个包含5440个资产的真实数据基准测试中,NPGA-GPU在2.80秒内解决了未简化的完整模型,而Gurobi解决相同问题需要64.84秒,而经过优化的压缩GPU变体仍保持在低位数秒的范围内。这些结果表明,完整的稠密模型在现代GPU上已经是实用的,而在压缩后,剩下的瓶颈是投影而不是矩阵-向量乘法。
更新时间: 2026-04-03 09:35:05
领域: math.OC,cs.CE,cs.DC,cs.LG,math.NA
Split and Conquer Partial Deepfake Speech
Partial deepfake speech detection requires identifying manipulated regions that may occur within short temporal portions of an otherwise bona fide utterance, making the task particularly challenging for conventional utterance-level classifiers. We propose a split-and-conquer framework that decomposes the problem into two stages: boundary detection and segment-level classification. A dedicated boundary detector first identifies temporal transition points, allowing the audio signal to be divided into segments that are expected to contain acoustically consistent content. Each resulting segment is then evaluated independently to determine whether it corresponds to bona fide or fake speech. This formulation simplifies the learning objective by explicitly separating temporal localization from authenticity assessment, allowing each component to focus on a well-defined task. To further improve robustness, we introduce a reflection-based multi-length training strategy that converts variable-duration segments into several fixed input lengths, producing diverse feature-space representations. Each stage is trained using multiple configurations with different feature extractors and augmentation strategies, and their complementary predictions are fused to obtain improved final models. Experiments on the PartialSpoof benchmark demonstrate state-of-the-art performance across multiple temporal resolutions as well as at the utterance level, with substantial improvements in the accurate detection and localization of spoofed regions. In addition, the proposed method achieves state-of-the-art performance on the Half-Truth dataset, further confirming the robustness and generalization capability of the framework.
Updated: 2026-04-03 09:33:01
标题: 分割与征服部分深度伪造语音
摘要: 部分Deepfake语音检测需要识别可能出现在原本真实话语的短时间部分内的操纵区域,这使得传统话语级分类器的任务变得特别具有挑战性。我们提出了一个分割与征服的框架,将问题分解为两个阶段:边界检测和段级分类。一个专门的边界检测器首先识别时间过渡点,允许音频信号被分成预计包含声学一致内容的段。然后独立评估每个结果段,以确定它是否对应于真实或伪造的语音。 这种公式简化了学习目标,通过明确将时间定位与真实性评估分开,使每个组件能够专注于一个明确定义的任务。为了进一步提高鲁棒性,我们引入了一种基于反射的多长度训练策略,将可变持续时间的段转换为几个固定的输入长度,产生不同的特征空间表示。每个阶段都使用不同的特征提取器和增强策略进行多种配置的训练,并融合它们的互补预测以获得改进的最终模型。 在PartialSpoof基准测试上的实验表明,在多个时间分辨率以及话语级别上表现出最先进的性能,对伪造区域的准确检测和定位有着显著的改进。此外,所提出的方法在Half-Truth数据集上实现了最先进的性能,进一步确认了该框架的鲁棒性和泛化能力。
更新时间: 2026-04-03 09:33:01
领域: cs.SD,cs.AI,cs.LG
Corporations Constitute Intelligence
In January 2026, Anthropic published a 79-page "constitution" for its AI model Claude, the most comprehensive corporate AI governance document ever released. This Article offers the first legal and democratic-theoretic analysis of that document. Despite genuine philosophical sophistication, the constitution harbors two structural defects. First, it excludes the contexts where ethical constraints matter most: models deployed to the U.S. military operate under different rules, a gap exposed when Claude remained embedded in Palantir's Maven platform during military strikes in Iran even after a government-wide ban on Anthropic's technology. Second, its very comprehensiveness forecloses democratic contestation by resolving questions about AI values, moral status, and conscientious objection that should remain open for public deliberation. Anthropic's own 2023 experiment in participatory constitution-making found roughly 50% divergence between publicly sourced and corporate-authored principles, with the democratic version producing lower bias across nine social dimensions, yet the 2026 constitution incorporates none of those findings. I argue that AI governance suffers from a "political community deficit": the absence of any democratic body authorized to determine the principles governing AI behavior. Corporate transparency, however admirable, is not democratic legitimacy.
Updated: 2026-04-03 09:30:43
标题: 公司构成智能
摘要: 在2026年1月,Anthropic发布了一份79页的“宪法”,用于其AI模型克劳德,这是迄今为止发布的最全面的企业AI治理文件。本文是对该文件的首次法律和民主理论分析。尽管在哲学上具有真正的复杂性,但这份宪法存在两个结构性缺陷。首先,它排除了道德约束最重要的情境:在美国军方部署的模型运行在不同的规则下,克劳德在伊朗军事打击期间仍然嵌入在Palantir的Maven平台中,即使在Anthropic技术被全面禁止之后。其次,其非常全面性通过解决有关AI价值观、道德地位和良心抗议的问题,使民主争论被排除在外,这些问题应该保持公开讨论。Anthropic自己在2023年进行的参与式宪法制定实验发现,公开获取和公司撰写的原则之间大约有50%的分歧,民主版本在九个社会维度上产生较低的偏见,然而2026年的宪法并未纳入这些发现。我认为AI治理存在“政治社区赤字”:没有任何授权确定AI行为原则的民主机构。尽管公司的透明度值得称赞,但并不等同于民主合法性。
更新时间: 2026-04-03 09:30:43
领域: cs.CY,cs.AI
Analysis of Optimality of Large Language Models on Planning Problems
Classic AI planning problems have been revisited in the Large Language Model (LLM) era, with a focus of recent benchmarks on success rates rather than plan efficiency. We examine the degree to which frontier models reason optimally versus relying on simple, heuristic, and possibly inefficient strategies. We focus on the Blocksworld domain involving towers of labeled blocks which have to be moved from an initial to a goal configuration via a set of primitive actions. We also study a formally equivalent task, the generalized Path-Star ($P^*$) graph, in order to isolate true topological reasoning from semantic priors. We systematically manipulate problem depth (the height of block towers), width (the number of towers), and compositionality (the number of goal blocks). Reasoning-enhanced LLMs significantly outperform traditional satisficing planners (e.g., LAMA) in complex, multi-goal configurations. Although classical search algorithms hit a wall as the search space expands, LLMs track theoretical optimality limits with near-perfect precision, even when domain-specific semantic hints are stripped away. To explain these surprising findings, we consider (and find evidence to support) two hypotheses: an active Algorithmic Simulation executed via reasoning tokens and a Geometric Memory that allows models to represent the $P^*$ topology as a navigable global geometry, effectively bypassing exponential combinatorial complexity.
Updated: 2026-04-03 09:27:16
标题: 大型语言模型在规划问题上的最优性分析
摘要: 在大型语言模型(LLM)时代重新审视了经典的人工智能规划问题,最近的基准重点放在成功率上,而不是计划效率。我们研究了前沿模型在推理过程中是以最佳方式推理,还是依赖简单、启发式和可能低效的策略。我们关注涉及带有标记的方块塔的Blocksworld领域,这些方块塔必须通过一组原始动作从初始配置移动到目标配置。我们还研究了一个形式等价的任务,即广义Path-Star ($P^*$)图,以便从语义先验中隔离出真正的拓扑推理。我们系统地操纵问题的深度(方块塔的高度)、宽度(方块塔的数量)和组成性(目标方块的数量)。增强推理的LLM在复杂的多目标配置中明显优于传统的满足性规划器(例如LAMA)。尽管经典的搜索算法在搜索空间扩大时遇到障碍,LLM在几乎完美的精度下跟踪理论上的最优限制,即使领域特定的语义提示被剥离。为了解释这些令人惊讶的发现,我们考虑(并找到证据支持)两个假设:通过推理令牌执行的主动算法模拟和允许模型将$P^*$拓扑表示为可导航的全局几何的几何存储器,有效地绕过指数级组合复杂性。
更新时间: 2026-04-03 09:27:16
领域: cs.AI,cs.CL
Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks
The ratio of outlier parameters in language pre-training models and vision pre-training models differs significantly, making cross-modality (language and vision) inherently more challenging than cross-domain adaptation. As a result, many prior studies have focused on cross-domain transfer rather than attempting to bridge language and vision modalities, assuming that language pre-trained models are unsuitable for downstream visual tasks due to disparate parameter spaces. Contrary to this assumption, we show that adding a bridge training stage as a modality adaptation learner can effectively align Large Language Model (LLM) parameters with vision tasks. Specifically, we propose a simple yet powerful solution random label bridge training that requires no manual labeling and helps LLM parameters adapt to vision foundation tasks. Moreover, our findings reveal that partial bridge training is often advantageous, as certain layers in LLMs exhibit strong foundational properties that remain beneficial even without fine-tuning for visual tasks. This surprising discovery opens up new avenues for leveraging language pre-trained parameters directly within vision models and highlights the potential of partial bridge training as a practical pathway to cross-modality adaptation.
Updated: 2026-04-03 09:26:05
标题: 语言预训练引起的偏见:通用视觉任务的坚实基础
摘要: 语言预训练模型和视觉预训练模型中异常参数的比率显著不同,使得跨模态(语言和视觉)比跨域适应更具挑战性。因此,许多先前的研究集中在跨域转移上,而不是尝试连接语言和视觉模态,因为他们认为语言预训练模型由于参数空间不同,不适用于下游视觉任务。与这种假设相反,我们展示了添加桥接训练阶段作为模态适应学习器可以有效地将大型语言模型(LLM)参数与视觉任务对齐。具体而言,我们提出了一个简单但强大的解决方案,即随机标签桥接训练,无需手动标记,有助于LLM参数适应视觉基础任务。此外,我们的发现表明,部分桥接训练通常是有利的,因为LLM中的某些层表现出强大的基础特性,即使不对视觉任务进行微调,仍然具有益处。这一令人惊讶的发现打开了直接利用语言预训练参数在视觉模型中的新途径,并凸显了部分桥接训练作为跨模态适应的实用途径的潜力。
更新时间: 2026-04-03 09:26:05
领域: cs.CV,cs.CL,cs.LG
FLEX: A Largescale Multimodal, Multiview Dataset for Learning Structured Representations for Fitness Action Quality Assessment
Action Quality Assessment (AQA) -- the task of quantifying how well an action is performed -- has great potential for detecting errors in gym weight training, where accurate feedback is critical to prevent injuries and maximize gains. Existing AQA datasets, however, are limited to single-view competitive sports and RGB video, lacking multimodal signals and professional assessment of fitness actions. We introduce FLEX, the first large-scale, multimodal, multiview dataset for fitness AQA that incorporates surface electromyography (sEMG). FLEX contains over 7,500 multiview recordings of 20 weight-loaded exercises performed by 38 subjects of diverse skill levels, with synchronized RGB video, 3D pose, sEMG, and physiological signals. Expert annotations are organized into a Fitness Knowledge Graph (FKG) linking actions, key steps, error types, and feedback, supporting a compositional scoring function for interpretable quality assessment. FLEX enables multimodal fusion, cross-modal prediction -- including the novel Video$\rightarrow$EMG task -- and biomechanically oriented representation learning. Building on the FKG, we further introduce FLEX-VideoQA, a structured question-answering benchmark with hierarchical queries that drive cross-modal reasoning in vision-language models. Baseline experiments demonstrate that multimodal inputs, multiview video, and fine-grained annotations significantly enhance AQA performance. FLEX thus advances AQA toward richer multimodal settings and provides a foundation for AI-powered fitness assessment and coaching. Dataset and code are available at \href{https://github.com/HaoYin116/FLEX}{https://github.com/HaoYin116/FLEX}. Link to Project \href{https://haoyin116.github.io/FLEX_Dataset}{page}.
Updated: 2026-04-03 09:25:47
标题: FLEX:用于学习健身动作质量评估结构化表示的大规模多模态、多视图数据集
摘要: 动作质量评估(AQA)——量化动作执行效果的任务——具有在健身房重量训练中检测错误的巨大潜力,精确的反馈对于预防受伤和最大化收益至关重要。然而,现有的AQA数据集仅限于单视角竞技运动和RGB视频,缺乏多模态信号和专业评估健身动作。我们介绍了FLEX,这是第一个大规模的、多模态、多视角的健身AQA数据集,其中包括表面肌电图(sEMG)。FLEX包含了38位技能水平不同的受试者进行的20种受重量训练的多视角录像,配有同步的RGB视频、3D姿势、sEMG和生理信号。专家注释整理成一个健身知识图(FKG),将动作、关键步骤、错误类型和反馈联系起来,支持一个可解释的质量评估的组合评分函数。FLEX实现了多模态融合、跨模态预测——包括新颖的视频$\rightarrow$sEMG任务——以及以生物力学为导向的表示学习。基于FKG,我们进一步介绍了FLEX-VideoQA,这是一个结构化的问题回答基准,其中包含驱动视觉-语言模型中跨模态推理的分层查询。基线实验表明,多模态输入、多视角视频和细粒度注释显著提高了AQA的性能。因此,FLEX将AQA推向更丰富的多模态设置,并为基于人工智能的健身评估和辅导奠定基础。数据集和代码可在\href{https://github.com/HaoYin116/FLEX}{https://github.com/HaoYin116/FLEX}上获取。项目页面链接\href{https://haoyin116.github.io/FLEX_Dataset}{这里}。
更新时间: 2026-04-03 09:25:47
领域: cs.CV,cs.AI
RayMamba: Ray-Aligned Serialization for Long-Range 3D Object Detection
Long-range 3D object detection remains challenging because LiDAR observations become highly sparse and fragmented in the far field, making reliable context modeling difficult for existing detectors. To address this issue, recent state space model (SSM)-based methods have improved long-range modeling efficiency. However, their effectiveness is still limited by generic serialization strategies that fail to preserve meaningful contextual neighborhoods in sparse scenes. To address this issue, we propose RayMamba, a geometry-aware plug-and-play enhancement for voxel-based 3D detectors. RayMamba organizes sparse voxels into sector-wise ordered sequences through a ray-aligned serialization strategy, which preserves directional continuity and occlusion-related context for subsequent Mamba-based modeling. It is compatible with both LiDAR-only and multimodal detectors, while introducing only modest overhead. Extensive experiments on nuScenes and Argoverse 2 demonstrate consistent improvements across strong baselines. In particular, RayMamba achieves up to 2.49 mAP and 1.59 NDS gain in the challenging 40--50 m range on nuScenes, and further improves VoxelNeXt on Argoverse 2 from 30.3 to 31.2 mAP.
Updated: 2026-04-03 09:20:48
标题: RayMamba:用于长距离3D目标检测的射线对齐序列化
摘要: 长距离3D物体检测仍然具有挑战性,因为LiDAR观测在远场变得非常稀疏和分散,使得现有检测器对可靠的上下文建模困难。为了解决这个问题,最近基于状态空间模型(SSM)的方法已经改进了长距离建模效率。然而,它们的有效性仍然受到通用串行化策略的限制,这些策略无法保留稀疏场景中有意义的上下文邻域。为了解决这个问题,我们提出了RayMamba,这是一个基于几何的即插即用增强插件,用于基于体素的3D检测器。RayMamba通过射线对齐的串行化策略将稀疏体素组织成按部门有序的序列,这保留了方向连续性和遮挡相关的上下文,以便进行后续基于Mamba的建模。它与仅LiDAR和多模态检测器兼容,同时只引入了适度的开销。对nuScenes和Argoverse 2上的大量实验显示,在强基线上持续改进。特别是,在nuScenes的具有挑战性的40-50米范围内,RayMamba实现了高达2.49 mAP和1.59 NDS的增益,并且进一步将Argoverse 2上的VoxelNeXt从30.3提高到31.2 mAP。
更新时间: 2026-04-03 09:20:48
领域: cs.CV,cs.AI
Extracting Money Laundering Transactions from Quasi-Temporal Graph Representation
Money laundering presents a persistent challenge for financial institutions worldwide, while criminal organizations constantly evolve their tactics to bypass detection systems. Traditional anti-money laundering approaches mainly rely on predefined risk-based rules, leading to resource-intensive investigations and high numbers of false positive alerts. In order to restrict operational costs from exploding, while billions of transactions are being processed every day, financial institutions are investing in more sophisticated mechanisms to improve existing systems. In this paper, we present ExSTraQt (EXtract Suspicious TRAnsactions from Quasi-Temporal graph representation), an advanced supervised learning approach to detect money laundering (or suspicious) transactions in financial datasets. Our proposed framework excels in performance, when compared to the state-of-the-art AML (Anti Money Laundering) detection models. The key strengths of our framework are sheer simplicity, in terms of design and number of parameters; and scalability, in terms of the computing and memory requirements. We evaluated our framework on transaction-level detection accuracy using a real dataset; and a set of synthetic financial transaction datasets. We consistently achieve an uplift in the F1 score for most datasets, up to 1% for the real dataset; and more than 8% for one of the synthetic datasets. We also claim that our framework could seamlessly complement existing AML detection systems in banks. Our code and datasets are available at https://github.com/mhaseebtariq/exstraqt.
Updated: 2026-04-03 09:14:39
标题: 提取从准时空图表示中的洗钱交易
摘要: 洗钱对全球金融机构构成持续挑战,犯罪组织不断改进他们的策略以规避侦测系统。传统的反洗钱方法主要依赖于预定义的基于风险的规则,导致资源密集型的调查和大量的误报警报。为了避免运营成本激增,同时每天处理数十亿笔交易,金融机构正投资于更复杂的机制以改进现有系统。在本文中,我们提出了ExSTraQt(从准时空图表示中提取可疑交易)这一先进的监督学习方法,用于检测金融数据集中的洗钱(或可疑)交易。与最先进的反洗钱检测模型相比,我们提出的框架在性能上表现卓越。我们框架的主要优势是设计和参数数量极为简单;以及计算和内存需求的可扩展性。我们使用真实数据集和一组合成金融交易数据集对我们的框架进行了交易级别的检测准确性评估。对于大多数数据集,我们一致实现了F1得分的提升,对于真实数据集高达1%;对于其中一个合成数据集则超过8%。我们还声称我们的框架可以无缝地补充现有的银行反洗钱检测系统。我们的代码和数据集可在https://github.com/mhaseebtariq/exstraqt获得。
更新时间: 2026-04-03 09:14:39
领域: cs.LG
Output-Constrained Decision Trees
Incorporating domain-specific constraints into machine learning models is essential for generating predictions that are both accurate and feasible in real-world applications. This paper introduces new methods for training Output-Constrained Regression Trees (OCRT), addressing the limitations of traditional decision trees in constrained multi-target regression tasks. We propose three approaches: M-OCRT, which uses split-based mixed integer programming to enforce constraints; E-OCRT, which employs an exhaustive search for optimal splits and solves constrained prediction problems at each decision node; and EP-OCRT, which applies post-hoc constrained optimization to tree predictions. To illustrate their potential uses in ensemble learning, we also introduce a random forest framework working under convex feasible sets. We validate the proposed methods through a computational study both on synthetic and industry-driven hierarchical time series datasets. Our results demonstrate that imposing constraints on decision tree training results in accurate and feasible predictions.
Updated: 2026-04-03 09:10:59
标题: 受限输出的决策树
摘要: 将领域特定约束条件纳入机器学习模型对于生成精确且实际应用可行的预测至关重要。本文介绍了新的方法用于训练输出受限回归树(OCRT),解决了传统决策树在受限多目标回归任务中的局限性。我们提出了三种方法:M-OCRT,使用基于分裂的混合整数规划来强制执行约束条件;E-OCRT,采用枚举搜索来寻找最佳分裂并解决每个决策节点上的约束预测问题;以及EP-OCRT,应用事后约束优化来树预测。为了说明它们在集成学习中的潜在用途,我们还引入了一个在凸可行集合下工作的随机森林框架。我们通过对合成和行业驱动的分层时间序列数据集进行计算研究来验证提出的方法。我们的结果表明,对决策树训练施加约束条件会产生精确且实际可行的预测。
更新时间: 2026-04-03 09:10:59
领域: cs.LG
Toward an Artificial General Teacher: Procedural Geometry Data Generation and Visual Grounding with Vision-Language Models
We study visual explanation in geometry education as a Referring Image Segmentation (RIS) problem: given a diagram and a natural language description, the task is to produce a pixel-level mask for the referred geometric element. However, existing RIS models trained on natural image benchmarks such as RefCOCO fail catastrophically on geometric diagrams due to the fundamental domain shift between photographic scenes and abstract, textureless schematics. To address the absence of suitable training data, we present a fully automated procedural data engine that generates over 200,000 synthetic geometry diagrams with pixel-perfect segmentation masks and linguistically diverse referring expressions, requiring zero manual annotation. We further propose domain-specific fine-tuning of vision-language models (VLMs), demonstrating that a fine-tuned Florence-2 achieves 49% IoU and 85% Buffered IoU (BIoU), compared to <1% IoU in zero-shot settings. We introduce Buffered IoU, a geometry-aware evaluation metric that accounts for thin-structure localization, and show that it better reflects true segmentation quality than standard IoU. Our results establish a foundation for building Artificial General Teachers (AGTs) capable of providing visually grounded, step-by-step explanations of geometry problems.
Updated: 2026-04-03 09:10:21
标题: 朝向人工通用教师:利用视觉语言模型进行程序化几何数据生成和视觉基础。
摘要: 我们研究几何教育中的视觉解释问题,将其视为一个指代图像分割(RIS)问题:给定一个图表和一个自然语言描述,任务是为所指的几何元素生成像素级掩码。然而,现有的在自然图像基准上训练的RIS模型(如RefCOCO)在几何图表上表现惨败,这是因为摄影场景与抽象、无纹理的示意图之间存在根本的域转移。为了解决缺乏合适训练数据的问题,我们提出了一个完全自动的程序化数据引擎,生成了超过200,000个合成几何图表,带有像素完美的分割掩码和语言多样的指代表达,无需任何手动标注。我们进一步提出了视觉-语言模型(VLMs)的特定领域微调方法,表明经过微调的Florence-2在IoU方面达到了49%,在有缓冲的IoU(BIoU)方面达到了85%,而在零-shot设置下的IoU小于1%。我们引入了有缓冲的IoU,这是一个考虑到细小结构定位的几何感知评估指标,并且表明它比标准IoU更好地反映了真实的分割质量。我们的结果为构建能够提供视觉基础、逐步解释几何问题的人工通用教师(AGTs)奠定了基础。
更新时间: 2026-04-03 09:10:21
领域: cs.CV,cs.AI,cs.LG
Rethinking Forward Processes for Score-Based Data Assimilation in High Dimensions
Data assimilation is the process of estimating the time-evolving state of a dynamical system by integrating model predictions and noisy observations. It is commonly formulated as Bayesian filtering, but classical filters often struggle with accuracy or computational feasibility in high dimensions. Recently, score-based generative models have emerged as a scalable approach for high-dimensional data assimilation, enabling accurate modeling and sampling of complex distributions. However, existing score-based filters often specify the forward process independently of the data assimilation. As a result, the measurement-update step depends on heuristic approximations of the likelihood score, which can accumulate errors and degrade performance over time. Here, we propose a measurement-aware score-based filter (MASF) that defines a measurement-aware forward process directly from the measurement equation. This construction makes the likelihood score analytically tractable: for linear measurements, we derive the exact likelihood score and combine it with a learned prior score to obtain the posterior score. Numerical experiments covering a range of settings, including high-dimensional datasets, demonstrate improved accuracy and stability over existing score-based filters.
Updated: 2026-04-03 08:55:38
标题: 重新思考高维评分数据同化中的前向过程
摘要: 数据同化是通过整合模型预测和嘈杂观测来估计动力系统的演变状态的过程。它通常被表述为贝叶斯过滤,但是在高维度中,传统过滤器通常难以准确或计算可行。最近,基于分数的生成模型已经成为高维数据同化的可扩展方法,使得能够准确建模和采样复杂分布成为可能。然而,现有的基于分数的过滤器通常独立于数据同化指定前向过程。结果,测量更新步骤依赖于对可能性分数的启发式逼近,这可能会随时间累积错误并降低性能。在这里,我们提出了一种测量感知分数过滤器(MASF),它直接从测量方程定义了一个测量感知的前向过程。这种构造使得可能性分数在分析上是可处理的:对于线性测量,我们推导出精确的可能性分数,并将其与学习的先验分数相结合以获得后验分数。涵盖一系列设置的数值实验表明,与现有的基于分数的过滤器相比,提高了准确性和稳定性。
更新时间: 2026-04-03 08:55:38
领域: stat.ML,cs.AI,cs.LG
Training Multi-Image Vision Agents via End2End Reinforcement Learning
Recent VLM-based agents aim to replicate OpenAI O3's "thinking with images" via tool use, yet most open-source methods restrict inputs to a single image, limiting their applicability to real-world multi-image QA tasks. To address this gap, we propose IMAgent, an open-source visual agent trained with end-to-end reinforcement learning for fine-grained single/multi-image reasoning. During inference, VLMs tend to gradually neglect visual inputs; to mitigate this issue, we design two dedicated tools for visual reflection and verification, enabling the model to actively refocus attention on image content. Beyond that, we, for the first time, reveal how tool usage enhances agent performance from an attention perspective. Equipped with a carefully designed two-layer motion trajectory masking strategy and tool-use reward gain, IMAgent acquires an effective tool-use paradigm through pure reinforcement learning, eliminating the need for costly supervised fine-tuning data. To further unleash the inherent tool-usage potential of the base VLM and fill data gaps, we construct a challenging, visually enriched multi-image QA dataset via multi-agent system. Extensive experiments validate that IMAgent achieves SOTA performance across mainstream single and multi-image benchmarks, and our in-depth analysis offers actionable insights for the community. Code and data will be released soon.
Updated: 2026-04-03 08:53:56
标题: 通过端对端强化学习训练多图像视觉智能体
摘要: 最近基于VLM的代理旨在通过工具使用复制OpenAI O3的“图像思维”,然而大多数开源方法限制输入为单个图像,从而限制了它们应用于真实世界多图像问答任务的适用性。为了解决这一差距,我们提出了IMAgent,一个经过端到端强化学习训练的开源视觉代理,用于精细的单/多图像推理。在推理过程中,VLM往往逐渐忽略视觉输入;为了缓解这个问题,我们设计了两个专门用于视觉反思和验证的工具,使模型能够积极重新关注图像内容。除此之外,我们首次揭示了工具使用如何从注意力的角度增强代理性能。通过精心设计的两层运动轨迹遮罩策略和工具使用奖励增益,IMAgent通过纯强化学习获得了有效的工具使用范式,消除了昂贵的监督微调数据的需求。为了进一步释放基础VLM的固有工具使用潜力并填补数据空白,我们通过多智能体系统构建了一个具有挑战性的、视觉丰富的多图像问答数据集。大量实验验证了IMAgent在主流单图像和多图像基准测试中实现了SOTA性能,我们的深入分析为社区提供了可操作的见解。代码和数据即将发布。
更新时间: 2026-04-03 08:53:56
领域: cs.CV,cs.AI
Lipschitz bounds for integral kernels
Feature maps associated with positive definite kernels play a central role in kernel methods and learning theory, where regularity properties such as Lipschitz continuity are closely related to robustness and stability guarantees. Despite their importance, explicit characterizations of the Lipschitz constant of kernel feature maps are available only in a limited number of cases. In this paper, we study the Lipschitz regularity of feature maps associated with integral kernels under differentiability assumptions. We first provide sufficient conditions ensuring Lipschitz continuity and derive explicit formulas for the corresponding Lipschitz constants. We then identify a condition under which the feature map fails to be Lipschitz continuous and apply these results to several important classes of kernels. For infinite width two-layer neural network with isotropic Gaussian weight distributions, we show that the Lipschitz constant of the associated kernel can be expressed as the supremum of a two-dimensional integral, leading to an explicit characterization for the Gaussian kernel and the ReLU random neural network kernel. We also study continuous and shift-invariant kernels such as Gaussian, Laplace, and Matérn kernels, which admit an interpretation as neural network with cosine activation function. In this setting, we prove that the feature map is Lipschitz continuous if and only if the weight distribution has a finite second-order moment, and we then derive its Lipschitz constant. Finally, we raise an open question concerning the asymptotic behavior of the convergence of the Lipschitz constant in finite width neural networks. Numerical experiments are provided to support this behavior.
Updated: 2026-04-03 08:52:36
标题: 积分核的Lipschitz界
摘要: 与正定核相关的特征映射在核方法和学习理论中起着核心作用,其中诸如Lipschitz连续性之类的正则性属性与鲁棒性和稳定性保证密切相关。尽管它们很重要,但核特征映射的Lipschitz常数的显式表征仅在有限数量的情况下可用。在本文中,我们研究了在可微性假设下与积分核相关的特征映射的Lipschitz正则性。我们首先提供了确保Lipschitz连续性的充分条件,并推导出相应Lipschitz常数的显式公式。然后我们确定了一个条件,该条件下特征映射无法满足Lipschitz连续性,并将这些结果应用于几个重要类别的核。对于具有各向同性高斯权值分布的无限宽度两层神经网络,我们表明相关核的Lipschitz常数可以表示为二维积分的最值,从而为高斯核和ReLU随机神经网络核提供了显式表征。我们还研究了连续和平移不变核,例如高斯、拉普拉斯和马特恩核,这些核可以解释为具有余弦激活函数的神经网络。在这种设置下,我们证明了如果权重分布具有有限的二阶矩,则特征映射仅在权重分布具有有限二阶矩时才是Lipschitz连续的,然后推导出它的Lipschitz常数。最后,我们提出了一个关于有限宽度神经网络中Lipschitz常数收敛性的渐近行为的开放问题。数值实验支持这种行为。
更新时间: 2026-04-03 08:52:36
领域: stat.ML,cs.LG
One Model to Translate Them All? A Journey to Mount Doom for Multilingual Model Merging
Weight-space model merging combines independently fine-tuned models without accessing original training data, offering a practical alternative to joint training. While merging succeeds in multitask settings, its behavior in multilingual contexts remains poorly understood. We systematically study weight-space merging for multilingual machine translation by fully fine-tuning language model on large-scale bilingual corpora and evaluating standard merging strategies. Our experiments reveal that merging degrades performance, especially when target languages differ. To explain this failure, we analyze internal representations using span-conditioned neuron selectivity and layer-wise centered kernel alignment. We find that language-specific neurons concentrate in embedding layers and upper transformer blocks, while intermediate layers remain largely shared across languages. Critically, fine-tuning redistributes rather than sharpens language selectivity: neurons for supervised and related languages become less exclusive, while those for unsupervised languages grow more isolated. This redistribution increases representational divergence in higher layers that govern generation. These findings suggest that multilingual fine-tuning may reshape geometry in ways that reduce compatibility with standard weight-space merging assumptions. Our work thus provides an explanation for why merging fails in multilingual translation scenarios.
Updated: 2026-04-03 08:45:26
标题: 一个模型翻译所有?多语言模型合并之旅到末日山
摘要: Weight-space模型合并是将独立微调的模型结合起来,而不需要访问原始训练数据,提供了一种实用的替代方法来进行联合训练。虽然合并在多任务设置中取得成功,但在多语言环境中的行为仍然不太清楚。我们通过在大规模双语语料库上对语言模型进行全面微调,并评估标准合并策略来系统地研究多语言机器翻译的权重空间合并。我们的实验表明,合并会降低性能,特别是当目标语言不同时。为了解释这种失败,我们使用基于跨度条件的神经元选择性和逐层中心化核对齐来分析内部表示。我们发现,语言特定的神经元集中在嵌入层和上部变换器块中,而中间层在各种语言之间基本上是共享的。关键是,微调重新分配而不是加强语言选择性:受监督和相关语言的神经元变得不那么独占,而对于未受监督语言的神经元则变得更加孤立。这种重新分配增加了在控制生成的更高层中的表现差异。这些发现表明,多语言微调可能会以减少与标准权重空间合并假设的兼容性的方式重新塑造几何形状。因此,我们的工作为解释为什么在多语言翻译场景中合并失败提供了一个解释。
更新时间: 2026-04-03 08:45:26
领域: cs.CL,cs.AI
Toward an Operational GNN-Based Multimesh Surrogate for Fast Flood Forecasting
Operational flood forecasting still relies on high-fidelity two-dimensional hydraulic solvers, but their runtime can be prohibitive for rapid decision support on large urban floodplains. In parallel, AI-based surrogate models have shown strong potential in several areas of computational physics for accelerating otherwise expensive high-fidelity simulations. We address this issue on the lower Têt River (France), starting from a production-grade Telemac2D model defined on a high-resolution unstructured finite-element mesh with more than $4\times 10^5$ nodes. From this setup, we build a learning-ready database of synthetic but operationally grounded flood events covering several representative hydrograph families and peak discharges. On top of this database, we develop a graph-neural surrogate based on projected meshes and multimesh connectivity. The projected-mesh strategy keeps training tractable while preserving high-fidelity supervision from the original Telemac simulations, and the multimesh construction enlarges the effective spatial receptive field without increasing network depth. We further study the effect of an explicit discharge feature $Q(t)$ and of pushforward training for long autoregressive rollouts. The experiments show that conditioning on $Q(t)$ is essential in this boundary-driven setting, that multimesh connectivity brings additional gains once the model is properly conditioned, and that pushforward further improves rollout stability. Among the tested configurations, the combination of $Q(t)$, multimesh connectivity, and pushforward provides the best overall results. These gains are observed both on hydraulic variables over the surrogate mesh and on inundation maps interpolated onto a common $25\,\mathrm{m}$ regular grid and compared against the original high-resolution Telemac solution. On the studied case, the learned surrogate produces 6-hour predictions in about $0.4\,\mathrm{s}$ on a single NVIDIA A100 GPU, compared with about $180\,\mathrm{min}$ on 56 CPU cores for the reference simulation. These results support graph-based surrogates as practical complements to industrial hydraulic solvers for operational flood mapping.
Updated: 2026-04-03 08:42:51
标题: 朝向基于GNN的多网格代理模型的快速洪水预测
摘要: 操作性洪水预测仍然依赖于高保真度的二维水力求解器,但它们的运行时间可能对于在大型城市洪水平原上进行快速决策支持是不可接受的。同时,基于人工智能的替代模型在计算物理学的几个领域显示出强大的潜力,可以加速原本昂贵的高保真模拟。我们在下游Têt河(法国)解决了这个问题,从一个基于高分辨率非结构化有限元网格定义的生产级Telemac2D模型开始,该网格具有超过$4\times 10^5$个节点。从这个设置开始,我们建立了一个学习准备好的数据库,其中包含了一些代表性的水文族和峰值流量,并且是基于合成但操作上基础的洪水事件。在这个数据库的基础上,我们开发了一种基于投影网格和多网格连接的图神经替代模型。投影网格策略保持了训练的可处理性,同时保留了来自原始Telemac模拟的高保真度监督,而多网格构建扩大了有效的空间感受野,而不增加网络深度。我们进一步研究了明确的流量特征$Q(t)$的影响以及用于长自回归轮换的前进训练。实验表明,在这种边界驱动的设置中,以$Q(t)$为条件是必不可少的,一旦模型得到适当条件,多网格连接会带来额外的收益,并且前进进一步提高了轮换的稳定性。在测试的配置中,$Q(t)$、多网格连接和前进的结合提供了最佳的整体结果。这些收益既在替代网格上的水力变量上观察到,也在被插值到一个共同的$25\,\mathrm{m}$规则网格上的淹没地图上观察到,并与原始高分辨率Telemac解决方案进行了比较。在研究的案例中,学习到的替代模型在单个NVIDIA A100 GPU上产生了约$0.4\,\mathrm{s}$的6小时预测,而对于参考模拟,需要大约56个CPU核心约$180\,\mathrm{min}$。这些结果支持图形替代模型作为工业水力求解器的实用补充,用于操作性洪水制图。
更新时间: 2026-04-03 08:42:51
领域: cs.LG
CeRA: Overcoming the Linear Ceiling of Low-Rank Adaptation via Capacity Expansion
Low-Rank Adaptation (LoRA) dominates parameter-efficient fine-tuning (PEFT). However, it faces a ``linear ceiling'': increasing the rank yields diminishing returns in expressive capacity due to intrinsic linear constraints. We introduce CeRA (Capacity-enhanced Rank Adaptation), a weight-level parallel adapter that injects SiLU gating and dropout to induce non-linear capacity expansion. We demonstrate a fundamental relationship between adapter expressivity and task complexity. In basic arithmetic (GSM8K), CeRA matches standard linear baselines, but on the complex MATH dataset, it demonstrates high parameter efficiency in downstream reasoning (Exact Match). CeRA at rank 64 (pass@1 16.36\%) outperforms both a high-rank LoRA at rank 512 (15.72\%) and the state-of-the-art linear variant, DoRA, at rank 64 (14.44\%), achieving higher exact-match accuracy with only 1/8 of the parameter budget. Empirical spectral analysis shows that CeRA activates the lower-variance tail of the singular value spectrum, preventing the rank collapse observed in linear methods and providing the representation capacity required for complex logical reasoning.
Updated: 2026-04-03 08:37:26
标题: CeRA: 通过容量扩展克服低秩适应的线性上限
摘要: 低秩适应(LoRA)主导参数高效微调(PEFT)。然而,它面临着“线性天花板”:增加秩会因固有线性约束而导致表达能力的收益递减。我们引入了CeRA(Capacity-enhanced Rank Adaptation),这是一个在权重级别并行适配器,通过注入SiLU门控和辍学来诱导非线性容量扩展。我们展示了适配器表达能力与任务复杂性之间的基本关系。在基本算术(GSM8K)中,CeRA与标准线性基线相匹配,但在复杂的MATH数据集上,它在下游推理(Exact Match)中表现出高参数效率。CeRA在秩64(pass@1 16.36%)上表现优于秩512的高秩LoRA(15.72%)和最先进的线性变体DoRA,在秩64(14.44%)上实现了更高的精确匹配准确性,只使用了参数预算的1/8。经验谱分析显示,CeRA激活了奇异值谱的较低方差尾部,防止了线性方法中观察到的秩崩溃,并提供了进行复杂逻辑推理所需的表征容量。
更新时间: 2026-04-03 08:37:26
领域: cs.LG,cs.AI,cs.CL
Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration
Training tool-calling agents with reinforcement learning on multi-turn tasks remains challenging due to sparse outcome rewards and difficult credit assignment across conversation turns. We present the first application of MT-GRPO (Multi-Turn Group Relative Policy Optimization) combined with GTPO (Generalized Token-level Policy Optimization) for training a tool-calling agent on realistic customer service tasks with an LLM-based user simulator. Through systematic analysis of training rollouts, we discover that naively designed dense per-turn rewards degrade performance by up to 14 percentage points due to misalignment between reward discriminativeness and advantage direction. We introduce Iterative Reward Calibration, a methodology for designing per-turn rewards using empirical discriminative analysis of rollout data, and show that our GTPO hybrid advantage formulation eliminates the advantage misalignment problem. Applied to the Tau-Bench airline benchmark, our approach improves Qwen3.5-4B from 63.8 percent to 66.7 percent (+2.9pp) and Qwen3-30B-A3B from 58.0 percent to 69.5 percent (+11.5pp) -- with the trained 4B model exceeding GPT-4.1 (49.4 percent) and GPT-4o (42.8 percent) despite being 50 times smaller, and the 30.5B MoE model approaching Claude Sonnet 4.5 (70.0 percent). To our knowledge, these are the first published RL training results on Tau-Bench. We release our code, reward calibration analysis, and training recipes.
Updated: 2026-04-03 08:36:03
标题: 多轮强化学习用于具有迭代奖励校准的工具调用代理
摘要: 使用强化学习在多轮任务上训练呼叫代理仍然具有挑战性,原因是稀疏的结果奖励和跨对话轮之间的难以归因。我们提出了第一个将MT-GRPO(多轮组相对策略优化)与GTPO(通用标记级策略优化)相结合,用于在基于LLM的用户模拟器上训练工具呼叫代理进行现实客户服务任务的应用。通过对训练结果的系统分析,我们发现天真设计的每轮密集奖励会导致性能下降高达14个百分点,原因是奖励区分度和优势方向之间的不一致。我们引入了迭代奖励校准方法,通过对结果数据的经验性区分分析来设计每轮奖励,展示了我们的GTPO混合优势公式消除了优势不一致问题。应用于Tau-Bench航空公司基准测试,我们的方法将Qwen3.5-4B从63.8%提高到66.7%(+2.9pp),将Qwen3-30B-A3B从58.0%提高到69.5%(+11.5pp)-训练后的4B模型超过了GPT-4.1(49.4%)和GPT-4o(42.8%),尽管体积小了50倍,而30.5B MoE模型接近了Claude Sonnet 4.5(70.0%)。据我们所知,这是Tau-Bench上首次发表的RL训练结果。我们发布了我们的代码,奖励校准分析和训练配方。
更新时间: 2026-04-03 08:36:03
领域: cs.AI
ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models
Reinforcement learning from verifiable rewards has significantly advanced the reasoning capabilities of large language models. However, Group Relative Policy Optimization (GRPO) typically assigns a uniform, sequence-level advantage to all tokens, thereby overlooking the intrinsic information heterogeneity along reasoning chains. We show that this coarse-grained credit assignment leads to premature entropy collapse and encourages the model to generate redundant, low-quality reasoning paths. Through systematic empirical analysis, we identify Critical Decision Pivots (CDPs): transient high-entropy states where the policy's trajectory is most sensitive to perturbations. These pivots represent the "forks in the road" where effective multi-path exploration is most crucial yet often suppressed by uniform advantage signals. Building on these insights, we propose Entropy-Regulated Policy Optimization (ERPO), which transitions the optimization focus from coarse sequences to fine-grained token dynamics. ERPO introduces three synergistic components: (i) Entropy-aware Gating, which adaptively amplifies exploration at CDPs to facilitate diverse path discovery; (ii) Bucket-based Implicit Normalization, which mitigates difficulty bias by aligning token progress windows; and (iii) Result-anchored Advantage Synthesis, which re-weights token-level signals via outcome-driven anchors. Extensive experiments on competitive mathematical benchmarks demonstrate that ERPO significantly outperforms GRPO. Notably, ERPO not only boosts reasoning accuracy but also yields significantly more concise and robust derivation paths, while achieving performance comparable to large models with orders of magnitude more parameters.
Updated: 2026-04-03 08:30:42
标题: ERPO:大型推理模型的令牌级熵调节策略优化
摘要: 来自可验证奖励的强化学习显著提升了大型语言模型的推理能力。然而,Group Relative Policy Optimization (GRPO)通常为所有标记分配统一的序列级优势,从而忽视了沿着推理链的内在信息异质性。我们展示了这种粗粒度的信用分配导致了过早的熵坍塌,并鼓励模型生成冗余的、低质量的推理路径。通过系统的实证分析,我们确定了Critical Decision Pivots (CDPs):瞬态高熵状态,在这些状态下,策略的轨迹对扰动最为敏感。这些关键点代表了“十字路口”,在这些地方有效的多路径探索是最关键的,但往往被统一的优势信号所抑制。基于这些见解,我们提出了Entropy-Regulated Policy Optimization (ERPO),将优化焦点从粗粒度序列转移到细粒度的标记动态。ERPO引入了三个协同组件:(i) Entropy-aware Gating,自适应地增强在CDPs的探索,以促进多样化的路径发现;(ii) Bucket-based Implicit Normalization,通过对齐标记进度窗口来减轻困难偏差;以及(iii) Result-anchored Advantage Synthesis,通过结果驱动的锚点重新加权标记级信号。在竞争性数学基准上的大量实验表明,ERPO明显优于GRPO。值得注意的是,ERPO不仅提高了推理准确性,而且产生了更为简洁和稳健的推导路径,同时实现了与参数数量相比大得多的大型模型相当的性能。
更新时间: 2026-04-03 08:30:42
领域: cs.LG,cs.AI
EMS: Multi-Agent Voting via Efficient Majority-then-Stopping
Majority voting is the standard for aggregating multi-agent responses into a final decision. However, traditional methods typically require all agents to complete their reasoning before aggregation begins, leading to significant computational overhead, as many responses become redundant once a majority consensus is achieved. In this work, we formulate the multi-agent voting as a reliability-aware agent scheduling problem, and propose an Efficient Majority-then-Stopping (EMS) to improve reasoning efficiency. EMS prioritizes agents based on task-aware reliability and terminates the reasoning pipeline the moment a majority is achieved from the following three critical components. Specifically, we introduce Agent Confidence Modeling (ACM) to estimate agent reliability using historical performance and semantic similarity, Adaptive Incremental Voting (AIV) to sequentially select agents with early stopping, and Individual Confidence Updating (ICU) to dynamically update the reliability of each contributing agent. Extensive evaluations across six benchmarks demonstrate that EMS consistently reduces the average number of invoked agents by 32%.
Updated: 2026-04-03 08:29:50
标题: EMS:通过高效的多数派-停止方法进行多智能体投票
摘要: 大多数投票是将多个代理的回应聚合成最终决策的标准。然而,传统方法通常要求所有代理在聚合开始之前完成他们的推理,这导致了显著的计算开销,因为一旦达成多数共识,许多回应就变得多余。在这项工作中,我们将多代理投票形式化为一个可靠性感知的代理调度问题,并提出了一种提高推理效率的高效多数停止(EMS)方法。EMS基于任务感知可靠性对代理进行优先排序,并在从以下三个关键组件中实现多数达成时终止推理流程。具体来说,我们引入了代理置信建模(ACM)来估计代理可靠性,使用历史性能和语义相似性,自适应增量投票(AIV)来顺序选择具有早期停止的代理,以及个体置信更新(ICU)来动态更新每个贡献代理的可靠性。对六个基准的广泛评估表明,EMS一致地将平均调用代理数量减少了32%。
更新时间: 2026-04-03 08:29:50
领域: cs.AI
LLM+Graph@VLDB'2025 Workshop Summary
The integration of large language models (LLMs) with graph-structured data has become a pivotal and fast evolving research frontier, drawing strong interest from both academia and industry. The 2nd LLM+Graph Workshop, co-located with the 51st International Conference on Very Large Data Bases (VLDB 2025) in London, focused on advancing algorithms and systems that bridge LLMs, graph data management, and graph machine learning for practical applications. This report highlights the key research directions, challenges, and innovative solutions presented by the workshop's speakers.
Updated: 2026-04-03 08:27:54
标题: LLM+Graph@VLDB'2025 研讨会总结
摘要: 大型语言模型(LLMs)与图结构数据的集成已成为一个关键且快速发展的研究前沿,引起了学术界和工业界的浓厚兴趣。第二届LLM+Graph研讨会与第51届国际超大型数据库会议(VLDB 2025)在伦敦同期举行,重点关注推进算法和系统,以桥接LLMs、图数据管理和图机器学习,用于实际应用。本报告突出了研讨会发言人所提出的关键研究方向、挑战和创新解决方案。
更新时间: 2026-04-03 08:27:54
领域: cs.DB,cs.AI
A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos
Temporal sentence grounding in videos (TSGV) aims to localize a temporal segment that semantically corresponds to a sentence query from an untrimmed video. Most current methods adopt pre-trained query-agnostic visual encoders for offline feature extraction, and the video backbones are frozen and not optimized for TSGV. This leads to a task discrepancy issue for the video backbone trained for visual classification, but utilized for TSGV. To bridge this gap, we propose a fully end-to-end paradigm that jointly optimizes the video backbone and localization head. We first conduct an empirical study validating the effectiveness of end-to-end learning over frozen baselines across different model scales. Furthermore, we introduce a Sentence Conditioned Adapter (SCADA), which leverages sentence features to train a small portion of video backbone parameters adaptively. SCADA facilitates the deployment of deeper network backbones with reduced memory and significantly enhances visual representation by modulating feature maps through precise integration of linguistic embeddings. Experiments on two benchmarks show that our method outperforms state-of-the-art approaches. The code and models will be released.
Updated: 2026-04-03 08:26:12
标题: 一个范式转变:视频中时间句子定位的完全端到端训练
摘要: 视频中的时间句子定位(TSGV)旨在定位一个与未修剪视频中的句子查询语义对应的时间段。大多数现有方法采用预先训练的与查询无关的视觉编码器进行离线特征提取,视频主干被冻结并未针对TSGV进行优化。这导致视频主干针对视觉分类训练,但却被用于TSGV的任务不一致问题。为了弥合这一差距,我们提出了一个完全端到端的范式,联合优化视频主干和定位头。我们首先进行了一项实证研究,验证端到端学习在不同模型规模上优于冻结基线的有效性。此外,我们引入了一个基于句子条件的适配器(SCADA),它利用句子特征自适应地训练视频主干参数的一小部分。SCADA通过通过精确整合语言嵌入调节特征图,促进了更深层次网络主干的部署,减少了内存占用,并显著增强了视觉表示。在两个基准测试上的实验结果表明,我们的方法优于最先进的方法。代码和模型将会发布。
更新时间: 2026-04-03 08:26:12
领域: cs.CV,cs.AI
Forget Many, Forget Right: Scalable and Precise Concept Unlearning in Diffusion Models
Text-to-image diffusion models have achieved remarkable progress, yet their use raises copyright and misuse concerns, prompting research into machine unlearning. However, extending multi-concept unlearning to large-scale scenarios remains difficult due to three challenges: (i) conflicting weight updates that hinder unlearning or degrade generation; (ii) imprecise mechanisms that cause collateral damage to similar content; and (iii) reliance on additional data or modules, creating scalability bottlenecks. To address these, we propose Scalable-Precise Concept Unlearning (ScaPre), a unified framework tailored for large-scale unlearning. ScaPre introduces a conflict-aware stable design, integrating spectral trace regularization and geometry alignment to stabilize optimization, suppress conflicts, and preserve global structure. Furthermore, an Informax Decoupler identifies concept-relevant parameters and adaptively reweights updates, strictly confining unlearning to the target subspace. ScaPre yields an efficient closed-form solution without requiring auxiliary data or sub-models. Comprehensive experiments on objects, styles, and explicit content demonstrate that ScaPre effectively removes target concepts while maintaining generation quality. It forgets up to $\times \mathbf{5}$ more concepts than the best baseline within acceptable quality limits, achieving state-of-the-art precision and efficiency for large-scale unlearning.
Updated: 2026-04-03 08:18:10
标题: 忘记许多,正确遗忘:扩展和精确的扩散模型概念遗忘
摘要: 文本到图像扩散模型取得了显著进展,但是它们的使用引起了版权和滥用的担忧,促使研究机器去学习。然而,将多概念去学习扩展到大规模场景仍然困难,原因在于三个挑战:(i)相互冲突的权重更新阻碍了去学习或降低了生成质量;(ii)不精确的机制导致了对相似内容的附带损害;(iii)依赖额外数据或模块,造成了可扩展性瓶颈。为了解决这些问题,我们提出了Scalable-Precise Concept Unlearning(ScaPre),一个专为大规模去学习定制的统一框架。ScaPre引入了一个冲突感知稳定设计,集成了谱痕正则化和几何对齐,以稳定优化、抑制冲突,并保留全局结构。此外,一个Informax Decoupler识别概念相关参数,并自适应地重新调整更新,严格限制了去学习到目标子空间。ScaPre提供了一个高效的闭式解决方案,无需额外数据或子模型。在对象、风格和显式内容的全面实验中,证明了ScaPre有效地去除目标概念同时保持生成质量。在可接受的质量限制下,它比最佳基线忘记了多达5倍的概念,实现了大规模去学习的最新精度和效率。
更新时间: 2026-04-03 08:18:10
领域: cs.LG,cs.CV
Zero-shot Concept Bottleneck Models
Concept bottleneck models (CBMs) are inherently interpretable and intervenable neural network models, which explain their final label prediction by the intermediate prediction of high-level semantic concepts. However, they require target task training to learn input-to-concept and concept-to-label mappings, incurring target dataset collections and training resources. In this paper, we present zero-shot concept bottleneck models (Z-CBMs), which predict concepts and labels in a fully zero-shot manner without training neural networks. Z-CBMs utilize a large-scale concept bank, which is composed of millions of vocabulary extracted from the web, to describe arbitrary input in various domains. For the input-to-concept mapping, we introduce concept retrieval, which dynamically finds input-related concepts by the cross-modal search on the concept bank. In the concept-to-label inference, we apply concept regression to select essential concepts from the retrieved concepts by sparse linear regression. Through extensive experiments, we confirm that our Z-CBMs provide interpretable and intervenable concepts without any additional training. Code will be available at https://github.com/yshinya6/zcbm.
Updated: 2026-04-03 08:12:40
标题: 零样本概念瓶颈模型
摘要: 概念瓶颈模型(CBMs)是一种本质上可解释和可干预的神经网络模型,通过高级语义概念的中间预测来解释它们的最终标签预测。然而,它们需要目标任务训练来学习输入到概念和概念到标签的映射关系,需要目标数据集的收集和训练资源。在本文中,我们提出了零阶概念瓶颈模型(Z-CBMs),可以在完全零训练的情况下以零阶方式预测概念和标签。Z-CBMs利用一个由数百万个从网络中提取的词汇组成的大规模概念库,来描述各种领域的任意输入。对于输入到概念的映射,我们引入概念检索,通过在概念库上进行跨模态搜索来动态地找到与输入相关的概念。在概念到标签的推断中,我们应用概念回归,通过稀疏线性回归从检索到的概念中选择关键概念。通过大量实验,我们证实我们的Z-CBMs提供可解释和可干预的概念,而无需任何额外训练。代码将在https://github.com/yshinya6/zcbm 上提供。
更新时间: 2026-04-03 08:12:40
领域: cs.LG,cs.AI,cs.CV
High-resolution probabilistic estimation of three-dimensional regional ocean dynamics from sparse surface observations
The ocean interior regulates Earth's climate but remains sparsely observed due to limited in situ measurements, while satellite observations are restricted to the surface. We present a depth-aware generative framework for reconstructing high-resolution three-dimensional ocean states from extremely sparse surface data. Our approach employs a conditional denoising diffusion probabilistic model (DDPM) trained on sea surface height and temperature observations with up to 99.9 percent sparsity, without reliance on a background dynamical model. By incorporating continuous depth embeddings, the model learns a unified vertical representation of the ocean states and generalizes to previously unseen depths. Applied to the Gulf of Mexico, the framework accurately reconstructs subsurface temperature, salinity, and velocity fields across multiple depths. Evaluations using statistical metrics, spectral analysis, and heat transport diagnostics demonstrate recovery of both large-scale circulation and multiscale variability. These results establish generative diffusion models as a scalable approach for probabilistic ocean reconstruction in data-limited regimes, with implications for climate monitoring and forecasting.
Updated: 2026-04-03 08:10:55
标题: 高分辨率概率三维区域海洋动力学从稀疏表面观测的估计
摘要: 海洋内部调节着地球的气候,但由于现场观测有限,仍然存在观测稀疏的问题,而卫星观测则受限于表面。我们提出了一种深度感知生成框架,用于从极度稀疏的表面数据重建高分辨率三维海洋状态。我们的方法采用了一个条件去噪扩散概率模型(DDPM),该模型在海洋表面高度和温度观测数据上进行训练,观测数据的稀疏度高达99.9%,而不依赖于背景动力学模型。通过融入连续的深度嵌入,模型学习了海洋状态的统一垂直表示,并对以前未见过的深度进行了泛化。应用于墨西哥湾,该框架准确重建了多个深度下的海洋温度、盐度和速度场。使用统计指标、谱分析和热传输诊断进行评估,证明了对大尺度环流和多尺度变异性的恢复。这些结果将生成式扩散模型建立为一种可扩展的方法,用于在数据有限的情况下进行概率海洋重建,对气候监测和预测具有重要意义。
更新时间: 2026-04-03 08:10:55
领域: physics.ao-ph,cs.AI,math.DS,nlin.CD
Integrated representational signatures strengthen specificity in brains and models
The extent to which different neural or artificial neural networks (models) rely on equivalent representations to support similar tasks remains a central question in neuroscience and machine learning. Prior work has typically compared systems using a single representational similarity metric, yet each captures only one facet of representational structure. To address this, we leverage a suite of representational similarity metrics-each capturing a distinct facet of representational correspondence, such as geometry, unit-level tuning, or linear decodability-and assess brain region or model separability using multiple complementary measures. Metrics that preserve geometric or tuning structure (e.g., RSA, Soft Matching) yield stronger region-based discrimination, whereas more flexible mappings such as Linear Predictivity show weaker separation. These findings suggest that geometry and tuning encode brain-region- or model-family-specific signatures, while linearly decodable information tends to be more globally shared across regions or models. To integrate these complementary representational facets, we adapt Similarity Network Fusion (SNF), a framework originally developed for multi-omics data integration. SNF produces substantially sharper regional and model family-level separation than any single metric and yields robust composite similarity profiles. Moreover, clustering cortical regions using SNF-derived similarity scores reveals a clearer hierarchical organization that aligns closely with established anatomical and functional hierarchies of the visual cortex-surpassing the correspondence achieved by individual metrics.
Updated: 2026-04-03 08:07:12
标题: 综合表征签名增强大脑和模型的特异性
摘要: 不同神经网络或人工神经网络(模型)在支持类似任务时依赖于等效表示的程度仍然是神经科学和机器学习中的一个核心问题。先前的研究通常使用单一表示相似度度量来比较系统,然而每种度量只捕捉了表示结构的一方面。为了解决这个问题,我们利用一套表示相似度度量,每种度量捕捉了表示对应的不同方面,比如几何、单元级调整或线性可解性,并使用多种互补的度量评估大脑区域或模型的可分离性。保留几何或调整结构的度量(例如RSA、软匹配)产生更强的基于区域的区分,而更灵活的映射,如线性预测性,显示出较弱的分离。这些发现表明,几何和调整编码大脑区域或模型家族特定的签名,而线性可解信息倾向于在区域或模型之间更广泛地共享。为了整合这些互补的表示方面,我们改编了Similarity Network Fusion(SNF),这是一个最初为多组学数据整合开发的框架。SNF产生了比任何单一度量更为清晰的区域和模型家族级别的分离,并产生了强大的复合相似性概要。此外,使用SNF衍生的相似性分数对皮质区域进行聚类,揭示了一个更清晰的分层组织,与视觉皮层的已建立的解剖和功能层次结构密切一致,超越了单个度量所达到的相似度。
更新时间: 2026-04-03 08:07:12
领域: q-bio.NC,cs.AI
Diffusion Models as Dataset Distillation Priors
Dataset distillation aims to synthesize compact yet informative datasets from large ones. A significant challenge in this field is achieving a trifecta of diversity, generalization, and representativeness in a single distilled dataset. Although recent generative dataset distillation methods adopt powerful diffusion models as their foundation models, the inherent representativeness prior in diffusion models is overlooked. Consequently, these approaches often necessitate the integration of external constraints to enhance data quality. To address this, we propose Diffusion As Priors (DAP), which formalizes representativeness by quantifying the similarity between synthetic and real data in feature space using a Mercer kernel. We then introduce this prior as guidance to steer the reverse diffusion process, enhancing the representativeness of distilled samples without any retraining. Extensive experiments on large-scale datasets, such as ImageNet-1K and its subsets, demonstrate that DAP outperforms state-of-the-art methods in generating high-fidelity datasets while achieving superior cross-architecture generalization. Our work not only establishes a theoretical connection between diffusion priors and the objectives of dataset distillation but also provides a practical, training-free framework for improving the quality of the distilled dataset.
Updated: 2026-04-03 07:59:32
标题: 扩散模型作为数据集提炼先验
摘要: 数据集精炼旨在从大型数据集中合成紧凑而信息丰富的数据集。这一领域的一个重要挑战是在单个精炼数据集中实现多样性、泛化性和代表性的三重效应。尽管最近的生成式数据集精炼方法采用了强大的扩散模型作为基础模型,但扩散模型中固有的代表性先验被忽视了。因此,这些方法通常需要整合外部约束来提高数据质量。为了解决这个问题,我们提出了Diffusion As Priors (DAP),通过在特征空间中使用Mercer核量化合成和真实数据之间的相似性,形式化代表性。然后,我们将这一先验作为指导,引导反向扩散过程,增强精炼样本的代表性,而无需重新训练。对大规模数据集(如ImageNet-1K及其子集)进行的大量实验表明,DAP在生成高保真数据集方面优于最先进的方法,同时实现了出色的跨体系结构泛化。我们的工作不仅建立了扩散先验与数据集精炼目标之间的理论联系,还提供了一个实用的、无需训练的框架,用于提高精炼数据集的质量。
更新时间: 2026-04-03 07:59:32
领域: cs.LG
Towards Secure Agent Skills: Architecture, Threat Taxonomy, and Security Analysis
Agent Skills is an emerging open standard that defines a modular, filesystem-based packaging format enabling LLM-based agents to acquire domain-specific expertise on demand. Despite rapid adoption across multiple agentic platforms and the emergence of large community marketplaces, the security properties of Agent Skills have not been systematically studied. This paper presents the first comprehensive security analysis of the Agent Skills framework. We define the full lifecycle of an Agent Skill across four phases -- Creation, Distribution, Deployment, and Execution -- and identify the structural attack surface each phase introduces. Building on this lifecycle analysis, we construct a threat taxonomy comprising seven categories and seventeen scenarios organized across three attack layers, grounded in both architectural analysis and real-world evidence. We validate the taxonomy through analysis of five confirmed security incidents in the Agent Skills ecosystem. Based on these findings, we discuss defense directions for each threat category, identify open research challenges, and provide actionable recommendations for stakeholders. Our analysis reveals that the most severe threats arise from structural properties of the framework itself, including the absence of a data-instruction boundary, a single-approval persistent trust model, and the lack of mandatory marketplace security review, and cannot be addressed through incremental mitigations alone.
Updated: 2026-04-03 07:56:42
标题: 朝向安全代理技能:架构、威胁分类和安全分析
摘要: 代理技能是一种新兴的开放标准,定义了一种基于模块化、基于文件系统的打包格式,使基于LLM的代理能够根据需求获取特定领域的专业知识。尽管在多个代理平台上迅速被采用,并出现了大型社区市场,但代理技能的安全属性尚未得到系统研究。本文介绍了代理技能框架的第一次全面安全分析。我们定义了代理技能的完整生命周期,包括四个阶段--创建、分发、部署和执行--并确定每个阶段引入的结构攻击面。基于这一生命周期分析,我们构建了一个威胁分类法,包括七个类别和十七个场景,这些场景分布在三个攻击层中,基于建筑分析和现实证据。我们通过分析代理技能生态系统中五起确认的安全事件来验证这一分类法。根据这些发现,我们讨论了每个威胁类别的防御方向,确定了开放研究挑战,并为利益相关者提供了可操作的建议。我们的分析显示,最严重的威胁来自框架本身的结构属性,包括缺乏数据指令边界、单一批准持久信任模型和缺乏强制性的市场安全审查,不能仅通过增量减轻来解决。
更新时间: 2026-04-03 07:56:42
领域: cs.CR,cs.AI
ESL-Bench: An Event-Driven Synthetic Longitudinal Benchmark for Health Agents
Longitudinal health agents must reason across multi-source trajectories that combine continuous device streams, sparse clinical exams, and episodic life events - yet evaluating them is hard: real-world data cannot be released at scale, and temporally grounded attribution questions seldom admit definitive answers without structured ground truth. We present ESL-Bench, an event-driven synthesis framework and benchmark providing 100 synthetic users, each with a 1-5 year trajectory comprising a health profile, a multi-phase narrative plan, daily device measurements, periodic exam records, and an event log with explicit per-indicator impact parameters. Each indicator follows a baseline stochastic process driven by discrete events with sigmoid-onset, exponential-decay kernels under saturation and projection constraints; a hybrid pipeline delegates sparse semantic artifacts to LLM-based planning and dense indicator dynamics to algorithmic simulation with hard physiological bounds. Users are each paired with 100 evaluation queries across five dimensions - Lookup, Trend, Comparison, Anomaly, Explanation - stratified into Easy, Medium, and Hard tiers, with all ground-truth answers programmatically computable from the recorded event-indicator relationships. Evaluating 13 methods spanning LLMs with tools, DB-native agents, and memory-augmented RAG, we find that DB agents (48-58%) substantially outperform memory RAG baselines (30-38%), with the gap concentrated on Comparison and Explanation queries where multi-hop reasoning and evidence attribution are required.
Updated: 2026-04-03 07:55:56
标题: ESL-Bench:面向健康代理的事件驱动合成纵向基准
摘要: 纵向健康代理必须跨越多源轨迹进行推理,这些轨迹结合了连续的设备流、稀疏的临床检查和偶发的生活事件 - 然而,评估它们是困难的:现实世界的数据无法大规模发布,而且在没有结构化基本事实的情况下,基于时间的归因问题很少能得出确定的答案。我们提出了ESL-Bench,这是一个事件驱动的综合框架和基准,提供了100个合成用户,每个用户都有一个包括健康概况、多阶段叙述计划、每日设备测量、周期性检查记录和一个事件日志的1-5年轨迹,其中包括明确的指标影响参数。每个指标都遵循一个基线随机过程,由具有S型启动、指数衰减内核的离散事件驱动,同时受饱和和投影约束;一个混合管道将稀疏的语义工件委托给基于LLM的规划,将密集的指标动态委托给具有硬生理边界的算法模拟。每个用户与100个评估查询配对,涵盖五个维度 - 查找、趋势、比较、异常、解释 - 分为容易、中等和困难三个层次,所有基本事实答案都可以通过记录的事件-指标关系进行程序计算。通过评估横跨LLM工具、DB本地代理和记忆增强型RAG的13种方法,我们发现DB代理(48-58%)明显优于记忆RAG基线(30-38%),差距主要集中在需要多跳推理和证据归因的比较和解释查询上。
更新时间: 2026-04-03 07:55:56
领域: cs.AI
Escaping the BLEU Trap: A Signal-Grounded Framework with Decoupled Semantic Guidance for EEG-to-Text Decoding
Decoding natural language from non-invasive EEG signals is a promising yet challenging task. However, current state-of-the-art models remain constrained by three fundamental limitations: Semantic Bias (mode collapse into generic templates), Signal Neglect (hallucination based on linguistic priors rather than neural inputs), and the BLEU Trap, where evaluation metrics are artificially inflated by high-frequency stopwords, masking a lack of true semantic fidelity. To address these challenges, we propose SemKey, a novel multi-stage framework that enforces signal-grounded generation through four decoupled semantic objectives: sentiment, topic, length, and surprisal. We redesign the interaction between the neural encoder and the Large Language Model (LLM) by injecting semantic prompts as Queries and EEG embeddings as Key-Value pairs, strictly forcing the model to attend to neural inputs. Furthermore, we move beyond standard translation metrics by adopting N-way Retrieval Accuracy and Fréchet Distance to rigorously assess diversity and alignment. Extensive experiments demonstrate that our approach effectively eliminates hallucinations on noise inputs and achieves SOTA performance on these robust protocols. Code will be released upon acceptance at https://github.com/xmed-lab/SemKey.
Updated: 2026-04-03 07:55:06
标题: 逃离BLEU陷阱:一种基于信号引导的框架,使用解耦的语义引导进行EEG到文本解码
摘要: 从非侵入式脑电信号中解码自然语言是一项有前途但具有挑战性的任务。然而,当前最先进的模型仍受到三个基本限制的约束:语义偏差(崩溃为通用模板)、信号忽略(基于语言先验而非神经输入的幻觉)和BLEU陷阱,评估指标受高频停用词人为膨胀,掩盖了真正语义忠实度的缺乏。为了解决这些挑战,我们提出了SemKey,一个新颖的多阶段框架,通过四个分离的语义目标(情感、主题、长度和惊讶)强制执行信号驱动的生成。我们通过将语义提示作为查询和脑电嵌入作为键值对注入神经编码器和大型语言模型(LLM)之间的交互,严格迫使模型关注神经输入。此外,我们通过采用N路检索准确度和Fréchet距离来严格评估多样性和对齐度,超越了标准翻译度量。广泛的实验表明,我们的方法有效消除了噪声输入上的幻觉,并在这些稳健的协议上实现了最先进的性能。代码将在https://github.com/xmed-lab/SemKey接受后发布。
更新时间: 2026-04-03 07:55:06
领域: cs.CL,cs.AI,cs.HC,eess.AS,q-bio.NC
Transfer Learning for Loan Recovery Prediction under Distribution Shifts with Heterogeneous Feature Spaces
Accurate forecasting of recovery rates (RR) is central to credit risk management and regulatory capital determination. In many loan portfolios, however, RR modeling is constrained by data scarcity arising from infrequent default events. Transfer learning (TL) offers a promising avenue to mitigate this challenge by exploiting information from related but richer source domains, yet its effectiveness critically depends on the presence and strength of distributional shifts, and on potential heterogeneity between source and target feature spaces. This paper introduces FT-MDN-Transformer, a mixture-density tabular Transformer architecture specifically designed for TL in RR forecasting across heterogeneous feature sets. The model produces both loan-level point estimates and portfolio-level predictive distributions, thereby supporting a wide range of practical RR forecasting applications. We evaluate the proposed approach in a controlled Monte Carlo simulation that facilitates systematic variation of covariate, conditional, and label shifts, as well as in a real-world transfer setting using the Global Credit Data (GCD) loan dataset as source and a novel bonds dataset as target. Our results show that FT-MDN-Transformer outperforms baseline models when target-domain data are limited, with particularly pronounced gains under covariate and conditional shifts, while label shift remains challenging. We also observe its probabilistic forecasts to closely track empirical recovery distributions, providing richer information than conventional point-prediction metrics alone. Overall, the findings highlight the potential of distribution-aware TL architectures to improve RR forecasting in data-scarce credit portfolios and offer practical insights for risk managers operating under heterogeneous data environments.
Updated: 2026-04-03 07:54:49
标题: 跨分布转移下基于异构特征空间的贷款回收预测的迁移学习
摘要: 准确预测恢复率(RR)对信用风险管理和监管资本确定至关重要。然而,在许多贷款组合中,由于罕见的违约事件导致数据稀缺,RR建模受到限制。迁移学习(TL)为缓解这一挑战提供了一条有希望的途径,通过利用相关但更丰富的源域信息,但其有效性关键取决于分布转移的存在和强度,以及源和目标特征空间之间的潜在异质性。 本文介绍了FT-MDN-Transformer,一种专门设计用于跨异构特征集进行RR预测的混合密度表格Transformer架构。该模型既生成贷款级别的点估计,又生成组合级别的预测分布,从而支持广泛的实际RR预测应用。我们在一个受控的蒙特卡洛模拟中评估了提出的方法,该模拟有助于系统地变化协变量、条件和标签转移,并在一个真实的迁移设置中使用全球信贷数据(GCD)贷款数据集作为源,以及一个新颖的债券数据集作为目标。 我们的结果显示,在目标域数据有限时,FT-MDN-Transformer在对抗协变量和条件转移时优于基线模型,而标签转移仍然具有挑战性。我们还观察到其概率预测与实际恢复分布紧密跟踪,提供比传统点预测指标更丰富的信息。总体而言,研究结果强调了分布感知的迁移学习架构在改善数据稀缺信用组合中的RR预测方面的潜力,并为在异构数据环境下操作的风险管理人员提供实用见解。
更新时间: 2026-04-03 07:54:49
领域: q-fin.RM,cs.LG
NavCrafter: Exploring 3D Scenes from a Single Image
Creating flexible 3D scenes from a single image is vital when direct 3D data acquisition is costly or impractical. We introduce NavCrafter, a novel framework that explores 3D scenes from a single image by synthesizing novel-view video sequences with camera controllability and temporal-spatial consistency. NavCrafter leverages video diffusion models to capture rich 3D priors and adopts a geometry-aware expansion strategy to progressively extend scene coverage. To enable controllable multi-view synthesis, we introduce a multi-stage camera control mechanism that conditions diffusion models with diverse trajectories via dual-branch camera injection and attention modulation. We further propose a collision-aware camera trajectory planner and an enhanced 3D Gaussian Splatting (3DGS) pipeline with depth-aligned supervision, structural regularization and refinement. Extensive experiments demonstrate that NavCrafter achieves state-of-the-art novel-view synthesis under large viewpoint shifts and substantially improves 3D reconstruction fidelity.
Updated: 2026-04-03 07:50:39
标题: NavCrafter:从单个图像中探索3D场景
摘要: 从一张图片中创建灵活的3D场景在直接获取3D数据成本高昂或不切实际时至关重要。我们引入了NavCrafter,这是一个新颖的框架,通过合成具有摄像机可控性和时空一致性的新视角视频序列,从一张图片中探索3D场景。NavCrafter利用视频扩散模型捕获丰富的3D先验知识,并采用几何感知扩展策略逐步扩展场景覆盖范围。为了实现可控的多视角合成,我们引入了一个多阶段摄像机控制机制,通过双分支摄像机注入和注意力调制对扩散模型进行条件化,以便通过多样的轨迹实现。我们进一步提出了一种考虑碰撞的摄像机轨迹规划器和一个增强的3D高斯飞溅(3DGS)管道,包括深度对齐监督、结构正则化和细化。大量实验表明,NavCrafter在大视角偏移下实现了最先进的新视角合成,并显著提高了3D重建的准确性。
更新时间: 2026-04-03 07:50:39
领域: cs.CV,cs.AI
DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization
Video dubbing requires content accuracy, expressive prosody, high-quality acoustics, and precise lip synchronization, yet existing approaches struggle on all four fronts. To address these issues, we propose DiFlowDubber, the first video dubbing framework built upon a discrete flow matching backbone with a novel two-stage training strategy. In the first stage, a zero-shot text-to-speech (TTS) system is pre-trained on large-scale corpora, where a deterministic architecture captures linguistic structures, and the Discrete Flow-based Prosody-Acoustic (DFPA) module models expressive prosody and realistic acoustic characteristics. In the second stage, we propose the Content-Consistent Temporal Adaptation (CCTA) to transfer TTS knowledge to the dubbing domain: its Synchronizer enforces cross-modal alignment for lip-synchronized speech. Complementarily, the Face-to-Prosody Mapper (FaPro) conditions prosody on facial expressions, whose outputs are then fused with those of the Synchronizer to construct rich, fine-grained multimodal embeddings that capture prosody-content correlations, guiding the DFPA to generate expressive prosody and acoustic tokens for content-consistent speech. Experiments on two benchmark datasets demonstrate that DiFlowDubber outperforms prior methods across multiple evaluation metrics.
Updated: 2026-04-03 07:45:32
标题: DiFlowDubber:通过跨模态对齐和同步进行自动视频配音的离散流匹配
摘要: 视频配音需要准确的内容、富有表现力的韵律、高质量的声学效果和精确的嘴唇同步,然而现有方法在所有这四个方面都面临困难。为了解决这些问题,我们提出了DiFlowDubber,这是第一个建立在离散流匹配骨干上的视频配音框架,采用了一种新颖的两阶段训练策略。在第一阶段,一个零样本文本转语音(TTS)系统在大规模语料库上进行预训练,其中一个确定性的架构捕捉语言结构,而基于离散流的韵律-声学(DFPA)模块则模拟富有表现力的韵律和逼真的声学特征。在第二阶段,我们提出了内容一致的时间适应(CCTA)来将TTS知识转移到配音领域:它的同步器强制进行跨模态对齐以实现嘴唇同步的语音。此外,面部到韵律映射器(FaPro)根据面部表情条件化韵律,其输出然后与同步器的输出融合,以构建丰富、精细的多模态嵌入,捕捉韵律-内容相关性,指导DFPA生成表达富有表现力和声学标记的内容一致语音。在两个基准数据集上的实验证明,DiFlowDubber在多个评估指标上优于先前的方法。
更新时间: 2026-04-03 07:45:32
领域: cs.CV,cs.AI,cs.MM,cs.SD
Transfer learning for nonparametric Bayesian networks
This paper introduces two transfer learning methodologies for estimating nonparametric Bayesian networks under scarce data. We propose two algorithms, a constraint-based structure learning method, called PC-stable-transfer learning (PCS-TL), and a score-based method, called hill climbing transfer learning (HC-TL). We also define particular metrics to tackle the negative transfer problem in each of them, a situation in which transfer learning has a negative impact on the model's performance. Then, for the parameters, we propose a log-linear pooling approach. For the evaluation, we learn kernel density estimation Bayesian networks, a type of nonparametric Bayesian network, and compare their transfer learning performance with the models alone. To do so, we sample data from small, medium and large-sized synthetic networks and datasets from the UCI Machine Learning repository. Then, we add noise and modifications to these datasets to test their ability to avoid negative transfer. To conclude, we perform a Friedman test with a Bergmann-Hommel post-hoc analysis to show statistical proof of the enhanced experimental behavior of our methods. Thus, PCS-TL and HC-TL demonstrate to be reliable algorithms for improving the learning performance of a nonparametric Bayesian network with scarce data, which in real industrial environments implies a reduction in the required time to deploy the network.
Updated: 2026-04-03 07:37:36
标题: 非参数贝叶斯网络的迁移学习
摘要: 本文介绍了两种转移学习方法,用于在数据稀缺的情况下估计非参数贝叶斯网络。我们提出了两种算法,一种基于约束的结构学习方法,称为PC稳定转移学习(PCS-TL),一种基于得分的方法,称为爬山转移学习(HC-TL)。我们还定义了特定的度量标准,以应对它们中的负转移问题,即转移学习对模型性能产生负面影响的情况。然后,对于参数,我们提出了一种对数线性池化方法。在评估方面,我们学习了核密度估计贝叶斯网络,这是一种非参数贝叶斯网络类型,并将它们的转移学习性能与单独的模型进行了比较。为此,我们从UCI机器学习存储库中的小型、中型和大型合成网络和数据集中对数据进行采样。然后,我们向这些数据集添加噪音和修改,以测试它们避免负转移的能力。最后,我们进行了弗里德曼检验,并进行了Bergmann-Hommel事后分析,以显示我们方法的增强实验行为的统计证据。因此,PCS-TL和HC-TL证明是可靠的算法,可改善在数据稀缺情况下非参数贝叶斯网络的学习性能,这在实际工业环境中意味着减少部署网络所需的时间。
更新时间: 2026-04-03 07:37:36
领域: cs.LG,cs.AI
QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models
Multimodal Large Language Models (MLLMs) have shown strong reasoning ability, but their high computational and memory costs hinder deployment in resource-constrained settings. While Post-Training Quantization (PTQ) and vision token pruning are standard compression techniques, they are usually treated as independent optimizations. In this paper, we show that these two techniques are strongly coupled: naively applying semantic-based token pruning to PTQ-optimized MLLMs can discard activation outliers that are important for numerical stability and thus worsen quantization errors in low-bit regimes (\textit{e.g.}, W4A4). To address this issue, we propose a quantization-aware vision token pruning framework. Our method introduces a lightweight hybrid sensitivity metric that combines simulated group-wise quantization error with outlier intensity. By combining this metric with standard semantic relevance scores, the method retains tokens that are both semantically informative and robust to quantization. Experiments on standard LLaVA architectures show that our method consistently outperforms naive integration baselines. At an aggressive pruning ratio that retains only 12.5\% of visual tokens, our framework improves accuracy by 2.24\% over the baseline and even surpasses dense quantization without pruning. To the best of our knowledge, this is the first method that explicitly co-optimizes vision token pruning and PTQ for accurate low-bit MLLM inference.
Updated: 2026-04-03 07:32:07
标题: QAPruner:用于多模态大型语言模型的量化感知视觉令牌修剪
摘要: 多模态大语言模型(MLLMs)表现出强大的推理能力,但其高计算和内存成本阻碍了在资源受限环境中的部署。尽管后训练量化(PTQ)和视觉标记修剪是标准的压缩技术,但它们通常被视为独立的优化。在本文中,我们展示了这两种技术是强耦合的:将基于语义的标记修剪朴素地应用于经过PTQ优化的MLLMs可能会丢弃对数稳定性重要的激活异常值,从而恶化低位制度(例如W4A4)中的量化误差。为了解决这个问题,我们提出了一种量化感知视觉标记修剪框架。我们的方法引入了一个轻量级的混合敏感度指标,将模拟的分组量化误差与异常值强度结合起来。通过将此指标与标准的语义相关性分数结合,该方法保留了既具有语义信息又对量化稳健的标记。对标准LLaVA架构的实验表明,我们的方法始终优于朴素集成基线。在一个保留了视觉标记仅有12.5%的激进修剪比率下,我们的框架将准确性提高了2.24%,甚至超过了没有修剪的密集量化。据我们所知,这是第一种明确为准确的低位MLLM推理共同优化视觉标记修剪和PTQ的方法。
更新时间: 2026-04-03 07:32:07
领域: cs.CV,cs.AI
LLM Analysis of 150+ years of German Parliamentary Debates on Migration Reveals Shift from Post-War Solidarity to Anti-Solidarity in the Last Decade
Migration has been a core topic in German political debate, from the postwar displacement of millions of expellees to labor migration and recent refugee movements. Studying political speech across such wide-ranging phenomena in depth has traditionally required extensive manual annotation, limiting analysis to small subsets of the data. Large language models (LLMs) offer a potential way to overcome this constraint. Using a theory-driven annotation scheme, we examine how well LLMs annotate subtypes of solidarity and anti-solidarity in German parliamentary debates and whether the resulting labels support valid downstream inference. We first provide a comprehensive evaluation of multiple LLMs, analyzing the effects of model size, prompting strategies, fine-tuning, historical versus contemporary data, and systematic error patterns. We find that the strongest models, especially GPT-5 and gpt-oss-120B, achieve human-level agreement on this task, although their errors remain systematic and bias downstream results. To address this issue, we combine soft-label model outputs with Design-based Supervised Learning (DSL) to reduce bias in long-term trend estimates. Beyond the methodological evaluation, we interpret the resulting annotations from a social-scientific perspective to trace trends in solidarity and anti-solidarity toward migrants in postwar and contemporary Germany. Our approach shows relatively high levels of solidarity in the postwar period, especially in group-based and compassionate forms, and a marked rise in anti-solidarity since 2015, framed through exclusion, undeservingness, and resource burden. We argue that LLMs can support large-scale social-scientific text analysis, but only when their outputs are rigorously validated and statistically corrected.
Updated: 2026-04-03 07:30:26
标题: LLM分析德国150多年来的移民议会辩论,揭示出近十年来从战后团结转向反团结。
摘要: 迁移一直是德国政治辩论的核心话题,从战后数百万被驱逐者的流离失所,到劳工迁移和最近的难民运动。对于在如此广泛的现象中深入研究政治言论传统上需要大量的手动标注,限制了对数据的小子集的分析。大型语言模型(LLMs)提供了克服这一限制的潜在途径。本文使用一个理论驱动的标注方案,研究LLMs在德国议会辩论中如何标注团结和反团结的亚型,以及由此产生的标签是否支持有效的下游推理。我们首先对多个LLMs进行了全面评估,分析了模型大小、提示策略、微调、历史与当代数据以及系统误差模式的影响。我们发现,最强大的模型,特别是GPT-5和gpt-oss-120B,实现了与人类水平一致的协议,尽管它们的错误仍然是系统性的,会对下游结果产生偏见。为了解决这个问题,我们将软标签模型输出与基于设计的监督学习(DSL)相结合,以减少对长期趋势估计的偏见。除了方法论评估之外,我们从社会科学的角度解释了结果标注,以追踪战后和当代德国对移民的团结和反团结趋势。我们的方法显示了战后时期相对较高水平的团结,特别是以群体为基础和富有同情心的形式,以及自2015年以来反团结明显上升,通过排斥、不值得和资源负担来构建。我们认为LLMs可以支持大规模的社会科学文本分析,但只有在它们的输出经过严格验证和统计校正时。
更新时间: 2026-04-03 07:30:26
领域: cs.CL,cs.CY,cs.LG
Decoding RWA Tokenized U.S. Treasuries: Functional Dissection and Address Role Inference
Tokenized U.S. Treasuries have emerged as a prominent subclass of real-world assets (RWAs), offering cryptographically secured, yield-bearing instruments issued across multi-chain Web3 infrastructures, with growing significance for transparency, accessibility, and financial inclusion. While the market has expanded rapidly, empirical analyses of transaction-level behaviours remain limited. This paper conducts a quantitative, function-level dissection of U.S. Treasury-backed RWA tokens, including BUIDL, BENJI, and USDY across multi-chain: mostly Ethereum and Layer-2s. Decoded contract calls expose core financial primitives such as issuance, redemption, transfer, and bridging, revealing patterns that distinguish institutional participants from smaller or retail users for the extent and limits of inclusivity in current RWA adoption. To infer address-level economic roles, we introduce a curvature-aware representation learning model. Our method outperforms baseline models in role inference on our collected U.S. Treasury transaction dataset and generalizes to address classification across broader public blockchain transaction datasets. The decoded transaction-level patterns in tokenized U.S. Treasuries across chains surface the degree of retail participation, and the role inference model enables the distinction between institutional treasuries, arbitrage bots, and retail traders based on behavioral patterns, facilitating future more transparent, inclusive, and accountable Web3 finance.
Updated: 2026-04-03 07:30:01
标题: 解码RWA代币化的美国国债:功能解剖和地址角色推断
摘要: 代币化的美国国债已成为现实世界资产(RWAs)的一个重要子类,提供了跨多链Web3基础设施发行的加密保护、带收益工具,对于透明性、可访问性和金融包容性日益重要。虽然市场迅速扩大,但对交易级行为的实证分析仍然有限。本文对多链上的美国国债支持的RWA代币进行了数量化、功能级别的解剖,包括BUIDL、BENJI和USDY,主要涵盖以太坊和Layer-2等链。解码的合约调用暴露了核心的金融原语,如发行、赎回、转移和桥接,揭示了在当前RWA采用中区分机构参与者与较小或零售用户的包容范围和限制的模式。为了推断地址级别的经济角色,我们引入了一种曲率感知表示学习模型。我们的方法在我们收集的美国国债交易数据集上的角色推断中表现优于基线模型,并可以推广到更广泛的公共区块链交易数据集上的地址分类。跨链代币化美国国债的解码交易级模式突显了零售参与的程度,而角色推断模型使得能够根据行为模式区分机构国库、套利机器人和零售交易者,促进未来更加透明、包容和负责任的Web3金融。
更新时间: 2026-04-03 07:30:01
领域: q-fin.CP,cs.CE,cs.LG
ChatSVA: Bridging SVA Generation for Hardware Verification via Task-Specific LLMs
Functional verification consumes over 50% of the IC development lifecycle, where SystemVerilog Assertions (SVAs) are indispensable for formal property verification and enhanced simulation-based debugging. However, manual SVA authoring is labor-intensive and error-prone. While Large Language Models (LLMs) show promise, their direct deployment is hindered by low functional accuracy and a severe scarcity of domain-specific data. To address these challenges, we introduce ChatSVA, an end-to-end SVA generation system built upon a multi-agent framework. At its core, the AgentBridge platform enables this multi-agent approach by systematically generating high-purity datasets, overcoming the data scarcity inherent to few-shot scenarios. Evaluated on 24 RTL designs, ChatSVA achieves 98.66% syntax and 96.12% functional pass rates, generating 139.5 SVAs per design with 82.50% function coverage. This represents a 33.3 percentage point improvement in functional correctness and an over 11x enhancement in function coverage compared to the previous state-of-the-art (SOTA). ChatSVA not only sets a new SOTA in automated SVA generation but also establishes a robust framework for solving long-chain reasoning problems in few-shot, domain-specific scenarios. An online service has been publicly released at https://www.nctieda.com/CHATDV.html.
Updated: 2026-04-03 07:24:14
标题: ChatSVA:通过任务特定的LLM桥接硬件验证的SVA生成
摘要: 功能验证占集成电路开发生命周期的50%以上,其中SystemVerilog Assertions(SVAs)对于形式属性验证和增强基于模拟的调试至关重要。然而,手动编写SVA是费时且容易出错的。虽然大型语言模型(LLMs)显示出潜力,但它们直接部署受到功能准确性低和领域特定数据严重稀缺的阻碍。为了解决这些挑战,我们介绍了ChatSVA,这是一个建立在多Agent框架上的端到端SVA生成系统。AgentBridge平台作为核心,通过系统生成高纯度数据集,克服了少样本场景中固有的数据稀缺性。在24个RTL设计上进行评估,ChatSVA实现了98.66%的语法和96.12%的功能通过率,每个设计生成139.5个SVAs,功能覆盖率达到82.50%。与先前的最新技术(SOTA)相比,这代表了功能正确性改进33.3个百分点和功能覆盖率提高了超过11倍。ChatSVA不仅在自动SVA生成方面设立了新的SOTA,还为解决少样本、领域特定场景中的长链推理问题建立了一个强大的框架。一个在线服务已经公开发布在https://www.nctieda.com/CHATDV.html。
更新时间: 2026-04-03 07:24:14
领域: cs.AR,cs.AI
ProdCodeBench: A Production-Derived Benchmark for Evaluating AI Coding Agents
Benchmarks that reflect production workloads are better for evaluating AI coding agents in industrial settings, yet existing benchmarks differ from real usage in programming language distribution, prompt style and codebase structure. This paper presents a methodology for curating production-derived benchmarks, illustrated through ProdCodeBench, a benchmark sourced from real developer-agent sessions. We detail our data collection and curation practices including LLM-based task classification, test relevance validation, and multi-run stability checks which address challenges in constructing reliable evaluation signals from monorepo environments. Each curated sample consists of a verbatim prompt, a committed code change and fail-to-pass tests spanning seven programming languages. Our systematic analysis of four foundation models yields solve rates ranging from 53.2% to 72.2%. We demonstrate how these offline evaluation signals drive practical decisions around model selection and harness design, while noting that offline benchmarks provide directional signal that we complement with online A/B testing for production deployment decisions. We share our methodology and lessons learned to enable other organizations to construct similar production-derived benchmarks.
Updated: 2026-04-03 07:18:13
标题: ProdCodeBench:用于评估人工智能编码代理的生产衍生基准
摘要: 在工业环境中评估AI编码代理时,反映生产工作负载的基准更为合适,然而现有的基准与实际编程语言分布、提示风格和代码库结构存在差异。本文提出了一种策划基于生产数据的基准的方法论,通过ProdCodeBench进行说明,这是一个从真实开发者-代理会话中获取的基准。我们详细介绍了我们的数据收集和策划实践,包括基于LLM的任务分类、测试相关性验证和多次运行稳定性检查,这些实践解决了从单仓库环境构建可靠评估信号的挑战。每个策划样本包括一个逐字提示、一个提交的代码更改和涵盖七种编程语言的未通过测试。我们对四个基础模型的系统分析得出解决率在53.2%到72.2%之间。我们展示了这些离线评估信号如何推动模型选择和设计利用的实际决策,同时注意到离线基准提供了指导性信号,我们通过在线A/B测试来补充以进行生产部署决策。我们分享我们的方法论和经验教训,以便让其他组织构建类似的基于生产数据的基准。
更新时间: 2026-04-03 07:18:13
领域: cs.SE,cs.AI,cs.LG
Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning
Reinforcement Learning (RL) has achieved remarkable success in various domains, yet it often relies on carefully designed programmatic reward functions to guide agent behavior. Designing such reward functions can be challenging and may not generalize well across different tasks. To address this limitation, we leverage the rich world knowledge contained in pretrained video diffusion models to provide goal-driven reward signals for RL agents without ad-hoc design of reward. Our key idea is to exploit off-the-shelf video diffusion models pretrained on large-scale video datasets as informative reward functions in terms of video-level and frame-level goals. For video-level rewards, we first finetune a pretrained video diffusion model on domain-specific datasets and then employ its video encoder to evaluate the alignment between the latent representations of agent's trajectories and the generated goal videos. To enable more fine-grained goal-achievement, we derive a frame-level goal by identifying the most relevant frame from the generated video using CLIP, which serves as the goal state. We then employ a learned forward-backward representation that represents the probability of visiting the goal state from a given state-action pair as frame-level reward, promoting more coherent and goal-driven trajectories. Experiments on Meta-World and Distracting Control Suite demonstrate the effectiveness of our approach.
Updated: 2026-04-03 07:09:55
标题: 视频扩散模型驱动的强化学习目标导向奖励
摘要: 强化学习(RL)在各个领域取得了显著的成功,然而它通常依赖精心设计的程序奖励函数来引导智能体行为。设计这样的奖励函数可能具有挑战性,并且可能不会在不同任务之间很好地泛化。为了解决这一限制,我们利用预训练视频扩散模型中包含的丰富世界知识,为RL智能体提供目标驱动的奖励信号,而无需专门设计奖励。我们的关键思想是利用预先在大规模视频数据集上训练的视频扩散模型作为视频级和帧级目标方面的信息奖励函数。对于视频级奖励,我们首先在特定领域数据集上微调预训练视频扩散模型,然后利用其视频编码器评估智能体轨迹的潜在表示与生成的目标视频之间的对齐情况。为了实现更精细的目标达成,我们通过使用CLIP识别生成视频中最相关的帧来推断帧级目标,该帧作为目标状态。然后,我们利用一个学习到的前向-后向表示,代表从给定状态-动作对访问目标状态的概率作为帧级奖励,促进更连贯和目标驱动的轨迹。在Meta-World和Distracting Control Suite上的实验证明了我们方法的有效性。
更新时间: 2026-04-03 07:09:55
领域: cs.LG
f-INE: A Hypothesis Testing Framework for Estimating Influence under Training Randomness
Influence estimation methods promise to explain and debug machine learning by estimating the impact of individual samples on the final model. Yet, existing methods collapse under training randomness: the same example may appear critical in one run and irrelevant in the next. Such instability undermines their use in data curation or cleanup since it is unclear if we indeed deleted/kept the correct datapoints. To overcome this, we introduce *f-influence* -- a new influence estimation framework grounded in hypothesis testing that explicitly accounts for training randomness, and establish desirable properties that make it suitable for reliable influence estimation. We also design a highly efficient algorithm **f**-**IN**fluence **E**stimation (**f-INE**) that computes f-influence **in a single training run**. Finally, we scale up f-INE to estimate influence of instruction tuning data on Llama-3.1-8B and show it can reliably detect poisoned samples that steer model opinions, demonstrating its utility for data cleanup and attributing model behavior.
Updated: 2026-04-03 07:05:28
标题: f-INE:一个用于在训练随机性下估计影响的假设检验框架
摘要: 影响估计方法承诺通过估计单个样本对最终模型的影响来解释和调试机器学习。然而,现有方法在训练时存在崩溃:同一个示例在一个运行中可能显得至关重要,而在下一个运行中可能无关紧要。这种不稳定性削弱了它们在数据整理或清理中的使用,因为我们不清楚我们是否确实删除/保留了正确的数据点。为了克服这一问题,我们引入了*f-influence* -- 一种新的影响估计框架,基于假设检验,明确考虑了训练时的随机性,并建立了适用于可靠影响估计的理想特性。我们还设计了一种高效的算法 **f**-**IN**fluence **E**stimation (**f-INE**),可以在**单次训练**中计算f-influence。最后,我们将f-INE扩展到估计指导调整数据对Llama-3.1-8B的影响,并展示它可以可靠地检测操纵模型观点的有毒样本,证明了它在数据清理和归因模型行为方面的实用性。
更新时间: 2026-04-03 07:05:28
领域: cs.LG,cs.AI
Structure-Aware Commitment Reduction for Network-Constrained Unit Commitment with Solver-Preserving Guarantees
The growing number of individual generating units, hybrid resources, and security constraints has significantly increased the computational burden of network-constrained unit commitment (UC), where most solution time is spent exploring branch-and-bound trees over unit-hour binary variables. To reduce this combinatorial burden, recent approaches have explored learning-based guidance to assist commitment decisions. However, directly using tools such as large language models (LLMs) to predict full commitment schedules is unreliable, as infeasible or inconsistent binary decisions can violate inter-temporal constraints and degrade economic optimality. This paper proposes a solver-compatible dimensionality reduction framework for UC that exploits structural regularities in commitment decisions. Instead of generating complete schedules, the framework identifies a sparse subset of structurally stable commitment binaries to fix prior to optimization. One implementation uses an LLM to select these variables. The LLM does not replace the optimization process but provides partial variable restriction, while all constraints and remaining decisions are handled by the original MILP solver, which continues to enforce network, ramping, reserve, and security constraints. We formally show that the masked problem defines a reduced feasible region of the original UC model, thereby preserving feasibility and enabling solver-certified optimality within the restricted space. Experiments on IEEE 57-bus, RTS 73-bus, IEEE 118-bus, and augmented large-scale cases, including security-constrained variants, demonstrate consistent reductions in branch-and-bound nodes and solution time, achieving order-of-magnitude speedups on high-complexity instances while maintaining near-optimal objective values.
Updated: 2026-04-03 06:55:32
标题: 具有保留求解器保证的基于结构的网络约束单元耗减方案
摘要: 随着个体发电单元、混合资源和安全约束数量的增加,网络受限的机组组合问题的计算负担显著增加,其中大部分解决时间都用于探索单元-小时二进制变量上的分支和界限树。为了减少这种组合负担,最近的方法探索了基于学习的指导来辅助决策。然而,直接使用大型语言模型(LLMs)来预测完整的承诺计划是不可靠的,因为不可行或不一致的二进制决策可能违反跨期约束并降低经济最优性。本文提出了一个适用于UC的求解器兼容的维度缩减框架,利用承诺决策中的结构规律。该框架不是生成完整的计划,而是在优化之前确定一组稀疏的稳定结构承诺二进制变量。一种实现方法使用LLM来选择这些变量。LLM并未取代优化过程,而是提供部分变量限制,而所有约束和剩余决策由原始MILP求解器处理,继续执行网络、爬坡、储备和安全约束。我们正式证明了掩码问题定义了原始UC模型的减少可行区域,从而保持了可行性并在受限空间内实现了求解器认证的最优性。在IEEE 57-母线、RTS 73-母线、IEEE 118-母线和扩展的大规模案例上的实验,包括受安全约束的变体,展示了在高复杂度实例上持续减少分支和界限节点以及解决时间,实现了数量级的加速,同时保持接近最优的目标值。
更新时间: 2026-04-03 06:55:32
领域: cs.LG
PVD-ONet: A Multi-scale Neural Operator Method for Singularly Perturbed Boundary Layer Problems
Physics-informed neural networks and Physics-informed DeepONet excel in solving partial differential equations; however, they often fail to converge for singularly perturbed problems. To address this, we propose two novel frameworks, Prandtl-Van Dyke neural network(PVD-Net) and its operator learning extension Prandtl-Van Dyke Deep Operator Network (PVD-ONet), which rely solely on governing equations without data. To address varying task-specific requirements, both PVD-Net and PVD-ONet are developed in two distinct versions, tailored respectively for stability-focused and high-accuracy modeling. The leading-order PVD-Net adopts a two-network architecture combined with Prandtl's matching condition, targeting stability-prioritized scenarios. The high-order PVD-Net employs a five-network design with Van Dyke's matching principle to capture fine-scale boundary layer structures, making it ideal for high-accuracy scenarios. PVD-ONet generalizes PVD-Net to the operator learning setting by assembling multiple DeepONet modules, directly mapping initial conditions to solution operators and enabling instant predictions for an entire family of boundary layer problems without retraining. Numerical experiments (second-order equations with constant and variable coefficients, and internal layer problems) show that the proposed methods consistently outperform existing baselines. Moreover, beyond forward prediction, the proposed framework can be extended to inverse problems. It enables the inference of the scaling exponent governing boundary layer thickness from sparse data, providing potential for practical applications.
Updated: 2026-04-03 06:54:21
标题: PVD-ONet:一种用于奇异扰动边界层问题的多尺度神经算子方法
摘要: 物理信息神经网络和物理信息DeepONet在解决偏微分方程方面表现出色;然而,它们经常无法收敛于奇异扰动问题。为了解决这一问题,我们提出了两种新颖的框架,Prandtl-Van Dyke神经网络(PVD-Net)及其操作学习扩展Prandtl-Van Dyke Deep Operator Network(PVD-ONet),仅依赖于控制方程而不需要数据。为了满足不同任务需求,PVD-Net和PVD-ONet分别开发了两种不同版本,分别针对稳定性优先和高精度建模。主导PVD-Net采用两个网络结构,结合Prandtl的匹配条件,针对稳定性优先的情况。高阶PVD-Net采用了五个网络设计,利用Van Dyke的匹配原则来捕捉细节边界层结构,适用于高精度场景。PVD-ONet将PVD-Net推广到操作学习设置,通过组装多个DeepONet模块,直接将初始条件映射到解算符,并实现对整个边界层问题家族的即时预测,无需重新训练。数值实验(具有恒定和可变系数的二阶方程以及内部层问题)表明,所提出的方法始终优于现有基线。此外,除了前向预测,提出的框架还可以扩展到逆问题。它可以从稀疏数据中推断出控制边界层厚度的缩放指数,为实际应用提供潜力。
更新时间: 2026-04-03 06:54:21
领域: cs.LG
ROPA: Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation
Training robust bimanual manipulation policies via imitation learning requires demonstration data with broad coverage over robot poses, contacts, and scene contexts. However, collecting diverse and precise real-world demonstrations is costly and time-consuming, which hinders scalability. Prior works have addressed this with data augmentation, typically for either eye-in-hand (wrist camera) setups with RGB inputs or for generating novel images without paired actions, leaving augmentation for eye-to-hand (third-person) RGB-D training with new action labels less explored. In this paper, we propose Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation (ROPA), an offline imitation learning data augmentation method that fine-tunes Stable Diffusion to synthesize third-person RGB and RGB-D observations of novel robot poses. Our approach simultaneously generates corresponding joint-space action labels while employing constrained optimization to enforce physical consistency through appropriate gripper-to-object contact constraints in bimanual scenarios. We evaluate our method on 5 simulated and 3 real-world tasks. Our results across 2625 simulation trials and 300 real-world trials demonstrate that ROPA outperforms baselines and ablations, showing its potential for scalable RGB and RGB-D data augmentation in eye-to-hand bimanual manipulation. Our project website is available at: https://ropaaug.github.io/.
Updated: 2026-04-03 06:49:01
标题: ROPA:用于RGB-D双手数据增强的合成机器人姿态生成
摘要: 通过模仿学习培训鲁棒的双手操纵策略需要具有广泛覆盖机器人姿势、接触以及场景背景的演示数据。然而,收集多样且精确的真实世界演示数据成本高且耗时,这阻碍了可伸缩性。先前的研究通过数据增强来解决这个问题,通常用于眼-手(手腕相机)设置的RGB输入或用于生成没有配对动作的新图像,对于具有新动作标签的眼-手(第三人称)RGB-D训练的数据增强研究较少。在本文中,我们提出了一种用于RGB-D双手数据增强的合成机器人姿势生成(ROPA)方法,这是一种离线模仿学习数据增强方法,通过微调稳定扩散来合成新的机器人姿势的第三人称RGB和RGB-D观察。我们的方法同时生成相应的关节空间动作标签,同时利用受限优化来通过适当的握爪-物体接触约束在双手操纵场景中强制实施物理一致性。我们在5个模拟任务和3个真实世界任务上评估了我们的方法。在2625个模拟试验和300个真实世界试验中,我们的结果表明ROPA胜过基线和消融实验,展示了它在眼-手双手操纵中可伸缩的RGB和RGB-D数据增强潜力。我们的项目网站可访问:https://ropaaug.github.io/。
更新时间: 2026-04-03 06:49:01
领域: cs.RO,cs.AI,cs.CV,cs.LG
Open Challenges for Secure and Scalable Wi-Fi Connectivity in Rural Areas
Providing reliable, affordable, and secure Internet connectivity in rural areas remains a major challenge. Pay-for-use Wi-Fi hotspots are emerging as a scalable solution to provide affordable Internet access in underserved and rural regions. Despite their growing adoption, their security properties remain largely unexplored. In this paper, we present a security analysis of these hotspot ecosystems based on Wi-Fi surveys and practical attack validation. We first perform a Wi-Fi survey conducted in two countries, namely the Philippines and India, to understand the deployment and adoption of such systems in practice. Our results suggest that Piso-WiFi pay-to-use hotspots are particularly widespread in rural regions of the Philippines, and that India's PM-WANI initiative is slowly gaining traction. We then perform a security assessment of these deployments and demonstrate two practical attacks: hijacking another user's paid connection; and rogue hotspots. We analyze the root causes of these vulnerabilities, introduce threat models tailored to pay-for-use hotspot deployments, and outline practical security improvements, including a secure caching architecture. Our findings highlight security challenges in emerging rural connectivity infrastructure and provide directions toward more secure and scalable deployments.
Updated: 2026-04-03 06:34:09
标题: 农村地区安全可扩展Wi-Fi连接的挑战
摘要: 在农村地区提供可靠、负担得起和安全的互联网连接仍然是一个重大挑战。付费使用的Wi-Fi热点正在成为在服务不足和农村地区提供负担得起的互联网接入的可扩展解决方案。尽管它们得到了越来越多的采用,但它们的安全性质仍然大部分未被探索。在本文中,我们基于Wi-Fi调查和实际攻击验证,对这些热点生态系统进行了安全分析。我们首先在菲律宾和印度两个国家进行了一项Wi-Fi调查,以了解这些系统在实践中的部署和采用情况。我们的结果表明,菲律宾农村地区的Piso-WiFi付费热点特别普遍,而印度的PM-WANI倡议正在逐渐获得认可。然后,我们对这些部署进行了安全评估,并演示了两种实际攻击:劫持另一个用户的付费连接;和流氓热点。我们分析了这些漏洞的根本原因,介绍了针对付费热点部署定制的威胁模型,并概述了实际安全改进,包括安全缓存架构。我们的研究结果突显了新兴农村连接基础设施中的安全挑战,并提供了朝着更安全和可扩展部署的方向。
更新时间: 2026-04-03 06:34:09
领域: cs.CR,cs.NI
A Unifying Framework for Parallelizing Sequential Models with Linear Dynamical Systems
Harnessing parallelism in seemingly sequential models is a central challenge for modern machine learning. Several approaches have been proposed for evaluating sequential processes in parallel using iterative fixed-point methods, like Newton, Picard, and Jacobi iterations. In this work, we show that these methods can be understood within a common framework based on linear dynamical systems (LDSs), where different iteration schemes arise naturally as approximate linearizations of a nonlinear recursion. Moreover, we theoretically analyze the rates of convergence of these methods, and we verify the predictions of this theory with several case studies. This unifying framework highlights shared principles behind these techniques and clarifies when particular fixed-point methods are most likely to be effective. By bridging diverse algorithms through the language of LDSs, the framework provides a clearer theoretical foundation for parallelizing sequential models and points toward new opportunities for efficient and scalable computation.
Updated: 2026-04-03 06:32:20
标题: 一个统一的框架,用于将线性动力系统与顺序模型并行化
摘要: 在看似顺序模型中利用并行性是现代机器学习的中心挑战。已经提出了几种方法,可以使用迭代固定点方法(如牛顿、皮卡尔和雅各比迭代)并行评估顺序过程。在这项工作中,我们展示了这些方法可以在基于线性动态系统(LDSs)的通用框架中理解,其中不同的迭代方案自然地产生为非线性递归的近似线性化。此外,我们在理论上分析了这些方法的收敛速度,并通过几个案例研究验证了该理论的预测。这一统一框架突出了这些技术背后的共同原则,并澄清了特定固定点方法在何时最有可能有效。通过通过LDSs的语言连接不同的算法,该框架为并行化顺序模型提供了更清晰的理论基础,并指向了有效和可扩展计算的新机会。
更新时间: 2026-04-03 06:32:20
领域: cs.LG
Community-Based Early-Stage Chronic Kidney Disease Screening using Explainable Machine Learning for Low-Resource Settings
Early detection of chronic kidney disease (CKD) is essential for preventing progression to end-stage renal disease. However, existing screening tools - primarily developed using populations from high-income countries - often underperform in Bangladesh and South Asia, where risk profiles differ. Most of these tools rely on simple additive scoring functions and are based on data from patients with advanced-stage CKD. Consequently, they fail to capture complex interactions among risk factors and are limited in predicting early-stage CKD. Our objective was to develop and evaluate an explainable machine learning (ML) framework for community-based early-stage CKD screening for low-resource settings, tailored to the Bangladeshi and South Asian population context. A community-based CKD dataset from Bangladesh was used to develop predictive models. Variables were organized into clinically meaningful feature groups, and ten complementary feature selection methods were applied to identify robust predictor subsets. Twelve ML classifiers were evaluated using nested cross-validation. Model performance was benchmarked against established CKD screening tools and externally validated on three independent datasets from India, the UAE, and Bangladesh. SHAP was used to interpret model predictions. An ML model trained on an RFECV-selected feature subset achieved a balanced accuracy of 90.40%, whereas minimal non-pathology-test features demonstrated excellent predictive capability with a balanced accuracy of 89.23%, often outperforming larger or full feature sets. Compared with existing screening tools, the proposed models achieved substantially higher accuracy and sensitivity while requiring fewer and more accessible inputs. External validation confirmed strong generalizability with 78% to 98% sensitivity.
Updated: 2026-04-03 06:32:18
标题: 基于社区的早期慢性肾病筛查:在资源匮乏环境中使用可解释机器学习
摘要: 慢性肾脏病(CKD)的早期检测对于预防疾病进展至晚期肾脏疾病至关重要。然而,现有的筛查工具 - 主要是使用来自高收入国家的人群开发的 - 在孟加拉国和南亚经常表现不佳,因为风险概况有所不同。大多数这些工具依赖于简单的加法评分函数,并且基于晚期CKD患者的数据。因此,它们无法捕捉风险因素之间的复杂相互作用,并且在预测早期CKD方面受到限制。我们的目标是开发和评估一个解释性机器学习(ML)框架,用于面向低资源环境的基于社区的早期CKD筛查,定制为孟加拉国和南亚人口背景。使用来自孟加拉国的基于社区的CKD数据集来开发预测模型。变量被组织成临床意义的特征组,并应用了十种互补的特征选择方法来识别稳健的预测子集。使用嵌套交叉验证评估了十二个ML分类器。模型性能与已建立的CKD筛查工具进行了基准比较,并在印度、阿联酋和孟加拉国的三个独立数据集上进行了外部验证。使用SHAP解释模型预测。在RFECV选择的特征子集上训练的ML模型实现了90.40%的平衡准确性,而最少的非病理测试特征表现出优秀的预测能力,平衡准确性为89.23%,通常优于较大或完整的特征集。与现有的筛查工具相比,所提出的模型在准确性和灵敏度方面取得了明显提高,同时需要更少且更易获取的输入。外部验证证实了78%至98%的敏感性。
更新时间: 2026-04-03 06:32:18
领域: cs.LG
ContractShield: Bridging Semantic-Structural Gaps via Hierarchical Cross-Modal Fusion for Multi-Label Vulnerability Detection in Obfuscated Smart Contracts
Smart contracts are increasingly targeted by adversaries employing obfuscation techniques such as bogus code injection and control flow manipulation to evade vulnerability detection. Existing multimodal methods often process semantic, temporal, and structural features in isolation and fuse them using simple strategies such as concatenation, which neglects cross-modal interactions and weakens robustness, as obfuscation of a single modality can sharply degrade detection accuracy. To address these challenges, we propose ContractShield, a robust multimodal framework with a novel fusion mechanism that effectively correlates multiple complementary features through a three-level fusion. Self-attention first identifies patterns that indicate vulnerability within each feature space. Cross-modal attention then establishes meaningful connections between complementary signals across modalities. Then, adaptive weighting dynamically calibrates feature contributions based on their reliability under obfuscation. For feature extraction, ContractShield integrates (1) CodeBERT with a sliding window mechanism to capture semantic dependencies in source code, (2) Extended long short-term memory (xLSTM) to model temporal dynamics in opcode sequences, and (3) GATv2 to identify structural invariants in control flow graphs (CFGs) that remain stable across obfuscation. Empirical evaluation demonstrates resilience of ContractShield, achieving a 89 percentage Hamming Score with only a 1-3 percentage drop compared to non-obfuscated data. The framework simultaneously detects five major vulnerability types with 91 percentage F1-score, outperforming state-of-the-art approaches by 6-15 percentage under adversarial conditions.
Updated: 2026-04-03 06:29:34
标题: ContractShield:通过分层交叉模态融合弥合智能合约中多标签漏洞检测中的语义结构差距
摘要: 智能合约越来越受到对手的攻击,他们使用混淆技术,如虚假代码注入和控制流操作,以逃避漏洞检测。现有的多模态方法通常独立处理语义、时间和结构特征,并使用简单的策略(如串联)将它们融合在一起,这忽视了跨模态交互并削弱了健壮性,因为对单一模态的混淆会显著降低检测准确性。为了解决这些挑战,我们提出了ContractShield,这是一个具有新颖融合机制的强大多模态框架,通过三级融合有效地相关多个互补特征。自注意首先识别每个特征空间中指示漏洞的模式。交叉模态注意然后在模态之间建立有意义的连接。然后,自适应加权根据它们在混淆下的可靠性动态校准特征贡献。对于特征提取,ContractShield整合了(1) 带有滑动窗口机制的CodeBERT来捕捉源代码中的语义依赖关系,(2) 扩展的长短期记忆(xLSTM)来模拟操作码序列中的时间动态,并(3) GATv2来识别控制流图中(CFGs)在混淆下保持稳定的结构不变量。实证评估表明,ContractShield具有弹性,与非混淆数据相比,仅有1-3个百分点的下降,达到了89%的Hamming分数。该框架同时检测了五种主要的漏洞类型,具有91%的F1分数,在对手条件下优于最先进的方法6-15个百分点。
更新时间: 2026-04-03 06:29:34
领域: cs.CR
SentinelAgent: Intent-Verified Delegation Chains for Securing Federal Multi-Agent AI Systems
When Agent A delegates to Agent B, which invokes Tool C on behalf of User X, no existing framework can answer: whose authorization chain led to this action, and where did it violate policy? This paper introduces SentinelAgent, a formal framework for verifiable delegation chains in federal multi-agent AI systems. The Delegation Chain Calculus (DCC) defines seven properties - six deterministic (authority narrowing, policy preservation, forensic reconstructibility, cascade containment, scope-action conformance, output schema conformance) and one probabilistic (intent preservation) - with four meta-theorems and one proposition establishing the practical infeasibility of deterministic intent verification. The Intent-Preserving Delegation Protocol (IPDP) enforces all seven properties at runtime through a non-LLM Delegation Authority Service. A three-point verification lifecycle achieves 100% combined TPR at 0% FPR on DelegationBench v4 (516 scenarios, 10 attack categories, 13 federal domains). Under black-box adversarial conditions, the DAS blocks 30/30 attacks with 0 false positives. Deterministic properties are unbreakable under adversarial stress testing; intent verification degrades to 13% against sophisticated paraphrasing. Fine-tuning the NLI model on 190 government delegation examples improves P2 from 1.7% to 88.3% TPR (5-fold cross-validated, F1=82.1%). Properties P1, P3-P7 are mechanically verified via TLA+ model checking across 2.7 million states with zero violations. Even when intent verification is evaded, the remaining six properties constrain the adversary to permitted API calls, conformant outputs, traceable actions, bounded cascades, and compliant behavior.
Updated: 2026-04-03 06:25:18
标题: 哨兵代理:用于保护联邦多智能体人工智能系统的意图验证委托链
摘要: 当代理A委托给代理B时,代理B代表用户X调用工具C,没有现有的框架能够回答:这个行动的授权链是谁导致的,违反了哪些政策?本文介绍了SentinelAgent,这是一个用于联邦多代理人工智能系统中可验证委托链的形式化框架。委托链计算(DCC)定义了七个属性 - 六个确定性的(权威缩小、政策保留、取证重建性、级联约束、范围-行动一致性、输出模式一致性)和一个概率性的(意图保留)-通过四个元定理和一个命题,建立了确定性意图验证的实际不可行性。意图保持的委托协议(IPDP)通过一个非LLM委托授权服务在运行时强制执行所有七个属性。一个三点验证生命周期在DelegationBench v4(516个场景、10个攻击类别、13个联邦领域)上实现了100%的组合TPR,0%的FPR。在黑盒敌对条件下,DAS阻止了30/30次攻击,没有误报。确定性属性在敌对压力测试下不可破解;对于复杂的改写,意图验证降至13%。在190个政府委托示例上对NLI模型进行微调,将P2从1.7%提高到88.3%的TPR(5倍交叉验证,F1=82.1%)。属性P1、P3-P7通过TLA+模型检查在270万个状态中机械验证,没有违规。即使意图验证被规避,其余的六个属性也将敌对者限制在允许的API调用、符合输出、可追踪行动、有限级联和合规行为中。
更新时间: 2026-04-03 06:25:18
领域: cs.CR,cs.AI,cs.MA
Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs
Modern LLMs inherit strong priors from web-scale pretraining, which can limit the headroom of post-training data-selection strategies. While Active Preference Learning (APL) seeks to optimize query efficiency in online Direct Preference Optimization (DPO), the inherent richness of on-policy candidate pools often renders simple Random sampling a surprisingly formidable baseline. We evaluate uncertainty-based APL against Random across harmlessness, helpfulness, and instruction-following settings, utilizing both reward models and LLM-as-a-judge proxies. We find that APL yields negligible improvements in proxy win-rates compared to Random. Crucially, we observe a dissociation where win-rate improves even as general capability -- measured by standard benchmarks -- degrades. APL fails to mitigate this capability collapse or reduce variance significantly better than random sampling. Our findings suggest that in the regime of strong pre-trained priors, the computational overhead of active selection is difficult to justify against the ``cheap diversity'' provided by simple random samples. Our code is available at https://github.com/BootsofLagrangian/random-vs-apl.
Updated: 2026-04-03 06:24:11
标题: 随机很难被打败:现代LLMs在在线DPO中的主动选择
摘要: 现代LLMs继承了来自网络规模预训练的强先验知识,这可能会限制后续数据选择策略的提升空间。虽然主动偏好学习(APL)旨在优化在线直接偏好优化(DPO)中的查询效率,但基于策略的候选池的丰富性通常使简单的随机抽样成为一个令人惊讶的强基准。我们评估了基于不确定性的APL与随机抽样在无害性、有益性和遵循指示设置中的对比,利用奖励模型和LLM作为评判者的代理。我们发现与随机抽样相比,APL在代理获胜率上几乎没有改善。至关重要的是,我们观察到一个脱离,即即使通用能力(通过标准基准测试衡量)下降,获胜率仍在提高。APL未能缓解这种能力崩溃,也未能比随机抽样显着降低方差。我们的研究结果表明,在强先验知识的情况下,主动选择的计算开销难以证明胜过简单随机样本提供的“廉价多样性”。我们的代码可在https://github.com/BootsofLagrangian/random-vs-apl找到。
更新时间: 2026-04-03 06:24:11
领域: cs.LG,cs.AI
Towards Realistic Class-Incremental Learning with Free-Flow Increments
Class-incremental learning (CIL) is typically evaluated under predefined schedules with equal-sized tasks, leaving more realistic and complex cases unexplored. However, a practical CIL system should learns immediately when any number of new classes arrive, without forcing fixed-size tasks. We formalize this setting as Free-Flow Class-Incremental Learning (FFCIL), where data arrives as a more realistic stream with a highly variable number of unseen classes each step. It will make many existing CIL methods brittle and lead to clear performance degradation. We propose a model-agnostic framework for robust CIL learning under free-flow arrivals. It comprises a class-wise mean (CWM) objective that replaces sample frequency weighted loss with uniformly aggregated class-conditional supervision, thereby stabilizing the learning signal across free-flow class increments, as well as method-wise adjustments that improve robustness for representative CIL paradigms. Specifically, we constrain distillation to replayed data, normalize the scale of contrastive and knowledge transfer losses, and introduce Dynamic Intervention Weight Alignment (DIWA) to prevent over-adjustment caused by unstable statistics from small class increments. Experiments confirm a clear performance degradation across various CIL baselines under FFCIL, while our strategies yield consistent gains.
Updated: 2026-04-03 06:19:58
标题: 朝向具有自由流增量的真实类增量学习
摘要: Class-incremental learning(CIL)通常在预定义的具有相同大小任务的时间表下进行评估,这导致更为现实和复杂的情况未被探索。然而,一个实用的CIL系统应该在任何数量的新类别到来时立即学习,而不是强制固定大小的任务。我们将这种设置形式化为自由流式类增量学习(FFCIL),其中数据以更为现实的流式到达,每一步中具有高度可变的未见类别数量。这将使许多现有的CIL方法变得脆弱,并导致明显的性能下降。我们提出了一个针对自由流式到达的鲁棒CIL学习的模型无关框架。它包括一个逐类均值(CWM)目标,用均匀聚合的类条件监督代替样本频率加权损失,从而在自由流式类增量之间稳定学习信号,以及针对典型的CIL范式改进稳健性的方法调整。具体而言,我们将蒸馏限制在重播数据中,规范对比和知识传递损失的规模,并引入动态干预权重对齐(DIWA)以防止由小类增量不稳定统计数据引起的过度调整。实验证实,在FFCIL下各种CIL基线出现明显的性能下降,而我们的策略产生了一致的收益。
更新时间: 2026-04-03 06:19:58
领域: cs.LG
Pushing the Limits of Distillation-Based Continual Learning via Classifier-Proximal Lightweight Plugins
Continual learning requires models to learn continuously while preserving prior knowledge under evolving data streams. Distillation-based methods are appealing for retaining past knowledge in a shared single-model framework with low storage overhead. However, they remain constrained by the stability-plasticity dilemma: knowledge acquisition and preservation are still optimized through coupled objectives, and existing enhancement methods do not alter this underlying bottleneck. To address this issue, we propose a plugin extension paradigm termed Distillation-aware Lightweight Components (DLC) for distillation-based CL. DLC deploys lightweight residual plugins into the base feature extractor's classifier-proximal layer, enabling semantic-level residual correction for better classification accuracy while minimizing disruption to the overall feature extraction process. During inference, plugin-enhanced representations are aggregated to produce classification predictions. To mitigate interference from non-target plugins, we further introduce a lightweight weighting unit that learns to assign importance scores to different plugin-enhanced representations. DLC could deliver a significant 8% accuracy gain on large-scale benchmarks while introducing only a 4% increase in backbone parameters, highlighting its exceptional efficiency. Moreover, DLC is compatible with other plug-and-play CL enhancements and delivers additional gains when combined with them.
Updated: 2026-04-03 06:15:30
标题: 推动基于蒸馏的持续学习极限:通过分类器-邻近轻量级插件
摘要: Continuous learning requires models to continuously learn while preserving previous knowledge in the face of evolving data streams. Distillation-based methods are attractive for retaining past knowledge in a shared single-model framework with low storage overhead. However, they are limited by the stability-plasticity dilemma: knowledge acquisition and preservation are still optimized through coupled objectives, and existing enhancement methods do not address this bottleneck. To tackle this issue, we propose a plugin extension paradigm called Distillation-aware Lightweight Components (DLC) for distillation-based continual learning. DLC integrates lightweight residual plugins into the classifier-proximal layer of the base feature extractor, allowing for semantic-level residual correction to improve classification accuracy while minimizing disruption to the overall feature extraction process. During inference, plugin-enhanced representations are combined to make classification predictions. To reduce interference from non-target plugins, we also introduce a lightweight weighting unit that learns to assign importance scores to different plugin-enhanced representations. DLC can significantly improve accuracy by 8% on large-scale benchmarks with only a 4% increase in backbone parameters, demonstrating its exceptional efficiency. Additionally, DLC is compatible with other plug-and-play continual learning enhancements and provides additional benefits when used in conjunction with them.
更新时间: 2026-04-03 06:15:30
领域: cs.LG,stat.ML
STDDN: A Physics-Guided Deep Learning Framework for Crowd Simulation
Accurate crowd simulation is crucial for public safety management, emergency evacuation planning, and intelligent transportation systems. However, existing methods, which typically model crowds as a collection of independent individual trajectories, are limited in their ability to capture macroscopic physical laws. This microscopic approach often leads to error accumulation and compromises simulation stability. Furthermore, deep learning-driven methods tend to suffer from low inference efficiency and high computational overhead, making them impractical for large-scale, efficient simulations. To address these challenges, we propose the Spatio-Temporal Decoupled Differential Equation Network (STDDN), a novel framework that guides microscopic trajectory prediction with macroscopic physics. We innovatively introduce the continuity equation from fluid dynamics as a strong physical constraint. A Neural Ordinary Differential Equation (Neural ODE) is employed to model the macroscopic density evolution driven by individual movements, thereby physically regularizing the microscopic trajectory prediction model. We design a density-velocity coupled dynamic graph learning module to formulate the derivative of the density field within the Neural ODE, effectively mitigating error accumulation. We also propose a differentiable density mapping module to eliminate discontinuous gradients caused by discretization and introduce a cross-grid detection module to accurately model the impact of individual cross-grid movements on local density changes. The proposed STDDN method has demonstrated significantly superior simulation performance compared to state-of-the-art methods on long-term tasks across four real-world datasets, as well as a major reduction in inference latency.
Updated: 2026-04-03 06:06:04
标题: STDDN:基于物理引导的深度学习框架用于人群模拟
摘要: 准确的人群模拟对于公共安全管理、紧急疏散规划和智能交通系统至关重要。然而,现有方法通常将人群建模为独立个体轨迹的集合,这种方法在捕捉宏观物理规律方面存在局限性。这种微观方法通常会导致误差累积并损害模拟稳定性。此外,基于深度学习的方法往往存在推理效率低和计算开销高的问题,使它们在大规模、高效的模拟中不切实际。为了解决这些挑战,我们提出了时空解耦微分方程网络(STDDN),这是一个新颖的框架,用宏观物理指导微观轨迹预测。我们创新地引入了流体动力学中的连续方程作为强有力的物理约束。采用神经常微分方程(Neural ODE)来模拟由个体移动驱动的宏观密度演变,从而在物理上规范微观轨迹预测模型。我们设计了一个密度-速度耦合的动态图学习模块,来制定神经ODE内的密度场导数,有效减轻误差积累。我们还提出了一个可微密度映射模块,以消除由离散化引起的不连续梯度,并引入一个跨网格检测模块来准确建模个体跨网格移动对局部密度变化的影响。所提出的STDDN方法在四个真实数据集上的长期任务中展示了明显优越的模拟性能,同时推理延迟也大幅减少。
更新时间: 2026-04-03 06:06:04
领域: cs.LG
ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization
Large language models (LLMs) present significant deployment challenges due to their immense computational and memory requirements. While semi-structured pruning, particularly 2:4 sparsity, offers a path to practical hardware acceleration, existing methods often incur substantial performance degradation. To bridge this gap, we introduce ARMOR: (Adaptive Representation with Matrix-factORization), a novel one-shot post-training pruning algorithm. Instead of directly pruning weights, ARMOR factorizes each weight matrix into a 2:4 sparse core wrapped by two low-overhead, block diagonal matrices. These wrappers act as efficient pre and post-transformation error correctors, offering greater flexibility to preserve model quality compared to conventional 2:4 pruning techniques. The sparse core and block diagonal wrappers are chosen through a block coordinate descent algorithm that minimizes a layer-wise proxy loss. We theoretically prove this optimization is guaranteed to converge to a solution with a proxy loss less than or equal to state-of-the-art pruning algorithms. Experiments on Llama (Touvron et al., 2023; Dubey et al., 2024) and Qwen (Yang et al., 2025) model families demonstrate that ARMOR consistently and significantly outperforms state-of-the-art 2:4 pruning methods across a wide range of downstream tasks and perplexity evaluations. ARMOR achieves this superior performance while retaining the inference speedups and substantial memory usage reductions of 2:4 pruning, establishing a more effective trade-off between model compression and task accuracy
Updated: 2026-04-03 05:57:22
标题: ARMOR:通过自适应矩阵因式分解实现高性能半结构化剪枝
摘要: 大型语言模型(LLMs)由于其巨大的计算和内存需求而面临着重大的部署挑战。尽管半结构化修剪,特别是2:4稀疏度,为实际硬件加速提供了一条途径,但现有方法通常会导致显著的性能降级。为了弥合这一差距,我们引入了ARMOR:(自适应矩阵分解表示),这是一种新颖的一次性训练后修剪算法。ARMOR不是直接修剪权重,而是将每个权重矩阵分解为一个由两个低开销的块对角矩阵包裹的2:4稀疏核心。这些包裹作为高效的预处理和后处理变换误差校正器,相比传统的2:4修剪技术,提供了更大的灵活性来保持模型质量。通过一个块坐标下降算法选择稀疏核心和块对角包裹,该算法最小化逐层代理损失。我们从理论上证明,该优化保证会收敛到一个具有代理损失小于或等于现有最先进修剪算法的解决方案。对Llama(Touvron等,2023年; Dubey等,2024年)和Qwen(Yang等,2025年)模型家族的实验表明,ARMOR在各种下游任务和困惑性评估中始终明显优于最先进的2:4修剪方法。ARMOR在保留2:4修剪的推理加速和大幅减少内存使用的同时实现了这种卓越性能,确立了模型压缩和任务准确性之间更有效的权衡。
更新时间: 2026-04-03 05:57:22
领域: cs.LG
Learn then Decide: A Learning Approach for Designing Data Marketplaces
As data marketplaces become increasingly central to the digital economy, it is crucial to design efficient pricing mechanisms that optimize revenue while ensuring fair and adaptive pricing. We introduce the Maximum Auction-to-Posted Price (MAPP) mechanism, a novel two-stage approach that first estimates the bidders' value distribution through auctions and then determines the optimal posted price based on the learned distribution. We establish that MAPP is individually rational and incentive-compatible, ensuring truthful bidding while balancing revenue maximization with minimal price discrimination. On the theoretical side, we establish a statistical viewpoint that recasts revenue optimization as a valuation density estimation problem: we show that revenue regret can be controlled by uniform error in estimating the valuation density. MAPP achieves a regret of $O_p(n^{-1}(\log n)^2)$ when incorporating historical bid data, where $n$ is the number of bids in the current round. For sequential dataset sales over $T$ rounds, we propose an online MAPP mechanism that dynamically adjusts pricing across datasets with varying value distributions. Our approach achieves no-regret learning, with the average cumulative regret converging at a rate of $O_p(T^{-1/2}(\log T)^2)$. We validate the effectiveness of MAPP through simulations and real-world data from the FCC AWS-3 spectrum auction.
Updated: 2026-04-03 05:52:10
标题: 学习然后决定:设计数据市场的学习方法
摘要: 随着数据市场在数字经济中变得越来越重要,设计高效的定价机制以优化收入并确保公平和适应性定价至关重要。我们引入了最大拍卖-发布价格(MAPP)机制,这是一种新颖的两阶段方法,首先通过拍卖来估计竞标者的价值分布,然后根据学习到的分布确定最优发布价格。我们确定MAPP是个体理性和激励兼容的,确保竞标诚实,同时在最大化收入和最小化价格歧视之间取得平衡。在理论上,我们建立了一个统计观点,将收入优化重新解释为估算价值密度的问题:我们展示了通过估算价值密度的均匀误差可以控制收入遗憾。当结合历史竞标数据时,MAPP在当前轮次中的竞标数为n时,可以实现$O_p(n^{-1}(\log n)^2)$的遗憾。对于连续数据集销售的$T$轮次,我们提出了一种在线MAPP机制,动态调整具有不同价值分布的数据集的定价。我们的方法实现了无遗憾学习,平均累积遗憾以$O_p(T^{-1/2}(\log T)^2)$的速率收敛。我们通过模拟和来自FCC AWS-3频谱拍卖的真实数据验证了MAPP的有效性。
更新时间: 2026-04-03 05:52:10
领域: stat.ML,cs.LG
Understanding Latent Diffusability via Fisher Geometry
Diffusion models often degrade when trained in latent spaces (e.g., VAEs), yet the formal causes remain poorly understood. We quantify latent-space diffusability through the rate of change of the Minimum Mean Squared Error (MMSE) along the diffusion trajectory. Our framework decomposes this MMSE rate into contributions from Fisher Information (FI) and Fisher Information Rate (FIR). We demonstrate that while global isometry ensures FI alignment, FIR is governed by the encoder's local geometric properties. Our analysis explicitly decouples latent geometric distortion into three measurable penalties: dimensional compression, tangential distortion, and curvature injection. We derive theoretical conditions for FIR preservation across spaces, ensuring maintained diffusability. Experiments across diverse autoencoding architectures validate our framework and establish these efficient FI and FIR metrics as a robust diagnostic suite for identifying and mitigating latent diffusion failure.
Updated: 2026-04-03 05:52:09
标题: 通过费舍尔几何学理解潜在的扩散性
摘要: 扩散模型在潜在空间(例如VAEs)中训练时往往会退化,但其形式原因仍然不太清楚。我们通过最小均方误差(MMSE)沿扩散轨迹的变化速率量化潜在空间的扩散性。我们的框架将这个MMSE速率分解为来自Fisher信息(FI)和Fisher信息速率(FIR)的贡献。我们证明,虽然全局等距确保FI对齐,但FIR受编码器的局部几何属性控制。我们的分析明确地将潜在几何失真解耦为三个可测量的惩罚:维度压缩、切向失真和曲率注入。我们推导了跨空间保持FIR的理论条件,确保了扩散性的维持。在不同的自编码架构上进行的实验验证了我们的框架,并将这些高效的FI和FIR指标建立为一套强大的诊断工具,用于识别和减轻潜在的扩散失败。
更新时间: 2026-04-03 05:52:09
领域: cs.LG
SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond
The success of large language models (LLMs) in scientific domains has heightened safety concerns, prompting numerous benchmarks to evaluate their scientific safety. Existing benchmarks often suffer from limited risk coverage and a reliance on subjective evaluation. To address these problems, we introduce SafeSci, a comprehensive framework for safety evaluation and enhancement in scientific contexts. SafeSci comprises SafeSciBench, a multi-disciplinary benchmark with 0.25M samples, and SafeSciTrain, a large-scale dataset containing 1.5M samples for safety enhancement. SafeSciBench distinguishes between safety knowledge and risk to cover extensive scopes and employs objective metrics such as deterministically answerable questions to mitigate evaluation bias. We evaluate 24 advanced LLMs, revealing critical vulnerabilities in current models. We also observe that LLMs exhibit varying degrees of excessive refusal behaviors on safety-related issues. For safety enhancement, we demonstrate that fine-tuning on SafeSciTrain significantly enhances the safety alignment of models. Finally, we argue that knowledge is a double-edged sword, and determining the safety of a scientific question should depend on specific context, rather than universally categorizing it as safe or unsafe. Our work provides both a diagnostic tool and a practical resource for building safer scientific AI systems.
Updated: 2026-04-03 05:26:21
标题: SafeSci:大型语言模型在科学领域及其他领域的安全评估
摘要: 大型语言模型(LLMs)在科学领域的成功增加了安全担忧,促使了许多基准来评估它们的科学安全性。现有的基准往往存在风险覆盖有限和依赖主观评价的问题。为解决这些问题,我们引入了SafeSci,一个用于科学上下文中安全评估和增强的综合框架。SafeSci包括SafeSciBench,一个包含0.25M样本的跨学科基准,以及SafeSciTrain,一个包含1.5M样本用于安全增强的大规模数据集。SafeSciBench区分安全知识和风险,以覆盖广泛范围,并采用客观指标,如确定性可回答的问题,以减轻评估偏见。我们评估了24个先进的LLMs,揭示了当前模型中的关键漏洞。我们还观察到LLMs在安全相关问题上表现出不同程度的过度拒绝行为。对于安全增强,我们证明在SafeSciTrain上微调显著增强了模型的安全对齐性。最后,我们认为知识是一把双刃剑,确定科学问题的安全性应取决于具体上下文,而不是普遍将其分类为安全或不安全。我们的工作为构建更安全的科学人工智能系统提供了诊断工具和实用资源。
更新时间: 2026-04-03 05:26:21
领域: cs.LG,cs.AI
State estimations and noise identifications with intermittent corrupted observations via Bayesian variational inference
This paper focuses on the state estimation problem in distributed sensor networks, where intermittent packet dropouts, corrupted observations, and unknown noise covariances coexist. To tackle this challenge, we formulate the joint estimation of system states, noise parameters, and network reliability as a Bayesian variational inference problem, and propose a novel variational Bayesian adaptive Kalman filter (VB-AKF) to approximate the joint posterior probability densities of the latent parameters. Unlike existing AKF that separately handle missing data and measurement outliers, the proposed VB-AKF adopts a dual-mask generative model with two independent Bernoulli random variables, explicitly characterizing both observable communication losses and latent data authenticity. Additionally, the VB-AKF integrates multiple concurrent multiple observations into the adaptive filtering framework, which significantly enhances statistical identifiability. Comprehensive numerical experiments verify the effectiveness and asymptotic optimality of the proposed method, showing that both parameter identification and state estimation asymptotically converge to the theoretical optimal lower bound with the increase in the number of sensors.
Updated: 2026-04-03 05:20:49
标题: 使用贝叶斯变分推断进行带有间歇性受损观测的状态估计和噪声识别
摘要: 本文关注分布式传感器网络中的状态估计问题,其中间歇性数据包丢失、损坏观测和未知噪声协方差并存。为了解决这一挑战,我们将系统状态、噪声参数和网络可靠性的联合估计建模为贝叶斯变分推断问题,并提出了一种新颖的变分贝叶斯自适应卡尔曼滤波器(VB-AKF)来近似潜在参数的联合后验概率密度。与现有的AKF单独处理缺失数据和测量异常的方法不同,所提出的VB-AKF采用了一个具有两个独立伯努利随机变量的双掩码生成模型,明确地表征了可观测的通信丢失和潜在数据的真实性。此外,VB-AKF将多个并发多次观测集成到自适应滤波框架中,显著增强了统计可辨识性。全面的数值实验验证了所提方法的有效性和渐近最优性,显示出随着传感器数量的增加,参数识别和状态估计都会渐近收敛到理论最优下界。
更新时间: 2026-04-03 05:20:49
领域: stat.ML,cs.LG,math.OC,stat.CO
Stock Market Prediction Using Node Transformer Architecture Integrated with BERT Sentiment Analysis
Stock market prediction presents considerable challenges for investors, financial institutions, and policymakers operating in complex market environments characterized by noise, non-stationarity, and behavioral dynamics. Traditional forecasting methods, including fundamental analysis and technical indicators, often fail to capture the intricate patterns and cross-sectional dependencies inherent in financial markets. This paper presents an integrated framework combining a node transformer architecture with BERT-based sentiment analysis for stock price forecasting. The proposed model represents the stock market as a graph structure where individual stocks form nodes and edges capture relationships including sectoral affiliations, correlated price movements, and supply chain connections. A fine-tuned BERT model extracts sentiment information from social media posts and combines it with quantitative market features through attention-based fusion mechanisms. The node transformer processes historical market data while capturing both temporal evolution and cross-sectional dependencies among stocks. Experiments conducted on 20 S&P 500 stocks spanning January 1982 to March 2025 demonstrate that the integrated model achieves a mean absolute percentage error (MAPE) of 0.80% for one-day-ahead predictions, compared to 1.20% for ARIMA and 1.00% for LSTM. The inclusion of sentiment analysis reduces prediction error by 10% overall and 25% during earnings announcements, while the graph-based architecture contributes an additional 15% improvement by capturing inter-stock dependencies. Directional accuracy reaches 65% for one-day forecasts. Statistical validation through paired t-tests confirms the significance of these improvements (p < 0.05 for all comparisons). The model maintains lower error during high-volatility periods, achieving MAPE of 1.50% while baseline models range from 1.60% to 2.10%.
Updated: 2026-04-03 04:57:36
标题: 使用与 BERT 情感分析集成的节点变换器架构进行股市预测
摘要: 股市预测在复杂的市场环境中提出了巨大挑战,这些环境以噪音、非平稳性和行为动态为特征,投资者、金融机构和政策制定者在其中运作。传统的预测方法,包括基本分析和技术指标,通常无法捕捉金融市场中固有的复杂模式和横截面依赖关系。本文提出了一个综合框架,将节点转换器架构与基于BERT的情感分析相结合,用于股价预测。所提出的模型将股市表示为一个图结构,其中个别股票形成节点,边捕捉包括部门关联、相关价格波动和供应链连接在内的关系。一个经过优化的BERT模型从社交媒体帖子中提取情感信息,并通过基于注意力的融合机制将其与定量市场特征相结合。节点转换器处理历史市场数据,同时捕捉股票之间的时间演变和横截面依赖关系。对跨越1982年1月至2025年3月的20只标普500股票进行的实验表明,集成模型对于一天的预测达到了0.80%的平均绝对百分比误差(MAPE),而ARIMA为1.20%,LSTM为1.00%。情感分析的加入总体上减少了10%的预测误差,在收益公告期间减少了25%,而基于图的架构通过捕捉股票之间的依赖关系提供了额外的15%改进。一天预测的方向准确率达到65%。通过成对的t检验进行的统计验证确认了这些改进的显著性(所有比较的p < 0.05)。该模型在高波动时期保持较低的误差,达到1.50%的MAPE,而基准模型范围从1.60%到2.10%。
更新时间: 2026-04-03 04:57:36
领域: cs.LG,cs.AI,q-fin.ST
Amortized Inference of Causal Models via Conditional Fixed-Point Iterations
Structural Causal Models (SCMs) offer a principled framework to reason about interventions and support out-of-distribution generalization, which are key goals in scientific discovery. However, the task of learning SCMs from observed data poses formidable challenges, and often requires training a separate model for each dataset. In this work, we propose an amortized inference framework that trains a single model to predict the causal mechanisms of SCMs conditioned on their observational data and causal graph. We first use a transformer-based architecture for amortized learning of dataset embeddings, and then extend the Fixed-Point Approach (FiP) to infer the causal mechanisms conditionally on their dataset embeddings. As a byproduct, our method can generate observational and interventional data from novel SCMs at inference time, without updating parameters. Empirical results show that our amortized procedure performs on par with baselines trained specifically for each dataset on both in and out-of-distribution problems, and also outperforms them in scarce data regimes.
Updated: 2026-04-03 04:44:44
标题: 通过有条件的固定点迭代摊销推断因果模型
摘要: 结构因果模型(SCMs)提供了一个明确的框架,用于推理干预和支持分布外泛化,这是科学发现的关键目标。然而,从观察数据中学习SCMs的任务面临巨大挑战,通常需要为每个数据集训练一个单独的模型。在这项工作中,我们提出了一个摊销推断框架,该框架训练一个单一模型,以预测SCMs的因果机制,条件是它们的观测数据和因果图。我们首先使用基于Transformer的架构来摊销学习数据集嵌入,然后扩展了固定点方法(FiP)以推断因果机制,条件是它们的数据集嵌入。作为副产品,我们的方法可以在推理时从新的SCMs生成观察和干预数据,而无需更新参数。实证结果表明,我们的摊销程序在内部和分布外问题上表现与专门为每个数据集训练的基线相当,并且在稀缺数据情况下也优于它们。
更新时间: 2026-04-03 04:44:44
领域: cs.LG,stat.ML
MOMO: Mars Orbital Model Foundation Model for Mars Orbital Applications
We introduce MOMO, the first multi-sensor foundation model for Mars remote sensing. MOMO uses model merge to integrate representations learned independently from three key Martian sensors (HiRISE, CTX, and THEMIS), spanning resolutions from 0.25 m/pixel to 100 m/pixel. Central to our method is our novel Equal Validation Loss (EVL) strategy, which aligns checkpoints across sensors based on validation loss similarity before fusion via task arithmetic. This ensures models are merged at compatible convergence stages, leading to improved stability and generalization. We train MOMO on a large-scale, high-quality corpus of $\sim 12$ million samples curated from Mars orbital data and evaluate it on 9 downstream tasks from Mars-Bench. MOMO achieves better overall performance compared to ImageNet pre-trained, earth observation foundation model, sensor-specific pre-training, and fully-supervised baselines. Particularly on segmentation tasks, MOMO shows consistent and significant performance improvement. Our results demonstrate that model merging through an optimal checkpoint selection strategy provides an effective approach for building foundation models for multi-resolution data. The model weights, pretraining code, pretraining data, and evaluation code are available at: https://github.com/kerner-lab/MOMO.
Updated: 2026-04-03 04:22:30
标题: MOMO:火星轨道模型基础模型用于火星轨道应用
摘要: 我们引入了MOMO,这是火星遥感的第一个多传感器基础模型。MOMO使用模型合并来集成独立学习的三个关键火星传感器(HiRISE、CTX和THEMIS)的表示,涵盖从0.25 m/像素到100 m/像素的分辨率。我们方法的核心是我们的新颖的Equal Validation Loss(EVL)策略,该策略根据验证损失的相似性在融合之前通过任务算法对传感器之间的检查点进行对齐。这确保了模型在兼容的收敛阶段进行合并,从而提高了稳定性和泛化能力。我们在从火星-Bench中提取的大规模、高质量的约1200万个样本的语料库上训练了MOMO,并在9个下游任务上进行了评估。与ImageNet预训练、地球观测基础模型、特定传感器预训练和完全监督基准相比,MOMO表现出更好的整体性能。特别是在分割任务中,MOMO表现出一致且显著的性能提高。我们的结果表明,通过选择最佳检查点策略进行模型合并为构建多分辨率数据的基础模型提供了一种有效的方法。模型权重、预训练代码、预训练数据和评估代码可在以下链接找到:https://github.com/kerner-lab/MOMO。
更新时间: 2026-04-03 04:22:30
领域: cs.CV,cs.AI,cs.LG
Generative Frontiers: Why Evaluation Matters for Diffusion Language Models
Diffusion language models have seen exciting recent progress, offering far more flexibility in generative trajectories than autoregressive models. This flexibility has motivated a growing body of research into new approaches to diffusion language modeling, which typically begins at the scale of GPT-2 small (150 million parameters). However, these advances introduce new issues with evaluation methodology. In this technical note, we discuss the limitations of current methodology and propose principled augmentations to ensure reliable comparisons. We first discuss why OpenWebText has become the standard benchmark, and why alternatives such as LM1B are inherently less meaningful. We then discuss the limitations of likelihood evaluations for diffusion models, and explain why relying on generative perplexity alone as a metric can lead to uninformative results. To address this, we show that generative perplexity and entropy are two components of the KL divergence to a reference distribution. This decomposition explains generative perplexity's sensitivity to entropy, and naturally suggests generative frontiers as a principled method for evaluating model generative quality. We conclude with empirical observations on model quality at this scale. We include a blog post with interactive content to illustrate the argument at https://patrickpynadath1.github.io/blog/eval_methodology/.
Updated: 2026-04-03 04:21:20
标题: 生成前沿:评估对于扩散语言模型的重要性
摘要: 扩散语言模型近期取得了令人振奋的进展,比自回归模型在生成轨迹上提供了更多的灵活性。这种灵活性促使越来越多的研究关注新的扩散语言建模方法,通常从GPT-2 small(1.5亿参数)的规模开始。然而,这些进展引入了评估方法的新问题。在这篇技术说明中,我们讨论了当前方法的局限性,并提出了有原则的增强方法,以确保可靠的比较。我们首先讨论了为什么OpenWebText已成为标准基准,以及为什么LM1B等替代方法在本质上意义较小。然后,我们讨论了扩散模型的可能性评估的局限性,并解释了仅依赖生成困惑作为指标可能导致无信息结果的原因。为了解决这个问题,我们展示了生成困惑和熵是相对分布的KL散度的两个组成部分。这种分解解释了生成困惑对熵的敏感性,并自然地建议生成前沿作为评估模型生成质量的有原则方法。我们最后提供了关于这个规模下模型质量的实证观察。我们在https://patrickpynadath1.github.io/blog/eval_methodology/上提供了一个带有交互内容的博客文章,以说明这一观点。
更新时间: 2026-04-03 04:21:20
领域: cs.LG,cs.CL
FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving
Mixture-of-Experts (MoE) models have become a dominant paradigm for scaling large language models, but their rapidly growing parameter sizes introduce a fundamental inefficiency during inference: most expert weights remain idle in GPU memory while competing with performance-critical runtime state such as the key-value (KV) cache. Since KV cache capacity directly determines serving throughput, this mismatch leads to underutilized memory and degraded performance. In this paper, we present FluxMoE, a new MoE inference system that decouples expert parameters from persistent GPU residency. FluxMoE introduces an expert paging abstraction that treats expert weights as streamed, transient resources, materializing them on demand and evicting them immediately after use, allowing GPU memory to be preferentially allocated to throughput-critical runtime state. We implement FluxMoE atop vLLM to enable efficient MoE inference under severe memory constraints. Experimental results demonstrate that FluxMoE achieves up to 3.0$\times$ throughput gains over vLLM in memory-intensive regimes, without compromising model fidelity.
Updated: 2026-04-03 04:16:01
标题: FluxMoE:将专家居住地与高性能MoE服务解耦
摘要: 混合专家(MoE)模型已成为扩展大型语言模型的主导范式,但其不断增长的参数大小在推断过程中引入了一种基本的低效率:大多数专家权重在GPU内存中保持空闲状态,与关键-值(KV)缓存等性能关键的运行时状态竞争。由于KV缓存容量直接决定了服务吞吐量,这种不匹配导致了内存的低效利用和性能的降低。在本文中,我们提出了FluxMoE,一种新的MoE推断系统,将专家参数与持久性GPU驻留状态分离。FluxMoE引入了专家分页抽象,将专家权重视为流式、瞬态资源,在需求时实现其材料化,并在使用后立即将其驱逐,从而使GPU内存优先分配给吞吐量关键的运行时状态。我们在vLLM之上实现了FluxMoE,以在严格的内存约束下实现高效的MoE推断。实验结果表明,FluxMoE在内存密集型环境中可以实现高达3.0倍的吞吐量增益,而不会影响模型的准确性。
更新时间: 2026-04-03 04:16:01
领域: cs.LG
Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy
The formal reasoning capabilities of LLMs are crucial for advancing automated software engineering. However, existing benchmarks for LLMs lack systematic evaluation based on computation and complexity, leaving a critical gap in understanding their formal reasoning capabilities. Therefore, it is still unknown whether SOTA LLMs can grasp the structured, hierarchical complexity of formal languages as defined by Computation Theory. To address this, we introduce ChomskyBench, a benchmark for systematically evaluating LLMs through the lens of Chomsky Hierarchy. Unlike prior work that uses vectorized classification for neural networks, ChomskyBench is the first to combine full Chomsky Hierarchy coverage, process-trace evaluation via natural language, and deterministic symbolic verifiability. ChomskyBench is composed of a comprehensive suite of language recognition and generation tasks designed to test capabilities at each level. Extensive experiments indicate a clear performance stratification that correlates with the hierarchy's levels of complexity. Our analysis reveals a direct relationship where increasing task difficulty substantially impacts both inference length and performance. Furthermore, we find that while larger models and advanced inference methods offer notable relative gains, they face severe efficiency barriers: achieving practical reliability would require prohibitive computational costs, revealing that current limitations stem from inefficiency rather than absolute capability bounds. A time complexity analysis further indicates that LLMs are significantly less efficient than traditional algorithmic programs for these formal tasks. These results delineate the practical limits of current LLMs, highlight the indispensability of traditional software tools, and provide insights to guide the development of future LLMs with more powerful formal reasoning capabilities.
Updated: 2026-04-03 04:06:39
标题: 通过乔姆斯基层次评估大型语言模型的形式推理能力
摘要: LLM的形式推理能力对推动自动化软件工程至关重要。然而,现有的LLM基准缺乏基于计算和复杂性的系统评估,从而在理解其形式推理能力方面存在重要的空白。因此,目前还不清楚SOTA LLM是否能够掌握由计算理论定义的形式语言的结构化、分层复杂性。为了解决这个问题,我们引入了ChomskyBench,这是一个通过Chomsky等级体系的视角系统评估LLM的基准。与先前使用神经网络的向量化分类的研究不同,ChomskyBench是第一个结合完整的Chomsky等级体系覆盖、通过自然语言进行过程跟踪评估以及确定性符号验证的基准。ChomskyBench由一套全面的语言识别和生成任务组成,旨在测试每个级别的能力。广泛的实验表明,存在明显的性能分层,与等级的复杂性水平相关。我们的分析揭示了一个直接关系,即增加任务难度显著影响推理长度和性能。此外,我们发现,虽然更大的模型和先进的推理方法提供了显著的相对收益,但它们面临着严重的效率障碍:实现实际可靠性将需要高昂的计算成本,这表明当前的限制源于效率而非绝对能力边界。时间复杂性分析进一步表明,LLM在这些形式任务上明显不如传统算法程序高效。这些结果描绘了当前LLM的实际限制,突显了传统软件工具的必不可少性,并为指导未来具有更强大形式推理能力的LLM的发展提供了见解。
更新时间: 2026-04-03 04:06:39
领域: cs.CL,cs.AI,cs.LG,cs.SE
S$^4$ST: A Strong, Self-transferable, faSt, and Simple Scale Transformation for Transferable Targeted Attack
Transferable Targeted Attacks (TTAs) face significant challenges due to severe overfitting to surrogate models. Recent breakthroughs heavily rely on large-scale training data of victim models, while data-free solutions, \textit{i.e.}, image transformation-involved gradient optimization, often depend on black-box feedback for method design and tuning. These dependencies violate black-box transfer settings and compromise threat evaluation fairness. In this paper, we propose two blind estimation measures, self-alignment and self-transferability, to analyze per-transformation effectiveness and cross-transformation correlations under strict black-box constraints. Our findings challenge conventional assumptions: (1) Attacking simple scaling transformations uniquely enhances targeted transferability, outperforming other basic transformations and rivaling leading complex methods; (2) Geometric and color transformations exhibit high internal redundancy despite weak inter-category correlations. These insights drive the design and tuning of S$^4$ST (Strong, Self-transferable, faSt, Simple Scale Transformation), which integrates dimensionally consistent scaling, complementary low-redundancy transformations, and block-wise operations. Extensive evaluations across diverse architectures, training distributions, and tasks show that S$^{4}$ST achieves state-of-the-art effectiveness-efficiency balance without data dependency. We reveal that scaling's effectiveness stems from visual data's multi-scale nature and ubiquitous scale augmentation during training, rendering such augmentation a double-edged sword. Further validations on medical imaging and face verification confirm the framework's strong generalization.
Updated: 2026-04-03 02:48:34
标题: S$^4$ST:一种强大、自我可转移、快速简单的可转移目标攻击的尺度转换
摘要: 可转移的目标攻击(TTAs)面临严重挑战,因为它们对替代模型过度拟合。最近的突破主要依赖于受害模型的大规模训练数据,而无数据解决方案,即涉及图像变换的梯度优化,通常依赖于黑盒反馈进行方法设计和调整。这些依赖违反了黑盒传输设置并损害了威胁评估的公平性。在本文中,我们提出了两个盲估计度量,自对齐和自可转移性,以在严格的黑盒约束条件下分析每个转换的有效性和跨转换的相关性。我们的发现挑战了传统的假设:(1)攻击简单的缩放变换独特地增强了目标传输性能,超越了其他基本变换,并与领先的复杂方法相媲美;(2)几何和颜色变换表现出高内部冗余性,尽管在类间相关性较弱。这些见解推动了S$^4$ST(强大、自可转移、快速、简单的比例变换)的设计和调整,该方法整合了尺度一致的缩放、互补的低冗余变换和分块操作。在不同体系结构、训练分布和任务之间进行了广泛评估,结果显示S$^{4}$ST实现了效果-效率平衡的最新水平,而无需依赖数据。我们揭示了缩放的效果源于视觉数据的多尺度特性和在训练过程中普遍存在的尺度增强,使得这种增强成为一把双刃剑。在医学成像和人脸验证方面的进一步验证证实了该框架的强大泛化能力。
更新时间: 2026-04-03 02:48:34
领域: cs.CR,cs.AI
Poison Once, Exploit Forever: Environment-Injected Memory Poisoning Attacks on Web Agents
Memory makes LLM-based web agents personalized, powerful, yet exploitable. By storing past interactions to personalize future tasks, agents inadvertently create a persistent attack surface that spans websites and sessions. While existing security research on memory assumes attackers can directly inject into memory storage or exploit shared memory across users, we present a more realistic threat model: contamination through environmental observation alone. We introduce Environment-injected Trajectory-based Agent Memory Poisoning (eTAMP), the first attack to achieve cross-session, cross-site compromise without requiring direct memory access. A single contaminated observation (e.g., viewing a manipulated product page) silently poisons an agent's memory and activates during future tasks on different websites, bypassing permission-based defenses. Our experiments on (Visual)WebArena reveal two key findings. First, eTAMP achieves substantial attack success rates: up to 32.5% on GPT-5-mini, 23.4% on GPT-5.2, and 19.5% on GPT-OSS-120B. Second, we discover Frustration Exploitation: agents under environmental stress become dramatically more susceptible, with ASR increasing up to 8 times when agents struggle with dropped clicks or garbled text. Notably, more capable models are not more secure. GPT-5.2 shows substantial vulnerability despite superior task performance. With the rise of AI browsers like OpenClaw, ChatGPT Atlas, and Perplexity Comet, our findings underscore the urgent need for defenses against environment-injected memory poisoning.
Updated: 2026-04-03 01:25:12
标题: 一次中毒,永远利用:对Web代理的环境注入式内存中毒攻击
摘要: 记忆使LLM(基于语言的模型)网络代理个性化、强大,但也容易受到攻击。通过存储过去的交互以个性化未来的任务,代理无意中创建了一个跨网站和会话的持久攻击面。虽然现有的关于记忆的安全研究假设攻击者可以直接注入到内存存储或利用跨用户共享内存,但我们提出了一个更加现实的威胁模型:仅通过环境观察来污染。我们介绍了环境注入轨迹型代理记忆中毒(eTAMP),这是第一个能够实现跨会话、跨站点妥协而不需要直接内存访问的攻击。一个受污染的观察(例如查看操纵的产品页面)会悄无声息地污染代理的记忆,并在未来在不同网站上执行任务时激活,绕过基于权限的防御。我们在(Visual)WebArena上的实验揭示了两个关键发现。首先,eTAMP实现了可观的攻击成功率:在GPT-5-mini上高达32.5%,在GPT-5.2上为23.4%,在GPT-OSS-120B上为19.5%。其次,我们发现了挫败利用:在环境压力下的代理更容易受到攻击,当代理处理丢失点击或乱码文本时,ASR增加了多达8倍。值得注意的是,更强大的模型并不意味着更安全。尽管任务性能更优越,但GPT-5.2显示出相当的脆弱性。随着OpenClaw、ChatGPT Atlas和Perplexity Comet等AI浏览器的兴起,我们的发现强调了迫切需要防御环境注入记忆中毒。
更新时间: 2026-04-03 01:25:12
领域: cs.CR,cs.AI
AutoVerifier: An Agentic Automated Verification Framework Using Large Language Models
Scientific and Technical Intelligence (S&TI) analysis requires verifying complex technical claims across rapidly growing literature, where existing approaches fail to bridge the verification gap between surface-level accuracy and deeper methodological validity. We present AutoVerifier, an LLM-based agentic framework that automates end-to-end verification of technical claims without requiring domain expertise. AutoVerifier decomposes every technical assertion into structured claim triples of the form (Subject, Predicate, Object), constructing knowledge graphs that enable structured reasoning across six progressively enriching layers: corpus construction and ingestion, entity and claim extraction, intra-document verification, cross-source verification, external signal corroboration, and final hypothesis matrix generation. We demonstrate AutoVerifier on a contested quantum computing claim, where the framework, operated by analysts with no quantum expertise, automatically identified overclaims and metric inconsistencies within the target paper, traced cross-source contradictions, uncovered undisclosed commercial conflicts of interest, and produced a final assessment. These results show that structured LLM verification can reliably evaluate the validity and maturity of emerging technologies, turning raw technical documents into traceable, evidence-backed intelligence assessments.
Updated: 2026-04-03 01:11:43
标题: AutoVerifier:一种使用大型语言模型的代理自动验证框架
摘要: 科技情报(S&TI)分析需要验证日益增长的文献中的复杂技术声明,现有方法未能弥合表面准确性和更深层方法论有效性之间的验证差距。我们提出了AutoVerifier,这是一个基于LLM的代理框架,可以自动验证技术声明的端到端过程,无需领域专业知识。AutoVerifier将每个技术断言分解为结构化的声明三元组(主体,谓词,宾语),构建知识图,实现跨六个逐渐丰富的层次的结构化推理:语料库构建和摄取,实体和声明提取,文档内验证,跨源验证,外部信号协同,以及最终假设矩阵生成。我们在有争议的量子计算声明上展示了AutoVerifier,框架由没有量子专业知识的分析人员操作,自动识别了目标论文中的过度声明和度量不一致,跟踪了跨源矛盾,揭示了未披露的商业利益冲突,并产生了最终评估。这些结果表明,结构化的LLM验证能够可靠地评估新兴技术的有效性和成熟度,将原始技术文件转化为可追踪的、有证据支持的情报评估。
更新时间: 2026-04-03 01:11:43
领域: cs.AI,cs.CR,cs.IR,cs.LG,cs.SI
Separating Oblivious and Adaptive Differential Privacy under Continual Observation
We resolve an open question of Jain, Raskhodnikova, Sivakumar, and Smith (ICML 2023) by exhibiting a problem separating differential privacy under continual observation in the oblivious and adaptive settings. The continual observation (a.k.a. continual release) model formalizes privacy for streaming algorithms, where data is received over time and output is released at each time step. In the oblivious setting, privacy need only hold for data streams fixed in advance; in the adaptive setting, privacy is required even for streams that can be chosen adaptively based on the streaming algorithm's output. We describe the first explicit separation between the oblivious and adaptive settings. The problem showing this separation is based on the correlated vector queries problem of Bun, Steinke, and Ullman (SODA 2017). Specifically, we present an $(\varepsilon,0)$-DP algorithm for the oblivious setting that remains accurate for exponentially many time steps in the dimension of the input. On the other hand, we show that every $(\varepsilon,δ)$-DP adaptive algorithm fails to be accurate after releasing output for only a constant number of time steps.
Updated: 2026-04-03 00:04:26
标题: 分离在持续观察下的遗忘性和适应性差分隐私
摘要: 我们通过展示一个问题,在无意识和适应性环境下区分连续观察下的差分隐私,解决了Jain、Raskhodnikova、Sivakumar和Smith(ICML 2023)的一个未解之谜。连续观察(又称连续发布)模型形式化了流算法的隐私,其中数据随时间接收,每个时间步骤发布输出。在无意识环境中,隐私仅需要对预先固定的数据流保持;在适应环境中,隐私甚至需要对根据流算法输出自适应选择的数据流保持。 我们描述了第一个在无意识和适应性环境之间明确区分的问题。显示这种区分的问题基于Bun、Steinke和Ullman(SODA 2017)的相关向量查询问题。具体来说,我们提出了一个对于输入维度的指数级时间步骤保持准确的$(\varepsilon,0)$-差分隐私算法,适用于无意识环境。另一方面,我们展示了每个$(\varepsilon,δ)$-差分隐私适应性算法在发布输出后仅经历常数个时间步骤就失去准确性。
更新时间: 2026-04-03 00:04:26
领域: cs.CR,cs.DS
The Quantum-Cryptographic Co-evolution
As quantum computing matures toward the realization of Cryptographically Relevant Quantum Computers (CRQC), global cryptographic infrastructure faces an existential threat. This paper introduces a two-dimensional coordinate system to map the co-evolution of cryptographic resilience (x-axis) and computational capability (y-axis). By analyzing the four resulting quadrants, we categorize the transition from legacy classical systems to quantum-resilient architectures. We argue that the "Quantum Gap" - the delta between CRQC arrival and quantum-safe adoption represents the highest systemic risk, necessitating an immediate transition to crypto-agile frameworks.
Updated: 2026-04-03 00:01:06
标题: 量子密码学的共同进化
摘要: 随着量子计算向实现具有密码学重要性的量子计算机(CRQC)的成熟发展,全球加密基础设施面临着存在威胁。本文介绍了一个二维坐标系统,用于映射密码学弹性(x轴)和计算能力(y轴)的共同演变。通过分析得到的四个象限,我们对从传统经典系统向量子弹性架构的过渡进行了分类。我们认为,“量子鸿沟”——CRQC到量子安全采用之间的差距代表着最高的系统风险,需要立即过渡到密码敏捷框架。
更新时间: 2026-04-03 00:01:06
领域: cs.CR