Boosting Team Modeling through Tempo-Relational Representation Learning
Team modeling remains a fundamental challenge at the intersection of Artificial Intelligence and Social Sciences. Although a variety of computational models have been proposed in the last two decades, most fail to integrate Social Sciences insights, such as the critical role of temporal interactions in shaping team dynamics, and do not meet key practical requirements for real-world applications, including the ability to provide real-time, actionable recommendations to enhance team performance. To address these limitations, in this paper, we propose a novel tempo-relational neural architecture that jointly models interactions between team members and the evolution of team dynamics through temporal graphs. We additionally propose a multi-task extension of the architecture that learns shared social embeddings for team members enabling the simultaneous prediction of multiple team constructs (e.g., Emergent Leadership, Leadership Style, and Teamwork components). Experiments on two state-of-the-art team datasets show that our tempo-relational architecture out performs temporal-only and relational-only approaches for team performance prediction, and that its multi-task extension substantially reduces training and inference time without loss of predictive performance. Finally, the integration of explainability techniques within the proposed architectures provides interpretable insights and actionable recommendations to support team improvement. These strengths make our approach particularly well-suited for human-centered artificial intelligence applications, such as intelligent decision-support systems in high-stakes collaborative environments.
Updated: 2026-05-05 17:59:26
标题: 通过节奏关系表示学习提升团队建模
摘要: 团队建模仍然是人工智能和社会科学交叉领域的一个基本挑战。尽管在过去的二十年中提出了各种计算模型,但大多数未能整合社会科学的见解,例如时间交互在塑造团队动态中的关键作用,也未能满足现实世界应用的关键要求,包括提供实时可操作的建议以增强团队绩效。为了解决这些局限性,本文提出了一种新颖的时间关系神经架构,同时模拟团队成员之间的互动以及通过时间图演变的团队动态。我们还提出了架构的多任务扩展,学习团队成员的共享社会嵌入,从而实现多个团队构建(如新兴领导力、领导风格和团队合作组件)的同时预测。在两个最先进的团队数据集上的实验证明,我们的时间关系架构在团队绩效预测方面优于仅有时间和仅有关系的方法,而其多任务扩展大幅减少了训练和推理时间,而不损失预测性能。最后,将可解释性技术整合到所提出的架构中,提供可解释的见解和可操作的建议,以支持团队改进。这些优势使我们的方法特别适用于以人为中心的人工智能应用,如在高风险协作环境中的智能决策支持系统。
更新时间: 2026-05-05 17:59:26
领域: cs.LG
A Closed-Form Adaptive-Landmark Kernel for Certified Point-Cloud and Graph Classification
We introduce PALACE (Persistence Adaptive-Landmark Analytic Classification Engine), the data-adaptive companion to PLACE, paying a small cross-validation tier on three knobs (budget, radii, bandwidth; $\leq 5$ choices each). A cover-theoretic core (Lebesgue-number criterion on the landmark cover) yields four closed-form guarantees. (i) A structural lower distortion bound $λ(τ;ν)$ on $\mathcal{D}_n$ under cross-diagram non-interference, with a $(D/L)^2$ budget reduction over the uniform grid when diagrams concentrate. (ii) Equal weights $w_k = K^{-1/2}$ maximizing $λ$, and farthest-point-sampling positions $2$-approximating the optimal $k$-center covering radius; both derived from training labels alone, no gradient training. (iii) A kernel-RKHS classification rate $O((k-1)\sqrt{K}/(γ\sqrt{m_{\min}}))$ with binary necessity threshold $m = Ω(\sqrt K/γ)$ from a matching Le Cam lower bound, and a closed-form filtration-selection rule. The kernel-Mahalanobis margin $\hatρ_{\mathrm{Mah}}$ is the strongest closed-form ranker across the chemical-graph pool (mean Spearman $ρ\approx +0.60$); the isotropic surrogate $\hatγ/\sqrt{K}$ admits a selection-consistency rate, and $\widehatλ$ from (i) provides an independent data-level signal (positive on COX2 and PTC). (iv) A per-prediction certificate, in non-asymptotic Pinelis and asymptotic Gaussian forms, with no calibration split. Empirically, PALACE is the strongest closed-form diagram-based method on Orbit5k ($91.3 \pm 1.0\%$, matching Persformer), leads every diagram-based competitor on COX2 and MUTAG, and is competitive on DHFR (within 1 pp of ECP). At $8\times$ domain inflation, adaptive placement maintains $94\%$ while the uniform grid collapses to chance ($25\%$ on 4-class data).
Updated: 2026-05-05 17:59:18
标题: 一种闭式自适应地标核用于认证点云和图分类
摘要: 我们介绍了PALACE(Persistence Adaptive-Landmark Analytic Classification Engine),这是数据自适应的PLACE的伴侣,在三个旋钮上支付了一个小的交叉验证层(预算、半径、带宽;每个$\leq 5$个选择)。基于覆盖理论核心(地标覆盖上的勒贝格数准则),得出了四个闭合形式的保证。 (i)在交叉图中干扰下,$\mathcal{D}_n$的结构下界失真限$λ(τ;ν)$,当图表集中时,相对于均匀网格减少了$(D/L)^2$的预算。 (ii)相等权重$w_k = K^{-1/2}$最大化$λ$,最远点采样位置$2$-逼近最优$k$-中心覆盖半径;两者都仅从训练标签中得出,没有梯度训练。 (iii)从匹配的Le Cam下界得到的核-RKHS分类率为$O((k-1)\sqrt{K}/(γ\sqrt{m_{\min}}))$,二进制必要性阈值$m = Ω(\sqrt K/γ)$,以及一个闭合形式的过滤-选择规则。核Mahalanobis边界$\hatρ_{\mathrm{Mah}}$是化学图池中最强的闭合形式排名者(均值Spearman $ρ\approx +0.60$);各向同性替代物$\hatγ/\sqrt{K}$具有选择一致性率,而(i)中的$\widehatλ$提供了一个独立的数据级信号(在COX2和PTC上为正)。 (iv)每次预测的证书,以非渐近Pinelis和渐近高斯形式,没有校准分割。从经验上看,PALACE是Orbit5k上最强的闭合形式图表方法($91.3 \pm 1.0\%$,与Persformer相匹配),在COX2和MUTAG上领先于每个基于图表的竞争对手,并在DHFR上具有竞争力(与ECP相差不到1 pp)。在$8\times$域膨胀时,自适应放置保持在$94\%$,而均匀网格则降至随机水平(对于4类数据为$25\%$)。
更新时间: 2026-05-05 17:59:18
领域: cs.LG,math.AT
Safety and accuracy follow different scaling laws in clinical large language models
Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that higher accuracy implies safer behavior. This assumption is incomplete in medicine, where a few confident, high-risk, or evidence-contradicting errors can matter more than average benchmark performance. We introduce SaFE-Scale, a framework for measuring how clinical LLM safety changes across model scale, evidence quality, retrieval strategy, context exposure, and inference-time compute. To instantiate this framework, we introduce RadSaFE-200, a Radiology Safety-Focused Evaluation benchmark of 200 multiple-choice questions with clinician-defined clean evidence, conflict evidence, and option-level labels for high-risk error, unsafe answer, and evidence contradiction. We evaluated 34 locally deployed LLMs across six deployment conditions: closed-book prompting (zero-shot), clean evidence, conflict evidence, standard RAG, agentic RAG, and max-context prompting. Clean evidence produced the strongest improvement, increasing mean accuracy from 73.5% to 94.1%, while reducing high-risk error from 12.0% to 2.6%, contradiction from 12.7% to 2.3%, and dangerous overconfidence from 8.0% to 1.6%. Standard RAG and agentic RAG did not reproduce this safety profile: agentic RAG improved accuracy over standard RAG and reduced contradiction, but high-risk error and dangerous overconfidence remained elevated. Max-context prompting increased latency without closing the safety gap, and additional inference-time compute produced only limited gains. Worst-case analysis showed that clinically consequential errors concentrated in a small subset of questions. Clinical LLM safety is therefore not a passive consequence of scaling, but a deployment property shaped by evidence quality, retrieval design, context construction, and collective failure behavior.
Updated: 2026-05-05 17:57:19
标题: 临床大型语言模型中安全性和准确性遵循不同的比例定律
摘要: 临床LLM通常通过增加模型大小、上下文长度、检索复杂度或推理时间计算来进行缩放,隐含的预期是更高的准确性意味着更安全的行为。然而,在医学领域,一些自信的、高风险的或与证据相矛盾的错误可能比平均基准性能更重要。我们引入了SaFE-Scale框架,用于衡量临床LLM在模型规模、证据质量、检索策略、上下文暴露和推理时间计算变化时的安全性。为了实现这一框架,我们引入了RadSaFE-200,一个放射学安全关注评估基准,包含200道多项选择题,其中包括由临床医生定义的干净证据、冲突证据和选项级别的标签,用于高风险错误、不安全答案和证据矛盾。我们评估了34个本地部署的LLM模型在六种部署条件下的表现:闭卷提示(零-shot)、干净证据、冲突证据、标准RAG、代理RAG和最大上下文提示。干净证据产生了最强的改进,将平均准确率从73.5%提高到94.1%,同时将高风险错误从12.0%降低到2.6%,矛盾从12.7%降低到2.3%,危险的过度自信从8.0%降低到1.6%。标准RAG和代理RAG未复制这种安全性配置:代理RAG提高了准确性并减少了矛盾,但高风险错误和危险的过度自信仍然较高。最大上下文提示增加了延迟,但没有缩小安全差距,而额外的推理时间计算只带来有限的收益。最坏情况分析显示,临床上具有重大影响的错误集中在一小部分问题中。因此,临床LLM的安全性并不是缩放的被动结果,而是由证据质量、检索设计、上下文构建和集体失败行为所塑造的部署属性。
更新时间: 2026-05-05 17:57:19
领域: cs.CL,cs.AI,cs.LG
OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories
Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet their development remains dominated by industrial giants. The typical industry recipe involves a highly resource-intensive pipeline spanning pre-training, continual pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL). In this report, we show that when fueled with informative and high-difficulty trajectories, a simple SFT approach could be surprisingly powerful for training frontier search agents. By introducing three simple data synthesis modifications: scaling knowledge graph size for richer exploration, expanding the tool set size for broader functionality, and strict low-step filtering, we establish a stronger baseline. Trained on merely 10.6k data points, our OpenSeeker-v2 achieves state-of-the-art performance across 4 benchmarks (30B-sized agents with ReAct paradigm): 46.0% on BrowseComp, 58.1% on BrowseComp-ZH, 34.6% on Humanity's Last Exam, and 78.0% on xbench, surpassing even Tongyi DeepResearch trained with heavy CPT+SFT+RL pipeline, which achieves 43.4%, 46.7%, 32.9%, and 75.0%, respectively. Notably, OpenSeeker-v2 represents the first state-of-the-art search agent within its model scale and paradigm to be developed by a purely academic team using only SFT. We are excited to open-source the OpenSeeker-v2 model weights and share our simple yet effective findings to make frontier search agent research more accessible to the community.
Updated: 2026-05-05 17:55:25
标题: OpenSeeker-v2:通过提供信息丰富且高难度轨迹推动搜索代理的极限
摘要: 深度搜索能力已经成为前沿大型语言模型(LLM)代理所必不可少的能力,然而它们的发展仍然被工业巨头所主导。典型的工业配方包括一个高度资源密集型的流程,跨越预训练、持续预训练(CPT)、监督微调(SFT)和强化学习(RL)。在这份报告中,我们展示了当给予信息丰富且高难度的轨迹时,一个简单的SFT方法对于训练前沿搜索代理可能会出人意料地强大。通过引入三个简单的数据合成修改:扩大知识图大小以实现更丰富的探索、扩展工具集大小以实现更广泛的功能、以及严格的低步骤过滤,我们建立了一个更强的基准线。仅仅在10.6k数据点上训练,我们的OpenSeeker-v2在4个基准测试中实现了最先进的表现(30B级代理,采用ReAct范式):在BrowseComp上达到46.0%,在BrowseComp-ZH上达到58.1%,在Humanity's Last Exam上达到34.6%,在xbench上达到78.0%,甚至超过了使用重型CPT+SFT+RL流程训练的通一DeepResearch,后者分别达到43.4%, 46.7%, 32.9%和75.0%。值得注意的是,OpenSeeker-v2代表了该模型规模和范式内首个由纯学术团队开发的最先进搜索代理,仅使用了SFT。我们很高兴地开源OpenSeeker-v2模型权重,并分享我们简单而有效的发现,以使前沿搜索代理研究更容易为社区所接受。
更新时间: 2026-05-05 17:55:25
领域: cs.AI,cs.CL
Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures
We propose HeadsUp, a scalable feed-forward method for reconstructing high-quality 3D Gaussian heads from large-scale multi-camera setups. Our method employs an efficient encoder-decoder architecture that compresses input views into a compact latent representation. This latent representation is then decoded into a set of UV-parameterized 3D Gaussians anchored to a neutral head template. This UV representation decouples the number of 3D Gaussians from the number and resolution of input images, enabling training with many high-resolution input views. We train and evaluate our model on an internal dataset with more than 10,000 subjects, which is an order of magnitude larger than existing multi-view human head datasets. HeadsUp achieves state-of-the-art reconstruction quality and generalizes to novel identities without test-time optimization. We extensively analyze the scaling behavior of our model across identities, views, and model capacity, revealing practical insights for quality-compute trade-offs. Finally, we highlight the strength of our latent space by showcasing two downstream applications: generating novel 3D identities and animating the 3D heads with expression blendshapes.
Updated: 2026-05-05 17:55:01
标题: 大规模高质量的三维高斯头部重建来自多视角捕捉
摘要: 我们提出了HeadsUp,一种可扩展的前馈方法,用于从大规模多摄像头设置中重建高质量的3D高斯头部。我们的方法采用了一种高效的编码器-解码器架构,将输入视图压缩成紧凑的潜在表示。然后,这种潜在表示被解码为一组基于UV参数化的3D高斯,锚定在一个中性头部模板上。这种UV表示将3D高斯的数量与输入图像的数量和分辨率分离开来,使得可以使用许多高分辨率的输入视图进行训练。我们在一个内部数据集上训练和评估我们的模型,该数据集包含超过10,000个主体,比现有的多视角人头数据集大一个数量级。HeadsUp实现了最先进的重建质量,并且在没有测试时优化的情况下泛化到新的身份。我们广泛分析了我们的模型在身份、视图和模型容量方面的扩展行为,揭示了质量-计算权衡的实际见解。最后,我们通过展示两个下游应用程序来突出我们的潜在空间的优势:生成新的3D身份和用表情混合形状动画化3D头部。
更新时间: 2026-05-05 17:55:01
领域: cs.CV,cs.LG
Probabilistic-bit Guided CDCL for SAT Solving using Ising Consensus Assumptions
Boolean satisfiability (SAT) solvers are widely used in hardware verification, cryptanalysis, automatic test-pattern generation, and side-channel reasoning workflows. Modern conflict-driven clause-learning (CDCL) solvers are highly effective, but satisfiable instances may still require substantial conflict analysis and Boolean propagation before identifying productive regions of the search space. This paper studies a hybrid SAT-solving framework in which a probabilistic-bit (p-bit) Ising sampler proposes high-agreement literals that are passed to CDCL as temporary assumptions. The goal is not to replace CDCL, but to evaluate whether stochastic low-violation samples can reduce CDCL internal search effort while retaining correctness through CDCL fallback. On selected controlled-backbone random 3-SAT benchmarks, the hybrid method reduces median conflicts by 80.8-85.5% and median propagations by 80.2-84.6% relative to pure CDCL. The observed benefit is distribution-sensitive, suggesting that p-bit guidance is effective only for certain instance classes. We further report exploratory machine-learning gates that estimate when hybrid solving is likely to help. On the selected run, a random-forest gate retains 94.8% of hybrid wins, indicating that lightweight gating may help avoid unproductive hybrid calls.
Updated: 2026-05-05 17:54:42
标题: 使用伊辛共识假设的概率比特引导的CDCL用于SAT求解
摘要: 布尔可满足性(SAT)求解器广泛应用于硬件验证、密码分析、自动测试模式生成和侧信道推理工作流程中。现代冲突驱动子句学习(CDCL)求解器非常有效,但仍可能需要大量的冲突分析和布尔传播才能识别搜索空间中的有效区域。本文研究了一种混合SAT求解框架,其中概率位(p-bit)伊辛采样器提出高一致性的文字,这些文字作为临时假设传递给CDCL。目标不是取代CDCL,而是评估随机低违例样本是否能减少CDCL内部搜索的工作量,同时通过CDCL回退保持正确性。在选择的受控骨干随机3-SAT基准测试中,混合方法相对于纯CDCL将中位冲突减少了80.8-85.5%,中位传播减少了80.2-84.6%。所观察到的好处是分布敏感的,表明p-bit指导仅对某些实例类有效。我们进一步报告了探索性机器学习门,用于估计何时混合求解可能有帮助。在所选运行中,随机森林门保留了94.8%的混合胜利,表明轻量级门控可能有助于避免无效的混合调用。
更新时间: 2026-05-05 17:54:42
领域: cs.CR,cs.LO
Learning a Stochastic Differential Equation Model of Tropical Cyclone Intensification from Reanalysis and Observational Data
Tropical cyclones are among the most consequential weather hazards, yet estimates of their risk are limited by the relatively short historical record. To extend these records, researchers often generate large ensembles of synthetic storms using simplified models of cyclone intensification. Developing such models, however, has traditionally required substantial theoretical effort. Here we explore whether equation-discovery methods, a class of data-driven techniques designed to infer governing equations, can accelerate the process of developing simplified intensification models. Using observational storm data (IBTrACS) together with environmental conditions from reanalysis (ERA5), we learn a compact stochastic differential equation describing tropical cyclone intensity evolution. We focus on TCs because their dynamics are well studied and a hierarchy of reduced-order models exist, enabling direct comparison of the learned model to physically-derived counterparts. We find that the learned model simulates synthetic TCs whose intensification statistics and hazard estimates are consistent with observations and competitive with a leading physics-based TC intensification model. Our model also reproduces known nonlinear dynamical behavior of tropical cyclones, including as a saddle node bifurcation as inner core ventilation is increased. This result shows that equation-discovery approaches, when applied directly to storm intensity, can recover not only realistic statistics but also physically meaningful dynamical structure. These findings highlight the potential for data-driven methods to complement existing theory and reduced-order models in the study of extreme weather.
Updated: 2026-05-05 17:48:08
标题: 学习从再分析和观测数据中的热带气旋强化的随机微分方程模型
摘要: 热带气旋是最具影响力的天气灾害之一,然而对其风险的估计受到历史记录相对较短的限制。为了扩展这些记录,研究人员通常使用简化的气旋强化模型生成大量合成风暴。然而,开发这种模型通常需要大量的理论工作。在这里,我们探讨了方程式发现方法是否可以加速开发简化强化模型的过程,这是一类旨在推断控制方程的数据驱动技术。使用观测风暴数据(IBTrACS)和来自再分析(ERA5)的环境条件,我们学习了描述热带气旋强度演变的紧凑随机微分方程。我们专注于热带气旋,因为它们的动力学已经得到深入研究,并存在一系列简化模型,可以直接将学习到的模型与基于物理的对应模型进行比较。我们发现,学习到的模型模拟出的合成热带气旋的强化统计数据和风险估计与观测结果一致,并且与领先的基于物理的热带气旋强化模型具有竞争力。我们的模型还复现了已知的热带气旋非线性动力行为,包括内核通风增加时的鞍结点分叉。这一结果表明,当直接应用于风暴强度时,方程式发现方法不仅可以恢复现实统计数据,还可以恢复具有物理意义的动力结构。这些发现突显了数据驱动方法在研究极端天气中补充现有理论和简化模型的潜力。
更新时间: 2026-05-05 17:48:08
领域: cs.LG,math.DS,physics.ao-ph,stat.AP
Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours
AI systems are entering critical domains like healthcare, finance, and defense, yet remain vulnerable to adversarial attacks. While AI red teaming is a primary defense, current approaches force operators into manual, library-specific workflows. Operators spend weeks hand-crafting workflows - assembling attacks, transforms, and scorers. When results fall short, workflows must be rebuilt. As a result, operators spend more time constructing workflows than probing targets for security and safety vulnerabilities. We introduce an AI red teaming agent built on the open-source Dreadnode SDK. The agent creates workflows grounded in 45+ adversarial attacks, 450+ transforms, and 130+ scorers. Operators can probe multi-agent systems, multilingual, and multimodal targets, focusing on what to probe rather than how to implement it. We make three contributions: 1. Agentic interface. Operators describe goals in natural language via the Dreadnode TUI (Terminal User Interface). The agent handles attack selection, transform composition, execution, and reporting, letting operators focus on red teaming. Weeks compress to hours. 2. Unified framework. A single framework for probing traditional ML models (adversarial examples) and generative AI systems (jailbreaks), removing the need for separate libraries. 3. Llama Scout case study. We red team Meta Llama Scout and achieve an 85% attack success rate with severity up to 1.0, using zero human-developed code
Updated: 2026-05-05 17:43:52
标题: 在执行力时代重新定义AI红队行动:从周到小时
摘要: 人工智能系统正在进入医疗保健、金融和国防等关键领域,但仍然容易受到对抗性攻击的影响。虽然人工智能红队是一种主要的防御手段,但目前的方法迫使操作员采用手工、特定于库的工作流程。操作员花费数周时间手工制作工作流程 - 组装攻击、转换和评分器。当结果不尽如人意时,必须重新构建工作流程。结果,操作员花费的时间更多地用于构建工作流程,而不是针对安全漏洞和安全漏洞进行目标探测。 我们介绍了一个建立在开源Dreadnode SDK上的人工智能红队代理。该代理创建了基于45个以上对抗性攻击、450个以上转换和130个以上评分器的工作流程。操作员可以探测多代理系统、多语言和多模态目标,专注于探测内容而不是如何实现它。 我们做出了三方面的贡献:1. 代理接口。操作员通过Dreadnode TUI(终端用户界面)用自然语言描述目标。代理处理攻击选择、转换组合、执行和报告,让操作员专注于红队工作。数周的工作可以压缩到数小时。2. 统一框架。一个单一的框架用于探测传统的机器学习模型(对抗性示例)和生成式人工智能系统(越狱),消除了对分开的库的需求。3. Llama Scout案例研究。我们对Meta Llama Scout进行了红队工作,并取得了85%的攻击成功率,严重程度高达1.0,使用零人类开发的代码。
更新时间: 2026-05-05 17:43:52
领域: cs.AI,cs.CR
RAMoEA-QA: Hierarchical Specialization for Robust Respiratory Audio Question Answering
Conversational generative AI is increasingly explored in healthcare, where models must integrate heterogeneous patient signals and support diverse interaction styles while producing clinically meaningful outputs. In respiratory care, non-invasive audio recordings captured with sensing devices offer a scalable route to screening and longitudinal monitoring, but heterogeneity is particularly acute: recordings vary across devices, environments, and acquisition protocols, and queries may vary in intent, answer format, and prediction objective. Existing biomedical audio-language question answering systems for respiratory assessment are starting to emerge, but they are typically built as single-path models, processing all inputs through the same acoustic and language pathway despite variation in recording conditions and query types. They are also usually evaluated in relatively limited settings, leaving open their robustness under realistic distribution shifts, including changes in acquisition domains, modality, and clinical task. To address this gap, we introduce RAMoEA-QA, the first RA QA model designed to support input-dependent specialization across heterogeneous recordings and query types within a unified hierarchical two-stage framework. We study this design in a unified RA QA setting spanning clinical and self-recorded, multi-device acquisition settings, question formats, and both discrete and continuous targets. Across in-domain and controlled-shift evaluations, RAMoEA-QA improves over matched monolithic baselines and routing controls, reaching 0.72 in in-domain test accuracy (vs. 0.61 and 0.67 for single-path baselines) on discriminative tasks, while also achieving the best regression performance and stronger average transfer under dataset, modality, and task shifts, including gains of up to 23 percentage points in accuracy on the COPD modality-shift setting.
Updated: 2026-05-05 17:43:33
标题: RAMoEA-QA:用于强大的呼吸音频问答的分层专业化
摘要: 会话式生成AI在医疗保健领域得到越来越广泛的探索,模型必须整合异质患者信号并支持多样化的交互样式,同时产生临床意义的输出。在呼吸护理中,通过传感设备捕获的非侵入性音频记录提供了一个可扩展的筛查和纵向监测途径,但异质性尤为严重:记录在设备、环境和采集协议之间变化,查询可能在意图、答案格式和预测目标上变化。现有的用于呼吸评估的生物医学音频-语言问答系统开始出现,但它们通常建立为单路径模型,尽管记录条件和查询类型有所变化,但所有输入都通过相同的声学和语言路径处理。它们通常在相对有限的环境中进行评估,未能展示它们在现实分布转变下的鲁棒性,包括采集领域、模态和临床任务的变化。为了填补这一空白,我们引入了RAMoEA-QA,这是第一个旨在支持跨异质记录和查询类型的输入相关特化的RA QA模型,采用统一的分层两阶段框架。我们在涵盖临床和自我记录、多设备采集设置、问题格式以及离散和连续目标的统一RA QA设置中研究了这一设计。在域内和受控转变评估中,RAMoEA-QA相对于匹配的单体基线和路由控制有所改进,在判别性任务上达到了0.72的域内测试准确率(单路径基线为0.61和0.67),同时实现了最佳回归表现,并在数据集、模态和任务转变下展现了更强的平均转移能力,包括在COPD模态转变设置中准确率提升高达23个百分点。
更新时间: 2026-05-05 17:43:33
领域: cs.SD,cs.AI
Conditional Diffusion Sampling
Sampling from unnormalized multimodal distributions with limited density evaluations remains a fundamental challenge in machine learning and natural sciences. Successful approaches construct a bridge between a tractable reference and the target distribution. Parallel Tempering (PT) serves as the gold standard, while recent diffusion-based approaches offer a continuous alternative at the cost of neural training. In this work, we introduce Conditional Diffusion Sampling (CDS), a framework that combines these two paradigms. To this end, we derive Conditional Interpolants, a class of stochastic processes whose transport dynamics are governed by an exact, closed-form stochastic differential equation (SDE), requiring no neural approximation. Although these dynamics require sampling from a non-trivial initialization distribution, we show both theoretically and empirically that the cost of this initialization diminishes for sufficiently short diffusion times. CDS leverages this by a two-stage procedure: (1) PT is used to efficiently sample the initial distribution, and then (2) samples are transported via the transport SDE. This combination couples the robust global exploration of PT with efficient local transport. Experiments suggest that CDS has the potential to achieve a superior trade-off between sample quality and density evaluation cost compared to state-of-the-art samplers.
Updated: 2026-05-05 17:36:29
标题: 有条件扩散抽样
摘要: 从具有有限密度评估的非归一化多模态分布中进行抽样仍然是机器学习和自然科学中的一个基本挑战。成功的方法构建了一个可处理的参考和目标分布之间的桥梁。并行淬火(PT)作为黄金标准,而最近基于扩散的方法提供了一种连续的选择,但需要神经训练。在这项工作中,我们介绍了条件扩散抽样(CDS)框架,结合了这两种范式。为此,我们推导出条件插值器,一类随机过程,其传输动态由精确的闭合形式随机微分方程(SDE)控制,无需神经逼近。尽管这些动态需要从非平凡的初始化分布中抽样,但我们理论上和实证上显示,对于足够短的扩散时间,这种初始化的成本会减少。CDS通过两阶段过程利用这一点:(1)使用PT有效地抽样初始分布,然后(2)通过传输SDE传输样本。这种组合将PT的强大全局探索与高效的局部传输相结合。实验表明,与最先进的抽样器相比,CDS有潜力在样本质量和密度评估成本之间实现更优越的权衡。
更新时间: 2026-05-05 17:36:29
领域: stat.ML,cs.LG
SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment
Language models excel at diagnostic assessments on currated medical case-studies and vignettes, performing on par with, or better than, clinical professionals. However, existing studies focus on complex scenarios with rich context making it difficult to draw conclusions about how these systems perform for patients reporting symptoms in everyday life. We deployed SymptomAI, a set of conversational AI agents for end-to-end patient interviewing and differential diagnosis (DDx), via the Fitbit app in a study that randomized participants (N=13,917) to interact with five AI agents. This corpus captures diverse communication and a realistic distribution of illnesses from a real world population. A subset of 1,228 participants reported a clinician-provided diagnosis, and 517 of these were further evaluated by a panel of clinicians during over 250 hours of annotation. SymptomAI DDx were significantly more accurate (OR = 2.47, p < 0.001) than those from independent clinicians given the same dialogue in a blinded randomized comparison. Moreover, agentic strategies which conduct a dedicated symptom interview that elicit additional symptom information before providing a diagnosis, perform substantially better than baseline, user-guided conversations (p < 0.001). An auxiliary analysis on 1,509 conversations from a general US population panel validated that these results generalize beyond wearable device users. We used SymptomAI diagnoses as labels for all 13,917 participants to analyze over 500,000 days of wearable metrics across nearly 400 unique conditions. We identified strong associations between acute infections and physiological shifts (e.g., OR > 7 for influenza). While limited by self-reported ground truth, these results demonstrate the benefits of a dedicated and complete symptom interview compared to a user-guided symptom discussion, which is the default of most consumer LLMs.
Updated: 2026-05-05 17:36:12
标题: SymptomAI:面向日常症状评估的对话式人工智能代理
摘要: 语言模型在精心策划的医疗病例研究和小插图上表现出色,与临床专业人员相当或更好。然而,现有研究侧重于具有丰富背景的复杂情景,这使得很难得出关于这些系统如何为报告日常生活中症状的患者提供服务的结论。我们部署了SymptomAI,这是一组用于端到端患者访谈和不同诊断(DDx)的对话式AI代理,通过Fitbit应用程序进行了一项研究,将参与者(N=13,917)随机分配到与五个AI代理进行交互。这个语料库捕捉了来自现实世界人口的多样化沟通和疾病的真实分布。1228名参与者报告了临床提供的诊断结果,其中517名受到了专家小组的进一步评估,共进行了超过250小时的注释。SymptomAI DDx比在盲目随机比较中给予相同对话的独立临床医生的DDx显着更准确(OR = 2.47,p < 0.001)。此外,在提供诊断之前进行专门的症状访谈以引出额外的症状信息的代理策略比基准用户引导对话表现明显更好(p < 0.001)。对来自美国普通人群小组的1509次对话进行的辅助分析验证了这些结果超出了可穿戴设备用户。我们使用SymptomAI诊断作为所有13,917名参与者的标签,分析了近400种独特病症的超过500,000天的可穿戴指标。我们发现急性感染与生理变化之间存在强烈关联(例如,流感的OR > 7)。尽管受到自我报告的真实性的限制,这些结果表明,相较于用户引导的症状讨论(这是大多数消费者LLMs的默认选择),进行专门和完整的症状访谈的好处。
更新时间: 2026-05-05 17:36:12
领域: cs.AI
Enhanced 3D Brain Tumor Segmentation Using Assorted Precision Training
A brain tumor is a medical disorder faced by individuals of all demographics. Medically, it is described as the spread of non-essential cells close to or throughout the brain. Symptoms of this ailment include headaches, seizures, and sensory changes. This research explores two main categories of brain tumors: benign and malignant. Benign spreads steadily, and malignant expresses growth, making it dangerous. Early identification of brain tumors is a crucial factor for the survival of patients. This research provides a state-of-the-art approach to the early identification of tumors within the brain. We implemented the SegResNet architecture, a widely adopted architecture for three-dimensional segmentation, and trained it using the automatic multi-precision method. We incorporated the dice loss function and dice metric for evaluating the model. We got a dice score of 0.84. For the tumor core, we got a dice score of 0.84; for the whole tumor, 0.90; and for the enhanced tumor, we got a score of 0.79.
Updated: 2026-05-05 17:30:17
标题: 使用各种精确训练增强的三维脑肿瘤分割
摘要: 一个脑肿瘤是一个医学障碍,影响各个年龄段的个体。在医学上,它被描述为非必要细胞在大脑附近或全身传播。这种疾病的症状包括头痛、癫痫发作和感觉变化。这项研究探讨了两类脑肿瘤:良性和恶性。良性蔓延缓慢,而恶性呈现生长,使其变得危险。早期发现脑肿瘤对患者的生存至关重要。这项研究提供了一种先进的方法来早期发现大脑中的肿瘤。我们实施了SegResNet架构,这是一种广泛采用的三维分割架构,并使用自动多精度方法进行了训练。我们结合了dice损失函数和dice指标来评估模型。我们得到了0.84的dice分数。对于肿瘤核心,我们得到了0.84的dice分数;对于整个肿瘤,是0.90;对于增强的肿瘤,我们得到了0.79的分数。
更新时间: 2026-05-05 17:30:17
领域: cs.CV,cs.LG
Physics-Grounded Multi-Agent Architecture for Traceable, Risk-Aware Human-AI Decision Support in Manufacturing
High-precision CNC machining of free-form aerospace components requires bounded compensations informed by inspection, simulation, and process knowledge. Off-the-shelf large language model (LLM) assistants can generate text, but they do not reliably execute risk-constrained multi-step numerical workflows or provide auditable provenance for high-stakes decisions. We present multi-agent knowledge analysis (MAKA), a human-in-the-loop decision-support architecture that separates intent routing, tools-only quantitative analysis, knowledge graph retrieval, and critic-based verification that enforces physical plausibility, safety bounds, and provenance completeness before recommendations are surfaced for human approval. MAKA is instantiated on a Ti-6Al-4V rotor blade machining testbed by fusing virtual-machining path-tracking error fields, cutting-force and deflection simulations, and scan-based 3D inspection deviation maps from 16 blades. The analysis decomposes deviation into an evidence-linked pathing component, a drift-based wear proxy capturing systematic evolution across parts, a residual systematic compliance term, and a variability proxy for instability-aware escalation. In a three-level tool-orchestration benchmark (single-step through $\geq$3-step stateful sequences), MAKA improves successful tool execution by up to 87.5 percentage points relative to an unstructured single-model interaction pattern with identical tool access. Digital twin what-if studies show MAKA can coordinate traceable compensation candidates that reduce predicted surface deviation from order $10^{-2}$in to approximately $\pm 10^{-3}$in over most of the blade within the simulation environment, providing a pre-deployment verification signal for risk-aware human decision-making.
Updated: 2026-05-05 17:24:53
标题: 基于物理学的多智能体架构:用于制造业中可追踪、风险感知的人工智能决策支持
摘要: 航空航天自由曲面零部件的高精度数控加工需要通过检验、模拟和工艺知识确定有界的补偿。现成的大型语言模型(LLM)助手可以生成文本,但它们不能可靠地执行受风险限制的多步数值工作流程,也不能为高风险决策提供可审计的来源。我们提出了多代理知识分析(MAKA),这是一种人机协同决策支持架构,它将意图路由、仅工具的定量分析、知识图检索以及基于评论的验证分离开来,以确保在建议提交给人类批准之前,满足物理合理性、安全界限和来源完整性。MAKA在一个Ti-6Al-4V转子叶片加工试验台上实现,通过融合16个叶片的虚拟加工路径跟踪误差场、切削力和挠曲模拟,以及基于扫描的三维检测偏差图。该分析将偏差分解为与路径相关联的证据、基于漂移的磨损代理,捕捉零件之间的系统演变、残余系统顺从项以及稳定性感知升级的可变性代理。在一个三级工具编排基准测试中(单步至≥3步状态序列),相对于具有相同工具访问权限的非结构化单模式交互模式,MAKA可以将成功的工具执行率提高高达87.5个百分点。数字孪生体假设研究显示,MAKA可以协调可追踪的补偿候选,将预测的表面偏差从约$10^{-2}$in减少到模拟环境中叶片大部分范围内的约$\pm 10^{-3}$in,为风险感知的人类决策提供部署前的验证信号。
更新时间: 2026-05-05 17:24:53
领域: cs.MA,cs.AI,cs.IR
Autonomous Reliability Qualification of Ga$_2$O$_3$-based Hydrogen and Temperature Sensors via Safe Active Learning
We present a Safe Active Learning (SAL) framework for autonomous reliability characterization of rectifying Ga$_2$O$_3$-based devices under coupled thermal and hydrogen stress. SAL treats rectification as a device-physics-motivated safety observable and models its evolution over elapsed time, temperature, and H$_2$ concentration using a Gaussian-process surrogate. To handle condition-dependent and uncertain experiment durations, the method combines an adaptive completion-time window, time-window lower-confidence-bound safety checks, a trust region anchored to previously verified safe conditions, and a two-phase strategy that transitions from conservative safe exploration to progressively relaxed rectification targets as the device degrades. We first evaluate SAL in simulation, where it safely expands the explored region while learning the evolving rectification surface. We then demonstrate SAL experimentally on an automated high-temperature probe-station platform using a Pt/Cr$_2$O$_3$:Mg/$β$-Ga$_2$O$_3$ device. In the reported campaign, phase 1 incurred only one unsafe measurement associated with spurious current-voltage sweeps, while phase 2 intentionally probed lower-rectification regimes. Finally, we use the curated SAL dataset for offline long-horizon forecasting of device response at a target voltage using a structured Gaussian-process model with a condition-dependent Kohlrausch--Williams--Watts mean and a residual covariance kernel. The model captures long-time, saturating degradation trends in an auxiliary validation dataset, illustrating how safety-aware autonomous experimentation enables both conservative characterization and subsequent degradation modeling. Although demonstrated here for a rectifying Ga$_2$O$_3$ device, SAL is applicable to other systems where a measurable in situ safety observable can be defined.
Updated: 2026-05-05 17:23:12
标题: 自主可靠性鉴定:基于Ga$_2$O$_3$的氢气和温度传感器的安全主动学习
摘要: 我们提出了一个安全主动学习(SAL)框架,用于在耦合热和氢应力下自主可靠性表征整流Ga$_2$O$_3$基器件。SAL将整流视为一个设备物理驱动的安全可观测量,并使用高斯过程代理模型来模拟其随时间、温度和H$_2$浓度的演变。为了处理条件依赖和不确定的实验持续时间,该方法结合了自适应完成时间窗口、时间窗口下置信界限安全检查、以及一个锚定到先前验证的安全条件的信任区域,并采用一个两阶段策略,从保守的安全探索过渡到随着器件退化逐渐放松的整流目标。我们首先在仿真中评估了SAL,在那里它安全地扩展了探索区域,同时学习了不断演变的整流表面。然后我们在一个自动化高温探针站平台上实验性地展示了SAL,使用了一个Pt/Cr$_2$O$_3$:Mg/$β$-Ga$_2$O$_3$器件。在报告的实验中,第一阶段仅产生了一个与虚假的电流-电压扫描相关的不安全测量,而第二阶段有意地探寻了较低整流区域。最后,我们使用整理好的SAL数据集,通过一个具有条件依赖Kohlrausch-Williams-Watts均值和残差协方差核的结构化高斯过程模型,在离线长期预测目标电压下器件响应。该模型捕捉了辅助验证数据集中长时间、饱和退化趋势,说明了安全感知的自主实验如何使保守的表征和随后的退化建模成为可能。尽管本文演示了对一个整流Ga$_2$O$_3$器件的应用,但SAL适用于其他可以定义可测量的原位安全可观测量的系统。
更新时间: 2026-05-05 17:23:12
领域: physics.app-ph,cond-mat.mtrl-sci,cs.LG,eess.SY
An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration
Retrieval-augmented generation systems often assume that one fixed retrieval pipeline is sufficient across heterogeneous tasks, yet factoid question answering, multi-hop reasoning, and scientific verification exhibit different retrieval preferences. We present Experience-RAG Skill, an agent-oriented pluggable retrieval orchestration layer positioned between the agent and the retriever pool. The proposed skill analyzes the current scene, consults an experience memory, selects an appropriate retrieval strategy, and returns structured evidence to the agent. Under a fixed candidate pool, Experience-RAG Skill achieves an overall nDCG@10 of 0.8924 on BeIR/nq, BeIR/hotpotqa, and BeIR/scifact, outperforming fixed single-retriever baselines and remaining competitive with Adaptive-RAG-style routing. The results suggest that retrieval strategy selection can be productively encapsulated as a reusable agent skill rather than being hard-coded in the upper workflow.
Updated: 2026-05-05 17:10:25
标题: 一个面向代理的可插拔体验-RAG技能,用于基于经验的检索策略编排
摘要: 检索增强生成系统通常假设一个固定的检索管道足以跨异构任务,然而,事实型问题回答、多跳推理和科学验证展现出不同的检索偏好。我们提出了Experience-RAG Skill,这是一个面向代理的可插拔检索编排层,位于代理和检索器池之间。所提出的技能分析当前情景,咨询经验记忆,选择适当的检索策略,并向代理返回结构化证据。在固定的候选池下,Experience-RAG Skill在BeIR/nq、BeIR/hotpotqa和BeIR/scifact上实现了0.8924的总体nDCG@10,优于固定单一检索器基线,并与Adaptive-RAG风格的路由保持竞争力。结果表明,检索策略选择可以有益地封装为可重复使用的代理技能,而不是硬编码在上层工作流程中。
更新时间: 2026-05-05 17:10:25
领域: cs.AI
From Intent to Execution: Composing Agentic Workflows with Agent Recommendation
Multi-Agent Systems (MAS) built using AI agents fulfill a variety of user intents that may be used to design and build a family of related applications. However, the creation of such MAS currently involves manual composition of the plan, manual selection of appropriate agents, and manual creation of execution graphs. This paper introduces a framework for the automated creation of multi-agent systems which replaces multiple manual steps with an automated framework. The proposed framework consists of software modules and a workflow to orchestrate the requisite task- specific application. The modules include: an LLM-derived planner, a set of tasks described in natural language, a dynamic call graph, an orchestrator for map agents to tasks, and an agent recommender that finds the most suitable agent(s) from local and global agent registries. The agent recommender uses a two-stage information retrieval (IR) system comprising a fast retriever and an LLM-based re-ranker. We implemented a series of experiments exploring the choice of embedders, re- rankers, agent description enrichment, and supervising critique agent. We benchmarked this system end-to-end, evaluating the combination of planning, agent selection, and task completion, with our proposed approach. Our experimental results show that our approach outperforms the state-of-the- art in terms of the recall rate and is more robust and scalable compared to previous approaches. The critique agent holistically reevaluates both agent and tool recommendations against the overall plan. We show that the inclusion of the critique agent further enhances the recall score, proving that the comprehensive review and revision of task-based agent selection is an essential step in building end-to-end multi-agent systems.
Updated: 2026-05-05 17:08:26
标题: 从意图到执行:使用代理推荐组合代理工作流程
摘要: 利用AI代理构建的多代理系统(MAS)可以满足各种用户意图,可用于设计和构建一系列相关应用程序。然而,目前创建这样的MAS涉及手动组合计划、手动选择适当的代理以及手动创建执行图。本文介绍了一个用于自动创建多代理系统的框架,该框架用自动化框架取代了多个手动步骤。所提出的框架包括软件模块和工作流程,用于编排所需的任务特定应用程序。这些模块包括:一个基于LLM的规划器、用自然语言描述的一组任务、一个动态调用图、一个用于将代理映射到任务的编排器,以及一个代理推荐器,从本地和全局代理注册表中找到最合适的代理。代理推荐器使用了一个包含快速检索器和基于LLM的重新排序器的两阶信息检索(IR)系统。我们实施了一系列实验,探索了嵌入器、重新排序器、代理描述增强和监督评论代理的选择。我们对这个系统进行了端到端的基准测试,评估了我们提出的方法在规划、代理选择和任务完成方面的组合。我们的实验结果表明,我们的方法在召回率方面优于现有技术,并且与以前的方法相比更加稳健和可伸缩。评论代理从整体上重新评估代理和工具建议与整体计划的匹配情况。我们展示了评论代理的加入进一步增强了召回分数,证明了对基于任务的代理选择进行全面审查和修订是构建端到端多代理系统的一个重要步骤。
更新时间: 2026-05-05 17:08:26
领域: cs.AI
Flow Sampling: Learning to Sample from Unnormalized Densities via Denoising Conditional Processes
Sampling from unnormalized densities is analogous to the generative modeling problem, but the target distribution is defined by a known energy function instead of data samples. Because evaluating the energy function is often costly, a primary challenge is to learn an efficient sampler. We introduce Flow Sampling, a framework built on diffusion models and flow matching for the data-free setting. Our training objective is conditioned on a noise sample and regresses onto a denoising diffusion drift constructed from the energy function. In contrast, diffusion models' objective is conditioned on a data sample and regresses onto a noising diffusion drift. We utilize the interpolant process to minimize the number of energy function evaluations during training, resulting in an efficient and scalable method for sampling unnormalized densities. Furthermore, our formulation naturally extends to Riemannian manifolds, enabling diffusion-based sampling in geometries beyond Euclidean space. We derive a closed-form formula for the conditional drift on constant curvature manifolds, including hyperspheres and hyperbolic spaces. We evaluate Flow Sampling on synthetic energy benchmarks, small peptides, large-scale amortized molecular conformer generation, and distributions supported on the sphere, demonstrating strong empirical performance.
Updated: 2026-05-05 17:07:37
标题: 流采样:通过去噪条件过程学习从非规范化密度中抽样
摘要: 从未标准化的密度中进行抽样类似于生成建模问题,但目标分布是由已知能量函数定义而不是数据样本。由于评估能量函数通常代价高昂,一个主要挑战是学习一个高效的采样器。我们引入了流采样,这是一个建立在扩散模型和流匹配上的框架,适用于无数据设置。我们的训练目标是基于噪声样本,并回归到由能量函数构建的去噪扩散漂移。相比之下,扩散模型的目标是基于数据样本,并回归到一个噪声扩散漂移。我们利用插值过程在训练过程中最小化能量函数评估的次数,从而实现了一种高效且可扩展的未标准化密度抽样方法。此外,我们的公式自然地扩展到黎曼流形,使扩散基础的抽样在超出欧几里德空间的几何形状中成为可能。我们推导了在常曲率流形上的条件漂移的闭式公式,包括超球和双曲空间。我们在合成能量基准、小肽、大规模摊销分子构象生成以及支持在球面上的分布上评估了流采样,展示了强大的实证性能。
更新时间: 2026-05-05 17:07:37
领域: cs.LG,cs.AI
Power-Softmax: Towards Secure LLM Inference over Encrypted Data
Modern cryptographic methods for implementing privacy-preserving LLMs such as \gls{HE} require the LLMs to have a polynomial form. Forming such a representation is challenging because transformers include non-polynomial components, such as \Softmax and layer normalization. Previous approaches have either directly approximated pre-trained models with large-degree polynomials, which are less efficient over HE, or replaced non-polynomial components with easier-to-approximate primitives before training, e.g., \Softmax with pointwise attention. The latter approach might introduce scalability challenges. We present a new HE-friendly variant of self-attention that offers a stable form for training and is easy to approximate with polynomials for secure inference. Our work introduces the first polynomial LLMs over a billion parameters, exceeding the size of previous models by more than tenfold. The resulting models demonstrate reasoning and in-context learning (ICL) capabilities comparable to standard transformers of the same size, representing a breakthrough in the field. Finally, we provide a detailed latency breakdown for each computation over encrypted data, paving the way for further optimization, and explore the differences in inductive bias between models relying on our HE-friendly variant and standard transformers.
Updated: 2026-05-05 16:59:31
标题: Power-Softmax:针对加密数据上的安全LLM推断
摘要: 现代加密方法用于实现保护隐私的LLMs(如\gls{HE})需要LLMs具有多项式形式。形成这种表示形式具有挑战性,因为变换器包括非多项式成分,例如\Softmax和层归一化。先前的方法要么直接用大阶多项式来近似预训练模型(在HE上效率较低),要么在训练之前用易于近似的原始方法替换非多项式成分,例如用点对点注意力替换\Softmax。后一种方法可能会引入可扩展性挑战。我们提出了一种新的HE友好型自注意力变种,提供了一个稳定的形式进行训练,并易于用多项式来近似进行安全推断。我们的工作首次将多项式LLMs扩展到十亿参数以上的规模,超过了之前模型的大小超过十倍。由此产生的模型展示了与相同规模的标准变换器相当的推理和上下文学习(ICL)能力,代表了该领域的突破。最后,我们为每个对加密数据进行的计算提供了详细的延迟分析,为进一步优化铺平了道路,并探讨了依赖于我们的HE友好型变种和标准变换器的模型之间的归纳偏差差异。
更新时间: 2026-05-05 16:59:31
领域: cs.LG,cs.CR
LIPPEN: A Lightweight In-Place Pointer Encryption Architecture for Pointer Integrity
Memory-safety violations in C and C++ programs continue to enable sophisticated exploitation techniques such as control-flow hijacking and data-oriented attacks. Existing hardware defenses either rely on address space layout randomization (ASLR) or attach explicit metadata to pointers to verify their integrity. External metadata schemes provide strong guarantees, but incur additional memory accesses and memory footprint overhead. In-place authentication mechanisms, such as ARM Pointer Authentication (PAC), achieve low overhead at the cost of limited entropy and susceptibility to brute-force and reuse attacks. This paper presents LIPPEN, a hardware-software co-design for full-pointer encryption that provides strong pointer integrity and confidentiality with zero metadata overhead. LIPPEN treats every pointer as an encrypted block, cryptographically binding it to its execution context and decrypting it transparently at dereference time. By re-purposing the entire 64-bit pointer field for encryption rather than preserving raw address bits, LIPPEN maximizes entropy, eliminates the brute-force weaknesses of truncated authentication codes, and maintains binary compatibility with existing PAC-enabled software. We prototype LIPPEN on FPGA using 64-bit RISC-V Rocket and BOOM cores, and evaluate it with microbenchmarks, nbench, and SPEC CPU2017. We compare against both an in-house RISC-V PAC implementation and Apple's PAC on the M1 processor. Across these workloads, LIPPEN provides comprehensive pointer protection with runtime overhead comparable to PAC-based schemes, while incurring negligible area and power overhead. These results show that LIPPEN is a practical design point for deploying strong pointer protection in real processors.
Updated: 2026-05-05 16:58:17
标题: LIPPEN:一种用于指针完整性的轻量级就地指针加密架构
摘要: C和C++程序中的内存安全违规继续为控制流劫持和面向数据的攻击等复杂利用技术提供可能。现有的硬件防御要么依赖于地址空间布局随机化(ASLR),要么附加显式元数据到指针以验证其完整性。外部元数据方案提供强大的保证,但会产生额外的内存访问和内存占用开销。就地认证机制,如ARM指针身份验证(PAC),以低开销实现,但受限于有限的熵和易受暴力破解和重用攻击的影响。本文介绍了LIPPEN,这是一种用于全指针加密的硬件-软件协同设计,提供了零元数据开销的强大指针完整性和保密性。LIPPEN将每个指针视为一个加密块,通过密码学将其与其执行上下文绑定,并在解引用时透明地解密它。通过将整个64位指针字段重新用于加密而不是保留原始地址位,LIPPEN最大化了熵,消除了截断身份验证代码的暴力破解弱点,并与现有的PAC启用软件保持二进制兼容性。我们使用64位RISC-V Rocket和BOOM核在FPGA上原型化LIPPEN,并使用微基准测试、nbench和SPEC CPU2017进行评估。我们将其与内部RISC-V PAC实现和苹果M1处理器上的PAC进行比较。在这些工作负载中,LIPPEN提供了与基于PAC的方案相当的运行时开销,同时产生可忽略的面积和功耗开销。这些结果表明,LIPPEN是在实际处理器中部署强大指针保护的实用设计点。
更新时间: 2026-05-05 16:58:17
领域: cs.CR,cs.AR
Feature-Augmented Transformers for Robust AI-Text Detection Across Domains and Generators
AI-generated text is nowadays produced at scale across domains and heterogeneous generation pipelines, making robustness to distribution shift a central requirement for supervised binary detectors. We train transformer-based detectors on HC3 PLUS and calibrate a single decision threshold by maximising balanced accuracy on held-out validation; this threshold is then kept fixed for all downstream test distributions, revealing domain- and generator-dependent error asymmetries under shift. We evaluate in-domain on HC3 PLUS, under cross-dataset transfer to the multi-domain, multi-generator M4 benchmark, and on the external AI-Text-Detection-Pile. Although base models achieve near-ceiling in-domain performance (up to 99.5% balanced accuracy), performance under shift is brittle and strongly model-dependent. Feature augmentation via attention-based linguistic feature fusion improves transfer, with our best model (DeBERTa-v3-base+FeatAttn) achieving 85.9% balanced accuracy on M4. Multi-seed experiments confirm high stability. Under the same fixed-threshold protocol, our model outperforms strong zero-shot baselines by up to +7.22 points. Category-level ablations further show that readability and vocabulary features contribute most to robustness under shift. Overall, these results demonstrate that feature augmentation and a modern DeBERTa backbone significantly outperform earlier BERT/RoBERTa models, while the fixed-threshold protocol provides a more realistic and informative assessment of practical detector robustness.
Updated: 2026-05-05 16:52:26
标题: 功能增强变压器用于跨领域和生成器的鲁棒AI文本检测
摘要: 人工智能生成的文本现在在各个领域和异构生成管道中大规模生产,这使得对分布转移的稳健性成为监督二元检测器的一个中心要求。我们在HC3 PLUS上训练了基于Transformer的检测器,并通过最大化保持验证集上的平衡准确性来校准一个单一的决策阈值;然后将此阈值固定在所有下游测试分布中,揭示了在转移下的领域和生成器相关的错误不对称性。我们在HC3 PLUS上进行了领域内评估,将其跨数据集转移到多领域、多生成器的M4基准测试上,并在外部AI-Text-Detection-Pile上进行评估。尽管基础模型在领域内表现接近完美(最高可达99.5%的平衡准确性),但在转移下的表现脆弱且严重依赖于模型。通过基于注意力的语言特征融合进行特征增强可以改善转移,我们的最佳模型(DeBERTa-v3-base+FeatAttn)在M4上实现了85.9%的平衡准确性。多种子实验证实了高稳定性。在相同的固定阈值协议下,我们的模型比强零样本基线表现提高了高达+7.22个点。类别级别的消融进一步表明可读性和词汇特征对转移下的稳健性贡献最大。总的来说,这些结果表明,通过特征增强和现代的DeBERTa骨干网络,可以显著优于早期的BERT/RoBERTa模型,而固定阈值协议提供了一个更现实和更具信息性的实际检测器稳健性评估。
更新时间: 2026-05-05 16:52:26
领域: cs.CL,cs.AI
Label-Efficient School Detection from Aerial Imagery via Weakly Supervised Pretraining and Fine-Tuning
Accurate school detection is essential for supporting education initiatives, including infrastructure planning and expanding internet connectivity to underserved areas. However, many regions around the world face challenges due to outdated, incomplete, or unavailable official records. Manual mapping efforts, while valuable, are labor-intensive and lack scalability across large geographic areas. To address this, we propose a weakly supervised framework for school detection from aerial imagery that minimizes the need for human annotations while supporting global mapping efforts. Our method is specifically designed for low-data regimes, where manual annotations are extremely scarce. We introduce an automatic labeling pipeline that leverages sparse location points and semantic segmentation to generate infrastructure masks from which we generate bounding boxes. Using these automatically labeled images, we train our detectors on a first training stage to learn a representation of what schools look like, then using a small set of manually labeled images, we fine-tune the previously trained models on this clean dataset. This two stage training pipeline enables large-scale and strong detection in low-data setting of school infrastructure with minimal supervision. Our results demonstrate strong object detection performance, particularly in the low-data regime, where the models achieve promising results using only 50 manually labeled images, significantly reducing the need for costly annotations. This framework supports education and connectivity initiatives worldwide by providing an efficient and extensible approach to mapping schools from space. All models, training code and auto-labeled data will be publicly released to foster future research and real-world impact.
Updated: 2026-05-05 16:51:28
标题: 标题:通过弱监督预训练和微调在航空图像中高效检测学校位置
摘要: 准确的学校检测对支持教育计划至关重要,包括基础设施规划和将互联网连接扩展到服务不足地区。然而,全球许多地区面临挑战,因为官方记录过时、不完整或不可用。手动绘制地图的努力虽然有价值,但劳动密集且在大范围地理区域上缺乏可扩展性。为了解决这个问题,我们提出了一个弱监督的学校检测框架,通过航空图像最小化对人类注释的需求,同时支持全球绘图工作。我们的方法专门设计用于数据稀缺的情况,其中手动注释极为稀缺。我们引入了一个自动标记管道,利用稀疏位置点和语义分割从中生成基础设施掩模,然后生成边界框。通过使用这些自动标记的图像,我们在第一阶段的培训中训练我们的检测器,学习学校的外观,然后使用一小组手动标记的图像,在这个干净的数据集上对先前训练的模型进行微调。这种两阶段训练管道在低数据环境中以最少的监督实现了大规模和强大的学校基础设施检测。我们的结果表明了强大的目标检测性能,特别是在低数据环境中,模型仅使用50个手动标记的图像就取得了令人满意的结果,显著减少了昂贵注释的需求。该框架通过提供一种高效且可扩展的方法从太空中绘制学校,支持全球教育和连接计划。所有模型、训练代码和自动标记的数据将公开发布,以促进未来研究和实际影响。
更新时间: 2026-05-05 16:51:28
领域: cs.CV,cs.AI,cs.LG
Pretrained Model Representations as Acquisition Signals for Active Learning of MLIPs
Training machine learning interatomic potentials (MLIPs) for reactive chemistry is often bottlenecked by the high cost of quantum chemical labels and the scarcity of transition state configurations in candidate pools. Active learning (AL) can mitigate these costs, but its effectiveness hinges on the acquisition rule. We investigate whether the latent space of a pretrained MLIP already contains the information necessary for effective acquisition, eliminating the need for auxiliary uncertainty heads, Bayesian training and fine-tuning, or committee ensembles. We introduce two acquisition signals derived directly from a pretrained MACE potential: a finite-width neural tangent kernel (NTK) and an activation kernel built from hidden latent space features. On reactive-chemistry benchmarks, both kernels consistently outperform fixed-descriptor baselines, committee disagreement, and random acquisition, reducing the data required to reach performance targets by an average of 38% for energy error and 28% for force error. We further show that the pretrained model induces similarity spaces that preserve chemically meaningful structure and provide more reliable residual uncertainty estimates than randomly initialised or fixed-descriptor-based kernels. Our results suggest that pretraining aligns latent-space geometry with model error, yielding a practical and sufficient acquisition signal for reactive MLIP fine-tuning.
Updated: 2026-05-05 16:48:23
标题: 预训练模型表示作为主动学习MLIPs的获取信号
摘要: 培训机器学习间原子势(MLIPs)用于反应化学通常受限于量子化学标签的高成本和候选池中过渡态配置的稀缺性。主动学习(AL)可以缓解这些成本,但其有效性取决于获取规则。我们调查了预训练的MLIP的潜在空间是否已包含进行有效获取所需的信息,从而消除了辅助不确定性头、贝叶斯训练和微调,或委员会集合的需求。我们介绍了两个直接从预训练的MACE势的获取信号导出的内核:一个有限宽度的神经切线内核(NTK)和一个由隐藏的潜在空间特征构建的激活内核。在反应化学基准测试中,这两个内核始终优于固定描述符基线、委员会不一致性和随机获取,将达到性能目标所需的数据平均降低了38%的能量误差和28%的力误差。我们进一步展示了预训练模型诱导的相似空间保留了化学上有意义的结构,并提供比随机初始化或基于固定描述符的内核更可靠的残余不确定性估计。我们的结果表明,预训练将潜在空间几何与模型误差对齐,从而产生用于反应MLIP微调的实用且足够的获取信号。
更新时间: 2026-05-05 16:48:23
领域: cs.LG,physics.chem-ph
Optimal Rates for Pure $\varepsilon$-Differentially Private Stochastic Convex Optimization with Heavy Tails
We study stochastic convex optimization (SCO) with heavy-tailed gradients under pure $\varepsilon$-differential privacy (DP). Instead of assuming a bound on the worst-case Lipschitz parameter of the loss, we assume only a bounded $k$-th moment. This assumption allows for unbounded, heavy-tailed stochastic gradient distributions, and can yield sharper excess risk bounds. Prior work characterized the minimax optimal rate for $ρ$-zero-concentrated DP SCO up to logarithmic factors in this setting, but the pure $\varepsilon$-DP case has remained open. We characterize the minimax optimal excess-risk rate for pure $\varepsilon$-DP heavy-tailed SCO up to logarithmic factors. Our algorithm achieves this rate in polynomial time with high probability. Moreover, it runs in deterministic polynomial time when the worst-case Lipschitz parameter is polynomially bounded. For important structured problem classes -- including hinge/ReLU-type and absolute-value losses on Euclidean balls, ellipsoids, and polytopes -- we achieve deterministic polynomial time even when the worst-case Lipschitz parameter is infinite. Our approach is based on a novel framework for privately optimizing Lipschitz extensions of the empirical loss. We complement our upper bound with a nearly matching high-probability lower bound.
Updated: 2026-05-05 16:42:49
标题: 纯$\varepsilon$-差分隐私重尾随机凸优化的最优速率
摘要: 我们研究了在纯$\varepsilon$-差分隐私(DP)下具有重尾梯度的随机凸优化(SCO)。我们假设了损失函数的最坏情况Lipschitz参数的界限,而不是只假设有界的$k$-th矩。这种假设允许存在无界的、重尾随机梯度分布,并且可以得到更尖锐的过量风险界限。之前的研究在这种情况下表征了$ρ$-零集中DP SCO的极小极大最优速率,但在纯$\varepsilon$-DP情况下仍然尚未解决。我们在此设置中表征了纯$\varepsilon$-DP重尾SCO的极小极大过量风险速率,直到对数因子。我们的算法在多项式时间内以高概率实现了这种速率。此外,当最坏情况的Lipschitz参数是多项式有界时,它在确定性多项式时间内运行。对于重要的结构问题类别--包括欧几里德球、椭球和多面体上的铰链/ReLU型和绝对值损失,我们实现了确定性多项式时间,即使最坏情况的Lipschitz参数是无限的。我们的方法基于一个新颖的框架,用于私下优化经验损失的Lipschitz扩展。我们将我们的上界与一个几乎匹配的高概率下界相补充。
更新时间: 2026-05-05 16:42:49
领域: cs.LG,cs.CR,stat.ML
Physically Guided Visual Mass Estimation from a Single RGB Image
Estimating object mass from visual input is challenging because mass depends jointly on geometric volume and material-dependent density, neither of which is directly observable from RGB appearance. Consequently, mass prediction from pixels is ill-posed and therefore benefits from physically meaningful representations to constrain the space of plausible solutions. We propose a physically structured framework for single-image mass estimation that addresses this ambiguity by aligning visual cues with the physical factors governing mass. From a single RGB image, we recover object-centric three-dimensional geometry via monocular depth estimation to inform volume and extract coarse material semantics using a vision-language model to guide density-related reasoning. These geometry, semantic, and appearance representations are fused through an instance-adaptive gating mechanism, and two physically guided latent factors (volume- and density-related) are predicted through separate regression heads under mass-only supervision. Experiments on image2mass and ABO-500 show that the proposed method consistently outperforms state-of-the-art methods.
Updated: 2026-05-05 16:41:11
标题: 单个RGB图像中的物理引导视觉质量估计
摘要: 从视觉输入中估计物体质量是具有挑战性的,因为质量取决于几何体积和与材料相关的密度,这两者都无法直接从RGB外观中观察到。因此,从像素预测质量是不适定的,因此受益于物理上有意义的表示来限制可行解的空间。我们提出了一个物理结构化框架,用于单图像质量估计,通过将视觉线索与控制质量的物理因素进行对齐,以解决这种不确定性。从单个RGB图像中,我们通过单目深度估计恢复物体中心化的三维几何形状,以提供体积信息,并使用视觉-语言模型提取粗糙的材料语义,以引导与密度相关的推理。这些几何、语义和外观表示通过实例自适应的门控机制融合,通过单独的回归头在仅质量监督下预测两个受物理引导的潜在因素(与体积和密度相关)。在image2mass和ABO-500上的实验表明,所提出的方法始终优于最先进的方法。
更新时间: 2026-05-05 16:41:11
领域: cs.CV,cs.AI
Generating Proof-of-Vulnerability Tests to Help Enhance the Security of Complex Software
Developers create modern software applications (Apps) on top of third-party libraries (Libs). When library vulnerabilities are reachable through application code, the applications can be vulnerable to software supply chain attacks. Prior work shows that developers often require concrete and executable evidence, i.e., proof-of-vulnerability (PoV) tests, to decide whether a reported dependency vulnerability poses a practical security risk to their application. However, manually crafting such tests is challenging, and existing tool support is insufficient to automate the procedure. To streamline test generation, we created PoVSmith -- a new approach that combines call path analysis, exemplar test, code context, and feedback into multiple prompts to guide a coding agent (i.e., Codex) and a large language model (i.e., GPT) for test generation, execution, and assessment. We evaluated PoVSmith on 33 $\langle$App, Lib$\rangle$ Java program pairs, where each App depends on a vulnerable Lib. PoVSmith revealed 158 unique application-level entry points (i.e., public methods) calling vulnerable library APIs; 152 (96\%) of them were correctly found, together with the call paths properly recognized. With such method call information, PoVSmith generated 152 tests, 84 (55\%) of which demonstrated feasible ways of attacking Apps by exploiting Lib vulnerabilities. PoVSmith substantially outperforms the state-of-the-art LLM-based approach, as it reduces human involvement while dramatically improving test quality. Our work contributes (1) a novel approach of agent-based test generation, (2) an iterative code refinement process driven by execution feedback, and (3) LLM-based quality assessment grounded in both the test context and execution logs.
Updated: 2026-05-05 16:39:29
标题: 生成漏洞证明测试以帮助增强复杂软件的安全性
摘要: 开发人员在第三方库(Libs)之上创建现代软件应用程序(Apps)。当库漏洞可以通过应用程序代码访问时,应用程序可能容易受到软件供应链攻击的影响。先前的研究表明,开发人员通常需要具体且可执行的证据,即漏洞证明(PoV)测试,以决定报告的依赖性漏洞是否对其应用程序构成实际安全风险。然而,手动创建这样的测试是具有挑战性的,现有的工具支持不足以自动化该过程。为了简化测试生成过程,我们创建了PoVSmith - 一种结合调用路径分析、示例测试、代码上下文和反馈的新方法,用于指导编码代理(即Codex)和大型语言模型(即GPT)进行测试生成、执行和评估。我们在33个$\langle$App,Lib$\rangle$ Java程序对上评估了PoVSmith,其中每个App依赖于一个有漏洞的Lib。PoVSmith揭示了158个唯一的应用级入口点(即公共方法),调用了有漏洞的库API;其中152个(96%)被正确找到,同时识别了正确的调用路径。有了这样的方法调用信息,PoVSmith生成了152个测试,其中84个(55%)展示了通过利用Lib漏洞攻击Apps的可行方式。PoVSmith在大幅减少人工参与的同时显著提高了测试质量,明显优于基于LLM的现有方法。我们的工作贡献包括(1)基于代理的测试生成新方法,(2)由执行反馈驱动的迭代代码细化过程,以及(3)基于测试上下文和执行日志的LLM质量评估。
更新时间: 2026-05-05 16:39:29
领域: cs.CR,cs.SE
Inconsistent Databases and Argumentation Frameworks with Collective Attacks
The connection between subset-maximal repairs for inconsistent databases involving various integrity constraints and acceptable sets of arguments within argumentation frameworks has recently drawn growing interest. In this paper, we contribute to this domain by establishing a new connection when integrity constraints (ICs) include denial constraints and local-as-view tuple-generating dependencies. It turns out that SET-based Argumentation Frameworks (SETAFs), an extension of Dung's argumentation frameworks (AFs) allowing collective attacks, are needed. It is known that subset-maximal repairs under denial constraints correspond to the naive extensions, which also coincide with the preferred and stable extensions in the resulting SETAFs. Our main findings establish that repairs under the considered fragment of tuple-generating dependencies correspond to the preferred extensions. Moreover, for these dependencies, additional preprocessing allows computing a unique extension that is stable and naive. Allowing both types of constraints breaks this relationship, and even the pre-processing does not help as only preferred semantics captures these repairs. Finally, while it is known that functional dependencies do not require set-based attacks, we prove the same regarding inclusion dependencies. Thus, one can translate inconsistent databases under these restricted classes of ICs to plain AFs with attacks only between arguments.
Updated: 2026-05-05 16:38:31
标题: 不一致数据库与具有集体攻击的论证框架
摘要: 最近,涉及各种完整性约束的不一致数据库的子集最大修复与论证框架中可接受的论据集之间的联系引起了越来越多的关注。在本文中,我们通过建立一个新的联系,为这个领域做出了贡献,当完整性约束(ICs)包括否定约束和基于视图的元组生成依赖时。结果表明,需要基于SET的论证框架(SETAFs),这是Dung的论证框架(AFs)的扩展,允许集体攻击。众所周知,在否定约束下的子集最大修复对应于朴素扩展,这也与结果SETAFs中的首选和稳定扩展相一致。我们的主要发现是,在考虑的元组生成依赖片段下的修复对应于首选扩展。此外,对于这些依赖关系,额外的预处理可以计算出一个稳定且朴素的唯一扩展。允许这两种类型的约束会打破这种关系,即使进行预处理也不会有帮助,因为只有首选语义可以捕捉到这些修复。最后,虽然众所周知,功能依赖关系不需要基于集合的攻击,但我们证明了关于包含依赖关系的情况也是如此。因此,可以将这些受限制的ICs类别下的不一致数据库转换为只在论据之间进行攻击的普通AFs。
更新时间: 2026-05-05 16:38:31
领域: cs.DB,cs.AI
Transformers with Selective Access to Early Representations
Several recent Transformer architectures expose later layers to representations computed in the earliest layers, motivated by the observation that low-level features can become harder to recover as the residual stream is repeatedly transformed through depth. The cheapest among these methods add static value residuals: learned mixing coefficients that expose the first-layer value projection V_1 uniformly across tokens and heads. More expressive dense or dynamic alternatives recover finer-grained access, but at higher memory cost and lower throughput. The usefulness of V_1 is unlikely to be constant across tokens, heads, and contexts; different positions plausibly require different amounts of access to early lexical or semantic information. We therefore treat early-representation reuse as a retrieval problem rather than a connectivity problem, and introduce Selective Access Transformer (SATFormer), which preserves the first-layer value pathway while controlling access with a context-dependent gate. Across models from 130M to 1.3B parameters, SATFormer consistently improves validation loss and zero-shot accuracy over the static value-residual and Transformer baselines. Its strongest gains appear on retrieval-intensive benchmarks, where it improves over static value residuals by approximately 1.5 average points, while maintaining throughput and memory usage close to the baseline Transformer. Gate analyses suggest sparse, depth-dependent, head-specific, and category-sensitive access patterns, supporting the interpretation that SATFormer learns selective reuse of early representations rather than uniform residual copying. Our code is available at https://github.com/SkyeGunasekaran/SATFormer.
Updated: 2026-05-05 16:38:29
标题: 具有对早期表示进行选择性访问的变压器
摘要: 最近几种Transformer架构将后续层暴露给最早层计算出的表示,这是因为观察到低级特征在残差流通过深度转换时可能变得更难恢复。这些方法中最便宜的是添加静态值残差:学习混合系数,使第一层值投影V_1均匀暴露在所有令牌和头部上。更具表现力的密集或动态替代方案恢复了更细粒度的访问,但以更高的内存成本和更低的吞吐量。V_1的有用性不太可能在令牌、头部和上下文中保持恒定;不同位置可能需要不同数量的访问早期词汇或语义信息。因此,我们将早期表示重用视为检索问题而不是连接问题,并引入选择性访问Transformer(SATFormer),该模型保留了第一层值通路,同时通过一个依赖于上下文的门控制访问。在从130M到1.3B个参数的模型中,SATFormer始终比静态值残差和Transformer基线改进验证损失和零-shot准确率。它在检索密集型基准测试中的最大增益约为1.5个平均点,同时保持吞吐量和内存使用接近基线Transformer。门控分析表明支持稀疏、深度相关、头部特定和类别敏感的访问模式,支持SATFormer学习对早期表示的选择性重用,而不是统一的残差复制。我们的代码可在https://github.com/SkyeGunasekaran/SATFormer找到。
更新时间: 2026-05-05 16:38:29
领域: cs.LG,cs.CL
MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents
Coding agents often pass per-prompt safety review yet ship exploitable code when their tasks are decomposed into routine engineering tickets. The challenge is structural: existing safety alignment evaluates overt requests in isolation, leaving models blind to malicious end-states that emerge from sequenced compliance with innocuous-looking requests. We introduce MOSAIC-Bench (Malicious Objectives Sequenced As Innocuous Compliance), a benchmark of 199 three-stage attack chains paired with deterministic exploit oracles on deployed software substrates (10 web-application substrates, 31 CWE classes, 5 programming languages) that treats both exploit ground truth and downstream reviewer protocol as first-class evaluation axes. On this benchmark, nine production coding agents from Anthropic, OpenAI, Google, Moonshot, Zhipu, and Minimax compose innocuous tickets at 53-86% end-to-end ASR with only two refusals across all staged runs. In a matched direct-prompt experiment over four frontier Claude/Codex agents, vulnerable-output rates fall to 0-20.4%: Claude primarily refuses, while Codex primarily hardens rather than emitting the vulnerable implementation - ticket staging silences both defense modes simultaneously. Downstream, code reviewer agents approve 25.8% of these confirmed-vulnerable cumulative diffs as routine PRs, and a full-context implementation protocol closes only 50% of the staged/direct gap, ruling out context fragmentation as the sole explanation. As a deployable but non-adaptive mitigation, reframing the reviewer as an adversarial pentester reduces evasion across the evaluated reviewer subset; pentester framed evasion ranges from 3.0% to 17.6%, and an open-weight Gemma-4-E4B-it reviewer under this framing detects 88.4% of attacks on the dataset with a 4.6% false-positive rate measured on 608 real-world GitHub PRs.
Updated: 2026-05-05 16:38:23
标题: MOSAIC-Bench:测量编码代理中的组合脆弱性诱发
摘要: 编码代理通常在经过每次安全审查后都会发出可利用代码,但当它们的任务被分解成常规工程工单时,会出现可利用漏洞的代码。挑战在于结构性问题:现有的安全对齐评估仅评估单独的请求,使模型对从看似无害的请求序列化合规中出现的恶意最终状态视而不见。我们引入了MOSAIC-Bench(Malicious Objectives Sequenced As Innocuous Compliance),这是一个由199个三阶段攻击链和部署软件基底上的确定性利用神谕(10个Web应用程序基底,31个CWE类,5种编程语言)组成的基准,将利用真相和下游审阅协议视为一流评估轴。在这个基准上,来自Anthropic、OpenAI、Google、Moonshot、Zhipu和Minimax的九个生产编码代理以53-86%的端到端ASR(Average Success Rate)组成无害工单,在所有分阶段运行中仅有两次拒绝。在与四个前沿Claude/Codex代理的匹配直接提示实验中,易受攻击输出率降至0-20.4%:Claude主要拒绝,而Codex主要加固而不是发出易受攻击的实现 - 工单分阶段同时沉默两种防御模式。在下游,代码审阅代理将25.8%的这些确认易受攻击的累积差异批准为常规PR,并且完整上下文实现协议仅弥合了分阶段/直接差距的50%,排除了上下文碎片化作为唯一解释。作为一种可部署但非自适应的缓解措施,将审阅人重新构建为对抗性渗透测试专家可减少在评估审阅人员子集中的回避行为;在这种框架下,渗透测试专家的回避率范围为3.0%至17.6%,并且在608个真实世界GitHub PR上测量的误报率为4.6%的开放权重Gemma-4-E4B-it审阅者在数据集上检测到88.4%的攻击。
更新时间: 2026-05-05 16:38:23
领域: cs.CR,cs.AI,cs.SE
HELO Cryptography: A Lightweight Cryptographic System for Enhancing IoT Security in P2P Data Transmission
The recent surge in security concerns for IoT devices highlights the increasing threat of cryptographic vulnerabilities. These weaknesses can lead to unauthorized access, data breaches, and manipulation of device functions, compromising the privacy and security of both the devices and their users. Given the limited computational power of IoT devices, especially when handling large amounts of data, encrypting and transmitting data over insecure networks poses significant challenges. This situation not only heightens security risks and prolongs runtime, but also degrades performance and consumes more resources. To address these issues, a novel cryptographic system named HELO (Hybrid Encryption Lightweight Optimization) is proposed. It is hybridized and gives solid security against cryptographic cyberattacks. However, the research objective is to enhance the security level of IoT devices without decreasing their performance. This system is ideal for resource-constrained gadgets due to its lightweight mechanism. Finally, it offers top-level cryptographic security for IoT gadgets by guaranteeing confidentiality, integrity, and availability while doing P2P data transmission.
Updated: 2026-05-05 16:35:44
标题: HELO密码学:一种轻量级的加密系统,用于增强P2P数据传输中的物联网安全
摘要: 物联网设备安全问题的最近激增突显了加密漏洞的不断增加威胁。这些弱点可能导致未经授权的访问、数据泄露和设备功能的篡改,从而危及设备和用户的隐私和安全。考虑到物联网设备的有限计算能力,特别是在处理大量数据时,通过不安全网络加密和传输数据面临重大挑战。这种情况不仅加剧了安全风险并延长了运行时间,还会降低性能并消耗更多资源。为解决这些问题,提出了一种名为HELO(混合加密轻量化优化)的新型加密系统。它是混合化的,并对加密网络攻击提供了坚固的安全防护。然而,研究目标是提高物联网设备的安全级别而不降低性能。由于其轻量级机制,该系统非常适合资源受限的小工具。最后,通过保证机密性、完整性和可用性,在进行P2P数据传输时为物联网小工具提供最高级别的加密安全。
更新时间: 2026-05-05 16:35:44
领域: cs.CR
Integrating Feature Correlation in Differential Privacy with Applications in DP-ERM
Standard differential privacy imposes uniform privacy constraints across all features, overlooking the inherent distinction between sensitive and insensitive features in practice. In this paper, we introduce a relaxed definition of differential privacy that accounts for such heterogeneity, allowing certain features to be treated as insensitive even when correlated with sensitive ones. We propose a correlation-aware framework, $\textsf{CorrDP}$, which relaxes privacy for insensitive features while accounting for their correlations with sensitive features, with the correlations quantified using total variation distance. We design algorithms for differentially private empirical risk minimization (DP-ERM) under the $\textsf{CorrDP}$ framework, incorporating distance-dependent noise into gradients for improved theoretical utility guarantees. When the correlation distance is unknown, we estimate it from the dataset and show that it achieves a comparable privacy-utility guarantee. We perform experiments on synthetic and real-world datasets and show that $\textsf{CorrDP}$-based DP-ERM algorithms consistently outperform the standard DP framework in the presence of insensitive features.
Updated: 2026-05-05 16:32:11
标题: 在差分隐私中整合特征相关性及在差分隐私ERM中的应用
摘要: 标准差分隐私在所有特征上施加统一的隐私约束,忽视了实际中敏感特征和非敏感特征之间的固有区别。本文介绍了一种放松的差分隐私定义,考虑了这种异质性,允许某些特征被视为非敏感,即使它们与敏感特征相关。我们提出了一个考虑相关性的框架,$\textsf{CorrDP}$,在这个框架下,对非敏感特征放宽隐私限制的同时考虑它们与敏感特征的相关性,相关性使用总变差距离来量化。我们设计了不同ially private empirical risk minimization (DP-ERM)算法,在$\textsf{CorrDP}$框架下,将距离依赖的噪声引入梯度,以改进理论上的效用保证。当相关距离未知时,我们从数据集中估计它,并展示它能够实现可比较的隐私-效用保证。我们在合成和真实数据集上进行实验,并展示$\textsf{CorrDP}$-based DP-ERM算法在存在非敏感特征时始终优于标准DP框架。
更新时间: 2026-05-05 16:32:11
领域: cs.LG,stat.ML
TabSurv: Adapting Modern Tabular Neural Networks to Survival Analysis
Survival analysis on tabular data is a well-studied problem. However, existing deep learning methods are often highly task-specific, which can limit the transfer of new approaches from other domains and introduce constraints that may affect performance. We propose TabSurv, an approach that adapts modern tabular architectures to survival analysis using either the Weibull distribution or non-parametric survival prediction. TabSurv optimizes SurvHL, a novel histogram loss function supporting censored data. In addition to a baseline feed-forward network, we implement deep ensembles of MLPs for survival analysis within TabSurv. In contrast to prior work, the ensemble components are trained in parallel, optimizing survival distribution parameters before averaging, which promotes diversity across ensemble component predictions. We perform a comprehensive empirical evaluation of different proposed architectures on 10 diverse real-world survival datasets. Our results show that TabSurv consistently outperforms on average established classical and deep learning baselines, such as RSF, DeepSurv, DeepHit, SurvTRACE. Notably, deep ensembles with Weibull parametrization instead of non-parametric models achieve the highest average rank by C-index. Overall, our study clarifies how modern tabular neural networks can be adapted and trained to tackle survival analysis problems, offering a strong and reliable approach. The TabSurv implementation is publicly available.
Updated: 2026-05-05 16:32:10
标题: TabSurv:将现代表格神经网络调整为生存分析
摘要: 在表格数据上的生存分析是一个经过深入研究的问题。然而,现有的深度学习方法通常具有高度特定任务的特点,这可能限制了新方法从其他领域的转移,并引入可能影响性能的约束。我们提出了TabSurv,一种将现代表格结构适应生存分析的方法,使用威布尔分布或非参数生存预测。TabSurv优化了SurvHL,一种支持被截尾数据的新型直方图损失函数。除了基线前馈网络外,我们还在TabSurv中实施了MLP的深度集成用于生存分析。与以往的工作相反,集成组件是并行训练的,优化生存分布参数后再平均,这有助于促进集成组件预测的多样性。我们对10个不同的实际生存数据集上不同提出的架构进行了全面的实证评估。我们的结果表明,TabSurv在平均值上始终优于已建立的经典和深度学习基线,如RSF、DeepSurv、DeepHit、SurvTRACE。值得注意的是,使用威布尔参数化而不是非参数模型的深度集成在C-index上获得了最高的平均排名。总的来说,我们的研究阐明了现代表格神经网络如何被改编和训练来解决生存分析问题,提供了一种强大可靠的方法。TabSurv实施是公开可用的。
更新时间: 2026-05-05 16:32:10
领域: cs.LG,cs.AI,stat.ML
Do Multimodal RAG Systems Leak Data? A Comprehensive Evaluation of Membership Inference and Image Caption Retrieval Attacks
The growing adoption of multimodal Retrieval-Augmented Generation (mRAG) pipelines for vision-centric tasks (e.g., visual QA) introduces important privacy challenges. In particular, while mRAG provides a practical capability to connect private datasets and improve model performance, it risks the leakage of private information from these datasets. In this paper, we perform an empirical study to analyze the privacy risks inherent in the mRAG pipeline observed through standard model prompting. Specifically, we implement a case study that attempts to determine whether a visual asset (e.g., image) is included in the mRAG, and, if present, to leak the metadata (e.g., caption) related to it. Our findings highlight the need for privacy-preserving mechanisms and motivate future research on mRAG privacy. Our code is published online: https://github.com/aliwister/mrag-attack-eval.
Updated: 2026-05-05 16:31:12
标题: 多模式RAG系统是否泄露数据?对成员推断和图像标题检索攻击的全面评估
摘要: 越来越多的多模态检索增强生成(mRAG)管道被用于以视觉为中心的任务(例如视觉问答),这引入了重要的隐私挑战。特别是,虽然mRAG提供了连接私有数据集并提高模型性能的实际能力,但它也存在着私人信息泄露的风险。本文进行了一项实证研究,分析了通过标准模型提示观察到的mRAG管道中固有的隐私风险。具体地,我们实施了一个案例研究,试图确定一个视觉资产(例如图像)是否包含在mRAG中,并且如果存在,则泄露与之相关的元数据(例如标题)。我们的研究结果突出了对隐私保护机制的需求,并激励未来对mRAG隐私的研究。我们的代码已经发布在网上:https://github.com/aliwister/mrag-attack-eval。
更新时间: 2026-05-05 16:31:12
领域: cs.CR,cs.AI
A Benchmark for Interactive World Models with a Unified Action Generation Framework
Achieving Artificial General Intelligence (AGI) requires agents that learn and interact adaptively, with interactive world models providing scalable environments for perception, reasoning, and action. Yet current research still lacks large-scale datasets and unified benchmarks to evaluate their physical interaction capabilities. To address this, we propose iWorld-Bench, a comprehensive benchmark for training and testing world models on interaction-related abilities such as distance perception and memory. We construct a diverse dataset with 330k video clips and select 2.1k high-quality samples covering varied perspectives, weather, and scenes. As existing world models differ in interaction modalities, we introduce an Action Generation Framework to unify evaluation and design six task types, generating 4.9k test samples. These tasks jointly assess model performance across visual generation, trajectory following, and memory. Evaluating 14 representative world models, we identify key limitations and provide insights for future research. The iWorld-Bench model leaderboard is publicly available at iWorld-Bench.com.
Updated: 2026-05-05 16:30:03
标题: 一个统一的动作生成框架下的交互世界模型基准
摘要: 实现人工通用智能(AGI)需要学习和交互适应性的代理,交互世界模型提供可扩展的环境,用于感知、推理和行动。然而,当前的研究仍然缺乏大规模数据集和统一基准来评估它们的物理交互能力。为了解决这个问题,我们提出了iWorld-Bench,一个用于训练和测试世界模型在与交互相关的能力(如距离感知和记忆)上的综合基准。我们构建了一个包含33万个视频剪辑的多样数据集,并选择了2.1k个高质量样本,涵盖了不同的视角、天气和场景。由于现有的世界模型在交互方式上存在差异,我们引入了一个行动生成框架,统一评估和设计了六种任务类型,生成了4.9k个测试样本。这些任务共同评估了模型在视觉生成、轨迹跟随和记忆方面的性能。通过评估14个代表性的世界模型,我们确定了关键限制,并为未来的研究提供了见解。iWorld-Bench模型排行榜可以在iWorld-Bench.com上公开获取。
更新时间: 2026-05-05 16:30:03
领域: cs.CV,cs.AI
CLAPS: Aleatoric-Epistemic Scaling via Last-Layer Laplace for Conformal Regression
Conformal regression provides finite-sample marginal coverage, but it does not by itself determine how interval width should adapt across heterogeneous inputs. Existing locally adaptive methods mainly account for aleatoric noise, leaving uncertainty from weak training support less explicit. We propose Conformal Laplace-Aware Predictive Scaling (CLAPS), a split conformal regression method that uses heteroscedastic last-layer Laplace uncertainty as the local normalization scale. CLAPS combines learned input-dependent noise with last-layer epistemic uncertainty, while retaining validity through standard conformal calibration. We characterize this aleatoric--epistemic scale, derive its heteroscedastic last-layer precision, and show that it reduces to aleatoric local scaling as epistemic uncertainty contracts. Experiments show nominal-level coverage with competitive interval efficiency.
Updated: 2026-05-05 16:26:41
标题: CLAPS:通过最后一层拉普拉斯进行的遵从回归的随机-认知缩放
摘要: 共形回归提供有限样本边际覆盖,但它本身并不确定在异质输入中区间宽度应该如何调整。现有的局部自适应方法主要考虑随机噪声,留下了来自训练支持不足的不确定性不够明显。我们提出了一种名为Conformal Laplace-Aware Predictive Scaling (CLAPS)的分裂共形回归方法,该方法使用异方差的最后一层拉普拉斯不确定性作为局部归一化尺度。CLAPS将学习到的与输入相关的噪声与最后一层认知不确定性结合起来,同时通过标准的共形校准保持有效性。我们表征了这种随机-认知尺度,推导了其异方差的最后一层精度,并展示了在认知不确定性收缩时,它会减少为随机局部缩放。实验表明了具有竞争性区间效率的名义水平覆盖。
更新时间: 2026-05-05 16:26:41
领域: cs.LG,stat.ML
The Counterexample Game: Iterated Conceptual Analysis and Repair in Language Models
Conceptual analysis -- proposing definitions and refining them through counterexamples -- is central to philosophical methodology. We study whether language models can perform this task through iterated analysis and repair chains: one model instance generates counterexamples to a proposed definition, another repairs the definition, and the process repeats. Across 20 concepts and thousands of counterexample-repair cycles, we find that, although many LM-generated counterexamples are judged invalid by both expert humans and an LM judge, the LM judge accepts roughly twice as many as humans do. Nonetheless, per-item validity judgments are moderately consistent across humans and between humans and the LM. We further find that extended iteration produces increasingly verbose definitions without improving accuracy. We also see that some concepts resist stable definitions in general. These findings suggest that while LMs can engage in philosophical reasoning, the counterexample-repair loop hits diminishing returns quickly and could be a fruitful test case for evaluating whether LMs can sustain high-level iterated philosophical reasoning.
Updated: 2026-05-05 16:26:30
标题: 《反例游戏:语言模型中的迭代概念分析和修复》
摘要: 概念分析——提出定义并通过反例加以完善——是哲学方法论的核心。我们研究语言模型是否能够通过迭代分析和修复链来执行此任务:一个模型实例生成对所提出定义的反例,另一个修复定义,然后这个过程重复进行。在20个概念和成千上万次的反例修复循环中,我们发现,尽管许多语言模型生成的反例被专家人类和语言模型评判为无效,但语言模型评判接受的数量大约是人类的两倍。然而,对每个项目的有效性判断在人类之间以及人类和语言模型之间都是适度一致的。我们进一步发现,延长迭代会导致定义变得越来越冗长,而不会提高准确性。我们还看到一些概念在一般情况下难以稳定定义。这些发现表明,虽然语言模型可以从事哲学推理,但反例修复循环很快遇到收益递减,并且可能是评估语言模型是否能够维持高级迭代哲学推理的一个有益的测试案例。
更新时间: 2026-05-05 16:26:30
领域: cs.CL,cs.AI
Towards Open World Sound Event Detection
Sound Event Detection (SED) plays a vital role in audio understanding, with applications in surveillance, smart cities, healthcare, and multimedia indexing. However, conventional SED systems operate under a closed-world assumption, limiting their effectiveness in real-world environments where novel acoustic events frequently emerge. Inspired by the success of open-world learning in computer vision, we introduce the Open-World Sound Event Detection (OW-SED) paradigm, where models must detect known events, identify unseen ones, and incrementally learn from them. To tackle the unique challenges of OW-SED, such as overlapping and ambiguous events, we propose a 1D Deformable architecture that leverages deformable attention to adaptively focus on salient temporal regions. Furthermore, we design a novel Open-World Deformable Sound Event Detection Transformer (WOOT) framework incorporating feature disentanglement to separate class-specific and class-agnostic representations, together with a one-to-many matching strategy and a diversity loss to enhance representation diversity. Experimental results demonstrate that our method achieves marginally superior performance compared to existing leading techniques in closed-world settings and significantly improves over existing baselines in open-world scenarios.
Updated: 2026-05-05 16:23:06
标题: 朝向开放世界声音事件检测
摘要: 声音事件检测(SED)在音频理解中发挥着至关重要的作用,在监控、智慧城市、医疗保健和多媒体索引等领域具有应用。然而,传统的SED系统在封闭世界假设下运行,限制了它们在现实环境中的有效性,其中新型声学事件经常出现。受计算机视觉中开放世界学习成功的启发,我们引入了开放世界声音事件检测(OW-SED)范式,模型必须检测已知事件,识别未见事件,并逐渐从中学习。为了解决OW-SED的独特挑战,如重叠和模糊事件,我们提出了一个1D可变形架构,利用可变形注意力自适应聚焦于显著的时间区域。此外,我们设计了一个新颖的开放世界可变形声音事件检测Transformer(WOOT)框架,结合特征解耦以分离类别特定和类别不可知的表示,以及一对多匹配策略和多样性损失以增强表示多样性。实验结果表明,我们的方法在封闭世界设置中实现了略高于现有领先技术的性能,并在开放世界场景中明显优于现有基线。
更新时间: 2026-05-05 16:23:06
领域: cs.SD,cs.AI
Magic-Informed Quantum Architecture Search
Nonstabilizerness, commonly referred to as magic, is a fundamental resource underpinning quantum advantage. In this paper, we propose a magic-informed quantum architecture search (QAS) technique that enables control over a quantum resource within the general framework of circuit design. Inspired by the AlphaGo approach, we tackle the problem with a Monte Carlo Tree Search technique equipped with a Graph Neural Network (GNN) that estimates the magic of candidate quantum circuits. The GNN model induces a magic-based bias that steers the search toward either high- or low-magic regimes, depending on the target objective. We benchmark the proposed magic-informed QAS technique on both the structured ground-state energy problem and on the more general quantum state approximation problem, spanning different sizes and target magic levels. Experimental results show that the proposed technique effectively influences the magic across the search tree and notably also on the resulting final circuit, even in regimes where the GNN operates on out-of-distribution instances. Although introducing a problem-agnostic magic bias could, in principle, constrain the search dynamics, we observe consistent improvements in solution quality across all problems tested.
Updated: 2026-05-05 16:20:46
标题: 魔法启发的量子架构搜索
摘要: 非稳定性,通常被称为魔术,是支撑量子优势的基本资源。在本文中,我们提出了一种以魔术为基础的量子架构搜索(QAS)技术,可以在电路设计的一般框架内控制量子资源。受AlphaGo方法的启发,我们使用一种配备图神经网络(GNN)的蒙特卡洛树搜索技术来解决这个问题,该技术估计候选量子电路的魔术。GNN模型引入了基于魔术的偏见,根据目标目标将搜索引导到高魔术或低魔术区域。我们在结构化基态能量问题和更一般的量子态逼近问题上对提出的以魔术为基础的QAS技术进行了基准测试,涵盖了不同大小和目标魔术水平。实验结果表明,提出的技术有效地影响了搜索树上的魔术,甚至在GNN处理超出分布范围的情况下也对最终电路产生了显著影响。尽管引入一个与问题无关的魔术偏见原则上可能会限制搜索动态,但我们观察到在所有测试问题上解决方案质量保持一致的改进。
更新时间: 2026-05-05 16:20:46
领域: quant-ph,cs.AI
PHALAR: Phasors for Learned Musical Audio Representations
Stem retrieval, the task of matching missing stems to a given audio submix, is a key challenge currently limited by models that discard temporal information. We introduce PHALAR, a contrastive framework achieving a relative accuracy increase of up to $\approx 70\%$ over the state-of-the-art while requiring $<50\%$ of the parameters and a 7$\times$ training speedup. By utilizing a Learned Spectral Pooling layer and a complex-valued head, PHALAR enforces pitch-equivariant and phase-equivariant biases. PHALAR establishes new retrieval state-of-the-art across MoisesDB, Slakh, and ChocoChorales, correlating significantly higher with human coherence judgment than semantic baselines. Finally, zero-shot beat tracking and linear chord probing confirm that PHALAR captures robust musical structures beyond the retrieval task.
Updated: 2026-05-05 16:19:58
标题: PHALAR:用于学习音乐音频表示的相位器
摘要: 干细胞检索是将缺失的干细胞与给定的音频子混音匹配的任务,目前受到模型丢弃时间信息限制的关键挑战。我们引入了PHALAR,这是一个对比框架,相对于最先进技术实现了高达70%的准确度提升,同时只需使用不到50%的参数和7倍的训练加速。通过利用学习的频谱池化层和复值头部,PHALAR强制实现了音高等变和相位等变的偏差。PHALAR在MoisesDB、Slakh和ChocoChorales等数据库中建立了新的检索最先进技术,与人类一致性判断的相关性显著高于语义基线。最后,零样本节拍跟踪和线性和弦探测证实,PHALAR捕捉了超出检索任务范围的稳健音乐结构。
更新时间: 2026-05-05 16:19:58
领域: cs.SD,cs.AI,cs.LG,eess.SP
Optimal Posterior Sampling for Policy Identification in Tabular Markov Decision Processes
We study the $(\varepsilon, δ)$-PAC policy identification problem in finite-horizon episodic Markov Decision Processes. Existing approaches provide finite-time guarantees for approximate settings ($\varepsilon>0$) but suffer from high computational cost, rendering them hard to implement, and also suffer from suboptimal dependence on $\log(1/δ)$. We propose a randomized and computationally efficient algorithm for best policy identification that combines posterior sampling with an online learning algorithm to guide exploration in the MDP. Our method achieves asymptotic optimality in sample complexity, also in terms of posterior contraction rate, and runs in $O(S^2AH)$ per episode, matching standard model-based approaches. Unlike prior algorithms such as MOCA and PEDEL, our guarantees remain meaningful in the asymptotic regime and avoid sub-optimal polynomial dependence on $\log(1/δ)$. Our results provide both theoretical insights and practical tools for efficient policy identification in tabular MDPs.
Updated: 2026-05-05 16:16:57
标题: 在表格化马尔可夫决策过程中政策识别的最佳后验采样
摘要: 我们研究了有限时间段的马尔可夫决策过程中$(\varepsilon, δ)$-PAC策略识别问题。现有方法对近似设置($\varepsilon>0$)提供了有限时间保证,但受到计算成本高的困扰,使其难以实施,并且在$\log(1/δ)$上存在次优依赖性。我们提出了一种随机化和计算高效的算法,用于最佳策略识别,将后验采样与在线学习算法结合,以引导MDP中的探索。我们的方法在样本复杂性方面实现了渐近最优性,也在后验收缩速率方面,每个周期运行在$O(S^2AH)$中,与标准的基于模型的方法相匹配。与先前的算法(如MOCA和PEDEL)不同,我们的保证在渐近情况下仍然具有意义,并避免对$\log(1/δ)$的次优多项式依赖。我们的结果为在表格型MDP中高效策略识别提供了理论洞见和实用工具。
更新时间: 2026-05-05 16:16:57
领域: cs.LG,stat.ML
Exact ReLU realization of tensor-product refinement iterates
We study scalar dyadic refinement operators on R^2 of the form (Vf)(x,y) = sum_{(j,k) in Z^2} c_{j,k} f(2x-j, 2y-k), where only finitely many mask coefficients c_{j,k} are nonzero. Under a fixed support-window hypothesis, we prove that for every compactly supported continuous piecewise linear seed g:R^2->R, the iterates V^n g admit exact ReLU realizations of fixed width and depth O(n). This gives a first genuinely two-dimensional extension of the exact realization theory for refinement cascades. Using the one-dimensional exact loop-controller framework, the proof transports the tensor-product residual dynamics exactly on the product of two polygonal loops and reduces the remaining seam ambiguity to a final readout and selector step. The matrix cascade is then handled by a fixed-depth recursive block, and general compactly supported continuous piecewise linear seeds are reduced to a finite decomposition together with exact clamped gluing on the support window. This identifies the tensor-product dyadic case as a natural first multivariate instance of the loop-controller method for refinement iterates.
Updated: 2026-05-05 16:12:32
标题: 精确实现张量积细化迭代的ReLU
摘要: 我们研究了形如(Vf)(x,y) = sum_{(j,k) in Z^2} c_{j,k} f(2x-j, 2y-k)的二维标量二元细化算子,其中只有有限个掩模系数c_{j,k}不为零。在一个固定支持窗口假设下,我们证明了对于每一个紧支撑的连续分段线性种子g:R^2->R,迭代V^n g具有固定宽度和深度O(n)的确切ReLU实现。 这为细化级联的确切实现理论提供了第一个真正的二维扩展。利用一维确切环控制器框架,证明将张量积残差动态精确地传送到两个多边形环的乘积上,并将剩余的接缝模糊性减少到最后的输出和选择器步骤。然后通过一个固定深度的递归块处理矩阵级联,并将一般的紧支撑连续分段线性种子减少到一种有限的分解,以及在支持窗口上进行确切的夹接粘合。这将张量积二元情况确定为细化迭代的环控制器方法的第一个自然的多变量实例。
更新时间: 2026-05-05 16:12:32
领域: math.CA,cs.LG
Atomic Fact-Checking Increases Clinician Trust in Large Language Model Recommendations for Oncology Decision Support: A Randomized Controlled Trial
Question: Does atomic fact-checking, which decomposes AI treatment recommendations into individually verifiable claims linked to source guideline documents, increase clinician trust compared to traditional explainability approaches? Findings: In this randomized trial of 356 clinicians generating 7,476 trust ratings, atomic fact-checking produced a large effect on trust (Cohen's d = 0.94), increasing the proportion of clinicians expressing trust from 26.9% to 66.5%. Traditional transparency mechanisms showed a dose-response gradient of improvement over baseline (d = 0.25 to 0.50). Meaning: Decomposing AI recommendations into individually verifiable claims linked to source guidelines produces substantially higher clinician trust than traditional explainability approaches in high-stakes clinical decisions.
Updated: 2026-05-05 16:12:30
标题: 原子事实检查增加临床医师对大型语言模型在肿瘤学决策支持中的信任:一项随机对照试验
摘要: 问题:原子事实核查,将AI治疗建议分解为与源指南文件相关联的可单独验证的声明,是否增加了临床医生对传统解释方法的信任? 发现:在这项对356名临床医生进行的随机试验中,生成了7,476个信任评分,原子事实核查对信任产生了很大影响(科恩d = 0.94),增加了表示信任的临床医生比例,从26.9%增加到66.5%。传统的透明机制显示出与基线相比的改善剂量-反应梯度(d = 0.25到0.50)。 意义:将AI建议分解为与源指南相关联的可单独验证的声明,比传统的解释方法在高风险临床决策中产生了更高水平的临床医生信任。
更新时间: 2026-05-05 16:12:30
领域: cs.CL,cs.AI
Ecologically-Constrained Task Arithmetic for Multi-Taxa Bioacoustic Classifiers Without Shared Data
Training data for bioacoustics is scattered across taxa, regions, and institutions. Centralizing it all is often infeasible. We show that independently fine-tuned BEATs encoders can be composed into a unified 661-species classifier via task vector arithmetic without sharing data. We find that bioacoustic task vectors are near-orthogonal (cosine 0.01-0.09). Their separation aligns closely with spectral distribution distance, a gradient consistent with the acoustic niche hypothesis. This geometry makes simple averaging optimal while sign-conflict methods reduce accuracy by one to six percentage points. Composition also creates an asymmetric gap: species-rich groups lose accuracy relative to joint training while underrepresented taxa gain, a redistribution useful for equitable biodiversity monitoring. We verify linear mode connectivity across all taxonomic pairs, demonstrate zero-shot transfer to new regions, and identify domain negation as a boundary condition where composition fails. These results enable a collaborative paradigm for bioacoustics where institutions share only task vectors to assemble multi-taxa classifiers, preserving data privacy.
Updated: 2026-05-05 16:10:02
标题: 在没有共享数据的多类生物声学分类器中,基于生态约束的任务算术
摘要: 生物声学的训练数据分散在不同的分类群、地区和机构之间。集中所有数据通常是不可行的。我们展示了通过任务向量算术将经过独立微调的BEATs编码器组合成一个统一的661种分类器,而无需共享数据。我们发现生物声学任务向量几乎是正交的(余弦0.01-0.09)。它们的分离与谱分布距离密切相关,这一梯度符合声学生态位假设。这种几何结构使简单平均成为最佳选择,而符号冲突方法则使准确性降低了一到六个百分点。组合还产生了一个不对称的差距:种群丰富的组群相对于联合训练而言准确性降低,而代表数量不足的分类群则获益,这种重新分配对于公平的生物多样性监测是有用的。我们验证了所有分类对之间的线性模式连接性,展示了零样本迁移到新地区,并确定了领域否定作为组合失败的一个边界条件。这些结果实现了一个协同的生物声学范式,其中机构只共享任务向量以组装多分类器,以保护数据隐私。
更新时间: 2026-05-05 16:10:02
领域: cs.SD,cs.LG
PhySe-RPO: Physics and Semantics Guided Relative Policy Optimization for Diffusion-Based Surgical Smoke Removal
Surgical smoke severely degrades intraoperative video quality, obscuring anatomical structures and limiting surgical perception. Existing learning-based desmoking approaches rely on scarce paired supervision and deterministic restoration pipelines, making it difficult to perform exploration or reinforcement-driven refinement under real surgical conditions. We propose PhySe-RPO, a diffusion restoration framework optimized through Physics- and Semantics-Guided Relative Policy Optimization. The core idea is to transform deterministic restoration into a stochastic policy, enabling trajectory-level exploration and critic-free updates via group-relative optimization. A physics-guided reward imposes illumination and color consistency, while a visual-concept semantic reward learned from CLIP-based surgical concepts promotes smoke-free and anatomically coherent restoration. Together with a reference-free perceptual constraint, PhySe-RPO produces results that are physically consistent, semantically faithful, and clinically interpretable across synthetic and real robotic surgical datasets, providing a principled route to robust diffusion-based restoration under limited paired supervision.
Updated: 2026-05-05 16:08:42
标题: PhySe-RPO:物理学和语义引导的相对策略优化用于基于扩散的手术烟雾去除
摘要: 手术烟雾严重影响术中视频质量,遮挡解剖结构并限制手术感知。现有基于学习的去烟方法依赖于稀缺的配对监督和确定性恢复管道,使得在真实手术条件下进行探索或强化驱动的改进变得困难。我们提出PhySe-RPO,一个通过物理和语义引导的相对策略优化优化的扩散恢复框架。核心思想是将确定性恢复转化为随机策略,通过组相对优化实现轨迹级探索和无批评者的更新。物理引导奖励施加照明和颜色一致性,而基于CLIP的手术概念学习的视觉概念语义奖励促进无烟和解剖上一致的恢复。结合无参考感知约束,PhySe-RPO在合成和真实机器人手术数据集上产生物理一致、语义忠实和临床可解释的结果,为在有限的配对监督下实现强大的基于扩散的恢复提供了合理途径。
更新时间: 2026-05-05 16:08:42
领域: cs.AI
SOC-ICNN: From Polyhedral to Conic Geometry for Learning Convex Surrogate Functions
Classical ReLU-based Input Convex Neural Networks (ICNNs) are equivalent to the optimal value functions of Linear Programming (LP). This intrinsic structural equivalence restricts their representational capacity to piecewise-linear polyhedral functions. To overcome this representational bottleneck, we propose the SOC-ICNN, an architecture that generalizes the underlying optimization class from LP to Second-Order Cone Programming (SOCP). By explicitly injecting positive semi-definite curvature and Euclidean norm-based conic primitives, our formulation introduces native smooth curvature into the representation while preserving a rigorous optimization-theoretic interpretation. We formally prove that SOC-ICNNs strictly expand the representational space of ReLU-ICNNs without increasing the asymptotic order of forward-pass complexity. Extensive experiments demonstrate that SOC-ICNN substantially improves function approximation, while delivering competitive downstream decision quality. The code is available at https://anonymous.4open.science/r/SOC-ICNN-4B18/.
Updated: 2026-05-05 16:08:12
标题: SOC-ICNN:从多面体到锥形几何学习凸代理函数
摘要: 传统的基于ReLU的输入凸神经网络(ICNNs)等同于线性规划(LP)的最优值函数。这种内在的结构等价性将它们的表示能力限制在分段线性多面体函数上。为了克服这种表示瓶颈,我们提出了SOC-ICNN,这是一种将优化类从LP泛化到二阶锥规划(SOCP)的架构。通过明确注入正半定曲率和基于欧几里得范数的锥形基元,我们的公式引入本地平滑曲率到表示中,同时保留了严格的优化理论解释。我们正式证明了SOC-ICNNs严格扩展了ReLU-ICNNs的表示空间,而不增加前向传递复杂度的渐近阶数。大量实验证明,SOC-ICNN在函数逼近方面取得了显著的改进,同时提供了具有竞争力的下游决策质量。代码可在https://anonymous.4open.science/r/SOC-ICNN-4B18/找到。
更新时间: 2026-05-05 16:08:12
领域: cs.LG,math.OC,stat.ML
Steer Like the LLM: Activation Steering that Mimics Prompting
Large language models can be steered at inference time through prompting or activation interventions, but activation steering methods often underperform compared to prompt-based approaches. We propose a framework that formulates prompt steering as a form of activation steering and investigates whether distilling successful prompt steering behavior into simpler, interpretable models can close this gap. Our analysis reveals that popular activation steering methods are not faithful to the mechanics of prompt steering, which applies strong interventions on some tokens while barely affecting others. Based on these insights, we introduce Prompt Steering Replacement (PSR) models that estimate token-specific steering coefficients from the activations themselves and are trained to imitate prompt-based interventions. Experiments on three steering benchmarks across multiple language models show that PSR models outperform existing activation steering methods, especially when controlling for high-coherence completions, and also compare favorably to prompting on AxBench and persona steering.
Updated: 2026-05-05 15:59:42
标题: 像低水平模拟运动的激活转向
摘要: 大型语言模型可以在推断时通过提示或激活干预进行引导,但是与基于提示的方法相比,激活引导方法通常表现不佳。我们提出了一个框架,将提示引导形式化为一种激活引导,并研究了将成功的提示引导行为精炼为更简单、可解释的模型是否可以弥合这一差距。我们的分析显示,流行的激活引导方法并不忠实于提示引导的机制,后者对某些标记施加强烈干预,而对其他标记几乎没有影响。基于这些见解,我们引入了Prompt Steering Replacement (PSR) 模型,该模型从激活本身估计标记特定的引导系数,并被训练以模仿基于提示的干预。在多个语言模型上进行的三个引导基准实验表明,PSR 模型优于现有的激活引导方法,特别是在控制高连贯性完成时,同时也与 AxBench 和个性引导的提示相比表现出色。
更新时间: 2026-05-05 15:59:42
领域: cs.CL,cs.AI,cs.LG
Adaptive Long-term Embedding with Denoising and Augmentation for Recommendation
The rapid growth of the internet has made personalized recommendation systems indispensable. Graph-based sequential recommendation systems, powered by Graph Neural Networks (GNNs), effectively capture complex user-item interactions but often face challenges such as noise and static representations. In this paper, we introduce the Adaptive Long-term Embedding with Denoising and Augmentation for Recommendation (ALDA4Rec) method, a novel model that constructs an item-item graph, filters noise through community detection, and enriches user-item interactions. Graph Convolutional Networks (GCNs) are then employed to learn short-term representations, while averaging, GRUs, and attention mechanisms are utilized to model long-term embeddings. An MLP-based adaptive weighting strategy is further incorporated to dynamically optimize long-term user preferences. Experiments conducted on four real-world datasets demonstrate that ALDA4Rec outperforms state-of-the-art baselines, delivering notable improvements in both accuracy and robustness. The source code is available at https://github.com/zahraakhlaghi/ALDA4Rec.
Updated: 2026-05-05 15:56:15
标题: 适应性长期嵌入与去噪和增强用于推荐
摘要: 互联网的快速增长使个性化推荐系统变得不可或缺。基于图神经网络(GNNs)的图形序列推荐系统有效地捕捉了复杂的用户-物品互动,但往往面临噪声和静态表示等挑战。本文介绍了一种名为ALDA4Rec的自适应长期嵌入去噪和增强推荐方法,该方法构建了一个物品-物品图,通过社区检测过滤噪声,并丰富了用户-物品互动。然后利用图卷积网络(GCNs)学习短期表示,同时利用平均值、GRUs和注意机制来建模长期嵌入。进一步加入了基于MLP的自适应加权策略,动态优化长期用户偏好。在四个真实数据集上进行的实验证明,ALDA4Rec优于最先进的基线模型,在准确性和鲁棒性方面都取得了显著改进。源代码可在https://github.com/zahraakhlaghi/ALDA4Rec找到。
更新时间: 2026-05-05 15:56:15
领域: cs.IR,cs.AI,cs.NE
Graph Neural Networks in the Wilson Loop Representation of Abelian Lattice Gauge Theories
Local gauge structures play a central role in a wide range of condensed matter systems and synthetic quantum platforms, where they emerge as effective descriptions of strongly correlated phases and engineered dynamics. We introduce a gauge-invariant graph neural network (GNN) architecture for Abelian lattice gauge models, in which symmetry is enforced explicitly through local gauge-invariant inputs, such as Wilson loops, and preserved throughout message passing, eliminating redundant gauge degrees of freedom while retaining expressive power. We benchmark the approach on both $\mathbb{Z}_2$ and $\mathrm{U}(1)$ lattice gauge models, achieving accurate predictions of global observables and spatially resolved quantities despite the nonlocal correlations induced by gauge-matter coupling. We further demonstrate that the learned model serves as an efficient surrogate for semiclassical dynamics in $\mathrm{U}(1)$ quantum link models, enabling stable and scalable time evolution without repeated fermionic diagonalization, while faithfully reproducing both local dynamics and statistical correlations. These results establish gauge-invariant message passing as a compact and physically grounded framework for learning and simulating Abelian lattice gauge systems.
Updated: 2026-05-05 15:55:21
标题: 在阿贝尔格点规范理论的威尔逊回路表示中的图神经网络
摘要: 本地规范结构在广泛的凝聚态物质系统和合成量子平台中起着核心作用,在这些系统中,它们作为强相关相和工程动力学的有效描述而出现。我们引入了一种用于阿贝尔格点规范模型的规范不变图神经网络(GNN)架构,其中对称性通过本地规范不变输入(如Wilson环)明确强制执行,并在整个消息传递过程中保持,消除多余的规范自由度同时保留表达能力。我们在$\mathbb{Z}_2$和$\mathrm{U}(1)$格点规范模型上对该方法进行了基准测试,尽管由规范-物质耦合引起的非局部相关性,但实现了全局可观测量和空间分辨量的准确预测。我们进一步证明,学习的模型可作为$\mathrm{U}(1)$量子链接模型中半经典动力学的高效替代品,实现稳定且可扩展的时间演化,无需重复费米子对角化,同时忠实地再现局部动态和统计相关性。这些结果确立了规范不变消息传递作为学习和模拟阿贝尔格点规范系统的紧凑且物理基础的框架。
更新时间: 2026-05-05 15:55:21
领域: cond-mat.str-el,cs.LG,hep-lat,quant-ph
Contextual Multi-Objective Optimization: Rethinking Objectives in Frontier AI Systems
Frontier AI systems perform best in settings with clear, stable, and verifiable objectives, such as code generation, mathematical reasoning, games, and unit-test-driven tasks. They remain less reliable in open-ended settings, including scientific assistance, long-horizon agents, high-stakes advice, personalization, and tool use, where the relevant objective is ambiguous, context-dependent, delayed, or only partially observable. We argue that many such failures are not merely failures of scale or capability, but failures of objective selection: the system optimizes a locally visible signal while missing which objectives should govern the interaction. We formulate this problem as \emph{contextual multi-objective optimization}. In this setting, systems must consider multiple, context-dependent objectives, such as helpfulness, truthfulness, safety, privacy, calibration, non-manipulation, user preference, reversibility, and stakeholder impact, while determining which objectives are active, which are soft preferences, and which must function as hard or quasi-hard constraints. These examples are not intended as an exhaustive taxonomy: different domains and deployment settings may activate different objective dimensions and different conflict-resolution procedures. Our framework models AI behavior as a context-dependent choice rule over candidate actions, objective estimates, active constraints, stakeholders, uncertainty, and conflict-resolution procedures. We outline an implementation pathway based on decomposed objective representations, context-to-objective routing, hierarchical constraints, deliberative policy reasoning, controlled personalization, tool-use control, diagnostic evaluation, auditing, and post-deployment revision.
Updated: 2026-05-05 15:55:19
标题: 背景下的多目标优化:重新思考前沿人工智能系统中的目标
摘要: 前沿人工智能系统在具有明确、稳定和可验证目标的环境中表现最佳,如代码生成、数学推理、游戏和单元测试驱动任务。在开放式环境中,包括科学辅助、长期代理、高风险建议、个性化和工具使用等方面,它们的可靠性较低,相关目标具有歧义、依赖于环境、延迟或仅部分可观察。我们认为许多此类失败不仅仅是规模或能力上的失败,而是目标选择的失败:系统优化局部可见信号,同时忽略了应该控制交互的目标。我们将这个问题形式化为\emph{上下文多目标优化}。在这种情况下,系统必须考虑多个、依赖于环境的目标,如帮助、真实性、安全性、隐私、校准、非操纵、用户偏好、可逆性和利益相关者影响,同时确定哪些目标是活跃的,哪些是软偏好,哪些必须作为硬性或准硬性约束。这些示例并不意味着穷尽所有分类:不同领域和部署设置可能会激活不同的目标维度和不同的冲突解决程序。我们的框架将AI行为建模为候选行动、目标估计、活跃约束、利益相关者、不确定性和冲突解决程序上的上下文相关选择规则。我们概述了一个基于分解目标表示、上下文到目标路由、层次约束、审慎政策推理、受控个性化、工具使用控制、诊断评估、审计和部署后修订的实施途径。
更新时间: 2026-05-05 15:55:19
领域: cs.AI
A Quantitative Confirmation of the Currier Language Distinction
We present a unified quantitative analysis of the Currier A/B language distinction in the Voynich Manuscript, proceeding in two stages. First, we confirm that the distinction is genuine: a Beta-Binomial mixture model applied to character-pair substitution ratios across 185 folios, without access to Currier's labels, selects 2 by BIC and predicts held-out folio labels at 89% accuracy. Second, we show that the A/B contrast is not primitive but is a low-resolution projection of a higher-dimensional generative system. Its dominant component is a discrete boolean "switch" set once per folio, governing the vowel following the digraphs ch and sh. A two-state binomial mixture achieves Delta\ AIC = 2,549 over a single-state model and assigns 195 of 197 folios unambiguously. This switch does not operate uniformly: word templates divide into fixed contexts and switchable contexts, with template identity accounting for 92% of variance.
Updated: 2026-05-05 15:54:58
标题: 对Currier语言区别的数量确认
摘要: 我们提出了一个统一的对 Voynich 手稿中 Currier A/B 语言区别的定量分析,分为两个阶段。 首先,我们确认了这种区别是真实的:应用贝塔二项混合模型对185页手稿中字符对替换比率进行分析,没有访问 Currier 的标签,通过 BIC 选择了 2 个,以 89% 的准确率预测了留存的页标签。 其次,我们表明 A/B 对比并非原始的,而是一个更高维度生成系统的低分辨率投影。其主要组成部分是一种离散的布尔“开关”,每页设置一次,控制了 ch 和 sh 两个二元音之间的元音。二状态二项混合模型在单状态模型上实现了 2,549 的 Delta AIC,并明确分配了 197 页中的 195 页。这个开关并不是均匀运作的:单词模板分为固定上下文和可切换上下文,模板身份解释了 92% 的差异。
更新时间: 2026-05-05 15:54:58
领域: cs.CR,cs.CL
MICA: Multi-granularity Intertemporal Credit Assignment for Long-Horizon Emotional Support Dialogue
Reinforcement learning (RL) for large language models (LLMs) has shown strong performance in single-turn tasks, but extending it to multi-turn interaction remains challenging due to sparse rewards and poor per-turn credit assignment. In emotional support dialogues, responses shape future user states, so matched-state step-wise comparison is unavailable, while trajectory-level supervision is insufficient. We propose MICA (Multi-granularity Intertemporal Credit Assignment), a critic-free RL framework for multi-turn emotional support tasks. MICA derives both immediate and delayed credit from a shared potential function over the user's structured support state. Incremental Distance Reward measures the per-turn decrease in residual distance to the target state, while its Monte Carlo return captures delayed effects. After scope-specific normalization, the two signals form a mixed advantage for stable per-turn optimization without matched-state comparisons, rollout trees, or a learned critic. On EMPA, EQ-Bench, and EmoBench with Qwen2.5-7B-Instruct and Qwen3-8B/14B/32B, MICA consistently outperforms GRPO and REINFORCE++, achieving up to +43.2 on EMPA, while adding no rollout cost and remaining robust to reward judges. These results show that turn-aware credit assignment enables effective and practical multi-turn RL for interactive LLMs.
Updated: 2026-05-05 15:53:05
标题: MICA:长期情感支持对话的多粒度时间信用分配
摘要: 强化学习对大型语言模型(LLMs)在单轮任务中表现出强大性能,但将其拓展到多轮交互仍然具有挑战性,原因是奖励稀疏且每轮学分分配不足。在情感支持对话中,响应塑造未来用户状态,因此匹配状态逐步比较不可用,而轨迹级别的监督不足。我们提出了MICA(多粒度间隔信用分配),这是一个无评论家的强化学习框架,用于多轮情感支持任务。MICA从用户的结构化支持状态中派生出即时和延迟信用,通过共享的潜在函数。增量距离奖励衡量了每轮到目标状态的残差距离减少,而其蒙特卡罗回报捕捉了延迟效应。在特定范围的归一化之后,这两个信号形成了一个混合优势,用于稳定每轮优化,而无需匹配状态比较、展开树或学习评论家。在EMPA、EQ-Bench和EmoBench上使用Qwen2.5-7B-Instruct和Qwen3-8B/14B/32B,MICA始终优于GRPO和REINFORCE++,在EMPA上最高达到+43.2,同时没有额外的展开成本,并且对奖励评判保持稳健。这些结果表明,轮次感知的信用分配使得交互式LLMs的多轮强化学习具有高效和实用的特点。
更新时间: 2026-05-05 15:53:05
领域: cs.CL,cs.AI
Position: Agent Should Invoke External Tools ONLY When Epistemically Necessary
As large language models evolve into tool-augmented agents, a central question remains unresolved: when is external tool use actually justified? Existing agent frameworks typically treat tools as ordinary actions and optimize for task success or reward, offering little principled distinction between epistemically necessary interaction and unnecessary delegation. This position paper argues that agents should invoke external tools only when epistemically necessary. Here, epistemic necessity means that a task cannot be completed reliably via the agent's internal reasoning over its current context, without any external interaction. We introduce the Theory of Agent (ToA), a framework that treats agents as making sequential decisions about whether remaining uncertainty should be resolved internally or delegated externally. From this perspective, common agent failure modes (e.g., overthinking and overacting) arise from miscalibrated decisions under uncertainty rather than deficiencies in reasoning or tool execution alone. We further discuss implications for training, evaluation, and agent design, highlighting that unnecessary delegation not only causes inefficiency but can impede the development of internal reasoning capability. Our position provides a normative criterion for tool use that complements existing decision-theoretic models and is essential for building agents that are not only correct, but increasingly intelligent.
Updated: 2026-05-05 15:52:30
标题: 位置:当认识上有必要时,代理应该调用外部工具
摘要: 随着大型语言模型逐渐演变为工具增强型代理,一个核心问题仍然未解决:何时才能真正合理地使用外部工具?现有的代理框架通常将工具视为普通操作,并优化任务成功或奖励,几乎没有对认知上必要的互动和不必要的委托之间进行原则性区分。本论文认为代理应仅在认知上必要时调用外部工具。在这里,认知上的必要性意味着任务无法通过代理在当前环境中的内部推理可靠地完成,而无需进行任何外部互动。我们引入了代理理论(ToA),这是一个框架,将代理视为在剩余不确定性是否应该内部解决或外部委托方面作出顺序决策。从这个角度来看,常见的代理失败模式(例如过度思考和过度行动)源于在不确定性下决策失灵,而不仅仅是推理或工具执行的不足。我们进一步讨论了对培训、评估和代理设计的影响,强调不必要的委托不仅会导致低效,还会阻碍内部推理能力的发展。我们的立场提供了一个补充现有决策理论模型的工具使用标准,对于构建不仅正确而且越来越智能的代理至关重要。
更新时间: 2026-05-05 15:52:30
领域: cs.AI
From Data Lifting to Continuous Risk Estimation: A Process-Aware Pipeline for Predictive Monitoring of Clinical Pathways
This paper presents a reproducible and process-aware pipeline for predictive monitoring of clinical pathways. The approach integrates data lifting, temporal reconstruction, event log construction, prefix-based representations, and predictive modeling to support continuous reasoning on partially observed patient trajectories, overcoming the limitations of traditional retrospective process mining. The framework is evaluated on COVID-19 clinical pathways using ICU admission as the prediction target, considering 4,479 patient cases and 46,804 prefixes. Predictive models are trained and evaluated using a case-level split, with 896 patients in the test set. Logistic Regression achieves the best performance (AUC 0.906, F1-score 0.835). A detailed prefix-based analysis shows that predictive performance improves progressively as new clinical events become available, with AUC increasing from 0.642 at early stages to 0.942 at later stages of the pathway. The results highlight two key findings: predictive signals emerge progressively along clinical pathways, and process-aware representations enable effective early risk estimation from evolving patient trajectories. Overall, the findings suggest that predictive monitoring in healthcare is best conceived as a continuous, dynamically aware process, in which risk estimates are progressively refined as the patient journey evolves.
Updated: 2026-05-05 15:51:43
标题: 从数据提取到连续风险估计:用于临床路径预测监测的流程感知管道
摘要: 本文提出了一个可重复且过程感知的用于临床路径预测监测的流程。该方法整合了数据提取、时间重建、事件日志构建、基于前缀的表示以及预测建模,以支持对部分观察到的患者轨迹进行持续推理,克服了传统回顾性过程挖掘的局限性。该框架在COVID-19临床路径上进行评估,以ICU入院作为预测目标,考虑了4,479例患者病例和46,804个前缀。预测模型使用病例级拆分进行训练和评估,测试集中有896名患者。 Logistic回归实现了最佳性能(AUC 0.906,F1分数0.835)。详细的基于前缀的分析显示,随着新的临床事件的出现,预测性能逐渐提高,AUC从早期阶段的0.642增加到路径后期的0.942。结果突出了两个关键发现:预测信号逐渐出现在临床路径中,并且过程感知表示使得从患者轨迹的演变中有效地进行早期风险估计成为可能。总体上,结果表明,医疗保健中的预测监测最好被构想为一个连续、动态感知的过程,患者旅程演变时风险估计逐步完善。
更新时间: 2026-05-05 15:51:43
领域: cs.LG,cs.SE
Hybrid Models for Natural Language Reasoning: The Case of Syllogistic Logic
Despite the remarkable progress in neural models, their ability to generalize, a cornerstone for applications such as logical reasoning, remains a critical challenge. We delineate two fundamental aspects of this ability: compositionality, the capacity to abstract atomic logical rules underlying complex inferences, and recursiveness, the aptitude to build intricate representations through iterative application of inference rules. In the literature, these two aspects are often conflated under the umbrella term of generalization. To sharpen this distinction, we investigate the logical generalization capabilities of LLMs using the syllogistic fragment as a benchmark for natural language reasoning. We extend classical syllogistic forms to construct more complex structures, yielding a foundational yet expressive subset of formal logic that supports controlled evaluation of essential reasoning abilities. Our findings on this non-trivial benchmark show that, while LLMs demonstrate reasonable proficiency in recursiveness, they struggle with compositionality. This disparity is not uniform, as a more detailed analysis reveals substantial variability in generalization performance across individual syllogistic types, ranging from near-perfect accuracy to significantly lower performance. To overcome these limitations and establish a reliable logical prover, we propose a hybrid architecture integrating symbolic reasoning with neural computation. This synergistic interaction enables robust and efficient inference, neural components accelerate processing, while symbolic reasoning guarantees completeness. Our experiments further show that high efficiency is preserved even when using relatively small neural components. Overall, our analysis provides both a rationale for hybrid neuro-symbolic approaches and evidence of their potential to address key generalization barriers in neural reasoning systems.
Updated: 2026-05-05 15:51:29
标题: 混合模型用于自然语言推理:以三段论逻辑为例
摘要: 尽管神经模型取得了显著进展,但它们的泛化能力,作为逻辑推理等应用的基石,仍然是一个关键挑战。我们界定了这种能力的两个基本方面:组合性,即抽象出复杂推理底层的原子逻辑规则的能力,以及递归性,即通过迭代应用推理规则来构建复杂的表示的能力。在文献中,这两个方面经常被合并在泛化的总称下。为了加深这种区别,我们使用三段论片段作为自然语言推理的基准,研究了LLMs的逻辑泛化能力。我们扩展了经典的三段论形式,构建了更复杂的结构,产生了一个既基础又有表现力的形式逻辑子集,支持关键推理能力的可控评估。我们在这个非常规的基准上的发现表明,虽然LLMs在递归性方面表现出合理的熟练程度,但在组合性方面表现出困难。这种差距并不均匀,更详细的分析揭示了在各个三段论类型之间的泛化性能存在显著差异,从近乎完美的准确率到明显较低的表现。为了克服这些限制并建立一个可靠的逻辑证明器,我们提出了一个集成符号推理和神经计算的混合架构。这种协同作用使得推理更加稳健和高效,神经组件加速处理,而符号推理保证完整性。我们的实验进一步表明,即使使用相对较小的神经组件,高效性也得以保持。总的来说,我们的分析为混合神经符号方法提供了合理性,并证明了它们在解决神经推理系统中的关键泛化障碍方面的潜力。
更新时间: 2026-05-05 15:51:29
领域: cs.CL,cs.LG,cs.LO
Raising the Ceiling: Better Empirical Fixation Densities for Saliency Benchmarking
Empirical fixation densities, spatial distributions estimated from human eye-tracking data, are foundational to saliency benchmarking. They directly shape benchmark conclusions, leaderboard rankings, failure case analyses, and scientific claims about human visual behavior. Yet the standard estimation method, fixed-bandwidth isotropic Gaussian KDE, has gone essentially unchanged for decades. This matters now more than ever: as the field shifts toward sample-level evaluation (failure case analysis, inverse benchmarking, per-image model comparison), reliable per-image density estimates become critical. We propose a principled mixture model that combines an adaptive-bandwidth KDE based on Abramson's method, center bias and uniform components, and a state-of-the-art saliency model, to capture different spatial and semantic types of interobserver consistency, and optimize all parameters per image via leave-one-subject-out cross-validation. Our method yields substantially higher interobserver consistency estimates across multiple benchmarks, with median per-image gains of 5-15% in log-likelihood and up to 2 percentage points in AUC. For the most affected images -- precisely those most relevant to failure case analysis -- improvements exceed 25%. We leverage these improved estimates to identify and analyze remaining failure cases of state-of-the-art saliency models, demonstrating that significant headroom for model improvement remains. More broadly, our findings highlight that empirical fixation densities should not be treated as fixed ground truths but as evolving estimates that improve with better methodology.
Updated: 2026-05-05 15:45:00
标题: 提高上限:更好的用于显著性基准测试的经验固定密度
摘要: 实证固定密度,从人类眼动数据估计的空间分布,是显著性基准测试的基础。它们直接影响基准测试的结论、排行榜排名、失败案例分析以及关于人类视觉行为的科学声明。然而,标准估计方法,固定带宽各向同性高斯KDE,几十年来基本没有改变。现在比以往任何时候都更重要:随着领域向样本级评估转变(失败案例分析,逆基准测试,每幅图像模型比较),可靠的每幅图像密度估计变得至关重要。我们提出了一种基于Abramson方法的自适应带宽KDE、中心偏差和均匀分量以及最先进的显著性模型的原则性混合模型,以捕捉不同的空间和语义类型的观察者间一致性,并通过留一子体外交叉验证优化每幅图像的所有参数。我们的方法在多个基准测试中产生了显著更高的观察者间一致性估计,每幅图像的中值增益为5-15%的对数似然,AUC高达2个百分点。对受影响最严重的图像,即与失败案例分析最相关的图像,改进超过25%。我们利用这些改进的估计来识别和分析最先进显著性模型的剩余失败案例,证明模型改进的显著余地仍然存在。更广泛地说,我们的发现强调,实证固定密度不应被视为固定的基本事实,而应被视为随着更好方法学的改进而不断提高的估计。
更新时间: 2026-05-05 15:45:00
领域: cs.CV,cs.LG
Beyond Code Reasoning: Specification-Anchored Auditing of Multi-Implementation Distributed Protocols
Code-driven auditing fails when correctness depends on what the specification requires rather than how the code is written. Production blockchain networks expose this directly: byzantine consensus runs many independent clients of a shared specification, so a specification-divergence defect in one client can fork the network or halt finality. Existing tools reason one repository at a time, with no shared baseline held constant across implementations. We present SPECA, an LLM-driven audit framework that derives explicit, categorized security properties (invariants, pre/postconditions, trust assumptions) from natural-language specifications and reuses them across implementations. SPECA enables controlled cross-implementation comparison, detections grounded in specification invariants no code pattern encodes, and false positives traceable to a specific pipeline phase rather than opaque model errors. On the Sherlock Ethereum Fusaka Audit Contest (10 targets, 366 submissions), SPECA recovers all 15 in-scope H/M/L vulnerabilities expert-augmented (8/15 automated-only) and surfaces 4 fix-confirmed bugs, including a cryptographic-invariant violation missed by every adjudicated finding. On the RepoAudit C/C++ benchmark, SPECA reaches 88.9% precision at 100% recall (F1=0.94) and surfaces 12 author-validated bugs beyond ground truth, two externally validated. SPECA also flags 5 of RepoAudit's 40 published bugs as defensive-coding fixes with no reachable exploit path. False positives trace to three pipeline-pinned root causes; a multi-model study identifies property-generation quality as the binding constraint. End-to-end cost is ~$30 per H/M/L bug (~42 min wall-clock under parallel execution).
Updated: 2026-05-05 15:44:33
标题: 超越代码推理:多实现分布式协议的规范锚定审计
摘要: 代码驱动的审计在正确性取决于规范要求而不是代码编写方式时失败。生产区块链网络直接暴露了这一点:拜占庭共识运行许多独立客户端,共享规范,因此一个客户端中的规范分歧缺陷可能导致网络分叉或停止最终性。现有工具一次只能推理一个存储库,没有跨实现保持恒定的共享基线。 我们提出了SPECA,一种基于LLM的审计框架,从自然语言规范中导出明确的、分类的安全属性(不变性、前/后条件、信任假设),并在各种实现中重复使用。SPECA实现了受控的跨实现比较,侦测基于规范不变性的检测,没有代码模式编码,以及可追溯到特定管道阶段的虚假阳性,而不是不透明模型错误。 在Sherlock以太坊Fusaka审计比赛(10个目标,366个提交)中,SPECA恢复了所有15个范围内的H/M/L漏洞,专家增强(8/15仅自动化),并发现了4个已确认修复的错误,包括每个裁定发现都错过的加密不变性违规。在RepoAudit C/C++基准测试中,SPECA在100%召回率下达到88.9%的精度(F1=0.94),并发现了12个作者验证的错误,超出基本事实,其中两个经过外部验证。SPECA还标记了RepoAudit的40个已发布错误中的5个作为具有不可达利用路径的防御编码修复。虚假阳性追溯到三个管线固定的根本原因;一项多模型研究确定了属性生成质量作为约束。端到端成本约为每个H/M/L错误30美元(在并行执行下为42分钟墙钟)。
更新时间: 2026-05-05 15:44:33
领域: cs.CR
QKVShare: Quantized KV-Cache Handoff for Multi-Agent On-Device LLMs
Multi-agent LLM systems on edge devices need to hand off latent context efficiently, but the practical choices today are expensive re-prefill or full-precision KV transfer. We study QKVShare, a framework for quantized KV-cache handoff between agents that combines token-level mixed-precision allocation, a self-contained CacheCard representation, and a HuggingFace-compatible cache injection path. Our current results support a narrower but clearer story than the original draft: on 150 GSM8K problems with Llama-3.1-8B-Instruct, adaptive quantization remains competitive under repeated handoff and shows its clearest gains against uniform quantization in deeper-hop, higher budget settings; for handoff latency, the QKVShare path reduces TTFT relative to full re prefill at every tested context, from 130.7 ms vs. 150.2 ms at nominal 1K context to 397.1 ms vs. 1029.7 ms at nominal 8K context;. Stage timing shows that post-injection generation, not card creation, dominates the current QKVShare latency path. These results position quantized KV handoff as a promising on-device systems direction while also highlighting the need for stronger controller ablations and apples-to-apples runtime comparisons.
Updated: 2026-05-05 15:44:29
标题: QKVShare:用于多智能体设备上的分级KV缓存切换的量化方法
摘要: 边缘设备上的多Agent LLM系统需要有效地交接潜在的上下文,但目前的实际选择是昂贵的重新预填或完整精度KV传输。我们研究了QKVShare,这是一个用于代理之间的量化KV缓存交接的框架,它结合了基于令牌级别的混合精度分配、一个独立的CacheCard表示和一个与HuggingFace兼容的缓存注入路径。我们目前的结果支持一个比原始草稿更狭窄但更清晰的故事:在150个GSM8K问题中,使用Llama-3.1-8B-Instruct,自适应量化在重复交接中仍然具有竞争力,并且在更深层次、更高预算设定下显示出对均匀量化的最明显优势;对于交接延迟,QKVShare路径在每个测试上下文中相对于完整重新预填的TTFT都有所降低,从在标称1K上下文下的130.7毫秒对比150.2毫秒,到在标称8K上下文下的397.1毫秒对比1029.7毫秒;阶段时间表明,后注入生成,而不是卡片创建,主导了当前QKVShare延迟路径。这些结果将量化KV交接定位为一个有前途的设备上系统方向,同时也凸显了对更强的控制器消融和苹果对苹果的运行时比较的需求。
更新时间: 2026-05-05 15:44:29
领域: cs.AI,cs.MA
Making Interpretable Discoveries from Unstructured Data: A High-Dimensional Multiple Hypothesis Testing Approach
Social scientists are increasingly turning to unstructured datasets to unlock new empirical insights, e.g., estimating descriptive statistics of or causal effects on quantitative measures derived from text, audio, or video data. In many such settings, unsupervised analysis is of primary interest, in that the researcher does not want to (or cannot) manually pre-specify all important aspects of the unstructured data to measure; they are interested in "discovery." This paper proposes a general and flexible framework for pursuing such discovery from unstructured data in a statistically principled way. The framework leverages recent methods from the literature on AI interpretability to map unstructured data points to high-dimensional, sparse, and interpretable "concept embeddings"; computes statistics from these concept embeddings for testing interpretable, concept-by-concept hypotheses; performs selective inference on these hypotheses using algorithms validated by new results in high-dimensional central limit theory, producing a selected set ("discoveries"); and both generates and evaluates human-interpretable natural language descriptions of these discoveries. The proposed framework has few researcher degrees of freedom, is robust to data snooping and other post-selection inference concerns, and facilitates fast and inexpensive sensitivity analysis and replication. Applications to recent descriptive and causal analyses of unstructured data in empirical economics are explored. Open source code is provided for researchers to implement the framework in their own projects.
Updated: 2026-05-05 15:42:49
标题: 从非结构化数据中进行可解释的发现:高维度多重假设检验方法
摘要: 社会科学家越来越倾向于利用非结构化数据集来解锁新的实证洞见,例如,估计从文本、音频或视频数据中导出的定量指标的描述统计或因果效应。在许多这种情况下,无监督分析是主要关注的,研究者不希望(或无法)手动预先指定要测量的非结构化数据的所有重要方面;他们对“发现”感兴趣。本文提出了一个通用而灵活的框架,以统计原则的方式从非结构化数据中进行这种发现。该框架利用了最近的人工智能可解释性文献中的方法,将非结构化数据点映射到高维、稀疏且可解释的“概念嵌入”;计算这些概念嵌入的统计量,用于测试可解释的、概念对概念的假设;使用经过高维中心极限理论新结果验证的算法对这些假设进行选择性推断,产生一组选择的发现;并生成和评估这些发现的人类可解释的自然语言描述。所提出的框架具有较少的研究者自由度,对数据窥视和其他后选择推断问题具有鲁棒性,并促进快速且廉价的敏感性分析和复制。对经验经济学中最近的非结构化数据的描述性和因果分析进行了应用探讨。为研究人员提供了开源代码,以在他们自己的项目中实现该框架。
更新时间: 2026-05-05 15:42:49
领域: econ.EM,cs.LG
Deco: Extending Personal Physical Objects into Pervasive AI Companion through a Dual-Embodiment Framework
Individuals frequently form deep attachments to physical objects (e.g., plush toys) that usually cannot sense or respond to their emotions. While AI companions offer responsiveness and personalization, they exist independently of these physical objects and lack an ongoing connection to them. To bridge this gap, we conducted a formative study (N=9) to explore how digital agents could inherit and extend the emotional bond, deriving four design principles (Faithful Identity, Calibrated Agency, Ambient Presence, and Reciprocal Memory). We then present the Dual-Embodiment Companion Framework, instantiated as Deco, a mobile system integrating multimodal Large Language Models (LLMs) and Augmented Reality to create synchronized digital embodiments of users' physical companions. A within-subjects study (N=25) showed Deco significantly outperformed a personalized LLM-empowered digital companion baseline on perceived companionship, emotional bond, and design-principle scales (all p<0.01). A seven-day field deployment (N=17) showed sustained engagement, subjective well-being improvement (p=.040), and three key relational patterns: digital activities retroactively vitalized physical objects, bond deepening was driven by emotional engagement depth rather than interaction frequency, and users sustained bonds while actively navigating digital companions' AI nature. This work highlights a promising alternative for designing digital companions: moving from creating new relationships to dual embodiment, where digital agents seamlessly extend the emotional history of physical objects.
Updated: 2026-05-05 15:42:11
标题: Deco:通过双体现框架将个人物品扩展为全方位人工智能伴侣
摘要: 个体经常对不能感知或回应其情感的物体(例如毛绒玩具)形成深厚的情感依恋。虽然AI伴侣提供了响应性和个性化,但它们独立于这些物体存在,并且缺乏与它们的持续联系。为了弥合这一鸿沟,我们进行了一项形成性研究(N=9),探索数字代理如何继承和延伸情感依恋,得出了四个设计原则(忠实身份、校准代理、环境存在和相互记忆)。然后,我们提出了双体伴侣框架,实例化为Deco,这是一个整合了多模式大语言模型(LLMs)和增强现实技术的移动系统,用于创建用户物理伴侣的数字化同步体。一项被试研究(N=25)显示,Deco在被感知的伴侣关系、情感依恋和设计原则评分上明显优于个性化LLM增强数字伴侣基线(所有p<0.01)。一项为期七天的现场部署(N=17)显示持续的参与、主观幸福感改善(p=0.040)和三种关系模式:数字活动在事后使物理物体焕发活力,依恋的加深是由情感参与的深度而不是互动频率驱动的,用户在积极应对数字伴侣的人工智能本质时仍保持依恋。这项工作突显了设计数字伴侣的一种有前途的替代方案:从建立新关系转向双体化,其中数字代理无缝地延伸物体的情感历史。
更新时间: 2026-05-05 15:42:11
领域: cs.HC,cs.AI,cs.CY
Personalized Worked Example Generation from Student Code Submissions Using Pattern-based Knowledge Components
Adaptive programming practice often relies on fixed libraries of worked examples and practice problems, which require substantial authoring effort and may not correspond well to the logical errors and partial solutions students produce while writing code. As a result, students may receive learning content that does not directly address the concepts they are working to understand, while instructors must either invest additional effort in expanding content libraries or accept a coarse level of personalization. We present an approach for knowledge-component (KC) guided educational content generation using pattern-based KCs extracted from student code. Given a problem statement and student submissions, our pipeline extracts recurring structural KC patterns from students' code through AST-based analysis and uses them to condition a generative model. In this study, we apply this approach to worked example generation, and compare baseline and KC-conditioned outputs through expert evaluation. Results suggest that KC-conditioned generation improves topical focus and relevance to students' underlying logical errors, providing evidence that KC-based steering of generative models can support personalized learning at scale.
Updated: 2026-05-05 15:39:02
标题: 个性化的工作示例生成:利用基于模式的知识组件从学生代码提交生成
摘要: 自适应编程实践通常依赖于固定的已解决示例和练习问题库,这需要大量的作者努力,并且可能与学生在编写代码时产生的逻辑错误和部分解决方案不相符。因此,学生可能会接收到与他们正在努力理解的概念不直接相关的学习内容,而教师必须要么投入额外的努力扩展内容库,要么接受粗糙的个性化水平。我们提出了一种基于知识组件(KC)的教育内容生成方法,该方法利用从学生代码中提取的基于模式的KC来引导。通过基于AST的分析,我们的流水线从问题陈述和学生提交中提取出学生代码中的重复结构KC模式,并将其用于条件生成模型。在这项研究中,我们将这种方法应用于示例生成,并通过专家评估比较基线和KC条件输出。结果表明,KC条件生成改善了主题焦点和与学生潜在逻辑错误相关性,提供了证据表明基于KC的生成模型的引导可以支持大规模的个性化学习。
更新时间: 2026-05-05 15:39:02
领域: cs.HC,cs.AI,cs.CY,cs.ET,cs.LG
DMGD: Train-Free Dataset Distillation with Semantic-Distribution Matching in Diffusion Models
Dataset distillation enables efficient training by distilling the information of large-scale datasets into significantly smaller synthetic datasets. Diffusion based paradigms have emerged in recent years, offering novel perspectives for dataset distillation. However, they typically necessitate additional fine-tuning stages, and effective guidance mechanisms remain underexplored. To address these limitations, we rethink diffusion based dataset distillation and propose a Dual Matching Guided Diffusion (DMGD) framework, centered on efficient training-free guidance. We first establish Semantic Matching via conditional likelihood optimization, eliminating the need for auxiliary classifiers. Furthermore, we propose a dynamic guidance mechanism that enhances the diversity of synthetic data while maintaining semantic alignment. Simultaneously, we introduce an optimal transport (OT) based Distribution Matching approach to further align with the target distribution structure. To ensure efficiency, we develop two enhanced strategies for diffusion based framework: Distribution Approximate Matching and Greedy Progressive Matching. These strategies enable effective distribution matching guidance with minimal computational overhead. Experimental results on ImageNet-Woof, ImageNet-Nette, and ImageNet-1K demonstrate that our training-free approach achieves significant improvements, outperforming state-of-the-art (SOTA) methods requiring additional fine-tuning by average accuracy gains of 2.1%, 5.4%, and 2.4%, respectively.
Updated: 2026-05-05 15:37:50
标题: DMGD: 使用扩散模型中的语义分布匹配进行无需训练的数据集提炼
摘要: 数据集蒸馏通过将大规模数据集的信息提炼成显著较小的合成数据集,实现了高效训练。最近出现了基于扩散的范式,为数据集蒸馏提供了新的视角。然而,它们通常需要额外的微调阶段,有效的引导机制仍未被充分探索。为了解决这些限制,我们重新思考了基于扩散的数据集蒸馏,并提出了一个以高效训练为中心的双匹配引导扩散(DMGD)框架。我们首先通过条件似然优化建立了语义匹配,消除了对辅助分类器的需求。此外,我们提出了一个动态引导机制,增强了合成数据的多样性,同时保持语义对齐。同时,我们引入了一种基于最优传输(OT)的分布匹配方法,以进一步与目标分布结构保持一致。为确保效率,我们开发了两种增强策略用于基于扩散的框架:分布近似匹配和贪婪逐步匹配。这些策略通过最小的计算开销实现了有效的分布匹配引导。在ImageNet-Woof、ImageNet-Nette和ImageNet-1K上的实验结果表明,我们的无需训练方法取得了显著的改进,分别比需要额外微调的最先进方法平均精度提升了2.1%、5.4%和2.4%。
更新时间: 2026-05-05 15:37:50
领域: cs.CV,cs.AI
Spatiotemporal Convolutions on EEG signal -- A Representation Learning Perspective on Efficient and Explainable EEG Classification with Convolutional Neural Nets
Classification of EEG signals using shallow Convolutional Neural Networks (CNNs) is a prevalent and successful approach across a variety of fields. Most of these models use independent one-dimensional (1D) convolutional layers along the spatial and temporal dimensions, which are concatenated without a non-linear activation layer between. In this paper, we investigate an alternative encoding that operates a bi-dimensional (2D) spatiotemporal convolution. While 2D convolutions are numerically identical to two concatenated 1D convolutions along the two dimensions, the impact on learning is still uncertain. We test 1D and 2D CNNs and a CNN+transformer hybrid model in a low-dimensional (3-channel) and a high-dimensional (22-channel) BCI motor imagery classification task. We observe that 2D convolutions significantly reduce training time in high-dimensional tasks while maintaining performance. We investigate the root of this improvement and find no difference in spectral feature importance. However, a clear pattern emerges in representational similarity across models: 1D and 2D models yield vastly different representational geometries. Overall, we suggest an improved model with a 2D convolutional layer for faster training and inference. We also highlight the importance of architecturally-driven encoding when processing complex multivariate signals, as reflected in internal representations rather than purely in performance metrics.
Updated: 2026-05-05 15:35:05
标题: 脑电信号上的时空卷积——基于表示学习的高效和可解释的脑电分类的卷积神经网络视角
摘要: 使用浅层卷积神经网络(CNNs)对脑电图(EEG)信号进行分类是各个领域中普遍且成功的方法。大多数这些模型使用独立的一维(1D)卷积层沿空间和时间维度进行操作,这些层在没有非线性激活层的情况下进行连接。在本文中,我们调查了一种替代编码,即进行双维(2D)时空卷积。虽然2D卷积在数值上等同于沿两个维度连接两个1D卷积,但对学习的影响仍不确定。我们在低维(3通道)和高维(22通道)BCI运动意象分类任务中测试了1D和2D CNNs以及CNN+transformer混合模型。我们观察到,2D卷积在高维任务中显著减少了训练时间,同时保持了性能。我们研究了这一改进的原因,并发现在光谱特征重要性方面没有明显差异。然而,在模型间的代表性相似性方面出现了明显的模式:1D和2D模型产生了截然不同的代表性几何形态。总的来说,我们建议采用具有2D卷积层的改进模型,以实现更快的训练和推断。我们还强调了在处理复杂多变量信号时,基于架构的编码的重要性,这体现在内部表示中而不仅仅是在性能指标上。
更新时间: 2026-05-05 15:35:05
领域: cs.LG,cs.AI
ReCode: Reinforcing Code Generation with Reasoning-Process Rewards
In practice, rigorous reasoning is often a key driver of correct code, while Reinforcement Learning (RL) for code generation often neglects optimizing reasoning quality. Bringing process-level supervision into RL is appealing, but it faces two challenges. First, training reliable reward models to assess reasoning quality is bottlenecked by the scarcity of fine-grained preference data. Second, naively incorporating such neural rewards may suffer from reward hacking. This work proposes ReCode (Reasoning-Reinforced Code Generation), a novel RL training framework comprising: (1) Contrastive Reasoning-Process Reward Learning (CRPL), which trains a reward model with synthesized optimized and degraded reasoning variants to assess the quality of reasoning process; and (2) Consistency-Gated GRPO (CG-GRPO), which integrates the reasoning-process reward model into RL by gating neural reasoning-process rewards with strict execution outcomes, using execution correctness as a hard gate to mitigate reward hacking. Additionally, to assess the reward model's discriminative capability in assessing reasoning-process quality, we introduce LiveCodeBench-RewardBench (LCB-RB), a new benchmark comprising preference pairs of superior and inferior reasoning processes tailored for code generation. Experimental results across HumanEval(+), MBPP(+), LiveCodeBench, and BigCodeBench show that a 7B model trained with ReCode outperforms the base version by 16.1% and reaches performance comparable to GPT-4-Turbo. We further demonstrate the generalizability of ReCode by extending it to the math domain.
Updated: 2026-05-05 15:31:45
标题: 重新编码:通过推理过程奖励加强代码生成
摘要: 在实践中,严格的推理通常是正确代码的关键驱动因素,而用于代码生成的强化学习(RL)经常忽视了优化推理质量。将过程级监督引入RL是有吸引力的,但面临两个挑战。首先,训练可靠的奖励模型来评估推理质量受到细粒度偏好数据的稀缺性限制。其次,天真地整合这种神经奖励可能会导致奖励攻击。本研究提出了ReCode(Reasoning-Reinforced Code Generation),这是一个新颖的RL训练框架,包括:(1)对比推理-过程奖励学习(CRPL),用于训练一个奖励模型,通过合成的优化和退化推理变体来评估推理过程的质量;(2)一致性门控GRPO(CG-GRPO),将推理过程奖励模型与RL整合,通过使用执行正确性作为硬门控来减轻奖励攻击,对神经推理过程奖励进行门控。此外,为了评估奖励模型在评估推理过程质量方面的区分能力,我们引入了LiveCodeBench-RewardBench(LCB-RB),这是一个新的基准,包括优越和劣质推理过程的偏好对,专为代码生成而设计。在HumanEval(+)、MBPP(+)、LiveCodeBench和BigCodeBench等实验结果显示,使用ReCode训练的7B模型的性能优于基础版本16.1%,并达到了与GPT-4-Turbo相当的性能。我们进一步通过将ReCode扩展到数学领域来展示其泛化能力。
更新时间: 2026-05-05 15:31:45
领域: cs.SE,cs.AI,cs.CL,cs.LG
EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics
Language models encode substantial evaluative knowledge from pretraining, yet current post-training methods rely on external supervision (human annotations, proprietary models, or scalar reward models) to produce reward signals. Each imposes a ceiling. Human judgment cannot supervise capabilities beyond its own, proprietary APIs create dependencies, and verifiable rewards cover only domains with ground-truth answers. Self-improvement from a model's own evaluative capacity is a reward source that scales with the model itself, yet remains largely untapped by current methods. We introduce EVOLM, a post-training method that structures this capacity into explicit discriminative rubrics and uses them as training signal. EVOLM trains two capabilities within a single language model in alternation: (1) a rubric generator producing instance-specific evaluation criteria optimized for discriminative utility, which maximizes a small frozen judge's ability to distinguish preferred from dispreferred responses; and (2) a policy trained using those rubric-conditioned scores as reward. All preference signals are constructed from the policy's own outputs via temporal contrast with earlier checkpoints, requiring no human annotation or external supervision. EVOLM trains a Qwen3-8B model to generate rubrics that outperform GPT-4.1 on RewardBench-2 by 25.7%. The co-trained policy achieves 69.3% average on the OLMo3-Adapt suite, outperforming policies trained with GPT-4.1 prompted rubrics by 3.9% and with the state-of-the-art 8B reward model SkyWork-RM by 16%. Overall, EVOLM demonstrates that structuring a model's evaluative capacity into co-evolving discriminative rubrics enables self-improvement without external supervision.
Updated: 2026-05-05 15:31:00
标题: EvoLM:通过共同进化的区分性评分构建自我进化语言模型
摘要: 语言模型在预训练中编码了大量的评估知识,然而当前的后训练方法依赖于外部监督(人类注释、专有模型或标量奖励模型)来产生奖励信号。每种方法都设有一个上限。人类判断无法监督超出自身能力的能力,专有API会创建依赖关系,可验证的奖励只涵盖具有地面真相答案的领域。模型自身评价能力的自我改进是一个随着模型本身而扩展的奖励来源,但目前的方法仍然很少利用。我们介绍了EVOLM,一种后训练方法,将这种能力结构化为明确的辨别性规则,并将它们用作训练信号。EVOLM在一个语言模型中交替训练了两种能力:(1)一个生成特定实例评估标准的规则生成器,优化了区分实用性,最大化了一个小型冻结的评判者区分优选和非优选响应的能力;(2)一个使用这些规则条件得分作为奖励的策略。所有偏好信号都是通过与早期检查点的时间对比构建的,不需要人工注释或外部监督。EVOLM训练了一个Qwen3-8B模型生成优于GPT-4.1的规则,效果提高了25.7%。协同训练的策略在OLMo3-Adapt套件上达到了69.3%的平均水平,比使用GPT-4.1提示规则训练的策略高出3.9%,比使用最先进的8B奖励模型SkyWork-RM高出16%。总的来说,EVOLM表明,将模型的评估能力结构化为共同发展的辨别性规则能够实现自我改进,不需要外部监督。
更新时间: 2026-05-05 15:31:00
领域: cs.AI
On Adaptivity in Zeroth-Order Optimization
We investigate the effectiveness of adaptive zeroth-order (ZO) optimization for memory-constrained fine-tuning of large language models (LLMs). Contrary to prior claims, we show that adaptive ZO methods such as ZO-Adam offer no convergence advantage over well-tuned ZO-SGD, while incurring significant memory overhead. Our analysis reveals that in high dimensions, ZO gradients lack coordinate-wise heterogeneity, rendering adaptive mechanisms memory inefficient. Leveraging this insight, we propose MEAZO, a memory-efficient adaptive ZO optimizer that tracks only a single scalar for global step size adaptation. We support our method with theoretical convergence guarantees under standard assumptions. Experiments across multiple LLM families and tasks demonstrate that MEAZO matches ZO-Adam's performance with the memory footprint of ZO-SGD. Additional experiments on synthetic quadratic problems and LLM fine-tuning further demonstrate MEAZO's enhanced robustness to step size choices, particularly in grouped or block-structured optimization settings.
Updated: 2026-05-05 15:29:11
标题: 关于零阶优化中的自适应性
摘要: 我们研究了自适应零阶(ZO)优化在内存受限的大型语言模型(LLMs)微调中的有效性。与先前的说法相反,我们展示了自适应ZO方法(如ZO-Adam)在收敛方面没有比经过良好调整的ZO-SGD更有优势,同时会产生显著的内存开销。我们的分析表明,在高维度中,ZO梯度缺乏坐标间的异质性,使得自适应机制内存效率低下。利用这一洞察,我们提出了MEAZO,一种内存高效的自适应ZO优化器,仅追踪一个标量用于全局步长调整。我们通过标准假设提供了我们方法的理论收敛保证。对多个LLM系列和任务的实验表明,MEAZO在内存占用与ZO-SGD相匹配的情况下,与ZO-Adam的性能相当。对合成二次问题和LLM微调的额外实验进一步表明MEAZO在步长选择方面具有增强的鲁棒性,特别是在分组或块结构优化设置中。
更新时间: 2026-05-05 15:29:11
领域: cs.LG,math.OC
Memory-Efficient Continual Learning with CLIP Models
Contrastive Language-Image Pretraining (CLIP) models excel at understanding image-text relationships but struggle with adapting to new data without forgetting prior knowledge. To address this, models are typically fine-tuned using both new task data and a memory buffer of past tasks. However, CLIP's contrastive loss suffers when the memory buffer is small, leading to performance degradation on previous tasks. We propose a memory-efficient, distributionally robust method that dynamically reweights losses per class during training. Our approach, tested on class incremental settings (CIFAR-100, ImageNet1K) and a domain incremental setting (DomainNet) adapts CLIP models quickly while minimizing catastrophic forgetting, even with minimal memory usage.
Updated: 2026-05-05 15:27:27
标题: 使用CLIP模型进行高效的持续学习
摘要: Contrastive Language-Image Pretraining (CLIP)模型在理解图像文本关系方面表现出色,但在适应新数据而不遗忘先前知识方面存在困难。为了解决这个问题,通常使用新任务数据和过去任务的记忆缓冲区对模型进行微调。然而,当记忆缓冲区较小时,CLIP的对比损失会受到影响,导致先前任务的性能下降。我们提出了一种记忆高效、分布鲁棒的方法,在训练过程中动态重新调整每个类别的损失。我们的方法在类别递增设置(CIFAR-100、ImageNet1K)和域递增设置(DomainNet)上进行了测试,快速调整CLIP模型,同时最大程度地减少灾难性遗忘,即使内存使用量很小。
更新时间: 2026-05-05 15:27:27
领域: cs.LG
Quantifying the human visual exposome with vision language models
The visual environment is a fundamental yet unquantified determinant of mental health. While the concept of the environmental exposome is well established, current methods rely on coarse geospatial proxies or biased self reports, failing to capture the first person visual context of daily life. We addressed this gap by coupling ecological momentary assessment with vision language models (VLMs) to quantify the semantic richness of human visual experience. Across 2674 participant generated photographs, VLM derived estimates of greenness robustly predicted momentary affect and chronic stress, consistent with established benchmarks. We then developed a semi autonomous large language model (LLM) based pipeline that mined over seven million scientific publications to extract nearly 1000 environmental features empirically linked to mental health. When applied to real world imagery, up to 33 percent of VLM extracted context ratings significantly correlated with affect and stress. These findings establish a scalable objective paradigm for visual exposomics, enabling high throughput decoding of how the visible world is associated with mental health.
Updated: 2026-05-05 15:25:00
标题: 用视觉语言模型量化人类视觉暴露环境
摘要: 视觉环境是心理健康的一个基本但尚未量化的决定因素。虽然环境暴露组的概念已经确立,但目前的方法依赖于粗略的地理空间代理或有偏见的自我报告,未能捕捉日常生活中的第一人视觉背景。我们通过将生态瞬时评估与视觉语言模型(VLM)相结合,量化了人类视觉体验的语义丰富性,填补了这一空白。在2674个参与者生成的照片中,VLM导出的绿色度量值可靠地预测瞬时情绪和慢性压力,与已建立的基准一致。然后,我们开发了一个基于半自主的大型语言模型(LLM)的管道,挖掘了超过700万篇科学出版物,提取了近1000个与心理健康相关的环境特征。当应用于现实世界的图像时,高达33%的VLM提取的上下文评分与情绪和压力显著相关。这些发现建立了一个可扩展的客观视觉暴露组学范式,实现了可见世界与心理健康相关性的高吞吐量解码。
更新时间: 2026-05-05 15:25:00
领域: cs.AI,cs.CV
Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards
Reinforcement learning with verifiable rewards has become a common way to improve explicit reasoning in large language models, but final-answer correctness alone does not reveal whether the reasoning trace is faithful, reliable, or useful to the model that consumes it. This outcome-only signal can reinforce traces that are right for the wrong reasons, overstate reasoning gains by rewarding shortcuts, and propagate flawed intermediate states in multi-step systems. To this end, we propose TraceLift, a planner-executor training framework that treats reasoning as a consumable intermediate artifact. During planner training, the planner emits tagged reasoning. A frozen executor turns this reasoning into the final artifact for verifier feedback, while an executor-grounded reward shapes the intermediate trace. This reward multiplies a rubric-based Reasoning Reward Model (RM) score by measured uplift on the same frozen executor, crediting traces that are both high-quality and useful. To make reasoning quality directly learnable, we introduce TRACELIFT-GROUPS, a rubric-annotated reason-only dataset built from math and code seed problems. Each example is a same-problem group containing a high-quality reference trace and multiple plausible flawed traces with localized perturbations that reduce reasoning quality or solution support while preserving task relevance. Extensive experiments on code and math benchmarks show that this executor-grounded reasoning reward improves the two-stage planner-executor system over execution-only training, suggesting that reasoning supervision should evaluate not only whether a trace looks good, but also whether it helps the model that consumes it.
Updated: 2026-05-05 15:24:40
标题: 正确还不够:用执行者基础奖励培训推理规划者
摘要: 使用可验证奖励的强化学习已成为改进大型语言模型中显式推理的常用方式,但仅仅通过最终答案的正确性并不能揭示推理轨迹是否忠实、可靠或对消费者模型有用。这种仅仅基于结果的信号可能会强化出于错误原因而正确的推理轨迹,通过奖励快捷方式夸大推理增益,并在多步系统中传播有缺陷的中间状态。为此,我们提出了TraceLift,一个规划者-执行者训练框架,将推理视为一种可消耗的中间产物。在规划者训练期间,规划者发出带标签的推理。一个冻结的执行者将这种推理转化为最终产物,以供验证器反馈,同时执行者基于奖励塑造中间轨迹。这种奖励将基于评分的推理奖励模型(RM)得分乘以在同一冻结执行者上的实际提升,从而为高质量且有用的轨迹提供信用。为了使推理质量可以直接学习,我们引入了TRACELIFT-GROUPS,一个由数学和代码种子问题构建的基于评分的仅推理数据集。每个示例都是一个包含高质量参考轨迹和多个合理有缺陷的轨迹的相同问题组,这些轨迹具有减少推理质量或解决方案支持的局部扰动,同时保持任务相关性。对代码和数学基准测试的大量实验表明,这种基于执行者的推理奖励改进了两阶段规划者-执行者系统,相较于仅执行训练,这表明推理监督应该评估一个轨迹不仅看起来好不好,还要看它是否对消费者模型有帮助。
更新时间: 2026-05-05 15:24:40
领域: cs.AI,cs.CL
Uncovering and Understanding FPR Manipulation Attack in Industrial IoT Networks
In the network security domain, due to practical issues -- including imbalanced data and heterogeneous legitimate network traffic -- adversarial attacks in machine learning-based NIDSs have been viewed as attack packets misclassified as benign. Due to this prevailing belief, the possibility of (maliciously) perturbed benign packets being misclassified as attack has been largely ignored. In this paper, we demonstrate that this is not only theoretically possible, but also a particular threat to NIDS. In particular, we uncover a practical cyberattack, FPR manipulation attack (FPA), especially targeting industrial IoT networks, where domain-specific knowledge of the widely used MQTT protocol is exploited and a systematic simple packet-level perturbation is performed to alter the labels of benign traffic samples without employing traditional gradient-based or non-gradient-based methods. The experimental evaluations demonstrate that this novel attack results in a success rate of 80.19% to 100%. In addition, while estimating impacts in the Security Operations Center, we observe that even a small fraction of false positive alerts, irrespective of different budget constraints and alert traffic intensities, can increase the delay of genuine alerts investigations up to 2 hr in a single day under normal operating conditions. Furthermore, a series of relevant statistical and XAI analyses is conducted to understand the key factors behind this remarkable success. Finally, we explore the effectiveness of the FPA packets to enhance models' robustness through adversarial training and investigate the changes in decision boundaries accordingly.
Updated: 2026-05-05 15:24:16
标题: 揭示和理解工业物联网网络中的FPR操纵攻击
摘要: 在网络安全领域,由于实际问题——包括数据不平衡和异构的合法网络流量——基于机器学习的入侵检测系统中的对抗性攻击被视为被错误分类为良性的攻击数据包。由于这种普遍信念,恶意篡改的良性数据包被错误分类为攻击的可能性已被大部分忽略。在本文中,我们展示了这不仅在理论上是可能的,而且是对入侵检测系统的一种特殊威胁。特别地,我们揭示了一种实际的网络攻击,称为FPR操纵攻击(FPA),特别针对工业物联网网络,利用广泛使用的MQTT协议的领域特定知识,并执行系统性的简单的数据包级扰动,以改变良性流量样本的标签,而无需使用传统的基于梯度或非梯度的方法。实验评估表明,这种新型攻击的成功率为80.19%至100%。此外,在安全运营中心评估影响时,我们观察到即使是少量的误报警,无论不同的预算限制和警报流量强度如何,都可以在正常运行条件下将真实警报的调查延迟增加至每天多达2小时。此外,进行了一系列相关的统计和XAI分析,以了解这一显著成功背后的关键因素。最后,我们探索了FPA数据包增强模型鲁棒性的有效性,通过对抗性训练来研究相应的决策边界的变化。
更新时间: 2026-05-05 15:24:16
领域: cs.CR,cs.LG
MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following
Multi-constraint instruction following requires verifying whether a response satisfies multiple individual requirements, yet LLM judges are often assessed only through overall-response judgments. We introduce MCJudgeBench, a benchmark for constraint-level judge evaluation in multi-constraint instruction following. Each instance includes an instruction, a candidate response, an explicit constraint list, per-constraint gold labels in {yes, partial, no}, and controlled response-side perturbations. The evaluation protocol further includes evaluation prompt variants to test judge stability. We evaluate proprietary and open-source LLM judges using both correctness and inconsistency metrics, distinguishing intrinsic inconsistency under stochastic decoding from procedural inconsistency under prompt and response perturbations. Our results show that judge reliability has multiple dimensions: strong overall performance does not guarantee equally reliable detection across label categories, especially for rarer partial and no cases. Judges with higher correctness do not always have lower inconsistency. Evaluation with reasoning improves correctness but does not uniformly improve stability. These findings motivate evaluating LLM judges at the constraint level to study these failure modes.
Updated: 2026-05-05 15:20:42
标题: MCJudgeBench:多约束指令跟随中约束级评估的基准
摘要: 多约束指令跟随需要验证响应是否满足多个单独的要求,然而LLM评判通常只通过整体响应评判。我们引入了MCJudgeBench,这是一个用于多约束指令跟随中约束级评判的基准。每个实例包括一条指令、一个候选响应、一个明确的约束列表、每个约束的金标签{是、部分、否},以及受控的响应端扰动。评估协议还包括评估提示变体,以测试评判的稳定性。我们使用正确性和不一致性指标评估专有和开源LLM评判,区分随机解码下的内在不一致性和在提示和响应扰动下的程序不一致性。我们的结果显示评判的可靠性具有多个维度:强大的整体表现并不保证在标签类别上同样可靠的检测,尤其是对于更稀有的部分和否情况。正确性较高的评判并不总是具有较低的不一致性。推理评估可以提高正确性,但并不总是提高稳定性。这些发现促使在约束级别对LLM评判进行评估,以研究这些失败模式。
更新时间: 2026-05-05 15:20:42
领域: cs.CL,cs.AI
A Deeper Dive into the Irreversibility of PolyProtect: Making Protected Face Templates Harder to Invert
This work presents a deeper analysis of the "irreversibility" property of PolyProtect, a biometric template protection method initially proposed for securing face embeddings. PolyProtect transforms embeddings into protected templates via multivariate polynomials, whose coefficients and exponents are distinct for each subject enrolled in the face recognition system. A polynomial is applied to consecutive sets of elements from a given embedding, where the amount of overlap between the sets is a tunable parameter. We begin our irreversibility analysis by demonstrating that PolyProtected templates are easier to invert using a numerical solver based on cosine distance, as opposed to Euclidean distance (used in the earlier PolyProtect work). To make this inversion more difficult, we then propose a "key selection algorithm", which tries to choose "keys" (coefficients and exponents of the PolyProtect polynomial) that enhance the irreversibility of PolyProtected templates, compared to when the keys are purely random. Our experiments show that this algorithm is effective at generating PolyProtected templates that are significantly more difficult to invert, and that it approximately equalises the irreversibility of PolyProtected templates generated using different "overlap" parameters. This allows for better control of the irreversibility versus accuracy trade-off, known to exist across different overlaps. We also show that accuracy in the PolyProtected domain can be affected by the range in which the embedding elements lie, but that this can be improved by normalizing the embeddings prior to applying PolyProtect. This work is reproducible using our open-source code.
Updated: 2026-05-05 15:20:38
标题: 深入探讨PolyProtect的不可逆性:使受保护的面部模板更难逆转
摘要: 这项工作对PolyProtect的“不可逆性”属性进行了更深入的分析,PolyProtect是一种生物特征模板保护方法,最初用于保护面部嵌入。PolyProtect通过多元多项式将嵌入转化为受保护的模板,其中每个已注册在面部识别系统中的主体的系数和指数均不同。多项式应用于给定嵌入的连续元素集,其中集合之间的重叠量是可调参数。我们通过展示PolyProtected模板更容易使用基于余弦距离的数值求解器反演,而不是欧氏距离(早期PolyProtect工作中使用的),开始了我们的不可逆性分析。为了使这种反演更加困难,我们随后提出了一个“密钥选择算法”,该算法试图选择“密钥”(PolyProtect多项式的系数和指数),以增强PolyProtected模板的不可逆性,与密钥纯粹随机选择时相比。我们的实验表明,这种算法能够有效生成PolyProtected模板,使其更难反演,并且它在大致平衡使用不同“重叠”参数生成的PolyProtected模板的不可逆性方面表现出色。这使得更好地控制不同重叠之间已知存在的不可逆性与准确性的权衡。我们还表明,PolyProtected领域的准确性可能会受到嵌入元素所处范围的影响,但在应用PolyProtect之前进行嵌入归一化可以改善这一情况。这项工作可以通过我们的开源代码进行复制。
更新时间: 2026-05-05 15:20:38
领域: cs.CV,cs.CR
Large Language Model assisted Hybrid Fuzzing
Greybox fuzzing is one of the most popular methods for detecting software vulnerabilities, which conducts a biased random search within the program input space. To enhance its effectiveness in achieving deep coverage of program behaviors, greybox fuzzing is often combined with concolic execution, which performs a path-sensitive search over the domain of program inputs. In hybrid fuzzing, conventional greybox fuzzing is followed by concolic execution in an iterative loop, where reachability roadblocks encountered by greybox fuzzing are tackled by concolic execution. However, such hybrid fuzzing still suffers from difficulties conventionally faced by concolic execution, such as the need for environment modeling and system call support. In this work, we explore the potential of developing "smart" concolic execution empowered by Large Language Models (LLMs), leveraging their knowledge of code semantics during constraint computing and solving. When coverage-based greybox fuzzing reaches a roadblock in terms of reaching certain branches, we conduct a slicing on the execution trace and suggest modifications of the input to reach the relevant branches. The LLM is used as a solver to generate the modified input to reach the desired branches. Compared with state-of-the-art hybrid fuzzers CoFuzz, Intriguer, and QSYM, our LLM-based hybrid fuzzer HyllFuzz(pronounced "hill fuzz") covers 31.43%, 44.56%, and 59.48% more code branches, respectively. Furthermore, the LLM-based concolic execution in HyllFuzz takes a time that is 3--19 times faster than the concolic execution running in existing hybrid fuzzing tools. In extensively tested real-world subjects, HyllFuzz exposed seven previously unknown bugs. This experience shows that LLMs can be effectively inserted into the iterative loop of hybrid fuzzers to efficiently expose more program behaviors.
Updated: 2026-05-05 15:18:33
标题: 大型语言模型辅助混合模糊测试
摘要: 灰盒模糊测试是检测软件漏洞的最流行方法之一,它在程序输入空间内进行有偏随机搜索。为了增强其在深度覆盖程序行为方面的有效性,灰盒模糊测试通常与共模糊执行结合,后者在程序输入域上执行路径敏感搜索。在混合模糊测试中,传统的灰盒模糊测试后跟着共模糊执行进行迭代循环,通过共模糊执行解决灰盒模糊测试遇到的可达性障碍。然而,这种混合模糊测试仍然面临着共模糊执行常见的困难,例如需要环境建模和系统调用支持。在这项工作中,我们探索了通过大型语言模型(LLMs)开发“智能”共模糊执行的潜力,利用它们在约束计算和解决过程中对代码语义的知识。当基于覆盖率的灰盒模糊测试在达到某些分支方面遇到障碍时,我们对执行跟踪进行切片并建议修改输入以达到相关分支。LLM被用作求解器生成修改后的输入以达到所需的分支。与最先进的混合模糊测试工具CoFuzz、Intriguer和QSYM相比,我们基于LLM的混合模糊测试工具HyllFuzz(发音为“hill fuzz”)分别覆盖了31.43%、44.56%和59.48%更多的代码分支。此外,HyllFuzz中基于LLM的共模糊执行所需的时间比现有混合模糊测试工具中运行的共模糊执行快3-19倍。在广泛测试的真实主题中,HyllFuzz暴露了七个以前未知的错误。这一经验表明LLMs可以有效地插入到混合模糊测试工具的迭代循环中,以有效地暴露更多的程序行为。
更新时间: 2026-05-05 15:18:33
领域: cs.SE,cs.CR
ABC: Any-Subset Autoregression via Non-Markovian Diffusion Bridges in Continuous Time and Space
Generating continuous-time, continuous-space stochastic processes (e.g., videos, weather forecasts) conditioned on partial observations (e.g., first and last frames) is a fundamental challenge. Existing approaches, (e.g., diffusion models), suffer from key limitations: (1) noise-to-data evolution fails to capture structural similarity between states close in physical time and has unstable integration in low-step regimes; (2) random noise injected is insensitive to the physical process's time elapsed, resulting in incorrect dynamics; (3) they overlook conditioning on arbitrary subsets of states (e.g., irregularly sampled timesteps, future observations). We propose ABC: Any-Subset Autoregressive Models via Non-Markovian Diffusion Bridges in Continuous Time and Space. Crucially, we model the process with one continual SDE whose time variable and intermediate states track the real time and process states. This has provable advantages: (1) the starting point for generating future states is the already-close previous state, rather than uninformative noise; (2) random noise injection scales with physical time elapsed, encouraging physically plausible dynamics with similar time-adjacent states. We derive SDE dynamics via changes-of-measure on path space, yielding another advantage: (3) path-dependent conditioning on arbitrary subsets of the state history and/or future. To learn these dynamics, we derive a path- and time-dependent extension of denoising score matching. Our experiments show ABC's superiority to competing methods on multiple domains, including video generation and weather forecasting.
Updated: 2026-05-05 15:16:38
标题: ABC:任意子集自回归通过非马尔可夫扩散桥在连续时间和空间中
摘要: 生成连续时间、连续空间随机过程(例如视频、天气预报)并基于部分观测条件(例如第一帧和最后一帧)是一个基本挑战。现有方法(例如扩散模型)存在关键局限:(1)噪声到数据演变未能捕捉物理时间接近状态之间的结构相似性,在低步进制度下积分不稳定;(2)注入的随机噪声对物理过程经过的时间不敏感,导致动态不正确;(3)它们忽视了对任意子集的状态进行条件化(例如不规则采样的时间步长、未来观测)。我们提出ABC:连续时间和空间中通过非马尔科夫扩散桥连接的任意子集自回归模型。关键是,我们用一个持续SDE模型化这个过程,其时间变量和中间状态跟踪实际时间和过程状态。这具有可证明的优势:(1)生成未来状态的起始点是已经接近的前一个状态,而不是无信息的噪声;(2)随机噪声注入随着物理时间的流逝而扩大,鼓励具有相似时间相邻状态的物理合理动态。我们通过路径空间上的测度变换推导SDE动态,获得另一个优势:(3)对状态历史和/或未来进行路径相关的任意子集条件化。为了学习这些动态,我们推导了一种路径和时间相关的去噪分数匹配扩展。我们的实验显示ABC在多个领域(包括视频生成和天气预报)上优于竞争方法。
更新时间: 2026-05-05 15:16:38
领域: cs.LG,cs.AI
Mechanical Conscience: A Mathematical Framework for Dependability of Machine Intelligenc
Distributed collaborative intelligence (DCI), encompassing edge-to-edge architectures, federated learning, transfer learning, and swarm systems, creates environments in which emergent risk is structurally unavoidable: locally correct decisions by individual agents compose into globally unacceptable behavioral trajectories under uncertainty. Existing approaches such as constrained optimization, safe reinforcement learning, and runtime assurance evaluate acceptability at the level of individual actions rather than across behavioral trajectories, and none addresses the multi-participant, uncertainty-laden nature of DCI deployments. This paper introduces mechanical conscience (MC), a novel concept and simplified mathematical framework that operationalizes trajectory-level normative regulation for both single-agent and distributed intelligent systems. Mechanical conscience is defined as a supervisory filter that minimally corrects a baseline policy's actions to reduce cumulative deviation from a normatively admissible region, while accounting for epistemic uncertainty. We introduce associated constructs, conscience score, mechanical guilt, and resonant dependability, that provide an interpretable vocabulary and computable governance signals for this emerging field. Core theoretical properties are established: admissibility equivalence, existence of optimal regulation, and monotonic deviation reduction. Illustrative results demonstrate that MC-regulated agents maintain trajectory-level normative acceptability where conventional controllers drift outside admissible bounds, and that the framework naturally extends to suppress interaction-induced emergent risk in multi-agent DCI settings.
Updated: 2026-05-05 15:14:02
标题: 机械良知:机器智能可靠性的数学框架
摘要: 分布式协作智能(DCI)涵盖了从边缘到边缘的体系结构、联合学习、迁移学习和群体系统,创造了一种环境,其中新兴风险在结构上是不可避免的:在不确定性下,个体代理的局部正确决策组成了全局不可接受的行为轨迹。现有方法,如受限优化、安全强化学习和运行时保证,在个体行为水平上评估可接受性,而不是跨行为轨迹,没有一个方法考虑到DCI部署的多参与者、充满不确定性的特性。本文介绍了机械良心(MC),这是一个新颖的概念和简化的数学框架,为单一代理和分布式智能系统实现了轨迹级规范监管。机械良心被定义为一个监督过滤器,最小程度地校正基线策略的行为,以减少从规范可接受区域偏离的累积,同时考虑认知不确定性。我们引入了相关概念,如良心评分、机械罪恶感和共振可靠性,为这一新兴领域提供了可解释的词汇和可计算的治理信号。核心理论属性得到了建立:可接受性等价性、最优规范存在性和单调偏离减少。说明性结果表明,受MC调节的代理在传统控制器漂离可接受范围时保持轨迹级规范可接受性,并且该框架自然扩展到抑制多代理DCI设置中由互动引起的新兴风险。
更新时间: 2026-05-05 15:14:02
领域: cs.AI
SOAR: Real-Time Joint Optimization of Order Allocation and Robot Scheduling in Robotic Mobile Fulfillment Systems
Robotic Mobile Fulfillment Systems (RMFS) rely on mobile robots for automated inventory transportation, coordinating order allocation and robot scheduling to enhance warehousing efficiency. However, optimizing RMFS is challenging due to strict real-time constraints and the strong coupling of multi-phase decisions. Existing methods either decompose the problem into isolated sub-tasks to guarantee responsiveness at the cost of global optimality, or rely on computationally expensive global optimization models that are unsuitable for dynamic industrial environments. To bridge this gap, we propose SOAR, a unified Deep Reinforcement Learning framework for real-time joint optimization. SOAR transforms order allocation and robot scheduling into a unified process by utilizing soft order allocations as observations. We formulate this as an Event-Driven Markov Decision Process, enabling the agent to perform simultaneous scheduling in response to asynchronous system events. Technically, we employ a Heterogeneous Graph Transformer to encode the warehouse state and integrate phased domain knowledge. Additionally, we incorporate a reward shaping strategy to address sparse feedback in long-horizon tasks. Extensive experiments on synthetic and real-world industrial datasets, in collaboration with Geekplus, demonstrate that SOAR reduces global makespan by 7.5\% and average order completion time by 15.4\% with sub-100ms latency. Furthermore, sim-to-real deployment confirms its practical viability and significant performance gains in production environments. The code is available at https://github.com/200815147/SOAR.
Updated: 2026-05-05 15:09:32
标题: SOAR:机器人移动履行系统中订单分配和机器人调度的实时联合优化
摘要: 机器人移动履行系统(RMFS)依赖移动机器人进行自动化库存运输,协调订单分配和机器人调度,以提高仓储效率。然而,由于严格的实时约束和多阶段决策的强耦合,优化RMFS是具有挑战性的。现有方法要么将问题分解为孤立的子任务以保证响应性,但牺牲了全局最优性,要么依赖于计算昂贵的全局优化模型,这些模型不适用于动态工业环境。为了弥合这一差距,我们提出了SOAR,一个用于实时联合优化的统一深度强化学习框架。SOAR通过利用软订单分配作为观察结果,将订单分配和机器人调度转化为一个统一的过程。我们将其制定为基于事件驱动的马尔可夫决策过程,使代理能够在异步系统事件响应中进行同时调度。技术上,我们采用异构图转换器来编码仓库状态并整合分阶段的领域知识。此外,我们还结合了奖励塑造策略来解决长期任务中稀疏反馈的问题。与Geekplus合作,在合成和真实世界的工业数据集上进行了大量实验,结果表明,SOAR能够将全局最大工时缩短7.5\%,平均订单完成时间缩短15.4\%,延迟低于100ms。此外,通过模拟到实际部署验证了其实际可行性,并在生产环境中取得了显著的性能提升。代码可在https://github.com/200815147/SOAR获取。
更新时间: 2026-05-05 15:09:32
领域: cs.AI,cs.RO
Deep deterministic policy gradient with symmetric data augmentation for lateral attitude tracking control of a fixed-wing aircraft
The symmetry of dynamical systems can be exploited for state-transition prediction and to facilitate control policy optimization. This paper leverages system symmetry to develop sample-efficient offline reinforcement learning (RL) approaches. Under the symmetry assumption for a Markov Decision Process (MDP), a symmetric data augmentation method is proposed. The augmented samples are integrated into the dataset of Deep Deterministic Policy Gradient (DDPG) to enhance its coverage rate of the state-action space. Furthermore, sample utilization efficiency is improved by introducing a second critic trained on the augmented samples, resulting in a dual-critic structure. The aircraft's model is verified to be symmetric, and flight control simulations demonstrate accelerated policy convergence when augmented samples are employed.
Updated: 2026-05-05 15:09:20
标题: 使用对称数据增强的深度确定性策略梯度用于固定翼飞机横向姿态跟踪控制
摘要: 动力系统的对称性可以用于状态转移预测和促进控制策略优化。本文利用系统对称性开发了样本高效的离线强化学习(RL)方法。在对马尔可夫决策过程(MDP)进行对称性假设的情况下,提出了一种对称数据增强方法。增强样本被整合到深度确定性策略梯度(DDPG)的数据集中,以提高其对状态-动作空间的覆盖率。此外,通过引入在增强样本上训练的第二个评论家,改善了样本利用效率,形成了双评论家结构。验证飞机模型具有对称性,并且飞行控制模拟表明在使用增强样本时加速了策略收敛。
更新时间: 2026-05-05 15:09:20
领域: cs.LG,cs.AI
Complex Equation Learner: Rational Symbolic Regression with Gradient Descent in Complex Domain
Symbolic regression aims to discover interpretable equations from data, yet modern gradient-based methods fail for operators that introduce singularities or domain constraints, including division, logarithms, and square roots. As a result, Equation Learner-type models typically avoid these operators or impose restrictions, e.g. constraining denominators to prevent poles, which narrows the hypothesis class. We propose a complex weight extension of the Equation Learner that mitigates real-valued optimization pathologies by allowing optimization trajectories to bypass real-axis degeneracies. The proposed approach converges stably even when the target expression has real-domain poles, and it enables unconstrained use of operations such as logarithm and square root. We Validate the method on symbolic regression benchmarks and show it can recover singular behavior from experimental frequency response data.
Updated: 2026-05-05 15:08:08
标题: 复杂方程学习者:在复杂域中使用梯度下降进行有理符号回归
摘要: Symbolic regression旨在从数据中发现可解释的方程,然而现代基于梯度的方法在引入奇点或定义域约束的运算符(包括除法、对数和平方根)时会失败。因此,Equation Learner类型的模型通常会避开这些运算符或施加限制,例如限制分母以防止极点,这会限制假设类。我们提出了Equation Learner的复杂权重扩展,通过允许优化轨迹绕过实轴退化,从而减轻实值优化病态。所提出的方法即使目标表达式具有实域极点,也能稳定收敛,并且能够无约束地使用对数和平方根等操作。我们在符号回归基准测试中验证了该方法,并展示它可以从实验性频率响应数据中恢复奇异行为。
更新时间: 2026-05-05 15:08:08
领域: cs.LG
On Computing Total Variation Distance Between Mixtures of Product Distributions
We study the problem of approximating the total variation distance between two mixtures of product distributions over an $n$-dimensional discrete domain. Given two mixtures $\mathbb{P}$ and $\mathbb{Q}$ with $k_1$ and $k_2$ product distributions over $[q]^n$, respectively, we give a randomized algorithm that approximates $d_{\mathrm{TV}}\left({\mathbb{P}},{\mathbb{Q}}\right)$ within a multiplicative error of $(1\pm \varepsilon)$ in time $\mathrm{poly}((nq)^{k_1+k_2},1/\varepsilon)$. We also study the special case of mixtures of Boolean subcubes over $\{0,1\}^n$. For this class, we give a deterministic algorithm that exactly computes the total variation distance in time $\mathrm{poly}(n,2^{O(k_1+k_2)})$, and show that exact computation is $\#\mathsf{P}$-hard when $k_1+k_2=Θ(n)$.
Updated: 2026-05-05 15:05:49
标题: 关于计算产品分布混合物之间的总变差距离
摘要: 我们研究了在n维离散域上两个乘积分布混合物之间的总变差距离的近似问题。给定两个混合物$\mathbb{P}$和$\mathbb{Q}$,分别包含$k_1$和$k_2$个乘积分布在$[q]^n$上,我们提供了一个随机算法,以$(1\pm \varepsilon)$的乘法误差在时间$\mathrm{poly}((nq)^{k_1+k_2},1/\varepsilon)$内近似计算$d_{\mathrm{TV}}\left({\mathbb{P}},{\mathbb{Q}}\right)$。我们还研究了布尔子立方体混合物的特殊情况,即在$\{0,1\}^n$上。对于这一类,我们提供了一个确定性算法,在时间$\mathrm{poly}(n,2^{O(k_1+k_2)})$内精确计算总变差距离,并证明当$k_1+k_2=Θ(n)$时,精确计算是$\#\mathsf{P}$-难的。
更新时间: 2026-05-05 15:05:49
领域: cs.DS,cs.LG,math.PR
TRACE: A Metrologically-Grounded Engineering Framework for Trustworthy Agentic AI Systems in Operationally Critical Domains
We introduce TRACE, a cross-domain engineering framework for trustworthy agentic AI in operationally critical domains. TRACE combines a four-layer reference architecture with an explicit classical-ML vs. LLM-validator split (L2a/L2b), a stateful orchestration-and-escalation policy (L3), and bounded human supervision (L4); a metrologically grounded trust-metric suite mapped to GUM/VIM/ISO 17025; and a Model-Parsimony principle quantified by the Computational Parsimony Ratio (CPR). Three instantiations--clinical decision support, industrial multi-domain operations, and a judicial AI assistant--transfer the samearchitecture and metrics across principally different governance contexts. The L2a/L2b separation makes the use of large language models a deliberate design decision rather than an architectural default, with parsimony quantified through CPR. TRACE introduces CPR as a first-class design principle in trustworthy-AI engineering.
Updated: 2026-05-05 15:05:00
标题: TRACE:在运营关键领域中建立可信任代理人人工智能系统的计量基础工程框架
摘要: 我们介绍了TRACE,这是一个用于在运行关键领域中构建可信赖的代理人人工智能的跨领域工程框架。TRACE结合了一个四层参考架构,一个明确的经典机器学习与LLM验证器分离(L2a/L2b),一个有状态的编排和升级策略(L3),和有限的人类监督(L4);一个基于计量学的信任度量套件,映射到GUM/VIM/ISO 17025;以及一个通过计算节俭比(CPR)量化的模型简约原则。三种实例——临床决策支持、工业多领域运营和司法人工智能助手——在基本上不同的治理背景下跨越相同的架构和度量。L2a/L2b分离使得使用大型语言模型成为一个故意的设计决策,而不是一种默认的架构,通过CPR量化简约性。TRACE将CPR引入为可信赖人工智能工程中的一流设计原则。
更新时间: 2026-05-05 15:05:00
领域: cs.CL,cs.AI,cs.HC
A Domain Incremental Continual Learning Benchmark for ICU Time Series Model Transportability
In recent years, machine learning has made significant progress in clinical outcome prediction, demonstrating increasingly accurate results. However, the substantial resources required for hospitals to train these models, such as data collection, labeling, and computational power, limit the feasibility for smaller hospitals to develop their own models. An alternative approach involves transferring a machine learning model trained by a large hospital to smaller hospitals, allowing them to fine-tune the model on their specific patient data. However, these models are often trained and validated on data from a single hospital, raising concerns about their generalizability to new data. Our research shows that there are notable differences in measurement distributions and frequencies across various regions in the United States. To address this, we propose a benchmark that tests a machine learning model's ability to transfer from a source domain to different regions across the country. This benchmark assesses a model's capacity to learn meaningful information about each new domain while retaining key features from the original domain. Using this benchmark, we frame the transfer of a machine learning model from one region to another as a domain incremental learning problem. While the task of patient outcome prediction remains the same, the input data distribution varies, necessitating a model that can effectively manage these shifts. We evaluate two popular domain incremental learning methods: data replay, which stores examples from previous data sources for fine-tuning on the current source, and Elastic Weight Consolidation (EWC), a model parameter regularization method that maintains features important for both data sources.
Updated: 2026-05-05 15:02:07
标题: 一个用于重症监护室时间序列模型可迁移性的域增量式持续学习基准测试
摘要: 近年来,机器学习在临床结果预测方面取得了显著进展,结果表明越来越准确。然而,医院为训练这些模型所需的大量资源,如数据收集、标记和计算能力,限制了较小医院开发自己模型的可行性。另一种方法是将由大型医院训练的机器学习模型转移到较小医院,让他们根据自己的特定患者数据对模型进行微调。 然而,这些模型通常是在单个医院的数据上进行训练和验证的,这引发了对它们在新数据中的泛化能力的担忧。我们的研究显示,在美国各地区的测量分布和频率之间存在显著差异。为了解决这个问题,我们提出了一个基准测试,测试机器学习模型从一个源域转移到全国各地不同地区的能力。该基准测试评估模型学习有关每个新域的有意义信息的能力,同时保留原始域的关键特征。 利用这个基准测试,我们将机器学习模型从一个地区转移到另一个地区视为一个域增量学习问题。虽然患者结果预测任务保持不变,但输入数据分布变化,需要一个能够有效管理这些变化的模型。我们评估了两种流行的域增量学习方法:数据回放,它存储来自先前数据源的示例,以便在当前源上进行微调;以及弹性权重合并(EWC),一种维持两个数据源重要特征的模型参数正则化方法。
更新时间: 2026-05-05 15:02:07
领域: cs.LG
Fisher Decorator: Refining Flow Policy via a Local Transport Map
Recent advances in flow-based offline reinforcement learning (RL) have achieved strong performance by parameterizing policies via flow matching. However, they still face critical trade-offs among expressiveness, optimality, and efficiency. In particular, existing flow policies interpret the $L_2$ regularization as an upper bound of the 2-Wasserstein distance ($W_2$), which can be problematic in offline settings. This issue stems from a fundamental geometric mismatch: the behavioral policy manifold is inherently anisotropic, whereas the $L_2$ (or upper bound of $W_2$) regularization is isotropic and density-insensitive, leading to systematically misaligned optimization directions. To address this, we revisit offline RL from a geometric perspective and show that policy refinement can be formulated as a local transport map: an initial flow policy augmented by a residual displacement. By analyzing the induced density transformation, we derive a local quadratic approximation of the KL-constrained objective governed by the Fisher information matrix, enabling a tractable anisotropic optimization formulation. By leveraging the score function embedded in the flow velocity, we obtain a corresponding quadratic constraint for efficient optimization. Our results reveal that the optimality gap in prior methods arises from their isotropic approximation. In contrast, our framework achieves a controllable approximation error within a provable neighborhood of the optimal solution. Extensive experiments demonstrate state-of-the-art performance across diverse offline RL benchmarks. See project page: https://github.com/ARC0127/Fisher-Decorator.
Updated: 2026-05-05 15:00:45
标题: Fisher装饰器:通过本地传输地图优化流策略
摘要: 最近在基于流的离线强化学习(RL)方面取得了显著的进展,通过使用流匹配对策略进行参数化,取得了强大的性能。然而,它们仍然面临着在表达能力、最优性和效率之间的关键权衡。特别是,现有的流策略将$L_2$正则化解释为2-瓦瑟斯坦距离($W_2$)的上界,这在离线设置中可能会出现问题。这个问题源于一个基本的几何不匹配:行为策略流形在本质上是各向异性的,而$L_2$(或$W_2$的上界)正则化是各向同性的且与密度无关,导致系统性地错位的优化方向。为了解决这个问题,我们从几何角度重新审视离线RL,并展示策略改进可以被表述为一个局部传输映射:一个初始流策略加上一个残差位移。通过分析引起的密度变换,我们推导出由Fisher信息矩阵控制的KL约束目标的局部二次近似,从而实现了一个可处理的各向异性优化公式。通过利用嵌入在流速度中的得分函数,我们获得了一个相应的二次约束,以进行有效的优化。我们的结果表明,先前方法中的最优性差距源于它们的各向同性近似。相比之下,我们的框架实现了在最优解的可证邻域内的可控近似误差。广泛的实验表明,在各种离线RL基准测试中,我们的方法表现出最先进的性能。详情请参阅项目页面:https://github.com/ARC0127/Fisher-Decorator。
更新时间: 2026-05-05 15:00:45
领域: cs.LG,cs.RO
Parameter-Efficient Distributional RL via Normalizing Flows and a Geometry-Aware Cramér Surrogate
Distributional Reinforcement Learning (DistRL) improves upon expectation-based methods by modeling full return distributions, but standard approaches often remain far from parsimonious. Categorical methods (e.g., C51) rely on fixed supports where parameter counts scale linearly with resolution, while quantile methods approximate distributions as discrete mixtures whose piecewise-constant densities can be wasteful when modeling complex multi-modal or heavy-tailed returns. We introduce NFDRL, a parsimonious architecture that models return distributions using continuous normalizing flows. Unlike categorical baselines, our flow-based model maintains a compact parameter footprint that does not grow with the effective resolution of the distribution, while providing a dynamic, adaptive support for returns. To train this continuous representation, we propose a Cramér-inspired, geometry-aware distance defined over probability masses obtained from the flow. We show that this distance is a true probability metric, that the associated distributional Bellman operator is a sqrt(gamma)-contraction, and that the resulting objective admits unbiased sample gradients, properties that are typically not simultaneously guaranteed in prior PDF-based DistRL methods. Empirically, NFDRL recovers rich, multi-modal return landscapes on toy MDPs and achieves performance competitive with categorical baselines on the Atari-5 benchmark, while offering substantially better parameter efficiency.
Updated: 2026-05-05 14:58:08
标题: 参数高效的分布式强化学习:通过标准化流和几何感知Cramér替代方法
摘要: 分布式强化学习(DistRL)通过建模完整的回报分布改进了基于期望的方法,但标准方法通常远离简洁。分类方法(例如C51)依赖于固定支持,其中参数数量随分辨率线性增长,而分位数方法将分布近似为离散混合物,其分段常数密度在建模复杂的多模式或重尾回报时可能会浪费。我们介绍了NFDRL,这是一种简洁的架构,使用连续的归一化流模拟回报分布。与分类基线不同,我们基于流的模型保持紧凑的参数足迹,不随分布的有效分辨率增长,同时为回报提供动态、自适应的支持。为了训练这种连续表示,我们提出了一种基于Cramér启发的、几何感知的距离,该距离定义在从流中获得的概率质量上。我们展示了这个距离是一个真正的概率度量,相关的分布贝尔曼算子是一个sqrt(gamma)-压缩,产生的目标允许无偏的样本梯度,这些性质通常在先前基于概率密度函数的DistRL方法中不同时保证。在实证上,NFDRL在玩具MDP上恢复了丰富的多模式回报景观,并在Atari-5基准测试上取得了与分类基线竞争力相当的性能,同时提供了更高的参数效率。
更新时间: 2026-05-05 14:58:08
领域: cs.AI,cs.LG,math.OC
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge the gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize 3D representations. Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing geometric features from normalized diffusion representations. We evaluate Geometry Forcing on both camera-view conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods. Project page: https://GeometryForcing.github.io.
Updated: 2026-05-05 14:55:01
标题: 《几何约束:将视频扩散与三维表示结合,实现一致的世界建模》
摘要: 视频本质上代表了动态三维世界的二维投影。然而,我们的分析表明,仅在原始视频数据上训练的视频扩散模型往往无法捕捉到学习表示中的有意义的几何感知结构。为了弥合视频扩散模型与物理世界基本的三维特性之间的差距,我们提出了几何强化方法,这是一种简单而有效的方法,可以鼓励视频扩散模型内化三维表示。我们的关键洞察是通过将模型的中间表示与来自几何基础模型的特征对齐,引导模型朝向具有几何意识的结构。为此,我们引入了两种互补的对齐目标:角度对齐,通过余弦相似性强化方向一致性,和尺度对齐,通过从标准化扩散表示中回归几何特征来保留与尺度相关的信息。我们在摄像机视角条件和动作条件的视频生成任务上评估了几何强化方法。实验结果表明,我们的方法显著提高了视觉质量和三维一致性,超过了基线方法。项目页面:https://GeometryForcing.github.io。
更新时间: 2026-05-05 14:55:01
领域: cs.CV,cs.AI
Realizable Bayes-Consistency for General Metric Losses
We study strong universal Bayes-consistency in the realizable setting for learning with general metric losses, extending classical characterizations beyond $0$-$1$ classification \citep{bousquet_theory_2021, hanneke2021universalbayesconsistencymetric} and real-valued regression \citep{attias_universal_2024}. Given an instance space $(\mathcal X,ρ)$, a label space $(\mathcal Y,\ell)$ with possibly unbounded loss, and a hypothesis class $\mathcal H \subseteq \mathcal Y^{\mathcal X}$, we resolve the realizable case of an open problem presented in \citet{pmlr-v178-cohen22a}. Specifically, we find the necessary and sufficient conditions on the hypothesis class $\mathcal H$ under which there exists a distribution-free learning rule whose risk converges almost surely to the best-in-class risk (which is zero) for every realizable data-generating distribution. Our main contribution is this sharp characterization in terms of a combinatorial obstruction: Similarly to \citet{attias2024optimallearnersrealizableregression}, we introduce the notion of an infinite non-decreasing $(γ_k)$-Littlestone tree, where $γ_k \to \infty$. This extends the Littlestone tree structure used in \citet{bousquet_theory_2021} to the metric loss setting.
Updated: 2026-05-05 14:50:55
标题: 可实现的贝叶斯一致性在一般度量损失下
摘要: 我们研究了在可实现设置中学习具有一般度量损失的强普遍贝叶斯一致性,将经典特征扩展到超出$0$-$1$分类\citep{bousquet_theory_2021, hanneke2021universalbayesconsistencymetric}和实值回归\citep{attias_universal_2024}。给定一个实例空间$(\mathcal X,ρ)$,一个标签空间$(\mathcal Y,\ell)$,可能具有无界损失,并且一个假设类$\mathcal H \subseteq \mathcal Y^{\mathcal X}$,我们解决了\citet{pmlr-v178-cohen22a}中提出的一个开放问题的可实现情况。具体而言,我们找到了假设类$\mathcal H$的必要和充分条件,使得存在一个无分布学习规则,其风险几乎肯定收敛到最佳类别风险(即零),对于每个可实现数据生成分布。我们的主要贡献是以组合障碍的形式进行尖锐刻画:类似于\citet{attias2024optimallearnersrealizableregression},我们引入了无限非递增$(γ_k)$-Littlestone树的概念,其中$γ_k \to \infty$。这将在度量损失设置中使用的Littlestone树结构扩展到了\citet{bousquet_theory_2021}。
更新时间: 2026-05-05 14:50:55
领域: cs.LG,cs.IT,math.ST
KVerus: Scalable and Resilient Formal Verification Proof Generation for Rust Code
Formal verification provides the highest assurance of software correctness and security, but its application to large-scale, evolving systems remains a major challenge. While large language models (LLMs) have shown promise in automating proof generation, they often fail in real-world settings due to their inability to handle complex cross-module dependencies or changes in the codebase or the verification toolchain. We identify the fundamental problem as the Semantic-Structural Gap: LLMs operate on semantic code patterns, whereas formal verification is governed by rigid structural dependencies, a disconnect that leads to brittle, unsustainable proofs. To bridge this gap, we propose a new paradigm of self-adaptive verification and present KVerus, a retrieval-augmented system for Verus-based Rust verification that can adapt to an evolving software environment. KVerus constructs a dynamic knowledge base of code metadata, lemma semantics, and toolchain specifics. By combining dependency-aware program analysis, semantic lemma indexing, and error-driven self-refinement, it can navigate intricate cross-file dependencies to synthesize proofs and automatically repair proofs when faced with common evolutionary changes. Across three single-file benchmarks, KVerus verifies 80.2% of tasks, outperforming the state-of-the-art AutoVerus (56.9%) and degrades less than AutoVerus under breaking Verus updates. On three repository-level benchmarks with cross-file dependencies, KVerus achieves a 51.0% success rate, compared to 4.5% for a multi-round prompting baseline. Finally, on the Asterinas Rust OS kernel, KVerus produces upstream-accepted proofs that verify 23 previously unverified functions (21.0% of proof code) in the memory-management module. KVerus represents a significant step towards making formal verification a scalable and sustainable practice for modern, security-critical software.
Updated: 2026-05-05 14:50:24
标题: KVerus:用于Rust代码的可扩展和弹性的形式验证证明生成
摘要: 正式验证提供了软件正确性和安全性的最高保证,但将其应用于大规模、不断发展的系统仍然是一个重大挑战。虽然大型语言模型(LLMs)在自动化证明生成方面表现出潜力,但由于它们无法处理复杂的跨模块依赖关系或代码库或验证工具链中的更改,它们在实际环境中经常失败。我们确定了基本问题是语义-结构差距:LLMs基于语义代码模式运行,而形式验证受严格的结构依赖约束,这种不连贯导致脆弱、不可持续的证明。 为了弥补这一差距,我们提出了自适应验证的新范式,并提出了KVerus,这是一个基于Verus的Rust验证的检索增强系统,可以适应不断发展的软件环境。KVerus构建了一个动态知识库,包括代码元数据、引理语义和工具链的具体信息。通过结合依赖感知程序分析、语义引理索引和错误驱动的自我完善,它可以在合成证明时导航复杂的跨文件依赖关系,并在面对常见的演化变化时自动修复证明。在三个单文件基准测试中,KVerus验证了80.2%的任务,优于现有技术AutoVerus(56.9%),并且在破坏性的Verus更新下表现比AutoVerus更好。在具有跨文件依赖关系的三个存储库级基准测试中,KVerus实现了51.0%的成功率,而多轮提示基线为4.5%。最后,在Asterinas Rust操作系统内核上,KVerus生成了上游接受的证明,验证了之前未经验证的23个函数(证明代码的21.0%)在内存管理模块中。KVerus代表了向使形式验证成为现代、安全关键软件的可扩展和可持续实践的重要一步。
更新时间: 2026-05-05 14:50:24
领域: cs.SE,cs.CR
RoboAlign-R1: Distilled Multimodal Reward Alignment for Robot Video World Models
Existing robot video world models are typically trained with low-level objectives such as reconstruction and perceptual similarity, which are poorly aligned with the capabilities that matter most for robot decision making, including instruction following, manipulation success, and physical plausibility. They also suffer from error accumulation in long-horizon autoregressive prediction. We present RoboAlign-R1, a framework that combines reward-aligned post-training with stabilized long-horizon inference for robot video world models. We construct RobotWorldBench, a benchmark of 10,000 annotated video-instruction pairs collected from four robot data sources, and train a multimodal teacher judge, RoboAlign-Judge, to provide fine-grained six-dimensional evaluation of generated videos. We then distill the teacher into a lightweight student reward model for efficient reinforcement-learning-based post-training. To reduce long-horizon rollout drift, we further introduce Sliding Window Re-encoding (SWR), a training-free inference strategy that periodically refreshes the generation context. Under our in-domain evaluation protocol, RoboAlign-R1 improves the aggregate six-dimension score by 10.1% over the strongest baseline, including gains of 7.5% on Manipulation Accuracy and 4.6% on Instruction Following; these ranking improvements are further supported by an external VLM-based cross-check and a blinded human study. Meanwhile, SWR improves long-horizon prediction quality with only about 1% additional latency, yielding a 2.8% gain in SSIM and a 9.8% reduction in LPIPS. Together, these results show that reward-aligned post-training and stabilized long-horizon decoding improve task consistency, physical realism, and long-horizon prediction quality in robot video world models.
Updated: 2026-05-05 14:49:00
标题: RoboAlign-R1:用于机器人视频世界模型的蒸馏多模态奖励对齐
摘要: 现有的机器人视频世界模型通常是通过低级目标进行训练,例如重建和感知相似性,这些目标与机器人决策中最重要的能力(包括遵循指令、操作成功和物理合理性)之间的对齐性较差。它们还在长视野自回归预测中积累了错误。我们提出了RoboAlign-R1,这是一个将奖励对齐的后训练与稳定的长视野推理相结合的机器人视频世界模型框架。我们构建了RobotWorldBench,这是一个由四个机器人数据源收集的10,000个带注释的视频指令对的基准,并训练了一个多模式教师评判器RoboAlign-Judge,以提供对生成的视频进行细粒度六维度评估。然后,我们将教师提炼成一个轻量级的学生奖励模型,用于高效的基于强化学习的后训练。为了减少长视野推演的漂移,我们进一步引入了滑动窗口重新编码(SWR),这是一种无需训练的推理策略,定期刷新生成上下文。在我们的领域内评估协议下,RoboAlign-R1将六维度的综合得分提高了10.1%,超过了最强基线,包括操作准确性提高了7.5%和遵循指令提高了4.6%;这些排名改进得到了基于外部VLM的交叉检查和一项盲目人类研究的进一步支持。同时,SWR提高了长视野预测质量,仅增加了约1%的延迟,使SSIM提高了2.8%,LPIPS减少了9.8%。这些结果共同表明,奖励对齐的后训练和稳定的长视野解码提高了机器人视频世界模型中任务一致性、物理逼真性和长视野预测质量。
更新时间: 2026-05-05 14:49:00
领域: cs.RO,cs.AI
Multimodal Learning on Low-Quality Data with Conformal Predictive Self-Calibration
Multimodal learning often grapples with the challenge of low-quality data, which predominantly manifests as two facets: modality imbalance and noisy corruption. While these issues are often studied in isolation, we argue that they share a common root in the predictive uncertainty towards the reliability of individual modalities and instances during learning. In this paper, we propose a unified framework, termed Conformal Predictive Self-Calibration (CPSC), which leverages conformal prediction to equip the model with the ability to perform self-guided calibration on-the-fly. The core of our proposed CPSC lies in a novel self-calibrating training loop that seamlessly integrates two key modules: (1) Representation Self-Calibration, which decomposes unimodal features into components, and selectively fuses the most robust ones identified by a conformal predictor to enhance feature resilience. (2) Gradient Self-Calibration, which recalibrates the gradient flow during backpropagation based on instance-wise reliability scores, steering the optimization towards more trustworthy directions. Furthermore, we also devise a self-update strategy for the conformal predictor to ensure the entire system co-evolves consistently throughout the training process. Extensive experiments on six benchmark datasets under both imbalanced and noisy settings demonstrate that our CPSC framework consistently outperforms existing state-of-the-art methods. Our code is available at https://github.com/XunCHN/CPSC.
Updated: 2026-05-05 14:48:52
标题: 在低质量数据上进行多模态学习与符合性预测自校准
摘要: 多模态学习通常面临低质量数据的挑战,主要表现为两个方面:模态不平衡和噪声干扰。虽然这些问题通常被独立研究,但我们认为它们在学习过程中对于个体模态和实例的可靠性预测不确定性有着共同的根源。在本文中,我们提出了一个统一的框架,称为Conformal Predictive Self-Calibration(CPSC),利用符合预测来赋予模型即时自主校准的能力。我们提出的CPSC的核心在于一个新颖的自校准训练循环,无缝集成了两个关键模块:(1)表示自校准,将单模态特征分解为组件,并选择性地融合由符合预测器确定的最强大的组件,以增强特征的弹性。(2)梯度自校准,根据实例可靠性评分重新校准反向传播中的梯度流,引导优化朝着更可信赖的方向进行。此外,我们还设计了一个符合预测器的自更新策略,以确保整个系统在训练过程中一致协同演化。在六个基准数据集上进行的大量实验表明,我们的CPSC框架在不平衡和噪声设置下始终优于现有的最先进方法。我们的代码可在https://github.com/XunCHN/CPSC 上找到。
更新时间: 2026-05-05 14:48:52
领域: cs.CV,cs.LG,cs.MM
Toward Generative Quantum Utility via Correlation-Complexity Map
We study a practical question in generative quantum machine learning: given a classical dataset, can we determine, before training, whether it is well suited to a quantum generative model? We focus on a class of quantum circuits known as instantaneous quantum polynomial-time (IQP) circuits, whose output distributions are widely believed to be difficult to sample from using classical methods. These circuits are used to build our quantum generative models. We introduce a Correlation-Complexity Map, a simple diagnostic built from two quantities computed from data samples. The first measures how closely the dataset's spectral correlation patterns resemble those naturally produced by IQP circuits, while the second quantifies how much of the dataset's structural correlation cannot be captured by simple pairwise models. In other words, we can estimate beforehand how well a dataset can be approximated by our model family and also how complex its correlations are, indicating possible failures of classical models. Applying this framework, we identify turbulence data as a promising target for quantum generative modeling. Guided by this analysis, we use a latent-parameter adaptation scheme that reuses a compact IQP circuit over a temporal sequence by learning and interpolating a low-dimensional latent trajectory, and observe competitive performance against classical baselines in a low-data, low-parameter regime. These results suggest that dataset-level diagnostics can help prioritize problems where quantum generative models are most likely to be useful, with improvements in data and parameter efficiency.
Updated: 2026-05-05 14:45:46
标题: 朝向通过相关复杂度映射生成量子效用
摘要: 我们研究了生成量子机器学习中一个实际的问题:在给定一个经典数据集的情况下,我们能否在训练之前确定它是否适合量子生成模型?我们关注一类被称为瞬时量子多项式时间(IQP)电路的量子电路,人们普遍认为这些电路的输出分布难以使用经典方法进行采样。这些电路被用来构建我们的量子生成模型。我们引入了一个相关复杂度映射,这是一个简单的诊断工具,由从数据样本计算得到的两个量构建而成。第一个量度量了数据集的谱相关模式与IQP电路自然产生的相关模式之间的相似程度,而第二个量化了数据集的结构相关性中有多少不能被简单的成对模型捕捉到。换句话说,我们可以在训练之前估计数据集能够被我们的模型族近似地多好,以及其相关性的复杂程度,从而指示经典模型可能出现失败的地方。应用这一框架,我们确定湍流数据是量子生成建模的一个有前途的目标。在这一分析的指导下,我们使用一种潜在参数适应方案,通过学习和插值低维潜在轨迹,重复使用一个紧凑的IQP电路在时间序列上,并观察在低数据、低参数范围内与经典基线相比具有竞争性的性能。这些结果表明,数据集级别的诊断可以帮助优先考虑哪些问题量子生成模型最有可能是有用的,提高数据和参数的效率。
更新时间: 2026-05-05 14:45:46
领域: cs.LG,quant-ph
The Manokhin Probability Matrix: A Diagnostic Framework for Classifier Probability Quality
The Brier score conflates two distinct properties of probabilistic predictions: reliability (calibration error) and resolution (discriminatory power). We introduce the Manokhin Probability Matrix, a BCG-style two-dimensional diagnostic framework that separates them. Classifiers are placed on a 2x2 grid by Spiegelhalter Z-statistic and AUC-ROC expected rank, then assigned to one of four archetypes: Eagle (good on both axes), Bull (strong discrimination, poor calibration), Sloth (well-calibrated, weak discriminator), and Mole (poor on both). Each archetype carries a distinct prescription. We populate the matrix from a large-scale empirical study spanning 21 classifiers, 5 post-hoc calibrators, and 30 real-world binary classification tasks from the TabArena-v0.1 suite. The assignment is unambiguous. CatBoost, TabICL, EBM, TabPFN, GBC, and Random Forest are Eagles. XGBoost, LightGBM, and HGB are Bulls; Venn-Abers calibration cuts log-loss by 6.5 to 12.6% on Bulls but degrades Eagles by 2.1%. SVM, LR, LDA, and the empirical base-rate predictor are Sloths. MLP, KNN, Naive Bayes, and ExtraTrees are Moles. A theoretical asymmetry follows: no order-preserving post-hoc calibrator can add discriminatory power (Proposition 1), so calibration is the fixable part and discrimination is the hard part. The practical rule is direct: do not optimise aggregate Brier score without first decomposing it; optimise discrimination first, then fix calibration post-hoc. Code and raw experimental data are available at https://github.com/valeman/classifier_calibration.
Updated: 2026-05-05 14:44:47
标题: Manokhin概率矩阵:分类器概率质量的诊断框架
摘要: Brier分数混淆了概率预测的两个不同属性:可靠性(校准误差)和分辨率(区分能力)。我们引入了Manokhin概率矩阵,这是一个类似BCG风格的二维诊断框架,将它们分开。分类器通过Spiegelhalter Z统计量和AUC-ROC预期排名放置在一个2x2的网格上,然后分配到四种原型中的一种:鹰(两个轴都很好)、公牛(强大的区分能力,较差的校准)、懒惰(校准良好,较弱的辨别能力)和鼹鼠(两者都较差)。每种原型都有不同的处方。我们从一个涵盖21个分类器、5个事后校准器和30个来自TabArena-v0.1套件的真实世界二元分类任务的大规模实证研究中填充了这个矩阵。分配是明确的。CatBoost、TabICL、EBM、TabPFN、GBC和Random Forest是鹰。XGBoost、LightGBM和HGB是公牛;Venn-Abers校准在公牛身上将对数损失削减了6.5至12.6%,但会使鹰降低2.1%。SVM、LR、LDA和经验基线预测器是懒惰。MLP、KNN、朴素贝叶斯和ExtraTrees是鼹鼠。理论上存在一个不对称性:没有保持顺序的事后校准器可以增加区分能力(命题1),因此校准是可修复部分,而区分是困难部分。实际规则很直接:不要在分解之前优化Brier分数;首先优化区分能力,然后事后修正校准。代码和原始实验数据可在https://github.com/valeman/classifier_calibration找到。
更新时间: 2026-05-05 14:44:47
领域: stat.ML,cs.LG
LangPrecip: Language-Aware Multimodal Precipitation Nowcasting
Short-term precipitation nowcasting is an inherently uncertain and under-constrained spatiotemporal forecasting problem, especially for rapidly evolving and extreme weather events. Existing generative approaches rely primarily on visual conditioning, leaving future motion weakly constrained and ambiguous. We propose a language-aware multimodal nowcasting framework(LangPrecip) that treats meteorological text as a semantic motion constraint on precipitation evolution. By formulating nowcasting as a semantically constrained trajectory generation problem under the Rectified Flow paradigm, our method enables efficient and physically consistent integration of textual and radar information in latent space.We further introduce LangPrecip-160k, a large-scale multimodal dataset with 160k paired radar sequences and motion descriptions. Experiments on Swedish and MRMS datasets show consistent improvements over state-of-the-art methods, achieving over 60 \% and 19\% gains in heavy-rainfall CSI at an 80-minute lead time.
Updated: 2026-05-05 14:40:12
标题: LangPrecip:语言感知的多模态降水预报
摘要: 短期降水预报是一个固有的不确定和不充分约束的时空预测问题,特别是对于快速演变和极端天气事件。现有的生成方法主要依赖于视觉条件,未来运动受到弱约束和模糊。我们提出了一种语言感知的多模态现在预报框架(LangPrecip),将气象文本视为对降水演变的语义运动约束。通过将现在预报构建为在修正流范式下受语义约束的轨迹生成问题,我们的方法使得文本和雷达信息在潜在空间内的高效和物理一致性整合成为可能。我们进一步引入了LangPrecip-160k,一个包含160k个配对雷达序列和运动描述的大规模多模态数据集。在瑞典和MRMS数据集上的实验表明,相对于最先进的方法,我们取得了一致的改进,在80分钟的提前时间内,实现了超过60\%和19\%的强降雨CSI增益。
更新时间: 2026-05-05 14:40:12
领域: cs.LG,cs.AI,cs.CV
GPUBreach: Privilege Escalation Attacks on GPUs using Rowhammer
NVIDIA GPUs with GDDR memories have been shown susceptible to Rowhammer-based bit-flips, similar to CPUs. However, Rowhammer exploits on GPUs have been limited to injecting untargeted bit-flips in victim data like weights of machine learning models, to degrade model accuracy, unlike CPU exploits shown capable of privilege escalation. In this paper, we demonstrate that GPU Rowhammer exploits can be as potent as CPU Rowhammer attacks. By exploiting the GPU page table management to identify when and where new page tables are allocated, we enable an unprivileged user CUDA kernel of one process to use RowHammer bit-flips to gain access to the GPU memory of other processes or co-tenants via targeted tampering of such page-tables resident on the GPU memory. Using this newly found primitive, we demonstrate the first GPU-side privilege escalation attacks, leaking secret data such as cryptographic keys from cuPQC libraries, and even tampering with the model's GPU assembly code to degrade models more stealthily than previous attacks. We further demonstrate that GPU-side privilege escalation can lead to CPU-side privilege escalation, defeating the protections provided by the IOMMU, enabling a malicious user-level program with GPU access to gain root shell and system-wide control, even in a non-multi-tenant setting.
Updated: 2026-05-05 14:40:06
标题: GPUBreach:使用Rowhammer对GPU进行特权提升攻击
摘要: NVIDIA GPU与GDDR内存已被证实容易受到基于Rowhammer的位翻转攻击,类似于CPU。然而,GPU上的Rowhammer攻击仅限于向受害数据注入未定位的位翻转,例如机器学习模型的权重,以降低模型准确性,而与CPU攻击相比,CPU攻击可以实现提权。在本文中,我们展示了GPU Rowhammer攻击可以与CPU Rowhammer攻击一样有效。通过利用GPU页表管理来识别何时以及何处分配新的页表,我们使一个进程的非特权用户CUDA内核能够利用Rowhammer位翻转来访问其他进程或联合租户的GPU内存,通过有针对性地篡改驻留在GPU内存上的这些页表。利用这种新发现的基本原理,我们展示了第一个GPU端特权提升攻击,泄露诸如cuPQC库中的加密密钥等机密数据,甚至篡改模型的GPU汇编代码以比以前的攻击更隐蔽地降低模型的准确性。我们进一步证明了GPU端特权提升可能导致CPU端特权提升,击败了IOMMU提供的保护,使具有GPU访问权限的恶意用户级程序能够获得根Shell和系统范围的控制,即使在非多租户环境中也是如此。
更新时间: 2026-05-05 14:40:06
领域: cs.CR
Agentic-imodels: Evolving agentic interpretability tools via autoresearch
Agentic data science (ADS) systems are rapidly improving their capability to autonomously analyze, fit, and interpret data, potentially moving towards a future where agents conduct the vast majority of data-science work. However, current ADS systems use statistical tools designed to be interpretable by humans, rather than interpretable by agents. To address this, we introduce Agentic-imodels, an agentic autoresearch loop that evolves data-science tools designed to be interpretable by agents. Specifically, it develops a library of scikit-learn-compatible regressors for tabular data that are optimized for both predictive performance and a novel LLM-based interpretability metric. The metric measures a suite of LLM-graded tests that probe whether a fitted model's string representation is "simulatable" by an LLM, i.e. whether the LLM can answer questions about the model's behavior by reading its string output alone. We find that the evolved models jointly improve predictive performance and agent-facing interpretability, generalizing to new datasets and new interpretability tests. Furthermore, these evolved models improve downstream end-to-end ADS, increasing performance for Copilot CLI, Claude Code, and Codex on the BLADE benchmark by up to 73%
Updated: 2026-05-05 14:35:47
标题: Agent型模型:通过自动研究演化的Agent可解释性工具
摘要: 主动数据科学(ADS)系统正在迅速提高其自主分析、拟合和解释数据的能力,可能朝着未来代理人大部分进行数据科学工作的方向发展。然而,当前的ADS系统使用的统计工具是设计成可解释给人类理解的,而不是给代理人理解的。为了解决这个问题,我们引入了主动-i模型,这是一个主动自动研究循环,发展出专门为代理人可解释的数据科学工具。具体来说,它开发了一个适用于表格数据的scikit-learn兼容的回归器库,这些回归器既优化了预测性能,又具有一种新颖的基于LLM的可解释性指标。该指标衡量一套LLM分级测试,探究拟合模型的字符串表示是否能够被LLM“模拟”,即LLM是否能够通过阅读其字符串输出单独回答关于模型行为的问题。我们发现,这些进化的模型同时提高了预测性能和代理人面向的可解释性,能够推广到新的数据集和新的可解释性测试。此外,这些进化的模型提高了下游端到端ADS的性能,使Copilot CLI、Claude Code和Codex在BLADE基准测试中的性能提高了高达73%。
更新时间: 2026-05-05 14:35:47
领域: cs.AI,cs.CL,cs.LG
Scaling Laws and Symmetry, Evidence from Neural Force Fields
We present an empirical study in the geometric task of learning interatomic potentials, which shows equivariance matters even more at larger scales; we show a clear power-law scaling behaviour with respect to data, parameters and compute with ``architecture-dependent exponents''. In particular, we observe that equivariant architectures, which leverage task symmetry, scale better than non-equivariant models. Moreover, among equivariant architectures, higher-order representations translate to better scaling exponents. Our analysis also suggests that for compute-optimal training, the data and model sizes should scale in tandem regardless of the architecture. At a high level, these results suggest that, contrary to common belief, we should not leave it to the model to discover fundamental inductive biases such as symmetry, especially as we scale, because they change the inherent difficulty of the task and its scaling laws.
Updated: 2026-05-05 14:34:08
标题: 缩放定律和对称性,基于神经力场的证据
摘要: 我们展示了一个在学习原子间势能几何任务中的实证研究,结果表明等变性在更大尺度上更为重要;我们展示了与数据、参数和计算相关的清晰的幂律缩放行为,具有“依赖架构指数”。特别地,我们观察到利用任务对称性的等变架构比非等变模型具有更好的缩放能力。此外,在等变架构中,高阶表示转化为更好的缩放指数。我们的分析还表明,为了获得计算最优的训练,数据和模型大小应该同步缩放,不论架构如何。总的来说,这些结果表明,与普遍观点相反,我们不应该让模型发现基本的归纳偏差,比如对称性,特别是在缩放时,因为它们会改变任务的固有难度和缩放规律。
更新时间: 2026-05-05 14:34:08
领域: cs.LG,cs.AI,physics.comp-ph
Learning the APT Kill Chain: Temporal Reasoning over Provenance Data for Attack Stage Estimation
Advanced Persistent Threats (APTs) evolve through multiple stages, each exhibiting distinct temporal and structural behaviors. Accurate stage estimation is critical for enabling adaptive cyber defense. This paper presents StageFinder, a temporal-graph learning framework for multi-stage attack progression inference from fused host and network provenance data. Provenance graphs are encoded using a graph neural network to capture structural dependencies among processes, files, and connections, while a long short-term memory (LSTM) model learns temporal dynamics to estimate stage probabilities aligned with the MITRE ATT&CK framework. The model is pretrained on the DARPA OpTC dataset and fine-tuned on labeled DARPA Transparent Computing data. Experimental results demonstrate that StageFinder achieves a macro F1-score of 0.96 and reduces prediction volatility by 31% compared to state-of-the-art baselines (Cyberian, NetGuardian). These results highlight the effectiveness of fused provenance-temporal learning for accurate and stable APT stage inference.
Updated: 2026-05-05 14:33:31
标题: 学习APT杀伤链:基于攻击阶段估计的溯源数据的时间推理
摘要: 高级持续威胁(APTs)通过多个阶段发展,每个阶段都表现出明显的时间和结构行为。准确的阶段估计对于实现自适应网络防御至关重要。本文提出了StageFinder,这是一个用于从融合主机和网络来源数据中推断多阶段攻击进展的时间图学习框架。来源图使用图神经网络进行编码,以捕获进程、文件和连接之间的结构依赖关系,而长短期记忆(LSTM)模型学习时间动态,以估计与MITRE ATT&CK框架对齐的阶段概率。该模型在DARPA OpTC数据集上进行了预训练,并在标记的DARPA Transparent Computing数据上进行了微调。实验结果表明,与最先进的基线(Cyberian、NetGuardian)相比,StageFinder实现了0.96的宏F1分数,并将预测波动降低了31%。这些结果突显了融合来源时间学习对于准确和稳定的APT阶段推断的有效性。
更新时间: 2026-05-05 14:33:31
领域: cs.CR,cs.NI
Pseudo-differential-enhanced physics-informed neural networks
We present pseudo-differential enhanced physics-informed neural networks (PINNs), an extension of gradient enhancement but in Fourier space. Gradient enhancement of PINNs dictates that the PDE residual is taken to a higher differential order than prescribed by the PDE, added to the objective as an augmented term in order to improve training and overall learning fidelity. We propose the same procedure after application via Fourier transforms, since differentiating in Fourier space is multiplication with the Fourier wavenumber under suitable decay. Our methods are fast and efficient. Our methods oftentimes achieve superior PINN versus numerical error in fewer training iterations, potentially pair well with few samples in collocation, and can on occasion break plateaus in low collocation settings. Moreover, our methods are suitable for fractional derivatives. We establish that our methods, due to the dynamical effects, improve spectral eigenvalue decay of the neural tangent kernel (NTK), and so our methods contribute towards the learning of high frequencies in early training, mitigating the effects of frequency bias up to the polynomial order and possibly greater with smooth activations. Our methods accommodate advanced techniques in PINNs, such as Fourier feature embeddings. A pitfall of discrete Fourier transforms via the Fast Fourier Transform (FFT) is mesh subjugation, and so we demonstrate compatibility of our methods for greater mesh flexibility and invariance on alternative Euclidean and non-Euclidean domains via Monte Carlo methods and otherwise.
Updated: 2026-05-05 14:31:09
标题: 拟微分增强的物理信息神经网络
摘要: 我们提出了伪微分增强的物理信息神经网络(PINNs),这是梯度增强的延伸,但是在傅立叶空间中。 PINNs的梯度增强规定PDE残差取比PDE规定的更高微分阶数,作为增强项添加到目标中,以改进训练和整体学习的准确性。我们提出了相同的程序,经过傅立叶变换后应用,因为在傅立叶空间中微分是乘以傅立叶波数,并在适当衰减的情况下。我们的方法快速高效。我们的方法在较少的训练迭代次数内通常实现了卓越的PINN与数值误差,可能与少量采样在对位时很好地配对,并且有时可以打破低对位设置中的平台。此外,我们的方法适用于分数阶导数。我们建立了由于动态效应,我们的方法改善了神经切线核(NTK)的谱特征值衰减,因此我们的方法有助于在早期训练中学习高频率,减轻了频率偏差的影响,直到多项式阶数及可能更高的平稳激活。我们的方法适应于PINNs中的高级技术,如傅立叶特征嵌入。通过快速傅立叶变换(FFT)的离散傅立叶变换的一个缺陷是网格受制,因此我们展示了我们的方法对更大的网格灵活性和在另类欧几里得和非欧几里得域上的不变性的兼容性,通过蒙特卡罗方法和其他方法。
更新时间: 2026-05-05 14:31:09
领域: cs.LG,math.NA
ScrapMem: A Bio-inspired Framework for On-device Personalized Agent Memory via Optical Forgetting
Long-term personalized memory for LLM agents is challenging on resource-limited edge devices due to high storage costs and multimodal complexity. To address this, we propose ScrapMem, a framework that integrates multimodal data into "Scrapbook Page." ScrapMem introduces Optical Forgetting, an optical compression mechanism that progressively reduces the resolution of older memories, lowering storage cost while suppressing low-value details. To maintain semantic consistency, we construct an Episodic Memory Graph (EM-Graph) that organizes key events into a causal-temporal structure. Extensive experiments on the multimodal ATM-Bench showcase that ScrapMem provides three main benefits: (1) strong performance, achieving a new state-of-the-art with a 51.0% Joint@10 score; (2) high storage efficiency, reducing memory usage by up to 93% via optical forgetting; and (3) improved recall, increasing Recall@10 to 70.3% through structured aggregation. ScrapMem offers an effective and storage-efficient solution for on-device long-term memory in multimodal LLM agents.
Updated: 2026-05-05 14:30:30
标题: ScrapMem:一种通过光学遗忘实现设备个性化代理记忆的生物启发框架
摘要: 长期个性化记忆对于资源有限的边缘设备上的LLM代理是具有挑战性的,因为存储成本高且多模态复杂。为了解决这个问题,我们提出了ScrapMem,这是一个将多模态数据整合到“剪贴簿页面”中的框架。ScrapMem引入了光学遗忘,这是一种逐渐降低旧记忆分辨率的光学压缩机制,可以降低存储成本同时抑制低价值细节。为了保持语义一致性,我们构建了一个将关键事件组织成因果-时间结构的情景记忆图(EM-Graph)。在多模态ATM-Bench上的大量实验表明,ScrapMem提供了三个主要好处:(1)强大的性能,通过51.0%的Joint@10分数实现了新的技术水平;(2)高存储效率,通过光学遗忘将存储使用量降低了高达93%;(3)提高了召回率,通过结构化聚合将Recall@10提高至70.3%。ScrapMem为多模态LLM代理提供了一种有效且存储高效的解决方案,用于设备上的长期记忆。
更新时间: 2026-05-05 14:30:30
领域: cs.AI
Sparse Data Tree Canopy Segmentation: Fine-Tuning Leading Pretrained Models on Only 150 Images
Tree canopy detection from aerial imagery is an important task for environmental monitoring, urban planning, and ecosystem analysis. Simulating real-life data annotation scarcity, the Solafune Tree Canopy Detection competition provides a small and imbalanced dataset of only 150 annotated images, posing significant challenges for training deep models without severe overfitting. In this work, we evaluate five representative architectures, YOLOv11, Mask R-CNN, DeepLabv3, Swin-UNet, and DINOv2, to assess their suitability for canopy segmentation under extreme data scarcity. Our experiments show that pretrained convolution-based models, particularly YOLOv11 and Mask R-CNN, generalize significantly better than pretrained transformer-based models. DeeplabV3, Swin-UNet and DINOv2 underperform likely due to differences between semantic and instance segmentation tasks, the high data requirements of Vision Transformers, and the lack of strong inductive biases. These findings confirm that transformer-based architectures struggle in low-data regimes without substantial pretraining or augmentation and that differences between semantic and instance segmentation further affect model performance. We provide a detailed analysis of training strategies, augmentation policies, and model behavior under the small-data constraint and demonstrate that lightweight CNN-based methods remain the most reliable for canopy detection on limited imagery.
Updated: 2026-05-05 14:29:56
标题: 稀疏数据树冠分割:仅在150张图像上对领先的预训练模型进行微调
摘要: 从航空影像中检测树冠是环境监测、城市规划和生态系统分析中的重要任务。为了模拟现实生活中数据标注稀缺性,Solafune Tree Canopy Detection竞赛提供了一个仅有150个标注图像的小型和不平衡数据集,为在没有严重过拟合的情况下训练深度模型提出了重大挑战。在这项工作中,我们评估了五种代表性架构,包括YOLOv11、Mask R-CNN、DeepLabv3、Swin-UNet和DINOv2,以评估它们在极端数据稀缺情况下用于树冠分割的适用性。我们的实验表明,预训练的基于卷积的模型,特别是YOLOv11和Mask R-CNN,比预训练的基于变压器的模型更具有泛化能力。DeepLabV3、Swin-UNet和DINOv2表现不佳,可能是由于语义分割和实例分割任务之间的差异,视觉变压器的高数据需求以及缺乏强大的归纳偏差。这些发现证实了在低数据环境中,变压器架构在没有大量预训练或数据增强的情况下会遇到困难,并且语义分割和实例分割之间的差异进一步影响了模型的性能。我们对训练策略、数据增强政策和模型在小数据约束下的行为进行了详细分析,并证明轻量级基于CNN的方法仍然是限制图像上最可靠的树冠检测方法。
更新时间: 2026-05-05 14:29:56
领域: cs.CV,cs.AI
Towards accurate extreme event likelihoods from diffusion model climate emulators
ML climate model emulators are useful for scenario planning and adaptation, allowing for cost-efficient experimentation. Recently, the diffusion model Climate in a Bottle (cBottle) has been proposed for generation of atmospheric states compatible with boundary conditions of solar position and sea surface temperatures. Crucially, cBottle can be guided to generate extreme events such as Tropical Cyclones (TCs) over locations of interest. Diffusion models such as cBottle work by approximating the probability density of the training data. Here, we show use cases of the probability density estimates of atmospheric states obtained from this climate emulator. Most importantly, these estimates allow us to calculate likelihoods of extreme events under guidance. When guiding the model towards states including TCs, comparing the probability density under the guided and unguided model enables us to quantify how much more likely the guidance has made the TC. We show how these odds ratios allow us to importance-sample from the TC distribution, reducing the standard error of the probability estimate compared to simple Monte Carlo sampling. Furthermore, we discuss results and limitations of the application of model probability densities to extreme event attribution-like experiments. We present these early but encouraging results hoping they will spur more research into probabilistic information that can be gained from diffusion models of the atmosphere.
Updated: 2026-05-05 14:28:20
标题: 朝向准确的扩散模型气候模拟器极端事件概率
摘要: 机器学习气候模型模拟器对于情景规划和适应性非常有用,可以进行成本效益高的实验。最近,扩散模型Climate in a Bottle(cBottle)已被提出用于生成与太阳位置和海表温度边界条件相容的大气状态。至关重要的是,cBottle可以被引导生成极端事件,如热带风暴(TCs)在感兴趣的位置上。cBottle等扩散模型通过近似训练数据的概率密度来工作。在这里,我们展示了从这种气候模拟器获得的大气状态概率密度估计的用例。最重要的是,这些估计允许我们在引导下计算极端事件的可能性。当引导模型朝向包括TCs的状态时,比较引导和未引导模型下的概率密度可以让我们量化引导使TCs变得更有可能的程度。我们展示了这些赔率比如何让我们从TC分布中重要性采样,与简单的蒙特卡洛采样相比降低了概率估计的标准误差。此外,我们讨论了将模型概率密度应用于极端事件归因类似实验的结果和限制。我们呈现了这些早期但令人鼓舞的结果,希望它们能激发更多研究关于可以从大气扩散模型中获得的概率信息。
更新时间: 2026-05-05 14:28:20
领域: physics.ao-ph,cs.LG
Beyond the Bellman Fixed Point: Geometry and Fast Policy Identification in Value Iteration
Q-value iteration (Q-VI) is usually analyzed through the \(γ\)-contraction of the Bellman operator. This argument proves convergence to \(Q^*\), but it gives only a coarse account of when the induced greedy policy becomes optimal. We study discounted Q-VI as a switching system and focus on the practically optimal solution set (POSS), the set of \(Q\)-functions whose tie-broken greedy policies are optimal. The main result shows that Q-VI reaches the optimal action class in finite time by entering an invariant tube around \(\mathcal X_1=Q^*+\operatorname{span}(\mathbf 1)\), which is contained in the POSS. For every \(\varepsilon>0\), the distance to \(\mathcal X_1\) satisfies an exponential bound with rate \((\barρ+\varepsilon)^k\), where \(\barρ\) is the joint spectral radius of the projected switching family restricted to directions transverse to \(\mathcal X_1\). When \(\barρ<γ\), this transverse convergence is faster than the classical contraction rate. The analysis separates fast policy identification from the subsequent convergence to \(Q^*\), which may still be governed by the all-ones mode. We also give spectral and graph-theoretic conditions under which the strict inequality \(\barρ<γ\) holds or fails.
Updated: 2026-05-05 14:27:41
标题: 超越贝尔曼不动点:几何和值迭代中的快速策略识别
摘要: Q值迭代(Q-VI)通常通过贝尔曼算子的γ-收缩进行分析。这一论点证明了收敛到Q*,但仅对诱导的贪婪策略何时变得最优提供了粗略的说明。我们将折扣Q-VI视为一个切换系统,并关注实际最优解集(POSS),即那些贪婪策略经过打破平局后是最优的Q函数的集合。主要结果表明,Q-VI在有限时间内达到最优动作类别,通过进入围绕\(\mathcal X_1=Q^*+\operatorname{span}(\mathbf 1)\)的不变管道,该管道包含在POSS中。对于每个\(\varepsilon>0\),到\(\mathcal X_1\)的距离满足指数界限,速率为\((\barρ+\varepsilon)^k\),其中\(\barρ\)是投影的切换家族的联合谱半径,限制在与\(\mathcal X_1\)垂直的方向上。当\(\barρ<γ\)时,这种横向收敛比经典的收缩速率更快。分析将快速策略识别与随后收敛到Q*分开,后者可能仍然由全1模式控制。我们还给出了在严格不等式\(\barρ<γ\)成立或失败的谱和图论条件。
更新时间: 2026-05-05 14:27:41
领域: math.OC,cs.AI,eess.SY
Client-Conditional Federated Learning via Local Training Data Statistics
Federated learning (FL) under data heterogeneity remains challenging: existing methods either ignore client differences (FedAvg), require costly cluster discovery (IFCA), or maintain per-client models (Ditto). All degrade when data is sparse or heterogeneity is multi-dimensional. We propose conditioning a single global model on locally-computed PCA statistics of each client's training data, requiring zero additional communication. Evaluating across 97~configurations spanning four heterogeneity types (label shift, covariate shift, concept shift, and combined heterogeneity), four datasets (MNIST, Fashion-MNIST, CIFAR-10, CIFAR-100), and seven FL baseline methods, we find that our method matches the Oracle baseline -- which knows true cluster assignments -- across all settings, surpasses it by 1--6% on combined heterogeneity where continuous statistics are richer than discrete cluster identifiers, and is uniquely sparsity-robust among all tested methods.
Updated: 2026-05-05 14:27:31
标题: 基于客户端条件的联邦学习:通过本地训练数据统计进行
摘要: 在数据异质性下,联邦学习(FL)仍然具有挑战性:现有方法要么忽略客户端之间的差异(FedAvg),要么需要昂贵的集群发现(IFCA),或者保留每个客户端模型(Ditto)。当数据稀疏或异质性是多维的时,所有这些方法都会降级。我们提出了一种在本地计算每个客户端训练数据的PCA统计信息的基础上对单个全局模型进行条件化的方法,这不需要额外的通信。通过对跨四种异质性类型(标签偏移、协变量偏移、概念偏移和组合异质性)、四个数据集(MNIST、Fashion-MNIST、CIFAR-10、CIFAR-100)和七种FL基线方法的97个配置进行评估,我们发现我们的方法在所有设置中与Oracle基线相匹配,超过了在连续统计信息比离散集群标识符更丰富的组合异质性上的1-6%,并且在所有测试方法中具有独特的稀疏性鲁棒性。
更新时间: 2026-05-05 14:27:31
领域: cs.LG
AI Advocate: Educational Path to Transform Squads to the Future
This paper analyzes the strategic education process aimed at transitioning traditional software development squads into hybrid structures centered on collaborative work between humans and Artificial Intelligence (AI). In a context where human-AI collaboration can significantly increase productivity, this study explores how the upskilling of XPTO professionals, referred to as AI Advocates, acts as a catalyst for cultural and technical transformation. The objective is to present an experience report on the education and enablement process of AI Advocates within a private Brazilian technology company, highlighting key lessons learned and identified challenges.
Updated: 2026-05-05 14:26:43
标题: AI倡导者:将小组转变为未来的教育路径
摘要: 本文分析了旨在将传统软件开发小组转变为以人类和人工智能(AI)之间协作工作为中心的混合结构的战略教育过程。在一个人工智能与人类协作可以显著提高生产力的背景下,本研究探讨了所谓的AI倡导者(XPTO专业人士)的技能提升如何作为文化和技术转型的催化剂。其目标是介绍巴西私营科技公司内AI倡导者的教育和能力提升过程的经验报告,突出关键经验教训和识别的挑战。
更新时间: 2026-05-05 14:26:43
领域: cs.SE,cs.AI
Improving Graph Neural Network Training, Defense, Spectral Hypergraph Clustering and Multiview Spectral Clustering via Adversarial Robustness Evaluation
Graph Neural Networks (GNNs) are a highly effective neural network architecture for processing graph-structured data. Unlike traditional neural networks that rely solely on the features of the data as input, GNNs leverage both the graph structure, which represents the relationships between data points, and the feature matrix of the data to optimize their feature representation. This unique capability enables GNNs to achieve superior performance across various tasks. However, it also makes GNNs more susceptible to noise and adversarial attacks from both the graph structure and data features, which can significantly increase the training difficulty and degrade their performance. Similarly, a hypergraph is a highly complex structure, and partitioning a hypergraph is a challenging task. This paper leverages spectral adversarial robustness evaluation to effectively address key challenges in complex-graph algorithms. By using spectral adversarial robustness evaluation to distinguish robust nodes from non-robust ones and treating them differently, we propose a training-set construction strategy that improves the training quality of GNNs. In addition, we develop algorithms to enhance both the adversarial robustness of GNNs and the performance of hypergraph clustering. Experimental results show that this series of methods is highly effective.
Updated: 2026-05-05 14:26:31
标题: 通过对抗性鲁棒性评估改进图神经网络训练、防御、谱超图聚类和多视图谱聚类
摘要: 图神经网络(GNNs)是一种高效的神经网络架构,用于处理图结构化数据。与传统神经网络仅依赖数据特征作为输入不同,GNNs利用图结构(代表数据点之间的关系)和数据的特征矩阵来优化其特征表示。这种独特的能力使GNNs能够在各种任务中实现卓越的性能。然而,这也使GNNs更容易受到来自图结构和数据特征的噪声和对抗性攻击的影响,这会显著增加训练难度并降低性能。同样,超图是一个高度复杂的结构,对超图进行划分是一个具有挑战性的任务。本文利用谱对抗鲁棒性评估有效地解决了复杂图算法中的关键挑战。通过使用谱对抗鲁棒性评估来区分稳健节点和非稳健节点,并对它们进行不同处理,我们提出了一种改善GNNs训练质量的训练集构建策略。此外,我们开发了算法来增强GNNs的对抗鲁棒性和超图聚类的性能。实验结果表明,这一系列方法非常有效。
更新时间: 2026-05-05 14:26:31
领域: cs.LG
Graph Convolutional Support Vector Regression for Robust Spatiotemporal Forecasting of Urban Air Pollution
Urban air quality forecasting is challenging because pollutant concentrations are nonlinear, nonstationary, spatiotemporally dependent, and often affected by anomalous observations caused by traffic congestion, industrial emissions, and seasonal meteorological variability. This study proposes a Graph Convolutional Support Vector Regression (GCSVR) framework for robust spatiotemporal forecasting of urban air pollution. The model combines graph convolutional learning to capture inter-station spatial dependence with support vector regression to model nonlinear temporal dynamics while reducing sensitivity to outlier observations. The proposed framework is evaluated using air quality records from 37 monitoring stations in Delhi and 18 stations in Mumbai, representing inland and coastal metropolitan environments in India. Forecasting performance is assessed across multiple horizons and compared with established temporal and spatiotemporal benchmarks. The results show that GCSVR consistently improves predictive accuracy and maintains stable performance across seasons and outlier-prone pollution episodes. Statistical test further confirms the reliability of the proposed approach across the two cities. Finally, conformal prediction is integrated with GCSVR to generate calibrated prediction intervals, enhancing its practical value for uncertainty-aware air quality monitoring and public health decision-making.
Updated: 2026-05-05 14:24:49
标题: 图卷积支持向量回归用于城市空气污染鲁棒时空预测
摘要: 城市空气质量预测具有挑战性,因为污染物浓度是非线性的、非平稳的、时空相关的,并且往往受到交通拥堵、工业排放和季节性气象变化引起的异常观测的影响。本研究提出了一个图卷积支持向量回归(GCSVR)框架,用于城市空气污染的稳健时空预测。该模型结合了图卷积学习以捕捉站点间的空间依赖性,支持向量回归以建模非线性时态动态,并减少对离群观测的敏感性。所提出的框架使用了来自德里37个监测站和孟买18个监测站的空气质量记录,代表了印度内陆和沿海都市环境。通过与已建立的时态和时空基准进行比较,评估了预测性能跨多个时间范围。结果表明,GCSVR始终提高了预测准确性,并在不同季节和易受离群污染事件的情况下保持稳定性。统计检验进一步确认了所提出方法在两个城市中的可靠性。最后,将合规预测与GCSVR集成,生成校准的预测区间,增强了其在不确定性感知空气质量监测和公共卫生决策中的实用价值。
更新时间: 2026-05-05 14:24:49
领域: cs.LG,stat.AP,stat.ML
Different Strokes for Different Folks: Writer Identification for Historical Arabic Manuscripts
Handwritten Arabic manuscripts preserve the Arab world's intellectual and cultural heritage, and writer identification supports provenance, authenticity verification, and historical analysis. Using the Muharaf dataset of historical Arabic manuscripts, we evaluate writer identification from individual line images and, to the best of our knowledge, provide the first baselines reported under both line-level and page-disjoint evaluation protocols. Since the dataset is only partially labeled for writer identification, we manually verified and expanded writer labels in the public portion from 6,858 (28.00%) to 21,249 lines (86.75%) out of 24,495 line images, correcting inconsistencies and removing non-handwritten text. After further filtering, we retained 18,987 lines (77.51%). We propose a Convolutional Neural Network (CNN)-based model with attention mechanisms for closed-set writer identification, including rare two-writer lines modeled as composite writer-pair classes. We benchmark fourteen configurations and conduct ablations across different feature extractors and training regimes. To assess generalization to unseen pages, the page-disjoint protocol assigns all lines from each page to a single split. Under the line-level protocol, a fine-tuned DenseNet201 with attention achieves 99.05% Top-1 accuracy, 99.73% Top-5 accuracy, and 97.44% F1-score. Under the more challenging page-disjoint protocol, the best observed results are 78.61% Top-1 accuracy, 87.79% Top-5 accuracy, and 66.55% F1-score, thus quantifying the impact of page-level cues. By expanding the Muharaf dataset's labeled subset and reporting both protocols, we provide a clearer benchmark and a practical resource for historians and linguists engaged with culturally and historically significant documents. The code and implementation details are available on GitHub.
Updated: 2026-05-05 14:17:39
标题: 不同的笔法适用于不同的人:历史阿拉伯手稿的作者识别
摘要: 手写阿拉伯文手稿保存了阿拉伯世界的知识和文化遗产,作家识别支持出处、真实性验证和历史分析。利用历史阿拉伯手稿数据集Muharaf,我们评估了从单独行图像进行的作家识别,并据我们所知,提供了第一个基线报告,涵盖了行级和页面不交叉评估协议。由于数据集仅部分标记了作家身份,我们手动验证并扩展了公开部分的作家标签,从24,495行图像中的6,858行(28.00%)扩展到21,249行(86.75%),纠正了不一致之处并删除了非手写文本。经进一步筛选,我们保留了18,987行(77.51%)。我们提出了基于卷积神经网络(CNN)的带有注意机制的模型,用于封闭集作家识别,包括罕见的两位作家行作为复合作家对类建模。我们对十四种配置进行基准测试,并在不同的特征提取器和训练方案上进行消融实验。为了评估对未见页面的泛化能力,页面不交叉协议将每个页面的所有行分配到一个单独的分割。根据行级协议,经过微调的DenseNet201与注意力实现了99.05%的Top-1准确率,99.73%的Top-5准确率和97.44%的F1分数。在更具挑战性的页面不交叉协议下,最佳结果为78.61%的Top-1准确率,87.79%的Top-5准确率和66.55%的F1分数,从而量化了页面级别线索的影响。通过扩展Muharaf数据集的标记子集并报告两种协议,我们为从事与文化和历史重要文档相关的历史学家和语言学家提供了更清晰的基准和实用资源。代码和实现细节可在GitHub上找到。
更新时间: 2026-05-05 14:17:39
领域: cs.CV,cs.LG
Training-Free Probabilistic Time-Series Forecasting with Conformal Seasonal Pools
We propose Conformal Seasonal Pools (CSP), a training-free probabilistic time-series forecaster that mixes same-season empirical draws with signed residual draws around a seasonal naive forecast. In an audited rolling-origin benchmark on the six time-series datasets where DeepNPTS was originally evaluated (electricity, exchange_rate, solar_energy, taxi, traffic, wikipedia), CSP-Adaptive significantly outperforms DeepNPTS on every metric we report -- CRPS (per-window paired Wilcoxon $p \approx 4 \times 10^{-10}$), normalized mean quantile loss ($p \approx 7 \times 10^{-10}$), and empirical 95% coverage ($p \approx 8 \times 10^{-45}$, mean 0.89 vs 0.66) -- while running over 500x faster on CPU. Coverage is the most decision-critical of these: a 0.95 nominal interval that contains the truth in only ~66% of cases fails the basic calibration desideratum and would not survive deployment in safety- or decision-critical settings. The failure mode is also more severe than aggregate coverage suggests: in the worst 10% of windows, DeepNPTS's prediction interval covers none of the H forecast horizons -- the entire multi-step trajectory misses the truth at every step simultaneously. This poses serious risk in safety- and decision-critical applications such as healthcare, finance, energy operations, and autonomous systems, where prediction intervals that systematically miss the truth across the entire planning horizon translate directly into misclassified patients, regulatory capital failures, grid imbalances, and safety-case violations. CSP achieves all of this with no learned parameters and no training. We argue training-free conformal samplers should be mandatory baselines when evaluating learned non-parametric forecasters.
Updated: 2026-05-05 14:16:35
标题: 无需训练的概率时间序列预测方法:基于符合季节池的方法
摘要: 我们提出了一种称为Conformal Seasonal Pools (CSP)的无需训练的概率时间序列预测器,它将同一季节的经验抽样与围绕季节天真预测的带符号残差抽样混合在一起。在六个时间序列数据集上进行的经过审核的滚动起源基准测试中,CSP-Adaptive在我们报告的每个指标上明显优于DeepNPTS - CRPS(每窗口配对Wilcoxon $p \approx 4 \times 10^{-10}$)、标准化均值分位数损失($p \approx 7 \times 10^{-10}$)和经验95%覆盖率($p \approx 8 \times 10^{-45}$,平均值为0.89 vs 0.66) - 同时在CPU上运行速度比DeepNPTS快500倍以上。其中覆盖率是这些指标中最关键的决策指标:一个只在大约66%的情况下包含真实值的0.95名义区间未能达到基本的校准要求,并且在安全或决策关键场景中无法部署。失败模式也比总体覆盖率所示更严重:在最差的10%窗口中,DeepNPTS的预测区间未覆盖任何H预测范围 - 整个多步轨迹同时在每一步上都偏离真实值。这在安全和决策关键应用中存在严重风险,如医疗保健、金融、能源运营和自主系统,其中预测区间在整个规划视野范围内系统地错过真实值,直接导致患者错误分类、监管资本失败、电网不平衡和安全案例违规。CSP在没有学习参数和没有训练的情况下实现了所有这些。我们认为,在评估学习的非参数预测器时,无需训练的符合抽样器应该是强制性基线。
更新时间: 2026-05-05 14:16:35
领域: stat.ML,cs.LG
Scaling Unsupervised Multi-Source Federated Domain Adaptation through Group-Wise Discrepancy Minimization
Unsupervised multi-source domain adaptation (UMDA) leverages labeled data from multiple source domains to generalize to an unlabeled target. While federated UMDA addresses privacy by avoiding raw data sharing, existing methods scale poorly as the number of sources increases, often suffering from high computational overhead or training instability. We propose GALA, a scalable and robust federated UMDA framework designed for high-diversity settings. GALA achieves scalability by coupling a novel inter-group discrepancy minimization objective that approximates pairwise alignment with linear complexity alongside a temperature-controlled, centroid-based weighting strategy for dynamic source prioritization. These components enable stable, parallelizable training across many heterogeneous sources, addressing a critical scalability bottleneck that remains largely unaddressed in current literature. To evaluate performance in high-diversity scenarios, we introduce Digit-18, a new benchmark comprising 18 datasets with varied synthetic and real-world domain shifts. Extensive experiments demonstrate that GALA achieves state-of-the-art results on standard benchmarks and significantly outperforms prior methods in large-scale settings where others either fail to converge or become computationally infeasible.
Updated: 2026-05-05 14:15:16
标题: 通过分组差异最小化实现无监督多源联邦领域自适应的扩展
摘要: 无监督多源领域自适应(UMDA)利用多个源领域的标记数据来泛化到未标记的目标领域。尽管联邦UMDA通过避免原始数据共享来解决隐私问题,但现有方法在源数量增加时扩展性较差,往往受到高计算开销或训练不稳定性的困扰。我们提出了GALA,一个可扩展且稳健的联邦UMDA框架,专为高多样性环境设计。GALA通过结合一种新颖的组间差异最小化目标,以线性复杂度近似成对对齐,以及一个温度控制的、基于质心的加权策略,用于动态源优先级排序,实现了可扩展性。这些组件使得在许多异构源之间实现稳定、可并行化的训练,解决了目前文献中主要未解决的关键可扩展性瓶颈。为了评估在高多样性场景中的性能,我们引入了Digit-18,一个包含18个具有不同合成和真实领域转移的数据集的新基准。大量实验证明,GALA在标准基准上取得了最先进的结果,并在大规模环境中明显优于先前方法,其他方法在这种环境下要么无法收敛,要么变得计算上不可行。
更新时间: 2026-05-05 14:15:16
领域: cs.LG
Say the Mission, Execute the Swarm: Agent-Enhanced LLM Reasoning in the Web-of-Drones
Large Language Models (LLMs) are increasingly explored as high-level reasoning engines for cyber-physical systems, yet their application to real-time UAV swarm management remains challenging due to heterogeneous interfaces, limited grounding, and the need for long-running closed-loop execution. This paper presents a mission-agnostic, agent-enhanced LLM framework for UAV swarm control, where users express mission objectives in natural language and the system autonomously executes them through grounded, real-time interactions. The proposed architecture combines an LLM-based Agent Core with a Model Context Protocol (MCP) gateway and a Web-of-Drones abstraction based on W3C Web of Things (WoT) standards. By exposing drones, sensors, and services as standardized WoT Things, the framework enables structured tool-based interaction, continuous state observation, and safe actuation without relying on code generation. We evaluate the framework using ArduPilot-based simulation across four swarm missions and six state-of-the-art LLMs. Results show that, despite strong reasoning abilities, current general-purpose LLMs still struggle to achieve reliable execution - even for simple swarm tasks - when operating without explicit grounding and execution support. Task-specific planning tools and runtime guardrails substantially improve robustness, while token consumption alone is not indicative of execution quality or reliability.
Updated: 2026-05-05 14:14:57
标题: 说出任务,执行集群:代理增强的LLM推理在无人机网络中
摘要: 大型语言模型(LLMs)越来越多地被探索作为物理系统的高级推理引擎,然而,由于异构接口、有限的基础和需要长时间运行的闭环执行,它们在实时无人机群管理中的应用仍然具有挑战性。本文提出了一个面向任务的、代理增强的LLM框架,用于无人机群控制,用户可以用自然语言表达任务目标,系统通过基于实时交互的基础自主执行这些目标。所提出的架构将基于LLM的代理核心与模型上下文协议(MCP)网关和基于W3C物联网(WoT)标准的无人机网络结合起来。通过将无人机、传感器和服务作为标准化的WoT物体暴露出来,该框架实现了结构化的工具化交互、持续的状态观察和安全的执行,而不依赖于代码生成。我们使用基于ArduPilot的仿真对该框架进行评估,涵盖了四个无人机群任务和六个最先进的LLMs。结果显示,尽管具有强大的推理能力,但当前的通用LLMs在没有明确的基础和执行支持的情况下,甚至对于简单的群体任务,仍然很难实现可靠的执行。特定任务的规划工具和运行时保护措施大大提高了鲁棒性,而仅靠令牌消耗并不能表明执行质量或可靠性。
更新时间: 2026-05-05 14:14:57
领域: cs.AI,cs.NI,cs.RO
What You Think is What You See: Driving Exploration in VLM Agents via Visual-Linguistic Curiosity
To navigate partially observable visual environments, recent VLM agents increasingly internalize world modeling capabilities into their policies via explicit CoT reasoning, enabling them to mentally simulate futures before acting. However, relying solely on passive reasoning over visited states is insufficient for sparse-reward tasks, as it lacks the epistemic drive to actively uncover the ``known unknown'' required for robust generalization. We ask: Can VLM agents actively find signals that challenge and refine their internal world model through curiosity-driven exploration? In this work, we propose GLANCE, a unified framework that bridges reasoning and exploration by grounding the agent's linguistic world model into the stable visual representations of an evolving target network. Crucially, GLANCE leverages the discrepancy between linguistic prediction and visual reality as an intrinsic curiosity signal within reinforcement learning, steering the agent to actively explore areas where its internal model is uncertain. Extensive experiments across a series of agentic tasks show the effectiveness of GLANCE, and demonstrate that aligning ``what the agent thinks'' with ``what the agent sees'' is key to solving complex or sparse agentic tasks.
Updated: 2026-05-05 14:08:54
标题: 你认为什么就是你看到的:通过视觉-语言好奇心驱动VLM代理的探索
摘要: 为了在部分可观察的视觉环境中导航,最近的VLM代理越来越将世界建模能力内化到其策略中,通过明确的CoT推理,使其能够在行动之前在脑海中模拟未来。然而,仅依靠对访问状态进行被动推理对于稀疏奖励任务来说是不够的,因为它缺乏积极探索“已知未知”所需的认识动力,我们提出一个问题:VLM代理能否通过好奇驱动的探索积极寻找挑战和完善其内部世界模型的信号?在这项工作中,我们提出了GLANCE,一个统一框架,通过将代理的语言世界模型与不断发展的目标网络的稳定视觉表示相结合,实现了推理和探索的桥梁。关键是,GLANCE利用语言预测和视觉现实之间的差异作为强化学习中的内在好奇信号,引导代理积极探索其内部模型不确定的领域。通过一系列代理任务的广泛实验显示了GLANCE的有效性,并表明将“代理认为的”与“代理看到的”相一致对于解决复杂或稀疏的代理任务至关重要。
更新时间: 2026-05-05 14:08:54
领域: cs.AI
Soft Tournament Equilibrium
The evaluation of general-purpose artificial agents, particularly those based on LLMs, presents a significant challenge due to the non-transitive nature of their interactions. When agent A defeats B, B defeats C, and C defeats A, traditional ranking methods that force a linear ordering can be misleading and unstable. We argue that for such cyclic domains, the fundamental object of evaluation should not be a ranking alone but a set-valued core, as conceptualized in classical tournament theory. This paper introduces Soft Tournament Equilibrium (STE), a differentiable framework for learning and computing set-valued tournament solutions directly from pairwise comparison data. STE first learns a probabilistic tournament model, potentially conditioned on rich contextual information. It then employs differentiable operators for soft reachability and soft covering to compute continuous analogues of two seminal tournament solutions: the Top Cycle and the Uncovered Set. The output is a set of core agents, each with a continuous membership score that can be calibrated when suitable validation labels or repeated-sampling evidence are available. We develop the theoretical foundation for STE by proving consistency with classical solutions in the zero-temperature limit, establishing Condorcet-inclusion properties, and analyzing stability and sample complexity. We evaluate the method on a planted cyclic core benchmark and on real preference/execution diagnostics. This work provides a self-contained account that re-centers general-agent evaluation on a robust tournament-theoretic foundation, moving from unstable rankings toward stable, set-valued equilibria.
Updated: 2026-05-05 14:08:00
标题: 软博弈均衡
摘要: 通用人工智能代理的评估,特别是基于LLMs的代理,由于它们互动的非传递性而面临着重大挑战。当代理A击败B,B击败C,C击败A时,强制线性排序的传统排名方法可能会产生误导性和不稳定性。我们认为,在这种循环领域中,评估的基本对象不应仅仅是一个排名,而应该是一个集值核,正如古典锦标赛理论中所构想的那样。本文介绍了软锦标赛均衡(STE),这是一个可微的框架,可以从成对比较数据中直接学习和计算集值锦标赛解决方案。STE首先学习一个概率锦标赛模型,可能是在丰富的上下文信息的条件下。然后利用可微的运算符进行软可及性和软覆盖,计算两个开创性锦标赛解决方案的连续类比:顶级循环和未覆盖集。输出是一组核心代理,每个代理都有一个连续的成员评分,在适当的验证标签或重复抽样证据可用时可以进行校准。我们通过证明与零温度极限下的经典解决方案的一致性,建立康德塞特包含性质,并分析稳定性和样本复杂性,为STE发展了理论基础。我们在一个种植的循环核心基准和真实偏好/执行诊断上评估该方法。这项工作提供了一个自成体系的账户,将通用代理评估重新集中在一个稳健的锦标赛理论基础上,从不稳定的排名向稳定的集值均衡转变。
更新时间: 2026-05-05 14:08:00
领域: cs.AI,cs.LG,cs.MA
Task Vector Geometry Underlies Dual Modes of Task Inference in Transformers
Transformers are effective at inferring the latent task from context via two inference modes: recognizing a task seen during training, and adapting to a novel one. Recent interpretability studies have identified from middle-layer representations task-specific directions, or task vectors, that steer model behavior. However, a lack of rigorous foundations hinders connecting internal representations to external model behavior: existing work fails to explain how task-vector geometry is shaped by the training distribution, and what geometry enables out-of-distribution (OOD) generalization. In this paper, we study these questions in a controlled synthetic setting by training small transformers from scratch on latent-task sequence distributions, which allows a principled mathematical characterization. We show that two inference modes can coexist within a single model. In-distribution behavior is governed by Bayesian task retrieval, implemented internally through convex combinations of learned task vectors. OOD behavior, by contrast, arises through extrapolative task learning, whose representations occupy a subspace nearly orthogonal to the task-vector subspace. Taken together, our results suggest that task-vector geometry, training distributions, and generalization behaviors are closely related.
Updated: 2026-05-05 14:07:55
标题: 任务向量几何学支撑变压器中的任务推断的双模式
摘要: Transformers 在通过两种推断模式:识别训练中见过的任务和适应新任务,有效地从上下文中推断潜在任务。最近的可解释性研究已经确定了中间层表示中的任务特定方向或任务向量,这些方向或向量引导模型行为。然而,缺乏严格的基础阻碍了将内部表示连接到外部模型行为:现有工作未能解释训练分布如何塑造任务向量几何形状,以及什么几何形状使得模型能够在分布之外进行泛化。 在本文中,我们在受控合成设置中研究了这些问题,通过从头开始训练小型transformers模型在潜在任务序列分布上,这样可以进行原则性的数学表征。我们展示了两种推断模式可以在单个模型中共存。在分布内的行为由贝叶斯任务检索所控制,内部通过学习的任务向量的凸组合实现。相比之下,分布外的行为是通过外推式任务学习实现的,其表示占据几乎与任务向量子空间正交的子空间。综合而言,我们的结果表明任务向量几何形状、训练分布和泛化行为之间密切相关。
更新时间: 2026-05-05 14:07:55
领域: cs.LG,cs.CL,stat.ML
Firmware Distribution as Attack Surface: A Security Study of ASIC Cryptocurrency Miners
ASIC cryptocurrency miners are a core component of blockchain infrastructures, directly converting computation and energy into monetary value. Despite their economic im- portance, their security is rarely evaluated in a structured manner. In this paper, we show that the firmware distribution ecosystem of mining devices fundamentally challenges existing trust assumptions. We introduce a scalable methodology based on the collection and static analysis of publicly distributed firmware artifacts, requiring neither device access nor runtime interaction. Applying this approach, we reconstruct and analyze 134 firmware images spanning manufacturers that account for over 99% of deployed miners (Bitmain, MicroBT, Canaan, Iceriver). Our re- sults reveal that firmware artifacts alone are sufficient to recover internal architecture, identify security weaknesses, and recon- struct complete attack paths leading to high-impact adversarial objectives. In particular, our analysis reveals vulnerabilities that enable realistic large-scale attack scenarios, including firmware phishing and the exploitation of miners still operating over Stratum V1. Validation on two real devices confirms that publicly distributed artifacts closely reflect deployed software and that these weaknesses translate into attack capabilities. Overall, our study shows that firmware distribution mechanisms themselves constitute a primary attack surface, significantly lowering the barrier to compromise in the ASIC mining ecosystem.
Updated: 2026-05-05 14:00:54
标题: 固件分发作为攻击面:ASIC加密货币矿工的安全研究
摘要: ASIC加密货币矿工是区块链基础设施的核心组成部分,直接将计算和能源转化为货币价值。尽管它们在经济上非常重要,但它们的安全性很少以结构化方式进行评估。在本文中,我们展示了挖矿设备固件分发生态系统从根本上挑战了现有的信任假设。我们介绍了一种基于公开分发的固件信息收集和静态分析的可扩展方法,既不需要访问设备也不需要运行时交互。应用这种方法,我们重建并分析了134个固件图像,涵盖了超过99%的部署矿工的制造商(比特大陆、微宝、卡南、冰河)。我们的结果表明,固件信息本身足以恢复内部架构,识别安全弱点,并重建导致高影响对抗目标的完整攻击路径。特别是,我们的分析揭示了使实际大规模攻击场景成为可能的漏洞,包括固件钓鱼和利用仍在使用Stratum V1的矿工。在两个真实设备上的验证证实,公开分发的信息与部署软件密切相关,并且这些弱点转化为攻击能力。总体而言,我们的研究表明,固件分发机制本身构成主要攻击面,显著降低了ASIC挖矿生态系统中的妥协门槛。
更新时间: 2026-05-05 14:00:54
领域: cs.CR,cs.SE
Lyapunov-Certified Direct Switching Theory for Q-Learning
Q-learning is a fundamental algorithmic primitive in reinforcement learning. This paper develops a new framework for analyzing Q-learning from a switching-system viewpoint. In particular, we derive a direct stochastic switching-system representation of the Q-learning error. The key observation is that the Bellman maximization error can be expressed exactly as an average of action-wise Q-errors under a suitable stochastic policy. The resulting recursion has a switched linear conditional-mean drift and martingale-difference noise. To the best of our knowledge, this is the first convergence-rate analysis of standard Q-learning whose leading exponential rate is expressed through the joint spectral radius (JSR) of a direct switching family. Since the JSR is the exact worst-case exponential rate of the associated switched linear drift, the resulting rate is among the tightest drift-based rates that can be certified for this Q-learning representation. Building on this representation, we prove finite-time bounds based on a product-defined JSR-induced Lyapunov function and also give an optional common quadratic Lyapunov certificate. The quadratic certificate is only a sufficient condition and hence applies only to instances for which the certificate is feasible, whereas the JSR-induced Lyapunov construction applies to the full direct switching family whenever its JSR is below one. When feasible, the quadratic certificate replaces product-based verification by a computable matrix inequality and gives a simpler stochastic bound. We further extend the framework to Markovian observation models.
Updated: 2026-05-05 14:00:33
标题: 李亚普诺夫认证的直接切换理论对Q学习
摘要: Q学习是强化学习中的基本算法原语。本文从切换系统的视角发展了一个新的框架来分析Q学习。特别地,我们推导了Q学习误差的直接随机切换系统表示。关键观察是贝尔曼最大化误差可以准确地表达为在适当的随机策略下动作-wise Q误差的平均值。结果的递归具有一个切换线性条件均值漂移和鞅差噪声。据我们所知,这是标准Q学习的第一个收敛速率分析,其主要指数速率通过直接切换族的联合谱半径(JSR)表示。由于JSR是相关切换线性漂移的精确最坏情况指数速率,从而得到的速率是可证实的该Q学习表示的最紧密的基于漂移的速率之一。基于这个表示,我们基于产生式JSR诱导的李亚普诺夫函数证明有限时间界限,并且还提供一个可选的常用二次李亚普诺夫证书。二次证书只是一个充分条件,因此仅适用于证书可行的实例,而JSR诱导的李亚普诺夫构造适用于完整的直接切换族,只要其JSR低于一。当可行时,二次证书取代了基于产品的验证,通过一个可计算的矩阵不等式给出一个更简单的随机界。我们进一步将该框架扩展到马尔可夫观测模型。
更新时间: 2026-05-05 14:00:33
领域: cs.LG,cs.AI,eess.SY
Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer
Matrix-based optimizers have demonstrated immense potential in training Large Language Models (LLMs), however, designing an ideal optimizer remains a formidable challenge. A superior optimizer must satisfy three core desiderata: efficiency, achieving Muon-like preconditioning to accelerate optimization; stability, strictly adhering to the scale-invariance inherent in neural networks; and speed, minimizing computational overhead. While existing methods address these aspects to varying degrees, they often fail to unify them, either incurring prohibitive computational costs like Muon, or allowing radial jitters that compromise stability like RMNP. To bridge this gap, we propose Nora, an optimizer that rigorously satisfies all three requirements. Nora achieves training stability by explicitly stabilizing weight norms and angular velocities through row-wise momentum projection onto the orthogonal complement of the weights. Simultaneously, by leveraging the block-diagonal dominance of the Transformer Hessian, Nora effectively approximates structured preconditioning while maintaining an optimal computational complexity of $\mathcal{O}(mn)$. Furthermore, we prove that Nora is a scalable optimizer and establish its corresponding scaling theorems. With a streamlined implementation requiring only two lines of code, our preliminary experiments validate Nora as an efficient and highly promising optimizer for large-scale training.
Updated: 2026-05-05 14:00:27
标题: Nora: 可扩展矩阵优化器的标准化正交行对齐
摘要: 基于矩阵的优化器在训练大型语言模型(LLMs)方面展现了巨大潜力,然而,设计一个理想的优化器仍然是一个艰巨的挑战。一个优秀的优化器必须满足三个核心要求:效率,实现类似Muon的预处理以加速优化;稳定性,严格遵循神经网络固有的尺度不变性;以及速度,最小化计算开销。虽然现有方法在不同程度上解决了这些方面,但它们通常未能将它们统一起来,要么会像Muon一样产生巨大的计算成本,要么会像RMNP那样允许径向抖动,从而影响稳定性。为了弥合这一差距,我们提出了Nora,一种严格满足所有三个要求的优化器。Nora通过将动量投影到权重的正交补上明确稳定权重范数和角速度来实现训练稳定性。同时,通过利用Transformer Hessian的块对角支配性,Nora有效地近似了结构化的预处理,同时保持了$\mathcal{O}(mn)$的最佳计算复杂度。此外,我们证明了Nora是一种可扩展的优化器,并建立了相应的缩放定理。通过仅需两行代码的简化实现,我们的初步实验证实了Nora作为大规模训练的高效和极具潜力的优化器。
更新时间: 2026-05-05 14:00:27
领域: cs.LG
OracleProto: A Reproducible Framework for Benchmarking LLM Native Forecasting via Knowledge Cutoff and Temporal Masking
Large language models are moving from static text generators toward real-world decision-support systems, where forecasting is a composite capability that links information gathering, evidence integration, situational judgment, and action-oriented decision making. This capability is in broad demand across finance, policy, industry, and scientific research, yet its evaluation remains difficult: live benchmarks evaluate forecasts before answers exist, making them the cleanest way to measure forecasting ability, but they expire once events resolve; retrospective benchmarks are reproducible, but they cannot reliably distinguish genuine forecasting from facts a model may have already learned during pretraining. Prompting models to "pretend not to know" cannot replace a genuine knowledge boundary. We propose OracleProto, a reproducible framework for evaluating LLM native forecasting capability. OracleProto reconstructs resolved events into time-bounded forecasting samples by combining model-cutoff-aligned sample admission, tool-level temporal masking, content-level leakage detection, discrete answer normalization, and hierarchical scoring. Instantiated on a FutureX-Past-derived dataset with six contemporary LLMs, OracleProto distinguishes forecasting quality, sampling stability, and cost efficiency under controlled information boundaries, while reducing residual leakage to the $1\%$ level, an order of magnitude below tool-only temporal filtering. OracleProto turns LLM forecasting from one-off evaluation into an auditable, reusable, and trainable dataset-level capability, providing a unified interface for fair cross-model comparison and a controlled signal source for downstream SFT and RL. Code and data are available at https://github.com/MaYiding/OracleProto and https://huggingface.co/datasets/MaYiding/OracleProto.
Updated: 2026-05-05 13:50:50
标题: OracleProto:一种通过知识截止和时间掩蔽实现LLM本地预测基准测试的可重复框架
摘要: 大型语言模型正在从静态文本生成器转向真实世界的决策支持系统,其中预测是一个将信息收集、证据整合、情境判断和面向行动的决策制定联系起来的综合能力。这种能力在金融、政策、工业和科学研究等领域广受需求,然而其评估仍然困难:实时基准评估在答案出现之前评估预测,是衡量预测能力的最清晰方式,但一旦事件解决,它们就会过期;回顾性基准是可重现的,但无法可靠区分真实的预测和模型在预训练期间可能已经学到的事实。促使模型“假装不知道”无法替代真正的知识边界。我们提出了OracleProto,这是一个用于评估LLM本地预测能力的可重现框架。OracleProto通过将已解决的事件重建为以时间为界限的预测样本,结合模型截止对齐的样本准入、工具级时间掩蔽、内容级泄漏检测、离散答案标准化和分层评分。在一个基于FutureX-Past的数据集上实现了六个当代LLMs,OracleProto在受控信息边界下区分了预测质量、采样稳定性和成本效率,同时将残余泄漏降低到1%的水平,比仅使用工具的时间过滤低一个数量级。OracleProto将LLM预测从一次性评估转变为可审计、可重用和可训练的数据集级能力,为公平跨模型比较提供了统一接口,并为下游SFT和RL提供了受控信号源。代码和数据可在https://github.com/MaYiding/OracleProto 和https://huggingface.co/datasets/MaYiding/OracleProto 上获取。
更新时间: 2026-05-05 13:50:50
领域: cs.AI
Aura-CAPTCHA: A Reinforcement Learning and GAN-Enhanced Multi-Modal CAPTCHA System
We present Aura-CAPTCHA, a multi-modal verification system that integrates Generative Adversarial Networks (GANs), Reinforcement Learning (RL), and behavioral analysis to create adaptive challenges resistant to classical deep-learning attacks. Our system synthesizes unique visual stimuli via GAN-based generation alongside synchronized audio challenges, while an RL agent adjusts difficulty based on real-time user interaction patterns. A hybrid classifier combining heuristic rules and machine learning distinguishes human from bot interactions. We position Aura-CAPTCHA relative to well-established baselines (text-based schemes, Google reCAPTCHA v2, audio alternatives, and modern invisible risk-analysis systems) and evaluate it against documented state-of-the-art attacks, including convolutional-neural-network solvers, object-detection pipelines (YOLO), and recent agentic vision-language models. Experimental results indicate that Aura-CAPTCHA improves human success rates and lowers classical bypass rates compared to static challenge-based baselines, although, like all explicit-challenge systems, it remains vulnerable to emerging large-model agents. We discuss these limitations transparently and outline future directions toward cognitive-gap-based defenses.
Updated: 2026-05-05 13:50:11
标题: Aura-CAPTCHA:一种增强强化学习和GAN的多模态CAPTCHA系统
摘要: 我们提出了Aura-CAPTCHA,这是一个多模态验证系统,集成了生成对抗网络(GANs)、强化学习(RL)和行为分析,以创建对经典深度学习攻击具有抗性的自适应挑战。我们的系统通过基于GAN的生成合成独特的视觉刺激,并结合同步的音频挑战,同时,一个RL代理根据实时用户交互模式调整难度。一个结合启发式规则和机器学习的混合分类器区分人类和机器人的交互。我们将Aura-CAPTCHA相对于已建立的基线(基于文本的方案,Google reCAPTCHA v2,音频替代方案和现代隐形风险分析系统)进行定位,并针对已记录的最新攻击(包括卷积神经网络求解器,目标检测管道(YOLO)和最近的代理视觉语言模型)进行评估。实验结果表明,与静态挑战为基础的基线相比,Aura-CAPTCHA提高了人类的成功率,并降低了经典的绕过率,尽管像所有明确挑战系统一样,它仍然容易受到新兴大型模型代理的攻击。我们透明地讨论这些限制,并概述了未来朝向基于认知差距的防御方向。
更新时间: 2026-05-05 13:50:11
领域: cs.LG
Ivy-Fake: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection
The rapid development of Artificial Intelligence Generated Content (AIGC) techniques has enabled the creation of high-quality synthetic content, but it also raises significant security concerns. Current detection methods face two major limitations: (1) the lack of multidimensional explainable datasets for generated images and videos. Existing open-source datasets (e.g., WildFake, GenVideo) rely on oversimplified binary annotations, which restrict the explainability and trustworthiness of trained detectors. (2) Prior MLLM-based forgery detectors (e.g., FakeVLM) exhibit insufficiently fine-grained interpretability in their step-by-step reasoning, which hinders reliable localization and explanation. To address these challenges, we introduce Ivy-Fake, the first large-scale multimodal benchmark for explainable AIGC detection. It consists of over 106K richly annotated training samples (images and videos) and 5,000 manually verified evaluation examples, sourced from multiple generative models and real world datasets through a carefully designed pipeline to ensure both diversity and quality. Furthermore, we propose Ivy-xDetector, a reinforcement learning model based on Group Relative Policy Optimization (GRPO), capable of producing explainable reasoning chains and achieving robust performance across multiple synthetic content detection benchmarks. Extensive experiments demonstrate the superiority of our dataset and confirm the effectiveness of our approach. Notably, our method improves performance on GenImage from 86.88% to 96.32%, surpassing prior state-of-the-art methods by a clear margin.
Updated: 2026-05-05 13:47:19
标题: 假常春藤:一种统一的可解释框架和图像视频AIGC检测基准
摘要: 人工智能生成内容(AIGC)技术的快速发展使高质量合成内容的创建成为可能,但也引发了重大安全关切。目前的检测方法面临两个主要限制:(1)对生成图像和视频的多维可解释数据集的缺乏。现有的开源数据集(例如WildFake、GenVideo)依赖于过于简化的二元标注,这限制了训练检测器的可解释性和可信度。 (2)之前基于MLLM的伪造检测器(例如FakeVLM)在逐步推理中表现出的解释能力不够精细,这阻碍了可靠的定位和解释。为了解决这些挑战,我们引入了Ivy-Fake,这是第一个用于可解释AIGC检测的大规模多模态基准。它包括超过106K个丰富注释的训练样本(图像和视频)和5,000个手动验证的评估示例,通过精心设计的管道从多个生成模型和真实世界数据集中获取,以确保多样性和质量。此外,我们提出了基于群体相对策略优化(GRPO)的强化学习模型Ivy-xDetector,能够产生可解释的推理链,并在多个合成内容检测基准上实现稳健性能。广泛的实验证明了我们数据集的优越性,并证实了我们方法的有效性。值得注意的是,我们的方法将在GenImage上的性能从86.88%提高到96.32%,明显超过之前的最先进方法。
更新时间: 2026-05-05 13:47:19
领域: cs.CV,cs.AI
Before Forgetting, Learn to Remember: Revisiting Foundational Learning Failures in LVLM Unlearning Benchmarks
While Large Vision-Language Models (LVLMs) offer powerful capabilities, they pose privacy risks by unintentionally memorizing sensitive personal information. Current unlearning benchmarks attempt to mitigate this using fictitious identities but overlook a critical stage 1 failure: models fail to effectively memorize target information initially, rendering subsequent unlearning evaluations unreliable. Diagnosing under-memorization and the multi-hop curse as root causes, we introduce ReMem, a Reliable Multi-hop and Multi-image Memorization Benchmark. ReMem ensures robust foundational learning through principled data scaling, reasoning-aware QA pairs, and diverse visual contexts. Additionally, we propose a novel Exposure metric to quantify the depth of information erasure from the model's internal probability distribution. Extensive experiments demonstrate that ReMem provides a rigorous and trustworthy framework for diagnosing both learning and unlearning behaviors in LVLMs.
Updated: 2026-05-05 13:42:31
标题: 在忘记之前,学会记住:重新审视LVLM取消学习基础性失败Benchmark
摘要: 尽管大型视觉-语言模型(LVLMs)提供了强大的功能,但它们通过无意中记忆敏感个人信息而存在隐私风险。当前的取消学习基准尝试通过使用虚构身份来缓解这一问题,但忽略了一个关键的第一阶段失败:模型最初未能有效记忆目标信息,从而使得随后的取消学习评估不可靠。通过诊断欠记忆和多跳现象作为根本原因,我们引入了ReMem,一个可靠的多跳和多图像记忆基准。ReMem通过原则性数据缩放、理性感知问答对和多样化的视觉背景确保了强大的基础学习。此外,我们提出了一种新的曝光度量标准来量化模型内部概率分布中信息消除的深度。广泛的实验证明,ReMem为LVLMs中学习和取消学习行为的诊断提供了严格和可信赖的框架。
更新时间: 2026-05-05 13:42:31
领域: cs.CV,cs.AI
MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier
While large language models (LLMs) show promise in scientific discovery, existing research focuses on inference or feedback-driven training, leaving the direct modeling of the generative reasoning process, $P(\text{hypothesis}|\text{background})$ ($P(h|b)$), unexplored. We demonstrate that directly training $P(h|b)$ is mathematically intractable due to the combinatorial complexity ($O(N^k)$) inherent in retrieving and composing inspirations from a vast knowledge base. To break this barrier, we introduce MOOSE-Star, a unified framework enabling tractable training and scalable inference. In the best case, MOOSE-Star reduces complexity from exponential to logarithmic ($O(\log N)$) by (1) training on decomposed subtasks derived from the probabilistic equation of discovery, (2) employing motivation-guided hierarchical search to enable logarithmic retrieval and prune irrelevant subspaces, and (3) utilizing bounded composition for robustness against retrieval noise. To facilitate this, we release TOMATO-Star, a dataset of 108,717 decomposed papers (38,400 GPU hours) for training. Furthermore, we show that while brute-force sampling hits a "complexity wall," MOOSE-Star exhibits continuous test-time scaling.
Updated: 2026-05-05 13:42:21
标题: MOOSE-Star:通过打破复杂性壁垒,释放科学发现的可解释训练
摘要: 尽管大型语言模型(LLMs)在科学发现方面显示出潜力,但现有的研究主要集中在推理或反馈驱动的训练上,而直接对生成推理过程进行建模,即$P(\text{hypothesis}|\text{background})$ ($P(h|b)$),尚未被探索。我们证明直接训练$P(h|b)$在数学上是棘手的,因为从庞大知识库中检索和组合灵感所涉及的组合复杂性 ($O(N^k)$)。为了突破这一障碍,我们引入了MOOSE-Star,这是一个统一的框架,实现了可行的训练和可扩展的推理。在最佳情况下,MOOSE-Star通过(1)训练基于发现的概率方程推导出的分解子任务,(2)采用基于动机引导的分层搜索以实现对数检索和修剪不相关子空间,以及(3)利用有界组合来抵抗检索噪声。为了促进这一点,我们发布了TOMATO-Star,这是一个包含108,717篇分解论文(38,400 GPU小时)用于训练的数据集。此外,我们展示了虽然暴力采样遇到了“复杂性壁垒”,但MOOSE-Star在测试时表现出持续的扩展能力。
更新时间: 2026-05-05 13:42:21
领域: cs.LG,cs.CE,cs.CL
Vanishing L2 regularization for the softmax Multi Armed Bandit
Multi Armed Bandit (MAB) algorithms are a cornerstone of reinforcement learning and have been studied both theoretically and numerically. One of the most commonly used implementation uses a softmax mapping to prescribe the optimal policy and served as the foundation for downstream algorithms, including REINFORCE. Distinct from vanilla approaches, we consider here the L2 regularized softmax policy gradient where a quadratic term is subtracted from the mean reward. Previous studies exploiting convexity failed to identify a suitable theoretical framework to analyze its convergence when the regularization parameter vanishes. We prove here theoretical convergence results and confirm empirically that this regime makes the L2 regularization numerically advantageous on standard benchmarks.
Updated: 2026-05-05 13:37:13
标题: Vanishing L2 正则化对于Softmax多臂老虎机的影响
摘要: 多臂赌博(MAB)算法是强化学习的基石,已经在理论和数值上进行了研究。其中最常用的实现方法之一是使用softmax映射来确定最优策略,并作为后续算法(包括REINFORCE)的基础。与传统方法不同,我们在这里考虑了L2正则化softmax策略梯度,其中从平均奖励中减去了一个二次项。以前利用凸性的研究未能找到适合分析当正则化参数趋于零时其收敛性的理论框架。我们在这里证明了理论收敛结果,并在标准基准测试中实验证明,这种情况下L2正则化在数值上具有优势。
更新时间: 2026-05-05 13:37:13
领域: cs.LG,math.ST,stat.ML
Process-Mining of Hypertraces: Enabling Scalable Formal Security Verification of (Automotive) Network Architectures
The automotive domain is transitioning: vehicles act as rolling servers, persistently connected to numerous external entities. This connectivity, combined with rising on-board computing power for advanced driver assistance systems and similar use cases, creates escalating challenges for securing automotive network architectures. This work advances the security analysis of internet-connected automotive network architectures and their protocols. We introduce a strong, active adversary model tailored to the automotive domain. We substantially extend security protocol verification possible based on Attack Resilience Hyperproperties (ARHs) by introducing a verification-orchestration algorithm. Furthermore, we provide methods for comparative attribution of security property invalidations to specific, ne-grained component compromises. We present a novel integration of formal verification and process mining. By utilizing ARH counterexample traces for process mining, we systematically identify and aggregate attacker behavior that causes security property invalidations. This pipeline enables in-depth understanding of root causes and attack paths leading to protocol-security invalidations. We demonstrate real-world applicability through a prototype and case study on the secure transmission of battery management system data within an automotive network architecture.
Updated: 2026-05-05 13:36:46
标题: 超迹的过程挖掘:实现(汽车)网络架构可扩展形式安全验证
摘要: 汽车领域正在转变:车辆充当滚动服务器,持续连接到众多外部实体。这种连接性,结合日益增强的车载计算能力用于先进的驾驶员辅助系统和类似用例,为保护汽车网络架构带来了不断升级的挑战。本文推进了对互联网连接的汽车网络架构及其协议的安全分析。我们引入了一个专为汽车领域量身定制的强大、积极的对手模型。我们通过引入一种验证编排算法,大幅扩展了基于攻击弹性超属性(ARHs)的安全协议验证可能性。此外,我们提供了将安全属性无效归因于特定、细粒度组件妥协的方法。我们呈现了形式验证和过程挖掘的新颖整合。通过利用ARH反例跟踪进行过程挖掘,我们系统地识别和聚合导致安全属性无效的攻击者行为。这一流程使我们深入了解导致协议安全无效的根本原因和攻击路径。我们通过一个原型和一个关于在汽车网络架构中安全传输电池管理系统数据的案例研究展示了实际应用的可能性。
更新时间: 2026-05-05 13:36:46
领域: cs.CR
GEM-FI: Gated Evidential Mixtures with Fisher Modulation
Evidential Deep Learning (EDL) enables single-pass uncertainty estimation by predicting Dirichlet evidence, but it can remain overconfident and poorly calibrated, and it often fails to represent multi-modal epistemic uncertainty. We introduce Gated Evidential Mixtures (GEM), a family of models that learns an in-model energy signal and uses it to gate evidential outputs end-to-end in a distance-informed manner. GEM-CORE learns a feature-level energy and maps it to a bounded gate that smoothly suppresses evidence when support is low. To capture epistemic multi-modality without multi-pass ensembling, GEM-MIX adds a lightweight mixture of evidential heads with learned routing weights while preserving single-pass inference. Finally, GEM-FI stabilizes mixture allocations via a Fisher-informed regularizer, reducing head collapse and producing smoother boundary uncertainty. Across image classification and OOD detection benchmarks, GEM improves calibration and ID/OOD separation with single-pass inference. On CIFAR-10, GEM-FI vs. DAEDL improves accuracy from 91.11 to 93.75 (+2.64 pp), reduces Brier x100 from 14.27 to 6.81 (-7.46), and also improves misclassification-detection AUPR from 99.08 to 99.94 (+0.86). For epistemic OOD detection, GEM-FI achieves AUPR/AUROC of 92.59/95.09 on CIFAR-10 to SVHN and 90.20/89.06 on CIFAR-10 to CIFAR-100, compared with 85.54/89.30 and 88.19/86.10 for DAEDL.
Updated: 2026-05-05 13:33:27
标题: GEM-FI:带有Fisher调制的门控证据混合
摘要: 证据深度学习(EDL)通过预测狄利克雷证据实现了单次不确定性估计,但它可能过于自信且校准不佳,通常无法表示多模态认知不确定性。我们引入了门控证据混合模型(GEM),这是一类模型,学习在模型内部的能量信号,并使用它以距离为基础的方式端到端地控制证据输出。GEM-CORE学习特征级的能量并将其映射到一个有界的门,当支持度较低时,该门会平滑地抑制证据。为了捕捉认知多模态性而无需多次集成,GEM-MIX添加了一个轻量级的证据头混合,具有学习的路由权重,同时保持单次推理。最后,通过Fisher-informed正则化器,GEM-FI稳定混合分配,减少头部崩溃,并产生更平滑的边界不确定性。在图像分类和OOD检测基准测试中,GEM通过单次推理改善了校准性和ID/OOD分离。在CIFAR-10上,GEM-FI相对于DAEDL将准确性从91.11提高到93.75(+2.64个百分点),将Brier x100从14.27降至6.81(-7.46),还将误分类检测AUPR从99.08提高到99.94(+0.86)。对于认知OOD检测,GEM-FI在CIFAR-10到SVHN上实现了92.59/95.09的AUPR/AUROC,以及在CIFAR-10到CIFAR-100上的90.20/89.06,而DAEDL的结果分别为85.54/89.30和88.19/86.10。
更新时间: 2026-05-05 13:33:27
领域: cs.LG
TCM-Serve: Modality-aware Scheduling for Multimodal Large Language Model Inference
Multimodal Large Language Models (MLLMs) power platforms like ChatGPT, Gemini, and Copilot, enabling richer interactions with text, images, and videos. These heterogeneous workloads introduce additional inference stages, such as vision preprocessing and encoding, that inflate latency and memory demand. Existing LLM serving systems, optimized for text-only workloads, fail under multimodality: large requests (e.g., videos) monopolize resources, causing severe head-of-line blocking and performance degradation. Our key insight is that multimodal requests differ by orders of magnitude in resource demands, which we capture through a simple abstraction: videos behave like trucks, images like cars, and text like motorcycles. We design TCM-Serve, a modality-aware scheduler that lets motorcycles flow quickly through cars and trucks, ensuring interactive responsiveness while avoiding starvation. TCM-Serve classifies requests, prioritizes them dynamically, and applies aging to avoid starvation. Evaluation across state-of-the-art MLLMs shows that TCM-Serve reduces, on average, time-to-first-token (TTFT) by 54% overall, and by 78.5% for latency-critical requests, compared to current systems. TCM-Serve delivers LLM-like responsiveness for MLLMs, with modality-aware scheduling and by making the most efficient use of the available resources.
Updated: 2026-05-05 13:30:35
标题: 中医药服务:面向多模态大型语言模型推理的模态感知调度
摘要: 多模大型语言模型(MLLMs)支持ChatGPT、Gemini和Copilot等平台,实现与文本、图像和视频的更丰富交互。这些异构工作负载引入了额外的推理阶段,如视觉预处理和编码,增加了延迟和内存需求。现有为文本工作负载优化的LLM服务系统在多模性下失效:大型请求(如视频)占用资源,导致严重的头部阻塞和性能下降。我们的关键洞察是,多模请求在资源需求上存在数量级差异,我们通过一个简单的抽象概念来捕捉这一点:视频行为类似卡车,图像类似汽车,文本类似摩托车。我们设计了TCM-Serve,一个模态意识调度器,让摩托车快速通过汽车和卡车,确保交互响应性同时避免饥饿。TCM-Serve对请求进行分类,动态优先处理,并应用老化机制以避免饥饿。在最先进的MLLMs上进行评估显示,与当前系统相比,TCM-Serve将平均首个令牌到达时间(TTFT)整体减少了54%,对于延迟关键的请求,减少了78.5%。TCM-Serve通过模态意识调度和最有效地利用可用资源,为MLLMs提供了类似LLM的响应能力。
更新时间: 2026-05-05 13:30:35
领域: cs.DC,cs.AI
S2O: Early Stopping for Sparse Attention via Online Permutation
Attention scales quadratically with sequence length, fundamentally limiting long-context inference. Existing block-granularity sparsification can reduce latency, but coarse blocks impose an intrinsic sparsity ceiling, making further improvements difficult even with carefully engineered designs. We present S2O, which performs early stopping for sparse attention via online permutation. Inspired by virtual-to-physical address mapping in memory systems, S2O revisits and factorizes FlashAttention execution, enabling inference to load non-contiguous tokens rather than a contiguous span in the original order. Motivated by fine-grained structures in attention heatmaps, we transform explicit permutation into an online, index-guided, discrete loading policy; with extremely lightweight preprocessing and index-remapping overhead, it concentrates importance on a small set of high-priority blocks. Building on this importance-guided online permutation for loading, S2O further introduces an early-stopping rule: computation proceeds from high to low importance; once the current block score falls below a threshold, S2O terminates early and skips the remaining low-contribution blocks, thereby increasing effective sparsity and reducing computation under a controlled error budget. As a result, S2O substantially raises the practical sparsity ceiling. On Llama-3.1-8B under a 128K context, S2O reduces single-operator MSE by 3.82$\times$ at matched sparsity, and reduces prefill compute density by 3.31$\times$ at matched MSE; meanwhile, it preserves end-to-end accuracy and achieves 7.51$\times$ attention and 3.81$\times$ end-to-end speedups.
Updated: 2026-05-05 13:29:50
标题: S2O:通过在线排列实现稀疏注意力的早停止
摘要: 注意力随着序列长度呈二次增长,从根本上限制了长上下文推理。现有的块粒度稀疏化可以减少延迟,但是粗糙的块会施加固有的稀疏上限,即使采用精心设计的方法,进一步改进也很困难。我们提出了S2O,通过在线排列实现稀疏注意力的提前停止。受内存系统中虚拟到物理地址映射的启发,S2O重新审视和分解FlashAttention执行,使推理可以加载非连续的令牌,而不是按原始顺序加载连续的范围。受注意力热图中细粒度结构的启发,我们将显式排列转换为在线、索引引导的离散加载策略;通过极其轻量级的预处理和索引重映射开销,它将重点集中在一小组高优先级块上。建立在这种重要性引导的在线排列加载基础上,S2O进一步引入了一个提前停止规则:计算从高到低重要性进行;一旦当前块得分低于阈值,S2O就会提前终止并跳过其余的低贡献块,从而增加有效稀疏性并在受控的错误预算下减少计算量。 因此,S2O显著提高了实际稀疏上限。在一个128K上下文下的Llama-3.1-8B上,S2O在匹配稀疏性下将单操作员MSE降低了3.82倍,并在匹配MSE时将预先填充计算密度降低了3.31倍;同时,它保持了端到端的准确性,并实现了7.51倍注意力和3.81倍端到端的加速。
更新时间: 2026-05-05 13:29:50
领域: cs.LG,cs.AI
Internet of Things Security: A Survey on Common Attacks
The exponential growth of the Internet of Things (IoT) has integrated connected devices into various sectors like smart cities, digital health, and Industry 4.0, generating vast amounts of real-time data to support intelligent decision-making. However, this widespread adoption is fundamentally challenged by significant security risks, primarily due to the inherent computational limitations of devices, lack of standardization, and an expanding attack surface. Given that security is paramount to ensuring trust in these environments, this paper presents a comprehensive survey and a multi-dimensional analysis of the IoT threat landscape. It describes 28 common attacks, ranging from traditional threats, such as Man-in-the-Middle, to specialized IoT exploits, including node replication and skimming. To provide a structured understanding of these risks, we employ the STRIDE model for functional threat classification alongside the CVSS framework for quantitative criticality assessment. Furthermore, the research establishes a robust mapping between these threats and five foundational vulnerability classes (Process, Code, Communication, Operation, and Device), uncovering the specific technical entry points exploited by adversaries. Beyond threat identification, the survey presents state-of-the-art mitigation techniques and discusses emerging paradigms and research gaps, working as a roadmap for future investigation and providing a consolidated technical foundation for both researchers and practitioners aiming to build resilient and secure IoT ecosystems.
Updated: 2026-05-05 13:29:22
标题: 物联网安全:常见攻击调查
摘要: 物联网(IoT)的指数增长已将连接设备整合到智能城市、数字健康和工业4.0等各个领域,产生大量实时数据以支持智能决策。然而,这种广泛的应用在根本上受到重大安全风险的挑战,主要是由于设备固有的计算限制、缺乏标准化以及攻击面的扩大。鉴于安全对于确保这些环境中的信任至关重要,本文提出了对物联网威胁景观进行全面调查和多维分析。它描述了28种常见攻击,从传统威胁(如中间人攻击)到专门针对物联网的攻击(包括节点复制和刷卡)。为了提供对这些风险的结构化理解,我们采用STRIDE模型进行功能威胁分类,同时使用CVSS框架进行定量关键性评估。此外,该研究建立了这些威胁与五个基础性漏洞类别(流程、代码、通信、操作和设备)之间的强大映射,揭示了对手利用的具体技术入口点。除了威胁识别,调查还介绍了最先进的缓解技术,并讨论了新兴范式和研究空白,作为未来调查的路线图,并为希望建立具有弹性和安全的物联网生态系统的研究人员和从业者提供了一个整合的技术基础。
更新时间: 2026-05-05 13:29:22
领域: cs.CR
A Workflow-Oriented Framework for Asynchronous Human-AI Collaboration in Hybrid and Compute-Intensive HPC Environments
Human involvement is critical in training and deploying AI systems in high-stakes defence and security contexts. However, real-time interaction is impractical in HPC environments due to compute intensity and resource constraints. We present a workflow framework that enables asynchronous human-AI collaboration across hybrid infrastructures, including HPC clusters, local machines, and cloud platforms. Workflows can pause at defined checkpoints for human input without halting underlying compute jobs, preventing idle resources and enabling non-blocking supervision. The framework supports interaction with SLURM-based scheduling, containerized and native tasks, and is customized for scenarios requiring human judgment and adaptability. We demonstrate its application in model training on systems like MareNostrum 5, highlighting benefits in portability, efficiency, and oversight in operational AI workflows.
Updated: 2026-05-05 13:29:05
标题: 一个面向工作流的框架,用于在混合和计算密集型HPC环境中进行异步人工智能协作
摘要: 人类参与在高风险防御和安全环境中培训和部署人工智能系统至关重要。然而,在HPC环境中,由于计算强度和资源限制,实时交互是不切实际的。我们提出了一个工作流框架,可以在混合基础设施(包括HPC集群、本地机器和云平台)上实现异步的人工智能协作。工作流程可以在定义的检查点暂停,等待人类输入,而不会停止底层的计算作业,从而防止资源空闲,并实现非阻塞监督。该框架支持与基于SLURM调度的交互,容器化和本地任务,并针对需要人类判断和适应性的场景进行定制。我们展示了该框架在像MareNostrum 5这样的系统上进行模型训练的应用,突出了在操作AI工作流中的可移植性、效率和监督方面的优势。
更新时间: 2026-05-05 13:29:05
领域: cs.DC,cs.AI,cs.HC,cs.SE
Low Rank Tensor Completion via Adaptive ADMM
We consider a novel algorithm, for the completion of partially observed low-rank tensors, as a generalization of matrix completion. The proposed low-rank tensor completion (TC) method builds on the conventional nuclear norm (NN) minimization-based low-rank TC paradigm, by leveraging the alternating direction method of multipliers (ADMM) optimization framework. To that extend the original NN minimization problem is reformulated into multiple subproblems, which are then solved iteratively via closed-form proximal operators, making use of over-relaxation and an adaptive penalty parameter update scheme, to further speed up convergence and improve the overall performance of the method. Simulation results demonstrate the superior performance of the new method in terms of normalized mean square error (NMSE), compared to the conventional state-of-the-art (SotA) techniques, including NN minimization approaches, as well as a mixture of the latter with a matrix factorization approach, while its convergence can be significantly improved by initializing the algorithm with the solution of the SotA.
Updated: 2026-05-05 13:24:39
标题: 通过自适应ADMM进行低秩张量补全
摘要: 我们考虑了一种新颖的算法,用于完成部分观测到的低秩张量,作为矩阵完成的推广。所提出的低秩张量完成(TC)方法建立在传统核范数(NN)最小化的低秩TC范例之上,利用交替方向乘法器(ADMM)优化框架。为此,原始的NN最小化问题被重新制定为多个子问题,然后通过闭式近似算子进行迭代求解,利用过度松弛和自适应惩罚参数更新方案,进一步加快收敛速度,提高方法的整体性能。模拟结果表明,与传统的最先进技术(SotA)方法相比,包括NN最小化方法以及后者与矩阵分解方法的混合方法,新方法在标准化均方误差(NMSE)方面表现出更优异的性能,同时通过使用SotA的解初始化算法,其收敛性可以显著提高。
更新时间: 2026-05-05 13:24:39
领域: stat.ML,cs.LG,eess.SP
Predicting missing values: A good idea?
Minimizing the Mean Squared Error (MSE) is a key objective in machine learning and is commonly used for imputing missing values. While this approach provides accurate point estimates, it introduces systematic biases in downstream analyses. These biases affect key parameters such as variance, prevalence, correlation, slope, and explained variance. The root cause is that imputed values optimized for MSE are averages, which reduce the natural variability in the data. This paper demonstrates that adding noise to imputed values can effectively eliminate these biases. The required noise level is proportional to the MSE. Using a toy example in a multivariate normal setting, we compare two methods: predictive imputation, which minimizes MSE, and stochastic imputation, which incorporates random noise. Simulation results show that predictive methods systematically introduce bias, while stochastic methods preserve the data's natural variability and produce unbiased estimates. We also evaluate three popular imputation tools -- missForest, softImpute, and mice -- and observe consistent biases in predictive methods. These findings highlight that MSE is an inadequate measure of imputation quality, as it prioritizes accuracy over variability. Incorporating noise into imputation methods is essential to prevent biases and ensure valid downstream analyses, underscoring the importance of stochastic approaches for handling incomplete data.
Updated: 2026-05-05 13:22:06
标题: 预测缺失值:一个好主意吗?
摘要: 最小化均方误差(MSE)是机器学习中的一个关键目标,通常用于插补缺失值。虽然这种方法提供了准确的点估计,但它在下游分析中引入了系统偏差。这些偏差影响关键参数,如方差、患病率、相关性、斜率和解释方差。根本原因在于,为了最小化MSE而优化的插补值是平均值,这会降低数据的自然变异性。 本文证明,向插补值添加噪声可以有效消除这些偏差。所需的噪声水平与MSE成正比。在多元正态设置中使用一个示例,我们比较了两种方法:预测插补,它最小化MSE,以及随机插补,它包含随机噪声。模拟结果显示,预测方法系统性地引入偏差,而随机方法保留了数据的自然变异性并产生了无偏估计。 我们还评估了三种流行的插补工具--missForest、softImpute和mice--并观察到预测方法中的一致偏差。这些发现突显了MSE是插补质量的一个不足指标,因为它优先考虑准确性而不是变异性。将噪声纳入插补方法对于防止偏差并确保下游分析的有效性至关重要,强调了用于处理不完整数据的随机方法的重要性。
更新时间: 2026-05-05 13:22:06
领域: stat.ML,cs.LG,stat.ME
Sample-Efficient Optimization over Generative Priors via Coarse Learnability
We study zeroth-order optimization where solutions must minimize a cost $d(s)$ while maintaining high probability under a complex generative prior $L(s)$ (e.g., a parameterized model). This reduces to sampling from a target distribution proportional to $L(s) e^{-T \cdot d(s)}$. Since classical model-based optimization (MBO) lacks finite-sample guarantees for expressive approximate learners, we introduce "coarse learnability", a flexible statistical assumption requiring only that a learned model covers the target's probability mass within a polynomial factor. Leveraging this assumption, we design an iterative MBO algorithm called \alift with a sample correction step that provably approximates the target using only a polynomial number of samples. We apply this framework to globally optimizing non-convex objectives bounded by a quadratic envelope in $R^n$, where we show this assumption is naturally satisfied for a family of "optimistic" posterior distributions. To reach global $\varepsilon$-optimality, this implies a sample complexity of $\widetilde{O}(\log 1/\varepsilon)$, a rate characteristic of optimistic space-partitioning methods. We further justify coarse learnability as an assumption for generative priors theoretically, proving that in simple settings, parametric maximum likelihood estimation and over-smoothed kernel density estimators naturally satisfy it. Finally, one motivation for our framework comes from inference-time alignment. Though our primary contribution pertains to the theoretical foundations of MBO, we provide qualitative evidence that, in simple settings, even primitive LLMs can shift their distributions toward lower-cost regions when fine-tuned with zeroth-order feedback.
Updated: 2026-05-05 13:20:09
标题: 通过粗糙可学习性在生成先验上进行高效样本优化
摘要: 我们研究零阶优化,其中解决方案必须最小化成本$d(s)$,同时在复杂的生成先验$L(s)$(例如,一个参数化模型)下保持高概率。这可以简化为从与$L(s)e^{-T\cdot d(s)}$成比例的目标分布中抽样。由于经典的基于模型的优化(MBO)缺乏对表达近似学习者的有限样本保证,我们引入了“粗糙可学习性”,这是一种灵活的统计假设,仅要求学习的模型以多项式因子覆盖目标的概率质量。利用这一假设,我们设计了一种名为\alift的迭代MBO算法,其中包含一个样本校正步骤,可以证明仅使用多项式数量的样本即可近似目标。我们将这一框架应用于在$R^n$中受二次包络限制的非凸目标的全局优化,我们展示了对于一组“乐观”的后验分布,这一假设在自然上被满足。为了实现全局$\varepsilon$-最优性,这意味着一个$\widetilde{O}(\log 1/\varepsilon)$的样本复杂度,这是乐观空间分区方法的特征速率。我们进一步在理论上证明了粗糙可学习性作为生成先验的假设,证明在简单情况下,参数最大似然估计和过度平滑的核密度估计自然地满足它。最后,我们的框架的一个动机来自推理时间对齐。虽然我们的主要贡献涉及MBO的理论基础,但我们提供了定性证据表明,在简单情况下,即使是原始的LLM在接受零阶反馈进行微调时也可以将其分布转移到成本更低的区域。
更新时间: 2026-05-05 13:20:09
领域: cs.LG,cs.DS,stat.ML
On Digital Twins in Defence: Overview and Applications
Digital twins have emerged as a transformative technology for modeling and simulation in various industries, including defense. This paper provides a comprehensive review of digital twin applications in defense modeling and simulation, focusing on how digital twins can enhance simulation fidelity, interoperability, and decision support within defense systems. We consolidate existing research into a unified framework that links digital twin concepts, simulation-driven application, and real-world deployment in defense scenarios. We discuss the role of digital twin in applications like planning, training, execution and monitoring, and debriefing. We introduce a standardized digital twin characterization framework suitable for defense application that aligns with industrial modeling and simulation standards, and present a taxonomy of defense specific use cases, highlighting recurring requirements. Additionally, practical evidence is provided from a targeted questionnaire distributed to defense stakeholders and Ministries of Defense, revealing current challenges in digital twin integration and deployment. Finally, we conclude by identifying key gaps in digital twins application for defense modeling and simulation, including interoperability, security, and system integration, and we outline future research directions and development opportunities. This review aims to inform defense modeling and simulation practitioners and researchers, guiding future work on digital twin design, implementation and deployment across defense applications.
Updated: 2026-05-05 13:19:34
标题: 关于国防中数字孪生技术的概述和应用
摘要: 数字孪生技术已成为各行业,包括国防领域建模和仿真的一项变革性技术。本文全面审视了数字孪生技术在国防建模和仿真中的应用,重点关注数字孪生如何提升国防系统内仿真忠实度、互操作性和决策支持。我们将现有研究整合到一个统一框架中,将数字孪生概念、仿真驱动应用和国防场景中的实际部署联系起来。我们讨论了数字孪生在规划、训练、执行和监控以及总结反思等应用中的作用。我们引入了一个适用于国防应用的标准化数字孪生特征框架,与工业建模和仿真标准保持一致,并提出了一套国防特定用例的分类法,突出了重复性需求。此外,通过针对国防利益相关者和国防部门的有针对性问卷调查提供了实证证据,揭示了数字孪生整合和部署中的当前挑战。最后,我们通过识别数字孪生在国防建模和仿真中的关键空白,包括互操作性、安全性和系统集成,勾勒了未来研究方向和发展机会。本审查旨在为国防建模和仿真从业者和研究人员提供信息,引导未来在国防应用中的数字孪生设计、实施和部署工作。
更新时间: 2026-05-05 13:19:34
领域: cs.CR
Can LLMs Make (Personalized) Access Control Decisions?
Precise access control decisions are crucial for the security of both traditional applications and emerging agent-based systems. Typically, these decisions are made by users during app installation or at runtime. However, due to the increasing complexity and automation of systems, making access control decisions can impose a significant cognitive burden on users, often overwhelming them and leading to suboptimal or even arbitrary choices. To address this problem, we investigate the ability of LLMs to make dynamic, context-aware decisions aligned with users' security preferences, expressed during a lightweight setup phase. As a case study, we analyze smartphone application permission requests, given their ubiquity and users' familiarity with them. We curated a dataset comprising 307 user privacy statements (short, natural-language descriptions of user preferences) and 14,682 corresponding permission decisions, gathered from smartphone users in an online data collection. We compare these decisions with those made by two versions of LLMs that are tasked with reasoning about the app and the request context: a general model and a personalized one (which incorporates user preferences). For the latter, we also collected user feedback on 1,298 of its decisions. Our results show that LLMs generally reflect users' preferences well, agreeing with the majority decision in up to 86% of cases, and can steer users toward safer behavior. However, the results also reveal a key trade-off in personalization: while incorporating user-specific privacy preferences improves agreement with individual decisions, strict adherence to these preferences may lead to less safe outcomes, as users tend to over-permission.
Updated: 2026-05-05 13:11:45
标题: LLMs是否可以做(个性化的)访问控制决策?
摘要: 精确的访问控制决策对于传统应用程序和新兴基于代理的系统的安全性至关重要。通常,这些决策是由用户在应用程序安装或运行时进行的。然而,由于系统的复杂性和自动化程度不断增加,进行访问控制决策可能会给用户带来重大的认知负担,往往会使他们感到不知所措,导致次优甚至任意的选择。 为了解决这个问题,我们研究了LLMs在轻量级设置阶段表达用户安全偏好时,做出与用户偏好一致的动态、上下文感知的决策的能力。作为一个案例研究,我们分析了智能手机应用程序权限请求,考虑到它们的普遍性和用户对其熟悉程度。我们收集了一个数据集,包括307个用户隐私声明(用户偏好的简短自然语言描述)和14,682个相应的权限决策,这些数据是从在线数据收集中收集的智能手机用户中获取的。我们将这些决策与两个版本的LLMs所做的决策进行了比较,这些LLMs负责推理应用程序和请求上下文:一个是通用模型,另一个是个性化模型(包含用户偏好)。对于后者,我们还收集了对其1,298个决策的用户反馈。 我们的结果表明,LLMs通常很好地反映了用户的偏好,在多达86%的情况下与大多数决策一致,并且可以引导用户朝着更安全的行为方向。然而,结果也揭示了个性化中的一个关键权衡:尽管将用户特定的隐私偏好纳入决策可以改善与个体决策的一致性,但严格遵循这些偏好可能导致更少安全的结果,因为用户往往会过度授权。
更新时间: 2026-05-05 13:11:45
领域: cs.CR,cs.AI
LightSBB-M: Bridging Schrödinger and Bass for Generative Diffusion Modeling
The Schrodinger Bridge and Bass (SBB) formulation, which jointly controls drift and volatility, is an established extension of the classical Schrodinger Bridge (SB). Building on this framework, we introduce LightSBB-M, an algorithm that computes the optimal SBB transport plan in only a few iterations. The method exploits a dual representation of the SBB objective to obtain analytic expressions for the optimal drift and volatility, and it incorporates a tunable parameter beta greater than zero that interpolates between pure drift (the Schrodinger Bridge) and pure volatility (Bass martingale transport). We show that LightSBB-M achieves the lowest 2-Wasserstein distance on synthetic datasets against state-of-the-art SB and diffusion baselines with up to 32 percent improvement. We also illustrate the generative capability of the framework on an unpaired image-to-image translation task (adult to child faces in FFHQ). These findings demonstrate that LightSBB-M provides a scalable, high-fidelity SBB solver that outperforms existing SB and diffusion baselines across both synthetic and real-world generative tasks. The code is available at https://github.com/alexouadi/LightSBB-M.
Updated: 2026-05-05 13:11:05
标题: LightSBB-M:将Schrödinger和Bass连接起来,用于生成扩散建模
摘要: 薛定谔桥和贝斯(SBB)公式是对经典薛定谔桥(SB)的一个已建立的扩展,它共同控制漂移和波动率。在这个框架的基础上,我们介绍了LightSBB-M算法,它可以在很少的迭代中计算出最优的SBB传输计划。该方法利用SBB目标的双重表示来获得最优漂移和波动率的解析表达式,并且它还包含一个可调参数 beta 大于零,可以在纯漂移(薛定谔桥)和纯波动率(贝斯鞅传输)之间进行插值。我们展示了LightSBB-M在合成数据集上相对于最先进的SB和扩散基线实现了最低的2-Wasserstein距离,改进了高达32%。我们还在一个不成对的图像到图像转换任务(FFHQ中的成人到儿童面孔)上展示了该框架的生成能力。这些发现表明,LightSBB-M提供了一个可扩展、高保真度的SBB求解器,它在合成和真实世界生成任务中均优于现有的SB和扩散基线。源代码可在https://github.com/alexouadi/LightSBB-M 获取。
更新时间: 2026-05-05 13:11:05
领域: cs.LG,eess.SY,stat.CO,stat.ML
Rethinking the Rank Threshold for LoRA Fine-Tuning
A recent landscape analysis of LoRA fine-tuning in the neural tangent kernel regime establishes a sufficient condition $r(r+1)/2 > KN$ on the LoRA rank $r$ for the absence of spurious local minima under squared-error loss, prescribing $r \geq 12$ on canonical few-shot RoBERTa setups. The condition is stated for general output dimension $K$, so its sharpness in any particular regime, and its practical implication for the cross-entropy loss actually used in fine-tuning, are open. We give three results that together reduce the prescribed rank to $r = 1$ for binary classification in this regime. First, replacing the symmetric Sard-form count with the non-symmetric LoRA manifold dimension yields a strictly weaker capacity requirement, $r(m+n) - r^2 > C^* \cdot KN$ with $C^* \approx 1.35$ under Gaussian-iid features, satisfied at $r = 1$ on canonical setups. Second, in the cross-entropy setting the Polyak--Łojasiewicz inequality removes the rank threshold entirely. Third, a Rademacher-complexity bound predicts rank-one variance optimality precisely when the bias term is saturated, which is the case for binary classification but not for $K > 2$. Empirically, across four GLUE-style binary tasks, three encoder architectures, and at scale on RoBERTa-large, rank one is competitive with the existing prescription $r = 12$; on multi-class MNLI the optimal rank shifts above one, also as predicted. The binary-regime guarantees are conditional on standard NTK assumptions; the multi-class extension is left to future work.
Updated: 2026-05-05 13:09:46
标题: 重新考虑LoRA微调的级别阈值
摘要: 最近对LoRA在神经切向核区域微调的景观分析建立了一个足够的条件$r(r+1)/2 > KN$,用于规定在平方误差损失下LoRA秩$r$不存在虚假局部最小值,对于标准少样本RoBERTa设置要求$r \geq 12$。该条件适用于一般输出维度$K$,因此在任何特定区域的尖锐性以及在微调中实际使用的交叉熵损失的实际含义还未知。我们给出三个结果,结合在该区域将预定的秩降低到$r = 1$,用于二分类。首先,将对称Sard形式计数替换为非对称的LoRA流形维度,得到了一个严格较弱的容量要求$r(m+n) - r^2 > C^* \cdot KN$,其中$C^* \approx 1.35$,在高斯-iid特征下满足在标准设置中$r = 1$。其次,在交叉熵设置中,Polyak-Łojasiewicz不等式完全消除了秩阈值。第三,Rademacher复杂度界限预测当偏置项饱和时秩一方差最优,这适用于二分类但不适用于$K > 2$。经验上,在四个GLUE风格的二分类任务、三个编码器架构和RoBERTa-large规模的情况下,秩一与现有的预定$r = 12$相竞争;在多类MNLI任务中,最佳秩超过一,也与预测一致。二分类保证取决于标准NTK假设;多类扩展留待未来工作。
更新时间: 2026-05-05 13:09:46
领域: cs.LG,cs.AI
Segmenting Human-LLM Co-authored Text via Change Point Detection
The rise of large language models (LLMs) has created an urgent need to distinguish between human-written and LLM-generated text to ensure authenticity and societal trust. Existing detectors typically provide a binary classification for an entire passage; however, this is insufficient for human--LLM co-authored text, where the objective is to localize specific segments authored by humans or LLMs. To bridge this gap, we propose algorithms to segment text into human- and LLM-authored pieces. Our key observation is that such a segmentation task is conceptually similar to classical change point detection in time-series analysis. Leveraging this analogy, we adapt change point detection to LLM-generated text detection, develop a weighted algorithm and a generalized algorithm to accommodate heterogeneous detection score variability, and establish the minimax optimality of our procedure. Empirically, we demonstrate the strong performance of our approach against a wide range of existing baselines.
Updated: 2026-05-05 13:08:55
标题: 通过变点检测对人类-LLM共同撰写的文本进行分段
摘要: 大型语言模型(LLMs)的崛起已经创造出了一种迫切需要,即区分人类编写和LLM生成的文本,以确保真实性和社会信任。现有的检测器通常为整个段落提供二元分类;然而,这对于人类-LLM共同创作的文本是不够的,因为其目标是定位由人类或LLMs创作的特定部分。为了弥合这一差距,我们提出了将文本分割为人类和LLM创作部分的算法。我们的关键观察是,这样的分割任务在概念上类似于时间序列分析中的经典变点检测。借鉴这个类比,我们改编变点检测以适应LLM生成文本检测,开发了加权算法和广义算法以适应异质检测分数的可变性,并建立了我们过程的极小极优性。从经验上看,我们展示了我们的方法在广泛的现有基线上表现出的强大性能。
更新时间: 2026-05-05 13:08:55
领域: cs.CL,cs.AI,stat.ME
Distribution-Free Pretraining of Classification Losses via Evolutionary Dynamics
We propose Evolutionary Dynamic Loss (EDL), a framework that learns a transferable classification loss in the probability space using unlimited synthetic prediction-label pairs, without accessing real samples during the main loss pretraining stage. EDL parameterizes the loss as a lightweight network and is trained with a semantics-free ranking-consistency objective that assigns larger penalties for more erroneous predictions. To robustly explore the space of loss functions, we optimize EDL via an evolutionary strategy and introduce chaotic mutation to improve exploration under noisy fitness evaluations. Experiments on CIFAR-10 with ResNet backbones show that EDL can serve as a drop-in replacement for cross-entropy and achieves competitive or improved accuracy, while ablation studies confirm that chaotic mutation yields faster convergence and better synthetic pretraining metrics than standard Gaussian mutation.
Updated: 2026-05-05 13:08:18
标题: 无分布先前的分类损失预训练通过进化动力学
摘要: 我们提出了进化动态损失(EDL)框架,该框架在概率空间中使用无限的合成预测-标签对学习可转移的分类损失,而不在主要损失预训练阶段访问真实样本。EDL将损失参数化为一个轻量级网络,并通过一个无语义的排名一致性目标进行训练,该目标为更错误的预测分配更大的惩罚。为了稳健地探索损失函数的空间,我们通过进化策略优化EDL,并引入混沌突变以改善在嘈杂的适应度评估下的探索。在使用ResNet骨干网络的CIFAR-10上的实验表明,EDL可以作为交叉熵的即插即用替代方案,并实现了竞争性或更高的准确性,而消融研究证实,混沌突变比标准高斯突变产生更快的收敛和更好的合成预训练指标。
更新时间: 2026-05-05 13:08:18
领域: cs.LG
Tempered Guided Diffusion
Training-free conditional diffusion provides a flexible alternative to task-specific conditional model training, but existing samplers often allocate computation inefficiently: independent guided trajectories can vary widely in quality, and additional function evaluations along a single trajectory may not recover from poor early decisions. We propose Tempered Guided Diffusion (TGD), an annealed sequential Monte Carlo framework for training-free conditional sampling with diffusion priors. TGD targets tempered posterior distributions over the clean signal, using noisy diffusion states only as auxiliary variables for proposing reconstructions and propagating particles. Particles are reweighted by incremental likelihood ratios, resampled, and propagated across noise levels, concentrating computation on trajectories plausible under both the prior and observation. Under idealized exact-reconstruction assumptions, full TGD yields a consistent particle approximation to the posterior as the number of particles grows. For expensive reconstruction tasks, Accelerated TGD (A-TGD) retains early particle exploration but prunes to a single high-likelihood trajectory partway through sampling. Experiments on a controlled two-dimensional inverse problem and image inverse problems show improved posterior approximation and favorable wall-clock speed-quality tradeoffs over independent multi-trajectory baselines.
Updated: 2026-05-05 13:00:15
标题: 调整引导扩散
摘要: 无需训练的条件扩散提供了一种灵活的替代方案,用于特定任务的条件模型训练,但现有的采样器通常会低效地分配计算:独立引导轨迹的质量可能差异很大,并且沿着单个轨迹的额外函数评估可能无法从早期不良决策中恢复过来。我们提出了淬火引导扩散(TGD),这是一种退火顺序蒙特卡洛框架,用于无需训练的条件采样,采用扩散先验。TGD针对干净信号上的淬火后验分布,仅使用噪声扩散状态作为提出重建和传播粒子的辅助变量。粒子通过递增似然比重加权,重新采样,并在噪声级别之间传播,集中计算力量在先验和观察下均合理的轨迹上。在理想化的精确重建假设下,完整的TGD随着粒子数量的增长产生一致的粒子近似后验。对于昂贵的重建任务,加速TGD(A-TGD)保留了早期粒子探索,但在采样过程中部分地修剪到单个高似然轨迹。在受控的二维反问题和图像反问题上进行的实验显示,相对于独立多轨迹基线,后验近似改善,并且在时钟速度-质量权衡方面有利。
更新时间: 2026-05-05 13:00:15
领域: stat.ML,cs.LG
SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection
Speculative decoding accelerates large language model (LLM) inference by using a small draft model to propose candidate tokens that a larger target model verifies. A critical hyperparameter in this process is the speculation length $γ$, which determines how many tokens the draft model proposes per step. Nearly all existing systems use a fixed $γ$ (typically 4), yet empirical evidence suggests that the optimal value varies across task types and, crucially, depends on the compression level applied to the target model. In this paper, we present SpecKV, a lightweight adaptive controller that selects $γ$ per speculation step using signals extracted from the draft model itself. We profile speculative decoding across 4 task categories, 4 speculation lengths, and 3 compression levels (FP16, INT8, NF4), collecting 5,112 step-level records with per-step acceptance rates, draft entropy, and draft confidence. We demonstrate that the optimal $γ$ shifts across compression regimes and that draft model confidence and entropy are strong predictors of acceptance rate (correlation $\approx$ 0.56). SpecKV uses a small MLP trained on these signals to maximize expected tokens per speculation step, achieving a 56.0% improvement over the fixed-$γ=4$ baseline with only 0.34 ms overhead per decision (<0.5% of step time). The improvement is statistically significant (p < 0.001, paired bootstrap test). We release all profiling data, trained models, and notebooks as open-source artifacts.
Updated: 2026-05-05 12:57:37
标题: SpecKV:自适应具有压缩感知Gamma选择的推测解码
摘要: 投机性解码通过使用一个小型草案模型提出候选标记,由更大的目标模型进行验证,加速大型语言模型(LLM)的推理。在这个过程中的一个关键超参数是猜测长度$γ$,它决定了草案模型每步提出多少标记。几乎所有现有系统都使用固定的$γ$(通常为4),然而经验证据表明最佳值在任务类型之间变化,并且关键地取决于应用于目标模型的压缩级别。在本文中,我们提出了SpecKV,一个轻量级自适应控制器,利用从草案模型本身提取的信号来选择每个猜测步骤的$γ$。我们在4个任务类别、4个猜测长度和3个压缩级别(FP16、INT8、NF4)上对投机解码进行了分析,收集了5,112个步骤级记录,包括每步的接受率、草案熵和草案置信度。我们证明了最佳的$γ$值在压缩制度之间变化,并且草案模型的置信度和熵是接受率的强烈预测因素(相关性约为0.56)。SpecKV使用一个小型的MLP训练这些信号,以最大化每个猜测步骤的预期标记数,与固定$γ=4$基线相比,实现了56.0%的改进,每个决策只需0.34毫秒的额外时间(<0.5%的步骤时间)。这种改进在统计上是显著的(p < 0.001,成对的自举检验)。我们将所有的分析数据、训练模型和笔记本作为开源工件发布。
更新时间: 2026-05-05 12:57:37
领域: cs.LG,cs.AI,cs.CL,cs.DC,eess.SY
HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks
Existing benchmarks for hardware design primarily evaluate Large Language Models (LLMs) on isolated, component-level tasks such as generating HDL modules from specifications, leaving repository-scale evaluation unaddressed. We introduce HWE-Bench, the first large-scale, repository-level benchmark for evaluating LLM agents on real-world hardware bug repair tasks. HWE-Bench comprises 417 task instances derived from real historical bug-fix pull requests across six major open-source projects spanning both Verilog/SystemVerilog and Chisel, covering RISC-V cores, SoCs, and security roots-of-trust. Each task is grounded in a fully containerized environment where the agent must resolve a real bug report, with correctness validated through the project's native simulation and regression flows. The benchmark is built through a largely automated pipeline that enables efficient expansion to new repositories. We evaluate seven LLMs with four agent frameworks and find that the best agent resolves 70.7% of tasks overall, with performance exceeding 90% on smaller cores but dropping below 65% on complex SoC-level projects. We observe larger performance gaps across models than commonly reported on software benchmarks, and difficulty is driven by project scope and bug-type distribution rather than code size alone. Our failure analysis traces agent failures to three stages of the debugging process: fault localization, hardware-semantic reasoning, and cross-artifact coordination across RTL, configuration, and verification components, providing concrete directions for developing more capable hardware-aware agents.
Updated: 2026-05-05 12:56:34
标题: HWE-Bench:基于真实硬件故障修复任务对LLM代理进行基准测试
摘要: 现有的硬件设计基准主要评估大型语言模型(LLMs)在孤立的、组件级任务上的表现,例如根据规范生成HDL模块,但对存储库规模的评估未被解决。我们引入了HWE-Bench,这是第一个用于评估LLM代理在真实硬件错误修复任务上的大规模、存储库级别基准。HWE-Bench包括417个任务实例,这些实例源自于六个主要开源项目的真实历史错误修复拉取请求,涵盖了Verilog/SystemVerilog和Chisel,涵盖了RISC-V核心、SoC和安全信任根。每个任务都在一个完全容器化的环境中,代理必须解决一个真实的错误报告,通过项目的本地仿真和回归流程验证正确性。该基准是通过一个大部分自动化的流水线构建的,可以有效地扩展到新的存储库。我们评估了七个LLMs,使用四个代理框架,发现最佳代理总体解决了70.7%的任务,对较小的核心性能超过90%,但在复杂的SoC级别项目上下降到65%以下。我们观察到在模型之间存在较大的性能差距,比通常在软件基准上报告的要大,并且困难主要受项目范围和错误类型分布的影响,而不仅仅是代码大小。我们的失败分析将代理的失败追溯到三个调试过程的阶段:故障定位、硬件语义推理以及跨RTL、配置和验证组件之间的跨工件协调,为开发更具能力的硬件感知代理提供了明确方向。
更新时间: 2026-05-05 12:56:34
领域: cs.AI
Amortized Variational Inference for Joint Posterior and Predictive Distributions in Bayesian Uncertainty Quantification
Bayesian predictive inference propagates parameter uncertainty to quantities of interest through the posterior-predictive distribution. In practice, this is typically performed using a two-stage procedure: first approximating the posterior distribution of model parameters, and then propagating posterior samples through the predictive model via Monte Carlo simulation. This sequential workflow can be computationally demanding, particularly for high-fidelity models such as those governed by partial differential equations. We propose a variational Bayesian framework that directly targets the posterior-predictive distribution and jointly learns variational approximations of both the posterior and the corresponding predictive distribution. The formulation introduces a variational upper bound on the Kullback--Leibler divergence together with moment-based regularization terms. The variational distributions are trained in an amortized manner, shifting computational effort to an offline stage and enabling efficient online inference. Numerical experiments ranging from analytical benchmarks to a finite-element solid mechanics problem demonstrate that the proposed method achieves more accurate predictive distributions than conventional two-stage variational inference, while substantially reducing the cost of online predictive inference.
Updated: 2026-05-05 12:56:00
标题: 摊销变分推断用于贝叶斯不确定性量化中的联合后验和预测分布
摘要: 贝叶斯预测推断通过后验预测分布将参数不确定性传播到感兴趣的量中。在实践中,通常使用两阶段程序来执行:首先近似模型参数的后验分布,然后通过蒙特卡洛模拟将后验样本传播到预测模型。这种顺序工作流程可能在计算上要求较高,特别是对于由偏微分方程控制的高保真度模型。 我们提出了一个直接针对后验预测分布的变分贝叶斯框架,并联合学习后验和相应预测分布的变分逼近。该公式引入了一个关于Kullback-Leibler散度的变分上限,以及基于矩的正则化项。这些变分分布以一种摊销的方式进行训练,将计算工作转移到离线阶段,并实现了高效的在线推断。从分析基准到有限元固体力学问题的数值实验表明,所提出的方法比传统的两阶段变分推断实现了更准确的预测分布,同时大大降低了在线预测推断的成本。
更新时间: 2026-05-05 12:56:00
领域: stat.ML,cs.AI,cs.LG,stat.CO,stat.ME
SAM-NER: Semantic Archetype Mediation for Zero-Shot Named Entity Recognition
Zero-shot Named Entity Recognition (ZS-NER) remains brittle under domain and schema shifts, where unseen label definitions often misalign with a large language model's (LLM's) intrinsic semantic organization. As a result, directly mapping entity mentions to fine-grained target labels can induce systematic semantic drift, especially when target schemas are novel or semantically overlapping. We propose \textbf{SAM-NER}, a three-stage framework based on \emph{Semantic Archetype Mediation} that stabilizes cross-domain transfer through an intermediate, domain-invariant archetype space. SAM-NER: (i) performs \emph{Entity Discovery} via cooperative extraction and consensus-based denoising to obtain high-coverage, high-fidelity entity spans; (ii) conducts \emph{Abstract Mediation} by projecting entities into a compact set of universal semantic archetypes distilled from high-level ontological abstractions; and (iii) applies \emph{Semantic Calibration} to resolve archetype-level predictions into target-domain types through constrained, definition-aligned inference with a frozen LLM. Experiments on the CrossNER benchmark show that SAM-NER consistently outperforms strong prior ZS-NER baselines in cross-domain settings. Our implementation will be open-sourced at https://github.com/DMIRLAB-Group/SAM-NER.
Updated: 2026-05-05 12:54:17
标题: SAM-NER:零-shot 命名实体识别的语义原型中介
摘要: 零样本命名实体识别(ZS-NER)在领域和架构转移下仍然脆弱,其中看不见的标签定义通常与大型语言模型(LLM)的内在语义组织不对齐。因此,直接将实体提及映射到细粒度目标标签可能引起系统性语义漂移,特别是当目标模式是新颖的或语义重叠时。我们提出了\textbf{SAM-NER},这是一个基于\emph{语义原型调解}的三阶段框架,通过一个中间的、领域不变的原型空间稳定跨领域转移。SAM-NER:(i)通过协作提取和基于共识的去噪来执行\emph{实体发现},以获得高覆盖率、高保真度的实体跨度;(ii)通过将实体投影到从高级本体抽象中提炼出的一组紧凑的通用语义原型进行\emph{抽象调解};(iii)通过与冻结的LLM进行受限、定义对齐的推理,将原型级别的预测解析为目标领域类型,从而应用\emph{语义校准}。在CrossNER基准测试上的实验表明,SAM-NER在跨领域设置中始终优于强大的先前ZS-NER基线。我们的实现将在https://github.com/DMIRLAB-Group/SAM-NER进行开源。
更新时间: 2026-05-05 12:54:17
领域: cs.CL,cs.AI
SERE: Structural Example Retrieval for Enhancing LLMs in Event Causality Identification
Event Causality Identification (ECI) requires models to determine whether a given pair of events in a context exhibits a causal relationship. While Large Language Models (LLMs) have demonstrated strong performance across various NLP tasks, their effectiveness in ECI remains limited due to biases in causal reasoning, often leading to overprediction of causal relationships (causal hallucination). To mitigate these issues and enhance LLM performance in ECI, we propose SERE, a structural example retrieval framework that leverages LLMs' few-shot learning capabilities. SERE introduces an innovative retrieval mechanism based on three structural concepts: (i) Conceptual Path Metric, which measures the conceptual relationship between events using edit distance in ConceptNet; (ii) Syntactic Metric, which quantifies structural similarity through tree edit distance on syntactic trees; and (iii) Causal Pattern Filtering, which filters examples based on predefined causal structures using LLMs. By integrating these structural retrieval strategies, SERE selects more relevant examples to guide LLMs in causal reasoning, mitigating bias and improving accuracy in ECI tasks. Extensive experiments on multiple ECI datasets validate the effectiveness of SERE. The source code is publicly available at https://github.com/DMIRLAB-Group/SERE.
Updated: 2026-05-05 12:50:19
标题: SERE: 结构示例检索以增强事件因果识别中的LLMs
摘要: 事件因果识别(ECI)需要模型确定在特定背景下给定一对事件是否存在因果关系。尽管大型语言模型(LLMs)在各种自然语言处理任务中表现出色,但它们在ECI中的有效性受到因果推理偏见的限制,通常导致对因果关系的过度预测(因果幻觉)。为了缓解这些问题并提高LLMs在ECI中的性能,我们提出了SERE,一个结构化示例检索框架,利用LLMs的少样本学习能力。SERE引入了一种基于三个结构概念的创新检索机制:(i)概念路径度量,使用ConceptNet中的编辑距离来衡量事件之间的概念关系;(ii)句法度量,通过句法树的树编辑距离量化结构相似性;(iii)因果模式过滤,使用LLMs基于预定义因果结构过滤示例。通过整合这些结构检索策略,SERE选择更相关的示例来指导LLMs进行因果推理,减少偏见并提高ECI任务中的准确性。对多个ECI数据集进行的广泛实验验证了SERE的有效性。源代码可在https://github.com/DMIRLAB-Group/SERE 上公开获取。
更新时间: 2026-05-05 12:50:19
领域: cs.CL,cs.AI
Tailored Prompts, Targeted Protection: Vulnerability-Specific LLM Analysis for Smart Contracts
Smart contracts on blockchains are prone to diverse security vulnerabilities that can lead to significant financial losses due to their immutable nature. Existing detection approaches often lack flexibility across vulnerability types and rely heavily on manually crafted expert rules. In this paper, we present an LLM-based framework for practical smart contract vulnerability detection. We construct and release a large-scale dataset comprising 31,165 professionally annotated vulnerability instances collected from over 3,200 real-world projects across 15 major blockchain platforms. Our approach leverages precise AST-based context extraction and vulnerability-specific prompt design to instantiate customized detectors for 13 prevalent vulnerability categories. Experimental results demonstrate strong effectiveness, achieving an average positive recall of 0.92 and an average negative recall of 0.85, highlighting the potential of carefully engineered contextual prompting for scalable and high-precision smart contract security analysis.
Updated: 2026-05-05 12:40:02
标题: 定制提示,针对性保护:智能合约的脆弱性特定LLM分析
摘要: 区块链上的智能合约容易受到各种安全漏洞的影响,这可能会导致由于其不可变性而造成重大的财务损失。现有的检测方法通常在漏洞类型之间缺乏灵活性,并且过度依赖手工制定的专家规则。在本文中,我们提出了一个基于LLM的实用智能合约漏洞检测框架。我们构建并发布了一个包含31,165个专业注释的漏洞实例的大规模数据集,这些实例来自15个主要区块链平台上的3,200多个真实项目。我们的方法利用基于AST的精确上下文提取和专门设计的漏洞提示,为13个常见漏洞类别实例化定制检测器。实验结果表明了强大的有效性,实现了平均正召回率为0.92和平均负召回率为0.85,突显了精心设计的上下文提示对于可扩展和高精度的智能合约安全分析的潜力。
更新时间: 2026-05-05 12:40:02
领域: cs.CR,cs.AI
Graph Neural Network based Hierarchy-Aware Embeddings of Knowledge Graphs: Applications to Yeast Phenotype Prediction
We present a method for finding hierarchy-aware embeddings of knowledge graphs (KGs) using graph neural networks (GNNs) enriched with a semantic loss derived from underlying ontologies. This method yields embeddings that better reflect domain knowledge. To demonstrate their utility, we predict and interpret the effects of gene deletions in the yeast Saccharomyces cerevisiae and learn box embeddings for KGs in the absence of a prediction task. We further show how box embeddings can serve as the basis for evaluating KG revisions. Our yeast KG is constructed from community databases and ontology terms. Low-dimensional box embeddings combined with GNNs are used to predict cell growth for double gene knockouts. Over 10-fold cross validation, these predictions have a mean $R^2$~score~of~0.360, significantly higher than baseline comparisons, demonstrating that high-level qualitative knowledge is informative about experimental outcomes. Incorporating semantic loss terms in the training of the models improves their predictive performance ($R^2$=0.377) by aligning embeddings with ontology structure. This shows that class hierarchies from ontologies can be exploited for quantitative prediction. We also test the trained models on triple gene knockouts, showing they generalise to data beyond those seen in training. Additionally, by identifying co-occurring relations in the yeast KG important for the cell-growth predictions, we construct hypotheses about interacting traits in yeast. A biological experiment validates one such finding, revealing an association between inositol utilisation and osmotic stress resistance, highlighting the model's potential to guide biological discovery.
Updated: 2026-05-05 12:34:45
标题: 基于图神经网络的层次感知知识图嵌入:应用于酵母表型预测
摘要: 我们提出了一种利用图神经网络(GNNs)和基于本体论的语义损失来找到知识图谱(KGs)层次感知嵌入的方法。这种方法产生的嵌入更好地反映了领域知识。为了展示它们的实用性,我们预测并解释了酵母Saccharomyces cerevisiae中基因缺失的影响,并在没有预测任务的情况下学习KGs的盒状嵌入。我们进一步展示了盒状嵌入如何作为评估KG修订的基础。 我们的酵母KG是从社区数据库和本体术语构建的。低维盒状嵌入结合GNNs用于预测双基因敲除的细胞生长。在10倍交叉验证中,这些预测的平均$R^2$~得分为0.360,显著高于基线比较,表明高水平的定性知识对实验结果具有信息量。在模型训练中引入语义损失项提高了它们的预测性能($R^2$=0.377),通过将嵌入与本体结构对齐来展示类层次结构可用于定量预测。我们还在三基因敲除上测试了训练模型,表明它们能泛化到训练中未见的数据。 此外,通过在酵母KG中识别对细胞生长预测重要的共同关系,我们构建了关于酵母互动特征的假设。一项生物实验验证了其中一个发现,揭示了肌醇利用和渗透胁迫抗性之间的关联,突显了该模型引导生物发现的潜力。
更新时间: 2026-05-05 12:34:45
领域: cs.LG,cs.AI,q-bio.QM
From Code to Prediction: Fine-Tuning LLMs for Neural Network Performance Classification in NNGPT
Automated Machine Learning (AutoML) frameworks increasingly leverage Large Language Models (LLMs) for tasks such as hyperparameter optimization and neural architecture code generation. However, current LLM-based approaches focus on generative outputs and evaluate them by training the produced artifacts. Whether LLMs can learn to reason about neural network performance across datasets remains underexplored. We present a classification task integrated into the NNGPT framework, in which a fine-tuned LLM predicts which of two image classification datasets a given neural network architecture achieves higher accuracy on. The task is built on the LEMUR dataset, which provides standardized PyTorch implementations with reproducible performance metrics. Three prompt configurations of increasing difficulty are evaluated: a normalized-accuracy baseline (trivially reaching 100%), a metadata-enriched prompt replacing accuracies with dataset properties, and a code-only prompt presenting only architecture source code and dataset names. Using DeepSeek-Coder-7B-Instruct fine-tuned with LoRA, the code-only prompt reaches 80% peak accuracy over 15 epochs, while the metadata prompt peaks at 70%. Perdataset analysis reveals complementary strengths: metadata excels for datasets with distinctive properties (CelebAGender at 90.9%) but degrades for overlapping characteristics, whereas the code-only prompt shows more balanced performance. A comparison with DeepSeek-Coder1.3B confirms that model capacity affects this form of architectural reasoning. The results establish that LLMs can be fine-tuned to predict cross-dataset suitability from neural network code, suggesting that architecture source code contains richer discriminative signal than dataset metadata alone.
Updated: 2026-05-05 12:30:19
标题: 从代码到预测:在NNGPT中对LLMs进行微调以进行神经网络性能分类
摘要: 自动化机器学习(AutoML)框架越来越多地利用大型语言模型(LLMs)来进行超参数优化和神经架构代码生成等任务。然而,目前基于LLM的方法主要集中在生成性输出上,并通过训练生成的工件来评估它们。LLMs能否学会推理关于神经网络在不同数据集上的性能仍未深入探讨。我们在NNGPT框架中引入了一个分类任务,其中一个经过微调的LLM预测给定神经网络架构在两个图像分类数据集中哪个实现更高准确率。该任务建立在LEMUR数据集上,该数据集提供了具有可重复性性能指标的标准化PyTorch实现。评估了三种逐渐增加难度的提示配置:一个基于归一化准确率的基准(轻松达到100%),一个使用数据集属性替换准确率的元数据丰富提示,以及一个仅呈现架构源代码和数据集名称的仅代码提示。使用经过LoRA微调的DeepSeek-Coder-7B-Instruct,仅代码提示在15个周期内达到80%的最高准确率,而元数据提示的峰值为70%。每个数据集的分析显示互补的优势:元数据在具有独特属性的数据集上表现出色(CelebAGender为90.9%),但在重叠特征上表现下降,而仅代码提示表现更加平衡。与DeepSeek-Coder1.3B的比较证实了模型容量对这种架构推理的影响。结果表明,LLMs可以被微调以预测来自神经网络代码的跨数据集适用性,这表明架构源代码包含比仅数据集元数据更丰富的区分信号。
更新时间: 2026-05-05 12:30:19
领域: cs.LG,cs.CV
Evaluating Semantic Fragility in Text-to-Audio Generation Systems Under Controlled Prompt Perturbations
Recent advances in text-to-audio generation enable models to translate natural-language descriptions into diverse musical output. However, the robustness of these systems under semantically equivalent prompt variations remains largely unexplored. Small linguistic changes may lead to substantial variation in generated audio, raising concerns about reliability in practical use. In this study, we evaluate the semantic fragility of text-to-audio systems under controlled prompt perturbations. We selected MusicGen-small, MusicGen-large, and Stable Audio 2.5 as representative models, and we evaluated them under Minimal Lexical Substitution (MLS), Intensity Shifts (IS), and Structural Rephrasing (SR). The proposed dataset contains 75 prompt groups designed to preserve semantic intent while introducing localized linguistic variation. Generated outputs are compared through complementary spectral, temporal, and semantic similarity measures, enabling robustness analysis across multiple representational levels. Experimental results show that larger models achieve improved semantic consistency, with MusicGen-large reaching cosine similarities of 0.77 under MLS and 0.82 under IS. However, acoustic and temporal analyses reveal persistent divergence across all models, even when embedding similarity remains high. These findings indicate that fragility arises primarily during semantic-to-acoustic realization rather than multi-modal embedding alignment. Our study introduces a controlled framework for evaluating robustness in text-to-audio generation and highlights the need for multi-level stability assessment in generative audio systems.
Updated: 2026-05-05 12:29:09
标题: 在受控提示扰动下评估文本到语音生成系统的语义脆弱性
摘要: 最近在文本转音频生成方面取得了一些进展,使模型能够将自然语言描述转化为多样化的音乐输出。然而,这些系统在语义上等效提示变化下的稳健性仍然未被广泛探讨。小的语言变化可能导致生成音频中的显著变化,引发对实际使用可靠性的担忧。 在本研究中,我们评估了受控提示扰动下文本到音频系统的语义脆弱性。我们选择了MusicGen-small、MusicGen-large和Stable Audio 2.5作为代表模型,并在Minimal Lexical Substitution(MLS)、Intensity Shifts(IS)和Structural Rephrasing(SR)下对它们进行了评估。提出的数据集包含75个提示组,旨在保留语义意图同时引入局部语言变化。通过互补的频谱、时间和语义相似度测量比较生成的输出,从而实现跨多个表示级别的稳健性分析。 实验结果表明,较大的模型实现了改善的语义一致性,MusicGen-large在MLS下达到了0.77的余弦相似度,在IS下达到了0.82。然而,声学和时间分析显示,即使嵌入相似性保持高水平,所有模型之间仍存在持续的差异。这些发现表明,脆弱性主要出现在语义到声学实现阶段,而不是多模态嵌入对齐阶段。我们的研究引入了一个评估文本到音频生成稳健性的受控框架,并强调了在生成音频系统中需要多级稳定性评估的必要性。
更新时间: 2026-05-05 12:29:09
领域: cs.SD,cs.AI
Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models
Inference-time compute has re-emerged as a practical way to improve LLM reasoning. Most test-time scaling (TTS) algorithms rely on autoregressive decoding, which is ill-suited to discrete diffusion language models (dLLMs) due to their parallel decoding over the entire sequence. As a result, developing effective and efficient TTS methods to unlock dLLMs' full generative potential remains an underexplored challenge. To address this, we propose Prism (Pruning, Remasking, and Integrated Self-verification Method), an efficient TTS framework for dLLMs that (i) performs Hierarchical Trajectory Search (HTS) which dynamically prunes and reallocates compute in an early-to-mid denoising window, (ii) introduces Local branching with partial remasking to explore diverse implementations while preserving high-confidence tokens, and (iii) replaces external verifiers with Self-Verified Feedback (SVF) obtained via self-evaluation prompts on intermediate completions. Across four mathematical reasoning and code generation benchmarks on three dLLMs, including LLaDA 8B Instruct, Dream 7B Instruct, and LLaDA 2.0-mini, our Prism achieves a favorable performance-efficiency trade-off, matching best-of-N performance with substantially fewer function evaluations (NFE). The code is released at https://github.com/viiika/Prism.
Updated: 2026-05-05 12:24:00
标题: Prism:基于分层搜索和自验证的离散扩散语言模型的高效测试时间缩放
摘要: 推理时间计算已经重新出现作为改善LLM推理的实用方法。大多数测试时间缩放(TTS)算法依赖于自回归解码,这对于离散扩散语言模型(dLLMs)来说并不合适,因为它们在整个序列上进行并行解码。因此,开发有效和高效的TTS方法来释放dLLMs的全面生成潜力仍然是一个不太研究的挑战。为了解决这个问题,我们提出了Prism(Pruning,Remasking和Integrated Self-verification Method),这是一个针对dLLMs的高效TTS框架,它(i)执行分层轨迹搜索(HTS),在早期到中期的去噪窗口动态修剪和重新分配计算,(ii)引入局部分支和部分重新蒙面,以探索不同的实现同时保留高置信度标记,(iii)通过自评提示获得自验证反馈(SVF)来取代外部验证器,并用于中间完成。在三个dLLMs上的四个数学推理和代码生成基准上,包括LLaDA 8B Instruct,Dream 7B Instruct和LLaDA 2.0-mini,我们的Prism实现了有利的性能效率权衡,与更少的函数评估(NFE)相匹配的最佳N表现。代码发布在https://github.com/viiika/Prism。
更新时间: 2026-05-05 12:24:00
领域: cs.LG
Real Image Denoising with Knowledge Distillation for High-Performance Mobile NPUs
While deep-learning-based image restoration has achieved unprecedented fidelity, deployment on mobile Neural Processing Units (NPUs) remains bottlenecked by operator incompatibility and memory-access overhead. We propose an NPU-aware hardware-algorithm co-design approach for real-world image denoising on mobile NPUs. Our approach employs a high-capacity teacher to supervise a lightweight student network specifically designed to leverage the tiled-memory architectures of modern mobile SoCs. By prioritizing NPU-native primitives -- standard 3x3 convolutions, ReLU activations, and nearest-neighbor upsampling -- and employing a progressive context expansion strategy (up to 1024x1024 crops), the model achieves 37.66 dB PSNR / 0.9278 SSIM on the validation benchmark and 37.58 dB PSNR / 0.9098 SSIM on the held-out test benchmark at full resolution (2432x3200) in the Mobile AI 2026 challenge. Following the official challenge rules, the inference runtime is measured under a standardized Full HD (1088x1920) protocol, where it runs in 34.0 ms on the MediaTek Dimensity 9500 and 46.1 ms on the Qualcomm Snapdragon 8 Elite NPU. We further reveal an "Inference Inversion" effect, where strict adherence to NPU-compatible operations enables dedicated NPU execution up to 3.88x faster than the integrated mobile GPU. The 1.96M-parameter student recovers 99.8% of the teacher's restoration quality via high-alpha knowledge distillation (alpha = 0.9), achieving a 21.2x parameter reduction while closing the PSNR gap from 1.63 dB to only 0.05 dB. These results establish hardware-aware distillation as an effective strategy for unifying high-fidelity denoising with practical deployment across diverse mobile NPU architectures. The proposed lightweight student model (LiteDenoiseNet) and its training statistics are provided in the NN Dataset, available at https://github.com/ABrain-One/NN-Dataset.
Updated: 2026-05-05 12:19:30
标题: 使用知识蒸馏进行高性能移动NPUs的真实图像去噪
摘要: 虽然基于深度学习的图像恢复已经实现了前所未有的保真度,但在移动神经处理单元(NPU)上的部署仍然受到操作符不兼容和内存访问开销的限制。我们提出了一种面向移动NPU的硬件-算法协同设计方法,用于实际图像去噪。我们的方法利用高容量的教师监督一个轻量级的学生网络,专门设计以利用现代移动SoC的瓦片内存架构。通过优先考虑NPU本地基元--标准的3x3卷积、ReLU激活和最近邻上采样--并采用渐进式上下文扩展策略(最多达到1024x1024的裁剪),该模型在验证基准上实现了37.66 dB的PSNR / 0.9278的SSIM,在保留测试基准上实现了37.58 dB的PSNR / 0.9098的SSIM,分辨率为完整分辨率(2432x3200),在Mobile AI 2026挑战赛中。根据官方挑战规则,推理运行时间是根据标准的Full HD(1088x1920)协议测量的,在联发科Dimensity 9500上运行时间为34.0毫秒,在高通骁龙8 Elite NPU上为46.1毫秒。我们进一步揭示了“推理反转”效应,严格遵循NPU兼容操作使得专用NPU执行速度比集成移动GPU快最多3.88倍。这个拥有196万参数的学生模型通过高α知识蒸馏(α=0.9)恢复了99.8%的教师恢复质量,实现了21.2倍参数减少,同时将PSNR差距从1.63 dB缩小到仅0.05 dB。这些结果建立了硬件感知蒸馏作为一个有效的策略,将高保真去噪与在各种移动NPU架构上的实际部署统一起来。提供了提出的轻量级学生模型(LiteDenoiseNet)及其训练统计数据,可以在NN数据集中找到,网址为https://github.com/ABrain-One/NN-Dataset。
更新时间: 2026-05-05 12:19:30
领域: cs.CV,cs.LG
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
On-policy distillation (OPD) has recently emerged as an effective post-training paradigm for consolidating the capabilities of specialized expert models into a single student model. Despite its empirical success, the conditions under which OPD yields reliable improvement remain poorly understood. In this work, we identify two fundamental bottlenecks that limit effective OPD: insufficient exploration of informative states and unreliable teacher supervision for student rollouts. Building on this insight, we propose Uni-OPD, a unified OPD framework that generalizes across Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs), centered on a dual-perspective optimization strategy. Specifically, from the student's perspective, we adopt two data balancing strategies to promote exploration of informative student-generated states during training. From the teacher's perspective, we show that reliable supervision hinges on whether aggregated token-level guidance remains order-consistent with the outcome reward. To this end, we develop an outcome-guided margin calibration mechanism to restore order consistency between correct and incorrect trajectories. We conduct extensive experiments on 5 domains and 16 benchmarks covering diverse settings, including single-teacher and multi-teacher distillation across LLMs and MLLMs, strong-to-weak distillation, and cross-modal distillation. Our results verify the effectiveness and versatility of Uni-OPD and provide practical insights into reliable OPD.
Updated: 2026-05-05 12:15:21
标题: Uni-OPD:将双重视角配方与策略蒸馏统一起来
摘要: 最近,基于策略的蒸馏(OPD)已经成为巩固专业专家模型能力的有效后训练范式,将其整合到单个学生模型中。尽管OPD在实证方面取得了成功,但OPD产生可靠改进的条件仍然不明确。在这项工作中,我们确定了限制有效OPD的两个基本瓶颈:对信息状态的不足探索和学生回合的教师监督不可靠。基于这一洞察,我们提出了Uni-OPD,一个统一的OPD框架,可以横跨大型语言模型(LLMs)和多模大型语言模型(MLLMs),以双重视角优化策略为中心。具体来说,从学生的角度出发,我们采用两种数据平衡策略,在训练过程中促进探索具有信息的学生生成状态。从教师的角度出发,我们表明可靠的监督取决于聚合的基于标记的指导是否与结果奖励保持顺序一致。为此,我们开发了一种结果引导的边际校准机制,恢复了正确和不正确轨迹之间的顺序一致性。我们在涵盖不同设置的5个领域和16个基准测试上进行了大量实验,包括LLMs和MLLMs之间的单教师和多教师蒸馏、强到弱蒸馏以及跨模态蒸馏。我们的结果验证了Uni-OPD的有效性和多功能性,并为可靠的OPD提供了实用见解。
更新时间: 2026-05-05 12:15:21
领域: cs.LG
MEMTIER: Tiered Memory Architecture and Retrieval Bottleneck Analysis for Long-Running Autonomous AI Agents
Long-running autonomous AI agents suffer from a well-documented memory coherence problem: tool-execution success rates degrade 14 percentage points over 72-hour operation windows due to four compounding failure modes in existing flat-file memory systems. We present MEMTIER, a tripartite memory architecture for the OpenClaw agent runtime that introduces a structured episodic JSONL store, a five-signal weighted retrieval engine, an attention-attributed cognitive weight update loop, an asynchronous consolidation daemon promoting episodic facts to a semantic tier, and a PPO-based policy framework for adapting retrieval weights (infrastructure validated; performance gains pending camera-ready). On the full 500-question LongMemEval-S benchmark (Wu et al., 2025), MEMTIER achieves Acc=0.382, F1=0.412 with Qwen2.5-7B on a consumer 6GB GPU - a +33 percentage point improvement over the full-context baseline (0.050 -> 0.382, i.e., 5% -> 38%). With DeepSeek-V4-Flash fact pre-population, single-session recall reaches 0.686-0.714, exceeding the paper's RAG BM25 GPT-4o baseline (0.560) on those categories. Temporal reasoning rises to 0.323 and multi-session synthesis to 0.173, demonstrating that structured semantic pre-population qualitatively changes what lightweight retrieval can achieve. All phases run locally on a consumer laptop with a 6GB GPU.
Updated: 2026-05-05 12:14:10
标题: MEMTIER:长时间运行的自主AI代理的分层内存架构和检索瓶颈分析
摘要: 长期运行的自主AI代理存在一个已被广泛记录的内存一致性问题:由于现有的平面文件内存系统中存在四种不断恶化的故障模式,工具执行成功率在72小时的运行窗口内下降了14个百分点。我们提出了MEMTIER,一种三部分内存架构,用于OpenClaw代理运行时,引入了一个结构化的情节化JSONL存储,一个五信号加权检索引擎,一个基于注意力的认知权重更新循环,一个异步整合守护程序将情节性事实提升到语义层,并基于PPO的政策框架以适应检索权重(基础设施已验证;性能收益待定)。在完整的500个问题LongMemEval-S基准测试(Wu等人,2025)上,MEMTIER在消费者6GB GPU上以Qwen2.5-7B实现了Acc=0.382,F1=0.412,比全文背景线(0.050 -> 0.382,即5% -> 38%)提高了33个百分点。通过DeepSeek-V4-Flash事实预填充,单次会话召回率达到0.686-0.714,在这些类别上超过了论文的RAG BM25 GPT-4o基线(0.560)。时间推理提高到0.323,多次会话综合提高到0.173,表明结构化语义预填充在定性上改变了轻量级检索能够实现的内容。所有阶段在消费者笔记本电脑上的6GB GPU上本地运行。
更新时间: 2026-05-05 12:14:10
领域: cs.AI
Prediction horizon shapes representations in predictive learning
Predictive learning has emerged as a central paradigm for training models across diverse data domains and is increasingly viewed as a foundation for modern artificial intelligence. A common intuition for this success is that accurate prediction requires models to capture the underlying dynamics of the environment, leading to the emergence of structured world models. However, predictive learning does not universally yield such representations, and a mechanistic account of when and why it does remains incomplete. In this work, we identify the prediction horizon as a critical, but often implicit, component of predictive learning objectives. We show that increasing the prediction horizon fundamentally shapes the effective structure of the learning problem. In a minimal setting, we demonstrate both theoretically and empirically that the model's implicit biases interact with this structural change to recover the latent geometry of the task. We then extend these empirical results to nonlinear architectures and more complex datasets, where similar phenomena persist. These findings provide a principled explanation for the emergence of structured representations in predictive learning paradigms and clarify the conditions under which such representations should be expected.
Updated: 2026-05-05 12:11:24
标题: 预测视野在预测性学习中塑造表征
摘要: 预测学习已经成为跨越各种数据领域训练模型的中心范式,并且越来越被视为现代人工智能的基础。对于这一成功的普遍直觉是,准确的预测需要模型捕捉环境的基本动态,从而导致结构化世界模型的出现。然而,预测学习并不普遍产生这样的表示,关于它何时以及为何这样做的机制解释仍然不完整。在这项工作中,我们确定了预测视野作为预测学习目标的一个关键但常常隐含的组成部分。我们展示增加预测视野基本上塑造了学习问题的有效结构。在一个最简单的设置中,我们理论上和实证上证明了模型的隐含偏见与此结构变化相互作用,以恢复任务的潜在几何。然后,我们将这些实证结果扩展到非线性结构和更复杂的数据集,类似现象仍然存在。这些发现为预测学习范式中结构化表示的出现提供了一个原则性的解释,并澄清了这种表示应该被期望的条件。
更新时间: 2026-05-05 12:11:24
领域: cs.LG,q-bio.NC
FUS3DMaps: Scalable and Accurate Open-Vocabulary Semantic Mapping by 3D Fusion of Voxel- and Instance-Level Layers
Open-vocabulary semantic mapping enables robots to spatially ground previously unseen concepts without requiring predefined class sets. Current training-free methods commonly rely on multi-view fusion of semantic embeddings into a 3D map, either at the instance-level via segmenting views and encoding image crops of segments, or by projecting image patch embeddings directly into a dense semantic map. The latter approach sidesteps segmentation and 2D-to-3D instance association by operating on full uncropped image frames, but existing methods remain limited in scalability. We present FUS3DMaps, an online dual-layer semantic mapping method that jointly maintains both dense and instance-level open-vocabulary layers within a shared voxel map. This design enables further voxel-level semantic fusion of the layer embeddings, combining the complementary strengths of both semantic mapping approaches. We find that our proposed semantic cross-layer fusion approach improves the quality of both the instance-level and dense layers, while also enabling a scalable and highly accurate instance-level map where the dense layer and cross-layer fusion are restricted to a spatial sliding window. Experiments on established 3D semantic segmentation benchmarks as well as a selection of large-scale scenes show that FUS3DMaps achieves accurate open-vocabulary semantic mapping at multi-story building scales. Additional material and code will be made available: https://githanonymous.github.io/FUS3DMaps/.
Updated: 2026-05-05 12:08:16
标题: FUS3DMaps:通过体素和实例级层的三维融合实现可扩展和准确的开放词汇语义映射
摘要: 开放词汇语义映射使机器人能够在不需要预定义类别集的情况下对以前未见过的概念进行空间定位。目前的无需训练的方法通常依赖于将语义嵌入进行多视图融合到3D地图中,可以通过实例级别通过分割视图和编码图像片段来进行,也可以通过将图像补丁嵌入直接投影到密集语义地图中。后一种方法通过在完整未裁剪的图像帧上操作,避开了分割和2D到3D实例关联,但现有方法在可扩展性方面仍然受到限制。我们提出了FUS3DMaps,这是一种在线双层语义映射方法,它同时在共享的体素图中维护密集和实例级别的开放词汇层。这种设计使得可以进一步在层嵌入中进行体素级语义融合,结合了两种语义映射方法的互补优势。我们发现,我们提出的语义跨层融合方法提高了实例级别和密集层的质量,同时还实现了可扩展和高度准确的实例级地图,其中密集层和跨层融合受到空间滑动窗口的限制。在已建立的3D语义分割基准测试以及一些大规模场景上的实验表明,FUS3DMaps在多层建筑尺度上实现了准确的开放词汇语义映射。额外的材料和代码将会提供:https://githanonymous.github.io/FUS3DMaps/。
更新时间: 2026-05-05 12:08:16
领域: cs.RO,cs.AI
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
Skill libraries in deployed robotic systems are continually updated through fine-tuning, fresh demonstrations, or domain adaptation, yet existing typed-composition methods (BLADE, SymSkill, Generative Skill Chaining) treat the library as frozen at test time and do not analyze how composition outcomes change when a skill is replaced. We introduce a paired-sampling cross-version swap protocol on robosuite manipulation tasks to characterize this dimension of compositional skill learning. On a dual-arm peg-in-hole task we discover a dominant-skill effect: one ECM achieves 86.7% atomic success rate while every other ECM is at or below 26.7%, and whether this dominant ECM enters a composition shifts the success rate by up to +50pp. We characterize the boundary on a simpler pick task where all atomic policies saturate at 100% and the effect is undefined. Across three tasks we further find that off-policy behavioral distance metrics fail to identify the dominant ECM, ruling out the natural cheap predictor. We propose an atomic-quality probe and a Hybrid Selector combining per-skill probes (zero per-decision cost) with selective composition revalidation (full cost), and characterize its Pareto frontier on 144 skill-update decisions. On T6 the atomic-only probe sits 23pp below full revalidation (64.6% vs 87.5% oracle match) at zero per-decision cost; a Hybrid Selector with m=10 closes most of that gap to ~12pp at 46% of full-revalidation cost. On the cross-task average over 144 events, atomic-only is within 3pp of full revalidation under a mixed-oracle caveat. The atomic-quality probe is, to our knowledge, the first principled, deployment-ready primitive for skill-update governance in compositional robot policies.
Updated: 2026-05-05 12:06:50
标题: 原子探针治理在组合机器人政策中的技能更新
摘要: 在部署的机器人系统中,技能库通过微调、新的演示或领域适应不断更新,然而现有的类型组合方法(例如BLADE、SymSkill、生成技能链接)在测试时将库视为冻结状态,并不分析当一个技能被替换时组合结果如何变化。我们引入了一个针对robosuite操作任务的配对采样跨版本交换协议,以表征这种组合技能学习的维度。在一个双臂插孔任务中,我们发现了一个主导技能效应:一个ECM实现了86.7%的原子成功率,而其他每个ECM都在26.7%或以下,而这个主导ECM是否进入组合会使成功率提高最多50pp。我们在一个更简单的拾取任务中确定了边界,所有的原子策略都饱和在100%,效果不确定。在三个任务中,我们进一步发现离线行为距离度量无法识别主导的ECM,排除了自然的廉价预测器。我们提出了一个原子质量探测器和一个混合选择器,结合每个技能的探测器(零每个决策成本)和选择性的组合重新验证(完全成本),并在144个技能更新决策上表征其Pareto边界。在T6上,仅原子探测器在零每个决策成本下比完全重新验证低23pp(64.6% vs 87.5%的oracle匹配);当m=10时,一个混合选择器减少了大部分差距至约12pp,成本为完全重新验证的46%。在144次事件的跨任务平均值上,仅原子探测器在混合oracle的情况下与完全重新验证相差不到3pp。据我们所知,原子质量探测器是首个为机器人组合策略中的技能更新治理准备好的原始方法。
更新时间: 2026-05-05 12:06:50
领域: cs.RO,cs.AI
ELAS: Efficient Pre-Training of Low-Rank Large Language Models via 2:4 Activation Sparsity
Large Language Models (LLMs) have achieved remarkable capabilities, but their immense computational demands during training remain a critical bottleneck for widespread adoption. Low-rank training has received attention in recent years due to its ability to significantly reduce training memory usage. Meanwhile, applying 2:4 structured sparsity to weights and activations to leverage NVIDIA GPU support for 2:4 structured sparse format has become a promising direction. However, existing low-rank methods often leave activation matrices in full-rank, which dominates memory consumption and limits throughput during large-batch training. Furthermore, directly applying sparsity to weights often leads to non-negligible performance degradation. To achieve efficient pre-training of LLMs, this paper proposes ELAS: Efficient pre-training of Low-rank LLMs via 2:4 Activation Sparsity, a novel framework for low-rank models via 2:4 activation sparsity. ELAS applies squared ReLU activation functions to the feed-forward networks in low-rank models and implements 2:4 structured sparsity on the activations after the squared ReLU operation. We evaluated ELAS through pre-training experiments on LLaMA models ranging from 60M to 1B parameters. The results demonstrate that ELAS maintains performance with minimal degradation after applying 2:4 activation sparsity, while achieving training and inference acceleration. Moreover, ELAS reduces activation memory overhead, particularly with large batch sizes. Code is available at ELAS Repo.
Updated: 2026-05-05 12:04:51
标题: ELAS: 通过2:4激活稀疏性高效预训练低秩大型语言模型
摘要: 大型语言模型(LLMs)已经取得了显著的能力,但是它们在训练过程中巨大的计算需求仍然是普遍采用的关键瓶颈。近年来,低秩训练受到关注,因为它能够显著减少训练内存的使用。同时,将2:4结构稀疏应用于权重和激活以利用NVIDIA GPU对2:4结构稀疏格式的支持已经成为一个有前途的方向。然而,现有的低秩方法通常会将激活矩阵保持在完整秩,这会主导内存消耗并限制大批量训练的吞吐量。此外,直接将稀疏应用于权重通常会导致不可忽略的性能下降。为了实现LLMs的高效预训练,本文提出了ELAS:通过2:4激活稀疏实现低秩LLMs的高效预训练,这是一个通过2:4激活稀疏实现低秩模型的新框架。ELAS将平方ReLU激活函数应用于低秩模型中的前馈网络,并在平方ReLU操作后对激活实施2:4结构稀疏。我们通过对包含从60M到1B参数的LLaMA模型进行预训练实验来评估ELAS。结果表明,在应用2:4激活稀疏后,ELAS保持了性能,并实现了训练和推理加速。此外,ELAS减少了激活内存开销,尤其是在大批量大小下。代码可以在ELAS Repo中找到。
更新时间: 2026-05-05 12:04:51
领域: cs.LG,cs.AI
Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration
Memory systems enable otherwise-stateless LLM agents to persist user information across sessions, but also introduce a new attack surface. We characterize the Trojan Hippo attack, a class of persistent memory attacks that operates in a more realistic threat model than prior memory poisoning work: the attacker plants a dormant payload into an agent's long-term memory via a single untrusted tool call (e.g., a crafted email), which activates only when the user later discusses sensitive topics such as finance, health, or identity, and exfiltrates high-value personal data to the attacker. While anecdotal demonstrations of such attacks have appeared against deployed systems, no prior work systematically evaluates them across heterogeneous memory architectures and defenses. We introduce a dynamic evaluation framework comprising two components: (1) an OpenEvolve-based adaptive red-teaming benchmark that stress-tests defenses and memory backends against continuously refined attacks, and (2) the first capability-aware security/utility analysis for persistent memory systems, enabling principled reasoning about defense deployment across different usage profiles. Instantiated on an email assistant across four memory backends (explicit tool memory, agentic memory, RAG, and sliding-window context), Trojan Hippo achieves up to 85-100% ASR against current frontier models from OpenAI and Google, with planted memories successfully activating even after 100 benign sessions. We evaluate four memory-system defenses inspired by basic security principles, finding they substantially reduce attack success rates (to as low as 0-5%), though at utility costs that vary widely with task requirements. Because of this substantial security-utility tradeoff, the effective real-world deployment of defenses remains an open challenge, which our evaluation framework is specifically designed to address.
Updated: 2026-05-05 11:52:20
标题: 特洛伊木马河马:利用代理内存进行数据泄露
摘要: 记忆系统使得原本无状态的LLM代理能够跨会话持久化用户信息,但也引入了新的攻击面。我们描述了特洛伊河马攻击,这是一类持久性记忆攻击,相较于以往的记忆毒害工作,它在更现实的威胁模型下运作:攻击者通过单个不受信任的工具调用(例如,一个精心制作的电子邮件)向代理的长期记忆植入休眠负载,当用户稍后讨论敏感主题,如金融、健康或身份时,该负载才会激活,并将高价值个人数据传送给攻击者。 尽管已经出现了针对部署系统的此类攻击的个别演示,但以前的工作并未系统地评估它们在异构记忆架构和防御中的表现。我们引入了一个动态评估框架,包括两个组成部分:基于OpenEvolve的自适应红队基准测试,对防御和记忆后端进行持续改进的攻击压力测试,以及首个面向能力的持久性记忆系统安全/效用分析,使得能够有理由地思考在不同使用情景下的防御部署。 在四个记忆后端(显式工具记忆、代理记忆、RAG和滑动窗口上下文)上实例化为电子邮件助手,特洛伊河马攻击在当前OpenAI和谷歌的前沿模型中实现了高达85-100%的ASR,即使在100个良性会话之后,植入的记忆仍然能够成功激活。我们评估了四种受基本安全原则启发的记忆系统防御方法,发现它们显著降低了攻击成功率(甚至降至0-5%),尽管在任务需求上的效用成本差异很大。由于这种重要的安全-效用权衡,防御的有效实际部署仍然是一个开放性挑战,而我们的评估框架专门设计用于解决这个挑战。
更新时间: 2026-05-05 11:52:20
领域: cs.CR,cs.AI
Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning
Scaling cooperative multi-agent reinforcement learning (MARL) is fundamentally limited by cross-agent noise. When agents share a common reward, each agent's learning signal is computed from a shared return that depends on all agents, so the stochasticity of the other agents enters the signal as cross-agent noise that grows with $N$. Fortunately, many engineering systems, such as cloud computing and power systems, have differentiable analytical models that prescribe efficient system states, providing a new reference beyond noisy shared returns. In this work, we propose Descent-Guided Policy Gradient (DG-PG), a framework that augments policy-gradient updates with a noise-free descent signal derived from differentiable analytical models. We prove that DG-PG reduces policy-gradient estimator variance from $\mathcal{O}(N)$ to $\mathcal{O}(1)$, preserves the equilibria of the cooperative game, and achieves agent-independent sample complexity $\widetilde{\mathcal{O}} (1/ε)$. On a heterogeneous cloud resource scheduling task with up to 1500 agents, DG-PG converges within 20 episodes on average, while MAPPO and IPPO fail to converge under identical architectures.
Updated: 2026-05-05 11:51:06
标题: "基于下降引导的策略梯度方法,用于可扩展的合作多智体学习"
摘要: 合作多智能体强化学习(MARL)的扩展受到跨智能体噪声的基本限制。当智能体共享一个共同的奖励时,每个智能体的学习信号都是从一个依赖于所有智能体的共享回报计算得出的,因此其他智能体的随机性被作为跨智能体噪声输入到信号中,这种噪声随着N的增加而增长。幸运的是,许多工程系统,如云计算和电力系统,具有可微分的分析模型,规定了高效的系统状态,提供了一个超出嘈杂共享回报的新参考。在这项工作中,我们提出了Descent-Guided Policy Gradient(DG-PG),这是一个框架,通过从可微分的分析模型中导出的无噪声下降信号来增强策略梯度更新。我们证明DG-PG将策略梯度估计器的方差从$ \mathcal{O}(N)$减小到$ \mathcal{O}(1)$,保持了合作博弈的平衡,并实现了代理无关的样本复杂度$ \widetilde{\mathcal{O}}(1/ε)$。在一个具有多达1500个智能体的异构云资源调度任务上,DG-PG在平均20个episode内收敛,而MAPPO和IPPO在相同的架构下无法收敛。
更新时间: 2026-05-05 11:51:06
领域: cs.MA,cs.AI,cs.LG
Patterns vs. Patients: Evaluating LLMs against Mental Health Professionals on Personality Disorder Diagnosis through First-Person Narratives
Growing reliance on LLMs for psychiatric self-assessment raises questions about their ability to interpret qualitative patient narratives. This depth-first case study provides the first direct comparison of state-of-the-art LLMs and mental health professionals in assessing Borderline (BPD) and Narcissistic (NPD) Personality Disorders based on Polish-language first-person autobiographical accounts. Within our sample, the overall diagnostic scores of the top-performing Gemini Pro models (65.48%) were 21.91 percentage points higher than the average scores of the human professionals (43.57%). While both models and human experts excelled at identifying BPD (F1 = 83.4 & F1 = 80.0, respectively), models severely underdiagnosed NPD (F1 = 6.7 vs. 50.0), showing a potential reluctance toward the value-laden term "narcissism." Qualitatively, models provided confident, elaborate justifications focused on patterns and formal categories, while human experts remained concise and cautious, emphasizing the patients' sense of self and temporal experience. Our findings demonstrate that while LLMs might be competent at interpreting complex first-person clinical data, their outputs still carry critical reliability and bias issues.
Updated: 2026-05-05 11:48:23
标题: 模式 vs. 患者:通过第一人称叙述评估LLMs在人格障碍诊断中与心理健康专业人员的比较
摘要: 对于精神自我评估越来越依赖于语言模型的情况,引发了关于它们是否能够解释定性病人叙述的能力的问题。这项深度优先案例研究首次直接比较了最先进的语言模型和精神健康专业人员在基于波兰语第一人称自传账户评估边缘(BPD)和自恋(NPD)人格障碍方面的能力。在我们的样本中,表现最佳的Gemini Pro模型的总体诊断得分(65.48%)比人类专业人员的平均得分(43.57%)高出了21.91个百分点。虽然两种模型和人类专家在识别BPD方面表现出色(F1 = 83.4%和F1 = 80.0%),但模型严重低估了NPD(F1 = 6.7%对50.0%),显示出对“自恋”这个价值负载术语的潜在抵触情绪。在定性上,模型提供了自信、详细的理由,重点放在模式和形式类别上,而人类专家则保持简洁和谨慎,强调病人的自我意识和时间经验。我们的发现表明,尽管语言模型可能擅长解释复杂的第一人称临床数据,但它们的输出仍然存在关键的可靠性和偏见问题。
更新时间: 2026-05-05 11:48:23
领域: cs.CL,cs.AI,cs.CY,cs.HC
psifx -- Psychological and Social Interactions Feature Extraction Package
psifx is a plug-and-play multi-modal feature extraction toolkit, aiming to facilitate and democratize the use of state-of-the-art machine learning techniques for human sciences research. It is motivated by a need (a) to automate and standardize data annotation processes that typically require expensive, lengthy, and inconsistent human labour; (b) to develop and distribute open-source community-driven psychology research software; and (c) to enable large-scale access and ease of use for non-expert users. The framework contains an array of tools for tasks such as speaker diarization, closed-caption transcription and translation from audio; body, hand, and facial pose estimation and gaze tracking with multi-person tracking from video; and interactive textual feature extraction supported by large language models. The package has been designed with a modular and task-oriented approach, enabling the community to add or update new tools easily. This combination creates new opportunities for in-depth study of real-time behavioral phenomena in psychological and social science research.
Updated: 2026-05-05 11:47:57
标题: psifx -- 心理和社会互动特征提取包
摘要: psifx是一个即插即用的多模态特征提取工具包,旨在促进和民主化最先进的机器学习技术在人类科学研究中的使用。它的动机是为了(a)自动化和标准化通常需要昂贵、耗时和不一致的人力的数据标注过程;(b)开发和分发开源社区驱动的心理学研究软件;以及(c)为非专家用户提供大规模访问和易用性。该框架包含一系列工具,用于音频的扬声器分离、闭幕字幕转录和翻译;视频中的身体、手部和面部姿势估计以及凝视跟踪和多人跟踪;以及由大型语言模型支持的交互式文本特征提取。该软件包采用模块化和任务导向的方法进行设计,使社区能够轻松添加或更新新工具。这种组合为心理学和社会科学研究中实时行为现象的深入研究创造了新机会。
更新时间: 2026-05-05 11:47:57
领域: cs.CL,cs.LG
Stage Light is Sequence$^2$: Multi-Light Control via Imitation Learning
Music-inspired Automatic Stage Lighting Control (ASLC) has gained increasing attention in recent years due to the substantial time and financial costs associated with hiring and training professional lighting engineers. However, existing methods suffer from several notable limitations: the low interpretability of rule-based approaches, the restriction to single-primary-light control in music-to-color-space methods, and the limited transferability of music-to-controlling-parameter frameworks. To address these gaps, we propose SeqLight, a hierarchical deep learning framework that maps music to multi-light Hue-Saturation-Value (HSV) space. Our approach first customizes SkipBART, an end-to-end single primary light generation model, to predict the full light color distribution for each frame, followed by hybrid Imitation Learning (IL) techniques to derive an effective decomposition strategy that distributes the global color distribution among individual lights. Notably, the light decomposition module can be trained under varying venue-specific lighting configurations using only mixed light data and no professional demonstrations, thereby flexibly adapting across diverse venues. In this stage, we formulate the light decomposition task as a Goal-Conditioned Markov Decision Process (GCMDP), construct an expert demonstration set inspired by Hindsight Experience Replay (HER), and introduce a three-phase IL training pipeline, achieving strong generalization capability. To validate our IL solution for the proposed GCMDP, we conduct a series of quantitative analysis and human study. The code and trained models are provided at https://github.com/RS2002/SeqLight .
Updated: 2026-05-05 11:41:53
标题: 舞台灯光是序列的平方:通过模仿学习实现多灯控制
摘要: 音乐灵感自动舞台灯光控制(ASLC)近年来受到越来越多的关注,原因在于聘请和培训专业灯光工程师所需的大量时间和财力成本。然而,现有方法存在一些显著限制:基于规则的方法的低可解释性,音乐到颜色空间方法中仅限于单个主灯光控制,以及音乐到控制参数框架的有限可传递性。为了解决这些差距,我们提出了SeqLight,这是一个将音乐映射到多光源色调饱和度值(HSV)空间的分层深度学习框架。我们的方法首先定制了SkipBART,一个端到端单一主灯光生成模型,用于预测每帧的完整光颜色分布,然后采用混合模仿学习(IL)技术,得出一个有效的分解策略,将全局颜色分布分配给各个灯光。值得注意的是,灯光分解模块可以在不同场馆特定的灯光配置下进行训练,仅使用混合光数据而无需专业演示,从而在不同场馆之间灵活适应。在这个阶段,我们将灯光分解任务制定为目标条件马尔可夫决策过程(GCMDP),构建受Hindsight Experience Replay(HER)启发的专家演示集,并引入一个三阶段IL训练流程,实现强大的泛化能力。为了验证我们针对所提出的GCMDP的IL解决方案,我们进行了一系列定量分析和人类研究。代码和训练模型可在https://github.com/RS2002/SeqLight 上找到。
更新时间: 2026-05-05 11:41:53
领域: cs.MM,cs.AI
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
Video generation models internalize physical realism as their prior. Anime deliberately violates physics: smears, impact frames, chibi shifts; and its thousands of coexisting artistic conventions yield no single "physics of anime" a model can absorb. Physics-biased models therefore flatten the artistry that defines the medium or collapse under its stylistic variance. We present AniMatrix, a video generation model that targets artistic rather than physical correctness through a dual-channel conditioning mechanism and a three-step transition: redefine correctness, override the physics prior, and distinguish art from failure. First, a Production Knowledge System encodes anime as a structured taxonomy of controllable production variables (Style, Motion, Camera, VFX), and AniCaption infers these variables from pixels as directorial directives. A trainable tag encoder preserves the field-value structure of this taxonomy while a frozen T5 encoder handles free-form narrative; dual-path injection (cross-attention for fine-grained control, AdaLN modulation for global enforcement) ensures categorical directives are never diluted by open-ended text. Second, a style-motion-deformation curriculum transitions the model from near-physical motion to full anime expressiveness. Third, deformation-aware preference optimization with a domain-specific reward model separates intentional artistry from pathological collapse. On an anime-specific human evaluation with five production dimensions scored by professional animators, AniMatrix ranks first on four of five, with the largest gains over Seedance-Pro 1.0 on Prompt Understanding (+0.70, +22.4 percent) and Artistic Motion (+0.55, +16.9 percent). We will publicly release the AniMatrix model weights and inference code.
Updated: 2026-05-05 11:36:52
标题: AniMatrix:一种以艺术思维而非物理思维生成动漫视频的模型
摘要: 视频生成模型将物理现实主义内化为其先验。动画故意违反物理规律:涂抹、冲击帧、卡通变形;其成千上万个共存的艺术约定没有一个单一的“动画物理学”,模型可以吸收。因此,受物理学偏见的模型要么削弱了定义媒介的艺术性,要么在其风格变化下崩溃。我们提出了AniMatrix,一个以艺术正确性而非物理正确性为目标的视频生成模型,通过双通道调节机制和三步转换实现:重新定义正确性,覆盖物理先验,区分艺术与失败。首先,一个生产知识系统将动画编码为可控制生产变量的结构化分类法(风格、运动、摄像机、视觉效果),而AniCaption从像素中推断出这些变量作为导演指令。一个可训练的标签编码器保留这个分类法的字段-值结构,而一个冻结的T5编码器处理自由形式叙述;双通道注入(交叉注意力用于细粒度控制,AdaLN调制用于全局执行)确保分类指令永远不会被开放式文本稀释。其次,一个风格-运动-变形课程将模型从接近物理运动过渡到完全的动画表现力。第三,具有领域特定奖励模型的变形感知偏好优化将有意艺术性与病态崩溃分开。在由专业动画师评分的五个制作维度上进行的动画特定人类评估中,AniMatrix在五个维度中有四个排名第一,对比Seedance-Pro 1.0获得最大收益,其中在提示理解(+0.70,+22.4%)和艺术运动(+0.55,+16.9%)方面表现最好。我们将公开发布AniMatrix模型权重和推理代码。
更新时间: 2026-05-05 11:36:52
领域: cs.CV,cs.AI
Rethinking Temporal Consistency in Video Object-Centric Learning: From Prediction to Correspondence
The de facto approach in video object-centric learning maintains temporal consistency through learned dynamics modules that predict future object representations, called slots. We demonstrate that these predictors function as expensive approximations of discrete correspondence problems. Modern self-supervised vision backbones already encode instance-discriminative features that distinguish objects reliably. Exploiting these features eliminates the need for learned temporal prediction. We introduce Grounded Correspondence, a framework that replaces learned transition functions with deterministic bipartite matching. Slots initialize from salient regions in frozen backbone features. Frame-to-frame identity is maintained through Hungarian matching on slot representations. The approach requires zero learnable parameters for temporal modeling yet achieves competitive performance on MOVi-D, MOVi-E, and YouTube-VIS. Project page: https://magenta-sherbet-85b101.netlify.app/
Updated: 2026-05-05 11:29:43
标题: 重新思考视频对象中心学习中的时间一致性:从预测到对应
摘要: 视频对象中心学习中的事实方法通过学习的动态模块维持时间一致性,这些模块可以预测未来对象表示,称为插槽。我们证明这些预测器作为离散对应问题的昂贵近似。现代自监督视觉骨干已经编码了能够可靠区分对象的实例判别特征。利用这些特征消除了对学习的时间预测的需求。我们介绍了基于对应关系的框架,用确定性二部匹配替换了学习的过渡函数。插槽从冻结的骨干特征中的显著区域初始化。通过插槽表示上的匈牙利匹配来维护逐帧身份。该方法对于时间建模不需要学习参数,但在MOVi-D、MOVi-E和YouTube-VIS上实现了竞争性能。项目页面:https://magenta-sherbet-85b101.netlify.app/
更新时间: 2026-05-05 11:29:43
领域: cs.CV,cs.LG
AEROS: A Single-Agent Operating Architecture with Embodied Capability Modules
Robotic systems lack a principled abstraction for organizing intelligence, capabilities, and execution in a unified manner. Existing approaches either couple skills within monolithic architectures or decompose functionality into loosely coordinated modules or multiple agents, often without a coherent model of identity and control authority. We argue that a robot should be modeled as a single persistent intelligent subject whose capabilities are extended through installable packages. We formalize this view as AEROS (Agent Execution Runtime Operating System), in which each robot corresponds to one persistent agent and capabilities are provided through Embodied Capability Modules (ECMs). Each ECM encapsulates executable skills, models, and tools, while execution constraints and safety guarantees are enforced by a policy-separated runtime. This separation enables modular extensibility, composable capability execution, and consistent system-level safety. We evaluate a reference implementation in PyBullet simulation with a Franka Panda 7-DOF manipulator across eight experiments covering re-planning, failure recovery, policy enforcement, baseline comparison, cross-task generality, ECM hot-swapping, ablation, and failure boundary analysis. Over 100 randomized trials per condition, AEROS achieves 100% task success across three tasks versus baselines (BehaviorTree.CPP-style and ProgPrompt-style at 92--93%, flat pipeline at 67--73%), the policy layer blocks all invalid actions with zero false acceptances, runtime benefits generalize across tasks without task-specific tuning, and ECMs load at runtime with 100% post-swap success.
Updated: 2026-05-05 11:28:14
标题: AEROS:具有实体能力模块的单一代理操作架构
摘要: 机器人系统缺乏一个有原则的抽象,用以统一组织智能、能力和执行。现有方法要么将技能耦合在单块结构中,要么将功能分解为松散协调的模块或多个代理,通常缺乏关于身份和控制权限的一致模型。我们认为机器人应被建模为一个持续的智能主体,其能力通过可安装的包进行扩展。我们将这种观点形式化为AEROS(代理执行运行时操作系统),其中每个机器人对应一个持续的代理,并通过具体能力模块(ECMs)提供能力。每个ECM封装了可执行的技能、模型和工具,而执行约束和安全保证则由一个策略分离的运行时强制执行。这种分离实现了模块化的可扩展性、可组合的能力执行和一致的系统级安全。我们在PyBullet模拟中对Franka Panda 7-DOF机械臂进行了一个参考实现的评估,涵盖重新规划、故障恢复、策略执行、基线比较、跨任务通用性、ECM热插拔、消融和故障边界分析等八个实验。在每个条件下进行了100多次随机试验,AEROS在三个任务中实现了100%的任务成功率,而基线(BehaviorTree.CPP风格和ProgPrompt风格分别为92-93%,平行管道为67-73%)只能达到92-93%和67-73%。策略层阻止了所有无效操作,零误接受,运行时的优势在没有特定任务调整的情况下普遍化,ECM在运行时实现了100%的替换成功。
更新时间: 2026-05-05 11:28:14
领域: cs.RO,cs.AI
Normalized Matching Transformer
We introduce the Normalized Matching Transformer (NMT), a deep learning approach for efficient and accurate sparse semantic keypoint matching between image pairs. NMT consists of a strong visual backbone, geometric feature refinement via SplineCNN, followed by a normalized Transformer for computing matching features. Central to NMT is our hyperspherical normalization strategy: we enforce unit-norm embeddings at every Transformer layer and train with a combined contrastive InfoNCE and hyperspherical uniformity loss to yield more discriminative keypoint representations. This novel architecture/loss combination encourages close alignment of matching image features and large distances between non-matching ones not only at the output level, but for each layer. Despite its architectural simplicity, NMT sets a new state-of-the-art performance on PascalVOC and SPair-71k, outperforming BBGM, ASAR, COMMON and GMTR by 5.1% and 2.2%, respectively, while converging in at least 1.7x fewer epochs compared to other state-of-the-art baselines. These results underscore the power of combining pervasive normalization with hyperspherical learning for matching tasks.
Updated: 2026-05-05 11:28:12
标题: 标准化匹配变换器
摘要: 我们介绍了标准匹配变换器(NMT),这是一种深度学习方法,用于在图像对之间进行高效准确的稀疏语义关键点匹配。NMT由强大的视觉骨干、通过SplineCNN进行几何特征细化,以及用于计算匹配特征的标准化变换器组成。NMT的核心是我们的超球形归一化策略:我们在每个变换器层上强制实现单位范数嵌入,并使用结合对比InfoNCE和超球形均匀性损失进行训练,以产生更具区分性的关键点表示。这种新颖的架构/损失组合不仅鼓励在输出级别对匹配图像特征进行紧密对齐,而且对每一层的非匹配图像特征之间的距离也有较大的差异。尽管其架构简单,但NMT在PascalVOC和SPair-71k上创造了新的最先进性能,分别比BBGM、ASAR、COMMON和GMTR高出5.1%和2.2%,同时与其他最先进基线相比,至少在1.7倍的迭代次数内收敛。这些结果强调了将普遍规范化与超球形学习相结合用于匹配任务的强大效果。
更新时间: 2026-05-05 11:28:12
领域: cs.CV,cs.LG
A Comprehensive Evaluation of Deep Learning Object Detection Models on Heterogeneous Edge Devices
Modern applications such as autonomous vehicles, intelligent surveillance, and smart city systems increasingly require object detection on resource-constrained edge devices. Yet, there is still limited understanding of how different object detection models behave across heterogeneous edge devices and under varying scene complexity. In this paper, we benchmark YOLOv8 (Nano, Small, Medium), EfficientDet Lite (Lite0, Lite1, Lite2), and SSD (SSD MobileNet V1, SSDLite MobileDet) on Raspberry Pi 3, 4, 5 with/without Coral TPU accelerators, Raspberry Pi 5 with AI HAT+, Jetson Nano, and Jetson Orin Nano. We evaluate energy consumption, inference time, and accuracy, and further examine how accuracy changes with the number of objects in the input image. The results reveal clear trade-offs among accuracy, latency, and energy efficiency across model-device combinations. SSD MobileNet V1 achieves the lowest latency and energy consumption but the lowest accuracy, whereas YOLOv8 Medium achieves the highest accuracy at higher computational cost. TPU-based Raspberry Pi devices improve the efficiency of SSD and EfficientDet Lite while reducing YOLOv8 accuracy. Orin Nano offers the most favorable overall balance across most model families. The object-count-based analysis further shows that models achieve more similar accuracy on simpler images, while the accuracy gap widens as scene complexity increases.
Updated: 2026-05-05 11:26:25
标题: 对异构边缘设备上的深度学习目标检测模型进行全面评估
摘要: 现代应用程序,如自动驾驶车辆、智能监控和智能城市系统,越来越需要在资源受限的边缘设备上进行目标检测。然而,对于不同的目标检测模型在异构边缘设备上以及在不同场景复杂度下的表现仍存在有限的理解。本文在树莓派3、4、5(带/不带Coral TPU加速器)、带AI HAT+的树莓派5、Jetson Nano和Jetson Orin Nano上对YOLOv8(Nano、Small、Medium)、EfficientDet Lite(Lite0、Lite1、Lite2)和SSD(SSD MobileNet V1、SSDLite MobileDet)进行基准测试。我们评估了能耗、推断时间和准确性,并进一步研究了准确性如何随输入图像中目标数量的变化而变化。结果显示,在模型-设备组合中准确性、延迟和能效之间存在明显的权衡。SSD MobileNet V1具有最低的延迟和能耗,但准确性最低,而YOLOv8 Medium在更高的计算成本下实现了最高的准确性。基于TPU的树莓派设备提高了SSD和EfficientDet Lite的效率,同时降低了YOLOv8的准确性。Orin Nano在大多数模型系列中提供了最有利的整体平衡。基于目标数量的分析进一步显示,模型在简单图像上实现的准确性更为相似,而随着场景复杂度的增加,准确性差距扩大。
更新时间: 2026-05-05 11:26:25
领域: cs.CV,cs.AR,cs.DC,cs.LG,cs.SE
Agent-Based Modeling of Low-Emission Fertilizer Adoption for Dairy Farm Decarbonisation using Empirical Farm Data
To understand complex system dynamics in dairy farming, it is essential to use modeling tools that capture farm heterogeneity, social interactions, and cumulative environmental impacts. This study proposes an agent-based modeling (ABM) framework to simulate nitrogen management and the adoption of low-emission fertilizer across 295 Irish dairy farms over a 15-year period. Using empirical data, the model represents farm communication through a social network, capturing peer influence and discussion group dynamics, where adoption probabilities are driven by social contagion, farm-scale characteristics, and policy interventions such as subsidies and carbon taxes. The framework estimates sectoral greenhouse gas emissions, cumulative abatement, and private-social cost trade-offs, using Monte Carlo simulation and sensitivity analysis to quantify uncertainty. The model shows strong agreement with observed adoption trajectories ($R^2 = 0.979$, RMSE = 0.0274) and is validated against empirical data using a Kolmogorov-Smirnov test (D = 0.2407, p < 0.001), indicating its ability to reproduce structural patterns in adoption behavior. Adoption dynamics are further characterized using a logistic diffusion model consistent with Rogers' innovation diffusion theory, capturing progression from early adoption to a saturation level of approximately 91%. By framing decarbonization as a socio-technical diffusion process rather than a purely economic optimization problem, this study provides an in silico policy laboratory for evaluating the robustness and diffusion speed of climate mitigation strategies prior to implementation.
Updated: 2026-05-05 11:25:03
标题: 基于代理人建模的低排放肥料在奶牛养殖场脱碳过程中的应用:利用实证农场数据
摘要: 为了理解奶牛养殖中的复杂系统动态,必须使用能够捕捉农场异质性、社会互动和累积环境影响的建模工具。本研究提出了一个基于代理的建模(ABM)框架,用于模拟爱尔兰295个奶牛养殖场在15年期间的氮管理和低排放肥料的采用情况。利用实证数据,该模型通过社交网络代表农场之间的交流,捕捉同行影响和讨论组动态,其中采用概率受到社会传染、农场规模特征以及政策干预(如补贴和碳税)的驱动。该框架估计了部门温室气体排放、累积减排和私营-社会成本权衡,使用蒙特卡洛模拟和敏感性分析来量化不确定性。该模型与观察到的采用轨迹具有很强的一致性($R^2 = 0.979$,RMSE = 0.0274),并通过Kolmogorov-Smirnov检验进行了验证(D = 0.2407,p < 0.001),表明其能够重现采用行为中的结构模式。采用动态进一步使用与罗杰斯的创新扩散理论一致的Logistic扩散模型进行表征,捕捉从早期采用到约91%饱和水平的进展。通过将脱碳框架为社会技术扩散过程,而不仅仅是纯粹的经济优化问题,本研究提供了一个用于评估气候减缓战略的稳健性和扩散速度的虚拟政策实验室,以便在实施之前进行评估。
更新时间: 2026-05-05 11:25:03
领域: cs.AI
Super-fast Rates of Convergence for Neural Network Classifiers under the Hard Margin Condition
We study the classical binary classification problem for hypothesis spaces of Deep Neural Networks (DNNs) under Tsybakov's low-noise condition with exponent $q>0$, as well as its limit case $q=\infty$, which we refer to as the \emph{hard margin condition}. We demonstrate that, for a wide range of commonly used activation functions (including but not limited to ReLU, LeakyReLU, ELU, CELU, SELU, Softplus, GELU, SiLU, Swish, Mish, and Softmax), DNN solutions to the empirical risk minimization (ERM) problem with square loss surrogate and $\ell_p$ penalty on the weights $(0<p<\infty)$ can achieve excess risk bounds of order $\mathcal{O}\left(n^{-α}\right)$ for $α$ close to $1$ under the low-noise condition, and for arbitrarily large $α>1$ under the hard-margin condition, provided that the Bayes regression function $η$ satisfies a \emph{distribution-adapted smoothness} condition relative to the marginal data distribution $ρ_{X}$. Furthermore, when the activation function is chosen as $\tanh$ or sigmoid, we show that the same rates follow from the standard assumption that $η\in \mathcal{C}^s$. Finally, we establish minimax lower bounds, showing that these rates cannot be improved upon whenever $q\ge2$. Our proof relies on a novel decomposition of the excess risk for general ERM-based classifiers which might be of independent interest.
Updated: 2026-05-05 11:20:04
标题: 神经网络分类器在硬边界条件下的超快收敛速率
摘要: 我们研究了在Tsybakov低噪声条件下指数为$q>0$的假设空间的深度神经网络(DNNs)的经典二分类问题,以及其极限情况$q=\infty$,我们将其称为\emph{硬间隔条件}。我们展示了,对于广泛使用的激活函数(包括但不限于ReLU、LeakyReLU、ELU、CELU、SELU、Softplus、GELU、SiLU、Swish、Mish和Softmax),带有平方损失代理和权重$\ell_p$惩罚$(0<p<\infty)$的经验风险最小化(ERM)问题的DNN解决方案在低噪声条件下可以实现阶为$\mathcal{O}\left(n^{-α}\right)$的超额风险界,其中$α$接近$1$,在硬间隔条件下,可以实现任意大的$α>1$,前提是贝叶斯回归函数$η$相对于边际数据分布$ρ_{X}$满足\emph{分布适应平滑}条件。此外,当激活函数选择为$\tanh$或sigmoid时,我们表明相同的速率可从标准假设$η\in \mathcal{C}^s$中得到。最后,我们建立了最小化下界,表明只要$q\ge2$,这些速率不能得到改进。我们的证明依赖于对一般基于ERM的分类器的超额风险的新颖分解,这可能具有独立的兴趣。
更新时间: 2026-05-05 11:20:04
领域: cs.LG,math.ST
Deep Interest Mining with Cross-Modal Alignment for SemanticID Generation in Generative Recommendation
Generative Recommendation (GR) has demonstrated remarkable performance in next-token prediction paradigms, which relies on Semantic IDs (SIDs) to compress trillion-scale data into learnable vocabulary sequences. However, existing methods suffer from three critical limitations: (1) Information Degradation: the two-stage compression pipeline causes semantic loss and information degradation, with no posterior mechanism to distinguish high-quality from low-quality SIDs; (2) Semantic Degradation: cascaded quantization discards key semantic information from original multimodal features, as the embedding generation and quantization stages are not jointly optimized toward a unified objective; (3) Modality Distortion: quantizers fail to properly align text and image modalities, causing feature misalignment even when upstream networks have aligned them. To address these challenges, we propose a novel framework integrating three key innovations: Deep Contextual Interest Mining (DCIM), Cross-Modal Semantic Alignment (CMSA), and Quality-Aware Reinforcement Mechanism (QARM). First, we leverage Vision-Language Models (VLMs) to align non-textual modalities into a unified text-based semantic space, mitigating modality distortion. Second, we introduce a deep interest mining mechanism that captures high-level semantic information implicitly present in advertising contexts, encouraging SIDs to preserve critical contextual information through reconstruction-based supervision. Third, we employ a reinforcement learning framework with quality-aware rewards to encourage semantically rich SIDs while suppressing low-quality ones in the posterior stage. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art SID generation methods, achieving superior performance on multiple benchmarks. Ablation studies further validate the effectiveness of each proposed component
Updated: 2026-05-05 11:18:30
标题: 深度兴趣挖掘与跨模态对齐在生成式推荐中用于语义ID生成
摘要: 生成推荐(GR)在下一个令牌预测范例中表现出色,这依赖于语义ID(SIDs)将万亿级数据压缩为可学习的词汇序列。然而,现有方法存在三个关键限制:(1)信息退化:两阶段压缩流程导致语义丢失和信息退化,没有后续机制区分高质量和低质量的SIDs;(2)语义退化:级联量化丢弃原始多模态特征的关键语义信息,因为嵌入生成和量化阶段未联合优化到统一目标;(3)模态失真:量化器未能正确对齐文本和图像模态,导致特征失调,即使上游网络已经对齐它们也是如此。为了解决这些挑战,我们提出了一个整合三个关键创新的新框架:深度上下文兴趣挖掘(DCIM)、跨模态语义对齐(CMSA)和质量感知强化机制(QARM)。首先,我们利用视觉语言模型(VLMs)将非文本模态对齐到统一的基于文本的语义空间,缓解模态失真。其次,我们引入了一个深度兴趣挖掘机制,隐含地捕获广告环境中高级语义信息,通过基于重建的监督鼓励SIDs保留关键的上下文信息。第三,我们采用一个带有质量感知奖励的强化学习框架,在后期阶段鼓励语义丰富的SIDs,同时抑制低质量的SIDs。大量实验证明,我们的方法始终优于最先进的SID生成方法,在多个基准测试上表现出卓越的性能。消融研究进一步验证了每个提议组件的有效性。
更新时间: 2026-05-05 11:18:30
领域: cs.IR,cs.AI
Closed-Loop Vision-Language Planning for Multi-Agent Coordination
Cooperative multi-agent reinforcement learning (MARL) struggles with sample efficiency, interpretability, and generalization. While Large Language Models (LLMs) offer powerful planning capabilities, their application has been hampered by a reliance on text-only inputs and a failure to handle the non-Markovian, partially observable nature of multi-agent tasks. We introduce COMPASS, a multi-agent framework that overcomes these limitations by integrating Vision-Language Models (VLMs) for decentralized, closed-loop decision-making. COMPASS dynamically generates and refines interpretable, code-based strategies stored in a skill library that is bootstrapped from expert demonstrations. To ensure robust coordination, it propagates entity information through a structured multi-hop communication protocol, allowing teams to build a coherent understanding from partial observations. Evaluated on the challenging SMACv2 benchmark, COMPASS significantly outperforms state-of-the-art MARL baselines. Notably, in the symmetric Protoss 5v5 task, COMPASS achieved a 57\% win rate, a 30 percentage point advantage over QMIX (27\%). Project page can be found at https://stellar-entremet-1720bb.netlify.app/.
Updated: 2026-05-05 11:18:23
标题: 多智能体协调的闭环视觉语言规划
摘要: 多智能体协作强化学习(MARL)在样本效率、可解释性和泛化性方面存在困难。虽然大型语言模型(LLMs)提供了强大的规划能力,但由于过于依赖文本输入和未能处理多智能体任务的非马尔可夫、部分可观察性特性,其应用受到阻碍。我们引入了COMPASS,这是一个多智能体框架,通过集成视觉语言模型(VLMs)实现了分散式、闭环决策。COMPASS动态生成和优化存储在技能库中的可解释的基于代码的策略,该库是从专家示范中引导的。为了确保稳健的协调,它通过结构化的多跳通信协议传播实体信息,使团队能够从部分观测中建立一致的理解。在具有挑战性的SMACv2基准测试中评估,COMPASS明显优于最先进的MARL基线。值得注意的是,在对称的Protoss 5v5任务中,COMPASS取得了57%的胜率,比QMIX(27%)高出30个百分点。项目页面可在https://stellar-entremet-1720bb.netlify.app/找到。
更新时间: 2026-05-05 11:18:23
领域: cs.AI,cs.MA
AdapShot: Adaptive Many-Shot In-Context Learning with Semantic-Aware KV Cache Reuse
Many-Shot In-Context Learning (ICL) has emerged as a promising paradigm, leveraging extensive examples to unlock the reasoning potential of Large Language Models (LLMs). However, existing methods typically rely on a predetermined, fixed number of shots. This static approach often fails to adapt to the varying difficulty of different queries, leading to either insufficient context or interference from noise. Furthermore, the prohibitive computational and memory costs of long contexts severely limit Many-Shot's feasibility. To address the above limitations, we propose AdapShot, which dynamically optimizes shot counts and leverages KV cache reuse for efficient inference. Specifically, we design a probe-based evaluation mechanism that utilizes output entropy to determine the optimal number of shots. To bypass the redundant prefilling computation during both the probing and inference phases, we incorporate a semantics-aware KV cache reuse strategy. Within this reuse strategy, to address positional encoding incompatibilities, we introduce a decoupling and re-encoding method that enables the flexible reordering of cached key-value pairs. Extensive experiments demonstrate that AdapShot achieves an average performance gain of around 10% and a 4.64x speedup compared to state-of-the-art DBSA.
Updated: 2026-05-05 11:16:52
标题: AdapShot: 针对语义感知的自适应多次快照上下文学习与KV缓存重用
摘要: Many-Shot In-Context Learning (ICL)已经成为一种有前途的范式,利用大量的示例来释放大型语言模型(LLMs)的推理潜力。然而,现有的方法通常依赖于预先确定的固定次数的shots。这种静态方法往往无法适应不同查询的不同难度,导致上下文不足或受到噪声干扰。此外,长上下文的计算和内存成本极高,严重限制了Many-Shot的可行性。为了解决上述限制,我们提出了AdapShot,它动态优化shot次数,并利用KV缓存重用进行高效推理。具体而言,我们设计了一种基于探测的评估机制,利用输出熵来确定最佳shot次数。为了在探测和推理阶段都避免冗余的预填充计算,我们结合了一种语义感知的KV缓存重用策略。在这种重用策略中,为了解决位置编码的不兼容性,我们引入了一种解耦和重新编码方法,使得可以灵活重新排序缓存的键值对。大量实验证明,与最先进的DBSA相比,AdapShot实现了约10%的平均性能增益和4.64倍的加速。
更新时间: 2026-05-05 11:16:52
领域: cs.AI
Can Blockchains Reliably Train Machine Learning Models?
Large proof of work (PoW) networks allow anyone to earn rewards by running computation-intensive hash puzzles for profit, yet they typically consume electricity comparable to that of medium-sized countries. Repurposing computing resources from hash puzzles to machine learning training can benefit the energy sector as a whole, since this computing power is no longer wasted on solving hash puzzles but is instead used to train machine learning models that provide value across different application domains. However, major technical gaps currently prevent this integration. To bridge these gaps, we introduce proof of training (PoT), a protocol that directs mining power toward verifiable training of machine learning models while preserving PoW's incentives for participation and growth. We study PoT by theoretically identifying the blockchain structure that best meets the goals of training reliability, security, and scalability, and we further evaluate it by implementing a decentralized training network. Our results indicate considerable potential, including high task throughput, strong robustness, and improved network security.
Updated: 2026-05-05 11:11:29
标题: 区块链能可靠地训练机器学习模型吗?
摘要: 大规模工作量证明(PoW)网络允许任何人通过运行计算密集型哈希难题来赚取奖励,但它们通常消耗与中等规模国家相当的电力。将计算资源从哈希难题重新定位到机器学习训练可以使整个能源部门受益,因为这种计算能力不再浪费在解决哈希难题上,而是用于训练能够跨不同应用领域提供价值的机器学习模型。然而,目前存在主要的技术差距阻碍了这种整合。为了弥合这些差距,我们引入了证明训练(PoT)协议,该协议将挖矿算力引导到可验证的机器学习模型训练中,同时保留PoW的参与和增长激励。我们通过理论上确定最好满足训练可靠性、安全性和可扩展性目标的区块链结构来研究PoT,并通过实施一个去中心化的训练网络来进一步评估它。我们的结果表明,PoT具有相当大的潜力,包括高任务吞吐量、强大的鲁棒性和改进的网络安全性。
更新时间: 2026-05-05 11:11:29
领域: cs.CR,cs.CE,cs.DC,cs.LG
Information Plane Analysis of Binary Neural Networks
Information plane (IP) analysis has been suggested to study the training dynamics of deep neural networks through mutual information (MI) between inputs, representations, and targets. However, its statistical validity is often compromised by the difficulty of estimating MI from samples of high-dimensional, deterministic representations. In this work, we perform IP analyses on binary neural networks (BNNs) where activations are discrete and MI is finite. We characterise the finite-sample behaviour of the plug-in entropy estimator and identify regimes for sample size $N$ and representation dimensionality $D$ under which MI estimates are reliable. Outside these regimes, we show that empirical MI estimates saturate to $\log_2 N$, rendering IP trajectories uninformative. Restricting attention to the reliable regime, we train 375 BNNs to investigate the existence of late-stage compression phases and the relationship between compressed representations and generalisation performance. Our results show that while late-stage compression is frequently observed, compressed latent representations do not consistently correlate with improved generalization performance. Instead, the relationship between compression and generalisation is highly dependent on task, architecture, and regularisation.
Updated: 2026-05-05 11:08:18
标题: 二进制神经网络的信息平面分析
摘要: 信息平面(IP)分析被建议用于通过输入、表示和目标之间的互信息(MI)研究深度神经网络的训练动态。然而,由于难以从高维确定性表示的样本中估计MI,其统计有效性经常受到损害。 在这项工作中,我们对二进制神经网络(BNNs)进行IP分析,其中激活是离散的,MI是有限的。我们表征了插件熵估计器的有限样本行为,并确定了样本大小$N$和表示维度$D$的区域,在这些区域内,MI估计是可靠的。在这些区域之外,我们展示了经验MI估计饱和到$\log_2 N$,使得IP轨迹无法提供信息。 将注意力限制在可靠区域,我们训练了375个BNNs,以研究后期压缩阶段的存在以及压缩表示与泛化性能之间的关系。我们的结果显示,虽然后期压缩经常被观察到,但压缩的潜在表示并不一致地与改进的泛化性能相关。相反,压缩和泛化之间的关系高度依赖于任务、架构和正则化。
更新时间: 2026-05-05 11:08:18
领域: cs.LG
Free Decompression with Algebraic Spectral Curves
Tools from random matrix theory have become central to deep learning theory, using spectral information to provide mechanisms for modeling generalization, robustness, scaling, and failure modes. While often capable of modeling empirical behavior, practical computations are limited by matrix size, often imposing a restriction to models that are too small to be realistic. This motivates the inference of properties of larger models from the behavior of smaller ones. Free decompression (FD) is a recently proposed method for extrapolating spectral information across matrix sizes, but its utility is currently limited by strong assumptions that preclude its implementation on more realistic machine learning (ML) models. We use algebraic spectral curve theory to provide a general FD methodology for spectral densities whose Stieltjes transform satisfies an algebraic relation, a modeling assumption that is more likely to hold in practice. This recasts FD as an evolution along spectral curves which can be readily integrated. Our framework enables the expansion of spectral densities that have multiple or multi-modal bulks, that exist at multiple scales, and that contain atoms, all characteristic of real-world data and popular ML models. We demonstrate the efficacy of our framework on models of interest in modern ML, including Hessian and activation matrices associated with neural networks and large-scale diffusion models.
Updated: 2026-05-05 11:03:48
标题: 使用代数谱曲线进行免费解压缩
摘要: 随机矩阵理论的工具已经成为深度学习理论的核心,利用谱信息提供建模泛化、鲁棒性、扩展性和失效模式的机制。虽然通常能够建模经验行为,但实际计算受到矩阵大小的限制,常常限制了模型的规模过小而不切实际。这促使人们从较小模型的行为中推断出更大模型的特性。自由解压缩(FD)是最近提出的一种方法,用于在不同的矩阵大小之间推断谱信息,但其实用性目前受到强假设的限制,限制了其在更现实的机器学习(ML)模型上的实现。我们使用代数谱曲线理论提供了一种适用于谱密度的一般FD方法,其斯蒂尔捷斯变换满足代数关系,这是一种更有可能在实践中成立的建模假设。这将FD重新构建为沿着谱曲线演化的过程,可以方便地进行积分。我们的框架使得能够扩展具有多个或多模态的块、存在于多个尺度上、包含原子等特点的谱密度,这些特点是真实世界数据和流行的ML模型的特征。我们在现代ML中感兴趣的模型上展示了我们框架的有效性,包括与神经网络和大规模扩散模型相关的Hessian和激活矩阵。
更新时间: 2026-05-05 11:03:48
领域: stat.ML,cs.LG,math.NA
AhaRobot: A Low-Cost Open-Source Bimanual Mobile Manipulator for Embodied AI
Scaling Vision-Language-Action models for embodied manipulation demands large volumes of diverse manipulation data, yet the high cost of commercial mobile manipulators and teleoperation interfaces that are difficult to deploy at scale remain key bottlenecks. We present AhaRobot, a low-cost, fully open-source bimanual mobile manipulator tailored for Embodied-AI. The system contributes: (1) a SCARA-like dual-arm hardware design that reduces motor torque demands while maintaining a large vertical reachable workspace, (2) an optimized control stack that improves precision via dual-motor backlash mitigation and static-friction compensation through dithering, and (3) RoboPilot, a teleoperation interface featuring a novel 26-faced marker handle for precise, long-horizon remote data collection. Experimental results show that our hardware-control co-design achieves 0.7 mm repeatability at a total hardware cost of only $1,000. The proposed 26-faced handle reduces tracking error by 80% over a 6-faced baseline and improves data-collection efficiency by 30%, while robustly handling singularities and supporting extremely long-horizon tasks in fully remote settings. Despite its low cost, AhaRobot enables imitation learning of complex household behaviors involving bimanual coordination, upper-body mobility, and contact-rich interaction, with data quality comparable to VR-based collection. All software, CAD files, and documentation are available at https://aha-robot.github.io.
Updated: 2026-05-05 11:02:02
标题: AhaRobot:一种低成本、开源的双手移动操作机器人用于具有实体 AI 的应用
摘要: 将视觉-语言-动作模型扩展到具有具体操作需求的规模化需要大量多样化的操纵数据,然而,商用移动机器人和难以大规模部署的远程操作界面的高成本仍然是主要瓶颈。我们提出了AhaRobot,这是一款低成本、完全开源的双手移动机器人,专为具体操作人工智能而设计。该系统的贡献包括:(1)类似SCARA的双臂硬件设计,降低了电机扭矩需求,同时保持了一个大的垂直可达工作空间;(2)一个经过优化的控制堆栈,通过双电机反向间隙补偿和通过抖动进行静摩擦补偿来提高精度;(3)RoboPilot,一种远程操作界面,具有一种新颖的26面标记手柄,用于精确、长期远程数据收集。实验结果表明,我们的硬件-控制共同设计在总硬件成本仅为$1,000时实现了0.7mm的可重复性。所提出的26面手柄比6面基线减少了80%的跟踪误差,并提高了30%的数据收集效率,同时在完全远程设置中稳健处理奇点,并支持极长期任务。尽管成本低廉,AhaRobot可以模仿学习涉及双手协调、上半身移动和丰富接触的复杂家庭行为,其数据质量可与基于虚拟现实的收集相媲美。所有软件、CAD文件和文档均可在https://aha-robot.github.io上获得。
更新时间: 2026-05-05 11:02:02
领域: cs.RO,cs.AI,cs.LG
Efficient Deconvolution in Populational Inverse Problems
This work is focussed on the inversion task of inferring the distribution over parameters of interest leading to multiple sets of observations. The potential to solve such distributional inversion problems is driven by increasing availability of data, but a major roadblock is blind deconvolution, arising when the observational noise distribution is unknown. However, when data originates from collections of physical systems, a population, it is possible to leverage this information to perform deconvolution. To this end, we propose a methodology leveraging large data sets of observations, collected from different instantiations of the same physical processes, to simultaneously deconvolve the data corrupting noise distribution, and to identify the distribution over model parameters defining the physical processes. A parameter-dependent mathematical model of the physical process is employed. A loss function characterizing the match between the observed data and the output of the mathematical model is defined; it is minimized as a function of the both the parameter inputs to the model of the physics and the parameterized observational noise. This coupled problem is addressed with a modified gradient descent algorithm that leverages specific structure in the noise model. Furthermore, a new active learning scheme is proposed, based on adaptive empirical measures, to train a surrogate model to be accurate in parameter regions of interest; this approach accelerates computation and enables automatic differentiation of black-box, potentially nondifferentiable, code computing parameter-to-solution maps. The proposed methodology is demonstrated on porous medium flow, damped elastodynamics, and simplified models of atmospheric dynamics.
Updated: 2026-05-05 11:01:06
标题: 在人口逆问题中高效的反卷积
摘要: 这项工作专注于反演任务,推断导致多组观测的感兴趣参数分布。解决这种分布反演问题的潜力受到数据日益增加的影响,但主要障碍是盲解卷积,当观测噪声分布未知时会出现。然而,当数据来源于各种物理系统的集合,即一个群体时,可以利用这些信息进行解卷积。为此,我们提出了一种方法,利用大量的观测数据集,从同一物理过程的不同实例中收集,同时对污染噪声分布进行解卷积,并确定定义物理过程的模型参数分布。采用了一个参数相关的物理过程的数学模型。定义了一个损失函数,描述观测数据与数学模型输出之间的匹配;将其最小化作为输入到物理模型的参数和参数化观测噪声的函数。通过修改梯度下降算法来处理这个耦合问题,利用噪声模型中的特定结构。此外,提出了一种新的主动学习方案,基于自适应经验度量,训练一个替代模型以在感兴趣的参数区域中准确,这种方法加速了计算并实现了黑盒子、可能不可微分的代码计算参数到解决方案映射的自动区分。该方法在多孔介质流动,阻尼弹性动力学和大气动力学简化模型上进行了演示。
更新时间: 2026-05-05 11:01:06
领域: stat.ML,cs.LG,physics.comp-ph
Triple-Identity Authentication: The Future of Secure Access
In password-based authentication systems, the username fields are essentially unprotected, while the password fields are susceptible to attacks. In this article, we shift our research focus from traditional authentication paradigm to the establishment of gatekeeping mechanisms for the systems. To this end, we introduce a Triple-Identity Authentication scheme. First, we combine each user credential (i.e., login name, login password, and authentication password) with the International Mobile Equipment Identity (IMEI) and International Mobile Subscriber Identity (IMSI) of a user's smartphone to create a combined identity represented as "credential+IMEI+IMSI", defined as a system attribute of the user. Then, we grant the password-based local systems autonomy to use the internal elements of our matrix-like hash algorithm. Following a credential input, the algorithm hashes it, and then the local system, rather than the algorithm, creates an identifier using a set of elements randomly selected from the algorithm, which is used to verify the user's combined identity. This decentralized authentication based on the identity-identifier handshake approach is implemented at the system's interaction points, such as login name field, login password field, and server's authentication point. Ultimately, this approach establishes effective security gates, empowering the password-based local systems to autonomously safeguard user identification and authentication processes.
Updated: 2026-05-05 10:55:49
标题: 三元认证:安全访问的未来
摘要: 在基于密码的身份验证系统中,用户名字段基本上没有受到保护,而密码字段容易受到攻击。在本文中,我们将研究重点从传统的认证范式转移到系统的门禁机制的建立上。为此,我们介绍了一种三重身份验证方案。首先,我们将每个用户凭证(即登录名、登录密码和认证密码)与用户智能手机的国际移动设备身份码(IMEI)和国际移动用户识别码(IMSI)结合起来,创建一个表示为“凭证+IMEI+IMSI”的组合身份,定义为用户的系统属性。然后,我们授予基于密码的本地系统使用我们类似矩阵的哈希算法的内部元素的自主权。在凭证输入之后,算法对其进行哈希处理,然后本地系统(而不是算法)使用从算法中随机选择的一组元素创建一个标识符,用于验证用户的组合身份。基于身份-标识符握手方法的分散式身份验证在系统的交互点实施,例如登录名字段、登录密码字段和服务器的认证点。最终,这种方法建立了有效的安全门,赋予基于密码的本地系统自主保护用户识别和认证过程的能力。
更新时间: 2026-05-05 10:55:49
领域: cs.CR,cs.ET,cs.HC,eess.SY
Self-Improvement for Fast, High-Quality Plan Generation
Generative models trained on synthetic plan data are a promising approach to generalized planning. Recent work has focused on finding any valid plan, rather than a high-quality solution. We address the challenge of producing high-quality plans, a computationally hard problem, in sub-exponential time. First, we demonstrate that, given optimal data, a decoder-only transformer can generate high-quality plans for unseen problem instances. Second, we show how to self-improve an initial model trained on sub-optimal data. Each round of self-improvement combines multiple model calls with graph search to generate improved plans, used for model fine-tuning. An experimental study on four domains: Blocksworld, Logistics, Labyrinth, and Sokoban, shows on average a 30% reduction in plan length over the source symbolic planner, with over 80% of plans being optimal, where the optimum is known. Plan quality is further improved by inference-time search. The model's latency scales sub-exponentially in contrast to the satisficing and optimal symbolic planners to which we compare. Together, these results suggest that self-improvement with generative models offers a scalable approach for high-quality plan generation.
Updated: 2026-05-05 10:55:18
标题: 自我提升以实现快速、高质量的计划生成
摘要: 训练在合成计划数据上的生成模型是通用规划的一种有前途的方法。最近的工作侧重于寻找任何有效的计划,而不是高质量的解决方案。我们解决了在次指数时间内产生高质量计划的挑战,这是一个计算难题。首先,我们证明了,在给定最优数据的情况下,仅解码器的变压器可以为看不见的问题实例生成高质量的计划。其次,我们展示了如何自我改进在次优数据上训练的初始模型。每一轮的自我改进结合了多个模型调用和图搜索,生成改进的计划,用于模型微调。对四个领域(Blocksworld,Logistics,Labyrinth和Sokoban)的实验研究显示,与源符号规划器相比,平均计划长度减少了30%,超过80%的计划是最佳的,其中最优解是已知的。计划质量通过推理时间搜索进一步提高。与我们比较的满足和最优符号规划器相比,模型的延迟呈次指数比例增长。总的来说,这些结果表明,使用生成模型进行自我改进提供了一个可扩展的高质量计划生成方法。
更新时间: 2026-05-05 10:55:18
领域: cs.AI
A Few-Step Generative Model on Cumulative Flow Maps
We propose a unified, few-step generative modeling framework based on \emph{cumulative flow maps} for long-range transport in probability space, inspired by flow-map techniques for physical transport and dynamics. At its core is a cumulative-flow abstraction that connects local, instantaneous updates with finite-time transport, enabling generative models to reason about global state transitions. This perspective yields a unified few-step framework built on cumulative transport and \revise{cumulative} parameterization that applies broadly to existing diffusion- and flow-based models without being tied to a specific prediction \revise{instantiation}. Our formulation supports few-step and even one-step generation while preserving synthesis quality, requiring only minimal changes to time embeddings and training objectives, and no increase in model capacity. We demonstrate its effectiveness across diverse tasks, including image generation, geometric distribution modeling, joint prediction, and SDF generation, with reduced inference cost.
Updated: 2026-05-05 10:51:40
标题: 一个关于累积流程图的几步生成模型
摘要: 我们提出了一个基于累积流图的统一、少步骤的生成建模框架,用于概率空间中的长程传输,受到物理传输和动力学的流图技术的启发。在其核心是一个累积流的抽象,将局部、瞬时更新与有限时间传输联系起来,使生成模型能够推断全局状态转换。这种观点产生了一个建立在累积传输和累积参数化之上的统一的少步骤框架,广泛适用于现有的扩散和基于流的模型,而不受特定预测实例的约束。我们的公式支持少步骤甚至一步生成,同时保持合成质量,只需要对时间嵌入和训练目标进行最小的更改,模型容量不增加。我们展示了它在包括图像生成、几何分布建模、联合预测和SDF生成等各种任务中的有效性,并且推理成本降低。
更新时间: 2026-05-05 10:51:40
领域: cs.LG,cs.GR
Exact and Approximate Algorithms for Polytree Learning
Polytrees are a subclass of Bayesian networks that seek to capture the conditional dependencies between a set of $n$ variables as a directed forest and are motivated by their more efficient inference and improved interpretability. Since the problem of learning the best polytree is NP-hard, we study which restrictions make it more tractable by considering for example in-degree bounds, properties of score functions measuring the quality of a polytree, and approximation algorithms. We devise an algorithm that finds the optimal polytree in time $O((2+ε)^n)$ for arbitrarily small $ε> 0$ and any constant in-degree bound $k$, improving over the fastest previously known algorithm of time complexity $O(3^n)$. We further give polynomial-time algorithms for finding a polytree whose score is within a factor of $k$ from the optimal one for arbitrary scores and a factor of $2$ for additive ones. Many of the results are complemented by (nearly) tight lower bounds for either the time complexity or the approximation factors.
Updated: 2026-05-05 10:50:14
标题: 多叉树学习的精确和近似算法
摘要: Polytrees是贝叶斯网络的一个子类,旨在以有向森林的形式捕捉一组$n$个变量之间的条件依赖关系,并受到更高效的推理和改进的可解释性的启发。由于学习最佳Polytree的问题是NP难的,我们研究了哪些限制使其更易处理,例如考虑入度限制、衡量Polytree质量的评分函数的属性和近似算法。我们设计了一个算法,可以在$O((2+ε)^n)$的时间内找到最佳的Polytree,其中任意小的$ε>0$和任意常数入度限制$k$,相较于之前已知的复杂度为$O(3^n)$的最快算法有所改进。我们进一步给出了在任意分数情况下,找到得分与最佳得分相差不超过$k$倍的Polytree的多项式时间算法,以及在加法得分情况下相差不超过2倍的算法。许多结果都附带有接近的时间复杂度下界或逼近因子。
更新时间: 2026-05-05 10:50:14
领域: cs.DS,cs.CC,cs.LG
Benchmarking Document Parsers on Mathematical Formula Extraction from PDFs
Correctly parsing mathematical formulas from PDFs is critical for training large language models and building scientific knowledge bases from academic literature, yet existing benchmarks either exclude formulas entirely or lack semantically-aware evaluation metrics. We introduce a benchmarking framework centered on synthetically generated PDFs with precise LaTeX ground truth, enabling systematic control over layout, formulas, and content characteristics. For evaluation, we apply LLM-as-a-judge to assess semantic equivalence of parsed formulas, capturing mathematical meaning beyond surface-level notation differences. We validate this approach through a human study (250 formula pairs, 750 ratings from 30 evaluators), showing a Pearson correlation of r=0.78 with human judgment, compared to r=0.34 for character-level matching (CDM) and r~0 for text similarity. Our robust two-stage matching pipeline combining LLM-based extraction with fuzzy validation reliably aligns parsed formulas with ground truth despite format inconsistencies across parsers. Evaluating 20+ contemporary PDF parsers across 100 synthetic documents with 2,000+ formulas reveals significant performance disparities, providing actionable guidance for practitioners selecting parsers for downstream applications. Code and benchmark data: https://github.com/phorn1/pdf-parse-bench and https://github.com/phorn1/formula-metric-study
Updated: 2026-05-05 10:49:25
标题: 在PDF文档中数学公式提取的文件解析器基准测试
摘要: 正确解析PDF中的数学公式对于训练大型语言模型和从学术文献中构建科学知识库至关重要,然而现有的基准要么完全排除公式,要么缺乏语义意识的评估指标。我们引入了一个基于合成生成的PDF的基准框架,具有精确的LaTeX基准,可以系统地控制布局、公式和内容特征。为了评估,我们应用LLM作为评判者来评估解析公式的语义等价性,捕捉表面符号差异以外的数学含义。我们通过一项人类研究验证了这种方法(250对公式,30位评估者的750个评分),显示与人类判断的皮尔逊相关系数为r=0.78,而字符级匹配(CDM)为r=0.34,文本相似度为r~0。我们的强大的两阶段匹配流程将基于LLM的提取与模糊验证结合在一起,尽管解析器之间存在格式不一致性,但可可靠地将解析的公式与基准对齐。评估20多个当代PDF解析器在100个合成文档中的2,000多个公式中的性能差异显著,为选择下游应用程序的解析器提供了可操作的指导。代码和基准数据:https://github.com/phorn1/pdf-parse-bench 和 https://github.com/phorn1/formula-metric-study.
更新时间: 2026-05-05 10:49:25
领域: cs.CV,cs.AI,cs.IR
Leveraging Code Automorphisms for Improved Syndrome-Based Neural Decoding
Syndrome-based neural decoding (SBND) has emerged as a promising deep learning approach for soft-decision decoding of high-rate, short-length codes. However, this approach still has substantial room for improvement. In this paper, we show how to leverage code automorphisms to enhance the ability of existing SBND models to learn and generalize through data augmentation during training and inference. As a result, for the short high-rate codes considered, we obtain models that closely approach MLD performance using small datasets and proper training. Our findings also suggest that many prior results for SBND models in the literature underestimate their true correction capability due to undertraining. Code to reproduce all results is available at: https://github.com/lebidan/sbnd.
Updated: 2026-05-05 10:48:14
标题: 利用代码自同构提高基于综合征的神经解码
摘要: 综合症状基础的神经解码(SBND)已经成为一种有希望的深度学习方法,用于对高速率、短长度编码进行软决策解码。然而,这种方法仍然有很大的改进空间。在本文中,我们展示了如何利用编码自同构来增强现有SBND模型通过数据增强在训练和推断过程中学习和泛化的能力。因此,对于考虑的短高速率编码,我们通过小数据集和适当的训练获得了接近MLD性能的模型。我们的研究结果还表明,文献中对SBND模型的许多先前结果低估了它们真正的校正能力,这是由于训练不足导致的。可复制所有结果的代码可在以下链接找到:https://github.com/lebidan/sbnd。
更新时间: 2026-05-05 10:48:14
领域: cs.IT,cs.LG
Deciphering Shortcut Learning from an Evolutionary Game Theory Perspective
Shortcut learning causes deep learning models to rely on non-essential features within the data. However, its formation in deep neural network training still lacks theoretical understanding. In this paper, we provide a formal definition of core and shortcut features and employ evolutionary game theory to analyze the origins of shortcut bias by modeling data samples as players and their corresponding neural tangent features as strategies, assuming the existence of core and shortcut subnetworks. We find that gradient descent (GD) and stochastic gradient descent (SGD) lead to two distinct stochastically stable states, each corresponding to a different strategy. The former primarily optimizes the shortcut subnetwork, while the latter primarily optimizes the core subnetwork. We investigate the influence of these strategies on shortcut bias through a continuous stochastic differential equation, and reveal the impact of data noise and optimization noise on the formation of shortcut bias. In brief, our work employs evolutionary game theory to characterize the dynamics of shortcut bias formation and provides a theoretical view on its mitigation.
Updated: 2026-05-05 10:46:41
标题: 从进化博弈论角度解读快捷学习
摘要: 快捷学习导致深度学习模型依赖于数据中的非关键特征。然而,在深度神经网络训练中,对其形成仍缺乏理论理解。在本文中,我们提供了核心和快捷特征的正式定义,并采用进化博弈论来分析快捷偏差的起源,将数据样本建模为玩家,将它们对应的神经切线特征建模为策略,假设存在核心和快捷子网络。我们发现梯度下降(GD)和随机梯度下降(SGD)导致两种不同的随机稳定状态,每种状态对应一种不同的策略。前者主要优化快捷子网络,而后者主要优化核心子网络。我们通过连续随机微分方程研究这些策略对快捷偏差的影响,并揭示数据噪声和优化噪声对快捷偏差形成的影响。简言之,我们的工作利用进化博弈论来表征快捷偏差形成的动态,并提供了其缓解的理论视角。
更新时间: 2026-05-05 10:46:41
领域: cs.AI
Mechanized Foundations of Structural Governance: Machine-Checked Proofs for Governed Intelligence
We present five results in the theory of structural governance for cognitive workflow systems. Three are mechanized in Coq 8.19 using the Interaction Trees library with parameterized coinduction; two are proved on paper with explicit reductions. The Coinductive Safety Predicate (gov_safe) is a coinductive property that captures governance safety for infinite program behaviors, indexed by a boolean permission flag that is provably false for ungoverned I/O and true for governed interpretations (mechanized). The Governance Invariance Theorem establishes that governance is uniform across the meta-recursive tower: governance at level n+1 reduces to governance at level n by definitional equality of the type (mechanized). The Sufficiency Theorem proves that four atomic primitives (code, reason, memory, call) are expressively complete for any discrete intelligent system, formalized as compositional closure of a Kleisli category (mechanized). The Alternating Normal Form provides a canonical decomposition of any machine into alternating code and effect layers, with a confluent rewriting system (paper proof). The Necessity Theorem proves via explicit reduction to Rice's theorem that an architecturally opaque component (the reason primitive) is mathematically necessary for problems requiring semantic judgment (paper proof). A sixth contribution connects the abstract model to the deployed runtime: the Verified Interpreter Specification formalizes the BEAM runtime's trust, capability, and hash chain logic in Coq, then tests the running system against this specification using property-based testing with over 70,000 randomly generated directive sequences and zero disagreements. The mechanization comprises approximately 12,000 lines across 36 modules with 454 theorems and zero admitted lemmas.
Updated: 2026-05-05 10:45:32
标题: 结构治理的机械基础:受控智能的机器验证证明
摘要: 我们提出了关于认知工作流系统结构治理理论的五个结果。其中三个在Coq 8.19中使用参数化共归纳的交互树库机械化;另外两个通过明确的规约在纸上证明。共归纳安全谓词(gov_safe)是一个共归纳属性,捕捉了对无限程序行为的治理安全,由一个布尔权限标志索引,对于未受控制的I/O可证明为假,对于受控解释为真(机械化)。治理不变性定理建立了治理在元递归塔中是统一的:n+1级别的治理通过类型的定义相等减少为n级别的治理(机械化)。充分性定理证明了四个原子基元(代码、原因、内存、调用)对于任何离散智能系统来说是表现完整的,形式化为Kleisli范畴的组合闭包(机械化)。交替正常形式为任何机器提供了一个交替的代码和效果层的规范分解,具有一个收敛的重写系统(纸上证明)。必要性定理通过明确规约到Rice定理证明了一个架构不透明组件(原因基元)在涉及语义判断问题时在数学上是必要的(纸上证明)。第六项贡献将抽象模型与部署运行时连接起来:验证的解释器规范在Coq中形式化了BEAM运行时的信任、能力和哈希链逻辑,然后使用属性为基的测试对运行系统与此规范进行测试,使用了超过70,000个随机生成的指令序列,没有发现任何争议。这个机械化工作包括大约12,000行代码,分布在36个模块中,包含454个定理和零个被承认的引理。
更新时间: 2026-05-05 10:45:32
领域: cs.AI
The Infinite Mutation Engine? Measuring Polymorphism in LLM-Generated Offensive Code
Malware authors have traditionally relied on polymorphic techniques to produce variants in the same malware family, complicating signature-based detection. Integrating generative AI into offensive toolchains enables attackers to synthesize structurally diverse payloads with identical behavior, raising the question of how much polymorphism LLMs provide. Recent work has assumed that LLMs can produce sufficiently polymorphic payloads, leaving unquantified the variation that emerges when an attacker repeatedly builds the same payload, or explicitly instructs the model to avoid prior implementations. In this work, we measure the polymorphic capacity of a commercial model (Claude Opus 4.6) as an automated malware generator. We build a dual-agent, four-stage pipeline that generates, tests, and refines a data-exfiltration payload comprising file traversal, encryption, exfiltration, and integration. We produce payloads in two settings: using prompts that specify only functional requirements, and using prompts that inject a structured history of prior outcomes to force divergence. We measure pairwise distances along structural (AST) and semantic (embedding) axes, finding that when polymorphism is not explicitly required, structural distances are high while semantic distances remain low; i.e., implementations diverge widely without changing high-level behavior. Explicit prompting substantially amplifies this structural diversity while preserving correctness, at the cost of roughly 5 times more tokens but only a small increase in LLM calls (from $4.2$ to $4.5$ per payload, with effective API costs of \$0.41 and \$0.73). These results show that a single commercial LLM can cheaply generate large populations of behaviorally equivalent yet structurally diverse payloads, facilitating the evasion of signature-based detection rules and similarity-based clustering.
Updated: 2026-05-05 10:44:49
标题: 无限变异引擎?衡量由LLM生成的攻击代码中的多态性
摘要: 恶意软件作者传统上依靠多态技术来生成同一恶意软件家族的变体,使基于签名的检测变得复杂化。将生成式人工智能集成到攻击工具链中使攻击者能够合成结构多样的负载,其行为相同,这引发了LLMs提供多少多态性的问题。最近的研究假设LLMs可以生成足够多态的负载,但未量化当攻击者重复构建相同负载或明确指示模型避免先前实现时会出现的变化。在这项工作中,我们测量了一个商业模型(Claude Opus 4.6)作为自动恶意软件生成器的多态容量。我们构建了一个双代理、四阶段的流水线,生成、测试和完善一个数据外泄负载,包括文件遍历、加密、外泄和集成。我们在两种设置下生成负载:使用仅指定功能要求的提示以及注入先前结果的结构化历史的提示以强制分歧。我们测量了结构(AST)和语义(嵌入)轴上的成对距离,发现当不需要明确要求多态性时,结构距离很高,而语义距离保持较低;即实现在不改变高层行为的情况下广泛分歧。显式提示显著增加了这种结构多样性,同时保持正确性,代价是大约多出5倍的标记,但仅略微增加LLM调用次数(从每个负载的$4.2$增加到$4.5$,有效的API成本为\$0.41和\$0.73)。这些结果表明,单个商业LLM能够廉价生成大量行为等效但结构多样的负载,有助于规避基于签名的检测规则和基于相似性的聚类。
更新时间: 2026-05-05 10:44:49
领域: cs.CR
The Two Boundaries: Why Behavioral AI Governance Fails Structurally
Every system that performs effects has two boundaries: what it can do (expressiveness) and what governance covers (governance). In nearly all deployed AI systems, these boundaries are defined independently, creating three regions: governed capabilities (the only useful region), ungoverned capabilities (risk), and governance policies that address non-existent capabilities (theater). Two of the three regions are failure modes. We focus on the governance of effects: actions that AI systems perform in the world (API calls, database writes, tool invocations). This is distinct from the governance of model outputs (content quality, bias, fairness), which operates at a different level and requires different mechanisms. We present a formal framework for analyzing this structural gap. Rice's theorem (1953) proves the gap is undecidable in the general case for any Turing-complete architecture that attempts to govern effects behaviorally: no algorithm can decide non-trivial semantic properties of arbitrary programs, including the property "this program's effects comply with the governance policy." We define coterminous governance: a system property where the expressivenessboundary equals the governance boundary. We show that coterminous governance requires an architectural decision (separatingcomputation from effect) rather than a governance layer added after the fact. We show that structural governance under this separation subsumes separate governance infrastructure: governance checks become part of the execution pipeline rather than a second system running alongside it. We propose coterminous governance as the testable criterion for any AI governance system: either the two boundaries are provably identical, or risk and theater are structurally inevitable. Proofs are mechanized in Coq (454 theorems, 36 modules, 0 admitted).
Updated: 2026-05-05 10:43:57
标题: 两个边界:为什么行为人工智能治理在结构上失败
摘要: 每个执行效果的系统都有两个边界:它可以做什么(表现力)和治理覆盖范围(治理)。在几乎所有部署的人工智能系统中,这些边界是独立定义的,形成了三个区域:受管能力(唯一有用的区域)、未受管能力(风险)和针对不存在能力的治理政策(戏剧)。三个区域中的两个是失败模式。我们关注效果的治理:人工智能系统在世界中执行的行动(API调用、数据库写入、工具调用)。这与模型输出的治理(内容质量、偏见、公平性)不同,后者在不同层次上运作,需要不同的机制。我们提出了一个形式框架来分析这种结构差距。 Rice's theorem(1953)证明了对于试图在行为上治理效果的任何图灵完备架构来说,该差距在一般情况下是不可判定的:没有算法可以决定任意程序的非平凡语义属性,包括“该程序的效果符合治理政策”属性。我们定义了同义治理:一个系统属性,其中表现力边界等于治理边界。我们展示了同义治理需要一个架构决策(将计算与效果分开),而不是事后添加一个治理层。我们展示了在这种分离下的结构治理包含了单独的治理基础设施:治理检查变成了执行管道的一部分,而不是在其旁边运行的第二系统。我们提出同义治理作为任何人工智能治理系统的可测试标准:两个边界要么可以被证明是相同的,要么风险和戏剧在结构上是不可避免的。证明在Coq中形式化(454个定理,36个模块,0个被承认)。
更新时间: 2026-05-05 10:43:57
领域: cs.AI
Effect-Transparent Governance for AI Workflow Architectures: Semantic Preservation, Expressive Minimality, and Decidability Boundaries
We present a machine-checked formalization of structurally governed AI workflow architectures and prove that effect-level governance can be imposed without reducing internal computational expressivity. Using Interaction Trees in Rocq 8.19, we define a governance operator G that mediates all effectful directives, including memory access, external calls, and oracle (LLM) queries. Our development compiles with 0 admitted lemmas and consists of 36 modules, ~12,000 lines of Rocq, and 454 theorems. We establishseven properties: (P1) governed Turing completeness, (P2) governed oracle expressivity, (P3) a decidability boundary in which governance predicates are total and closed under Boolean composition while semantic program properties remain non-trivial and undecidable by governance, (P4) goal preservation for permitted executions, (P5) expressive minimality of primitive capabilities (compute, memory, reasoning, external call, observability), (P6) subsumption asymmetry showing structural governance strictly subsumes content-level filtering, and (P7) semantic transparency: on all executions where governance permits, the governed interpretation is observationally equivalent (modulo governance-only events) to the ungoverned interpretation. Together, these results show that governance and computational expressivity are orthogonal dimensions: governance constrains the effect boundary of programs while remaining semantically transparent to internal computation.
Updated: 2026-05-05 10:41:41
标题: AI工作流架构的效果透明治理:语义保留,表达最小化和决策边界
摘要: 我们提出了一种结构受控人工智能工作流架构的机器检查形式化,并证明可以在不减少内部计算表达性的情况下施加效果级治理。使用Rocq 8.19中的交互树,我们定义了一个治理操作符G,调解所有效果指令,包括内存访问、外部调用和oracle (LLM)查询。我们的开发包括了36个模块,约12,000行Rocq代码和454个定理,其中没有任何引入的引理。我们建立了七个属性:(P1) 受控的图灵完备性,(P2) 受控的oracle表达性,(P3) 一个决策边界,在这个边界内,治理谓词是完全的,并在布尔合成下封闭,同时语义程序属性仍然是非平凡的,且不受治理的决策,(P4) 对允许执行的目标保留,(P5) 原始功能的表达性最小化(计算、内存、推理、外部调用、可观察性),(P6) 包含不对称性表明结构治理严格包含内容级过滤,以及(P7) 语义透明性:在所有治理允许的执行中,受控解释在观察上等同(除了治理事件)于未受控解释。综上所述,这些结果表明治理和计算表达性是正交维度:治理约束程序的效果边界,同时对内部计算保持语义透明。
更新时间: 2026-05-05 10:41:41
领域: cs.AI,cs.LO,cs.PL
On-Device Fine-Tuning via Backprop-Free Zeroth-Order Optimization
On-device fine-tuning is a critical capability for edge AI systems, which must support adaptation to different agentic tasks under stringent memory constraints. Conventional backpropagation (BP)-based training requires storing layer activations and optimizer states, a demand that can be only partially alleviated through checkpointing. In edge deployments in which the model weights must reside entirely in device memory, this overhead severely limits the maximum model size that can be deployed. Memory-efficient zeroth-order optimization (MeZO) alleviates this bottleneck by estimating gradients using forward evaluations alone, eliminating the need for storing intermediate activations or optimizer states. This enables significantly larger models to fit within on-chip memory, albeit at the cost of potentially longer fine-tuning wall-clock time. This paper first provides a theoretical estimate of the relative model sizes that can be accommodated under BP and MeZO training. We then numerically validate the analysis, demonstrating that MeZO exhibits accuracy advantages under on-device memory constraints, provided sufficient wall-clock time is available for fine-tuning.
Updated: 2026-05-05 10:40:45
标题: 在设备上通过免反向传播的零阶优化进行微调
摘要: 设备上的微调是边缘AI系统的关键能力,必须支持在严格的内存约束下适应不同的任务。传统的基于反向传播(BP)的训练需要存储层激活和优化器状态,这种需求只能通过检查点部分缓解。在边缘部署中,模型权重必须完全驻留在设备内存中,这种开销严重限制了可以部署的最大模型大小。内存高效的零阶优化(MeZO)通过仅使用前向评估来估计梯度,消除了存储中间激活或优化器状态的需求。这使得更大的模型能够适应芯片内存,尽管可能会增加微调的墙钟时间。本文首先提供了在BP和MeZO训练下可以容纳的相对模型大小的理论估计。然后,我们通过数值验证了分析,证明了在设备内存约束下,MeZO在提供足够的微调墙钟时间的情况下具有精度优势。
更新时间: 2026-05-05 10:40:45
领域: cs.LG,cs.CL
StackFeat RL: Reinforcement Learning over Iterative Dual Criterion Feature Selection for Stable Biomarker Discovery
Feature selection in high-dimensional genomic data ($d \gg n$) demands methods that are simultaneously accurate, sparse, and stable. Existing approaches either require manual threshold specification (mRMR, stability selection), produce unstable selections under data perturbation (Lasso, Boruta), or ignore biological structure entirely. We introduce StackFeat-RL, a meta-learning framework that optimises the hyperparameters of an iterative dual-criterion feature selection algorithm via REINFORCE policy gradients. The dual criterion, requiring both coefficient consistency and selection frequency, guards against two failure modes missed by single-criterion methods, while iterative accumulation provides convergence guarantees via the law of large numbers. On COVID-19 miRNA data (GSE240888, 332 features) and three Alzheimer's disease classification tasks (GSE84422, 13237 genes; Normal vs.\ Possible, Probable, and Definite AD), StackFeat-RL achieves the highest predictive accuracy among all evaluated methods, including ElasticNet, Boruta, mRMR, and stability selection, while requiring 3--4$\times$ fewer features. Keywords: feature selection, reinforcement learning, REINFORCE, elastic net, biomarker discovery, Alzheimer's disease, dual-criterion selection, protein interaction networks
Updated: 2026-05-05 10:40:40
标题: StackFeat RL:基于迭代双重标准特征选择的强化学习,用于稳定生物标志物发现
摘要: 在高维基因组数据($d \gg n$)中的特征选择需要同时准确、稀疏和稳定的方法。现有方法要么需要手动阈值规定(mRMR、稳定性选择),在数据扰动下产生不稳定的选择(Lasso、Boruta),要么完全忽略生物结构。我们介绍了StackFeat-RL,这是一个元学习框架,通过REINFORCE策略梯度优化迭代双标准特征选择算法的超参数。双标准要求系数一致性和选择频率,防止了单标准方法所忽略的两种失败模式,而迭代累积通过大数定律提供了收敛保证。 在COVID-19 miRNA数据(GSE240888,332个特征)和三个阿尔茨海默病分类任务(GSE84422,13237个基因;正常 vs 可能、可能和确定性AD),StackFeat-RL在所有评估方法中实现了最高的预测准确性,包括ElasticNet、Boruta、mRMR和稳定性选择,同时需要3-4倍更少的特征。 关键词:特征选择、强化学习、REINFORCE、弹性网络、生物标志物发现、阿尔茨海默病、双标准选择、蛋白质相互作用网络
更新时间: 2026-05-05 10:40:40
领域: cs.LG
InvisibleInk: High-Utility and Low-Cost Text Generation with Differential Privacy
As major progress in LLM-based long-form text generation enables paradigms such as retrieval-augmented generation (RAG) and inference-time scaling, safely incorporating private information into the generation remains a critical open question. We present InvisibleInk, a highly scalable long-form text generation framework satisfying rigorous differential privacy guarantees with respect to the sensitive reference texts. It interprets sampling from the LLM's next-token-distribution as the exponential mechanism over the LLM logits with two innovations. First, we reduce the privacy cost by isolating and clipping only the sensitive information in the model logits (relative to the public logits). Second, we improve text quality by sampling without any privacy cost from a small superset of the top-$k$ private tokens. Empirical evaluations demonstrate a consistent $8\times$ (or more) reduction in computation cost over state-of-the-art baselines to generate long-form private text of the same utility across privacy levels. InvisibleInk is able to generate, for the first time, high-quality private long-form text at less than $4$-$8\times$ times the computation cost of non-private generation, paving the way for its practical use. We open-source a pip-installable Python package (invink) for InvisibleInk at https://github.com/cerai-iitm/invisibleink.
Updated: 2026-05-05 10:38:51
标题: 隐形墨水:具有差分隐私的高效低成本文本生成
摘要: 随着基于LLM的长篇文本生成取得重大进展,使得检索增强生成(RAG)和推理时间扩展等范式成为可能,安全地将私人信息纳入生成仍然是一个关键的未解问题。我们提出了InvisibleInk,一个高度可扩展的长篇文本生成框架,满足与敏感参考文本相关的严格差分隐私保证。它将从LLM的下一个标记分布中抽样解释为对LLM对数的指数机制,具有两项创新。首先,我们通过隔离和裁剪模型对数中的敏感信息(相对于公共对数)来降低隐私成本。其次,通过从顶部k个私有标记的一个小超集中无隐私成本地抽样,提高文本质量。实证评估表明,在隐私级别的情况下,相对于最先进的基线,计算成本降低了一致的8倍(或更多),生成具有相同效用的长篇私人文本。InvisibleInk首次能够以不到非私有生成计算成本的4-8倍的速度生成高质量的私人长篇文本,为其实际应用铺平了道路。我们在https://github.com/cerai-iitm/invisibleink上开源了一个可通过pip安装的Python包(invink)。
更新时间: 2026-05-05 10:38:51
领域: cs.LG,cs.CL,cs.CR
A note on the unique properties of the Kullback--Leibler divergence for sampling via gradient flows
We consider the problem of sampling from a probability distribution $π$ which admits a density w.r.t. a dominating measure. It is well known that this can be written as an optimisation problem over the space of probability distributions in which we aim to minimise a divergence from $π$. The optimisation problem is normally solved through gradient flows in the space of probability distributions with an appropriate metric. We show that the Kullback--Leibler divergence is the only divergence in the family of Bregman divergences whose gradient flow w.r.t. many popular metrics does not require knowledge of the normalising constant of $π$.
Updated: 2026-05-05 10:35:38
标题: 关于通过梯度流采样的Kullback-Leibler散度的独特性质的注释
摘要: 我们考虑从一个相对于某个支配测度具有密度的概率分布$π$中进行抽样的问题。众所周知,这可以被写成一个关于概率分布空间的优化问题,我们的目标是最小化与$π$的差异。通常通过在概率分布空间中使用适当的度量解决优化问题的梯度流。我们表明,在Bregman差异族中,Kullback-Leibler差异是唯一的一个,其梯度流不需要了解$π$的归一化常数的许多流行度量。
更新时间: 2026-05-05 10:35:38
领域: stat.ME,cs.LG,math.ST,stat.CO
Certified Purity for Cognitive Workflow Executors: From Static Analysis to Cryptographic Attestation
We present a certified purity architecture that converts governance enforcement in cognitive workflow systems from a runtime convention into a structural capability boundary. A prior three-layer governance architecture proves governance completeness, provenance completeness, and the impossibility of ungoverned effects, conditional on the pure module constraint: that step executors cannot perform effects. That constraint was enforced by module import graph analysis, which is insufficient against adversarial bypass on the BEAM virtual machine. This paper closes the gap through four mechanisms: (1) a restricted WebAssembly compilation target where effect-producing instructions are structurally absent; (2) purity certificates, cryptographically signed proofs binding executor binaries to their import classifications; (3) a runtime verification gate that rejects uncertified executors before they enter the governance pipeline; and (4) portable governance credentials via remote attestation for cross-organizational verification. We prove four theorems: structural purity by construction, bypass elimination for all five BEAM bypass classes, certificate integrity, and gate completeness. The guarantee holds relative to an explicit Trusted Computing Base. Evaluation on four implemented executors shows verification latency of 39--42 us, full plan cycle under 400 us, runtime overhead under 0.4% of a 100 ms HTTP request, and zero determinism divergences across repeated invocations.
Updated: 2026-05-05 10:31:31
标题: 认证纯净的认知工作流执行者:从静态分析到密码学证明
摘要: 我们提出了一种认证纯度架构,将认知工作流系统中的治理执行从运行时约定转变为结构能力边界。先前的三层治理架构证明了治理完整性、溯源完整性以及在纯模块约束条件下不可能存在未受治理的影响,即步骤执行者无法执行影响。这一约束通过模块导入图分析执行,但对于BEAM虚拟机上的对抗性绕过是不够的。本文通过四种机制弥合了这一差距:(1)限制WebAssembly编译目标,其中产生影响的指令在结构上不存在;(2)纯度证书,将执行者二进制文件与其导入分类绑定的加密签名证明;(3)运行时验证门,拒绝未经认证的执行者进入治理管道;(4)通过远程认证来实现跨组织验证的可移植治理凭证。我们证明了四个定理:结构纯度通过构造、消除所有五个BEAM绕过类别的绕过、证书完整性以及门完整性。评估了四个实施的执行者,验证延迟为39-42微秒,完整计划周期不超过400微秒,运行时开销不超过100毫秒HTTP请求的0.4%,在重复调用中没有确定性分歧。
更新时间: 2026-05-05 10:31:31
领域: cs.CR,cs.AI,cs.PL
Where Paths Split: Localized, Calibrated Control of Moral Reasoning in Large Language Models
Large language models often display heterogeneous moral preferences across settings. We study inference-time steering toward a desired ethical framework while preserving general competence. We present Convergent-Divergent Routing, which traces and edits minimal branch points inside transformer blocks where ethical-framework-related pathways first converge and then diverge. Gating non-target branches at these loci blocks the downstream propagation while leaving upstream computations intact. We find that this intervention alone increases targeted ethical-framework reasoning. To achieve fine-grained control, we adapt Common Spatial Patterns to the residual stream and extract, for each branch-point layer, a pair of directions that discriminate between utilitarian and deontological frameworks. We then introduce Dual Logit Calibration, a closed-form, minimum-$\ell_2$-norm update that moves the residual within this two-dimensional subspace so the resulting directional projections align with user-specified preference weights. Experiments on real-life moral dilemmas show that our method reliably achieves preference calibration and largely preserves general capabilities, outperforming recent baselines while providing an interpretable mechanism.
Updated: 2026-05-05 10:30:44
标题: 路径分歧:对大型语言模型中道德推理的本地化、校准控制
摘要: 大型语言模型在不同环境中通常展示出异质的道德偏好。我们研究了推理时向期望的道德框架进行引导,同时保持一般能力。我们提出了收敛-发散路由,该方法跟踪和编辑变压器块内的最小分支点,在这些点上,与道德框架相关的路径首先汇聚,然后分歧。在这些位置对非目标分支进行门控会阻止下游传播,同时保留上游计算。我们发现,单独进行这种干预会增加针对道德框架的推理能力。为了实现细粒度控制,我们将常见空间模式适应到残差流中,并提取每个分支点层的一对方向,用于区分功利主义和义务论框架。然后,我们引入了双对数校准,这是一种封闭形式的、最小-$\ell_2$-范数更新,将残差移动到这个二维子空间内,使得结果方向投影与用户指定的偏好权重一致。在真实道德困境的实验中,我们的方法可靠地实现了偏好校准,并在很大程度上保留了一般能力,优于最近的基线方法,同时提供了一个可以解释的机制。
更新时间: 2026-05-05 10:30:44
领域: cs.AI,cs.LG
Algebraic Semantics of Governed Execution: Monoidal Categories, Effect Algebras, and Coterminous Boundaries
We present an algebraic semantics for governed execution in which governance is axiomatized, compositional, and coterminous with expressibility. The framework, mechanized in 32 Rocq modules (~12,000 lines, 454 theorems, 0 admitted), is built on interaction trees and parameterized coinduction. A three-axiom GovernanceAlgebra record (safety, transparency, properness) induces a symmetric monoidal category with verified pentagon, triangle, and hexagon coherence, where every tensor composition preserves governance. An algebraic effect system constrains the handler algebra so that only governance-preserving handlers can be constructed in the safe fragment; programs in the empty capability set provably emit only observability directives. Capability-indexed composition bundles programs with machine-checked capability bounds, and a dual guarantee theorem establishes that within_caps and gov_safe hold simultaneously under all composition operators. The capstone result is the coterminous boundary: within our formal model, every program expressible via the four primitive morphism constructors is governed under interpretation, and every governed program is the image of such a program. Turing completeness is preserved inside governance; unmediated I/O is excluded from the governed fragment. Governance denial is modeled as safe coinductive divergence. The governance algebra is parametric: any system instantiating the three axioms inherits all derived properties, including convergence, compositional closure, and goal preservation. Extracted OCaml runs as a NIF in the BEAM runtime, with property-based testing (70,000+ random inputs, zero disagreements) confirming behavioral equivalence between the specification and the runtime interpreter.
Updated: 2026-05-05 10:30:32
标题: Governed Execution的代数语义:幺半范畴、效应代数和共边界
摘要: 我们提出了一种代数语义,用于受管执行,其中治理被公理化,组合化,并与可表达性同时存在。该框架在32个Rocq模块(约12,000行,454个定理,0个已证实)中进行了机械化,建立在交互树和参数化共归纳上。一个三公理GovernanceAlgebra记录(安全性,透明性,适当性)引发了一个对称单子范畴,其中验证了五边形,三角形和六边形的一致性,其中每个张量组合都保持治理。代数效应系统限制了处理程序代数,以便只有保持治理的处理程序可以在安全片段中构建;在空能力集中的程序可以证明仅发出可观察性指令。能力索引化组合捆绑了具有机器检查的能力限制的程序,并且双保证定理确立了在所有组合运算符下同时保持within_caps和gov_safe。最重要的结果是同时存在的边界:在我们的形式模型中,通过四个原始态射构造器表达的每个程序在解释下都受到治理,并且每个受管程序都是这样一个程序的图像。图灵完备性在治理内得以保留;未经调解的I/O被排除在受管片段之外。治理否认被建模为安全的共归发散。治理代数是参数化的:实例化三个公理的任何系统都会继承所有派生属性,包括收敛性,组合闭合性和目标保持性。提取的OCaml在BEAM运行时作为NIF运行,通过基于属性的测试(70,000多个随机输入,零分歧)确认了规范和运行时解释器之间的行为等价性。
更新时间: 2026-05-05 10:30:32
领域: cs.AI,cs.LO,cs.PL
Multi-Agent Strategic Games with LLMs
This paper asks whether large language models (LLMs) can be used to study the strategic foundations of conflict and cooperation. I introduce LLMs as experimental subjects in a repeated security dilemma and evaluate whether they reproduce canonical mechanisms from international relations theory. The baseline game is extended along three theoretically central dimensions: multipolarity, finite time horizons, and the availability of communication. Across multiple models, the results exhibit systematic and consistent patterns: multipolarity increases the likelihood of conflict, finite horizons induce universal unraveling consistent with backward-induction logic, and communication reduces conflict by enabling signaling and reciprocity. Beyond observed behavior, the design provides access to agents' private reasoning and public messages, allowing choices to be linked to underlying strategic logics such as preemption, cooperation under uncertainty, and trust-building. The contribution is primarily methodological. LLM-based experiments offer a scalable, transparent, and replicable approach to probing theoretical mechanisms.
Updated: 2026-05-05 10:28:17
标题: 多智能体战略游戏与LLMs
摘要: 这篇论文探讨了大型语言模型(LLMs)是否可以用于研究冲突与合作的战略基础。我将LLMs引入一个重复的安全困境中,评估它们是否能够重现国际关系理论中的经典机制。基准游戏在三个理论上的核心维度上进行了扩展:多极性、有限时间视野和通信的可用性。在多个模型中,结果呈现出系统性和一致性的模式:多极性增加了冲突的可能性,有限时间视野引发了与反向归纳逻辑一致的普遍解体,而通信通过启用信号传递和互惠减少了冲突。除了观察到的行为之外,该设计还提供了对代理人私人推理和公共信息的访问,从而使选择与潜在的战略逻辑(如先发制人、不确定条件下的合作和建立信任)联系起来。这一贡献主要是方法论的。基于LLM的实验提供了一种可扩展、透明和可复制的方法来探索理论机制。
更新时间: 2026-05-05 10:28:17
领域: cs.GT,cs.AI,cs.CY
Unifying Dynamical Systems and Graph Theory to Mechanistically Understand Computation in Neural Networks
Understanding how biological and artificial neural networks implement computation from connectivity is a central problem in neuroscience and machine learning. In neural systems, structural and functional connectivity are known to diverge, motivating approaches that move beyond direct connections alone. Here, we show that the spatial and temporal function of recurrent neural networks (RNNs) trained on hierarchically modular tasks can be recovered by modelling the network as a graph and analysing the multi-hop pathways between input and output units. In particular, decomposing these pathways by hop length reveals how the network temporally routes information. This perspective reframes regularisation: if function is implemented through multi-hop communication, then standard penalties such as L1 regularisation, which act only on individual weights, constrain single-hop structure rather than the multi-hop pathways that support computation. Motivated by this view, we introduce resolvent-RNNs (R-RNNs), which constrain multi-hop pathways and thereby induce temporal sparsity beyond that achieved by standard L1 regularisation. Compared with L1 regularisation, R-RNNs achieve improved performance by inducing temporal sparsity that matches the task structure, even when the task signal is sparse. Moreover, R-RNNs exhibit stronger sparsity-function alignment, reflected in their increased robustness under strong regularisation. Together, our results identify multi-hop communication as a key principle linking structure to function in recurrent networks, and suggest that sparsity should be defined over functional pathways rather than individual parameters.
Updated: 2026-05-05 10:18:53
标题: 将动力系统和图论统一起来,以深入理解神经网络中的计算机制
摘要: 理解生物和人工神经网络如何通过连接实现计算是神经科学和机器学习中的一个核心问题。在神经系统中,结构和功能连接已知存在分歧,这促使我们采用超越单纯直接连接的方法。在这里,我们展示了在层次模块化任务上训练的循环神经网络(RNNs)的空间和时间功能可以通过将网络建模为图并分析输入和输出单元之间的多跳路径来恢复。特别是,通过跳数分解这些路径,可以揭示网络如何在时间上路由信息。这种视角重新定义了正则化:如果功能通过多跳通信实现,那么标准的惩罚措施,如L1正则化,仅作用于单个权重,将限制单跳结构而不是支持计算的多跳路径。受到这一观点的启发,我们引入了解析-RNNs(R-RNNs),这些网络约束多跳路径,从而比标准的L1正则化更多地引入了时间稀疏性。与L1正则化相比,R-RNNs通过引入与任务结构相匹配的时间稀疏性来实现改进的性能,即使任务信号是稀疏的。此外,R-RNNs表现出更强的稀疏-功能对齐性,在强正则化下表现出更高的鲁棒性。综上所述,我们的结果确定了多跳通信作为连接结构与功能之间的关键原则,并建议稀疏性应该以功能路径而不是单个参数来定义。
更新时间: 2026-05-05 10:18:53
领域: cs.NE,cs.AI
ZK-Value: A Practical Zero-Knowledge System for Verifiable Data Valuation
Data valuation is a foundational task in data marketplaces, where a Shapley-value attribution determines how a buyer's payment is distributed among data providers. Typically, the marketplace operator runs this attribution alone, requiring participants and external auditors to trust scores they cannot independently recompute on the underlying private data. While zero-knowledge proofs (ZKPs) can theoretically reconcile this conflict between privacy and verifiability, existing ZK valuation systems fail to scale to real-world marketplace demands due to prohibitive proving times or the requirement to disclose validation cohorts. We present ZK-Value, a practical, end-to-end ZK data-valuation system. Our solution bridges the scalability gap through a fully co-designed architecture: (1) LSH-Shapley, a locality-based valuation primitive that replaces expensive pairwise distance metrics with per-bucket collision counts; (2) ZK-LSH-Shapley, a tailored ZKP protocol that drastically reduces witness size by encoding these counts into bucket-level histograms rather than naive per-pair tensors; and (3) structural proof-system optimizations, specifically super-oracle batching and sparsity skipping. Evaluated across 12 standard datasets, ZK-Value delivers valuation quality on par with state-of-the-art baselines (within 0.033 AUROC of exact KNN-Shapley), while generating proofs in seconds to minutes and outperforming specialized ZK baselines by 12.6x to 68.1x in proving time, with verification in under 4.6 s.
Updated: 2026-05-05 09:51:43
标题: ZK-Value:一种用于可验证数据估值的实用零知识系统
摘要: 数据估值是数据市场中的一个基础任务,其中沙普利值分配确定了买家的付款如何在数据提供者之间分配。通常,市场运营商独自运行此分配,需要参与者和外部审计员信任他们无法独立重新计算底层私人数据上的分数。虽然零知识证明(ZKPs)在理论上可以调和隐私性和可验证性之间的冲突,但现有的ZK估值系统由于证明时间过长或需要披露验证队伍而无法满足现实世界市场需求。 我们提出了ZK-Value,一个实用的、端到端的ZK数据估值系统。我们的解决方案通过完全协同设计的体系结构弥合了可扩展性差距:(1)LSH-Shapley,一个基于位置的估值基元,用每个桶的碰撞计数替代昂贵的成对距离度量;(2)ZK-LSH-Shapley,一个定制的ZKP协议,通过将这些计数编码到桶级直方图而不是朴素的每对张量中,大幅减少见证大小;以及(3)结构证明系统优化,特别是超级神谕批处理和稀疏跳跃。在12个标准数据集上评估,ZK-Value提供了与最先进基线相当的估值质量(在精确KNN-Shapley之间的AUROC差距为0.033),同时生成证明时间在几秒到几分钟之间,并在证明时间方面胜过专门的ZK基线12.6倍至68.1倍,验证时间不到4.6秒。
更新时间: 2026-05-05 09:51:43
领域: cs.CR
SPRINT: Robust Model Attribution of Generated Images via Secret Pixel Reconstruction
Detecting the source model of AI-generated images is a growing accountability problem. AI fingerprinting techniques address this by detecting imperceptible patterns in the images that are unique to each model, achieving high detection accuracy under ideal conditions. However, recent research has shown that image fingerprints are extremely brittle to adaptive attacks, where knowledge of the technique can be exploited to perturb the fingerprints and evade detection. We present SPRINT (Secret Pixel Reconstruction fingerprinting), a novel model attribution method specifically designed to provide robustness to adaptive attacks. As opposed to existing fingerprinting, which focuses on publicly discoverable patterns in the image, SPRINT relies on a secret to define hidden reconstruction targets, thus keeping the verification task itself private. As a result, the attacker can no longer see the task that the verifier solves at verification time, protecting the information exploited by the attacks. Our results show that SPRINT achieves high closed-world accuracy while remaining robust to adaptive attacks: on the FFHQ dataset, SPRINT reaches 99.17% clean accuracy on a diverse 12-model pool and 98.83% on a harder pool of 6 close checkpoints of the same model architecture, while reducing adaptive removal and forgery attack success rates to 1% or below. When the same pool of close model checkpoints is considered an open world, SPRINT maintains high accuracy with an AUROC of 99.30%. These findings show that the approach of privatizing the verification task can make adaptive evasion substantially harder while maintaining performance in the clean setting.
Updated: 2026-05-05 09:40:32
标题: SPRINT:通过秘密像素重建的生成图像的稳健模型归因
摘要: 检测AI生成图像的来源模型是一个不断增长的责任问题。AI指纹技术通过检测图像中的不可察觉的模式,这些模式对每个模型都是独一无二的,从而在理想条件下实现高检测准确率。然而,最近的研究表明,图像指纹对自适应攻击非常脆弱,攻击者可以利用技术知识来扰乱指纹并逃避检测。我们提出了SPRINT(秘密像素重建指纹),这是一种专门设计用于提供对自适应攻击的稳健性的模型归因方法。与现有的指纹技术不同,现有技术侧重于图像中公开可发现的模式,SPRINT依靠一个秘密来定义隐藏的重建目标,从而使验证任务本身保持私密性。因此,攻击者在验证时不再能看到验证器解决的任务,从而保护了攻击所利用的信息。我们的结果表明,SPRINT在保持对自适应攻击的稳健性的同时实现了高封闭世界准确率:在FFHQ数据集上,SPRINT在一个多样化的12个模型池中达到了99.17%的干净准确率,而在一个更难的由同一模型架构的6个相近检查点组成的池中,准确率为98.83%,同时将自适应移除和伪造攻击成功率降至1%或以下。当相同的相近模型检查点池被视为一个开放世界时,SPRINT保持高准确率,AUROC为99.30%。这些发现表明,将验证任务私有化的方法可以使自适应逃避变得更加困难,同时在干净设置中保持性能。
更新时间: 2026-05-05 09:40:32
领域: cs.CR,cs.AI,cs.LG
GuardSec: A Multi-Modal Web Platform for Real-Time Digital Fraud Detection, Entity Verification, and Connection Security Analysis in the African Context
Online fraud in Africa has reached epidemic scale, yet the few cybersecurity tools that exist are not available to ordinary citizens and are calibrated almost exclusively for SOCs and technically literate users operating on stable broadband connections. This mismatch is not incidental: it is the predictable outcome of a research culture that optimises for benchmark performance while systematically neglecting deployability, accessibility, and local threat context. This paper presents \textit{GuardSec}, a production-deployed, openly accessible web platform for real-time multi-modal digital threat verification, designed from the ground up for the African user context. The system enables any user with a browser to assess the legitimacy of URLs, websites, phone numbers, email addresses, and business entities in under five seconds, without registration, without an API key, and without cybersecurity expertise. A distinctive original component of GuardSec is the \textit{Mon Empreinte} (My Footprint) module, which performs a real-time security audit of the user's own connection and digital exposure: it analyses the visitor's IP address, geolocation, ISP identity, connection type, device fingerprint, browser configuration, and a set of twelve security indicators spanning network integrity, tracking exposure, and anonymisation status. This self-diagnostic capability transforms GuardSec from a passive verification tool into an active instrument of digital self-awareness, enabling users to understand not only whether an external entity is safe, but whether their own connection is compromised, tracked, or exposed. The platform additionally embeds \textit{Gilda}, a context-aware conversational security assistant that answers user questions about digital threats in plain language and issues personalised security recommendations on demand.
Updated: 2026-05-05 08:54:08
标题: GuardSec:一个多模态网络平台,用于在非洲环境中实时数字欺诈检测、实体验证和连接安全分析
摘要: 非洲的网络欺诈已经达到流行规模,然而存在的少数网络安全工具并不适用于普通公民,几乎完全被校准为只在稳定宽带连接上操作的SOC和技术精通用户。这种不匹配并非偶然:这是一个以优化基准性能为目标的研究文化的可预测结果,同时系统性地忽视了可部署性、可访问性和当地威胁背景。 本文介绍了一个名为GuardSec的实时多模数字威胁验证的生产部署、开放访问的网络平台,从头开始设计,适用于非洲用户环境。该系统使任何用户只需使用浏览器就可以在不到五秒的时间内评估URL、网站、电话号码、电子邮件地址和企业实体的合法性,无需注册、无需API密钥,也无需网络安全专业知识。GuardSec的一个独特的原创组成部分是“Mon Empreinte”(我的足迹)模块,它可以实时对用户自己的连接和数字曝光进行安全审计:它分析访问者的IP地址、地理位置、ISP身份、连接类型、设备指纹、浏览器配置以及一组涵盖网络完整性、跟踪暴露和匿名化状态的十二个安全指标。这种自我诊断能力将GuardSec从一个被动的验证工具转变为数字自我意识的主动工具,使用户不仅可以了解外部实体是否安全,还可以了解自己的连接是否受到威胁、跟踪或暴露。该平台还嵌入了Gilda,一个面向上下文的对话式安全助手,以简明的语言回答用户关于数字威胁的问题,并根据需要发布个性化的安全建议。
更新时间: 2026-05-05 08:54:08
领域: cs.CR
Design of Memristive Lightweight Encryption For In-Memory Image Steganography
With the expansion of data-intensive applications and increasing data volumes, providing an efficient solution to address growing energy consumption and performance degradation caused by the transfer of large amounts of data between the processor and the main memory has become a severe challenge. The frequent transfer of large amounts of data between internal chip units, memories, and their interconnections exacerbates the vulnerability of the data being accessed. Employing a memristive Computation In-Memory-Array (CIM-A) architecture limits data transfer, thereby addressing both challenges. Furthermore, by integrating lightweight cryptography, developed to secure data in hardware-constrained devices, with CIM-A architectures, the security of data in transit, especially across interconnections, can be ensured. This paper implements two standard lightweight stream ciphers, Trivium and Grain-128a, for CIM using stateful material implication (IMPLY) logic to address these combined security and performance challenges. In addition to redesigning the cryptographic structures, we reduce the hardware complexity of conventional IMPLY-based implementations by proposing an efficient method for shifting data within the shift registers. Applying the proposed data-shifting method to the registers of these ciphers reduces the number of computational steps and decreases energy consumption by up to 42% and 44%, respectively, compared to conventional implementations. Finally, the performance of the proposed circuits is evaluated in a steganography application, demonstrating their practical efficiency.
Updated: 2026-05-05 08:33:25
标题: 内存图像隐写术中的基于忆阻存储器的轻量级加密设计
摘要: 随着数据密集型应用的扩展和数据量的增加,提供一种有效的解决方案来解决由于处理器和主存储器之间大量数据传输导致的能耗增加和性能下降已成为一个严峻的挑战。在芯片内部单元、存储器以及它们之间的大量数据频繁传输加剧了被访问数据的脆弱性。采用基于忆阻计算的内存阵列(CIM-A)架构限制了数据传输,从而解决了这两个挑战。此外,通过将用于保护硬件受限设备中的数据的轻量级密码学与CIM-A架构集成,可以确保数据在传输过程中的安全性,特别是在互联连接中。本文采用基于状态材料蕴含(IMPLY)逻辑的两种标准轻量级流密码Trivium和Grain-128a,用于CIM,以解决这些结合安全性和性能挑战。除了重新设计加密结构,我们通过提出一种有效的数据移位方法,降低了传统基于IMPLY的实现的硬件复杂性。将所提出的数据移位方法应用于这些密码的寄存器,与传统实现相比,计算步骤数量减少,能耗分别降低了高达42%和44%。最后,通过在隐写应用中评估所提出的电路性能,展示了它们的实际效率。
更新时间: 2026-05-05 08:33:25
领域: cs.CR,cs.ET
From TinyGo to gc Compiler: Extending Zorya's Concolic Framework to Real-World Go Binaries
Zorya is a concolic execution framework that lifts compiled binaries to Ghidra's P-Code intermediate representation and uses the Z3 SMT solver to detect vulnerabilities by reasoning over both concrete and symbolic values. Previous versions supported only single-threaded TinyGo binaries. In this paper, we extend Zorya to multi-threaded binaries produced by Go's standard gc compiler. This is achieved by restoring OS thread states from gdb dumps, neutralizing runtime preemption, and introducing overlay path analysis with copy-on-write semantics to detect silent vulnerabilities on untaken branches. We rigorously assess Zorya on 11 real-world vulnerabilities from production Go projects such as Kubernetes, Go-Ethereum, and CoreDNS. Our evaluation shows that Zorya detects seven bugs at the binary level, including a silent integer overflow detects no other evaluated tool finds without a manually written oracle.
Updated: 2026-05-05 08:31:22
标题: 从TinyGo到gc编译器:将Zorya的共轭框架扩展到现实世界的Go二进制文件
摘要: Zorya是一个将编译后的二进制文件转换为Ghidra的P-Code中间表示的共享执行框架,并利用Z3 SMT求解器通过对具体和符号值的推理来检测漏洞。之前的版本仅支持单线程的TinyGo二进制文件。在本文中,我们将Zorya扩展为由Go标准gc编译器生成的多线程二进制文件。这是通过从gdb转储中恢复操作系统线程状态,中和运行时抢占,并引入覆盖路径分析以检测未执行分支上的潜在漏洞来实现的。我们在来自生产Go项目(如Kubernetes,Go-Ethereum和CoreDNS)的11个实际漏洞上严格评估了Zorya。我们的评估结果显示,Zorya在二进制级别检测到七个错误,其中包括一种静默整数溢出,其他评估工具无法发现,需要手动编写的预言。
更新时间: 2026-05-05 08:31:22
领域: cs.CR,cs.SC,cs.SE
Position: Mind the Gap-AI Security and the Limits of Current Reporting Standards
AI systems face a growing number of AI security threats that are increasingly exploited in the real world. Hence, shared AI incident reporting practices are emerging in industry as best practice and as mandated by regulatory requirements. Although non-AI cybersecurity and non-security AI reporting have progressed as industrial and policy norms, existing collections of practices do not meet the specific requirements posed by AI security reporting. we argue that established processes are not well aligned with AI security reporting due to fundamental shortcomings for the distinctive characteristics of AI systems. Some of these shortcomings are immediately addressable, while others remain unresolved technically or within social systems, like the treatment of IP or the ownership of a vulnerability. Based on this position, we examine the limitations of current AI security incident reporting proposals. We conclude that the advent of AI agents will further reinforce the need to advance specialized AI security incident reporting.
Updated: 2026-05-05 08:22:20
标题: 职位:注意差距-人工智能安全和当前报告标准的局限性
摘要: 人工智能系统面临着越来越多的人工智能安全威胁,这些威胁在现实世界中越来越受到利用。因此,在工业界,共享的人工智能事件报告实践正在兴起,被视为最佳实践,并受到监管要求的规定。虽然非人工智能网络安全和非安全人工智能报告已经成为工业和政策规范,但现有的实践集合并不能满足人工智能安全报告提出的特定要求。我们认为,由于人工智能系统的独特特征存在基本缺陷,已经建立的流程与人工智能安全报告并不完全吻合。其中一些缺陷可以立即解决,而其他一些在技术或社会系统内尚未解决,比如知识产权的处理或漏洞的所有权。基于这一观点,我们审视了当前人工智能安全事件报告提案的局限性。我们得出结论,人工智能代理的出现将进一步强调推进专门的人工智能安全事件报告的必要性。
更新时间: 2026-05-05 08:22:20
领域: cs.CR
MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents
Persistent external memory enables LLM agents to maintain context across sessions, yet its security properties remain formally uncharacterized. We formalize memory poisoning attacks on retrieval-augmented agents as a Stackelberg game with a unified evaluation framework spanning three attack classes with escalating access assumptions. Correcting an evaluation protocol inconsistency in the triggered-query specification of Chen et al. (2024), we show faithful evaluation increases measured attack success by $4\times$ (ASR-R: $0.25 \to 1.00$). Our primary contribution is MEMSAD (Semantic Anomaly Detection), a calibration-based defense grounded in a gradient coupling theorem: under encoder regularity, the anomaly score gradient and the retrieval objective gradient are provably identical, so any continuous perturbation that reduces detection risk necessarily degrades retrieval rank. This coupling yields a certified detection radius guaranteeing correct classification regardless of adversary strategy. We prove minimax optimality via Le Cam's method, showing any threshold detector requires $Ω(1/ρ^2)$ calibration samples and MEMSAD achieves this up to $\log(1/δ)$ factors. We further derive online regret bounds for rolling calibration at rate $O(σ^{2/3}Δ^{1/3})$, and formally characterize a discrete synonym-invariance loophole that marks the boundary of what continuous-space defenses can guarantee. Experiments on a $3 \times 5$ attack-defense matrix with bootstrap confidence intervals, Bonferroni-corrected hypothesis tests, and Clopper-Pearson validation ($n=1{,}000$) confirm: composite defenses achieve TPR $= 1.00$, FPR $= 0.00$ across all attacks, while synonym substitution evades detection at $Δ$ ASR-R $\approx 0$, exposing a gap existing embedding-based defenses cannot close.
Updated: 2026-05-05 08:15:41
标题: MEMSAD:梯度耦合异常检测用于检测检索增强代理中的内存污染
摘要: 持久性外部存储使LLM代理能够在会话之间保持上下文,但其安全性质仍未得到正式描述。我们将检索增强代理的存储污染攻击形式化为一个斯塔克伯格博弈,其统一评估框架涵盖了三种攻击类别,具有不断增加的访问假设。通过纠正Chen等人(2024年)在触发查询规范中的评估协议不一致性,我们展示忠实的评估可以将攻击成功率提高4倍(ASR-R:0.25→1.00)。我们的主要贡献是MEMSAD(语义异常检测),这是一种基于校准的防御,其基础是一个梯度耦合定理:在编码器正则性下,异常分数梯度和检索目标梯度可以被证明是相同的,因此任何降低检测风险的连续扰动必然会降低检索排名。这种耦合产生了一个经过认证的检测半径保证,无论对手策略如何,都能保证正确分类。我们通过Le Cam的方法证明了极小化的最优性,表明任何阈值检测器都需要$Ω(1/ρ^2)$个校准样本,而MEMSAD实现了这一点,直到$ \log (1/δ)$的因子。我们进一步推导了滚动校准的在线遗憾界限,速率为$O(σ^{2/3}Δ^{1/3})$,并正式描述了一种离散的同义词不变性漏洞,标志着连续空间防御所能保证的范围的边界。在一个$3 \times 5$的攻击-防御矩阵上进行实验证明:复合防御在所有攻击中实现TPR $= 1.00$,FPR $= 0.00$,而同义词替换会逃避检测,ASR-R $\approx 0$,暴露了存在的嵌入式防御无法填补的差距。
更新时间: 2026-05-05 08:15:41
领域: cs.CR,cs.AI,cs.LG
Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis
Large language models (LLMs) employ safety mechanisms to prevent harmful outputs, yet these defenses primarily rely on semantic pattern matching. We show that encoding harmful prompts as coherent mathematical problems -- using formalisms such as set theory, formal logic, and quantum mechanics -- bypasses these filters at high rates, achieving 46%--56% average attack success across eight target models and two established benchmarks. Crucially, the effectiveness depends not on mathematical notation itself, but on whether a helper LLM deeply reformulates the harmful content into a genuine mathematical problem: rule-based encodings that apply mathematical formatting without such reformulation perform no better than unencoded baselines. We introduce a novel Formal Logic encoding that achieves attack success comparable to Set Theory, demonstrating that this vulnerability generalizes across mathematical formalisms. Additional experiments with repeat post-processing confirm that these attacks are robust to simple prompt augmentation. Notably, newer models (GPT-5, GPT-5-Mini) show substantially greater robustness than older models, though they remain vulnerable. Our findings highlight fundamental gaps in current safety frameworks and motivate defenses that reason about mathematical structure rather than surface-level semantics.
Updated: 2026-05-05 07:25:33
标题: 通过数学编码揭示LLM安全漏洞:新攻击和系统分析
摘要: 大型语言模型(LLMs)采用安全机制来防止有害输出,然而这些防御主要依赖于语义模式匹配。我们发现,将有害提示编码为连贯的数学问题 -- 使用集合论、形式逻辑和量子力学等形式化方法 -- 可以高比例地绕过这些过滤器,实现对八个目标模型和两个已建立基准的46%至56%的平均攻击成功率。关键在于,有效性并不取决于数学符号本身,而取决于一个辅助LLM是否将有害内容深度重构为真正的数学问题:应用数学格式但没有进行这样的重构的基于规则的编码与未编码的基准没有更好的性能。我们引入了一种新颖的形式逻辑编码,其实现的攻击成功率与集合论相当,表明这种漏洞可以推广到各种数学形式化方法。通过重复后处理进行的额外实验确认了这些攻击对简单提示增强是稳健的。值得注意的是,新模型(GPT-5、GPT-5-Mini)显示出比旧模型更大的鲁棒性,尽管它们仍然是脆弱的。我们的发现突显了当前安全框架中的基本差距,并激励进行关于数学结构而非表面层语义的防御。
更新时间: 2026-05-05 07:25:33
领域: cs.CR,cs.AI,cs.CL,cs.LG
Graph Reconstruction from Differentially Private GNN Explanations
Regulatory frameworks such as GDPR increasingly require that ML predictions be accompanied by post-hoc explanations, even when raw data and trained models cannot be released. Differential privacy (DP) is the standard mitigation for the residual privacy risk of releasing these explanations. We show that DP is not sufficient: an adversary observing only DP-perturbed GNN explanations can reconstruct hidden graph structure with high accuracy. Our attack, PRIVX, exploits the fact that the Gaussian DP mechanism is a single DDPM forward step at known noise level σ(ε), recasting reconstruction as reverse diffusion conditioned on the corrupted signal, a principled Bayesian denoiser under known DP corruption. We formalise a stratified adversary model parameterised by (M, \hatε, \hatδ, S, ρ) that interpolates between oblivious and oracle attackers, and derive endpoint-matched two-sided bounds on reconstruction AUC. For practitioners, we provide regime-stratified guidance on explainer choice: on homophilic graphs, neighbourhood-aggregating explainers (GraphLIME, GNNExplainer) leak more structure than per-node gradient explainers under the same DP budget; on strongly heterophilic graphs the ordering reverses. We introduce PRIVF as an auxiliary diagnostic sharing the same diffusion backbone to decompose leakage into explainer-induced and intrinsic graph-distribution components. Experiments across seven benchmarks, three DP mechanisms, and three GNN backbones show PRIVX achieves AUC above 0.7 at ε = 5 on five of seven datasets, with the attack succeeding well within typically deployed privacy budgets.
Updated: 2026-05-05 05:50:06
标题: 从差分隐私GNN解释中重建图网络
摘要: 监管框架,如GDPR,越来越要求ML预测必须附带事后解释,即使原始数据和训练模型不能被公开。差分隐私(DP)是减轻发布这些解释的剩余隐私风险的标准方法。我们展示了DP并不足够:仅观察DP扰动的GNN解释的对手可以以高准确度重构隐藏的图结构。我们的攻击,PRIVX,利用了高斯DP机制是已知噪声水平σ(ε)的单个DDPM前向步骤,将重构重新构造为在已知DP污染下的条件下反扩散,这是一个基于原则的贝叶斯去噪器。我们形式化了一个由(M, \hatε, \hatδ, S, ρ)参数化的分层对手模型,插值于无意识和预言者攻击者之间,并推导出重构AUC的端点匹配的双边界限。对于从业者,我们提供了区域分层的指导,选择解释器:在同质图上,邻域聚合解释器(GraphLIME,GNNExplainer)比相同的DP预算下的每节点梯度解释器泄露更多的结构;在强热异质图上,顺序相反。我们引入PRIVF作为一个辅助诊断,共享相同的扩散骨干,将泄漏分解为解释器引起的和内在图分布组件。跨七个基准、三个DP机制和三个GNN骨干的实验显示,PRIVX在五个七个数据集上在ε = 5时实现了高于0.7的AUC,攻击成功在通常部署的隐私预算内成功。
更新时间: 2026-05-05 05:50:06
领域: cs.LG,cs.CR
DECKER: Domain-invariant Embedding for Cross-Keyboard Extraction and Recognition
Acoustic side-channel attacks (ASCA) on keyboards pose a significant security risk, as keystrokes can be inferred from typing acoustics, revealing sensitive information. Prior ASCA studies are limited by small-scale datasets with restricted diversity in users, keyboards, and environments, constraining analysis across devices, microphones, and noise conditions. We introduce HEAR, a dataset designed to study ASCA along three axes: keyboard generalization, noise adaptation, and user bias. HEAR contains recordings from 53 participants using 37 laptop keyboards, collected in three realistic settings: (1) external microphone capture, (2) device microphone capture without network noise, and (3) VoIP-based streaming capture. This enables controlled evaluation across users, keyboards, and environments. On HEAR, we establish an ASCA benchmark spanning conventional features and pre-trained representations from raw audio and spectrograms in unimodal and multimodal settings. We propose DECKER, a domain-invariant keystroke inference framework with four stages: (1) Keyboard Signature Normalization to reduce device coloration, (2) domain-adversarial disentanglement to suppress keyboard identity, (3) supervised cross-keyboard contrastive alignment to enforce key consistency, and (4) Acoustic Style Randomization to synthesize unseen keyboard responses. We further explore sentence-level inference using an LLM-based post-processing layer to refine keystroke sequences via linguistic context. Results on HEAR show DECKER improves keystroke identification over strong baselines, particularly in cross-keyboard and cross-user settings, with further gains from language-model rectification. These findings highlight that ASCA remains effective across diverse users, devices, and noisy environments, underscoring its practical security risk.
Updated: 2026-05-05 05:43:24
标题: DECKER:用于跨键盘提取和识别的域不变嵌入
摘要: 声学侧信道攻击(ASCA)对键盘构成了重大安全风险,因为可以通过键盘敲击声音推断按键,从而泄露敏感信息。以往的ASCA研究受限于规模较小的数据集,用户、键盘和环境多样性有限,限制了跨设备、麦克风和噪音条件的分析。我们介绍了HEAR,一个设计用于研究ASCA的数据集,沿着三个方面展开:键盘泛化、噪音适应和用户偏差。HEAR包含来自53名参与者使用37个笔记本键盘的录音,在三个真实环境中收集:(1)外部麦克风捕捉,(2)设备麦克风捕捉无网络噪音,(3)基于VoIP的流式捕捉。这使得可以在用户、键盘和环境之间进行受控评估。在HEAR上,我们建立了一个ASCA基准,跨越了传统特征和从原始音频和频谱图中预训练表示的单模态和多模态设置。我们提出了DECKER,一个具有四个阶段的领域不变按键推断框架:(1)键盘签名标准化以减少设备颜色,(2)领域对抗解缠以抑制键盘身份,(3)监督跨键盘对比对齐以强制键一致性,(4)声学风格随机化以合成未见键盘响应。我们进一步探讨使用基于LLM的后处理层进行句子级推断,通过语言环境来完善按键序列。HEAR上的结果显示DECKER在强基线上提高了按键识别,特别是在跨键盘和跨用户设置中,从语言模型校正中获得进一步收益。这些发现突显了ASCA在不同用户、设备和嘈杂环境下仍然有效,强调了其实际安全风险。
更新时间: 2026-05-05 05:43:24
领域: cs.CR,cs.SD
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection
The rise of Large Language Model (LLM) agents, augmented with tool use, skills, and external knowledge, has introduced new security risks. Among them, prompt injection attacks, where adversaries embed malicious instructions into the agent workflow, have emerged as the primary threat. However, existing benchmarks and defenses are fundamentally limited as they assume context-insensitive settings in which the agent works under a fully specified user instruction, and the attacks are straightforward and context-independent. As a result, they fail to capture real-world deployments where agent behavior usually depends on dynamic context, not just the user prompt, and adversaries can adapt their attacks to different context. Similarly, existing defenses built on this narrow threat model overlook the nature of real-world agent delegation. In this paper, we present AgentLure, a benchmark that captures context-dependent tasks and context-aware prompt injection attacks. AgentLure spans four agentic domains and eight attack vectors across diverse attack surfaces. Our evaluation shows that existing defenses often struggle in this setting, yielding poor performance against such attacks in agentic systems. To address this limitation, we propose ARGUS, a defense mechanism that enforces provenance-aware decision auditing for LLM agents. ARGUS constructs an influence provenance graph to track how untrusted context propagates into agent decisions and verify whether a decision is justified by trustworthy evidence before execution. Our evaluation shows ARGUS reduces attack success rate to 3.8% while preserving 87.5% task utility, significantly outperforming existing defenses and remaining robust against adaptive white-box adversaries.
Updated: 2026-05-05 05:37:00
标题: ARGUS: 防御LLM代理免受上下文感知提示注入
摘要: 大型语言模型(LLM)代理的兴起,搭配工具使用、技能和外部知识,引入了新的安全风险。其中,提示注入攻击,即对手将恶意指令嵌入代理工作流程中,已经成为主要威胁。然而,现有的基准和防御措施在根本上存在局限,因为它们假设代理在上下文无关的环境中工作,其中代理在完全指定的用户指令下工作,并且攻击是直接且上下文无关的。因此,它们未能捕捉到真实世界部署的情况,代理行为通常取决于动态上下文,而不仅仅是用户提示,对手可以根据不同的上下文调整他们的攻击方式。类似地,建立在这种狭窄威胁模型基础上的现有防御措施忽视了真实世界代理委托的性质。 在本文中,我们提出了AgentLure,一个捕捉上下文相关任务和上下文感知提示注入攻击的基准。AgentLure涵盖了四个代理领域和八种攻击向量,覆盖了各种攻击面。我们的评估显示,现有的防御措施在这种情况下通常效果不佳,在代理系统中对此类攻击的性能较差。为了解决这一局限性,我们提出了ARGUS,一种为LLM代理实施起源感知决策审计的防御机制。ARGUS构建了一个影响起源图,以跟踪不受信任的上下文如何传播到代理的决策,并验证一个决策是否有可信的证据支持执行之前的正当性。我们的评估显示,ARGUS将攻击成功率降低到3.8%,同时保留87.5%的任务效用,明显优于现有的防御措施,并且对自适应白盒对手具有鲁棒性。
更新时间: 2026-05-05 05:37:00
领域: cs.CR,cs.SE
The AI risk repository: A meta-review, database, and taxonomy of risks from artificial intelligence
Artificial intelligence (AI) is reshaping society, from video generation to medical diagnosis, coding agents to autonomous vehicles. Yet researchers, policymakers, and technology companies lack shared terminology for discussing AI risks. Consider "privacy": one framework uses this term to describe a model's ability to leak sensitive training data, while another uses it to mean freedom from government surveillance. Conversely, researchers have introduced "Goodhart's law," "specification gaming," "reward hacking," and "mesa-optimization" to describe the same phenomenon of AI systems optimizing for measured proxies rather than intended goals. This terminological diversity creates friction: comparing findings across studies requires mapping between frameworks, and comprehensive risk coverage requires consulting multiple taxonomies that use different organizing principles. This paper addresses this challenge by creating a comprehensive catalog of AI risks. We systematically analyzed every major AI risk framework published to date-74 frameworks containing 1,725 distinct risks-and organized them into a unified system. Our two classification systems reveal important patterns: contrary to common assumptions, human decisions cause nearly as many AI risks (38%) as the AI systems themselves (42%). The work provides practical tools for anyone working on AI safety, from developers conducting risk assessments to policymakers writing regulations to auditors evaluating AI systems. By establishing a common reference point, this repository creates the foundation for more coordinated and comprehensive approaches to managing AI's risks while realizing its benefits.
Updated: 2026-05-05 05:31:40
标题: 人工智能风险知识库:人工智能风险的元回顾、数据库和分类法
摘要: 人工智能(AI)正在重塑社会,从视频生成到医学诊断,编码代理到自动驾驶车辆。然而,研究人员、政策制定者和技术公司缺乏共同的术语来讨论AI风险。以“隐私”为例:一个框架使用这个术语来描述模型泄漏敏感训练数据的能力,而另一个使用它来表示免受政府监视的自由。相反,研究人员引入了“古哈特定律”、“规范游戏”、“奖励黑客”和“mesa-优化”来描述AI系统优化测量代理而非预期目标的现象。这种术语多样性产生了摩擦:跨研究比较发现需要在框架之间进行映射,而全面的风险覆盖需要参考使用不同组织原则的多个分类法。本文通过创建一个全面的AI风险目录来解决这一挑战。我们系统地分析了迄今为止发表的每个主要AI风险框架-74个框架包含1,725个不同的风险,并将它们组织成一个统一的系统。我们的两个分类系统揭示了重要的模式:与常见假设相反,人类决策导致的AI风险几乎和AI系统本身一样多(38%和42%)。这项工作为任何涉及AI安全性的人提供了实用工具,从进行风险评估的开发人员到制定规定的政策制定者,再到评估AI系统的审计员。通过建立一个共同的参考点,这个知识库为更协调和全面地管理AI风险的方法奠定了基础,同时实现其益处。
更新时间: 2026-05-05 05:31:40
领域: cs.AI,cs.CR,cs.ET,cs.LG,eess.SY
SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents
LLM-Agents have evolved into autonomous systems for complex task execution, with the SKILL.md specification emerging as a de facto standard for encapsulating agent capabilities. However, a critical bottleneck remains: different agent frameworks exhibit starkly different sensitivities to prompt formatting, causing up to 40% performance variation, yet nearly all skills exist as a single, format-agnostic Markdown version. Manual per-platform rewriting creates an unsustainable maintenance burden, while prior audits have found that over one third of community skills contain security vulnerabilities. To address this, we present SkCC, a compilation framework that introduces classical compiler design into agent skill development. At its core, SkIR - a strongly-typed intermediate representation - decouples skill semantics from platform-specific formatting, enabling portable deployment across heterogeneous agent frameworks. Around this IR, a compile-time Analyzer enforces security constraints via Anti-Skill Injection before deployment. Through a four-phase pipeline, SkCC reduces adaptation complexity from $O(m \times n)$ to $O(m + n)$. Experiments on SkillsBench demonstrate that compiled skills consistently outperform their original counterparts, improving pass rates from 21.1% to 33.3% on Claude Code and from 35.1% to 48.7% on Kimi CLI, while achieving sub-10ms compilation latency, a 94.8% proactive security trigger rate, and 10-46% runtime token savings across platforms.
Updated: 2026-05-05 04:15:48
标题: SkCC:用于跨框架LLM代理的便携和安全的技能编译
摘要: LLM代理已经发展成为用于复杂任务执行的自主系统,SKILL.md规范已经成为封装代理能力的事实标准。然而,一个关键瓶颈仍然存在:不同的代理框架对提示格式的敏感性截然不同,导致性能变化高达40%,然而几乎所有技能都存在一个单一的、与格式无关的Markdown版本。手动逐平台重写造成了无法持续的维护负担,而先前的审计发现超过三分之一的社区技能存在安全漏洞。为了解决这个问题,我们提出了SkCC,这是一个将传统编译器设计引入代理技能开发的编译框架。在其核心,SkIR - 一个强类型的中间表示 - 将技能语义与特定平台的格式分离,实现在异构代理框架中的可移植部署。在这个IR周围,一个编译时的分析器通过反技能注入来强制执行安全约束。通过一个四阶段的管道,SkCC将适应复杂性从$O(m \times n)$减少到$O(m + n)$。在SkillsBench上的实验表明,编译后的技能在克劳德代码上将通过率从21.1%提高到33.3%,在Kimi CLI上从35.1%提高到48.7%,同时实现亚10ms的编译延迟,94.8%的主动安全触发率,以及跨平台节省10-46%的运行时令牌。
更新时间: 2026-05-05 04:15:48
领域: cs.CR,cs.AI
Cross-Paradigm Graph Backdoor Attacks with Promptable Subgraph Triggers
Graph Neural Networks(GNNs) are vulnerable to backdoor attacks, where adversaries implant malicious triggers to manipulate model predictions. Existing trigger generators are often simplistic in structure and overly reliant on specific features, confining them to a single graph learning paradigm, such as graph supervised learning, graph contrastive learning, or graph prompt learning. Such paradigm-specific designs lead to poor transferability across different learning frameworks, limiting attack success rates in general testing scenarios. To bridge this gap, we propose Cross-Paradigm Graph Backdoor Attacks with Promptable Subgraph Triggers(CP-GBA), which employs Graph Prompt Learning(GPL) to synthesize transferable subgraph triggers. Specifically, we first distill a compact yet expressive trigger set into a queryable repository, jointly optimizing for class-awareness, feature richness, and structural fidelity. Furthermore, we pioneer the theoretical exploration of GPL transferability under prompt-based objectives, ensuring robust generalization to diverse and unseen test-time paradigms. Extensive experiments across multiple real-world datasets and defense scenarios show that CP-GBA achieves state-of-the-art attack success rates.
Updated: 2026-05-05 03:00:23
标题: 跨范式图背门攻击与可提示子图触发器
摘要: 图神经网络(GNNs)容易受到后门攻击,攻击者会植入恶意触发器来操纵模型预测。现有的触发器生成器通常结构简单,并过度依赖特定特征,限制它们适用于单一的图学习范式,如图监督学习、图对比学习或图提示学习。这种范式特定设计导致在不同学习框架之间的转移能力差,限制了在一般测试场景中的攻击成功率。为了弥补这一差距,我们提出了使用图提示学习(GPL)来合成可传递的子图触发器的跨范式图后门攻击(CP-GBA)方法。具体地,我们首先将一个简洁而富有表现力的触发器集合提炼成一个可查询的库,同时优化类别感知性、特征丰富性和结构保真度。此外,我们开拓了在基于提示的目标下GPL可转移性的理论探索,确保对各种未知测试时范式的稳健泛化。在多个真实数据集和防御场景下的广泛实验表明,CP-GBA取得了最新攻击成功率。
更新时间: 2026-05-05 03:00:23
领域: cs.CR,cs.LG
Cryptographic Registry Provenance: Structural Defense Against Dependency Confusion in AI Package Ecosystems
Dependency confusion attacks exploit a structural gap in software distribution: once a package is installed, there is no cryptographic proof of which registry distributed it. Every existing defense is configuration-based and fails silently when misconfigured. We present a cryptographic distribution provenance system comprising three components: (1) cryptographic registry identity, where every registry holds an Ed25519 keypair and signs every artifact it distributes; (2) a dual-signature model, where the publisher signs at packaging time and the registry countersigns at publication time; and (3) authoritative namespace binding, where consumers pin registry fingerprints and the resolver cryptographically rejects artifacts from unauthorized registries. These create three defense layers requiring simultaneous compromise for a successful attack. A comparison across eight ecosystems (npm, Cargo, Hex.pm, PyPI, Go modules, Docker/OCI, NuGet, Maven) shows no existing ecosystem combines mandatory publisher signing, cryptographic registry identity, mandatory registry countersigning, and consumer-side cryptographic enforcement. The system extends to AI-generation provenance as a signed attribute and governance-enforced dependency resolution. A case study integrates distribution provenance with a three-layer runtime governance architecture, creating a four-phase lifecycle chain with no cryptographic gaps.
Updated: 2026-05-05 02:56:31
标题: 加密注册来源:结构性防御AI软件包生态系统中的依赖混淆
摘要: 依赖混淆攻击利用软件分发中的结构漏洞:一旦安装了包,就没有加密证据表明哪个注册表分发了它。每种现有的防御都是基于配置的,当配置错误时会默默失败。我们提出了一个包括三个组件的加密分发来源系统:(1)加密注册表标识,其中每个注册表都持有一个Ed25519密钥对,并为其分发的每个工件进行签名;(2)双重签名模型,在打包时发布者签名,注册表在发布时再次签名;和(3)权威命名空间绑定,其中消费者固定注册表指纹,解析器通过加密方式拒绝未经授权的注册表的工件。这些创建了三个防御层,需要同时破坏才能成功攻击。对八个生态系统(npm、Cargo、Hex.pm、PyPI、Go模块、Docker/OCI、NuGet、Maven)进行比较表明,没有任何现有生态系统结合了强制发布者签名、加密注册表标识、强制注册表再次签名和消费者端的加密强制。该系统扩展到AI生成来源作为签名属性和由治理强制执行的依赖关系解决。一个案例研究将分发来源与三层运行时治理架构集成在一起,创建了一个四阶段生命周期链,没有加密漏洞。
更新时间: 2026-05-05 02:56:31
领域: cs.CR,cs.AI,cs.SE
"Security vs. Interoperability" Arguments: A Technical Framework
Concerns about big tech's monopoly power have featured prominently in recent media and policy discourse, as regulators across the European Union (EU), the United States (US) and beyond have ramped up efforts to promote healthier market competition. One favored approach is to require certain kinds of interoperation between platforms, to mitigate the current concentration of power in the biggest companies. Unsurprisingly, interoperability initiatives have generally been met with resistance by big tech companies. Perhaps more surprisingly, a significant part of that pushback has been in the name of security -- that is, arguing against interoperation on the basis that it will undermine security. We conduct a systematic examination of "security vs. interoperability" (SvI) discourse in the context of EU antitrust and competition proceedings. Our resulting contributions are threefold. First, we propose a taxonomy of SvI concerns in three categories: engineering, vetting, and hybrid. Second, we present an analytical framework for assessing real-world SvI concerns, and illustrate its utility by analyzing several case studies spanning our three taxonomy categories. Third, we undertake a comparative analysis that highlights key considerations around the interplay of economic incentives, market power, and security across our diverse case study contexts, identifying common patterns in each taxonomy category. Our contributions provide valuable analytical tools for experts and non-experts alike to critically assess SvI discourse in today's fast-paced regulatory landscape.
Updated: 2026-05-05 01:02:14
标题: "安全性与互操作性" 论点:一个技术框架
摘要: 对大型科技公司垄断力量的担忧在近期的媒体和政策讨论中占据了重要地位,因为欧盟、美国以及其他地方的监管机构加大了促进更健康市场竞争的努力。一种备受青睐的方法是要求平台之间进行某种形式的互操作,以减轻目前最大公司的权力集中。毫不奇怪,互操作性倡议通常遭到大型科技公司的抵制。更令人意外的是,这种反对的一个重要方面是出于安全考虑--即基于互操作性将损害安全性的论点。我们在欧盟反垄断和竞争程序的背景下对“安全性与互操作性”(SvI)话语进行了系统考察。我们的研究结果有三个方面的贡献。首先,我们提出了SvI问题的三个类别:工程、审核和混合。其次,我们提出了一个评估实际SvI问题的分析框架,并通过分析跨越我们三个分类的几个案例研究来说明其实用性。第三,我们进行了比较分析,突出了在我们各种案例研究环境中经济激励、市场力量和安全之间相互作用的关键考虑因素,识别出每个分类中的共同模式。我们的研究为专家和非专家提供了有价值的分析工具,可以在当今快节奏的监管环境中对SvI话语进行批判性评估。
更新时间: 2026-05-05 01:02:14
领域: cs.CR