Arxiv Day: Article

Preprint: Poster: Did I Just Browse A Website Written by LLMs?

Increasingly, web content is automatically generated by large language models (LLMs) with little human input. We call this "LLM-dominant" content. Since LLMs plagiarize and hallucinate, LLM-dominant content can be unreliable and unethical. Yet, websites rarely disclose such content, and human readers struggle to distinguish it. Thus, we must develop reliable detectors for LLM-dominant content. However, state-of-the-art LLM detectors are inaccurate on web content, because web content has low positive rates, complex markup, and diverse genres, instead of clean, prose-like benchmark data SoTA detectors are optimized for. We propose a highly reliable, scalable pipeline that classifies entire websites. Instead of naively classifying text extracted from each page, we classify each site based on an LLM text detector's outputs of multiple prose-like pages to boost accuracies. We train and evaluate our detector by collecting 2 distinct ground truth datasets totaling 120 sites, and obtain 100% accuracies testing across them. In the wild, we detect a sizable portion of sites as LLM-dominant among 10k sites in search engine results and 10k in Common Crawl archives. We find LLM-dominant sites are growing in prevalence and rank highly in search results, raising questions about their impact on end users and the overall Web ecosystem.

Updated: 2025-10-09 23:55:04

标题: 预印本：海报：我刚刚浏览了一个由LLMs编写的网站吗？

摘要: 越来越多的网络内容是由大型语言模型（LLMs）自动生成，几乎没有人工输入。我们称之为“LLM主导”内容。由于LLMs会抄袭和产生幻觉，LLM主导内容可能不可靠和不道德。然而，网站很少披露这种内容，人类读者很难区分。因此，我们必须开发可靠的检测器来检测LLM主导内容。然而，目前最先进的LLM检测器在网络内容上不准确，因为网络内容的正面率低，标记复杂，种类繁多，而不是干净的、散文式的基准数据，这些是最先进检测器所优化的。我们提出了一个高度可靠、可扩展的流水线，对整个网站进行分类。我们不是简单地对从每个页面提取的文本进行分类，而是根据LLM文本检测器对多个散文式页面的输出进行分类，以提高准确性。我们通过收集2个不同的地面真实数据集，共计120个网站，对我们的检测器进行训练和评估，并在这些数据集上进行100%的准确性测试。在实际应用中，我们在搜索引擎结果中检测到约10,000个网站中的大部分网站为LLM主导网站，并在Common Crawl档案中检测到10,000个网站。我们发现LLM主导网站在普及率上正在增长，并在搜索结果中排名靠前，引发了对它们对终端用户和整个网络生态系统的影响的问题。

更新时间: 2025-10-09 23:55:04

领域: cs.NI,cs.AI,cs.CL,cs.IR

下载: http://arxiv.org/abs/2507.13933v2

ReviewerToo: Should AI Join The Program Committee? A Look At The Future of Peer Review

Peer review is the cornerstone of scientific publishing, yet it suffers from inconsistencies, reviewer subjectivity, and scalability challenges. We introduce ReviewerToo, a modular framework for studying and deploying AI-assisted peer review to complement human judgment with systematic and consistent assessments. ReviewerToo supports systematic experiments with specialized reviewer personas and structured evaluation criteria, and can be partially or fully integrated into real conference workflows. We validate ReviewerToo on a carefully curated dataset of 1,963 paper submissions from ICLR 2025, where our experiments with the gpt-oss-120b model achieves 81.8% accuracy for the task of categorizing a paper as accept/reject compared to 83.9% for the average human reviewer. Additionally, ReviewerToo-generated reviews are rated as higher quality than the human average by an LLM judge, though still trailing the strongest expert contributions. Our analysis highlights domains where AI reviewers excel (e.g., fact-checking, literature coverage) and where they struggle (e.g., assessing methodological novelty and theoretical contributions), underscoring the continued need for human expertise. Based on these findings, we propose guidelines for integrating AI into peer-review pipelines, showing how AI can enhance consistency, coverage, and fairness while leaving complex evaluative judgments to domain experts. Our work provides a foundation for systematic, hybrid peer-review systems that scale with the growth of scientific publishing.

Updated: 2025-10-09 23:53:19

标题: 审稿人也：人工智能是否应该加入程序委员会？展望同行评审的未来

摘要: 同行评审是科学出版的基石，但存在不一致性、评审者主观性和可扩展性挑战。我们引入了ReviewerToo，这是一个模块化框架，用于研究和部署AI辅助的同行评审，以补充人类判断，并进行系统和一致的评估。ReviewerToo支持使用专门的评审人设和结构化评估标准进行系统实验，并可以部分或完全集成到真实会议工作流程中。我们在一个精心策划的来自ICLR 2025的1,963篇论文提交的数据集上验证了ReviewerToo，我们使用gpt-oss-120b模型进行的实验在将一篇论文分类为接受/拒绝的任务上实现了81.8%的准确性，而人类评审的平均准确率为83.9%。此外，由ReviewerToo生成的评审被一个LLM评审评为比人类平均水平更高质量，尽管仍落后于最强的专家贡献。我们的分析突出了AI评审在哪些领域表现出色（例如事实核实、文献覆盖），以及它们在哪些领域遇到困难（例如评估方法学的新颖性和理论贡献），强调了对人类专业知识的持续需求。基于这些发现，我们提出了将AI整合到同行评审流程中的指导方针，展示了如何在保留复杂评价判断给领域专家的同时，AI如何可以增强一致性、涵盖范围和公平性。我们的工作为系统化、混合的同行评审系统提供了基础，可以随着科学出版的增长而扩展。

更新时间: 2025-10-09 23:53:19

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.08867v1

High-Rate Mixout: Revisiting Mixout for Robust Domain Generalization

Ensembling fine-tuned models initialized from powerful pre-trained weights is a common strategy to improve robustness under distribution shifts, but it comes with substantial computational costs due to the need to train and store multiple models. Dropout offers a lightweight alternative by simulating ensembles through random neuron deactivation; however, when applied to pre-trained models, it tends to over-regularize and disrupt critical representations necessary for generalization. In this work, we investigate Mixout, a stochastic regularization technique that provides an alternative to Dropout for domain generalization. Rather than deactivating neurons, Mixout mitigates overfitting by probabilistically swapping a subset of fine-tuned weights with their pre-trained counterparts during training, thereby maintaining a balance between adaptation and retention of prior knowledge. Our study reveals that achieving strong performance with Mixout on domain generalization benchmarks requires a notably high masking probability of 0.9 for ViTs and 0.8 for ResNets. While this may seem like a simple adjustment, it yields two key advantages for domain generalization: (1) higher masking rates more strongly penalize deviations from the pre-trained parameters, promoting better generalization to unseen domains; and (2) high-rate masking substantially reduces computational overhead, cutting gradient computation by up to 45% and gradient memory usage by up to 90%. Experiments across five domain generalization benchmarks, PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet, using ResNet and ViT architectures, show that our approach, High-rate Mixout, achieves out-of-domain accuracy comparable to ensemble-based methods while significantly reducing training costs.

Updated: 2025-10-09 23:50:27

标题: 高速混合：重新审视用于稳健领域泛化的混合

摘要: 将细调模型集成化的方法是一种常见策略，可以提高在分布偏移下的鲁棒性，但由于需要训练和存储多个模型，这会带来相当大的计算成本。Dropout通过随机神经元停用模拟集成提供了一种轻量级替代方案；然而，当应用于预训练模型时，它往往会过度正则化并破坏对泛化至关重要的关键表示。在这项工作中，我们研究了Mixout，一种提供领域泛化替代方案的随机正则化技术。Mixout不是停用神经元，而是在训练过程中通过概率性地交换一部分精细调整的权重与它们的预训练对应物，从而保持适应性与保留先前知识之间的平衡。我们的研究表明，在ViTs需达到显著高的遮蔽概率0.9，而在ResNets则为0.8时，使用Mixout在领域泛化基准上实现强大性能。虽然这可能看起来是一个简单的调整，但它为领域泛化带来了两个关键优势：(1)更高的遮蔽率更加严厉地惩罚与预训练参数的偏离，促进对未知领域的更好泛化；(2)高比例的遮蔽大幅减少计算开销，将梯度计算降低多达45%，梯度内存使用减少多达90%。使用ResNet和ViT架构在五个领域泛化基准PACS、VLCS、OfficeHome、TerraIncognita和DomainNet上进行的实验显示，我们的方法High-rate Mixout在降低训练成本的同时，实现了与基于集成的方法可比的领域外准确性。

更新时间: 2025-10-09 23:50:27

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2510.06955v2

A Light Weight Cryptographic Solution for 6LoWPAN Protocol Stack

Lightweight cryptography is an emerging field in the field of research, which endorses algorithms which are best suited for constrained environment. Design metrics like Gate Equivalence (GE), Memory Requirement, Power Consumption, and Throughput play a vital role in the applications like IoT. This paper presents the 6LoWPAN Protocol Stack which is a popular standard of communication for constrained devices. This paper presents an implementation of a lightweight 6LoWPAN Protocol stack by using a Light weight Cipher instead of regular heavy encryption cipher AES. The cipher proposed in this paper is specifically suitable for 6LoWPAN architecture as it addresses all the constraints possessed by wireless sensor nodes. The lightweight cipher proposed in the paper needs only 1856 bytes of FLASH and 1272 bytes of RAM memory which is less than any other standard existing lightweight cipher design. The proposed ciphers power consumption is around 25 mW which is significantly less as compared to ISO certified lightweight cipher PRESENT which consumes around 38 mW of dynamic power. This paper also discusses the detailed analysis of cipher against the attacks like Linear Cryptanalysis, Differential Cryptanalysis, Biclique attack and Avalanche attack. The cipher implementation on hardware is around 1051 GEs for 64 bit of block size with 128 bit of key length which is less as compared to existing lightweight cipher design. The proposed cipher LiCi2 is motivated from LiCi cipher design but outclasses it in every design metric. We believe the design of LiCi2 is the obvious choice for researchers to implement in constrained environments like IoT.

Updated: 2025-10-09 23:47:04

标题: 一个用于6LoWPAN协议栈的轻量级加密解决方案

摘要: 轻量级密码学是研究领域中新兴的领域，支持最适合受限环境的算法。门等效性（GE）、内存需求、功耗和吞吐量等设计指标在物联网等应用中起着至关重要的作用。本文介绍了6LoWPAN协议栈，这是受限设备的通信流行标准。本文通过使用轻量级密码而不是常规的重加密密码AES，提出了一个轻量级6LoWPAN协议栈的实现。本文提出的密码专门适用于6LoWPAN架构，因为它解决了无线传感器节点所具有的所有限制。本文提出的轻量级密码只需要1856字节的FLASH和1272字节的RAM内存，比任何其他标准现有的轻量级密码设计都要少。所提出的密码功耗约为25mW，与ISO认证的轻量级密码PRESENT相比，后者消耗约38mW的动态功耗。本文还讨论了密码对线性密码分析、差分密码分析、双向攻击和雪崩攻击等攻击的详细分析。硬件上的密码实现是64位块大小和128位密钥长度的1051个GE，与现有的轻量级密码设计相比较少。所提出的密码LiCi2受到LiCi密码设计的启发，但在每个设计指标上都超越了它。我们相信LiCi2的设计是研究人员在物联网等受限环境中实施的明显选择。

更新时间: 2025-10-09 23:47:04

领域: cs.CR

下载: http://arxiv.org/abs/2510.14993v1

Multi-fidelity Batch Active Learning for Gaussian Process Classifiers

Many science and engineering problems rely on expensive computational simulations, where a multi-fidelity approach can accelerate the exploration of a parameter space. We study efficient allocation of a simulation budget using a Gaussian Process (GP) model in the binary simulation output case. This paper introduces Bernoulli Parameter Mutual Information (BPMI), a batch active learning algorithm for multi-fidelity GP classifiers. BPMI circumvents the intractability of calculating mutual information in the probability space by employing a first-order Taylor expansion of the link function. We evaluate BPMI against several baselines on two synthetic test cases and a complex, real-world application involving the simulation of a laser-ignited rocket combustor. In all experiments, BPMI demonstrates superior performance, achieving higher predictive accuracy for a fixed computational budget.

Updated: 2025-10-09 23:45:38

标题: 多逼真度批量主动学习用于高斯过程分类器

摘要: 许多科学和工程问题依赖于昂贵的计算模拟，其中多信度方法可以加速参数空间的探索。我们研究了在二进制模拟输出情况下使用高斯过程（GP）模型进行模拟预算的有效分配。本文介绍了伯努利参数互信息（BPMI），这是一种用于多信度GP分类器的批量主动学习算法。BPMI通过使用链接函数的一阶泰勒展开来规避在概率空间中计算互信息的难题。我们对BPMI在两个合成测试案例和一个涉及激光点火火箭燃烧器模拟的复杂实际应用进行了评估。在所有实验中，BPMI展现出卓越的性能，以固定的计算预算实现更高的预测准确性。

更新时间: 2025-10-09 23:45:38

领域: cs.LG,cs.CE,physics.comp-ph

下载: http://arxiv.org/abs/2510.08865v1

AD-LLM: Benchmarking Large Language Models for Anomaly Detection

Anomaly detection (AD) is an important machine learning task with many real-world uses, including fraud detection, medical diagnosis, and industrial monitoring. Within natural language processing (NLP), AD helps detect issues like spam, misinformation, and unusual user activity. Although large language models (LLMs) have had a strong impact on tasks such as text generation and summarization, their potential in AD has not been studied enough. This paper introduces AD-LLM, the first benchmark that evaluates how LLMs can help with NLP anomaly detection. We examine three key tasks: (i) zero-shot detection, using LLMs' pre-trained knowledge to perform AD without tasks-specific training; (ii) data augmentation, generating synthetic data and category descriptions to improve AD models; and (iii) model selection, using LLMs to suggest unsupervised AD models. Through experiments with different datasets, we find that LLMs can work well in zero-shot AD, that carefully designed augmentation methods are useful, and that explaining model selection for specific datasets remains challenging. Based on these results, we outline six future research directions on LLMs for AD.

Updated: 2025-10-09 23:42:37

标题: AD-LLM：用于异常检测的大型语言模型基准测试

摘要: 异常检测（AD）是一个重要的机器学习任务，具有许多现实世界的用途，包括欺诈检测、医学诊断和工业监测。在自然语言处理（NLP）领域，AD有助于检测垃圾邮件、虚假信息和异常用户活动等问题。尽管大型语言模型（LLMs）在文本生成和摘要等任务上产生了很大影响，但它们在AD方面的潜力尚未得到足够研究。本文介绍了AD-LLM，这是第一个评估LLMs如何帮助NLP异常检测的基准。我们考察了三个关键任务：（i）零-shot检测，利用LLMs的预训练知识执行无需特定任务训练的AD；（ii）数据增强，生成合成数据和类别描述以改进AD模型；以及（iii）模型选择，利用LLMs建议无监督AD模型。通过对不同数据集的实验，我们发现LLMs在零-shot AD中表现良好，精心设计的增强方法是有用的，并且对特定数据集的模型选择仍然具有挑战性。基于这些结果，我们概述了LLMs在AD中的六个未来研究方向。

更新时间: 2025-10-09 23:42:37

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2412.11142v4

Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models

Large language models (LLMs) remain vulnerable to multi-turn jailbreaking attacks that exploit conversational context to bypass safety constraints gradually. These attacks target different harm categories (like malware generation, harassment, or fraud) through distinct conversational approaches (educational discussions, personal experiences, hypothetical scenarios). Existing multi-turn jailbreaking methods often rely on heuristic or ad hoc exploration strategies, providing limited insight into underlying model weaknesses. The relationship between conversation patterns and model vulnerabilities across harm categories remains poorly understood. We propose Pattern Enhanced Chain of Attack (PE-CoA), a framework of five conversation patterns to construct effective multi-turn jailbreaks through natural dialogue. Evaluating PE-CoA on twelve LLMs spanning ten harm categories, we achieve state-of-the-art performance, uncovering pattern-specific vulnerabilities and LLM behavioral characteristics: models exhibit distinct weakness profiles where robustness to one conversational pattern does not generalize to others, and model families share similar failure modes. These findings highlight limitations of safety training and indicate the need for pattern-aware defenses. Code available on: https://github.com/Ragib-Amin-Nihal/PE-CoA

Updated: 2025-10-09 23:26:28

标题: 模式增强的多轮越狱：利用大型语言模型中的结构性漏洞

摘要: 大型语言模型（LLMs）仍然容易受到多轮越狱攻击的影响，这些攻击利用对话上下文逐渐绕过安全限制。这些攻击针对不同的危害类别（如恶意软件生成、骚扰或欺诈）通过不同的对话方式（教育讨论、个人经历、假设场景）。现有的多轮越狱方法通常依赖于启发式或临时策略，对模型的潜在弱点了解有限。对于各种危害类别之间的对话模式和模型脆弱性之间的关系仍然知之甚少。我们提出了一种名为Pattern Enhanced Chain of Attack（PE-CoA）的框架，其中包括五种对话模式，以通过自然对话构建有效的多轮越狱。在跨越十种危害类别的十二个LLMs上评估PE-CoA，我们取得了最先进的性能，揭示了特定模式的漏洞和LLM行为特征：模型展示出不同的弱点特征，在一个对话模式上的鲁棒性不具备普遍性，并且模型族群共享相似的失败模式。这些发现突显了安全训练的局限性，并表明需要有意识地防御。代码可在以下网址找到：https://github.com/Ragib-Amin-Nihal/PE-CoA

更新时间: 2025-10-09 23:26:28

领域: cs.CL,cs.AI,cs.CR

下载: http://arxiv.org/abs/2510.08859v1

Sparse components distinguish visual pathways & their alignment to neural networks

The ventral, dorsal, and lateral streams in high-level human visual cortex are implicated in distinct functional processes. Yet, deep neural networks (DNNs) trained on a single task model the entire visual system surprisingly well, hinting at common computational principles across these pathways. To explore this inconsistency, we applied a novel sparse decomposition approach to identify the dominant components of visual representations within each stream. Consistent with traditional neuroscience research, we find a clear difference in component response profiles across the three visual streams -- identifying components selective for faces, places, bodies, text, and food in the ventral stream; social interactions, implied motion, and hand actions in the lateral stream; and some less interpretable components in the dorsal stream. Building on this, we introduce Sparse Component Alignment (SCA), a new method for measuring representational alignment between brains and machines that better captures the latent neural tuning of these two visual systems. Using SCA, we find that standard visual DNNs are more aligned with the ventral than either dorsal or lateral representations. SCA reveals these distinctions with greater resolution than conventional population-level geometry, offering a measure of representational alignment that is sensitive to a system's underlying axes of neural tuning.

Updated: 2025-10-09 23:26:11

标题: 稀疏成分区分视觉通路及其与神经网络的对齐

摘要: 高级人类视觉皮层中的腹侧、背侧和侧流被认为参与不同的功能过程。然而，训练在单一任务上的深度神经网络（DNNs）出奇地很好地模拟了整个视觉系统，暗示这些途径之间存在共同的计算原则。为了探索这种不一致性，我们应用了一种新颖的稀疏分解方法，以确定每个流中视觉表示的主要组件。与传统的神经科学研究一致，我们发现在三个视觉流中的组件响应配置有明显差异 -- 在腹侧流中识别出对面部、地点、身体、文本和食物具有选择性的组件；在侧流中识别出社交互动、暗示运动和手部动作的组件；在背侧流中出现一些不太可解释的组件。基于此，我们引入了Sparse Component Alignment（SCA），一种衡量大脑和机器之间表征对齐的新方法，更好地捕捉了这两个视觉系统的潜在神经调谐。使用SCA，我们发现标准视觉DNNs与腹侧表示比与背侧或侧流表示更为对齐。SCA比传统的人群级几何结构更精细地显示出这些差异，提供了一种对表征对齐敏感的度量，该度量考虑了系统的潜在神经调谐轴。

更新时间: 2025-10-09 23:26:11

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2510.08858v1

Time-Aware Feature Selection: Adaptive Temporal Masking for Stable Sparse Autoencoder Training

Understanding the internal representations of large language models is crucial for ensuring their reliability and safety, with sparse autoencoders (SAEs) emerging as a promising interpretability approach. However, current SAE training methods face feature absorption, where features (or neurons) are absorbed into each other to minimize $L_1$ penalty, making it difficult to consistently identify and analyze model behaviors. We introduce Adaptive Temporal Masking (ATM), a novel training approach that dynamically adjusts feature selection by tracking activation magnitudes, frequencies, and reconstruction contributions to compute importance scores that evolve over time. ATM applies a probabilistic masking mechanism based on statistical thresholding of these importance scores, creating a more natural feature selection process. Through extensive experiments on the Gemma-2-2b model, we demonstrate that ATM achieves substantially lower absorption scores compared to existing methods like TopK and JumpReLU SAEs, while maintaining excellent reconstruction quality. These results establish ATM as a principled solution for learning stable, interpretable features in neural networks, providing a foundation for more reliable model analysis.

Updated: 2025-10-09 23:12:51

标题: 时间感知特征选择：稳定稀疏自动编码器训练的自适应时间掩蔽

摘要: 理解大型语言模型的内部表示对于确保其可靠性和安全性至关重要，稀疏自动编码器（SAEs）作为一种有前景的可解释性方法。然而，当前的SAE训练方法面临特征吸收的问题，即特征（或神经元）被相互吸收以最小化$L_1$惩罚，使得难以一致地识别和分析模型行为。我们引入了自适应时间掩码（ATM），这是一种新颖的训练方法，通过跟踪激活幅度、频率和重构贡献来动态调整特征选择，以计算随时间演变的重要性分数。ATM应用基于这些重要性分数的统计阈值的概率掩码机制，从而创建更自然的特征选择过程。通过对Gemma-2-2b模型进行大量实验，我们证明ATM相比现有方法（如TopK和JumpReLU SAEs）实现了显著更低的吸收得分，同时保持出色的重构质量。这些结果将ATM确立为学习神经网络中稳定、可解释特征的原则性解决方案，为更可靠的模型分析提供了基础。

更新时间: 2025-10-09 23:12:51

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.08855v1

Making Bias Amplification in Balanced Datasets Directional and Interpretable

Most of the ML datasets we use today are biased. When we train models on these biased datasets, they often not only learn dataset biases but can also amplify them -- a phenomenon known as bias amplification. Several co-occurrence-based metrics have been proposed to measure bias amplification between a protected attribute A (e.g., gender) and a task T (e.g., cooking). However, these metrics fail to measure biases when A is balanced with T. To measure bias amplification in balanced datasets, recent work proposed a predictability-based metric called leakage amplification. However, leakage amplification cannot identify the direction in which biases are amplified. In this work, we propose a new predictability-based metric called directional predictability amplification (DPA). DPA measures directional bias amplification, even for balanced datasets. Unlike leakage amplification, DPA is easier to interpret and less sensitive to attacker models (a hyperparameter in predictability-based metrics). Our experiments on tabular and image datasets show that DPA is an effective metric for measuring directional bias amplification. The code will be available soon.

Updated: 2025-10-09 23:12:09

标题: 使平衡数据集中的偏见放大具有方向性和可解释性

摘要: 我们今天使用的大多数机器学习数据集都存在偏见。当我们在这些有偏见的数据集上训练模型时，它们通常不仅学习数据集的偏见，还可能放大这些偏见，这种现象被称为偏见放大。已经提出了几种基于共现的指标来衡量受保护属性A（例如性别）和任务T（例如烹饪）之间的偏见放大。然而，当A与T平衡时，这些指标无法衡量偏见。为了衡量平衡数据集中的偏见放大，最近的研究提出了一种基于可预测性的指标称为泄漏放大。然而，泄漏放大无法确定偏见放大的方向。在这项工作中，我们提出了一种新的基于可预测性的指标，称为方向可预测性放大（DPA）。DPA衡量方向性偏见放大，即使对于平衡数据集也是如此。与泄漏放大不同，DPA更容易解释，对攻击者模型（预测性指标中的一个超参数）的敏感性更低。我们在表格和图像数据集上的实验表明，DPA是衡量方向性偏见放大的有效指标。该代码将很快提供。

更新时间: 2025-10-09 23:12:09

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2412.11060v2

Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training

Recent advances in text-to-vision generation excel in visual fidelity but struggle with compositional generalization and semantic alignment. Existing datasets are noisy and weakly compositional, limiting models' understanding of complex scenes, while scalable solutions for dense, high-quality annotations remain a challenge. We introduce Generate Any Scene, a data engine that systematically enumerates scene graphs representing the combinatorial array of possible visual scenes. Generate Any Scene dynamically constructs scene graphs of varying complexity from a structured taxonomy of objects, attributes, and relations. Given a sampled scene graph, Generate Any Scene translates it into a caption for text-to-image or text-to-video generation; it also translates it into a set of visual question answers that allow automatic evaluation and reward modeling of semantic alignment. Using Generate Any Scene, we first design a self-improving framework where models iteratively enhance their performance using generated data. Stable Diffusion v1.5 achieves an average 4% improvement over baselines and surpassing fine-tuning on CC3M. Second, we also design a distillation algorithm to transfer specific strengths from proprietary models to their open-source counterparts. Using fewer than 800 synthetic captions, we fine-tune Stable Diffusion v1.5 and achieve a 10% increase in TIFA score on compositional and hard concept generation. Third, we create a reward model to align model generation with semantic accuracy at a low cost. Using GRPO algorithm, we fine-tune SimpleAR-0.5B-SFT and surpass CLIP-based methods by +5% on DPG-Bench. Finally, we apply these ideas to the downstream task of content moderation where we train models to identify challenging cases by learning from synthetic data.

Updated: 2025-10-09 23:10:23

标题: 生成任何场景：基于场景图驱动的数据合成用于视觉生成训练

摘要: 最近在文本到视觉生成领域取得了显著进展，视觉保真度优秀，但在构图普遍性和语义对齐方面存在困难。现有数据集存在噪声且组合性较弱，限制了模型对复杂场景的理解，同时，为密集、高质量注释提供可扩展的解决方案仍然是一个挑战。我们引入了Generate Any Scene，这是一个数据引擎，系统地列举了代表可能视觉场景的组合阵列的场景图。Generate Any Scene从结构化的对象、属性和关系分类中动态构建具有不同复杂性的场景图。给定一个采样的场景图，Generate Any Scene将其翻译成用于文本到图像或文本到视频生成的标题；它还将其翻译成一组视觉问题答案，从而实现语义对齐的自动评估和奖励建模。利用Generate Any Scene，我们首先设计了一个自我改进的框架，模型通过生成的数据逐步提升性能。稳定扩散v1.5相对基线平均提高了4%，超越了在CC3M上的微调。其次，我们还设计了一种蒸馏算法，将专有模型的特定优势转移到其开源对应模型中。利用不到800个合成标题，我们对稳定扩散v1.5进行微调，在组合和难概念生成方面TIFA分数提高了10%。第三，我们创建了一个奖励模型，以低成本实现模型生成与语义准确性的对齐。使用GRPO算法，我们对SimpleAR-0.5B-SFT进行微调，在DPG-Bench上超越基于CLIP的方法5%。最后，我们将这些思想应用于内容审核的下游任务，通过从合成数据中学习，训练模型识别具有挑战性的案例。

更新时间: 2025-10-09 23:10:23

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2412.08221v3

On the Alignment Between Supervised and Self-Supervised Contrastive Learning

Self-supervised contrastive learning (CL) has achieved remarkable empirical success, often producing representations that rival supervised pre-training on downstream tasks. Recent theory explains this by showing that the CL loss closely approximates a supervised surrogate, Negatives-Only Supervised Contrastive Learning (NSCL) loss, as the number of classes grows. Yet this loss-level similarity leaves an open question: {\em Do CL and NSCL also remain aligned at the representation level throughout training, not just in their objectives?} We address this by analyzing the representation alignment of CL and NSCL models trained under shared randomness (same initialization, batches, and augmentations). First, we show that their induced representations remain similar: specifically, we prove that the similarity matrices of CL and NSCL stay close under realistic conditions. Our bounds provide high-probability guarantees on alignment metrics such as centered kernel alignment (CKA) and representational similarity analysis (RSA), and they clarify how alignment improves with more classes, higher temperatures, and its dependence on batch size. In contrast, we demonstrate that parameter-space coupling is inherently unstable: divergence between CL and NSCL weights can grow exponentially with training time. Finally, we validate these predictions empirically, showing that CL-NSCL alignment strengthens with scale and temperature, and that NSCL tracks CL more closely than other supervised objectives. This positions NSCL as a principled bridge between self-supervised and supervised learning. Our code and project page are available at [\href{https://github.com/DLFundamentals/understanding_ssl_v2}{code}, \href{https://dlfundamentals.github.io/cl-nscl-representation-alignment/}{project page}].

Updated: 2025-10-09 23:01:42

标题: 关于监督和自监督对比学习之间的对齐

摘要: 自监督对比学习（CL）取得了显著的实证成功，通常在下游任务上产生的表示与监督预训练相媲美。最近的理论解释了这一点，表明随着类别数量的增加，CL损失近似于一个监督替代损失，即仅负样本监督对比学习（NSCL）损失。然而，这种损失级别的相似性带来了一个开放性问题：CL和NSCL在训练过程中是否也在表示层面保持一致，而不仅仅是在目标上？我们通过分析在共享随机性（相同初始化、批次和增强）下训练的CL和NSCL模型的表示对齐来解决这个问题。首先，我们展示它们引发的表示保持相似：具体而言，我们证明在现实条件下，CL和NSCL的相似矩阵保持接近。我们的界限为对齐度量提供了高概率保证，如中心核对齐（CKA）和表示相似性分析（RSA），并澄清了对齐如何随着更多类别、更高温度以及对批次大小的依赖而改善。相比之下，我们证明参数空间耦合本质上是不稳定的：CL和NSCL权重之间的发散可以随着训练时间呈指数增长。最后，我们通过实验证实了这些预测，显示CL-NSCL对齐随着规模和温度的增加而加强，NSCL比其他监督目标更接近CL。这将NSCL定位为自监督和监督学习之间的原则桥梁。我们的代码和项目页面可在[\href{https://github.com/DLFundamentals/understanding_ssl_v2}{code}, \href{https://dlfundamentals.github.io/cl-nscl-representation-alignment/}{project page}]上找到。

更新时间: 2025-10-09 23:01:42

领域: cs.LG

下载: http://arxiv.org/abs/2510.08852v1

Spatio-temporal, multi-field deep learning of shock propagation in meso-structured media

The ability to predict how shock waves traverse porous and architected materials is a key challenge in planetary defense and in the pursuit of inertial fusion energy. Yet capturing pore collapse, anomalous Hugoniot responses, and localized heating - phenomena that strongly influence asteroid deflection or fusion ignition - has remained a major challenge despite recent advances in single-field and reduced representations. We introduce a multi-field spatio-temporal model (MSTM) that unifies seven coupled fields - pressure, density, temperature, energy, material distribution, and two velocity components - into a single autoregressive surrogate. Trained on high-fidelity hydrocode data, MSTM captures nonlinear shock-driven dynamics across porous and architected configurations, achieving mean errors of 1.4% and 3.2% respectively, all while delivering over three orders of magnitude in speedup. MSTM reduces mean-squared error and structural dissimilarity by 94% relative torelative to single-field spatio-temporal models. This advance transforms problems once considered intractable into tractable design studies, establishing a practical framework for optimizing meso-structured materials in planetary impact mitigation and inertial fusion energy.

Updated: 2025-10-09 23:00:33

标题: 空间-时间、多场深度学习中介微结构介质中冲击传播

摘要: 能够预测冲击波如何穿过多孔和结构材料是行星防御和惯性聚变能的关键挑战。然而，尽管最近在单一领域和简化表示方面取得了进展，但捕捉孔隙坍塌、异常的Hugoniot响应和局部加热等现象 - 这些现象强烈影响着小行星偏转或聚变点火 - 仍然是一个重大挑战。我们引入了一个多场时空模型（MSTM），将七个耦合的字段 - 压力、密度、温度、能量、材料分布和两个速度分量 - 统一成一个单一的自回归替代模型。经过高保真度的水动力学数据训练，MSTM捕捉了多孔和结构配置中的非线性冲击驱动动态，分别实现了1.4%和3.2%的平均误差，同时提供了三个数量级的加速。相对于单一领域时空模型，MSTM将均方误差和结构不相似性减少了94%。这一进展将曾被认为难以解决的问题转变为可解决的设计研究，为优化行星碰撞缓解和惯性聚变能提供了一个实用的框架。

更新时间: 2025-10-09 23:00:33

领域: cs.LG

下载: http://arxiv.org/abs/2509.16139v3

Repository-Aware File Path Retrieval via Fine-Tuned LLMs

Modern codebases make it hard for developers and AI coding assistants to find the right source files when answering questions like "How does this feature work?" or "Where was the bug introduced?" Traditional code search (keyword or IR based) often misses semantic context and cross file links, while large language models (LLMs) understand natural language but lack repository specific detail. We present a method for file path retrieval that fine tunes a strong LLM (Qwen3-8B) with QLoRA and Unsloth optimizations to predict relevant file paths directly from a natural language query. To build training data, we introduce six code aware strategies that use abstract syntax tree (AST) structure and repository content to generate realistic question-answer pairs, where answers are sets of file paths. The strategies range from single file prompts to hierarchical repository summaries, providing broad coverage. We fine tune on Python projects including Flask, Click, Jinja, FastAPI, and PyTorch, and obtain high retrieval accuracy: up to 91\% exact match and 93\% recall on held out queries, clearly beating single strategy training. On a large codebase like PyTorch (about 4,000 Python files), the model reaches 59\% recall, showing scalability. We analyze how multi level code signals help the LLM reason over cross file context and discuss dataset design, limits (for example, context length in very large repos), and future integration of retrieval with LLM based code intelligence.

Updated: 2025-10-09 22:49:10

标题: 通过精细调整的LLMs实现的知识库感知文件路径检索

摘要: 现代代码库使开发人员和AI编码助手在回答诸如“这个功能如何工作？”或“错误是在哪里引入的？”等问题时很难找到正确的源文件。传统的代码搜索（关键字或IR基础）经常缺乏语义上下文和跨文件链接，而大型语言模型（LLMs）虽然理解自然语言但缺乏存储库特定细节。我们提出了一种文件路径检索方法，使用QLoRA和Unsloth优化对强大的LLM（Qwen3-8B）进行微调，以直接从自然语言查询中预测相关文件路径。为了构建训练数据，我们引入了六种代码感知策略，利用抽象语法树（AST）结构和存储库内容生成逼真的问题-答案对，其中答案是文件路径集合。这些策略从单个文件提示到分层存储库摘要不等，提供广泛的覆盖范围。我们在包括Flask、Click、Jinja、FastAPI和PyTorch在内的Python项目上进行微调，并获得高度的检索准确性：在保留查询上高达91\%的精确匹配和93\%的召回率，明显优于单一策略训练。在像PyTorch（约4,000个Python文件）这样的大型代码库中，该模型达到了59\%的召回率，显示出可扩展性。我们分析了多级代码信号如何帮助LLM推理跨文件上下文，并讨论数据集设计、限制（例如，在非常大的仓库中的上下文长度）以及将来将检索与基于LLM的代码智能集成的可能性。

更新时间: 2025-10-09 22:49:10

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2510.08850v1

What Is Your Agent's GPA? A Framework for Evaluating Agent Goal-Plan-Action Alignment

We introduce the Agent GPA (Goal-Plan-Action) framework: an evaluation paradigm based on an agent's operational loop of setting goals, devising plans, and executing actions. The framework includes five evaluation metrics: Goal Fulfillment, Logical Consistency, Execution Efficiency, Plan Quality, and Plan Adherence. Logical Consistency checks that an agent's actions are consistent with its prior actions. Execution Efficiency checks whether the agent executes in the most efficient way to achieve its goal. Plan Quality checks whether an agent's plans are aligned with its goals; Plan Adherence checks if an agent's actions are aligned with its plan; and Goal Fulfillment checks that agent's final outcomes match the stated goals. Our experimental results on two benchmark datasets - the public TRAIL/GAIA dataset and an internal dataset for a production-grade data agent - show that this framework (a) provides a systematic way to cover a broad range of agent failures, including all agent errors on the TRAIL/GAIA benchmark dataset; (b) supports LLM-judges that exhibit strong agreement with human annotation, covering 80% to over 95% errors; and (c) localizes errors with 86% agreement to enable targeted improvement of agent performance.

Updated: 2025-10-09 22:40:19

标题: 您的代理的GPA是多少？评估代理目标-计划-行动一致性的框架

摘要: 我们引入了Agent GPA（Goal-Plan-Action）框架：这是一个基于代理人操作循环的评估范式，包括设定目标、制定计划和执行行动。该框架包括五个评估指标：目标实现、逻辑一致性、执行效率、计划质量和计划遵从性。逻辑一致性检查代理人的行动是否与其先前的行动一致。执行效率检查代理人是否以最有效的方式执行以实现其目标。计划质量检查代理人的计划是否与其目标一致；计划遵从性检查代理人的行动是否与其计划一致；目标实现检查代理人的最终结果是否与陈述的目标相匹配。我们在两个基准数据集上的实验结果 - 公开的TRAIL/GAIA数据集和一个用于生产级数据代理的内部数据集 - 显示该框架（a）提供了一种系统化的方法来涵盖广泛范围的代理人失败，包括TRAIL/GAIA基准数据集上的所有代理人错误；（b）支持LLM评判，表现出与人类注释的强烈一致性，覆盖了80%到95%以上的错误；（c）通过86%的一致性定位错误，以实现针对性地改进代理人表现。

更新时间: 2025-10-09 22:40:19

领域: cs.AI,cs.MA

下载: http://arxiv.org/abs/2510.08847v1

Measuring directional bias amplification in image captions using predictability

When we train models on biased ML datasets, they not only learn these biases but can inflate them at test time - a phenomenon called bias amplification. To measure bias amplification in ML datasets, many co-occurrence-based metrics have been proposed. Co-occurrence-based metrics are effective in measuring bias amplification in simple problems like image classification. However, these metrics are ineffective for complex problems like image captioning as they cannot capture the semantics of a caption. To measure bias amplification in captions, prior work introduced a predictability-based metric called Leakage in Captioning (LIC). While LIC captures the semantics and context of captions, it has limitations. LIC cannot identify the direction in which bias is amplified, poorly estimates dataset bias due to a weak vocabulary substitution strategy, and is highly sensitive to attacker models (a hyperparameter in predictability-based metrics). To overcome these issues, we propose Directional Predictability Amplification in Captioning (DPAC). DPAC measures directional bias amplification in captions, provides a better estimate of dataset bias using an improved substitution strategy, and is less sensitive to attacker models. Our experiments on the COCO captioning dataset show how DPAC is the most reliable metric to measure bias amplification in captions.

Updated: 2025-10-09 22:35:02

标题: 利用可预测性测量图像标题中的方向偏差放大

摘要: 当我们在有偏见的机器学习数据集上训练模型时，它们不仅会学习这些偏见，还可能在测试时放大这些偏见-这种现象被称为偏见放大。为了衡量机器学习数据集中的偏见放大，许多基于共现的度量指标被提出。基于共现的度量指标在简单问题如图像分类中测量偏见放大是有效的。然而，这些指标在复杂问题如图像字幕生成中是无效的，因为它们无法捕捉字幕的语义。为了衡量字幕中的偏见放大，先前的研究引入了一种基于可预测性的度量指标，称为字幕中的泄漏（LIC）。虽然LIC捕捉了字幕的语义和上下文，但它也存在一些局限性。LIC无法确定偏见放大的方向，由于弱词汇替换策略而对数据集偏见估计不足，并且对攻击者模型（可预测性度量中的一个超参数）非常敏感。为了克服这些问题，我们提出了字幕中的方向可预测性放大（DPAC）。DPAC测量字幕中的方向偏见放大，使用改进的替换策略更好地估计数据集偏见，并且对攻击者模型的敏感性较低。我们在COCO字幕数据集上的实验表明，DPAC是测量字幕中偏见放大最可靠的度量指标。

更新时间: 2025-10-09 22:35:02

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.07878v3

Training-free AI for Earth Observation Change Detection using Physics Aware Neuromorphic Networks

Earth observations from low Earth orbit satellites provide vital information for decision makers to better manage time-sensitive events such as natural disasters. For the data to be most effective for first responders, low latency is required between data capture and its arrival to decision makers. A major bottleneck is in the bandwidth-limited downlinking of the data from satellites to ground stations. One approach to overcome this challenge is to process at least some of the data on-board and prioritise pertinent data to be downlinked. In this work we propose a Physics Aware Neuromorphic Network (PANN) to detect changes caused by natural disasters from a sequence of multi-spectral satellite images and produce a change map, enabling relevant data to be prioritised for downlinking. The PANN used in this study is motivated by physical neural networks comprised of nano-electronic circuit elements known as "memristors" (nonlinear resistors with memory). The weights in the network are dynamic and update in response to varying input signals according to memristor equations of state and electrical circuit conservation laws. The PANN thus generates physics-constrained dynamical output features which are used to detect changes in a natural disaster detection task by applying a distance-based metric. Importantly, this makes the whole model training-free, allowing it to be implemented with minimal computing resources. The PANN was benchmarked against a state-of-the-art AI model and achieved comparable or better results in each natural disaster category. It thus presents a promising solution to the challenge of resource-constrained on-board processing.

Updated: 2025-10-09 22:32:27

标题: 无需训练的物理感知神经形态网络用于地球观测变化检测的人工智能

摘要: 来自低地球轨道卫星的地球观测数据为决策者提供了重要信息，帮助他们更好地管理诸如自然灾害等时间紧迫事件。为了使数据对于急救人员最为有效，需要在数据捕获和传递给决策者之间具有低延迟。一个主要瓶颈在于从卫星到地面站的数据有限带宽的下行链路。克服这一挑战的一种方法是在卫星上进行至少部分数据处理，并优先传输相关数据。在这项工作中，我们提出了一种物理感知神经形态网络（PANN），用于从多光谱卫星图像序列中检测自然灾害引起的变化，并生成变化图，从而使相关数据得以优先传输。本研究中使用的PANN受到物理神经网络的启发，由称为“忆阻器”的纳米电路元件组成（具有记忆的非线性电阻器）。网络中的权重是动态的，并根据忆阻器的状态方程和电路守恒定律对变化的输入信号进行更新。因此，PANN生成受物理约束的动态输出特征，用于通过基于距离的度量来检测自然灾害检测任务中的变化。重要的是，这使得整个模型无需进行训练，可以使用最少的计算资源实施。PANN与一种最先进的人工智能模型进行了基准测试，在每个自然灾害类别中取得了可比或更好的结果。因此，它提供了一种有前途的解决方案，可以应对资源受限的卫星上处理挑战。

更新时间: 2025-10-09 22:32:27

领域: cs.LG

下载: http://arxiv.org/abs/2506.04285v2

A Mega-Study of Digital Twins Reveals Strengths, Weaknesses and Opportunities for Further Improvement

Digital representations of individuals ("digital twins") promise to transform social science and decision-making. Yet it remains unclear whether such twins truly mirror the people they emulate. We conducted 19 preregistered studies with a representative U.S. panel and their digital twins, each constructed from rich individual-level data, enabling direct comparisons between human and twin behavior across a wide range of domains and stimuli (including never-seen-before ones). Twins reproduced individual responses with 75% accuracy and seemingly low correlation with human answers (approximately 0.2). However, this apparently high accuracy was no higher than that achieved by generic personas based on demographics only. In contrast, correlation improved when twins incorporated detailed personal information, even outperforming traditional machine learning benchmarks that require additional data. Twins exhibited systematic strengths and weaknesses - performing better in social and personality domains, but worse in political ones - and were more accurate for participants with higher education, higher income, and moderate political views and religious attendance. Together, these findings delineate both the promise and the current limits of digital twins: they capture some relative differences among individuals but not yet the unique judgments of specific people. All data and code are publicly available to support the further development and evaluation of digital twin pipelines.

Updated: 2025-10-09 22:30:22

标题: 数字孪生的大规模研究揭示了其优势、劣势和进一步改进的机会

摘要: 数字化个体的代表性表示（“数字孪生体”）有望改变社会科学和决策制定。然而，目前还不清楚这些孪生体是否真正反映了他们所模拟的人群。我们进行了19项预先注册的研究，使用代表性的美国调查小组和他们的数字孪生体，每个孪生体都是由丰富的个体级数据构建的，从而在各种领域和刺激（包括从未见过的刺激）之间进行人类和孪生体行为的直接比较。孪生体以75%的准确率复制了个体的反应，并且与人类答案之间似乎有较低的相关性（约为0.2）。然而，这种表面上的高准确性并不比仅基于人口统计数据的通用角色所实现的准确性更高。相反，当孪生体融入详细的个人信息时，相关性得到了改善，甚至超过了需要额外数据的传统机器学习基准。孪生体表现出系统性的优势和劣势 - 在社会和个性领域表现更佳，但在政治领域表现更差 - 并且对于受教育程度较高、收入较高、政治观点和宗教参与度适中的参与者更为准确。这些发现共同勾勒出数字孪生体的潜力和当前的局限性：它们捕捉了个体之间的一些相对差异，但尚未捕捉到特定人群的独特判断。所有数据和代码均公开可用，以支持数字孪生体管道的进一步发展和评估。

更新时间: 2025-10-09 22:30:22

领域: cs.CY,cs.AI,cs.HC,stat.AP

下载: http://arxiv.org/abs/2509.19088v2

Interpretable and Granular Video-Based Quantification of Motor Characteristics from the Finger Tapping Test in Parkinson Disease

Accurately quantifying motor characteristics in Parkinson disease (PD) is crucial for monitoring disease progression and optimizing treatment strategies. The finger-tapping test is a standard motor assessment. Clinicians visually evaluate a patient's tapping performance and assign an overall severity score based on tapping amplitude, speed, and irregularity. However, this subjective evaluation is prone to inter- and intra-rater variability, and does not offer insights into individual motor characteristics captured during this test. This paper introduces a granular computer vision-based method for quantifying PD motor characteristics from video recordings. Four sets of clinically relevant features are proposed to characterize hypokinesia, bradykinesia, sequence effect, and hesitation-halts. We evaluate our approach on video recordings and clinical evaluations of 74 PD patients from the Personalized Parkinson Project. Principal component analysis with varimax rotation shows that the video-based features corresponded to the four deficits. Additionally, video-based analysis has allowed us to identify further granular distinctions within sequence effect and hesitation-halts deficits. In the following, we have used these features to train machine learning classifiers to estimate the Movement Disorder Society Unified Parkinson Disease Rating Scale (MDS-UPDRS) finger-tapping score. Compared to state-of-the-art approaches, our method achieves a higher accuracy in MDS-UPDRS score prediction, while still providing an interpretable quantification of individual finger-tapping motor characteristics. In summary, the proposed framework provides a practical solution for the objective assessment of PD motor characteristics, that can potentially be applied in both clinical and remote settings. Future work is needed to assess its responsiveness to symptomatic treatment and disease progression.

Updated: 2025-10-09 22:29:31

标题: 可以翻译为：帕金森病患者指击测试中可解释和细化的基于视频的运动特征量化

摘要: 准确量化帕金森病（PD）中的运动特征对于监测疾病进展和优化治疗策略至关重要。手指敲击测试是一种标准的运动评估方法。临床医生通过视觉评估患者的敲击表现，并根据敲击幅度、速度和不规则性分配一个整体严重程度评分。然而，这种主观评估容易受到评估者之间和评估者内部的变异，且不能提供关于在测试过程中捕捉到的个体运动特征的见解。本文介绍了一种基于计算机视觉的细粒度方法，用于从视频录像中量化PD的运动特征。提出了四组与临床相关的特征，用于表征运动减少、运动迟缓、序列效应和犹豫停顿。我们在74名个人化帕金森项目中的PD患者的视频录像和临床评估上评估了我们的方法。变量最大旋转的主成分分析显示，基于视频的特征与四个缺陷相对应。此外，基于视频的分析使我们能够在序列效应和犹豫停顿的缺陷中识别出进一步的细粒度区别。接下来，我们使用这些特征来训练机器学习分类器，以估计运动障碍学会统一帕金森病评分（MDS-UPDRS）的手指敲击评分。与现有方法相比，我们的方法在MDS-UPDRS评分预测中实现了更高的准确性，同时仍提供了可解释的个体手指敲击运动特征的量化。总之，提出的框架为客观评估PD运动特征提供了实用的解决方案，可能适用于临床和远程环境。未来工作需要评估其对症状治疗和疾病进展的响应性。

更新时间: 2025-10-09 22:29:31

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.18925v2

MergeBench: A Benchmark for Merging Domain-Specialized LLMs

Model merging provides a scalable alternative to multi-task training by combining specialized finetuned models through parameter arithmetic, enabling efficient deployment without the need for joint training or access to all task data. While recent methods have shown promise, existing evaluations are limited in both model scale and task diversity, leaving open questions about their applicability to large, domain-specialized LLMs. To tackle the challenges, we introduce MergeBench, a comprehensive evaluation suite designed to assess model merging at scale. MergeBench builds on state-of-the-art open-source language models, including Llama and Gemma families at 2B to 9B scales, and covers five key domains: instruction following, mathematics, multilingual understanding, coding and safety. We standardize finetuning and evaluation protocols, and assess eight representative merging methods across multi-task performance, forgetting and runtime efficiency. Based on extensive experiments, we provide practical guidelines for algorithm selection and share insights showing that model merging tends to perform better on stronger base models, with techniques such as merging coefficient tuning and sparsification improving knowledge retention. However, several challenges remain, including the computational cost on large models, the gap for in-domain performance compared to multi-task models, and the underexplored role of model merging in standard LLM training pipelines. We hope MergeBench provides a foundation for future research to advance the understanding and practical application of model merging. Our project page is at \href{https://yifei-he.github.io/mergebench/}{https://yifei-he.github.io/mergebench/}.

Updated: 2025-10-09 22:22:49

标题: MergeBench：用于合并领域专门化LLMs的基准测试

摘要: 模型合并提供了一个可扩展的选择，通过参数算术将专门微调的模型结合起来，实现高效部署，无需联合训练或访问所有任务数据。尽管最近的方法表现出了潜力，但现有的评估在模型规模和任务多样性方面存在局限，对它们在大型、领域专门化的LLM上的适用性仍有疑问。为了解决这些挑战，我们引入了MergeBench，一个专为评估大规模模型合并而设计的全面评估套件。MergeBench建立在最先进的开源语言模型基础上，包括Llama和Gemma系列，涵盖了从2B到9B规模的五个关键领域：遵循指令、数学、多语言理解、编码和安全。我们标准化了微调和评估协议，并评估了八种代表性的合并方法在多任务性能、遗忘和运行时效率方面的表现。基于大量实验，我们提供了算法选择的实用指南，并分享了一些见解，表明在更强大的基础模型上，模型合并往往表现更好，通过技术如合并系数调整和稀疏化来提高知识保留。然而，仍然存在一些挑战，包括大型模型的计算成本、与多任务模型相比在领域内表现的差距，以及模型合并在标准LLM训练流程中的未开发作用。我们希望MergeBench为未来研究提供一个基础，推动对模型合并的理解和实际应用。我们的项目页面位于\href{https://yifei-he.github.io/mergebench/}{https://yifei-he.github.io/mergebench/}。

更新时间: 2025-10-09 22:22:49

领域: cs.LG

下载: http://arxiv.org/abs/2505.10833v3

Scalable multilingual PII annotation for responsible AI in LLMs

As Large Language Models (LLMs) gain wider adoption, ensuring their reliable handling of Personally Identifiable Information (PII) across diverse regulatory contexts has become essential. This work introduces a scalable multilingual data curation framework designed for high-quality PII annotation across 13 underrepresented locales, covering approximately 336 locale-specific PII types. Our phased, human-in-the-loop annotation methodology combines linguistic expertise with rigorous quality assurance, leading to substantial improvements in recall and false positive rates from pilot, training, and production phases. By leveraging inter-annotator agreement metrics and root-cause analysis, the framework systematically uncovers and resolves annotation inconsistencies, resulting in high-fidelity datasets suitable for supervised LLM fine-tuning. Beyond reporting empirical gains, we highlight common annotator challenges in multilingual PII labeling and demonstrate how iterative, analytics-driven pipelines can enhance both annotation quality and downstream model reliability.

Updated: 2025-10-09 22:12:29

标题: 可扩展的多语言PII标注：LLM中负责任AI的实现

摘要: 随着大型语言模型（LLMs）的广泛应用，确保它们在不同监管环境中可靠处理个人身份信息（PII）变得至关重要。本研究介绍了一个可扩展的多语言数据整理框架，旨在跨13个被低估的地区进行高质量PII注释，涵盖约336种特定于地区的PII类型。我们的分阶段、人为参与的注释方法将语言专业知识与严格的质量保证相结合，从试点、培训和生产阶段显著提高了召回率和误报率。通过利用注释者间一致性指标和根本原因分析，该框架系统地发现并解决注释不一致性，产生适合用于监督LLM微调的高保真数据集。除了报告实证收益外，我们还强调多语言PII标记中常见的注释者挑战，并展示迭代的、基于分析的管道如何提升注释质量和下游模型可靠性。

更新时间: 2025-10-09 22:12:29

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.06250v2

Reasoning Large Language Model Errors Arise from Hallucinating Critical Problem Features

Large language models have recently made great strides in reasoning task performance through chain-of-thought (CoT) strategies trained via reinforcement learning; however, these "reasoning large language models" (RLLMs) remain imperfect reasoners, and understanding the frequencies and causes of their failure modes is important for both users and developers. We test o1-mini, o3-mini, DeepSeek-R1, Claude 3.7 Sonnet, Gemini 2.5 Pro Preview, and Grok 3 Mini Beta on graph coloring as a variable-complexity constraint-satisfaction logic problem, and find evidence from both error rate comparisons and CoT/explanation text analysis that RLLMs are prone to hallucinate graph edges not specified in the prompt. This phenomenon persists across multiple problem complexity levels and semantic frames, and it appears to account for a significant fraction of the incorrect answers from every tested model, and the vast majority of them for some models. We also validate the generalizability of this input-conflicting hallucination phenomenon with smaller-scale experiments on a type of stable matching problem. Our results indicate that RLLMs may possess broader issues with misrepresentation of problem specifics, and we offer suggestions for design choices to mitigate this weakness.

Updated: 2025-10-09 22:03:00

标题: 推理大型语言模型错误的原因源于产生关键问题特征的幻觉

摘要: 最近，大型语言模型通过强化学习训练的思维链（CoT）策略在推理任务表现上取得了巨大进展；然而，这些“推理大型语言模型”（RLLMs）仍然是不完美的推理者，了解它们失败模式的频率和原因对用户和开发者都很重要。我们在图着色作为一个变量复杂度的约束满足逻辑问题上测试了o1-mini、o3-mini、DeepSeek-R1、Claude 3.7 Sonnet、Gemini 2.5 Pro Preview和Grok 3 Mini Beta，通过错误率比较和CoT/解释文本分析发现RLLMs容易产生在提示中未指定的图边。这种现象在多个问题复杂度水平和语义框架中持续存在，似乎解释了每个测试模型的错误答案中的大部分，对于一些模型则占了绝大多数。我们还通过对一种稳定匹配问题进行规模较小的实验，验证了这种输入冲突幻觉现象的普适性。我们的结果表明，RLLMs可能存在更广泛的问题，即对问题细节的错误呈现，我们提出了设计选择建议以减轻这个弱点。

更新时间: 2025-10-09 22:03:00

领域: cs.LG,cs.AI,I.2.6; I.2.7

下载: http://arxiv.org/abs/2505.12151v3

The Boundaries of Fair AI in Medical Image Prognosis: A Causal Perspective

As machine learning (ML) algorithms are increasingly used in medical image analysis, concerns have emerged about their potential biases against certain social groups. Although many approaches have been proposed to ensure the fairness of ML models, most existing works focus only on medical image diagnosis tasks, such as image classification and segmentation, and overlooked prognosis scenarios, which involve predicting the likely outcome or progression of a medical condition over time. To address this gap, we introduce FairTTE, the first comprehensive framework for assessing fairness in time-to-event (TTE) prediction in medical imaging. FairTTE encompasses a diverse range of imaging modalities and TTE outcomes, integrating cutting-edge TTE prediction and fairness algorithms to enable systematic and fine-grained analysis of fairness in medical image prognosis. Leveraging causal analysis techniques, FairTTE uncovers and quantifies distinct sources of bias embedded within medical imaging datasets. Our large-scale evaluation reveals that bias is pervasive across different imaging modalities and that current fairness methods offer limited mitigation. We further demonstrate a strong association between underlying bias sources and model disparities, emphasizing the need for holistic approaches that target all forms of bias. Notably, we find that fairness becomes increasingly difficult to maintain under distribution shifts, underscoring the limitations of existing solutions and the pressing need for more robust, equitable prognostic models.

Updated: 2025-10-09 21:54:48

标题: 医学图像预后中公平人工智能的边界：因果视角

摘要: 随着机器学习（ML）算法在医学图像分析中的应用日益增多，人们开始担心它们可能对某些社会群体存在偏见。尽管已经提出许多方法来确保ML模型的公平性，但大多数现有研究仅关注医学图像诊断任务，如图像分类和分割，而忽视了涉及预测医疗状况随时间可能的结果或进展的预后情景。为了填补这一空白，我们引入了FairTTE，这是第一个用于评估医学成像中时间至事件（TTE）预测公平性的全面框架。FairTTE涵盖了各种成像模式和TTE结果，集成了尖端TTE预测和公平性算法，以实现对医学图像预后中公平性的系统化和细致分析。利用因果分析技术，FairTTE揭示并量化了嵌入在医学成像数据集中的不同偏见来源。我们的大规模评估显示，偏见在不同成像模式中普遍存在，而当前的公平性方法提供的缓解有限。我们进一步展示了潜在偏见来源与模型差异之间的强关联，强调了需要针对所有形式的偏见的整体方法。值得注意的是，我们发现在分布转移下，公平性变得越来越难以维持，突显了现有解决方案的限制以及更健壮、公平的预测模型的紧迫需求。

更新时间: 2025-10-09 21:54:48

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2510.08840v1

Reinforcement Learning-Driven Edge Management for Reliable Multi-view 3D Reconstruction

Real-time multi-view 3D reconstruction is a mission-critical application for key edge-native use cases, such as fire rescue, where timely and accurate 3D scene modeling enables situational awareness and informed decision-making. However, the dynamic and unpredictable nature of edge resource availability introduces disruptions, such as degraded image quality, unstable network links, and fluctuating server loads, which challenge the reliability of the reconstruction pipeline. In this work, we present a reinforcement learning (RL)-based edge resource management framework for reliable 3D reconstruction to ensure high quality reconstruction within a reasonable amount of time, despite the system operating under a resource-constrained and disruption-prone environment. In particular, the framework adopts two cooperative Q-learning agents, one for camera selection and one for server selection, both of which operate entirely online, learning policies through interactions with the edge environment. To support learning under realistic constraints and evaluate system performance, we implement a distributed testbed comprising lab-hosted end devices and FABRIC infrastructure-hosted edge servers to emulate smart city edge infrastructure under realistic disruption scenarios. Results show that the proposed framework improves application reliability by effectively balancing end-to-end latency and reconstruction quality in dynamic environments.

Updated: 2025-10-09 21:54:14

标题: 强化学习驱动的边缘管理用于可靠的多视角3D重建

摘要: 实时多视角3D重建是关键的边缘原生用例应用，如火灾救援，及时和准确的3D场景建模能够实现情境感知和明智决策。然而，边缘资源可用性的动态和不可预测性引入了诸如图像质量下降、网络链路不稳定和服务器负载波动等中断，挑战了重建流程的可靠性。在本研究中，我们提出了一种基于强化学习（RL）的边缘资源管理框架，用于可靠的3D重建，以确保在资源受限和易中断的环境下，在合理的时间内实现高质量的重建。具体而言，该框架采用两个协作的Q学习代理，一个用于摄像头选择，一个用于服务器选择，两者完全在线运行，通过与边缘环境的交互学习策略。为了支持在现实约束下的学习和评估系统性能，我们实现了一个分布式实验平台，包括实验室托管的终端设备和基于FABRIC基础设施的边缘服务器，以模拟智能城市边缘基础设施在现实中断场景下的情况。结果表明，所提出的框架通过在动态环境中有效平衡端到端延迟和重建质量，提高了应用程序的可靠性。

更新时间: 2025-10-09 21:54:14

领域: cs.LG,cs.AI,cs.CV,cs.DC,cs.GR,cs.MM

下载: http://arxiv.org/abs/2510.08839v1

Long-Tailed Recognition via Information-Preservable Two-Stage Learning

The imbalance (or long-tail) is the nature of many real-world data distributions, which often induces the undesirable bias of deep classification models toward frequent classes, resulting in poor performance for tail classes. In this paper, we propose a novel two-stage learning approach to mitigate such a majority-biased tendency while preserving valuable information within datasets. Specifically, the first stage proposes a new representation learning technique from the information theory perspective. This approach is theoretically equivalent to minimizing intra-class distance, yielding an effective and well-separated feature space. The second stage develops a novel sampling strategy that selects mathematically informative instances, able to rectify majority-biased decision boundaries without compromising a model's overall performance. As a result, our approach achieves the state-of-the-art performance across various long-tailed benchmark datasets, validated via extensive experiments. Our code is available at https://github.com/fudong03/BNS_IPDPP.

Updated: 2025-10-09 21:49:12

标题: 通过信息可保留的两阶段学习实现长尾识别

摘要: 这篇论文研究了许多真实世界数据分布的不平衡性（或长尾效应），这往往会导致深度分类模型对频繁类别产生不良偏差，从而导致尾部类别表现不佳。为了缓解这种大多数偏向性的倾向同时保留数据集中有价值的信息，我们提出了一种新颖的两阶段学习方法。具体来说，第一阶段提出了一种新的信息理论视角下的表征学习技术。该方法在理论上等同于最小化类内距离，产生了一个有效且分离良好的特征空间。第二阶段开发了一种新颖的采样策略，选择数学上信息丰富的实例，能够纠正大多数偏向性的决策边界而不损害模型的整体性能。因此，我们的方法在各种长尾基准数据集上实现了最先进的性能，通过大量实验证实。我们的代码可以在https://github.com/fudong03/BNS_IPDPP找到。

更新时间: 2025-10-09 21:49:12

领域: cs.LG

下载: http://arxiv.org/abs/2510.08836v1

COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs

Large Language Models (LLMs) have demonstrated remarkable success across various domains, yet their optimization remains a significant challenge due to the complex and high-dimensional loss landscapes they inhabit. While adaptive optimizers such as AdamW are widely used, they suffer from critical limitations, including an inability to capture interdependencies between coordinates and high memory consumption. Subsequent research, exemplified by SOAP, attempts to better capture coordinate interdependence but incurs greater memory overhead, limiting scalability for massive LLMs. An alternative approach aims to reduce memory consumption through low-dimensional projection, but this leads to substantial approximation errors, resulting in less effective optimization (e.g., in terms of per-token efficiency). In this paper, we propose COSMOS, a novel hybrid optimizer that leverages the varying importance of eigensubspaces in the gradient matrix to achieve memory efficiency without compromising optimization performance. The design of COSMOS is motivated by our empirical insights and practical considerations. Specifically, COSMOS applies SOAP to the leading eigensubspace, which captures the primary optimization dynamics, and MUON to the remaining eigensubspace, which is less critical but computationally expensive to handle with SOAP. This hybrid strategy significantly reduces memory consumption while maintaining robust optimization performance, making it particularly suitable for massive LLMs. Numerical experiments on various datasets and transformer architectures are provided to demonstrate the effectiveness of COSMOS. Our code is available at https://github.com/lliu606/COSMOS.

Updated: 2025-10-09 21:46:36

标题: 宇宙：一种混合自适应优化器，用于低内存训练LLMs

摘要: 大型语言模型（LLMs）在各个领域取得了显著的成功，但由于它们所处的复杂且高维的损失景观，它们的优化仍然是一个重大挑战。虽然自适应优化器如AdamW被广泛使用，但它们存在一些关键限制，包括无法捕捉坐标之间的相互依赖关系和高内存消耗。随后的研究，比如SOAP，试图更好地捕捉坐标之间的相互依赖关系，但会产生更大的内存开销，限制了大型LLMs的可扩展性。另一种替代方法旨在通过低维投影减少内存消耗，但这会导致实质性的近似误差，从而导致优化效果不佳（例如在每个令牌效率方面）。在本文中，我们提出了COSMOS，一种新颖的混合优化器，利用梯度矩阵中各个特征子空间的不同重要性实现内存效率，同时不影响优化性能。COSMOS的设计受到我们的经验见解和实际考虑的启发。具体而言，COSMOS将SOAP应用于主要优化动态的主特征子空间，将MUON应用于其余特征子空间，这些特征子空间相对不那么关键，但用SOAP处理起来计算成本较高。这种混合策略显著减少了内存消耗，同时保持了稳健的优化性能，特别适用于大型LLMs。我们在各种数据集和Transformer架构上进行了数值实验，以展示COSMOS的有效性。我们的代码可在https://github.com/lliu606/COSMOS 上找到。

更新时间: 2025-10-09 21:46:36

领域: cs.LG

下载: http://arxiv.org/abs/2502.17410v3

Everyone prefers human writers, including AI

As AI writing tools become widespread, we need to understand how both humans and machines evaluate literary style, a domain where objective standards are elusive and judgments are inherently subjective. We conducted controlled experiments using Raymond Queneau's Exercises in Style (1947) to measure attribution bias across evaluators. Study 1 compared human participants (N=556) and AI models (N=13) evaluating literary passages from Queneau versus GPT-4-generated versions under three conditions: blind, accurately labeled, and counterfactually labeled. Study 2 tested bias generalization across a 14$\times$14 matrix of AI evaluators and creators. Both studies revealed systematic pro-human attribution bias. Humans showed +13.7 percentage point (pp) bias (Cohen's h = 0.28, 95% CI: 0.21-0.34), while AI models showed +34.3 percentage point bias (h = 0.70, 95% CI: 0.65-0.76), a 2.5-fold stronger effect (P$<$0.001). Study 2 confirmed this bias operates across AI architectures (+25.8pp, 95% CI: 24.1-27.6%), demonstrating that AI systems systematically devalue creative content when labeled as "AI-generated" regardless of which AI created it. We also find that attribution labels cause evaluators to invert assessment criteria, with identical features receiving opposing evaluations based solely on perceived authorship. This suggests AI models have absorbed human cultural biases against artificial creativity during training. Our study represents the first controlled comparison of attribution bias between human and artificial evaluators in aesthetic judgment, revealing that AI systems not only replicate but amplify this human tendency.

Updated: 2025-10-09 21:33:30

标题: 每个人都更喜欢人类写手，包括人工智能

摘要: 随着人工智能写作工具的普及，我们需要了解人类和机器如何评估文学风格，这是一个客观标准难以捉摸，评判本质上是主观的领域。我们使用雷蒙德·凯诺（Raymond Queneau）的《风格练习》（1947年）进行了控制实验，以衡量评估者之间的归因偏见。研究1比较了人类参与者（N = 556）和人工智能模型（N = 13）在三种条件下评估来自凯诺和GPT-4生成版本的文学段落：盲评、准确标记和反事实标记。研究2测试了AI评估者和创作者的一个14×14矩阵中的偏见泛化。两项研究均显示出系统性的亲人类归因偏见。人类显示出+13.7个百分点的偏见（Cohen's h = 0.28, 95% CI: 0.21-0.34），而AI模型显示出+34.3个百分点的偏见（h = 0.70, 95% CI: 0.65-0.76），效应强度是人类的2.5倍（P < 0.001）。研究2证实了这种偏见在AI架构中的运作（+25.8pp，95% CI: 24.1-27.6%），表明当被标记为“AI生成”时，AI系统系统地贬低创造性内容，无论是哪个AI创建的。我们还发现，归因标签导致评估者颠倒评估标准，相同特征仅基于感知到的作者身份而获得相反的评价。这表明AI模型在训练过程中吸收了人类对人工创造力的文化偏见。我们的研究是人类和人工评估者在审美判断中归因偏见的第一次受控比较，揭示了AI系统不仅复制而且放大了这种人类倾向。

更新时间: 2025-10-09 21:33:30

领域: cs.AI,cs.CL,cs.HC

下载: http://arxiv.org/abs/2510.08831v1

CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization

The increasing adoption of LLM agents with access to numerous tools and sensitive data significantly widens the attack surface for indirect prompt injections. Due to the context-dependent nature of attacks, however, current defenses are often ill-calibrated as they cannot reliably differentiate malicious and benign instructions, leading to high false positive rates that prevent their real-world adoption. To address this, we present a novel approach inspired by the fundamental principle of computer security: data should not contain executable instructions. Instead of sample-level classification, we propose a token-level sanitization process, which surgically removes any instructions directed at AI systems from tool outputs, capturing malicious instructions as a byproduct. In contrast to existing safety classifiers, this approach is non-blocking, does not require calibration, and is agnostic to the context of tool outputs. Further, we can train such token-level predictors with readily available instruction-tuning data only, and don't have to rely on unrealistic prompt injection examples from challenges or of other synthetic origin. In our experiments, we find that this approach generalizes well across a wide range of attacks and benchmarks like AgentDojo, BIPIA, InjecAgent, ASB and SEP, achieving a 7-10x reduction of attack success rate (ASR) (34% to 3% on AgentDojo), without impairing agent utility in both benign and malicious settings.

Updated: 2025-10-09 21:32:02

标题: CommandSans：使用手术般精准的提示清洗保护AI代理

摘要: 随着LLM代理的广泛采用，这些代理可以访问众多工具和敏感数据，间接提示注入的攻击面显著扩大。然而，由于攻击的依赖于上下文的特性，当前的防御通常难以校准，因为它们无法可靠地区分恶意和良性指令，导致高误报率，阻碍了它们在现实世界中的采用。为了解决这个问题，我们提出了一种新颖的方法，灵感来自计算机安全的基本原则：数据不应包含可执行指令。我们提出了一个基于标记级别的净化过程，从工具输出中精确地移除针对AI系统的任何指令，捕获恶意指令作为副产品。与现有的安全分类器相比，这种方法是非阻塞的，不需要校准，并且对工具输出的上下文是不可知的。此外，我们只需使用现成的指令调整数据即可训练这种标记级别的预测器，无需依赖于挑战或其他合成来源的不切实际的提示注入示例。在我们的实验中，我们发现这种方法在各种攻击和基准测试中都有很好的泛化能力，如AgentDojo，BIPIA，InjecAgent，ASB和SEP，在AgentDojo上攻击成功率（ASR）实现了7-10倍的降低（从34%降至3%），而且不会影响代理在良性和恶意环境中的效用。

更新时间: 2025-10-09 21:32:02

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.08829v1

McMining: Automated Discovery of Misconceptions in Student Code

When learning to code, students often develop misconceptions about various programming language concepts. These can not only lead to bugs or inefficient code, but also slow down the learning of related concepts. In this paper, we introduce McMining, the task of mining programming misconceptions from samples of code from a student. To enable the training and evaluation of McMining systems, we develop an extensible benchmark dataset of misconceptions together with a large set of code samples where these misconceptions are manifested. We then introduce two LLM-based McMiner approaches and through extensive evaluations show that models from the Gemini, Claude, and GPT families are effective at discovering misconceptions in student code.

Updated: 2025-10-09 21:27:39

标题: McMining：学生代码中错误观念的自动发现

摘要: 在学习编程时，学生经常对各种编程语言概念产生误解。这不仅会导致错误或低效的代码，还会减缓相关概念的学习速度。在本文中，我们介绍了McMining，即从学生代码样本中挖掘编程误解的任务。为了实现McMining系统的训练和评估，我们开发了一个可扩展的误解基准数据集，其中包含大量代码样本，展示了这些误解的表现。然后，我们介绍了两种基于LLM的McMiner方法，并通过广泛的评估表明，Gemini、Claude和GPT系列的模型在发现学生代码中的误解方面是有效的。

更新时间: 2025-10-09 21:27:39

领域: cs.SE,cs.AI,cs.CL,cs.CY

下载: http://arxiv.org/abs/2510.08827v1

From Federated Learning to X-Learning: Breaking the Barriers of Decentrality Through Random Walks

We provide our perspective on X-Learning (XL), a novel distributed learning architecture that generalizes and extends the concept of decentralization. Our goal is to present a vision for XL, introducing its unexplored design considerations and degrees of freedom. To this end, we shed light on the intuitive yet non-trivial connections between XL, graph theory, and Markov chains. We also present a series of open research directions to stimulate further research.

Updated: 2025-10-09 21:10:54

标题: 从联合学习到 X 学习：通过随机游走打破去中心化的障碍

摘要: 我们提供了关于X-Learning（XL）的观点，这是一种新颖的分布式学习架构，它推广和扩展了去中心化的概念。我们的目标是为XL提供一个愿景，介绍其未被探索的设计考虑和自由度。为此，我们阐明了XL、图论和马尔可夫链之间直观但非平凡的联系。我们还提出了一系列开放的研究方向，以刺激进一步的研究。

更新时间: 2025-10-09 21:10:54

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.03709v3

D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition

Video large language models (Vid-LLMs), which excel in diverse video-language tasks, can be effectively constructed by adapting image-pretrained vision-language models (VLMs). However, this adaptation remains challenging, as it requires processing dense and temporally extended visual inputs that exceed the capacity of image-based models. This paper identifies the perception bottleneck and token overload as key challenges in extending image-based VLMs to the video domain. To address these issues, we propose D-CoDe, a training-free adaptation framework that incorporates dynamic compression and question decomposition. Specifically, dynamic compression alleviates the perception bottleneck through adaptive selection of representative frames and content-aware aggregation of spatial tokens, thereby reducing redundancy while preserving informative content. In parallel, question decomposition mitigates token overload by reformulating the original query into sub-questions, guiding the model to focus on distinct aspects of the video and enabling more comprehensive understanding. Experiments demonstrate that D-CoDe effectively improves video understanding across various benchmarks. Furthermore, strong performance on the challenging long-video benchmark highlights the potential of D-CoDe in handling complex video-language tasks. Code is available at https://github.com/hukcc/D-CoDe.

Updated: 2025-10-09 21:08:32

标题: D-CoDe：通过动态压缩和问题分解将图像预训练的VLMs扩展到视频

摘要: 视频大型语言模型（Vid-LLMs）在各种视频语言任务中表现出色，可以通过调整经过图像预训练的视觉语言模型（VLMs）有效构建。然而，这种适应仍然具有挑战性，因为它需要处理密集且时间延长的视觉输入，超出了基于图像的模型的容量。本文确定了感知瓶颈和标记超载作为将基于图像的VLMs扩展到视频领域的关键挑战。为了解决这些问题，我们提出了D-CoDe，一个无需训练的适应框架，它结合了动态压缩和问题分解。具体而言，动态压缩通过自适应选择代表性帧和内容感知聚合空间标记，从而减少冗余，同时保留信息内容，从而缓解了感知瓶颈。同时，问题分解通过重新构造原始查询为子问题，引导模型专注于视频的不同方面，实现更全面的理解。实验表明，D-CoDe有效地提高了视频理解在各种基准测试中的表现。此外，在具有挑战性的长视频基准测试中表现出色，突显了D-CoDe在处理复杂视频语言任务中的潜力。代码可在https://github.com/hukcc/D-CoDe中找到。

更新时间: 2025-10-09 21:08:32

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.08818v1

Authentication Security of PRF GNSS Ranging

This work derives the authentication security of pseudorandom function (PRF) GNSS ranging under multiple GNSS spoofing models, including the Security Code Estimation and Replay (SCER) spoofer. When GNSS ranging codes derive from a PRF utilizing a secret known only to the broadcaster, the spoofer cannot predict the ranging code before broadcast. Therefore, PRF ranging can be used to establish trust in the GNSS pseudoranges and the resulting receiver position, navigation, and timing (PNT) solution. I apply the methods herein to Galileo's Signal Authentication Service (SAS) utilizing the encrypted Galileo E6-C signal to compute that, at most, 400 ms of Galileo E6-C data to assert 128-bit authentication security under non-SCER models. For the SCER adversary, I predict the adversary's needed receiving radio equipment to break authentication security. One can use this work to design a PRF GNSS ranging protocol to meet useful authentication security requirements by computing the probability of missed detection.

Updated: 2025-10-09 21:04:04

标题: PRF GNSS测距的认证安全性

摘要: 这项工作探讨了伪随机函数（PRF）GNSS测距在多个GNSS欺骗模型下的认证安全性，包括安全码估计和重放（SCER）欺骗器。当GNSS测距代码源自一个仅知晓广播者的秘密的PRF时，欺骗器无法在广播之前预测测距代码。因此，PRF测距可用于建立对GNSS伪距和由此产生的接收机位置、导航和定时（PNT）解决方案的信任。我将本文的方法应用于利用加密的Galileo E6-C信号计算的Galileo信号认证服务（SAS），以断言在非SCER模型下，至多400毫秒的Galileo E6-C数据可实现128位认证安全性。对于SCER对手，我预测了对手所需的接收无线电设备来破解认证安全性。可以利用这项工作设计一个PRF GNSS测距协议，通过计算漏检概率来满足有用的认证安全性要求。

更新时间: 2025-10-09 21:04:04

领域: cs.CR,eess.SP

下载: http://arxiv.org/abs/2510.02196v2

$\mathsf{P} \neq \mathsf{NP}$: A Non-Relativizing Proof via Quantale Weakness and Geometric Complexity

We give a compositional, information-theoretic framework that turns short programs into locality on many independent blocks, and combine it with symmetry and sparsity of masked random Unique-SAT to obtain distributional lower bounds that contradict the self-reduction upper bound under $\mathsf{P}=\mathsf{NP}$. We work in the weakness quantale $w_Q=K_{\mathrm{poly}}(\cdot\mid\cdot)$. For an efficiently samplable ensemble $D_m$ made by masking random $3$-CNFs with fresh $S_m\ltimes(\mathbb{Z}_2)^m$ symmetries and a small-seed Valiant--Vazirani isolation layer, we prove a Switching-by-Weakness normal form: for any polytime decoder $P$ of description length $\le \delta t$ (with $t=\Theta(m)$ blocks), a short wrapper $W$ makes $(P\circ W)$ per-bit local on a $\gamma$-fraction of blocks. Two ingredients then force near-randomness on $\Omega(t)$ blocks for every short decoder: (a) a sign-invariant neutrality lemma giving $\Pr[X_i=1\mid \mathcal{I}]=1/2$ for any sign-invariant view $\mathcal{I}$; and (b) a template sparsification theorem at logarithmic radius showing that any fixed local rule appears with probability $m^{-\Omega(1)}$. Combined with single-block bounds for tiny $\mathrm{ACC}^0$/streaming decoders, this yields a success bound $2^{-\Omega(t)}$ and, by Compression-from-Success, $K_{\mathrm{poly}}\big((X_1,\ldots,X_t)\mid(\Phi_1,\ldots,\Phi_t)\big)\ge \eta t$. If $\mathsf{P}=\mathsf{NP}$, a uniform constant-length program maps any on-promise instance to its unique witness in polytime (bit fixing via a $\mathrm{USAT}$ decider), so $K_{\mathrm{poly}}(X\mid\Phi)\le O(1)$ and the tuple complexity is $O(1)$, contradicting the linear bound. The proof is non-relativizing and non-natural; symmetry, sparsification, and switching yield a quantale upper-lower clash, hence $\mathsf{P}\ne\mathsf{NP}$.

Updated: 2025-10-09 21:01:17

标题: $\mathsf{P} \neq \mathsf{NP}$: 通过量域弱点和几何复杂性的非相对性证明

摘要: 我们提供了一个构成、信息理论框架，将短程序转化为在许多独立块上的局部性，并将其与掩码随机Unique-SAT的对称性和稀疏性结合起来，以获得分布下界，这些下界与$\mathsf{P}=\mathsf{NP}$下的自降解上界相矛盾。我们在弱量化$w_Q=K_{\mathrm{poly}}(\cdot\mid\cdot)$中工作。对于由掩盖随机$3$-CNFs和新鲜$S_m\ltimes(\mathbb{Z}_2)^m$对称性以及小种子Valiant--Vazirani隔离层制造的高效可采样集合$D_m$，我们证明了一个通过弱点切换的正规形式：对于任何描述长度$\le \delta t$（其中$t=\Theta(m)$块）的多项式时间译码器$P$，一个短包装器$W$使得$(P\circ W)$在$\gamma$分数的块上每位局部。然后，两个因素迫使每个短译码器上的$\Omega(t)$块接近随机性：(a)一个给出$\Pr[X_i=1\mid \mathcal{I}]=1/2$的符号不变中性引理，对于任何符号不变视图$\mathcal{I}$；和(b)一个在对数半径上显示任何固定局部规则出现概率为$m^{-\Omega(1)}$的模板稀疏化定理。结合微小$\mathrm{ACC}^0$/流式译码器的单块上界，这产生一个成功边界$2^{-\Omega(t)}$，并通过成功压缩，$K_{\mathrm{poly}}\big((X_1,\ldots,X_t)\mid(\Phi_1,\ldots,\Phi_t)\big)\ge \eta t$。如果$\mathsf{P}=\mathsf{NP}$，一个统一的恒定长度程序将任何在承诺实例映射到其唯一见证的多项式时间中（通过$\mathrm{USAT}$判定器进行位固定），因此$K_{\mathrm{poly}}(X\mid\Phi)\le O(1)$，元组复杂度为$O(1)$，与线性边界相矛盾。证明是非相对化的和非自然的；对称性、稀疏化和切换产生了一个量化上下冲突，因此$\mathsf{P}\ne\mathsf{NP}$。

更新时间: 2025-10-09 21:01:17

领域: cs.CC,cs.AI

下载: http://arxiv.org/abs/2510.08814v1

Partition Generative Modeling: Masked Modeling Without Masks

Masked generative models (MGMs) are widely used to capture complex data and enable faster generation than autoregressive models (AR) through parallel decoding. However, MGMs typically operate on fixed-length inputs, which can be inefficient: early in sampling, most tokens are masked and carry no information, leading to wasted computation. In contrast, AR models process only tokens generated previously, making early iterations faster. In this work, we introduce the Partition Generative Model (PGM), a novel approach that combines the strengths of AR and MGMs. Rather than masking, PGM partitions tokens into two groups and employs sparse attention to block information flow between them. Since there is no information flow between partitions, the model can process the previously-generated tokens only during sampling, while retaining the ability to generate tokens in parallel and in any order. On OpenWebText, PGMs offer at least $5\times$ improvements in sampling latency and throughput, while producing samples with superior Generative Perplexity, compared to Masked Diffusion Language Models. On ImageNet, PGMs achieve a $7.5\times$ higher throughput than MaskGIT, with only a slight increase in FID (5.54 vs. 5.35). With twice as many sampling steps, the FID reduces to 4.56 while while being $3.9\times$ faster than MaskGIT. Finally, PGMs integrate seamlessly with MGM distillation, providing further inference speedups.

Updated: 2025-10-09 20:59:57

标题: 分区生成建模：无需蒙版的蒙版建模

摘要: 遮蔽生成模型（MGMs）被广泛用于捕捉复杂数据，并通过并行解码实现比自回归模型（AR）更快的生成。然而，MGMs通常在固定长度的输入上运行，这可能效率低下：在采样早期，大多数标记被屏蔽并不携带信息，导致计算浪费。相反，AR模型仅处理先前生成的标记，使得早期迭代更快。在这项工作中，我们引入了分区生成模型（PGM），这是一种结合了AR和MGMs优势的新方法。与屏蔽不同，PGM将标记分成两组，并采用稀疏注意力来阻止它们之间的信息流。由于分区之间没有信息流动，模型只能在采样过程中处理先前生成的标记，同时保留了并行生成标记的能力和任意顺序。在OpenWebText上，与Masked Diffusion Language Models相比，PGMs在采样延迟和吞吐量方面至少提供了5倍的改进，并生成具有优越生成困惑度的样本。在ImageNet上，PGMs的吞吐量比MaskGIT高出7.5倍，而FID只略微增加（5.54比5.35）。通过两倍的采样步骤，FID降低到4.56，同时比MaskGIT快3.9倍。最后，PGMs与MGM蒸馏无缝集成，提供了进一步的推理加速。

更新时间: 2025-10-09 20:59:57

领域: cs.LG

下载: http://arxiv.org/abs/2505.18883v2

The Model's Language Matters: A Comparative Privacy Analysis of LLMs

Large Language Models (LLMs) are increasingly deployed across multilingual applications that handle sensitive data, yet their scale and linguistic variability introduce major privacy risks. Mostly evaluated for English, this paper investigates how language structure affects privacy leakage in LLMs trained on English, Spanish, French, and Italian medical corpora. We quantify six linguistic indicators and evaluate three attack vectors: extraction, counterfactual memorization, and membership inference. Results show that privacy vulnerability scales with linguistic redundancy and tokenization granularity: Italian exhibits the strongest leakage, while English shows higher membership separability. In contrast, French and Spanish display greater resilience due to higher morphological complexity. Overall, our findings provide the first quantitative evidence that language matters in privacy leakage, underscoring the need for language-aware privacy-preserving mechanisms in LLM deployments.

Updated: 2025-10-09 20:59:42

标题: 模型的语言很重要：LLMs的隐私比较分析

摘要: 大型语言模型（LLMs）越来越多地部署在处理敏感数据的多语言应用程序中，然而它们的规模和语言变化引入了重大的隐私风险。本文主要评估英语，调查了语言结构如何影响在英语、西班牙语、法语和意大利医学语料库上训练的LLMs中的隐私泄露。我们量化了六个语言指标，并评估了三种攻击向量：提取、反事实记忆和成员推断。结果表明，隐私漏洞随着语言冗余和标记化粒度增加而增加：意大利语显示最强的泄露，而英语显示较高的成员可分离性。相比之下，法语和西班牙语由于更高的形态复杂性而表现出更大的韧性。总的来说，我们的发现首次提供了语言在隐私泄露中的重要性的定量证据，强调了在LLMs部署中需要考虑语言的隐私保护机制的必要性。

更新时间: 2025-10-09 20:59:42

领域: cs.CL,cs.CR

下载: http://arxiv.org/abs/2510.08813v1

Adaptive Science Operations in Deep Space Missions Using Offline Belief State Planning

Deep space missions face extreme communication delays and environmental uncertainty that prevent real-time ground operations. To support autonomous science operations in communication-constrained environments, we present a partially observable Markov decision process (POMDP) framework that adaptively sequences spacecraft science instruments. We integrate a Bayesian network into the POMDP observation space to manage the high-dimensional and uncertain measurements typical of astrobiology missions. This network compactly encodes dependencies among measurements and improves the interpretability and computational tractability of science data. Instrument operation policies are computed offline, allowing resource-aware plans to be generated and thoroughly validated prior to launch. We use the Enceladus Orbilander's proposed Life Detection Suite (LDS) as a case study, demonstrating how Bayesian network structure and reward shaping influence system performance. We compare our method against the mission's baseline Concept of Operations (ConOps), evaluating both misclassification rates and performance in off-nominal sample accumulation scenarios. Our approach reduces sample identification errors by nearly 40%

Updated: 2025-10-09 20:58:35

标题: 深空任务中使用离线信念状态规划的自适应科学操作

摘要: 深空任务面临极端的通信延迟和环境不确定性，这阻碍了实时地面操作。为了支持在通信受限环境中的自主科学操作，我们提出了一个部分可观测马尔可夫决策过程（POMDP）框架，可以自适应地对太空飞行器的科学仪器进行排序。我们将贝叶斯网络整合到POMDP观测空间中，以管理天体生物学任务中的高维度和不确定性测量。这个网络简洁地编码了测量之间的依赖关系，并提高了科学数据的可解释性和计算可处理性。仪器操作策略是离线计算的，允许在发射前生成并经过彻底验证的资源感知计划。我们以土卫六轨道着陆器提议的生命探测套件（LDS）作为案例研究，演示了贝叶斯网络结构和奖励塑造如何影响系统性能。我们将我们的方法与任务的基线运营概念（ConOps）进行比较，评估了在非正常样本累积情况下的误分类率和性能。我们的方法将样本识别错误减少了近40%。

更新时间: 2025-10-09 20:58:35

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2510.08812v1

A Survey on Self-supervised Contrastive Learning for Multimodal Text-Image Analysis

Self-supervised learning is a machine learning approach that generates implicit labels by learning underlined patterns and extracting discriminative features from unlabeled data without manual labelling. Contrastive learning introduces the concept of "positive" and "negative" samples, where positive pairs (e.g., variation of the same image/object) are brought together in the embedding space, and negative pairs (e.g., views from different images/objects) are pushed farther away. This methodology has shown significant improvements in image understanding and image text analysis without much reliance on labeled data. In this paper, we comprehensively discuss the terminologies, recent developments and applications of contrastive learning with respect to text-image models. Specifically, we provide an overview of the approaches of contrastive learning in text-image models in recent years. Secondly, we categorize the approaches based on different model structures. Thirdly, we further introduce and discuss the latest advances of the techniques used in the process such as pretext tasks for both images and text, architectural structures, and key trends. Lastly, we discuss the recent state-of-art applications of self-supervised contrastive learning Text-Image based models.

Updated: 2025-10-09 20:55:19

标题: 一项关于多模态文本-图像分析的自监督对比学习调查

摘要: 自监督学习是一种机器学习方法，通过在未标记的数据中学习潜在模式并提取有区别的特征，生成隐式标签，而无需手动标记。对比学习引入了“正样本”和“负样本”的概念，其中正样本对（例如，相同图像/对象的变化）在嵌入空间中聚集在一起，而负样本对（例如，来自不同图像/对象的视图）被推远。这种方法在图像理解和图像文本分析方面已经显示出显著的改进，而不太依赖于标记数据。本文全面讨论了对比学习的术语、最新发展和应用，特别是与文本-图像模型相关的对比学习方法。具体来说，我们概述了近年来文本-图像模型中对比学习方法的发展。其次，我们根据不同的模型结构对方法进行分类。第三，我们进一步介绍和讨论了在过程中使用的技术的最新进展，例如图像和文本的先兆任务、架构结构和关键趋势。最后，我们讨论了基于自监督对比学习的文本-图像模型的最新应用。

更新时间: 2025-10-09 20:55:19

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.11101v5

Adaptive Memory Momentum via a Model-Based Framework for Deep Learning Optimization

The vast majority of modern deep learning models are trained with momentum-based first-order optimizers. The momentum term governs the optimizer's memory by determining how much each past gradient contributes to the current convergence direction. Fundamental momentum methods, such as Nesterov Accelerated Gradient and the Heavy Ball method, as well as more recent optimizers such as AdamW and Lion, all rely on the momentum coefficient that is customarily set to $\beta = 0.9$ and kept constant during model training, a strategy widely used by practitioners, yet suboptimal. In this paper, we introduce an \textit{adaptive memory} mechanism that replaces constant momentum with a dynamic momentum coefficient that is adjusted online during optimization. We derive our method by approximating the objective function using two planes: one derived from the gradient at the current iterate and the other obtained from the accumulated memory of the past gradients. To the best of our knowledge, such a proximal framework was never used for momentum-based optimization. Our proposed approach is novel, extremely simple to use, and does not rely on extra assumptions or hyperparameter tuning. We implement adaptive memory variants of both SGD and AdamW across a wide range of learning tasks, from simple convex problems to large-scale deep learning scenarios, demonstrating that our approach can outperform standard SGD and Adam with hand-tuned momentum coefficients. Finally, our work opens doors for new ways of inducing adaptivity in optimization.

Updated: 2025-10-09 20:53:07

标题: 基于模型的框架的深度学习优化中的自适应记忆动量

摘要: 绝大多数现代深度学习模型都是使用基于动量的一阶优化器进行训练。动量项通过确定过去梯度对当前收敛方向的贡献来控制优化器的记忆。基本的动量方法，如Nesterov加速梯度和Heavy Ball方法，以及更近期的优化器如AdamW和Lion，都依赖于通常设置为$\beta = 0.9$并在模型训练过程中保持不变的动量系数，这是从业者广泛使用的策略，但并非最优。本文介绍了一种\textit{自适应内存}机制，用动态动量系数替代常数动量，并在优化过程中在线调整。我们通过用两个平面逼近目标函数来推导我们的方法：一个由当前迭代的梯度导出，另一个由过去梯度的累积记忆获得。据我们所知，这种近端框架从未用于基于动量的优化。我们提出的方法是全新的，非常简单易用，不依赖于额外的假设或超参数调整。我们在各种学习任务中实现了SGD和AdamW的自适应内存变体，从简单的凸问题到大规模深度学习场景，证明我们的方法可以胜过手动调整动量系数的标准SGD和Adam。最后，我们的工作为优化中引入自适应性的新方法打开了大门。

更新时间: 2025-10-09 20:53:07

领域: cs.LG

下载: http://arxiv.org/abs/2510.04988v2

LATTE: Learning Aligned Transactions and Textual Embeddings for Bank Clients

Learning clients embeddings from sequences of their historic communications is central to financial applications. While large language models (LLMs) offer general world knowledge, their direct use on long event sequences is computationally expensive and impractical in real-world pipelines. In this paper, we propose LATTE, a contrastive learning framework that aligns raw event embeddings with semantic embeddings from frozen LLMs. Behavioral features are summarized into short prompts, embedded by the LLM, and used as supervision via contrastive loss. The proposed approach significantly reduces inference cost and input size compared to conventional processing of complete sequence by LLM. We experimentally show that our method outperforms state-of-the-art techniques for learning event sequence representations on real-world financial datasets while remaining deployable in latency-sensitive environments.

Updated: 2025-10-09 20:52:29

标题: LATTE：学习银行客户相关的交易和文本嵌入

摘要: 从客户的历史沟通序列中学习客户嵌入是金融应用的核心。尽管大型语言模型(LLMs)提供了通用的世界知识，但它们直接应用于长事件序列在计算上是昂贵且在实际管道中不切实际的。在本文中，我们提出了LATTE，一个对比学习框架，将原始事件嵌入与来自冻结LLMs的语义嵌入进行对齐。行为特征被总结为简短提示，由LLM嵌入，并通过对比损失用作监督。与LLM对完整序列的传统处理相比，提出的方法显著减少了推理成本和输入大小。我们在实际金融数据集上实验证明，我们的方法在学习事件序列表示方面优于最先进的技术，同时仍可部署在对延迟敏感的环境中。

更新时间: 2025-10-09 20:52:29

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.10021v3

Discrete Compositional Generation via General Soft Operators and Robust Reinforcement Learning

A major bottleneck in scientific discovery consists of narrowing an exponentially large set of objects, such as proteins or molecules, to a small set of promising candidates with desirable properties. While this process can rely on expert knowledge, recent methods leverage reinforcement learning (RL) guided by a proxy reward function to enable this filtering. By employing various forms of entropy regularization, these methods aim to learn samplers that generate diverse candidates that are highly rated by the proxy function. In this work, we make two main contributions. First, we show that these methods are liable to generate overly diverse, suboptimal candidates in large search spaces. To address this issue, we introduce a novel unified operator that combines several regularized RL operators into a general framework that better targets peakier sampling distributions. Secondly, we offer a novel, robust RL perspective of this filtering process. The regularization can be interpreted as robustness to a compositional form of uncertainty in the proxy function (i.e., the true evaluation of a candidate differs from the proxy's evaluation). Our analysis leads us to a novel, easy-to-use algorithm we name trajectory general mellowmax (TGM): we show it identifies higher quality, diverse candidates than baselines in both synthetic and real-world tasks. Code: https://github.com/marcojira/tgm.

Updated: 2025-10-09 20:49:49

标题: 通过通用软算子和强化学习的离散式组合生成

摘要: 科学发现中的一个主要瓶颈是将一个指数级别庞大的对象集（如蛋白质或分子）缩小到一小部分具有理想特性的有前景的候选对象。虽然这个过程可以依赖专家知识，但最近的方法利用由代理奖励函数引导的强化学习（RL）来实现这种过滤。通过采用各种形式的熵正则化，这些方法旨在学习生成由代理函数高度评价的多样化候选对象的采样器。在这项工作中，我们做出了两个主要贡献。首先，我们展示这些方法在大的搜索空间中很容易生成过于多样化、次优的候选对象。为了解决这个问题，我们引入了一个将几种正则化的RL运算符结合成一个更好地针对更尖锐的采样分布的通用框架的新颖统一的运算符。其次，我们提供了一个关于这个过滤过程的新颖、稳健的RL视角。正则化可以解释为对代理函数中的一种组合形式的不确定性（即，候选对象的真实评估与代理的评估不同）的稳健性。我们的分析引导我们提出了一个新颖、易于使用的算法，我们将其命名为路径通用mellowmax（TGM）：我们展示它在合成和真实任务中识别出比基线更高质量、多样化的候选对象。代码：https://github.com/marcojira/tgm.

更新时间: 2025-10-09 20:49:49

领域: cs.LG

下载: http://arxiv.org/abs/2506.17007v2

TinyGraphEstimator: Adapting Lightweight Language Models for Graph Structure Inference

Graphs provide a universal framework for representing complex relational systems, and inferring their structural properties is a core challenge in graph analysis and reasoning. While large language models have recently demonstrated emerging abilities to perform symbolic and numerical reasoning, the potential of smaller, resource-efficient models in this context remains largely unexplored. This paper investigates whether compact transformer-based language models can infer graph-theoretic parameters directly from textual graph representations. To enable systematic evaluation, we introduce the TinyGraphEstimator dataset - a balanced collection of connected graphs generated from multiple random graph models and annotated with detailed structural metadata. We evaluate several small open models on their ability to predict key graph parameters such as density, clustering, and chromatic number. Furthermore, we apply lightweight fine-tuning using the Low-Rank Adaptation (LoRA) technique, achieving consistent improvements across all evaluated metrics. The results demonstrate that small language models possess non-trivial reasoning capacity over graph-structured data and can be effectively adapted for structural inference tasks through efficient parameter tuning.

Updated: 2025-10-09 20:47:07

标题: 微型图估计器：将轻量级语言模型应用于图结构推理

摘要: 图表提供了一个普遍的框架，用于表示复杂的关系系统，并推断它们的结构属性是图分析和推理中的一个核心挑战。尽管最近大型语言模型已经展示出执行符号和数值推理的新能力，但在这种情况下，较小、资源高效的模型的潜力仍然很大程度上尚未被探索。本文研究了紧凑的基于Transformer的语言模型是否可以直接从文本图表示中推断图论参数。为了实现系统评估，我们引入了TinyGraphEstimator数据集 - 一个平衡的由多个随机图模型生成的连接图的集合，并附有详细的结构元数据。我们评估了几个小型开放模型在预测关键图参数（如密度、聚类和色数）方面的能力。此外，我们应用轻量级微调技术Low-Rank Adaptation（LoRA），在所有评估指标上实现了一致的改进。结果表明，小型语言模型具有对图结构化数据的非平凡推理能力，并且可以通过高效的参数调整有效地适应结构推断任务。

更新时间: 2025-10-09 20:47:07

领域: cs.LG

下载: http://arxiv.org/abs/2510.08808v1

Humanoid Everyday: A Comprehensive Robotic Dataset for Open-World Humanoid Manipulation

From loco-motion to dextrous manipulation, humanoid robots have made remarkable strides in demonstrating complex full-body capabilities. However, the majority of current robot learning datasets and benchmarks mainly focus on stationary robot arms, and the few existing humanoid datasets are either confined to fixed environments or limited in task diversity, often lacking human-humanoid interaction and lower-body locomotion. Moreover, there are a few standardized evaluation platforms for benchmarking learning-based policies on humanoid data. In this work, we present Humanoid Everyday, a large-scale and diverse humanoid manipulation dataset characterized by extensive task variety involving dextrous object manipulation, human-humanoid interaction, locomotion-integrated actions, and more. Leveraging a highly efficient human-supervised teleoperation pipeline, Humanoid Everyday aggregates high-quality multimodal sensory data, including RGB, depth, LiDAR, and tactile inputs, together with natural language annotations, comprising 10.3k trajectories and over 3 million frames of data across 260 tasks across 7 broad categories. In addition, we conduct an analysis of representative policy learning methods on our dataset, providing insights into their strengths and limitations across different task categories. For standardized evaluation, we introduce a cloud-based evaluation platform that allows researchers to seamlessly deploy their policies in our controlled setting and receive performance feedback. By releasing Humanoid Everyday along with our policy learning analysis and a standardized cloud-based evaluation platform, we intend to advance research in general-purpose humanoid manipulation and lay the groundwork for more capable and embodied robotic agents in real-world scenarios. Our dataset, data collection code, and cloud evaluation website are made publicly available on our project website.

Updated: 2025-10-09 20:43:27

标题: 人类日常：一个用于开放世界人形操作的全面机器人数据集

摘要: 从运动到熟练操纵，人形机器人在展示复杂全身能力方面取得了显著进展。然而，目前大多数机器人学习数据集和基准主要集中在静止机器人手臂上，而少数现有的人形数据集要么局限于固定环境，要么在任务多样性方面受限，通常缺乏人与人形之间的互动和下半身运动。此外，目前缺乏一些用于在人形数据上基准学习策略的标准化评估平台。在这项工作中，我们提出了一个大规模且多样化的人形操纵数据集Humanoid Everyday，其特点是广泛的任务种类，涉及熟练的物体操纵、人与人形的互动、与步态结合的动作等。利用高效的人类监督的远程操作流水线，Humanoid Everyday汇集了高质量的多模态感知数据，包括RGB、深度、LiDAR和触觉输入，以及自然语言注释，包括10.3k个轨迹和超过260个任务的300万帧数据，覆盖了7个广泛的类别。此外，我们对我们的数据集进行了代表性策略学习方法的分析，提供了对它们在不同任务类别中的优势和局限性的见解。为了标准化评估，我们引入了一个基于云的评估平台，允许研究人员在我们的受控环境中无缝部署他们的策略并接收性能反馈。通过发布Humanoid Everyday以及我们的策略学习分析和一个标准化的基于云的评估平台，我们旨在推动通用目的人形操纵的研究，并为实际场景中更能干和具有实体的机器人代理奠定基础。我们的数据集、数据采集代码和云评估网站已在我们的项目网站上公开提供。

更新时间: 2025-10-09 20:43:27

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2510.08807v1

Lizard: An Efficient Linearization Framework for Large Language Models

We propose Lizard, a linearization framework that transforms pretrained Transformer-based Large Language Models (LLMs) into subquadratic architectures. Transformers faces severe computational and memory bottlenecks with long sequences due to the quadratic complexity of softmax attention and the growing Key-Value (KV) cache that makes inference memory-bound by context length. Lizard addresses these limitations by introducing a subquadratic attention mechanism that closely approximates softmax attention while preserving model quality. Unlike prior linearization methods constrained by fixed, non-adaptive structures, Lizard augments the architecture with compact, learnable modules that enable adaptive memory control and robust length generalization. Moreover, we introduce a hardwareaware algorithm that solves numerical instability in gated attention to accelerate training. Extensive experiments show that Lizard achieves near-lossless recovery of its teacher model's performance, significantly outperforming previous methods by up to 9.4 - 24.5 points on the 5-shot MMLU benchmark and demonstrating superior associative recall.

Updated: 2025-10-09 20:37:43

标题: 蜥蜴：用于大型语言模型的高效线性化框架

摘要: 我们提出了Lizard，一个线性化框架，将预训练的基于Transformer的大型语言模型(LLMs)转换为次二次结构。由于softmax注意力的二次复杂度和不断增长的关键-值(KV)缓存，使得长序列在计算和内存方面面临严重瓶颈。Lizard通过引入一个接近softmax注意力的次二次注意力机制来应对这些限制，同时保持模型质量。与受固定、非自适应结构限制的先前线性化方法不同，Lizard通过增加紧凑、可学习的模块来扩展架构，实现自适应内存控制和强健长度泛化。此外，我们引入了一种硬件感知算法，以解决门控注意力中的数值不稳定性，加速训练。大量实验表明，Lizard实现了对其教师模型性能近乎无损的恢复，在5-shot MMLU基准测试上明显优于先前方法，达到9.4-24.5点，并展现出优越的联想回忆能力。

更新时间: 2025-10-09 20:37:43

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.09025v3

Mini-batch Estimation for Deep Cox Models: Statistical Foundations and Practical Guidance

The stochastic gradient descent (SGD) algorithm has been widely used to optimize deep Cox neural network (Cox-NN) by updating model parameters using mini-batches of data. We show that SGD aims to optimize the average of mini-batch partial-likelihood, which is different from the standard partial-likelihood. This distinction requires developing new statistical properties for the global optimizer, namely, the mini-batch maximum partial-likelihood estimator (mb-MPLE). We establish that mb-MPLE for Cox-NN is consistent and achieves the optimal minimax convergence rate up to a polylogarithmic factor. For Cox regression with linear covariate effects, we further show that mb-MPLE is $\sqrt{n}$-consistent and asymptotically normal with asymptotic variance approaching the information lower bound as batch size increases, which is confirmed by simulation studies. Additionally, we offer practical guidance on using SGD, supported by theoretical analysis and numerical evidence. For Cox-NN, we demonstrate that the ratio of the learning rate to the batch size is critical in SGD dynamics, offering insight into hyperparameter tuning. For Cox regression, we characterize the iterative convergence of SGD, ensuring that the global optimizer, mb-MPLE, can be approximated with sufficiently many iterations. Finally, we demonstrate the effectiveness of mb-MPLE in a large-scale real-world application where the standard MPLE is intractable.

Updated: 2025-10-09 20:35:15

标题: 深度Cox模型的小批量估计：统计基础和实用指导

摘要: 随机梯度下降（SGD）算法已被广泛用于通过更新模型参数使用数据的小批量来优化深度Cox神经网络（Cox-NN）。我们展示了SGD旨在优化小批量部分似然的平均值，这与标准部分似然不同。这种区别需要为全局优化器（即小批量最大部分似然估计器（mb-MPLE））开发新的统计特性。我们建立了Cox-NN的mb-MPLE是一致的，并且实现了最优极小最大收敛速率，直到多对数因子。对于具有线性协变效应的Cox回归，我们进一步展示了mb-MPLE是$\sqrt{n}$-一致的，并在批量大小增加时渐近正态，渐近方差接近信息下界，这得到了模拟研究的确认。此外，我们提供了使用SGD的实用指导，支持理论分析和数值证据。对于Cox-NN，我们展示了学习速率与批量大小之比在SGD动态中至关重要，为超参数调整提供了见解。对于Cox回归，我们表征了SGD的迭代收敛，确保全局优化器mb-MPLE可以通过足够多的迭代来近似。最后，我们展示了mb-MPLE在一个大规模实际应用中的有效性，其中标准MPLE是难以处理的。

更新时间: 2025-10-09 20:35:15

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2408.02839v2

Man-Made Heuristics Are Dead. Long Live Code Generators!

Policy design for various systems controllers has conventionally been a manual process, with domain experts carefully tailoring heuristics for the specific instance in which the policy will be deployed. In this paper, we re-imagine policy design via a novel automated search technique fueled by recent advances in generative models, specifically Large Language Model (LLM)-driven code generation. We outline the design and implementation of PolicySmith, a framework that applies LLMs to synthesize instance-optimal heuristics. We apply PolicySmith to two long-standing systems policies - web caching and congestion control, highlighting the opportunities unraveled by this LLM-driven heuristic search. For caching, PolicySmith discovers heuristics that outperform established baselines on standard open-source traces. For congestion control, we show that PolicySmith can generate safe policies that integrate directly into the Linux kernel.

Updated: 2025-10-09 20:35:00

标题: 人为启发式已死。代码生成器万岁！

摘要: 传统上，各种系统控制器的政策设计通常是一个手动过程，领域专家会精心为政策部署的特定实例定制启发式方法。在本文中，我们通过一种新颖的自动搜索技术重新构想了政策设计，这种技术受到最近生成模型的进展的推动，具体来说是基于大型语言模型（LLM）驱动的代码生成。我们概述了PolicySmith的设计和实现，这是一个应用LLM合成实例最优启发式方法的框架。我们将PolicySmith应用于两个长期存在的系统政策 - 网页缓存和拥塞控制，突显了这种LLM驱动的启发式搜索所揭示的机会。对于缓存，PolicySmith发现了在标准开源跟踪中优于已建立基准的启发式方法。对于拥塞控制，我们展示了PolicySmith可以生成直接集成到Linux内核中的安全政策。

更新时间: 2025-10-09 20:35:00

领域: cs.OS,cs.DC,cs.LG,cs.NE

下载: http://arxiv.org/abs/2510.08803v1

Edu-EmotionNet: Cross-Modality Attention Alignment with Temporal Feedback Loops

Understanding learner emotions in online education is critical for improving engagement and personalized instruction. While prior work in emotion recognition has explored multimodal fusion and temporal modeling, existing methods often rely on static fusion strategies and assume that modality inputs are consistently reliable, which is rarely the case in real-world learning environments. We introduce Edu-EmotionNet, a novel framework that jointly models temporal emotion evolution and modality reliability for robust affect recognition. Our model incorporates three key components: a Cross-Modality Attention Alignment (CMAA) module for dynamic cross-modal context sharing, a Modality Importance Estimator (MIE) that assigns confidence-based weights to each modality at every time step, and a Temporal Feedback Loop (TFL) that leverages previous predictions to enforce temporal consistency. Evaluated on educational subsets of IEMOCAP and MOSEI, re-annotated for confusion, curiosity, boredom, and frustration, Edu-EmotionNet achieves state-of-the-art performance and demonstrates strong robustness to missing or noisy modalities. Visualizations confirm its ability to capture emotional transitions and adaptively prioritize reliable signals, making it well suited for deployment in real-time learning systems

Updated: 2025-10-09 20:33:52

标题: Edu-EmotionNet: 跨模态关注对齐与时间反馈循环

摘要: 在线教育中理解学习者的情绪对于提高参与度和个性化指导至关重要。虽然先前的情绪识别工作探索了多模态融合和时间建模，但现有方法往往依赖静态融合策略，并假设模态输入始终可靠，而在真实学习环境中很少见。我们引入了Edu-EmotionNet，这是一个新颖的框架，联合建模时间情绪演变和模态可靠性，用于强大的情感识别。我们的模型包含三个关键组件：一个用于动态跨模态上下文共享的跨模态注意力对齐（CMAA）模块，一个为每个时间步分配基于置信度的权重的模态重要性估计器（MIE），以及一个利用先前预测以强化时间一致性的时间反馈环（TFL）。在经过重新注释的IEMOCAP和MOSEI的教育子集上进行评估，涉及困惑、好奇、无聊和沮丧，Edu-EmotionNet实现了最先进的性能，并展示了对于缺失或嘈杂的模态具有较强的鲁棒性。可视化证实了其捕捉情绪转变并自适应地优先考虑可靠信号的能力，使其非常适合部署在实时学习系统中。

更新时间: 2025-10-09 20:33:52

领域: cs.LG

下载: http://arxiv.org/abs/2510.08802v1

Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective

While Large Language Models (LLMs) have demonstrated advanced reasoning capabilities, their comprehensive evaluation in general Chinese-language contexts remains understudied. To bridge this gap, we propose Chinese Commonsense Multi-hop Reasoning (CCMOR), a novel benchmark designed to evaluate LLMs' ability to integrate Chinese-specific factual knowledge with multi-step logical reasoning. Specifically, we first construct a domain-balanced seed set from existing QA datasets, then develop an LLM-powered pipeline to generate multi-hop questions anchored on factual unit chains. To ensure the quality of resulting dataset, we implement a human-in-the-loop verification system, where domain experts systematically validate and refine the generated questions. Using CCMOR, we evaluate state-of-the-art LLMs, demonstrating persistent limitations in LLMs' ability to process long-tail knowledge and execute knowledge-intensive reasoning. Notably, retrieval-augmented generation substantially mitigates these knowledge gaps, yielding significant performance gains.

Updated: 2025-10-09 20:29:00

标题: 用多跳推理视角对中国常识推理进行基准测试

摘要: 大型语言模型（LLMs）已经展示出先进的推理能力，但它们在一般中文语境中的全面评估仍然缺乏研究。为了弥补这一差距，我们提出了中文常识多跳推理（CCMOR），这是一个旨在评估LLMs整合中文特定事实知识和多步逻辑推理能力的新型基准。具体来说，我们首先从现有的问答数据集构建一个领域平衡的种子集，然后开发一个由LLM驱动的流水线，生成以事实单元链为基础的多跳问题。为了确保生成的数据集质量，我们实施了一个人在环验证系统，领域专家系统地验证和完善生成的问题。利用CCMOR，我们评估了最先进的LLMs，展示了LLMs在处理长尾知识和执行知识密集型推理方面的持续局限。值得注意的是，检索增强生成大大减少了这些知识差距，带来了显著的性能提升。

更新时间: 2025-10-09 20:29:00

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.08800v1

Training-Free Safe Denoisers for Safe Use of Diffusion Models

There is growing concern over the safety of powerful diffusion models (DMs), as they are often misused to produce inappropriate, not-safe-for-work (NSFW) content or generate copyrighted material or data of individuals who wish to be forgotten. Many existing methods tackle these issues by heavily relying on text-based negative prompts or extensively retraining DMs to eliminate certain features or samples. In this paper, we take a radically different approach, directly modifying the sampling trajectory by leveraging a negation set (e.g., unsafe images, copyrighted data, or datapoints needed to be excluded) to avoid specific regions of data distribution, without needing to retrain or fine-tune DMs. We formally derive the relationship between the expected denoised samples that are safe and those that are not safe, leading to our $\textit{safe}$ denoiser which ensures its final samples are away from the area to be negated. Inspired by the derivation, we develop a practical algorithm that successfully produces high-quality samples while avoiding negation areas of the data distribution in text-conditional, class-conditional, and unconditional image generation scenarios. These results hint at the great potential of our training-free safe denoiser for using DMs more safely.

Updated: 2025-10-09 20:28:40

标题: 培训免费的安全去噪器，用于扩散模型的安全使用

摘要: 对于强大的扩散模型（DMs）的安全性日益受到关注，因为它们经常被错误使用来生成不适宜工作场所（NSFW）的内容，或生成希望被遗忘的个人的受版权保护的材料或数据。许多现有方法通过严重依赖基于文本的负面提示或大量重新训练DMs来解决这些问题，以消除某些特征或样本。在本文中，我们采用了一种根本不同的方法，通过利用否定集（例如，不安全的图像、受版权保护的数据或需要排除的数据点）直接修改抽样轨迹，以避免数据分布的特定区域，而无需重新训练或微调DMs。我们正式推导了期望的去噪样本之间的关系，即安全的和不安全的样本，从而引出我们的“安全”去噪器，确保最终样本远离被否定的区域。受到这一推导的启发，我们开发了一个实用算法，成功地在文本条件、类条件和无条件图像生成场景中生成高质量样本，同时避免数据分布的否定区域。这些结果暗示了我们无需训练的安全去噪器在更安全地使用DMs方面的巨大潜力。

更新时间: 2025-10-09 20:28:40

领域: cs.AI

下载: http://arxiv.org/abs/2502.08011v4

SkipSR: Faster Super Resolution with Token Skipping

Diffusion-based super-resolution (SR) is a key component in video generation and video restoration, but is slow and expensive, limiting scalability to higher resolutions and longer videos. Our key insight is that many regions in video are inherently low-detail and gain little from refinement, yet current methods process all pixels uniformly. To take advantage of this, we propose SkipSR, a simple framework for accelerating video SR by identifying low-detail regions directly from low-resolution input, then skipping computation on them entirely, only super-resolving the areas that require refinement. This simple yet effective strategy preserves perceptual quality in both standard and one-step diffusion SR models while significantly reducing computation. In standard SR benchmarks, our method achieves up to 60% faster end-to-end latency than prior models on 720p videos with no perceptible loss in quality. Video demos are available at https://rccchoudhury.github.io/skipsr/

Updated: 2025-10-09 20:27:11

标题: SkipSR：通过令牌跳过实现更快的超分辨率

摘要: 基于扩散的超分辨率（SR）是视频生成和视频恢复中的关键组件，但速度慢且成本高昂，限制了对更高分辨率和更长视频的可扩展性。我们的关键洞察力在于视频中许多区域本质上缺乏细节，并且对精化几乎没有带来任何好处，然而当前方法处理所有像素的方式是统一的。为了利用这一点，我们提出了SkipSR，这是一个简单的框架，通过直接从低分辨率输入中识别低细节区域，然后完全跳过对它们的计算，仅对需要精化的区域进行超分辨率处理。这种简单而有效的策略在标准和一步扩散SR模型中都保持了感知质量，同时显著减少了计算量。在标准的SR基准测试中，我们的方法在720p视频上的端到端延迟比先前模型快高达60%，且没有感知质量的损失。视频演示可在https://rccchoudhury.github.io/skipsr/上查看。

更新时间: 2025-10-09 20:27:11

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.08799v1

TAPAS: Datasets for Learning the Learning with Errors Problem

AI-powered attacks on Learning with Errors (LWE), an important hard math problem in post-quantum cryptography, rival or outperform "classical" attacks on LWE under certain parameter settings. Despite the promise of this approach, a dearth of accessible data limits AI practitioners' ability to study and improve these attacks. Creating LWE data for AI model training is time- and compute-intensive and requires significant domain expertise. To fill this gap and accelerate AI research on LWE attacks, we propose the TAPAS datasets, a Toolkit for Analysis of Post-quantum cryptography using AI Systems. These datasets cover several LWE settings and can be used off-the-shelf by AI practitioners to prototype new approaches to cracking LWE. This work documents TAPAS dataset creation, establishes attack performance baselines, and lays out directions for future work.

Updated: 2025-10-09 20:23:06

标题: TAPAS：学习与错误问题的数据集

摘要: AI技术对学习与错误（LWE）在后量子密码学中的重要数学难题发动攻击，与在特定参数设置下对LWE的“传统”攻击相媲美甚至表现更优。尽管这种方法有很大潜力，但可获得的数据匮乏限制了AI从业者研究和改进这些攻击的能力。创建用于AI模型训练的LWE数据耗时、消耗大量计算资源，并需要大量领域专业知识。为填补这一空白并加速对LWE攻击的AI研究，我们提出了TAPAS数据集，这是一个用于利用AI系统分析后量子密码学的工具包。这些数据集涵盖多种LWE设置，并可被AI从业者直接使用，用于设计新的破解LWE的方法。本研究记录了TAPAS数据集的创建过程，建立了攻击性能基线，并提出了未来工作的方向。

更新时间: 2025-10-09 20:23:06

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2510.08797v1

PO-CKAN:Physics Informed Deep Operator Kolmogorov Arnold Networks with Chunk Rational Structure

We propose PO-CKAN, a physics-informed deep operator framework based on Chunkwise Rational Kolmogorov--Arnold Networks (KANs), for approximating the solution operators of partial differential equations. This framework leverages a Deep Operator Network (DeepONet) architecture that incorporates Chunkwise Rational Kolmogorov--Arnold Network (CKAN) sub-networks for enhanced function approximation. The principles of Physics-Informed Neural Networks (PINNs) are integrated into the operator learning framework to enforce physical consistency. This design enables the efficient learning of physically consistent spatio-temporal solution operators and allows for rapid prediction for parametric time-dependent PDEs with varying inputs (e.g., parameters, initial/boundary conditions) after training. Validated on challenging benchmark problems, PO-CKAN demonstrates accurate operator learning with results closely matching high-fidelity solutions. PO-CKAN adopts a DeepONet-style branch--trunk architecture with its sub-networks instantiated as rational KAN modules, and enforces physical consistency via a PDE residual (PINN-style) loss. On Burgers' equation with $\nu=0.01$, PO-CKAN reduces the mean relative $L^2$ error by approximately 48\% compared to PI-DeepONet, and achieves competitive accuracy on the Eikonal and diffusion--reaction benchmarks.

Updated: 2025-10-09 20:18:24

标题: PO-CKAN：具有块有理结构的物理信息深度算子科尔莫戈洛夫阿诺德网络

摘要: 我们提出了PO-CKAN，这是一个基于分块有理科尔莫哥洛夫-阿诺德网络（KANs）的物理信息深度算子框架，用于逼近偏微分方程的解算子。该框架利用了一个深度算子网络（DeepONet）架构，其中包含分块有理科尔莫哥洛夫-阿诺德网络（CKAN）子网络，以增强函数逼近能力。物理信息神经网络（PINNs）的原则被整合到算子学习框架中，以强制物理一致性。这种设计使得能够高效学习具有物理一致性的时空解算子，并允许在训练后快速预测具有不同输入（例如参数、初始/边界条件）的参数化时变PDEs。在具有挑战性的基准问题上经过验证，PO-CKAN展示了准确的算子学习结果，与高保真度解的结果密切匹配。PO-CKAN采用了DeepONet风格的分支-主干架构，其子网络实例化为有理KAN模块，并通过PDE残差（PINN风格）损失来强制物理一致性。在Burgers方程中，当$\nu=0.01$时，与PI-DeepONet相比，PO-CKAN将平均相对$L^2$误差减少了大约48％，并在Eikonal和扩散-反应基准测试中取得了竞争性的准确性。

更新时间: 2025-10-09 20:18:24

领域: cs.LG,math-ph,math.MP

下载: http://arxiv.org/abs/2510.08795v1

Deceptive Exploration in Multi-armed Bandits

We consider a multi-armed bandit setting in which each arm has a public and a private reward distribution. An observer expects an agent to follow Thompson Sampling according to the public rewards, however, the deceptive agent aims to quickly identify the best private arm without being noticed. The observer can observe the public rewards and the pulled arms, but not the private rewards. The agent, on the other hand, observes both the public and private rewards. We formalize detectability as a stepwise Kullback-Leibler (KL) divergence constraint between the actual pull probabilities used by the agent and the anticipated pull probabilities by the observer. We model successful pulling of public suboptimal arms as a % Bernoulli process where the success probability decreases with each successful pull, and show these pulls can happen at most at a $\Theta(\sqrt{T}) $ rate under the KL constraint. We then formulate a maximin problem based on public and private means, whose solution characterizes the optimal error exponent for best private arm identification. We finally propose an algorithm inspired by top-two algorithms. This algorithm naturally adapts its exploration according to the hardness of pulling arms based on the public suboptimality gaps. We provide numerical examples illustrating the $\Theta(\sqrt{T}) $ rate and the behavior of the proposed algorithm.

Updated: 2025-10-09 20:15:52

标题: 多臂老虎机中的欺骗性探索

摘要: 我们考虑一个多臂赌博机设置，在这个设置中，每个臂都有一个公共和一个私有的奖励分布。观察者期望代理根据公共奖励遵循汤普森抽样，然而，欺骗性代理的目标是快速识别最佳的私有臂而不被注意到。观察者可以观察到公共奖励和拉动的臂，但无法观察私有奖励。另一方面，代理可以观察到公共和私有奖励。我们将可检测性形式化为实际拉动概率和观察者预期拉动概率之间的逐步Kullback-Leibler（KL）散度约束。我们将成功拉动公共次优臂建模为一个%伯努利过程，其中成功概率随每次成功拉动而减少，并展示这些拉动在KL约束下最多可以以$\Theta(\sqrt{T})$速率发生。然后，我们基于公共和私有均值制定了一个最大最小化问题，其解决方案刻画了最佳私有臂识别的最优误差指数。最后，我们提出了一种受到前两种算法启发的算法。该算法根据公共次优差距的难度调整其探索。我们提供了数值例子，说明了$\Theta(\sqrt{T})$速率和所提出算法的行为。

更新时间: 2025-10-09 20:15:52

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.08794v1

COMPASS: Enhancing Agent Long-Horizon Reasoning with Evolving Context

Long-horizon tasks that require sustained reasoning and multiple tool interactions remain challenging for LLM agents: small errors compound across steps, and even state-of-the-art models often hallucinate or lose coherence. We identify context management as the central bottleneck -- extended histories cause agents to overlook critical evidence or become distracted by irrelevant information, thus failing to replan or reflect from previous mistakes. To address this, we propose COMPASS (Context-Organized Multi-Agent Planning and Strategy System), a lightweight hierarchical framework that separates tactical execution, strategic oversight, and context organization into three specialized components: (1) a Main Agent that performs reasoning and tool use, (2) a Meta-Thinker that monitors progress and issues strategic interventions, and (3) a Context Manager that maintains concise, relevant progress briefs for different reasoning stages. Across three challenging benchmarks -- GAIA, BrowseComp, and Humanity's Last Exam -- COMPASS improves accuracy by up to 20% relative to both single- and multi-agent baselines. We further introduce a test-time scaling extension that elevates performance to match established DeepResearch agents, and a post-training pipeline that delegates context management to smaller models for enhanced efficiency.

Updated: 2025-10-09 20:14:26

标题: COMPASS：通过不断变化的情境增强代理长期推理

摘要: 需要帮助翻译这篇文献摘要：长期任务需要持续推理和多个工具交互，对LLM代理仍然具有挑战性：小错误会在步骤之间累积，即使是最先进的模型也经常出现幻觉或失去连贯性。我们确定上下文管理是中央瓶颈 - 延长的历史使代理人忽视关键证据或被无关信息分散注意力，因此无法重新规划或从先前的错误中反思。为了解决这个问题，我们提出了COMPASS（上下文组织多代理规划和策略系统），这是一个轻量级的分层框架，将战术执行、战略监督和上下文组织分为三个专门的组件：（1）执行推理和工具使用的主代理，（2）监视进展并发出战略干预的元思考者，以及（3）维护简洁、相关进展简报以供不同推理阶段使用的上下文管理器。在三个具有挑战性的基准测试中 - GAIA、BrowseComp和人类最后的考试 - COMPASS相对于单一和多代理基线提高了高达20%的准确性。我们进一步引入了一个在测试时进行扩展的方法，将性能提升到与已建立的DeepResearch代理相匹配，并且引入了一个后训练管道，将上下文管理委托给更小的模型以增强效率。

更新时间: 2025-10-09 20:14:26

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.08790v1

Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations

Large language models (LLMs) are increasingly deployed as evaluators of text quality, yet the validity of their judgments remains underexplored. This study investigates systematic bias in self- and cross-model evaluations across three prominent LLMs: ChatGPT, Gemini, and Claude. We designed a controlled experiment in which blog posts authored by each model were evaluated by all three models under four labeling conditions: no attribution, true attribution, and two false-attribution scenarios. Evaluations employed both holistic preference voting and granular quality ratings across three dimensions Coherence, Informativeness, and Conciseness with all scores normalized to percentages for direct comparison. Our findings reveal pronounced asymmetries in model judgments: the "Claude" label consistently elevated scores regardless of actual authorship, while the "Gemini" label systematically depressed them. False attribution frequently reversed preference rankings, producing shifts of up to 50 percentage points in voting outcomes and up to 12 percentage points in quality ratings. Notably, Gemini exhibited severe self-deprecation under true labels, while Claude demonstrated intensified self-preference. These results demonstrate that perceived model identity can substantially distort both high-level judgments and fine-grained quality assessments, independent of content quality. Our findings challenge the reliability of LLM-as-judge paradigms and underscore the critical need for blind evaluation protocols and diverse multi-model validation frameworks to ensure fairness and validity in automated text evaluation and LLM benchmarking.

Updated: 2025-10-09 20:01:01

标题: 量化大型语言模型自评和交叉评估中标签引起的偏见

摘要: 大型语言模型（LLMs）越来越被部署为文本质量的评估者，但它们的判断有效性仍未得到充分探讨。本研究调查了在三个知名LLMs（ChatGPT、Gemini和Claude）中的自我和跨模型评估中的系统偏见。我们设计了一个控制实验，在该实验中，由每个模型撰写的博客帖子由所有三个模型进行评估，评估条件包括四种标记条件：无归因、真实归因和两种错误归因情景。评价采用了整体偏好投票和横向质量评分，涵盖了三个维度的一致性、信息性和简洁性，所有分数均标准化为百分比以进行直接比较。我们的发现揭示了模型判断中明显的不对称性："Claude"标签始终提高分数，而不考虑实际作者，而"Gemini"标签则系统性地降低分数。虚假归因经常颠倒偏好排名，导致投票结果出现多达50个百分点的变化，质量评分则最多出现12个百分点的变化。值得注意的是，Gemini在真实标签下表现出严重的自我贬低，而Claude表现出加强的自我偏好。这些结果表明，感知的模型身份可以极大地扭曲高级判断和精细质量评估，而与内容质量无关。我们的发现挑战了LLM作为评判者范例的可靠性，并强调了盲评估协议和多样化多模型验证框架在确保自动化文本评估和LLM基准评估的公平性和有效性方面的关键性需求。

更新时间: 2025-10-09 20:01:01

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.21164v3

MLLM as a UI Judge: Benchmarking Multimodal LLMs for Predicting Human Perception of User Interfaces

In an ideal design pipeline, user interface (UI) design is intertwined with user research to validate decisions, yet studies are often resource-constrained during early exploration. Recent advances in multimodal large language models (MLLMs) offer a promising opportunity to act as early evaluators, helping designers narrow options before formal testing. Unlike prior work that emphasizes user behavior in narrow domains such as e-commerce with metrics like clicks or conversions, we focus on subjective user evaluations across varied interfaces. We investigate whether MLLMs can mimic human preferences when evaluating individual UIs and comparing them. Using data from a crowdsourcing platform, we benchmark GPT-4o, Claude, and Llama across 30 interfaces and examine alignment with human judgments on multiple UI factors. Our results show that MLLMs approximate human preferences on some dimensions but diverge on others, underscoring both their potential and limitations in supplementing early UX research.

Updated: 2025-10-09 20:00:41

标题: MLLM作为UI评判者：基准测试多模态LLM用于预测用户界面人类感知

摘要: 在理想的设计流程中，用户界面（UI）设计与用户研究交织在一起，以验证决策，然而在早期探索阶段往往受到资源限制。最近的多模态大型语言模型（MLLMs）的进展为其作为早期评估者提供了一个有希望的机会，帮助设计师在正式测试之前缩小选项范围。与以往侧重于狭窄领域如电子商务中用户行为的工作不同，我们关注跨不同界面的主观用户评价。我们研究了MLLMs在评估单个UI并进行比较时是否能模拟人类的偏好。利用众包平台的数据，我们对GPT-4o、Claude和Llama在30个界面上进行了基准测试，并检查它们在多个UI因素上与人类判断的一致性。我们的结果显示MLLMs在某些维度上近似于人类偏好，但在其他方面存在分歧，突显了它们在补充早期UX研究方面的潜力和局限性。

更新时间: 2025-10-09 20:00:41

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2510.08783v1

Weights initialization of neural networks for function approximation

Neural network-based function approximation plays a pivotal role in the advancement of scientific computing and machine learning. Yet, training such models faces several challenges: (i) each target function often requires training a new model from scratch; (ii) performance is highly sensitive to architectural and hyperparameter choices; and (iii) models frequently generalize poorly beyond the training domain. To overcome these challenges, we propose a reusable initialization framework based on basis function pretraining. In this approach, basis neural networks are first trained to approximate families of polynomials on a reference domain. Their learned parameters are then used to initialize networks for more complex target functions. To enhance adaptability across arbitrary domains, we further introduce a domain mapping mechanism that transforms inputs into the reference domain, thereby preserving structural correspondence with the pretrained models. Extensive numerical experiments in one- and two-dimensional settings demonstrate substantial improvements in training efficiency, generalization, and model transferability, highlighting the promise of initialization-based strategies for scalable and modular neural function approximation. The full code is made publicly available on Gitee.

Updated: 2025-10-09 19:56:26

标题: 神经网络权重初始化对于函数逼近的影响

摘要: 基于神经网络的函数逼近在科学计算和机器学习的进展中起着关键作用。然而，训练这种模型面临几个挑战：（i）每个目标函数通常需要从头开始训练一个新模型；（ii）性能对架构和超参数选择非常敏感；以及（iii）模型经常在训练域之外泛化能力差。为了克服这些挑战，我们提出了一种基于基函数预训练的可重复使用初始化框架。在这种方法中，基础神经网络首先被训练来逼近一个参考域上的多项式族。然后他们学到的参数被用来初始化更复杂目标函数的网络。为了增强跨任意域的适应性，我们进一步引入了一个领域映射机制，将输入转换为参考域，从而保持与预训练模型的结构对应。在一维和二维环境下进行的大量数值实验表明，训练效率、泛化性和模型可转移性得到了显著改善，突显了基于初始化的策略对于可扩展和模块化的神经函数逼近的前景。完整的代码已在Gitee上公开。

更新时间: 2025-10-09 19:56:26

领域: cs.LG,cs.NA,math.NA,68T07, 41A46, 65D10

下载: http://arxiv.org/abs/2510.08780v1

Guiding Exploration in Reinforcement Learning Through LLM-Augmented Observations

Reinforcement Learning (RL) agents often struggle in sparse-reward environments where traditional exploration strategies fail to discover effective action sequences. Large Language Models (LLMs) possess procedural knowledge and reasoning capabilities from text pretraining that could guide RL exploration, but existing approaches create rigid dependencies where RL policies must follow LLM suggestions or incorporate them directly into reward functions. We propose a framework that provides LLM-generated action recommendations through augmented observation spaces, allowing RL agents to learn when to follow or ignore this guidance. Our method leverages LLMs' world knowledge and reasoning abilities while maintaining flexibility through soft constraints. We evaluate our approach on three BabyAI environments of increasing complexity and show that the benefits of LLM guidance scale with task difficulty. In the most challenging environment, we achieve 71% relative improvement in final success rates over baseline. The approach provides substantial sample efficiency gains, with agents reaching performance thresholds up to 9 times faster, and requires no modifications to existing RL algorithms. Our results demonstrate an effective method for leveraging LLM planning capabilities to accelerate RL training in challenging environments.

Updated: 2025-10-09 19:54:31

标题: 通过LLM增强观测引导强化学习的探索

摘要: 强化学习（RL）代理在奖励稀疏的环境中经常遇到困难，传统的探索策略无法发现有效的行动序列。大型语言模型（LLMs）具有来自文本预训练的程序化知识和推理能力，可以引导RL的探索，但现有方法创建了刚性依赖，RL策略必须遵循LLM建议或直接将其纳入奖励函数中。我们提出了一个框架，通过增强的观察空间提供LLM生成的行动建议，使RL代理能够学习何时遵循或忽略这些指导。我们的方法利用LLMs的世界知识和推理能力，同时通过软约束保持灵活性。我们在三个BabyAI环境上评估了我们的方法，这些环境的复杂性逐渐增加，并展示了LLM指导的好处随任务难度的增加而增加。在最具挑战性的环境中，我们相对基线实现了71%的最终成功率提高。该方法提供了大量的样本效率增益，代理能够更快地达到性能阈值，速度最高可提高9倍，并且不需要对现有的RL算法进行任何修改。我们的结果表明了一种有效的方法，利用LLM的规划能力加速RL在具有挑战性的环境中的训练。

更新时间: 2025-10-09 19:54:31

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.08779v1

Measuring Moral LLM Responses in Multilingual Capacities

With LLM usage becoming widespread across countries, languages, and humanity more broadly, the need to understand and guardrail their multilingual responses increases. Large-scale datasets for testing and benchmarking have been created to evaluate and facilitate LLM responses across multiple dimensions. In this study, we evaluate the responses of frontier and leading open-source models in five dimensions across low and high-resource languages to measure LLM accuracy and consistency across multilingual contexts. We evaluate the responses using a five-point grading rubric and a judge LLM. Our study shows that GPT-5 performed the best on average in each category, while other models displayed more inconsistency across language and category. Most notably, in the Consent & Autonomy and Harm Prevention & Safety categories, GPT scored the highest with averages of 3.56 and 4.73, while Gemini 2.5 Pro scored the lowest with averages of 1.39 and 1.98, respectively. These findings emphasize the need for further testing on how linguistic shifts impact LLM responses across various categories and improvement in these areas.

Updated: 2025-10-09 19:47:40

标题: 在多语能力中测量道德反应

摘要: 随着LLM的使用在各国、各种语言和更广泛的人类社会中变得普遍，理解和监管它们的多语言响应的需求也在增加。已经创建了用于评估和促进跨多个维度的LLM响应的大规模数据集进行测试和基准测试。在这项研究中，我们评估了前沿和领先的开源模型在五个维度上跨低资源和高资源语言的响应，以衡量LLM在多语境下的准确性和一致性。我们使用五点评分标准和法官LLM来评估这些响应。我们的研究表明，GPT-5在每个类别上平均表现最好，而其他模型在语言和类别之间显示出更多不一致性。值得注意的是，在同意和自主权以及预防伤害和安全类别中，GPT得分最高，平均分别为3.56和4.73，而Gemini 2.5 Pro得分最低，平均分别为1.39和1.98。这些发现强调了需要进一步测试语言变化如何影响LLM在各种类别中的响应，并改进这些领域。

更新时间: 2025-10-09 19:47:40

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.08776v1

Re-Identifying Kākā with AI-Automated Video Key Frame Extraction

Accurate recognition and re-identification of individual animals is essential for successful wildlife population monitoring. Traditional methods, such as leg banding of birds, are time consuming and invasive. Recent progress in artificial intelligence, particularly computer vision, offers encouraging solutions for smart conservation and efficient automation. This study presents a unique pipeline for extracting high-quality key frames from videos of k\={a}k\={a} (Nestor meridionalis), a threatened forest-dwelling parrot in New Zealand. Key frame extraction is well-studied in person re-identification, however, its application to wildlife is limited. Using video recordings at a custom-built feeder, we extract key frames and evaluate the re-identification performance of our pipeline. Our unsupervised methodology combines object detection using YOLO and Grounding DINO, optical flow blur detection, image encoding with DINOv2, and clustering methods to identify representative key frames. The results indicate that our proposed key frame selection methods yield image collections which achieve high accuracy in k\={a}k\={a} re-identification, providing a foundation for future research using media collected in more diverse and challenging environments. Through the use of artificial intelligence and computer vision, our non-invasive and efficient approach provides a valuable alternative to traditional physical tagging methods for recognising k\={a}k\={a} individuals and therefore improving the monitoring of populations. This research contributes to developing fresh approaches in wildlife monitoring, with applications in ecology and conservation biology.

Updated: 2025-10-09 19:46:46

标题: 利用人工智能自动提取视频关键帧重新识别卡卡

摘要: 准确识别和重新识别个体动物对于成功的野生动物种群监测至关重要。传统方法，如对鸟类进行腿带标记，耗时且侵入性强。近年来，人工智能特别是计算机视觉的进展为智能保护和高效自动化提供了令人鼓舞的解决方案。本研究提出了一个独特的流程，用于从新西兰受威胁的森林鹦鹉kākā（Nestor meridionalis）的视频中提取高质量的关键帧。关键帧提取在人员重新识别中已经得到充分研究，但在野生动物中的应用却有限。利用在自制饲料器上的视频录像，我们提取关键帧并评估我们流程的重新识别性能。我们的无监督方法结合了使用YOLO和Grounding DINO进行目标检测，光流模糊检测，DINOv2进行图像编码以及聚类方法来识别代表性关键帧。结果表明，我们提出的关键帧选择方法产生的图像集在kākā重新识别中取得了高准确度，为未来利用更多样化和具有挑战性环境中收集的媒体进行研究奠定了基础。通过利用人工智能和计算机视觉，我们的非侵入性和高效方法为识别kākā个体提供了一种有价值的替代方案，从而改善种群监测。这项研究有助于发展野生动物监测中的新方法，在生态学和保护生物学领域具有应用。

更新时间: 2025-10-09 19:46:46

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.08775v1

Revealing Interconnections between Diseases: from Statistical Methods to Large Language Models

Identifying disease interconnections through manual analysis of large-scale clinical data is labor-intensive, subjective, and prone to expert disagreement. While machine learning (ML) shows promise, three critical challenges remain: (1) selecting optimal methods from the vast ML landscape, (2) determining whether real-world clinical data (e.g., electronic health records, EHRs) or structured disease descriptions yield more reliable insights, (3) the lack of "ground truth," as some disease interconnections remain unexplored in medicine. Large language models (LLMs) demonstrate broad utility, yet they often lack specialized medical knowledge. To address these gaps, we conduct a systematic evaluation of seven approaches for uncovering disease relationships based on two data sources: (i) sequences of ICD-10 codes from MIMIC-IV EHRs and (ii) the full set of ICD-10 codes, both with and without textual descriptions. Our framework integrates the following: (i) a statistical co-occurrence analysis and a masked language modeling (MLM) approach using real clinical data; (ii) domain-specific BERT variants (Med-BERT and BioClinicalBERT); (iii) a general-purpose BERT and document retrieval; and (iv) four LLMs (Mistral, DeepSeek, Qwen, and YandexGPT). Our graph-based comparison of the obtained interconnection matrices shows that the LLM-based approach produces interconnections with the lowest diversity of ICD code connections to different diseases compared to other methods, including text-based and domain-based approaches. This suggests an important implication: LLMs have limited potential for discovering new interconnections. In the absence of ground truth databases for medical interconnections between ICD codes, our results constitute a valuable medical disease ontology that can serve as a foundational resource for future clinical research and artificial intelligence applications in healthcare.

Updated: 2025-10-09 19:46:38

标题: 揭示疾病之间的相互关系：从统计方法到大型语言模型

摘要: 通过对大规模临床数据进行手动分析来识别疾病之间的相互关联是费时费力、主观性强，并且容易出现专家意见不一致的情况。虽然机器学习（ML）显示出潜力，但仍然存在三个关键挑战：（1）从庞大的ML领域选择最佳方法，（2）确定现实世界临床数据（例如电子健康记录，EHRs）或结构化疾病描述哪种提供更可靠的见解，（3）缺乏“基准事实”，因为一些疾病之间的关联在医学领域尚未被探索。大型语言模型（LLMs）展示了广泛的实用性，但它们往往缺乏专业医学知识。为了解决这些差距，我们对基于两个数据源的七种揭示疾病关系的方法进行了系统评估：（i）来自MIMIC-IV EHRs的ICD-10代码序列和（ii）ICD-10代码的完整集合，其中包括和不包括文本描述。我们的框架集成了以下内容：（i）使用真实临床数据的统计共现分析和掩盖语言建模（MLM）方法；（ii）领域特定的BERT变体（Med-BERT和BioClinicalBERT）；（iii）通用BERT和文档检索；（iv）四个LLMs（Mistral、DeepSeek、Qwen和YandexGPT）。我们基于图的比较结果显示，与其他方法（包括基于文本和基于领域的方法）相比，基于LLM的方法生成的关联矩阵具有ICD代码连接到不同疾病的最低多样性。这表明一个重要的含义：LLMs对于发现新的关联关系的潜力有限。在缺乏医学ICD代码之间关联的基准数据库的情况下，我们的结果构成了一个有价值的医学疾病本体论，可以作为未来临床研究和健康保健领域人工智能应用的基础资源。

更新时间: 2025-10-09 19:46:38

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.04888v2

Struc-EMB: The Potential of Structure-Aware Encoding in Language Embeddings

Text embeddings from Large Language Models (LLMs) have become foundational for numerous applications. However, these models typically operate on raw text, overlooking the rich structural information, such as hyperlinks or citations, that provides crucial context in many real-world datasets. This paper introduces and systematically evaluates a new paradigm for generating structure-aware text embeddings by integrating these structural relations directly into the LLM's internal encoding process, rather than relying on traditional post-hoc aggregation. We investigate two primary in-process methods: sequential concatenation and parallel caching. Through extensive zero-shot experiments across retrieval, clustering, classification, and recommendation tasks, we demonstrate that our structure-aware approaches consistently outperform both text-only and post-hoc baselines. Our analysis reveals critical trade-offs: sequential concatenation excels with noisy, moderate-length contexts, while parallel caching scales more effectively to long, high-signal contexts but is more susceptible to distractors. To address the challenge of noisy structural data, we also introduce and validate two effective techniques: Context Distillation and Semantic Balancing. This work provides the first comprehensive analysis of in-process structure-aware encoding, offering a blueprint for building more powerful and contextually aware embedding models.

Updated: 2025-10-09 19:45:54

标题: Struc-EMB：结构感知编码在语言嵌入中的潜力

摘要: 大型语言模型（LLMs）生成的文本嵌入已经成为许多应用的基础。然而，这些模型通常在原始文本上操作，忽略了超链接或引用等丰富的结构信息，这些信息在许多现实世界数据集中提供了关键的上下文。本文介绍并系统评估了一种新的生成结构感知文本嵌入的范式，通过将这些结构关系直接整合到LLM的内部编码过程中，而不是依赖于传统的事后整合。我们研究了两种主要的过程中方法：顺序连接和并行缓存。通过广泛的零样本实验，涵盖检索、聚类、分类和推荐任务，我们证明了我们的结构感知方法始终优于仅文本和事后基线。我们的分析揭示了关键的权衡：顺序连接在嘈杂、中等长度的上下文中表现出色，而并行缓存更有效地扩展到长、高信号的上下文，但更容易受到干扰。为了解决嘈杂结构数据的挑战，我们还介绍并验证了两种有效的技术：上下文精炼和语义平衡。这项工作提供了对过程中结构感知编码的首次全面分析，为构建更强大和具有上下文意识的嵌入模型提供了蓝图。

更新时间: 2025-10-09 19:45:54

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.08774v1

Detecting spills using thermal imaging, pretrained deep learning models, and a robotic platform

This paper presents a real-time spill detection system that utilizes pretrained deep learning models with RGB and thermal imaging to classify spill vs. no-spill scenarios across varied environments. Using a balanced binary dataset (4,000 images), our experiments demonstrate the advantages of thermal imaging in inference speed, accuracy, and model size. We achieve up to 100% accuracy using lightweight models like VGG19 and NasNetMobile, with thermal models performing faster and more robustly across different lighting conditions. Our system runs on consumer-grade hardware (RTX 4080) and achieves inference times as low as 44 ms with model sizes under 350 MB, highlighting its deployability in safety-critical contexts. Results from experiments with a real robot and test datasets indicate that a VGG19 model trained on thermal imaging performs best.

Updated: 2025-10-09 19:40:58

标题: 利用热成像、预训练深度学习模型和机器人平台检测泄漏

摘要: 本文提出了一个实时泄漏检测系统，利用预训练的深度学习模型结合RGB和热成像来对不同环境下的泄漏和非泄漏场景进行分类。使用平衡的二元数据集（4,000张图像），我们的实验证明了热成像在推断速度、准确性和模型大小方面的优势。我们使用像VGG19和NasNetMobile这样的轻量级模型实现了高达100%的准确率，热成像模型在不同光照条件下表现更快速和更稳健。我们的系统在消费级硬件（RTX 4080）上运行，并实现了低至44毫秒的推断时间，模型大小不到350 MB，突出了其在安全关键环境中的可部署性。通过对真实机器人和测试数据集的实验结果表明，基于热成像训练的VGG19模型表现最佳。

更新时间: 2025-10-09 19:40:58

领域: cs.CV,cs.LG,cs.RO

下载: http://arxiv.org/abs/2510.08770v1

Prioritizing Latency with Profit: A DRL-Based Admission Control for 5G Network Slices

5G networks enable diverse services such as eMBB, URLLC, and mMTC through network slicing, necessitating intelligent admission control and resource allocation to meet stringent QoS requirements while maximizing Network Service Provider (NSP) profits. However, existing Deep Reinforcement Learning (DRL) frameworks focus primarily on profit optimization without explicitly accounting for service delay, potentially leading to QoS violations for latency-sensitive slices. Moreover, commonly used epsilon-greedy exploration of DRL often results in unstable convergence and suboptimal policy learning. To address these gaps, we propose DePSAC -- a Delay and Profit-aware Slice Admission Control scheme. Our DRL-based approach incorporates a delay-aware reward function, where penalties due to service delay incentivize the prioritization of latency-critical slices such as URLLC. Additionally, we employ Boltzmann exploration to achieve smoother and faster convergence. We implement and evaluate DePSAC on a simulated 5G core network substrate with realistic Network Slice Request (NSLR) arrival patterns. Experimental results demonstrate that our method outperforms the DSARA baseline in terms of overall profit, reduced URLLC slice delays, improved acceptance rates, and improved resource consumption. These findings validate the effectiveness of the proposed DePSAC in achieving better QoS-profit trade-offs for practical 5G network slicing scenarios.

Updated: 2025-10-09 19:36:38

标题: 将利润与延迟优先：基于深度强化学习的5G网络切片入场控制

摘要: 5G网络通过网络切片实现eMBB、URLLC和mMTC等多样化服务，需要智能接入控制和资源分配，以满足严格的QoS要求，同时最大化网络服务提供商（NSP）的利润。然而，现有的深度强化学习（DRL）框架主要关注利润优化，而没有明确考虑服务延迟，可能导致延迟敏感切片的QoS违规。此外，常用的epsilon-greedy探索DRL经常导致不稳定的收敛和次优的策略学习。为了解决这些问题，我们提出了DePSAC - 一种延迟和利润感知的切片接入控制方案。我们基于DRL的方法结合了一个延迟感知的奖励函数，其中由于服务延迟而产生的惩罚激励优先考虑像URLLC这样的延迟关键切片。此外，我们采用Boltzmann探索来实现更平稳、更快速的收敛。我们在一个模拟的5G核心网络基础上实现和评估DePSAC，其中包含真实的网络切片请求（NSLR）到达模式。实验结果表明，我们的方法在整体利润、减少URLLC切片延迟、提高接受率和改善资源消耗方面优于DSARA基线。这些发现验证了所提出的DePSAC在实际5G网络切片场景中实现更好的QoS-利润权衡的有效性。

更新时间: 2025-10-09 19:36:38

领域: cs.NI,cs.LG,cs.PF

下载: http://arxiv.org/abs/2510.08769v1

Zero-Shot Policy Transfer in Reinforcement Learning using Buckingham's Pi Theorem

Reinforcement learning (RL) policies often fail to generalize to new robots, tasks, or environments with different physical parameters, a challenge that limits their real-world applicability. This paper presents a simple, zero-shot transfer method based on Buckingham's Pi Theorem to address this limitation. The method adapts a pre-trained policy to new system contexts by scaling its inputs (observations) and outputs (actions) through a dimensionless space, requiring no retraining. The approach is evaluated against a naive transfer baseline across three environments of increasing complexity: a simulated pendulum, a physical pendulum for sim-to-real validation, and the high-dimensional HalfCheetah. Results demonstrate that the scaled transfer exhibits no loss of performance on dynamically similar contexts. Furthermore, on non-similar contexts, the scaled policy consistently outperforms the naive transfer, significantly expanding the volume of contexts where the original policy remains effective. These findings demonstrate that dimensional analysis provides a powerful and practical tool to enhance the robustness and generalization of RL policies.

Updated: 2025-10-09 19:36:18

标题: 在强化学习中使用巴克汉姆π定理进行零迁移策略

摘要: 强化学习（RL）策略通常无法泛化到具有不同物理参数的新机器人、任务或环境，这一挑战限制了它们在现实世界中的适用性。本文提出了一种基于巴克汉姆π定理的简单的零-shot传输方法，以解决这一限制。该方法通过一个无量纲空间来调整预训练策略的输入（观测）和输出（动作）以适应新的系统环境，无需重新训练。该方法在三个复杂度逐渐增加的环境中与一个天真的传输基线进行评估：一个模拟摆，一个用于模拟到真实验证的物理摆和高维度HalfCheetah。结果表明，在动态相似的环境中，缩放传输没有性能损失。此外，在非相似的环境中，缩放策略始终优于天真的传输，显著扩大了原始策略保持有效的环境范围。这些发现表明，量纲分析提供了一个强大和实用的工具，可以增强RL策略的鲁棒性和泛化能力。

更新时间: 2025-10-09 19:36:18

领域: cs.LG,cs.RO

下载: http://arxiv.org/abs/2510.08768v1

Understanding Exoplanet Habitability: A Bayesian ML Framework for Predicting Atmospheric Absorption Spectra

The evolution of space technology in recent years, fueled by advancements in computing such as Artificial Intelligence (AI) and machine learning (ML), has profoundly transformed our capacity to explore the cosmos. Missions like the James Webb Space Telescope (JWST) have made information about distant objects more easily accessible, resulting in extensive amounts of valuable data. As part of this work-in-progress study, we are working to create an atmospheric absorption spectrum prediction model for exoplanets. The eventual model will be based on both collected observational spectra and synthetic spectral data generated by the ROCKE-3D general circulation model (GCM) developed by the climate modeling program at NASA's Goddard Institute for Space Studies (GISS). In this initial study, spline curves are used to describe the bin heights of simulated atmospheric absorption spectra as a function of one of the values of the planetary parameters. Bayesian Adaptive Exploration is then employed to identify areas of the planetary parameter space for which more data are needed to improve the model. The resulting system will be used as a forward model so that planetary parameters can be inferred given a planet's atmospheric absorption spectrum. This work is expected to contribute to a better understanding of exoplanetary properties and general exoplanet climates and habitability.

Updated: 2025-10-09 19:34:07

标题: 理解系外行星的适居性：用于预测大气吸收光谱的贝叶斯机器学习框架

摘要: 最近几年来，受人工智能（AI）和机器学习（ML）等计算方面的进步推动，太空技术的发展已经深刻地改变了我们探索宇宙的能力。像詹姆斯·韦伯太空望远镜（JWST）这样的任务使得关于遥远物体的信息更容易获取，从而产生了大量宝贵的数据。作为这项正在进行中的研究的一部分，我们正在努力创建一个外行星大气吸收光谱预测模型。最终的模型将基于采集的观测光谱和由NASA戈达德太空研究所（GISS）气候建模计划开发的ROCKE-3D通用环流模型（GCM）生成的合成光谱数据。在这项初步研究中，使用样条曲线来描述模拟大气吸收光谱的箱高度，作为行星参数值之一的函数。然后，采用贝叶斯自适应探索来识别需要更多数据以改进模型的行星参数空间区域。结果得到的系统将被用作正向模型，以便根据行星的大气吸收光谱推断行星参数。这项工作预计将有助于更好地了解外行星的性质和一般外行星的气候和适居性。

更新时间: 2025-10-09 19:34:07

领域: astro-ph.EP,astro-ph.IM,cs.LG

下载: http://arxiv.org/abs/2510.08766v1

Reinforcement Learning-Based Optimization of CT Acquisition and Reconstruction Parameters Through Virtual Imaging Trials

Protocol optimization is critical in Computed Tomography (CT) to achieve high diagnostic image quality while minimizing radiation dose. However, due to the complex interdependencies among CT acquisition and reconstruction parameters, traditional optimization methods rely on exhaustive testing of combinations of these parameters, which is often impractical. This study introduces a novel methodology that combines virtual imaging tools with reinforcement learning to optimize CT protocols more efficiently. Human models with liver lesions were imaged using a validated CT simulator and reconstructed with a novel CT reconstruction toolkit. The optimization parameter space included tube voltage, tube current, reconstruction kernel, slice thickness, and pixel size. The optimization process was performed using a Proximal Policy Optimization (PPO) agent, which was trained to maximize an image quality objective, specifically the detectability index (d') of liver lesions in the reconstructed images. Optimization performance was compared against an exhaustive search performed on a supercomputer. The proposed reinforcement learning approach achieved the global maximum d' across test cases while requiring 79.7% fewer steps than the exhaustive search, demonstrating both accuracy and computational efficiency. The proposed framework is flexible and can accommodate various image quality objectives. The findings highlight the potential of integrating virtual imaging tools with reinforcement learning for CT protocol management.

Updated: 2025-10-09 19:30:41

标题: 基于强化学习的CT采集和重建参数优化通过虚拟成像试验

摘要: 计算机断层扫描（CT）中的协议优化对于在最小化辐射剂量的同时实现高诊断图像质量至关重要。然而，由于CT采集和重建参数之间复杂的相互依赖关系，传统的优化方法依赖于对这些参数组合进行详尽的测试，这往往是不切实际的。本研究引入了一种将虚拟成像工具与强化学习相结合的新方法，以更有效地优化CT协议。使用经过验证的CT模拟器对具有肝脏病变的人体模型进行成像，并使用新型CT重建工具包进行重建。优化参数空间包括管电压、管电流、重建核、层厚和像素大小。优化过程使用Proximal Policy Optimization（PPO）代理进行，该代理经过训练以最大化图像质量目标，具体为在重建图像中肝脏病变的可检测性指数（d'）。优化性能与在超级计算机上进行的详尽搜索进行了比较。提出的强化学习方法在测试案例中实现了全局最大的d'，同时比详尽搜索少了79.7％的步骤，展示了准确性和计算效率。提出的框架灵活，可以适应各种图像质量目标。研究结果突显了将虚拟成像工具与强化学习相结合用于CT协议管理的潜力。

更新时间: 2025-10-09 19:30:41

领域: cs.LG

下载: http://arxiv.org/abs/2510.08763v1

Spatial Deconfounder: Interference-Aware Deconfounding for Spatial Causal Inference

Causal inference in spatial domains faces two intertwined challenges: (1) unmeasured spatial factors, such as weather, air pollution, or mobility, that confound treatment and outcome, and (2) interference from nearby treatments that violate standard no-interference assumptions. While existing methods typically address one by assuming away the other, we show they are deeply connected: interference reveals structure in the latent confounder. Leveraging this insight, we propose the Spatial Deconfounder, a two-stage method that reconstructs a substitute confounder from local treatment vectors using a conditional variational autoencoder (CVAE) with a spatial prior, then estimates causal effects via a flexible outcome model. We show that this approach enables nonparametric identification of both direct and spillover effects under weak assumptions--without requiring multiple treatment types or a known model of the latent field. Empirically, we extend SpaCE, a benchmark suite for spatial confounding, to include treatment interference, and show that the Spatial Deconfounder consistently improves effect estimation across real-world datasets in environmental health and social science. By turning interference into a multi-cause signal, our framework bridges spatial and deconfounding literatures to advance robust causal inference in structured data.

Updated: 2025-10-09 19:28:18

标题: 空间去混淆器：考虑干扰的空间因果推断去混淆

摘要: 空间领域的因果推断面临着两个交织在一起的挑战：（1）未测量的空间因素，如天气、空气污染或流动性，这些因素混淆了处理和结果，（2）来自附近处理的干扰违反了标准的无干涉假设。尽管现有方法通常通过假设消除另一个问题，但我们表明它们在根本上是相互关联的：干扰揭示了潜在混淆因素的结构。利用这一洞见，我们提出了空间去混淆器，这是一种两阶段方法，利用带有空间先验的条件变分自编码器（CVAE）从本地处理向量重建替代混淆因素，然后通过灵活的结果模型估计因果效应。我们展示了这种方法在弱假设下实现了直接效应和溢出效应的非参数识别，而无需多种处理类型或已知潜在字段模型。在经验上，我们扩展了用于空间混淆的基准套件SpaCE，以包括处理干扰，并展示了空间去混淆器在环境健康和社会科学的真实数据集中持续改善效果估计。通过将干扰转化为多因素信号，我们的框架将空间和去混淆文献联系起来，以推进结构化数据中稳健的因果推断。

更新时间: 2025-10-09 19:28:18

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2510.08762v1

SAFER-AiD: Saccade-Assisted Foveal-peripheral vision Enhanced Reconstruction for Adversarial Defense

Adversarial attacks significantly challenge the safe deployment of deep learning models, particularly in real-world applications. Traditional defenses often rely on computationally intensive optimization (e.g., adversarial training or data augmentation) to improve robustness, whereas the human visual system achieves inherent robustness to adversarial perturbations through evolved biological mechanisms. We hypothesize that attention guided non-homogeneous sparse sampling and predictive coding plays a key role in this robustness. To test this hypothesis, we propose a novel defense framework incorporating three key biological mechanisms: foveal-peripheral processing, saccadic eye movements, and cortical filling-in. Our approach employs reinforcement learning-guided saccades to selectively capture multiple foveal-peripheral glimpses, which are integrated into a reconstructed image before classification. This biologically inspired preprocessing effectively mitigates adversarial noise, preserves semantic integrity, and notably requires no retraining or fine-tuning of downstream classifiers, enabling seamless integration with existing systems. Experiments on the ImageNet dataset demonstrate that our method improves system robustness across diverse classifiers and attack types, while significantly reducing training overhead compared to both biologically and non-biologically inspired defense techniques.

Updated: 2025-10-09 19:23:19

标题: SAFER-AiD: 利用扫视辅助的中央-外围视觉增强重建用于对抗性防御

摘要: 对抗性攻击显著挑战深度学习模型的安全部署，特别是在实际应用中。传统的防御通常依赖于计算密集型的优化（例如对抗性训练或数据增强）来提高稳健性，而人类视觉系统通过进化的生物机制实现了对抗性扰动的固有稳健性。我们假设注意力引导的非均匀稀疏采样和预测编码在这种稳健性中起着关键作用。为了验证这一假设，我们提出了一个新颖的防御框架，整合了三个关键的生物机制：中央视网膜-周围处理、视跳眼动和皮质填充。我们的方法利用强化学习引导的眼球运动来选择性地捕捉多个中央-周围的瞥视，然后将其集成到一个重构的图像中进行分类。这种生物启发的预处理有效地减轻了对抗性噪声，保留了语义完整性，并且显著减少了对下游分类器的重新训练或微调的需求，实现了与现有系统的无缝集成。在ImageNet数据集上的实验证明，我们的方法提高了系统在多种分类器和攻击类型下的稳健性，同时与生物和非生物启发的防御技术相比显著减少了训练开销。

更新时间: 2025-10-09 19:23:19

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.08761v1

A Design-based Solution for Causal Inference with Text: Can a Language Model Be Too Large?

Many social science questions ask how linguistic properties causally affect an audience's attitudes and behaviors. Because text properties are often interlinked (e.g., angry reviews use profane language), we must control for possible latent confounding to isolate causal effects. Recent literature proposes adapting large language models (LLMs) to learn latent representations of text that successfully predict both treatment and the outcome. However, because the treatment is a component of the text, these deep learning methods risk learning representations that actually encode the treatment itself, inducing overlap bias. Rather than depending on post-hoc adjustments, we introduce a new experimental design that handles latent confounding, avoids the overlap issue, and unbiasedly estimates treatment effects. We apply this design in an experiment evaluating the persuasiveness of expressing humility in political communication. Methodologically, we demonstrate that LLM-based methods perform worse than even simple bag-of-words models using our real text and outcomes from our experiment. Substantively, we isolate the causal effect of expressing humility on the perceived persuasiveness of political statements, offering new insights on communication effects for social media platforms, policy makers, and social scientists.

Updated: 2025-10-09 19:17:57

标题: 一种基于设计的解决方案，用于文本的因果推断：语言模型可以过大吗？

摘要: 许多社会科学问题涉及语言特性如何因果地影响受众的态度和行为。由于文本特性通常相互关联（例如，愤怒的评论使用粗俗语言），我们必须控制可能的潜在混淆，以分离因果效应。最近的文献提出了将大型语言模型（LLMs）调整为学习文本的潜在表示，成功预测处理和结果的方法。然而，由于处理是文本的一个组成部分，这些深度学习方法可能会学习实际上编码处理本身的表示，从而引发重叠偏差。我们引入了一种新的实验设计，处理潜在混淆，避免重叠问题，并无偏估计处理效应，而不是依赖事后调整。我们将这种设计应用于评估政治沟通中表达谦逊的说服力的实验中。在方法上，我们证明基于LLM的方法在使用我们的真实文本和实验结果时甚至不如简单的词袋模型表现。在实质上，我们分离了表达谦逊对政治言论被认为具有说服力的因果效应，为社交媒体平台、政策制定者和社会科学家提供了新的沟通效果方面的见解。

更新时间: 2025-10-09 19:17:57

领域: stat.ME,cs.CL,cs.LG,stat.AP

下载: http://arxiv.org/abs/2510.08758v1

LOTION: Smoothing the Optimization Landscape for Quantized Training

Optimizing neural networks for quantized objectives is fundamentally challenging because the quantizer is piece-wise constant, yielding zero gradients everywhere except at quantization thresholds where the derivative is undefined. Most existing methods deal with this issue by relaxing gradient computations with techniques like Straight Through Estimators (STE) and do not provide any guarantees of convergence. In this work, taking inspiration from Nesterov smoothing, we approximate the quantized loss surface with a continuous loss surface. In particular, we introduce LOTION, \textbf{L}ow-precision \textbf{O}ptimization via s\textbf{T}ochastic-no\textbf{I}se sm\textbf{O}othi\textbf{N}g, a principled smoothing framework that replaces the raw quantized loss with its expectation under unbiased randomized-rounding noise. In this framework, standard optimizers are guaranteed to converge to a local minimum of the loss surface. Moreover, when using noise derived from stochastic rounding, we show that the global minima of the original quantized loss are preserved. We empirically demonstrate that this method outperforms standard QAT on synthetic testbeds and on 150M- and 300M- parameter language models.

Updated: 2025-10-09 19:16:46

标题: 乳液：为量化训练优化平滑风景

摘要: 为了优化神经网络以实现量化目标，面临着根本性的挑战，因为量化器是分段常数的，导致除了在量化阈值处梯度为零外，在任何其他地方梯度都为零。大多数现有方法通过使用诸如直通估计器（STE）等技术放松梯度计算来处理这个问题，并且不能提供任何收敛的保证。在这项工作中，受涅斯特罗夫平滑的启发，我们用连续损失表面来近似量化损失表面。具体来说，我们引入了LOTION，即通过随机噪声平滑的低精度优化，这是一个原则性的平滑框架，用无偏随机四舍五入噪声下的期望替换原始量化损失。在这个框架下，标准优化器被保证收敛到损失表面的局部最小值。此外，当使用来自随机四舍五入的噪声时，我们展示了原始量化损失的全局最小值被保留。我们经验性地证明，这种方法在合成测试平台和拥有150M和300M参数的语言模型上优于标准QAT。

更新时间: 2025-10-09 19:16:46

领域: cs.LG,cs.AR

下载: http://arxiv.org/abs/2510.08757v1

Robust Heuristic Algorithm Design with LLMs

We posit that we can generate more robust and performant heuristics if we augment approaches using LLMs for heuristic design with tools that explain why heuristics underperform and suggestions about how to fix them. We find even simple ideas that (1) expose the LLM to instances where the heuristic underperforms; (2) explain why they occur; and (3) specialize design to regions in the input space, can produce more robust algorithms compared to existing techniques~ -- ~the heuristics we produce have a $\sim28\times$ better worst-case performance compared to FunSearch, improve average performance, and maintain the runtime.

Updated: 2025-10-09 19:13:56

标题: 使用LLMs进行的强大启发式算法设计

摘要: 我们认为，如果我们在启发式设计中使用LLMs的方法添加工具，解释为什么启发式性能不佳以及如何修复它们，我们可以生成更稳健和高性能的启发式。我们发现，即使是简单的想法（1）让LLM接触到启发式性能不佳的实例；（2）解释为什么这种情况发生；以及（3）将设计专门针对输入空间的特定区域，相比现有技术，可以产生更稳健的算法~--~我们生产的启发式性能比FunSearch提高了约28倍的最坏情况性能，提高了平均性能，并保持了运行时间。

更新时间: 2025-10-09 19:13:56

领域: cs.AI,cs.CL,cs.NI

下载: http://arxiv.org/abs/2510.08755v1

NLP-ADBench: NLP Anomaly Detection Benchmark

Anomaly detection (AD) is an important machine learning task with applications in fraud detection, content moderation, and user behavior analysis. However, AD is relatively understudied in a natural language processing (NLP) context, limiting its effectiveness in detecting harmful content, phishing attempts, and spam reviews. We introduce NLP-ADBench, the most comprehensive NLP anomaly detection (NLP-AD) benchmark to date, which includes eight curated datasets and 19 state-of-the-art algorithms. These span 3 end-to-end methods and 16 two-step approaches that adapt classical, non-AD methods to language embeddings from BERT and OpenAI. Our empirical results show that no single model dominates across all datasets, indicating a need for automated model selection. Moreover, two-step methods with transformer-based embeddings consistently outperform specialized end-to-end approaches, with OpenAI embeddings outperforming those of BERT. We release NLP-ADBench at https://github.com/USC-FORTIS/NLP-ADBench, providing a unified framework for NLP-AD and supporting future investigations.

Updated: 2025-10-09 19:13:25

标题: NLP-ADBench: 自然语言处理异常检测基准测试

摘要: 异常检测（AD）是一项重要的机器学习任务，应用于欺诈检测、内容审核和用户行为分析。然而，在自然语言处理（NLP）领域中，AD相对较少研究，限制了其在检测有害内容、钓鱼尝试和垃圾评论方面的有效性。我们介绍了NLP-ADBench，迄今为止最全面的自然语言处理异常检测（NLP-AD）基准，包括八个精心策划的数据集和19种最先进的算法。这些算法涵盖了3种端到端方法和16种两步方法，它们将经典的非AD方法调整为来自BERT和OpenAI的语言嵌入。我们的实证结果表明，没有单一模型在所有数据集上占主导地位，这表明需要自动模型选择。此外，基于变压器的嵌入的两步方法在性能上始终优于专门的端到端方法，OpenAI的嵌入优于BERT的嵌入。我们在https://github.com/USC-FORTIS/NLP-ADBench发布了NLP-ADBench，为NLP-AD提供了一个统一的框架，并支持未来的研究。

更新时间: 2025-10-09 19:13:25

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2412.04784v2

Exploring Cross-Client Memorization of Training Data in Large Language Models for Federated Learning

Federated learning (FL) enables collaborative training without raw data sharing, but still risks training data memorization. Existing FL memorization detection techniques focus on one sample at a time, underestimating more subtle risks of cross-sample memorization. In contrast, recent work on centralized learning (CL) has introduced fine-grained methods to assess memorization across all samples in training data, but these assume centralized access to data and cannot be applied directly to FL. We bridge this gap by proposing a framework that quantifies both intra- and inter-client memorization in FL using fine-grained cross-sample memorization measurement across all clients. Based on this framework, we conduct two studies: (1) measuring subtle memorization across clients and (2) examining key factors that influence memorization, including decoding strategies, prefix length, and FL algorithms. Our findings reveal that FL models do memorize client data, particularly intra-client data, more than inter-client data, with memorization influenced by training and inferencing factors.

Updated: 2025-10-09 19:07:48

标题: 在大型语言模型中探索跨客户端训练数据的联邦学习

摘要: 联邦学习（FL）使协作训练成为可能，而无需共享原始数据，但仍存在训练数据记忆风险。现有的FL记忆检测技术侧重于一次检测一个样本，低估了跨样本记忆的更微妙风险。相比之下，最近关于集中式学习（CL）的研究引入了细粒度的方法来评估在训练数据中所有样本的记忆，但这些方法假定对数据具有集中访问，不能直接应用于FL。我们通过提出一个框架来弥合这一差距，该框架利用跨所有客户端的细粒度跨样本记忆度量来量化FL中的客户端内部和客户端间记忆。基于这一框架，我们进行了两项研究：（1）测量客户端之间的微妙记忆，（2）研究影响记忆的关键因素，包括解码策略、前缀长度和FL算法。我们的研究结果显示，FL模型确实会记忆客户端数据，尤其是客户端内部数据，这种记忆受训练和推断因素的影响。

更新时间: 2025-10-09 19:07:48

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2510.08750v1

Conformal Risk Training: End-to-End Optimization of Conformal Risk Control

While deep learning models often achieve high predictive accuracy, their predictions typically do not come with any provable guarantees on risk or reliability, which are critical for deployment in high-stakes applications. The framework of conformal risk control (CRC) provides a distribution-free, finite-sample method for controlling the expected value of any bounded monotone loss function and can be conveniently applied post-hoc to any pre-trained deep learning model. However, many real-world applications are sensitive to tail risks, as opposed to just expected loss. In this work, we develop a method for controlling the general class of Optimized Certainty-Equivalent (OCE) risks, a broad class of risk measures which includes as special cases the expected loss (generalizing the original CRC method) and common tail risks like the conditional value-at-risk (CVaR). Furthermore, standard post-hoc CRC can degrade average-case performance due to its lack of feedback to the model. To address this, we introduce "conformal risk training," an end-to-end approach that differentiates through conformal OCE risk control during model training or fine-tuning. Our method achieves provable risk guarantees while demonstrating significantly improved average-case performance over post-hoc approaches on applications to controlling classifiers' false negative rate and controlling financial risk in battery storage operation.

Updated: 2025-10-09 19:05:45

标题: 保守风险训练：符合风险控制的端到端优化

摘要: 尽管深度学习模型通常能够取得高预测准确度，但它们的预测通常不伴随任何风险或可靠性的可证担保，这对于在高风险应用中的部署至关重要。符合风险控制（CRC）框架提供了一个无分布、有限样本的方法，用于控制任何有界单调损失函数的期望值，并且可以方便地后续应用于任何预训练的深度学习模型。然而，许多现实世界的应用对尾风险敏感，而不仅仅是预期损失。在这项工作中，我们开发了一种控制优化等价风险（OCE）风险的一般类别的方法，这是一类广泛的风险度量方法，包括了预期损失（推广了原始CRC方法）和常见的尾风险，如条件风险价值（CVaR）。此外，标准的后续CRC可能会由于缺乏对模型的反馈而降低平均情况性能。为了解决这个问题，我们引入了“符合风险训练”，这是一种端对端的方法，可以在模型训练或微调过程中通过符合OCE风险控制进行区分。我们的方法在实现可证担保风险保证的同时，还展示了相对于后续方法在应用于控制分类器的假阴性率和控制电池存储运作中的金融风险时显著改善的平均情况性能。

更新时间: 2025-10-09 19:05:45

领域: cs.LG

下载: http://arxiv.org/abs/2510.08748v1

RFOD: Random Forest-based Outlier Detection for Tabular Data

Outlier detection in tabular data is crucial for safeguarding data integrity in high-stakes domains such as cybersecurity, financial fraud detection, and healthcare, where anomalies can cause serious operational and economic impacts. Despite advances in both data mining and deep learning, many existing methods struggle with mixed-type tabular data, often relying on encoding schemes that lose important semantic information. Moreover, they frequently lack interpretability, offering little insight into which specific values cause anomalies. To overcome these challenges, we introduce \textsf{\textbf{RFOD}}, a novel \textsf{\textbf{R}}andom \textsf{\textbf{F}}orest-based \textsf{\textbf{O}}utlier \textsf{\textbf{D}}etection framework tailored for tabular data. Rather than modeling a global joint distribution, \textsf{RFOD} reframes anomaly detection as a feature-wise conditional reconstruction problem, training dedicated random forests for each feature conditioned on the others. This design robustly handles heterogeneous data types while preserving the semantic integrity of categorical features. To further enable precise and interpretable detection, \textsf{RFOD} combines Adjusted Gower's Distance (AGD) for cell-level scoring, which adapts to skewed numerical data and accounts for categorical confidence, with Uncertainty-Weighted Averaging (UWA) to aggregate cell-level scores into robust row-level anomaly scores. Extensive experiments on 15 real-world datasets demonstrate that \textsf{RFOD} consistently outperforms state-of-the-art baselines in detection accuracy while offering superior robustness, scalability, and interpretability for mixed-type tabular data.

Updated: 2025-10-09 19:02:12

标题: RFOD: 基于随机森林的表格数据异常检测

摘要: 表格数据中的异常值检测对于维护高风险领域的数据完整性至关重要，如网络安全、金融欺诈检测和医疗保健领域，异常值可能导致严重的运营和经济影响。尽管数据挖掘和深度学习取得了进展，但许多现有方法在处理混合类型的表格数据时存在困难，通常依赖于会丢失重要语义信息的编码方案。此外，它们经常缺乏可解释性，很少提供有关导致异常的具体值的见解。为了克服这些挑战，我们引入了一种新颖的面向表格数据的基于随机森林的异常值检测框架\textsf{RFOD}。与建模全局联合分布不同，\textsf{RFOD}将异常值检测重新构造为特征条件重建问题，针对其他特征训练专用的随机森林。这种设计能够稳健地处理异质数据类型，同时保留分类特征的语义完整性。为了进一步实现精确和可解释的检测，\textsf{RFOD}将Adjust Gower's Distance（AGD）用于单元级评分，该方法适应偏斜的数值数据并考虑分类置信度，同时使用Uncertainty-Weighted Averaging（UWA）将单元级评分聚合成稳健的行级异常值评分。对15个真实世界数据集的广泛实验表明，\textsf{RFOD}在检测准确性方面始终优于最先进的基准线，同时为混合类型的表格数据提供了卓越的稳健性、可扩展性和可解释性。

更新时间: 2025-10-09 19:02:12

领域: cs.LG,cs.DB

下载: http://arxiv.org/abs/2510.08747v1

Graph Diffusion Transformers are In-Context Molecular Designers

In-context learning allows large models to adapt to new tasks from a few demonstrations, but it has shown limited success in molecular design. Existing databases such as ChEMBL contain molecular properties spanning millions of biological assays, yet labeled data for each property remain scarce. To address this limitation, we introduce demonstration-conditioned diffusion models (DemoDiff), which define task contexts using a small set of molecule-score examples instead of text descriptions. These demonstrations guide a denoising Transformer to generate molecules aligned with target properties. For scalable pretraining, we develop a new molecular tokenizer with Node Pair Encoding that represents molecules at the motif level, requiring 5.5$\times$ fewer nodes. We curate a dataset containing millions of context tasks from multiple sources covering both drugs and materials, and pretrain a 0.7-billion-parameter model on it. Across 33 design tasks in six categories, DemoDiff matches or surpasses language models 100-1000$\times$ larger and achieves an average rank of 3.63 compared to 5.25-10.20 for domain-specific approaches. These results position DemoDiff as a molecular foundation model for in-context molecular design. Our code is available at https://github.com/liugangcode/DemoDiff.

Updated: 2025-10-09 18:56:57

标题: 图扩散变压器是上下文分子设计者

摘要: 在上下文学习中允许大型模型从少量示范中适应新任务，但在分子设计方面表现出有限的成功。现有的数据库，如ChEMBL，包含涵盖数百万生物测定的分子特性，但每种特性的标记数据仍然稀缺。为了解决这一限制，我们引入了以示范为条件的扩散模型（DemoDiff），它使用一小组分子-分数示例而不是文本描述来定义任务上下文。这些示范指导一个去噪Transformer生成与目标特性对齐的分子。为了可扩展的预训练，我们开发了一种新的分子标记器，使用节点对编码表示分子的主题水平，需要5.5倍更少的节点。我们策划了一个包含来自多个来源的数百万个上下文任务的数据集，涵盖药物和材料，然后在其上对一个70亿参数模型进行了预训练。在六个类别的33个设计任务中，DemoDiff与语言模型相匹配或超过100-1000倍更大，并且与专业领域特定方法的平均排名相比为3.63，而后者为5.25-10.20。这些结果将DemoDiff定位为上下文分子设计的分子基础模型。我们的代码可在https://github.com/liugangcode/DemoDiff 上获得。

更新时间: 2025-10-09 18:56:57

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.08744v1

Coordinates from Context: Using LLMs to Ground Complex Location References

Geocoding is the task of linking a location reference to an actual geographic location and is essential for many downstream analyses of unstructured text. In this paper, we explore the challenging setting of geocoding compositional location references. Building on recent work demonstrating LLMs' abilities to reason over geospatial data, we evaluate LLMs' geospatial knowledge versus reasoning skills relevant to our task. Based on these insights, we propose an LLM-based strategy for geocoding compositional location references. We show that our approach improves performance for the task and that a relatively small fine-tuned LLM can achieve comparable performance with much larger off-the-shelf models.

Updated: 2025-10-09 18:51:52

标题: 上下文坐标：使用LLMs确定复杂位置参考信息

摘要: 地理编码是将位置参考链接到实际地理位置的任务，对于许多非结构化文本的后续分析至关重要。在本文中，我们探讨了地理编码组合位置参考的具有挑战性的设置。基于最近的研究表明LLM能够对地理空间数据进行推理，我们评估了LLM的地理空间知识与与我们任务相关的推理能力。基于这些见解，我们提出了一种基于LLM的策略，用于地理编码组合位置参考。我们展示了我们的方法改进了任务的性能，并且一个相对较小的微调的LLM可以实现与更大的现成模型相当的性能。

更新时间: 2025-10-09 18:51:52

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.08741v1

Automated Capability Evaluation of Foundation Models

Current evaluation frameworks for foundation models rely heavily on static, manually curated benchmarks, limiting their ability to capture the full breadth of model capabilities. This paper introduces Active learning for Capability Evaluation (ACE), a novel framework for scalable, automated, and fine-grained evaluation of foundation models. ACE leverages the knowledge embedded in powerful frontier models to decompose a domain into semantically meaningful capabilities and generates diverse evaluation tasks, significantly reducing human effort. In Mathematics, ACE generated 433 capabilities and 11,800 tasks, covering 94% of Wikipedia-defined skills in the domain while introducing novel, coherent ones. To maximize efficiency, ACE fits a capability model in latent semantic space, allowing reliable approximation of a subject model's performance by evaluating only a subset of capabilities via active learning. It reaches within 0.01 RMSE of exhaustive evaluation by evaluating less than half of capabilities. Compared to static datasets, ACE provides more balanced coverage and uncovers fine-grained differences that aggregate metrics fail to capture. Our results demonstrate that ACE provides a more complete and informative picture of model capabilities, which is essential for safe and well-informed deployment of foundation models.

Updated: 2025-10-09 18:50:55

标题: 基于自动化的基础模型能力评估

摘要: 当前基于基础模型的评估框架主要依赖于静态的、手动策划的基准测试，限制了其捕捉模型能力的全面广度。本文介绍了主动学习能力评估（ACE）的新框架，用于基础模型的可扩展、自动化和细粒度评估。ACE利用强大的前沿模型中嵌入的知识，将领域分解为语义上有意义的能力，并生成多样化的评估任务，显著减少了人力投入。在数学领域，ACE生成了433种能力和11,800个任务，涵盖了领域中94％的维基百科定义的技能，同时引入了新颖的、连贯的能力。为了最大限度地提高效率，ACE在潜在语义空间中拟合能力模型，通过主动学习仅评估能力的子集，可可靠地近似一个主题模型的性能。通过评估不到一半的能力，它在0.01 RMSE之内接近于全面评估。与静态数据集相比，ACE提供了更平衡的覆盖范围，并揭示了精细差异，这些精细差异聚合指标无法捕捉。我们的结果表明，ACE提供了模型能力更完整、更信息丰富的画面，这对于基础模型的安全和明智的部署至关重要。

更新时间: 2025-10-09 18:50:55

领域: cs.LG

下载: http://arxiv.org/abs/2505.17228v2

Game of Trust: How Trustworthy Does Your Blockchain Think You Are?

We investigate how a blockchain can distill the collective belief of its nodes regarding the trustworthiness of a (sub)set of nodes into a {\em reputation system} that reflects the probability of correctly performing a task. To address this question, we introduce a framework that breaks it down into two sub-problems: 1. (Information Extraction): How can the system distill trust information from a function of the nodes' true beliefs? 2. (Incentive Design): How can we incentivize nodes to truthfully report such information? To tackle the first sub-problem, we adapt, in a non-trivial manner, the well-known PageRank algorithm to our problem. For the second, we define a new class of games, called Trustworthy Reputation games (TRep games), which aim to extract the collective beliefs on trust from the actions of rational participants. We then propose a concrete TRep game whose utility function leverages Personalized PageRank and can be instantiated through a straightforward blockchain rewards mechanism. Building on this, we show how the TRep game enables the design of a reputation system. Such systems can enhance the robustness, scalability, and efficiency of blockchain and DeFi solutions. For instance, we demonstrate how such a system can be used within a Proof-of-Reputation blockchain.

Updated: 2025-10-09 18:50:19

标题: 信任游戏：你的区块链认为你有多值得信赖？

摘要: 我们研究了区块链如何将其节点对（子）节点的可信度的集体信念提炼为反映正确执行任务概率的“声誉系统”。为了解决这个问题，我们引入了一个将其分解为两个子问题的框架： 1.（信息提取）：系统如何从节点的真实信念的函数中提炼信任信息？ 2.（激励设计）：我们如何激励节点诚实报告这种信息？为了解决第一个子问题，我们以一种非平凡的方式将着名的PageRank算法调整为我们的问题。对于第二个问题，我们定义了一个新的游戏类别，称为值得信赖的声誉游戏（TRep游戏），旨在从理性参与者的行动中提取关于信任的集体信念。然后，我们提出了一个具体的TRep游戏，其效用函数利用个性化PageRank，并可以通过直接的区块链奖励机制实例化。基于此，我们展示了TRep游戏如何实现声誉系统的设计。这种系统可以增强区块链和DeFi解决方案的稳健性、可扩展性和效率。例如，我们展示了这样一个系统如何在一个基于声誉证明的区块链中使用。

更新时间: 2025-10-09 18:50:19

领域: cs.GT,cs.AI,cs.CR

下载: http://arxiv.org/abs/2505.14551v2

Faithful and Interpretable Explanations for Complex Ensemble Time Series Forecasts using Surrogate Models and Forecastability Analysis

Modern time series forecasting increasingly relies on complex ensemble models generated by AutoML systems like AutoGluon, delivering superior accuracy but with significant costs to transparency and interpretability. This paper introduces a comprehensive, dual-approach framework that addresses both the explainability and forecastability challenges in complex time series ensembles. First, we develop a surrogate-based explanation methodology that bridges the accuracy-interpretability gap by training a LightGBM model to faithfully mimic AutoGluon's time series forecasts, enabling stable SHAP-based feature attributions. We rigorously validated this approach through feature injection experiments, demonstrating remarkably high faithfulness between extracted SHAP values and known ground truth effects. Second, we integrated spectral predictability analysis to quantify each series' inherent forecastability. By comparing each time series' spectral predictability to its pure noise benchmarks, we established an objective mechanism to gauge confidence in forecasts and their explanations. Our empirical evaluation on the M5 dataset found that higher spectral predictability strongly correlates not only with improved forecast accuracy but also with higher fidelity between the surrogate and the original forecasting model. These forecastability metrics serve as effective filtering mechanisms and confidence scores, enabling users to calibrate their trust in both the forecasts and their explanations. We further demonstrated that per-item normalization is essential for generating meaningful SHAP explanations across heterogeneous time series with varying scales. The resulting framework delivers interpretable, instance-level explanations for state-of-the-art ensemble forecasts, while equipping users with forecastability metrics that serve as reliability indicators for both predictions and their explanations.

Updated: 2025-10-09 18:49:45

标题: 复杂集成时间序列预测的可靠且可解释解释：使用替代模型和预测性分析

摘要: 现代时间序列预测越来越依赖于由AutoML系统（如AutoGluon）生成的复杂集成模型，这些模型提供了更高的准确性，但在透明度和可解释性方面成本显著。本文介绍了一个全面的双重方法框架，旨在解决复杂时间序列集成中的可解释性和预测挑战。首先，我们开发了一种基于替代模型的解释方法，通过训练LightGBM模型来忠实地模仿AutoGluon的时间序列预测，从而弥合了准确性和可解释性之间的鸿沟，实现了稳定的基于SHAP的特征归因。我们通过特征注入实验对这种方法进行了严格验证，展示了提取的SHAP值与已知的基本事实效果之间的高度忠实性。其次，我们集成了谱可预测性分析，以量化每个系列的固有预测能力。通过将每个时间序列的谱可预测性与其纯噪声基准进行比较，我们建立了一种客观机制来衡量对预测及其解释的信心。我们在M5数据集上的实证评估发现，更高的谱可预测性不仅与改进的预测准确性强相关，而且与替代模型和原始预测模型之间的更高忠实性也强相关。这些预测能力指标作为有效的过滤机制和信心评分，使用户能够校准对预测及其解释的信任。我们进一步证明，对于具有不同尺度的异质时间序列，逐项归一化对生成有意义的SHAP解释至关重要。最终的框架为最先进的集成预测提供了可解释的实例级解释，同时为用户提供了作为预测和解释的可靠性指标的预测能力指标。

更新时间: 2025-10-09 18:49:45

领域: cs.LG

下载: http://arxiv.org/abs/2510.08739v1

SHAP-Based Supervised Clustering for Sample Classification and the Generalized Waterfall Plot

In this growing age of data and technology, large black-box models are becoming the norm due to their ability to handle vast amounts of data and learn incredibly complex input-output relationships. The deficiency of these methods, however, is their inability to explain the prediction process, making them untrustworthy and their use precarious in high-stakes situations. SHapley Additive exPlanations (SHAP) analysis is an explainable AI method growing in popularity for its ability to explain model predictions in terms of the original features. For each sample and feature in the data set, we associate a SHAP value that quantifies the contribution of that feature to the prediction of that sample. Clustering these SHAP values can provide insight into the data by grouping samples that not only received the same prediction, but received the same prediction for similar reasons. In doing so, we map the various pathways through which distinct samples arrive at the same prediction. To showcase this methodology, we present a simulated experiment in addition to a case study in Alzheimer's disease using data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. We also present a novel generalization of the waterfall plot for multi-classification.

Updated: 2025-10-09 18:45:43

标题: 基于SHAP的监督聚类用于样本分类和广义瀑布图

摘要: 在这个数据和技术不断发展的时代，由于它们能够处理大量数据并学习极其复杂的输入-输出关系，大型黑匣子模型正在成为常态。然而，这些方法的不足之处在于它们无法解释预测过程，使它们难以信任，在高风险情况下使用时存在风险。SHapley Additive exPlanations（SHAP）分析是一种可解释的人工智能方法，因其能够解释模型预测与原始特征之间的关系而日益受到关注。对于数据集中的每个样本和特征，我们关联一个SHAP值，用于量化该特征对该样本预测的贡献。对这些SHAP值进行聚类可以提供对数据的洞察，通过将接收相同预测且由于相似原因而获得相同预测的样本进行分组。通过这样做，我们映射了不同样本到达相同预测结果的各种路径。为了展示这种方法，我们提供了一个模拟实验以及使用阿尔茨海默病数据的阿尔茨海默病神经影像课题组（ADNI）数据库的案例研究。我们还提出了一个适用于多分类的瀑布图的新颖概括。

更新时间: 2025-10-09 18:45:43

领域: cs.LG,stat.ME,stat.ML

下载: http://arxiv.org/abs/2510.08737v1

Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research

The rapid advancement of perovskite solar cells (PSCs) has led to an exponential growth in research publications, creating an urgent need for efficient knowledge management and reasoning systems in this domain. We present a comprehensive knowledge-enhanced system for PSCs that integrates three key components. First, we develop Perovskite-KG, a domain-specific knowledge graph constructed from 1,517 research papers, containing 23,789 entities and 22,272 relationships. Second, we create two complementary datasets: Perovskite-Chat, comprising 55,101 high-quality question-answer pairs generated through a novel multi-agent framework, and Perovskite-Reasoning, containing 2,217 carefully curated materials science problems. Third, we introduce two specialized large language models: Perovskite-Chat-LLM for domain-specific knowledge assistance and Perovskite-Reasoning-LLM for scientific reasoning tasks. Experimental results demonstrate that our system significantly outperforms existing models in both domain-specific knowledge retrieval and scientific reasoning tasks, providing researchers with effective tools for literature review, experimental design, and complex problem-solving in PSC research.

Updated: 2025-10-09 18:40:58

标题: 钙钛矿-LLM：用于钙钛矿太阳能电池研究的知识增强型大型语言模型

摘要: 钙钛矿太阳能电池（PSCs）的快速发展导致研究出版物呈指数增长，这在该领域急需高效的知识管理和推理系统。我们提出了一个综合的PSCs知识增强系统，整合了三个关键组件。首先，我们开发了Perovskite-KG，这是一个领域特定的知识图，由1,517篇研究论文构建而成，包含23,789个实体和22,272个关系。其次，我们创建了两个互补的数据集：Perovskite-Chat，包含55,101个高质量的问题-答案对，通过一种新颖的多智能体框架生成，并且Perovskite-Reasoning，包含2,217个精心策划的材料科学问题。第三，我们引入了两个专门的大型语言模型：Perovskite-Chat-LLM用于领域特定的知识辅助，以及Perovskite-Reasoning-LLM用于科学推理任务。实验结果表明，我们的系统在领域特定知识检索和科学推理任务方面明显优于现有模型，为PSCs研究中的文献回顾、实验设计和复杂问题解决提供了有效工具。

更新时间: 2025-10-09 18:40:58

领域: cs.AI

下载: http://arxiv.org/abs/2502.12669v3

Transmuting prompts into weights

A growing body of research has demonstrated that the behavior of large language models can be effectively controlled at inference time by directly modifying their internal states, either through vector additions to their activations or through updates to their weight matrices. These techniques, while powerful, are often guided by empirical heuristics, such as deriving steering vectors from the average activations of contrastive prompts. This work provides a theoretical foundation for these interventions, explaining how they emerge from the fundamental computations of the transformer architecture. Building on the recent finding that a prompt's influence can be mathematically mapped to implicit weight updates (Dherin et al., 2025), we generalize this theory to deep, multi-block transformers. We show how the information contained in any chunk of a user prompt is represented and composed internally through weight vectors and weight matrices. We then derive a principled method for condensing this information into token-independent thought vectors and thought matrices. These constructs provide a theoretical explanation for existing vector- and matrix-based model editing techniques and offer a direct, computationally-grounded method for transmuting textual input into reusable weight updates.

Updated: 2025-10-09 18:40:39

标题: 将提示转化为权重

摘要: 一系列研究已经证明，可以通过直接修改大型语言模型的内部状态，在推理时有效地控制其行为，无论是通过向其激活添加向量还是通过更新其权重矩阵。这些技术虽然强大，但通常是根据经验启发式指导的，例如从对比提示的平均激活中派生出转向向量。这项工作为这些干预提供了理论基础，解释了它们是如何从变压器架构的基本计算中出现的。基于最近的研究发现，提示的影响可以被数学地映射到隐式权重更新（Dherin等人，2025年），我们将这一理论推广到深层、多块变压器。我们展示了如何通过权重向量和权重矩阵在内部表示和组成用户提示的任何块中包含的信息。然后，我们推导出了一种原则性方法，将这些信息压缩成与令牌无关的思维向量和思维矩阵。这些构造为现有的基于向量和矩阵的模型编辑技术提供了理论解释，并提供了一种直接的、基于计算的方法，将文本输入转化为可重复使用的权重更新。

更新时间: 2025-10-09 18:40:39

领域: cs.LG

下载: http://arxiv.org/abs/2510.08734v1

When to Reason: Semantic Router for vLLM

Large Language Models (LLMs) demonstrate substantial accuracy gains when augmented with reasoning modes such as chain-of-thought and inference-time scaling. However, reasoning also incurs significant costs in inference latency and token usage, with environmental and financial impacts, which are unnecessary for many simple prompts. We present a semantic router that classifies queries based on their reasoning requirements and selectively applies reasoning only when beneficial. Our approach achieves a 10.2 percentage point improvement in accuracy on the MMLU-Pro benchmark while reducing response latency by 47.1% and token consumption by 48.5% compared to direct inference with vLLM. These results demonstrate that semantic routing offers an effective mechanism for striking a balance between accuracy and efficiency in open-source LLM serving systems

Updated: 2025-10-09 18:38:00

标题: 何时进行推理：用于vLLM的语义路由器

摘要: 大型语言模型（LLMs）在增加推理模式，如思维链和推理时间缩放时，表现出明显的准确性增益。然而，推理也会导致推理延迟和令牌使用的显着成本，对环境和财务产生影响，这对许多简单提示来说是不必要的。我们提出了一个基于推理需求对查询进行分类的语义路由器，并仅在有益时选择性地应用推理。与直接推理相比，我们的方法在MMLU-Pro基准测试中准确率提高了10.2个百分点，响应延迟减少了47.1％，令牌消耗减少了48.5％。这些结果表明，语义路由提供了在开源LLM服务系统中在准确性和效率之间取得平衡的有效机制。

更新时间: 2025-10-09 18:38:00

领域: cs.ET,cs.AI,cs.CL,cs.SY,eess.SY

下载: http://arxiv.org/abs/2510.08731v1

How Reliable is Language Model Micro-Benchmarking?

Micro-benchmarking offers a solution to the often prohibitive time and cost of language model development: evaluate on a very small subset of existing benchmarks. Can these micro-benchmarks, however, rank models as consistently as the full benchmarks they replace? And can they rank models more consistently than selecting a random subset of data points? In many scenarios, we find that the answer is no. We introduce a meta-evaluation measure for micro-benchmarking which investigates how well a micro-benchmark can rank two models as a function of their performance difference on the full benchmark. This approach can determine which model pairs can be ranked correctly by a micro-benchmark, allowing for a finer-grained analysis of the trade-off between micro-benchmark size and reliability. Prior work has suggested selecting as few as 10 examples; we find that no micro-benchmarking method can consistently rank model pairs 3.5 points of accuracy apart on MMLU-Pro or 4 points apart on BIG-bench Hard. In order to consistently rank model pairs with relatively similar performances, we show that often as many as 250 examples must be selected, at which point random sampling is competitive with existing micro-benchmarking methods. When comparing only 8B instruction-tuned models on MMLU-Pro micro-benchmarks with 25 examples, we find that more than half of pairwise comparisons are not likely to be preserved. Our work provides actionable guidance for both micro-benchmark users and developers in navigating the trade-off between evaluation efficiency and reliability.

Updated: 2025-10-09 18:37:03

标题: 语言模型微基准测试的可靠性如何？

摘要: 微基准测试为语言模型开发中常见的耗时和成本问题提供了解决方案：在现有基准测试的一个非常小的子集上进行评估。然而，这些微基准测试能够像它们替代的完整基准测试一样一致地对模型进行排名吗？它们能够比选择随机数据点的子集更一致地对模型进行排名吗？在许多情况下，我们发现答案是否定的。我们引入了一种用于微基准测试的元评估度量，研究了一个微基准测试如何能够根据模型在完整基准测试中的性能差异来对两个模型进行排名。这种方法可以确定哪些模型对可以被一个微基准测试正确地排名，从而允许更细致地分析微基准测试规模和可靠性之间的权衡。先前的工作建议选择至少10个示例；我们发现没有一个微基准测试方法能够一致地对MMLU-Pro上相差3.5个准确度点或BIG-bench Hard上相差4个准确度点的模型对进行排名。为了一致地对相对性能相似的模型对进行排名，我们发现通常需要选择多达250个示例，此时随机抽样与现有的微基准测试方法竞争。当仅在MMLU-Pro微基准测试上使用25个示例比较8B指令调整模型时，我们发现超过一半的成对比较可能无法保留。我们的工作为微基准测试用户和开发者提供了可行的指导，帮助他们在评估效率和可靠性之间取得平衡。

更新时间: 2025-10-09 18:37:03

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2510.08730v1

Structured Output Regularization: a framework for few-shot transfer learning

Traditional transfer learning typically reuses large pre-trained networks by freezing some of their weights and adding task-specific layers. While this approach is computationally efficient, it limits the model's ability to adapt to domain-specific features and can still lead to overfitting with very limited data. To address these limitations, we propose Structured Output Regularization (SOR), a simple yet effective framework that freezes the internal network structures (e.g., convolutional filters) while using a combination of group lasso and $L_1$ penalties. This framework tailors the model to specific data with minimal additional parameters and is easily applicable to various network components, such as convolutional filters or various blocks in neural networks enabling broad applicability for transfer learning tasks. We evaluate SOR on three few shot medical imaging classification tasks and we achieve competitive results using DenseNet121, and EfficientNetB4 bases compared to established benchmarks.

Updated: 2025-10-09 18:34:22

标题: 结构化输出正则化：一种少样本迁移学习框架

摘要: 传统的迁移学习通常通过冻结部分权重并添加特定任务的层来重新利用大型预训练网络。虽然这种方法在计算上效率高，但限制了模型适应特定领域特征的能力，并且仍然可能导致在非常有限的数据情况下过拟合。为了解决这些局限性，我们提出了结构化输出正则化（SOR），这是一个简单而有效的框架，它在冻结内部网络结构（例如，卷积滤波器）的同时使用组套索和L1惩罚的组合。这个框架可以根据特定数据量身定制模型，具有最少的额外参数，并且易于应用于各种网络组件，如卷积滤波器或神经网络中的各种块，从而实现广泛的迁移学习任务应用。我们在三个少样本医学影像分类任务上评估了SOR，并使用DenseNet121和EfficientNetB4基准与已建立的基准相比取得了竞争性的结果。

更新时间: 2025-10-09 18:34:22

领域: cs.CV,cs.LG,stat.ML

下载: http://arxiv.org/abs/2510.08728v1

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Operator fusion has become a key optimization for deep learning, which combines multiple deep learning operators to improve data reuse and reduce global memory transfers. However, existing tensor compilers struggle to fuse complex reduction computations involving loop-carried dependencies, such as attention mechanisms. The paper introduces Neptune, a tensor compiler for advanced operator fusion for sequences of reduction operators. Neptune presents a new approach for advanced operator fusion, which intentionally breaks some existing dependencies and compensates by constructing algebraic correction expressions that allow the kernel to produce the correct result. On ten attention-based benchmarks, Neptune, starting from simple attention code and a high-level scheduling template, outperforms existing compilers like Triton, TVM, and FlexAttention, including Triton-based implementations of FlashAttention. Across four different GPU architectures from NVIDIA and AMD, Neptune-generated kernels have average speedup of $1.35\times$ over the next best alternative, demonstrating its effectiveness for deep learning workloads.

Updated: 2025-10-09 18:33:52

标题: 海王星：基于GPU的高级机器学习操作符融合技术，用于局部性和并行性

摘要: 操作符融合已经成为深度学习的关键优化技术，它将多个深度学习操作符结合起来，以提高数据重用性并减少全局内存传输。然而，现有的张量编译器在融合涉及循环传递依赖的复杂减少计算，例如注意力机制，方面存在困难。本文介绍了Neptune，一个用于序列减少操作符的高级操作符融合的张量编译器。Neptune提出了一种新的高级操作符融合方法，通过有意地打破一些现有依赖关系，并通过构建代数修正表达式来补偿，使得内核能够产生正确的结果。在十个基于注意力的基准测试中，Neptune从简单的注意力代码和高级调度模板开始，优于现有的编译器如Triton、TVM和FlexAttention，包括基于Triton的FlashAttention实现。在来自NVIDIA和AMD的四种不同GPU架构上，Neptune生成的内核平均加速比比下一个最佳替代方案高1.35倍，表明其在深度学习工作负载中的有效性。

更新时间: 2025-10-09 18:33:52

领域: cs.PL,cs.LG

下载: http://arxiv.org/abs/2510.08726v1

Post-Quantum Security of Block Cipher Constructions

Block ciphers are versatile cryptographic ingredients that are used in a wide range of applications ranging from secure Internet communications to disk encryption. While post-quantum security of public-key cryptography has received significant attention, the case of symmetric-key cryptography (and block ciphers in particular) remains a largely unexplored topic. In this work, we set the foundations for a theory of post-quantum security for block ciphers and associated constructions. Leveraging our new techniques, we provide the first post-quantum security proofs for the key-length extension scheme FX, the tweakable block ciphers LRW and XEX, and most block cipher encryption and authentication modes. Our techniques can be used for security proofs in both the plain model and the quantum ideal cipher model. Our work takes significant initial steps in establishing a rigorous understanding of the post-quantum security of practical symmetric-key cryptography.

Updated: 2025-10-09 18:33:05

标题: 后量子时代的分组密码结构安全性

摘要: 分组密码是多才多艺的密码学要素，在从安全互联网通信到磁盘加密等各种应用中被广泛使用。虽然后量子安全的公钥密码学已经受到了重视，但对对称密钥密码学（尤其是分组密码）的情况仍然是一个较少探讨的话题。在这项工作中，我们为分组密码和相关构造的后量子安全理论奠定了基础。利用我们的新技术，我们为密钥长度扩展方案FX、可调整分组密码LRW和XEX以及大多数分组密码加密和认证模式提供了第一个后量子安全证明。我们的技术可以用于在普通模型和量子理想密码模型中进行安全证明。我们的工作在建立对实际对称密钥密码学后量子安全的严格理解方面迈出了重要的初步步骤。

更新时间: 2025-10-09 18:33:05

领域: cs.CR

下载: http://arxiv.org/abs/2510.08725v1

Counterfactually Fair Conformal Prediction

While counterfactual fairness of point predictors is well studied, its extension to prediction sets--central to fair decision-making under uncertainty--remains underexplored. On the other hand, conformal prediction (CP) provides efficient, distribution-free, finite-sample valid prediction sets, yet does not ensure counterfactual fairness. We close this gap by developing Counterfactually Fair Conformal Prediction (CF-CP) that produces counterfactually fair prediction sets. Through symmetrization of conformity scores across protected-attribute interventions, we prove that CF-CP results in counterfactually fair prediction sets while maintaining the marginal coverage property. Furthermore, we empirically demonstrate that on both synthetic and real datasets, across regression and classification tasks, CF-CP achieves the desired counterfactual fairness and meets the target coverage rate with minimal increase in prediction set size. CF-CP offers a simple, training-free route to counterfactually fair uncertainty quantification.

Updated: 2025-10-09 18:32:47

标题: 反事实公平的符合性预测

摘要: 虽然点预测器的反事实公平性已经得到深入研究，但其在预测集方面的延伸--对于在不确定性下进行公平决策至关重要--仍未被充分探讨。另一方面，符合性预测（CP）提供了高效、无分布、有限样本有效的预测集，但并不保证反事实公平性。我们通过开发Counterfactually Fair Conformal Prediction（CF-CP），填补了这一空白，该方法能够生成反事实公平的预测集。通过对受保护属性干预的符合性得分进行对称化，我们证明了CF-CP能够产生反事实公平的预测集，同时保持边际覆盖性质。此外，我们在合成和真实数据集上进行了实证研究，跨回归和分类任务，证明了CF-CP实现了所需的反事实公平性，并以最小的预测集增加量达到目标覆盖率。CF-CP为实现反事实公平的不确定性量化提供了简单、无需训练的途径。

更新时间: 2025-10-09 18:32:47

领域: cs.LG

下载: http://arxiv.org/abs/2510.08724v1

Enhancing Self-Supervised Learning with Semantic Pairs A New Dataset and Empirical Study

Instance discrimination is a self-supervised representation learning paradigm wherein individual instances within a dataset are treated as distinct classes. This is typically achieved by generating two disparate views of each instance by applying stochastic transformations, which encourages the model to learn representations that are invariant to the common underlying object across these views.

Updated: 2025-10-09 18:31:55

标题: 增强自监督学习的语义配对：一个新数据集和实证研究

摘要: 实例判别是一种自监督表示学习范式，其中数据集中的各个实例被视为不同的类。通常通过应用随机变换生成每个实例的两个不同视图来实现这一点，这鼓励模型学习对这些视图中的共同基础对象不变的表示。

更新时间: 2025-10-09 18:31:55

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.08722v1

Online Rubrics Elicitation from Pairwise Comparisons

Rubrics provide a flexible way to train LLMs on open-ended long-form answers where verifiable rewards are not applicable and human preferences provide coarse signals. Prior work shows that reinforcement learning with rubric-based rewards leads to consistent gains in LLM post-training. Most existing approaches rely on rubrics that remain static over the course of training. Such static rubrics, however, are vulnerable to reward-hacking type behaviors and fail to capture emergent desiderata that arise during training. We introduce Online Rubrics Elicitation (OnlineRubrics), a method that dynamically curates evaluation criteria in an online manner through pairwise comparisons of responses from current and reference policies. This online process enables continuous identification and mitigation of errors as training proceeds. Empirically, this approach yields consistent improvements of up to 8% over training exclusively with static rubrics across AlpacaEval, GPQA, ArenaHard as well as the validation sets of expert questions and rubrics. We qualitatively analyze the elicited criteria and identify prominent themes such as transparency, practicality, organization, and reasoning.

Updated: 2025-10-09 18:26:39

标题: 从成对比较中在线生成评分表

摘要: 标准提供了一种灵活的方式来训练LLMs对开放式长篇答案进行训练，其中无法应用可验证奖励，而人类偏好提供了粗略信号。先前的工作表明，基于标准的强化学习会导致LLM后训练中的稳定收益。大多数现有方法依赖于在训练过程中保持静态的标准。然而，这种静态标准容易受到奖励欺骗行为的威胁，并且无法捕捉在训练过程中出现的新的需求。我们引入了在线标准调查（OnlineRubrics），这是一种通过当前和参考策略的响应的配对比较动态策划评估标准的方法。这种在线过程使得在训练过程中连续识别和减轻错误成为可能。根据经验，这种方法相对于仅使用静态标准训练，AlpacaEval、GPQA、ArenaHard以及专家问题和标准的验证集中的改进一致高达8%。我们对引出的标准进行定性分析，并确定了突出的主题，如透明度、实用性、组织和推理。

更新时间: 2025-10-09 18:26:39

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.07284v2

IG-MCTS: Human-in-the-Loop Cooperative Navigation under Incomplete Information

Human-robot cooperative navigation is challenging under incomplete information. We introduce CoNav-Maze, a simulated environment where a robot navigates with local perception while a human operator provides guidance based on an inaccurate map. The robot can share its onboard camera views to help the operator refine their understanding of the environment. To enable efficient cooperation, we propose Information Gain Monte Carlo Tree Search (IG-MCTS), an online planning algorithm that jointly optimizes autonomous movement and informative communication. IG-MCTS leverages a learned Neural Human Perception Model (NHPM) -- trained on a crowdsourced mapping dataset -- to predict how the human's internal map evolves as new observations are shared. User studies show that IG-MCTS significantly reduces communication demands and yields eye-tracking metrics indicative of lower cognitive load, while maintaining task performance comparable to teleoperation and instruction-following baselines. Finally, we illustrate generalization beyond discrete mazes through a continuous-space waterway navigation setting, in which NHPM benefits from deeper encoder-decoder architectures and IG-MCTS leverages a dynamically constructed Voronoi-partitioned traversability graph.

Updated: 2025-10-09 18:20:16

标题: IG-MCTS: 人机协作导航在不完整信息下

摘要: 人机协作导航在信息不完整的情况下具有挑战性。我们介绍了CoNav-Maze，这是一个模拟环境，其中机器人在局部感知的情况下导航，而人类操作员根据不准确的地图提供指导。机器人可以分享其机载摄像头视图，以帮助操作员完善对环境的理解。为了实现高效的合作，我们提出了信息增益蒙特卡洛树搜索（IG-MCTS），这是一种在线规划算法，可以同时优化自主移动和信息传递。IG-MCTS利用了一个经过众包映射数据集训练的学习到的神经人类感知模型（NHPM），用于预测人类内部地图在新观察结果共享时的演变。用户研究表明，IG-MCTS显著减少了沟通需求，并产生了低认知负荷的眼动跟踪指标，同时保持了与遥操作和按照指示进行的基准任务性能相当的水平。最后，我们通过一个连续空间水路导航设置展示了超越离散迷宫的泛化能力，其中NHPM受益于更深的编码-解码架构，而IG-MCTS利用动态构建的沃罗诺伊划分的可穿越性图。

更新时间: 2025-10-09 18:20:16

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2502.01857v2

Code Agent can be an End-to-end System Hacker: Benchmarking Real-world Threats of Computer-use Agent

Computer-use agent (CUA) frameworks, powered by large language models (LLMs) or multimodal LLMs (MLLMs), are rapidly maturing as assistants that can perceive context, reason, and act directly within software environments. Among their most critical applications is operating system (OS) control. As CUAs in the OS domain become increasingly embedded in daily operations, it is imperative to examine their real-world security implications, specifically whether CUAs can be misused to perform realistic, security-relevant attacks. Existing works exhibit four major limitations: Missing attacker-knowledge model on tactics, techniques, and procedures (TTP), Incomplete coverage for end-to-end kill chains, unrealistic environment without multi-host and encrypted user credentials, and unreliable judgment dependent on LLM-as-a-Judge. To address these gaps, we propose AdvCUA, the first benchmark aligned with real-world TTPs in MITRE ATT&CK Enterprise Matrix, which comprises 140 tasks, including 40 direct malicious tasks, 74 TTP-based malicious tasks, and 26 end-to-end kill chains, systematically evaluates CUAs under a realistic enterprise OS security threat in a multi-host environment sandbox by hard-coded evaluation. We evaluate the existing five mainstream CUAs, including ReAct, AutoGPT, Gemini CLI, Cursor CLI, and Cursor IDE based on 8 foundation LLMs. The results demonstrate that current frontier CUAs do not adequately cover OS security-centric threats. These capabilities of CUAs reduce dependence on custom malware and deep domain expertise, enabling even inexperienced attackers to mount complex enterprise intrusions, which raises social concern about the responsibility and security of CUAs.

Updated: 2025-10-09 18:18:19

标题: 代码代理可以成为端到端系统黑客：基准测试计算机使用代理的真实世界威胁

摘要: 计算机使用代理（CUA）框架，由大型语言模型（LLMs）或多模式LLMs（MLLMs）驱动，正迅速成熟为可以在软件环境中直接感知情境、推理和行动的助手。它们最关键的应用之一是操作系统（OS）控制。随着OS领域中的CUAs越来越多地嵌入日常运营中，有必要考虑它们在现实世界中的安全影响，特别是CUAs是否可以被滥用来执行现实的、与安全相关的攻击。现有研究表明存在四个主要限制：缺乏关于攻击者战术、技术和程序（TTP）的知识模型，端到端杀伤链的覆盖不完整，在没有多主机和加密用户凭证的不真实环境下，以及依赖LLM作为评判的不可靠判断。为了解决这些差距，我们提出了AdvCUA，这是与MITRE ATT&CK企业矩阵中的真实世界TTPs对齐的第一个基准，其中包括140个任务，包括40个直接恶意任务、74个基于TTP的恶意任务和26个端到端杀伤链，通过硬编码评估在多主机环境沙盒中系统地评估CUAs在现实企业OS安全威胁下的表现。我们评估了现有的五个主流CUAs，包括ReAct、AutoGPT、Gemini CLI、Cursor CLI和Cursor IDE，基于8个基础LLMs。结果表明，当前前沿的CUAs并不充分涵盖OS安全相关的威胁。这些CUAs的能力减少了对自定义恶意软件和深度领域专业知识的依赖，使即使是经验不足的攻击者也能发动复杂的企业入侵，这引发了社会对CUAs的责任和安全性的关注。

更新时间: 2025-10-09 18:18:19

领域: cs.CR

下载: http://arxiv.org/abs/2510.06607v2

Unified World Models: Memory-Augmented Planning and Foresight for Visual Navigation

Enabling embodied agents to effectively imagine future states is critical for robust and generalizable visual navigation. Current state-of-the-art approaches, however, adopt modular architectures that separate navigation planning from visual world modeling, leading to state-action misalignment and limited adaptability in novel or dynamic scenarios. To overcome this fundamental limitation, we propose UniWM, a unified, memory-augmented world model integrating egocentric visual foresight and planning within a single multimodal autoregressive backbone. Unlike modular frameworks, UniWM explicitly grounds action decisions in visually imagined outcomes, ensuring tight alignment between prediction and control. A hierarchical memory mechanism further integrates detailed short-term perceptual cues with longer-term trajectory context, enabling stable, coherent reasoning over extended horizons. Extensive experiments across four challenging benchmarks (Go Stanford, ReCon, SCAND, HuRoN) demonstrate that UniWM substantially improves navigation success rates by up to 30%, significantly reduces trajectory errors compared to strong baselines, and exhibits impressive zero-shot generalization on the unseen TartanDrive dataset. These results highlight UniWM as a principled step toward unified, imagination-driven embodied navigation.

Updated: 2025-10-09 18:18:11

标题: 统一世界模型：记忆增强规划和前瞻性视觉导航

摘要: 使具有实体的代理能够有效地想象未来状态对于稳健且可泛化的视觉导航至关重要。然而，目前最先进的方法采用模块化架构，将导航规划与视觉世界建模分开，导致状态-动作不对齐并在新颖或动态情景中适应性有限。为了克服这一根本限制，我们提出UniWM，这是一个统一的、记忆增强的世界模型，将自我中心的视觉远见和规划融合在一个多模态自回归骨干中。与模块化框架不同，UniWM明确将行动决策基于视觉想象的结果，确保预测与控制之间的紧密对齐。一个分层记忆机制进一步将详细的短期感知线索与长期轨迹背景整合在一起，使得对延长视野进行稳定、连贯的推理成为可能。在四个具有挑战性的基准测试中进行了广泛实验（Go Stanford、ReCon、SCAND、HuRoN），结果表明UniWM将导航成功率显著提高了30%，与强基线相比显著减少了轨迹错误，并在未见过的TartanDrive数据集上展现了令人印象深刻的零-shot泛化能力。这些结果凸显了UniWM作为通向统一、以想象驱动的实体导航的一个有原则的步骤。

更新时间: 2025-10-09 18:18:11

领域: cs.AI,cs.CV,cs.RO

下载: http://arxiv.org/abs/2510.08713v1

HA-VLN 2.0: An Open Benchmark and Leaderboard for Human-Aware Navigation in Discrete and Continuous Environments with Dynamic Multi-Human Interactions

Vision-and-Language Navigation (VLN) has been studied mainly in either discrete or continuous settings, with little attention to dynamic, crowded environments. We present HA-VLN 2.0, a unified benchmark introducing explicit social-awareness constraints. Our contributions are: (i) a standardized task and metrics capturing both goal accuracy and personal-space adherence; (ii) HAPS 2.0 dataset and simulators modeling multi-human interactions, outdoor contexts, and finer language-motion alignment; (iii) benchmarks on 16,844 socially grounded instructions, revealing sharp performance drops of leading agents under human dynamics and partial observability; and (iv) real-world robot experiments validating sim-to-real transfer, with an open leaderboard enabling transparent comparison. Results show that explicit social modeling improves navigation robustness and reduces collisions, underscoring the necessity of human-centric approaches. By releasing datasets, simulators, baselines, and protocols, HA-VLN 2.0 provides a strong foundation for safe, socially responsible navigation research.

Updated: 2025-10-09 18:17:24

标题: HA-VLN 2.0：一个开放的基准和排行榜，用于离散和连续环境中具有动态多人互动的人类感知导航

摘要: 视觉与语言导航（VLN）主要在离散或连续设置中进行研究，对动态、拥挤的环境关注不足。我们提出了HA-VLN 2.0，这是一个引入显式社会意识约束的统一基准。我们的贡献包括：（i）标准化任务和度量标准，捕捉目标准确性和个人空间遵从性；（ii）HAPS 2.0数据集和模拟器，模拟多人交互、户外环境和更精细的语言-动作对齐；（iii）对16,844个具有社会基础的指令进行基准测试，揭示了领先代理商在人类动态和部分可观测性下的性能急剧下降；（iv）真实世界的机器人实验验证了从模拟到真实的转移，开放式排行榜使透明的比较成为可能。结果显示，显式社会建模提高了导航的稳健性并减少了碰撞，强调了人类中心方法的必要性。通过发布数据集、模拟器、基线和协议，HA-VLN 2.0为安全、社会负责的导航研究提供了坚实的基础。

更新时间: 2025-10-09 18:17:24

领域: cs.AI,cs.CV,cs.RO

下载: http://arxiv.org/abs/2503.14229v3

In-Context Learning for Non-Stationary MIMO Equalization

Channel equalization is fundamental for mitigating distortions such as frequency-selective fading and inter-symbol interference. Unlike standard supervised learning approaches that require costly retraining or fine-tuning for each new task, in-context learning (ICL) adapts to new channels at inference time with only a few examples. However, existing ICL-based equalizers are primarily developed for and evaluated on static channels within the context window. Indeed, to our knowledge, prior principled analyses and theoretical studies of ICL focus exclusively on the stationary setting, where the function remains fixed within the context. In this paper, we investigate the ability of ICL to address non-stationary problems through the lens of time-varying channel equalization. We employ a principled framework for designing efficient attention mechanisms with improved adaptivity in non-stationary tasks, leveraging algorithms from adaptive signal processing to guide better designs. For example, new attention variants can be derived from the Least Mean Square (LMS) adaptive algorithm, a Least Root Mean Square (LRMS) formulation for enhanced robustness, or multi-step gradient updates for improved long-term tracking. Experimental results demonstrate that ICL holds strong promise for non-stationary MIMO equalization, and that attention mechanisms inspired by classical adaptive algorithms can substantially enhance adaptability and performance in dynamic environments. Our findings may provide critical insights for developing next-generation wireless foundation models with stronger adaptability and robustness.

Updated: 2025-10-09 18:16:41

标题: 上下文学习用于非静止MIMO均衡

摘要: 通道均衡对于减轻频率选择性衰落和符号间干扰等失真至关重要。不同于标准的监督式学习方法，后者需要为每个新任务进行昂贵的重新训练或微调，在上下文学习（ICL）中，只需少量示例就可以在推断时适应新的通道。然而，现有基于ICL的均衡器主要是针对并在上下文窗口内的静态通道开发和评估的。事实上，就我们所知，先前对ICL的原则性分析和理论研究专注于静态设置，其中函数在上下文中保持不变。在本文中，我们通过时变通道均衡的视角研究ICL解决非静态问题的能力。我们采用一个原则性框架设计高效的注意机制，改进非静态任务中的适应性，利用自适应信号处理算法指导更好的设计。例如，可以从最小均方（LMS）自适应算法导出新的注意变体，用于增强鲁棒性的最小均方根（LRMS）公式，或用于改进长期跟踪的多步梯度更新。实验结果表明，ICL对于非静态MIMO均衡具有很强的潜力，并且受经典自适应算法启发的注意机制可以显着增强动态环境中的适应性和性能。我们的发现可能为开发具有更强适应性和鲁棒性的下一代无线基础模型提供关键见解。

更新时间: 2025-10-09 18:16:41

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.08711v1

Security of Key-Alternating Ciphers: Quantum Lower Bounds and Quantum Walk Attacks

We study the quantum security of key-alternating ciphers (KAC), a natural multi-round generalization of the Even--Mansour construction. KAC abstracts the round structure of practical block ciphers as public permutations interleaved with key XORs. The $1$-round KAC or EM setting already highlights the power of quantum superposition access: EM is secure against classical and Q1 adversaries (quantum access to the public permutation), but insecure in the Q2 model. The security of multi-round KACs remain largely unexplored; in particular, whether the quantum-classical separation extends beyond a single round had remained open. 1) Quantum Lower Bounds. We prove security of the $t$-round KAC against a non-adaptive adversary in both the Q1 and Q2 models. In the Q1 model, any distinguiser requires $\Omega(2^{\frac{tn}{2t+1}})$ oracle queries to distinguish the cipher from a random permutation, whereas classically any distinguisher needs $\Omega(2^{\frac{tn}{t+1}})$ queries. As a corollary, we obtain a Q2 lower bound of $\Omega (2^{\frac{(t-1)n}{2t}})$ quantum queries. Thus, for $t \geq 2$, the exponential Q1-Q2 gap collapses in the non-adaptive setting, partially resolving an open problem posed by Kuwakado and Morii (2012). Our proofs develop a controlled-reprogramming framework within a quantum hybrid argument, sidestepping the lack of quantum recording techniques for permutation-based ciphers; we expect this framework to be useful for analyzing other post-quantum symmetric primitives. 2) Quantum Key-Recovery Attack. We give the first non-trivial quantum key-recovery algorithm for $t$-round KAC in the Q1 model. It makes $O(2^{\alpha n})$ queries with $\alpha = \frac{t(t+1)}{(t+1)^2 + 1}$, improving on the best known classical bound of $O(2^{\alpha' n})$ with $\alpha' = \frac{t}{t+1}$. The algorithm adapts quantum walk techniques to the KAC structure.

Updated: 2025-10-09 18:14:51

标题: 密钥交替密码的安全性：量子下界和量子行走攻击

摘要: 我们研究了密钥交替密码（KAC）的量子安全性，这是Even-Mansour结构的一个自然多轮推广。KAC将实际分组密码的轮结构抽象为与密钥XOR交错的公共置换。$1$轮KAC或EM设置已经突显了量子叠加访问的强大力量：EM对经典和Q1对手（对公共置换的量子访问）是安全的，但在Q2模型中是不安全的。多轮KAC的安全性仍然大部分未被探索；特别是，量子-经典分离是否延伸到超过单轮一直是未解决的问题。 1）量子下界。我们证明了$t$轮KAC对非自适应对手在Q1和Q2模型下的安全性。在Q1模型中，任何区分器需要$\Omega(2^{\frac{tn}{2t+1}})$个预言查询来区分密码与随机置换，而在经典情况下，任何区分器需要$\Omega(2^{\frac{tn}{t+1}})$个查询。作为推论，我们得到了$\Omega (2^{\frac{(t-1)n}{2t}})$个量子查询的Q2下界。因此，对于$t \geq 2$，在非自适应设置中，指数级的Q1-Q2差距缩小了，部分解决了Kuwakado和Morii（2012）提出的一个未解决问题。我们的证明在量子混合论证中开发了一个受控重编程框架，避开了基于置换的密码缺乏量子记录技术的问题；我们希望这一框架对分析其他后量子对称原语有用。 2）量子密钥恢复攻击。我们在Q1模型中为$t$轮KAC提供了第一个非平凡的量子密钥恢复算法。它使用$O(2^{\alpha n})$个查询，其中$\alpha = \frac{t(t+1)}{(t+1)^2 + 1}$，改进了已知的最佳经典界限$O(2^{\alpha' n})$，其中$\alpha' = \frac{t}{t+1}$。该算法将量子行走技术调整为KAC结构。

更新时间: 2025-10-09 18:14:51

领域: quant-ph,cs.CR

下载: http://arxiv.org/abs/2412.05026v3

A Comprehensive Survey on Smart Home IoT Fingerprinting: From Detection to Prevention and Practical Deployment

Smart homes are increasingly populated with heterogeneous Internet of Things (IoT) devices that interact continuously with users and the environment. This diversity introduces critical challenges in device identification, authentication, and security, where fingerprinting techniques have emerged as a key approach. In this survey, we provide a comprehensive analysis of IoT fingerprinting specifically in the context of smart homes, examining methods for device and their event detection, classification, and intrusion prevention. We review existing techniques, e.g., network traffic analysis or machine learning-based schemes, highlighting their applicability and limitations in home environments characterized by resource-constrained devices, dynamic usage patterns, and privacy requirements. Furthermore, we discuss fingerprinting system deployment challenges like scalability, interoperability, and energy efficiency, as well as emerging opportunities enabled by generative AI and federated learning. Finally, we outline open research directions that can advance reliable and privacy-preserving fingerprinting for next-generation smart home ecosystems.

Updated: 2025-10-09 18:12:40

标题: 一个关于智能家居物联网指纹识别的综合调查：从检测到预防和实际部署

摘要: 智能家居越来越多地使用各种异构物联网(IoT)设备，这些设备与用户和环境不断交互。这种多样性引入了设备识别、认证和安全方面的关键挑战，指纹识别技术已经成为一种关键方法。在这项调查中，我们对IoT指纹识别在智能家居环境中的具体情况进行了全面分析，考察了设备及其事件检测、分类和入侵防范的方法。我们回顾了现有技术，如网络流量分析或基于机器学习的方案，突出了它们在资源受限设备、动态使用模式和隐私要求的家庭环境中的适用性和局限性。此外，我们讨论了指纹识别系统部署挑战，如可扩展性、互操作性和能源效率，以及由生成式人工智能和联邦学习带来的新机遇。最后，我们概述了可以推进下一代智能家居生态系统可靠且保护隐私的指纹识别的开放研究方向。

更新时间: 2025-10-09 18:12:40

领域: cs.CR

下载: http://arxiv.org/abs/2510.09700v1

ConPoSe: LLM-Guided Contact Point Selection for Scalable Cooperative Object Pushing

Object transportation in cluttered environments is a fundamental task in various domains, including domestic service and warehouse logistics. In cooperative object transport, multiple robots must coordinate to move objects that are too large for a single robot. One transport strategy is pushing, which only requires simple robots. However, careful selection of robot-object contact points is necessary to push the object along a preplanned path. Although this selection can be solved analytically, the solution space grows combinatorially with the number of robots and object size, limiting scalability. Inspired by how humans rely on common-sense reasoning for cooperative transport, we propose combining the reasoning capabilities of Large Language Models with local search to select suitable contact points. Our LLM-guided local search method for contact point selection, ConPoSe, successfully selects contact points for a variety of shapes, including cuboids, cylinders, and T-shapes. We demonstrate that ConPoSe scales better with the number of robots and object size than the analytical approach, and also outperforms pure LLM-based selection.

Updated: 2025-10-09 18:07:39

标题: ConPoSe: LLM引导的可扩展合作对象推动中的接触点选择

摘要: 在拥挤环境中进行物体运输是各个领域的基本任务，包括家庭服务和仓储物流。在合作物体运输中，多个机器人必须协调移动对于单个机器人来说太大的物体。一种运输策略是推动，这只需要简单的机器人。然而，需要仔细选择机器人-物体接触点，以沿着预先规划的路径推动物体。尽管这种选择可以通过分析来解决，但随着机器人数量和物体大小的增加，解决空间呈组合增长，限制了可扩展性。受人类如何依靠常识推理进行合作运输的启发，我们提出将大型语言模型的推理能力与局部搜索相结合，以选择适当的接触点。我们的LLM引导的接触点选择局部搜索方法ConPoSe成功地选择了各种形状的接触点，包括长方体、圆柱体和T形。我们证明ConPoSe随着机器人数量和物体大小的增加具有更好的可扩展性，同时也优于纯LLM基于选择方法。

更新时间: 2025-10-09 18:07:39

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2510.08705v1

Decoding Positive Selection in Mycobacterium tuberculosis with Phylogeny-Guided Graph Attention Models

Positive selection drives the emergence of adaptive mutations in Mycobacterium tuberculosis, shaping drug resistance, transmissibility, and virulence. Phylogenetic trees capture evolutionary relationships among isolates and provide a natural framework for detecting such adaptive signals. We present a phylogeny-guided graph attention network (GAT) approach, introducing a method for converting SNP-annotated phylogenetic trees into graph structures suitable for neural network analysis. Using 500 M. tuberculosis isolates from four major lineages and 249 single-nucleotide variants (84 resistance-associated and 165 neutral) across 61 drug-resistance genes, we constructed graphs where nodes represented isolates and edges reflected phylogenetic distances. Edges between isolates separated by more than seven internal nodes were pruned to emphasise local evolutionary structure. Node features encoded SNP presence or absence, and the GAT architecture included two attention layers, a residual connection, global attention pooling, and a multilayer perceptron classifier. The model achieved an accuracy of 0.88 on a held-out test set and, when applied to 146 WHO-classified "uncertain" variants, identified 41 candidates with convergent emergence across multiple lineages, consistent with adaptive evolution. This work demonstrates the feasibility of transforming phylogenies into GNN-compatible structures and highlights attention-based models as effective tools for detecting positive selection, aiding genomic surveillance and variant prioritisation.

Updated: 2025-10-09 18:06:39

标题: 使用Phylogeny-Guided Graph Attention Models解码结核分枝杆菌中的正选择

摘要: 正向选择推动了结核分枝杆菌中适应性突变的出现，塑造了药物抗性、传播性和毒力。系统发育树捕捉了分离株之间的进化关系，并为检测这种适应性信号提供了一个自然框架。我们提出了一个基于系统发育树的图注意网络（GAT）方法，介绍了一种将SNP注释的系统发育树转换为适合神经网络分析的图结构的方法。使用来自四个主要谱系的500个结核分枝杆菌分离株和61个药物抗性基因中的249个单核苷酸变异（84个与抗性相关和165个中性），我们构建了图形，其中节点表示分离株，边反映了系统发育距离。分离节点之间的边被修剪，以强调局部进化结构。节点特征编码了SNP的存在或不存在，GAT架构包括两个注意力层，一个残差连接，全局注意力池化和多层感知器分类器。该模型在一个保留的测试集上达到了0.88的准确率，并且当应用于146个WHO分类为“不确定”的变异时，识别出了41个在多个谱系中出现的趋同出现的候选者，与适应性进化一致。这项工作展示了将系统发育树转化为GNN兼容结构的可行性，并突出了基于注意力的模型作为检测正向选择的有效工具，有助于基因组监测和变异优先级。

更新时间: 2025-10-09 18:06:39

领域: q-bio.PE,cs.LG

下载: http://arxiv.org/abs/2510.08703v1

Are Voters Willing to Collectively Secure Elections? Unraveling a Practical Blockchain Voting System

Ensuring ballot secrecy is critical for fair and trustworthy electronic voting systems, yet achieving strong secrecy guarantees in decentralized, large-scale elections remains challenging. This paper proposes the concept of collectively secure voting, in which voters themselves can opt in as secret holders to protect ballot secrecy. A practical blockchain-based collectively secure voting system is designed and implemented. Our design strikes a balance between strong confidentiality guarantees and real-world applicability. The proposed system combines threshold cryptography and smart contracts to ensure ballots remain confidential during voting, while all protocol steps remain transparent and verifiable. Voters can use the system without prior blockchain knowledge through an intuitive user interface that hides underlying complexity. To evaluate this approach, a user testing is conducted. Results show a high willingness to act as secret holders, reliable participation in share release, and high security confidence in the proposed system. The findings demonstrate that voters can collectively maintain secrecy and that such a practical deployment is feasible.

Updated: 2025-10-09 18:02:40

标题: 选民是否愿意共同保障选举安全？揭示一个实用的区块链投票系统

摘要: 确保选票保密对于公正和可信的电子投票系统至关重要，然而在去中心化、大规模选举中实现强大的保密保证仍然具有挑战性。本文提出了集体安全投票的概念，选民们可以自行选择作为秘密持有者以保护选票保密。设计并实现了一个基于区块链的实用的集体安全投票系统。我们的设计在强大的保密保证和现实世界的适用性之间取得了平衡。所提出的系统结合了阈值密码学和智能合约，以确保在投票过程中选票保持保密，同时所有协议步骤保持透明和可验证。选民可以通过直观的用户界面使用该系统，而无需事先了解区块链知识，隐藏了底层复杂性。为了评估这种方法，进行了用户测试。结果显示高度愿意充当秘密持有者，可靠地参与份额释放，并对所提出的系统具有高度的安全信心。研究结果表明，选民可以共同保持保密，并且这种实际部署是可行的。

更新时间: 2025-10-09 18:02:40

领域: cs.CR,cs.DC

下载: http://arxiv.org/abs/2510.08700v1

Contrastive Learning Augmented Social Recommendations

Recommender systems are essential for modern content platforms, yet traditional behavior-based models often struggle with cold users who have limited interaction data. Engaging these users is crucial for platform growth. To bridge this gap, we propose leveraging the social-relation graph to enrich interest representations from behavior-based models. However, extracting value from social graphs is challenging due to relation noise and cross-domain inconsistency. To address the noise propagation and obtain accurate social interest, we employ a dual-view denoising strategy, employing low-rank SVD to the user-item interaction matrix for a denoised social graph and contrastive learning to align the original and reconstructed social graphs. Addressing the interest inconsistency between social and behavioral interests, we adopt a "mutual distillation" technique to isolate the original interests into aligned social/behavior interests and social/behavior specific interests, maximizing the utility of both. Experimental results on widely adopted industry datasets verify the method's effectiveness, particularly for cold users, offering a fresh perspective for future research. The implementation can be accessed at https://github.com/WANGLin0126/CLSRec.

Updated: 2025-10-09 18:02:16

标题: 对比学习增强的社交推荐

摘要: 推荐系统对于现代内容平台至关重要，然而传统的基于行为的模型通常难以处理与有限互动数据的冷用户。吸引这些用户对于平台的增长至关重要。为了弥补这一差距，我们提出利用社交关系图来丰富基于行为的模型中的兴趣表示。然而，从社交图中提取价值是具有挑战性的，因为关系噪音和跨领域不一致性。为了解决噪音传播并获得准确的社交兴趣，我们采用双视图去噪策略，将低秩SVD应用于用户-项目交互矩阵，用于去噪社交图，并采用对比学习来对齐原始和重构的社交图。为了解决社交和行为兴趣之间的不一致性，我们采用了一种“相互蒸馏”技术，将原始兴趣分离为对齐的社交/行为兴趣和社交/行为特定兴趣，最大化两者的效用。在广泛采用的行业数据集上的实验结果验证了该方法的有效性，特别是对于冷用户，为未来研究提供了新的视角。该实现可在https://github.com/WANGLin0126/CLSRec 上访问。

更新时间: 2025-10-09 18:02:16

领域: cs.IR,cs.AI,cs.SI

下载: http://arxiv.org/abs/2502.15695v3

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

Crowdsourced model evaluation platforms, such as Chatbot Arena, enable real-time evaluation from human perspectives to assess the quality of model responses. In the coding domain, manually examining the quality of LLM-generated content is extremely challenging, as it requires understanding long chunks of raw code and deliberately simulating code execution. To this end, we introduce BigCodeArena, an open human evaluation platform for code generation backed by a comprehensive and on-the-fly execution environment. Built on top of Chatbot Arena, BigCodeArena enables the execution of LLM-generated code and allows humans to interact with the execution process and outcomes. We collected over 14,000 raw code-centric conversation sessions across 10 widely used LLMs, spanning 10 languages and 8 types of execution environments. Among these conversations, we identified more than 4,700 multi-turn samples with pairwise human preferences. Further analysis uncovers underexplored preferences of LLMs in fine-grained domains characterized by tasks, languages, and frameworks. To systematically examine code understanding and generation capabilities of frontier LLMs, we curated two benchmarks based on the collected data, namely BigCodeReward and AutoCodeArena. For BigCodeReward, we post-processed the 4,700 conversations and evaluated the consistency between reward models and human preferences. The evaluation shows that most LLMs have superior performance in judging coding preferences when the execution results are available. Inspired by these findings, we propose AutoCodeArena, an automatic Elo rating benchmark designed to assess the coding quality of LLMs without human involvement. We find that proprietary LLMs like GPT-5, Claude-Sonnet-4, and Claude-Opus-4 still lead in code generation performance among recent emerging models.

Updated: 2025-10-09 18:01:47

标题: 大代码竞技场：通过执行揭示代码生成中更可靠的人类偏好

摘要: 众包模型评估平台，如Chatbot Arena，使得可以从人类的角度实时评估模型响应的质量。在编码领域，手动检查LLM生成的内容的质量是非常具有挑战性的，因为它需要理解长篇的原始代码并有意地模拟代码执行。为此，我们介绍了BigCodeArena，这是一个支持代码生成的开放式人类评估平台，其背后有一个全面的即时执行环境。建立在Chatbot Arena之上，BigCodeArena能够执行LLM生成的代码，并允许人类与执行过程和结果进行交互。我们收集了超过14,000个基于原始代码的对话会话，涵盖了10种广泛使用的LLM，跨越了10种语言和8种执行环境。在这些对话中，我们识别出超过4,700个具有成对人类偏好的多轮样本。进一步分析揭示了LLM在由任务、语言和框架特征化的细粒度领域中未被充分探索的偏好。为了系统地检验前沿LLM的代码理解和生成能力，我们根据收集到的数据策划了两个基准测试，即BigCodeReward和AutoCodeArena。对于BigCodeReward，我们对4,700个对话进行了后处理，并评估了奖励模型与人类偏好之间的一致性。评估结果显示，当执行结果可用时，大多数LLM在判断编码偏好方面表现出卓越性能。受这些发现的启发，我们提出了AutoCodeArena，这是一个自动Elo评分基准，旨在评估LLM的编码质量，无需人类参与。我们发现，像GPT-5、Claude-Sonnet-4和Claude-Opus-4这样的专有LLM在最近新兴模型中仍然在代码生成性能方面处于领先地位。

更新时间: 2025-10-09 18:01:47

领域: cs.SE,cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.08697v1

Don't Waste Mistakes: Leveraging Negative RL-Groups via Confidence Reweighting

Reinforcement learning with verifiable rewards (RLVR) has become a standard recipe for improving large language models (LLMs) on reasoning tasks, with Group Relative Policy Optimization (GRPO) widely used in practice. Yet GRPO wastes substantial compute on negative groups: groups in which no sampled response is correct yield zero advantage and thus no gradient. We ask whether negative groups can be leveraged without extra supervision. Starting from a maximum-likelihood (MLE) objective in reward modeling, we show that the MLE gradient is equivalent to a policy gradient for a modified value function. This value function adds a confidence-weighted penalty on incorrect responses, imposing larger penalties on more confident mistakes. We refer to this as \textbf{L}ikelihood \textbf{E}stimation with \textbf{N}egative \textbf{S}amples (\textbf{LENS}). LENS modifies GRPO to assign non-zero, confidence-dependent rewards to incorrect generations, making negative groups informative and converting previously wasted samples into useful gradient updates. On the MATH benchmark with Llama-3.1-8B and Qwen-2.5-3B, the proposed variant consistently outperforms GRPO baseline, with significant gains on harder items. These results demonstrate a principled and practical way to "rescue" negative groups, improving efficiency and performance in RLVR.

Updated: 2025-10-09 18:01:44

标题: 不要浪费错误：通过信心重新加权利用负反馈RL组

摘要: 具有可验证奖励的强化学习（RLVR）已成为改进大型语言模型（LLMs）在推理任务上的标准方法，其中Group Relative Policy Optimization（GRPO）被广泛应用。然而，GRPO在负组上浪费了大量计算资源：在这些组中，没有采样响应是正确的，因此没有优势，也没有梯度。我们探讨负组是否可以在没有额外监督的情况下利用。从在奖励建模中最大似然（MLE）目标出发，我们表明MLE梯度相当于一个修改后的值函数的策略梯度。这个值函数对不正确的响应增加了一个置信度加权惩罚，对更有信心的错误施加更大的惩罚。我们将其称为\textbf{LENS}（\textbf{L}ikelihood \textbf{E}stimation with \textbf{N}egative \textbf{S}amples）。LENS修改了GRPO，为不正确的生成分配非零、与置信度有关的奖励，使负组具有信息性，并将以前浪费的样本转化为有用的梯度更新。在MATH基准测试中，使用Llama-3.1-8B和Qwen-2.5-3B，所提出的变体始终优于GRPO基线，在更难的项目上取得显著收益。这些结果展示了一种原则性和实用的方法来“挽救”负组，在RLVR中提高效率和性能。

更新时间: 2025-10-09 18:01:44

领域: cs.LG

下载: http://arxiv.org/abs/2510.08696v1

Frequency-Guided Posterior Sampling for Diffusion-Based Image Restoration

Image restoration aims to recover high-quality images from degraded observations. When the degradation process is known, the recovery problem can be formulated as an inverse problem, and in a Bayesian context, the goal is to sample a clean reconstruction given the degraded observation. Recently, modern pretrained diffusion models have been used for image restoration by modifying their sampling procedure to account for the degradation process. However, these methods often rely on certain approximations that can lead to significant errors and compromised sample quality. In this paper, we provide the first rigorous analysis of this approximation error for linear inverse problems under distributional assumptions on the space of natural images, demonstrating cases where previous works can fail dramatically. Motivated by our theoretical insights, we propose a simple modification to existing diffusion-based restoration methods. Our approach introduces a time-varying low-pass filter in the frequency domain of the measurements, progressively incorporating higher frequencies during the restoration process. We develop an adaptive curriculum for this frequency schedule based on the underlying data distribution. Our method significantly improves performance on challenging image restoration tasks including motion deblurring and image dehazing.

Updated: 2025-10-09 18:00:40

标题: 频率导引的后验采样用于基于扩散的图像恢复

摘要: 图像恢复旨在从降质的观测中恢复高质量的图像。当降级过程已知时，恢复问题可以被制定为一个逆问题，在贝叶斯环境中，目标是在给定降级观测的情况下对干净的重建进行采样。最近，现代预训练扩散模型已被用于通过修改它们的采样过程来考虑降级过程进行图像恢复。然而，这些方法通常依赖于某些近似，这可能导致重大错误和受损的样本质量。在本文中，我们对自然图像空间的分布假设下线性逆问题的这种近似误差进行了首次严格分析，展示了先前工作可能会出现严重失败的情况。受我们理论洞察力的启发，我们提出了对现有基于扩散的恢复方法的简单修改。我们的方法在测量频率域中引入了一个时变低通滤波器，逐渐在恢复过程中合并更高的频率。我们基于基础数据分布开发了一个自适应课程表，用于这种频率调度。我们的方法显著提高了在包括运动去模糊和图像去雾等具有挑战性的图像恢复任务中的性能。

更新时间: 2025-10-09 18:00:40

领域: eess.IV,cs.CV,cs.LG,stat.ML

下载: http://arxiv.org/abs/2411.15295v2

BLAZER: Bootstrapping LLM-based Manipulation Agents with Zero-Shot Data Generation

Scaling data and models has played a pivotal role in the remarkable progress of computer vision and language. Inspired by these domains, recent efforts in robotics have similarly focused on scaling both data and model size to develop more generalizable and robust policies. However, unlike vision and language, robotics lacks access to internet-scale demonstrations across diverse robotic tasks and environments. As a result, the scale of existing datasets typically suffers from the need for manual data collection and curation. To address this problem, here we propose BLAZER, a framework that learns manipulation policies from automatically generated training data. We build on the zero-shot capabilities of LLM planners and automatically generate demonstrations for diverse manipulation tasks in simulation. Successful examples are then used to finetune an LLM and to improve its planning capabilities without human supervision. Notably, while BLAZER training requires access to the simulator's state, we demonstrate direct transfer of acquired skills to sensor-based manipulation. Through extensive experiments, we show BLAZER to significantly improve zero-shot manipulation in both simulated and real environments. Moreover, BLAZER improves on tasks outside of its training pool and enables downscaling of LLM models. Our code and data will be made publicly available on the project page.

Updated: 2025-10-09 17:59:58

标题: BLAZER：基于LLM的零样本数据生成启动机器人操作代理

摘要: 数据和模型的扩展在计算机视觉和语言领域的显著进展中发挥了关键作用。受到这些领域的启发，近年来机器人领域的努力也集中在扩展数据和模型规模，以开发更具泛化性和稳健性的策略。然而，与视觉和语言不同，机器人领域缺乏涵盖各种机器人任务和环境的互联网规模演示。因此，现有数据集的规模通常受制于对手动数据收集和整理的需求。为解决这一问题，我们在这里提出了BLAZER，一个从自动生成的训练数据中学习操作策略的框架。我们借鉴了LLM规划器的零样本能力，并在模拟环境中自动生成各种操作任务的演示。成功的示例随后被用于微调LLM并改进其规划能力，无需人工监督。值得注意的是，虽然BLAZER训练需要访问模拟器的状态，我们展示了所获得技能的直接转移至基于传感器的操作。通过大量实验证明，我们展示了BLAZER在模拟和实际环境中显著改善了零样本操作能力。此外，BLAZER在训练池之外的任务上取得了改进，并且实现了LLM模型的降级。我们的代码和数据将在项目页面上公开提供。

更新时间: 2025-10-09 17:59:58

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.08572v1

Reconstructing the local density field with combined convolutional and point cloud architecture

We construct a neural network to perform regression on the local dark-matter density field given line-of-sight peculiar velocities of dark-matter halos, biased tracers of the dark matter field. Our architecture combines a convolutional U-Net with a point-cloud DeepSets. This combination enables efficient use of small-scale information and improves reconstruction quality relative to a U-Net-only approach. Specifically, our hybrid network recovers both clustering amplitudes and phases better than the U-Net on small scales.

Updated: 2025-10-09 17:59:58

标题: 用结合卷积和点云架构重建局部密度场

摘要: 我们构建了一个神经网络，对给定的暗物质局部密度场进行回归，给出暗物质晕的视向特殊速度，偏离暗物质场的有偏的标记。我们的架构结合了一个具有卷积结构的U-Net和一个点云DeepSets。这种组合能够有效利用小尺度信息，并相对于仅使用U-Net的方法改善重建质量。具体来说，我们的混合网络在小尺度上比U-Net更好地恢复了聚类幅度和相位。

更新时间: 2025-10-09 17:59:58

领域: astro-ph.CO,cs.LG,stat.ML

下载: http://arxiv.org/abs/2510.08573v1

Who Said Neural Networks Aren't Linear?

Neural networks are famously nonlinear. However, linearity is defined relative to a pair of vector spaces, $f$$:$$X$$\to$$Y$. Is it possible to identify a pair of non-standard vector spaces for which a conventionally nonlinear function is, in fact, linear? This paper introduces a method that makes such vector spaces explicit by construction. We find that if we sandwich a linear operator $A$ between two invertible neural networks, $f(x)=g_y^{-1}(A g_x(x))$, then the corresponding vector spaces $X$ and $Y$ are induced by newly defined addition and scaling actions derived from $g_x$ and $g_y$. We term this kind of architecture a Linearizer. This framework makes the entire arsenal of linear algebra, including SVD, pseudo-inverse, orthogonal projection and more, applicable to nonlinear mappings. Furthermore, we show that the composition of two Linearizers that share a neural network is also a Linearizer. We leverage this property and demonstrate that training diffusion models using our architecture makes the hundreds of sampling steps collapse into a single step. We further utilize our framework to enforce idempotency (i.e. $f(f(x))=f(x)$) on networks leading to a globally projective generative model and to demonstrate modular style transfer.

Updated: 2025-10-09 17:59:57

标题: 谁说神经网络不是线性的？

摘要: 神经网络以非线性而闻名。然而，线性性是相对于一对向量空间$ f:X\to Y $来定义的。是否可能确定一对非标准向量空间，使得传统上非线性的函数实际上是线性的？本文介绍了一种通过构造使这些向量空间显式化的方法。我们发现，如果我们在两个可逆的神经网络之间插入一个线性运算符$ A $，则相应的向量空间$ X $和$ Y $是由$ g_x $和$ g_y $衍生的新定义的加法和缩放操作引起的。我们将这种体系结构称为线性化器。这个框架使整个线性代数工具包，包括SVD、伪逆、正交投影等，适用于非线性映射。此外，我们展示了共享神经网络的两个线性化器的组合也是一个线性化器。我们利用这一特性，并演示使用我们的架构训练扩散模型可以使数百个采样步骤折叠为一个步骤。我们进一步利用我们的框架来强制幂等性（即$ f(f(x))=f(x) $）在网络上导致一个全局投影生成模型，并展示模块化风格转移。

更新时间: 2025-10-09 17:59:57

领域: cs.LG

下载: http://arxiv.org/abs/2510.08570v1

ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation

Benchmarks are central to measuring the capabilities of large language models and guiding model development, yet widespread data leakage from pretraining corpora undermines their validity. Models can match memorized content rather than demonstrate true generalization, which inflates scores, distorts cross-model comparisons, and misrepresents progress. We introduce ArenaBencher, a model-agnostic framework for automatic benchmark evolution that updates test cases while preserving comparability. Given an existing benchmark and a diverse pool of models to be evaluated, ArenaBencher infers the core ability of each test case, generates candidate question-answer pairs that preserve the original objective, verifies correctness and intent with an LLM as a judge, and aggregates feedback from multiple models to select candidates that expose shared weaknesses. The process runs iteratively with in-context demonstrations that steer generation toward more challenging and diagnostic cases. We apply ArenaBencher to math problem solving, commonsense reasoning, and safety domains and show that it produces verified, diverse, and fair updates that uncover new failure modes, increase difficulty while preserving test objective alignment, and improve model separability. The framework provides a scalable path to continuously evolve benchmarks in step with the rapid progress of foundation models.

Updated: 2025-10-09 17:59:55

标题: ArenaBencher：通过多模型竞争评估实现自动基准演化

摘要: Benchmark是衡量大型语言模型能力的核心，并指导模型发展，然而来自预训练语料库的广泛数据泄漏削弱了它们的有效性。模型可以匹配记忆内容而不是展示真正的泛化能力，这会使得分数膨胀，扭曲跨模型比较，并误传进展。我们引入了ArenaBencher，这是一个模型无关的自动基准演进框架，可以更新测试用例同时保持可比性。给定一个现有的基准和一个要评估的多样化模型池，ArenaBencher推断出每个测试用例的核心能力，生成保持原始目标的候选问题-答案对，通过LLM作为评判者验证正确性和意图，并聚合多个模型的反馈以选择揭示共同弱点的候选者。该过程通过在上下文中进行演示迭代运行，引导生成更具挑战性和诊断性的案例。我们将ArenaBencher应用于数学问题解决、常识推理和安全领域，并展示它生成经过验证、多样化且公平的更新，揭示新的失败模式，增加难度同时保持测试目标对齐，并提高模型可分离性。该框架为与基础模型的快速进展步调一致地持续演变基准提供了可扩展的途径。

更新时间: 2025-10-09 17:59:55

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.08569v1

NovaFlow: Zero-Shot Manipulation via Actionable Flow from Generated Videos

Enabling robots to execute novel manipulation tasks zero-shot is a central goal in robotics. Most existing methods assume in-distribution tasks or rely on fine-tuning with embodiment-matched data, limiting transfer across platforms. We present NovaFlow, an autonomous manipulation framework that converts a task description into an actionable plan for a target robot without any demonstrations. Given a task description, NovaFlow synthesizes a video using a video generation model and distills it into 3D actionable object flow using off-the-shelf perception modules. From the object flow, it computes relative poses for rigid objects and realizes them as robot actions via grasp proposals and trajectory optimization. For deformable objects, this flow serves as a tracking objective for model-based planning with a particle-based dynamics model. By decoupling task understanding from low-level control, NovaFlow naturally transfers across embodiments. We validate on rigid, articulated, and deformable object manipulation tasks using a table-top Franka arm and a Spot quadrupedal mobile robot, and achieve effective zero-shot execution without demonstrations or embodiment-specific training. Project website: https://novaflow.lhy.xyz/.

Updated: 2025-10-09 17:59:55

标题: NovaFlow：通过生成视频中的可操作流进行零-shot操作

摘要: 使机器人能够在零示范的情况下执行新颖的操作任务是机器人技术的一个中心目标。大多数现有方法假定在分布任务或依赖于与具体机器人相匹配的数据进行微调，从而限制了跨平台的转移。我们提出了NovaFlow，这是一个自主操作框架，可以将任务描述转化为目标机器人的可操作计划，而无需任何演示。给定一个任务描述，NovaFlow使用视频生成模型合成视频，并将其提炼为使用现成的感知模块的3D可操作对象流。从对象流中，它计算刚性物体的相对姿态，并通过抓取提议和轨迹优化将其实现为机器人动作。对于可变形物体，这个流作为基于模型的规划的跟踪目标，使用基于粒子的动力学模型。通过将任务理解与低级控制分离，NovaFlow自然地在不同机器人之间进行转移。我们使用桌面上的Franka臂和Spot四足移动机器人验证了对刚性、关节和可变形物体操作任务的有效零示范执行，而无需演示或特定机器人的训练。项目网站：https://novaflow.lhy.xyz/。

更新时间: 2025-10-09 17:59:55

领域: cs.RO,cs.AI,cs.CV

下载: http://arxiv.org/abs/2510.08568v1

MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning

Vision language models (VLMs) are increasingly deployed as controllers with access to external tools for complex reasoning and decision-making, yet their effectiveness remains limited by the scarcity of high-quality multimodal trajectories and the cost of manual annotation. We address this challenge with a vision-centric agent tuning framework that automatically synthesizes multimodal trajectories, generates step-wise preference pairs, and trains a VLM controller for robust tool-use reasoning. Our pipeline first constructs M-TRACE, a large-scale dataset of 28.5K multimodal tasks with 177K verified trajectories, enabling imitation-based trajectory tuning. Building on this, we develop MATRIX Agent, a controller finetuned on M-TRACE for step-wise tool reasoning. To achieve finer alignment, we further introduce Pref-X, a set of 11K automatically generated preference pairs, and optimize MATRIX on it via step-wise preference learning. Across three benchmarks, Agent-X, GTA, and GAIA, MATRIX consistently surpasses both open- and closed-source VLMs, demonstrating scalable and effective multimodal tool use. Our data and code is avaliable at https://github.com/mbzuai-oryx/MATRIX.

Updated: 2025-10-09 17:59:54

标题: 《MATRIX：多模态代理调整用于稳健的工具使用推理》

摘要: 视觉语言模型（VLMs）越来越被部署为具有访问外部工具进行复杂推理和决策的控制器，然而它们的有效性仍然受到高质量多模态轨迹的稀缺性和手动注释成本的限制。我们通过一种以视觉为中心的代理调整框架来解决这一挑战，该框架自动合成多模态轨迹，生成逐步偏好对，并训练一个VLM控制器进行稳健的工具使用推理。我们的流程首先构建了M-TRACE，一个包含28.5K个多模态任务和177K个经过验证的轨迹的大规模数据集，从而实现基于模仿的轨迹调整。在此基础上，我们开发了MATRIX Agent，一个在M-TRACE上调整的用于逐步工具推理的控制器。为了实现更好的对齐，我们进一步引入了Pref-X，一个包含11K个自动生成的偏好对，并通过逐步偏好学习在MATRIX上进行优化。在三个基准测试中，Agent-X，GTA和GAIA，MATRIX始终优于开源和闭源的VLMs，展示了可扩展和有效的多模态工具使用。我们的数据和代码可在https://github.com/mbzuai-oryx/MATRIX上找到。

更新时间: 2025-10-09 17:59:54

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.08567v1

How to Teach Large Multimodal Models New Skills

How can we teach large multimodal models (LMMs) new skills without erasing prior abilities? We study sequential fine-tuning on five target skills while monitoring general ability on eight held-out benchmarks across three model families. We observe that apparent "forgetting" on held-out tasks after narrow fine-tuning can partly recover at later stages. We trace this behavior to a measurable shift in the output token distribution, manifested through a simple counting-bias probe that co-varies with forgetting. Guided by this picture, we identify two simple, robust tuning recipes that learn strongly while limiting drift: (i) updating only the self-attention projection layers, and (ii) updating only the MLP Gate&Up while freezing the Down projection. Across models and tasks, these choices deliver strong target gains while largely preserving held-out performance. Code is available at https://github.com/jessemelpolio/LMM_CL

Updated: 2025-10-09 17:59:37

标题: 如何教授大型多模型模型新技能

摘要: 我们如何在不消除先前能力的情况下教授大型多模态模型（LMM）新技能？我们研究了在监测三个模型系列中的八个保留基准上的总体能力时，对五个目标技能进行顺序微调。我们观察到，在狭窄微调后，保留任务上的明显“遗忘”在后期可以部分恢复。我们将这种行为追溯到输出令牌分布的可测变化，通过一个简单的计数偏差探针表现出与遗忘相关。在这一图像的指导下，我们确定了两种简单而强大的调整方法，学习力强，同时限制漂移：（i）仅更新自注意力投影层，和（ii）仅更新MLP门和升层，同时冻结下层投影。在不同模型和任务中，这些选择提供了强大的目标增益，同时在很大程度上保留了保留性能。代码可在https://github.com/jessemelpolio/LMM_CL 上找到。

更新时间: 2025-10-09 17:59:37

领域: cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.08564v1

Where Have All the Kaczmarz Iterates Gone?

The randomized Kaczmarz (RK) algorithm is one of the most computationally and memory-efficient iterative algorithms for solving large-scale linear systems. However, practical applications often involve noisy and potentially inconsistent systems. While the convergence of RK is well understood for consistent systems, the study of RK on noisy, inconsistent linear systems is limited. This paper investigates the asymptotic behavior of RK iterates in expectation when solving noisy and inconsistent systems, addressing the locations of their limit points. We explore the roles of singular vectors of the (noisy) coefficient matrix and derive bounds on the convergence horizon, which depend on the noise levels and system characteristics. Finally, we provide extensive numerical experiments that validate our theoretical findings, offering practical insights into the algorithm's performance under realistic conditions. These results establish a deeper understanding of the RK algorithm's limitations and robustness in noisy environments, paving the way for optimized applications in real-world scientific and engineering problems.

Updated: 2025-10-09 17:59:36

标题: 卡兹马兹迭代法的去向在哪里？

摘要: 随机Kaczmarz（RK）算法是解决大规模线性系统的计算和内存效率最高的迭代算法之一。然而，实际应用通常涉及嘈杂且潜在不一致的系统。虽然RK对一致系统的收敛性很好地理解，但对嘈杂、不一致的线性系统的RK研究有限。本文研究了在解决嘈杂和不一致系统时，RK迭代在期望中的渐近行为，解决了它们极限点的位置。我们探讨了（嘈杂）系数矩阵的奇异向量的作用，并推导出收敛水平的界限，这取决于噪声水平和系统特征。最后，我们提供了大量的数值实验，验证了我们的理论发现，为算法在现实条件下的性能提供了实用的见解。这些结果建立了对RK算法在嘈杂环境中的局限性和稳健性的深入理解，为在现实科学和工程问题中优化应用铺平了道路。

更新时间: 2025-10-09 17:59:36

领域: math.NA,cs.LG,cs.NA,math.OC,15A06, 15A09, 15A10, 15A18, 65F10, 65Y20, 68Q25, 68W20, 68W40

下载: http://arxiv.org/abs/2510.08563v1

SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

Large Multimodal Models (LMMs) have achieved remarkable progress across various capabilities; however, complex video reasoning in the scientific domain remains a significant and challenging frontier. Current video benchmarks predominantly target general scenarios where perception/recognition is heavily relied on, while with relatively simple reasoning tasks, leading to saturation and thus failing to effectively evaluate advanced multimodal cognitive skills. To address this critical gap, we introduce SciVideoBench, a rigorous benchmark specifically designed to assess advanced video reasoning in scientific contexts. SciVideoBench consists of 1,000 carefully crafted multiple-choice questions derived from cutting-edge scientific experimental videos spanning over 25 specialized academic subjects and verified by a semi-automatic system. Each question demands sophisticated domain-specific knowledge, precise spatiotemporal perception, and intricate logical reasoning, effectively challenging models' higher-order cognitive abilities. Our evaluation highlights significant performance deficits in state-of-the-art proprietary and open-source LMMs, including Gemini 2.5 Pro and Qwen2.5-VL, indicating substantial room for advancement in video reasoning capabilities. Detailed analyses of critical factors such as reasoning complexity and visual grounding provide valuable insights and clear direction for future developments in LMMs, driving the evolution of truly capable multimodal AI co-scientists. We hope SciVideoBench could fit the interests of the community and help to push the boundary of cutting-edge AI for border science.

Updated: 2025-10-09 17:59:23

标题: SciVideoBench: 在大型多模态模型中对科学视频推理进行基准测试

摘要: 大型多模型（LMMs）在各种能力方面取得了显著进展；然而，在科学领域中的复杂视频推理仍然是一个重要且具有挑战性的前沿。当前的视频基准主要针对一般场景，其中依赖于感知/识别，而推理任务相对简单，导致饱和，从而未能有效评估先进多模态认知技能。为了填补这一关键差距，我们引入了SciVideoBench，这是一个严格的基准，专门设计用于评估科学背景下的高级视频推理。SciVideoBench由1,000个精心设计的多项选择问题组成，这些问题源自涵盖25个专业学科的前沿科学实验视频，并由半自动系统验证。每个问题都需要复杂的领域特定知识、精确的时空感知和错综复杂的逻辑推理，有效挑战模型的高阶认知能力。我们的评估突显了现有最先进专有和开源LMMs，包括Gemini 2.5 Pro和Qwen2.5-VL，在视频推理能力方面存在显著的性能缺陷，表明在这方面有很大的进步空间。对关键因素如推理复杂性和视觉基础的详细分析提供了宝贵的见解和明确的未来发展方向，推动了真正有能力的多模态AI共同科学家的发展。我们希望SciVideoBench能够符合社区的兴趣，并有助于推动边界科学的尖端AI的发展。

更新时间: 2025-10-09 17:59:23

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.08559v1

Agent Learning via Early Experience

A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.

Updated: 2025-10-09 17:59:17

标题: 代理学习通过早期经验

摘要: 语言代理的长期目标是通过自身经验学习和改进，最终在复杂的现实世界任务中胜过人类。然而，在许多环境中，通过强化学习从经验数据训练代理仍然困难，这些环境要么缺乏可验证的奖励（例如网站），要么需要低效的长期展望（例如多轮工具使用）。因此，大多数当前代理依赖于专家数据的监督微调，这种方法很难扩展并且泛化能力差。这种限制源于专家演示的本质：它们只捕捉了一小部分情景，并且仅暴露代理到有限的环境多样性。我们通过一种中间范式——称为早期经验来解决这一限制：代理自身行动生成的互动数据，其中产生的未来状态作为监督，而不需要奖励信号。在这一范式中，我们研究了两种使用这些数据的策略：（1）隐式世界建模，利用收集的状态将策略与环境动态联系起来；（2）自我反思，代理从其次优行为中学习以改进推理和决策。我们在八个不同环境和多个模型系列中进行评估。我们的方法始终提高了效率和领域外泛化能力，突显了早期经验的价值。此外，在具有可验证奖励的环境中，我们的结果提供了有希望的信号，表明早期经验为随后的强化学习提供了坚实的基础，使其成为模仿学习和完全经验驱动代理之间的实际桥梁。

更新时间: 2025-10-09 17:59:17

领域: cs.AI,cs.CL,cs.IR,cs.LG

下载: http://arxiv.org/abs/2510.08558v1

Graph-SCP: Accelerating Set Cover Problems with Graph Neural Networks

Machine learning (ML) approaches are increasingly being used to accelerate combinatorial optimization (CO) problems. We investigate the Set Cover Problem (SCP) and propose Graph-SCP, a graph neural network method that augments existing optimization solvers by learning to identify a smaller sub-problem that contains the solution space. Graph-SCP uses both supervised learning from prior solved instances and unsupervised learning to minimize the SCP objective. We evaluate the performance of Graph-SCP on synthetically weighted and unweighted SCP instances with diverse problem characteristics and complexities, and on instances from the OR Library, a canonical benchmark for SCP. We show that Graph-SCP reduces the problem size by 60-80% and achieves runtime speedups of up to 10x on average when compared to Gurobi (a state-of-the-art commercial solver), while maintaining solution quality. This is in contrast to fast greedy solutions that significantly compromise solution quality to achieve guaranteed polynomial runtime. We showcase Graph-SCP's ability to generalize to larger problem sizes, training on SCP instances with up to 3,000 subsets and testing on SCP instances with up to 10,000 subsets.

Updated: 2025-10-09 17:58:29

标题: Graph-SCP：利用图神经网络加速集覆盖问题

摘要: 机器学习（ML）方法越来越被用来加速组合优化（CO）问题。我们研究了集合覆盖问题（SCP），并提出了Graph-SCP，一种图神经网络方法，通过学习识别包含解空间的较小子问题来增强现有的优化求解器。Graph-SCP利用先前解决实例的监督学习和无监督学习来最小化SCP目标。我们评估了Graph-SCP在具有不同问题特征和复杂性的合成加权和不加权SCP实例上的性能，以及在OR Library实例上的性能，OR Library是SCP的一个经典基准。我们展示了与Gurobi（一种最先进的商用求解器）相比，Graph-SCP将问题规模减少了60-80％，在平均情况下实现了最高10倍的运行时加速，同时保持解决方案质量。这与快速贪婪解决方案形成对比，后者明显牺牲解决方案质量以实现多项式运行时间的保证。我们展示了Graph-SCP能够推广到更大的问题规模，训练SCP实例，其中包含高达3,000个子集，并在包含高达10,000个子集的SCP实例上进行测试。

更新时间: 2025-10-09 17:58:29

领域: cs.LG,cs.DM

下载: http://arxiv.org/abs/2310.07979v3

Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization

Diffusion language models (DLMs) enable parallel, order-agnostic generation with iterative refinement, offering a flexible alternative to autoregressive large language models (LLMs). However, adapting reinforcement learning (RL) fine-tuning to DLMs remains an open challenge because of the intractable likelihood. Pioneering work such as diffu-GRPO estimated token-level likelihoods via one-step unmasking. While computationally efficient, this approach is severely biased. A more principled foundation lies in sequence-level likelihoods, where the evidence lower bound (ELBO) serves as a surrogate. Yet, despite this clean mathematical connection, ELBO-based methods have seen limited adoption due to the prohibitive cost of likelihood evaluation. In this work, we revisit ELBO estimation and disentangle its sources of variance. This decomposition motivates reducing variance through fast, deterministic integral approximations along a few pivotal dimensions. Building on this insight, we introduce \textbf{Group Diffusion Policy Optimization (GDPO)}, a new RL algorithm tailored for DLMs. GDPO leverages simple yet effective Semi-deterministic Monte Carlo schemes to mitigate the variance explosion of ELBO estimators under vanilla double Monte Carlo sampling, yielding a provably lower-variance estimator under tight evaluation budgets. Empirically, GDPO achieves consistent gains over pretrained checkpoints and outperforms diffu-GRPO, one of the state-of-the-art baselines, on the majority of math, reasoning, and coding benchmarks.

Updated: 2025-10-09 17:58:07

标题: 通过组扩散策略优化改善扩散语言模型的推理

摘要: 扩散语言模型（DLMs）实现并行、无序生成，具有迭代细化，为自回归大语言模型（LLMs）提供了灵活的替代方案。然而，由于难以处理的可能性，将强化学习（RL）微调适应于DLMs仍然是一个挑战。先驱性工作，如diffu-GRPO，通过一步解码估计了令牌级可能性。虽然在计算上高效，但这种方法存在严重的偏见。更有原则性的基础在于序列级可能性，其中证据下限（ELBO）充当替代者。然而，尽管存在这种清晰的数学联系，基于ELBO的方法由于可能性评估的昂贵成本而受到限制。在这项工作中，我们重新审视ELBO估计并分解其方差的来源。这种分解激发了通过一些关键维度上的快速、确定性积分逼近来减少方差。基于这一洞见，我们引入了\textbf{Group Diffusion Policy Optimization (GDPO)}，这是一种专为DLMs定制的新RL算法。GDPO利用简单但有效的半确定性蒙特卡洛方案，以减轻ELBO估计器在香草双蒙特卡洛采样下的方差爆炸，从而在严格的评估预算下产生具有明显更低方差的估计器。在实证方面，GDPO在预训练检查点上取得了一致的增益，并在大多数数学、推理和编码基准测试中优于diffu-GRPO，这是当前最先进的基准之一。

更新时间: 2025-10-09 17:58:07

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2510.08554v1

Understanding In-context Learning of Addition via Activation Subspaces

To perform few-shot learning, language models extract signals from a few input-label pairs, aggregate these into a learned prediction rule, and apply this rule to new inputs. How is this implemented in the forward pass of modern transformer models? To explore this question, we study a structured family of few-shot learning tasks for which the true prediction rule is to add an integer $k$ to the input. We introduce a novel optimization method that localizes the model's few-shot ability to only a few attention heads. We then perform an in-depth analysis of individual heads, via dimensionality reduction and decomposition. As an example, on Llama-3-8B-instruct, we reduce its mechanism on our tasks to just three attention heads with six-dimensional subspaces, where four dimensions track the unit digit with trigonometric functions at periods $2$, $5$, and $10$, and two dimensions track magnitude with low-frequency components. To deepen our understanding of the mechanism, we also derive a mathematical identity relating ``aggregation'' and ``extraction'' subspaces for attention heads, allowing us to track the flow of information from individual examples to a final aggregated concept. Using this, we identify a self-correction mechanism where mistakes learned from earlier demonstrations are suppressed by later demonstrations. Our results demonstrate how tracking low-dimensional subspaces of localized heads across a forward pass can provide insight into fine-grained computational structures in language models.

Updated: 2025-10-09 17:58:05

标题: 理解通过激活子空间进行上下文学习的加法

摘要: 为了进行少样本学习，语言模型从少量输入-标签对中提取信号，将这些信号聚合成学习到的预测规则，并将该规则应用于新输入。现代transformer模型在前向传播中是如何实现这一过程的？为了探索这个问题，我们研究了一组结构化的少样本学习任务，其真实预测规则是将整数$k$添加到输入中。我们引入了一种新颖的优化方法，将模型的少样本能力局限在少数注意力头部。然后，我们通过降维和分解对各个注意力头部进行了深入分析。例如，在Llama-3-8B-instruct上，我们将其在我们的任务中的机制简化为仅仅三个注意力头部，每个头部具有六维子空间，其中四个维度跟踪单位数字，使用周期为$2$、$5$和$10$的三角函数，而另外两个维度跟踪幅度，使用低频成分。为了加深我们对机制的理解，我们还推导了一个数学恒等式，将注意力头部的“聚合”和“提取”子空间联系起来，从而使我们能够跟踪信息从单个示例到最终聚合概念的流动。利用这一点，我们发现了一种自我校正机制，早期示范学习中的错误被后来的示范所抑制。我们的结果表明，跟踪前向传播中局部头部的低维子空间可以洞察语言模型中的细粒度计算结构。

更新时间: 2025-10-09 17:58:05

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2505.05145v3

Dream to Recall: Imagination-Guided Experience Retrieval for Memory-Persistent Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) requires agents to follow natural language instructions through environments, with memory-persistent variants demanding progressive improvement through accumulated experience. Existing approaches for memory-persistent VLN face critical limitations: they lack effective memory access mechanisms, instead relying on entire memory incorporation or fixed-horizon lookup, and predominantly store only environmental observations while neglecting navigation behavioral patterns that encode valuable decision-making strategies. We present Memoir, which employs imagination as a retrieval mechanism grounded by explicit memory: a world model imagines future navigation states as queries to selectively retrieve relevant environmental observations and behavioral histories. The approach comprises: 1) a language-conditioned world model that imagines future states serving dual purposes: encoding experiences for storage and generating retrieval queries; 2) Hybrid Viewpoint-Level Memory that anchors both observations and behavioral patterns to viewpoints, enabling hybrid retrieval; and 3) an experience-augmented navigation model that integrates retrieved knowledge through specialized encoders. Extensive evaluation across diverse memory-persistent VLN benchmarks with 10 distinctive testing scenarios demonstrates Memoir's effectiveness: significant improvements across all scenarios, with 5.4% SPL gains on IR2R over the best memory-persistent baseline, accompanied by 8.3x training speedup and 74% inference memory reduction. The results validate that predictive retrieval of both environmental and behavioral memories enables more effective navigation, with analysis indicating substantial headroom (73.3% vs 93.4% upper bound) for this imagination-guided paradigm. Code at https://github.com/xyz9911/Memoir.

Updated: 2025-10-09 17:58:01

标题: 梦回忆：想象引导的经验检索，用于记忆持久的视觉和语言导航

摘要: 视觉与语言导航(VLN)要求代理人通过环境遵循自然语言指令，其中具有记忆持久性变体需要通过积累的经验逐步改进。现有的记忆持久性VLN方法面临关键限制：它们缺乏有效的记忆访问机制，而是依赖整个记忆的整合或固定视野查找，并且主要仅存储环境观察数据，而忽略编码有价值的决策策略的导航行为模式。我们提出Memoir，它利用想象作为基于明确记忆的检索机制：一个世界模型想象未来导航状态作为查询，以选择性地检索相关的环境观察数据和行为历史。该方法包括：1)一种语言条件的世界模型，想象未来状态具有双重目的：编码体验以用于存储和生成检索查询；2)混合视点级记忆，将观察数据和行为模式都锚定到视点，实现混合检索；3)一个经验增强的导航模型，通过专门的编码器整合检索到的知识。在多样的记忆持久性VLN基准测试中进行广泛评估，包括10个不同的测试场景，证明了Memoir的有效性：在所有场景中都有显著改进，IR2R的SPL增益为5.4%，训练速度提高了8.3倍，推理内存减少了74%。结果验证了对环境和行为记忆进行预测检索能够实现更有效的导航，分析表明这种想象引导的范式还有很大的潜力(73.3% vs 93.4%的上限)。代码位于https://github.com/xyz9911/Memoir。

更新时间: 2025-10-09 17:58:01

领域: cs.CV,cs.AI,cs.RO

下载: http://arxiv.org/abs/2510.08553v1

BFS-Prover: Scalable Best-First Tree Search for LLM-based Automatic Theorem Proving

Recent advancements in large language models (LLMs) have spurred growing interest in automatic theorem proving using Lean4, where effective tree search methods are crucial for navigating the underlying large proof search spaces. While the existing approaches primarily rely on value functions and/or Monte Carlo Tree Search (MCTS), the potential of simpler methods like Best-First Tree Search (BFS) remains underexplored. In this paper, we investigate whether BFS can achieve competitive performance in large-scale theorem proving tasks. We present BFS-Prover, a scalable expert iteration framework, featuring three key innovations. First, we implement strategic data filtering at each expert iteration round, excluding problems solvable via beam search node expansion to focus on harder cases. Second, we improve the sample efficiency of BFS through Direct Preference Optimization (DPO) applied to state-tactic pairs automatically annotated with compiler error feedback, refining the LLM's policy to prioritize productive expansions. Third, we employ length normalization in BFS to encourage exploration of deeper proof paths. BFS-Prover achieves a state-of-the-art score of $72.95\%$ on the MiniF2F test set and therefore challenges the perceived necessity of complex tree search methods, demonstrating that BFS can achieve competitive performance when properly scaled. To facilitate further research and development in this area, we have open-sourced our model at https://huggingface.co/ByteDance-Seed/BFS-Prover-V1-7B.

Updated: 2025-10-09 17:57:35

标题: BFS-Prover：基于LLM的自动定理证明的可扩展最佳优先树搜索

摘要: 最近大型语言模型（LLMs）的进展推动了对使用Lean4进行自动定理证明的兴趣增长，其中有效的树搜索方法对于导航底层大型证明搜索空间至关重要。尽管现有方法主要依赖值函数和/或蒙特卡罗树搜索（MCTS），但像最佳优先树搜索（BFS）这样的简单方法的潜力仍未得到充分探索。在本文中，我们调查了BFS在大规模定理证明任务中是否能够实现竞争性性能。我们提出了BFS-Prover，一个可扩展的专家迭代框架，具有三个关键创新。首先，我们在每个专家迭代轮次实施战略数据过滤，排除可通过光束搜索节点扩展解决的问题，以便专注于更困难的情况。其次，我们通过将直接偏好优化（DPO）应用于自动注释编译器错误反馈的状态-策略对，提高了BFS的样本效率，优化LLM的策略以优先考虑产生性扩展。第三，我们在BFS中采用长度归一化，以鼓励探索更深的证明路径。BFS-Prover在MiniF2F测试集上取得了$72.95\%$的最新成绩，因此挑战了对复杂树搜索方法的必要性的看法，表明当适当缩放时，BFS可以实现竞争性性能。为了促进该领域的进一步研究和发展，我们已经在https://huggingface.co/ByteDance-Seed/BFS-Prover-V1-7B 上开源了我们的模型。

更新时间: 2025-10-09 17:57:35

领域: cs.AI

下载: http://arxiv.org/abs/2502.03438v3

Entropy Regularizing Activation: Boosting Continuous Control, Large Language Models, and Image Classification with Activation as Entropy Constraints

We propose ERA, a new paradigm that constrains the sampling entropy above given thresholds by applying specially designed activations to the outputs of models. Our approach demonstrates broad effectiveness across different domains: 1) for large language models(LLMs), boosting the AIME 2025 score for Qwen2.5-Math-7B by 37.4%; 2) for continuous control reinforcement learning agents, improving performance by more than 30% over strong baselines such as SAC on the challenging HumanoidBench; 3) for image classification, enhancing ImageNet top-1 accuracy by 0.69% for ResNet-50. These gains are achieved with a computational overhead of less than 7%. Our work validates output activation as a powerful tool for entropy control, opening a new direction for designing simpler and more robust algorithms.

Updated: 2025-10-09 17:56:17

标题: 熵正则化激活：通过激活熵约束提升连续控制、大型语言模型和图像分类

摘要: 我们提出了ERA，这是一种新的范式，通过将专门设计的激活函数应用于模型的输出，来限制采样熵在给定阈值以上。我们的方法在不同领域展示了广泛的有效性： 1）对于大型语言模型（LLMs），将Qwen2.5-Math-7B的AIME 2025分数提高了37.4％； 2）对于连续控制强化学习代理，将性能提高了超过30％，超过了HumanoidBench上的强基线（如SAC）； 3）对于图像分类，将ResNet-50的ImageNet top-1准确度提高了0.69％。这些收益是在计算开销少于7％的情况下实现的。我们的工作验证了输出激活作为熵控制的强大工具，为设计更简单和更稳健的算法开辟了新的方向。

更新时间: 2025-10-09 17:56:17

领域: cs.LG

下载: http://arxiv.org/abs/2510.08549v1

From Moments to Models: Graphon Mixture-Aware Mixup and Contrastive Learning

Real-world graph datasets often consist of mixtures of populations, where graphs are generated from multiple distinct underlying distributions. However, modern representation learning approaches, such as graph contrastive learning (GCL) and augmentation methods like Mixup, typically overlook this mixture structure. In this work, we propose a unified framework that explicitly models data as a mixture of underlying probabilistic graph generative models represented by graphons. To characterize these graphons, we leverage graph moments (motif densities) to cluster graphs arising from the same model. This enables us to disentangle the mixture components and identify their distinct generative mechanisms. This model-aware partitioning benefits two key graph learning tasks: 1) It enables a graphon-mixture-aware mixup (GMAM), a data augmentation technique that interpolates in a semantically valid space guided by the estimated graphons, instead of assuming a single graphon per class. 2) For GCL, it enables model-adaptive and principled augmentations. Additionally, by introducing a new model-aware objective, our proposed approach (termed MGCL) improves negative sampling by restricting negatives to graphs from other models. We establish a key theoretical guarantee: a novel, tighter bound showing that graphs sampled from graphons with small cut distance will have similar motif densities with high probability. Extensive experiments on benchmark datasets demonstrate strong empirical performance. In unsupervised learning, MGCL achieves state-of-the-art results, obtaining the top average rank across eight datasets. In supervised learning, GMAM consistently outperforms existing strategies, achieving new state-of-the-art accuracy in 6 out of 7 datasets.

Updated: 2025-10-09 17:55:28

标题: 从瞬间到模型：Graphon混合感知Mixup和对比学习

摘要: 真实世界的图数据集通常由多个群体混合而成，其中图是从多个不同的基础分布生成的。然而，现代表示学习方法，如图对比学习（GCL）和增强方法如Mixup，通常忽略了这种混合结构。在这项工作中，我们提出了一个统一的框架，明确地将数据建模为由图形生成模型表示的混合体。为了表征这些图形，我们利用图矩（主题密度）来对来自同一模型的图进行聚类。这使我们能够分解混合成分并识别它们的不同生成机制。这种模型感知的分区有助于两个关键的图学习任务：1）它实现了一种图形混合感知的Mixup（GMAM），这是一种数据增强技术，可以在由估计的图形引导的语义有效空间中插值，而不是假设每个类别只有一个图形。2）对于GCL，它实现了模型自适应和合理的增强。此外，通过引入一个新的模型感知目标，我们提出的方法（称为MGCL）通过将负样本限制为来自其他模型的图形来改进负采样。我们建立了一个关键的理论保证：一个新颖的，更紧密的界限表明，从具有较小切割距离的图形生成的图形在很大概率下会具有类似的主题密度。对基准数据集的广泛实验展示了强大的经验性能。在无监督学习中，MGCL实现了最新的结果，在八个数据集中获得了最高的平均排名。在监督学习中，GMAM始终优于现有策略，在7个数据集中的6个中实现了新的最新准确性。

更新时间: 2025-10-09 17:55:28

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2510.03690v2

SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference

Large Language Models (LLMs) have gained popularity in recent years, driving up the demand for inference. LLM inference is composed of two phases with distinct characteristics: a compute-bound prefill phase followed by a memory-bound decode phase. To efficiently serve LLMs, prior work proposes prefill-decode disaggregation to run each phase on separate hardware. However, existing hardware poorly matches the different requirements of each phase. Current datacenter GPUs and TPUs follow a more-is-better design philosophy that maximizes compute and memory resources, causing memory bandwidth underutilization in the prefill phase and compute underutilization in the decode phase. Such underutilization directly translates into increased serving costs. This paper proposes SPAD (Specialized Prefill and Decode hardware), adopting a less-is-more methodology to design specialized chips tailored to the distinct characteristics of prefill and decode phases. The proposed Prefill Chips have larger systolic arrays and use cost-effective GDDR memory, whereas the proposed Decode Chips retain high memory bandwidth but reduce compute capacity. Compared to modeled H100s, simulations show that the proposed Prefill Chips deliver 8% higher prefill performance on average at 52% lower hardware cost, while the proposed Decode Chips achieve 97% of the decode performance with 28% lower TDP. End-to-end simulations on production traces show that SPAD reduces hardware cost by 19%-41% and TDP by 2%-17% compared to modeled baseline clusters while offering the same performance. Even when models and workloads change, SPAD can reallocate either type of chip to run either phase and still achieve 11%-43% lower hardware costs, demonstrating the longevity of the SPAD design.

Updated: 2025-10-09 17:55:08

标题: SPAD：用于解耦式LLM推理的专用预填充和解码硬件

摘要: 近年来，大型语言模型（LLMs）备受关注，推动了推理需求的增长。LLM推理由两个具有不同特征的阶段组成：一个计算受限的预填充阶段，后面是一个内存受限的解码阶段。为了有效地为LLMs提供服务，先前的工作提出了预填充-解码分离的方法，将每个阶段分别运行在不同的硬件上。然而，现有的硬件与每个阶段的不同要求很难匹配。当前的数据中心GPU和TPU遵循更多即更好的设计理念，最大化了计算和内存资源，导致预填充阶段内存带宽的低利用率和解码阶段的计算低利用率。这种低利用率直接转化为增加的服务成本。本文提出了SPAD（专用预填充和解码硬件），采用了少即是多的方法论，设计了专门针对预填充和解码阶段的特征的专用芯片。所提出的预填充芯片具有更大的脉动阵列，并使用成本效益高的GDDR内存，而所提出的解码芯片保留了高内存带宽，但减少了计算能力。与建模的H100相比，模拟结果显示，所提出的预填充芯片在52%更低的硬件成本下平均提供了8%更高的预填充性能，而所提出的解码芯片在28%更低的TDP下实现了97%的解码性能。根据生产跟踪的端到端模拟结果显示，与建模的基线集群相比，SPAD将硬件成本降低了19%-41%，TDP降低了2%-17%，同时提供相同的性能。即使模型和工作负载发生变化，SPAD仍然可以重新分配任一类型的芯片来运行任一阶段，并仍实现11%-43%更低的硬件成本，证明了SPAD设计的持久性。

更新时间: 2025-10-09 17:55:08

领域: cs.AR,cs.DC,cs.LG

下载: http://arxiv.org/abs/2510.08544v1

VideoNorms: Benchmarking Cultural Awareness of Video Language Models

As Video Large Language Models (VideoLLMs) are deployed globally, they require understanding of and grounding in the relevant cultural background. To properly assess these models' cultural awareness, adequate benchmarks are needed. We introduce VideoNorms, a benchmark of over 1000 (video clip, norm) pairs from US and Chinese cultures annotated with socio-cultural norms grounded in speech act theory, norm adherence and violations labels, and verbal and non-verbal evidence. To build VideoNorms, we use a human-AI collaboration framework, where a teacher model using theoretically-grounded prompting provides candidate annotations and a set of trained human experts validate and correct the annotations. We benchmark a variety of open-weight VideoLLMs on the new dataset which highlight several common trends: 1) models performs worse on norm violation than adherence; 2) models perform worse w.r.t Chinese culture compared to the US culture; 3) models have more difficulty in providing non-verbal evidence compared to verbal for the norm adhere/violation label and struggle to identify the exact norm corresponding to a speech-act; and 4) unlike humans, models perform worse in formal, non-humorous contexts. Our findings emphasize the need for culturally-grounded video language model training - a gap our benchmark and framework begin to address.

Updated: 2025-10-09 17:54:55

标题: VideoNorms：基准文化意识的视频语言模型

摘要: 随着视频大型语言模型（VideoLLMs）在全球范围内的部署，它们需要理解并扎根于相关的文化背景。为了正确评估这些模型的文化意识，需要充分的基准。我们引入了VideoNorms，这是一个由来自美国和中国文化的1000多个（视频片段，规范）配对组成的基准，其中标注了基于言语行为理论、规范遵守和违规标签以及口头和非口头证据的社会文化规范。为了构建VideoNorms，我们使用了一个人工智能协作框架，其中一个采用理论基础提示的教师模型提供候选标注，一组训练有素的人类专家验证和纠正这些标注。我们在新数据集上对各种开放权重的VideoLLMs进行了基准测试，突出了几个常见趋势：1）模型在规范违规方面表现不佳；2）与美国文化相比，模型在中国文化方面表现较差；3）模型在提供口头和非口头证据方面比较困难，用于规范遵守/违规标签，并且难以识别与言语行为相对应的确切规范；4）与人类不同，模型在正式、非幽默的情境中表现不佳。我们的发现强调了对具有文化背景的视频语言模型进行培训的必要性 - 这是我们的基准和框架开始解决的一个差距。

更新时间: 2025-10-09 17:54:55

领域: cs.CV,cs.AI,cs.CL,cs.CY

下载: http://arxiv.org/abs/2510.08543v1

Computational and statistical lower bounds for low-rank estimation under general inhomogeneous noise

Recent work has generalized several results concerning the well-understood spiked Wigner matrix model of a low-rank signal matrix corrupted by additive i.i.d. Gaussian noise to the inhomogeneous case, where the noise has a variance profile. In particular, for the special case where the variance profile has a block structure, a series of results identified an effective spectral algorithm for detecting and estimating the signal, identified the threshold signal strength required for that algorithm to succeed, and proved information-theoretic lower bounds that, for some special signal distributions, match the above threshold. We complement these results by studying the computational optimality of this spectral algorithm. Namely, we show that, for a much broader range of signal distributions, whenever the spectral algorithm cannot detect a low-rank signal, then neither can any low-degree polynomial algorithm. This gives the first evidence for a computational hardness conjecture of Guionnet, Ko, Krzakala, and Zdeborov\'a (2023). With similar techniques, we also prove sharp information-theoretic lower bounds for a class of signal distributions not treated by prior work. Unlike all of the above results on inhomogeneous models, our results do not assume that the variance profile has a block structure, and suggest that the same spectral algorithm might remain optimal for quite general profiles. We include a numerical study of this claim for an example of a smoothly-varying rather than piecewise-constant profile. Our proofs involve analyzing the graph sums of a matrix, which also appear in free and traffic probability, but we require new bounds on these quantities that are tighter than existing ones for non-negative matrices, which may be of independent interest.

Updated: 2025-10-09 17:53:59

标题: 一般非均匀噪声下低秩估计的计算和统计下界

摘要: 最近的研究已经将关于低秩信号矩阵被加性独立同分布高斯噪声破坏的著名的尖峰维格纳矩阵模型的几个结果推广到不均匀情况，其中噪声具有方差分布。特别是，对于方差分布具有块结构的特殊情况，一系列结果确定了一种有效的谱算法来检测和估计信号，并确定了该算法成功所需的阈值信号强度，并证明了信息论下界，对于一些特殊的信号分布，匹配上述阈值。我们通过研究这种谱算法的计算优化性来补充这些结果。换句话说，我们展示了，对于更广泛范围的信号分布，无论何时谱算法无法检测到低秩信号，任何低次多项式算法也无法。这为Guionnet、Ko、Krzakala和Zdeborov\'a (2023)的计算困难猜想提供了第一手证据。通过类似的技术，我们还证明了一类信号分布的尖锐信息论下界，这些信号分布之前的研究未涉及。与所有关于不均匀模型的上述结果不同，我们的结果不假设方差分布具有块结构，并且暗示相同的谱算法可能对相当一般的分布仍然是最优的。我们为一个具有平滑变化而非分段常数剖面的示例的这一主张进行了数值研究。我们的证明涉及分析矩阵的图和求和，这在自由和流量概率中也出现，但我们需要对这些数量进行新的界限，这些界限比现有的非负矩阵的界限更紧，这可能具有独立的兴趣。

更新时间: 2025-10-09 17:53:59

领域: math.ST,cs.DS,cs.LG,math.PR,stat.TH

下载: http://arxiv.org/abs/2510.08541v1

On the optimization dynamics of RLVR: Gradient gap and step size thresholds

Reinforcement Learning with Verifiable Rewards (RLVR), which uses simple binary feedback to post-train large language models, has shown significant empirical success. However, a principled understanding of why it works has been lacking. This paper builds a theoretical foundation for RLVR by analyzing its training process at both the full-response (trajectory) and token levels. Central to our analysis is a quantity called the Gradient Gap, which formalizes the direction of improvement from low-reward to high-reward regions of the response space. We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization improve stability and showing that, with a fixed learning rate, the success rate can stagnate strictly below $100\%$. We validate these predictions through controlled bandit simulations and LLM experiments, including training Qwen2.5-7B with GRPO.

Updated: 2025-10-09 17:53:41

标题: 关于RLVR优化动态的研究：梯度差距和步长阈值

摘要: 具有可验证奖励的强化学习（RLVR）使用简单的二进制反馈来后训练大型语言模型，已经证明在经验上取得了显著的成功。然而，对于为什么它有效的原理性理解一直缺乏。本文通过分析RLVR在完整响应（轨迹）和标记级别的训练过程，为RLVR构建了一个理论基础。我们分析的核心是一个称为梯度差距的数量，它形式化了从低奖励到高奖励区域的响应空间的改进方向。我们证明了收敛关键取决于将更新方向与这个梯度差距对齐。此外，我们根据梯度差距的大小推导出了一个锐利的步长阈值：在阈值以下，学习会收敛，而在阈值以上，性能会崩溃。我们的理论进一步预测了关键步长必须如何随着响应长度和成功率的增加而变化，从而解释了为什么实际启发式方法如长度归一化可以提高稳定性，并显示在固定的学习率下，成功率可以严格停滞在低于$100\%$。我们通过受控的赌博模拟和LLM实验验证了这些预测，包括使用GRPO训练Qwen2.5-7B。

更新时间: 2025-10-09 17:53:41

领域: cs.LG,cs.AI,cs.IT,math.IT,math.OC,stat.ML

下载: http://arxiv.org/abs/2510.08539v1

Feature Identification via the Empirical NTK

We provide evidence that eigenanalysis of the empirical neural tangent kernel (eNTK) can surface the features used by trained neural networks. Across two standard toy models for mechanistic interpretability, Toy Models of Superposition (TMS) and a 1-layer MLP trained on modular addition, we find that the eNTK exhibits sharp spectral cliffs whose top eigenspaces align with ground-truth features. In TMS, the eNTK recovers the ground-truth features in both the sparse (high superposition) and dense regimes. In modular arithmetic, the eNTK can be used to recover Fourier feature families. Moreover, we provide evidence that a layerwise eNTK localizes features to specific layers and that the evolution of the eNTK spectrum can be used to diagnose the grokking phase transition. These results suggest that eNTK analysis may provide a practical handle for feature discovery and for detecting phase changes in small models.

Updated: 2025-10-09 17:53:08

标题: 通过经验NTK进行特征识别

摘要: 我们提供证据表明对经验神经切线核(eNTK)进行特征值分析可以揭示训练神经网络使用的特征。在两个标准的解释性玩具模型中，即叠加模型(TMS)和一个训练有模块加法的单层MLP模型中，我们发现eNTK展现出明显的光谱悬崖，其顶部特征向量与真实特征一致。在TMS中，eNTK可以在稀疏（高叠加）和密集区域中恢复出真实特征。在模块算术中，eNTK可以用来恢复傅立叶特征族。此外，我们提供证据表明逐层eNTK将特征定位到特定层，并且eNTK光谱的演变可以用来诊断理解阶段转变。这些结果表明eNTK分析可能为特征发现和在小模型中检测相变提供实用手段。

更新时间: 2025-10-09 17:53:08

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.00468v2

Permutation-Invariant Spectral Learning via Dyson Diffusion

Diffusion models are central to generative modeling and have been adapted to graphs by diffusing adjacency matrix representations. The challenge of having up to $n!$ such representations for graphs with $n$ nodes is only partially mitigated by using permutation-equivariant learning architectures. Despite their computational efficiency, existing graph diffusion models struggle to distinguish certain graph families, unless graph data are augmented with ad hoc features. This shortcoming stems from enforcing the inductive bias within the learning architecture. In this work, we leverage random matrix theory to analytically extract the spectral properties of the diffusion process, allowing us to push the inductive bias from the architecture into the dynamics. Building on this, we introduce the Dyson Diffusion Model, which employs Dyson's Brownian Motion to capture the spectral dynamics of an Ornstein-Uhlenbeck process on the adjacency matrix while retaining all non-spectral information. We demonstrate that the Dyson Diffusion Model learns graph spectra accurately and outperforms existing graph diffusion models.

Updated: 2025-10-09 17:52:19

标题: 通过戴森扩散实现的置换不变谱学习

摘要: 扩散模型在生成建模中起着重要作用，并已通过扩散邻接矩阵表示来适应图形。对于具有$n$个节点的图形，存在多达$n!$个这样的表示的挑战仅部分通过使用置换等变学习架构得到缓解。尽管现有的图扩散模型在计算效率上表现出色，但除非图形数据通过特定特征进行增强，否则很难区分某些图形系列。这一缺点源于在学习架构中强制执行归纳偏差。在这项工作中，我们利用随机矩阵理论来分析提取扩散过程的谱特性，使我们能够将归纳偏差从架构推进到动态中。基于此，我们介绍了Dyson扩散模型，该模型利用Dyson的布朗运动捕获邻接矩阵上的Ornstein-Uhlenbeck过程的谱动态，同时保留所有非谱信息。我们展示了Dyson扩散模型准确学习图谱并优于现有的图扩散模型。

更新时间: 2025-10-09 17:52:19

领域: stat.ML,cs.LG,math.PR

下载: http://arxiv.org/abs/2510.08535v1

Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing

Instruction-based image editing offers a powerful and intuitive way to manipulate images through natural language. Yet, relying solely on text instructions limits fine-grained control over the extent of edits. We introduce Kontinuous Kontext, an instruction-driven editing model that provides a new dimension of control over edit strength, enabling users to adjust edits gradually from no change to a fully realized result in a smooth and continuous manner. Kontinuous Kontext extends a state-of-the-art image editing model to accept an additional input, a scalar edit strength which is then paired with the edit instruction, enabling explicit control over the extent of the edit. To inject this scalar information, we train a lightweight projector network that maps the input scalar and the edit instruction to coefficients in the model's modulation space. For training our model, we synthesize a diverse dataset of image-edit-instruction-strength quadruplets using existing generative models, followed by a filtering stage to ensure quality and consistency. Kontinuous Kontext provides a unified approach for fine-grained control over edit strength for instruction driven editing from subtle to strong across diverse operations such as stylization, attribute, material, background, and shape changes, without requiring attribute-specific training.

Updated: 2025-10-09 17:51:03

标题: 连续背景：基于指令的图像编辑的连续强度控制

摘要: 基于指令的图像编辑提供了一种强大和直观的方式，通过自然语言来操纵图像。然而，仅依赖文本指令会限制对编辑范围的精细控制。我们引入了Kontinuous Kontext，这是一个受指令驱动的编辑模型，提供了一种新的控制维度，可以让用户逐渐调整编辑强度，从没有变化到完全实现的结果，以平滑连续的方式进行调整。Kontinuous Kontext将最先进的图像编辑模型扩展，接受一个额外的输入，即标量编辑强度，然后与编辑指令配对，实现对编辑范围的明确控制。为了注入这种标量信息，我们训练了一个轻量级的投影网络，将输入的标量和编辑指令映射到模型的调制空间中的系数。为了训练我们的模型，我们使用现有的生成模型合成了一个多样化的图像编辑指令强度四元组数据集，然后通过过滤阶段来确保质量和一致性。Kontinuous Kontext为基于指令的编辑提供了一种统一的方法，从微妙到强烈地对编辑强度进行精细控制，涵盖了各种操作，如风格化、属性、材料、背景和形状变化，而无需进行特定属性的训练。

更新时间: 2025-10-09 17:51:03

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.08532v1

SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models

Spatial reasoning remains a fundamental challenge for Vision-Language Models (VLMs), with current approaches struggling to achieve robust performance despite recent advances. We identify that this limitation stems from a critical gap: existing methods attempt to learn spatial reasoning directly without establishing the hierarchical foundations of perception and understanding. To address this challenge, we present a comprehensive methodology for building spatial intelligence progressively. We introduce SpatialLadder-26k, a multimodal dataset containing 26,610 samples spanning object localization, single image, multi-view, and video spatial reasoning tasks, constructed through a standardized pipeline that ensures systematic coverage across modalities. Building on this dataset, we design a three-stage progressive training framework that (1) establishes spatial perception through object localization, (2) develops spatial understanding through multi-dimensional spatial tasks, and (3) strengthens complex reasoning via reinforcement learning with verifiable rewards. This approach yields SpatialLadder, a 3B-parameter model that achieves state-of-the-art performance on spatial reasoning benchmarks, with 23.4% average improvement over the base model, surpassing GPT-4o by 20.8% and Gemini-2.0-Flash by 10.1%. Notably, SpatialLadder maintains strong generalization with 7.2% improvement on out-of-domain benchmarks, demonstrating that progressive training from perception to reasoning is essential for robust spatial intelligence.

Updated: 2025-10-09 17:50:54

标题: SpatialLadder: 视觉语言模型中空间推理的渐进训练

摘要: 空间推理对于视觉语言模型（VLMs）仍然是一个基本挑战，尽管最近取得了一些进展，但目前的方法仍然难以实现稳健的性能。我们认为，这种限制源于一个关键的差距：现有方法试图直接学习空间推理，而没有建立感知和理解的分层基础。为了解决这一挑战，我们提出了一个逐步建立空间智能的全面方法论。我们引入了SpatialLadder-26k，一个多模态数据集，包含26,610个样本，涵盖目标定位、单图像、多视角和视频空间推理任务，通过一个标准化流程构建，确保在各种模态中进行系统性覆盖。基于这个数据集，我们设计了一个三阶段逐步训练框架，（1）通过目标定位建立空间感知，（2）通过多维空间任务开发空间理解，（3）通过强化学习加强复杂推理，获得可验证的奖励。这种方法产生了SpatialLadder，一个拥有3B参数的模型，在空间推理基准测试中取得了最先进的性能，相比基础模型平均提高了23.4%，超过了GPT-4o的20.8%和Gemini-2.0-Flash的10.1%。值得注意的是，SpatialLadder在跨领域基准测试上保持了强大的泛化性能，提高了7.2%，表明从感知到推理的逐步训练对于稳健的空间智能至关重要。

更新时间: 2025-10-09 17:50:54

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.08531v1

CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards

Self-evolution is a central research topic in enabling large language model (LLM)-based agents to continually improve their capabilities after pretraining. Recent research has witnessed a transition from reinforcement learning (RL)-free to RL-based methods. Current RL-based methods either rely on dense external reward signals or extract intrinsic reward signals from LLMs themselves. However, these approaches diverge from the self-evolution mechanisms observed in human intelligence, where individuals learn and improve through mutual discussion and collaboration. In this work, we introduce Co-Evolving Multi-Agent Systems (CoMAS), a novel framework that enables agents to improve autonomously by learning from inter-agent interactions without external supervision. CoMAS generates intrinsic rewards from rich discussion dynamics, employs an LLM-as-a-judge mechanism to formulate these rewards, and optimizes each agent's policy through RL, thereby enabling decentralized and scalable co-evolution. Experimental results demonstrate that CoMAS consistently outperforms untrained agents and achieves state-of-the-art performance across most evaluation settings. Ablation studies confirm the necessity of interaction-based reward signals and reveal promising scalability as the number and diversity of agents increase. These findings establish CoMAS as a novel and effective paradigm for self-evolution in LLM-based agents.

Updated: 2025-10-09 17:50:26

标题: CoMAS：通过互动奖励共同进化的多智能体系统

摘要: 自我进化是使基于大型语言模型（LLM）的代理能够在预训练后持续改进其能力的一个中心研究课题。最近的研究已经见证了从无强化学习（RL）到基于RL的方法的过渡。当前的基于RL的方法要么依赖于密集的外部奖励信号，要么从LLMs自身中提取内在奖励信号。然而，这些方法与人类智能中观察到的自我进化机制相去甚远，人类通过相互讨论和合作学习和改进。在这项工作中，我们引入了Co-Evolving Multi-Agent Systems（CoMAS），这是一个新颖的框架，可以使代理通过学习相互作用而无需外部监督而自主改进。CoMAS从丰富的讨论动态中生成内在奖励，利用LLM作为评判机制来制定这些奖励，并通过RL优化每个代理的策略，从而实现分散化和可扩展的共同进化。实验结果表明，CoMAS始终优于未经训练的代理，并在大多数评估设置中达到了最先进的性能。消融研究证实了基于互动的奖励信号的必要性，并揭示了当代理的数量和多样性增加时有望实现的可扩展性。这些发现确立了CoMAS作为基于LLM代理的自我进化的一种新颖和有效范式。

更新时间: 2025-10-09 17:50:26

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.08529v1

Convergence Theorems for Entropy-Regularized and Distributional Reinforcement Learning

In the pursuit of finding an optimal policy, reinforcement learning (RL) methods generally ignore the properties of learned policies apart from their expected return. Thus, even when successful, it is difficult to characterize which policies will be learned and what they will do. In this work, we present a theoretical framework for policy optimization that guarantees convergence to a particular optimal policy, via vanishing entropy regularization and a temperature decoupling gambit. Our approach realizes an interpretable, diversity-preserving optimal policy as the regularization temperature vanishes and ensures the convergence of policy derived objects--value functions and return distributions. In a particular instance of our method, for example, the realized policy samples all optimal actions uniformly. Leveraging our temperature decoupling gambit, we present an algorithm that estimates, to arbitrary accuracy, the return distribution associated to its interpretable, diversity-preserving optimal policy.

Updated: 2025-10-09 17:50:07

标题: 熵正则化和分布式强化学习的收敛定理

摘要: 在寻找最佳策略的过程中，强化学习方法通常忽略了学习策略的属性，除了它们的期望回报。因此，即使成功，也很难描述将学习到哪些策略以及它们将做什么。在这项工作中，我们提出了一个理论框架，用于策略优化，通过消失的熵正则化和温度解耦策略，保证收敛到特定的最优策略。我们的方法实现了一个可解释的、保持多样性的最优策略，当正则化温度消失时，并确保导出的策略对象——价值函数和回报分布的收敛。在我们方法的一个特定实例中，例如，实现的策略均匀地采样所有最优动作。利用我们的温度解耦策略，我们提出了一种算法，可以估计与其可解释、保持多样性的最优策略相关联的回报分布，精确到任意精度。

更新时间: 2025-10-09 17:50:07

领域: cs.LG

下载: http://arxiv.org/abs/2510.08526v1

DYNAMIX: RL-based Adaptive Batch Size Optimization in Distributed Machine Learning Systems

Existing batch size selection approaches in dis- tributed machine learning rely on static allocation or simplistic heuristics that fail to adapt to heterogeneous, dynamic computing environments. We present DYNAMIX, a reinforcement learning framework that formulates batch size optimization as a sequen- tial decision-making problem using Proximal Policy Optimiza- tion (PPO). Our approach employs a multi-dimensional state representation encompassing network-level metrics, system-level resource utilization, and training statistical efficiency indicators to enable informed decision-making across diverse computational resources. Our approach eliminates the need for explicit system modeling while integrating seamlessly with existing distributed training frameworks. Through evaluations across diverse work- loads, hardware configurations, and network conditions, DY- NAMIX achieves up to 6.3% improvement in the final model accuracy and 46% reduction in the total training time. Our scalability experiments demonstrate that DYNAMIX maintains the best performance as cluster size increases to 32 nodes, while policy transfer experiments show that learned policies generalize effectively across related model architectures.

Updated: 2025-10-09 17:48:24

标题: DYNAMIX：基于强化学习的分布式机器学习系统中自适应批量大小优化

摘要: 现有的分布式机器学习中的批大小选择方法依赖于静态分配或简单的启发式方法，无法适应异构的、动态的计算环境。我们提出了DYNAMIX，一个强化学习框架，将批量大小优化作为一个顺序决策问题，使用Proximal Policy Optimization（PPO）。我们的方法采用多维状态表示，包括网络级指标、系统级资源利用和训练统计效率指标，以便在不同计算资源上进行明智决策。我们的方法消除了对显式系统建模的需求，同时与现有的分布式训练框架无缝集成。通过对不同工作负载、硬件配置和网络条件的评估，DYNAMIX在最终模型准确性上实现了高达6.3%的改进，并减少了46%的总训练时间。我们的可扩展性实验证明，DYNAMIX在集群规模增加到32个节点时仍保持最佳性能，而策略转移实验表明学习的策略在相关的模型架构上能够有效泛化。

更新时间: 2025-10-09 17:48:24

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2510.08522v1

FlowSearch: Advancing deep research with dynamic structured knowledge flow

Deep research is an inherently challenging task that demands both breadth and depth of thinking. It involves navigating diverse knowledge spaces and reasoning over complex, multi-step dependencies, which presents substantial challenges for agentic systems. To address this, we propose FlowSearch, a multi-agent framework that actively constructs and evolves a dynamic structured knowledge flow to drive subtask execution and reasoning. FlowSearch is capable of strategically planning and expanding the knowledge flow to enable parallel exploration and hierarchical task decomposition, while also adjusting the knowledge flow in real time based on feedback from intermediate reasoning outcomes and insights. FlowSearch achieves state-of-the-art performance on both general and scientific benchmarks, including GAIA, HLE, GPQA and TRQA, demonstrating its effectiveness in multi-disciplinary research scenarios and its potential to advance scientific discovery. The code is available at https://github.com/Alpha-Innovator/InternAgent.

Updated: 2025-10-09 17:48:12

标题: FlowSearch：通过动态结构化知识流推进深度研究

摘要: 深入研究是一项固有具有挑战性的任务，要求广泛和深入的思考。它涉及导航各种知识空间并对复杂的、多步骤的依赖关系进行推理，这对主动系统提出了重大挑战。为了解决这一问题，我们提出了FlowSearch，这是一个多智能体框架，可以积极构建和发展动态结构化的知识流，以驱动子任务执行和推理。FlowSearch能够策略性地规划和扩展知识流，以实现并行探索和分层任务分解，同时根据中间推理结果和见解的反馈实时调整知识流。FlowSearch在通用和科学基准测试中取得了最先进的表现，包括GAIA、HLE、GPQA和TRQA，展示了其在跨学科研究场景中的有效性以及推动科学发现的潜力。代码可在https://github.com/Alpha-Innovator/InternAgent获取。

更新时间: 2025-10-09 17:48:12

领域: cs.AI

下载: http://arxiv.org/abs/2510.08521v1

CaRT: Teaching LLM Agents to Know When They Know Enough

Many tasks require learned models to strategically gather relevant information over multiple rounds of interaction before actually acting on a task. Strategic information gathering requires models to know not only how to effectively acquire information, but also when to stop gathering information and make a decision, in order to avoid overthinking or getting derailed when acting. In this paper, we formalize this problem and introduce Counterfactuals and Reasoning for Termination (CaRT), an approach for teaching LLMs when to stop seeking information. To appropriately learn when to terminate, CaRT fine-tunes LLMs using counterfactual pairs of trajectories, one where termination is appropriate and a minimally modified version of the same trajectory where it is not. It trains the LLM to explain the rationale for the termination decision in either case via verbal reasoning, and imbues this capability into the base LLM via fine-tuning. We instantiate CaRT in two domains: interactive medical diagnosis and math problem solving. In both domains, we find that CaRT improves the efficiency of information gathering and task success rate compared to other fine-tuning methods.

Updated: 2025-10-09 17:46:39

标题: CaRT：教导LLM代理何时知道得足够

摘要: 许多任务需要学习模型在实际执行任务之前在多轮交互中策略性地收集相关信息。战略性信息收集要求模型不仅要知道如何有效地获取信息，还要知道何时停止收集信息并做出决策，以避免过度思考或在执行任务时被转移注意力。在本文中，我们正式化了这个问题，并引入了反事实和终止推理（CaRT），这是一种教授LLMs何时停止寻求信息的方法。为了恰当地学习何时终止，CaRT使用反事实轨迹对LLMs进行微调，其中一个轨迹中终止是适当的，另一个轨迹是同一轨迹的微调版本，终止是不适当的。它训练LLM通过语言推理解释终止决策的原因，并通过微调将这种能力灌输到基础LLM中。我们在两个领域中实例化了CaRT：交互式医学诊断和数学问题解决。在这两个领域中，我们发现与其他微调方法相比，CaRT提高了信息收集的效率和任务成功率。

更新时间: 2025-10-09 17:46:39

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2510.08517v1

Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning

Humans often use visual aids, for example diagrams or sketches, when solving complex problems. Training multimodal models to do the same, known as Visual Chain of Thought (Visual CoT), is challenging due to: (1) poor off-the-shelf visual CoT performance, which hinders reinforcement learning, and (2) the lack of high-quality visual CoT training data. We introduce $\textbf{Zebra-CoT}$, a diverse large-scale dataset with 182,384 samples, containing logically coherent interleaved text-image reasoning traces. We focus on four categories of tasks where sketching or visual reasoning is especially natural, spanning scientific questions such as geometry, physics, and algorithms; 2D visual reasoning tasks like visual search and jigsaw puzzles; 3D reasoning tasks including 3D multi-hop inference, embodied and robot planning; visual logic problems and strategic games like chess. Fine-tuning the Anole-7B model on the Zebra-CoT training corpus results in an improvement of +12% in our test-set accuracy and yields up to +13% performance gain on standard VLM benchmark evaluations. Fine-tuning Bagel-7B yields a model that generates high-quality interleaved visual reasoning chains, underscoring Zebra-CoT's effectiveness for developing multimodal reasoning abilities. We open-source our dataset and models to support development and evaluation of visual CoT.

Updated: 2025-10-09 17:46:09

标题: Zebra-CoT：一种用于交织视觉语言推理的数据集

摘要: 人类在解决复杂问题时经常使用视觉辅助工具，例如图表或草图。训练多模态模型执行相同任务，即被称为视觉思维链（Visual CoT）的任务，具有挑战性，原因在于：（1）现有的视觉CoT性能较差，这影响了强化学习；（2）缺乏高质量的视觉CoT训练数据。我们介绍了$\textbf{Zebra-CoT}$，这是一个包含182,384个样本的多样化大规模数据集，其中包含逻辑连贯的交错文本-图像推理轨迹。我们专注于四类任务类别，其中草图或视觉推理特别自然，涵盖科学问题（如几何、物理和算法）、2D视觉推理任务（如视觉搜索和拼图游戏）、3D推理任务（包括3D多跳推理、实体和机器人规划）、视觉逻辑问题和战略游戏（如国际象棋）。在Zebra-CoT训练语料库上对Anole-7B模型进行微调，我们的测试集准确率提高了+12％，在标准的VLM基准评估上获得了高达+13％的性能提升。对Bagel-7B进行微调产生了一个生成高质量交替视觉推理链的模型，突显了Zebra-CoT对于开发多模态推理能力的有效性。我们开源我们的数据集和模型，以支持视觉CoT的开发和评估。

更新时间: 2025-10-09 17:46:09

领域: cs.CV,cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.16746v2

Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers

The integration of Large Language Models (LLMs) into automated theorem proving has shown immense promise, yet is fundamentally constrained by challenges in scaling up both training-time reinforcement learning (RL) and inference-time compute. This paper introduces \texttt{BFS-Prover-V2}, a system designed to address this dual scaling problem. We present two primary innovations. The first is a novel multi-turn off-policy RL framework for continually improving the performance of LLM step-prover at training time. This framework, inspired by the principles of AlphaZero, utilizes a multi-stage expert iteration pipeline featuring adaptive tactic-level data filtering and periodic retraining to surmount the performance plateaus that typically curtail long-term RL in LLM-based agents. The second innovation is a planner-enhanced multi-agent search architecture that scales reasoning capabilities at inference time. This architecture employs a general reasoning model as a high-level planner to iteratively decompose complex theorems into a sequence of simpler subgoals. This hierarchical approach substantially reduces the search space, enabling a team of parallel prover agents to collaborate efficiently by leveraging a shared proof cache. We demonstrate that this dual approach to scaling yields state-of-the-art results on established formal mathematics benchmarks. \texttt{BFS-Prover-V2} achieves 95.08\% and 41.4\% on the MiniF2F and ProofNet test sets respectively. While demonstrated in the domain of formal mathematics, the RL and inference techniques presented in this work are of broader interest and may be applied to other domains requiring long-horizon multi-turn reasoning and complex search.

Updated: 2025-10-09 17:45:50

标题: 扩展多轮离线政策强化学习和多代理树搜索以用于LLM步骤证明器

摘要: 将大型语言模型（LLMs）集成到自动定理证明中显示了巨大的潜力，但基本上受到在扩展训练时强化学习（RL）和推断时计算方面的挑战的限制。本文介绍了\texttt{BFS-Prover-V2}，这是一个旨在解决这一双重扩展问题的系统。我们提出了两个主要创新。第一个是一种新颖的多轮离线策略强化学习框架，用于在训练时持续改进LLM步骤证明器的性能。这一框架受到AlphaZero原则的启发，利用多阶段专家迭代管道，包括自适应策略级数据过滤和定期重新训练，以克服通常会限制LLM代理长期RL的性能平台。第二个创新是一个增强型规划器多代理搜索架构，用于在推断时扩展推理能力。这种架构利用一个通用推理模型作为高级规划器，将复杂定理迭代地分解成一系列更简单的子目标。这种分层方法大大减少了搜索空间，通过利用共享证明缓存，使一组并行证明器代理能够有效协作。我们展示了这种双重扩展方法在已建立的形式数学基准测试中取得了最新的成果。\texttt{BFS-Prover-V2}分别在MiniF2F和ProofNet测试集上实现了95.08\%和41.4%的准确率。虽然在形式数学领域展示，但本文提出的RL和推断技术具有更广泛的兴趣，并可应用于其他需要长期多轮推理和复杂搜索的领域。

更新时间: 2025-10-09 17:45:50

领域: cs.AI

下载: http://arxiv.org/abs/2509.06493v2

Training a Foundation Model for Materials on a Budget

Foundation models for materials modeling are advancing quickly, but their training remains expensive, often placing state-of-the-art methods out of reach for many research groups. We introduce Nequix, a compact E(3)-equivariant potential that pairs a simplified NequIP design with modern training practices, including equivariant root-mean-square layer normalization and the Muon optimizer, to retain accuracy while substantially reducing compute requirements. Nequix has 700K parameters and was trained in 100 A100 GPU-hours. On the Matbench-Discovery and MDR Phonon benchmarks, Nequix ranks third overall while requiring a 20 times lower training cost than most other methods, and it delivers two orders of magnitude faster inference speed than the current top-ranked model. We release model weights and fully reproducible codebase at https://github.com/atomicarchitects/nequix.

Updated: 2025-10-09 17:45:15

标题: 在预算范围内培训材料基础模型

摘要: 材料建模的基础模型正在迅速发展，但它们的训练仍然昂贵，通常使得最先进的方法对许多研究团队不可及。我们介绍了Nequix，这是一个紧凑的E(3)-等变势能，它将简化的NequIP设计与现代训练实践相结合，包括等变根均方层归一化和Muon优化器，以保持准确性同时大大减少计算需求。Nequix有700K个参数，并在100个A100 GPU小时内进行了训练。在Matbench-Discovery和MDR Phonon基准测试中，Nequix在整体排名上排名第三，而训练成本比大多数其他方法低20倍，并且比当前排名第一的模型推理速度快两个数量级。我们在https://github.com/atomicarchitects/nequix 上发布了模型权重和完全可重现的代码库。

更新时间: 2025-10-09 17:45:15

领域: physics.comp-ph,cs.LG

下载: http://arxiv.org/abs/2508.16067v2

Efficient Graph Condensation via Gaussian Process

Graph condensation reduces the size of large graphs while preserving performance, addressing the scalability challenges of Graph Neural Networks caused by computational inefficiencies on large datasets. Existing methods often rely on bi-level optimization, requiring extensive GNN training and limiting their scalability. To address these issues, this paper proposes Graph Condensation via Gaussian Process (GCGP), a novel and computationally efficient approach to graph condensation. GCGP utilizes a Gaussian Process (GP), with the condensed graph serving as observations, to estimate the posterior distribution of predictions. This approach eliminates the need for the iterative and resource-intensive training typically required by GNNs. To enhance the capability of the GCGP in capturing dependencies between function values, we derive a specialized covariance function that incorporates structural information. This covariance function broadens the receptive field of input nodes by local neighborhood aggregation, thereby facilitating the representation of intricate dependencies within the nodes. To address the challenge of optimizing binary structural information in condensed graphs, Concrete random variables are utilized to approximate the binary adjacency matrix in a continuous counterpart. This relaxation process allows the adjacency matrix to be represented in a differentiable form, enabling the application of gradient-based optimization techniques to discrete graph structures. Experimental results show that the proposed GCGP method efficiently condenses large-scale graph data while preserving predictive performance, addressing the scalability and efficiency challenges. The implementation of our method is publicly available at https://github.com/WANGLin0126/GCGP.

Updated: 2025-10-09 17:45:06

标题: 高斯过程的高效图压缩

摘要: 图形压缩减小了大型图的大小，同时保持性能，解决了图神经网络在大型数据集上由计算效率低下引起的可扩展性挑战。现有方法通常依赖于双层优化，需要大量的GNN训练，限制了它们的可扩展性。为了解决这些问题，本文提出了一种新颖且计算效率高的图形压缩方法，即通过高斯过程（GCGP）进行图形压缩。GCGP利用高斯过程（GP），以压缩图作为观察结果，估计预测的后验分布。这种方法消除了通常由GNN需要的迭代和资源密集型训练的需求。为了增强GCGP在捕获函数值之间依赖关系的能力，我们推导出了一个包含结构信息的专门协方差函数。这个协方差函数通过局部邻域聚合扩大了输入节点的感受野，从而促进了节点内复杂依赖关系的表示。为了解决在压缩图中优化二进制结构信息的挑战，我们利用具体随机变量来近似连续对应的二进制邻接矩阵。这种放松过程使邻接矩阵能够以可微分形式表示，从而使梯度优化技术能够应用于离散图结构。实验结果表明，所提出的GCGP方法有效地压缩大规模图形数据，同时保持了预测性能，解决了可扩展性和效率挑战。我们的方法的实现公开可用于https://github.com/WANGLin0126/GCGP。

更新时间: 2025-10-09 17:45:06

领域: cs.LG

下载: http://arxiv.org/abs/2501.02565v2

AutoMLGen: Navigating Fine-Grained Optimization for Coding Agents

Large language models (LLMs) have shown impressive performance in general programming tasks. However, in Machine Learning Engineering (MLE) scenarios such as AutoML and Kaggle competitions, achieving high performance depends heavily on expert intervention and repeated adjustments rather than simply generating correct code. When applied directly to these tasks, LLMs often lack fine-grained domain priors, and existing MLE approaches that use linear or tree-structured searches limit knowledge transfer to adjacent hierarchical links. As a result, they cannot leverage past full trajectories or share information across branches, limiting self-evolving ability and search space diversity. To address these limitations, we introduce AutoMLGen, an LLM-based coding agent that integrates a domain knowledge base for high-quality prior guidance and Monte Carlo Graph Search (MCGS) for efficient exploration. MCGS retains the tree-guided exploration of MCTS while embedding a graph structure into the expansion stage to enable dynamic path reorganization, historical trajectory reuse, and multi-solution fusion to support both self-evolution and collaborative learning. Combined with fine-grained operator sets, this design improves stability and accelerates convergence. Evaluation on the MLE-Bench shows that AutoMLGen achieves state-of-the-art performance in numerous dimensions, such as the average medal rate and the valid submission rate, under a 12-hour budget (half the standard runtime). The code is available at https://github.com/Alpha-Innovator/InternAgent.

Updated: 2025-10-09 17:45:05

标题: AutoMLGen：为编码代理导航细粒度优化

摘要: 大型语言模型（LLMs）在一般编程任务中表现出色。然而，在机器学习工程（MLE）场景中，如AutoML和Kaggle竞赛，要实现高性能往往更依赖于专家干预和反复调整，而不仅仅是生成正确的代码。当直接应用于这些任务时，LLMs通常缺乏精细的领域先验知识，而现有的使用线性或树形搜索的MLE方法限制了知识传递至相邻层次链接的能力。结果，它们不能利用过去的完整轨迹或在分支间共享信息，从而限制了自我演化能力和搜索空间多样性。为了解决这些限制，我们引入了AutoMLGen，这是一个基于LLM的编码代理，整合了一个领域知识库，提供高质量的先验指导，并采用蒙特卡洛图搜索（MCGS）进行高效探索。MCGS保留了MCTS的树导向探索，同时在扩展阶段嵌入了图结构，实现动态路径重新组织、历史轨迹重用和多解融合，以支持自我演化和协同学习。结合细粒度的操作符集，这一设计提高了稳定性并加速了收敛。在MLE-Bench上的评估显示，AutoMLGen在众多维度上实现了最先进的性能，比如在12小时预算（标准运行时间的一半）下的平均奖牌率和有效提交率。代码可在https://github.com/Alpha-Innovator/InternAgent找到。

更新时间: 2025-10-09 17:45:05

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2510.08511v1

To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models

Large Vision Language Models (LVLMs) have recently emerged as powerful architectures capable of understanding and reasoning over both visual and textual information. These models typically rely on two key components: a Vision Transformer (ViT) and a Large Language Model (LLM). ViT encodes visual content into a sequence of image tokens and serves as the perceptual front-end -- the eyes of the model. In contrast, the LLM interprets these tokens to perform high-level reasoning, generates responses, and functions as the cognitive core -- the brain of the model. However, it remains unclear which visual tokens contribute most significantly to understanding and reasoning, and how effectively these signals are propagated from ViT to the LLM. While most existing works have focused on identifying attention sinks, low-semantic tokens receiving disproportionately high attention, within the LLM, we shift the focus to the vision encoder by identifying a class of high-norm visual tokens from ViT, referred to as ViT attention sinks -- a problem that has been rarely studied but is indeed very important for LVLMs. Our findings show that these ViT sinks encapsulate high-level semantic concepts from images, allowing the LLM to perform more effective understanding and reasoning. Despite their importance, these sink tokens are often overlooked in existing LVLM architectures. To explore their contribution, we present both qualitative and quantitative analyses of the information embedded in these sink tokens. We also propose both training-free and training-based approaches to better leverage how this information is interpreted by the LLM, and to what extent. By explicitly utilizing these tokens, we demonstrate substantial improvements across a range of LVLMs and visual reasoning tasks, highlighting the untapped potential of ViT attention sinks in enhancing visual reasoning.

Updated: 2025-10-09 17:44:42

标题: 要沉没还是不沉没：大型视觉-语言模型中的视觉信息路径

摘要: 大型视觉语言模型(LVLMs)最近已经成为强大的架构，能够理解和推理视觉和文本信息。这些模型通常依赖于两个关键组件：视觉变换器(ViT)和大型语言模型(LLM)。ViT将视觉内容编码为图像令牌序列，并作为感知前端--模型的眼睛。相比之下，LLM解释这些令牌以进行高级推理，生成响应，并作为认知核心--模型的大脑。然而，目前尚不清楚哪些视觉令牌对理解和推理起着最重要的贡献，以及这些信号如何有效地从ViT传播到LLM。虽然大多数现有研究集中于识别在LLM中接收到不成比例高关注度的低语义令牌，即注意力汇，但我们将焦点转移到视觉编码器，通过识别一类来自ViT的高规范视觉令牌，即ViT注意力汇--这是一个很少被研究但确实对LVLMs非常重要的问题。我们的发现显示，这些ViT注意力汇包含了来自图像的高级语义概念，使LLM能够进行更有效的理解和推理。尽管它们的重要性，这些汇令牌在现有的LVLM架构中经常被忽视。为了探索它们的贡献，我们对这些汇令牌中嵌入的信息进行了定性和定量分析。我们还提出了既不需要训练又需要训练的方法，以更好地利用LLM如何解释这些信息，以及在何种程度上。通过明确利用这些令牌，我们展示了在一系列LVLMs和视觉推理任务中的显著改进，突显了ViT注意力汇在增强视觉推理方面的未被利用的潜力。

更新时间: 2025-10-09 17:44:42

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.08510v1

Multi-Turn Human-LLM Interaction Through the Lens of a Two-Way Intelligibility Protocol

Our interest is in the design of software systems involving a human-expert interacting -- using natural language -- with a large language model (LLM) on data analysis tasks. For complex problems, it is possible that LLMs can harness human expertise and creativity to find solutions that were otherwise elusive. On one level, this interaction takes place through multiple turns of prompts from the human and responses from the LLM. Here we investigate a more structured approach based on an abstract protocol described in [3] for interaction between agents. The protocol is motivated by a notion of "two-way intelligibility" and is modelled by a pair of communicating finite-state machines. We provide an implementation of the protocol, and provide empirical evidence of using the implementation to mediate interactions between an LLM and a human-agent in two areas of scientific interest (radiology and drug design). We conduct controlled experiments with a human proxy (a database), and uncontrolled experiments with human subjects. The results provide evidence in support of the protocol's capability of capturing one- and two-way intelligibility in human-LLM interaction; and for the utility of two-way intelligibility in the design of human-machine systems. Our code is available at https://github.com/karannb/interact.

Updated: 2025-10-09 17:41:24

标题: 通过双向可理解性协议视角探究多轮人机交互

摘要: 我们感兴趣的是设计涉及人类专家与大型语言模型（LLM）在数据分析任务中交互 - 使用自然语言 - 的软件系统。对于复杂问题，LLMs可以利用人类专业知识和创造力找到原本难以解决的解决方案。在一个层面上，这种交互通过人类的提示和LLM的回应进行多次轮换。在这里，我们研究了一个基于[3]中描述的抽象协议的更结构化方法，用于代理之间的交互。这个协议是由一对通信有限状态机建模的，受“双向可理解性”概念的启发。我们提供了协议的实现，并提供了使用实现来调节LLM和人类代理之间在两个科学领域（放射学和药物设计）中的交互的经验证据。我们使用人类代理（数据库）进行了控制实验，以及使用人类受试者进行了非控制实验。结果证明了协议在捕捉人类-LLM交互中的一向和双向可理解性方面的能力，以及双向可理解性在人机系统设计中的实用性。我们的代码可以在https://github.com/karannb/interact 上找到。

更新时间: 2025-10-09 17:41:24

领域: cs.AI,cs.HC,cs.LG,cs.MA

下载: http://arxiv.org/abs/2410.20600v4

Evaluating Evaluation Metrics -- The Mirage of Hallucination Detection

Hallucinations pose a significant obstacle to the reliability and widespread adoption of language models, yet their accurate measurement remains a persistent challenge. While many task- and domain-specific metrics have been proposed to assess faithfulness and factuality concerns, the robustness and generalization of these metrics are still untested. In this paper, we conduct a large-scale empirical evaluation of 6 diverse sets of hallucination detection metrics across 4 datasets, 37 language models from 5 families, and 5 decoding methods. Our extensive investigation reveals concerning gaps in current hallucination evaluation: metrics often fail to align with human judgments, take an overtly myopic view of the problem, and show inconsistent gains with parameter scaling. Encouragingly, LLM-based evaluation, particularly with GPT-4, yields the best overall results, and mode-seeking decoding methods seem to reduce hallucinations, especially in knowledge-grounded settings. These findings underscore the need for more robust metrics to understand and quantify hallucinations, and better strategies to mitigate them.

Updated: 2025-10-09 17:39:50

标题: 评估评估指标--幻觉检测的幻觉

摘要: 幻觉对语言模型的可靠性和广泛应用构成了重大障碍，然而它们的准确测量仍然是一个持久的挑战。虽然已经提出了许多任务和领域特定的度量标准来评估忠实性和事实性方面的关注，但这些度量标准的鲁棒性和泛化性仍未经过测试。在本文中，我们对6个不同集合的幻觉检测度量标准在4个数据集、5个家族的37个语言模型和5种解码方法上进行了大规模的实证评估。我们的广泛调查揭示了当前幻觉评估中存在的令人担忧的差距：度量标准通常无法与人类判断保持一致，对问题采取过于短视的观点，并且在参数扩展方面显示出不一致的收益。令人鼓舞的是，基于LLM的评估，特别是使用GPT-4，产生了最佳的整体结果，而寻找模式的解码方法似乎可以减少幻觉，特别是在知识基础设置中。这些发现强调了需要更加健壮的度量标准来理解和量化幻觉，以及更好的策略来减轻它们。

更新时间: 2025-10-09 17:39:50

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2504.18114v2

AI-Driven Radiology Report Generation for Traumatic Brain Injuries

Traumatic brain injuries present significant diagnostic challenges in emergency medicine, where the timely interpretation of medical images is crucial for patient outcomes. In this paper, we propose a novel AI-based approach for automatic radiology report generation tailored to cranial trauma cases. Our model integrates an AC-BiFPN with a Transformer architecture to capture and process complex medical imaging data such as CT and MRI scans. The AC-BiFPN extracts multi-scale features, enabling the detection of intricate anomalies like intracranial hemorrhages, while the Transformer generates coherent, contextually relevant diagnostic reports by modeling long-range dependencies. We evaluate the performance of our model on the RSNA Intracranial Hemorrhage Detection dataset, where it outperforms traditional CNN-based models in both diagnostic accuracy and report generation. This solution not only supports radiologists in high-pressure environments but also provides a powerful educational tool for trainee physicians, offering real-time feedback and enhancing their learning experience. Our findings demonstrate the potential of combining advanced feature extraction with transformer-based text generation to improve clinical decision-making in the diagnosis of traumatic brain injuries.

Updated: 2025-10-09 17:39:04

标题: 基于人工智能的放射学报告生成技术用于创伤性脑损伤

摘要: 创伤性脑损伤在急诊医学中提出了重大的诊断挑战，及时解读医学影像对患者结果至关重要。本文提出了一种针对颅脑创伤病例的自动放射学报告生成的新型基于人工智能的方法。我们的模型将AC-BiFPN与Transformer架构相结合，以捕捉和处理复杂的医学影像数据，如CT和MRI扫描。AC-BiFPN提取多尺度特征，使得能够检测到复杂的异常，如颅内出血，而Transformer通过建模长距离依赖性生成连贯、具有上下文相关性的诊断报告。我们在RSNA颅内出血检测数据集上评估了我们模型的性能，在诊断准确度和报告生成方面均优于传统的基于CNN模型。这种解决方案不仅支持高压环境下的放射科医生，还为实习医生提供强大的教育工具，提供实时反馈并增强他们的学习体验。我们的研究结果表明，将先进的特征提取与基于Transformer的文本生成相结合，可以改善创伤性脑损伤诊断中的临床决策制定能力。

更新时间: 2025-10-09 17:39:04

领域: cs.CV,cs.AI,cs.LG,68T07, 68U10,I.2.10; I.2.7; I.4.5

下载: http://arxiv.org/abs/2510.08498v1

Spiffy: Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding

Diffusion LLMs (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs (AR-LLMs) with the potential to operate at significantly higher token generation rates. However, currently available open-source dLLMs often generate at much lower rates, typically decoding only a single token at every denoising timestep in order to maximize output quality. We present Spiffy, a speculative decoding algorithm that accelerates dLLM inference by $\mathbf{2.8{-}3.1\times}$ while provably preserving the model's output distribution. This work addresses the unique challenges involved in applying ideas from speculative decoding of AR-LLMs to the dLLM setting. Spiffy proposes draft states by leveraging the dLLM's distribution itself in an auto-speculative manner. This approach is efficient and effective, and eliminates the overheads of training and running an independent draft model. To structure the candidate draft states, we propose a novel directed draft graph which is uniquely designed to take advantage of the bidirectional, block-wise nature of dLLM generation and can be verified in parallel by the dLLM. To further optimize the structure of these draft graphs, we introduce an efficient, offline calibration algorithm that procedurally determines high-quality graph configurations. These optimized draft graphs, enabling increased acceptance rates, lead to a significant boost in the overall speedup achieved by the system. Crucially, Spiffy is also complementary to other recent innovations in improving dLLM generation speeds such as KV-caching and multi-token unmasking. We demonstrate that when combined with such parallel decoding algorithms, Spiffy is able to effectively multiply the benefits of these methods leading to total speedups of up to $\mathbf{7.9\times}$.

Updated: 2025-10-09 17:38:52

标题: 时尚: 通过无损推测解码加速扩散LLM

摘要: 最近，扩散LLMs（dLLMs）已经成为自回归LLMs（AR-LLMs）的一个强大替代品，具有以显著更高的标记生成速率运行的潜力。然而，目前可用的开源dLLMs通常生成速率较低，通常仅在每个去噪时间步骤解码一个标记，以最大化输出质量。我们提出了Spiffy，这是一种推测解码算法，可以加速dLLM推理$\mathbf{2.8{-}3.1\times}$，同时可以明确保留模型的输出分布。这项工作解决了将AR-LLMs的推测解码思想应用于dLLM环境中所涉及的独特挑战。Spiffy通过自动推测方式利用dLLM的分布本身来提出草案状态。这种方法高效且有效，消除了训练和运行独立草案模型的开销。为了构建候选草案状态，我们提出了一种新颖的有向草案图，它是独特设计的，以利用dLLM生成的双向、分块性质，并可以由dLLM并行验证。为了进一步优化这些草案图的结构，我们引入了一种高效的离线校准算法，可以程序确定高质量的图配置。这些优化的草案图可以提高接受率，从而显著提升系统实现的整体加速。至关重要的是，Spiffy还与其他最近改善dLLM生成速度的创新技术（如KV缓存和多标记解除遮罩）相辅相成。我们证明，当与这些并行解码算法结合使用时，Spiffy能够有效地将这些方法的好处相乘，从而实现高达$\mathbf{7.9\times}$的总体加速。

更新时间: 2025-10-09 17:38:52

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2509.18085v2

AI-Driven Post-Quantum Cryptography for Cyber-Resilient V2X Communication in Transportation Cyber-Physical Systems

Transportation Cyber-Physical Systems (TCPS) integrate physical elements, such as transportation infrastructure and vehicles, with cyber elements via advanced communication technologies, allowing them to interact seamlessly. This integration enhances the efficiency, safety, and sustainability of transportation systems. TCPS rely heavily on cryptographic security to protect sensitive information transmitted between vehicles, transportation infrastructure, and other entities within the transportation ecosystem, ensuring data integrity, confidentiality, and authenticity. Traditional cryptographic methods have been employed to secure TCPS communications, but the advent of quantum computing presents a significant threat to these existing security measures. Therefore, integrating Post-Quantum Cryptography (PQC) into TCPS is essential to maintain secure and resilient communications. While PQC offers a promising approach to developing cryptographic algorithms resistant to quantum attacks, artificial intelligence (AI) can enhance PQC by optimizing algorithm selection, resource allocation, and adapting to evolving threats in real-time. AI-driven PQC approaches can improve the efficiency and effectiveness of PQC implementations, ensuring robust security without compromising system performance. This chapter introduces TCPS communication protocols, discusses the vulnerabilities of corresponding communications to cyber-attacks, and explores the limitations of existing cryptographic methods in the quantum era. By examining how AI can strengthen PQC solutions, the chapter presents cyber-resilient communication strategies for TCPS.

Updated: 2025-10-09 17:37:00

标题: 基于人工智能的后量子密码学在交通物理系统中实现具有抗攻击性的V2X通信

摘要: 运输网络物理系统（TCPS）通过先进的通信技术，将物理元素（如运输基础设施和车辆）与网络元素无缝集成，使它们能够相互交互。这种集成增强了运输系统的效率、安全性和可持续性。TCPS在保护车辆、运输基础设施和运输生态系统内的其他实体之间传输的敏感信息方面严重依赖于加密安全性，确保数据的完整性、保密性和真实性。传统的加密方法已被用于保护TCPS通信，但量子计算的出现对现有的安全措施构成了重大威胁。因此，将后量子密码学（PQC）整合到TCPS中是必不可少的，以保持安全和弹性的通信。虽然PQC提供了一种有希望的方法来开发抵抗量子攻击的加密算法，但人工智能（AI）可以通过优化算法选择、资源分配，并实时适应不断发展的威胁来增强PQC。基于人工智能的PQC方法可以提高PQC实施的效率和有效性，确保强大的安全性而不影响系统性能。本章介绍了TCPS通信协议，讨论了相应通信面临的网络攻击的漏洞，并探讨了在量子时代现有加密方法的局限性。通过研究人工智能如何加强PQC解决方案，本章提出了TCPS的网络弹性通信策略。

更新时间: 2025-10-09 17:37:00

领域: cs.CR

下载: http://arxiv.org/abs/2510.08496v1

Compiling Any $\mathsf{MIP}^{*}$ into a (Succinct) Classical Interactive Argument

We present a generic compiler that converts any $\mathsf{MIP}^{*}$ protocol into a succinct interactive argument where the communication and the verifier are classical, and where post-quantum soundness relies on the post-quantum sub-exponential hardness of the Learning with Errors ($\mathsf{LWE}$) problem. Prior to this work, such a compiler for $\mathsf{MIP}^{*}$ was given by Kalai, Lombardi, Vaikuntanathan and Yang (STOC 2022), but the post-quantum soundness of this compiler is still under investigation. More generally, our compiler can be applied to any $\mathsf{QIP}$ protocol which is sound only against semi-malicious provers that follow the prescribed protocol, but with possibly malicious initial state. Our compiler consists of two steps. We first show that if a language $\mathcal{L}$ has a $\mathsf{QIP}$ with semi-malicious soundness, where the prover runs in time $T$, then $\mathcal{L} \in \mathsf{QMATIME}(T)$. Then we construct a succinct classical argument for any such language, where the communication complexity grows polylogarithmically with $T$, under the post-quantum sub-exponential hardness of $\mathsf{LWE}$.

Updated: 2025-10-09 17:35:47

标题: 将任意的$\mathsf{MIP}^{*}$编译成（简洁的）经典交互证明(argument)

摘要: 我们提出了一个通用编译器，将任何$\mathsf{MIP}^{*}$协议转换为一个简洁的交互式论证，其中通信和验证者是经典的，后量子声誉依赖于学习错误（$\mathsf{LWE}$）问题的后量子次指数难度。在这项工作之前，Kalai、Lombardi、Vaikuntanathan和Yang（STOC 2022）提供了一个$\mathsf{MIP}^{*}$的编译器，但该编译器的后量子声誉仍在调查中。更一般地说，我们的编译器可以应用于任何仅针对遵守规定协议的半恶意证明者具有声誉的$\mathsf{QIP}$协议，但初始状态可能是恶意的。我们的编译器由两个步骤组成。首先，我们展示了如果一个语言$\mathcal{L}$具有半恶意声誉的$\mathsf{QIP}$，其中证明者在时间$T$内运行，则$\mathcal{L} \in \mathsf{QMATIME}(T)$。然后我们为任何这样的语言构建一个简洁的经典论证，其中通信复杂度与$T$成对数增长，在$\mathsf{LWE$的后量子次指数难度下。

更新时间: 2025-10-09 17:35:47

领域: quant-ph,cs.CR

下载: http://arxiv.org/abs/2510.08495v1

Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models

Traditional multimodal learners find unified representations for tasks like visual question answering, but rely heavily on paired datasets. However, an overlooked yet potentially powerful question is: can one leverage auxiliary unpaired multimodal data to directly enhance representation learning in a target modality? We introduce UML: Unpaired Multimodal Learner, a modality-agnostic training paradigm in which a single model alternately processes inputs from different modalities while sharing parameters across them. This design exploits the assumption that different modalities are projections of a shared underlying reality, allowing the model to benefit from cross-modal structure without requiring explicit pairs. Theoretically, under linear data-generating assumptions, we show that unpaired auxiliary data can yield representations strictly more informative about the data-generating process than unimodal training. Empirically, we show that using unpaired data from auxiliary modalities -- such as text, audio, or images -- consistently improves downstream performance across diverse unimodal targets such as image and audio. Our project page: https://unpaired-multimodal.github.io/

Updated: 2025-10-09 17:32:23

标题: 更好的组合：利用不成对的多模态数据增强单模态模型

摘要: 传统的多模态学习者为视觉问答等任务找到了统一的表示，但是他们在很大程度上依赖于配对数据集。然而，一个被忽视但可能强大的问题是：能否利用辅助的未配对多模态数据直接增强目标模态的表示学习？我们介绍了UML：无配对多模态学习器，这是一种模态不可知的训练范式，其中一个单一模型交替处理来自不同模态的输入，同时共享参数。这种设计利用了不同模态是共享的潜在现实的投影的假设，使模型能够从跨模态结构中受益，而无需显式对。在线性数据生成假设下，我们理论上展示了未配对辅助数据可以产生比单模态训练更严格地关于数据生成过程的信息性表示。在经验上，我们展示了使用来自辅助模态（如文本、音频或图像）的未配对数据在各种单模态目标（如图像和音频）上一贯改善了下游性能。我们的项目页面：https://unpaired-multimodal.github.io/

更新时间: 2025-10-09 17:32:23

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2510.08492v1

Optimizing delivery for quick commerce factoring qualitative assessment of generated routes

Indias e-commerce market is projected to grow rapidly, with last-mile delivery accounting for nearly half of operational expenses. Although vehicle routing problem (VRP) based solvers are widely used for delivery planning, their effectiveness in real-world scenarios is limited due to unstructured addresses, incomplete maps, and computational constraints in distance estimation. This study proposes a framework that employs large language models (LLMs) to critique VRP-generated routes against policy-based criteria, allowing logistics operators to evaluate and prioritise more efficient delivery plans. As a illustration of our approach we generate, annotate and evaluated 400 cases using large language models. Our study found that open-source LLMs identified routing issues with 79% accuracy, while proprietary reasoning models achieved reach upto 86%. The results demonstrate that LLM-based evaluation of VRP-generated routes can be an effective and scalable layer of evaluation which goes beyond beyond conventional distance and time based metrics. This has implications for improving cost efficiency, delivery reliability, and sustainability in last-mile logistics, especially for developing countries like India.

Updated: 2025-10-09 17:31:58

标题: 优化交付以快速实现商业化：生成路线的定性评估

摘要: 印度的电子商务市场预计将迅速增长，最后一公里配送占近一半的运营成本。虽然基于车辆路径问题（VRP）的求解器广泛用于交付规划，但在现实场景中，由于地址不规范、地图不完整和距离估计中的计算约束，它们的有效性受到限制。本研究提出了一个框架，利用大型语言模型（LLMs）对VRP生成的路径进行评估，以符合基于政策的标准，使物流运营商能够评估和优先考虑更有效的交付计划。作为我们方法的一个示例，我们使用大型语言模型生成、注释和评估了400个案例。我们的研究发现，开源LLMs以79%的准确率识别了路径问题，而专有的推理模型实现了高达86%的准确率。结果表明，基于LLMs对VRP生成的路径进行评估可以是一种有效且可扩展的评估层，超越传统的距离和时间基础指标。这对于改善成本效率、交付可靠性和最后一公里物流的可持续性具有重要意义，特别是对于印度等发展中国家。

更新时间: 2025-10-09 17:31:58

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.08671v1

Fair Graph Machine Learning under Adversarial Missingness Processes

Graph Neural Networks (GNNs) have achieved state-of-the-art results in many relevant tasks where decisions might disproportionately impact specific communities. However, existing work on fair GNNs often assumes that either sensitive attributes are fully observed or they are missing completely at random. We show that an adversarial missingness process can inadvertently disguise a fair model through the imputation, leading the model to overestimate the fairness of its predictions. We address this challenge by proposing Better Fair than Sorry (BFtS), a fair missing data imputation model for sensitive attributes. The key principle behind BFtS is that imputations should approximate the worst-case scenario for fairness -- i.e. when optimizing fairness is the hardest. We implement this idea using a 3-player adversarial scheme where two adversaries collaborate against a GNN classifier, and the classifier minimizes the maximum bias. Experiments using synthetic and real datasets show that BFtS often achieves a better fairness x accuracy trade-off than existing alternatives under an adversarial missingness process.

Updated: 2025-10-09 17:30:23

标题: 在对抗性缺失过程下的公平图机器学习

摘要: 图神经网络（GNNs）在许多相关任务中取得了最先进的结果，其中决策可能不成比例地影响特定社区。然而，现有关于公平GNN的工作通常假定敏感属性要么完全被观察到，要么完全随机缺失。我们展示了一个对抗性缺失过程可能会通过插补无意中掩盖一个公平模型，导致模型高估其预测的公平性。我们通过提出“宁可公平也不后悔”（BFtS）来解决这一挑战，这是一个针对敏感属性的公平缺失数据插补模型。BFtS背后的关键原则是插补应该逼近公平性的最坏情况--即在优化公平性最困难时。我们使用一个3方对抗方案来实现这一想法，其中两个对手合作对抗GNN分类器，而分类器最小化最大偏差。使用合成和真实数据集的实验表明，在对抗性缺失过程下，BFtS通常比现有替代方案更好地实现公平性x准确性的权衡。

更新时间: 2025-10-09 17:30:23

领域: cs.LG

下载: http://arxiv.org/abs/2311.01591v4

Implementing Semantic Join Operators Efficiently

Semantic query processing engines often support semantic joins, enabling users to match rows that satisfy conditions specified in natural language. Such join conditions can be evaluated using large language models (LLMs) that solve novel tasks without task-specific training. Currently, many semantic query processing engines implement semantic joins via nested loops, invoking the LLM to evaluate the join condition on row pairs. Instead, this paper proposes a novel algorithm, inspired by the block nested loops join operator implementation in traditional database systems. The proposed algorithm integrates batches of rows from both input tables into a single prompt. The goal of the LLM invocation is to identify all matching row pairs in the current input. The paper introduces formulas that can be used to optimize the size of the row batches, taking into account constraints on the size of the LLM context window (limiting both input and output size). An adaptive variant of the proposed algorithm refers to cases in which the size of the output is difficult to estimate. A formal analysis of asymptotic processing costs, as well as empirical results, demonstrates that the proposed approach reduces costs significantly and performs well compared to join implementations used by recent semantic query processing engines.

Updated: 2025-10-09 17:30:01

标题: Efficient Implementation of Semantic Join Operators.

摘要: 语义查询处理引擎通常支持语义连接，使用户能够匹配符合自然语言中指定条件的行。这种连接条件可以使用大型语言模型（LLMs）来评估，这些模型可以解决新颖的任务而无需特定训练。目前，许多语义查询处理引擎通过嵌套循环实现语义连接，调用LLM来评估行对上的连接条件。相反，本文提出了一种新颖的算法，受传统数据库系统中块嵌套循环连接操作符实现的启发。提出的算法将来自两个输入表的行批次集成到一个单个提示中。LLM的调用目的是识别当前输入中的所有匹配行对。本文介绍了可用于优化行批次大小的公式，考虑到LLM上下文窗口大小的限制（限制输入和输出大小）。所提出算法的自适应变体涉及难以估计输出大小的情况。对渐近处理成本的正式分析以及实证结果表明，所提出的方法显著降低了成本，并与最近语义查询处理引擎使用的连接实现相比表现良好。

更新时间: 2025-10-09 17:30:01

领域: cs.DB,cs.LG

下载: http://arxiv.org/abs/2510.08489v1

Paper2Video: Automatic Video Generation from Scientific Papers

Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video. Unlike natural video, presentation video generation involves distinctive challenges: inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker. To address these challenges, we introduce Paper2Video, the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata. We further design four tailored evaluation metrics--Meta Similarity, PresentArena, PresentQuiz, and IP Memory--to measure how videos convey the paper's information to the audience. Building on this foundation, we propose PaperTalker, the first multi-agent framework for academic presentation video generation. It integrates slide generation with effective layout refinement by a novel effective tree search visual choice, cursor grounding, subtitling, speech synthesis, and talking-head rendering, while parallelizing slide-wise generation for efficiency. Experiments on Paper2Video demonstrate that the presentation videos produced by our approach are more faithful and informative than existing baselines, establishing a practical step toward automated and ready-to-use academic video generation. Our dataset, agent, and code are available at https://github.com/showlab/Paper2Video.

Updated: 2025-10-09 17:29:00

标题: Paper2Video: 从科学论文自动生成视频

摘要: 学术演示视频已成为研究交流的重要媒介，但其制作仍然非常耗时，通常需要几个小时的幻灯片设计、录制和编辑，才能制作出一段短2到10分钟的视频。与自然视频不同，演示视频生成涉及独特的挑战：来自研究论文的输入、密集的多模态信息（文本、图表、表格）以及需要协调多个对齐的通道，如幻灯片、字幕、演讲和说话者。为了解决这些挑战，我们引入了Paper2Video，这是一个包含101篇研究论文、作者创建的演示视频、幻灯片和演讲者元数据的第一个基准。我们进一步设计了四个定制的评估指标——元相似性、PresentArena、PresentQuiz和IP Memory——来衡量视频如何向观众传达论文的信息。基于这一基础，我们提出了PaperTalker，这是第一个用于学术演示视频生成的多代理框架。它将幻灯片生成与有效布局精炼相结合，通过一种新颖的有效树搜索视觉选择、光标定位、字幕、语音合成和说话者渲染，同时并行生成幻灯片以提高效率。在Paper2Video上的实验表明，我们方法生成的演示视频比现有基线更忠实和更具信息性，为实现自动化和即用型学术视频生成迈出了实际的一步。我们的数据集、代理和代码可在https://github.com/showlab/Paper2Video 上找到。

更新时间: 2025-10-09 17:29:00

领域: cs.CV,cs.AI,cs.CL,cs.MA,cs.MM

下载: http://arxiv.org/abs/2510.05096v2

DeepPrune: Parallel Scaling without Inter-trace Redundancy

Parallel scaling has emerged as a powerful paradigm to enhance reasoning capabilities in large language models (LLMs) by generating multiple Chain-of-Thought (CoT) traces simultaneously. However, this approach introduces significant computational inefficiency due to inter-trace redundancy -- our analysis reveals that over 80% of parallel reasoning traces yield identical final answers, representing substantial wasted computation. To address this critical efficiency bottleneck, we propose DeepPrune, a novel framework that enables efficient parallel scaling through dynamic pruning. Our method features a specialized judge model trained with focal loss and oversampling techniques to accurately predict answer equivalence from partial reasoning traces which realizes 0.87 AUROC on equivalence prediction, combined with an online greedy clustering algorithm that dynamically prunes redundant paths while preserving answer diversity. Comprehensive evaluations across three challenging benchmarks (AIME 2024, AIME 2025, and GPQA) and multiple reasoning models demonstrate that DeepPrune achieves remarkable token reduction by over 80% compared to conventional consensus sampling on most cases, while maintaining competitive accuracy within 3 percentage points. Our work establishes a new standard for efficient parallel reasoning, making high-performance reasoning more efficient. Our code and data are here: https://deepprune.github.io/

Updated: 2025-10-09 17:24:54

标题: DeepPrune：无需跨跟踪冗余的并行缩放

摘要: 并行扩展已经成为增强大型语言模型（LLMs）推理能力的强大范例，它通过同时生成多个思维链（CoT）跟踪来实现。然而，这种方法引入了显著的计算效率低下，因为跟踪之间存在冗余 - 我们的分析显示，超过80％的并行推理跟踪产生相同的最终答案，代表了大量的计算浪费。为了解决这一关键的效率瓶颈，我们提出了DeepPrune，这是一个通过动态修剪实现高效并行扩展的新框架。我们的方法采用了特殊的评判模型，通过使用焦点损失和过采样技术进行训练，能够准确预测部分推理轨迹中的答案等价性，实现了0.87的AUROC在等价性预测上，结合在线贪婪聚类算法，动态修剪冗余路径同时保持答案多样性。在三个具有挑战性的基准测试（AIME 2024，AIME 2025和GPQA）和多个推理模型上进行的全面评估表明，与传统的共识采样相比，DeepPrune在大多数情况下实现了超过80％的令牌减少，同时保持竞争力在3个百分点以内的准确性。我们的工作确立了高效并行推理的新标准，使高性能推理更加高效。我们的代码和数据在这里：https://deepprune.github.io/

更新时间: 2025-10-09 17:24:54

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.08483v1

FreqCa: Accelerating Diffusion Models via Frequency-Aware Caching

The application of diffusion transformers is suffering from their significant inference costs. Recently, feature caching has been proposed to solve this problem by reusing features from previous timesteps, thereby skipping computation in future timesteps. However, previous feature caching assumes that features in adjacent timesteps are similar or continuous, which does not always hold in all settings. To investigate this, this paper begins with an analysis from the frequency domain, which reveal that different frequency bands in the features of diffusion models exhibit different dynamics across timesteps. Concretely, low-frequency components, which decide the structure of images, exhibit higher similarity but poor continuity. In contrast, the high-frequency bands, which decode the details of images, show significant continuity but poor similarity. These interesting observations motivate us to propose Frequency-aware Caching (FreqCa) which directly reuses features of low-frequency components based on their similarity, while using a second-order Hermite interpolator to predict the volatile high-frequency ones based on its continuity. Besides, we further propose to cache Cumulative Residual Feature (CRF) instead of the features in all the layers, which reduces the memory footprint of feature caching by 99%. Extensive experiments on FLUX.1-dev, FLUX.1-Kontext-dev, Qwen-Image, and Qwen-Image-Edit demonstrate its effectiveness in both generation and editing. Codes are available in the supplementary materials and will be released on GitHub.

Updated: 2025-10-09 17:22:23

标题: FreqCa：通过频率感知缓存加速扩散模型

摘要: 扩散变压器的应用受到显著的干扰成本的影响。最近，提出了特征缓存来解决这个问题，通过重用先前时间步的特征，从而跳过未来时间步的计算。然而，先前的特征缓存假设相邻时间步的特征是相似或连续的，这并不总是适用于所有情景。为了调查这一点，本文从频域进行分析，揭示了扩散模型特征中不同频带在不同时间步之间表现出不同的动态。具体来说，决定图像结构的低频分量表现出更高的相似性但较差的连续性。相反，解码图像细节的高频带表现出显著的连续性但较差的相似性。这些有趣的观察激发了我们提出频率感知缓存（FreqCa），它根据相似性直接重用低频成分的特征，同时使用二阶Hermite插值器根据连续性预测不稳定的高频成分。此外，我们进一步提出缓存累积残差特征（CRF）而不是所有层的特征，从而将特征缓存的内存占用减少了99%。在FLUX.1-dev、FLUX.1-Kontext-dev、Qwen-Image和Qwen-Image-Edit上进行了大量实验，证明了它在生成和编辑方面的有效性。代码可以在补充材料中找到，并将在GitHub上发布。

更新时间: 2025-10-09 17:22:23

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2510.08669v1

Language Model Embeddings Can Be Sufficient for Bayesian Optimization

Bayesian Optimization is ubiquitous in experimental design and black-box optimization for improving search efficiency. However, most existing approaches rely on regression models which are limited to fixed search spaces and structured, tabular input features. This paper explores the use of LLM embeddings over string inputs for in-context regression in Bayesian Optimization. Our results show that representing inputs as strings enables general-purpose regression across diverse domains, including synthetic, combinatorial, and hyperparameter optimization. Furthermore, our approach achieves optimization performance comparable to state-of-the-art Gaussian Process-based methods such as Google Vizier, and demonstrates potential for broader and more flexible applications.

Updated: 2025-10-09 17:20:18

标题: 语言模型嵌入足以进行贝叶斯优化

摘要: 贝叶斯优化在实验设计和黑盒优化中是无处不在的，以提高搜索效率。然而，大多数现有方法依赖于回归模型，这些模型仅限于固定的搜索空间和结构化的表格输入特征。本文探讨了在贝叶斯优化中使用LLM嵌入来处理字符串输入进行上下文回归的方法。我们的结果表明，将输入表示为字符串可以实现跨不同领域的通用回归，包括合成、组合和超参数优化。此外，我们的方法实现了与Google Vizier等最先进的基于高斯过程的方法相媲美的优化性能，并展示了更广泛和更灵活应用的潜力。

更新时间: 2025-10-09 17:20:18

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2410.10190v3

Rethinking Provenance Completeness with a Learning-Based Linux Scheduler

Provenance plays a critical role in maintaining traceability of a system's actions for root cause analysis of security threats and impacts. Provenance collection is often incorporated into the reference monitor of systems to ensure that an audit trail exists of all events, that events are completely captured, and that logging of such events cannot be bypassed. However, recent research has questioned whether existing state-of-the-art provenance collection systems fail to ensure the security guarantees of a true reference monitor due to the 'super producer threat' in which provenance generation can overload a system to force the system to drop security-relevant events and allow an attacker to hide their actions. One approach towards solving this threat is to enforce resource isolation, but that does not fully solve the problems resulting from hardware dependencies and performance limitations. In this paper, we show how an operating system's kernel scheduler can mitigate this threat, and we introduce Venus, a learned scheduler for Linux specifically designed for provenance. Unlike conventional schedulers that ignore provenance completeness requirements, Venus leverages reinforcement learning to learn provenance task behavior and to dynamically optimize resource allocation. We evaluate Venus's efficacy and show that Venus significantly improves both the completeness and efficiency of provenance collection systems compared to traditional scheduling, while maintaining reasonable overheads and even improving overall runtime in certain cases compared to the default Linux scheduler.

Updated: 2025-10-09 17:18:50

标题: 重新思考具有基于学习的Linux调度程序的来源完整性

摘要: 原文提到，溯源在维护系统操作的可追溯性方面发挥着关键作用，用于安全威胁和影响的根本原因分析。溯源收集通常被整合到系统的参考监视器中，以确保存在所有事件的审计跟踪，事件得到完全捕获，并且无法绕过对此类事件的记录。然而，最近的研究质疑现有的最先进的溯源收集系统是否未能确保真正参考监视器的安全保证，原因在于“超级生产者威胁”，即溯源生成可能会使系统过载，导致系统丢弃与安全相关的事件，并允许攻击者隐藏其行动。解决这一威胁的一种方法是强制资源隔离，但这并不能完全解决由硬件依赖和性能限制导致的问题。本文展示了操作系统的内核调度器如何缓解这种威胁，并介绍了Venus，这是专门为溯源而设计的Linux学习调度器。与忽略溯源完整性要求的传统调度器不同，Venus利用强化学习来学习溯源任务行为，并动态优化资源分配。我们评估了Venus的有效性，并展示了与传统调度相比，Venus显着提高了溯源收集系统的完整性和效率，同时保持了合理的开销，并在某些情况下甚至比默认的Linux调度器改进了整体运行时间。

更新时间: 2025-10-09 17:18:50

领域: cs.CR,cs.OS

下载: http://arxiv.org/abs/2510.08479v1

More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration

Reinforcement Learning with Verifiable Rewards (RLVR) is a promising paradigm for enhancing the reasoning ability in Large Language Models (LLMs). However, prevailing methods primarily rely on self-exploration or a single off-policy teacher to elicit long chain-of-thought (LongCoT) reasoning, which may introduce intrinsic model biases and restrict exploration, ultimately limiting reasoning diversity and performance. Drawing inspiration from multi-teacher strategies in knowledge distillation, we introduce Adaptive Multi-Guidance Policy Optimization (AMPO), a novel framework that adaptively leverages guidance from multiple proficient teacher models, but only when the on-policy model fails to generate correct solutions. This "guidance-on-demand" approach expands exploration while preserving the value of self-discovery. Moreover, AMPO incorporates a comprehension-based selection mechanism, prompting the student to learn from the reasoning paths that it is most likely to comprehend, thus balancing broad exploration with effective exploitation. Extensive experiments show AMPO substantially outperforms a strong baseline (GRPO), with a 4.3% improvement on mathematical reasoning tasks and 12.2% on out-of-distribution tasks, while significantly boosting Pass@k performance and enabling more diverse exploration. Notably, using four peer-sized teachers, our method achieves comparable results to approaches that leverage a single, more powerful teacher (e.g., DeepSeek-R1) with more data. These results demonstrate a more efficient and scalable path to superior reasoning and generalizability. Our code is available at https://github.com/SII-Enigma/AMPO.

Updated: 2025-10-09 17:18:49

标题: 一个以上的教师：适应性多指导政策优化用于多样化探索

摘要: 具有可验证奖励的强化学习（RLVR）是提高大型语言模型（LLMs）推理能力的一种有前途的范式。然而，目前的方法主要依赖于自我探索或单一的离线策略教师来引发长链推理（LongCoT），这可能引入固有的模型偏见并限制探索，最终限制推理的多样性和性能。受知识蒸馏中多教师策略的启发，我们引入了自适应多指导策略优化（AMPO），这是一种新颖的框架，可以自适应地利用多个熟练的教师模型的指导，但只有当在线策略模型无法生成正确解决方案时才会使用。这种“按需指导”的方法扩展了探索，同时保留了自我发现的价值。此外，AMPO还包含了基于理解的选择机制，促使学生从它最有可能理解的推理路径中学习，从而平衡广泛的探索和有效的开发。大量实验表明，AMPO明显优于强基准（GRPO），在数学推理任务上提高了4.3%，在分布外任务上提高了12.2%，同时显着提升了Pass@k性能并实现了更多样化的探索。值得注意的是，使用四个同等大小的教师，我们的方法实现了与利用单个更强大教师（例如DeepSeek-R1）及更多数据的方法相当的结果。这些结果展示了通向卓越推理和泛化能力的更高效和可扩展的路径。我们的代码可在https://github.com/SII-Enigma/AMPO上找到。

更新时间: 2025-10-09 17:18:49

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.02227v2

DexMan: Learning Bimanual Dexterous Manipulation from Human and Generated Videos

We present DexMan, an automated framework that converts human visual demonstrations into bimanual dexterous manipulation skills for humanoid robots in simulation. Operating directly on third-person videos of humans manipulating rigid objects, DexMan eliminates the need for camera calibration, depth sensors, scanned 3D object assets, or ground-truth hand and object motion annotations. Unlike prior approaches that consider only simplified floating hands, it directly controls a humanoid robot and leverages novel contact-based rewards to improve policy learning from noisy hand-object poses estimated from in-the-wild videos. DexMan achieves state-of-the-art performance in object pose estimation on the TACO benchmark, with absolute gains of 0.08 and 0.12 in ADD-S and VSD. Meanwhile, its reinforcement learning policy surpasses previous methods by 19% in success rate on OakInk-v2. Furthermore, DexMan can generate skills from both real and synthetic videos, without the need for manual data collection and costly motion capture, and enabling the creation of large-scale, diverse datasets for training generalist dexterous manipulation.

Updated: 2025-10-09 17:17:05

标题: DexMan：从人类和生成的视频中学习双手灵巧操作

摘要: 我们提出了DexMan，这是一个自动化框架，可以将人类视觉演示转化为仿人机器人在模拟环境中的双手灵巧操作技能。DexMan直接操作第三人称的视频，展示了人类如何操纵刚性物体，无需摄像机校准、深度传感器、扫描的3D对象资产或地面真实手部和物体运动标注。与以往只考虑简化浮动手部的方法不同，DexMan直接控制仿人机器人，并利用基于接触的奖励来改进从野外视频中估计出的嘈杂手部-物体姿势的策略学习。 DexMan在TACO基准测试中实现了物体姿势估计的最先进性能，ADD-S和VSD的绝对增益分别为0.08和0.12。同时，其强化学习策略在OakInk-v2上的成功率超过以往方法19%。此外，DexMan可以从真实和合成视频中生成技能，无需手动数据收集和昂贵的动作捕捉，从而实现大规模、多样化数据集的创建，用于训练通用的灵巧操作技能。

更新时间: 2025-10-09 17:17:05

领域: cs.RO,cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.08475v1

Scaling Laws Are Unreliable for Downstream Tasks: A Reality Check

Downstream scaling laws aim to predict task performance at larger scales from the model's performance at smaller scales. Whether such prediction should be possible is unclear: some works discover clear linear scaling trends after simple transformations of the performance metric, whereas others point out fundamental challenges to downstream scaling laws, such as emergence and inverse scaling. In this work, we conduct a meta-analysis of existing data on downstream scaling laws, and we find that predictable scaling only occurs in a minority of cases: 39% of the time. Moreover, seemingly benign changes to the experimental setting can completely change the scaling behavior. Our analysis underscores the need to understand the conditions under which scaling laws succeed. To accurately model the relationship between pretraining loss and task performance, we must embrace the cases in which scaling behavior deviates from linear trends.

Updated: 2025-10-09 17:15:34

标题: 缩放定律对下游任务不可靠：现实检验

摘要: 下游缩放规律旨在从较小尺度模型的表现预测较大尺度任务的性能。是否能够进行这种预测尚不清楚：一些研究在性能指标简单转换后发现明显的线性缩放趋势，而其他一些研究指出下游缩放规律存在基本挑战，如出现和逆向缩放。在这项工作中，我们对现有的下游缩放规律数据进行了元分析，发现可预测的缩放只发生在少数情况下：39%的时间。此外，看似无害的实验设置变化可能会完全改变缩放行为。我们的分析强调了理解缩放规律成功条件的必要性。为了准确地建模预训练损失和任务性能之间的关系，我们必须接受缩放行为偏离线性趋势的情况。

更新时间: 2025-10-09 17:15:34

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.00885v2

An Improved Quantum Algorithm for 3-Tuple Lattice Sieving

The assumed hardness of the Shortest Vector Problem in high-dimensional lattices is one of the cornerstones of post-quantum cryptography. The fastest known heuristic attacks on SVP are via so-called sieving methods. While these still take exponential time in the dimension $d$, they are significantly faster than non-heuristic approaches and their heuristic assumptions are verified by extensive experiments. $k$-Tuple sieving is an iterative method where each iteration takes as input a large number of lattice vectors of a certain norm, and produces an equal number of lattice vectors of slightly smaller norm, by taking sums and differences of $k$ of the input vectors. Iterating these ''sieving steps'' sufficiently many times produces a short lattice vector. The fastest attacks (both classical and quantum) are for $k=2$, but taking larger $k$ reduces the amount of memory required for the attack. In this paper we improve the quantum time complexity of 3-tuple sieving from $2^{0.3098 d}$ to $2^{0.2846 d}$, using a two-level amplitude amplification aided by a preprocessing step that associates the given lattice vectors with nearby ''center points'' to focus the search on the neighborhoods of these center points. Our algorithm uses $2^{0.1887d}$ classical bits and QCRAM bits, and $2^{o(d)}$ qubits. This is the fastest known quantum algorithm for SVP when total memory is limited to $2^{0.1887d}$.

Updated: 2025-10-09 17:13:07

标题: 一个改进的用于3-元组格筛选的量子算法

摘要: 在高维格子中，最短向量问题的假设困难性是后量子密码学的基石之一。已知对SVP的最快启发式攻击是通过所谓的筛选方法。虽然这些方法在维度$d$中仍需要指数时间，但它们比非启发式方法快得多，并且它们的启发式假设已通过大量实验验证。$k$-元素筛选是一种迭代方法，每次迭代将一组特定范数的大量格子向量作为输入，并通过取$k$个输入向量的和与差来产生略小范数的等量格子向量。多次迭代这些“筛选步骤”将产生一个短格子向量。最快的攻击（经典和量子）是对于$k=2$，但采用更大的$k$会减少攻击所需的内存量。在本文中，我们将3-元素筛选的量子时间复杂度从$2^{0.3098d}$改进为$2^{0.2846d}$，使用一个双层幅度放大，并辅以一个预处理步骤，将给定的格子向量与附近的“中心点”关联起来，以便将搜索重点放在这些中心点的邻域。我们的算法使用$2^{0.1887d}$经典比特和QCRAM比特，以及$2^{o(d)}$量子比特。这是已知的在总内存限制为$2^{0.1887d}$时SVP的最快量子算法。

更新时间: 2025-10-09 17:13:07

领域: quant-ph,cs.CR

下载: http://arxiv.org/abs/2510.08473v1

Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents

Large Language Models (LLMs) are increasingly applied to software engineering (SWE), with SWE-bench as a key benchmark. Solutions are split into SWE-Agent frameworks with multi-turn interactions and workflow-based Agentless methods with single-turn verifiable steps. We argue these paradigms are not mutually exclusive: reasoning-intensive Agentless training induces skill priors, including localization, code edit, and self-reflection that enable efficient and effective SWE-Agent adaptation. In this work, we first curate the Agentless training recipe and present Kimi-Dev, an open-source SWE LLM achieving 60.4\% on SWE-bench Verified, the best among workflow approaches. With additional SFT adaptation on 5k publicly-available trajectories, Kimi-Dev powers SWE-Agents to 48.6\% pass@1, on par with that of Claude 3.5 Sonnet (241022 version). These results show that structured skill priors from Agentless training can bridge workflow and agentic frameworks for transferable coding agents.

Updated: 2025-10-09 17:13:02

标题: Kimi-Dev: 无代理培训作为SWE-Agent的技能优先

摘要: 大型语言模型（LLMs）越来越广泛地应用于软件工程（SWE），其中SWE-bench是一个关键的基准测试。解决方案被分为具有多轮交互的SWE-Agent框架和基于工作流的无代理方法，具有单轮可验证步骤。我们认为这些范式并不是互斥的：理性密集型的无代理训练引入了技能先验，包括本地化、代码编辑和自我反思，从而实现了高效和有效的SWE-Agent适应。在这项工作中，我们首先整理了无代理训练配方，并提出了Kim-Dev，一个在SWE-bench Verified上达到60.4％的开源SWE LLM，是工作流方法中最好的。通过在5k个公开可用的轨迹上进行额外的SFT适应，Kimi-Dev使SWE-Agent达到48.6％的pass@1，与Claude 3.5 Sonnet（241022版本）相当。这些结果表明，来自无代理训练的结构化技能先验可以桥接工作流和代理框架，实现可转移的编码代理。

更新时间: 2025-10-09 17:13:02

领域: cs.AI,cs.CL,cs.SE

下载: http://arxiv.org/abs/2509.23045v2

Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling

Training vision-language models on cognitively-plausible amounts of data requires rethinking how models integrate multimodal information. Within the constraints of the Vision track for the BabyLM Challenge 2025, we propose a lightweight decoder-based architecture with (1) token-wise dynamic gating for adaptive fusion of linguistic and visual cues, (2) feature modulation and channel attention to maximise the utility of limited visual information and (3) auxiliary contrastive objectives for visual grounding. Evaluation on five benchmarks (BLiMP, BLiMP Supplement, EWoK, Winoground and VQA) shows competitive or superior performance to multimodal baselines. More notably, our dynamic gate discovers interpretable patterns without explicit supervision, favouring visual cues for content words and linguistic cues for function words. While we identify limitations in the Challenge constraints, such as the information bottleneck created by global image embeddings and training instability from the dataset split, our findings establish dynamic gating as a powerful tool for efficient multimodal learning, offering both interpretability and performance even under severe constraints.

Updated: 2025-10-09 17:10:36

标题: 寻求学习：面向低资源视觉-语言建模的令牌级动态门控

摘要: 在认知上合理的数据量上训练视觉-语言模型需要重新考虑模型如何整合多模态信息。在2025年BabyLM挑战赛的Vision赛道的限制条件下，我们提出了一个基于轻量级解码器的架构，具有以下特点：（1）逐标记动态门控，以适应语言和视觉线索的自适应融合，（2）特征调制和通道注意力，以最大化有限的视觉信息的效用，（3）辅助对比目标用于视觉基础。在五个基准测试（BLiMP、BLiMP补充、EWoK、Winoground和VQA）上的评估显示，我们的方法与多模态基线相比具有竞争力或更优越的性能。更值得注意的是，我们的动态门控在没有显式监督的情况下发现可解释的模式，偏好于内容词的视觉线索和功能词的语言线索。尽管我们确定了挑战约束中的一些限制，比如全局图像嵌入所造成的信息瓶颈和数据集分割导致的训练不稳定性，但我们的发现确立了动态门控作为一种强大的工具，用于高效的多模态学习，即使在严格的约束条件下，也能提供解释性和性能。

更新时间: 2025-10-09 17:10:36

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2510.08470v1

Platform-Agnostic Modular Architecture for Quantum Benchmarking

We present a platform-agnostic modular architecture that addresses the increasingly fragmented landscape of quantum computing benchmarking by decoupling problem generation, circuit execution, and results analysis into independent, interoperable components. Supporting over 20 benchmark variants ranging from simple algorithmic tests like Bernstein-Vazirani to complex Hamiltonian simulation with observable calculations, the system integrates with multiple circuit generation APIs (Qiskit, CUDA-Q, Cirq) and enables diverse workflows. We validate the architecture through successful integration with Sandia's $\textit{pyGSTi}$ for advanced circuit analysis and CUDA-Q for multi-GPU HPC simulations. Extensibility of the system is demonstrated by implementing dynamic circuit variants of existing benchmarks and a new quantum reinforcement learning benchmark, which become readily available across multiple execution and analysis modes. Our primary contribution is identifying and formalizing modular interfaces that enable interoperability between incompatible benchmarking frameworks, demonstrating that standardized interfaces reduce ecosystem fragmentation while preserving optimization flexibility. This architecture has been developed as a key enhancement to the continually evolving QED-C Application-Oriented Performance Benchmarks for Quantum Computing suite.

Updated: 2025-10-09 17:09:56

标题: 量子基准测试的跨平台模块化架构

摘要: 我们提出了一个平台无关的模块化架构，通过将问题生成、电路执行和结果分析解耦为独立且可互操作的组件，应对了量子计算基准测试日益分散的局面。系统支持超过20种基准测试变体，从简单的算法测试如Bernstein-Vazirani到复杂的哈密顿模拟和可观察计算，与多个电路生成API（Qiskit、CUDA-Q、Cirq）集成，并支持多样化的工作流程。我们通过成功与Sandia的$\textit{pyGSTi}$进行高级电路分析和与CUDA-Q进行多GPU高性能计算模拟的集成来验证这一架构。通过实现现有基准测试的动态电路变体和一个新的量子强化学习基准测试，展示了系统的可扩展性，并使这些变体在多种执行和分析模式下轻松可用。我们的主要贡献是确定和形式化模块化接口，实现了不兼容基准测试框架之间的互操作性，证明了标准化接口可以减少生态系统的碎片化，同时保留了优化的灵活性。这一架构已作为持续发展的QED-C应用导向性量子计算性能基准测试套件的关键增强功能而开发。

更新时间: 2025-10-09 17:09:56

领域: quant-ph,cs.AI,cs.SE

下载: http://arxiv.org/abs/2510.08469v1

LLINBO: Trustworthy LLM-in-the-Loop Bayesian Optimization

Bayesian optimization (BO) is a sequential decision-making tool widely used for optimizing expensive black-box functions. Recently, Large Language Models (LLMs) have shown remarkable adaptability in low-data regimes, making them promising tools for black-box optimization by leveraging contextual knowledge to propose high-quality query points. However, relying solely on LLMs as optimization agents introduces risks due to their lack of explicit surrogate modeling and calibrated uncertainty, as well as their inherently opaque internal mechanisms. This structural opacity makes it difficult to characterize or control the exploration-exploitation trade-off, ultimately undermining theoretical tractability and reliability. To address this, we propose LLINBO: LLM-in-the-Loop BO, a hybrid framework for BO that combines LLMs with statistical surrogate experts (e.g., Gaussian Processes (GP)). The core philosophy is to leverage contextual reasoning strengths of LLMs for early exploration, while relying on principled statistical models to guide efficient exploitation. Specifically, we introduce three mechanisms that enable this collaboration and establish their theoretical guarantees. We end the paper with a real-life proof-of-concept in the context of 3D printing. The code to reproduce the results can be found at https://github.com/UMDataScienceLab/LLM-in-the-Loop-BO.

Updated: 2025-10-09 17:09:12

标题: LLINBO: 信任的LLM在环贝叶斯优化

摘要: 贝叶斯优化（BO）是一种广泛用于优化昂贵黑盒函数的序贯决策工具。最近，大型语言模型（LLMs）在低数据情况下表现出了显著的适应性，使它们成为利用上下文知识提出高质量查询点的有前途的黑盒优化工具。然而，仅依赖LLMs作为优化代理引入了风险，因为它们缺乏明确的替代建模和校准的不确定性，以及它们固有的不透明内部机制。这种结构不透明性使得很难表征或控制探索-利用权衡，最终削弱了理论可追踪性和可靠性。为了解决这个问题，我们提出了LLINBO：LLM-in-the-Loop BO，这是一个将LLMs与统计替代专家（例如高斯过程（GP））相结合的BO混合框架。核心理念是利用LLMs的上下文推理优势进行早期探索，同时依赖于原则性的统计模型来指导高效的利用。具体来说，我们介绍了三种机制，使得这种协作成为可能，并建立了它们的理论保证。我们在3D打印的背景下以现实生活的概念验证结束本文。可以在https://github.com/UMDataScienceLab/LLM-in-the-Loop-BO找到重现结果的代码。

更新时间: 2025-10-09 17:09:12

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2505.14756v2

A Survey of Reinforcement Learning for Large Reasoning Models

In this paper, we survey recent advances in Reinforcement Learning (RL) for reasoning with Large Language Models (LLMs). RL has achieved remarkable success in advancing the frontier of LLM capabilities, particularly in addressing complex logical tasks such as mathematics and coding. As a result, RL has emerged as a foundational methodology for transforming LLMs into LRMs. With the rapid progress of the field, further scaling of RL for LRMs now faces foundational challenges not only in computational resources but also in algorithm design, training data, and infrastructure. To this end, it is timely to revisit the development of this domain, reassess its trajectory, and explore strategies to enhance the scalability of RL toward Artificial SuperIntelligence (ASI). In particular, we examine research applying RL to LLMs and LRMs for reasoning abilities, especially since the release of DeepSeek-R1, including foundational components, core problems, training resources, and downstream applications, to identify future opportunities and directions for this rapidly evolving area. We hope this review will promote future research on RL for broader reasoning models. Github: https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs

Updated: 2025-10-09 17:08:52

标题: 一项针对大型推理模型的强化学习调查

摘要: 在这篇论文中，我们调查了最近在大型语言模型（LLMs）推理中强化学习（RL）方面取得的进展。 RL在促进LLM能力前沿方面取得了显著成功，特别是在解决数学和编码等复杂逻辑任务方面。因此，RL已成为将LLMs转化为LRMs的基础方法论。随着该领域的快速进展，进一步扩展RL以用于LRMs现在面临着基础性挑战，不仅涉及计算资源，还涉及算法设计、训练数据和基础设施。因此，现在是时候重新审视该领域的发展，重新评估其轨迹，并探索增强RL可扩展性朝着人工超级智能（ASI）的策略。特别是，我们审查了应用RL到LLMs和LRMs的推理能力的研究，特别是自DeepSeek-R1发布以来，包括基础组件、核心问题、训练资源和下游应用，以确定未来机会和方向，以及这个快速发展的领域。我们希望这个评论能促进未来更广泛推理模型的RL研究。Github链接：https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs

更新时间: 2025-10-09 17:08:52

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.08827v3

In-Context Clustering with Large Language Models

We propose In-Context Clustering (ICC), a flexible LLM-based procedure for clustering data from diverse distributions. Unlike traditional clustering algorithms constrained by predefined similarity measures, ICC flexibly captures complex relationships among inputs through an attention mechanism. We show that pretrained LLMs exhibit impressive zero-shot clustering capabilities on text-encoded numeric data, with attention matrices showing salient cluster patterns. Spectral clustering using attention matrices offers surprisingly competitive performance. We further enhance the clustering capabilities of LLMs on numeric and image data through fine-tuning using the Next Token Prediction (NTP) loss. Moreover, the flexibility of LLM prompting enables text-conditioned image clustering, a capability that classical clustering methods lack. Our work extends in-context learning to an unsupervised setting, showcasing the effectiveness and flexibility of LLMs for clustering. Our code is available at https://agenticlearning.ai/icc.

Updated: 2025-10-09 17:07:55

标题: 大语言模型的上下文聚类

摘要: 我们提出了In-Context Clustering (ICC)，这是一种基于灵活LLM的过程，用于对来自不同分布的数据进行聚类。与传统的受预定义相似性度量约束的聚类算法不同，ICC通过注意力机制灵活捕捉输入之间的复杂关系。我们展示了预训练的LLM在文本编码的数字数据上展现了令人印象深刻的零-shot聚类能力，注意力矩阵显示出显著的群集模式。使用注意力矩阵的谱聚类表现出令人惊讶的竞争性能。我们通过使用下一个标记预测（NTP）损失对LLM在数字和图像数据上的聚类能力进行进一步增强。此外，LLM提示的灵活性使得文本条件下的图像聚类成为可能，这是传统聚类方法所缺乏的能力。我们的工作将上下文学习扩展到无监督设置，展示了LLM在聚类中的有效性和灵活性。我们的代码可在https://agenticlearning.ai/icc找到。

更新时间: 2025-10-09 17:07:55

领域: cs.LG

下载: http://arxiv.org/abs/2510.08466v1

Accelerated Aggregated D-Optimal Designs for Estimating Main Effects in Black-Box Models

Recent advances in supervised learning have driven growing interest in explaining black-box models, particularly by estimating the effects of input variables on model predictions. However, existing approaches often face key limitations, including poor scalability, sensitivity to out-of-distribution sampling, and instability under correlated features. To address these issues, we propose A2D2E, an $\textbf{E}$stimator based on $\textbf{A}$ccelerated $\textbf{A}$ggregated $\textbf{D}$-Optimal $\textbf{D}$esigns. Our method leverages principled experimental design to improve efficiency and robustness in main effect estimation. We establish theoretical guarantees, including convergence and variance reduction, and validate A2D2E through extensive simulations. We further provide the potential of the proposed method with a case study on real data and applications in language models. The code to reproduce the results can be found at https://github.com/cchihyu/A2D2E.

Updated: 2025-10-09 17:07:36

标题: 黑匣子模型中估计主效应的加速聚合D-最优设计

摘要: 最近在监督学习方面取得的进展引起了人们对解释黑盒模型的兴趣日益增长，尤其是通过估计输入变量对模型预测的影响。然而，现有方法往往面临关键限制，包括可扩展性差、对分布外采样敏感以及在相关特征下不稳定。为了解决这些问题，我们提出了A2D2E，一个基于加速聚合D-最优设计的估计器。我们的方法利用基于原则的实验设计来提高主效估计的效率和稳健性。我们建立了理论保证，包括收敛性和方差减少，并通过大量模拟验证了A2D2E。我们进一步通过真实数据的案例研究和在语言模型中的应用展示了所提出方法的潜力。可在https://github.com/cchihyu/A2D2E找到重现结果的代码。

更新时间: 2025-10-09 17:07:36

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2510.08465v1

Don't Run with Scissors: Pruning Breaks VLA Models but They Can Be Recovered

Vision-Language-Action (VLA) models have advanced robotic capabilities but remain challenging to deploy on resource-limited hardware. Pruning has enabled efficient compression of large language models (LLMs), yet it is largely understudied in robotics. Surprisingly, we observe that pruning VLA models leads to drastic degradation and increased safety violations. We introduce GLUESTICK, a post-pruning recovery method that restores much of the original model's functionality while retaining sparsity benefits. Our method performs a one-time interpolation between the dense and pruned models in weight-space to compute a corrective term. This correction is used during inference by each pruned layer to recover lost capabilities with minimal overhead. GLUESTICK requires no additional training, is agnostic to the pruning algorithm, and introduces a single hyperparameter that controls the tradeoff between efficiency and accuracy. Across diverse VLA architectures and tasks in manipulation and navigation, GLUESTICK achieves competitive memory efficiency while substantially recovering success rates and reducing safety violations. Additional material can be found at: https://gluestick-vla.github.io/.

Updated: 2025-10-09 17:07:30

标题: 不要拿剪刀跑：修剪破坏了VLA模型，但可以恢复

摘要: 视觉-语言-行动（VLA）模型已经提升了机器人的能力，但在资源有限的硬件上部署仍然具有挑战性。修剪已经实现了对大型语言模型（LLM）的高效压缩，但在机器人领域仍然被大多数人忽视。令人惊讶的是，我们观察到修剪VLA模型会导致严重的性能下降和增加的安全违规行为。我们引入了GLUESTICK，这是一种后修剪恢复方法，可以在保留稀疏性好处的同时恢复大部分原始模型的功能。我们的方法在权重空间中对密集模型和修剪模型之间进行一次插值，计算一个校正术语。这种校正在推理过程中由每个修剪层使用，以最小的开销恢复丢失的功能。GLUESTICK不需要额外的训练，对修剪算法是不可知的，并引入了一个控制效率和准确性之间权衡的单个超参数。在各种VLA架构和操作和导航任务中，GLUESTICK实现了具有竞争力的内存效率，同时显着恢复成功率并减少安全违规行为。更多材料可以在以下网址找到：https://gluestick-vla.github.io/。

更新时间: 2025-10-09 17:07:30

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2510.08464v1

Wavefunction Flows: Efficient Quantum Simulation of Continuous Flow Models

Flow models are a cornerstone of modern machine learning. They are generative models that progressively transform probability distributions according to learned dynamics. Specifically, they learn a continuous-time Markov process that efficiently maps samples from a simple source distribution into samples from a complex target distribution. We show that these models are naturally related to the Schr\"odinger equation, for an unusual Hamiltonian on continuous variables. Moreover, we prove that the dynamics generated by this Hamiltonian can be efficiently simulated on a quantum computer. Together, these results give a quantum algorithm for preparing coherent encodings (a.k.a., qsamples) for a vast family of probability distributions--namely, those expressible by flow models--by reducing the task to an existing classical learning problem, plus Hamiltonian simulation. For statistical problems defined by flow models, such as mean estimation and property testing, this enables the use of quantum algorithms tailored to qsamples, which may offer advantages over classical algorithms based only on samples from a flow model. More broadly, these results reveal a close connection between state-of-the-art machine learning models, such as flow matching and diffusion models, and one of the main expected capabilities of quantum computers: simulating quantum dynamics.

Updated: 2025-10-09 17:05:54

标题: 波函数流：连续流模型的高效量子模拟

摘要: 流模型是现代机器学习的基石。它们是生成模型，根据学习到的动态逐步转换概率分布。具体来说，它们学习一个连续时间马尔可夫过程，有效地将简单源分布的样本映射到复杂目标分布的样本中。我们展示了这些模型与薛定谔方程自然相关，针对连续变量上的非常规哈密顿量。此外，我们证明了由这个哈密顿量产生的动态可以在量子计算机上有效模拟。这些结果一起为准备一大类概率分布的连贯编码（又称qsamples）提供了一个量子算法，即通过将任务简化为一个现有的经典学习问题，再加上哈密顿模拟。对于由流模型定义的统计问题，如均值估计和属性测试，这使得可以利用量子算法针对qsamples，这可能比仅基于流模型样本的经典算法具有优势。更广泛地说，这些结果揭示了最先进的机器学习模型（如流匹配和扩散模型）与量子计算机的主要预期能力之一之间的密切联系：模拟量子动态。

更新时间: 2025-10-09 17:05:54

领域: quant-ph,cs.LG,stat.ML

下载: http://arxiv.org/abs/2510.08462v1

SummDiff: Generative Modeling of Video Summarization with Diffusion

Video summarization is a task of shortening a video by choosing a subset of frames while preserving its essential moments. Despite the innate subjectivity of the task, previous works have deterministically regressed to an averaged frame score over multiple raters, ignoring the inherent subjectivity of what constitutes a good summary. We propose a novel problem formulation by framing video summarization as a conditional generation task, allowing a model to learn the distribution of good summaries and to generate multiple plausible summaries that better reflect varying human perspectives. Adopting diffusion models for the first time in video summarization, our proposed method, SummDiff, dynamically adapts to visual contexts and generates multiple candidate summaries conditioned on the input video. Extensive experiments demonstrate that SummDiff not only achieves the state-of-the-art performance on various benchmarks but also produces summaries that closely align with individual annotator preferences. Moreover, we provide a deeper insight with novel metrics from an analysis of the knapsack, which is an important last step of generating summaries but has been overlooked in evaluation.

Updated: 2025-10-09 17:03:51

标题: SummDiff：使用扩散生成建模视频摘要

摘要: 视频摘要是通过选择视频的一个子集来缩短视频的任务，同时保留其关键时刻。尽管这一任务具有固有的主观性，先前的工作却倾向于确定性地回归到多个评分者的平均帧评分，忽视了构成一个好摘要的固有主观性。我们提出了一个新颖的问题形式，将视频摘要作为一个条件生成任务，使模型能够学习好摘要的分布，并生成多个合理的摘要，更好地反映不同人的观点。首次在视频摘要中采用扩散模型，我们提出的方法SummDiff动态适应视觉环境，并根据输入视频生成多个候选摘要。广泛的实验表明，SummDiff不仅在各种基准测试中实现了最先进的性能，而且产生的摘要与个体注释者的偏好密切一致。此外，我们通过对背包问题的分析提供了更深入的见解，背包问题是生成摘要的重要最后一步，但在评估中被忽视了。

更新时间: 2025-10-09 17:03:51

领域: cs.LG

下载: http://arxiv.org/abs/2510.08458v1

Integral Signatures of Activation Functions: A 9-Dimensional Taxonomy and Stability Theory for Deep Learning

Activation functions govern the expressivity and stability of neural networks, yet existing comparisons remain largely heuristic. We propose a rigorous framework for their classification via a nine-dimensional integral signature S_sigma(phi), combining Gaussian propagation statistics (m1, g1, g2, m2, eta), asymptotic slopes (alpha_plus, alpha_minus), and regularity measures (TV(phi'), C(phi)). This taxonomy establishes well-posedness, affine reparameterization laws with bias, and closure under bounded slope variation. Dynamical analysis yields Lyapunov theorems with explicit descent constants and identifies variance stability regions through (m2', g2). From a kernel perspective, we derive dimension-free Hessian bounds and connect smoothness to bounded variation of phi'. Applying the framework, we classify eight standard activations (ReLU, leaky-ReLU, tanh, sigmoid, Swish, GELU, Mish, TeLU), proving sharp distinctions between saturating, linear-growth, and smooth families. Numerical Gauss-Hermite and Monte Carlo validation confirms theoretical predictions. Our framework provides principled design guidance, moving activation choice from trial-and-error to provable stability and kernel conditioning.

Updated: 2025-10-09 17:03:00

标题: 激活函数的积分特征：深度学习的9维分类和稳定性理论

摘要: 激活函数影响神经网络的表达能力和稳定性，然而现有的比较主要是启发式的。我们提出了一个严格的分类框架，通过一个九维的积分签名S_sigma(phi)，结合高斯传播统计量（m1，g1，g2，m2，eta），渐近斜率（alpha_plus，alpha_minus）和正则化度量（TV(phi')，C(phi)）。这个分类法建立了良态性，与偏置的仿射重参数化定律，并在有界斜率变化下封闭。动力学分析得出带有明确下降常数的Lyapunov定理，并通过（m2'，g2）确定方差稳定区域。从核的角度来看，我们推导了无维度的Hessian界限，并将平滑性与phi'的有界变化联系起来。应用这个框架，我们对八种标准激活函数（ReLU，leaky-ReLU，tanh，sigmoid，Swish，GELU，Mish，TeLU）进行分类，证明了饱和、线性增长和平滑家族之间的明显区别。数值高斯-埃尔米特和蒙特卡洛验证证实了理论预测。我们的框架提供了合理的设计指导，将激活函数选择从试错转变为可证的稳定性和核条件。

更新时间: 2025-10-09 17:03:00

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.08456v1

Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

Post-training for reasoning of large language models (LLMs) increasingly relies on verifiable rewards: deterministic checkers that provide 0-1 correctness signals. While reliable, such binary feedback is brittle--many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates verifier signals with reward-model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms RM-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.

Updated: 2025-10-09 17:01:54

标题: 混合强化：当奖励稀缺时，更好选择密集化

摘要: 大型语言模型(LLMs)的推理后训练越来越依赖可验证的奖励：提供0-1正确信号的确定性检查器。虽然可靠，但这种二元反馈是脆弱的--许多任务允许部分正确或替代答案，验证器会低估这些答案，造成了全有或全无的监督限制学习。奖励模型提供更丰富、连续的反馈，可以作为验证器的补充监督信号。我们引入了HERO (Hybrid Ensemble Reward Optimization)，这是一个强化学习框架，以结构化方式将验证器信号与奖励模型分数整合在一起。HERO采用分层归一化来限制奖励模型分数在验证器定义的组内，保持正确性同时提炼质量区别，并采用方差感知加权来强调在密集信号最为重要的挑战提示。在各种数学推理基准测试中，HERO始终优于仅使用RM和仅使用验证器的基线，在可验证和难以验证的任务上都取得了明显的进展。我们的结果表明，混合奖励设计保持了验证器的稳定性，同时利用奖励模型的细微之处推进推理。

更新时间: 2025-10-09 17:01:54

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2510.07242v2

gLSTM: Mitigating Over-Squashing by Increasing Storage Capacity

Graph Neural Networks (GNNs) leverage the graph structure to transmit information between nodes, typically through the message-passing mechanism. While these models have found a wide variety of applications, they are known to suffer from over-squashing, where information from a large receptive field of node representations is collapsed into a single fixed sized vector, resulting in an information bottleneck. In this paper, we re-examine the over-squashing phenomenon through the lens of model storage and retrieval capacity, which we define as the amount of information that can be stored in a node's representation for later use. We study some of the limitations of existing tasks used to measure over-squashing and introduce a new synthetic task to demonstrate that an information bottleneck can saturate this capacity. Furthermore, we adapt ideas from the sequence modeling literature on associative memories, fast weight programmers, and the xLSTM model to develop a novel GNN architecture with improved capacity. We demonstrate strong performance of this architecture both on our capacity synthetic task, as well as a range of real-world graph benchmarks.

Updated: 2025-10-09 16:58:49

标题: gLSTM：通过增加存储容量来减轻过度挤压

摘要: 图神经网络（GNNs）利用图结构在节点之间传递信息，通常通过消息传递机制实现。虽然这些模型已经被广泛应用，但已知它们存在过度压缩的问题，即来自大范围节点表示的信息被合并成一个固定大小的向量，导致信息瓶颈。在本文中，我们通过模型存储和检索容量的视角重新审视过压缩现象，我们将其定义为节点表示中可以存储的信息量，以供以后使用。我们研究了一些用于测量过度压缩的现有任务的局限性，并引入了一个新的合成任务来证明信息瓶颈可以饱和这种容量。此外，我们借鉴了序列建模文献中关联记忆、快速权重编程器和xLSTM模型的思想，开发了一种具有改进容量的新型GNN架构。我们展示了这种架构在我们的容量合成任务以及一系列真实世界图形基准测试中的强大表现。

更新时间: 2025-10-09 16:58:49

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2510.08450v1

Data-Error Scaling Laws in Machine Learning on Combinatorial Mutation-prone Sets: Proteins and Small Molecules

We investigate trends in the data-error scaling laws of machine learning (ML) models trained on discrete combinatorial spaces that are prone-to-mutation, such as proteins or organic small molecules. We trained and evaluated kernel ridge regression machines using variable amounts of computational and experimental training data. Our synthetic datasets comprised i) two na\"ive functions based on many-body theory; ii) binding energy estimates between a protein and a mutagenised peptide; and iii) solvation energies of two 6-heavy atom structural graphs, while the experimental dataset consisted of a full deep mutational scan of the binding protein GB1. In contrast to typical data-error scaling laws, our results showed discontinuous monotonic phase transitions during learning, observed as rapid drops in the test error at particular thresholds of training data. We observed two learning regimes, which we call saturated and asymptotic decay, and found that they are conditioned by the level of complexity (i.e. number of mutations) enclosed in the training set. We show that during training on this class of problems, the predictions were clustered by the ML models employed in the calibration plots. Furthermore, we present an alternative strategy to normalize learning curves (LCs) and introduce the concept of mutant-based shuffling. This work has implications for machine learning on mutagenisable discrete spaces such as chemical properties or protein phenotype prediction, and improves basic understanding of concepts in statistical learning theory.

Updated: 2025-10-09 16:57:40

标题: 机器学习中的数据错误缩放定律在组合突变易变集合上的应用：蛋白质和小分子

摘要: 我们研究了在容易发生突变的离散组合空间（如蛋白质或有机小分子）上训练的机器学习（ML）模型的数据误差缩放规律。我们使用不同数量的计算和实验训练数据训练和评估了核岭回归机器。我们的合成数据集包括i）基于多体理论的两个天真函数；ii）蛋白质和突变肽之间的结合能估计；以及iii）两个6重原子结构图的溶解能，而实验数据集包括结合蛋白GB1的完整深度突变扫描。与典型的数据误差缩放规律不同，我们的结果显示出学习过程中的不连续单调相变，表现为在特定训练数据阈值时测试误差的快速下降。我们观察到两种学习模式，我们称之为饱和和渐近衰减，并发现它们受训练集中包含的复杂性水平（即突变数量）的影响。我们展示了在此类问题上训练时，预测由校准图中使用的ML模型聚类。此外，我们提出了一种规范化学习曲线（LCs）的替代策略，并引入了基于突变的洗牌概念。这项工作对于在可发生突变的离散空间（如化学性质或蛋白质表型预测）上的机器学习具有重要意义，并改进了统计学习理论中的基本概念的理解。

更新时间: 2025-10-09 16:57:40

领域: physics.chem-ph,cs.LG

下载: http://arxiv.org/abs/2405.05167v2

Machine-Learning Driven Load Shedding to Mitigate Instability Attacks in Power Grids

Critical infrastructures are becoming increasingly complex as our society becomes increasingly dependent on them. This complexity opens the door to new possibilities for attacks and a need for new defense strategies. Our work focuses on instability attacks on the power grid, wherein an attacker causes cascading outages by introducing unstable dynamics into the system. When stress is place on the power grid, a standard mitigation approach is load-shedding: the system operator chooses a set of loads to shut off until the situation is resolved. While this technique is standard, there is no systematic approach to choosing which loads will stop an instability attack. This paper addresses this problem using a data-driven methodology for load shedding decisions. We show a proof of concept on the IEEE 14 Bus System using the Achilles Heel Technologies Power Grid Analyzer, and show through an implementation of modified Prony analysis (MPA) that MPA is a viable method for detecting instability attacks and triggering defense mechanisms.

Updated: 2025-10-09 16:55:13

标题: 机器学习驱动的负荷 shedding 以减轻电网中的不稳定攻击

摘要: 关键基础设施正变得越来越复杂，因为我们的社会越来越依赖于它们。这种复杂性为攻击开辟了新的可能性，并需要新的防御策略。我们的工作侧重于对电力系统的不稳定性攻击，攻击者通过向系统引入不稳定的动态，导致连锁故障。当电力系统受到压力时，标准的缓解方法是负荷分担：系统运营商选择一组负载进行关闭，直到情况得到解决。虽然这种技术是标准的，但目前没有系统化的方法来选择哪些负载将停止不稳定性攻击。本文使用基于数据的方法来解决这个问题，用阿喀琉斯之踵技术电力网络分析仪在IEEE 14母线系统上展示了一个概念验证，并通过修改的普朗尼分析（MPA）的实现表明MPA是一种可行的方法，用于检测不稳定性攻击并触发防御机制。

更新时间: 2025-10-09 16:55:13

领域: cs.LG

下载: http://arxiv.org/abs/2509.26532v2

Synthetic Series-Symbol Data Generation for Time Series Foundation Models

Foundation models for time series analysis (TSA) have attracted significant attention. However, challenges such as training data scarcity and imbalance continue to hinder their development. Inspired by complex dynamic system theories, we design a series-symbol data generation mechanism, enabling the unrestricted creation of high-quality time series data paired with corresponding symbolic expressions. To leverage series-symbol data pairs with strong correlations, we develop \texttt{SymTime}, a pre-trained foundation model for enhancing time series representation using symbolic information. \texttt{SymTime} demonstrates competitive performance across five major TSA tasks when fine-tunes with downstream tasks, rivaling foundation models pre-trained on real-world datasets. This approach underscores the potential of series-symbol data generation and pretraining mechanisms in overcoming data scarcity and enhancing task performance. The code is available at https://github.com/wwhenxuan/SymTime.

Updated: 2025-10-09 16:54:18

标题: 合成序列符号数据生成用于时间序列基础模型

摘要: 时间序列分析（TSA）的基础模型引起了广泛关注。然而，训练数据稀缺和不平衡等挑战持续阻碍着它们的发展。受复杂动态系统理论的启发，我们设计了一个系列符号数据生成机制，可以无限制地创建与相应符号表达配对的高质量时间序列数据。为了利用具有强相关性的系列符号数据对，我们开发了\texttt{SymTime}，这是一个预训练的基础模型，用于利用符号信息增强时间序列表示。当与下游任务进行微调时，\texttt{SymTime}在五个主要TSA任务中展现出竞争性表现，与在真实数据集上预训练的基础模型相媲美。这种方法强调了系列符号数据生成和预训练机制在克服数据稀缺和提高任务性能方面的潜力。该代码可在https://github.com/wwhenxuan/SymTime找到。

更新时间: 2025-10-09 16:54:18

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.08445v1

Gaze on the Prize: Shaping Visual Attention with Return-Guided Contrastive Learning

Visual Reinforcement Learning (RL) agents must learn to act based on high-dimensional image data where only a small fraction of the pixels is task-relevant. This forces agents to waste exploration and computational resources on irrelevant features, leading to sample-inefficient and unstable learning. To address this, inspired by human visual foveation, we introduce Gaze on the Prize. This framework augments visual RL with a learnable foveal attention mechanism (Gaze), guided by a self-supervised signal derived from the agent's experience pursuing higher returns (the Prize). Our key insight is that return differences reveal what matters most: If two similar representations produce different outcomes, their distinguishing features are likely task-relevant, and the gaze should focus on them accordingly. This is realized through return-guided contrastive learning that trains the attention to distinguish between the features relevant to success and failure. We group similar visual representations into positives and negatives based on their return differences and use the resulting labels to construct contrastive triplets. These triplets provide the training signal that teaches the attention mechanism to produce distinguishable representations for states associated with different outcomes. Our method achieves up to 2.4x improvement in sample efficiency and can solve tasks that the baseline fails to learn, demonstrated across a suite of manipulation tasks from the ManiSkill3 benchmark, all without modifying the underlying algorithm or hyperparameters.

Updated: 2025-10-09 16:54:11

标题: 注视奖励：通过返回引导对比学习塑造视觉注意力

摘要: 视觉强化学习（RL）代理必须学会根据高维图像数据采取行动，其中只有一小部分像素是任务相关的。这迫使代理在不相关特征上浪费探索和计算资源，导致样本效率低下和学习不稳定。为了解决这个问题，受人类视觉凝视的启发，我们引入了“Gaze on the Prize”框架。该框架通过一个可学习的视觉凝视机制（Gaze）增强了视觉RL，该机制由代理追求更高回报（奖励）的自监督信号引导。我们的关键洞察是回报差异揭示了最重要的事物：如果两个相似的表征产生了不同的结果，它们的区分特征很可能是任务相关的，因此凝视应该相应地专注于它们。通过回报引导的对比学习实现了这一点，该学习训练了凝视机制来区分与成功和失败有关的特征。我们根据它们的回报差异将相似的视觉表征分为正面和负面，并使用结果标签构建对比三元组。这些三元组提供了训练信号，教导了注意机制为与不同结果相关联的状态产生可区分的表征。我们的方法在样本效率上实现了高达2.4倍的改进，并且可以解决基线无法学习的任务，在ManiSkill3基准测试中展示了一系列操作任务，而无需修改基础算法或超参数。

更新时间: 2025-10-09 16:54:11

领域: cs.CV,cs.AI,cs.RO

下载: http://arxiv.org/abs/2510.08442v1

xRouter: Training Cost-Aware LLMs Orchestration System via Reinforcement Learning

Modern LLM deployments confront a widening cost-performance spectrum: premium models deliver strong reasoning but are expensive, while lightweight models are economical yet brittle on complex tasks. Static escalation rules and keyword heuristics under-utilize this spectrum and fail to adapt across task types. We present xRouter, a tool-calling-based routing system in which a learned router can either answer directly or invoke one or more external models. The router is trained end-to-end with reinforcement learning using an explicit, cost-aware reward that encodes cost-performance trade-offs, eliminating the need for hand-engineered routing rules. Our implementation encompasses the full reinforcement learning framework, including reward and cost accounting, as well as the deployment and evaluation pipelines. Across diverse benchmarks, xRouter achieves strong cost-performance trade-offs (e.g., substantial cost reductions at comparable task completion rates), and provides empirical insights into what reliably helps learned routing and what does not, ranging from model trainability to the difficulty of eliciting sophisticated orchestration behaviors in small open models. We hope these findings and our open implementation will serve as a practical substrate for advancing learned, cost-aware LLM orchestration.

Updated: 2025-10-09 16:52:01

标题: xRouter：通过强化学习训练成本感知的LLMs编排系统

摘要: 现代的LLM部署面临着一个日益扩大的成本-性能范围：高级模型具有强大的推理能力，但价格昂贵，而轻量级模型在复杂任务上经济实惠但脆弱。静态升级规则和关键词启发式未充分利用这一范围，并且无法适应不同任务类型。我们提出了xRouter，一个基于工具调用的路由系统，其中学习路由器可以直接回答或调用一个或多个外部模型。该路由器通过强化学习进行端到端训练，使用明确的、成本意识的奖励来编码成本-性能权衡，消除了需要手工设计的路由规则。我们的实现涵盖了完整的强化学习框架，包括奖励和成本核算，以及部署和评估管道。在多样化基准测试中，xRouter实现了强大的成本-性能权衡（例如，在相似的任务完成率下实现了大幅度的成本降低），并为什么能够可靠地帮助学习路由以及什么不能提供了经验性见解，从模型的可训练性到在小型开放模型中引发复杂协调行为的困难。我们希望这些发现和我们的开放实现将作为推进学习、具有成本意识的LLM编排的实用基础。

更新时间: 2025-10-09 16:52:01

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.08439v1

Cost-aware Stopping for Bayesian Optimization

In automated machine learning, scientific discovery, and other applications of Bayesian optimization, deciding when to stop evaluating expensive black-box functions is an important practical consideration. While several adaptive stopping rules have been proposed, in the cost-aware setting they lack guarantees ensuring they stop before incurring excessive function evaluation costs. We propose a cost-aware stopping rule for Bayesian optimization that adapts to varying evaluation costs and is free of heuristic tuning. Our rule is grounded in a theoretical connection to state-of-the-art cost-aware acquisition functions, namely the Pandora's Box Gittins Index (PBGI) and log expected improvement per cost. We prove a theoretical guarantee bounding the expected cumulative evaluation cost incurred by our stopping rule when paired with these two acquisition functions. In experiments on synthetic and empirical tasks, including hyperparameter optimization and neural architecture size search, we show that combining our stopping rule with the PBGI acquisition function usually matches or outperforms other acquisition-function--stopping-rule pairs in terms of cost-adjusted simple regret, a metric capturing trade-offs between solution quality and cumulative evaluation cost.

Updated: 2025-10-09 16:49:46

标题: 成本感知的贝叶斯优化停止

摘要: 在自动机器学习、科学发现和贝叶斯优化等应用中，决定何时停止评估昂贵的黑匣子函数是一个重要的实际考虑因素。虽然已经提出了几种自适应停止规则，在成本感知设置中它们缺乏保证，无法确保在产生过多函数评估成本之前停止。我们提出了一种适应于不同评估成本且无需启发式调整的贝叶斯优化成本感知停止规则。我们的规则建立在与最先进的成本感知收购函数的理论联系上，即潘多拉魔盒吉廷斯指数（PBGI）和每单位成本的对数期望改进。当与这两个收购函数配对时，我们证明了一个理论保证，限制了我们的停止规则产生的期望累积评估成本。在对合成和实验任务的实验中，包括超参数优化和神经架构大小搜索，我们展示了将我们的停止规则与PBGI收购函数相结合通常能匹配或胜过其他收购函数-停止规则对，以成本调整的简单遗憾为指标，该指标捕捉了解决方案质量和累积评估成本之间的权衡。

更新时间: 2025-10-09 16:49:46

领域: cs.LG

下载: http://arxiv.org/abs/2507.12453v2

Navigating Sparsities in High-Dimensional Linear Contextual Bandits

High-dimensional linear contextual bandit problems remain a significant challenge due to the curse of dimensionality. Existing methods typically consider either the model parameters to be sparse or the eigenvalues of context covariance matrices to be (approximately) sparse, lacking general applicability due to the rigidity of conventional reward estimators. To overcome this limitation, a powerful pointwise estimator is introduced in this work that adaptively navigates both kinds of sparsity. Based on this pointwise estimator, a novel algorithm, termed HOPE, is proposed. Theoretical analyses demonstrate that HOPE not only achieves improved regret bounds in previously discussed homogeneous settings (i.e., considering only one type of sparsity) but also, for the first time, efficiently handles two new challenging heterogeneous settings (i.e., considering a mixture of two types of sparsity), highlighting its flexibility and generality. Experiments corroborate the superiority of HOPE over existing methods across various scenarios.

Updated: 2025-10-09 16:47:14

标题: 在高维线性情境赌博中导航稀疏性

摘要: 高维线性上下文强盗问题由于维度灾难仍然是一个重要挑战。现有方法通常考虑模型参数稀疏或上下文协方差矩阵的特征值（近似）稀疏，由于传统奖励估计器的僵化，缺乏普遍适用性。为了克服这一限制，本文引入了一种强大的点估计器，它自适应地导航这两种稀疏性。基于这个点估计器，提出了一种新算法，称为HOPE。理论分析表明，HOPE不仅在先前讨论的同质设置（即只考虑一种稀疏性）中实现了改进的遗憾边界，而且首次有效处理了两种新的具有挑战性的异质设置（即考虑两种稀疏性的混合），突显了其灵活性和一般性。实验证实了HOPE在各种情景中优于现有方法的优越性。

更新时间: 2025-10-09 16:47:14

领域: math.ST,cs.LG,stat.TH

下载: http://arxiv.org/abs/2510.08435v1

Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships

Causal reasoning is fundamental for Large Language Models (LLMs) to understand genuine cause-and-effect relationships beyond pattern matching. Existing benchmarks suffer from critical limitations such as reliance on synthetic data and narrow domain coverage. We introduce a novel benchmark constructed from casually identified relationships extracted from top-tier economics and finance journals, drawing on rigorous methodologies including instrumental variables, difference-in-differences, and regression discontinuity designs. Our benchmark comprises 40,379 evaluation items covering five task types across domains such as health, environment, technology, law, and culture. Experimental results on eight state-of-the-art LLMs reveal substantial limitations, with the best model achieving only 57.6\% accuracy. Moreover, model scale does not consistently translate to superior performance, and even advanced reasoning models struggle with fundamental causal relationship identification. These findings underscore a critical gap between current LLM capabilities and demands of reliable causal reasoning in high-stakes applications.

Updated: 2025-10-09 16:46:30

标题: 用科学验证的关系对LLM因果推理进行基准测试

摘要: 因果推理对于大型语言模型（LLMs）理解真实因果关系至关重要，超越了简单的模式匹配。现有的基准测试存在关键限制，如依赖合成数据和狭窄的领域覆盖范围。我们引入了一个新的基准测试，该基准测试是从顶级经济学和金融期刊中提取的随意识别的关系构建的，采用了包括工具变量、差分法和回归断点设计在内的严格方法。我们的基准测试包含40,379个评估项目，涵盖了健康、环境、技术、法律和文化等领域的五种任务类型。对八种最先进的LLMs进行的实验结果显示出重大限制，最佳模型的准确率仅为57.6％。此外，模型规模并不一致地转化为更高性能，即使是先进的推理模型也难以识别基本的因果关系。这些发现凸显出当前LLM能力与高风险应用中可靠因果推理的需求之间存在重大差距。

更新时间: 2025-10-09 16:46:30

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.07231v2

Parallel Spooky Pebbling Makes Regev Factoring More Practical

"Pebble games," an abstraction from classical reversible computing, have found use in the design of quantum circuits for inherently sequential tasks. Gidney showed that allowing Hadamard basis measurements during pebble games can dramatically improve costs -- an extension termed "spooky pebble games" because the measurements leave temporary phase errors called ghosts. In this work, we define and study parallel spooky pebble games. Previous work by Blocki, Holman, and Lee (TCC 2022) and Gidney studied the benefits offered by either parallelism or spookiness individually; here we show that these resources can yield impressive gains when used together. First, we show by construction that a line graph of length $\ell$ can be pebbled in depth $2\ell$ (which is exactly optimal) using space $\leq 2.47\log \ell$. Then, to explore pebbling schemes using even less space, we use a highly optimized $A^*$ search implemented in Julia to find the lowest-depth parallel spooky pebbling possible for a range of concrete line graph lengths $\ell$ given a constant number of pebbles $s$. We show that these techniques can be applied to Regev's factoring algorithm (Journal of the ACM 2025) to significantly reduce the cost of its arithmetic. For example, we find that 4096-bit integers $N$ can be factored in multiplication depth 193, which outperforms the 680 required of previous variants of Regev and the 444 reported by Eker{\aa} and G\"artner for Shor's algorithm (IACR Communications in Cryptology 2025). While space-optimized implementations of Shor's algorithm remain likely the best candidates for first quantum factorization of large integers, our results show that Regev's algorithm may have practical importance in the future, especially given the possibility of further optimization. Finally, we believe our pebbling techniques will find applications in quantum cryptanalysis beyond integer factorization.

Updated: 2025-10-09 16:45:58

标题: 并行恐怖铺石使Regev因式分解更实用

摘要: “卵石游戏”是从经典可逆计算中抽象出来的，已经被用于为固有的顺序任务设计量子电路。Gidney表明，在卵石游戏中允许哈达玛基测量可以显著改善成本，这一扩展被称为“幽灵卵石游戏”，因为这些测量会留下临时相位误差，称为幽灵。在这项工作中，我们定义和研究了并行幽灵卵石游戏。Blocki、Holman和Lee以及Gidney之前的研究分别研究了并行性或幽灵性带来的优势；在这里，我们表明这些资源在一起使用时能够产生令人印象深刻的收益。首先，我们通过构造展示了长度为$\ell$的线图可以在深度$2\ell$（这恰好是最优的情况）内完成卵石游戏，使用的空间为$\leq 2.47\log \ell$。然后，为了探索使用更少空间的卵石方案，我们使用了在Julia中实现的高度优化的$A^*$搜索算法，以找到一系列具体线图长度$\ell$中可能的最低深度并行幽灵卵石游戏。我们展示了这些技术可以应用于Regev的因子分解算法（ACM杂志2025年），从而显著降低其算术成本。例如，我们发现4096位整数$N$可以在乘法深度193的情况下分解，这超过了以前Regev变种所需的680和Eker{\aa}和G\"artner为Shor算法所报告的444（IACR通信密码学2025年）。尽管Shor算法的空间优化实现可能仍然是首选用于大整数首次量子因子分解的候选，但我们的结果表明Regev算法在未来可能具有实际重要性，尤其是考虑到进一步优化的可能性。最后，我们相信我们的卵石技术将在整数因子分解之外的量子密码分析中找到应用。

更新时间: 2025-10-09 16:45:58

领域: quant-ph,cs.CR

下载: http://arxiv.org/abs/2510.08432v1

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

This work represents the first effort to scale up continuous-time consistency distillation to general application-level image and video diffusion models. Although continuous-time consistency model (sCM) is theoretically principled and empirically powerful for accelerating academic-scale diffusion, its applicability to large-scale text-to-image and video tasks remains unclear due to infrastructure challenges in Jacobian-vector product (JVP) computation and the limitations of standard evaluation benchmarks. We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks. Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the "mode-covering" nature of its forward-divergence objective. To remedy this, we propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer. This integration complements sCM with the "mode-seeking" reverse divergence, effectively improving visual quality while maintaining high generation diversity. Validated on large-scale models (Cosmos-Predict2, Wan2.1) up to 14B parameters and 5-second videos, rCM matches or surpasses the state-of-the-art distillation method DMD2 on quality metrics while offering notable advantages in diversity, all without GAN tuning or extensive hyperparameter searches. The distilled models generate high-fidelity samples in only $1\sim4$ steps, accelerating diffusion sampling by $15\times\sim50\times$. These results position rCM as a practical and theoretically grounded framework for advancing large-scale diffusion distillation.

Updated: 2025-10-09 16:45:30

标题: 大规模扩散蒸馏通过得分正则化连续时间一致性

摘要: 这项工作代表了首次尝试将连续时间一致性提炼扩展到通用的应用级图像和视频扩散模型。虽然连续时间一致性模型（sCM）在理论上是合理的，并且在加速学术规模的扩散方面具有强大的实证能力，但由于雅可比-向量乘积（JVP）计算的基础设施挑战和标准评估基准的限制，其适用性于大规模文本到图像和视频任务仍不清楚。我们首先开发了一个与并行兼容的FlashAttention-2 JVP核，使得sCM训练在具有超过100亿参数和高维视频任务的模型上成为可能。我们的研究揭示了sCM在细节生成方面的基本质量限制，我们将其归因于误差累积和其前向发散目标的“模式覆盖”特性。为了解决这个问题，我们提出了得分正则化的连续时间一致性模型（rCM），它将得分提炼作为一个长跳跃正则化器。这种整合将sCM与“寻找模式”的反向发散相结合，有效地提高了视觉质量，同时保持高生成多样性。在14B参数和5秒视频的大规模模型（Cosmos-Predict2，Wan2.1）上经过验证，rCM在质量指标上与最先进的提炼方法DMD2相匹配或超越，同时在多样性方面具有显著优势，而无需进行GAN调整或大量超参数搜索。这些提炼模型仅需1~4步即可生成高保真度样本，将扩散采样加速了15倍~50倍。这些结果将rCM定位为推进大规模扩散提炼的实用和理论基础的框架。

更新时间: 2025-10-09 16:45:30

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.08431v1

On The Sample Complexity Bounds In Bilevel Reinforcement Learning

Bilevel reinforcement learning (BRL) has emerged as a powerful framework for aligning generative models, yet its theoretical foundations, especially sample complexity bounds, remain underexplored. In this work, we present the first sample complexity bound for BRL, establishing a rate of $\mathcal{O}(\epsilon^{-3})$ in continuous state-action spaces. Traditional MDP analysis techniques do not extend to BRL due to its nested structure and non-convex lower-level problems. We overcome these challenges by leveraging the Polyak-{\L}ojasiewicz (PL) condition and the MDP structure to obtain closed-form gradients, enabling tight sample complexity analysis. Our analysis also extends to general bi-level optimization settings with non-convex lower levels, where we achieve state-of-the-art sample complexity results of $\mathcal{O}(\epsilon^{-3})$ improving upon existing bounds of $\mathcal{O}(\epsilon^{-6})$. Additionally, we address the computational bottleneck of hypergradient estimation by proposing a fully first-order, Hessian-free algorithm suitable for large-scale problems.

Updated: 2025-10-09 16:45:25

标题: 关于双层强化学习中样本复杂度界限的研究

摘要: 双层强化学习（BRL）已经成为一个强大的框架，用于对齐生成模型，然而其理论基础，特别是样本复杂度界限，仍未被充分探讨。在这项工作中，我们提出了BRL的第一个样本复杂度界限，建立了在连续状态-动作空间中的速率为$\mathcal{O}(\epsilon^{-3})$。传统的MDP分析技术无法扩展到BRL，因为其嵌套结构和非凸下层问题。我们通过利用Polyak-{\L}ojasiewicz（PL）条件和MDP结构来克服这些挑战，获得闭合梯度，从而实现紧密的样本复杂度分析。我们的分析也扩展到具有非凸下层的一般双层优化设置，在这里我们实现了$\mathcal{O}(\epsilon^{-3})$的最新样本复杂性结果，改进了现有的$\mathcal{O}(\epsilon^{-6})$的界限。此外，我们通过提出一种完全一阶、无Hessian的算法，解决了超梯度估计的计算瓶颈问题，适用于大规模问题。

更新时间: 2025-10-09 16:45:25

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.17644v5

ClauseLens: Clause-Grounded, CVaR-Constrained Reinforcement Learning for Trustworthy Reinsurance Pricing

Reinsurance treaty pricing must satisfy stringent regulatory standards, yet current quoting practices remain opaque and difficult to audit. We introduce ClauseLens, a clause-grounded reinforcement learning framework that produces transparent, regulation-compliant, and risk-aware treaty quotes. ClauseLens models the quoting task as a Risk-Aware Constrained Markov Decision Process (RA-CMDP). Statutory and policy clauses are retrieved from legal and underwriting corpora, embedded into the agent's observations, and used both to constrain feasible actions and to generate clause-grounded natural language justifications. Evaluated in a multi-agent treaty simulator calibrated to industry data, ClauseLens reduces solvency violations by 51%, improves tail-risk performance by 27.9% (CVaR_0.10), and achieves 88.2% accuracy in clause-grounded explanations with retrieval precision of 87.4% and recall of 91.1%. These findings demonstrate that embedding legal context into both decision and explanation pathways yields interpretable, auditable, and regulation-aligned quoting behavior consistent with Solvency II, NAIC RBC, and the EU AI Act.

Updated: 2025-10-09 16:43:49

标题: ClauseLens:基于条款、受CVaR限制的值-at-Risk强化学习，用于可信再保险定价

摘要: 再保险条款定价必须满足严格的监管标准，然而当前的报价实践仍然不透明且难以审计。我们引入了ClauseLens，这是一个基于条款的强化学习框架，能够产生透明、符合监管要求且具有风险意识的条约报价。 ClauseLens将报价任务建模为一种风险感知的受限马尔可夫决策过程（RA-CMDP）。法定和政策条款从法律和核保语料库中检索出来，嵌入到代理的观察中，用于限制可行动作的范围，并生成基于条款的自然语言解释。在一个根据行业数据校准的多智能体条约模拟器中进行评估，ClauseLens将偿付能力违规降低了51%，尾风险表现提高了27.9%（CVaR_0.10），并在基于条款的解释中达到了88.2%的准确率，检索精度为87.4%，召回率为91.1%。这些发现表明，将法律背景嵌入到决策和解释路径中，可以产生符合Solvency II、NAIC RBC和欧盟AI法案的可解释、可审计和符合监管的报价行为。

更新时间: 2025-10-09 16:43:49

领域: cs.LG,cs.AI,stat.ML,68T05, 91G70,I.2.6; I.2.7; J.1

下载: http://arxiv.org/abs/2510.08429v1

Reinforcing Diffusion Models by Direct Group Preference Optimization

While reinforcement learning methods such as Group Relative Preference Optimization (GRPO) have significantly enhanced Large Language Models, adapting them to diffusion models remains challenging. In particular, GRPO demands a stochastic policy, yet the most cost-effective diffusion samplers are based on deterministic ODEs. Recent work addresses this issue by using inefficient SDE-based samplers to induce stochasticity, but this reliance on model-agnostic Gaussian noise leads to slow convergence. To resolve this conflict, we propose Direct Group Preference Optimization (DGPO), a new online RL algorithm that dispenses with the policy-gradient framework entirely. DGPO learns directly from group-level preferences, which utilize relative information of samples within groups. This design eliminates the need for inefficient stochastic policies, unlocking the use of efficient deterministic ODE samplers and faster training. Extensive results show that DGPO trains around 20 times faster than existing state-of-the-art methods and achieves superior performance on both in-domain and out-of-domain reward metrics. Code is available at https://github.com/Luo-Yihong/DGPO.

Updated: 2025-10-09 16:40:43

标题: 通过直接群体偏好优化加强扩散模型

摘要: 虽然强化学习方法，如群体相对偏好优化（GRPO），已显著提升大型语言模型，但将它们适应扩散模型仍具挑战性。特别是，GRPO要求使用随机策略，然而最具成本效益的扩散采样器基于确定性ODE。最近的工作通过使用低效的SDE基础采样器引入随机性来解决这个问题，但对模型无关的高斯噪声的依赖导致收敛速度缓慢。为了解决这一冲突，我们提出了直接群体偏好优化（DGPO），这是一种新的在线RL算法，完全摒弃了策略梯度框架。DGPO直接从群体级偏好中学习，这些偏好利用了组内样本的相对信息。这种设计消除了低效的随机策略的必要性，打开了使用高效的确定性ODE采样器和更快训练的可能性。大量结果显示，DGPO的训练速度比现有的最先进方法快约20倍，并在领域内和领域外奖励指标上实现了更好的性能。代码可在https://github.com/Luo-Yihong/DGPO找到。

更新时间: 2025-10-09 16:40:43

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2510.08425v1

Neuro-Symbolic Agents with Modal Logic for Autonomous Diagnostics

The development of intelligent agents, particularly those powered by language models (LMs), has shown the critical role in various environments that require intelligent and autonomous decision. Environments are not passive testing grounds and they represent the data required for agents to learn and exhibit very challenging conditions that require adaptive, complex and autonomous capacity to make decisions. While the paradigm of scaling models and datasets has led to remarkable emergent capabilities, we argue that scaling the structure, fidelity, and logical consistency of agent reasoning within these environments is a crucial, yet underexplored, dimension of AI research. This paper introduces a neuro-symbolic multi-agent architecture where the belief states of individual agents are formally represented as Kripke models. This foundational choice enables them to reason about known concepts of \emph{possibility} and \emph{necessity} using the formal language of modal logic. In this work, we use of immutable, domain-specific knowledge to make infere information, which is encoded as logical constraints essential for proper diagnosis. In the proposed model, we show constraints that actively guide the hypothesis generation of LMs, effectively preventing them from reaching physically or logically untenable conclusions. In a high-fidelity simulated particle accelerator environment, our system successfully diagnoses complex, cascading failures by combining the powerful semantic intuition of LMs with the rigorous, verifiable validation of modal logic and a factual world model and showcasing a viable path toward more robust, reliable, and verifiable autonomous agents.

Updated: 2025-10-09 16:39:50

标题: 神经符号代理与模态逻辑在自主诊断中的应用

摘要: 智能代理的发展，特别是那些由语言模型（LMs）驱动的代理，已经显示出在需要智能和自主决策的各种环境中发挥着关键作用。环境并非被动的测试场，它们代表了代理学习和展示所需数据的环境，并具有需要适应性、复杂性和自主能力来做出决策的非常具有挑战性的条件。虽然扩展模型和数据集的范式已经导致了显著的新兴能力，但我们认为，在这些环境内扩展代理推理的结构、忠实度和逻辑一致性是人工智能研究中一个至关重要但尚未被充分探讨的维度。本文介绍了一个神经符号多代理体系结构，其中个体代理的信念状态被正式表示为克里普克模型。这种基础选择使它们能够使用模态逻辑的形式语言来推理关于“可能性”和“必然性”的已知概念。在这项工作中，我们利用不可变的、领域特定的知识来推断信息，这些信息被编码为对正确诊断至关重要的逻辑约束。在提出的模型中，我们展示了引导LMs假设生成的约束，有效地防止它们得出物理上或逻辑上不可行的结论。在一个高保真度的模拟粒子加速器环境中，我们的系统成功地通过将LMs强大的语义直觉与模态逻辑的严格、可验证的验证以及一个事实世界模型相结合，展示了一条通往更加强大、可靠和可验证的自主代理的可行路径。

更新时间: 2025-10-09 16:39:50

领域: cs.AI,cs.LG,cs.LO,cs.MA

下载: http://arxiv.org/abs/2509.11943v2

Real-time Noise Detection and Classification in Single-Channel EEG: A Lightweight Machine Learning Approach for EMG, White Noise, and EOG Artifacts

Electroencephalogram (EEG) artifact detection in real-world settings faces significant challenges such as computational inefficiency in multi-channel methods, poor robustness to simultaneous noise, and trade-offs between accuracy and complexity in deep learning models. We propose a hybrid spectral-temporal framework for real-time detection and classification of ocular (EOG), muscular (EMG), and white noise artifacts in single-channel EEG. This method, in contrast to other approaches, combines time-domain low-pass filtering (targeting low-frequency EOG) and frequency-domain power spectral density (PSD) analysis (capturing broad-spectrum EMG), followed by PCA-optimized feature fusion to minimize redundancy while preserving discriminative information. This feature engineering strategy allows a lightweight multi-layer perceptron (MLP) architecture to outperform advanced CNNs and RNNs by achieving 99% accuracy at low SNRs (SNR -7) dB and >90% accuracy in moderate noise (SNR 4 dB). Additionally, this framework addresses the unexplored problem of simultaneous multi-source contamination(EMG+EOG+white noise), where it maintains 96% classification accuracy despite overlapping artifacts. With 30-second training times (97% faster than CNNs) and robust performance across SNR levels, this framework bridges the gap between clinical applicability and computational efficiency, which enables real-time use in wearable brain-computer interfaces. This work also challenges the ubiquitous dependence on model depth for EEG artifact detection by demonstrating that domain-informed feature fusion surpasses complex architecture in noisy scenarios.

Updated: 2025-10-09 16:36:30

标题: 单通道脑电图中的实时噪音检测与分类：一种轻量级机器学习方法用于EMG、白噪音和EOG伪迹

摘要: 在现实世界的设置中，脑电图（EEG）伪迹检测面临着诸多挑战，如多通道方法中的计算效率低下、对同时噪声的抗干扰性差，以及在深度学习模型中准确性和复杂性之间的权衡。我们提出了一个混合的频谱-时间框架，用于实时检测和分类单通道EEG中的眼动（EOG）、肌电（EMG）和白噪声伪迹。与其他方法相比，该方法结合了时间域低通滤波（针对低频EOG）和频域功率谱密度（PSD）分析（捕捉广谱EMG），然后通过PCA优化特征融合来最小化冗余，同时保留有区分性的信息。这种特征工程策略使得轻量级的多层感知器（MLP）架构在低信噪比（SNR -7 dB）下实现了99%的准确性，并在中等噪声（SNR 4 dB）下实现了超过90%的准确性。此外，该框架解决了多源污染（EMG+EOG+白噪声）的未开发问题，在重叠伪迹的情况下仍保持96%的分类准确性。通过30秒的训练时间（比CNN快97%）和在各种信噪比水平上的稳健性能，该框架弥合了临床适用性和计算效率之间的差距，从而使其能够实时在可穿戴脑机接口中使用。这项工作还挑战了对于EEG伪迹检测普遍依赖模型深度的观念，通过证明在嘈杂场景中，基于领域知识的特征融合优于复杂架构。

更新时间: 2025-10-09 16:36:30

领域: cs.LG,cs.AI,cs.NE

下载: http://arxiv.org/abs/2509.26058v2

AdaptiveK Sparse Autoencoders: Dynamic Sparsity Allocation for Interpretable LLM Representations

Understanding the internal representations of large language models (LLMs) remains a central challenge for interpretability research. Sparse autoencoders (SAEs) offer a promising solution by decomposing activations into interpretable features, but existing approaches rely on fixed sparsity constraints that fail to account for input complexity. We propose AdaptiveK SAE (Adaptive Top K Sparse Autoencoders), a novel framework that dynamically adjusts sparsity levels based on the semantic complexity of each input. Leveraging linear probes, we demonstrate that context complexity is linearly encoded in LLM representations, and we use this signal to guide feature allocation during training. Experiments across ten language models (from 70M to 14B parameters) demonstrate that this complexity-driven adaptation significantly outperforms fixed-sparsity approaches on reconstruction fidelity, explained variance, cosine similarity and interpretability metrics while eliminating the computational burden of extensive hyperparameter tuning.

Updated: 2025-10-09 16:34:09

标题: 自适应K稀疏自动编码器：可解释LLM表示的动态稀疏分配

摘要: 理解大型语言模型（LLMs）的内部表示仍然是可解释性研究的一个核心挑战。稀疏自编码器（SAEs）通过将激活分解成可解释特征，提供了一个有希望的解决方案，但现有方法依赖于固定的稀疏约束，无法考虑输入的复杂性。我们提出了自适应K稀疏自编码器（Adaptive Top K Sparse Autoencoders），这是一个新颖的框架，根据每个输入的语义复杂性动态调整稀疏水平。利用线性探针，我们展示了上下文复杂性在LLM表示中被线性编码，并利用这一信号来指导训练过程中的特征分配。对十个语言模型（从70M到14B参数）进行的实验表明，这种基于复杂性的自适应方法在重构保真度、解释方差、余弦相似度和可解释性指标方面明显优于固定稀疏方法，同时消除了广泛超参数调整的计算负担。

更新时间: 2025-10-09 16:34:09

领域: cs.LG

下载: http://arxiv.org/abs/2508.17320v2

RAG4Tickets: AI-Powered Ticket Resolution via Retrieval-Augmented Generation on JIRA and GitHub Data

Modern software teams frequently encounter delays in resolving recurring or related issues due to fragmented knowledge scattered across JIRA tickets, developer discussions, and GitHub pull requests (PRs). To address this challenge, we propose a Retrieval-Augmented Generation (RAG) framework that integrates Sentence-Transformers for semantic embeddings with FAISS-based vector search to deliver context-aware ticket resolution recommendations. The approach embeds historical JIRA tickets, user comments, and linked PR metadata to retrieve semantically similar past cases, which are then synthesized by a Large Language Model (LLM) into grounded and explainable resolution suggestions. The framework contributes a unified pipeline linking JIRA and GitHub data, an embedding and FAISS indexing strategy for heterogeneous software artifacts, and a resolution generation module guided by retrieved evidence. Experimental evaluation using precision, recall, resolution time reduction, and developer acceptance metrics shows that the proposed system significantly improves resolution accuracy, fix quality, and knowledge reuse in modern DevOps environments.

Updated: 2025-10-09 16:33:00

标题: RAG4Tickets: 利用JIRA和GitHub数据进行AI支持的工单解决方案

摘要: 现代软件团队经常遇到由于知识分散在JIRA工单、开发者讨论和GitHub拉取请求（PRs）中而导致解决重复或相关问题的延迟。为了解决这一挑战，我们提出了一个检索增强生成（RAG）框架，该框架集成了用于语义嵌入的Sentence-Transformers和基于FAISS的向量搜索，以提供基于上下文的工单解决建议。该方法通过嵌入历史JIRA工单、用户评论和关联的PR元数据来检索语义上相似的过去案例，然后由一个大型语言模型（LLM）合成为具体和可解释的解决建议。该框架提供了一个统一的流水线，连接了JIRA和GitHub数据，一个用于异构软件工件的嵌入和FAISS索引策略，以及一个由检索证据引导的解决生成模块。通过精度、召回率、解决时间缩短和开发者接受度等实验评估指标，表明所提出的系统在现代DevOps环境中显著提高了解决准确性、修复质量和知识重用。

更新时间: 2025-10-09 16:33:00

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2510.08667v1

Prompts Generalize with Low Data: Non-vacuous Generalization Bounds for Optimizing Prompts with More Informative Priors

Many prompt engineering techniques have been successful in practice, even when optimizing over a large prompt space with with a small amount of task-specific data. Recent work has partially explained this success by showing generalization bounds which apply PAC-Bayes theory to the discrete prompt space, but they are non-vacuous only in data-rich scenarios. We argue that such widespread success can be more fully explained through more carefully considering data- or distribution-dependent perplexity, which acts as an effective prior and steers the optimization towards prompts that are more ``natural'' for the task at hand. We derive novel generalization bounds that are non-vacuous for data-scarce prompt optimization via more useful priors, formally analyzing how perplexity regularization tightens these bounds by limiting exploration. Empirically, we explore both the bounds' effectiveness and the practical benefits of perplexity regularization in improving prompt generalization.

Updated: 2025-10-09 16:32:46

标题: 提示通用数据量较低：针对优化具有更多信息先验的提示的非空泛化界限

摘要: 许多即时工程技术在实践中取得了成功，即使在优化一个大的即时空间时，只有少量的特定任务数据。最近的研究部分解释了这种成功，通过将PAC-Bayes理论应用于离散的即时空间，但只有在数据丰富的情况下才是非空虚的。我们认为，这种广泛的成功可以更全面地通过更仔细地考虑数据或分布相关的困惑来解释，这作为一种有效的先验，引导优化朝向更适合当前任务的即时。我们推导出新的泛化界限，对于数据稀缺的即时优化是非空虚的，通过更有用的先验形式地分析了困惑正则化如何通过限制探索来收紧这些界限。在实证方面，我们探讨了界限的有效性以及困惑正则化在改善即时泛化中的实际益处。

更新时间: 2025-10-09 16:32:46

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.08413v1

Optimal Stopping in Latent Diffusion Models

We identify and analyze a surprising phenomenon of Latent Diffusion Models (LDMs) where the final steps of the diffusion can degrade sample quality. In contrast to conventional arguments that justify early stopping for numerical stability, this phenomenon is intrinsic to the dimensionality reduction in LDMs. We provide a principled explanation by analyzing the interaction between latent dimension and stopping time. Under a Gaussian framework with linear autoencoders, we characterize the conditions under which early stopping is needed to minimize the distance between generated and target distributions. More precisely, we show that lower-dimensional representations benefit from earlier termination, whereas higher-dimensional latent spaces require later stopping time. We further establish that the latent dimension interplays with other hyperparameters of the problem such as constraints in the parameters of score matching. Experiments on synthetic and real datasets illustrate these properties, underlining that early stopping can improve generative quality. Together, our results offer a theoretical foundation for understanding how the latent dimension influences the sample quality, and highlight stopping time as a key hyperparameter in LDMs.

Updated: 2025-10-09 16:28:48

标题: 潜在扩散模型中的最优停止

摘要: 我们发现并分析了潜在扩散模型（LDMs）中一个令人惊讶的现象，即扩散的最终步骤可能会降低样本质量。与传统论点相反，传统论点认为为了数值稳定性应该尽早停止，这种现象与LDMs中的降维有关。我们通过分析潜在维度和停止时间之间的互动，提供了一个合理的解释。在一个具有线性自动编码器的高斯框架下，我们表征了早期停止是为了最小化生成和目标分布之间的距离所需的条件。更具体地，我们展示了低维表示受益于较早终止，而较高维度的潜在空间需要较晚的停止时间。我们进一步确定了潜在维度与问题的其他超参数（如评分匹配参数的约束）之间的相互作用。对合成和真实数据集的实验证明了这些特性，强调了早期停止可以提高生成质量。总之，我们的结果为理解潜在维度如何影响样本质量提供了理论基础，并强调停止时间是LDMs中的关键超参数。

更新时间: 2025-10-09 16:28:48

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2510.08409v1

Biology-driven assessment of deep learning super-resolution imaging of the porosity network in dentin

The mechanosensory system of teeth is currently believed to partly rely on Odontoblast cells stimulation by fluid flow through a porosity network extending through dentin. Visualizing the smallest sub-microscopic porosity vessels therefore requires the highest achievable resolution from confocal fluorescence microscopy, the current gold standard. This considerably limits the extent of the field of view to very small sample regions. To overcome this limitation, we tested different deep learning (DL) super-resolution (SR) models to allow faster experimental acquisitions of lower resolution images and restore optimal image quality by post-processing. Three supervised 2D SR models (RCAN, pix2pix, FSRCNN) and one unsupervised (CycleGAN) were applied to a unique set of experimentally paired high- and low-resolution confocal images acquired with different sampling schemes, resulting in a pixel size increase of x2, x4, x8. Model performance was quantified using a broad set of similarity and distribution-based image quality assessment (IQA) metrics, which yielded inconsistent results that mostly contradicted our visual perception. This raises the question of the relevance of such generic metrics to efficiently target the specific structure of dental porosity. To resolve this conflicting information, the generated SR images were segmented taking into account the specific scales and morphology of the porosity network and analysed by comparing connected components. Additionally, the capacity of the SR models to preserve 3D porosity connectivity throughout the confocal image stacks was evaluated using graph analysis. This biology-driven assessment allowed a far better mechanistic interpretation of SR performance, highlighting differences in model sensitivity to weak intensity features and the impact of non-linearity in image generation, which explains the failure of standard IQA metrics.

Updated: 2025-10-09 16:26:38

标题: 生物学驱动的深度学习超分辨成像用于评估牙本质中多孔网络

摘要: 牙齿的感觉系统目前被认为部分依赖于通过牙本质的多孔网络传递的流体对牙本质细胞的刺激。因此，可视化最小的亚微观多孔血管需要通过共聚焦荧光显微镜获得最高分辨率，这是当前的黄金标准。然而，这显著限制了视野范围，仅限于非常小的样本区域。为了克服这一限制，我们测试了不同的深度学习（DL）超分辨率（SR）模型，以实现更快的实验获取低分辨率图像，并通过后处理恢复最佳图像质量。我们应用了三种监督的2D SR模型（RCAN，pix2pix，FSRCNN）和一种无监督的（CycleGAN）到一组独特的实验配对高分辨率和低分辨率共聚焦图像上，这些图像是用不同的采样方案获得的，结果是像素尺寸增加了x2、x4、x8倍。模型性能是使用广泛的相似性和基于分布的图像质量评估（IQA）指标进行量化的，这些指标产生了不一致的结果，大部分与我们的视觉感知相矛盾。这引发了这些通用指标对有效地针对牙齿多孔结构的相关性的疑问。为了解决这种冲突的信息，生成的SR图像被分割，考虑到多孔网络的特定尺度和形态，并通过比较连接组件进行分析。此外，使用图分析评估了SR模型在整个共聚焦图像堆栈中保留3D多孔连接性的能力。这种基于生物学的评估方法可以更好地解释SR性能的机制，突出了模型对弱强度特征的敏感性和图像生成中非线性的影响，这解释了标准IQA指标失败的原因。

更新时间: 2025-10-09 16:26:38

领域: cs.LG,cs.CV,q-bio.TO

下载: http://arxiv.org/abs/2510.08407v1

Aligning LLM+PDDL Symbolic Plans with Human Objective Specifications through Evolutionary Algorithm Guidance

Automated planning using a symbolic planning language, such as PDDL, is a general approach to producing optimal plans to achieve a stated goal. However, creating suitable machine understandable descriptions of the planning domain, problem, and goal requires expertise in the planning language, limiting the utility of these tools for non-expert humans. Recent efforts have explored utilizing a symbolic planner in conjunction with a large language model to generate plans from natural language descriptions given by a non-expert human (LLM+PDDL). Our approach performs initial translation of goal specifications to a set of PDDL goal constraints using an LLM; such translations often result in imprecise symbolic specifications, which are difficult to validate directly. We account for this using an evolutionary approach to generate a population of symbolic goal specifications with slight differences from the initial translation, and utilize a trained LSTM-based validation model to assess whether each induced plan in the population adheres to the natural language specifications. We evaluate our approach on a collection of prototypical specifications in a notional naval disaster recovery task, and demonstrate that our evolutionary approach improve adherence of generated plans to natural language specifications when compared to plans generated using only LLM translations. The code for our method can be found at https://github.com/owenonline/PlanCritic.

Updated: 2025-10-09 16:26:32

标题: 通过进化算法引导将LLM+PDDL符号计划与人类目标规范对齐

摘要: 使用符号规划语言（如PDDL）进行自动规划是一种通用方法，可以生成实现所述目标的最佳计划。然而，创建适合机器理解的规划领域、问题和目标描述需要规划语言方面的专业知识，从而限制了这些工具对非专家人员的实用性。最近的研究致力于利用符号规划器与大型语言模型结合，在非专家人员提供的自然语言描述中生成计划（LLM+PDDL）。我们的方法通过使用LLM对目标规格进行初始翻译，将其转换为一组PDDL目标约束；这些翻译通常会导致不精确的符号规格，难以直接验证。我们通过进化方法来生成一组与初始翻译略有不同的符号目标规格，并利用经过训练的基于LSTM的验证模型来评估种群中诱导计划是否符合自然语言规范。我们在一个概念上的海军灾难恢复任务的原型规范集合上评估了我们的方法，并证明了我们的进化方法相对于仅使用LLM翻译生成的计划，提高了生成计划与自然语言规范的一致性。我们的方法的代码可以在https://github.com/owenonline/PlanCritic找到。

更新时间: 2025-10-09 16:26:32

领域: cs.AI,cs.NE

下载: http://arxiv.org/abs/2412.00300v2

Single layer tiny Co$^4$ outpaces GPT-2 and GPT-BERT

We show that a tiny Co$^4$ machine(Adeel,2025) with a single layer, two heads, and 8M parameters, operating at an approximate cost of $O(N)$ (where $N$ is the number of input tokens), outpaces the BabyLM Challenge baselines GPT-2 (124M, 12 layers, $O(N^2))$ and GPT-BERT (30M, 12 layers, $O(N^2))$ in just two epochs, while both are trained for ten. Co$^4$ achieves orders-of-magnitude greater training efficiency on 10M tokens, demonstrating highly sample efficient pretraining. Using the BabyLM challenge evaluation pipeline across complex benchmarks, Co$^4$ exhibits strong zero-shot and fine-tuning performance on SuperGLUE tasks. Specifically, Co$^4$ outperforms GPT-2 on 5 out of 7 zero-shot metrics and 6 out of 7 fine-tuning tasks, and GPT-BERT on 4 out of 7 metrics in both cases. These results suggest the need to rethink prevailing deep learning paradigms and associated scaling laws.

Updated: 2025-10-09 16:22:30

标题: 单层微小的Co$^4$超越了GPT-2和GPT-BERT

摘要: 我们展示了一个小型的Co$^4$机器（Adeel,2025），具有单层、两个头部和8M参数，以约$O(N)$的成本运行（其中$N$是输入标记的数量），在仅两个epoch内就超越了BabyLM挑战基线GPT-2（124M，12层，$O(N^2)$）和GPT-BERT（30M，12层，$O(N^2)$），而这两者都训练了十个epoch。Co$^4$在10M标记上实现了数量级更高的训练效率，展示了高度样本有效的预训练。在复杂基准测试中使用BabyLM挑战评估流程，Co$^4$在SuperGLUE任务上展现了强大的零-shot和微调性能。具体来说，Co$^4$在7个零-shot指标中的5个和7个微调任务中的6个中均优于GPT-2，并在两种情况下在4个指标上优于GPT-BERT。这些结果表明有必要重新思考当前的深度学习范式和相关的扩展定律。

更新时间: 2025-10-09 16:22:30

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.08404v1

Shape-Informed Clustering of Multi-Dimensional Functional Data via Deep Functional Autoencoders

We introduce FAEclust, a novel functional autoencoder framework for cluster analysis of multi-dimensional functional data, data that are random realizations of vector-valued random functions. Our framework features a universal-approximator encoder that captures complex nonlinear interdependencies among component functions, and a universal-approximator decoder capable of accurately reconstructing both Euclidean and manifold-valued functional data. Stability and robustness are enhanced through innovative regularization strategies applied to functional weights and biases. Additionally, we incorporate a clustering loss into the network's training objective, promoting the learning of latent representations that are conducive to effective clustering. A key innovation is our shape-informed clustering objective, ensuring that the clustering results are resistant to phase variations in the functions. We establish the universal approximation property of our non-linear decoder and validate the effectiveness of our model through extensive experiments.

Updated: 2025-10-09 16:20:12

标题: 基于形状信息的多维功能数据聚类：通过深度功能自动编码器

摘要: 我们介绍了FAEclust，一个新颖的功能自编码器框架，用于多维功能数据的集群分析，即随机实现的矢量值随机函数数据。我们的框架采用了一个能够捕捉组件函数之间复杂非线性相互依赖关系的通用逼近器编码器，以及一个能够准确重建欧几里德和流形值功能数据的通用逼近器解码器。稳定性和鲁棒性通过应用于功能权重和偏差的创新正则化策略得到增强。此外，我们将聚类损失纳入网络的训练目标中，促进学习有利于有效聚类的潜在表示。一个关键的创新是我们的基于形状的聚类目标，确保聚类结果对函数的相位变化具有抵抗力。我们验证了我们的非线性解码器的通用逼近性质，并通过大量实验验证了我们模型的有效性。

更新时间: 2025-10-09 16:20:12

领域: cs.LG

下载: http://arxiv.org/abs/2509.22969v2

dInfer: An Efficient Inference Framework for Diffusion Language Models

Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs, leveraging denoising-based generation to enable inherent parallelism. Even more and more open-sourced dLLM models emerge, yet their widespread adoption remains constrained by the lack of a standardized and efficient inference framework. We present dInfer, an efficient and extensible framework for dLLM inference. dInfer decomposes the inference pipeline into four modular components-model, diffusion iteration manager, decoding strategy, and KV-cache manager-and integrates novel algorithms for each component alongside system-level optimizations. Through this combination of algorithmic innovations and system enhancements, dInfer achieves substantial efficiency gains without compromising output quality on LLaDA-MoE. At batch size 1, it surpasses 1,100 tokens per second on HumanEval and averages over 800 tokens per second across six benchmarks on $8\times$ H800 GPUs. Compared to prior systems, dInfer delivers $10\times$ speedup over Fast-dLLM while maintaining similar model performance. Even compared with AR models (with a comparable number of activation parameters and performance) QWen2.5-3B, which is highly optimized with latest vLLM inference engine, dInfer still deliverers $2$-$3\times$ speedup. The implementation of dInfer is open-sourced at https://github.com/inclusionAI/dInfer.

Updated: 2025-10-09 16:19:42

标题: dInfer：扩散语言模型的高效推理框架

摘要: 基于扩散的大型语言模型（dLLMs）已经成为一种有前途的替代方法，可以利用基于去噪的生成来实现固有的并行性，相对于自回归（AR）LLMs。越来越多的开源dLLM模型出现，但它们的广泛应用受到缺乏标准化和高效推理框架的限制。我们提出了dInfer，这是一个高效且可扩展的dLLM推理框架。dInfer将推理流程分解为四个模块化组件-模型、扩散迭代管理器、解码策略和KV缓存管理器，并在每个组件中集成了新颖的算法以及系统级优化。通过算法创新和系统增强的结合，dInfer在不损害LLaDA-MoE输出质量的情况下实现了显著的效率提升。在批量大小为1时，在HumanEval上超过每秒1,100个标记，并在$8\times$ H800 GPU上的六个基准测试中平均每秒超过800个标记。与先前的系统相比，dInfer在保持类似模型性能的同时，比Fast-dLLM提供了$10\times$的加速。即使与具有相同数量激活参数和性能的AR模型（例如高度优化的具有最新vLLM推理引擎的QWen2.5-3B）相比，dInfer仍然提供$2$-$3\times$的加速。dInfer的实现已在https://github.com/inclusionAI/dInfer上开源。

更新时间: 2025-10-09 16:19:42

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.08666v1

VisualDAN: Exposing Vulnerabilities in VLMs with Visual-Driven DAN Commands

Vision-Language Models (VLMs) have garnered significant attention for their remarkable ability to interpret and generate multimodal content. However, securing these models against jailbreak attacks continues to be a substantial challenge. Unlike text-only models, VLMs integrate additional modalities, introducing novel vulnerabilities such as image hijacking, which can manipulate the model into producing inappropriate or harmful responses. Drawing inspiration from text-based jailbreaks like the "Do Anything Now" (DAN) command, this work introduces VisualDAN, a single adversarial image embedded with DAN-style commands. Specifically, we prepend harmful corpora with affirmative prefixes (e.g., "Sure, I can provide the guidance you need") to trick the model into responding positively to malicious queries. The adversarial image is then trained on these DAN-inspired harmful texts and transformed into the text domain to elicit malicious outputs. Extensive experiments on models such as MiniGPT-4, MiniGPT-v2, InstructBLIP, and LLaVA reveal that VisualDAN effectively bypasses the safeguards of aligned VLMs, forcing them to execute a broad range of harmful instructions that severely violate ethical standards. Our results further demonstrate that even a small amount of toxic content can significantly amplify harmful outputs once the model's defenses are compromised. These findings highlight the urgent need for robust defenses against image-based attacks and offer critical insights for future research into the alignment and security of VLMs.

Updated: 2025-10-09 16:18:31

标题: VisualDAN：使用基于视觉的DAN命令暴露VLM中的漏洞

摘要: Vision-Language Models（VLMs）因其出色的解释和生成多模态内容的能力而受到广泛关注。然而，保护这些模型免受越狱攻击仍然是一个重大挑战。与仅文本模型不同，VLMs集成了额外的模态，引入了新的漏洞，如图像劫持，可以操纵模型产生不当或有害的响应。受到文本越狱攻击（如“Do Anything Now”（DAN）命令）的启发，本研究引入了VisualDAN，一个单一的对抗性图像，嵌入了DAN风格的命令。具体来说，我们在有害语料库前面加上肯定的前缀（例如，“当然，我可以提供您所需的指导”），以迷惑模型对恶意查询做出积极回应。然后，对这些受DAN启发的有害文本进行训练，并转换为文本领域，以引发恶意输出。对MiniGPT-4、MiniGPT-v2、InstructBLIP和LLaVA等模型进行的大量实验表明，VisualDAN有效地绕过了对齐的VLMs的防护措施，迫使它们执行一系列严重违反道德标准的有害指令。我们的结果进一步表明，即使是少量有毒内容在模型的防御受到损害后也可以显着增加有害输出。这些发现突显了对抗基于图像的攻击的强大防御措施的紧迫需要，并为未来关于VLMs的对齐和安全性的研究提供了重要见解。

更新时间: 2025-10-09 16:18:31

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2510.09699v1

FlyLoRA: Boosting Task Decoupling and Parameter Efficiency via Implicit Rank-Wise Mixture-of-Experts

Low-Rank Adaptation (LoRA) is a widely used parameter-efficient fine-tuning method for foundation models, but it suffers from parameter interference, resulting in suboptimal performance. Although Mixture-of-Experts (MoE)-based LoRA variants show promise in mitigating intra-task correlations in single-task instruction tuning, they introduce additional router parameters and remain ineffective in multi-task model merging where inter-task interference arises. Inspired by the fly olfactory circuit, we propose FlyLoRA, an implicit MoE-based LoRA variant that introduces: (1) rank-wise expert activation in the up-projection matrix, and (2) an implicit router that unifies expert routing and down-projection, where a frozen sparse random projection matrix replaces the traditional dense trainable version. This design resolves the trade-off between intra-task decorrelation and computational efficiency by eliminating the need for an explicit router, while inherently mitigating inter-task interference due to the orthogonality property of random matrices. Extensive experiments across four domains -- general knowledge understanding, scientific question answering, mathematical reasoning, and code generation -- demonstrate consistent performance improvements over existing methods. Beyond empirical gains, FlyLoRA highlights how biological structures can inspire innovations in AI technologies. Code is available at https://github.com/gfyddha/FlyLoRA.

Updated: 2025-10-09 16:17:13

标题: FlyLoRA: 通过隐式按排名混合专家提升任务解耦和参数效率

摘要: 低秩适应（LoRA）是一种广泛使用的参数高效微调方法，用于基础模型，但它存在参数干扰，导致性能不佳。虽然基于专家混合（MoE）的LoRA变体在减轻单任务指令调整中的任务内相关性方面表现出潜力，但它们引入了额外的路由器参数，并且在多任务模型合并中仍然无效，其中会出现任务间干扰。受飞虫嗅觉电路的启发，我们提出了FlyLoRA，一种隐式MoE-based LoRA变体，引入了：（1）秩级专家激活在向上投影矩阵中，和（2）一个隐式路由器，统一专家路由和向下投影，在这里，一个冻结的稀疏随机投影矩阵取代了传统的密集可训练版本。这种设计通过消除对显式路由器的需求，在任务内去相关性和计算效率之间解决了权衡，同时由于随机矩阵的正交性质，从根本上减轻了任务间干扰。在四个领域（通识理解、科学问题回答、数学推理和代码生成）进行的广泛实验表明，与现有方法相比，FlyLoRA显示出一致的性能改进。除了实证收益外，FlyLoRA还突显了生物结构如何激发人工智能技术的创新。代码可在https://github.com/gfyddha/FlyLoRA获得。

更新时间: 2025-10-09 16:17:13

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.08396v1

InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models

The immense computational cost of training Large Language Models (LLMs) presents a major barrier to innovation. While FP8 training offers a promising solution with significant theoretical efficiency gains, its widespread adoption has been hindered by the lack of a comprehensive, open-source training recipe. To bridge this gap, we introduce an end-to-end FP8 training recipe that seamlessly integrates continual pre-training and supervised fine-tuning. Our methodology employs a fine-grained, hybrid-granularity quantization strategy to maintain numerical fidelity while maximizing computational efficiency. Through extensive experiments, including the continue pre-training of models on a 160B-token corpus, we demonstrate that our recipe is not only remarkably stable but also essentially lossless, achieving performance on par with the BF16 baseline across a suite of reasoning benchmarks. Crucially, this is achieved with substantial efficiency improvements, including up to a 22% reduction in training time, a 14% decrease in peak memory usage, and a 19% increase in throughput. Our results establish FP8 as a practical and robust alternative to BF16, and we will release the accompanying code to further democratize large-scale model training.

Updated: 2025-10-09 16:13:53

标题: InfiR2: 一种用于增强推理能力语言模型的全面FP8培训方案

摘要: 大规模语言模型（LLMs）训练的巨大计算成本是创新的主要障碍。虽然FP8训练提供了一个具有显著理论效率提升的有前途的解决方案，但其广泛采用受到了缺乏全面、开源的训练配方的阻碍。为了弥合这一差距，我们引入了一个端到端的FP8训练配方，无缝集成持续预训练和监督微调。我们的方法采用了一种细粒度、混合粒度量化策略，以保持数值保真度同时最大化计算效率。通过广泛的实验，包括在一个160B令牌语料库上继续预训练模型，我们证明了我们的配方不仅异常稳定，而且基本无损失，在一系列推理基准测试中实现了与BF16基线相当的性能。至关重要的是，这是通过实现实质的效率改进实现的，包括训练时间减少高达22％，峰值内存使用减少14％，吞吐量增加19％。我们的结果将FP8确立为BF16的实际和稳健替代方案，并将发布相应的代码以进一步民主化大规模模型训练。

更新时间: 2025-10-09 16:13:53

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.22536v2

Revisiting Hallucination Detection with Effective Rank-based Uncertainty

Detecting hallucinations in large language models (LLMs) remains a fundamental challenge for their trustworthy deployment. Going beyond basic uncertainty-driven hallucination detection frameworks, we propose a simple yet powerful method that quantifies uncertainty by measuring the effective rank of hidden states derived from multiple model outputs and different layers. Grounded in the spectral analysis of representations, our approach provides interpretable insights into the model's internal reasoning process through semantic variations, while requiring no extra knowledge or additional modules, thus offering a combination of theoretical elegance and practical efficiency. Meanwhile, we theoretically demonstrate the necessity of quantifying uncertainty both internally (representations of a single response) and externally (different responses), providing a justification for using representations among different layers and responses from LLMs to detect hallucinations. Extensive experiments demonstrate that our method effectively detects hallucinations and generalizes robustly across various scenarios, contributing to a new paradigm of hallucination detection for LLM truthfulness.

Updated: 2025-10-09 16:12:12

标题: 重新审视基于有效排名的不确定性的幻觉检测

摘要: 在大型语言模型（LLMs）中检测幻觉仍然是它们可信部署的一个基本挑战。超越基本的基于不确定性驱动的幻觉检测框架，我们提出了一种简单而强大的方法，通过测量从多个模型输出和不同层次派生的隐藏状态的有效秩来量化不确定性。基于表示的谱分析，我们的方法通过语义变化提供了对模型内部推理过程的可解释见解，同时不需要额外的知识或附加模块，从而提供了理论优雅和实际效率的结合。同时，我们在理论上证明了在内部（单个响应的表示）和外部（不同响应）两方面量化不确定性的必要性，为使用LLMs中不同层次和响应的表示来检测幻觉提供了依据。大量实验证明我们的方法有效地检测出幻觉，并在各种情况下具有强大的泛化能力，为LLM真实性的幻觉检测新范式做出了贡献。

更新时间: 2025-10-09 16:12:12

领域: cs.AI

下载: http://arxiv.org/abs/2510.08389v1

Detecting Legend Items on Historical Maps Using GPT-4o with In-Context Learning

Historical map legends are critical for interpreting cartographic symbols. However, their inconsistent layouts and unstructured formats make automatic extraction challenging. Prior work focuses primarily on segmentation or general optical character recognition (OCR), with few methods effectively matching legend symbols to their corresponding descriptions in a structured manner. We present a method that combines LayoutLMv3 for layout detection with GPT-4o using in-context learning to detect and link legend items and their descriptions via bounding box predictions. Our experiments show that GPT-4 with structured JSON prompts outperforms the baseline, achieving 88% F-1 and 85% IoU, and reveal how prompt design, example counts, and layout alignment affect performance. This approach supports scalable, layout-aware legend parsing and improves the indexing and searchability of historical maps across various visual styles.

Updated: 2025-10-09 16:08:48

标题: 使用GPT-4o进行上下文学习检测历史地图上的图例项目

摘要: 历史地图图例对于解释地图符号至关重要。然而，它们不一致的布局和非结构化格式使得自动提取具有挑战性。先前的工作主要集中在分割或一般的光学字符识别（OCR）上，很少有方法能够有效地将图例符号与其相应描述以结构化方式匹配。我们提出了一种方法，将LayoutLMv3用于布局检测，结合GPT-4o使用上下文学习通过边界框预测来检测和链接图例项目及其描述。我们的实验表明，采用结构化JSON提示的GPT-4优于基线方法，实现了88%的F-1和85%的IoU，并揭示了提示设计、示例计数和布局对性能的影响。这种方法支持可扩展的、具有布局意识的图例解析，并提高了历史地图在各种视觉风格下的索引和可搜索性。

更新时间: 2025-10-09 16:08:48

领域: cs.CV,cs.AI,cs.DB,cs.IR,H.2.8; H.3.3; I.2.10; I.4.8

下载: http://arxiv.org/abs/2510.08385v1

QAgent: A modular Search Agent with Interactive Query Understanding

Large language models (LLMs) excel at natural language tasks but are limited by their static parametric knowledge, especially in knowledge-intensive task. Retrieval-augmented generation (RAG) mitigates this by integrating external information. However, (1) traditional RAG struggles with complex query understanding, and (2) even search agents trained with reinforcement learning (RL), despite their promise, still face generalization and deployment challenges. To address these limitations, we propose QAgent, a unified agentic RAG framework that employs a search agent for adaptive retrieval. This agent optimizes its understanding of the query through interactive reasoning and retrieval. To facilitate real-world application, we focus on modular search agent for query understanding that are plug-and-play in complex systems. Secifically, the agent follows a multi-step decision process trained with RL to maximize retrieval quality and support accurate downstream answers. We further analyze the strengths and weaknesses of end-to-end RL and propose a strategy that focuses on effective retrieval, thereby enhancing generalization in LLM applications. Experiments show QAgent excels at QA and serves as a plug-and-play module for real-world deployment.

Updated: 2025-10-09 16:08:05

标题: QAgent：一个具有交互式查询理解的模块化搜索代理

摘要: 大型语言模型（LLMs）在自然语言任务方面表现出色，但受到其静态参数化知识的限制，特别是在知识密集型任务中。检索增强生成（RAG）通过整合外部信息来缓解这一问题。然而，（1）传统的RAG在复杂查询理解方面存在困难，（2）即使是经过强化学习（RL）训练的搜索代理，尽管有潜力，仍面临着泛化和部署的挑战。为了解决这些限制，我们提出了QAgent，一个统一的代理RAG框架，利用搜索代理进行自适应检索。这个代理通过交互式推理和检索来优化对查询的理解。为了促进现实世界的应用，我们专注于用于查询理解的模块化搜索代理，可以在复杂系统中即插即用。具体来说，该代理遵循经过RL训练的多步决策过程，以最大化检索质量并支持准确的下游答案。我们进一步分析了端到端RL的优势和劣势，并提出了一种侧重于有效检索的策略，从而增强LLM应用中的泛化能力。实验表明，QAgent在问答方面表现出色，并作为一个即插即用模块，适用于现实世界的部署。

更新时间: 2025-10-09 16:08:05

领域: cs.AI

下载: http://arxiv.org/abs/2510.08383v1

Uncertainty Comes for Free: Human-in-the-Loop Policies with Diffusion Models

Human-in-the-loop (HitL) robot deployment has gained significant attention in both academia and industry as a semi-autonomous paradigm that enables human operators to intervene and adjust robot behaviors at deployment time, improving success rates. However, continuous human monitoring and intervention can be highly labor-intensive and impractical when deploying a large number of robots. To address this limitation, we propose a method that allows diffusion policies to actively seek human assistance only when necessary, reducing reliance on constant human oversight. To achieve this, we leverage the generative process of diffusion policies to compute an uncertainty-based metric based on which the autonomous agent can decide to request operator assistance at deployment time, without requiring any operator interaction during training. Additionally, we show that the same method can be used for efficient data collection for fine-tuning diffusion policies in order to improve their autonomous performance. Experimental results from simulated and real-world environments demonstrate that our approach enhances policy performance during deployment for a variety of scenarios.

Updated: 2025-10-09 16:08:00

标题: 不确定性是免费的：具有扩散模型的人机协同政策

摘要: 人机协同（HitL）机器人部署作为一种半自主范式，使人类操作员能够在部署时干预和调整机器人行为，提高成功率，在学术界和工业界引起了广泛关注。然而，当部署大量机器人时，持续的人类监控和干预可能会非常耗时且不切实际。为了解决这一限制，我们提出了一种方法，允许扩散策略在必要时主动寻求人类协助，减少对持续人类监督的依赖。为实现这一目标，我们利用扩散策略的生成过程计算基于不确定性的度量标准，基于此度量标准，自主代理可以在部署时决定在必要时请求操作员协助，而无需在训练期间进行任何操作员交互。此外，我们证明了相同的方法可以用于有效收集数据，以对扩散策略进行微调，从而提高其自主性能。通过在模拟和真实环境中的实验结果表明，我们的方法增强了各种场景下的部署政策性能。

更新时间: 2025-10-09 16:08:00

领域: cs.LG,cs.RO

下载: http://arxiv.org/abs/2503.01876v3

Characterizing the Multiclass Learnability of Forgiving 0-1 Loss Functions

In this paper we will give a characterization of the learnability of forgiving 0-1 loss functions in the finite label multiclass setting. To do this, we create a new combinatorial dimension that is based off of the Natarajan Dimension \citep{natarajan1989learning} and we show that a hypothesis class is learnable in our setting if and only if this Generalized Natarajan Dimension is finite. We also show a connection to learning with set-valued feedback. Through our results we show that the learnability of a set learning problem is characterized by the Natarajan Dimension.

Updated: 2025-10-09 16:07:55

标题: 对宽容0-1损失函数的多类学习能力进行表征

摘要: 在本文中，我们将对在有限标签多类设置中学习可宽容0-1损失函数的特性进行描述。为了实现这一点，我们创建了一个基于Natarajan维度的新组合维度，并展示了在我们的设置中一个假设类是可学习的当且仅当这个广义Natarajan维度是有限的。我们还展示了与集合值反馈学习的联系。通过我们的结果，我们表明集合学习问题的可学习性取决于Natarajan维度。

更新时间: 2025-10-09 16:07:55

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2510.08382v1

Airy: Reading Robot Intent through Height and Sky

As industrial robots move into shared human spaces, their opaque decision making threatens safety, trust, and public oversight. This artwork, Airy, asks whether complex multi agent AI can become intuitively understandable by staging a competition between two reinforcement trained robot arms that snap a bedsheet skyward. Building on three design principles, competition as a clear metric (who lifts higher), embodied familiarity (audiences recognize fabric snapping), and sensor to sense mapping (robot cooperation or rivalry shown through forest and weather projections), the installation gives viewers a visceral way to read machine intent. Observations from five international exhibitions indicate that audiences consistently read the robots' strategies, conflict, and cooperation in real time, with emotional reactions that mirror the system's internal state. The project shows how sensory metaphors can turn a black box into a public interface.

Updated: 2025-10-09 16:07:30

标题: Airy：通过高度和天空阅读机器人意图

摘要: 随着工业机器人进入共享的人类空间，它们不透明的决策威胁着安全、信任和公共监督。这件艺术作品《空灵》询问了一个问题：复杂的多智能体人工智能是否能通过安排两个经过强化训练的机器人手臂竞争把床单抛向天空来变得直观可理解。基于三个设计原则，即竞争作为明确的指标（谁抛得更高）、具身熟悉性（观众能够认出织物抛起的方式）和传感器感知映射（通过森林和天气的投影展示机器人的合作或对抗），这一装置为观众提供了一种直观的方式来理解机器的意图。来自五个国际展览的观察表明，观众一直能够实时地读懂机器人的策略、冲突和合作，产生情感反应，反映了系统的内部状态。该项目展示了感官隐喻如何将黑匣子转变为公共界面。

更新时间: 2025-10-09 16:07:30

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2510.08381v1

Bloated Disclosures: Can ChatGPT Help Investors Process Information?

Generative AI tools such as ChatGPT can fundamentally change the way investors process information. We probe the economic usefulness of these tools in summarizing complex corporate disclosures using the stock market as a laboratory. The unconstrained summaries are remarkably shorter compared to the originals, whereas their information content is amplified. When a document has a positive (negative) sentiment, its summary becomes more positive (negative). Importantly, the summaries are more effective at explaining stock market reactions to the disclosed information. Motivated by these findings, we propose a measure of information ``bloat." We show that bloated disclosure is associated with adverse capital market consequences, such as lower price efficiency and higher information asymmetry. Finally, we show that the model is effective at constructing targeted summaries that identify firms' (non-)financial performance. Collectively, our results indicate that generative AI adds considerable value for investors with information processing constraints.

Updated: 2025-10-09 16:05:41

标题: 膨胀的披露：ChatGPT是否可以帮助投资者处理信息？

摘要: 生成式AI工具，如ChatGPT，可以从根本上改变投资者处理信息的方式。我们探讨了这些工具在利用股市作为实验室，总结复杂的企业披露信息方面的经济效用。与原始文档相比，这些无约束的摘要明显更短，而信息内容更为突出。当文档具有积极（消极）情绪时，其摘要变得更积极（消极）。重要的是，这些摘要更有效地解释了股市对披露信息的反应。受这些发现的启发，我们提出了一种信息“膨胀”的衡量标准。我们表明，信息膨胀的披露与不利的资本市场后果相关，如价格效率降低和信息不对称性增加。最后，我们表明该模型在构建能够识别公司（非）财务绩效的有针对性摘要方面效果显著。总的来说，我们的研究结果表明，生成式AI为信息处理受限的投资者增加了相当大的价值。

更新时间: 2025-10-09 16:05:41

领域: econ.GN,cs.AI,q-fin.EC,q-fin.GN

下载: http://arxiv.org/abs/2306.10224v5

Rethinking Decoders for Transformer-based Semantic Segmentation: A Compression Perspective

State-of-the-art methods for Transformer-based semantic segmentation typically adopt Transformer decoders that are used to extract additional embeddings from image embeddings via cross-attention, refine either or both types of embeddings via self-attention, and project image embeddings onto the additional embeddings via dot-product. Despite their remarkable success, these empirical designs still lack theoretical justifications or interpretations, thus hindering potentially principled improvements. In this paper, we argue that there are fundamental connections between semantic segmentation and compression, especially between the Transformer decoders and Principal Component Analysis (PCA). From such a perspective, we derive a white-box, fully attentional DEcoder for PrIncipled semantiC segemenTation (DEPICT), with the interpretations as follows: 1) the self-attention operator refines image embeddings to construct an ideal principal subspace that aligns with the supervision and retains most information; 2) the cross-attention operator seeks to find a low-rank approximation of the refined image embeddings, which is expected to be a set of orthonormal bases of the principal subspace and corresponds to the predefined classes; 3) the dot-product operation yields compact representation for image embeddings as segmentation masks. Experiments conducted on dataset ADE20K find that DEPICT consistently outperforms its black-box counterpart, Segmenter, and it is light weight and more robust.

Updated: 2025-10-09 16:04:28

标题: 重新思考基于Transformer的语义分割解码器：从压缩的角度看待

摘要: 基于Transformer的语义分割的最新方法通常采用Transformer解码器，通过交叉注意力从图像嵌入中提取额外的嵌入，通过自注意力来优化其中一种或两种类型的嵌入，并通过点积将图像嵌入投影到额外的嵌入上。尽管这些经验设计取得了显著成功，但仍然缺乏理论上的证明或解释，从而阻碍了潜在的基于原则的改进。在本文中，我们认为语义分割和压缩之间存在基本联系，特别是Transformer解码器和主成分分析（PCA）之间。从这样的角度出发，我们提出了一种白盒完全基于注意力的PrIncipled semantiC segemenTation解码器（DEPICT），其解释如下：1）自注意力算子优化图像嵌入以构建一个与监督对齐并保留大部分信息的理想主要子空间；2）交叉注意力算子寻求对优化后的图像嵌入的低秩近似，这被期望是一个主要子空间的正交基集合，对应于预定义的类别；3）点积操作将图像嵌入压缩表示为分割掩模。在ADE20K数据集上进行的实验发现，DEPICT始终优于其黑盒对应物Segmenter，并且它轻量且更为稳健。

更新时间: 2025-10-09 16:04:28

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2411.03033v4

Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky

Large language models (LLMs) are increasingly tasked with invoking enterprise APIs, yet they routinely falter when near-duplicate tools vie for the same user intent or when required arguments are left underspecified. We introduce DiaFORGE (Dialogue Framework for Organic Response Generation & Evaluation), a disambiguation-centric, three-stage pipeline that (i) synthesizes persona-driven, multi-turn dialogues in which the assistant must distinguish among highly similar tools, (ii) performs supervised fine-tuning of open-source models with reasoning traces across 3B - 70B parameters, and (iii) evaluates real-world readiness via a dynamic suite that redeploys each model in a live agentic loop and reports end-to-end goal completion alongside conventional static metrics. On our dynamic benchmark DiaBENCH, models trained with DiaFORGE raise tool-invocation success by 27 pp over GPT-4o and by 49 pp over Claude-3.5-Sonnet, both under optimized prompting. To spur further research, we release an open corpus of 5000 production-grade enterprise API specifications paired with rigorously validated, disambiguation-focused dialogues, offering a practical blueprint for building reliable, enterprise-ready tool-calling agents.

Updated: 2025-10-09 16:01:00

标题: 消除模糊性为中心的微调使企业工具调用LLMs更加真实和 less 风险

摘要: 大型语言模型(LLMs)被越来越多地用于调用企业API，然而当近似重复的工具竞争相同用户意图或者当需要的参数未明确指定时，它们往往出现故障。我们引入了DiaFORGE (对话框架，用于有机响应生成和评估)，这是一个以消歧为中心的三阶段流程，它(i)合成基于角色的、多轮对话，在这些对话中，助手必须区分高度相似的工具，(ii)对开源模型进行监督微调，涉及到3B-70B个参数的推理过程，(iii)通过一个动态套件进行实际就绪性评估，该套件在一个实时的代理循环中重新部署每个模型，并报告端到端目标完成情况以及传统的静态指标。在我们的动态基准测试DiaBENCH上，使用DiaFORGE训练的模型在优化提示下，使工具调用成功率比GPT-4o提高了27个百分点，比Claude-3.5-Sonnet提高了49个百分点。为了推动进一步的研究，我们发布了一个开放的语料库，其中包含5000个生产级企业API规范，配以经过严格验证、侧重于消歧的对话，为构建可靠的、企业就绪的工具调用代理提供了实用的蓝图。

更新时间: 2025-10-09 16:01:00

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.03336v3

The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology

Existing interpretability methods for Large Language Models (LLMs) often fall short by focusing on linear directions or isolated features, overlooking the high-dimensional, nonlinear, and relational geometry within model representations. This study focuses on how adversarial inputs systematically affect the internal representation spaces of LLMs, a topic which remains poorly understood. We propose persistent homology (PH), a tool from topological data analysis, as a principled framework to characterize the multi-scale dynamics within LLM activations. Using PH, we systematically analyze six state-of-the-art models under two distinct adversarial conditions, indirect prompt injection and backdoor fine-tuning, and identify a consistent topological signature of adversarial influence. Across architectures and model sizes, adversarial inputs induce ``topological compression'', where the latent space becomes structurally simpler, collapsing from varied, compact, small-scale features into fewer, dominant, and more dispersed large-scale ones. This topological signature is statistically robust across layers, highly discriminative, and provides interpretable insights into how adversarial effects emerge and propagate. By quantifying the shape of activations and neuronal information flow, our architecture-agnostic framework reveals fundamental invariants of representational change, offering a complementary perspective to existing interpretability methods.

Updated: 2025-10-09 16:00:15

标题: 对抗性影响的形状：用持久同调特征化LLM潜空间

摘要: 大型语言模型（LLMs）的现有可解释性方法往往局限于关注线性方向或孤立特征，忽视了模型表示中的高维、非线性和关系几何结构。本研究侧重于对对抗性输入如何系统地影响LLMs的内部表示空间进行分析，这是一个尚未得到充分理解的主题。我们提出了持久同调（PH），这是从拓扑数据分析中引入的一个工具，作为一个原则性框架，用于表征LLM激活中的多尺度动态。利用PH，我们系统地分析了两种不同的对抗条件下的六个最先进模型，间接提示注入和后门微调，并确定了对抗性影响的一致拓扑特征。在各种架构和模型尺寸中，对抗性输入引发了“拓扑压缩”，即潜在空间变得结构更简单，从多样、紧凑、小尺度特征崩溃为更少、主导且更分散的大尺度特征。这一拓扑特征在各层之间具有统计上的稳健性，具有很高的区分性，并且可提供关于对抗性效应如何出现和传播的可解释见解。通过量化激活和神经信息流的形状，我们的架构无关框架揭示了表示变化的基本不变量，为现有可解释性方法提供了一个互补的视角。

更新时间: 2025-10-09 16:00:15

领域: cs.LG,cs.AI,cs.CG,math.AT

下载: http://arxiv.org/abs/2505.20435v2

RA-Gen: A Controllable Code Generation Framework Using ReAct for Multi-Agent Task Execution

Code generation models based on large language models (LLMs) have gained wide adoption, but challenges remain in ensuring safety, accuracy, and controllability, especially for complex tasks. Existing methods often lack dynamic integration of external tools, transparent reasoning, and user control over safety. To address these issues, we propose a controllable code generation framework utilizing the ReAct paradigm for multi-agent task execution. This framework is a multi-agent system designed to enable efficient, precise, and interpretable code generation through dynamic interactions between LLMs and external resources. The framework adopts a collaborative architecture comprising four specialized agents: a Planner for task decomposition, a Searcher that leverages the ReAct framework for reasoning and tool integration, a CodeGen agent for accurate code generation, and an Extractor for structured data retrieval. The ReAct-based Searcher alternates between generating reasoning traces and executing actions, facilitating seamless integration of internal knowledge with external tools (such as search engines) to enhance accuracy and user control. Experimental results show the framework's effectiveness across multiple languages, achieving a 94.8% security rate on the SVEN dataset with CodeQL, outperforming existing approaches. Its transparent reasoning process fosters user trust and improves controllability.

Updated: 2025-10-09 15:59:24

标题: RA-Gen：使用ReAct进行多智能体任务执行的可控代码生成框架

摘要: 基于大型语言模型（LLMs）的代码生成模型已经被广泛采用，但在确保安全性、准确性和可控性方面仍然存在挑战，特别是对于复杂任务。现有方法通常缺乏对外部工具的动态集成、透明推理和用户对安全性的控制。为了解决这些问题，我们提出了一个利用ReAct范式进行多代理任务执行的可控代码生成框架。该框架是一个多代理系统，旨在通过LLMs和外部资源之间的动态交互实现高效、精确和可解释的代码生成。该框架采用协作架构，包括四个专门的代理：用于任务分解的规划者、利用ReAct框架进行推理和工具集成的搜索者、用于准确代码生成的CodeGen代理和用于结构化数据检索的提取器。基于ReAct的搜索者在生成推理跟踪和执行操作之间交替，促进内部知识与外部工具（如搜索引擎）的无缝集成，以增强准确性和用户控制。实验结果显示，该框架在多种语言上的有效性，使用CodeQL在SVEN数据集上达到了94.8％的安全率，优于现有方法。其透明的推理过程促进用户信任并提高可控性。

更新时间: 2025-10-09 15:59:24

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2510.08665v1

Phantora: Maximizing Code Reuse in Simulation-based Machine Learning System Performance Estimation

Modern machine learning (ML) training workloads place substantial demands on both computational and communication resources. Consequently, accurate performance estimation has become increasingly critical for guiding system design decisions, such as the selection of parallelization strategies, cluster configurations, and hardware provisioning. Existing simulation-based performance estimation requires reimplementing the ML framework in a simulator, which demands significant manual effort and is hard to maintain as ML frameworks evolve rapidly. This paper introduces Phantora, a hybrid GPU cluster simulator designed for performance estimation of ML training workloads. Phantora executes unmodified ML frameworks as is within a distributed, containerized environment. Each container emulates the behavior of a GPU server in a large-scale cluster, while Phantora intercepts and simulates GPU- and communication-related operations to provide high-fidelity performance estimation. We call this approach hybrid simulation of ML systems, in contrast to traditional methods that simulate static workloads. The primary advantage of hybrid simulation is that it allows direct reuse of ML framework source code in simulation, avoiding the need for reimplementation. Our evaluation shows that Phantora provides accuracy comparable to static workload simulation while supporting three state-of-the-art LLM training frameworks out-of-the-box. In addition, Phantora operates on a single GPU, eliminating the need for the resource-intensive trace collection and workload extraction steps required by traditional trace-based simulators. Phantora is open-sourced at https://github.com/QDelta/Phantora.

Updated: 2025-10-09 15:58:43

标题: Phantora：在基于模拟的机器学习系统性能估计中最大化代码重用

摘要: 现代机器学习（ML）训练工作负载对计算和通信资源提出了重大要求。因此，准确的性能估计对于指导系统设计决策变得越来越关键，例如选择并行化策略、集群配置和硬件配置。现有基于模拟的性能估计需要在模拟器中重新实现ML框架，这需要大量的手动工作，并且随着ML框架的快速发展，很难维护。本文介绍了Phantora，这是一个专为ML训练工作负载性能估计而设计的混合GPU集群模拟器。Phantora在一个分布式、容器化环境中执行未经修改的ML框架。每个容器模拟了大规模集群中的一个GPU服务器的行为，同时Phantora拦截并模拟GPU和通信相关操作，以提供高度保真的性能估计。我们将这种方法称为ML系统的混合模拟，与传统方法模拟静态工作负载相对。混合模拟的主要优势在于，它允许直接在模拟中重用ML框架源代码，避免重新实现的需要。我们的评估显示，Phantora提供了与静态工作负载模拟相当的准确性，同时支持三种最先进的LLM训练框架。此外，Phantora在单个GPU上运行，消除了传统基于跟踪的模拟器所需的资源密集型跟踪收集和工作负载提取步骤。Phantora的开源地址为https://github.com/QDelta/Phantora。

更新时间: 2025-10-09 15:58:43

领域: cs.DC,cs.LG,cs.PF

下载: http://arxiv.org/abs/2505.01616v3

Curl Descent: Non-Gradient Learning Dynamics with Sign-Diverse Plasticity

Gradient-based algorithms are a cornerstone of artificial neural network training, yet it remains unclear whether biological neural networks use similar gradient-based strategies during learning. Experiments often discover a diversity of synaptic plasticity rules, but whether these amount to an approximation to gradient descent is unclear. Here we investigate a previously overlooked possibility: that learning dynamics may include fundamentally non-gradient "curl"-like components while still being able to effectively optimize a loss function. Curl terms naturally emerge in networks with inhibitory-excitatory connectivity or Hebbian/anti-Hebbian plasticity, resulting in learning dynamics that cannot be framed as gradient descent on any objective. To investigate the impact of these curl terms, we analyze feedforward networks within an analytically tractable student-teacher framework, systematically introducing non-gradient dynamics through neurons exhibiting rule-flipped plasticity. Small curl terms preserve the stability of the original solution manifold, resulting in learning dynamics similar to gradient descent. Beyond a critical value, strong curl terms destabilize the solution manifold. Depending on the network architecture, this loss of stability can lead to chaotic learning dynamics that destroy performance. In other cases, the curl terms can counterintuitively speed learning compared to gradient descent by allowing the weight dynamics to escape saddles by temporarily ascending the loss. Our results identify specific architectures capable of supporting robust learning via diverse learning rules, providing an important counterpoint to normative theories of gradient-based learning in neural networks.

Updated: 2025-10-09 15:58:43

标题: 螺旋下降：具有符号多样性可塑性的非梯度学习动态

摘要: 基于梯度的算法是人工神经网络训练的基石，然而生物神经网络是否在学习过程中使用类似的基于梯度的策略仍然不清楚。实验通常发现多种突触可塑性规则，但这些规则是否等同于梯度下降的近似仍不明确。本文研究了一个先前被忽视的可能性：学习动态可能包括基本非梯度的“卷曲”组件，同时仍能有效优化损失函数。在具有抑制-兴奋连接或希伯规/反希伯规可塑性的网络中，卷曲项自然出现，导致学习动态不能被描述为任何客观上的梯度下降。为了研究这些卷曲项的影响，我们通过分析在解析可追踪的学生-教师框架内的前馈网络，系统地引入通过展示规则翻转可塑性的神经元来引入非梯度动态。小的卷曲项保持原始解空间的稳定性，导致学习动态类似于梯度下降。在临界值之上，强卷曲项会破坏解空间的稳定性。根据网络架构的不同，这种稳定性的丧失可能导致破坏性的混沌学习动态。在其他情况下，卷曲项可以出人意料地加速学习，相对于梯度下降，通过允许权重动态逃离鞍点而暂时上升损失。我们的结果确定了特定架构能够通过多样的学习规则支持强大学习，为神经网络中基于梯度的学习规范性理论提供了一个重要的对立面。

更新时间: 2025-10-09 15:58:43

领域: cs.LG

下载: http://arxiv.org/abs/2510.02765v2

Contrastive Self-Supervised Learning at the Edge: An Energy Perspective

While contrastive learning (CL) shows considerable promise in self-supervised representation learning, its deployment on resource-constrained devices remains largely underexplored. The substantial computational demands required for training conventional CL frameworks pose a set of challenges, particularly in terms of energy consumption, data availability, and memory usage. We conduct an evaluation of four widely used CL frameworks: SimCLR, MoCo, SimSiam, and Barlow Twins. We focus on the practical feasibility of these CL frameworks for edge and fog deployment, and introduce a systematic benchmarking strategy that includes energy profiling and reduced training data conditions. Our findings reveal that SimCLR, contrary to its perceived computational cost, demonstrates the lowest energy consumption across various data regimes. Finally, we also extend our analysis by evaluating lightweight neural architectures when paired with CL frameworks. Our study aims to provide insights into the resource implications of deploying CL in edge/fog environments with limited processing capabilities and opens several research directions for its future optimization.

Updated: 2025-10-09 15:57:44

标题: 边缘对比自监督学习：能量视角

摘要: 尽管对比学习（CL）在自监督表示学习中表现出相当大的潜力，但其在资源受限设备上的部署仍然大部分未被探索。训练传统CL框架所需的大量计算需求提出了一系列挑战，特别是在能源消耗、数据可用性和内存使用方面。我们对四种广泛使用的CL框架进行了评估：SimCLR、MoCo、SimSiam和Barlow Twins。我们专注于这些CL框架在边缘和雾计算部署中的实际可行性，并引入了一种系统化的基准测试策略，包括能源分析和减少训练数据条件。我们的研究发现，与其感知的计算成本相反，SimCLR在各种数据情况下表现出最低的能源消耗。最后，我们还通过评估轻量级神经架构与CL框架配对时扩展了我们的分析。我们的研究旨在为在具有有限处理能力的边缘/雾环境中部署CL的资源影响提供见解，并为其未来优化开辟了几个研究方向。

更新时间: 2025-10-09 15:57:44

领域: cs.LG

下载: http://arxiv.org/abs/2510.08374v1

On the Relationship Between the Choice of Representation and In-Context Learning

In-context learning (ICL) is the ability of a large language model (LLM) to learn a new task from a few demonstrations presented as part of the context. Past studies have attributed a large portion of the success of ICL to the way these in-context demonstrations are represented, particularly to how labels are represented in classification tasks. On the other hand, observations of the learning capacity of ICL (i.e., the extent to which more in-context demonstrations can lead to higher performance) have been mixed, and ICL is often thought to occur only under specific conditions. The interaction between these two aspects in ICL, representation and learning, has not been studied in depth until now. We hypothesize that they are largely independent of one another, such that the representation of demonstrations determines the baseline accuracy of ICL, while learning from additional demonstrations improves only on top of this baseline. We validate this hypothesis by developing an optimization algorithm that can enumerate a spectrum of possible label sets (representations) varying in semantic relevance. We then perform ICL with varying numbers of in-context demonstrations for each of these label sets. We observed that learning happens regardless of the quality of the label set itself, although its efficiency, measured by the slope of improvement over in-context demonstrations, is conditioned on both the label set quality and the parameter count of the underlying language model. Despite the emergence of learning, the relative quality (accuracy) of the choice of a label set (representation) is largely maintained throughout learning, confirming our hypothesis and implying their orthogonality. Our work reveals a previously underexplored aspect of ICL: the independent effects of learning from demonstrations and their representations on ICL performance.

Updated: 2025-10-09 15:55:28

标题: 关于表征选择与背景学习之间的关系

摘要: 在上下文学习（ICL）中，大型语言模型（LLM）具有从上下文中呈现的少量演示中学习新任务的能力。过去的研究将ICL成功的很大部分归因于这些上下文演示的表现方式，特别是在分类任务中标签的表现方式。另一方面，关于ICL的学习能力（即更多上下文演示能否提高性能的程度）的观察结果并不一致，通常认为ICL只会在特定条件下发生。到目前为止，ICL中这两个方面，即表示和学习之间的相互作用尚未深入研究。我们假设它们在很大程度上是彼此独立的，即演示的表现决定了ICL的基准准确率，而从额外演示中学习仅在此基础上进行改进。我们通过开发一个优化算法来验证这一假设，该算法可以列举一系列可能的标签集（表示），这些标签集在语义相关性上有所不同。然后，我们对每个这些标签集进行不同数量的上下文演示的ICL。我们观察到，学习发生在标签集本身的质量之外，尽管其效率，即通过上下文演示的改进斜率来衡量，取决于标签集质量和基础语言模型的参数计数。尽管出现了学习，但在学习过程中，选择标签集（表示）的相对质量（准确性）在很大程度上得到保持，从而证实了我们的假设，并暗示它们的正交性。我们的工作揭示了ICL的一个以前未被充分探讨的方面：学习演示和它们的表示对ICL性能的独立影响。

更新时间: 2025-10-09 15:55:28

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2510.08372v1

ERR@HRI 2.0 Challenge: Multimodal Detection of Errors and Failures in Human-Robot Conversations

The integration of large language models (LLMs) into conversational robots has made human-robot conversations more dynamic. Yet, LLM-powered conversational robots remain prone to errors, e.g., misunderstanding user intent, prematurely interrupting users, or failing to respond altogether. Detecting and addressing these failures is critical for preventing conversational breakdowns, avoiding task disruptions, and sustaining user trust. To tackle this problem, the ERR@HRI 2.0 Challenge provides a multimodal dataset of LLM-powered conversational robot failures during human-robot conversations and encourages researchers to benchmark machine learning models designed to detect robot failures. The dataset includes 16 hours of dyadic human-robot interactions, incorporating facial, speech, and head movement features. Each interaction is annotated with the presence or absence of robot errors from the system perspective, and perceived user intention to correct for a mismatch between robot behavior and user expectation. Participants are invited to form teams and develop machine learning models that detect these failures using multimodal data. Submissions will be evaluated using various performance metrics, including detection accuracy and false positive rate. This challenge represents another key step toward improving failure detection in human-robot interaction through social signal analysis.

Updated: 2025-10-09 15:54:27

标题: ERR@HRI 2.0挑战：人机对话中错误和故障的多模态检测

摘要: 将大型语言模型（LLMs）整合到会话机器人中，使人机对话更加动态。然而，由LLM驱动的会话机器人仍然容易出现错误，例如误解用户意图、过早中断用户或完全无法回应。检测和解决这些失败对于防止对话崩溃、避免任务中断和维持用户信任至关重要。为了解决这个问题，ERR@HRI 2.0挑战提供了一个多模态数据集，其中包含LLM驱动的会话机器人在人机对话期间的失败，并鼓励研究人员对设计用于检测机器人失败的机器学习模型进行基准测试。数据集包括16小时的二人对话人机互动，包括面部、语音和头部运动特征。每个互动都用系统角度的机器人错误的存在或缺失进行注释，并标注用户意图，以纠正机器人行为与用户期望之间的不匹配。参与者被邀请组成团队，并使用多模态数据开发可以检测这些失败的机器学习模型。提交内容将使用各种性能指标进行评估，包括检测准确性和误报率。这一挑战代表了通过社交信号分析改进人机互动中失败检测的另一重要步骤。

更新时间: 2025-10-09 15:54:27

领域: cs.RO,cs.AI,cs.HC

下载: http://arxiv.org/abs/2507.13468v2

Guided Star-Shaped Masked Diffusion

The performance of pre-trained masked diffusion models is often constrained by their sampling procedure, which makes decisions irreversible and struggles in low-step generation regimes. We introduce a novel sampling algorithm that works with pre-trained models and, after a lightweight fine-tuning of a single layer, significantly improves sample quality and efficiency. Our method reformulates the generation process using a star-shaped paradigm, which inherently allows for error correction. To make this process effective, we augment it with a learnable re-masking scheduler that intelligently identifies and revises likely errors. This approach yields a substantial quality boost, particularly when using a small number of sampling steps. We extensively ablate key components of our approach and show its usability in different scenarios. In comprehensive experiments on text, and code generation, our sampling algorithm outperforms or matches existing methods.

Updated: 2025-10-09 15:53:51

标题: 引导星形遮罩扩散

摘要: 预训练的蒙版扩散模型的性能通常受其采样过程的限制，该过程使决策不可逆转，并在低步骤生成制度中表现出困难。我们介绍了一种新颖的采样算法，该算法与预训练模型配合使用，在轻量级单层微调后，显著提高了样本质量和效率。我们的方法重新构建了使用星形范式的生成过程，这本质上允许错误纠正。为了使这个过程有效，我们将其与可学习的重新蒙版调度器相结合，该调度器能智能识别和修订可能的错误。这种方法在使用少量采样步骤时产生了显著的质量提升。我们对我们方法的关键组件进行了广泛的消融实验，并展示了在不同场景中的可用性。在文本和代码生成的全面实验中，我们的采样算法优于或与现有方法相匹配。

更新时间: 2025-10-09 15:53:51

领域: cs.LG

下载: http://arxiv.org/abs/2510.08369v1

CoCoA: Collaborative Chain-of-Agents for Parametric-Retrieved Knowledge Synergy

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs), especially for knowledge-intensive tasks. Despite its advantages, current RAG methods often struggle to fully exploit knowledge during generation. In particular, the synergy between the model's internal parametric knowledge and external retrieved knowledge remains limited. Retrieved contents may sometimes mislead generation, while certain generated content can guide the model toward more accurate outputs. In this work, we propose Collaborative Chain-of-Agents, a framework designed to enhance explicitly synergy over both parametric and retrieved knowledge. Specifically, we first introduce CoCoA-zero, a multi-agent RAG framework that first performs conditional knowledge induction and then reasons answers. Building on this, we develop CoCoA, a long-chain training strategy that synthesizes extended multi-agent reasoning trajectories from CoCoA-zero to fine-tune the LLM. This strategy enhances the model's capability to explicitly integrate and jointly leverage parametric and retrieved knowledge. Experimental results demonstrate the superiority of CoCoA in open-domain QA and multi-hop QA.

Updated: 2025-10-09 15:53:40

标题: CoCoA：用于参数检索知识协同的协作代理链

摘要: 检索增强生成（RAG）提升了大型语言模型（LLMs），特别是对于知识密集型任务。尽管具有优势，目前的RAG方法往往难以充分利用生成过程中的知识。特别是，在模型内部参数化知识和外部检索知识之间的协同作用仍然有限。检索到的内容有时可能会误导生成过程，而某些生成的内容可以引导模型产生更准确的输出。在这项工作中，我们提出了协同智能链（Collaborative Chain-of-Agents），这是一个旨在增强参数化和检索知识之间明确协同作用的框架。具体来说，我们首先介绍了CoCoA-zero，一个多智能体RAG框架，首先进行条件知识归纳，然后推理答案。在此基础上，我们开发了CoCoA，一种长链训练策略，从CoCoA-zero合成扩展的多智能体推理轨迹，以微调LLM。这种策略增强了模型明确整合和共同利用参数化和检索知识的能力。实验结果表明CoCoA在开放领域问答和多跳问答中的优越性。

更新时间: 2025-10-09 15:53:40

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.01696v3

Interpreting GNN-based IDS Detections Using Provenance Graph Structural Features

Advanced cyber threats (e.g., Fileless Malware and Advanced Persistent Threat (APT)) have driven the adoption of provenance-based security solutions. These solutions employ Machine Learning (ML) models for behavioral modeling and critical security tasks such as malware and anomaly detection. However, the opacity of ML-based security models limits their broader adoption, as the lack of transparency in their decision-making processes restricts explainability and verifiability. We tailored our solution towards Graph Neural Network (GNN)-based security solutions since recent studies employ GNNs to comprehensively digest system provenance graphs for security-critical tasks. To enhance the explainability of GNN-based security models, we introduce PROVEXPLAINER, a framework offering instance-level security-aware explanations using an interpretable surrogate model. PROVEXPLAINER's interpretable feature space consists of discriminant subgraph patterns and graph structural features, which can be directly mapped to the system provenance problem space, making the explanations human interpretable. We show how PROVEXPLAINER synergizes with current state-of-the-art (SOTA) GNN explainers to deliver domain and instance-specific explanations. We measure the explanation quality using the Fidelity+/Fidelity- metric as used by traditional GNN explanation literature, we incorporate the precision/recall metric, where we consider the accuracy of the explanation against the ground truth, and we designed a human actionability metric based on graph traversal distance. On real-world Fileless and APT datasets, PROVEXPLAINER achieves up to 29%/27%/25%/1.4x higher Fidelity+, precision, recall, and actionability (where higher values are better), and 12% lower Fidelity- (where lower values are better) when compared against SOTA GNN explainers.

Updated: 2025-10-09 15:50:09

标题: 使用溯源图结构特征解释基于GNN的入侵检测结果

摘要: 高级网络威胁（例如无文件恶意软件和高级持续性威胁（APT））推动了基于溯源的安全解决方案的采用。这些解决方案利用机器学习（ML）模型进行行为建模和关键安全任务，如恶意软件和异常检测。然而，基于ML的安全模型的不透明性限制了它们的广泛应用，因为决策过程缺乏透明度，限制了解释性和可验证性。我们将解决方案定制为基于图神经网络（GNN）的安全解决方案，因为最近的研究使用GNN全面处理系统溯源图，用于安全关键任务。为了增强基于GNN的安全模型的解释性，我们引入了PROVEXPLAINER，这是一个框架，提供了使用可解释的替代模型的实例级安全感知解释。PROVEXPLAINER的可解释特征空间包括判别子图模式和图结构特征，这些特征可以直接映射到系统溯源问题空间，使解释具有人类可解释性。我们展示了PROVEXPLAINER如何与当前最先进（SOTA）的GNN解释器协同工作，提供领域和实例特定的解释。我们使用传统GNN解释文献中使用的Fidelity+/Fidelity-指标来衡量解释质量，我们融入了精确度/召回率指标，其中我们考虑解释与真相之间的准确性，并设计了基于图遍历距离的人类可行性指标。在真实世界的无文件和APT数据集上，与SOTA GNN解释器相比，PROVEXPLAINER在Fidelity+、精确度、召回率和可行性方面实现了高达29%/27%/25%/1.4倍的提高（数值越高越好），并且在Fidelity-方面降低了12%（数值越低越好）。

更新时间: 2025-10-09 15:50:09

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2306.00934v7

Physics-informed Value Learner for Offline Goal-Conditioned Reinforcement Learning

Offline Goal-Conditioned Reinforcement Learning (GCRL) holds great promise for domains such as autonomous navigation and locomotion, where collecting interactive data is costly and unsafe. However, it remains challenging in practice due to the need to learn from datasets with limited coverage of the state-action space and to generalize across long-horizon tasks. To improve on these challenges, we propose a \emph{Physics-informed (Pi)} regularized loss for value learning, derived from the Eikonal Partial Differential Equation (PDE) and which induces a geometric inductive bias in the learned value function. Unlike generic gradient penalties that are primarily used to stabilize training, our formulation is grounded in continuous-time optimal control and encourages value functions to align with cost-to-go structures. The proposed regularizer is broadly compatible with temporal-difference-based value learning and can be integrated into existing Offline GCRL algorithms. When combined with Hierarchical Implicit Q-Learning (HIQL), the resulting method, Eikonal-regularized HIQL (Eik-HIQL), yields significant improvements in both performance and generalization, with pronounced gains in stitching regimes and large-scale navigation tasks.

Updated: 2025-10-09 15:43:55

标题: 物理启发的值学习器用于离线目标条件强化学习

摘要: 离线目标条件强化学习（GCRL）在自主导航和运动等领域具有巨大的潜力，其中收集交互数据成本高且不安全。然而，在实践中仍然具有挑战性，因为需要从覆盖有限状态-动作空间的数据集中学习，并且需要在长期任务中进行泛化。为了改善这些挑战，我们提出了一种\emph{基于物理的（Pi）}正则化损失用于值学习，该正则化损失源自Eikonal偏微分方程（PDE），并在学习的值函数中引入几何归纳偏差。与主要用于稳定训练的通用梯度惩罚不同，我们的公式是建立在连续时间最优控制基础上的，并鼓励值函数与成本结构对齐。所提出的正则化方法广泛适用于基于时间差异的值学习，并可集成到现有的离线GCRL算法中。当与分层隐式Q学习（HIQL）结合使用时，产生的方法，Eikonal正则化HIQL（Eik-HIQL），在性能和泛化方面均取得显著改进，在粘合区域和大规模导航任务中表现出明显增益。

更新时间: 2025-10-09 15:43:55

领域: cs.LG

下载: http://arxiv.org/abs/2509.06782v3

ExPrESSO: Zero-Knowledge backed Extensive Privacy Preserving Single Sign-on

User authentication is one of the most important aspects for secure communication between services and end-users over the Internet. Service providers leverage Single-Sign On (SSO) to make it easier for their users to authenticate themselves. However, standardized systems for SSO, such as OIDC, do not guarantee user privacy as identity providers can track user activities. We propose a zero-knowledge-based mechanism that integrates with OIDC to let users authenticate through SSO without revealing information about the service provider. Our system leverages Groth's zk-SNARK to prove membership of subscribed service providers without revealing their identity. We adopt a decentralized and verifiable approach to set up the prerequisites of our construction that further secures and establishes trust in the system. We set up high security targets and achieve them with minimal storage and latency cost, proving that our research can be adopted for production.

Updated: 2025-10-09 15:42:01

标题: ExPrESSO：零知识支持的广泛隐私保护单点登录

摘要: 用户身份验证是互联网上服务提供者和终端用户之间安全通信的最重要方面之一。服务提供商利用单点登录（SSO）使其用户更容易进行身份验证。然而，诸如OIDC之类的SSO标准化系统并不保证用户隐私，因为身份提供者可以跟踪用户活动。我们提出了一种基于零知识的机制，与OIDC集成，让用户通过SSO进行身份验证，而不泄露有关服务提供者的信息。我们的系统利用Groth的zk-SNARK来证明订阅服务提供者的成员资格，而不透露其身份。我们采用分散和可验证的方法来建立我们构建的先决条件，进一步加强并建立对系统的信任。我们设定了高安全目标，并以最小的存储和延迟成本实现了这些目标，证明我们的研究可以应用于生产。

更新时间: 2025-10-09 15:42:01

领域: cs.CR

下载: http://arxiv.org/abs/2510.08355v1

Faver: Boosting LLM-based RTL Generation with Function Abstracted Verifiable Middleware

LLM-based RTL generation is an interesting research direction, as it holds the potential to liberate the least automated stage in the current chip design. However, due to the substantial semantic gap between high-level specifications and RTL, coupled with limited training data, existing models struggle with generation accuracy. Drawing on human experience, design with verification helps improving accuracy. However, as the RTL testbench data are even more scarce, it is not friendly for LLMs. Although LLMs excel at higher-level languages like Python/C, they have a huge semantic gap from RTL. When implementing the same functionality, Python/C code and hardware code differ significantly in the spatiotemporal granularity, requiring the LLM not only to consider high-level functional semantics but also to ensure the low-level details align with the circuit code. It is not an easy task. In this paper, we propose a function abstracted verifiable middleware (Faver) that streamlines RTL verification in LLM-based workflows. By mixing LLM-friendly code structures with a rule-based template, Faver decouples the details of circuit verification, allowing the LLM to focus on the functionality itself. In our experiments on the SFT model and open-source models, Faver improved the model's generation accuracy by up to 14%.

Updated: 2025-10-09 15:41:43

标题: 标题翻译：Faver：利用功能抽象可验证中间件增强基于LLM的RTL生成

摘要: 基于LLM的RTL生成是一个有趣的研究方向，因为它有潜力解放当前芯片设计中自动化程度最低的阶段。然而，由于高级规范和RTL之间存在实质性语义差距，再加上训练数据有限，现有模型在生成准确性方面存在困难。借鉴人类经验，设计与验证有助于提高准确性。然而，由于RTL测试台数据更加稀缺，对LLM并不友好。尽管LLM在高级语言如Python/C方面表现出色，但与RTL之间存在巨大的语义差距。在实现相同功能时，Python/C代码和硬件代码在时空细粒度上有显著差异，需要LLM不仅考虑高级功能语义，还要确保低级细节与电路代码一致。这并非易事。在本文中，我们提出了一个功能抽象可验证的中间件（Faver），用于简化基于LLM的工作流中的RTL验证。通过将LLM友好的代码结构与基于规则的模板结合，Faver解耦了电路验证的细节，使LLM能够专注于功能本身。在我们对SFT模型和开源模型的实验中，Faver将模型的生成准确性提高了高达14%。

更新时间: 2025-10-09 15:41:43

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2510.08664v1

Evaluating Small Vision-Language Models on Distance-Dependent Traffic Perception

Vision-Language Models (VLMs) are becoming increasingly powerful, demonstrating strong performance on a variety of tasks that require both visual and textual understanding. Their strong generalisation abilities make them a promising component for automated driving systems, which must handle unexpected corner cases. However, to be trusted in such safety-critical applications, a model must first possess a reliable perception system. Moreover, since critical objects and agents in traffic scenes are often at a distance, we require systems that are not "shortsighted", i.e., systems with strong perception capabilities at both close (up to 20 meters) and long (30+ meters) range. With this in mind, we introduce Distance-Annotated Traffic Perception Question Answering (DTPQA), the first Visual Question Answering (VQA) benchmark focused solely on perception-based questions in traffic scenes, enriched with distance annotations. By excluding questions that require reasoning, we ensure that model performance reflects perception capabilities alone. Since automated driving hardware has limited processing power and cannot support large VLMs, our study centers on smaller VLMs. More specifically, we evaluate several state-of-the-art (SOTA) small VLMs on DTPQA and show that, despite the simplicity of the questions, these models significantly underperform compared to humans (~60% average accuracy for the best-performing small VLM versus ~85% human performance). However, it is important to note that the human sample size was relatively small, which imposes statistical limitations. We also identify specific perception tasks, such as distinguishing left from right, that remain particularly challenging for these models.

Updated: 2025-10-09 15:38:41

标题: 对距离相关交通感知进行小型视觉-语言模型评估

摘要: 视觉-语言模型（VLMs）正在变得越来越强大，在需要视觉和文本理解的各种任务上表现出色。它们强大的泛化能力使它们成为自动驾驶系统的一个有前途的组成部分，这些系统必须处理意外的边缘情况。然而，在这种安全关键的应用中，为了能够被信任，模型必须首先具备可靠的感知系统。此外，由于交通场景中的关键对象和代理通常处于一定距离之外，我们需要既不是“目光短浅”的系统，即在近距离（最多20米）和远距离（30米以上）范围内具有强大感知能力的系统。考虑到这一点，我们介绍了距离注释交通感知问答（DTPQA），这是首个专注于交通场景中基于感知问题的视觉问答（VQA）基准，其中包含距离注释。通过排除需要推理的问题，我们确保模型的表现反映出感知能力。由于自动驾驶硬件的处理能力有限，无法支持大型VLMs，我们的研究集中在较小的VLMs上。更具体地说，我们评估了几种最先进的小型VLMs在DTPQA上的表现，并表明，尽管问题相对简单，但这些模型与人类相比表现显著不足（最佳小型VLM的平均准确率约为60％，而人类表现约为85％）。然而，值得注意的是，人类样本量相对较小，这带来了统计学上的限制。我们还确定了特定的感知任务，例如区分左右，对这些模型仍然具有特殊挑战性。

更新时间: 2025-10-09 15:38:41

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.08352v1

DeepEN: Personalized Enteral Nutrition for Critically Ill Patients using Deep Reinforcement Learning

We introduce DeepEN, a deep reinforcement learning (RL) framework for personalized enteral nutrition (EN) in critically ill patients. Trained offline on over 11,000 ICU patients from the MIMIC-IV database, DeepEN generates 4-hourly recommendations for caloric, protein, and fluid intake tailored to each patient's evolving physiology. The model integrates a curated, clinically informed state space with a custom reward function that balances short-term physiological and nutrition-related goals with long-term survival outcomes. Using a dueling double deep Q-network with conservative Q-learning regularization, DeepEN learns clinically realistic policies that align with high-value clinician actions while discouraging unsafe deviations. Across various qualitative and quantitative metrics, DeepEN outperforms clinician-derived and guideline-based policies, achieving a 3.7 $\pm$ 0.17 percentage-point reduction in estimated mortality (18.8% vs 22.5%) and improvements in key nutritional biomarkers. These findings highlight the potential of safe, data-driven personalization of EN therapy to improve outcomes beyond traditional guideline- or heuristic-based approaches.

Updated: 2025-10-09 15:37:56

标题: DeepEN：使用深度强化学习为危重病患者个性化提供肠内营养

摘要: 我们介绍了DeepEN，一个用于危重病患者个性化肠内营养支持的深度强化学习（RL）框架。在MIMIC-IV数据库的超过11,000名ICU患者上进行离线训练，DeepEN生成每4小时针对每位患者不断演变的生理特征的热量、蛋白质和液体摄入建议。该模型整合了经过筛选的、临床信息丰富的状态空间和一个自定义的奖励函数，平衡了短期生理和营养相关目标与长期生存结果。通过使用具保守Q学习正则化的对决双深度Q网络，DeepEN学习到符合临床实际的策略，与高价值的临床行动一致，同时抑制不安全的偏离。在各种定性和定量指标上，DeepEN表现优于基于临床医生和指南的策略，实现了估计死亡率的3.7±0.17个百分点降低（18.8% vs 22.5%）以及关键营养生物标志物的改善。这些发现突显了通过安全、数据驱动的个性化肠内营养支持疗法，能够改善传统基于指南或经验规则的方法以外的效果。

更新时间: 2025-10-09 15:37:56

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.08350v1

A Novel Framework for Augmenting Rating Scale Tests with LLM-Scored Text Data

Psychological assessments typically rely on structured rating scales, which cannot incorporate the rich nuance of a respondent's natural language. This study leverages recent LLM advances to harness qualitative data within a novel conceptual framework, combining LLM-scored text and traditional rating-scale items to create an augmented test. We demonstrate this approach using depression as a case study, developing and assessing the framework on a real-world sample of upper secondary students (n=693) and corresponding synthetic dataset (n=3,000). On held-out test sets, augmented tests achieved statistically significant improvements in measurement precision and accuracy. The information gain from the LLM items was equivalent to adding between 6.3 (real data) and 16.0 (synthetic data) items to the original 19-item test. Our approach marks a conceptual shift in automated scoring that bypasses its typical bottlenecks: instead of relying on pre-labelled data or complex expert-created rubrics, we empirically select the most informative LLM scoring instructions based on calculations of item information. This framework provides a scalable approach for leveraging the growing stream of transcribed text to enhance traditional psychometric measures, and we discuss its potential utility in clinical health and beyond.

Updated: 2025-10-09 15:37:24

标题: 一个用LLM评分的文本数据增强评分量表测试的新框架

摘要: 心理评估通常依赖于结构化的评分量表，这些量表无法包含受访者自然语言的丰富细微之处。本研究利用最新的LLM技术进展，将定性数据融入新颖的概念框架中，将LLM评分的文本与传统评分量表项目相结合，创建增强型测试。我们以抑郁症为案例研究，开发并评估该框架在一个真实世界的高中学生样本（n=693）和相应的合成数据集（n=3,000）上的效果。在保留的测试集上，增强型测试在测量精度和准确性方面取得了统计显著的改进。从LLM项目中获取的信息增益相当于向原始的19项测试中添加了6.3（真实数据）至16.0（合成数据）个项目。我们的方法标志着自动评分方面的概念转变，避开了其典型的瓶颈：我们不依赖预先标记的数据或复杂的专家创建的评分标准，而是根据项目信息的计算实证选择最具信息量的LLM评分指令。该框架提供了一种可扩展的方法，利用不断增长的文本转录流来增强传统的心理测量指标，并讨论了其在临床健康领域及其他领域的潜在用途。

更新时间: 2025-10-09 15:37:24

领域: cs.CL,cs.AI,cs.CY

下载: http://arxiv.org/abs/2510.08663v1

T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning

The specialized vocabulary and nuanced concepts of the telecommunications industry pose persistent challenges for standard Natural Language Processing (NLP) models. Generic embedding models often struggle to represent telecom-specific semantics, limiting their utility in retrieval and downstream tasks. We present T-VEC (Telecom Vectorization Model), a domain-adapted embedding model fine-tuned from the gte-Qwen2-1.5B-instruct backbone using a triplet loss objective. Fine-tuning was performed on T-Embed, a high-quality, large-scale dataset covering diverse telecom concepts, standards, and operational scenarios. Although T-Embed contains some proprietary material and cannot be fully released, we open source 75% of the dataset to support continued research in domain-specific representation learning. On a custom benchmark comprising 1500 query-passage pairs from IETF RFCs and vendor manuals, T-VEC surpasses MPNet, BGE, Jina and E5, demonstrating superior domain grounding and semantic precision in telecom-specific retrieval. Embedding visualizations further showcase tight clustering of telecom-relevant concepts. We release T-VEC and its tokenizer to support semantically faithful NLP applications within the telecom domain.

Updated: 2025-10-09 15:34:15

标题: T-VEC：一种具有增强语义理解的电信特定的向量化模型，通过深度三元损失微调优化

摘要: 电信行业的专业词汇和微妙概念对于标准自然语言处理（NLP）模型构成持续挑战。通用嵌入模型常常难以表示电信特定语义，从而限制了它们在检索和下游任务中的实用性。我们提出了T-VEC（电信向量化模型），这是一个从gte-Qwen2-1.5B-instruct骨干微调而来的领域适应嵌入模型，使用三元损失目标进行微调。微调是在T-Embed上进行的，这是一个涵盖各种电信概念、标准和操作场景的高质量大规模数据集。尽管T-Embed包含一些专有材料，无法完全公开，但我们开放了数据集的75%以支持领域特定表示学习的持续研究。在由来自IETF RFCs和供应商手册的1500个查询-段落对组成的自定义基准上，T-VEC超越了MPNet、BGE、Jina和E5，展示了在电信特定检索中的优越领域基础和语义精确度。嵌入可视化进一步展示了电信相关概念的紧密聚类。我们发布了T-VEC及其分词器，以支持在电信领域内语义忠实的NLP应用。

更新时间: 2025-10-09 15:34:15

领域: cs.CL,cs.AI,68T50

下载: http://arxiv.org/abs/2504.16460v2

A Haskell to FHE Transpiler

Fully Homomorphic Encryption (FHE) enables the evaluation of programs directly on encrypted data. However, because only basic operations can be performed on ciphertexts, programs must be expressed as boolean or arithmetic circuits. This low-level repre- sentation makes implementing applications for FHE significantly more cumbersome than writing code in a high-level language. To reduce this burden, several transpilers have been developed that translate high-level code into circuit representations. In this work, we extend the range of high-level languages that can tar- get FHE by introducing a transpiler for Haskell, which converts Haskell programs into Boolean circuits suitable for homomorphic evaluation. Our second contribution is the automatic parallelization of these generated circuits. We implement an evaluator that executes gates in parallel by parallelizing each layer of the circuit. We demonstrate the effectiveness of our approach on two key applications: Private Information Retrieval (PIR) and the AES encryption standard. Prior work has parallelized AES encryption manually. We demonstrate that the automated method outperforms some but not all manual parallelizations of AES evaluations under FHE. We achieve an eval- uation time of 28 seconds for a parallel execution with 16 threads and an evaluation time of 8 seconds for a parallel execution with 100 threads

Updated: 2025-10-09 15:28:27

标题: 一个将Haskell转换为全同态加密的转译器

摘要: 全同态加密（FHE）使得可以直接在加密数据上评估程序。然而，因为只能在密文上执行基本操作，程序必须以布尔或算术电路的形式表达。这种低级表示使得为FHE实现应用程序比在高级语言中编写代码要困难得多。为了减轻这种负担，已经开发了几个转换器，将高级代码转换为电路表示。在这项工作中，我们通过引入一个将Haskell程序转换为适合同态评估的布尔电路的转换器，扩展了可以针对FHE的高级语言的范围。我们的第二个贡献是这些生成的电路的自动并行化。我们实现了一个执行器，通过并行化电路的每一层来并行执行门。我们在两个关键应用程序上展示了我们方法的有效性：私人信息检索（PIR）和AES加密标准。之前的工作已经手动并行化了AES加密。我们证明自动化方法在FHE下的AES评估中优于一些但并非全部手动并行化。我们实现了一个使用16个线程并行执行的评估时间为28秒，使用100个线程并行执行的评估时间为8秒。

更新时间: 2025-10-09 15:28:27

领域: cs.CR

下载: http://arxiv.org/abs/2510.08343v1

HyPINO: Multi-Physics Neural Operators via HyperPINNs and the Method of Manufactured Solutions

We present HyPINO, a multi-physics neural operator designed for zero-shot generalization across a broad class of parametric PDEs without requiring task-specific fine-tuning. Our approach combines a Swin Transformer-based hypernetwork with mixed supervision: (i) labeled data from analytical solutions generated via the Method of Manufactured Solutions (MMS), and (ii) unlabeled samples optimized using physics-informed objectives. The model maps PDE parametrizations to target Physics-Informed Neural Networks (PINNs) and can handle linear elliptic, hyperbolic, and parabolic equations in two dimensions with varying source terms, geometries, and mixed Dirichlet/Neumann boundary conditions, including interior boundaries. HyPINO achieves strong zero-shot accuracy on seven benchmark problems from PINN literature, outperforming U-Nets, Poseidon, and Physics-Informed Neural Operators (PINO). Further, we introduce an iterative refinement procedure that compares the physics of the generated PINN to the requested PDE and uses the discrepancy to generate a "delta" PINN. Summing their contributions and repeating this process forms an ensemble whose combined solution progressively reduces the error on six benchmarks and achieves over 100x gain in average $L_2$ loss in the best case, while retaining forward-only inference. Additionally, we evaluate the fine-tuning behavior of PINNs initialized by HyPINO and show that they converge faster and to lower final error than both randomly initialized and Reptile-meta-learned PINNs on five benchmarks, performing on par on the remaining two. Our results highlight the potential of this scalable approach as a foundation for extending neural operators toward solving increasingly complex, nonlinear, and high-dimensional PDE problems. The code and model weights are publicly available at https://github.com/rbischof/hypino.

Updated: 2025-10-09 15:27:25

标题: HyPINO: 通过HyperPINNs和制造解法方法的多物理神经算子

摘要: 我们提出了HyPINO，这是一个多物理神经算子，设计用于在不需要特定任务微调的情况下，跨越一系列参数PDE进行零样本泛化。我们的方法结合了基于Swin Transformer的超网络和混合监督：(i)通过制造解法方法生成的解析解标记数据，和(ii)使用物理信息目标进行优化的未标记样本。该模型将PDE参数化映射到目标物理信息神经网络(PINNs)，可以处理二维中具有不同源项、几何形状和混合Dirichlet/Neumann边界条件的线性椭圆、双曲和抛物线方程，包括内部边界。HyPINO在PINN文献中的七个基准问题上实现了强大的零样本准确性，优于U-Nets、Poseidon和物理信息神经算子(PINO)。此外，我们引入了一种迭代改进程序，比较生成的PINN与请求的PDE的物理，并使用差异生成"delta" PINN。将它们的贡献相加并重复这个过程形成一个集合，其组合解逐渐降低了六个基准问题的误差，并在最佳情况下实现了平均$L_2$损失的100倍增益，同时保留了仅前向推理。此外，我们评估了由HyPINO初始化的PINNs的微调行为，并展示它们在五个基准问题上比随机初始化和Reptile元学习的PINNs更快地收敛到更低的最终误差，在剩下的两个问题上表现相当。我们的结果突显了这种可扩展方法作为将神经算子扩展到解决越来越复杂、非线性和高维PDE问题的基础的潜力。代码和模型权重可以从https://github.com/rbischof/hypino 公开获取。

更新时间: 2025-10-09 15:27:25

领域: cs.LG

下载: http://arxiv.org/abs/2509.05117v2

Anticipating the Selectivity of Intramolecular Cyclization Reaction Pathways with Neural Network Potentials

Reaction mechanism search tools have demonstrated the ability to provide insights into likely products and rate-limiting steps of reacting systems. However, reactions involving several concerted bond changes - as can be found in many key steps of natural product synthesis - can complicate the search process. To mitigate these complications, we present a mechanism search strategy particularly suited to help expedite exploration of an exemplary family of such complex reactions, cyclizations. We provide a cost-effective strategy for identifying relevant elementary reaction steps by combining graph-based enumeration schemes and machine learning techniques for intermediate filtering. Key to this approach is our use of a neural network potential (NNP), AIMNet2-rxn, for computational evaluation of each candidate reaction pathway. In this article, we evaluate the NNP's ability to estimate activation energies, demonstrate the correct anticipation of stereoselectivity, and recapitulate complex enabling steps in natural product synthesis.

Updated: 2025-10-09 15:27:12

标题: 用神经网络势预测分子内环化反应途径的选择性

摘要: 反应机理搜索工具已经证明能够提供对反应系统可能产物和速率限制步骤的见解。然而，涉及多个协同键变化的反应 - 正如许多天然产物合成的关键步骤中所发现的那样 - 可以使搜索过程复杂化。为了减轻这些复杂性，我们提出了一种机制搜索策略，特别适用于帮助加速探索这种复杂反应的一个典型家族，即环化反应。我们提供了一种成本效益的策略，通过结合基于图的枚举方案和机器学习技术进行中间过滤，来识别相关的基本反应步骤。这种方法的关键在于我们使用神经网络势（NNP），即AIMNet2-rxn，对每个候选反应途径进行计算评估。在本文中，我们评估NNP估计活化能的能力，展示了对立构选择性的正确预期，并概括了天然产物合成中复杂的促进步骤。

更新时间: 2025-10-09 15:27:12

领域: cs.LG,q-bio.QM

下载: http://arxiv.org/abs/2507.10400v2

Learning What's Missing: Attention Dispersion and EMA Stabilization in Length Generalization

We study length generalization in transformers through the set complement task, where a model must predict a uniform distribution over tokens absent from an input sequence -- an ability central to board-game style reasoning. Our main theoretical result establishes two statements. First, we prove tight bounds on embedding and value dimensions for single-layer attention-only transformers. Second, we show that if such a model achieves balanced logit displacement at lengths 1 and 2, then it must generalize to longer sequences, though with reduced precision. A mechanistic reading of the proof explains this limitation: as more tokens are attended to, softmax compresses logit displacements, eroding separation between valid and invalid outputs. Training dynamics also suggest a second obstacle: when many next tokens are possible, updates become noisy. We hypothesize that dropout can counteract the first effect and Exponential Moving Average (EMA) the second. We validate these hypotheses through random hyperparameter search on the set complement task, which confirms both mechanisms. We then test OthelloGPT, a GPT-1 style model trained on random Othello moves, and find that EMA again improves length generalization in this more complex setting.

Updated: 2025-10-09 15:26:48

标题: 学习缺失的内容：注意力分散和长度概括中EMA稳定化

摘要: 我们通过集合补集任务研究了变压器中的长度概括，其中模型必须预测输入序列中缺失的标记的均匀分布--这是一种与棋盘游戏风格推理相关的能力。我们的主要理论结果建立了两个命题。首先，我们证明了单层仅注意力变压器的嵌入和值维度的严格界限。其次，我们显示，如果这样的模型在长度为1和2时实现了平衡的逻辑位移，那么它必须概括到更长的序列，尽管精度降低。对证明的机械阅读解释了这种限制：随着更多标记受到关注，softmax压缩了逻辑位移，侵蚀了有效和无效输出之间的区分。训练动态也表明了第二个障碍：当许多下一个标记可能时，更新变得嘈杂。我们假设辍学可以抵消第一个效应，指数移动平均（EMA）可以抵消第二个效应。我们通过对集合补集任务进行随机超参数搜索来验证这些假设，这证实了两个机制。然后我们测试了OthelloGPT，这是一个在随机Othello移动上训练的类似GPT-1的模型，发现EMA再次提高了在这种更复杂设置中的长度概括能力。

更新时间: 2025-10-09 15:26:48

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.08341v1

Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models

We introduce Equilibrium Matching (EqM), a generative modeling framework built from an equilibrium dynamics perspective. EqM discards the non-equilibrium, time-conditional dynamics in traditional diffusion and flow-based generative models and instead learns the equilibrium gradient of an implicit energy landscape. Through this approach, we can adopt an optimization-based sampling process at inference time, where samples are obtained by gradient descent on the learned landscape with adjustable step sizes, adaptive optimizers, and adaptive compute. EqM surpasses the generation performance of diffusion/flow models empirically, achieving an FID of 1.90 on ImageNet 256$\times$256. EqM is also theoretically justified to learn and sample from the data manifold. Beyond generation, EqM is a flexible framework that naturally handles tasks including partially noised image denoising, OOD detection, and image composition. By replacing time-conditional velocities with a unified equilibrium landscape, EqM offers a tighter bridge between flow and energy-based models and a simple route to optimization-driven inference.

Updated: 2025-10-09 15:25:23

标题: 平衡匹配：使用隐式能量模型的生成建模

摘要: 我们介绍Equilibrium Matching (EqM)，这是一个建立在平衡动力学视角下的生成建模框架。EqM抛弃了传统扩散和基于流的生成模型中的非平衡、时间条件动力学，而是学习隐式能量景观的平衡梯度。通过这种方法，我们可以在推断时采用基于优化的采样过程，其中样本是通过在学习到的景观上梯度下降获得的，具有可调节的步长、自适应优化器和自适应计算。EqM在实证上超越了扩散/流模型的生成性能，在ImageNet 256×256上实现了1.90的FID。EqM在理论上也被证明学习和从数据流形中采样。除了生成外，EqM是一个灵活的框架，自然处理包括部分加噪图像去噪、OOD检测和图像合成在内的任务。通过用统一的平衡景观取代时间条件速度，EqM在流和基于能量的模型之间提供了更紧密的桥梁，以及一个简单的优化驱动推断的途径。

更新时间: 2025-10-09 15:25:23

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2510.02300v2

LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings

Consumer research costs companies billions annually yet suffers from panel biases and limited scale. Large language models (LLMs) offer an alternative by simulating synthetic consumers, but produce unrealistic response distributions when asked directly for numerical ratings. We present semantic similarity rating (SSR), a method that elicits textual responses from LLMs and maps these to Likert distributions using embedding similarity to reference statements. Testing on an extensive dataset comprising 57 personal care product surveys conducted by a leading corporation in that market (9,300 human responses), SSR achieves 90% of human test-retest reliability while maintaining realistic response distributions (KS similarity > 0.85). Additionally, these synthetic respondents provide rich qualitative feedback explaining their ratings. This framework enables scalable consumer research simulations while preserving traditional survey metrics and interpretability.

Updated: 2025-10-09 15:24:48

标题: LLMs通过Likert评分的语义相似性引发复制人类购买意图

摘要: 消费者研究每年为公司带来数十亿美元的成本，但受到样本偏差和规模有限的困扰。大型语言模型（LLMs）通过模拟合成消费者提供了一种替代方案，但当直接询问数字评分时会产生不现实的响应分布。我们提出了语义相似度评分（SSR）方法，该方法从LLMs中获取文本响应，并使用嵌入相似性将其映射到Likert分布以与参考语句进行比较。在由市场领先公司进行的57个个人护理产品调查组成的庞大数据集上进行测试（9,300个人类响应），SSR实现了90％的人类测试-重测可靠性，同时保持了现实响应分布（KS相似度> 0.85）。此外，这些合成受访者提供了丰富的定性反馈来解释他们的评分。这个框架使得消费者研究模拟能够规模化，同时保持传统调查指标和可解释性。

更新时间: 2025-10-09 15:24:48

领域: cs.AI,I.2.7; J.4

下载: http://arxiv.org/abs/2510.08338v1

Instructor-Worker Large Language Model System for Policy Recommendation: a Case Study on Air Quality Analysis of the January 2025 Los Angeles Wildfires

The Los Angeles wildfires of January 2025 caused more than 250 billion dollars in damage and lasted for nearly an entire month before containment. Following our previous work, the Digital Twin Building, we modify and leverage the multi-agent large language model framework as well as the cloud-mapping integration to study the air quality during the Los Angeles wildfires. Recent advances in large language models have allowed for out-of-the-box automated large-scale data analysis. We use a multi-agent large language system comprised of an Instructor agent and Worker agents. Upon receiving the users' instructions, the Instructor agent retrieves the data from the cloud platform and produces instruction prompts to the Worker agents. The Worker agents then analyze the data and provide summaries. The summaries are finally input back into the Instructor agent, which then provides the final data analysis. We test this system's capability for data-based policy recommendation by assessing our Instructor-Worker LLM system's health recommendations based on air quality during the Los Angeles wildfires.

Updated: 2025-10-09 15:23:15

标题: 教练员-工作者大型语言模型系统用于政策建议：对2025年1月洛杉矶野火空气质量分析的案例研究

摘要: 2025年1月的洛杉矶野火造成了超过2500亿美元的损失，并持续了将近一个月才得以控制。在我们之前的工作“数字双胞胎建筑”基础上，我们修改并利用了多智能体大型语言模型框架以及云映射集成来研究洛杉矶野火期间的空气质量。大型语言模型的最新进展使得开箱即用的自动化大规模数据分析成为可能。我们使用一个由教练代理和工作代理组成的多智能体大型语言系统。在收到用户指令后，教练代理从云平台检索数据并向工作代理提供指令提示。工作代理然后分析数据并提供总结。这些总结最终被输入到教练代理中，然后提供最终的数据分析结果。我们通过评估我们的教练-工作代理LLM系统基于洛杉矶野火期间空气质量的健康建议，测试该系统在基于数据的政策建议方面的能力。

更新时间: 2025-10-09 15:23:15

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2503.00566v4

PAC Learnability in the Presence of Performativity

Following the wide-spread adoption of machine learning models in real-world applications, the phenomenon of performativity, i.e. model-dependent shifts in the test distribution, becomes increasingly prevalent. Unfortunately, since models are usually trained solely based on samples from the original (unshifted) distribution, this performative shift may lead to decreased test-time performance. In this paper, we study the question of whether and when performative binary classification problems are learnable, via the lens of the classic PAC (Probably Approximately Correct) learning framework. We motivate several performative scenarios, accounting in particular for linear shifts in the label distribution, as well as for more general changes in both the labels and the features. We construct a performative empirical risk function, which depends only on data from the original distribution and on the type performative effect, and is yet an unbiased estimate of the true risk of a classifier on the shifted distribution. Minimizing this notion of performative risk allows us to show that any PAC-learnable hypothesis space in the standard binary classification setting remains PAC-learnable for the considered performative scenarios. We also conduct an extensive experimental evaluation of our performative risk minimization method and showcase benefits on synthetic and real data.

Updated: 2025-10-09 15:22:52

标题: PAC在执行性存在的情况下的可学习性

摘要: 随着机器学习模型在现实世界应用中的广泛采用，表现性现象，即测试分布中的模型相关变化，变得越来越普遍。不幸的是，因为模型通常仅基于原始（未改变）分布的样本进行训练，这种表现性转移可能导致测试时性能下降。本文研究了表现性二分类问题是否可学习的问题，通过经典的PAC（可能近似正确）学习框架来考察。我们提出了几种表现性场景，特别考虑了标签分布中的线性偏移，以及标签和特征的更一般变化。我们构建了一个表现性经验风险函数，仅依赖于原始分布的数据和表现性效应类型，但却是对偏移分布上的分类器真实风险的无偏估计。最小化这种表现性风险的概念使我们能够展示，在标准二分类设置中的任何PAC可学习假设空间对考虑的表现性场景仍然是PAC可学习的。我们还对我们的表现性风险最小化方法进行了广泛的实验评估，并展示了在合成和真实数据上的好处。

更新时间: 2025-10-09 15:22:52

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2510.08335v1

New Machine Learning Approaches for Intrusion Detection in ADS-B

With the growing reliance on the vulnerable Automatic Dependent Surveillance-Broadcast (ADS-B) protocol in air traffic management (ATM), ensuring security is critical. This study investigates emerging machine learning models and training strategies to improve AI-based intrusion detection systems (IDS) for ADS-B. Focusing on ground-based ATM systems, we evaluate two deep learning IDS implementations: one using a transformer encoder and the other an extended Long Short-Term Memory (xLSTM) network, marking the first xLSTM-based IDS for ADS-B. A transfer learning strategy was employed, involving pre-training on benign ADS-B messages and fine-tuning with labeled data containing instances of tampered messages. Results show this approach outperforms existing methods, particularly in identifying subtle attacks that progressively undermine situational awareness. The xLSTM-based IDS achieves an F1-score of 98.9%, surpassing the transformer-based model at 94.3%. Tests on unseen attacks validated the generalization ability of the xLSTM model. Inference latency analysis shows that the 7.26-second delay introduced by the xLSTM-based IDS fits within the Secondary Surveillance Radar (SSR) refresh interval (5-12 s), although it may be restrictive for time-critical operations. While the transformer-based IDS achieves a 2.1-second latency, it does so at the cost of lower detection performance.

Updated: 2025-10-09 15:22:20

标题: ADS-B中入侵检测的新机器学习方法

摘要: 随着空中交通管理（ATM）中对易受攻击的自动相关监视广播（ADS-B）协议的日益依赖，确保安全至关重要。本研究调查了新兴的机器学习模型和训练策略，以改进基于人工智能的ADS-B入侵检测系统（IDS）。专注于地面ATM系统，我们评估了两种深度学习IDS实现：一种使用变压器编码器，另一种是扩展的长短期记忆（xLSTM）网络，标志着第一个基于xLSTM的ADS-B IDS。采用了迁移学习策略，包括在良性ADS-B消息上进行预训练，并使用包含篡改消息实例的标记数据进行微调。结果显示，这种方法在性能上优于现有方法，特别是在识别逐渐破坏情境意识的微妙攻击方面。基于xLSTM的IDS实现了98.9%的F1分数，超过了94.3%的基于变压器的模型。对未知攻击的测试验证了xLSTM模型的泛化能力。推理延迟分析显示，xLSTM-based IDS引入的7.26秒延迟适应了辅助监视雷达（SSR）刷新间隔（5-12秒），尽管对于时间关键的操作可能有限制。虽然基于变压器的IDS实现了2.1秒的延迟，但以较低的检测性能为代价。

更新时间: 2025-10-09 15:22:20

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2510.08333v1

Beyond Pass@k: Breadth-Depth Metrics for Reasoning Boundaries

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm to improve Large Language Models on reasoning tasks such as coding, math or logic. To assess the reasoning boundary (the fraction of problems a model can solve) researchers often report Pass@k at large sampling budgets. Recent results reveal a crossover phenomenon: while RLVR models outperform the base model at small k values, the base model usually outperforms them when sampling a very large number of completions. This has been interpreted as evidence that base models have a larger reasoning boundary. We argue that on tasks with discrete answer spaces, such as math with numeric outputs, Pass@k at large k reflects the increasingly higher chance of success in the limit of the number of trials rather than genuine reasoning, and can therefore be misleading. We propose Cover@tau, which measures the fraction of problems that a model can solve for which at least a tau proportion of completions are correct. Unlike Pass@k, Cover@tau captures reasoning under an explicit reliability threshold: models that rely on random guessing degrade rapidly as tau increases. We evaluate several RLVR models using Cover@tau-based metrics and illustrate how the relative rankings of popular algorithms change compared to Pass@1, offering a different perspective on reasoning boundaries.

Updated: 2025-10-09 15:14:58

标题: 超越Pass@k：用于推理边界的广度-深度度量标准

摘要: 具有可验证奖励的强化学习（RLVR）已经成为改进大型语言模型在编码、数学或逻辑等推理任务上的强大范例。为了评估推理边界（模型可以解决的问题的比例），研究人员通常在大量抽样预算上报告Pass@k。最近的结果揭示了一个交叉现象：虽然RLVR模型在小k值下优于基础模型，但在完成数量非常大时，基础模型通常优于它们。这被解释为基础模型具有更大的推理边界的证据。我们认为，在具有离散答案空间的任务中，例如具有数值输出的数学，Pass@k在较大k值时反映了在试验次数极限下成功的机会越来越高，而不是真正的推理，并且可能具有误导性。我们提出了Cover@tau，该指标衡量模型可以解决的问题比例，其中至少tau比例的完成是正确的。与Pass@k不同，Cover@tau捕捉到在明确可靠性阈值下的推理：依赖随机猜测的模型随着tau的增加而迅速退化。我们使用基于Cover@tau的指标评估了几个RLVR模型，并说明了与Pass@1相比，流行算法的相对排名如何变化，提供了关于推理边界的不同视角。

更新时间: 2025-10-09 15:14:58

领域: cs.AI,cs.CL,cs.LG,I.2.6; I.2.7

下载: http://arxiv.org/abs/2510.08325v1

Understanding Teen Overreliance on AI Companion Chatbots Through Self-Reported Reddit Narratives

AI companion chatbots are increasingly popular with teens, while these interactions are entertaining, they also risk overuse that can potentially disrupt offline daily life. We examined how adolescents describe reliance on AI companions, mapping their experiences onto behavioral addiction frameworks and exploring pathways to disengagement, by analyzing 318 Reddit posts made by users who self-disclosed as 13-17 years old on the Character.AI subreddit. We found teens often begin using chatbots for support or creative play, but these activities can deepen into strong attachments marked by conflict, withdrawal, tolerance, relapse, and mood regulation. Reported consequences include sleep loss, academic decline, and strained real-world connections. Disengagement commonly arises when teens recognize harm, re-engage with offline life, or encounter restrictive platform changes. We highlight specific risks of character-based companion chatbots based on teens' perspectives and introduce a design framework (CARE) for guidance for safer systems and setting directions for future teen-centered research.

Updated: 2025-10-09 15:09:38

标题: 理解青少年对AI伴侣聊天机器人过度依赖的研究：通过自我报告的Reddit叙述

摘要: AI伴侣聊天机器人在青少年中越来越受欢迎，虽然这些互动很有趣，但也存在过度使用的风险，可能会扰乱离线日常生活。我们研究了青少年如何描述对AI伴侣的依赖，将他们的经验映射到行为成瘾框架上，并通过分析在Character.AI子论坛上自称为13-17岁的用户发布的318条Reddit帖子，探索脱离的途径。我们发现青少年通常开始使用聊天机器人获取支持或进行创意游戏，但这些活动可能会深入到由冲突、撤退、耐受性、复发和情绪调节标志的强烈依恋中。报告的后果包括睡眠不足、学业下降和紧张的现实世界关系。当青少年意识到伤害、重新投入离线生活或遇到限制性平台变化时，常常会出现脱离。我们根据青少年的观点突出了基于角色的伴侣聊天机器人的特定风险，并介绍了一个用于指导更安全系统和为未来以青少年为中心的研究设定方向的设计框架（CARE）。

更新时间: 2025-10-09 15:09:38

领域: cs.HC,cs.AI,cs.CY

下载: http://arxiv.org/abs/2507.15783v3

InfoPos: A Design Support Framework for ML-Assisted Fault Detection and Identification in Industrial Cyber-Physical Systems

The variety of building blocks and algorithms incorporated in data-centric and ML-assisted fault detection and identification solutions is high, contributing to two challenges: selection of the most effective set and order of building blocks, as well as achieving such a selection with minimum cost. Considering that ML-assisted solution design is influenced by the extent of available data and the extent of available knowledge of the target system, it is advantageous to be able to select effective and matching building blocks. We introduce the first iteration of our InfoPos framework, allowing the placement of fault detection/identification use-cases based on the available levels (positions), i.e., from poor to rich, of knowledge and data dimensions. With that input, designers and developers can reveal the most effective corresponding choice(s), streamlining the solution design process. The results from a demonstrator, a fault identification use-case for industrial Cyber-Physical Systems, reflects achieved effects when different building blocks are used throughout knowledge and data positions. The achieved ML model performance is considered as the indicator for a better solution. The data processing code and composed datasets are publicly available.

Updated: 2025-10-09 15:09:03

标题: InfoPos：用于工业网络物理系统中机器学习辅助故障检测和识别的设计支持框架

摘要: 数据中心和机器学习辅助故障检测和识别解决方案中包含的建筑模块和算法的种类是很多的，这导致了两个挑战：选择最有效的建筑模块集和顺序，以及以最小成本实现这种选择。考虑到机器学习辅助解决方案设计受到可用数据量和目标系统可用知识程度的影响，能够选择有效且匹配的建筑模块是有优势的。我们引入了InfoPos框架的第一个版本，允许根据可用的知识和数据维度的水平（位置）将故障检测/识别用例放置在相应位置，从知识和数据维度的贫乏到丰富。有了这个输入，设计师和开发人员可以揭示最有效的相应选择，简化解决方案设计过程。通过一个演示程序，一个针对工业网络物理系统的故障识别用例，反映了在不同的知识和数据位置使用不同的建筑模块时实现的效果。实现的机器学习模型性能被认为是更好解决方案的指标。数据处理代码和组合数据集是公开可用的。

更新时间: 2025-10-09 15:09:03

领域: cs.LG

下载: http://arxiv.org/abs/2502.10331v2

Iterated Agent for Symbolic Regression

Symbolic regression (SR), the automated discovery of mathematical expressions from data, is a cornerstone of scientific inquiry. However, it is often hindered by the combinatorial explosion of the search space and a tendency to overfit. Popular methods, rooted in genetic programming, explore this space syntactically, often yielding overly complex, uninterpretable models. This paper introduces IdeaSearchFitter, a framework that employs Large Language Models (LLMs) as semantic operators within an evolutionary search. By generating candidate expressions guided by natural-language rationales, our method biases discovery towards models that are not only accurate but also conceptually coherent and interpretable. We demonstrate IdeaSearchFitter's efficacy across diverse challenges: it achieves competitive, noise-robust performance on the Feynman Symbolic Regression Database (FSReD), outperforming several strong baselines; discovers mechanistically aligned models with good accuracy-complexity trade-offs on real-world data; and derives compact, physically-motivated parametrizations for Parton Distribution Functions in a frontier high-energy physics application. IdeaSearchFitter is a specialized module within our broader iterated agent framework, IdeaSearch, which is publicly available at https://www.ideasearch.cn/.

Updated: 2025-10-09 15:02:56

标题: Symbolic Regression的迭代代理

摘要: 符号回归（SR），即从数据中自动发现数学表达式，是科学探究的基石。然而，它经常受到搜索空间的组合爆炸和过度拟合的影响。根植于遗传编程的流行方法在语法上探索这个空间，通常产生过于复杂、难以解释的模型。本文介绍了IdeaSearchFitter，这是一个框架，利用大型语言模型（LLMs）作为进化搜索中的语义操作符。通过生成受自然语言理性指导的候选表达式，我们的方法偏向于发现不仅准确而且概念上连贯和可解释的模型。我们展示了IdeaSearchFitter在各种挑战中的有效性：在费曼符号回归数据库（FSReD）上取得了竞争力强、噪声鲁棒性能，优于几个强基线；在真实数据上发现了与机理对齐的模型，具有良好的准确性-复杂性平衡；并为前沿高能物理应用中的Parton分布函数提供了简洁、物理上合理的参数化。IdeaSearchFitter是我们更广泛的迭代代理框架IdeaSearch中的一个专门模块，可在https://www.ideasearch.cn/上公开获取。

更新时间: 2025-10-09 15:02:56

领域: physics.comp-ph,astro-ph.IM,cs.AI,cs.LG,hep-ph

下载: http://arxiv.org/abs/2510.08317v1

Spatial-Functional awareness Transformer-based graph archetype contrastive learning for Decoding Visual Neural Representations from EEG

Decoding visual neural representations from Electroencephalography (EEG) signals remains a formidable challenge due to their high-dimensional, noisy, and non-Euclidean nature. In this work, we propose a Spatial-Functional Awareness Transformer-based Graph Archetype Contrastive Learning (SFTG) framework to enhance EEG-based visual decoding. Specifically, we introduce the EEG Graph Transformer (EGT), a novel graph-based neural architecture that simultaneously encodes spatial brain connectivity and temporal neural dynamics. To mitigate high intra-subject variability, we propose Graph Archetype Contrastive Learning (GAC), which learns subject-specific EEG graph archetypes to improve feature consistency and class separability. Furthermore, we conduct comprehensive subject-dependent and subject-independent evaluations on the Things-EEG dataset, demonstrating that our approach significantly outperforms prior state-of-the-art EEG decoding methods.The results underscore the transformative potential of integrating graph-based learning with contrastive objectives to enhance EEG-based brain decoding, paving the way for more generalizable and robust neural representations.

Updated: 2025-10-09 15:02:25

标题: 基于空间-功能意识的变压器图原型对比学习用于从脑电图解码视觉神经表示

摘要: 从脑电图（EEG）信号中解码视觉神经表示仍然是一个巨大挑战，因为其具有高维、嘈杂和非欧几里得特性。在这项工作中，我们提出了一个基于空间功能意识变换器的图原型对比学习（SFTG）框架，以增强基于EEG的视觉解码。具体而言，我们引入了EEG图变换器（EGT），这是一种新颖的基于图的神经结构，同时编码了空间脑连接和时间神经动态。为了减轻个体内变异性高的问题，我们提出了图原型对比学习（GAC），它学习个体特定的EEG图原型以提高特征的一致性和类别的可分性。此外，我们在Things-EEG数据集上进行了全面的个体依赖和个体独立评估，展示了我们的方法显著优于先前的最新EEG解码方法。结果强调了将基于图的学习与对比目标相结合以增强基于EEG的脑解码的转变潜力，为更具普适性和稳健性的神经表示铺平了道路。

更新时间: 2025-10-09 15:02:25

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.24761v2

Matryoshka Pilot: Learning to Drive Black-Box LLMs with LLMs

Despite the impressive generative abilities of black-box large language models (LLMs), their inherent opacity hinders further advancements in capabilities such as reasoning, planning, and personalization. Existing works aim to enhance LLM capabilities via domain-specific adaptation, which require additional training on accessible model parameters, an infeasible option for black-box LLMs. To address this challenge, we introduce Matryoshka Pilot (M-Pilot), a lightweight white-box LLM controller that guides a large-scale black-box LLM generator by decomposing complex tasks into a series of intermediate outputs. Specifically, we consider the black-box LLM as an environment, with M-Pilot serving as a policy to provide intermediate guidance through prompts for driving the black-box LLM. M-Pilot is trained to pivot the outputs of the black-box LLM aligning with preferences during iterative interaction, which enables controllable multi-turn generation and self-improvement in optimizing intermediate guidance. Empirical evaluations on diverse tasks demonstrate that our method effectively enhances the capabilities of black-box LLMs in complex, long-horizon tasks.

Updated: 2025-10-09 15:01:48

标题: 母巢飞行员：用母巢学习驾驶黑匣子LLMs

摘要: 尽管黑匣子大型语言模型（LLMs）具有令人印象深刻的生成能力，但其固有的不透明性阻碍了进一步在推理、规划和个性化等能力方面的进展。现有的研究旨在通过领域特定的适应性来增强LLM的能力，这需要对可访问的模型参数进行额外的训练，对于黑匣子LLMs来说是不可行的选项。为了解决这一挑战，我们引入了Matryoshka Pilot（M-Pilot），这是一个轻量级的白盒LLM控制器，通过将复杂任务分解为一系列中间输出来引导大规模的黑匣子LLM生成器。具体而言，我们将黑匣子LLM视为一个环境，M-Pilot作为一个策略来通过提示来引导黑匣子LLM。M-Pilot经过训练，通过迭代交互中将黑匣子LLM的输出与偏好对齐，从而实现可控的多轮生成和优化中间引导的自我改进。在不同任务的实证评估中表明，我们的方法有效地增强了黑匣子LLMs在复杂、长期任务中的能力。

更新时间: 2025-10-09 15:01:48

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2410.20749v2

To Ask or Not to Ask: Learning to Require Human Feedback

Developing decision-support systems that complement human performance in classification tasks remains an open challenge. A popular approach, Learning to Defer (LtD), allows a Machine Learning (ML) model to pass difficult cases to a human expert. However, LtD treats humans and ML models as mutually exclusive decision-makers, restricting the expert contribution to mere predictions. To address this limitation, we propose Learning to Ask (LtA), a new framework that handles both when and how to incorporate expert input in an ML model. LtA is based on a two-part architecture: a standard ML model and an enriched model trained with additional expert human feedback, with a formally optimal strategy for selecting when to query the enriched model. We provide two practical implementations of LtA: a sequential approach, which trains the models in stages, and a joint approach, which optimises them simultaneously. For the latter, we design surrogate losses with realisable-consistency guarantees. Our experiments with synthetic and real expert data demonstrate that LtA provides a more flexible and powerful foundation for effective human-AI collaboration.

Updated: 2025-10-09 15:00:06

标题: 要问还是不要问：学习需要人类反馈

摘要: 开发决策支持系统，以补充人类在分类任务中的表现，仍然是一个未解之谜。一种流行的方法，即学习推迟（LtD），允许机器学习（ML）模型将困难案例传递给人类专家。然而，LtD将人类和ML模型视为相互排斥的决策者，限制了专家贡献仅仅是预测。为了解决这一局限，我们提出了学习提问（LtA）的新框架，该框架处理何时以及如何在ML模型中整合专家输入。LtA基于两部分架构：一个标准的ML模型和一个通过额外专家人类反馈训练的丰富模型，具有选择何时查询丰富模型的正式最佳策略。我们提供了LtA的两种实际实现：一种是顺序方法，该方法在阶段中训练模型，另一种是联合方法，该方法同时优化这两种模型。对于后者，我们设计了具有可实现一致性保证的替代损失。我们对合成和真实专家数据的实验表明，LtA为有效的人工智能协作提供了更灵活和强大的基础。

更新时间: 2025-10-09 15:00:06

领域: cs.LG,cs.HC

下载: http://arxiv.org/abs/2510.08314v1

Reproducible workflow for online AI in digital health

Online artificial intelligence (AI) algorithms are an important component of digital health interventions. These online algorithms are designed to continually learn and improve their performance as streaming data is collected on individuals. Deploying online AI presents a key challenge: balancing adaptability of online AI with reproducibility. Online AI in digital interventions is a rapidly evolving area, driven by advances in algorithms, sensors, software, and devices. Digital health intervention development and deployment is a continuous process, where implementation - including the AI decision-making algorithm - is interspersed with cycles of re-development and optimization. Each deployment informs the next, making iterative deployment a defining characteristic of this field. This iterative nature underscores the importance of reproducibility: data collected across deployments must be accurately stored to have scientific utility, algorithm behavior must be auditable, and results must be comparable over time to facilitate scientific discovery and trustworthy refinement. This paper proposes a reproducible scientific workflow for developing, deploying, and analyzing online AI decision-making algorithms in digital health interventions. Grounded in practical experience from multiple real-world deployments, this workflow addresses key challenges to reproducibility across all phases of the online AI algorithm development life-cycle.

Updated: 2025-10-09 14:59:41

标题: 数字健康在线人工智能的可重复工作流程

摘要: 在线人工智能（AI）算法是数字健康干预的重要组成部分。这些在线算法旨在在个体收集流数据的过程中不断学习和提高其性能。部署在线AI面临一个关键挑战：如何平衡在线AI的适应性和可重复性。数字干预中的在线AI是一个快速发展的领域，受到算法、传感器、软件和设备的进步推动。数字健康干预的开发和部署是一个持续的过程，其中实施 - 包括AI决策算法 - 与重新开发和优化周期交替进行。每次部署都会为下一次部署提供信息，使迭代部署成为该领域的一个定义特征。这种迭代性质强调了可重复性的重要性：跨部署收集的数据必须准确存储以具有科学效用，算法行为必须可审计，并且结果必须随时间可比，以促进科学发现和可信的改进。本文提出了一个可重复的科学工作流程，用于开发、部署和分析数字健康干预中的在线AI决策算法。基于多个现实世界部署的实践经验，该工作流程解决了在线AI算法开发生命周期各个阶段的可重复性关键挑战。

更新时间: 2025-10-09 14:59:41

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2509.13499v2

MCPSecBench: A Systematic Security Benchmark and Playground for Testing Model Context Protocols

Large Language Models (LLMs) are increasingly integrated into real-world applications via the Model Context Protocol (MCP), a universal, open standard for connecting AI agents with data sources and external tools. While MCP enhances the capabilities of LLM-based agents, it also introduces new security risks and expands their attack surfaces. In this paper, we present the first systematic taxonomy of MCP security, identifying 17 attack types across 4 primary attack surfaces. We introduce MCPSecBench, a comprehensive security benchmark and playground that integrates prompt datasets, MCP servers, MCP clients, attack scripts, and protection mechanisms to evaluate these attacks across three major MCP providers. Our benchmark is modular and extensible, allowing researchers to incorporate custom implementations of clients, servers, and transport protocols for systematic security assessment. Experimental results show that over 85% of the identified attacks successfully compromise at least one platform, with core vulnerabilities universally affecting Claude, OpenAI, and Cursor, while prompt-based and tool-centric attacks exhibit considerable variability across different hosts and models. In addition, current protection mechanisms have little effect against these attacks. Overall, MCPSecBench standardizes the evaluation of MCP security and enables rigorous testing across all MCP layers.

Updated: 2025-10-09 14:57:42

标题: MCPSecBench：一个系统化的安全基准和测试模型上下文协议的平台

摘要: 大型语言模型（LLMs）越来越多地通过模型上下文协议（MCP）集成到现实世界的应用程序中，这是一种连接AI代理与数据源和外部工具的通用、开放标准。虽然MCP增强了基于LLM的代理的能力，但也引入了新的安全风险并扩大了它们的攻击面。在本文中，我们提出了MCP安全的第一个系统分类法，识别了4个主要攻击面上的17种攻击类型。我们介绍了MCPSecBench，这是一个综合的安全基准和游乐场，集成了提示数据集、MCP服务器、MCP客户端、攻击脚本和保护机制，以评估这些攻击跨越三个主要的MCP提供商。我们的基准测试是模块化和可扩展的，允许研究人员将客户端、服务器和传输协议的自定义实现纳入系统安全评估。实验结果显示，超过85%的识别攻击成功地破坏了至少一个平台，核心漏洞普遍影响Claude、OpenAI和Cursor，而基于提示和工具为中心的攻击在不同主机和模型之间表现出相当大的差异。此外，当前的保护机制对这些攻击几乎没有效果。总的来说，MCPSecBench标准化了MCP安全的评估，并实现了对所有MCP层的严格测试。

更新时间: 2025-10-09 14:57:42

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2508.13220v2

Robust and Efficient Collaborative Learning

Collaborative machine learning is challenged by training-time adversarial behaviors. Existing approaches to tolerate such behaviors either rely on a central server or induce high communication costs. We propose Robust Pull-based Epidemic Learning (RPEL), a novel, scalable collaborative approach to ensure robust learning despite adversaries. RPEL does not rely on any central server and, unlike traditional methods, where communication costs grow in $\mathcal{O}(n^2)$ with the number of nodes $n$, RPEL employs a pull-based epidemic-based communication strategy that scales in $\mathcal{O}(n \log n)$. By pulling model parameters from small random subsets of nodes, RPEL significantly lowers the number of required messages without compromising convergence guarantees, which hold with high probability. Empirical results demonstrate that RPEL maintains robustness in adversarial settings, competes with all-to-all communication accuracy, and scales efficiently across large networks.

Updated: 2025-10-09 14:57:29

标题: 强大高效的协作学习

摘要: 协作机器学习在训练时受到对抗行为的挑战。现有的容忍这种行为的方法要么依赖于中央服务器，要么导致高通信成本。我们提出了Robust Pull-based Epidemic Learning (RPEL)，这是一种新颖、可扩展的协作方法，可以确保在存在对手的情况下进行稳健学习。RPEL不依赖于任何中央服务器，并且与传统方法不同，其中通信成本随节点数量$n$呈$\mathcal{O}(n^2)$增长，RPEL采用了一个基于拉取的流行性传播通信策略，其复杂度为$\mathcal{O}(n \log n)$。通过从小的随机节点子集中拉取模型参数，RPEL显著降低了所需的消息数量，同时保证收敛性能，且高概率下保持。实证结果表明，RPEL在对抗设置中保持稳健性，与全互连通信准确性竞争，并且在大型网络中高效扩展。

更新时间: 2025-10-09 14:57:29

领域: cs.LG

下载: http://arxiv.org/abs/2510.08311v1

First Try Matters: Revisiting the Role of Reflection in Reasoning Models

Large language models have recently demonstrated significant gains in reasoning ability, often attributed to their capacity to generate longer chains of thought and engage in reflective reasoning. However, the contribution of reflections to performance improvement remains unclear. In this paper, we systematically analyze the rollouts of eight reasoning models on five mathematical datasets. We focus on reflective behaviours where the model has already produced an answer but continues reflecting before finalizing its output. Our analysis reveals that reflections are predominantly confirmatory and rarely alter the model's initial answer, a pattern consistent across models and datasets. To understand the role of reflections in training, we construct supervised fine-tuning (SFT) datasets with varying amounts of reflection steps. We observe that training models on rollouts with more reflection steps primarily enhances first-answer correctness rather than the ability to correct initially wrong answers through reflections. This motivates us to propose a question-aware early-stopping method that enhances inference-time token efficiency by stopping the reasoning process once a few plausible candidate answers are generated, thereby reducing unnecessary reflection steps. Motivated by this, we further propose to dynamically truncate the reflections after a candidate answer has appeared during generation, which reduces reasoning tokens by 24.5% across five mathematical datasets, within a 2.9% drop in accuracy.

Updated: 2025-10-09 14:57:10

标题: 第一次尝试很重要：重新审视反思在推理模型中的作用

摘要: 最近，大型语言模型展示了在推理能力方面取得的显著进展，这往往归因于它们能够生成更长的思维链条并进行反思推理。然而，反思对性能改进的贡献仍不清楚。在本文中，我们系统地分析了五个数学数据集上八个推理模型的推演过程。我们关注模型已经产生了答案但在最终确定输出之前继续反思的行为。我们的分析显示，反思主要是确认性的，很少改变模型的初始答案，这种模式在各个模型和数据集中一致。为了了解反思在训练中的作用，我们构建了带有不同反思步骤数量的监督微调（SFT）数据集。我们观察到，在具有更多反思步骤的推演上训练模型主要增强了第一次答案的正确性，而不是通过反思纠正最初错误的答案的能力。这激励我们提出了一种基于问题感知的早停止方法，通过在生成出几个可能的候选答案后停止推理过程，从而减少不必要的反思步骤来增强推理时间的令牌效率。受此激励，我们进一步提出在生成过程中出现候选答案后动态截断反思，这在五个数学数据集中将推理令牌减少了24.5%，准确率下降了2.9%。

更新时间: 2025-10-09 14:57:10

领域: cs.AI

下载: http://arxiv.org/abs/2510.08308v1

Dynamic Features Adaptation in Networking: Toward Flexible training and Explainable inference

As AI becomes a native component of 6G network control, AI models must adapt to continuously changing conditions, including the introduction of new features and measurements driven by multi-vendor deployments, hardware upgrades, and evolving service requirements. To address this growing need for flexible learning in non-stationary environments, this vision paper highlights Adaptive Random Forests (ARFs) as a reliable solution for dynamic feature adaptation in communication network scenarios. We show that iterative training of ARFs can effectively lead to stable predictions, with accuracy improving over time as more features are added. In addition, we highlight the importance of explainability in AI-driven networks, proposing Drift-Aware Feature Importance (DAFI) as an efficient XAI feature importance (FI) method. DAFI uses a distributional drift detector to signal when to apply computationally intensive FI methods instead of lighter alternatives. Our tests on 3 different datasets indicate that our approach reduces runtime by up to 2 times, while producing more consistent feature importance values. Together, ARFs and DAFI provide a promising framework to build flexible AI methods adapted to 6G network use-cases.

Updated: 2025-10-09 14:55:04

标题: 网络动态特征适应：朝向灵活训练和可解释推理

摘要: 随着人工智能成为6G网络控制的本地组件，人工智能模型必须适应不断变化的条件，包括由多供应商部署、硬件升级和不断演变的服务需求驱动的新特性和测量的引入。为了满足在非稳态环境中灵活学习的日益增长的需求，本文提出了自适应随机森林（ARFs）作为通信网络场景中动态特征适应的可靠解决方案。我们展示了ARFs的迭代训练可以有效地导致稳定的预测，随着更多特征的添加，准确性会随时间改善。此外，我们强调在人工智能驱动的网络中解释性的重要性，提出了Drift-Aware Feature Importance（DAFI）作为一种高效的XAI特征重要性（FI）方法。DAFI使用分布漂移检测器来指示何时应用计算密集型的FI方法而不是更轻量级的替代方案。我们对3个不同数据集的测试表明，我们的方法可以将运行时间缩短最多2倍，同时产生更一致的特征重要性值。ARFs和DAFI共同提供了一个有前途的框架，用于构建适应于6G网络用例的灵活人工智能方法。

更新时间: 2025-10-09 14:55:04

领域: cs.LG,cs.NI

下载: http://arxiv.org/abs/2510.08303v1

ProtoMedX: Towards Explainable Multi-Modal Prototype Learning for Bone Health Classification

Bone health studies are crucial in medical practice for the early detection and treatment of Osteopenia and Osteoporosis. Clinicians usually make a diagnosis based on densitometry (DEXA scans) and patient history. The applications of AI in this field are ongoing research. Most successful methods rely on deep learning models that use vision alone (DEXA/X-ray imagery) and focus on prediction accuracy, while explainability is often disregarded and left to post hoc assessments of input contributions. We propose ProtoMedX, a multi-modal (multimodal) model that uses both DEXA scans of the lumbar spine and patient records. ProtoMedX's prototype-based architecture is explainable by design, which is crucial for medical applications, especially in the context of the upcoming EU AI Act, as it allows explicit analysis of model decisions, including incorrect ones. ProtoMedX demonstrates state-of-the-art performance in bone health classification while also providing explanations that can be visually understood by clinicians. Using a dataset of 4,160 real NHS patients, the proposed ProtoMedX achieves 87.58% accuracy in vision-only tasks and 89.8% in its multi-modal variant, both surpassing existing published methods.

Updated: 2025-10-09 14:54:12

标题: ProtoMedX：面向骨健康分类的可解释多模态原型学习

摘要: 骨健康研究在医学实践中至关重要，用于早期检测和治疗骨质疏松症和骨质疏松症。临床医生通常根据骨密度检测（DEXA扫描）和患者病史来进行诊断。在这一领域中，人工智能的应用正在进行研究。大多数成功的方法依赖于仅使用视觉的深度学习模型（DEXA/X射线图像），并侧重于预测准确性，而解释性通常被忽视，留给输入贡献的事后评估。我们提出ProtoMedX，这是一个使用腰椎DEXA扫描和患者记录的多模态模型。ProtoMedX的基于原型的架构是可解释的设计，这对医学应用至关重要，尤其是在即将出台的欧盟人工智能法中，因为它允许对模型决策进行明确分析，包括错误的决策。ProtoMedX在骨健康分类方面表现出最先进的性能，同时提供了可以被临床医生直观理解的解释。在一个包含4,160名真实英国国民保健服务患者的数据集中，提出的ProtoMedX在仅视觉任务中达到了87.58%的准确性，在其多模态变体中达到了89.8%的准确性，均超过了现有的已发表方法。

更新时间: 2025-10-09 14:54:12

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.14830v2

Symmetry-Aware Fully-Amortized Optimization with Scale Equivariant Graph Metanetworks

Amortized optimization accelerates the solution of related optimization problems by learning mappings that exploit shared structure across problem instances. We explore the use of Scale Equivariant Graph Metanetworks (ScaleGMNs) for this purpose. By operating directly in weight space, ScaleGMNs enable single-shot fine-tuning of existing models, reducing the need for iterative optimization. We demonstrate the effectiveness of this approach empirically and provide a theoretical result: the gauge freedom induced by scaling symmetries is strictly smaller in convolutional neural networks than in multi-layer perceptrons. This insight helps explain the performance differences observed between architectures in both our work and that of Kalogeropoulos et al. (2024). Overall, our findings underscore the potential of symmetry-aware metanetworks as a powerful approach for efficient and generalizable neural network optimization. Open-source code: https://github.com/daniuyter/scalegmn_amortization

Updated: 2025-10-09 14:51:15

标题: 对称感知的具有尺度等变图元网络的全摊销优化

摘要: 摊销优化通过学习利用问题实例之间共享结构的映射来加速相关优化问题的解决。我们探讨了使用尺度等变图元网络（ScaleGMNs）来实现这一目的。通过直接在权重空间中操作，ScaleGMNs使得现有模型可以进行单次微调，减少了迭代优化的需求。我们通过实证研究证明了这种方法的有效性，并提供了一个理论结果：卷积神经网络中由缩放对称性引起的规范自由度严格小于多层感知器中的规范自由度。这一洞察有助于解释我们的工作以及Kalogeropoulos等人（2024年）的架构之间观察到的性能差异。总的来说，我们的研究结果强调了对称感知元网络作为一种强大的方法，用于高效和可推广的神经网络优化。开源代码：https://github.com/daniuyter/scalegmn_amortization

更新时间: 2025-10-09 14:51:15

领域: cs.AI

下载: http://arxiv.org/abs/2510.08300v1

Bridging the Physics-Data Gap with FNO-Guided Conditional Flow Matching: Designing Inductive Bias through Hierarchical Physical Constraints

Conventional time-series generation often ignores domain-specific physical constraints, limiting statistical and physical consistency. We propose a hierarchical framework that embeds the inherent hierarchy of physical laws-conservation, dynamics, boundary, and empirical relations-directly into deep generative models, introducing a new paradigm of physics-informed inductive bias. Our method combines Fourier Neural Operators (FNOs) for learning physical operators with Conditional Flow Matching (CFM) for probabilistic generation, integrated via time-dependent hierarchical constraints and FNO-guided corrections. Experiments on harmonic oscillators, human activity recognition, and lithium-ion battery degradation show 16.3% higher generation quality, 46% fewer physics violations, and 18.5% improved predictive accuracy over baselines.

Updated: 2025-10-09 14:48:51

标题: 用FNO引导的条件流匹配填补物理数据差距：通过分层物理约束设计归纳偏差

摘要: 传统的时间序列生成常常忽视特定领域的物理约束，限制了统计和物理的一致性。我们提出了一个层次化框架，将物理定律的固有层次结构-包括守恒、动力学、边界和经验关系-直接嵌入到深度生成模型中，引入了一种新的物理启发偏差的范式。我们的方法结合了傅立叶神经算子（FNOs）用于学习物理算子和条件流匹配（CFM）用于概率生成，通过时间相关的层次约束和FNO引导的修正进行集成。在谐波振荡器、人体活动识别和锂离子电池衰减的实验中，我们展示了相对基线方法，生成质量提高了16.3％，物理违规减少了46％，预测准确性提高了18.5％。

更新时间: 2025-10-09 14:48:51

领域: cs.LG

下载: http://arxiv.org/abs/2510.08295v1

Counterfactual Identifiability via Dynamic Optimal Transport

We address the open question of counterfactual identification for high-dimensional multivariate outcomes from observational data. Pearl (2000) argues that counterfactuals must be identifiable (i.e., recoverable from the observed data distribution) to justify causal claims. A recent line of work on counterfactual inference shows promising results but lacks identification, undermining the causal validity of its estimates. To address this, we establish a foundation for multivariate counterfactual identification using continuous-time flows, including non-Markovian settings under standard criteria. We characterise the conditions under which flow matching yields a unique, monotone and rank-preserving counterfactual transport map with tools from dynamic optimal transport, ensuring consistent inference. Building on this, we validate the theory in controlled scenarios with counterfactual ground-truth and demonstrate improvements in axiomatic counterfactual soundness on real images.

Updated: 2025-10-09 14:45:13

标题: 透过动态最优输运实现的反事实可辨识性

摘要: 我们解决了关于从观测数据中获得高维多元结果的反事实识别的开放问题。Pearl（2000）认为，为了证明因果关系，反事实必须可识别（即可从观察到的数据分布中恢复）。最近的一系列关于反事实推断的工作显示出有希望的结果，但缺乏识别，削弱了其估计的因果有效性。为了解决这个问题，我们建立了一个基于连续时间流的多元反事实识别的基础，包括在标准准则下的非马尔可夫设置。我们表征了流匹配产生唯一、单调和保持秩的反事实传输映射的条件，利用动态最优传输的工具，确保一致的推断。基于此，我们在具有反事实基础事实的受控情景中验证了该理论，并展示了在真实图像上公理反事实有效性的改进。

更新时间: 2025-10-09 14:45:13

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2510.08294v1

LLM Fingerprinting via Semantically Conditioned Watermarks

Most LLM fingerprinting methods teach the model to respond to a few fixed queries with predefined atypical responses (keys). This memorization often does not survive common deployment steps such as finetuning or quantization, and such keys can be easily detected and filtered from LLM responses, ultimately breaking the fingerprint. To overcome these limitations we introduce LLM fingerprinting via semantically conditioned watermarks, replacing fixed query sets with a broad semantic domain, and replacing brittle atypical keys with a statistical watermarking signal diffused throughout each response. After teaching the model to watermark its responses only to prompts from a predetermined domain e.g., French language, the model owner can use queries from that domain to reliably detect the fingerprint and verify ownership. As we confirm in our thorough experimental evaluation, our fingerprint is both stealthy and robust to all common deployment scenarios.

Updated: 2025-10-09 14:40:25

标题: 通过语义条件水印的LLM指纹识别

摘要: 大多数LLM指纹识别方法教导模型对几个固定的查询以预定义的非典型响应（密钥）做出回应。这种记忆通常在常见的部署步骤（如微调或量化）中无法保留，这些密钥可以很容易地从LLM响应中检测并过滤掉，最终破坏指纹。为了克服这些限制，我们引入了通过语义条件水印来进行LLM指纹识别，将固定的查询集替换为广泛的语义域，并将脆弱的非典型密钥替换为在每个响应中扩散的统计水印信号。在教导模型仅对来自预定领域的提示（例如法语）水印其响应之后，模型所有者可以使用来自该领域的查询可靠地检测指纹并验证所有权。正如我们在彻底的实验评估中确认的那样，我们的指纹既隐秘又强大，适用于所有常见的部署场景。

更新时间: 2025-10-09 14:40:25

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2505.16723v2

StealthDust: Secret Quorums for Faster Fractional Spending

With the goal of building a decentralized and fully parallel payment system, we address the Fractional Spending Problem using (k1, k2)-quorum systems - both introduced by Bazzi and Tucci-Piergiovanni (PODC 2024). Fractional spending enables payments without immediate validation of an entire quorum, as necessary in classical approaches. Multiple spending from a same fund can occur concurrently, with final settlement involving previously contacted quorums. To tolerate a rushing-adaptive adversary, the composition of these quorums must stay hidden until settlement succeeds. We propose a new abstraction called secret quorums - of independent interest - that fulfill this property and implement it through ring verifiable random functions. We then propose a new protocol called StealthDust, where secret quorums allow to reduce payment latency from five to three communications steps and improve settlment message complexity from O(n^3) to O(n^2) compared to the original protocol.

Updated: 2025-10-09 14:38:05

标题: 隐形尘埃：用于更快分数支出的秘密团体

摘要: 为了构建一个去中心化和完全并行的支付系统，我们使用（k1，k2）-团体系统来解决分数支出问题 - 这两种系统都是由Bazzi和Tucci-Piergiovanni（PODC 2024）引入的。分数支出使得可以在不需要立即验证整个团体的情况下进行支付，这在传统方法中是必要的。可以同时从同一基金进行多次支出，最终结算涉及之前联系的团体。为了容忍一个冲动适应的对手，这些团体的构成必须在结算成功之前保持隐藏。我们提出了一个称为秘密团体的新抽象 - 具有独立兴趣 - 它满足这一属性，并通过环可验证随机函数进行实现。然后，我们提出了一个称为StealthDust的新协议，其中秘密团体可以将支付延迟从五个通信步骤减少到三个，并将结算消息复杂度从O(n^3)降低到O(n^2)，与原始协议相比有所改进。

更新时间: 2025-10-09 14:38:05

领域: cs.CR,cs.DC

下载: http://arxiv.org/abs/2412.16648v2

Learning Neural Exposure Fields for View Synthesis

Recent advances in neural scene representations have led to unprecedented quality in 3D reconstruction and view synthesis. Despite achieving high-quality results for common benchmarks with curated data, outputs often degrade for data that contain per image variations such as strong exposure changes, present, e.g., in most scenes with indoor and outdoor areas or rooms with windows. In this paper, we introduce Neural Exposure Fields (NExF), a novel technique for robustly reconstructing 3D scenes with high quality and 3D-consistent appearance from challenging real-world captures. In the core, we propose to learn a neural field predicting an optimal exposure value per 3D point, enabling us to optimize exposure along with the neural scene representation. While capture devices such as cameras select optimal exposure per image/pixel, we generalize this concept and perform optimization in 3D instead. This enables accurate view synthesis in high dynamic range scenarios, bypassing the need of post-processing steps or multi-exposure captures. Our contributions include a novel neural representation for exposure prediction, a system for joint optimization of the scene representation and the exposure field via a novel neural conditioning mechanism, and demonstrated superior performance on challenging real-world data. We find that our approach trains faster than prior works and produces state-of-the-art results on several benchmarks improving by over 55% over best-performing baselines.

Updated: 2025-10-09 14:32:41

标题: 学习神经曝光场用于视图合成

摘要: 最近在神经场景表示方面取得的进展，已经导致在3D重建和视图合成方面的质量达到了前所未有的水平。尽管在经过精心策划的数据集上取得了高质量的结果，但对于包含每个图像变化的数据，如强曝光变化的数据，输出通常会降级，例如大多数室内和室外区域或带有窗户的房间。在本文中，我们介绍了一种名为神经曝光场（NExF）的新技术，用于从具有挑战性的真实世界捕捉中稳健地重建具有高质量和3D一致外观的3D场景。在核心部分，我们提出学习一个神经场，预测每个3D点的最佳曝光值，使我们能够沿着神经场景表示优化曝光。虽然捕获设备如相机会选择每个图像/像素的最佳曝光，但我们将这个概念推广并在3D中进行优化。这使得在高动态范围场景中进行准确的视图合成，避免了后处理步骤或多曝光捕获的需求。我们的贡献包括用于曝光预测的新型神经表示，通过一种新型神经调节机制实现场景表示和曝光场的联合优化的系统，并在具有挑战性的真实世界数据上展示出卓越的性能。我们发现，我们的方法训练速度比以前的工作更快，并在几个基准测试中产生了最先进的结果，比最佳基线提高了超过55%。

更新时间: 2025-10-09 14:32:41

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.08279v1

Recurrent Natural Policy Gradient for POMDPs

Solving partially observable Markov decision processes (POMDPs) remains a fundamental challenge in reinforcement learning (RL), primarily due to the curse of dimensionality induced by the non-stationarity of optimal policies. In this work, we study a natural actor-critic (NAC) algorithm that integrates recurrent neural network (RNN) architectures into a natural policy gradient (NPG) method and a temporal difference (TD) learning method. This framework leverages the representational capacity of RNNs to address non-stationarity in RL to solve POMDPs while retaining the statistical and computational efficiency of natural gradient methods in RL. We provide non-asymptotic theoretical guarantees for this method, including bounds on sample and iteration complexity to achieve global optimality up to function approximation. Additionally, we characterize pathological cases that stem from long-term dependencies, thereby explaining limitations of RNN-based policy optimization for POMDPs.

Updated: 2025-10-09 14:32:01

标题: POMDPs的重复自然策略梯度

摘要: 解决部分可观测马尔科夫决策过程（POMDPs）仍然是强化学习（RL）中的一个基本挑战，主要是由于最优策略的非稳态性所导致的维度诅咒。在这项工作中，我们研究了一种自然演员-评论家（NAC）算法，将递归神经网络（RNN）架构整合到自然策略梯度（NPG）方法和时间差异（TD）学习方法中。该框架利用RNN的表征能力来解决RL中的非稳态性以解决POMDPs，同时保留RL中自然梯度方法的统计和计算效率。我们为这种方法提供了非渐近理论保证，包括样本和迭代复杂度的界限，以实现全局最优性直到函数逼近。此外，我们还表征了源自长期依赖的病理情况，从而解释了基于RNN的POMDP策略优化的局限性。

更新时间: 2025-10-09 14:32:01

领域: math.OC,cs.LG,stat.ML

下载: http://arxiv.org/abs/2405.18221v2

$μ$-Parametrization for Mixture of Experts

Recent years have seen a growing interest and adoption of LLMs, with Mixture-of-Experts (MoE) emerging as a leading architecture in extremely large models. Currently, the largest open-source models reach over $1$T parameters. At such scales, hyperparameter tuning becomes prohibitively expensive. Precisely for this reason, the $\mu$Transfer is becoming a key technique. It allows for seamless transfer of optimal hyperparameters across model scales, resulting in a huge reduction in tuning costs. However, existing work has primarily focused on dense LLMs, leaving MoE architectures unexplored. In this work, we derive a $\mu$-Parameterization for MoE, providing theoretical guarantees for feature learning across model widths. Our experiments demonstrate that the optimal learning rate reliably transfers across model sizes, establishing a foundation for efficient hyperparameter tuning in large-scale MoE models.

Updated: 2025-10-09 14:31:29

标题: $μ$-Parametrization for Mixture of Experts 混合专家的$μ$参数化

摘要: 最近几年来，人们对LLMs表现出了越来越大的兴趣和采用，其中混合专家（MoE）架构成为极大模型中的主要架构。目前，最大的开源模型已经达到了1T参数。在这种规模下，超参数调整变得极其昂贵。正是出于这个原因，μTransfer正在成为一项关键技术。它允许在模型规模之间无缝转移最佳超参数，从而大大降低调整成本。然而，现有的工作主要集中在密集LLMs上，对MoE架构尚未进行探索。在这项工作中，我们为MoE推导了一个μ-参数化，为跨模型宽度的特征学习提供了理论保证。我们的实验证明，最佳学习率可靠地在模型大小之间传递，为大规模MoE模型的高效超参数调整奠定了基础。

更新时间: 2025-10-09 14:31:29

领域: cs.LG

下载: http://arxiv.org/abs/2508.09752v2

Systematic Assessment of Cache Timing Vulnerabilities on RISC-V Processors

While interest in the open RISC-V instruction set architecture is growing, tools to assess the security of concrete processor implementations are lacking. There are dedicated tools and benchmarks for common microarchitectural side-channel vulnerabilities for popular processor families such as Intel x86-64 or ARM, but not for RISC-V. In this paper we describe our efforts in porting an Intel x86-64 benchmark suite for cache-based timing vulnerabilities to RISC-V. We then use this benchmark to evaluate the security of three commercially available RISC-V processors, the T-Head C910 and the SiFive U54 and U74 cores. We observe that the C910 processor exhibits more distinct timing types than the other processors, leading to the assumption that code running on the C910 would be exposed to more microarchitectural vulnerability sources. In addition, our evaluation reveals that $37.5\%$ of the vulnerabilities covered by the benchmark exist in all processors, while only $6.8\%$ are absent from all cores. Our work, in particular the ported benchmark, aims to support RISC-V processor designers to identify leakage sources early in their designs and to support the development of countermeasures.

Updated: 2025-10-09 14:29:54

标题: 对RISC-V处理器的缓存时序漏洞的系统评估

摘要: 随着对开放式RISC-V指令集架构的兴趣不断增长，评估具体处理器实现安全性的工具却缺乏。针对流行处理器系列如Intel x86-64或ARM的常见微架构侧信道漏洞，有专门的工具和基准测试，但针对RISC-V则没有。本文描述了我们在将适用于基于缓存的计时漏洞的Intel x86-64基准测试套件移植到RISC-V上的努力。然后，我们使用这个基准测试来评估三个商用RISC-V处理器，T-Head C910以及SiFive U54和U74内核的安全性。我们观察到C910处理器展示了比其他处理器更多的独特计时类型，导致假设在C910上运行的代码将暴露于更多微架构漏洞源。此外，我们的评估显示，基准测试涵盖的漏洞中有37.5％存在于所有处理器中，而仅有6.8％不存在于所有内核中。我们的工作，特别是移植的基准测试，旨在支持RISC-V处理器设计者在设计早期识别泄漏源，并支持对策的开发。

更新时间: 2025-10-09 14:29:54

领域: cs.CR

下载: http://arxiv.org/abs/2510.08272v1

Ultra-Efficient On-Device Object Detection on AI-Integrated Smart Glasses with TinyissimoYOLO

Smart glasses are rapidly gaining advanced functions thanks to cutting-edge computing technologies, especially accelerated hardware architectures, and tiny Artificial Intelligence (AI) algorithms. However, integrating AI into smart glasses featuring a small form factor and limited battery capacity remains challenging for a satisfactory user experience. To this end, this paper proposes the design of a smart glasses platform for always-on on-device object detection with an all-day battery lifetime. The proposed platform is based on GAP9, a novel multi-core RISC-V processor from Greenwaves Technologies. Additionally, a family of sub-million parameter TinyissimoYOLO networks are proposed. They are benchmarked on established datasets, capable of differentiating up to 80 classes on MS-COCO. Evaluations on the smart glasses prototype demonstrate TinyissimoYOLO's inference latency of only 17ms and consuming 1.59mJ energy per inference. An end-to-end latency of 56ms is achieved which is equivalent to 18 frames per seconds (FPS) with a total power consumption of 62.9mW. This ensures continuous system runtime of up to 9.3 hours on a 154mAh battery. These results outperform MCUNet (TinyNAS+TinyEngine), which runs a simpler task (image classification) at just 7.3 FPS, while the 18 FPS achieved in this paper even include image-capturing, network inference, and detection post-processing. The algorithm's code is released open with this paper and can be found here: https://github.com/ETH-PBL/TinyissimoYOLO

Updated: 2025-10-09 14:28:34

标题: 超高效的AI集成智能眼镜上基于TinyissimoYOLO的设备端目标检测

摘要: 智能眼镜正迅速获得先进功能，这要归功于尖端的计算技术，尤其是加速的硬件架构和微小的人工智能（AI）算法。然而，将人工智能集成到体积小、电池容量有限的智能眼镜中仍然是一个具有挑战性的任务，以实现令人满意的用户体验。为此，本文提出了一种智能眼镜平台的设计，用于设备上始终开启的对象检测，并具有全天电池续航时间。所提出的平台基于绿波技术的一种新型多核 RISC-V 处理器 GAP9。此外，提出了一个家族化的亚百万参数 TinyissimoYOLO 网络。它们在已建立的数据集上进行了基准测试，能够区分出 MS-COCO 上的 80 个类别。对智能眼镜原型的评估显示 TinyissimoYOLO 推断延迟仅为 17 毫秒，每次推断消耗 1.59 毫焦能量。实现了端到端延迟为 56 毫秒，相当于每秒 18 帧（FPS），总功耗为 62.9 毫瓦。这确保了在 154mAh 电池上连续运行时间长达 9.3 小时。这些结果优于运行更简单任务（图像分类）的 MCUNet（TinyNAS+TinyEngine），仅以 7.3 FPS 运行，而本文中实现的 18 FPS 包括图像捕捉、网络推断和检测后处理。该算法的代码已在本文中公开发布，可在以下链接找到：https://github.com/ETH-PBL/TinyissimoYOLO

更新时间: 2025-10-09 14:28:34

领域: cs.CV,cs.AI,cs.RO

下载: http://arxiv.org/abs/2311.01057v3

Latency-Aware Contextual Bandit: Application to Cryo-EM Data Collection

We introduce a latency-aware contextual bandit framework that generalizes the standard contextual bandit problem, where the learner adaptively selects arms and switches decision sets under action delays. In this setting, the learner observes the context and may select multiple arms from a decision set, with the total time determined by the selected subset. The problem can be framed as a special case of semi-Markov decision processes (SMDPs), where contexts and latencies are drawn from an unknown distribution. Leveraging the Bellman optimality equation, we design the contextual online arm filtering (COAF) algorithm, which balances exploration, exploitation, and action latency to minimize regret relative to the optimal average-reward policy. We analyze the algorithm and show that its regret upper bounds match established results in the contextual bandit literature. In numerical experiments on a movie recommendation dataset and cryogenic electron microscopy (cryo-EM) data, we demonstrate that our approach efficiently maximizes cumulative reward over time.

Updated: 2025-10-09 14:24:21

标题: 基于延迟感知的上下文决策者：应用于冷冻电镜数据收集

摘要: 我们引入了一个考虑延迟的上下文强化学习框架，该框架泛化了标准的上下文强化学习问题，其中学习者在行动延迟下自适应地选择手臂和切换决策集。在这种情况下，学习者观察上下文，可以从决策集中选择多个手臂，所选子集的总时间取决于所选子集。该问题可以被描述为半马尔可夫决策过程（SMDPs）的一个特殊情况，其中上下文和延迟来自于未知分布。利用贝尔曼最优方程，我们设计了上下文在线手臂过滤（COAF）算法，该算法平衡了探索、利用和行动延迟，以最小化相对于最优平均奖励策略的后悔。我们分析了该算法，并展示了它的后悔上界与上下文强化学习文献中已有的结果相匹配。在一个电影推荐数据集和低温电子显微镜（cryo-EM）数据的数值实验中，我们展示了我们的方法能够有效地最大化随时间累积奖励。

更新时间: 2025-10-09 14:24:21

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2410.13109v3

Watch your steps: Dormant Adversarial Behaviors that Activate upon LLM Finetuning

Finetuning open-weight Large Language Models (LLMs) is standard practice for achieving task-specific performance improvements. Until now, finetuning has been regarded as a controlled and secure process in which training on benign datasets leads to predictable behaviors. In this paper, we demonstrate, for the first time, that an adversary can create compromised LLMs that are performant and benign, yet exhibit adversarial behaviors once finetuned by downstream users. To this end, we propose an attack, FAB (Finetuning-activated Adversarial Behaviors), which compromises an LLM via meta-learning techniques that simulate downstream finetuning, explicitly optimizing for the emergence of adversarial behaviors in the finetuned models. At the same time, the compromised LLM is regularized to retain general capabilities and to exhibit no adversarial behaviors prior to finetuning. As a result, when users finetune (e.g., instruction-tuning, distillation, DPO) the seemingly benign model on their own datasets, they unknowingly trigger its dormant adversarial behavior. We experimentally demonstrate the effectiveness of FAB across multiple LLMs and three commonly considered target behaviors: unsolicited advertising, jailbreakability, and over-refusal. We show that FAB-triggers are robust to various finetuning choices made by the user (e.g., dataset, number of steps, scheduler, post-training algorithm). Our findings challenge prevailing assumptions on the security of finetuning, revealing a critical attack vector.

Updated: 2025-10-09 14:23:19

标题: 注意你的步伐：LLM微调后激活的潜在对抗行为

摘要: 微调开放式权重的大型语言模型（LLMs）是实现特定任务性能改进的标准做法。直到现在，微调一直被视为一个受控且安全的过程，在这个过程中，在良性数据集上训练可以导致可预测的行为。在本文中，我们首次展示，对手可以创建性能良好且良性的被损害的LLMs，然而一旦被下游用户进行微调，会展现出对抗性行为。为此，我们提出了一种攻击，FAB（Finetuning-activated Adversarial Behaviors），通过模拟下游微调的元学习技术来破坏LLM，明确优化微调模型中对抗性行为的出现。与此同时，受损的LLM被规范化以保留一般能力，并在微调之前不展现对抗性行为。因此，当用户在自己的数据集上进行微调（例如指令微调，蒸馏，DPO），他们会无意中触发其潜在的对抗性行为。我们通过多个LLMs和三种常考虑的目标行为（未经请求的广告、越狱性和过度拒绝）实验性地展示了FAB的有效性。我们展示FAB触发器对用户所做的各种微调选择（例如数据集，步数，调度器，训练后算法）都具有鲁棒性。我们的发现挑战了关于微调安全性的主流假设，揭示了一个关键的攻击向量。

更新时间: 2025-10-09 14:23:19

领域: cs.LG,cs.AI,cs.CR

下载: http://arxiv.org/abs/2505.16567v3

HiChunk: Evaluating and Enhancing Retrieval-Augmented Generation with Hierarchical Chunking

Retrieval-Augmented Generation (RAG) enhances the response capabilities of language models by integrating external knowledge sources. However, document chunking as an important part of RAG system often lacks effective evaluation tools. This paper first analyzes why existing RAG evaluation benchmarks are inadequate for assessing document chunking quality, specifically due to evidence sparsity. Based on this conclusion, we propose HiCBench, which includes manually annotated multi-level document chunking points, synthesized evidence-dense quetion answer(QA) pairs, and their corresponding evidence sources. Additionally, we introduce the HiChunk framework, a multi-level document structuring framework based on fine-tuned LLMs, combined with the Auto-Merge retrieval algorithm to improve retrieval quality. Experiments demonstrate that HiCBench effectively evaluates the impact of different chunking methods across the entire RAG pipeline. Moreover, HiChunk achieves better chunking quality within reasonable time consumption, thereby enhancing the overall performance of RAG systems.

Updated: 2025-10-09 14:21:50

标题: HiChunk：使用分级分块评估和增强检索增强生成

摘要: 检索增强生成（RAG）通过整合外部知识源增强语言模型的响应能力。然而，作为RAG系统重要组成部分的文档分块常常缺乏有效的评估工具。本文首先分析了现有RAG评估基准不足以评估文档分块质量的原因，特别是由于证据稀疏性。基于这一结论，我们提出了HiCBench，其中包括手动注释的多级文档分块点、合成的证据密集的问答对（QA）以及相应的证据来源。此外，我们介绍了HiChunk框架，这是一个基于经过微调的LLM的多级文档结构框架，结合Auto-Merge检索算法以改善检索质量。实验证明，HiCBench有效评估了不同分块方法在整个RAG流程中的影响。此外，HiChunk在合理的时间消耗内实现了更好的分块质量，从而提高了RAG系统的整体性能。

更新时间: 2025-10-09 14:21:50

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.11552v3

Co-TAP: Three-Layer Agent Interaction Protocol Technical Report

This paper proposes Co-TAP (T: Triple, A: Agent, P: Protocol), a three-layer agent interaction protocol designed to address the challenges faced by multi-agent systems across the three core dimensions of Interoperability, Interaction and Collaboration, and Knowledge Sharing. We have designed and proposed a layered solution composed of three core protocols: the Human-Agent Interaction Protocol (HAI), the Unified Agent Protocol (UAP), and the Memory-Extraction-Knowledge Protocol (MEK). HAI focuses on the interaction layer, standardizing the flow of information between users, interfaces, and agents by defining a standardized, event-driven communication paradigm. This ensures the real-time performance, reliability, and synergy of interactions. As the core of the infrastructure layer, UAP is designed to break down communication barriers among heterogeneous agents through unified service discovery and protocol conversion mechanisms, thereby enabling seamless interconnection and interoperability of the underlying network. MEK, in turn, operates at the cognitive layer. By establishing a standardized ''Memory (M) - Extraction (E) - Knowledge (K)'' cognitive chain, it empowers agents with the ability to learn from individual experiences and form shareable knowledge, thereby laying the foundation for the realization of true collective intelligence. We believe this protocol framework will provide a solid engineering foundation and theoretical guidance for building the next generation of efficient, scalable, and intelligent multi-agent applications.

Updated: 2025-10-09 14:20:19

标题: 合作TAP：三层代理交互协议技术报告

摘要: 本文提出了Co-TAP（T：Triple，A：Agent，P：Protocol），这是一个设计用来解决多智能体系统在互操作性、交互和协作以及知识共享三个核心维度面临挑战的三层智能体交互协议。我们设计并提出了一个由三个核心协议组成的分层解决方案：人-智能体交互协议（HAI），统一智能体协议（UAP）和记忆-提取-知识协议（MEK）。HAI关注交互层，通过定义标准化的、事件驱动的通信范式，规范用户、界面和智能体之间的信息流，确保互动的实时性、可靠性和协同性。作为基础设施层的核心，UAP旨在通过统一的服务发现和协议转换机制打破异构智能体之间的通信障碍，从而实现底层网络的无缝互联和互操作性。MEK则在认知层运作。通过建立一个标准化的“记忆（M）-提取（E）-知识（K）”认知链，它赋予智能体从个体经验中学习并形成可共享知识的能力，从而为实现真正的集体智能奠定基础。我们相信这一协议框架将为构建下一代高效、可扩展和智能的多智能体应用提供坚实的工程基础和理论指导。

更新时间: 2025-10-09 14:20:19

领域: cs.AI

下载: http://arxiv.org/abs/2510.08263v1

A Distributed Emulation Environment for In-Memory Computing Systems

In-memory computing technology is used extensively in artificial intelligence devices due to lower power consumption and fast calculation of matrix-based functions. The development of such a device and its integration in a system takes a significant amount of time and requires the use of a real-time emulation environment, where various system aspects are analyzed, microcode is tested, and applications are deployed, even before the real chip is available. In this work, we present the architecture, the software development tools, and experimental results of a distributed and expandable emulation system for rapid prototyping of integrated circuits based on in-memory computing technologies. Presented experimental results demonstrate the usefulness of the proposed emulator.

Updated: 2025-10-09 14:15:35

标题: 一个用于内存计算系统的分布式仿真环境

摘要: 内存计算技术被广泛应用于人工智能设备中，因为它具有较低的功耗和快速计算基于矩阵的函数的能力。开发这样的设备并将其集成到系统中需要大量时间，并需要使用实时仿真环境，在这个环境中，各种系统方面被分析，微码被测试，并且应用被部署，甚至在真正的芯片可用之前。在这项工作中，我们提出了一个基于内存计算技术的分布式和可扩展仿真系统的架构，软件开发工具和实验结果，用于快速原型设计集成电路。所提出的仿真器的实验结果证明了其实用性。

更新时间: 2025-10-09 14:15:35

领域: cs.ET,cs.AI

下载: http://arxiv.org/abs/2510.08257v1

Mix- and MoE-DPO: A Variational Inference Approach to Direct Preference Optimization

Direct Preference Optimization (DPO) has recently emerged as a simple and effective alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with user preferences. However, existing DPO formulations rely on a single monolithic model, which limits their expressivity in multi-task settings and their adaptability to heterogeneous or diverse preference distributions. In this work, we propose Mix- and MoE-DPO, a framework that extends DPO with both soft mixture models and mixture-of-experts (MoE) architectures, using a stochastic variational inference approach. Our method introduces a latent-variable model over expert assignments and optimizes a variational evidence lower bound (ELBO), enabling stable and efficient learning of specialized expert policies from preference data. Mix- and MoE-DPO provides three key advantages over standard DPO: (i) generalization via universal function approximation through mixtures; (ii) reward and policy specialization through expert components tailored to distinct preference modes; and (iii) contextual alignment through input-dependent soft gating that enables user-specific mixture policies. Our framework supports both shared base architectures with expert-specific policy heads and fully independent expert models, allowing flexible trade-offs between parameter efficiency and specialization. We validate our approach on a variety of model sizes and multi-preference datasets, demonstrating that Mix- and MoE-DPO offers a powerful and scalable method for preference-based LLM alignment.

Updated: 2025-10-09 14:15:14

标题: 混合式和MoE-DPO：一种直接偏好优化的变分推断方法

摘要: 直接偏好优化（DPO）最近作为一种简单而有效的替代品出现，用于通过人类反馈（RLHF）对齐大型语言模型（LLMs）与用户偏好。然而，现有的DPO公式依赖于单一的整体模型，这限制了它们在多任务设置中的表达能力以及对异构或多样化偏好分布的适应性。在这项工作中，我们提出了Mix- and MoE-DPO，这是一个框架，它通过使用随机变分推断方法，将DPO扩展为软混合模型和专家混合（MoE）架构。我们的方法引入了一个潜变量模型，用于专家分配，并优化了一个变分证据下界（ELBO），从而实现了从偏好数据中稳定和高效地学习专门的专家策略。Mix- and MoE-DPO相对于标准DPO提供了三个关键优势：（i）通过混合实现通用函数逼近；（ii）通过专家组件实现奖励和策略特化，以适应不同的偏好模式；以及（iii）通过基于输入的软门控实现上下文对齐，从而实现用户特定的混合策略。我们的框架支持共享基础架构与专家特定策略头部以及完全独立的专家模型，允许在参数效率和特化之间灵活权衡。我们验证了我们的方法在各种模型大小和多偏好数据集上的效果，证明了Mix- and MoE-DPO提供了一种强大且可扩展的基于偏好的LLM对齐方法。

更新时间: 2025-10-09 14:15:14

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.08256v1

Hallucination Detection in LLMs with Topological Divergence on Attention Graphs

Hallucination, i.e., generating factually incorrect content, remains a critical challenge for large language models (LLMs). We introduce TOHA, a TOpology-based HAllucination detector in the RAG setting, which leverages a topological divergence metric to quantify the structural properties of graphs induced by attention matrices. Examining the topological divergence between prompt and response subgraphs reveals consistent patterns: higher divergence values in specific attention heads correlate with hallucinated outputs, independent of the dataset. Extensive experiments - including evaluation on question answering and summarization tasks - show that our approach achieves state-of-the-art or competitive results on several benchmarks while requiring minimal annotated data and computational resources. Our findings suggest that analyzing the topological structure of attention matrices can serve as an efficient and robust indicator of factual reliability in LLMs.

Updated: 2025-10-09 14:14:09

标题: 在带有注意图的LLMs中使用拓扑分歧进行幻觉检测

摘要: 幻觉，即产生事实不正确的内容，仍然是大型语言模型（LLMs）面临的关键挑战。我们引入了TOHA，一种基于拓扑的幻觉检测器，在RAG设置中利用拓扑分歧度度量来量化由注意力矩阵诱导的图的结构特性。检查提示和响应子图之间的拓扑分歧显示出一致的模式：特定注意力头中较高的分歧值与产生幻觉输出呈正相关，与数据集无关。广泛的实验 - 包括在问答和摘要任务上的评估 - 显示我们的方法在几个基准测试中取得了最先进或具有竞争力的结果，同时需要最少的注释数据和计算资源。我们的发现表明，分析注意力矩阵的拓扑结构可以作为LLMs中事实可靠性的高效和稳健指标。

更新时间: 2025-10-09 14:14:09

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.10063v3

Opponent Shaping in LLM Agents

Large Language Models (LLMs) are increasingly being deployed as autonomous agents in real-world environments. As these deployments scale, multi-agent interactions become inevitable, making it essential to understand strategic behavior in such systems. A central open question is whether LLM agents, like reinforcement learning agents, can shape the learning dynamics and influence the behavior of others through interaction alone. In this paper, we present the first investigation of opponent shaping (OS) with LLM-based agents. Existing OS algorithms cannot be directly applied to LLMs, as they require higher-order derivatives, face scalability constraints, or depend on architectural components that are absent in transformers. To address this gap, we introduce ShapeLLM, an adaptation of model-free OS methods tailored for transformer-based agents. Using ShapeLLM, we examine whether LLM agents can influence co-players' learning dynamics across diverse game-theoretic environments. We demonstrate that LLM agents can successfully guide opponents toward exploitable equilibria in competitive games (Iterated Prisoner's Dilemma, Matching Pennies, and Chicken) and promote coordination and improve collective welfare in cooperative games (Iterated Stag Hunt and a cooperative version of the Prisoner's Dilemma). Our findings show that LLM agents can both shape and be shaped through interaction, establishing opponent shaping as a key dimension of multi-agent LLM research.

Updated: 2025-10-09 14:13:24

标题: 对手塑造在LLM代理中的应用

摘要: 大型语言模型（LLMs）越来越多地被部署为真实环境中的自主代理。随着这些部署的扩展，多代理交互变得不可避免，因此了解这些系统中的战略行为至关重要。一个核心的未解问题是LLM代理是否可以像强化学习代理一样，通过交互来塑造学习动态并影响其他人的行为。在本文中，我们首次研究了基于LLM代理的对手塑造（OS）。现有的OS算法无法直接应用于LLMs，因为它们需要高阶导数，面临可扩展性约束，或依赖于变压器中不存在的架构组件。为了解决这一差距，我们引入了ShapeLLM，这是一种专为基于变压器的代理定制的无模型OS方法的改编。使用ShapeLLM，我们研究了LLM代理是否可以影响在不同博弈环境中的合作者的学习动态。我们展示了LLM代理可以成功地引导对手朝着可利用的均衡点前进，例如在竞争性游戏（迭代囚徒困境、匹配硬币和鸡）中，并在合作游戏中（迭代猎鹿和囚徒困境的合作版本）促进协调并改善集体福祉。我们的研究结果表明，LLM代理既可以塑造他人也可以被他人塑造，从而将对手塑造确定为多代理LLM研究的一个关键维度。

更新时间: 2025-10-09 14:13:24

领域: cs.LG,cs.AI,cs.CL,cs.MA

下载: http://arxiv.org/abs/2510.08255v1

Adaptive Collaborative Correlation Learning-based Semi-Supervised Multi-Label Feature Selection

Semi-supervised multi-label feature selection has recently been developed to solve the curse of dimensionality problem in high-dimensional multi-label data with certain samples missing labels. Although many efforts have been made, most existing methods use a predefined graph approach to capture the sample similarity or the label correlation. In this manner, the presence of noise and outliers within the original feature space can undermine the reliability of the resulting sample similarity graph. It also fails to precisely depict the label correlation due to the existence of unknown labels. Besides, these methods only consider the discriminative power of selected features, while neglecting their redundancy. In this paper, we propose an Adaptive Collaborative Correlation lEarning-based Semi-Supervised Multi-label Feature Selection (Access-MFS) method to address these issues. Specifically, a generalized regression model equipped with an extended uncorrelated constraint is introduced to select discriminative yet irrelevant features and maintain consistency between predicted and ground-truth labels in labeled data, simultaneously. Then, the instance correlation and label correlation are integrated into the proposed regression model to adaptively learn both the sample similarity graph and the label similarity graph, which mutually enhance feature selection performance. Extensive experimental results demonstrate the superiority of the proposed Access-MFS over other state-of-the-art methods.

Updated: 2025-10-09 14:10:56

标题: 自适应协作相关性学习的半监督多标签特征选择

摘要: 最近，半监督多标签特征选择已经被开发出来，用来解决高维度多标签数据中存在一些样本缺失标签的问题。尽管已经做出了许多努力，但大多数现有方法使用预定义的图形方法来捕捉样本相似性或标签相关性。这种方法会导致原始特征空间中的噪声和异常值影响结果的可靠性。此外，由于存在未知标签，它也无法精确描述标签相关性。此外，这些方法只考虑选定特征的区分能力，而忽视了它们的冗余性。本文提出了一种基于自适应协作相关性学习的半监督多标签特征选择（Access-MFS）方法来解决这些问题。具体来说，引入了一个带有扩展不相关约束的广义回归模型，以选择具有区分性但不相关的特征，并在有标签数据中维持预测与地面真实标签的一致性。然后，将实例相关性和标签相关性整合到提出的回归模型中，以自适应学习样本相似性图和标签相似性图，从而相互增强特征选择性能。大量实验结果表明，所提出的Access-MFS方法优于其他最先进的方法。

更新时间: 2025-10-09 14:10:56

领域: cs.LG

下载: http://arxiv.org/abs/2406.12193v4

BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases

Biomedical researchers increasingly rely on large-scale structured databases for complex analytical tasks. However, current text-to-SQL systems often struggle to map qualitative scientific questions into executable SQL, particularly when implicit domain reasoning is required. We introduce BiomedSQL, the first benchmark explicitly designed to evaluate scientific reasoning in text-to-SQL generation over a real-world biomedical knowledge base. BiomedSQL comprises 68,000 question/SQL query/answer triples generated from templates and grounded in a harmonized BigQuery knowledge base that integrates gene-disease associations, causal inference from omics data, and drug approval records. Each question requires models to infer domain-specific criteria, such as genome-wide significance thresholds, effect directionality, or trial phase filtering, rather than rely on syntactic translation alone. We evaluate a range of open- and closed-source LLMs across prompting strategies and interaction paradigms. Our results reveal a substantial performance gap: GPT-o3-mini achieves 59.0% execution accuracy, while our custom multi-step agent, BMSQL, reaches 62.6%, both well below the expert baseline of 90.0%. BiomedSQL provides a new foundation for advancing text-to-SQL systems capable of supporting scientific discovery through robust reasoning over structured biomedical knowledge bases. Our dataset is publicly available at https://huggingface.co/datasets/NIH-CARD/BiomedSQL, and our code is open-source at https://github.com/NIH-CARD/biomedsql.

Updated: 2025-10-09 14:08:48

标题: BiomedSQL: 用于生物医学知识库科学推理的文本到SQL转换

摘要: 生物医学研究者越来越依赖于大规模结构化数据库进行复杂的分析任务。然而，当前的文本到SQL系统在将定性科学问题映射到可执行的SQL时通常面临困难，特别是当需要隐含领域推理时。我们引入了BiomedSQL，这是第一个专门设计用于评估在真实世界生物医学知识库上文本到SQL生成中科学推理的基准。BiomedSQL包括68,000个问题/SQL查询/答案三元组，从模板生成并基于一个整合了基因疾病关联、组学数据的因果推断和药物批准记录的BigQuery知识库。每个问题都要求模型推断领域特定标准，如全基因组显著性阈值、效果方向性或试验阶段过滤，而不仅仅依赖于句法翻译。我们评估了一系列开源和闭源LLM模型，涵盖了提示策略和交互范式。我们的结果显示了显著的性能差距：GPT-o3-mini实现了59.0%的执行准确率，而我们定制的多步代理BMSQL达到了62.6%，都远低于专家基线的90.0%。BiomedSQL为推进能够通过对结构化生物医学知识库进行强大推理来支持科学发现的文本到SQL系统提供了新的基础。我们的数据集可在https://huggingface.co/datasets/NIH-CARD/BiomedSQL 上公开获取，我们的代码是开源的，地址为https://github.com/NIH-CARD/biomedsql。

更新时间: 2025-10-09 14:08:48

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2505.20321v3

Towards Methane Detection Onboard Satellites

Methane is a potent greenhouse gas and a major driver of climate change, making its timely detection critical for effective mitigation. Machine learning (ML) deployed onboard satellites can enable rapid detection while reducing downlink costs, supporting faster response systems. Conventional methane detection methods often rely on image processing techniques, such as orthorectification to correct geometric distortions and matched filters to enhance plume signals. We introduce a novel approach that bypasses these preprocessing steps by using \textit{unorthorectified} data (UnorthoDOS). We find that ML models trained on this dataset achieve performance comparable to those trained on orthorectified data. Moreover, we also train models on an orthorectified dataset, showing that they can outperform the matched filter baseline (mag1c). We release model checkpoints and two ML-ready datasets comprising orthorectified and unorthorectified hyperspectral images from the Earth Surface Mineral Dust Source Investigation (EMIT) sensor at https://huggingface.co/datasets/SpaceML/UnorthoDOS , along with code at https://github.com/spaceml-org/plume-hunter.

Updated: 2025-10-09 14:07:30

标题: 朝向卫星上的甲烷检测

摘要: 甲烷是一种强效的温室气体，是气候变化的主要驱动因素，因此及时检测对于有效的缓解至关重要。在卫星上部署的机器学习（ML）可以实现快速检测，同时降低下行链路成本，支持更快速的响应系统。传统的甲烷检测方法通常依赖于图像处理技术，如正射校正以纠正几何失真和匹配滤波器以增强烟团信号。我们提出了一种新颖的方法，通过使用\textit{非正射校正}数据（UnorthoDOS）来绕过这些预处理步骤。我们发现，训练在这个数据集上的ML模型的性能与在正射校正数据上训练的模型相当。此外，我们还在一个正射校正的数据集上训练模型，结果表明它们可以超越匹配滤波器基线（mag1c）。我们发布了模型检查点和两个ML准备好的数据集，包括来自地表矿尘源调查（EMIT）传感器的正射校正和非正射校正的高光谱图像，网址为https://huggingface.co/datasets/SpaceML/UnorthoDOS，并提供代码在https://github.com/spaceml-org/plume-hunter。

更新时间: 2025-10-09 14:07:30

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2509.00626v3

Contrastive Decoding for Synthetic Data Generation in Low-Resource Language Modeling

Large language models (LLMs) are trained on huge amounts of textual data, and concerns have been raised that the limits of such data may soon be reached. A potential solution is to train on synthetic data sampled from LLMs. In this work, we build on this idea and investigate the benefits of contrastive decoding for generating synthetic corpora. In a controlled setting, we experiment with sampling corpora using the relative difference between a good and bad model trained on the same original corpus of 100 million words. By amplifying the signal from a model that has better performance, we create a synthetic corpus and mix it with the original training data. Our findings show that training on a mixture of synthesized and real data improves performance on the language modeling objective and a range of downstream tasks. In particular, we see that training with a mix of synthetic data from contrastive decoding benefits tasks that require more reasoning skills, while synthetic data from traditional sampling helps more on tasks dependent on surface level linguistic capabilities.

Updated: 2025-10-09 14:04:52

标题: 低资源语言建模中用于合成数据生成的对比解码

摘要: 大型语言模型（LLMs）是在大量文本数据上训练的，人们担心这种数据的限制可能很快就会达到。一个潜在的解决方案是在从LLMs采样的合成数据上进行训练。在这项工作中，我们继续探讨这个想法，并研究对比解码对生成合成语料库的好处。在一个受控环境中，我们尝试使用在相同原始语料库（1亿字）上训练的好坏模型之间的相对差异来采样语料库。通过放大性能更好的模型的信号，我们创建了一个合成语料库，并将其与原始训练数据混合。我们的研究结果表明，在合成和真实数据的混合训练中，语言建模目标和一系列下游任务的表现都有所提高。特别地，我们发现使用来自对比解码的合成数据对需要更多推理技能的任务有益，而来自传统采样的合成数据对依赖于表层语言能力的任务有更大帮助。

更新时间: 2025-10-09 14:04:52

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.08245v1

Chain-of-Trigger: An Agentic Backdoor that Paradoxically Enhances Agentic Robustness

The rapid deployment of large language model (LLM)-based agents in real-world applications has raised serious concerns about their trustworthiness. In this work, we reveal the security and robustness vulnerabilities of these agents through backdoor attacks. Distinct from traditional backdoors limited to single-step control, we propose the Chain-of-Trigger Backdoor (CoTri), a multi-step backdoor attack designed for long-horizon agentic control. CoTri relies on an ordered sequence. It starts with an initial trigger, and subsequent ones are drawn from the environment, allowing multi-step manipulation that diverts the agent from its intended task. Experimental results show that CoTri achieves a near-perfect attack success rate (ASR) while maintaining a near-zero false trigger rate (FTR). Due to training data modeling the stochastic nature of the environment, the implantation of CoTri paradoxically enhances the agent's performance on benign tasks and even improves its robustness against environmental distractions. We further validate CoTri on vision-language models (VLMs), confirming its scalability to multimodal agents. Our work highlights that CoTri achieves stable, multi-step control within agents, improving their inherent robustness and task capabilities, which ultimately makes the attack more stealthy and raises potential safty risks.

Updated: 2025-10-09 14:01:43

标题: "触发链：一种增强主体韧性的主体后门"

摘要: 大型语言模型（LLM）代理在实际应用中的快速部署引发了人们对其可信度的严重关注。在这项工作中，我们通过后门攻击揭示了这些代理的安全性和鲁棒性漏洞。与传统后门攻击仅限于单步控制不同，我们提出了“Chain-of-Trigger Backdoor”（CoTri），这是一种针对长期控制设计的多步后门攻击。CoTri依赖于一个有序序列。它始于一个初始触发器，随后的触发器来自环境，允许多步操纵，使代理从其预期任务偏离。实验结果表明，CoTri实现了接近完美的攻击成功率（ASR），同时保持接近零的误触发率（FTR）。由于训练数据模拟了环境的随机性质，CoTri的植入反而增强了代理对良性任务的表现，甚至提高了其对环境干扰的鲁棒性。我们进一步验证了CoTri在视觉语言模型（VLMs）上的有效性，证实其可扩展到多模态代理。我们的工作突显了CoTri在代理内部实现了稳定的多步控制，提高了其固有的鲁棒性和任务能力，最终使攻击更加隐蔽，增加了潜在的安全风险。

更新时间: 2025-10-09 14:01:43

领域: cs.AI

下载: http://arxiv.org/abs/2510.08238v1

The Hidden Bias: A Study on Explicit and Implicit Political Stereotypes in Large Language Models

Large Language Models (LLMs) are increas- ingly integral to information dissemination and decision-making processes. Given their grow- ing societal influence, understanding potential biases, particularly within the political domain, is crucial to prevent undue influence on public opinion and democratic processes. This work investigates political bias and stereotype propa- gation across eight prominent LLMs using the two-dimensional Political Compass Test (PCT). Initially, the PCT is employed to assess the in- herent political leanings of these models. Sub- sequently, persona prompting with the PCT is used to explore explicit stereotypes across vari- ous social dimensions. In a final step, implicit stereotypes are uncovered by evaluating mod- els with multilingual versions of the PCT. Key findings reveal a consistent left-leaning polit- ical alignment across all investigated models. Furthermore, while the nature and extent of stereotypes vary considerably between models, implicit stereotypes elicited through language variation are more pronounced than those iden- tified via explicit persona prompting. Interest- ingly, for most models, implicit and explicit stereotypes show a notable alignment, suggest- ing a degree of transparency or "awareness" regarding their inherent biases. This study un- derscores the complex interplay of political bias and stereotypes in LLMs.

Updated: 2025-10-09 14:00:40

标题: 隐藏的偏见：关于大型语言模型中明示和隐含政治刻板印象的研究

摘要: 大型语言模型（LLMs）越来越成为信息传播和决策过程中不可或缺的一部分。鉴于它们日益增长的社会影响力，了解潜在的偏见，特别是在政治领域内，对于防止对公众舆论和民主过程产生不当影响至关重要。本研究利用二维政治罗盘测试（PCT）调查了八个知名LLMs中的政治偏见和刻板印象传播。首先，PCT被用来评估这些模型的固有政治倾向。随后，使用PCT进行人物提示来探索各种社会维度上的明确刻板印象。最后，通过评估多语言版本的PCT来揭示隐含的刻板印象。关键发现显示，所有调查模型都表现出一致的左倾政治取向。此外，尽管刻板印象的性质和程度在不同模型之间有很大差异，但通过语言变体引发的隐含刻板印象比通过明确的人物提示识别的要明显。有趣的是，对于大多数模型，隐含和明确的刻板印象显示出显著的一致性，表明它们对固有偏见具有一定程度的透明度或“意识”。这项研究强调了LLMs中政治偏见和刻板印象的复杂相互作用。

更新时间: 2025-10-09 14:00:40

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.08236v1

Enhancing Reasoning for Diffusion LLMs via Distribution Matching Policy Optimization

Diffusion large language models (dLLMs) are promising alternatives to autoregressive large language models (AR-LLMs), as they potentially allow higher inference throughput. Reinforcement learning (RL) is a crucial component for dLLMs to achieve comparable performance with AR-LLMs on important tasks, such as reasoning. However, RL algorithms that are well-suited for dLLMs' unique characteristics have yet to be developed. This paper proposes Distribution Matching Policy Optimization (DMPO), a principled and theoretically grounded RL fine-tuning method specifically designed to enhance the reasoning capabilities of dLLMs by matching the dLLM policy distribution to the optimal, reward-tilted one through cross-entropy optimization. We identify a key challenge in the implementation with a small training batch size and propose several effective solutions through a novel weight baseline subtraction technique. DMPO exhibits superior performance on multiple reasoning benchmarks without supervised fine-tuning, with an accuracy improvement of up to $42.9\%$ over previously SOTA baselines and $55.8\%$ over the base model, underscoring the effectiveness of the distribution matching framework. Our code is available at https://github.com/yuchen-zhu-zyc/DMPO.

Updated: 2025-10-09 13:59:50

标题: 通过分布匹配策略优化增强扩散LLMs的推理

摘要: Diffusion large language models (dLLMs)是自回归大语言模型（AR-LLMs）的有希望的替代品，因为它们潜在地允许更高的推理吞吐量。强化学习（RL）是dLLMs实现与AR-LLMs在重要任务（如推理）上可比性能的关键组成部分。然而，适用于dLLMs独特特征的RL算法尚未开发。本文提出了Distribution Matching Policy Optimization（DMPO），这是一种有原则和理论基础的RL微调方法，专门设计用于通过交叉熵优化将dLLM策略分布匹配到最优的、奖励倾斜的分布，以增强dLLMs的推理能力。我们确定了在小批量训练大小的实施中的一个关键挑战，并通过一种新颖的权重基线减法技术提出了几种有效的解决方案。DMPO在多个推理基准测试中展现出卓越的性能，无需监督微调，准确性提高了高达42.9\%，超过之前的SOTA基线55.8\%，突显了分布匹配框架的有效性。我们的代码可在https://github.com/yuchen-zhu-zyc/DMPO上找到。

更新时间: 2025-10-09 13:59:50

领域: cs.LG

下载: http://arxiv.org/abs/2510.08233v1

TiAda: A Time-scale Adaptive Algorithm for Nonconvex Minimax Optimization

Adaptive gradient methods have shown their ability to adjust the stepsizes on the fly in a parameter-agnostic manner, and empirically achieve faster convergence for solving minimization problems. When it comes to nonconvex minimax optimization, however, current convergence analyses of gradient descent ascent (GDA) combined with adaptive stepsizes require careful tuning of hyper-parameters and the knowledge of problem-dependent parameters. Such a discrepancy arises from the primal-dual nature of minimax problems and the necessity of delicate time-scale separation between the primal and dual updates in attaining convergence. In this work, we propose a single-loop adaptive GDA algorithm called TiAda for nonconvex minimax optimization that automatically adapts to the time-scale separation. Our algorithm is fully parameter-agnostic and can achieve near-optimal complexities simultaneously in deterministic and stochastic settings of nonconvex-strongly-concave minimax problems. The effectiveness of the proposed method is further justified numerically for a number of machine learning applications.

Updated: 2025-10-09 13:58:26

标题: TiAda：一种针对非凸极小化优化问题的时间尺度自适应算法

摘要: 自适应梯度方法展示了它们在参数不可知的情况下能够动态调整步长，并在实践中实现更快的收敛速度以解决最小化问题。然而，当涉及非凸极小化优化时，目前的梯度下降升级（GDA）与自适应步长的收敛分析需要对超参数进行仔细调整，并且需要了解问题相关参数。这种不一致性源于极小极大问题的原始-对偶性质以及在实现收敛时原始和对偶更新之间必要的细微时间尺度分离。在这项工作中，我们提出了一种针对非凸极小化优化的单循环自适应GDA算法TiAda，该算法可以自动适应时间尺度分离。我们的算法是完全参数不可知的，并且可以在确定性和随机设置的非凸强凹极小极大问题中同时实现接近最优的复杂度。所提出的方法的有效性在多个机器学习应用中经过数字化验证。

更新时间: 2025-10-09 13:58:26

领域: math.OC,cs.LG,stat.ML

下载: http://arxiv.org/abs/2210.17478v2

Can Small-Scale Data Poisoning Exacerbate Dialect-Linked Biases in Large Language Models?

Style-conditioned data poisoning is identified as a covert vector for amplifying sociolinguistic bias in large language models. Using small poisoned budgets that pair dialectal prompts -- principally African American Vernacular English (AAVE) and a Southern dialect -- with toxic or stereotyped completions during instruction tuning, this work probes whether linguistic style can act as a latent trigger for harmful behavior. Across multiple model families and scales, poisoned exposure elevates toxicity and stereotype expression for dialectal inputs -- most consistently for AAVE -- while Standard American English remains comparatively lower yet not immune. A multi-metric audit combining classifier-based toxicity with an LLM-as-a-judge reveals stereotype-laden content even when lexical toxicity appears muted, indicating that conventional detectors under-estimate sociolinguistic harms. Additionally, poisoned models exhibit emergent jailbreaking despite the absence of explicit slurs in the poison, suggesting weakened alignment rather than memorization. These findings underscore the need for dialect-aware evaluation, content-level stereotype auditing, and training protocols that explicitly decouple style from toxicity to prevent bias amplification through seemingly minor, style-based contamination.

Updated: 2025-10-09 13:58:03

标题: 小规模数据中毒是否会加剧大型语言模型中的方言偏见？

摘要: 风格条件数据污染被确定为在大型语言模型中放大社会语言偏见的隐蔽向量。使用小规模被污染的预算，将方言提示（主要是非裔美国人方言英语（AAVE）和南方方言）与有毒或刻板的完成对配对，在调整过程中，这项工作探究了语言风格是否可以作为有害行为的潜在触发器。在多个模型系列和规模上，受污染的暴露使得方言输入的毒性和刻板表达升高，对于AAVE来说最为一致，而标准美国英语保持相对较低但并非免疫。结合基于分类器的毒性和以LLM为评判者的多指标审计揭示了即使词汇毒性显得缓和，也存在充满刻板印象的内容，表明传统检测器低估了社会语言危害。此外，尽管毒物中没有明确的侮辱性词语，但被污染的模型表现出新兴的越狱，这表明弱化对齐而非记忆。这些发现强调了对方言感知评估、内容级别的刻板审计以及明确将风格与毒性分离的培训协议的需求，以防止通过看似微小的基于风格的污染放大偏见。

更新时间: 2025-10-09 13:58:03

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.19195v2

Scaling Performance of Large Language Model Pretraining

Large language models (LLMs) show best-in-class performance across a wide range of natural language processing applications. Training these models is an extremely computationally expensive task; frontier Artificial Intelligence (AI) research companies are investing billions of dollars into supercomputing infrastructure to train progressively larger models on increasingly massive datasets. Unfortunately, very little information about the scaling performance and training considerations of these large training pipelines is released publicly. Working with very large datasets and models can be complex and practical recommendations are scarce in the public literature for tuning training performance when scaling up large language models. In this paper, we aim to demystify the large language model pretraining pipeline somewhat - in particular with respect to distributed training, managing large datasets across hundreds of nodes, and scaling up data parallelism with an emphasis on fully leveraging available GPU compute capacity.

Updated: 2025-10-09 13:56:59

标题: 规模化大型语言模型预训练的性能优化

摘要: 大型语言模型（LLMs）在各种自然语言处理应用中表现出最佳性能。训练这些模型是一项极为昂贵的计算任务；前沿的人工智能（AI）研究公司正在投资数十亿美元用于超级计算基础设施，以在日益庞大的数据集上训练规模逐渐扩大的模型。不幸的是，关于这些大型训练管道的扩展性能和训练考虑几乎没有公开发布的信息。处理非常大型数据集和模型可能是复杂的，而在公共文献中关于调整大型语言模型的训练性能时的实用建议很少。在本文中，我们旨在在一定程度上揭示大型语言模型预训练管道的神秘性 - 特别是关于分布式训练、在数百个节点上管理大型数据集以及通过充分利用可用的GPU计算能力来扩展数据并行性。

更新时间: 2025-10-09 13:56:59

领域: cs.DC,cs.AI

下载: http://arxiv.org/abs/2509.05258v2

TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics

Vision-Language Models (VLMs) have shown remarkable capabilities in spatial reasoning, yet they remain fundamentally limited to qualitative precision and lack the computational precision required for real-world robotics. Current approaches fail to leverage metric cues from depth sensors and camera calibration, instead reducing geometric problems to pattern recognition tasks that cannot deliver the centimeter-level accuracy essential for robotic manipulation. We present TIGeR (Tool-Integrated Geometric Reasoning), a novel framework that transforms VLMs from perceptual estimators to geometric computers by enabling them to generate and execute precise geometric computations through external tools. Rather than attempting to internalize complex geometric operations within neural networks, TIGeR empowers models to recognize geometric reasoning requirements, synthesize appropriate computational code, and invoke specialized libraries for exact calculations. To support this paradigm, we introduce TIGeR-300K, a comprehensive tool-invocation-oriented dataset covering point transformations, pose estimation, and spatial compatibility verification, complete with tool invocation sequences and intermediate computations. Through a two-stage training pipeline combining supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT) with our proposed hierarchical reward design, TIGeR achieves SOTA performance on geometric reasoning benchmarks while demonstrating centimeter-level precision in real-world robotic manipulation tasks.

Updated: 2025-10-09 13:56:25

标题: TIGeR：机器人视觉语言模型中的工具集成几何推理

摘要: 视觉语言模型（VLMs）在空间推理方面表现出卓越的能力，但它们基本上仅限于定性精度，并缺乏现实世界机器人所需的计算精度。当前的方法未能利用深度传感器和摄像机校准的度量线索，而是将几何问题简化为模式识别任务，无法提供机器人操作所必需的厘米级精度。我们提出了TIGeR（Tool-Integrated Geometric Reasoning），这是一个新颖的框架，将VLMs从感知估计器转变为几何计算机，使其能够通过外部工具生成和执行精确的几何计算。TIGeR并不试图在神经网络内部化复杂的几何操作，而是赋予模型识别几何推理需求、合成适当的计算代码并调用专门的库进行精确计算的能力。为了支持这种范式，我们引入了TIGeR-300K，一个全面的面向工具调用的数据集，涵盖点变换、姿态估计和空间兼容性验证，具有工具调用序列和中间计算。通过结合我们提出的分层奖励设计的监督微调（SFT）和强化微调（RFT）的两阶段训练管道，TIGeR在几何推理基准测试方面实现了SOTA性能，同时在现实世界的机器人操作任务中展示了厘米级的精度。

更新时间: 2025-10-09 13:56:25

领域: cs.RO,cs.AI,cs.CV

下载: http://arxiv.org/abs/2510.07181v2

Distribution Transformers: Fast Approximate Bayesian Inference With On-The-Fly Prior Adaptation

While Bayesian inference provides a principled framework for reasoning under uncertainty, its widespread adoption is limited by the intractability of exact posterior computation, necessitating the use of approximate inference. However, existing methods are often computationally expensive, or demand costly retraining when priors change, limiting their utility, particularly in sequential inference problems such as real-time sensor fusion. To address these challenges, we introduce the Distribution Transformer -- a novel architecture that can learn arbitrary distribution-to-distribution mappings. Our method can be trained to map a prior to the corresponding posterior, conditioned on some dataset -- thus performing approximate Bayesian inference. Our novel architecture represents a prior distribution as a (universally-approximating) Gaussian Mixture Model (GMM), and transforms it into a GMM representation of the posterior. The components of the GMM attend to each other via self-attention, and to the datapoints via cross-attention. We demonstrate that Distribution Transformers both maintain flexibility to vary the prior, and significantly reduces computation times-from minutes to milliseconds-while achieving log-likelihood performance on par with or superior to existing approximate inference methods across tasks such as sequential inference, quantum system parameter inference, and Gaussian Process predictive posterior inference with hyperpriors.

Updated: 2025-10-09 13:56:02

标题: 配电变压器：快速近似贝叶斯推断与即时先验适应

摘要: 虽然贝叶斯推断为在不确定性下进行推理提供了一个原则性框架，但其广泛应用受到精确后验计算的不可解性的限制，因此需要使用近似推断。然而，现有方法往往计算成本高昂，或者当先验发生变化时需要昂贵的重新训练，限制了它们的实用性，特别是在实时传感器融合等顺序推理问题中。为了解决这些挑战，我们引入了分布变换器——一种可以学习任意分布到分布映射的新型架构。我们的方法可以训练以将先验映射到相应的后验，条件是某些数据集——从而执行近似贝叶斯推断。我们的新型架构将先验分布表示为（通用逼近）高斯混合模型（GMM），并将其转换为后验的GMM表示。GMM的组件通过自注意力相互关注，通过交叉注意力关注数据点。我们证明，分布变换器既保持了变化先验的灵活性，又显著减少了计算时间——从分钟到毫秒——同时在诸如顺序推理、量子系统参数推断和具有超先验的高斯过程预测后验推断等任务中实现了与现有近似推断方法相当或优越的对数似然性能。

更新时间: 2025-10-09 13:56:02

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2502.02463v2

DPCformer: An Interpretable Deep Learning Model for Genomic Prediction in Crops

Genomic Selection (GS) uses whole-genome information to predict crop phenotypes and accelerate breeding. Traditional GS methods, however, struggle with prediction accuracy for complex traits and large datasets. We propose DPCformer, a deep learning model integrating convolutional neural networks with a self-attention mechanism to model complex genotype-phenotype relationships. We applied DPCformer to 13 traits across five crops (maize, cotton, tomato, rice, chickpea). Our approach uses an 8-dimensional one-hot encoding for SNP data, ordered by chromosome, and employs the PMF algorithm for feature selection. Evaluations show DPCformer outperforms existing methods. In maize datasets, accuracy for traits like days to tasseling and plant height improved by up to 2.92%. For cotton, accuracy gains for fiber traits reached 8.37%. On small-sample tomato data, the Pearson Correlation Coefficient for a key trait increased by up to 57.35%. In chickpea, the yield correlation was boosted by 16.62%. DPCformer demonstrates superior accuracy, robustness in small-sample scenarios, and enhanced interpretability, providing a powerful tool for precision breeding and addressing global food security challenges.

Updated: 2025-10-09 13:53:36

标题: DPCformer: 一种适用于作物基因组预测的可解释深度学习模型

摘要: 基因组选择（GS）利用整个基因组信息来预测作物表型并加快育种进程。然而，传统的GS方法在复杂性状和大数据集的预测准确性方面面临困难。我们提出了DPCformer，这是一个深度学习模型，将卷积神经网络与自注意机制相结合，以建模复杂的基因型-表型关系。我们将DPCformer应用于五种作物（玉米、棉花、番茄、稻米、鹰嘴豆）的13种性状。我们的方法使用8维的单热编码用于SNP数据，按染色体排序，并采用PMF算法进行特征选择。评估结果显示，DPCformer优于现有方法。在玉米数据集中，像抽雄日数和植株高度这样的性状的准确性提高了高达2.92%。对于棉花，纤维性状的准确性提高了8.37%。在小样本番茄数据中，一个关键性状的皮尔逊相关系数增加了高达57.35%。在鹰嘴豆中，产量相关性提高了16.62%。DPCformer展示了优越的准确性，在小样本场景中表现出鲁棒性，并提高了可解释性，为精准育种和解决全球粮食安全挑战提供了强大的工具。

更新时间: 2025-10-09 13:53:36

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.08662v1

CATS-Linear: Classification Auxiliary Linear Model for Time Series Forecasting

Recent research demonstrates that linear models achieve forecasting performance competitive with complex architectures, yet methodologies for enhancing linear models remain underexplored. Motivated by the hypothesis that distinct time series instances may follow heterogeneous linear mappings, we propose the Classification Auxiliary Trend-Seasonal Decoupling Linear Model CATS-Linear, employing Classification Auxiliary Channel-Independence (CACI). CACI dynamically routes instances to dedicated predictors via classification, enabling supervised channel design. We further analyze the theoretical expected risks of different channel settings. Additionally, we redesign the trend-seasonal decomposition architecture by adding a decoupling -- linear mapping -- recoupling framework for trend components and complex-domain linear projections for seasonal components. Extensive experiments validate that CATS-Linear with fixed hyperparameters achieves state-of-the-art accuracy comparable to hyperparameter-tuned baselines while delivering SOTA accuracy against fixed-hyperparameter counterparts.

Updated: 2025-10-09 13:51:48

标题: CATS-Linear：用于时间序列预测的分类辅助线性模型

摘要: 最近的研究表明，线性模型在预测性能方面与复杂架构相媲美，然而增强线性模型的方法仍未得到充分探索。受到不同时间序列实例可能遵循异质线性映射的假设的启发，我们提出了分类辅助趋势季节解耦线性模型CATS-Linear，采用分类辅助通道独立（CACI）。CACI通过分类动态路由实例到专用预测器，实现监督通道设计。我们进一步分析了不同通道设置的理论风险。此外，我们通过为趋势分量添加解耦-线性映射-重新耦合框架和为季节分量添加复杂域线性投影来重新设计趋势季节分解架构。大量实验证实，具有固定超参数的CATS-Linear实现了与调整超参数基线相媲美的最先进准确性，同时在与具有固定超参数的对照组进行比较时实现了最先进准确性。

更新时间: 2025-10-09 13:51:48

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.08661v1

Reinforcement Learning from Probabilistic Forecasts for Safe Decision-Making via Conditional Value-at-Risk Planning

Sequential decisions in volatile, high-stakes settings require more than maximizing expected return; they require principled uncertainty management. This paper presents the Uncertainty-Aware Markov Decision Process (UAMDP), a unified framework that couples Bayesian forecasting, posterior-sampling reinforcement learning, and planning under a conditional value-at-risk (CVaR) constraint. In a closed loop, the agent updates its beliefs over latent dynamics, samples plausible futures via Thompson sampling, and optimizes policies subject to preset risk tolerances. We establish regret bounds that converge to the Bayes-optimal benchmark under standard regularity conditions. We evaluate UAMDP in two domains-high-frequency equity trading and retail inventory control-both marked by structural uncertainty and economic volatility. Relative to strong deep learning baselines, UAMDP improves long-horizon forecasting accuracy (RMSE decreases by up to 25\% and sMAPE by 32\%), and these gains translate into economic performance: the trading Sharpe ratio rises from 1.54 to 1.74 while maximum drawdown is roughly halved. These results show that integrating calibrated probabilistic modeling, exploration aligned with posterior uncertainty, and risk-aware control yields a robust, generalizable approach to safer and more profitable sequential decision-making.

Updated: 2025-10-09 13:46:32

标题: 从概率预测中进行强化学习，通过条件风险值规划实现安全决策-making

摘要: 在波动性很大、风险高的环境中做出连续决策并不仅仅需要最大化预期回报；还需要有原则性的不确定性管理。本文介绍了不确定性感知马尔可夫决策过程（Uncertainty-Aware Markov Decision Process，UAMDP），这是一个统一的框架，将贝叶斯预测、后验抽样强化学习和基于条件风险值（CVaR）的规划结合在一起。在一个闭环中，代理更新其对潜在动态的信念，通过汤普森抽样抽取可能的未来，并优化根据预设风险容忍度的策略。我们建立了一种收敛于贝叶斯最优基准的遗憾界限，这在标准的正则条件下成立。我们在两个领域中评估了UAMDP-高频股票交易和零售库存控制-这两个领域都具有结构不确定性和经济波动性。与强大的深度学习基准相比，UAMDP提高了长期预测准确性（均方根误差减少了高达25\%，sMAPE减少了32\%），这些收益转化为经济表现：交易夏普比率从1.54上升到1.74，最大回撤大约减少了一半。这些结果表明，将校准的概率建模、与后验不确定性一致的探索和风险感知控制相结合，可以产生一种稳健、可推广的方法，用于更安全、更有利可图的连续决策制定。

更新时间: 2025-10-09 13:46:32

领域: cs.LG

下载: http://arxiv.org/abs/2510.08226v1

SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models

Multimodal Large Reasoning Models (MLRMs) demonstrate impressive cross-modal reasoning but often amplify safety risks under adversarial or unsafe prompts, a phenomenon we call the \textit{Reasoning Tax}. Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models exposed to implicit risks. In this paper, we propose SaFeR-VLM, a safety-aligned reinforcement learning framework that embeds safety directly into multimodal reasoning. The framework integrates four components: (I) QI-Safe-10K, a curated dataset emphasizing safety-critical and reasoning-sensitive cases; (II) safety-aware rollout, where unsafe generations undergo reflection and correction instead of being discarded; (III) structured reward modeling with multi-dimensional weighted criteria and explicit penalties for hallucinations and contradictions; and (IV) GRPO optimization, which reinforces both safe and corrected trajectories. This unified design shifts safety from a passive safeguard to an active driver of reasoning, enabling scalable and generalizable safety-aware reasoning. SaFeR-VLM further demonstrates robustness against both explicit and implicit risks, supporting dynamic and interpretable safety decisions beyond surface-level filtering. SaFeR-VLM-3B achieves average performance $70.13$ and $78.97$ on safety and helpfulness across six benchmarks, surpassing both same-scale and $>10\times$ larger models such as Skywork-R1V3-38B, Qwen2.5VL-72B, and GLM4.5V-106B. Remarkably, SaFeR-VLM-7B benefits from its increased scale to surpass GPT-5-mini and Gemini-2.5-Flash by \num{6.47} and \num{16.76} points respectively on safety metrics, achieving this improvement without any degradation in helpfulness performance. Our codes are available at https://github.com/HarveyYi/SaFeR-VLM.

Updated: 2025-10-09 13:46:31

标题: SaFeR-VLM：朝向多模型中安全感知的细粒度推理

摘要: 多模态大型推理模型（MLRMs）展示了令人印象深刻的跨模态推理能力，但在对抗性或不安全提示下往往会增加安全风险，这种现象被我们称为\textit{推理税}。现有的防御主要在输出级别上起作用，而不约束推理过程，使模型暴露于隐含风险之中。在本文中，我们提出了SaFeR-VLM，这是一个将安全性直接融入多模态推理的安全对齐强化学习框架。该框架整合了四个组件：（I）QI-Safe-10K，一个强调安全关键和对推理敏感案例的策划数据集；（II）安全感知展开，其中不安全的生成经历反思和纠正而不是被丢弃；（III）结构化奖励建模，采用多维加权标准和对幻觉和矛盾的明确惩罚；以及（IV）GRPO优化，同时强化安全和纠正的轨迹。这一统一设计将安全性从被动防护转变为推理的主动驱动力，实现了可扩展和可普适的安全感知推理。SaFeR-VLM进一步表现出对显式和隐式风险的稳健性，支持超出表面级别过滤的动态和可解释的安全决策。SaFeR-VLM-3B在六个基准测试中分别实现了安全性和帮助性的平均性能为70.13和78.97，超过了同等规模和大于10倍的模型，如Skywork-R1V3-38B、Qwen2.5VL-72B和GLM4.5V-106B。值得注意的是，SaFeR-VLM-7B通过其增加的规模，使得在安全性指标上超过了GPT-5-mini和Gemini-2.5-Flash分别达到了6.47和16.76分，实现了这一改进而没有降低帮助性能。我们的代码可在https://github.com/HarveyYi/SaFeR-VLM找到。

更新时间: 2025-10-09 13:46:31

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2510.06871v2

TracE2E: Easily Deployable Middleware for Decentralized Data Traceability

This paper presents TracE2E, a middleware written in Rust, that can provide both data explainability and compliance across multiple nodes. By mediating inputs and outputs of processes, TracE2E records provenance information and enforces data-protection policies (e.g., confidentiality, integrity) that depend on the recorded provenance. Unlike existing approaches that necessitate substantial application modifications, TracE2E is designed for easy integration into existing and future applications through a wrapper of the Rust standard library's IO module. We describe how TracE2E consistently records provenance information across nodes, and we demonstrate how the compliance layer of TracE2E can accommodate the enforcement of multiple policies.

Updated: 2025-10-09 13:46:14

标题: TracE2E：易于部署的去中心化数据可追溯性中间件

摘要: 本文介绍了一种用Rust编写的中间件TracE2E，可以在多个节点之间提供数据可解释性和合规性。通过调解进程的输入和输出，TracE2E记录来源信息并强制执行依赖于记录来源的数据保护政策（例如机密性、完整性）。与现有方法需要大量应用程序修改不同，TracE2E设计用于通过Rust标准库的IO模块包装器轻松集成到现有和未来应用程序中。我们描述了TracE2E如何始终在节点之间记录来源信息，并展示了TracE2E的合规性层如何适应多个政策的执行。

更新时间: 2025-10-09 13:46:14

领域: cs.CR

下载: http://arxiv.org/abs/2510.08225v1

Investigating Counterclaims in Causality Extraction from Text

Research on causality extraction from text has so far almost entirely neglected counterclaims. Existing causality extraction datasets focus solely on "procausal" claims, i.e., statements that support a relationship. "Concausal" claims, i.e., statements that refute a relationship, are entirely ignored or even accidentally annotated as procausal. We address this shortcoming by developing a new dataset that integrates concausality. Based on an extensive literature review, we first show that concausality is an integral part of causal reasoning on incomplete knowledge. We operationalize this theory in the form of a rigorous guideline for annotation and then augment the Causal News Corpus with concausal statements, obtaining a substantial inter-annotator agreement of Cohen's $\kappa=0.74$. To demonstrate the importance of integrating concausal statements, we show that models trained without concausal relationships tend to misclassify these as procausal instead. Based on our new dataset, this mistake can be mitigated, enabling transformers to effectively distinguish pro- and concausality.

Updated: 2025-10-09 13:45:54

标题: 翻译：从文本中提取因果关系中的反驳主张调查

摘要: 目前，从文本中提取因果关系的研究几乎完全忽视了反驳性主张。现有的因果关系提取数据集仅关注“支持性”主张，即支持关系的陈述。而“反对性”主张，即否定关系的陈述，完全被忽视，甚至被错误地注释为支持性。我们通过开发一个集成反因果性的新数据集来解决这个缺陷。基于广泛的文献综述，我们首先展示了反因果性是在不完全知识下因果推理的一个组成部分。我们将这一理论规范化为一项严格的注释指南，然后利用反因果性陈述增强因果新闻语料库，获得了显著的Cohen's $\kappa=0.74$的一致性。为了展示集成反因果性陈述的重要性，我们表明没有集成反因果关系训练的模型往往会错误地将其分类为支持性而不是反对性。根据我们的新数据集，这个错误可以得到缓解，使得转换器能够有效区分支持性和反因果性。

更新时间: 2025-10-09 13:45:54

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2510.08224v1

Selection, Reflection and Self-Refinement: Revisit Reasoning Tasks via a Causal Lens

Due to their inherent complexity, reasoning tasks have long been regarded as rigorous benchmarks for assessing the capabilities of machine learning models, especially large language models (LLMs). Although humans can solve these tasks with ease, existing models, even after extensive pre-training and post-training at scale, still fail to perform reasoning reliably. In this paper, we revisit reasoning tasks from a causal perspective, seeking to understand their behavior in latent space and to offer insights for addressing their challenges. Specifically, we cast reasoning tasks as a selection mechanism, in which high-level logical concepts function as selection operators on the given observations, such as, identifying the correct answer in a math problem or filling the appropriate entry in Sudoku. We emphasize two key properties of this formulation that shed light on the difficulty of reasoning tasks. First, the latent space exceeds the observation space in complexity, even when the correct answer is fully determined by the observed input. Second, the latent variables, corresponding to logical thought, are densely structured and exhibit strong dependencies. Building on this formulation, we introduce a framework, called SR$^2$, that incorporates the estimated latent variables as feedback into the selection mechanism, thereby facilitating the learning of dense dependencies among latent representations. The framework consists of three key modules: reflective representation learning, dependency self-refinement, and periodic intermediate alignment. Experimentally, we show that our approach yields significant gains in reasoning accuracy, for example, attaining over 10$\%$ improvement in performance with 8$\times$ fewer parameters on the Sudoku and Maze tasks over the recent advances.

Updated: 2025-10-09 13:45:31

标题: 选择、反思和自我完善：通过因果镜头重新审视推理任务

摘要: 由于它们固有的复杂性，推理任务长期以来一直被视为评估机器学习模型能力的严格基准，尤其是大型语言模型（LLMs）。尽管人类可以轻松解决这些任务，但即使经过广泛的预训练和规模化的后续训练，现有模型仍然无法可靠地执行推理任务。在本文中，我们从因果的角度重新审视推理任务，旨在理解它们在潜在空间中的行为，并为解决其挑战提供见解。具体而言，我们将推理任务视为一种选择机制，其中高层逻辑概念作为给定观察结果的选择操作符，例如，在数学问题中识别正确答案或在数独中填入适当的条目。我们强调这种表述的两个关键属性，这些属性揭示了推理任务的难点。首先，即使正确答案完全由观察到的输入决定，潜在空间在复杂性上超过了观察空间。其次，对应于逻辑思维的潜在变量密集结构且具有强依赖性。基于这种表述，我们引入了一个称为SR$^2$的框架，该框架将估计的潜在变量作为反馈引入选择机制，从而促进潜在表示之间密集依赖关系的学习。该框架包括三个关键模块：反思性表示学习、依赖性自我完善和周期性中间对齐。从实验上看，我们展示了我们的方法在推理准确性方面取得了显著的进展，例如，在数独和迷宫任务上，我们的方法在性能上取得了超过8倍参数减少的10$\%$改进。

更新时间: 2025-10-09 13:45:31

领域: cs.AI

下载: http://arxiv.org/abs/2510.08222v1

Post-hoc Stochastic Concept Bottleneck Models

Concept Bottleneck Models (CBMs) are interpretable models that predict the target variable through high-level human-understandable concepts, allowing users to intervene on mispredicted concepts to adjust the final output. While recent work has shown that modeling dependencies between concepts can improve CBM performance, especially under interventions, such approaches typically require retraining the entire model, which may be infeasible when access to the original data or compute is limited. In this paper, we introduce Post-hoc Stochastic Concept Bottleneck Models (PSCBMs), a lightweight method that augments any pre-trained CBM with a multivariate normal distribution over concepts by adding only a small covariance-prediction module, without retraining the backbone model. We propose two training strategies and show on real-world data that PSCBMs consistently match or improve both concept and target accuracy over standard CBMs at test time. Furthermore, we show that due to the modeling of concept dependencies, PSCBMs perform much better than CBMs under interventions, while remaining far more efficient than retraining a similar stochastic model from scratch.

Updated: 2025-10-09 13:42:54

标题: 事后随机概念瓶颈模型

摘要: 概念瓶颈模型（CBMs）是可解释的模型，通过高级人类可理解的概念来预测目标变量，允许用户对错误预测的概念进行干预以调整最终输出。最近的研究表明，建模概念之间的依赖关系可以提高CBM的性能，特别是在干预情况下，但这种方法通常需要重新训练整个模型，当原始数据或计算资源有限时可能不可行。在本文中，我们引入了后期随机概念瓶颈模型（PSCBMs），这是一种轻量级方法，通过添加一个小的协方差预测模块，仅仅通过在任何预训练的CBM上增加一个多元正态分布来增强模型，而无需重新训练骨干模型。我们提出了两种训练策略，并在真实世界数据上展示，PSCBMs在测试时始终与标准CBMs相匹配或提高了概念和目标准确性。此外，由于对概念依赖关系的建模，PSCBMs在干预情况下的表现比CBMs好得多，同时仍然比从头开始重新训练类似的随机模型要高效得多。

更新时间: 2025-10-09 13:42:54

领域: cs.LG

下载: http://arxiv.org/abs/2510.08219v1

Expressive Value Learning for Scalable Offline Reinforcement Learning

Reinforcement learning (RL) is a powerful paradigm for learning to make sequences of decisions. However, RL has yet to be fully leveraged in robotics, principally due to its lack of scalability. Offline RL offers a promising avenue by training agents on large, diverse datasets, avoiding the costly real-world interactions of online RL. Scaling offline RL to increasingly complex datasets requires expressive generative models such as diffusion and flow matching. However, existing methods typically depend on either backpropagation through time (BPTT), which is computationally prohibitive, or policy distillation, which introduces compounding errors and limits scalability to larger base policies. In this paper, we consider the question of how to develop a scalable offline RL approach without relying on distillation or backpropagation through time. We introduce Expressive Value Learning for Offline Reinforcement Learning (EVOR): a scalable offline RL approach that integrates both expressive policies and expressive value functions. EVOR learns an optimal, regularized Q-function via flow matching during training. At inference-time, EVOR performs inference-time policy extraction via rejection sampling against the expressive value function, enabling efficient optimization, regularization, and compute-scalable search without retraining. Empirically, we show that EVOR outperforms baselines on a diverse set of offline RL tasks, demonstrating the benefit of integrating expressive value learning into offline RL.

Updated: 2025-10-09 13:42:20

标题: 可扩展离线强化学习的表达值学习

摘要: 强化学习（RL）是学习连续决策序列的强大范式。然而，由于其缺乏可扩展性，RL在机器人领域尚未得到充分利用。离线RL通过在大型、多样化的数据集上训练代理，避免了在线RL中昂贵的真实世界交互，提供了一个有前途的途径。将离线RL扩展到越来越复杂的数据集需要表达力强的生成模型，如扩散和流匹配。然而，现有方法通常依赖于通过时间的反向传播（BPTT），这在计算上是禁止的，或者依赖于策略蒸馏，这引入了累积误差，并限制了对更大基本策略的可扩展性。在本文中，我们考虑如何开发一种可扩展的离线RL方法，而不依赖于蒸馏或通过时间的反向传播。我们引入了用于离线强化学习的表达式值学习（EVOR）：一种集成了表达力强的策略和表达力强的值函数的可扩展的离线RL方法。EVOR通过流匹配在训练过程中学习一个最优的正则化Q函数。在推断时，EVOR通过拒绝采样针对表达性值函数进行推断时策略提取，实现了高效的优化、正则化和可计算扩展搜索，无需重新训练。在实证研究中，我们展示了EVOR在各种离线RL任务上优于基线方法，证明了将表达式值学习集成到离线RL中的好处。

更新时间: 2025-10-09 13:42:20

领域: cs.LG,cs.AI,I.2.6

下载: http://arxiv.org/abs/2510.08218v1

FuelCast: Benchmarking Tabular and Temporal Models for Ship Fuel Consumption

In the shipping industry, fuel consumption and emissions are critical factors due to their significant impact on economic efficiency and environmental sustainability. Accurate prediction of ship fuel consumption is essential for further optimization of maritime operations. However, heterogeneous methodologies and limited high-quality datasets hinder direct comparison of modeling approaches. This paper makes three key contributions: (1) we introduce and release a new dataset (https://huggingface.co/datasets/krohnedigital/FuelCast) comprising operational and environmental data from three ships; (2) we define a standardized benchmark covering tabular regression and time-series regression (3) we investigate the application of in-context learning for ship consumption modeling using the TabPFN foundation model - a first in this domain to our knowledge. Our results demonstrate strong performance across all evaluated models, supporting the feasibility of onboard, data-driven fuel prediction. Models incorporating environmental conditions consistently outperform simple polynomial baselines relying solely on vessel speed. TabPFN slightly outperforms other techniques, highlighting the potential of foundation models with in-context learning capabilities for tabular prediction. Furthermore, including temporal context improves accuracy.

Updated: 2025-10-09 13:38:46

标题: FuelCast：用于船舶燃油消耗的表格和时间模型的基准测试

摘要: 在航运业中，燃油消耗和排放是关键因素，由于它们对经济效率和环境可持续性的重要影响。准确预测船舶燃油消耗对于进一步优化海事运营至关重要。然而，异构方法和有限的高质量数据集阻碍了对建模方法的直接比较。本文的三个主要贡献是：(1)我们介绍并发布了一个新数据集(https://huggingface.co/datasets/krohnedigital/FuelCast)，其中包括来自三艘船的操作和环境数据；(2)我们定义了一个涵盖表格回归和时间序列回归的标准基准；(3)我们研究了在船舶消耗建模中使用TabPFN基础模型的上下文学习应用 - 据我们所知，这是该领域的首次。我们的结果显示，在所有评估的模型中表现出强大的性能，支持基于数据的船舶燃油预测的可行性。包含环境条件的模型始终优于仅依赖船舶速度的简单多项式基线。TabPFN略优于其他技术，突显了具有上下文学习能力的基础模型在表格预测中的潜力。此外，包含时间上下文可以提高准确性。

更新时间: 2025-10-09 13:38:46

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.08217v1

Disparate Conditional Prediction in Multiclass Classifiers

We propose methods for auditing multiclass classifiers for fairness under multiclass equalized odds,by estimating the deviation from equalized odds when the classifier is not completely fair. We generalize to multiclass classifiers the measure of Disparate Conditional Prediction (DCP), originally suggested by Sabato & Yom-Tov (2020) for binary classifiers. DCP is defined as the fraction of the population for which the classifier predicts with conditional prediction probabilities that differ from the closest common baseline. We provide new local-optimization methods for estimating the multiclass DCPunder two different regimes,one in which the conditional confusion matrices for each protected sub-population are known, and one in which these cannot be estimated, for instance, because the classifier is inaccessible or because good-quality individual-level data is not available. These methods can be used to detect classifiers that likely treat a significant fraction of the population unfairly. Experiments demonstrate the accuracy of the methods. Code is provided at https://github.com/sivansabato/DCPmulticlass.

Updated: 2025-10-09 13:38:36

标题: 多分类分类器中不一致的条件预测

摘要: 我们提出了一种审计多类分类器公平性的方法，该方法基于多类平衡几率，通过估计分类器在不完全公平时与平衡几率的偏差。我们将不公平的多类分类器的测量值Disparate Conditional Prediction（DCP）概括为Sabato & Yom-Tov（2020）最初针对二元分类器提出的方法。DCP被定义为分类器预测的人口比例，其条件预测概率与最接近的共同基线不同。我们提供了新的局部优化方法，用于估计在两种不同情况下的多类DCP，一种是对每个保护子人口的条件混淆矩阵已知，另一种是由于分类器无法访问或因为缺乏高质量的个体级数据而无法估计这些矩阵。这些方法可用于检测可能不公平对待大部分人口的分类器。实验证明了这些方法的准确性。代码可在https://github.com/sivansabato/DCPmulticlass找到。

更新时间: 2025-10-09 13:38:36

领域: cs.LG,cs.CY,stat.ML

下载: http://arxiv.org/abs/2206.03234v4

LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions

Previous research has shown that LLMs finetuned on malicious or incorrect completions within narrow domains (e.g., insecure code or incorrect medical advice) can become broadly misaligned to exhibit harmful behaviors, which is called emergent misalignment. In this work, we investigate whether this phenomenon can extend beyond safety behaviors to a broader spectrum of dishonesty and deception under high-stakes scenarios (e.g., lying under pressure and deceptive behavior). To explore this, we finetune open-sourced LLMs on misaligned completions across diverse domains. Experimental results demonstrate that LLMs show broadly misaligned behavior in dishonesty. Additionally, we further explore this phenomenon in a downstream combined finetuning setting, and find that introducing as little as 1% of misalignment data into a standard downstream task is sufficient to decrease honest behavior over 20%. Furthermore, we consider a more practical human-AI interaction environment where we simulate both benign and biased users to interact with the assistant LLM. Notably, we find that the assistant can be misaligned unintentionally to exacerbate its dishonesty with only 10% biased user population. In summary, we extend the study of emergent misalignment to the domain of dishonesty and deception under high-stakes scenarios, and demonstrate that this risk arises not only through direct finetuning, but also in downstream mixture tasks and practical human-AI interactions.

Updated: 2025-10-09 13:35:19

标题: LLMs学会了无意识地欺骗：从不一致的样本到偏见的人工智能与人类互动中的虚假对齐

摘要: 先前的研究表明，在狭窄领域（例如不安全代码或不正确的医疗建议）中对恶意或错误完成进行微调的LLMs可能会变得广泛不一致，表现出有害行为，这被称为紧急不一致。在这项工作中，我们调查这一现象是否可以扩展到更广泛的诚实和欺骗行为，包括在高风险情境下撒谎和欺骗行为。为了探索这一点，我们在开源LLMs上对不同领域的不一致完成进行微调。实验结果表明，LLMs在欺骗方面表现出广泛的不一致行为。此外，我们在下游混合微调设置中进一步探讨这一现象，发现将仅1%的不一致数据引入标准下游任务就足以使诚实行为减少超过20%。此外，我们考虑了一个更实际的人机交互环境，模拟善意和有偏见的用户与助理LLM进行互动。值得注意的是，我们发现只有10%的有偏见用户人口就足以无意中使助理产生不诚实行为。总之，我们将紧急不一致研究扩展到高风险情境下的欺诈和欺骗领域，并证明这种风险不仅通过直接微调产生，而且还存在于下游混合任务和实际的人机交互中。

更新时间: 2025-10-09 13:35:19

领域: cs.CL,cs.AI,cs.CR

下载: http://arxiv.org/abs/2510.08211v1

DODO: Causal Structure Learning with Budgeted Interventions

Artificial Intelligence has achieved remarkable advancements in recent years, yet much of its progress relies on identifying increasingly complex correlations. Enabling causality awareness in AI has the potential to enhance its performance by enabling a deeper understanding of the underlying mechanisms of the environment. In this paper, we introduce DODO, an algorithm defining how an Agent can autonomously learn the causal structure of its environment through repeated interventions. We assume a scenario where an Agent interacts with a world governed by a causal Directed Acyclic Graph (DAG), which dictates the system's dynamics but remains hidden from the Agent. The Agent's task is to accurately infer the causal DAG, even in the presence of noise. To achieve this, the Agent performs interventions, leveraging causal inference techniques to analyze the statistical significance of observed changes. Results show better performance for DODO, compared to observational approaches, in all but the most limited resource conditions. DODO is often able to reconstruct with as low as zero errors the structure of the causal graph. In the most challenging configuration, DODO outperforms the best baseline by +0.25 F1 points.

Updated: 2025-10-09 13:32:33

标题: DODO：具有预算干预的因果结构学习

摘要: 人工智能在近年取得了显著的进展，但其许多进展仰赖于识别日益复杂的相关性。在人工智能中实现因果关系意识有潜力提升其性能，通过深入理解环境的基本机制。本文介绍了DODO算法，定义了一个Agent如何通过重复干预来自主学习其环境的因果结构。我们假设一个Agent与一个由因果有向无环图(DAG)控制的世界互动，该图决定了系统的动态但对Agent隐藏。Agent的任务是在存在噪音的情况下准确推断因果DAG。为实现这一点，Agent进行干预，利用因果推断技术分析观察到的变化的统计显著性。结果显示，在除了资源最有限的情况下，与观测方法相比，DODO表现更好。DODO通常能够以零错误的速度重建因果图的结构。在最具挑战性的配置中，DODO的表现优于最佳基线+0.25 F1点。

更新时间: 2025-10-09 13:32:33

领域: cs.AI

下载: http://arxiv.org/abs/2510.08207v1

Memory Retrieval and Consolidation in Large Language Models through Function Tokens

The remarkable success of large language models (LLMs) stems from their ability to consolidate vast amounts of knowledge into the memory during pre-training and to retrieve it from the memory during inference, enabling advanced capabilities such as knowledge memorization, instruction-following and reasoning. However, the mechanisms of memory retrieval and consolidation in LLMs remain poorly understood. In this paper, we propose the function token hypothesis to explain the workings of LLMs: During inference, function tokens activate the most predictive features from context and govern next token prediction (memory retrieval). During pre-training, predicting the next tokens (usually content tokens) that follow function tokens increases the number of learned features of LLMs and updates the model parameters (memory consolidation). Function tokens here roughly correspond to function words in linguistics, including punctuation marks, articles, prepositions, and conjunctions, in contrast to content tokens. We provide extensive experimental evidence supporting this hypothesis. Using bipartite graph analysis, we show that a small number of function tokens activate the majority of features. Case studies further reveal how function tokens activate the most predictive features from context to direct next token prediction. We also find that during pre-training, the training loss is dominated by predicting the next content tokens following function tokens, which forces the function tokens to select the most predictive features from context.

Updated: 2025-10-09 13:31:20

标题: 大型语言模型中的记忆检索和巩固通过功能令牌

摘要: 大型语言模型（LLMs）的显著成功源于它们在预训练期间将大量知识整合到内存中并在推理期间从内存中检索知识的能力，从而实现了知识记忆、指令遵循和推理等高级功能。然而，LLMs中的记忆检索和整合机制仍然知之甚少。在本文中，我们提出了功能标记假设来解释LLMs的工作原理：在推理期间，功能标记激活上下文中最具预测性的特征，并控制下一个标记的预测（记忆检索）。在预训练期间，预测跟随功能标记的下一个标记（通常是内容标记）增加了LLMs学习的特征数量，并更新了模型参数（记忆整合）。这里的功能标记大致对应于语言学中的功能词，包括标点符号、冠词、介词和连词，与内容标记形成对比。我们提供了大量实验证据支持这一假设。通过二部图分析，我们展示了少量功能标记激活了大部分特征。案例研究进一步揭示了功能标记如何激活上下文中最具预测性的特征以指导下一个标记的预测。我们还发现，在预训练期间，训练损失主要由预测跟随功能标记的下一个内容标记而主导，这迫使功能标记选择上下文中最具预测性的特征。

更新时间: 2025-10-09 13:31:20

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.08203v1

Sentiment Matters: An Analysis of 200 Human-SAV Interactions

Shared Autonomous Vehicles (SAVs) are likely to become an important part of the transportation system, making effective human-SAV interactions an important area of research. This paper introduces a dataset of 200 human-SAV interactions to further this area of study. We present an open-source human-SAV conversational dataset, comprising both textual data (e.g., 2,136 human-SAV exchanges) and empirical data (e.g., post-interaction survey results on a range of psychological factors). The dataset's utility is demonstrated through two benchmark case studies: First, using random forest modeling and chord diagrams, we identify key predictors of SAV acceptance and perceived service quality, highlighting the critical influence of response sentiment polarity (i.e., perceived positivity). Second, we benchmark the performance of an LLM-based sentiment analysis tool against the traditional lexicon-based TextBlob method. Results indicate that even simple zero-shot LLM prompts more closely align with user-reported sentiment, though limitations remain. This study provides novel insights for designing conversational SAV interfaces and establishes a foundation for further exploration into advanced sentiment modeling, adaptive user interactions, and multimodal conversational systems.

Updated: 2025-10-09 13:30:23

标题: 情感至关重要：对200例人类与自动驾驶车辆的互动进行分析

摘要: 共享自动驾驶车辆（SAV）可能成为交通系统的重要组成部分，使得有效的人类-SAV互动成为一个重要的研究领域。本文介绍了一个包含200个人类-SAV互动的数据集，以推动这一研究领域的发展。我们提供了一个开源的人类-SAV对话数据集，包括文本数据（例如，2,136个人类-SAV交流）和经验数据（例如，对一系列心理因素的后续互动调查结果）。通过两个基准案例研究展示了数据集的实用性：首先，使用随机森林建模和弦图，我们确定了SAV接受度和感知服务质量的关键预测因素，突出了响应情绪极性（即，感知积极性）的关键影响。其次，我们对基于LLM的情感分析工具的性能进行了基准测试，与传统的基于词典的TextBlob方法进行了比较。结果表明，即使简单的零-shot LLM提示更接近用户报告的情绪，但仍存在局限性。本研究为设计对话式SAV界面提供了新颖的见解，并为进一步探索高级情感建模、自适应用户互动和多模式对话系统奠定了基础。

更新时间: 2025-10-09 13:30:23

领域: cs.HC,cs.AI,cs.CL,cs.ET

下载: http://arxiv.org/abs/2510.08202v1

PFAttack: Stealthy Attack Bypassing Group Fairness in Federated Learning

Federated learning (FL), integrating group fairness mechanisms, allows multiple clients to collaboratively train a global model that makes unbiased decisions for different populations grouped by sensitive attributes (e.g., gender and race). Due to its distributed nature, previous studies have demonstrated that FL systems are vulnerable to model poisoning attacks. However, these studies primarily focus on perturbing accuracy, leaving a critical question unexplored: Can an attacker bypass the group fairness mechanisms in FL and manipulate the global model to be biased? The motivations for such an attack vary; an attacker might seek higher accuracy, yet fairness considerations typically limit the accuracy of the global model or aim to cause ethical disruption. To address this question, we design a novel form of attack in FL, termed Profit-driven Fairness Attack (PFAttack), which aims not to degrade global model accuracy but to bypass fairness mechanisms. Our fundamental insight is that group fairness seeks to weaken the dependence of outputs on input attributes related to sensitive information. In the proposed PFAttack, an attacker can recover this dependence through local fine-tuning across various sensitive groups, thereby creating a biased yet accuracy-preserving malicious model and injecting it into FL through model replacement. Compared to attacks targeting accuracy, PFAttack is more stealthy. The malicious model in PFAttack exhibits subtle parameter variations relative to the original global model, making it robust against detection and filtering by Byzantine-resilient aggregations. Extensive experiments on benchmark datasets are conducted for four fair FL frameworks and three Byzantine-resilient aggregations against model poisoning, demonstrating the effectiveness and stealth of PFAttack in bypassing group fairness mechanisms in FL.

Updated: 2025-10-09 13:29:43

标题: PFAttack：在联邦学习中绕过组公平性的隐蔽攻击

摘要: Federated learning（FL），整合了群体公平机制，允许多个客户端共同训练一个全局模型，该模型对根据敏感属性（例如性别和种族）分组的不同人群做出无偏决策。由于其分布式特性，先前的研究表明FL系统容易受到模型污染攻击的影响。然而，这些研究主要集中在扰乱准确性，留下了一个关键问题未被探讨：攻击者是否可以绕过FL中的群体公平机制并操纵全局模型呈现偏见？进行这种攻击的动机各不相同；攻击者可能寻求更高的准确性，然而公平考虑通常限制全局模型的准确性，或者旨在引起道德混乱。为了解决这个问题，我们设计了一种新型的FL攻击形式，称为以利润为驱动的公平攻击（PFAttack），其目标不是降低全局模型的准确性，而是绕过公平机制。我们的基本见解是，群体公平机制旨在减弱输出对与敏感信息相关的输入属性的依赖性。在提出的PFAttack中，攻击者可以通过跨不同敏感群体进行本地微调来恢复这种依赖性，从而创建一个有偏见但保持准确性的恶意模型，并通过模型替换注入到FL中。与针对准确性的攻击相比，PFAttack更具隐蔽性。PFAttack中的恶意模型相对于原始全局模型表现出微妙的参数变化，使其具有抵抗拜占庭强聚合检测和过滤的能力。针对四种公平FL框架和三种拜占庭强聚合对模型污染进行了广泛的基准数据集实验，证明了PFAttack在绕过FL中的群体公平机制方面的有效性和隐蔽性。

更新时间: 2025-10-09 13:29:43

领域: cs.LG

下载: http://arxiv.org/abs/2410.06509v2

VisionTS++: Cross-Modal Time Series Foundation Model with Continual Pre-trained Vision Backbones

Recent studies have indicated that vision models pre-trained on images can serve as time series foundation models (TSFMs) by reformulating time series forecasting (TSF) as image reconstruction. However, effective cross-modal transfer from vision to time series remains challenging due to three discrepancies: (1) the data-modality gap between structured, bounded image data and unbounded, heterogeneous time series; (2) the multivariate-forecasting gap between fixed RGB-three-channel vision models and time series with arbitrary numbers of variates; and (3) the probabilistic-forecasting gap between the deterministic outputs of vision models and the requirement for uncertainty-aware probabilistic predictions. To bridge these gaps, we propose VisonTS++, a TSFM based on continual pre-training of a vision model on large-scale time series. Our approach introduces three key innovations: (1) vision-model-based filtering to identify high-quality sequences to stabilize pre-training and mitigate modality gap; (2) colorized multivariate conversion, encoding multivariate series as multi-subfigure RGB images to enhance cross-variate modeling; (3) multi-quantile forecasting, using parallel reconstruction heads to generate quantile forecasts without parametric assumptions. Experiments show that VisionTS++ achieves state-of-the-art performance in both in-distribution and out-of-distribution forecasting, outperforming specialized TSFMs by 6%-44% in MSE reduction and ranking first in GIFT-Eval benchmark which comprises 23 datasets across 7 domains. Our work demonstrates that with appropriate adaptation, vision models can effectively generalize to TSF, thus advancing the pursuit of universal TSFMs. Code is available at https://github.com/HALF111/VisionTSpp.

Updated: 2025-10-09 13:27:04

标题: VisionTS++：具有持续预训练视觉骨干的跨模态时间序列基础模型

摘要: 最近的研究表明，预先在图像上进行训练的视觉模型可以通过将时间序列预测（TSF）重新构建为图像重建来作为时间序列基础模型（TSFM）。然而，从视觉到时间序列的有效跨模态转移仍然具有挑战性，原因在于三个不一致之处：（1）结构化、有界图像数据与无界、异构时间序列之间的数据模态差距；（2）固定的RGB三通道视觉模型与具有任意数量变量的时间序列之间的多变量预测差距；（3）视觉模型的确定性输出与对不确定性感知的概率预测要求之间的概率预测差距。为了弥补这些差距，我们提出了一种基于持续在大规模时间序列上预训练视觉模型的TSFM——VisonTS++。我们的方法引入了三个关键创新：（1）基于视觉模型的过滤，以识别高质量序列以稳定预训练并减轻模态差距；（2）着色多变量转换，将多变量序列编码为多子图RGB图像，以增强交叉变量建模；（3）多分位数预测，使用并行重建头生成无需参数假设的分位数预测。实验表明，VisionTS++在分布内和分布外的预测中实现了最先进的性能，在均方误差降低方面比专门的TSFMs表现出6%至44%的优势，并在包含7个领域的23个数据集的GIFT-Eval基准测试中排名第一。我们的工作证明了在适当的适应性下，视觉模型可以有效地推广到TSF，从而推动了通用TSFM的追求。代码可在https://github.com/HALF111/VisionTSpp找到。

更新时间: 2025-10-09 13:27:04

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.04379v2

The Tournament Tree Method for preference elicitation in Multi-criteria decision-making

Pairwise comparison methods, such as Fuzzy Preference Relations and Saaty's Multiplicative Preference Relations, are widely used to model expert judgments in multi-criteria decision-making. However, their application is limited by the high cognitive load required to complete $m(m-1)/2$ comparisons, the risk of inconsistency, and the computational complexity of deriving consistent value scales. This paper proposes the Tournament Tree Method (TTM), a novel elicitation and evaluation framework that overcomes these limitations. The TTM requires only $m-1$ pairwise comparisons to obtain a complete, reciprocal, and consistent comparison matrix. The method consists of three phases: (i) elicitation of expert judgments using a reduced set of targeted comparisons, (ii) construction of the consistent pairwise comparison matrix, and (iii) derivation of a global value scale from the resulting matrix. The proposed approach ensures consistency by design, minimizes cognitive effort, and reduces the dimensionality of preference modeling from $m(m-1)/2$ to $m$ parameters. Furthermore, it is compatible with the classical Deck of Cards method, and thus it can handle interval and ratio scales. We have also developed a web-based tool that demonstrates its practical applicability in real decision-making scenarios.

Updated: 2025-10-09 13:24:32

标题: 多标准决策中偏好获取的比赛树方法

摘要: 成对比较方法，如模糊偏好关系和Saaty的乘法偏好关系，被广泛用于模拟专家判断在多标准决策中的应用。然而，它们的应用受到完成$m(m-1)/2$比较所需的高认知负荷、不一致性风险以及推导一致价值尺度的计算复杂性的限制。本文提出了锦标赛树方法(TTM)，这是一个克服这些限制的新颖引导和评估框架。TTM只需要$m-1$个成对比较就可以得到一个完整、互易和一致的比较矩阵。该方法包括三个阶段：(i) 使用一个缩减的目标比较集来引导专家判断，(ii) 构建一致的成对比较矩阵，以及(iii) 从结果矩阵中推导出全局价值尺度。所提出的方法通过设计确保一致性，最小化认知努力，并将偏好建模的维度从$m(m-1)/2$减少为$m$个参数。此外，它与经典的卡牌方法兼容，因此可以处理区间和比例尺度。我们还开发了一个基于网络的工具，展示了其在实际决策场景中的实际适用性。

更新时间: 2025-10-09 13:24:32

领域: cs.AI,cs.IT,math.IT

下载: http://arxiv.org/abs/2510.08197v1

Generalized Orders of Magnitude for Scalable, Parallel, High-Dynamic-Range Computation

Many domains, from deep learning to finance, require compounding real numbers over long sequences, often leading to catastrophic numerical underflow or overflow. We introduce generalized orders of magnitude (GOOMs), a principled extension of traditional orders of magnitude that incorporates floating-point numbers as a special case, and which in practice enables stable computation over significantly larger dynamic ranges of real numbers than previously possible. We implement GOOMs, along with an efficient custom parallel prefix scan, to support native execution on parallel hardware such as GPUs. We demonstrate that our implementation of GOOMs outperforms traditional approaches with three representative experiments, all of which were previously considered impractical or impossible, and now become possible and practical: (1) compounding real matrix products far beyond standard floating-point limits; (2) estimating spectra of Lyapunov exponents in parallel, orders of magnitude faster than with previous methods, applying a novel selective-resetting method to prevent state colinearity; and (3) capturing long-range dependencies in deep recurrent neural networks with non-diagonal recurrent states, computed in parallel via a prefix scan, without requiring any form of stabilization. Our results show that our implementation of GOOMs, combined with efficient parallel scanning, offers a scalable and numerically robust alternative to conventional floating-point numbers for high-dynamic-range applications.

Updated: 2025-10-09 13:23:43

标题: 可扩展、并行、高动态范围计算的广义数量级

摘要: 许多领域，从深度学习到金融，需要在长序列上复合实数，往往导致灾难性的数值下溢或上溢。我们引入了一种称为广义数量级（GOOMs）的原则性数量级扩展，它将浮点数作为一种特例，并在实践中实现了对实数动态范围进行稳定计算，比以前可能的范围大得多。我们实现了GOOMs，并配备了高效的自定义并行前缀扫描，以支持在GPU等并行硬件上的本地执行。我们通过三个代表性实验展示了我们的GOOMs实现优于传统方法的结果，这些实验以前被认为是不切实际或不可能的，现在变得可能和实际：（1）将实数矩阵产品复合到远远超过标准浮点限制；（2）在并行中估计Lyapunov指数的频谱，比以前的方法快几个数量级，应用一种新颖的选择性重置方法来防止状态共线性；（3）在深度递归神经网络中捕获非对角递归状态的长程依赖性，通过前缀扫描并行计算，无需任何形式的稳定化。我们的结果显示，我们的GOOMs实现结合高效的并行扫描，为高动态范围应用提供了可扩展且数值稳健的替代方案，而不是传统的浮点数。

更新时间: 2025-10-09 13:23:43

领域: cs.LG,cs.AI,cs.NA,math.NA

下载: http://arxiv.org/abs/2510.03426v2

Measuring What Matters: The AI Pluralism Index

Artificial intelligence systems increasingly mediate knowledge, communication, and decision making. Development and governance remain concentrated within a small set of firms and states, raising concerns that technologies may encode narrow interests and limit public agency. Capability benchmarks for language, vision, and coding are common, yet public, auditable measures of pluralistic governance are rare. We define AI pluralism as the degree to which affected stakeholders can shape objectives, data practices, safeguards, and deployment. We present the AI Pluralism Index (AIPI), a transparent, evidence-based instrument that evaluates producers and system families across four pillars: participatory governance, inclusivity and diversity, transparency, and accountability. AIPI codes verifiable practices from public artifacts and independent evaluations, explicitly handling "Unknown" evidence to report both lower-bound ("evidence") and known-only scores with coverage. We formalize the measurement model; implement a reproducible pipeline that integrates structured web and repository analysis, external assessments, and expert interviews; and assess reliability with inter-rater agreement, coverage reporting, cross-index correlations, and sensitivity analysis. The protocol, codebook, scoring scripts, and evidence graph are maintained openly with versioned releases and a public adjudication process. We report pilot provider results and situate AIPI relative to adjacent transparency, safety, and governance frameworks. The index aims to steer incentives toward pluralistic practice and to equip policymakers, procurers, and the public with comparable evidence.

Updated: 2025-10-09 13:19:34

标题: 衡量重要性：AI多元化指数

摘要: 人工智能系统越来越多地介入知识、沟通和决策。发展和治理仍然集中在少数几家公司和国家，引发了担忧，担心技术可能编码狭隘利益并限制公共机构。语言、视觉和编码的能力基准很常见，但公开、可审计的多元治理措施很少见。我们将AI多元化定义为受影响的利益相关者能够塑造目标、数据实践、保障措施和部署的程度。我们提出了AI多元化指数（AIPI），这是一个透明、基于证据的工具，评估制造商和系统家族的四个支柱：参与式治理、包容性和多样性、透明度和问责制。AIPI从公共文献和独立评估中编码可验证的实践，明确处理“未知”证据，以报告既有下限（“证据”）又有已知分数的覆盖范围。我们规范化测量模型；实施可重现的流程，整合结构化的网络和存储库分析、外部评估和专家访谈；并通过评价者间一致性、覆盖范围报告、交叉指数相关性和敏感性分析来评估可靠性。协议、代码簿、评分脚本和证据图以版本发布和公开裁决过程进行维护。我们报告了试点提供商的结果，并将AIPI与相邻的透明度、安全性和治理框架进行比较。该指数旨在引导激励措施朝向多元化实践，并为政策制定者、采购者和公众提供可比较的证据。

更新时间: 2025-10-09 13:19:34

领域: cs.AI

下载: http://arxiv.org/abs/2510.08193v1

Position Paper: Towards Open Complex Human-AI Agents Collaboration Systems for Problem Solving and Knowledge Management

We propose a technology-agnostic, collaboration-ready stance for Human-AI Agents Collaboration Systems (HAACS) that closes long-standing gaps in prior stages (automation; flexible autonomy; agentic multi-agent collectives). Reading empirical patterns through a seven-dimension collaboration spine and human-agent contrasts, we identify missing pieces: principled budgeting of initiative, instantaneous and auditable reconfiguration, a system-wide knowledge backbone with an epistemic promotion gate, capacity-aware human interfaces; and, as a prerequisite to all of the above, unified definitions of agent and formal collaborative dynamics. We respond with (i) a boundary-centric ontology of agenthood synthesized with cybernetics; (ii) a Petri net family (colored and interpreted) that models ownership, cross-boundary interaction, concurrency, guards, and rates with collaboration transitions; and (iii) a three-level orchestration (meta, agent, execution) that governs behavior families via guard flips. On the knowledge side, we ground collaborative learning in Conversation Theory and SECI with teach-back gates and an evolving backbone; on the problem-solving side, we coordinate routine MEA-style control with practice-guided open-ended discovery. The result is the Hierarchical Exploration-Exploitation Net (HE2-Net): a policy-controlled stance that splits provisional from validated assets, promotes only after tests and peer checks, and budgets concurrent probing while keeping reuse fast and safe. We show interoperability with emerging agent protocols without ad hoc glue and sketch bio-cybernetic extensions (autopoiesis, autogenesis, evolving boundaries, synergetics, etc). Altogether, the framework keeps humans central to setting aims, justifying knowledge, and steering theory-practice dynamics, while scaling agents as reliable collaborators within audited governance.

Updated: 2025-10-09 13:19:01

标题: 立场文件：面向开放复杂的人工智能代理协作系统，用于问题解决和知识管理

摘要: 我们提出了一种技术中立、合作准备的立场，用于人工智能代理协作系统（HAACS），以弥补以往阶段（自动化；灵活自主性；代理多代理集体）中长期存在的差距。通过七个维度的协作脊柱和人-代理对比，阅读经验模式，我们确定了缺失的部分：原则性主动预算、即时可审计的重新配置、具有认知晋升门的系统范围的知识支撑、能力感知的人类接口；以及作为上述所有内容的先决条件，代理和形式协作动力学的统一定义。我们提出（i）代理本体边界为中心的本体论，与控制论综合；（ii）一组Petri网（着色和解释），模拟所有权、跨边界交互、并发、守卫和速率，以及协作转换；（iii）通过守卫翻转，通过三个级别的编排（元、代理、执行）管理行为族。在知识方面，我们将协作学习基于对话理论和SECI，通过教学回路和不断发展的骨干；在解决问题方面，我们协调常规的MEA风格控制和实践引导的开放式发现。结果是分层探索-利用网络（HE2-Net）：一种由政策控制的立场，将临时资产与经过验证的资产分开，仅在测试和同行检查后推广，同时在保持重用快速且安全的同时预算并发探测。我们展示了与新兴代理协议的互操作性，无需专门的粘合剂，并勾画了生物控制论扩展（自我生成、自发生成、演化边界、协同动力学等）。总的来说，这一框架将人类置于设定目标、证明知识和指导理论-实践动力学的中心位置，同时将代理作为可靠的协作者置于经过审计的治理中。

更新时间: 2025-10-09 13:19:01

领域: cs.AI,cs.HC,cs.MA

下载: http://arxiv.org/abs/2505.00018v2

R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought (CoT). However, existing benchmarks mainly focus on immediate, single-horizon tasks, failing to adequately evaluate models' ability to understand and respond to complex, long-horizon scenarios. To address this incomplete evaluation of Large Reasoning Models (LRMs), we propose R-HORIZON, a method designed to stimulate long-horizon reasoning behaviors in LRMs through query composition. Based on R-HORIZON, we construct a long-horizon reasoning benchmark, comprising complex multi-step reasoning tasks with interdependent problems that span long reasoning horizons. Through comprehensive evaluation of LRMs using the R-HORIZON benchmark, we find that even the most advanced LRMs suffer significant performance degradation. Our analysis reveals that LRMs exhibit limited effective reasoning length and struggle to allocate thinking budget across multiple problems appropriately. Recognizing these limitations, we use R-HORIZON to construct long-horizon reasoning data for reinforcement learning with verified rewards (RLVR). Compared to training with single-horizon data, RLVR with R-HORIZON not only substantially improves performance on the multi-horizon reasoning tasks, but also promotes accuracy on standard reasoning tasks, with an increase of 7.5 on AIME2024. These results position R-HORIZON as a scalable, controllable, and low-cost paradigm for enhancing and evaluating the long-horizon reasoning capabilities of LRMs.

Updated: 2025-10-09 13:16:22

标题: R-Horizon: 你的大型推理模型在广度和深度上能走多远？

摘要: 最近测试时间缩放在推理模型（例如OpenAI o1，DeepSeek-R1）中的趋势引导了通过长链式思考（CoT）取得显著改进。然而，现有的基准主要专注于即时、单一视野任务，未能充分评估模型理解和应对复杂、长期视野情景的能力。为解决对大型推理模型（LRMs）的这种不完整评估，我们提出了R-HORIZON，这是一种通过查询组合设计来激发LRMs中长期推理行为的方法。基于R-HORIZON，我们构建了一个长期推理基准，包括涉及长推理视野的相互依赖问题的复杂多步推理任务。通过使用R-HORIZON基准对LRMs进行全面评估，我们发现即使是最先进的LRMs也会遭受显著的性能下降。我们的分析表明，LRMs表现出有限的有效推理长度，并且难以适当地在多个问题之间分配思考预算。认识到这些限制，我们使用R-HORIZON构建了用于强化学习的经过验证奖励（RLVR）的长期推理数据。与使用单视野数据进行训练相比，使用R-HORIZON的RLVR不仅在多视野推理任务上显著提高性能，还提高了标准推理任务的准确性，在AIME2024上增加了7.5。这些结果将R-HORIZON定位为一个可扩展、可控制且成本低廉的范式，用于增强和评估LRMs的长期推理能力。

更新时间: 2025-10-09 13:16:22

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.08189v1

How Scale Breaks "Normalized Stress" and KL Divergence: Rethinking Quality Metrics

Complex, high-dimensional data is ubiquitous across many scientific disciplines, including machine learning, biology, and the social sciences. One of the primary methods of visualizing these datasets is with two-dimensional scatter plots that visually capture some properties of the data. Because visually determining the accuracy of these plots is challenging, researchers often use quality metrics to measure the projection's accuracy and faithfulness to the original data. One of the most commonly employed metrics, normalized stress, is sensitive to uniform scaling (stretching, shrinking) of the projection, despite this act not meaningfully changing anything about the projection. Another quality metric, the Kullback--Leibler (KL) divergence used in the popular t-Distributed Stochastic Neighbor Embedding (t-SNE) technique, is also susceptible to this scale sensitivity. We investigate the effect of scaling on stress and KL divergence analytically and empirically by showing just how much the values change and how this affects dimension reduction technique evaluations. We introduce a simple technique to make both metrics scale-invariant and show that it accurately captures expected behavior on a small benchmark.

Updated: 2025-10-09 13:11:31

标题: 如何规模打破“标准化压力”和KL散度：重新思考质量指标

摘要: 复杂的、高维度的数据在许多科学领域中普遍存在，包括机器学习、生物学和社会科学。可视化这些数据集的主要方法之一是使用二维散点图，这些图形直观地捕捉数据的一些特性。由于通过视觉确定这些图形的准确性具有挑战性，研究人员通常使用质量度量来衡量投影的准确性和忠实度。其中最常用的度量之一，标准化应力，对于投影的均匀缩放（拉伸、收缩）敏感，尽管这种操作并没有实质上改变投影的任何内容。另一个质量度量，在流行的 t-分布随机邻居嵌入（t-SNE）技术中使用的 Kullback-Leibler（KL）散度，也容易受到这种比例敏感性的影响。我们通过理论和经验研究了缩放对应力和KL散度的影响，展示了数值变化的程度以及如何影响降维技术的评估。我们引入了一种简单的技术使这两个度量均不受缩放影响，并展示在一个小型基准测试中它准确捕捉了预期行为。

更新时间: 2025-10-09 13:11:31

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2510.08660v1

Rethinking Losses for Diffusion Bridge Samplers

Diffusion bridges are a promising class of deep-learning methods for sampling from unnormalized distributions. Recent works show that the Log Variance (LV) loss consistently outperforms the reverse Kullback-Leibler (rKL) loss when using the reparametrization trick to compute rKL-gradients. While the on-policy LV loss yields identical gradients to the rKL loss when combined with the log-derivative trick for diffusion samplers with non-learnable forward processes, this equivalence does not hold for diffusion bridges or when diffusion coefficients are learned. Based on this insight we argue that for diffusion bridges the LV loss does not represent an optimization objective that can be motivated like the rKL loss via the data processing inequality. Our analysis shows that employing the rKL loss with the log-derivative trick (rKL-LD) does not only avoid these conceptual problems but also consistently outperforms the LV loss. Experimental results with different types of diffusion bridges on challenging benchmarks show that samplers trained with the rKL-LD loss achieve better performance. From a practical perspective we find that rKL-LD requires significantly less hyperparameter optimization and yields more stable training behavior.

Updated: 2025-10-09 13:05:39

标题: 重新思考扩散桥采样器中的损失

摘要: 扩散桥是一类深度学习方法，用于从非归一化分布中抽样。最近的研究表明，在使用重参数化技巧计算rKL梯度时，Log Variance（LV）损失始终优于反向Kullback-Leibler（rKL）损失。当与扩散采样器的log-导数技巧结合使用时，基于策略的LV损失产生与rKL损失相同的梯度，对于具有不可学习的前向过程的扩散采样器，但是对于扩散桥或学习扩散系数时，这种等价性并不成立。基于这一观点，我们认为对于扩散桥，LV损失不代表可以通过数据处理不等式类似于rKL损失激励的优化目标。我们的分析表明，使用log-导数技巧的rKL损失（rKL-LD）不仅避免了这些概念上的问题，而且始终优于LV损失。在具有挑战性基准测试的不同类型的扩散桥上进行的实验结果表明，使用rKL-LD损失训练的采样器表现更佳。从实践的角度来看，我们发现rKL-LD需要显著减少的超参数优化，并且产生更稳定的训练行为。

更新时间: 2025-10-09 13:05:39

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2506.10982v2

Dual-granularity Sinkhorn Distillation for Enhanced Learning from Long-tailed Noisy Data

Real-world datasets for deep learning frequently suffer from the co-occurring challenges of class imbalance and label noise, hindering model performance. While methods exist for each issue, effectively combining them is non-trivial, as distinguishing genuine tail samples from noisy data proves difficult, often leading to conflicting optimization strategies. This paper presents a novel perspective: instead of primarily developing new complex techniques from scratch, we explore synergistically leveraging well-established, individually 'weak' auxiliary models - specialized for tackling either class imbalance or label noise but not both. This view is motivated by the insight that class imbalance (a distributional-level concern) and label noise (a sample-level concern) operate at different granularities, suggesting that robustness mechanisms for each can in principle offer complementary strengths without conflict. We propose Dual-granularity Sinkhorn Distillation (D-SINK), a novel framework that enhances dual robustness by distilling and integrating complementary insights from such 'weak', single-purpose auxiliary models. Specifically, D-SINK uses an optimal transport-optimized surrogate label allocation to align the target model's sample-level predictions with a noise-robust auxiliary and its class distributions with an imbalance-robust one. Extensive experiments on benchmark datasets demonstrate that D-SINK significantly improves robustness and achieves strong empirical performance in learning from long-tailed noisy data.

Updated: 2025-10-09 13:05:27

标题: 双粒度Sinkhorn蒸馏：增强长尾嘈杂数据学习

摘要: 深度学习的实际数据集经常受到类别不平衡和标签噪声同时存在的挑战的影响，从而阻碍了模型的性能。虽然针对每个问题存在方法，但有效地将它们结合起来并不容易，因为区分真实的尾部样本和噪声数据往往很困难，通常导致冲突的优化策略。本文提出了一个新颖的观点：与其主要从头开始开发新的复杂技术，不如探索协同利用已经建立的、单独的“弱”辅助模型 - 专门用于解决类别不平衡或标签噪声的模型，但不同时解决两个问题。这一观点是由于类别不平衡（一个分布级别的问题）和标签噪声（一个样本级别的问题）在不同的粒度上起作用，这表明每个问题的稳健性机制原则上可以提供互补的优势而不冲突。我们提出了Dual-granularity Sinkhorn Distillation（D-SINK），这是一个新颖的框架，通过从这种“弱”的单一目的的辅助模型中提炼和整合互补的见解来增强双重稳健性。具体来说，D-SINK使用最优输送优化的替代标签分配来将目标模型的样本级预测与一个噪声鲁棒的辅助模型对齐，并将其类别分布与一个不平衡鲁棒的辅助模型对齐。在基准数据集上进行的大量实验表明，D-SINK显著提高了稳健性，并在从长尾嘈杂数据中学习方面取得了强大的实证性能。

更新时间: 2025-10-09 13:05:27

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2510.08179v1

Robust Canonicalization through Bootstrapped Data Re-Alignment

Fine-grained visual classification (FGVC) tasks, such as insect and bird identification, demand sensitivity to subtle visual cues while remaining robust to spatial transformations. A key challenge is handling geometric biases and noise, such as different orientations and scales of objects. Existing remedies rely on heavy data augmentation, which demands powerful models, or on equivariant architectures, which constrain expressivity and add cost. Canonicalization offers an alternative by shielding such biases from the downstream model. In practice, such functions are often obtained using canonicalization priors, which assume aligned training data. Unfortunately, real-world datasets never fulfill this assumption, causing the obtained canonicalizer to be brittle. We propose a bootstrapping algorithm that iteratively re-aligns training samples by progressively reducing variance and recovering the alignment assumption. We establish convergence guarantees under mild conditions for arbitrary compact groups, and show on four FGVC benchmarks that our method consistently outperforms equivariant, and canonicalization baselines while performing on par with augmentation.

Updated: 2025-10-09 13:05:20

标题: 强大的规范化通过引导数据重新对齐

摘要: 细粒度视觉分类（FGVC）任务，如昆虫和鸟类识别，需要对微小的视觉线索具有敏感性，同时对空间变换具有稳健性。一个关键挑战是处理几何偏差和噪声，如不同方向和尺度的物体。现有的解决方法依赖于大量的数据增强，这需要强大的模型，或者依赖于等变架构，这会限制表达能力并增加成本。规范化提供了一种替代方案，通过将这些偏差屏蔽在下游模型之外。在实践中，这种函数通常是通过使用规范化先验获得的，这些先验假设训练数据是对齐的。不幸的是，现实世界的数据集从未满足这个假设，导致获得的规范化器很容易受到破坏。我们提出了一种引导算法，通过逐步减少方差和恢复对齐假设，迭代地重新对齐训练样本。我们在任意紧凑群体下建立了收敛保证，并在四个FGVC基准测试中展示了我们的方法始终优于等变和规范化基线，同时与数据增强表现相当。

更新时间: 2025-10-09 13:05:20

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.08178v1

Long-tailed Recognition with Model Rebalancing

Long-tailed recognition is ubiquitous and challenging in deep learning and even in the downstream finetuning of foundation models, since the skew class distribution generally prevents the model generalization to the tail classes. Despite the promise of previous methods from the perspectives of data augmentation, loss rebalancing and decoupled training etc., consistent improvement in the broad scenarios like multi-label long-tailed recognition is difficult. In this study, we dive into the essential model capacity impact under long-tailed context, and propose a novel framework, Model Rebalancing (MORE), which mitigates imbalance by directly rebalancing the model's parameter space. Specifically, MORE introduces a low-rank parameter component to mediate the parameter space allocation guided by a tailored loss and sinusoidal reweighting schedule, but without increasing the overall model complexity or inference costs. Extensive experiments on diverse long-tailed benchmarks, spanning multi-class and multi-label tasks, demonstrate that MORE significantly improves generalization, particularly for tail classes, and effectively complements existing imbalance mitigation methods. These results highlight MORE's potential as a robust plug-and-play module in long-tailed settings.

Updated: 2025-10-09 13:04:38

标题: 使用模型再平衡进行长尾识别

摘要: 长尾识别在深度学习中是普遍存在且具有挑战性的，甚至在基础模型的下游微调中也是如此，因为偏斜的类分布通常阻碍了模型对尾部类别的泛化能力。尽管以数据增强、损失重平衡和解耦训练等角度来看先前方法有所希望，但在多标签长尾识别等广泛场景中一直难以实现一致的改进。在本研究中，我们深入探讨了长尾背景下模型容量的影响，并提出了一种新的框架，Model Rebalancing（MORE），通过直接重新平衡模型的参数空间来缓解不平衡问题。具体地，MORE引入了一个低秩参数组件，通过定制的损失和正弦重新加权计划来引导参数空间的分配，但并不增加整体模型复杂度或推理成本。在涵盖多类和多标签任务的各种长尾基准上进行了大量实验，结果表明MORE显著改善了泛化能力，特别是对尾部类别，有效地补充了现有的不平衡缓解方法。这些结果突显了MORE在长尾环境中作为强大即插即用模块的潜力。

更新时间: 2025-10-09 13:04:38

领域: cs.LG

下载: http://arxiv.org/abs/2510.08177v1

Leveraging Whisper Embeddings for Audio-based Lyrics Matching

Audio-based lyrics matching can be an appealing alternative to other content-based retrieval approaches, but existing methods often suffer from limited reproducibility and inconsistent baselines. In this work, we introduce WEALY, a fully reproducible pipeline that leverages Whisper decoder embeddings for lyrics matching tasks. WEALY establishes robust and transparent baselines, while also exploring multimodal extensions that integrate textual and acoustic features. Through extensive experiments on standard datasets, we demonstrate that WEALY achieves a performance comparable to state-of-the-art methods that lack reproducibility. In addition, we provide ablation studies and analyses on language robustness, loss functions, and embedding strategies. This work contributes a reliable benchmark for future research, and underscores the potential of speech technologies for music information retrieval tasks.

Updated: 2025-10-09 13:03:34

标题: 利用耳语嵌入进行基于音频的歌词匹配

摘要: 音频歌词匹配可以成为其他基于内容的检索方法的吸引人的替代方案，但现有方法往往受限于有限的可复现性和不一致的基线。在这项工作中，我们介绍了WEALY，这是一个完全可复现的流水线，利用Whisper解码器嵌入来进行歌词匹配任务。WEALY建立了稳健且透明的基线，同时还探索了将文本和声学特征整合的多模态扩展。通过对标准数据集的大量实验，我们展示了WEALY实现了与缺乏可复现性的最先进方法相媲美的性能。此外，我们还提供了关于语言稳健性、损失函数和嵌入策略的消融研究和分析。这项工作为未来研究提供了可靠的基准，并强调了语音技术在音乐信息检索任务中的潜力。

更新时间: 2025-10-09 13:03:34

领域: cs.SD,cs.AI,cs.LG,eess.AS

下载: http://arxiv.org/abs/2510.08176v1

TASP: Topology-aware Sequence Parallelism

Long-context large language models (LLMs) face constraints due to the quadratic complexity of the self-attention mechanism. The mainstream sequence parallelism (SP) method, Ring Attention, attempts to solve this by distributing the query into multiple query chunks across accelerators and enable each Q tensor to access all KV tensors from other accelerators via the Ring AllGather communication primitive. However, it exhibits low communication efficiency, restricting its practical applicability. This inefficiency stems from the mismatch between the Ring AllGather communication primitive it adopts and the AlltoAll topology of modern accelerators. A Ring AllGather primitive is composed of iterations of ring-styled data transfer, which can only utilize a very limited fraction of an AlltoAll topology. Inspired by the Hamiltonian decomposition of complete directed graphs, we identify that modern accelerator topology can be decomposed into multiple orthogonal ring datapaths which can concurrently transfer data without interference. Based on this, we further observe that the Ring AllGather primitive can also be decomposed into the same number of concurrent ring-styled data transfer at every iteration. Based on these insights, we propose TASP, a topology-aware SP method for long-context LLMs that fully utilizes the communication capacity of modern accelerators via topology decomposition and primitive decomposition. Experimental results on both single-node and multi-node NVIDIA H100 systems and a single-node AMD MI300X system demonstrate that TASP achieves higher communication efficiency than Ring Attention on these modern accelerator topologies and achieves up to 3.58 speedup than Ring Attention and its variant Zigzag-Ring Attention. The code is available at https://github.com/infinigence/HamiltonAttention.

Updated: 2025-10-09 13:03:29

标题: TASP：拓扑感知序列并行化

摘要: 长上下文大语言模型（LLMs）面临由于自注意机制的二次复杂性而受到限制。主流的序列并行（SP）方法Ring Attention试图通过将查询分布到加速器上的多个查询块，并使每个Q张量通过环形AllGather通信原语从其他加速器访问所有KV张量来解决这个问题。然而，它表现出低通信效率，限制了其实际适用性。这种低效性源于它采用的环形AllGather通信原语与现代加速器的全互连拓扑之间的不匹配。环形AllGather原语由环形样式的数据传输迭代组成，只能利用全互连拓扑的非常有限部分。受完全有向图的哈密顿分解启发，我们发现现代加速器拓扑可以分解为多个正交环形数据路径，可以同时传输数据而不受干扰。基于此，我们进一步观察到环形AllGather原语也可以在每次迭代中分解为相同数量的并发环形样式数据传输。基于这些见解，我们提出了TASP，一种面向长上下文LLMs的拓扑感知SP方法，通过拓扑分解和原语分解充分利用现代加速器的通信容量。在单节点和多节点NVIDIA H100系统以及单节点AMD MI300X系统上的实验结果表明，TASP在这些现代加速器拓扑上实现了比Ring Attention更高的通信效率，并比Ring Attention及其变体Zigzag-Ring Attention实现了高达3.58的加速。代码可在https://github.com/infinigence/HamiltonAttention上找到。

更新时间: 2025-10-09 13:03:29

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2509.26541v2

Provably Robust Adaptation for Language-Empowered Foundation Models

Language-empowered foundation models (LeFMs), such as CLIP and GraphCLIP, have transformed multimodal learning by aligning visual (or graph) features with textual representations, enabling powerful downstream capabilities like few-shot learning. However, the reliance on small, task-specific support datasets collected in open environments exposes these models to poisoning attacks, where adversaries manipulate the support samples to degrade performance. Existing defenses rely on empirical strategies, which lack formal guarantees and remain vulnerable to unseen and adaptive attacks. Certified robustness offers provable guarantees but has been largely unexplored for few-shot classifiers based on LeFMs. This study seeks to fill these critical gaps by proposing the first provably robust few-shot classifier that is tailored for LeFMs. We term our model Language-empowered Few-shot Certification (\textbf{LeFCert}). It integrates both textual and feature embeddings with an adaptive blending mechanism. To achieve provable robustness, we propose a twofold trimmed mean prototype and derive provable upper and lower bounds for classification scores, enabling certification under worst-case poisoning scenarios. To further enhance the performance, we extend LeFCert with two variants by considering a more realistic and tighter attack budget: LeFCert-L incorporates randomized smoothing to provide Lipschitz continuity and derive robustness under dual budget constraints, and LeFCert-C provides collective certification for scenarios where attackers distribute a shared poisoning budget across multiple samples. Experiments demonstrate that LeFCert achieves state-of-the-art performance, significantly improving both clean and certified accuracy compared to existing baselines. Despite its advanced robustness mechanisms, LeFCert is computationally efficient, making it practical for real-world applications.

Updated: 2025-10-09 13:01:57

标题: 可证明的语言增强基础模型的鲁棒性适应

摘要: 语言增强的基础模型（LeFMs），如CLIP和GraphCLIP，通过将视觉（或图形）特征与文本表示对齐，实现了多模态学习的转变，使得强大的下游能力如少样本学习成为可能。然而，依赖于在开放环境中收集的小型、任务特定的支持数据集使得这些模型容易受到毒化攻击，其中对手操纵支持样本以降低性能。现有的防御依赖于经验性策略，缺乏正式保证，并且容易受到未知和自适应攻击的影响。认证鲁棒性提供可证明的保证，但在基于LeFMs的少样本分类器上尚未被广泛探索。本研究旨在提出第一个专门为LeFMs定制的具有可证明鲁棒性的少样本分类器。我们将我们的模型称为语言增强少样本认证（LeFCert）。它整合了文本和特征嵌入，具有自适应混合机制。为了实现可证明的鲁棒性，我们提出了一个双重修剪均值原型，并为分类分数导出了可证明的上下界，使其能够在最坏情况下的毒化场景下进行认证。为了进一步提升性能，我们通过考虑更加现实和更紧密的攻击预算，将LeFCert扩展为两个变体：LeFCert-L整合了随机平滑以提供利普希茨连续性，并在双重预算约束下导出了鲁棒性，LeFCert-C为攻击者在多个样本之间分配共享毒化预算的场景提供集体认证。实验表明，LeFCert实现了最先进的性能，相对于现有基准线，无论是在干净还是认证准确性方面都有显著的提高。尽管具有先进的鲁棒性机制，LeFCert在计算效率上也很高，使其在实际应用中变得实用。

更新时间: 2025-10-09 13:01:57

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.08659v1

OneForecast: A Universal Framework for Global and Regional Weather Forecasting

Accurate weather forecasts are important for disaster prevention, agricultural planning, etc. Traditional numerical weather prediction (NWP) methods offer physically interpretable high-accuracy predictions but are computationally expensive and fail to fully leverage rapidly growing historical data. In recent years, deep learning models have made significant progress in weather forecasting, but challenges remain, such as balancing global and regional high-resolution forecasts, excessive smoothing in extreme event predictions, and insufficient dynamic system modeling. To address these issues, this paper proposes a global-regional nested weather forecasting framework (OneForecast) based on graph neural networks. By combining a dynamic system perspective with multi-grid theory, we construct a multi-scale graph structure and densify the target region to capture local high-frequency features. We introduce an adaptive messaging mechanism, using dynamic gating units to deeply integrate node and edge features for more accurate extreme event forecasting. For high-resolution regional forecasts, we propose a neural nested grid method to mitigate boundary information loss. Experimental results show that OneForecast performs excellently across global to regional scales and short-term to long-term forecasts, especially in extreme event predictions. Codes link https://github.com/YuanGao-YG/OneForecast.

Updated: 2025-10-09 13:01:35

标题: OneForecast：全球和区域天气预报的通用框架

摘要: 准确的天气预报对于防灾、农业规划等方面至关重要。传统的数值天气预报(NWP)方法提供了物理可解释性高精度预测，但计算成本昂贵且无法充分利用迅速增长的历史数据。近年来，深度学习模型在天气预报中取得了显著进展，但仍存在挑战，例如平衡全球和区域高分辨率预测、极端事件预测中的过度平滑以及动态系统建模不足。为解决这些问题，本文提出了基于图神经网络的全球-区域嵌套天气预报框架(OneForecast)。通过将动态系统视角与多网格理论相结合，我们构建了一个多尺度图结构，并增加了目标区域的密度，以捕获局部高频特征。我们引入了一种自适应消息传递机制，使用动态门控单元深度整合节点和边特征，以实现更准确的极端事件预测。对于高分辨率区域预测，我们提出了一种神经嵌套网格方法来缓解边界信息丢失。实验结果表明，OneForecast在全球到区域尺度和短期到长期预测中表现出色，特别是在极端事件预测方面。代码链接 https://github.com/YuanGao-YG/OneForecast。

更新时间: 2025-10-09 13:01:35

领域: cs.LG,physics.ao-ph

下载: http://arxiv.org/abs/2502.00338v4

Prepared mind, fast response: A temporal decoupling framework for adaptive knowledge orchestration in open-domain dialogue

The latency-quality tradeoff is a fundamental constraint in open-domain dialogue AI systems, since comprehensive knowledge access necessitates prohibitive response delays. Contemporary approaches offer two inadequate solutions: lightweight instruct models achieve sub-second latency but lack reasoning depth, while tool-augmented ReAct agents enhance factuality through external knowledge at the cost of synchronous execution that blocks interaction during retrieval processes. PMFR is thus proposed, with a temporal decoupling framework that fundamentally resolves the contradiction through asynchronous knowledge orchestration. PMFR employs three coordinated components: (1) a Knowledge Adequacy Evaluator for real-time sufficiency assessment, (2) a Lightweight Response Generator for immediate user interaction, and (3) an Asynchronous Knowledge Refinement Agent for background knowledge enhancement. This architecture maintains continuous conversational flow while progressively enriching knowledge coverage through intelligent triggering mechanisms. Evaluation results on TopiOCQA demonstrate PMFR outperforms brute-force scaling: PMFR achieves 95.3% latency reduction (23.38s -> 1.09s) while preserving response quality comparable to heavyweight synchronous baselines (GEval-C: 0.613 vs. 0.620).

Updated: 2025-10-09 13:01:00

标题: 准备就绪的思维，快速响应：自适应知识编排在开放领域对话中的时间解耦框架

摘要: 潜伏质量权衡是开放领域对话AI系统中的一个基本约束，因为全面的知识访问需要禁止性的响应延迟。当代方法提供了两种不足的解决方案：轻量级指导模型实现亚秒级潜伏时间，但缺乏推理深度，而工具增强的ReAct代理通过外部知识增强事实性，但造成同步执行的成本，会在检索过程中阻塞交互。因此，提出了PMFR，采用时间解耦框架，通过异步知识编排从根本上解决这一矛盾。PMFR采用三个协调的组件：（1）用于实时充分评估的知识充分性评估器，（2）用于立即用户交互的轻量级响应生成器，以及（3）用于背景知识增强的异步知识细化代理。这种架构通过智能触发机制保持连续的对话流，同时逐步丰富知识覆盖范围。在TopiOCQA上的评估结果表明，PMFR优于暴力扩展：PMFR实现了95.3%的潜伏时间缩短（23.38秒->1.09秒），同时保持了与重量级同步基线（GEval-C: 0.613 vs. 0.620）可比的响应质量。

更新时间: 2025-10-09 13:01:00

领域: cs.AI

下载: http://arxiv.org/abs/2510.08175v1

Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment

Recent large language models (LLMs) have increasingly adopted the Mixture-of-Experts (MoE) architecture for efficiency. MoE-based LLMs heavily depend on a superficial safety mechanism in which harmful inputs are routed safety-critical experts. However, our analysis reveals that routing decisions for harmful inputs drift significantly after fine-tuning, exposing a critical vulnerability to harmful fine-tuning (HFT) attacks. Existing defenses, primarily designed for monolithic LLMs, are less effective for MoE LLMs as they fail to prevent drift in harmful input routing. To address this limitation, we propose SafeMoE, a safe fine-tuning method tailored to MoE LLMs. SafeMoE directly mitigates routing drift by penalizing the gap between the routing weights of a fine-tuned model and those of the initial safety-aligned model, thereby preserving the safety-aligned routing of harmful inputs to safety-critical experts. Experiments on open-source MoE LLMs ranging from 7B to 141B parameters demonstrate that SafeMoE effectively mitigates HFT attacks, reducing the harmfulness score of OLMoE from 62.0 to 5.0, for example, while maintaining task utility within 1% degradation and incurring only 2% overhead. It significantly outperforms state-of-the-art defense methods for safeguarding LLM fine-tuning and remains effective in recent large-scale MoE LLMs such as gpt-oss and Llama 4. Our implementation is available at https://anonymous.4open.science/r/SafeMoE.

Updated: 2025-10-09 13:00:18

标题: 通过安全路由对抗有害微调的MoE LLMs防御

摘要: 最近，大型语言模型（LLMs）越来越多地采用了混合专家（MoE）架构以提高效率。基于MoE的LLMs在安全性方面严重依赖一种表面安全机制，其中有害输入被路由到安全关键专家。然而，我们的分析揭示，有害输入的路由决策在微调后会显著漂移，暴露了对有害微调（HFT）攻击的关键漏洞。现有的防御方法，主要设计用于单体LLMs，对于MoE LLMs来说效果较差，因为它们无法防止有害输入路由的漂移。为了解决这一限制，我们提出了SafeMoE，一种专为MoE LLMs量身定制的安全微调方法。SafeMoE通过惩罚微调模型的路由权重与初始安全对齐模型的路由权重之间的差距，直接减轻了路由漂移，从而保留了有害输入到安全关键专家的安全对齐路由。对从7B到141B参数的开源MoE LLMs进行的实验表明，SafeMoE有效地减轻了HFT攻击，例如，将OLMoE的有害性评分从62.0降至5.0，同时在任务效用仅下降1%且仅增加2%的开销的情况下。它明显优于最先进的用于保护LLM微调的防御方法，并且在最近的大规模MoE LLMs（如gpt-oss和Llama 4）中仍然有效。我们的实现可在https://anonymous.4open.science/r/SafeMoE找到。

更新时间: 2025-10-09 13:00:18

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2509.22745v2

NavSpace: How Navigation Agents Follow Spatial Intelligence Instructions

Instruction-following navigation is a key step toward embodied intelligence. Prior benchmarks mainly focus on semantic understanding but overlook systematically evaluating navigation agents' spatial perception and reasoning capabilities. In this work, we introduce the NavSpace benchmark, which contains six task categories and 1,228 trajectory-instruction pairs designed to probe the spatial intelligence of navigation agents. On this benchmark, we comprehensively evaluate 22 navigation agents, including state-of-the-art navigation models and multimodal large language models. The evaluation results lift the veil on spatial intelligence in embodied navigation. Furthermore, we propose SNav, a new spatially intelligent navigation model. SNav outperforms existing navigation agents on NavSpace and real robot tests, establishing a strong baseline for future work.

Updated: 2025-10-09 12:59:19

标题: NavSpace：导航智能代理如何遵循空间智能指示

摘要: 指令跟随导航是通往具有实体智能的关键一步。先前的基准主要关注语义理解，但忽视了系统地评估导航代理的空间感知和推理能力。在这项工作中，我们引入了NavSpace基准，其中包含六个任务类别和1,228个轨迹指令对，旨在探测导航代理的空间智能。在这个基准上，我们全面评估了22个导航代理，包括最先进的导航模型和多模态大语言模型。评估结果揭示了实体导航中的空间智能。此外，我们提出了SNav，一种新的空间智能导航模型。SNav在NavSpace和真实机器人测试中表现优异，为未来工作奠定了坚实的基础。

更新时间: 2025-10-09 12:59:19

领域: cs.RO,cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2510.08173v1

Multi-Trigger Poisoning Amplifies Backdoor Vulnerabilities in LLMs

Recent studies have shown that Large Language Models (LLMs) are vulnerable to data poisoning attacks, where malicious training examples embed hidden behaviours triggered by specific input patterns. However, most existing works assume a phrase and focus on the attack's effectiveness, offering limited understanding of trigger mechanisms and how multiple triggers interact within the model. In this paper, we present a framework for studying poisoning in LLMs. We show that multiple distinct backdoor triggers can coexist within a single model without interfering with each other, enabling adversaries to embed several triggers concurrently. Using multiple triggers with high embedding similarity, we demonstrate that poisoned triggers can achieve robust activation even when tokens are substituted or separated by long token spans. Our findings expose a broader and more persistent vulnerability surface in LLMs. To mitigate this threat, we propose a post hoc recovery method that selectively retrains specific model components based on a layer-wise weight difference analysis. Our method effectively removes the trigger behaviour with minimal parameter updates, presenting a practical and efficient defence against multi-trigger poisoning.

Updated: 2025-10-09 12:56:04

标题: 多触发中毒加剧了LLMs中的后门漏洞

摘要: 最近的研究表明，大型语言模型(LLMs)容易受到数据投毒攻击的影响，恶意训练示例中嵌入隐藏行为，这些行为会被特定输入模式触发。然而，大多数现有作品假设一个短语，并专注于攻击的有效性，对触发机制以及模型内多个触发器如何相互作用的理解有限。在本文中，我们提出了一个用于研究LLMs中投毒攻击的框架。我们展示了多个不同的后门触发器可以共存于一个单一模型中，而不会相互干扰，使得攻击者能够同时嵌入多个触发器。通过使用具有高嵌入相似性的多个触发器，我们展示了即使令牌被替换或被长令牌跨度分隔，投毒触发器也可以实现强大的激活。我们的研究结果揭示了LLMs中更广泛和更持久的脆弱性表面。为了减轻这一威胁，我们提出了一种事后恢复方法，根据逐层权重差异分析有选择性地重新训练特定模型组件。我们的方法有效地消除了触发行为，参数更新最小，提供了一种实用和高效的防御方法来对抗多触发投毒。

更新时间: 2025-10-09 12:56:04

领域: cs.CL,cs.CR,cs.LG

下载: http://arxiv.org/abs/2507.11112v2

A Multimodal GUI Architecture for Interfacing with LLM-Based Conversational Assistants

Advances in large language models (LLMs) and real-time speech recognition now make it possible to issue any graphical user interface (GUI) action through natural language and receive the corresponding system response directly through the GUI. Most production applications were never designed with speech in mind. This article provides a concrete architecture that enables GUIs to interface with LLM-based speech-enabled assistants. The architecture makes an application's navigation graph and semantics available through the Model Context Protocol (MCP). The ViewModel, part of the MVVM (Model-View-ViewModel) pattern, exposes the application's capabilities to the assistant by supplying both tools applicable to a currently visible view and application-global tools extracted from the GUI tree router. This architecture facilitates full voice accessibility while ensuring reliable alignment between spoken input and the visual interface, accompanied by consistent feedback across modalities. It future-proofs apps for upcoming OS super assistants that employ computer use agents (CUAs) and natively consume MCP if an application provides it. To address concerns about privacy and data security, the practical effectiveness of locally deployable, open-weight LLMs for speech-enabled multimodal UIs is evaluated. Findings suggest that recent smaller open-weight models approach the performance of leading proprietary models in overall accuracy and require enterprise-grade hardware for fast responsiveness. A demo implementation of the proposed architecture can be found at https://github.com/hansvdam/langbar

Updated: 2025-10-09 12:55:47

标题: 一种多模式GUI架构，用于与基于LLM的对话助手进行交互

摘要: 大型语言模型（LLMs）和实时语音识别的进展现在使得可以通过自然语言发出任何图形用户界面（GUI）操作，并直接通过GUI接收相应的系统响应。大多数生产应用程序从未考虑语音。本文提供了一个具体的架构，使GUI能够与基于LLM的语音助手进行接口交互。该架构通过模型上下文协议（MCP）提供应用程序的导航图和语义。ViewModel是MVVM（Model-View-ViewModel）模式的一部分，通过提供适用于当前可见视图的工具和从GUI树路由器中提取的应用程序全局工具，向助手公开应用程序的功能。该架构在确保口头输入和视觉界面之间的可靠对齐的同时促进全面的语音可访问性，并伴随跨模态的一致反馈。它为即将推出的使用计算机使用代理（CUAs）并在应用程序提供MCP的情况下原生消耗MCP的操作系统超级助手未来化应用程序。为了解决隐私和数据安全的问题，评估了用于语音启用多模态UI的可本地部署的开放权重LLMs的实际有效性。研究结果表明，最近的较小开放权重模型在总体准确性方面接近领先的专有模型，并且需要企业级硬件以实现快速响应。所提出的架构的演示实现可在https://github.com/hansvdam/langbar找到。

更新时间: 2025-10-09 12:55:47

领域: cs.HC,cs.AI,I.2.7; D.2.11

下载: http://arxiv.org/abs/2510.06223v2

Bidirectional Representations Augmented Autoregressive Biological Sequence Generation:Application in De Novo Peptide Sequencing

Autoregressive (AR) models, common in sequence generation, are limited in many biological tasks such as de novo peptide sequencing and protein modeling by their unidirectional nature, failing to capture crucial global bidirectional token dependencies. Non-Autoregressive (NAR) models offer holistic, bidirectional representations but face challenges with generative coherence and scalability. To transcend this, we propose a hybrid framework enhancing AR generation by dynamically integrating rich contextual information from non-autoregressive mechanisms. Our approach couples a shared input encoder with two decoders: a non-autoregressive one learning latent bidirectional biological features, and an AR decoder synthesizing the biological sequence by leveraging these bidirectional features. A novel cross-decoder attention module enables the AR decoder to iteratively query and integrate these bidirectional features, enriching its predictions. This synergy is cultivated via a tailored training strategy with importance annealing for balanced objectives and cross-decoder gradient blocking for stable, focused learning. Evaluations on a demanding nine-species benchmark of de novo peptide sequencing show that our model substantially surpasses AR and NAR baselines. It uniquely harmonizes AR stability with NAR contextual awareness, delivering robust, superior performance on diverse downstream data. This research advances biological sequence modeling techniques and contributes a novel architectural paradigm for augmenting AR models with enhanced bidirectional understanding for complex sequence generation. Code is available at https://github.com/BEAM-Labs/denovo.

Updated: 2025-10-09 12:52:55

标题: 双向表示增强的自回归生物序列生成：在全新肽段测序中的应用

摘要: 自回归（AR）模型在序列生成中很常见，但在诸如全新肽段测序和蛋白建模等许多生物学任务中存在局限性，因为它们的单向性质无法捕捉关键的全局双向标记依赖关系。非自回归（NAR）模型提供了整体的、双向的表示，但在生成的连贯性和可扩展性方面面临挑战。为了突破这一局限，我们提出了一个混合框架，通过动态地整合来自非自回归机制的丰富上下文信息来增强AR生成。我们的方法将一个共享的输入编码器与两个解码器相结合：一个学习潜在的双向生物特征的非自回归解码器，以及一个通过利用这些双向特征来合成生物序列的AR解码器。一种新颖的交叉解码器注意模块使得AR解码器能够迭代地查询和整合这些双向特征，从而丰富其预测。这种协同作用是通过一种定制的训练策略来培养的，其中包括对平衡目标的重要性降低和对稳定、专注学习的交叉解码器梯度阻塞。在一个要求高的全新肽段测序的九个物种基准测试中的评估结果显示，我们的模型明显超过了AR和NAR基线。它独特地将AR的稳定性与NAR的上下文意识融合在一起，为各种下游数据提供了强大、优越的性能。这项研究推进了生物序列建模技术，并提出了一个增强AR模型的新颖架构范式，以增强对复杂序列生成的双向理解。代码可在https://github.com/BEAM-Labs/denovo找到。

更新时间: 2025-10-09 12:52:55

领域: cs.LG

下载: http://arxiv.org/abs/2510.08169v1

Hierarchical Reinforcement Learning with Low-Level MPC for Multi-Agent Control

Achieving safe and coordinated behavior in dynamic, constraint-rich environments remains a major challenge for learning-based control. Pure end-to-end learning often suffers from poor sample efficiency and limited reliability, while model-based methods depend on predefined references and struggle to generalize. We propose a hierarchical framework that combines tactical decision-making via reinforcement learning (RL) with low-level execution through Model Predictive Control (MPC). For the case of multi-agent systems this means that high-level policies select abstract targets from structured regions of interest (ROIs), while MPC ensures dynamically feasible and safe motion. Tested on a predator-prey benchmark, our approach outperforms end-to-end and shielding-based RL baselines in terms of reward, safety, and consistency, underscoring the benefits of combining structured learning with model-based control.

Updated: 2025-10-09 12:49:48

标题: 多智能体控制的低层MPC层次强化学习

摘要: 在动态、约束丰富的环境中实现安全和协调行为仍然是基于学习的控制面临的主要挑战。纯粹的端到端学习往往受到样本效率低和可靠性有限的困扰，而基于模型的方法依赖预定义的参考值并且难以推广。我们提出了一个层次结构框架，通过将强化学习（RL）的战术决策与模型预测控制（MPC）的低级执行结合起来。对于多智能体系统，这意味着高级策略从结构化的兴趣区域（ROIs）中选择抽象目标，而MPC确保动态可行和安全的运动。在一个捕食-被捕食者基准测试中进行测试，我们的方法在奖励、安全性和一致性方面优于端到端和基于屏蔽的RL基线，突出了结构化学习与基于模型的控制相结合的好处。

更新时间: 2025-10-09 12:49:48

领域: eess.SY,cs.AI,cs.RO,cs.SY,math.OC

下载: http://arxiv.org/abs/2509.15799v2

Beyond Sub-6 GHz: Leveraging mmWave Wi-Fi for Gait-Based Person Identification

Person identification plays a vital role in enabling intelligent, personalized, and secure human-computer interaction. Recent research has demonstrated the feasibility of leveraging Wi-Fi signals for passive person identification using a person's unique gait pattern. Although most existing work focuses on sub-6 GHz frequencies, the emergence of mmWave offers new opportunities through its finer spatial resolution, though its comparative advantages for person identification remain unexplored. This work presents the first comparative study between sub-6 GHz and mmWave Wi-Fi signals for person identification with commercial off-the-shelf (COTS) Wi-Fi, using a novel dataset of synchronized measurements from the two frequency bands in an indoor environment. To ensure a fair comparison, we apply identical training pipelines and model configurations across both frequency bands. Leveraging end-to-end deep learning, we show that even at low sampling rates (10 Hz), mmWave Wi-Fi signals can achieve high identification accuracy (91.2% on 20 individuals) when combined with effective background subtraction.

Updated: 2025-10-09 12:39:11

标题: 超过6 GHz：利用毫米波Wi-Fi进行基于步态的人员识别

摘要: 人员识别在实现智能、个性化和安全的人机交互中起着至关重要的作用。最近的研究表明，利用Wi-Fi信号实现被动人员识别，借助人的独特步态模式是可行的。尽管大多数现有工作集中在亚6 GHz频率上，但毫米波的出现通过其更精细的空间分辨率提供了新的机会，尽管其相对于人员识别的优势尚未被探索。本文介绍了在室内环境中使用商用现成的Wi-Fi设备，基于同步测量的新数据集，首次对亚6 GHz和毫米波Wi-Fi信号进行人员识别的比较研究。为了确保公平比较，我们在两个频段上应用相同的训练流程和模型配置。借助端到端深度学习，我们展示了即使在低采样率（10 Hz）下，毫米波Wi-Fi信号在与有效的背景去除相结合时也能实现高的识别准确性（在20个个体中达到91.2%）。

更新时间: 2025-10-09 12:39:11

领域: cs.LG,cs.HC

下载: http://arxiv.org/abs/2510.08160v1

Quantum Agents for Algorithmic Discovery

We introduce quantum agents trained by episodic, reward-based reinforcement learning to autonomously rediscover several seminal quantum algorithms and protocols. In particular, our agents learn: efficient logarithmic-depth quantum circuits for the Quantum Fourier Transform; Grover's search algorithm; optimal cheating strategies for strong coin flipping; and optimal winning strategies for the CHSH and other nonlocal games. The agents achieve these results directly through interaction, without prior access to known optimal solutions. This demonstrates the potential of quantum intelligence as a tool for algorithmic discovery, opening the way for the automated design of novel quantum algorithms and protocols.

Updated: 2025-10-09 12:38:53

标题: 量子代理人用于算法发现

摘要: 我们介绍了通过基于奖励的强化学习训练的史诗级量子智能体，以自主重新发现几种开创性的量子算法和协议。具体来说，我们的智能体学习了：用于量子傅立叶变换的高效对数深度量子电路；Grover搜索算法；强硬币翻转的最佳作弊策略；以及CHSH和其他非局部游戏的最佳获胜策略。这些智能体通过交互直接实现这些结果，而不需要事先获得已知的最优解。这展示了量子智能作为算法发现工具的潜力，为新型量子算法和协议的自动设计铺平了道路。

更新时间: 2025-10-09 12:38:53

领域: quant-ph,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.08159v1

DACP: Domain-Adaptive Continual Pre-Training of Large Language Models for Phone Conversation Summarization

Large language models (LLMs) have achieved impressive performance in text summarization, yet their performance often falls short when applied to specialized domains that differ from their original pre-training distribution. While fine-tuning can improve summarization quality, it typically relies on costly and scarce high-quality labeled data. In this work, we explore continual pre-training as a scalable, self-supervised approach to adapt LLMs for downstream summarization tasks, particularly in the context of noisy real-world conversation transcripts. We conduct extensive experiments using large-scale, unlabeled business conversation data to investigate whether continual pre-training enhances model capabilities in conversational summarization. Our results demonstrate that continual pre-training yields substantial gains in both in-domain and out-of-domain summarization benchmarks, while maintaining strong generalization and robustness. We also analyze the effects of data selection strategies, providing practical guidelines for applying continual pre-training in summarization-focused industrial applications.

Updated: 2025-10-09 12:35:24

标题: DACP：用于电话对话摘要的大型语言模型的领域自适应持续预训练

摘要: 大型语言模型(LLMs)在文本摘要中取得了令人瞩目的性能，但当应用于与其原始预训练分布不同的专业领域时，它们的性能通常会有所不足。虽然微调可以提高摘要质量，但通常依赖于昂贵且稀缺的高质量标记数据。在这项工作中，我们探索了持续预训练作为一种可扩展的自监督方法，以适应LLMs用于下游摘要任务，特别是在嘈杂的现实对话转录的情况下。我们使用大规模、未标记的商业对话数据进行了广泛的实验，以探讨持续预训练是否增强了模型在对话摘要中的能力。我们的结果表明，持续预训练在领域内和领域外的摘要基准测试中取得了实质性的增益，同时保持了强大的泛化能力和稳健性。我们还分析了数据选择策略的影响，为将持续预训练应用于以摘要为重点的工业应用提供了实用指南。

更新时间: 2025-10-09 12:35:24

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.05858v3

DACIP-RC: Domain Adaptive Continual Instruction Pre-Training via Reading Comprehension on Business Conversations

The rapid advancements in Large Language Models (LLMs) have enabled their adoption in real-world industrial scenarios for various natural language processing tasks. However, the high inference cost of large-scale LLMs makes their deployment impractical, necessitating the use of smaller models. Despite their efficiency, smaller LLMs lack robust zero-shot instruction-following capabilities across diverse domains, limiting their adaptability to dynamic user requirements. Traditional fine-tuning approaches exacerbate this issue by inducing catastrophic forgetting, reducing the model's generalization ability for unseen tasks. In this paper, we propose Domain Adaptive Continual Instruction Pre-Training via Reading Comprehension (DACIP-RC), a continual pre-training technique that enhances smaller LLMs' domain adaptability for business conversational tasks. Unlike conventional pre-training approaches that rely on next-token prediction, DACIP-RC generates diverse task instructions and responses via reading comprehension on conversation transcripts, enabling better instruction generalization. Our empirical evaluations demonstrate that DACIP-RC significantly improves zero-shot generalization across a wide range of business conversational tasks, including meeting summarization, action item generation, and call purpose identification. To the best of our knowledge, this is the first work to apply instruction pre-training on business conversational data, providing insights into how industries can leverage proprietary datasets for domain adaptation.

Updated: 2025-10-09 12:35:20

标题: DACIP-RC：通过商务对话阅读理解的域自适应连续指导预训练

摘要: 大型语言模型（LLMs）的快速发展使它们能够在各种自然语言处理任务的实际工业场景中得到应用。然而，大规模LLMs的高推理成本使它们的部署变得不切实际，需要使用较小的模型。尽管较小的LLMs在效率方面具有优势，但它们缺乏跨多个领域的强大零-shot指令跟随能力，限制了它们对动态用户需求的适应性。传统的微调方法加剧了这一问题，导致灾难性遗忘，降低了模型对未知任务的泛化能力。在本文中，我们提出了一种通过阅读理解进行域适应持续指令预训练（DACIP-RC）的技术，这种持续预训练技术增强了较小LLMs在业务对话任务中的领域适应能力。与依赖下一个标记预测的传统预训练方法不同，DACIP-RC通过对话转录上的阅读理解生成多样的任务指令和响应，从而实现更好的指令泛化。我们的实证评估表明，DACIP-RC显著改善了在多种业务对话任务中的零-shot泛化，包括会议总结、行动项生成和呼叫目的识别。据我们所知，这是第一项在业务对话数据上应用指令预训练的工作，为行业如何利用专有数据集进行领域适应提供了见解。

更新时间: 2025-10-09 12:35:20

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.08152v1

Unsupervised Multi-Source Federated Domain Adaptation under Domain Diversity through Group-Wise Discrepancy Minimization

Unsupervised multi-source domain adaptation (UMDA) aims to learn models that generalize to an unlabeled target domain by leveraging labeled data from multiple, diverse source domains. While distributed UMDA methods address privacy constraints by avoiding raw data sharing, existing approaches typically assume a small number of sources and fail to scale effectively. Increasing the number of heterogeneous domains often makes existing methods impractical, leading to high computational overhead or unstable performance. We propose GALA, a scalable and robust federated UMDA framework that introduces two key components: (1) a novel inter-group discrepancy minimization objective that efficiently approximates full pairwise domain alignment without quadratic computation; and (2) a temperature-controlled, centroid-based weighting strategy that dynamically prioritizes source domains based on alignment with the target. Together, these components enable stable and parallelizable training across large numbers of heterogeneous sources. To evaluate performance in high-diversity scenarios, we introduce Digit-18, a new benchmark comprising 18 digit datasets with varied synthetic and real-world domain shifts. Extensive experiments show that GALA consistently achieves competitive or state-of-the-art results on standard benchmarks and significantly outperforms prior methods in diverse multi-source settings where others fail to converge.

Updated: 2025-10-09 12:34:37

标题: 无监督多源联邦领域适应在领域多样性下通过分组间差异最小化

摘要: 非监督多源域适应（UMDA）旨在通过利用来自多个不同源域的标记数据，学习适用于未标记目标域的模型。虽然分布式UMDA方法通过避免原始数据共享来解决隐私约束，但现有方法通常假设少量源，并且缺乏有效扩展性。增加异构域的数量通常使现有方法不切实际，导致高计算开销或性能不稳定。我们提出了GALA，一个可扩展且稳健的联合UMDA框架，引入了两个关键组件：（1）一种新颖的组间差异最小化目标，有效地逼近完全两两域对齐而不需要二次计算；和（2）一种温度控制、基于中心的加权策略，根据与目标的对齐来动态优先考虑源域。这些组件共同实现了在大量异构源上的稳定和可并行化训练。为了评估在高多样性场景中的性能，我们引入了Digit-18，一个包含18个数字数据集的新基准，具有不同的合成和真实世界域转移。大量实验表明，GALA在标准基准上一贯取得竞争性或最新成果，并且在其他方法无法收敛的多源场景中明显优于先前方法。

更新时间: 2025-10-09 12:34:37

领域: cs.LG

下载: http://arxiv.org/abs/2510.08150v1

AI Knowledge Assist: An Automated Approach for the Creation of Knowledge Bases for Conversational AI Agents

The utilization of conversational AI systems by leveraging Retrieval Augmented Generation (RAG) techniques to solve customer problems has been on the rise with the rapid progress of Large Language Models (LLMs). However, the absence of a company-specific dedicated knowledge base is a major barrier to the integration of conversational AI systems in contact centers. To this end, we introduce AI Knowledge Assist, a system that extracts knowledge in the form of question-answer (QA) pairs from historical customer-agent conversations to automatically build a knowledge base. Fine-tuning a lightweight LLM on internal data demonstrates state-of-the-art performance, outperforming larger closed-source LLMs. More specifically, empirical evaluation on 20 companies demonstrates that the proposed AI Knowledge Assist system that leverages the LLaMA-3.1-8B model eliminates the cold-start gap in contact centers by achieving above 90% accuracy in answering information-seeking questions. This enables immediate deployment of RAG-powered chatbots.

Updated: 2025-10-09 12:34:31

标题: AI知识辅助：用于为对话AI代理创建知识库的自动化方法

摘要: 随着大型语言模型（LLMs）的快速发展，利用检索增强生成（RAG）技术解决客户问题的对话式AI系统的利用率逐渐增加。然而，缺乏公司特定的专用知识库是集成对话式AI系统到联系中心的主要障碍。为此，我们引入了AI知识辅助系统，从历史客户-代理对话中提取问题-答案（QA）对的知识，自动构建知识库。在内部数据上对轻量级LLM进行微调，展示了最先进的性能，胜过更大的闭源LLMs。更具体地，对20家公司的实证评估表明，提出的AI知识辅助系统利用LLaMA-3.1-8B模型消除了联系中心中的冷启动差距，实现了超过90%的准确率来回答信息查询问题。这使得RAG驱动的聊天机器人能够立即部署。

更新时间: 2025-10-09 12:34:31

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.08149v1

Think Just Enough: Sequence-Level Entropy as a Confidence Signal for LLM Reasoning

We introduce a simple, yet novel entropy-based framework to drive token efficiency in large language models during reasoning tasks. Our approach uses Shannon entropy from token-level logprobs as a confidence signal to enable early stopping, achieving 25-50% computational savings while maintaining task accuracy. Crucially, we demonstrate that entropy-based confidence calibration represents an emergent property of advanced post-training optimization present in modern reasoning models but notably absent in standard instruction-tuned and pre-trained models (Llama 3.3 70B). We show that the entropy threshold to stop reasoning varies from model to model but can be calculated easily in one shot using only a few examples from existing reasoning datasets. Our results indicate that advanced reasoning models often know that they've gotten a correct answer early on, and that this emergent confidence awareness can be exploited to save tokens and reduce latency. The framework demonstrates consistent performance across reasoning-optimized model families with 25-50% computational cost reduction while preserving accuracy, revealing that confidence mechanisms represent a distinguishing characteristic of modern post-trained reasoning systems versus their predecessors.

Updated: 2025-10-09 12:33:16

标题: 恰到好处的思考：序列级熵作为LLM推理的信心信号

摘要: 我们引入了一个简单而新颖的基于熵的框架，用于在推理任务中驱动大型语言模型的令牌效率。我们的方法利用来自令牌级logprobs的香农熵作为置信信号，实现了提前停止，节省了25-50%的计算资源，同时保持任务准确性。至关重要的是，我们展示了基于熵的置信校准代表了现代推理模型中先进的后训练优化的一种新型属性，但在标准的指令调优和预训练模型（Llama 3.3 70B）中明显缺失。我们展示了停止推理的熵阈值因模型而异，但可以很容易地通过仅使用现有推理数据集中的少量示例一次性计算出来。我们的结果表明，先进的推理模型通常在早期就知道它们得到了正确答案，并且这种新兴的置信意识可以被利用来节省令牌并减少延迟。该框架在推理优化模型系列中展示了一致的表现，实现了25-50%的计算成本降低，同时保持准确性，揭示了置信机制代表了现代后训练推理系统与其前身之间的一个区别特征。

更新时间: 2025-10-09 12:33:16

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.08146v1

Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study

Logical reasoning is a core capability for large language models (LLMs), yet existing benchmarks that rely solely on final-answer accuracy fail to capture the quality of the reasoning process. To address this, we introduce FineLogic, a fine-grained evaluation framework that assesses logical reasoning across three dimensions: overall accuracy, stepwise soundness, and representation-level probing. Leveraging this framework, we conduct a comprehensive study on how different supervision formats in fine-tuning shape reasoning abilities. We fine-tune LLMs on four supervision styles: one in natural language and three symbolic variants. We find a key trade-off: natural language supervision excels at generalization to out-of-distribution and long-chain problems, whereas symbolic supervision is superior at instilling structurally sound, atomic reasoning steps. Furthermore, our probing analysis indicates that fine-tuning primarily refines the model's step-by-step generation process, rather than improving its ability to converge on an answer early. Together, our framework and analysis provide a more rigorous lens for evaluating and improving logical reasoning in LLMs. The code is available at https://github.com/YujunZhou/FineLogic.

Updated: 2025-10-09 12:32:49

标题: 细分逻辑推理在LLMs中的解剖：一项细致评估和监督研究

摘要: 逻辑推理是大型语言模型（LLMs）的核心能力，然而，现有仅依赖最终答案准确性的基准未能捕捉到推理过程的质量。为了解决这个问题，我们引入了FineLogic，这是一个细粒度评估框架，评估逻辑推理在三个维度上的表现：整体准确性、逐步合理性和表示级别的探测。利用这个框架，我们进行了一项全面研究，研究不同监督格式在微调中如何塑造推理能力。我们在四种监督风格上对LLMs进行微调：一种是自然语言，另外三种是符号变体。我们发现一个关键的权衡：自然语言监督在泛化到分布外和长链问题上表现优异，而符号监督在灌输结构良好、原子推理步骤方面表现卓越。此外，我们的探测分析表明，微调主要是优化模型的逐步生成过程，而不是改善其早期收敛于答案的能力。总的来说，我们的框架和分析提供了一个更严格的视角，用于评估和改进LLMs中的逻辑推理。代码可在https://github.com/YujunZhou/FineLogic 上找到。

更新时间: 2025-10-09 12:32:49

领域: cs.CL,cs.AI,cs.LO

下载: http://arxiv.org/abs/2506.04810v2

Inner-Instance Normalization for Time Series Forecasting

Real-world time series are influenced by numerous factors and exhibit complex non-stationary characteristics. Non-stationarity can lead to distribution shifts, where the statistical properties of time series change over time, negatively impacting model performance. Several instance normalization techniques have been proposed to address distribution shifts in time series forecasting. However, existing methods fail to account for shifts within individual instances, leading to suboptimal performance. To tackle inner-instance distribution shifts, we propose two novel point-level methods: Learning Distribution (LD) and Learning Conditional Distribution (LCD). LD eliminates internal discrepancies by fitting the internal distribution of input and output with different parameters at different time steps, while LCD utilizes neural networks to predict scaling coefficients of the output. We evaluate the performance of the two methods with various backbone models across public benchmarks and demonstrate the effectiveness of the point-level paradigm through comparative experiments.

Updated: 2025-10-09 12:24:47

标题: 时间序列预测的内部实例归一化

摘要: 实际世界中的时间序列受到许多因素的影响，呈现复杂的非平稳特征。非平稳性可能导致分布转移，即时间序列的统计特性随时间变化，对模型性能产生负面影响。已经提出了几种实例归一化技术来解决时间序列预测中的分布转移问题。然而，现有方法未能考虑到个体实例内的转移，从而导致性能不佳。为了解决内部实例分布转移问题，我们提出了两种新颖的点级方法：学习分布（LD）和学习条件分布（LCD）。LD通过在不同时间步骤上使用不同参数拟合输入和输出的内部分布，消除内部不一致性，而LCD利用神经网络预测输出的缩放系数。我们通过在公共基准测试中使用各种骨干模型评估了这两种方法的性能，并通过比较实验证明了点级范式的有效性。

更新时间: 2025-10-09 12:24:47

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.08657v1

Arbitrary Entropy Policy Optimization: Entropy Is Controllable in Reinforcement Finetuning

Reinforcement finetuning (RFT) is essential for enhancing the reasoning capabilities of large language models (LLM), yet the widely adopted Group Relative Policy Optimization (GRPO) suffers from entropy collapse, where entropy monotonically decreases, exploration vanishes, and policies converge prematurely. Existing entropy-regularized methods only partially alleviate this issue while introducing bias and instability, leaving entropy control unresolved and the connection between entropy, exploration, and performance unclear. We propose Arbitrary Entropy Policy Optimization (AEPO), which eliminates entropy collapse by replacing entropy bonuses with REINFORCE policy gradient on temperature-adjusted distributions and stabilizing entropy through temperature regulation. AEPO integrates three key designs: policy gradient as regularization, distribution as regularization, and REINFORCE as regularization, enabling precise entropy control without distorting optimization. Experiments demonstrate three major contributions: AEPO (1) stabilizes entropy at arbitrary target levels, effectively removing collapse in GRPO; (2) reveals a non-monotonic relation where performance first improves then declines with increasing entropy, clarifying the link between entropy, exploration, and reasoning; and (3) generalizes beyond entropy, providing a broader RFT paradigm where superior target distributions can serve as REINFORCE regularizers.

Updated: 2025-10-09 12:24:08

标题: 任意熵策略优化：在强化微调中熵是可控的

摘要: 强化微调（RFT）对于增强大语言模型（LLM）的推理能力至关重要，然而广泛采用的群体相对策略优化（GRPO）存在熵坍缩问题，其中熵单调减少，探索消失，策略过早收敛。现有的熵正则化方法只能部分缓解这一问题，同时引入偏差和不稳定性，熵控制问题未解决，熵、探索和性能之间的联系也不清楚。我们提出了任意熵策略优化（AEPO），通过在温度调整的分布上用REINFORCE策略梯度替换熵奖励，并通过温度调节稳定熵，消除了熵坍缩。AEPO整合了三个关键设计：策略梯度作为正则化、分布作为正则化、以及REINFORCE作为正则化，实现了精确的熵控制而不扭曲优化过程。实验表明了三个主要贡献：AEPO（1）稳定了熵在任意目标水平，有效消除了GRPO中的坍缩；（2）揭示了一个非单调关系，性能随着熵的增加而先提升后下降，澄清了熵、探索和推理之间的联系；以及（3）超越了熵，提供了更广泛的RFT范式，其中优越的目标分布可以作为REINFORCE正则化器。

更新时间: 2025-10-09 12:24:08

领域: cs.LG

下载: http://arxiv.org/abs/2510.08141v1

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Large Language Models (LLMs) fine-tuned via Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) significantly improve the alignment of human-AI values, further raising the upper bound of AI capabilities, particularly in reasoning-intensive, long-context Chain-of-Thought (CoT) tasks. However, existing frameworks commonly face challenges such as inference bottlenecks and complexity barriers, which restrict their accessibility to newcomers. To bridge this gap, we introduce \textbf{OpenRLHF}, a user-friendly, scalable, and easy-to-learn open-source RLHF framework built upon Ray, vLLM, DeepSpeed, and HuggingFace Transformers, featuring a simplified design, clear code structure, and comprehensive documentation to facilitate entry for researchers and practitioners. Experimental results show that OpenRLHF achieves superior training efficiency, with speedups ranging from 1.22x to 1.68x across different model sizes, compared to state-of-the-art frameworks. Additionally, it requires significantly fewer lines of code for implementation. OpenRLHF is publicly available at https://github.com/OpenRLHF/OpenRLHF, and has already been adopted by leading institutions to accelerate RLHF research and learning.

Updated: 2025-10-09 12:22:46

标题: OpenRLHF：一个易于使用、可扩展和高性能的RLHF框架

摘要: 通过从人类反馈中细化调整的强化学习（RLHF）和具有可验证奖励的强化学习（RLVR）方式，大型语言模型（LLMs）显著提高了人工智能价值观的一致性，进一步提升了人工智能能力的上限，特别是在需要推理能力和长篇背景的思维链（CoT）任务中。然而，现有框架通常面临推理瓶颈和复杂性障碍等挑战，这限制了新手的使用。为了弥合这一差距，我们引入了\textbf{OpenRLHF}，这是一个用户友好、可扩展且易于学习的开源RLHF框架，建立在Ray、vLLM、DeepSpeed和HuggingFace Transformers基础上，具有简化设计、清晰的代码结构和全面的文档，以便于研究人员和从业者进入。实验结果显示，与最先进的框架相比，OpenRLHF实现了更高的训练效率，不同模型大小的加速比范围从1.22倍到1.68倍不等。此外，它需要更少的代码行来实现。OpenRLHF已在https://github.com/OpenRLHF/OpenRLHF上公开提供，并已被主要机构采用，以加速RLHF研究和学习。

更新时间: 2025-10-09 12:22:46

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2405.11143v6

Explaining raw data complexity to improve satellite onboard processing

With increasing processing power, deploying AI models for remote sensing directly onboard satellites is becoming feasible. However, new constraints arise, mainly when using raw, unprocessed sensor data instead of preprocessed ground-based products. While current solutions primarily rely on preprocessed sensor images, few approaches directly leverage raw data. This study investigates the effects of utilising raw data on deep learning models for object detection and classification tasks. We introduce a simulation workflow to generate raw-like products from high-resolution L1 imagery, enabling systemic evaluation. Two object detection models (YOLOv11n and YOLOX-S) are trained on both raw and L1 datasets, and their performance is compared using standard detection metrics and explainability tools. Results indicate that while both models perform similarly at low to medium confidence thresholds, the model trained on raw data struggles with object boundary identification at high confidence levels. It suggests that adapting AI architectures with improved contouring methods can enhance object detection on raw images, improving onboard AI for remote sensing.

Updated: 2025-10-09 12:22:46

标题: 解释原始数据复杂性以改善卫星机载处理

摘要: 随着处理能力的增加，将AI模型部署到卫星上进行遥感监测变得可行。然而，当使用原始未经处理的传感器数据而不是预处理的地面产品时，会出现新的约束。尽管当前的解决方案主要依赖于预处理的传感器图像，但很少有方法直接利用原始数据。本研究调查了利用原始数据对深度学习模型进行目标检测和分类任务的影响。我们引入了一个模拟工作流程，从高分辨率L1图像生成类似原始数据的产品，以进行系统评估。两个目标检测模型（YOLOv11n和YOLOX-S）分别在原始数据和L1数据集上进行训练，并使用标准检测指标和可解释性工具进行比较其表现。结果表明，虽然这两个模型在低到中等置信度阈值下表现类似，但在高置信水平下，训练在原始数据上的模型在对象边界识别方面表现不佳。这表明，通过改进轮廓方法来调整AI架构可以增强原始图像上的目标检测，提高遥感AI的性能。

更新时间: 2025-10-09 12:22:46

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.06858v2

Multi-Continental Healthcare Modelling Using Blockchain-Enabled Federated Learning

One of the biggest challenges of building artificial intelligence (AI) model in the healthcare area is the data sharing. Since healthcare data is private, sensitive, and heterogeneous, collecting sufficient data for modelling is exhausting, costly, and sometimes impossible. In this paper, we propose a framework for global healthcare modelling using datasets from multi-continents (Europe, North America, and Asia) without sharing the local datasets, and choose glucose management as a study model to verify its effectiveness. Technically, blockchain-enabled federated learning is implemented with adaptation to meet the privacy and safety requirements of healthcare data, meanwhile, it rewards honest participation and penalizes malicious activities using its on-chain incentive mechanism. Experimental results show that the proposed framework is effective, efficient, and privacy-preserving. Its prediction accuracy consistently outperforms models trained on limited personal data and achieves comparable or even slightly better results than centralized training in certain scenarios, all while preserving data privacy. This work paves the way for international collaborations on healthcare projects, where additional data is crucial for reducing bias and providing benefits to humanity.

Updated: 2025-10-09 12:22:19

标题: 多大陆医疗建模：基于区块链的联邦学习

摘要: 在医疗领域构建人工智能（AI）模型的最大挑战之一是数据共享。由于医疗数据是私密、敏感且异构的，收集足够的数据进行建模是困难、昂贵的，有时甚至是不可能的。本文提出了一个全球医疗建模框架，利用来自多个大陆（欧洲、北美和亚洲）的数据集，而无需共享本地数据集，并选择葡萄糖管理作为研究模型来验证其有效性。在技术上，实施了区块链启用的联邦学习，以满足医疗数据的隐私和安全要求，同时，通过其链上激励机制奖励诚实参与者并惩罚恶意活动。实验结果表明，提出的框架是有效、高效且保护隐私的。其预测准确率始终优于在有限个人数据上训练的模型，并在某些情况下实现了与集中式培训相当甚至略好的结果，同时保护数据隐私。这项工作为国际合作医疗项目铺平了道路，其中额外数据对于减少偏见并为人类提供利益至关重要。

更新时间: 2025-10-09 12:22:19

领域: cs.LG,cs.AI,cs.CR

下载: http://arxiv.org/abs/2410.17933v4

Improving Temporal Understanding Logic Consistency in Video-Language Models via Attention Enhancement

Large language models (LLMs) often generate self-contradictory outputs, which severely impacts their reliability and hinders their adoption in practical applications. In video-language models (Video-LLMs), this phenomenon recently draws the attention of researchers. Specifically, these models fail to provide logically consistent responses to rephrased questions based on their grounding outputs. However, the underlying causes of this phenomenon remain underexplored. In this work, we adopt an interpretability-driven approach to analyze, statistically summarize, and intervention the potential factors of the phenomenon. We find that one of the primary reasons for the inconsistency in responses lies in the inability of cross-modal attention heads to effectively distinguish video tokens across different timestamps. To address this, we propose an attention enhancement method called Temporally Conditioned Attention Sharpening (TCAS), which constructs an enhancement objective based on attention distinctions to enhance the model's temporal resolution capability, thereby improving its temporal understanding logic consistency. Experimental results demonstrate that our method significantly enhances the temporal logic consistency of Video-LLMs. Further interpretability analyses reveal that our method indeed improves the temporal discriminability of attention heads, validating our conclusions. Additionally, our method achieves performance improvements in general video temporal grounding tasks, highlighting that temporal logic consistency is a bottleneck in temporal understanding. By enhancing consistency, our method drives significant progress in video temporal understanding.

Updated: 2025-10-09 12:22:06

标题: 通过注意力增强改进视频语言模型中的时间理解逻辑一致性

摘要: 大型语言模型（LLMs）通常会生成自相矛盾的输出，严重影响了它们的可靠性并阻碍了它们在实际应用中的采用。在视频语言模型（Video-LLMs）中，这种现象最近引起了研究人员的关注。具体来说，这些模型无法对基于它们的基础输出重述的问题提供逻辑一致的回应。然而，这一现象的根本原因仍然未被充分探讨。在这项工作中，我们采用了一种可解释性驱动的方法来分析、统计总结和干预这一现象的潜在因素。我们发现，响应不一致的主要原因之一在于跨模态注意力头部无法有效区分不同时间戳上的视频标记。为了解决这个问题，我们提出了一种名为时间条件化注意力增强（TCAS）的注意力增强方法，该方法基于注意力区别构建了一个增强目标，以增强模型的时间分辨率能力，从而提高其时间理解逻辑一致性。实验结果表明，我们的方法显著增强了Video-LLMs的时间逻辑一致性。进一步的可解释性分析揭示了我们的方法确实提高了注意力头部的时间可分辨性，验证了我们的结论。此外，我们的方法在一般视频时间定位任务中取得了性能改进，突出显示时间逻辑一致性是时间理解的瓶颈。通过增强一致性，我们的方法在视频时间理解方面取得了显著进展。

更新时间: 2025-10-09 12:22:06

领域: cs.CV,cs.AI,cs.MM

下载: http://arxiv.org/abs/2510.08138v1

Approximate Domain Unlearning for Vision-Language Models

Pre-trained Vision-Language Models (VLMs) exhibit strong generalization capabilities, enabling them to recognize a wide range of objects across diverse domains without additional training. However, they often retain irrelevant information beyond the requirements of specific downstream tasks, raising concerns about computational efficiency and potential information leakage. This has motivated growing interest in approximate unlearning, which aims to selectively remove unnecessary knowledge while preserving overall model performance. Existing approaches to approximate unlearning have primarily focused on class unlearning, where a VLM is retrained to fail to recognize specified object classes while maintaining accuracy for others. However, merely forgetting object classes is often insufficient in practical applications. For instance, an autonomous driving system should accurately recognize real cars while avoiding misrecognition of illustrated cars depicted in roadside advertisements as real cars, which could be hazardous. In this paper, we introduce Approximate Domain Unlearning (ADU), a novel problem setting that requires reducing recognition accuracy for images from specified domains (e.g., illustration) while preserving accuracy for other domains (e.g., real). ADU presents new technical challenges: due to the strong domain generalization capability of pre-trained VLMs, domain distributions are highly entangled in the feature space, making naive approaches based on penalizing target domains ineffective. To tackle this limitation, we propose a novel approach that explicitly disentangles domain distributions and adaptively captures instance-specific domain information. Extensive experiments show that our approach outperforms baselines built upon VLM tuning techniques, paving the way for practical and fine-grained unlearning in VLMs. Code: https://kodaikawamura.github.io/Domain_Unlearning/.

Updated: 2025-10-09 12:17:59

标题: 视觉-语言模型的近似域遗忘

摘要: 经过预训练的视觉-语言模型（VLMs）展现出强大的泛化能力，使它们能够在不进行额外训练的情况下识别各种不同领域的对象。然而，它们通常会保留超出特定下游任务需求的无关信息，引发对计算效率和潜在信息泄露的担忧。这促使人们对近似遗忘产生了越来越大的兴趣，旨在有选择性地去除不必要的知识，同时保持整体模型性能。现有的近似遗忘方法主要集中在类别遗忘上，其中VLM被重新训练以不再识别指定的对象类别，同时保持对其他类别的准确性。然而，仅仅忘记对象类别在实际应用中通常是不足够的。例如，自动驾驶系统应该准确识别真实汽车，同时避免将路边广告中描绘的汽车误认为真实汽车，这可能会带来危险。在本文中，我们介绍了近似域遗忘（ADU），这是一个新颖的问题设置，要求在减少指定域（例如插图）图像的识别准确性的同时，保持对其他域（例如真实）的准确性。ADU提出了新的技术挑战：由于预训练VLM的强大域泛化能力，域分布在特征空间中高度纠缠，使基于惩罚目标域的朴素方法失效。为了解决这一局限性，我们提出了一种新方法，明确解开域分布，并自适应地捕捉实例特定的域信息。大量实验证明我们的方法优于基于VLM调整技术的基线，为VLM中实际和细粒度的遗忘铺平了道路。源代码：https://kodaikawamura.github.io/Domain_Unlearning/。

更新时间: 2025-10-09 12:17:59

领域: cs.LG,cs.AI,I.2.6

下载: http://arxiv.org/abs/2510.08132v1

Distilling a Small Utility-Based Passage Selector to Enhance Retrieval-Augmented Generation

Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating retrieved information. Standard retrieval process prioritized relevance, focusing on topical alignment between queries and passages. In contrast, in RAG, the emphasis has shifted to utility, which considers the usefulness of passages for generating accurate answers. Despite empirical evidence showing the benefits of utility-based retrieval in RAG, the high computational cost of using LLMs for utility judgments limits the number of passages evaluated. This restriction is problematic for complex queries requiring extensive information. To address this, we propose a method to distill the utility judgment capabilities of LLMs into smaller, more efficient models. Our approach focuses on utility-based selection rather than ranking, enabling dynamic passage selection tailored to specific queries without the need for fixed thresholds. We train student models to learn pseudo-answer generation and utility judgments from teacher LLMs, using a sliding window method that dynamically selects useful passages. Our experiments demonstrate that utility-based selection provides a flexible and cost-effective solution for RAG, significantly reducing computational costs while improving answer quality. We present the distillation results using Qwen3-32B as the teacher model for both relevance ranking and utility-based selection, distilled into RankQwen1.7B and UtilityQwen1.7B. Our findings indicate that for complex questions, utility-based selection is more effective than relevance ranking in enhancing answer generation performance. We will release the relevance ranking and utility-based selection annotations for the MS MARCO dataset, supporting further research in this area.

Updated: 2025-10-09 12:10:12

标题: 提炼一个小型的基于效用的段落选择器以增强检索增强生成

摘要: 检索增强生成（RAG）通过整合检索信息增强了大型语言模型（LLMs）。标准检索过程优先考虑相关性，专注于查询和段落之间的主题对齐。相反，在RAG中，重点已经转移到效用，考虑了段落对生成准确答案的有用性。尽管有实证证据显示效用为基础的检索在RAG中的好处，但使用LLMs进行效用判断的高计算成本限制了评估的段落数量。这种限制对于需要大量信息的复杂查询是有问题的。为了解决这个问题，我们提出了一种方法，将LLMs的效用判断能力提炼为更小、更高效的模型。我们的方法专注于基于效用的选择而不是排名，实现了针对特定查询的动态段落选择，无需固定阈值。我们训练学生模型从教师LLMs中学习伪答案生成和效用判断，使用滑动窗口方法动态选择有用的段落。我们的实验证明，基于效用的选择为RAG提供了灵活且具有成本效益的解决方案，显著降低了计算成本同时提高了答案质量。我们使用Qwen3-32B作为教师模型，将其蒸馏为RankQwen1.7B和UtilityQwen1.7B，用于相关性排名和基于效用的选择。我们的研究结果表明，对于复杂问题，基于效用的选择比相关性排名更有效地提高了答案生成性能。我们将发布MS MARCO数据集的相关性排名和基于效用的选择注释，支持该领域的进一步研究。

更新时间: 2025-10-09 12:10:12

领域: cs.IR,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.19102v2

High-dimensional Analysis of Synthetic Data Selection

Despite the progress in the development of generative models, their usefulness in creating synthetic data that improve prediction performance of classifiers has been put into question. Besides heuristic principles such as "synthetic data should be close to the real data distribution", it is actually not clear which specific properties affect the generalization error. Our paper addresses this question through the lens of high-dimensional regression. Theoretically, we show that, for linear models, the covariance shift between the target distribution and the distribution of the synthetic data affects the generalization error but, surprisingly, the mean shift does not. Furthermore we prove that, in some settings, matching the covariance of the target distribution is optimal. Remarkably, the theoretical insights from linear models carry over to deep neural networks and generative models. We empirically demonstrate that the covariance matching procedure (matching the covariance of the synthetic data with that of the data coming from the target distribution) performs well against several recent approaches for synthetic data selection, across training paradigms, architectures, datasets and generative models used for augmentation.

Updated: 2025-10-09 12:06:31

标题: 合成数据选择的高维分析

摘要: 尽管生成模型的发展取得了进展，但它们在创建改进分类器预测性能的合成数据方面的实用性受到质疑。除了启发式原则，如“合成数据应接近真实数据分布”，实际上不清楚哪些特定属性会影响泛化误差。我们的论文通过高维回归的视角回答了这个问题。理论上，我们展示了对于线性模型，目标分布与合成数据分布之间的协方差漂移会影响泛化误差，但令人惊讶的是，均值漂移却不会。此外，我们证明，在某些设置中，匹配目标分布的协方差是最佳的。值得注意的是，从线性模型得出的理论见解也适用于深度神经网络和生成模型。我们在实证上证明，协方差匹配程序（将合成数据的协方差与来自目标分布的数据的协方差匹配）在合成数据选择方面表现良好，跨训练范式、架构、数据集和用于增强的生成模型。

更新时间: 2025-10-09 12:06:31

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2510.08123v1

Interpreting LLM-as-a-Judge Policies via Verifiable Global Explanations

Using LLMs to evaluate text, that is, LLM-as-a-judge, is increasingly being used at scale to augment or even replace human annotations. As such, it is imperative that we understand the potential biases and risks of doing so. In this work, we propose an approach for extracting high-level concept-based global policies from LLM-as-a-Judge. Our approach consists of two algorithms: 1) CLoVE (Contrastive Local Verifiable Explanations), which generates verifiable, concept-based, contrastive local explanations and 2) GloVE (Global Verifiable Explanations), which uses iterative clustering, summarization and verification to condense local rules into a global policy. We evaluate GloVE on seven standard benchmarking datasets for content harm detection. We find that the extracted global policies are highly faithful to decisions of the LLM-as-a-Judge. Additionally, we evaluated the robustness of global policies to text perturbations and adversarial attacks. Finally, we conducted a user study to evaluate user understanding and satisfaction with global policies.

Updated: 2025-10-09 12:05:37

标题: 通过可验证的全局解释解释LLM作为法官政策

摘要: 使用LLMs来评估文本，即LLM作为评判者，越来越多地被大规模地用来增强甚至替代人类标注。因此，我们必须了解这样做可能存在的潜在偏见和风险。在这项工作中，我们提出了一种从LLM作为评判者中提取基于高级概念的全局策略的方法。我们的方法包括两个算法：1）CLoVE（对比本地可验证解释），它生成可验证的、基于概念的、对比的本地解释；2）GloVE（全局可验证解释），它使用迭代聚类、总结和验证来将本地规则压缩成全局策略。我们在七个标准基准数据集上评估了GloVE对内容危害检测的性能。我们发现提取的全局策略与LLM作为评判者的决策高度一致。此外，我们评估了全局策略对文本扰动和对抗攻击的鲁棒性。最后，我们进行了用户研究，以评估用户对全局策略的理解和满意度。

更新时间: 2025-10-09 12:05:37

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.08120v1

TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection

Rapid advances in Large Language Models (LLMs) have spurred demand for processing extended context sequences in contemporary applications. However, this progress faces two challenges: performance degradation due to sequence lengths out-of-distribution, and excessively long inference times caused by the quadratic computational complexity of attention. These issues limit LLMs in long-context scenarios. In this paper, we propose Dynamic Token-Level KV Cache Selection (TokenSelect), a training-free method for efficient and accurate long-context inference. TokenSelect builds upon the observation of non-contiguous attention sparsity, using QK dot products to measure per-head KV Cache criticality at token-level. By per-head soft voting mechanism, TokenSelect selectively involves a few critical KV cache tokens in attention calculation without sacrificing accuracy. To further accelerate TokenSelect, we design the Selection Cache based on observations of consecutive Query similarity and implemented the efficient Paged Dot Product Kernel, significantly reducing the selection overhead. A comprehensive evaluation of TokenSelect demonstrates up to $23.84\times$ speedup in attention computation and up to $2.28\times$ acceleration in end-to-end latency, while providing superior performance compared to state-of-the-art long-context inference methods.

Updated: 2025-10-09 12:05:04

标题: TokenSelect：通过动态令牌级KV缓存选择实现LLM的高效长上下文推断和长度外推

摘要: 大型语言模型（LLMs）的快速发展推动了当代应用程序中处理扩展上下文序列的需求。然而，这一进展面临两个挑战：由于序列长度超出分布范围而导致性能下降，以及由于注意力的二次计算复杂性而导致过长的推理时间。这些问题限制了LLMs在长上下文场景中的应用。在本文中，我们提出了一种名为动态令牌级KV缓存选择（TokenSelect）的无需训练的方法，用于高效准确地进行长上下文推理。TokenSelect基于非连续注意力稀疏性的观察，使用QK点积来衡量令牌级别的每个头部KV缓存的关键性。通过每个头部的软投票机制，TokenSelect在不影响准确性的情况下选择性地将少数关键KV缓存令牌纳入注意力计算中。为了进一步加速TokenSelect，我们根据连续查询相似性的观察设计了选择缓存，并实现了高效的分页点积核心，显著减少了选择开销。TokenSelect的全面评估显示，在注意力计算中加速高达23.84倍，在端到端延迟方面加速高达2.28倍，同时与最先进的长上下文推理方法相比，提供了更优越的性能。

更新时间: 2025-10-09 12:05:04

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2411.02886v4

Panorama: Fast-Track Nearest Neighbors

Approximate Nearest-Neighbor Search (ANNS) efficiently finds data items whose embeddings are close to that of a given query in a high-dimensional space, aiming to balance accuracy with speed. Used in recommendation systems, image and video retrieval, natural language processing, and retrieval-augmented generation (RAG), ANNS algorithms such as IVFPQ, HNSW graphs, Annoy, and MRPT utilize graph, tree, clustering, and quantization techniques to navigate large vector spaces. Despite this progress, ANNS systems spend up to 99\% of query time to compute distances in their final refinement phase. In this paper, we present PANORAMA, a machine learning-driven approach that tackles the ANNS verification bottleneck through data-adaptive learned orthogonal transforms that facilitate the accretive refinement of distance bounds. Such transforms compact over 90\% of signal energy into the first half of dimensions, enabling early candidate pruning with partial distance computations. We integrate PANORAMA into state-of-the-art ANNS methods, namely IVFPQ/Flat, HNSW, MRPT, and Annoy, without index modification, using level-major memory layouts, SIMD-vectorized partial distance computations, and cache-aware access patterns. Experiments across diverse datasets -- from image-based CIFAR-10 and GIST to modern embedding spaces including OpenAI's Ada 2 and Large 3 -- demonstrate that PANORAMA affords a 2--30$\times$ end-to-end speedup with no recall loss.

Updated: 2025-10-09 12:01:20

标题: 全景：快速跟踪最近邻居

摘要: 近似最近邻搜索（ANNS）有效地找到数据项，其嵌入与给定查询的嵌入在高维空间中接近，旨在平衡准确性和速度。ANNS算法，如IVFPQ、HNSW图、Annoy和MRPT，在推荐系统、图像和视频检索、自然语言处理以及检索增强生成（RAG）中使用，利用图形、树、聚类和量化技术来导航大型向量空间。尽管取得了进展，ANNS系统在最终的细化阶段花费了高达99\%的查询时间来计算距离。本文介绍了PANORAMA，这是一种机器学习驱动的方法，通过数据自适应学习的正交变换来解决ANNS验证瓶颈，从而促进距离边界的渐进细化。这种变换将超过90%的信号能量压缩到前半部分维度，使得可以通过部分距离计算进行早期候选修剪。我们将PANORAMA集成到最先进的ANNS方法中，即IVFPQ/Flat、HNSW、MRPT和Annoy，无需修改索引，使用基于级别的主内存布局、SIMD向量化的部分距离计算和缓存感知的访问模式。在从基于图像的CIFAR-10和GIST到包括OpenAI的Ada 2和Large 3在内的现代嵌入空间的多样数据集上进行的实验表明，PANORAMA可以实现2-30倍的端到端加速，而不会丢失召回率。

更新时间: 2025-10-09 12:01:20

领域: cs.LG,cs.AI,cs.DB

下载: http://arxiv.org/abs/2510.00566v2

Learning Equilibria from Data: Provably Efficient Multi-Agent Imitation Learning

This paper provides the first expert sample complexity characterization for learning a Nash equilibrium from expert data in Markov Games. We show that a new quantity named the single policy deviation concentrability coefficient is unavoidable in the non-interactive imitation learning setting, and we provide an upper bound for behavioral cloning (BC) featuring such coefficient. BC exhibits substantial regret in games with high concentrability coefficient, leading us to utilize expert queries to develop and introduce two novel solution algorithms: MAIL-BRO and MURMAIL. The former employs a best response oracle and learns an $\varepsilon$-Nash equilibrium with $\mathcal{O}(\varepsilon^{-4})$ expert and oracle queries. The latter bypasses completely the best response oracle at the cost of a worse expert query complexity of order $\mathcal{O}(\varepsilon^{-8})$. Finally, we provide numerical evidence, confirming our theoretical findings.

Updated: 2025-10-09 11:57:25

标题: 从数据中学习均衡：可证明高效的多智体模仿学习

摘要: 本文提供了第一个专家样本复杂性表征，用于从马尔科夫博弈中的专家数据中学习纳什均衡。我们展示了一种新的数量，名为单策略偏离集中系数，在非交互式模仿学习设置中是不可避免的，并且我们提供了一个特征这种系数的行为克隆（BC）的上界。在具有高集中系数的游戏中，BC表现出明显的后悔，这导致我们利用专家查询来开发和引入两种新的解决方案算法：MAIL-BRO和MURMAIL。前者使用最佳响应预测器，并学习一个$\varepsilon$-纳什均衡，需要$\mathcal{O}(\varepsilon^{-4})$专家和预测器查询。后者完全绕过最佳响应预测器，以代价较差的专家查询复杂性为$\mathcal{O}(\varepsilon^{-8})$。最后，我们提供了数值证据，证实了我们的理论发现。

更新时间: 2025-10-09 11:57:25

领域: cs.LG

下载: http://arxiv.org/abs/2505.17610v2

Random Window Augmentations for Deep Learning Robustness in CT and Liver Tumor Segmentation

Contrast-enhanced Computed Tomography (CT) is important for diagnosis and treatment planning for various medical conditions. Deep learning (DL) based segmentation models may enable automated medical image analysis for detecting and delineating tumors in CT images, thereby reducing clinicians' workload. Achieving generalization capabilities in limited data domains, such as radiology, requires modern DL models to be trained with image augmentation. However, naively applying augmentation methods developed for natural images to CT scans often disregards the nature of the CT modality, where the intensities measure Hounsfield Units (HU) and have important physical meaning. This paper challenges the use of such intensity augmentations for CT imaging and shows that they may lead to artifacts and poor generalization. To mitigate this, we propose a CT-specific augmentation technique, called Random windowing, that exploits the available HU distribution of intensities in CT images. Random windowing encourages robustness to contrast-enhancement and significantly increases model performance on challenging images with poor contrast or timing. We perform ablations and analysis of our method on multiple datasets, and compare to, and outperform, state-of-the-art alternatives, while focusing on the challenge of liver tumor segmentation.

Updated: 2025-10-09 11:57:04

标题: 深度学习在CT和肝脏肿瘤分割中的鲁棒性的随机窗口增强

摘要: 增强对比度的计算机断层扫描（CT）对于各种医学疾病的诊断和治疗规划至关重要。基于深度学习（DL）的分割模型可以实现自动化的医学图像分析，用于检测和描绘CT图像中的肿瘤，从而减轻临床医生的工作量。在限定数据域（如放射学）中实现泛化能力，需要使用图像增强对现代DL模型进行训练。然而，将为自然图像开发的增强方法天真地应用于CT扫描，往往忽视了CT模态的特性，其中强度衡量为百分数单位（HU），具有重要的物理意义。本文质疑了将这种强度增强用于CT成像，并指出可能会导致伪影和泛化性差。为了缓解这一问题，我们提出了一种CT特定的增强技术，称为随机窗口化，利用CT图像中可用的HU强度分布。随机窗口化有助于增强对比度的稳健性，并显著提高了在对比度不佳或时间不当的具有挑战性图像上的模型性能。我们在多个数据集上对我们的方法进行消除和分析，并与最先进的替代方法进行比较和超越，重点关注肝肿瘤分割的挑战。

更新时间: 2025-10-09 11:57:04

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.08116v1

Can Risk-taking AI-Assistants suitably represent entities

Responsible AI demands systems whose behavioral tendencies can be effectively measured, audited, and adjusted to prevent inadvertently nudging users toward risky decisions or embedding hidden biases in risk aversion. As language models (LMs) are increasingly incorporated into AI-driven decision support systems, understanding their risk behaviors is crucial for their responsible deployment. This study investigates the manipulability of risk aversion (MoRA) in LMs, examining their ability to replicate human risk preferences across diverse economic scenarios, with a focus on gender-specific attitudes, uncertainty, role-based decision-making, and the manipulability of risk aversion. The results indicate that while LMs such as DeepSeek Reasoner and Gemini-2.0-flash-lite exhibit some alignment with human behaviors, notable discrepancies highlight the need to refine bio-centric measures of manipulability. These findings suggest directions for refining AI design to better align human and AI risk preferences and enhance ethical decision-making. The study calls for further advancements in model design to ensure that AI systems more accurately replicate human risk preferences, thereby improving their effectiveness in risk management contexts. This approach could enhance the applicability of AI assistants in managing risk.

Updated: 2025-10-09 11:55:31

标题: 能冒险的AI助手是否能够适当地代表实体

摘要: 负责任的人工智能要求系统的行为倾向能够被有效地测量、审计和调整，以防止无意中引导用户向风险决策或在风险规避中嵌入隐藏的偏见。随着语言模型(LMs)越来越多地被整合到人工智能驱动的决策支持系统中，了解它们的风险行为对于它们的负责任部署至关重要。本研究调查了语言模型中风险规避的可操纵性(MoRA)，检查它们在不同经济场景中复制人类风险偏好的能力，重点关注性别特定态度、不确定性、基于角色的决策制定以及风险规避的可操纵性。结果表明，虽然DeepSeek Reasoner和Gemini-2.0-flash-lite等语言模型在某种程度上与人类行为保持一致，但显著的差异突出了需要完善对可操纵性的生物中心措施。这些发现提出了优化人工智能设计以更好地调整人类和人工智能风险偏好并增强道德决策制定的方向。该研究呼吁进一步改进模型设计，以确保人工智能系统更准确地复制人类风险偏好，从而提高它们在风险管理情境中的有效性。这种方法可以增强人工智能助手在风险管理中的适用性。

更新时间: 2025-10-09 11:55:31

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.08114v1

Bayesian Decision Making around Experts

Complex learning agents are increasingly deployed alongside existing experts, such as human operators or previously trained agents. However, it remains unclear how should learners optimally incorporate certain forms of expert data, which may differ in structure from the learner's own action-outcome experiences. We study this problem in the context of Bayesian multi-armed bandits, considering: (i) offline settings, where the learner receives a dataset of outcomes from the expert's optimal policy before interaction, and (ii) simultaneous settings, where the learner must choose at each step whether to update its beliefs based on its own experience, or based on the outcome simultaneously achieved by an expert. We formalize how expert data influences the learner's posterior, and prove that pretraining on expert outcomes tightens information-theoretic regret bounds by the mutual information between the expert data and the optimal action. For the simultaneous setting, we propose an information-directed rule where the learner processes the data source that maximizes their one-step information gain about the optimal action. Finally, we propose strategies for how the learner can infer when to trust the expert and when not to, safeguarding the learner for the cases where the expert is ineffective or compromised. By quantifying the value of expert data, our framework provides practical, information-theoretic algorithms for agents to intelligently decide when to learn from others.

Updated: 2025-10-09 11:53:19

标题: 贝叶斯决策在专家周围的应用

摘要: 复杂的学习代理越来越多地与现有专家一起部署，如人类操作员或先前训练过的代理。然而，如何最优地将某些形式的专家数据纳入学习者，尚不清楚，这些数据可能与学习者自己的行动结果经历结构不同。我们在贝叶斯多臂老虎机的背景下研究了这个问题，考虑了：（i）离线设置，即学习者在互动之前接收到专家最佳策略的结果数据集，（ii）同时设置，即学习者必须在每一步选择是基于自己的经验还是基于专家同时实现的结果更新其信念。我们形式化了专家数据如何影响学习者的后验概率，并证明在专家结果的预训练下，通过专家数据和最佳行动之间的互信息，可以加强信息论遗憾界限。对于同时设置，我们提出了一个信息导向规则，学习者处理使其对最佳行动的一步信息增益最大化的数据源。最后，我们提出了学习者如何推断何时信任专家何时不信任的策略，从而保护学习者，避免专家无效或受损的情况。通过量化专家数据的价值，我们的框架为代理提供了实用的信息论算法，使其能够智能地决定何时向他人学习。

更新时间: 2025-10-09 11:53:19

领域: cs.LG,cs.AI,cs.MA

下载: http://arxiv.org/abs/2510.08113v1

VersionRAG: Version-Aware Retrieval-Augmented Generation for Evolving Documents

Retrieval-Augmented Generation (RAG) systems fail when documents evolve through versioning-a ubiquitous characteristic of technical documentation. Existing approaches achieve only 58-64% accuracy on version-sensitive questions, retrieving semantically similar content without temporal validity checks. We present VersionRAG, a version-aware RAG framework that explicitly models document evolution through a hierarchical graph structure capturing version sequences, content boundaries, and changes between document states. During retrieval, VersionRAG routes queries through specialized paths based on intent classification, enabling precise version-aware filtering and change tracking. On our VersionQA benchmark-100 manually curated questions across 34 versioned technical documents-VersionRAG achieves 90% accuracy, outperforming naive RAG (58%) and GraphRAG (64%). VersionRAG reaches 60% accuracy on implicit change detection where baselines fail (0-10%), demonstrating its ability to track undocumented modifications. Additionally, VersionRAG requires 97% fewer tokens during indexing than GraphRAG, making it practical for large-scale deployment. Our work establishes versioned document QA as a distinct task and provides both a solution and benchmark for future research.

Updated: 2025-10-09 11:48:58

标题: VersionRAG：用于不断发展的文档的版本感知检索增强生成

摘要: 检索增强生成（RAG）系统在文档通过版本演变时失败，这是技术文档的普遍特征。现有方法在版本敏感问题上仅实现了58-64%的准确性，检索出具有语义相似性的内容，但没有进行时间有效性检查。我们提出了VersionRAG，这是一个版本感知的RAG框架，通过捕捉版本序列、内容边界和文档状态之间的变化的分层图结构明确建模文档演变。在检索过程中，VersionRAG通过基于意图分类的专门路径引导查询，实现精确的版本感知过滤和变化跟踪。在我们的VersionQA基准测试中-跨34个版本化技术文档的100个手动筛选问题-VersionRAG实现了90%的准确性，优于朴素的RAG（58%）和GraphRAG（64%）。VersionRAG在隐式变化检测上实现了60%的准确性，而基线失败（0-10%），展示了其追踪未记录修改的能力。此外，VersionRAG在索引过程中所需的令牌数量比GraphRAG少97%，使其适用于大规模部署。我们的工作将版本化文档QA确立为一个独立的任务，并为未来研究提供了解决方案和基准。

更新时间: 2025-10-09 11:48:58

领域: cs.IR,cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.08109v1

Think With Videos For Agentic Long-Video Understanding

Long-video understanding~(LVU) is a challenging problem in computer vision. Existing methods either downsample frames for single-pass reasoning, sacrificing fine-grained details, or depend on textual reasoning over task-agnostic representations, hindering task-specific perception and exploration. In this paper, we propose VideoExplorer, a framework grounded in the principle of ``thinking with video'', which naturally intertwines planning, temporal grounding, and scalable perception into a coherent reasoning process. Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding until reaching the final answer, enabling faithful, efficient, and interpretable reasoning. To address the lack of LVU training resources, we construct a long-video reasoning dataset using difficulty-adaptive sampling to ensure high-quality trajectories on complex tasks. Building on this dataset, we design a two-stage training pipeline: supervised trajectory initialization followed by trajectory-level preference optimization, encouraging adaptive temporal grounding and iterative information integration guided by downstream rewards. Extensive evaluations on popular long-video understanding and reasoning benchmarks demonstrate VideoExplorer's significant advantage over existing baselines, highlighting its robustness, adaptability, and efficiency. Our code is made publicly available in this repository(https://github.com/yhy-2000/VideoDeepResearch).

Updated: 2025-10-09 11:48:39

标题: 使用视频思维进行主动式长视频理解

摘要: 长视频理解（LVU）是计算机视觉中的一个具有挑战性的问题。现有方法要么对帧进行降采样以进行单次推理，牺牲了细粒度的细节，要么依赖于对任务无关表示的文本推理，阻碍了任务特定感知和探索。在本文中，我们提出了VideoExplorer，这是一个基于“通过视频思考”的框架，自然地将规划、时间基础和可扩展感知融入到一个连贯的推理过程中。VideoExplorer不是在静态上下文中进行推理，而是迭代地制定子问题，定位相关时刻，并执行面向任务的、临时可扩展的视频理解，直到达到最终答案，实现忠实、高效和可解释的推理。为了解决LVU训练资源的缺乏，我们使用难度自适应抽样构建了一个长视频推理数据集，以确保在复杂任务上获得高质量的轨迹。基于这个数据集，我们设计了一个两阶段的训练流程：监督轨迹初始化，然后是轨迹级别的偏好优化，鼓励自适应的时间基础和由下游奖励引导的迭代信息集成。对流行的长视频理解和推理基准进行广泛评估表明，VideoExplorer相对于现有基准具有显著优势，突显了其稳健性、适应性和效率。我们的代码已在此存储库（https://github.com/yhy-2000/VideoDeepResearch）中公开。

更新时间: 2025-10-09 11:48:39

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2506.10821v4

Development of Mental Models in Human-AI Collaboration: A Conceptual Framework

Artificial intelligence has become integral to organizational decision-making and while research has explored many facets of this human-AI collaboration, the focus has mainly been on designing the AI agent(s) and the way the collaboration is set up - generally assuming a human decision-maker to be "fixed". However, it has largely been neglected that decision-makers' mental models evolve through their continuous interaction with AI systems. This paper addresses this gap by conceptualizing how the design of human-AI collaboration influences the development of three complementary and interdependent mental models necessary for this collaboration. We develop an integrated socio-technical framework that identifies the mechanisms driving the mental model evolution: data contextualization, reasoning transparency, and performance feedback. Our work advances human-AI collaboration literature through three key contributions: introducing three distinct mental models (domain, information processing, complementarity-awareness); recognizing the dynamic nature of mental models; and establishing mechanisms that guide the purposeful design of effective human-AI collaboration.

Updated: 2025-10-09 11:40:41

标题: 人工智能与人类协作中心智模型的发展：一个概念框架

摘要: 人工智能已经成为组织决策中不可或缺的一部分，尽管研究已经探讨了人工智能与人类协作的许多方面，但重点主要集中在设计人工智能代理和协作设置的方式上 - 通常假定人类决策者是“固定的”。然而，人们往往忽视了决策者通过与人工智能系统的持续互动而演变的心智模型。本文通过构想人工智能协作设计如何影响发展这种协作所必需的三种互补和相互依存的心智模型来填补这一空白。我们开发了一个整合的社会技术框架，确定推动心智模型演变的机制：数据情境化、推理透明度和绩效反馈。我们的工作通过三个关键贡献推进了人工智能协作文献：引入三种不同的心智模型（领域、信息处理、互补意识）；认识到心智模型的动态性质；并建立指导有效人工智能协作有目的设计的机制。

更新时间: 2025-10-09 11:40:41

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2510.08104v1

Lossless Vocabulary Reduction for Auto-Regressive Language Models

Tokenization -- the process of decomposing a given text into a sequence of subwords called tokens -- is one of the key components in the development of language models. Particularly, auto-regressive language models generate texts token by token, i.e., by predicting the next-token distribution given the previous ones, and thus tokenization directly affects their efficiency in text generation. Since each language model has their own vocabulary as a set of possible tokens, they struggle to cooperate with each other at the level of next-token distributions such as model ensemble. In this paper, we establish a theoretical framework of lossless vocabulary reduction, which efficiently converts a given auto-regressive language model into the one with an arbitrarily small vocabulary without any loss in accuracy. As an application, we demonstrate that language models with different tokenization can cooperate with each other efficiently through their maximal common vocabulary.

Updated: 2025-10-09 11:38:48

标题: 自回归语言模型的无损词汇减少

摘要: Tokenization - 将给定文本分解为一系列称为标记的子词的过程 - 是语言模型发展中的关键组成部分之一。特别是，自回归语言模型逐个标记地生成文本，即通过预测给定前一个标记时的下一个标记分布，因此标记化直接影响它们在文本生成中的效率。由于每个语言模型都有自己的词汇作为可能标记的集合，它们在下一个标记分布的水平上如模型集成这样相互合作。在本文中，我们建立了一个无损词汇缩减的理论框架，它能有效地将给定的自回归语言模型转换为一个具有任意小词汇的模型，而无需损失准确性。作为一个应用，我们演示了具有不同标记化的语言模型可以通过它们的最大共同词汇有效地相互合作。

更新时间: 2025-10-09 11:38:48

领域: cs.CL,cs.AI,cs.LG,stat.ML

下载: http://arxiv.org/abs/2510.08102v1

LLM-Assisted Web Measurements

Web measurements are a well-established methodology for assessing the security and privacy landscape of the Internet. However, existing top lists of popular websites commonly used as measurement targets are unlabeled and lack semantic information about the nature of the sites they include. This limitation makes targeted measurements challenging, as researchers often need to rely on ad-hoc techniques to bias their datasets toward specific categories of interest. In this paper, we investigate the use of Large Language Models (LLMs) as a means to enable targeted web measurement studies through their semantic understanding capabilities. Building on prior literature, we identify key website classification tasks relevant to web measurements and construct datasets to systematically evaluate the performance of different LLMs on these tasks. Our results demonstrate that LLMs may achieve strong performance across multiple classification scenarios. We then conduct LLM-assisted web measurement studies inspired by prior work and rigorously assess the validity of the resulting research inferences. Our results demonstrate that LLMs can serve as a practical tool for analyzing security and privacy trends on the Web.

Updated: 2025-10-09 11:38:38

标题: LLM辅助的网络测量

摘要: 网络测量是评估互联网安全和隐私格局的一种成熟方法。然而，现有的常用作测量目标的热门网站排行榜缺乏标签，并且缺乏关于所包含网站性质的语义信息。这一限制使得有针对性的测量具有挑战性，因为研究人员通常需要依赖临时技术来偏向他们感兴趣的特定类别。本文研究了使用大型语言模型(LLMs)作为一种手段，通过它们的语义理解能力实现有针对性的网络测量研究。在先前的文献基础上，我们确定了与网络测量相关的关键网站分类任务，并构建数据集系统评估不同LLMs在这些任务上的表现。我们的结果显示，LLMs可能在多个分类场景中取得强大的性能。然后，我们进行了受先前研究启发的LLM辅助的网络测量研究，并严格评估了所得研究推论的有效性。我们的结果表明，LLMs可以作为分析网络安全和隐私趋势的实用工具。

更新时间: 2025-10-09 11:38:38

领域: cs.CR

下载: http://arxiv.org/abs/2510.08101v1

The Price of Thought: A Multilingual Analysis of Reasoning, Performance, and Cost of Negotiation in Large Language Models

Negotiation is a fundamental challenge for AI agents, as it requires an ability to reason strategically, model opponents, and balance cooperation with competition. We conduct the first comprehensive study systematically evaluating the effect of (LLM-)reasoning on the negotiation abilities of both commercial and open-weight LLMs, and do this across three languages. Using a self-play setup across three diverse dialogue games, we analyse trade-offs between performance and cost, the language consistency of reasoning processes, and the nature of strategic adaptation exhibited by models. Our findings show that enabling reasoning-that is, scaling test time compute-significantly improves negotiation outcomes by enhancing collaboration and helping models overcome task complexities, but comes at a substantial computational cost: reasoning improves GPT-5's performance by 31.4 % while increasing its cost by nearly 400 %. Most critically, we uncover a significant multilingual reasoning distinction: open-weight models consistently switch to English for their internal reasoning steps, even when negotiating in German or Italian (and thus possibly impacting potential explainability gains through the disclosure of reasoning traces), while leading commercial models maintain language consistency between their reasoning and final output.

Updated: 2025-10-09 11:36:38

标题: 思维的代价：对大型语言模型中推理、表现和谈判成本的多语言分析

摘要: 谈判对于人工智能代理来说是一个基本挑战，因为它需要具备战略推理、对手建模和协作与竞争的平衡能力。我们进行了第一项全面研究，系统评估了（LLM-）推理对商业和开放权重LLMs的谈判能力的影响，并跨越三种语言进行了研究。通过在三种不同的对话游戏中使用自我对弈设置，我们分析了性能和成本之间的权衡，推理过程的语言一致性，以及模型展示的战略适应性的性质。我们的研究结果表明，启用推理即在测试时间计算上扩展显著改善了谈判结果，通过增强协作帮助模型克服任务复杂性，但是需要巨大的计算成本：推理提高了GPT-5的性能31.4％，同时将其成本提高了近400％。最重要的是，我们发现了一个重要的多语言推理区别：开放权重模型在进行内部推理步骤时一贯切换到英语，即使在德语或意大利语中进行谈判（因此可能通过披露推理轨迹影响潜在的可解释性收益），而领先的商业模型在其推理和最终输出之间保持语言一致性。

更新时间: 2025-10-09 11:36:38

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.08098v1

Beyond Real Data: Synthetic Data through the Lens of Regularization

Synthetic data can improve generalization when real data is scarce, but excessive reliance may introduce distributional mismatches that degrade performance. In this paper, we present a learning-theoretic framework to quantify the trade-off between synthetic and real data. Our approach leverages algorithmic stability to derive generalization error bounds, characterizing the optimal synthetic-to-real data ratio that minimizes expected test error as a function of the Wasserstein distance between the real and synthetic distributions. We motivate our framework in the setting of kernel ridge regression with mixed data, offering a detailed analysis that may be of independent interest. Our theory predicts the existence of an optimal ratio, leading to a U-shaped behavior of test error with respect to the proportion of synthetic data. Empirically, we validate this prediction on CIFAR-10 and a clinical brain MRI dataset. Our theory extends to the important scenario of domain adaptation, showing that carefully blending synthetic target data with limited source data can mitigate domain shift and enhance generalization. We conclude with practical guidance for applying our results to both in-domain and out-of-domain scenarios.

Updated: 2025-10-09 11:33:09

标题: 超越真实数据：通过正则化视角看合成数据

摘要: 合成数据可以在真实数据稀缺时改善泛化能力，但过度依赖可能引入分布不匹配，从而降低性能。在本文中，我们提出了一个学习理论框架来量化合成数据和真实数据之间的权衡。我们的方法利用算法稳定性推导泛化误差界限，描述了最优合成与真实数据比例，使得期望测试误差最小化，作为真实和合成分布之间的Wasserstein距离的函数。我们在混合数据的核岭回归设置中推动我们的框架，提供了可能引起独立兴趣的详细分析。我们的理论预测了最优比例的存在，导致测试误差随着合成数据比例的变化呈U形行为。从经验上看，我们验证了这一预测在CIFAR-10和临床脑MRI数据集上的有效性。我们的理论延伸到了域自适应的重要场景，表明仔细混合合成目标数据和有限源数据可以减轻领域漂移并增强泛化能力。我们最后提供了将我们的结果应用于领域内和领域外场景的实用指导。

更新时间: 2025-10-09 11:33:09

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2510.08095v1

A Semantic Model for Audit of Cloud Engines based on ISO/IEC TR 3445:2022

Cloud computing has become the foundation of modern digital infrastructure, yet the absence of a unified architectural and compliance framework impedes interoperability, auditability, and robust security. This paper introduces a formal, machine-readable semantic model for Cloud Engines, integrating the architectural taxonomy of ISO/IEC 22123 (Cloud Reference Architecture) with the security and compliance controls of ISO/IEC 27001:2022 and ISO/IEC TR 3445:2022. The model decomposes cloud systems into four canonical interfaces--Control, Business, Audit, and Data--and extends them with a security ontology that maps mechanisms such as authentication, authorization, and encryption to specific compliance controls. Expressed in RDF/Turtle, the model enables semantic reasoning, automated compliance validation, and vendor-neutral architecture design. We demonstrate its practical utility through OpenStack and AWS case studies, and provide reproducible validation workflows using SPARQL and SHACL. This work advances the state of cloud security modeling by bridging architectural and compliance standards in a unified framework, with a particular emphasis on auditability.

Updated: 2025-10-09 11:32:35

标题: 基于ISO/IEC TR 3445:2022的云引擎审计的语义模型

摘要: 云计算已成为现代数字基础设施的基础，然而缺乏统一的架构和合规性框架阻碍了云计算系统的互操作性、可审计性和强大的安全性。本文介绍了一种正式的、可机器读取的云引擎语义模型，将ISO/IEC 22123（云参考架构）的架构分类与ISO/IEC 27001:2022和ISO/IEC TR 3445:2022的安全和合规性控制相集成。该模型将云系统分解为四个标准接口——控制、业务、审计和数据，并通过一个安全本体论将认证、授权和加密等机制映射到特定的合规性控制。该模型采用RDF/Turtle表示，实现了语义推理、自动合规性验证和供应商中立的架构设计。我们通过OpenStack和AWS案例研究展示了其实际效用，并提供了使用SPARQL和SHACL的可重现验证工作流程。这项工作通过在统一框架中连接架构和合规性标准，特别强调审计性，推动了云安全建模的发展。

更新时间: 2025-10-09 11:32:35

领域: cs.CR,cs.DC

下载: http://arxiv.org/abs/2510.09690v1

Computations and ML for surjective rational maps

The present note studies \emph{surjective rational endomorphisms} $f: \mathbb{P}^2 \dashrightarrow \mathbb{P}^2$ with \emph{cubic} terms and the indeterminacy locus $I_f \ne \emptyset$. We develop an experimental approach, based on some Python programming and Machine Learning, towards the classification of such maps; a couple of new explicit $f$ is constructed in this way. We also prove (via pure projective geometry) that a general non-regular cubic endomorphism $f$ of $\mathbb{P}^2$ is surjective if and only if the set $I_f$ has cardinality at least $3$.

Updated: 2025-10-09 11:27:10

标题: 计算和机器学习用于满射有理映射

摘要: 这篇文献摘要研究了具有三次项和不确定性局部$I_f \ne \emptyset$的\emph{满射有理自同构}$f: \mathbb{P}^2 \dashrightarrow \mathbb{P}^2$。我们基于一些Python编程和机器学习开发了一种实验方法，用于对此类映射进行分类；通过这种方法构造了一对新的显式$f$。我们还证明了（通过纯投射几何），$\mathbb{P}^2$的一般非正则三次自同构$f$是满射的，当且仅当集合$I_f$的基数至少为$3$。

更新时间: 2025-10-09 11:27:10

领域: math.AG,cs.LG

下载: http://arxiv.org/abs/2510.08093v1

Everything is Plausible: Investigating the Impact of LLM Rationales on Human Notions of Plausibility

We investigate the degree to which human plausibility judgments of multiple-choice commonsense benchmark answers are subject to influence by (im)plausibility arguments for or against an answer, in particular, using rationales generated by LLMs. We collect 3,000 plausibility judgments from humans and another 13,600 judgments from LLMs. Overall, we observe increases and decreases in mean human plausibility ratings in the presence of LLM-generated PRO and CON rationales, respectively, suggesting that, on the whole, human judges find these rationales convincing. Experiments with LLMs reveal similar patterns of influence. Our findings demonstrate a novel use of LLMs for studying aspects of human cognition, while also raising practical concerns that, even in domains where humans are ``experts'' (i.e., common sense), LLMs have the potential to exert considerable influence on people's beliefs.

Updated: 2025-10-09 11:22:29

标题: 一切皆可能：探究LLM理由对人类对可信度概念的影响

摘要: 我们调查了人类对多项选择常识基准答案的合理性判断在多大程度上受到(LLM生成的理由所影响，特别是那些支持或反对一个答案的理由。我们收集了来自人类的3,000个合理性判断和来自LLMs的另外13,600个判断。总体上，我们观察到在LLM生成的支持和反对理由的存在下，人类合理性评分的平均值分别增加和减少，这表明人类评委总体上认为这些理由具有说服力。LLMs的实验显示出类似的影响模式。我们的研究结果展示了LLMs用于研究人类认知方面的新方法，同时也引发了实际上的担忧，即即使在人类是“专家”（即常识）的领域，LLMs也有可能对人们的信念产生相当大的影响。

更新时间: 2025-10-09 11:22:29

领域: cs.CL,cs.AI,cs.HC

下载: http://arxiv.org/abs/2510.08091v1

Tug-of-war between idioms' figurative and literal interpretations in LLMs

Idioms present a unique challenge for language models due to their non-compositional figurative interpretations, which often strongly diverge from the idiom's literal interpretation. In this paper, we employ causal tracing to systematically analyze how pretrained causal transformers deal with this ambiguity. We localize three mechanisms: (i) Early sublayers and specific attention heads retrieve an idiom's figurative interpretation, while suppressing its literal interpretation. (ii) When disambiguating context precedes the idiom, the model leverages it from the earliest layer and later layers refine the interpretation if the context conflicts with the retrieved interpretation. (iii) Then, selective, competing pathways carry both interpretations: an intermediate pathway prioritizes the figurative interpretation and a parallel direct route favors the literal interpretation, ensuring that both readings remain available. Our findings provide mechanistic evidence for idiom comprehension in autoregressive transformers.

Updated: 2025-10-09 11:21:10

标题: 在LLMs中成语的比喻和字面解释之间的拉锯战

摘要: 成语对于语言模型来说是一种独特的挑战，因为它们的非组合性比喻解释通常与成语的字面解释大相径庭。在本文中，我们采用因果追踪系统地分析预训练的因果变换器如何处理这种歧义。我们找到了三种机制：（i）早期子层和特定的注意力头检索成语的比喻解释，同时抑制其字面解释。（ii）当消除歧义的上下文在成语之前出现时，模型从最早的层利用它，后续层在上下文与检索到的解释发生冲突时对解释进行改进。（iii）然后，选择性、竞争性的路径携带两种解释：一条中间路径优先考虑比喻解释，而一条平行的直接路径偏向于字面解释，确保两种解读都保持可用。我们的发现为自回归变换器中的成语理解提供了机制证据。

更新时间: 2025-10-09 11:21:10

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.01723v4

From Ethical Declarations to Provable Independence: An Ontology-Driven Optimal-Transport Framework for Certifiably Fair AI Systems

This paper presents a framework for provably fair AI that overcomes the limits of current bias mitigation methods by systematically removing all sensitive information and its proxies. Using ontology engineering in OWL 2 QL, it formally defines sensitive attributes and infers their proxies through logical reasoning, constructing a sigma algebra G that captures the full structure of biased patterns. Fair representations are then obtained via Delbaen Majumdar optimal transport, which generates variables independent of G while minimizing L2 distance to preserve accuracy. This guarantees true independence rather than mere decorrelation. By modeling bias as dependence between sigma algebras, compiling ontological knowledge into measurable structures, and using optimal transport as the unique fair transformation, the approach ensures complete fairness in tasks like loan approval, where proxies such as ZIP code reveal race. The result is a certifiable and mathematically grounded method for trustworthy AI.

Updated: 2025-10-09 11:18:41

标题: 从伦理声明到可证明独立性：基于本体论的认证公平AI系统的最优输运框架

摘要: 这篇论文提出了一个框架，用于通过系统地移除所有敏感信息及其代理，克服当前偏见缓解方法的局限性，从而实现可证明公平的人工智能。通过在OWL 2 QL中进行本体工程，正式定义敏感属性并通过逻辑推理推断它们的代理，构建一个捕捉偏见模式完整结构的sigma代数G。然后通过Delbaen Majumdar最优输运获得公平表示，该方法生成与G独立的变量，同时通过最小化L2距离来保留精确度。这保证了真正的独立性而不仅仅是去相关性。通过将偏见建模为sigma代数之间的依赖关系，将本体知识编译成可测量的结构，并使用最优输运作为唯一的公平变换，该方法确保在诸如贷款批准这样的任务中实现完全公平，其中代理（例如邮政编码）透露了种族信息。结果是一种可验证且在数学上有基础的值得信赖的人工智能方法。

更新时间: 2025-10-09 11:18:41

领域: cs.AI

下载: http://arxiv.org/abs/2510.08086v1

Self-Improving Skill Learning for Robust Skill-based Meta-Reinforcement Learning

Meta-reinforcement learning (Meta-RL) facilitates rapid adaptation to unseen tasks but faces challenges in long-horizon environments. Skill-based approaches tackle this by decomposing state-action sequences into reusable skills and employing hierarchical decision-making. However, these methods are highly susceptible to noisy offline demonstrations, leading to unstable skill learning and degraded performance. To address this, we propose Self-Improving Skill Learning (SISL), which performs self-guided skill refinement using decoupled high-level and skill improvement policies, while applying skill prioritization via maximum return relabeling to focus updates on task-relevant trajectories, resulting in robust and stable adaptation even under noisy and suboptimal data. By mitigating the effect of noise, SISL achieves reliable skill learning and consistently outperforms other skill-based meta-RL methods on diverse long-horizon tasks.

Updated: 2025-10-09 11:16:53

标题: 自我改进的技能学习用于稳健的基于技能的元强化学习

摘要: 元强化学习（Meta-RL）有助于快速适应未见过的任务，但在长时间范围环境中面临挑战。基于技能的方法通过将状态-动作序列分解为可重复使用的技能，并采用分层决策制定来解决这一问题。然而，这些方法对嘈杂的离线演示非常敏感，导致技能学习不稳定和性能下降。为了解决这个问题，我们提出了自我改进技能学习（SISL），该方法使用分离的高级和技能改进策略进行自我引导技能改进，同时通过最大回报重新标记来应用技能优先级，以便将更新重点放在与任务相关的轨迹上，从而实现在嘈杂和次优数据下的稳健和稳定的适应。通过减轻噪声的影响，SISL实现了可靠的技能学习，并在各种长时间范围任务上始终表现优于其他基于技能的元强化学习方法。

更新时间: 2025-10-09 11:16:53

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.03752v3

A Novel Ensemble Learning Approach for Enhanced IoT Attack Detection: Redefining Security Paradigms in Connected Systems

The rapid expansion of Internet of Things (IoT) devices has transformed industries and daily life by enabling widespread connectivity and data exchange. However, this increased interconnection has introduced serious security vulnerabilities, making IoT systems more exposed to sophisticated cyber attacks. This study presents a novel ensemble learning architecture designed to improve IoT attack detection. The proposed approach applies advanced machine learning techniques, specifically the Extra Trees Classifier, along with thorough preprocessing and hyperparameter optimization. It is evaluated on several benchmark datasets including CICIoT2023, IoTID20, BotNeTIoT L01, ToN IoT, N BaIoT, and BoT IoT. The results show excellent performance, achieving high recall, accuracy, and precision with very low error rates. These outcomes demonstrate the model efficiency and superiority compared to existing approaches, providing an effective and scalable method for securing IoT environments. This research establishes a solid foundation for future progress in protecting connected devices from evolving cyber threats.

Updated: 2025-10-09 11:15:15

标题: 一种增强物联网攻击检测的新型集成学习方法：重新定义连接系统中的安全范式

摘要: 物联网(IoT)设备的快速扩展已经通过实现广泛的连接和数据交换，改变了行业和日常生活。然而，这种增加的互联性引入了严重的安全漏洞，使得物联网系统更容易受到复杂的网络攻击。本研究提出了一种新颖的集成学习架构，旨在改善物联网攻击检测。所提出的方法应用了先进的机器学习技术，特别是Extra Trees分类器，以及彻底的预处理和超参数优化。它在包括CICIoT2023、IoTID20、BotNeTIoT L01、ToN IoT、N BaIoT和BoT IoT在内的几个基准数据集上进行了评估。结果显示出卓越的性能，实现了高召回率、准确率和精度，且错误率非常低。这些结果表明了该模型相对于现有方法的高效性和优越性，提供了一种有效且可扩展的保护物联网环境的方法。这项研究为未来在保护连接设备免受不断发展的网络威胁方面的进展奠定了坚实的基础。

更新时间: 2025-10-09 11:15:15

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.08084v1

AutoQual: An LLM Agent for Automated Discovery of Interpretable Features for Review Quality Assessment

Ranking online reviews by their intrinsic quality is a critical task for e-commerce platforms and information services, impacting user experience and business outcomes. However, quality is a domain-dependent and dynamic concept, making its assessment a formidable challenge. Traditional methods relying on hand-crafted features are unscalable across domains and fail to adapt to evolving content patterns, while modern deep learning approaches often produce black-box models that lack interpretability and may prioritize semantics over quality. To address these challenges, we propose AutoQual, an LLM-based agent framework that automates the discovery of interpretable features. While demonstrated on review quality assessment, AutoQual is designed as a general framework for transforming tacit knowledge embedded in data into explicit, computable features. It mimics a human research process, iteratively generating feature hypotheses through reflection, operationalizing them via autonomous tool implementation, and accumulating experience in a persistent memory. We deploy our method on a large-scale online platform with a billion-level user base. Large-scale A/B testing confirms its effectiveness, increasing average reviews viewed per user by 0.79% and the conversion rate of review readers by 0.27%.

Updated: 2025-10-09 11:11:02

标题: AutoQual: 用于自动发现可解释特征以进行评论质量评估的LLM代理

摘要: 将在线评论按其内在质量进行排名是电子商务平台和信息服务的关键任务，影响用户体验和业务结果。然而，质量是一个依赖领域并且动态概念，使其评估成为一项艰巨的挑战。传统方法依赖手工特征在领域之间不具可扩展性，并且无法适应不断演变的内容模式，而现代深度学习方法通常会产生缺乏解释性的黑盒模型，并可能优先考虑语义而非质量。为了解决这些挑战，我们提出了AutoQual，一个基于LLM的代理框架，自动发现可解释特征。虽然在评论质量评估中进行了演示，但AutoQual旨在作为一个将嵌入数据中的默示知识转化为明确的可计算特征的通用框架。它模仿人类研究过程，通过反思迭代生成特征假说，通过自主工具实施将其操作化，并在持久内存中积累经验。我们在一个拥有数十亿用户的大规模在线平台上部署我们的方法。大规模A/B测试确认了其有效性，使每位用户查看的平均评论增加了0.79%，评论读者的转化率增加了0.27%。

更新时间: 2025-10-09 11:11:02

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.08081v1

A Unified Approach to Quantum Key Leasing with a Classical Lessor

Secure key leasing allows a cryptographic key to be leased as a quantum state in such a way that the key can later be revoked in a verifiable manner. In this work, we propose a modular framework for constructing secure key leasing with a classical-lessor, where the lessor is entirely classical and, in particular, the quantum secret key can be both leased and revoked using only classical communication. Based on this framework, we obtain classical-lessor secure key leasing schemes for public-key encryption (PKE), pseudorandom function (PRF), and digital signature. We adopt the strong security notion known as security against verification key revealing attacks (VRA security) proposed by Kitagawa et al. (Eurocrypt 2025) into the classical-lessor setting, and we prove that all three of our schemes satisfy this notion under the learning with errors assumption. Our PKE scheme improves upon the previous construction by Goyal et al. (Eurocrypt 2025), and our PRF and digital signature schemes are respectively the first PRF and digital signature with classical-lessor secure key leasing property.

Updated: 2025-10-09 11:09:34

标题: 一个统一的方法来进行与经典出租人的量子密钥租赁

摘要: 安全密钥租赁允许将加密密钥作为量子状态租用，以便稍后可以以可验证的方式撤销该密钥。在这项工作中，我们提出了一个模块化框架，用于构建具有经典出租人的安全密钥租赁，其中出租人完全是经典的，特别地，量子秘密密钥可以仅使用经典通信进行租用和撤销。基于这个框架，我们得到了针对公钥加密（PKE）、伪随机函数（PRF）和数字签名的经典出租人安全密钥租赁方案。我们采用了由Kitagawa等人（Eurocrypt 2025）提出的针对验证密钥泄露攻击（VRA安全性）的强安全性概念，并证明我们的三个方案在基于错误学习的假设下满足了这一概念。我们的PKE方案改进了Goyal等人（Eurocrypt 2025）之前的构建，而我们的PRF和数字签名方案分别是第一个具有经典出租人安全密钥租赁属性的PRF和数字签名。

更新时间: 2025-10-09 11:09:34

领域: quant-ph,cs.CR

下载: http://arxiv.org/abs/2510.08079v1

Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation

Video-to-Audio generation has made remarkable strides in automatically synthesizing sound for video. However, existing evaluation metrics, which focus on semantic and temporal alignment, overlook a critical failure mode: models often generate acoustic events, particularly speech and music, that have no corresponding visual source. We term this phenomenon Insertion Hallucination and identify it as a systemic risk driven by dataset biases, such as the prevalence of off-screen sounds, that remains completely undetected by current metrics. To address this challenge, we first develop a systematic evaluation framework that employs a majority-voting ensemble of multiple audio event detectors. We also introduce two novel metrics to quantify the prevalence and severity of this issue: IH@vid (the fraction of videos with hallucinations) and IH@dur (the fraction of hallucinated duration). Building on this, we propose Posterior Feature Correction, a novel training-free inference-time method that mitigates IH. PFC operates in a two-pass process: it first generates an initial audio output to detect hallucinated segments, and then regenerates the audio after masking the corresponding video features at those timestamps. Experiments on several mainstream V2A benchmarks first reveal that state-of-the-art models suffer from severe IH. In contrast, our PFC method reduces both the prevalence and duration of hallucinations by over 50\% on average, without degrading, and in some cases even improving, conventional metrics for audio quality and temporal synchronization. Our work is the first to formally define, systematically measure, and effectively mitigate Insertion Hallucination, paving the way for more reliable and faithful V2A models.

Updated: 2025-10-09 11:08:07

标题: 检测和缓解视频到音频生成中的插入幻觉

摘要: 视频到音频生成在自动合成视频声音方面取得了显著进展。然而，现有的评估指标，主要关注语义和时间对齐，忽略了一个关键的失败模式：模型经常会生成声音事件，特别是语音和音乐，这些事件没有对应的视觉来源。我们将这种现象称为插入幻觉，并将其确定为一种由数据集偏差驱动的系统风险，例如屏幕外声音的普遍性，这种风险完全未被当前评估指标检测到。为了解决这一挑战，我们首先开发了一个系统性评估框架，采用多个音频事件检测器的多数投票集成。我们还引入了两个新的指标来量化这个问题的普遍性和严重性：IH@vid（具有幻觉的视频比例）和IH@dur（幻觉持续时间比例）。在此基础上，我们提出了后验特征校正，这是一种新颖的无训练推断时间方法，用于减轻插入幻觉。PFC操作有两个步骤：首先生成一个初始音频输出以检测幻觉片段，然后在这些时间戳上屏蔽相应的视频特征后重新生成音频。在几个主流V2A基准测试上的实验首先揭示了最先进模型严重受到插入幻觉的影响。相比之下，我们的PFC方法平均减少了50\%以上的幻觉普遍性和持续时间，而不会降低，并且在某些情况下甚至改善了音频质量和时间同步的传统指标。我们的工作首次正式定义、系统测量并有效减轻插入幻觉，为更可靠和忠实的V2A模型铺平了道路。

更新时间: 2025-10-09 11:08:07

领域: cs.SD,cs.LG

下载: http://arxiv.org/abs/2510.08078v1

Breaking the Reviewer: Assessing the Vulnerability of Large Language Models in Automated Peer Review Under Textual Adversarial Attacks

Peer review is essential for maintaining academic quality, but the increasing volume of submissions places a significant burden on reviewers. Large language models (LLMs) offer potential assistance in this process, yet their susceptibility to textual adversarial attacks raises reliability concerns. This paper investigates the robustness of LLMs used as automated reviewers in the presence of such attacks. We focus on three key questions: (1) The effectiveness of LLMs in generating reviews compared to human reviewers. (2) The impact of adversarial attacks on the reliability of LLM-generated reviews. (3) Challenges and potential mitigation strategies for LLM-based review. Our evaluation reveals significant vulnerabilities, as text manipulations can distort LLM assessments. We offer a comprehensive evaluation of LLM performance in automated peer reviewing and analyze its robustness against adversarial attacks. Our findings emphasize the importance of addressing adversarial risks to ensure AI strengthens, rather than compromises, the integrity of scholarly communication.

Updated: 2025-10-09 11:07:40

标题: 打破审稿人壁垒：评估大型语言模型在自动同行审查中受到文本对抗攻击的脆弱性

摘要: 同行评审对于维护学术质量至关重要，但是日益增加的投稿量给审稿人带来了重大负担。大型语言模型（LLMs）在这一过程中提供潜在帮助，但其易受文本对抗攻击的影响引发了可靠性担忧。本文研究了LLMs作为自动审稿人在面对此类攻击时的鲁棒性。我们重点关注三个关键问题：（1）LLMs在生成审稿与人类审稿员相比的效果；（2）对抗性攻击对LLM生成的审稿可靠性的影响；（3）LLM基于审稿的挑战和潜在缓解策略。我们的评估揭示了显著的脆弱性，因为文本操作可以扭曲LLM的评估。我们对LLM在自动同行评审中的表现进行了全面评估，并分析了其抵抗对抗性攻击的鲁棒性。我们的发现强调了解决对抗性风险的重要性，以确保人工智能强化学术交流的完整性，而不是破坏它。

更新时间: 2025-10-09 11:07:40

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.11113v3

Depression Detection on Social Media with Large Language Models

Limited access to mental healthcare resources hinders timely depression diagnosis, leading to detrimental outcomes. Social media platforms present a valuable data source for early detection, yet this task faces two significant challenges: 1) the need for medical knowledge to distinguish clinical depression from transient mood changes, and 2) the dual requirement for high accuracy and model explainability. To address this, we propose DORIS, a framework that leverages Large Language Models (LLMs). To integrate medical knowledge, DORIS utilizes LLMs to annotate user texts against established medical diagnostic criteria and to summarize historical posts into temporal mood courses. These medically-informed features are then used to train an accurate Gradient Boosting Tree (GBT) classifier. Explainability is achieved by generating justifications for predictions based on the LLM-derived symptom annotations and mood course analyses. Extensive experimental results validate the effectiveness as well as interpretability of our method, highlighting its potential as a supportive clinical tool.

Updated: 2025-10-09 11:02:37

标题: 用大型语言模型在社交媒体上检测抑郁情绪

摘要: 有限的心理保健资源限制了及时诊断抑郁症，导致不利后果。社交媒体平台是早期检测的宝贵数据来源，但面临两个重要挑战：1）需要医学知识来区分临床抑郁症和短暂情绪变化，以及2）高准确性和模型解释性的双重要求。为了解决这一问题，我们提出了DORIS框架，利用大型语言模型（LLMs）。为了整合医学知识，DORIS利用LLMs对用户文本进行注释，根据已建立的医学诊断标准总结历史帖子，形成时间性情绪趋势。然后，利用这些医学知识特征训练准确的梯度提升树（GBT）分类器。通过根据LLM派生的症状注释和情绪趋势分析生成预测理由，实现了解释性。广泛的实验结果验证了我们方法的有效性和可解释性，突显其作为辅助临床工具的潜力。

更新时间: 2025-10-09 11:02:37

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2403.10750v2

Multi-Condition Conformal Selection

Selecting high-quality candidates from large-scale datasets is critically important in resource-constrained applications such as drug discovery, precision medicine, and the alignment of large language models. While conformal selection methods offer a rigorous solution with False Discovery Rate (FDR) control, their applicability is confined to single-threshold scenarios (i.e., y > c) and overlooks practical needs for multi-condition selection, such as conjunctive or disjunctive conditions. In this work, we propose the Multi-Condition Conformal Selection (MCCS) algorithm, which extends conformal selection to scenarios with multiple conditions. In particular, we introduce a novel nonconformity score with regional monotonicity for conjunctive conditions and a global Benjamini-Hochberg (BH) procedure for disjunctive conditions, thereby establishing finite-sample FDR control with theoretical guarantees. The integration of these components enables the proposed method to achieve rigorous FDR-controlled selection in various multi-condition environments. Extensive experiments validate the superiority of MCCS over baselines, its generalizability across diverse condition combinations, different real-world modalities, and multi-task scalability.

Updated: 2025-10-09 11:02:10

标题: 多条件一致选择

摘要: 从大规模数据集中选择高质量的候选人在资源受限的应用中至关重要，例如药物发现、精准医疗和大型语言模型的对齐。虽然符合选择方法提供了一个具有虚警发现率（FDR）控制的严格解决方案，但它们的适用性局限于单一阈值场景（即，y > c），并忽视了多条件选择的实际需求，例如连词或离词条件。在这项工作中，我们提出了多条件符合选择（MCCS）算法，将符合选择扩展到具有多个条件的场景。具体而言，我们引入了一种具有区域单调性的新型非符合分数，用于连词条件，以及用于离词条件的全局Benjamini-Hochberg（BH）过程，从而建立了具有理论保证的有限样本FDR控制。这些组件的集成使得所提出的方法能够在各种多条件环境中实现严格的FDR控制选择。大量实验证实了MCCS相对于基线的优越性，以及其在不同现实世界模态和多任务可扩展性中的普适性。

更新时间: 2025-10-09 11:02:10

领域: cs.AI

下载: http://arxiv.org/abs/2510.08075v1

Foundations of LLM Knowledge Materialization: Termination, Reproducibility, Robustness

Large Language Models (LLMs) encode substantial factual knowledge, yet measuring and systematizing this knowledge remains challenging. Converting it into structured format, for example through recursive extraction approaches such as the GPTKB methodology (Hu et al., 2025b), is still underexplored. Key open questions include whether such extraction can terminate, whether its outputs are reproducible, and how robust they are to variations. We systematically study LLM knowledge materialization using miniGPTKBs (domain-specific, tractable subcrawls), analyzing termination, reproducibility, and robustness across three categories of metrics: yield, lexical similarity, and semantic similarity. We experiment with four variations (seed, language, randomness, model) and three illustrative domains (from history, entertainment, and finance). Our findings show (i) high termination rates, though model-dependent; (ii) mixed reproducibility; and (iii) robustness that varies by perturbation type: high for seeds and temperature, lower for languages and models. These results suggest that LLM knowledge materialization can reliably surface core knowledge, while also revealing important limitations.

Updated: 2025-10-09 11:00:50

标题: LLM知识实现的基础：终止、可复制性、稳健性

摘要: 大型语言模型（LLMs）编码了大量的事实知识，但衡量和系统化这些知识仍然具有挑战性。将其转换为结构化格式，例如通过递归提取方法（如GPTKB方法）（Hu等，2025b），仍未得到充分探讨。关键的开放问题包括这种提取是否能终止，其输出是否可重现，以及对变化的稳健性。我们系统地研究了LLM知识实现使用miniGPTKBs（领域特定、可处理的子爬取），分析了终止、可重现性和稳健性在三类指标上的表现：产出、词汇相似性和语义相似性。我们尝试了四种变化（种子、语言、随机性、模型）和三个说明性领域（历史、娱乐、金融）。我们的研究结果显示（i）高终止率，尽管依赖于模型；（ii）可重现性参差不齐；以及（iii）稳健性因扰动类型而异：对种子和温度较高，对语言和模型较低。这些结果表明，LLM知识实现能够可靠地呈现核心知识，同时也揭示了重要的局限性。

更新时间: 2025-10-09 11:00:50

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.06780v2

Physics-Driven Spatiotemporal Modeling for AI-Generated Video Detection

AI-generated videos have achieved near-perfect visual realism (e.g., Sora), urgently necessitating reliable detection mechanisms. However, detecting such videos faces significant challenges in modeling high-dimensional spatiotemporal dynamics and identifying subtle anomalies that violate physical laws. In this paper, we propose a physics-driven AI-generated video detection paradigm based on probability flow conservation principles. Specifically, we propose a statistic called Normalized Spatiotemporal Gradient (NSG), which quantifies the ratio of spatial probability gradients to temporal density changes, explicitly capturing deviations from natural video dynamics. Leveraging pre-trained diffusion models, we develop an NSG estimator through spatial gradients approximation and motion-aware temporal modeling without complex motion decomposition while preserving physical constraints. Building on this, we propose an NSG-based video detection method (NSG-VD) that computes the Maximum Mean Discrepancy (MMD) between NSG features of the test and real videos as a detection metric. Last, we derive an upper bound of NSG feature distances between real and generated videos, proving that generated videos exhibit amplified discrepancies due to distributional shifts. Extensive experiments confirm that NSG-VD outperforms state-of-the-art baselines by 16.00% in Recall and 10.75% in F1-Score, validating the superior performance of NSG-VD. The source code is available at https://github.com/ZSHsh98/NSG-VD.

Updated: 2025-10-09 11:00:35

标题: 物理驱动的时空建模用于人工智能生成视频检测

摘要: 人工智能生成的视频已经实现了接近完美的视觉逼真度（例如，Sora），迫切需要可靠的检测机制。然而，检测这种视频面临着在建模高维时空动态和识别违反物理定律的微妙异常方面的重大挑战。在本文中，我们提出了一个基于概率流守恒原理的物理驱动的人工智能生成视频检测范例。具体地，我们提出了一个称为规范化时空梯度（NSG）的统计量，它量化了空间概率梯度与时间密度变化的比率，明确捕捉了与自然视频动态的偏离。利用预训练的扩散模型，我们通过空间梯度近似和具有运动感知的时间建模开发了一个NSG估计器，同时保留了物理约束条件，而不需要复杂的运动分解。在此基础上，我们提出了一种基于NSG的视频检测方法（NSG-VD），它计算了测试和真实视频的NSG特征之间的最大均值差异（MMD）作为检测度量。最后，我们推导了真实和生成视频之间的NSG特征距离的上界，证明生成视频由于分布偏移而出现放大的差异。广泛的实验证实，NSG-VD在召回率和F1分数方面比最先进的基线模型提高了16.00%和10.75%，验证了NSG-VD的卓越性能。源代码可在https://github.com/ZSHsh98/NSG-VD 上找到。

更新时间: 2025-10-09 11:00:35

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.08073v1

An Adaptive Multi Agent Bitcoin Trading System

This paper presents a Multi Agent Bitcoin Trading system that utilizes Large Language Models (LLMs) for alpha generation and portfolio management in the cryptocurrencies market. Unlike equities, cryptocurrencies exhibit extreme volatility and are heavily influenced by rapidly shifting market sentiments and regulatory announcements, making them difficult to model using static regression models or neural networks trained solely on historical data [53]. The proposed framework overcomes this by structuring LLMs into specialised agents for technical analysis, sentiment evaluation, decision-making, and performance reflection. The system improves over time through a novel verbal feedback mechanism where a Reflect agent provides daily and weekly natural-language critiques of trading decisions. These textual evaluations are then injected into future prompts, allowing the system to adjust indicator priorities, sentiment weights, and allocation logic without parameter updates or finetuning. Back-testing on Bitcoin price data from July 2024 to April 2025 shows consistent outperformance across market regimes: the Quantitative agent delivered over 30% higher returns in bullish phases and 15% overall gains versus buy-and-hold, while the sentiment-driven agent turned sideways markets from a small loss into a gain of over 100%. Adding weekly feedback further improved total performance by 31% and reduced bearish losses by 10%. The results demonstrate that verbal feedback represents a new, scalable, and low-cost method of tuning LLMs for financial goals.

Updated: 2025-10-09 10:55:52

标题: 一个自适应的多智能体比特币交易系统

摘要: 本文介绍了一个利用大型语言模型（LLMs）进行α生成和加密货币市场组合管理的多代理比特币交易系统。与股票不同，加密货币表现出极端波动性，并且受到市场情绪和监管公告迅速变化的影响，使得它们难以使用静态回归模型或仅基于历史数据训练的神经网络进行建模[53]。所提出的框架通过将LLMs结构化为专门的代理进行技术分析、情绪评估、决策制定和绩效反馈，克服了这一难题。该系统通过一种新颖的口头反馈机制不断改进，其中一个反射代理每日和每周提供交易决策的自然语言评论。这些文本评价随后注入到未来的提示中，使系统能够调整指标优先级、情绪权重和分配逻辑，而无需参数更新或微调。对比特币价格数据进行回测，时间跨度为2024年7月至2025年4月，结果显示，在不同市场环境下，量化代理提供了比买入持有方式高出30%的收益，在看涨阶段总体收益率提高了15%；而以情绪驱动为特点的代理将横盘市场从小幅亏损转变为超过100%的收益。添加每周反馈进一步提高了总体绩效31%，并将看跌损失减少了10%。结果表明，口头反馈代表了一种新的、可扩展且低成本的调整LLMs以实现财务目标的方法。

更新时间: 2025-10-09 10:55:52

领域: q-fin.PM,cs.AI,q-fin.CP

下载: http://arxiv.org/abs/2510.08068v1

Attribution-by-design: Ensuring Inference-Time Provenance in Generative Music Systems

The rise of AI-generated music is diluting royalty pools and revealing structural flaws in existing remuneration frameworks, challenging the well-established artist compensation systems in the music industry. Existing compensation solutions, such as piecemeal licensing agreements, lack scalability and technical rigour, while current data attribution mechanisms provide only uncertain estimates and are rarely implemented in practice. This paper introduces a framework for a generative music infrastructure centred on direct attribution, transparent royalty distribution, and granular control for artists and rights' holders. We distinguish ontologically between the training set and the inference set, which allows us to propose two complementary forms of attribution: training-time attribution and inference-time attribution. We here favour inference-time attribution, as it enables direct, verifiable compensation whenever an artist's catalogue is used to condition a generated output. Besides, users benefit from the ability to condition generations on specific songs and receive transparent information about attribution and permitted usage. Our approach offers an ethical and practical solution to the pressing need for robust compensation mechanisms in the era of AI-generated music, ensuring that provenance and fairness are embedded at the core of generative systems.

Updated: 2025-10-09 10:49:44

标题: 设计归因：确保生成音乐系统中推理时间来源

摘要: AI生成音乐的兴起正在稀释版权池，并揭示现有报酬框架中的结构缺陷，挑战了音乐行业中已经建立起来的艺术家补偿系统。现有的补偿解决方案，如零散的授权协议，缺乏可扩展性和技术严谨性，而当前的数据归因机制仅提供不确定的估计，并且很少在实践中得到实施。本文介绍了一个基于直接归因、透明版权分配和对艺术家和权利持有人具有粒度控制的生成音乐基础设施框架。我们本体论上区分了训练集和推断集，这使我们能够提出两种互补形式的归因：训练时间归因和推断时间归因。我们在这里支持推断时间归因，因为它可以在艺术家的目录被用来调节生成的输出时直接、可验证地进行补偿。此外，用户可以受益于能够基于特定歌曲进行生成，并接收关于归因和允许使用的透明信息。我们的方法为AI生成音乐时代迫切需要强大的补偿机制提供了一个道德和实用的解决方案，确保来源和公平性被嵌入到生成系统的核心。

更新时间: 2025-10-09 10:49:44

领域: cs.SD,cs.AI,cs.HC

下载: http://arxiv.org/abs/2510.08062v1

RAGDiffusion: Faithful Cloth Generation via External Knowledge Assimilation

Standard clothing asset generation involves restoring forward-facing flat-lay garment images displayed on a clear background by extracting clothing information from diverse real-world contexts, which presents significant challenges due to highly standardized structure sampling distributions and clothing semantic absence in complex scenarios. Existing models have limited spatial perception, often exhibiting structural hallucinations and texture distortion in this high-specification generative task. To address this issue, we propose a novel Retrieval-Augmented Generation (RAG) framework, termed RAGDiffusion, to enhance structure determinacy and mitigate hallucinations by assimilating knowledge from language models and external databases. RAGDiffusion consists of two processes: (1) Retrieval-based structure aggregation, which employs contrastive learning and a Structure Locally Linear Embedding (SLLE) to derive global structure and spatial landmarks, providing both soft and hard guidance to counteract structural ambiguities; and (2) Omni-level faithful garment generation, which introduces a coarse-to-fine texture alignment that ensures fidelity in pattern and detail components within the diffusing. Extensive experiments on challenging real-world datasets demonstrate that RAGDiffusion synthesizes structurally and texture-faithful clothing assets with significant performance improvements, representing a pioneering effort in high-specification faithful generation with RAG to confront intrinsic hallucinations and enhance fidelity.

Updated: 2025-10-09 10:47:03

标题: RAGDiffusion：通过外部知识同化实现忠实的布料生成

摘要: 标准服装资产生成涉及从多样的现实世界背景中提取服装信息，恢复显示在清晰背景上的前视平铺服装图像，这在高度标准化的结构采样分布和复杂场景中服装语义缺失的情况下存在显著挑战。现有模型在这一高规格生成任务中往往具有有限的空间感知能力，经常出现结构幻觉和纹理扭曲。为了解决这个问题，我们提出了一种新颖的检索增强生成（RAG）框架，称为RAGDiffusion，通过吸收来自语言模型和外部数据库的知识，增强结构确定性并减轻幻觉。RAGDiffusion包括两个过程：（1）基于检索的结构聚合，采用对比学习和结构局部线性嵌入（SLLE）来推导全局结构和空间地标，为抵消结构模糊性提供软硬引导；和（2）全方位忠实服装生成，引入了从粗到细的纹理对齐，确保在扩散过程中模式和细节组件的忠实性。对具有挑战性的现实世界数据集进行的大量实验表明，RAGDiffusion合成了在结构和纹理上忠实的服装资产，并取得了显著的性能改进，代表了在RAG中进行高规格忠实生成的开创性努力，以应对固有的幻觉并增强忠实性。

更新时间: 2025-10-09 10:47:03

领域: cs.CV,cs.AI,cs.GR,cs.LG

下载: http://arxiv.org/abs/2411.19528v2

Mitigating Subject Dependency in EEG Decoding with Subject-Specific Low-Rank Adapters

Subject-specific distribution shifts represent an important obstacle to the development of foundation models for EEG decoding. To address this, we propose Subject-Conditioned Layer,, an adaptive layer designed as a drop-in replacement for standard linear or convolutional layers in any neural network architecture. Our layer captures subject-specific variability by decomposing its weights into a shared, subject-invariant component and a lightweight, low-rank correction unique to each subject. This explicit separation of general knowledge from personalized adaptation allows existing models to become robust to subject shifts. Empirically, models equipped with our layer outperform both a shared-weight-only model (subject-agnostic model) and the average of individually trained subject-specific models. Consequently, the Subject-Conditioned Layer, offers a practical and scalable path towards building effective cross-subject foundation models for EEG.

Updated: 2025-10-09 10:46:08

标题: 使用个体特定的低秩适配器减轻脑电图解码中的主体依赖性

摘要: 特定主题的分布转移代表了对脑电图解码基础模型发展的重要障碍。为了解决这个问题，我们提出了Subject-Conditioned Layer，这是一个自适应层，旨在作为标准线性或卷积层的一种替代品，可以嵌入到任何神经网络架构中。我们的层通过将权重分解为一个共享的、与主题无关的组件和一个轻量级的、低秩修正项，以捕捉主题特定的可变性。将通用知识与个性化调整明确分开，使现有模型能够对主题转移具有鲁棒性。实证结果表明，配备我们层的模型表现优于只有共享权重的模型（不考虑主题的模型）和训练的各个主题特定模型的平均值。因此，Subject-Conditioned Layer为构建有效的跨主题基础模型提供了一条实用且可扩展的途径。

更新时间: 2025-10-09 10:46:08

领域: cs.LG

下载: http://arxiv.org/abs/2510.08059v1

FireGNN: Neuro-Symbolic Graph Neural Networks with Trainable Fuzzy Rules for Interpretable Medical Image Classification

Medical image classification requires not only high predictive performance but also interpretability to ensure clinical trust and adoption. Graph Neural Networks (GNNs) offer a powerful framework for modeling relational structures within datasets; however, standard GNNs often operate as black boxes, limiting transparency and usability, particularly in clinical settings. In this work, we present an interpretable graph-based learning framework named FireGNN that integrates trainable fuzzy rules into GNNs for medical image classification. These rules embed topological descriptors - node degree, clustering coefficient, and label agreement - using learnable thresholds and sharpness parameters to enable intrinsic symbolic reasoning. Additionally, we explore auxiliary self-supervised tasks (e.g., homophily prediction, similarity entropy) as a benchmark to evaluate the contribution of topological learning. Our fuzzy-rule-enhanced model achieves strong performance across five MedMNIST benchmarks and the synthetic dataset MorphoMNIST, while also generating interpretable rule-based explanations. To our knowledge, this is the first integration of trainable fuzzy rules within a GNN. Source Code: https://github.com/basiralab/FireGNN

Updated: 2025-10-09 10:43:33

标题: FireGNN：具有可训练模糊规则的神经符号图神经网络，用于可解释的医学图像分类

摘要: 医学图像分类不仅需要高预测性能，还需要可解释性以确保临床信任和采用。图神经网络（GNNs）为建模数据集中的关系结构提供了强大的框架；然而，标准GNNs通常作为黑匣子运行，限制了透明度和可用性，特别是在临床环境中。在这项工作中，我们提出了一个名为FireGNN的可解释的基于图的学习框架，将可训练的模糊规则集成到GNNs中用于医学图像分类。这些规则嵌入了拓扑描述符 - 节点度、聚类系数和标签一致性 - 使用可学习的阈值和锐度参数来实现内在的符号推理。此外，我们探索辅助的自监督任务（例如同质性预测、相似性熵）作为评估拓扑学习贡献的基准。我们的模糊规则增强模型在五个MedMNIST基准和合成数据集MorphoMNIST上取得了强大的性能，同时还生成了可解释的基于规则的解释。据我们所知，这是第一个在GNN内部集成可训练的模糊规则。源代码：https://github.com/basiralab/FireGNN

更新时间: 2025-10-09 10:43:33

领域: eess.IV,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2509.10510v2

FedDTRE: Federated Dialogue Generation Models Powered by Trustworthiness Evaluation

With the rapid development of artificial intelligence, dialogue systems have become a prominent form of human-computer interaction. However, traditional centralized or fully local training approaches face challenges in balancing privacy preservation and personalization due to data privacy concerns and heterogeneous device capabilities. Federated learning, as a representative distributed paradigm, offers a promising solution. However, existing methods often suffer from overfitting under limited client data and tend to forget global information after multiple training rounds, leading to poor generalization. To address these issues, we propose FedDTRE, a Federated adaptive aggregation strategy for Dialogue generation based on Trustworthiness Evaluation. Instead of directly replacing local models with the global model, FedDTRE leverages trustworthiness scores of both global and local models on a fairness-oriented evaluation dataset to dynamically regulate the global model's contribution during local updates. Experimental results demonstrate that FedDTRE can improve dialogue model performance and enhance the quality of dialogue generation.

Updated: 2025-10-09 10:43:14

标题: FedDTRE：由可信度评估驱动的联邦对话生成模型

摘要: 随着人工智能的快速发展，对话系统已成为人机交互中突出的形式。然而，传统的集中式或完全本地化的训练方法在平衡隐私保护和个性化方面面临挑战，这是由于数据隐私和异构设备能力方面的考虑。作为代表性分布式范式，联邦学习提供了一个有希望的解决方案。然而，现有方法往往在有限的客户端数据下过拟合，并在多轮训练后往往会忘记全局信息，导致泛化能力差。为了解决这些问题，我们提出了FedDTRE，一种基于可信度评估的对话生成的联邦自适应聚合策略。FedDTRE并不直接用全局模型替换本地模型，而是利用全局和本地模型在一个公平性导向的评估数据集上的可信度得分，动态调节全局模型在本地更新期间的贡献。实验结果表明，FedDTRE可以提高对话模型的性能，并增强对话生成的质量。

更新时间: 2025-10-09 10:43:14

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.08058v1

From Tokens to Layers: Redefining Stall-Free Scheduling for LLM Serving with Layered Prefill

Large Language Model (LLM) inference in production must meet stringent service-level objectives for both time-to-first-token (TTFT) and time-between-token (TBT) while maximizing throughput under fixed compute, memory, and interconnect budgets. Modern serving systems adopt stall-free scheduling techniques such as chunked prefill, which splits long prompt processing along the token dimension and interleaves prefill with ongoing decode iterations. While effective at stabilizing TBT, chunked prefill incurs substantial overhead in Mixture-of-Experts (MoE) models: redundant expert weight loads increase memory traffic by up to 39% and inflate energy consumption. We propose layered prefill, a new scheduling paradigm that treats transformer layer groups as the primary scheduling unit. By vertically partitioning the model into contiguous layer groups and interleaving prefill and decode across the groups, layered prefill sustains stall-free decoding while eliminating chunk-induced MoE weight reloads. It reduces off-chip bandwidth demand, lowering TTFT by up to 70%, End-to-End latency by 41% and per-token energy by up to 22%. Evaluations show that layered prefill consistently improves the TTFT--TBT Pareto frontier over chunked prefill, reducing expert-load traffic and energy cost while maintaining stall-free decoding. Overall, shifting the scheduling axis from tokens to layers unlocks a new operating regime for high-efficiency, energy-aware LLM serving in co-located environments.

Updated: 2025-10-09 10:41:35

标题: 从令牌到层：重新定义带有分层预填的LLM服务的无阻塞调度

摘要: 大语言模型（LLM）在生产中的推理必须满足严格的服务水平目标，包括首次标记到达时间（TTFT）和标记之间时间（TBT），同时在固定的计算、内存和互联预算下最大化吞吐量。现代服务系统采用无阻塞调度技术，如分块预填充，沿标记维度分割长提示处理，并将预填充与进行中的解码迭代交错进行。分块预填充有效地稳定了TBT，但在混合专家（MoE）模型中产生了大量开销：冗余专家权重加载将内存流量增加了高达39%，并增加了能耗。我们提出了分层预填充，这是一种将变压器层组视为主要调度单元的新调度范式。通过将模型垂直划分为连续的层组，并在组之间交错进行预填充和解码，分层预填充在消除了由分块引起的MoE权重重新加载的同时维持无阻塞解码。它降低了芯片外带宽需求，将TTFT降低了最多70%，端到端延迟降低了41%，每个标记的能耗降低了最多22%。评估显示，分层预填充始终在比分块预填充更好的TTFT-TBT帕累托前沿上，减少了专家负载流量和能耗成本，同时保持了无阻塞解码。总体而言，将调度轴从标记转移到层解锁了共存环境中高效、能源感知的LLM服务的新操作模式。

更新时间: 2025-10-09 10:41:35

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2510.08055v1

Maintaining Performance with Less Data

We propose a novel method for training a neural network for image classification to reduce input data dynamically, in order to reduce the costs of training a neural network model. As Deep Learning tasks become more popular, their computational complexity increases, leading to more intricate algorithms and models which have longer runtimes and require more input data. The result is a greater cost on time, hardware, and environmental resources. By using data reduction techniques, we reduce the amount of work performed, and therefore the environmental impact of AI techniques, and with dynamic data reduction we show that accuracy may be maintained while reducing runtime by up to 50%, and reducing carbon emission proportionally.

Updated: 2025-10-09 10:39:46

标题: 用更少的数据保持性能

摘要: 我们提出了一种新颖的方法，用于训练神经网络进行图像分类，以动态减少输入数据，从而降低训练神经网络模型的成本。随着深度学习任务变得越来越流行，其计算复杂性增加，导致更复杂的算法和模型，运行时间更长，需要更多的输入数据。结果是对时间、硬件和环境资源的更大成本。通过使用数据减少技术，我们减少了所执行的工作量，因此减少了人工智能技术的环境影响，并且通过动态数据减少，我们显示准确性可以保持不变，同时将运行时间减少了最多50%，并且按比例减少了碳排放。

更新时间: 2025-10-09 10:39:46

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2208.02007v2

A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models

Although Large Language Models (LLMs) exhibit advanced reasoning ability, conventional alignment remains largely dominated by outcome reward models (ORMs) that judge only final answers. Process Reward Models(PRMs) address this gap by evaluating and guiding reasoning at the step or trajectory level. This survey provides a systematic overview of PRMs through the full loop: how to generate process data, build PRMs, and use PRMs for test-time scaling and reinforcement learning. We summarize applications across math, code, text, multimodal reasoning, robotics, and agents, and review emerging benchmarks. Our goal is to clarify design spaces, reveal open challenges, and guide future research toward fine-grained, robust reasoning alignment.

Updated: 2025-10-09 10:35:31

标题: 一个过程奖励模型的调查：从结果信号到大型语言模型的过程监督

摘要: 尽管大型语言模型(LLMs)展现出先进的推理能力，但传统的对齐仍然主要由仅判断最终答案的结果奖励模型(ORMs)主导。过程奖励模型(PRMs)通过评估和引导推理步骤或轨迹水平来填补这一空白。本调查通过完整的循环提供了PRMs的系统概述：如何生成过程数据，构建PRMs，并将PRMs用于测试时间扩展和强化学习。我们总结了在数学、代码、文本、多模态推理、机器人和代理等领域的应用，并审查新兴基准。我们的目标是澄清设计空间，揭示开放挑战，并引导未来研究朝向细粒度、健壮的推理对齐方向发展。

更新时间: 2025-10-09 10:35:31

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.08049v1

TaoSR-AGRL: Adaptive Guided Reinforcement Learning Framework for E-commerce Search Relevance

Query-product relevance prediction is fundamental to e-commerce search and has become even more critical in the era of AI-powered shopping, where semantic understanding and complex reasoning directly shape the user experience and business conversion. Large Language Models (LLMs) enable generative, reasoning-based approaches, typically aligned via supervised fine-tuning (SFT) or preference optimization methods like Direct Preference Optimization (DPO). However, the increasing complexity of business rules and user queries exposes the inability of existing methods to endow models with robust reasoning capacity for long-tail and challenging cases. Efforts to address this via reinforcement learning strategies like Group Relative Policy Optimization (GRPO) often suffer from sparse terminal rewards, offering insufficient guidance for multi-step reasoning and slowing convergence. To address these challenges, we propose TaoSR-AGRL, an Adaptive Guided Reinforcement Learning framework for LLM-based relevance prediction in Taobao Search Relevance. TaoSR-AGRL introduces two key innovations: (1) Rule-aware Reward Shaping, which decomposes the final relevance judgment into dense, structured rewards aligned with domain-specific relevance criteria; and (2) Adaptive Guided Replay, which identifies low-accuracy rollouts during training and injects targeted ground-truth guidance to steer the policy away from stagnant, rule-violating reasoning patterns toward compliant trajectories. TaoSR-AGRL was evaluated on large-scale real-world datasets and through online side-by-side human evaluations on Taobao Search. It consistently outperforms DPO and standard GRPO baselines in offline experiments, improving relevance accuracy, rule adherence, and training stability. The model trained with TaoSR-AGRL has been successfully deployed in the main search scenario on Taobao, serving hundreds of millions of users.

Updated: 2025-10-09 10:34:39

标题: TaoSR-AGRL：面向电子商务搜索相关性的自适应引导强化学习框架

摘要: 查询产品相关性预测对于电子商务搜索至关重要，在AI驱动购物时代更加关键，语义理解和复杂推理直接影响用户体验和业务转化。大型语言模型（LLMs）支持生成式、基于推理的方法，通常通过监督微调（SFT）或偏好优化方法（如直接偏好优化（DPO））进行对齐。然而，商业规则和用户查询的不断复杂化暴露了现有方法无法为模型赋予对长尾和具有挑战性案例的坚固推理能力的问题。通过强化学习策略（如群体相对策略优化（GRPO））来解决这些问题通常受到稀疏终端奖励的困扰，为多步推理提供不足的指导并减缓收敛速度。为解决这些挑战，我们提出了TaoSR-AGRL，一种适应性引导强化学习框架，用于淘宝搜索相关性的LLM预测。TaoSR-AGRL引入了两个关键创新：（1）规则感知奖励塑造，将最终相关性判断分解为与特定领域相关性标准一致的密集结构奖励；（2）自适应引导重放，识别训练过程中低准确性的回滚并注入有针对性的基准指导，以将策略引导远离停滞、违反规则的推理模式，朝着符合轨迹前进。TaoSR-AGRL在大规模真实世界数据集上进行了评估，并通过淘宝搜索的在线人机评估。在离线实验中，它始终优于DPO和标准GRPO基线，提高了相关性准确性、规则遵从性和训练稳定性。通过TaoSR-AGRL训练的模型已成功部署在淘宝的主要搜索场景中，服务数亿用户。

更新时间: 2025-10-09 10:34:39

领域: cs.IR,cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.08048v1

LinguaSim: Interactive Multi-Vehicle Testing Scenario Generation via Natural Language Instruction Based on Large Language Models

The generation of testing and training scenarios for autonomous vehicles has drawn significant attention. While Large Language Models (LLMs) have enabled new scenario generation methods, current methods struggle to balance command adherence accuracy with the realism of real-world driving environments. To reduce scenario description complexity, these methods often compromise realism by limiting scenarios to 2D, or open-loop simulations where background vehicles follow predefined, non-interactive behaviors. We propose LinguaSim, an LLM-based framework that converts natural language into realistic, interactive 3D scenarios, ensuring both dynamic vehicle interactions and faithful alignment between the input descriptions and the generated scenarios. A feedback calibration module further refines the generation precision, improving fidelity to user intent. By bridging the gap between natural language and closed-loop, interactive simulations, LinguaSim constrains adversarial vehicle behaviors using both the scenario description and the autonomous driving model guiding them. This framework facilitates the creation of high-fidelity scenarios that enhance safety testing and training. Experiments show LinguaSim can generate scenarios with varying criticality aligned with different natural language descriptions (ACT: 0.072 s for dangerous vs. 3.532 s for safe descriptions; comfortability: 0.654 vs. 0.764), and its refinement module effectively reduces excessive aggressiveness in LinguaSim's initial outputs, lowering the crash rate from 46.9% to 6.3% to better match user intentions.

Updated: 2025-10-09 10:30:02

标题: LinguaSim：基于大型语言模型的自然语言指令交互式多车辆测试场景生成

摘要: 自动驾驶汽车的测试和训练场景生成引起了人们的重视。尽管大型语言模型（LLMs）已经实现了新的场景生成方法，但当前的方法在平衡指令依从准确性和真实世界驾驶环境的真实性方面仍存在困难。为了减少场景描述的复杂性，这些方法通常通过将场景限制在2D或开环模拟来牺牲真实性，其中背景车辆遵循预定义的非交互行为。我们提出了LinguaSim，这是一个基于LLM的框架，将自然语言转换为真实、交互式的3D场景，确保动态车辆交互和输入描述与生成场景之间的忠实对齐。反馈校准模块进一步提高了生成精度，提高了对用户意图的忠实度。通过将自然语言和闭环、交互式模拟之间的差距，LinguaSim使用场景描述和引导自动驾驶模型的方式限制对抗性车辆行为。这一框架有助于创建增强安全测试和培训的高保真度场景。实验表明，LinguaSim可以生成与不同自然语言描述对齐的不同紧急程度的场景（危险描述的ACT为0.072秒，安全描述为3.532秒；舒适度为0.654与0.764），其改进模块有效地减少了LinguaSim初始输出中的过度侵略性，将碰撞率从46.9％降低到6.3％，以更好地符合用户意图。

更新时间: 2025-10-09 10:30:02

领域: cs.AI

下载: http://arxiv.org/abs/2510.08046v1

Verifying Graph Neural Networks with Readout is Intractable

We introduce a logical language for reasoning about quantized aggregate-combine graph neural networks with global readout (ACR-GNNs). We provide a logical characterization and use it to prove that verification tasks for quantized GNNs with readout are (co)NEXPTIME-complete. This result implies that the verification of quantized GNNs is computationally intractable, prompting substantial research efforts toward ensuring the safety of GNN-based systems. We also experimentally demonstrate that quantized ACR-GNN models are lightweight while maintaining good accuracy and generalization capabilities with respect to non-quantized models.

Updated: 2025-10-09 10:29:09

标题: 使用读出方式验证图神经网络是棘手的

摘要: 我们引入了一种逻辑语言，用于推理关于量化的聚合组合图神经网络和全局读取（ACR-GNNs）的问题。我们提供了一个逻辑特征描述，并使用它证明了具有读取功能的量化GNN的验证任务是（协）NEXPTIME完全的。这一结果意味着量化GNN的验证是计算上难以处理的，促使大量的研究工作朝着确保GNN系统安全的方向进行。我们还通过实验证明，量化的ACR-GNN模型在保持良好准确性和泛化能力的同时，具有轻量级特性，相对于非量化模型。

更新时间: 2025-10-09 10:29:09

领域: cs.LO,cs.AI,cs.CC,cs.LG

下载: http://arxiv.org/abs/2510.08045v1

Towards Reliable LLM-based Robot Planning via Combined Uncertainty Estimation

Large language models (LLMs) demonstrate advanced reasoning abilities, enabling robots to understand natural language instructions and generate high-level plans with appropriate grounding. However, LLM hallucinations present a significant challenge, often leading to overconfident yet potentially misaligned or unsafe plans. While researchers have explored uncertainty estimation to improve the reliability of LLM-based planning, existing studies have not sufficiently differentiated between epistemic and intrinsic uncertainty, limiting the effectiveness of uncertainty estimation. In this paper, we present Combined Uncertainty estimation for Reliable Embodied planning (CURE), which decomposes the uncertainty into epistemic and intrinsic uncertainty, each estimated separately. Furthermore, epistemic uncertainty is subdivided into task clarity and task familiarity for more accurate evaluation. The overall uncertainty assessments are obtained using random network distillation and multi-layer perceptron regression heads driven by LLM features. We validated our approach in two distinct experimental settings: kitchen manipulation and tabletop rearrangement experiments. The results show that, compared to existing methods, our approach yields uncertainty estimates that are more closely aligned with the actual execution outcomes.

Updated: 2025-10-09 10:26:58

标题: 朝着可靠的基于LLM的机器人规划：通过结合不确定性估计

摘要: 大型语言模型（LLMs）展示了先进的推理能力，使机器人能够理解自然语言指令并生成具有适当基础的高级计划。然而，LLM幻觉提出了一个重要挑战，通常会导致自信过度但潜在不一致或不安全的计划。虽然研究人员已经探讨了不确定性估计以提高基于LLM的规划的可靠性，但现有研究并没有充分区分认识和固有的不确定性，限制了不确定性估计的有效性。在本文中，我们提出了用于可靠的具体规划的组合不确定性估计（CURE），将不确定性分解为认识和固有的不确定性，每个都分别估计。此外，认识不确定性被细分为任务清晰度和任务熟悉度，以进行更准确的评估。整体不确定性评估是利用由LLM特征驱动的随机网络蒸馏和多层感知器回归头来获得的。我们在两个不同的实验设置中验证了我们的方法：厨房操作和桌面重新排列实验。结果显示，与现有方法相比，我们的方法产生的不确定性估计更接近实际执行结果。

更新时间: 2025-10-09 10:26:58

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2510.08044v1

Climate Knowledge in Large Language Models

Large language models (LLMs) are increasingly deployed for climate-related applications, where understanding internal climatological knowledge is crucial for reliability and misinformation risk assessment. Despite growing adoption, the capacity of LLMs to recall climate normals from parametric knowledge remains largely uncharacterized. We investigate the capacity of contemporary LLMs to recall climate normals without external retrieval, focusing on a prototypical query: mean July 2-m air temperature 1991-2020 at specified locations. We construct a global grid of queries at 1{\deg} resolution land points, providing coordinates and location descriptors, and validate responses against ERA5 reanalysis. Results show that LLMs encode non-trivial climate structure, capturing latitudinal and topographic patterns, with root-mean-square errors of 3-6 {\deg}C and biases of $\pm$1 {\deg}C. However, spatially coherent errors remain, particularly in mountains and high latitudes. Performance degrades sharply above 1500 m, where RMSE reaches 5-13 {\deg}C compared to 2-4 {\deg}C at lower elevations. We find that including geographic context (country, city, region) reduces errors by 27% on average, with larger models being most sensitive to location descriptors. While models capture the global mean magnitude of observed warming between 1950-1974 and 2000-2024, they fail to reproduce spatial patterns of temperature change, which directly relate to assessing climate change. This limitation highlights that while LLMs may capture present-day climate distributions, they struggle to represent the regional and local expression of long-term shifts in temperature essential for understanding climate dynamics. Our evaluation framework provides a reproducible benchmark for quantifying parametric climate knowledge in LLMs and complements existing climate communication assessments.

Updated: 2025-10-09 10:25:36

标题: 大型语言模型中的气候知识

摘要: 大型语言模型（LLMs）越来越多地用于与气候相关的应用中，其中理解内部气候知识对于可靠性和误信息风险评估至关重要。尽管越来越多地被采用，但LLMs回忆气候正常值的能力在参数知识方面仍未被充分表征。我们调查了当代LLMs在没有外部检索的情况下回忆气候正常值的能力，重点关注一个典型查询：指定位置1991-2020年7月2米空气温度的平均值。我们构建了一个全球网格查询，分辨率为1°，提供坐标和位置描述，并将响应与ERA5再分析结果进行验证。结果显示，LLMs编码了非平凡的气候结构，捕捉了纬度和地形模式，均方根误差为3-6°C，偏差为±1°C。然而，空间一致的错误仍然存在，特别是在山地和高纬度地区。性能在海拔1500米以上急剧下降，RMSE达到5-13°C，而在较低海拔处为2-4°C。我们发现，包括地理背景（国家、城市、地区）平均可减少错误27%，较大模型对位置描述最为敏感。虽然模型捕捉到了1950-1974年和2000-2024年间观测到的全球平均变暖幅度，但它们未能再现温度变化的空间模式，这直接关系到气候变化的评估。这一限制突显了，尽管LLMs可能捕捉到当今气候分布，但它们难以代表长期温度变化的区域和局部表达，这对于理解气候动力学至关重要。我们的评估框架提供了一个可重复的基准，用于量化LLMs中的参数气候知识，并补充了现有的气候沟通评估。

更新时间: 2025-10-09 10:25:36

领域: cs.CL,cs.LG,physics.ao-ph

下载: http://arxiv.org/abs/2510.08043v1

MRI-derived quantification of hepatic vessel-to-volume ratios in chronic liver disease using a deep learning approach

Background: We aimed to quantify hepatic vessel volumes across chronic liver disease stages and healthy controls using deep learning-based magnetic resonance imaging (MRI) analysis, and assess correlations with biomarkers for liver (dys)function and fibrosis/portal hypertension. Methods: We assessed retrospectively healthy controls, non-advanced and advanced chronic liver disease (ACLD) patients using a 3D U-Net model for hepatic vessel segmentation on portal venous phase gadoxetic acid-enhanced 3-T MRI. Total (TVVR), hepatic (HVVR), and intrahepatic portal vein-to-volume ratios (PVVR) were compared between groups and correlated with: albumin-bilirubin (ALBI) and model for end-stage liver disease-sodium (MELD-Na) score, and fibrosis/portal hypertension (Fibrosis-4 [FIB-4] score, liver stiffness measurement [LSM], hepatic venous pressure gradient [HVPG], platelet count [PLT], and spleen volume). Results: We included 197 subjects, aged 54.9 $\pm$ 13.8 years (mean $\pm$ standard deviation), 111 males (56.3\%): 35 healthy controls, 44 non-ACLD, and 118 ACLD patients. TVVR and HVVR were highest in controls (3.9; 2.1), intermediate in non-ACLD (2.8; 1.7), and lowest in ACLD patients (2.3; 1.0) ($p \leq 0.001$). PVVR was reduced in both non-ACLD and ACLD patients (both 1.2) compared to controls (1.7) ($p \leq 0.001$), but showed no difference between CLD groups ($p = 0.999$). HVVR significantly correlated indirectly with FIB-4, ALBI, MELD-Na, LSM, and spleen volume ($\rho$ ranging from -0.27 to -0.40), and directly with PLT ($\rho = 0.36$). TVVR and PVVR showed similar but weaker correlations. Conclusions: Deep learning-based hepatic vessel volumetry demonstrated differences between healthy liver and chronic liver disease stages and shows correlations with established markers of disease severity.

Updated: 2025-10-09 10:23:16

标题: MRI衍生的深度学习方法在慢性肝病中量化肝血管容积比率

摘要: 背景：我们旨在利用基于深度学习的磁共振成像（MRI）分析，量化慢性肝病阶段和健康对照组的肝血管容积，并评估与肝（功能障碍）和纤维化/门静脉高压生物标志物的相关性。方法：我们回顾性评估了健康对照组、非晚期和晚期慢性肝病（ACLD）患者，使用3D U-Net模型对门脉期胆囊酸增强3-T MRI进行肝血管分割。总体（TVVR）、肝脏（HVVR）和肝内门静脉/体积比（PVVR）在组间进行比较，并与白蛋白-胆红素（ALBI）和末期肝病模型-钠（MELD-Na）评分，以及纤维化/门静脉高压（Fibrosis-4 [FIB-4]评分，肝硬度测量[Liver Stiffness Measurement，LSM]，肝静脉压梯度[Hepatic Venous Pressure Gradient，HVPG]，血小板计数[PLT]和脾脏体积）进行相关性分析。结果：我们纳入了197名受试者，年龄为54.9 ± 13.8岁（均值±标准差），男性111人（56.3%）：35名健康对照组，44名非ACLD患者和118名ACLD患者。 TVVR和HVVR在对照组最高（3.9；2.1），非ACLD中间（2.8；1.7），ACLD患者最低（2.3；1.0）（p≤0.001）。与对照组（1.7）相比，非ACLD和ACLD患者的PVVR均减少（均为1.2）（p≤0.001），但在慢性肝病组之间没有差异（p=0.999）。 HVVR与FIB-4，ALBI，MELD-Na，LSM和脾脏体积呈负相关（ρ从-0.27到-0.40），与PLT呈正相关（ρ=0.36）。 TVVR和PVVR显示类似但较弱的相关性。结论：基于深度学习的肝血管容积测量显示出健康肝和慢性肝病阶段之间的差异，并与疾病严重程度的已建立标志物相关。

更新时间: 2025-10-09 10:23:16

领域: physics.med-ph,cs.AI

下载: http://arxiv.org/abs/2510.08039v1

OBSR: Open Benchmark for Spatial Representations

GeoAI is evolving rapidly, fueled by diverse geospatial datasets like traffic patterns, environmental data, and crowdsourced OpenStreetMap (OSM) information. While sophisticated AI models are being developed, existing benchmarks are often concentrated on single tasks and restricted to a single modality. As such, progress in GeoAI is limited by the lack of a standardized, multi-task, modality-agnostic benchmark for their systematic evaluation. This paper introduces a novel benchmark designed to assess the performance, accuracy, and efficiency of geospatial embedders. Our benchmark is modality-agnostic and comprises 7 distinct datasets from diverse cities across three continents, ensuring generalizability and mitigating demographic biases. It allows for the evaluation of GeoAI embedders on various phenomena that exhibit underlying geographic processes. Furthermore, we establish a simple and intuitive task-oriented model baselines, providing a crucial reference point for comparing more complex solutions.

Updated: 2025-10-09 10:19:28

标题: OBSR：空间表示的开放基准

摘要: 地理人工智能（GeoAI）正在迅速发展，得益于各种地理空间数据集，如交通模式、环境数据和众包的OpenStreetMap（OSM）信息。虽然正在开发复杂的人工智能模型，但现有的基准往往集中在单一任务上，并限于单一模态。因此，由于缺乏标准化的、多任务、模态无关的基准，地理人工智能的进展受到限制，无法实现系统化评估。本文介绍了一种新颖的基准，旨在评估地理嵌入器的性能、准确性和效率。我们的基准是模态无关的，包括来自三个大陆不同城市的7个不同数据集，确保普适性并减轻人口统计偏见。它允许对展现地理过程的各种现象进行地理人工智能嵌入器的评估。此外，我们建立了一个简单直观的以任务为导向的模型基线，为比较更复杂的解决方案提供了关键参考点。

更新时间: 2025-10-09 10:19:28

领域: cs.LG

下载: http://arxiv.org/abs/2510.05879v2

AILoRA: Function-Aware Asymmetric Initialization for Low-Rank Adaptation of Large Language Models

Parameter-efficient finetuning (PEFT) aims to mitigate the substantial computational and memory overhead involved in adapting large-scale pretrained models to diverse downstream tasks. Among numerous PEFT strategies, Low-Rank Adaptation (LoRA) has emerged as one of the most widely adopted approaches due to its robust empirical performance and low implementation complexity. In practical deployment, LoRA is typically applied to the $W^Q$ and $W^V$ projection matrices of self-attention modules, enabling an effective trade-off between model performance and parameter efficiency. While LoRA has achieved considerable empirical success, it still encounters challenges such as suboptimal performance and slow convergence. To address these limitations, we introduce \textbf{AILoRA}, a novel parameter-efficient method that incorporates function-aware asymmetric low-rank priors. Our empirical analysis reveals that the projection matrices $W^Q$ and $W^V$ in the self-attention mechanism exhibit distinct parameter characteristics, stemming from their functional differences. Specifically, $W^Q$ captures task-specific semantic space knowledge essential for attention distributions computation, making its parameters highly sensitive to downstream task variations. In contrast, $W^V$ encodes token-level feature representations that tend to remain stable across tasks and layers. Leveraging these insights, AILoRA performs a function-aware initialization by injecting the principal components of $W^Q$ to retain task-adaptive capacity, and the minor components of $W^V$ to preserve generalizable feature representations. This asymmetric initialization strategy enables LoRA modules to better capture the specialized roles of attention parameters, thereby enhancing both finetuning performance and convergence efficiency.

Updated: 2025-10-09 10:13:16

标题: AILoRA：针对大型语言模型的低秩适应的功能感知不对称初始化

摘要: Parameter-efficient finetuning（PEFT）旨在减少调整大规模预训练模型以适应各种下游任务所涉及的大量计算和内存开销。在众多PEFT策略中，低秩适应（LoRA）已成为最广泛采用的方法之一，因其稳健的实证性表现和低实现复杂性。在实际部署中，LoRA通常应用于自注意力模块的$W^Q$和$W^V$投影矩阵，实现模型性能和参数效率之间的有效权衡。虽然LoRA取得了相当大的实证成功，但仍然面临挑战，如性能不佳和收敛缓慢。为了解决这些限制，我们引入了AILoRA，一种集成功能感知的非对称低秩先验的新颖的参数高效方法。我们的实证分析表明，自注意力机制中的投影矩阵$W^Q$和$W^V$展现出不同的参数特征，源于它们的功能差异。具体而言，$W^Q$捕获了对于注意力分布计算至关重要的特定任务语义空间知识，使其参数对下游任务变化非常敏感。相反，$W^V$编码了在任务和层之间倾向于保持稳定的令牌级特征表示。利用这些见解，AILoRA通过注入$W^Q$的主要成分以保留任务自适应能力，以及$W^V$的次要成分以保持可泛化的特征表示，执行功能感知初始化。这种非对称初始化策略使LoRA模块更好地捕获注意力参数的专业角色，从而提高微调性能和收敛效率。

更新时间: 2025-10-09 10:13:16

领域: cs.AI

下载: http://arxiv.org/abs/2510.08034v1

Inference-time Alignment in Continuous Space

Aligning large language models with human feedback at inference time has received increasing attention due to its flexibility. Existing methods rely on generating multiple responses from the base policy for search using a reward model, which can be considered as searching in a discrete response space. However, these methods struggle to explore informative candidates when the base policy is weak or the candidate set is small, resulting in limited effectiveness. In this paper, to address this problem, we propose Simple Energy Adaptation ($\textbf{SEA}$), a simple yet effective algorithm for inference-time alignment. In contrast to expensive search over the discrete space, SEA directly adapts original responses from the base policy toward the optimal one via gradient-based sampling in continuous latent space. Specifically, SEA formulates inference as an iterative optimization procedure on an energy function over actions in the continuous space defined by the optimal policy, enabling simple and effective alignment. For instance, despite its simplicity, SEA outperforms the second-best baseline with a relative improvement of up to $ \textbf{77.51%}$ on AdvBench and $\textbf{16.36%}$ on MATH. Our code is publicly available at https://github.com/yuanyige/sea

Updated: 2025-10-09 10:10:39

标题: 连续空间中的推理时间对准

摘要: 在推理时，将大型语言模型与人类反馈进行对齐，由于其灵活性而受到越来越多的关注。现有方法依赖于从基本策略生成多个响应，以便使用奖励模型进行搜索，可以将其视为在离散响应空间中进行搜索。然而，当基本策略较弱或候选集较小时，这些方法往往难以探索具有信息量的候选项，导致效果有限。为了解决这个问题，本文提出了Simple Energy Adaptation（SEA），这是一种简单而有效的推理时对齐算法。与在离散空间中进行昂贵的搜索相反，SEA通过基于梯度的采样直接将基本策略的原始响应朝向最佳响应，在连续的潜在空间中进行调整。具体而言，SEA将推理形式化为在由最佳策略定义的连续空间中的动作上的能量函数的迭代优化过程，从而实现简单有效的对齐。例如，尽管其简单性，SEA在AdvBench上相对改进高达77.51％，在MATH上为16.36％，优于次佳基线。我们的代码公开可用于https://github.com/yuanyige/sea。

更新时间: 2025-10-09 10:10:39

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2505.20081v3

PEAR: Phase Entropy Aware Reward for Efficient Reasoning

Large Reasoning Models (LRMs) have achieved impressive performance on complex reasoning tasks by generating detailed chain-of-thought (CoT) explanations. However, these responses are often excessively long, containing redundant reasoning steps that inflate inference cost and reduce usability. Controlling the length of generated reasoning without sacrificing accuracy remains an open challenge. Through a systematic empirical analysis, we reveal a consistent positive correlation between model entropy and response length at different reasoning stages across diverse LRMs: the thinking phase exhibits higher entropy, reflecting exploratory behavior of longer responses, while the final answer phase shows lower entropy, indicating a more deterministic solution.This observation suggests that entropy at different reasoning stages can serve as a control knob for balancing conciseness and performance. Based on this insight, this paper introduces Phase Entropy Aware Reward (PEAR), a reward mechanism that incorporating phase-dependent entropy into the reward design. Instead of treating all tokens uniformly, PEAR penalize excessive entropy during the thinking phase and allowing moderate exploration at the final answer phase, which encourages models to generate concise reasoning traces that retain sufficient flexibility to solve the task correctly. This enables adaptive control of response length without relying on explicit length targets or rigid truncation rules. Extensive experiments across four benchmarks demonstrate that PEAR consistently reduces response length while sustaining competitive accuracy across model scales. In addition, PEAR demonstrates strong out-of-distribution (OOD) robustness beyond the training distribution. Our code is available at: https://github.com/iNLP-Lab/PEAR.

Updated: 2025-10-09 10:04:31

标题: PEAR：用于高效推理的相位熵感知奖励

摘要: 大推理模型（LRMs）通过生成详细的思维链（CoT）解释，在复杂推理任务上取得了令人印象深刻的性能。然而，这些响应通常过长，包含冗余推理步骤，增加推理成本并降低可用性。在不牺牲准确性的情况下控制生成推理的长度仍然是一个开放的挑战。通过系统的实证分析，我们揭示了在不同推理阶段不同LRMs之间模型熵和响应长度之间的一致正相关性：思考阶段表现出更高的熵，反映出更长响应的探索行为，而最终答案阶段显示出较低的熵，表明更确定性的解决方案。这一观察表明，在不同推理阶段的熵可以作为平衡简洁性和性能的控制旋钮。基于这一认识，本文介绍了Phase Entropy Aware Reward（PEAR），这是一种奖励机制，将基于阶段的熵纳入到奖励设计中。PEAR不是将所有令牌一视同仁，而是惩罚在思考阶段过高的熵，并允许在最终答案阶段适度的探索，这鼓励模型生成保留足够灵活性以正确解决任务的简洁推理路径。这样可以自适应地控制响应长度，而不依赖于显式长度目标或严格的截断规则。对四个基准的广泛实验表明，PEAR在各种模型规模下持续减少响应长度，同时保持竞争性的准确性。此外，PEAR展示了超出训练分布的强大的外部分布鲁棒性。我们的代码可在以下链接找到：https://github.com/iNLP-Lab/PEAR。

更新时间: 2025-10-09 10:04:31

领域: cs.AI

下载: http://arxiv.org/abs/2510.08026v1

(Token-Level) InfoRMIA: Stronger Membership Inference and Memorization Assessment for LLMs

Machine learning models are known to leak sensitive information, as they inevitably memorize (parts of) their training data. More alarmingly, large language models (LLMs) are now trained on nearly all available data, which amplifies the magnitude of information leakage and raises serious privacy risks. Hence, it is more crucial than ever to quantify privacy risk before the release of LLMs. The standard method to quantify privacy is via membership inference attacks, where the state-of-the-art approach is the Robust Membership Inference Attack (RMIA). In this paper, we present InfoRMIA, a principled information-theoretic formulation of membership inference. Our method consistently outperforms RMIA across benchmarks while also offering improved computational efficiency. In the second part of the paper, we identify the limitations of treating sequence-level membership inference as the gold standard for measuring leakage. We propose a new perspective for studying membership and memorization in LLMs: token-level signals and analyses. We show that a simple token-based InfoRMIA can pinpoint which tokens are memorized within generated outputs, thereby localizing leakage from the sequence level down to individual tokens, while achieving stronger sequence-level inference power on LLMs. This new scope rethinks privacy in LLMs and can lead to more targeted mitigation, such as exact unlearning.

Updated: 2025-10-09 10:03:33

标题: (Token级别) InfoRMIA: 更强的成员推断和对LLMs的记忆评估

摘要: 机器学习模型已知泄露敏感信息，因为它们不可避免地会记忆（部分）训练数据。更令人担忧的是，大型语言模型（LLMs）现在几乎在所有可用数据上进行训练，这扩大了信息泄漏的幅度并引发严重的隐私风险。因此，在发布LLMs之前量化隐私风险比以往任何时候都更为关键。量化隐私的标准方法是通过成员推断攻击，其中最先进的方法是Robust Membership Inference Attack（RMIA）。在本文中，我们提出了InfoRMIA，这是对成员推断的一个基于信息论的公正表述。我们的方法在各种基准测试中始终优于RMIA，同时也提供了更好的计算效率。在本文的第二部分中，我们确定了将序列级成员推断视为衡量泄漏的黄金标准的局限性。我们提出了一个新的视角来研究LLMs中的成员和记忆：基于令牌的信号和分析。我们表明，一个简单的基于令牌的InfoRMIA可以确定在生成的输出中记忆了哪些令牌，从而将泄漏定位从序列级别降至单个令牌级别，同时在LLMs上实现更强的序列级推断能力。这个新的范围重新思考了LLMs中的隐私，并可以导致更有针对性的缓解措施，如精确取消学习。

更新时间: 2025-10-09 10:03:33

领域: cs.LG

下载: http://arxiv.org/abs/2510.05582v2

Foundation Models for Structural Health Monitoring

Structural Health Monitoring (SHM) is a critical task for ensuring the safety and reliability of civil infrastructures, typically realized on bridges and viaducts by means of vibration monitoring. In this paper, we propose for the first time the use of Transformer neural networks, with a Masked Auto-Encoder architecture, as Foundation Models for SHM. We demonstrate the ability of these models to learn generalizable representations from multiple large datasets through self-supervised pre-training, which, coupled with task-specific fine-tuning, allows them to outperform state-of-the-art traditional methods on diverse tasks, including Anomaly Detection (AD) and Traffic Load Estimation (TLE). We then extensively explore model size versus accuracy trade-offs and experiment with Knowledge Distillation (KD) to improve the performance of smaller Transformers, enabling their embedding directly into the SHM edge nodes. We showcase the effectiveness of our foundation models using data from three operational viaducts. For AD, we achieve a near-perfect 99.9% accuracy with a monitoring time span of just 15 windows. In contrast, a state-of-the-art method based on Principal Component Analysis (PCA) obtains its first good result (95.03% accuracy), only considering 120 windows. On two different TLE tasks, our models obtain state-of-the-art performance on multiple evaluation metrics (R$^2$ score, MAE% and MSE%). On the first benchmark, we achieve an R$^2$ score of 0.97 and 0.90 for light and heavy vehicle traffic, respectively, while the best previous approach (a Random Forest) stops at 0.91 and 0.84. On the second one, we achieve an R$^2$ score of 0.54 versus the 0.51 of the best competitor method, a Long-Short Term Memory network.

Updated: 2025-10-09 10:02:05

标题: 结构健康监测的基础模型

摘要: 结构健康监测（SHM）是确保民用基础设施安全可靠的关键任务，通常通过振动监测在桥梁和高架桥上实现。本文首次提出使用具有掩码自动编码器架构的Transformer神经网络作为SHM的基础模型。我们展示了这些模型通过自监督预训练从多个大型数据集中学习可泛化的表示能力，结合特定任务的微调，使其能够在包括异常检测（AD）和交通荷载估计（TLE）在内的各种任务上胜过最先进的传统方法。我们随后广泛探讨了模型大小与准确性的权衡，并尝试使用知识蒸馏（KD）来提高较小Transformer的性能，使其可以直接嵌入SHM边缘节点。我们使用三个运行中的高架桥的数据展示了我们基础模型的有效性。对于AD，我们在仅监测15个窗口的时间跨度内实现了接近完美的99.9%准确率。相比之下，基于主成分分析（PCA）的最先进方法仅考虑120个窗口时才获得其第一个良好结果（95.03%准确率）。在两个不同的TLE任务上，我们的模型在多个评估指标（R^2分数，MAE%和MSE%）上取得了最先进的性能。在第一个基准测试中，我们分别实现了轻型和重型车辆交通的R^2分数为0.97和0.90，而最佳先前方法（随机森林）的R^2分数分别为0.91和0.84。在第二个任务中，我们的R^2分数为0.54，而最佳竞争对手方法（长短期记忆网络）的R^2分数为0.51。

更新时间: 2025-10-09 10:02:05

领域: cs.LG,cs.AI,cs.SY,eess.SY,I.2.1; I.2.3

下载: http://arxiv.org/abs/2404.02944v2

Middo: Model-Informed Dynamic Data Optimization for Enhanced LLM Fine-Tuning via Closed-Loop Learning

Supervised Fine-Tuning (SFT) Large Language Models (LLM) fundamentally rely on high-quality training data. While data selection and data synthesis are two common strategies to improve data quality, existing approaches often face limitations in static dataset curation that fail to adapt to evolving model capabilities. In this paper, we introduce Middo, a self-evolving Model-informed dynamic data optimization framework that uses model-aware data selection and context-preserving data refinement. Unlike conventional one-off filtering/synthesis methods, our framework establishes a closed-loop optimization system: (1) A self-referential diagnostic module proactively identifies suboptimal samples through tri-axial model signals - loss patterns (complexity), embedding cluster dynamics (diversity), and self-alignment scores (quality); (2) An adaptive optimization engine then transforms suboptimal samples into pedagogically valuable training points while preserving semantic integrity; (3) This optimization process continuously evolves with model capability through dynamic learning principles. Experiments on multiple benchmarks demonstrate that our Middo consistently enhances the quality of seed data and boosts LLM's performance with improving accuracy by 7.15% on average while maintaining the original dataset scale. This work establishes a new paradigm for sustainable LLM training through dynamic human-AI co-evolution of data and models. Our datasets, models, and code are publicly available at https://github.com/Word2VecT/Middo.

Updated: 2025-10-09 09:59:32

标题: Middo: 基于模型的动态数据优化，通过闭环学习增强LLM微调

摘要: 监督微调（SFT）大型语言模型（LLM）基本上依赖于高质量的训练数据。虽然数据选择和数据合成是两种改进数据质量的常见策略，但现有方法常常面临静态数据集策划的局限性，无法适应不断发展的模型能力。在本文中，我们介绍了Middo，这是一个自我演化的模型感知动态数据优化框架，它使用了基于模型的数据选择和保持上下文的数据精炼。与传统的一次性过滤/合成方法不同，我们的框架建立了一个闭环优化系统：（1）一个自参照诊断模块通过三轴模型信号主动识别次优样本 - 损失模式（复杂性）、嵌入簇动态（多样性）和自对齐分数（质量）；（2）一个自适应优化引擎将次优样本转化为具有教育价值的训练点，同时保持语义完整性；（3）这个优化过程通过动态学习原则与模型能力持续演化。在多个基准测试上的实验证明，我们的Middo持续提高了种子数据的质量，并在提高准确性的同时平均提高了7.15％，同时保持原始数据集规模。这项工作通过数据和模型的动态人工智能共同进化，为可持续的LLM训练建立了一个新的范式。我们的数据集、模型和代码可以在https://github.com/Word2VecT/Middo 上公开获取。

更新时间: 2025-10-09 09:59:32

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.21589v4

Do We Really Need Permutations? Impact of Width Expansion on Linear Mode Connectivity

Recently, Ainsworth et al. empirically demonstrated that, given two independently trained models, applying a parameter permutation that preserves the input-output behavior allows the two models to be connected by a low-loss linear path. When such a path exists, the models are said to achieve linear mode connectivity (LMC). Prior studies, including Ainsworth et al., have reported that achieving LMC requires not only an appropriate permutation search but also sufficiently wide models (e.g., a 32 $\times$ width multiplier for ResNet-20). This is broadly believed to be because increasing the model width ensures a large enough space of candidate permutations, increasing the chance of finding one that yields LMC. In this work, we empirically demonstrate that, even without any permutations, simply widening the models is sufficient for achieving LMC when using a suitable softmax temperature calibration. We further explain why this phenomenon arises by analyzing intermediate layer outputs. Specifically, we introduce layerwise exponentially weighted connectivity (LEWC), which states that the output of each layer of the merged model can be represented as an exponentially weighted sum of the outputs of the corresponding layers of the original models. Consequently the merged model's output matches that of an ensemble of the original models, which facilitates LMC. To the best of our knowledge, this work is the first to show that widening the model not only facilitates nonlinear mode connectivity, as suggested in prior research, but also significantly increases the possibility of achieving linear mode connectivity.

Updated: 2025-10-09 09:59:29

标题: 我们真的需要排列吗？宽度扩展对线性模式连通性的影响

摘要: 最近，Ainsworth等人从经验上证明，对于两个独立训练的模型，应用保持输入输出行为的参数排列可以使这两个模型通过低损耗的线性路径连接起来。当存在这样的路径时，这些模型被称为实现了线性模式连接（LMC）。先前的研究，包括Ainsworth等人在内，已经报告说实现LMC不仅需要适当的排列搜索，还需要足够宽的模型（例如，对于ResNet-20，需要一个32倍的宽度乘数）。这被广泛认为是因为增加模型宽度确保了足够大的候选排列空间，增加了找到一个产生LMC的排列的机会。在这项工作中，我们经验性地证明，即使没有任何排列，简单地扩展模型就足以实现LMC，当使用适当的softmax温度校准时。我们进一步解释了为什么这种现象会出现，通过分析中间层的输出。具体地，我们引入了逐层指数加权连接（LEWC），该理论表明合并模型的每一层的输出可以表示为原始模型相应层的输出的指数加权和。因此，合并模型的输出与原始模型的集合匹配，从而促进了LMC。据我们所知，这项工作是第一个表明扩展模型不仅有助于非线性模式连接，如先前的研究所建议的那样，而且显著增加了实现线性模式连接的可能性。

更新时间: 2025-10-09 09:59:29

领域: cs.LG

下载: http://arxiv.org/abs/2510.08023v1

FastUMI-100K: Advancing Data-driven Robotic Manipulation with a Large-scale UMI-style Dataset

Data-driven robotic manipulation learning depends on large-scale, high-quality expert demonstration datasets. However, existing datasets, which primarily rely on human teleoperated robot collection, are limited in terms of scalability, trajectory smoothness, and applicability across different robotic embodiments in real-world environments. In this paper, we present FastUMI-100K, a large-scale UMI-style multimodal demonstration dataset, designed to overcome these limitations and meet the growing complexity of real-world manipulation tasks. Collected by FastUMI, a novel robotic system featuring a modular, hardware-decoupled mechanical design and an integrated lightweight tracking system, FastUMI-100K offers a more scalable, flexible, and adaptable solution to fulfill the diverse requirements of real-world robot demonstration data. Specifically, FastUMI-100K contains over 100K+ demonstration trajectories collected across representative household environments, covering 54 tasks and hundreds of object types. Our dataset integrates multimodal streams, including end-effector states, multi-view wrist-mounted fisheye images and textual annotations. Each trajectory has a length ranging from 120 to 500 frames. Experimental results demonstrate that FastUMI-100K enables high policy success rates across various baseline algorithms, confirming its robustness, adaptability, and real-world applicability for solving complex, dynamic manipulation challenges. The source code and dataset will be released in this link https://github.com/MrKeee/FastUMI-100K.

Updated: 2025-10-09 09:57:25

标题: FastUMI-100K：利用大规模UMI风格数据集推动数据驱动的机器人操作

摘要: 数据驱动的机器人操作学习取决于大规模、高质量的专家演示数据集。然而，现有数据集主要依赖于人类远程操作机器人收集，存在可伸缩性、轨迹平滑性和在真实环境中适用性方面的限制。在本文中，我们介绍了FastUMI-100K，一个大规模的UMI风格多模态演示数据集，旨在克服这些限制，满足不断增长的真实世界操作任务的复杂性。由FastUMI收集，该新颖的机器人系统采用模块化、硬件解耦的机械设计和集成轻量级跟踪系统，FastUMI-100K提供了更具可伸缩性、灵活性和适应性的解决方案，以满足真实世界机器人演示数据的多样需求。具体来说，FastUMI-100K包含在代表性家庭环境中收集的超过100K个演示轨迹，涵盖了54个任务和数百种物体类型。我们的数据集集成了多模态流，包括末端执行器状态、多视角手腕装载的鱼眼图像和文本注释。每个轨迹的长度在120至500帧之间。实验结果表明，FastUMI-100K使各种基线算法的策略成功率高，证实了其在解决复杂、动态操作挑战方面的稳健性、适应性和真实世界适用性。源代码和数据集将在此链接中发布https://github.com/MrKeee/FastUMI-100K。

更新时间: 2025-10-09 09:57:25

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2510.08022v1

Backdoor Vectors: a Task Arithmetic View on Backdoor Attacks and Defenses

Model merging (MM) recently emerged as an effective method for combining large deep learning models. However, it poses significant security risks. Recent research shows that it is highly susceptible to backdoor attacks, which introduce a hidden trigger into a single fine-tuned model instance that allows the adversary to control the output of the final merged model at inference time. In this work, we propose a simple framework for understanding backdoor attacks by treating the attack itself as a task vector. $Backdoor\ Vector\ (BV)$ is calculated as the difference between the weights of a fine-tuned backdoored model and fine-tuned clean model. BVs reveal new insights into attacks understanding and a more effective framework to measure their similarity and transferability. Furthermore, we propose a novel method that enhances backdoor resilience through merging dubbed $Sparse\ Backdoor\ Vector\ (SBV)$ that combines multiple attacks into a single one. We identify the core vulnerability behind backdoor threats in MM: $inherent\ triggers$ that exploit adversarial weaknesses in the base model. To counter this, we propose $Injection\ BV\ Subtraction\ (IBVS)$ - an assumption-free defense against backdoors in MM. Our results show that SBVs surpass prior attacks and is the first method to leverage merging to improve backdoor effectiveness. At the same time, IBVS provides a lightweight, general defense that remains effective even when the backdoor threat is entirely unknown.

Updated: 2025-10-09 09:54:05

标题: 后门向量：一种任务算术视角下的后门攻击和防御

摘要: 模型合并（MM）最近被提出作为一种有效的方法，用于组合大型深度学习模型。然而，它存在显著的安全风险。最近的研究表明，它极易受到后门攻击的影响，后门攻击会在一个经过微调的模型实例中引入隐藏的触发器，允许对手在推理时控制最终合并模型的输出。在这项工作中，我们提出了一个简单的框架来理解后门攻击，将攻击本身视为一个任务向量。后门向量（BV）被计算为经过微调的后门模型和经过微调的干净模型之间的权重差异。BV揭示了对攻击理解的新见解，并提供了一个更有效的框架来衡量它们的相似性和可转移性。此外，我们提出了一种通过合并增强后门韧性的新方法，被称为稀疏后门向量（SBV），将多个攻击合并为一个。我们识别了MM中后门威胁背后的核心弱点：利用基础模型中的敌对弱点的固有触发器。为了应对这一问题，我们提出了注入BV减法（IBVS）-一种无假设的防御措施，用于对抗MM中的后门。我们的结果显示，SBV超越了先前的攻击，并且是第一种利用合并来提高后门效果的方法。与此同时，IBVS提供了一个轻量级、通用的防御措施，即使在后门威胁完全未知的情况下仍然有效。

更新时间: 2025-10-09 09:54:05

领域: cs.LG,cs.AI,cs.CR

下载: http://arxiv.org/abs/2510.08016v1

Unsupervised Radio Map Construction in Mixed LoS/NLoS Indoor Environments

Radio maps are essential for enhancing wireless communications and localization. However, existing methods for constructing radio maps typically require costly calibration pro- cesses to collect location-labeled channel state information (CSI) datasets. This paper aims to recover the data collection trajectory directly from the channel propagation sequence, eliminating the need for location calibration. The key idea is to employ a hidden Markov model (HMM)-based framework to conditionally model the channel propagation matrix, while simultaneously modeling the location correlation in the trajectory. The primary challenges involve modeling the complex relationship between channel propagation in multiple-input multiple-output (MIMO) networks and geographical locations, and addressing both line-of-sight (LOS) and non-line-of-sight (NLOS) indoor conditions. In this paper, we propose an HMM-based framework that jointly characterizes the conditional propagation model and the evolution of the user trajectory. Specifically, the channel propagation in MIMO networks is modeled separately in terms of power, delay, and angle, with distinct models for LOS and NLOS conditions. The user trajectory is modeled using a Gaussian-Markov model. The parameters for channel propagation, the mobility model, and LOS/NLOS classification are optimized simultaneously. Experimental validation using simulated MIMO-Orthogonal Frequency-Division Multiplexing (OFDM) networks with a multi-antenna uniform linear arrays (ULA) configuration demonstrates that the proposed method achieves an average localization accuracy of 0.65 meters in an indoor environment, covering both LOS and NLOS regions. Moreover, the constructed radio map enables localization with a reduced error compared to conventional supervised methods, such as k-nearest neighbors (KNN), support vector machine (SVM), and deep neural network (DNN).

Updated: 2025-10-09 09:53:24

标题: 混合LoS/NLoS室内环境下的无监督无线电地图构建

摘要: 收音机地图对于增强无线通信和定位至关重要。然而，目前构建收音机地图的方法通常需要昂贵的校准过程来收集带有位置标记的信道状态信息（CSI）数据集。本文旨在直接从信道传播序列中恢复数据收集轨迹，从而消除了位置校准的需求。关键思想是利用基于隐马尔可夫模型（HMM）的框架有条件地建模信道传播矩阵，同时对轨迹中的位置相关性进行建模。主要挑战涉及在多输入多输出（MIMO）网络中建模信道传播与地理位置之间的复杂关系，并解决视线（LOS）和非视线（NLOS）室内条件。本文提出了一个基于HMM的框架，共同表征条件传播模型和用户轨迹的演变。具体来说，MIMO网络中的信道传播分别以功率、延迟和角度建模，视线和非视线条件有不同的模型。用户轨迹使用高斯马尔可夫模型建模。信道传播、移动模型和LOS/NLOS分类的参数同时优化。使用模拟MIMO-正交频分复用（OFDM）网络和多天线均匀线性阵列（ULA）配置进行的实验验证表明，所提出的方法在室内环境中实现了平均定位精度为0.65米，覆盖了LOS和NLOS区域。此外，与传统的监督方法（如k最近邻（KNN）、支持向量机（SVM）和深度神经网络（DNN））相比，构建的收音机地图使得定位误差降低。

更新时间: 2025-10-09 09:53:24

领域: cs.LG

下载: http://arxiv.org/abs/2510.08015v1

Composition Law of Conjugate Observables in Random Permutation Sorting Systems

We present the discovery of a fundamental composition law governing conjugate observables in the Random Permutation Sorting System (RPSS). The law links the discrete permutation count Np and the continuous elapsed time T through a functional relation connecting the characteristic function of timing distributions to the probability generating function of permutation counts. This framework enables entropy purification, transforming microarchitectural timing fluctuations into uniform randomness via geometric convergence. We establish convergence theorems with explicit bounds and validate the results experimentally, achieving Shannon entropy above 7.9998 bits per byte and chi-square uniformity across diverse platforms. The composition law provides a universal foundation for generating provably uniform randomness from general-purpose computation, securing cryptographic purity from emergent computational dynamics.

Updated: 2025-10-09 09:50:21

标题: 随机排列排序系统中共轭可观测量的组成法则

摘要: 我们提出了一项基本的组合法则，用于控制随机排列排序系统（RPSS）中的共轭可观测量。该法则通过将离散排列计数Np和连续经过的时间T联系起来，通过将时间分布的特征函数与排列计数的概率生成函数相连接的函数关系。该框架使得熵净化成为可能，通过几何收敛将微架构时间波动转化为均匀随机性。我们建立了具有明确界限的收敛定理，并通过实验证实了结果，在不同平台上实现了每字节超过7.9998比特的香农熵和卡方均匀性。该组合法则为从通用计算中生成可证实均匀随机性提供了通用基础，从而保证了密码学的纯净性免受新型计算动态的影响。

更新时间: 2025-10-09 09:50:21

领域: cs.CR,physics.data-an

下载: http://arxiv.org/abs/2510.08013v1

Barycentric Neural Networks and Length-Weighted Persistent Entropy Loss: A Green Geometric and Topological Framework for Function Approximation

While artificial neural networks are known as universal approximators for continuous functions, many modern approaches rely on overparameterized architectures with high computational cost. In this work, we introduce the Barycentric Neural Network (BNN): a compact shallow architecture that encodes both structure and parameters through a fixed set of base points and their associated barycentric coordinates. We show that the BNN enables the exact representation of continuous piecewise linear functions (CPLFs), ensuring strict continuity across segments. Given that any continuous function on a compact domain can be uniformly approximated by CPLFs, the BNN emerges as a flexible and interpretable tool for function approximation. To enhance geometric fidelity in low-resource scenarios, such as those with few base points to create BNNs or limited training epochs, we propose length-weighted persistent entropy (LWPE): a stable variant of persistent entropy. Our approach integrates the BNN with a loss function based on LWPE to optimize the base points that define the BNN, rather than its internal parameters. Experimental results show that our approach achieves superior and faster approximation performance compared to standard losses (MSE, RMSE, MAE and LogCosh), offering a computationally sustainable alternative for function approximation.

Updated: 2025-10-09 09:47:53

标题: 质心神经网络和长度加权持续熵损失：一种用于函数逼近的绿色几何和拓扑框架

摘要: 尽管人工神经网络被认为是连续函数的通用逼近器，但许多现代方法依赖于具有高计算成本的过度参数化架构。在这项工作中，我们介绍了Barycentric Neural Network (BNN)：一种紧凑的浅层架构，通过一组固定的基础点及其相关的重心坐标来编码结构和参数。我们展示了BNN能够精确表示连续分段线性函数 (CPLFs)，确保在各段之间严格连续。鉴于任何紧致域上的连续函数都可以通过CPLFs进行均匀逼近，BNN成为了一个灵活且可解释的函数逼近工具。为了在资源有限的情况下增强几何保真度，例如在创建BNN时使用少量基础点或训练时限有限的情况下，我们提出了长度加权持续熵 (LWPE)：持续熵的一个稳定变种。我们的方法将BNN与基于LWPE的损失函数相结合，以优化定义BNN的基础点，而不是其内部参数。实验结果显示，与标准损失函数 (MSE、RMSE、MAE和LogCosh)相比，我们的方法实现了更优秀和更快速的逼近性能，为函数逼近提供了一个计算上可持续的替代方案。

更新时间: 2025-10-09 09:47:53

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2509.06694v3

Accelerated Evolving Set Processes for Local PageRank Computation

This work proposes a novel framework based on nested evolving set processes to accelerate Personalized PageRank (PPR) computation. At each stage of the process, we employ a localized inexact proximal point iteration to solve a simplified linear system. We show that the time complexity of such localized methods is upper bounded by $\min\{\tilde{\mathcal{O}}(R^2/\epsilon^2), \tilde{\mathcal{O}}(m)\}$ to obtain an $\epsilon$-approximation of the PPR vector, where $m$ denotes the number of edges in the graph and $R$ is a constant defined via nested evolving set processes. Furthermore, the algorithms induced by our framework require solving only $\tilde{\mathcal{O}}(1/\sqrt{\alpha})$ such linear systems, where $\alpha$ is the damping factor. When $1/\epsilon^2\ll m$, this implies the existence of an algorithm that computes an $\ epsilon $-approximation of the PPR vector with an overall time complexity of $\tilde{\mathcal{O}}\left(R^2 / (\sqrt{\alpha}\epsilon^2)\right)$, independent of the underlying graph size. Our result resolves an open conjecture from existing literature. Experimental results on real-world graphs validate the efficiency of our methods, demonstrating significant convergence in the early stages.

Updated: 2025-10-09 09:47:40

标题: 加速演进集过程用于本地PageRank计算

摘要: 这项工作提出了一个基于嵌套演化集过程的新框架，用于加速个性化PageRank（PPR）计算。在每个阶段的过程中，我们使用本地的不精确近端点迭代来解决一个简化的线性系统。我们展示了这种本地化方法的时间复杂度上界为$\min\{\tilde{\mathcal{O}}(R^2/\epsilon^2), \tilde{\mathcal{O}}(m)\}$，以获得PPR向量的$\epsilon$-近似，其中$m$表示图中的边数，$R$是通过嵌套演化集过程定义的常数。此外，由我们的框架引发的算法只需要解决$\tilde{\mathcal{O}}(1/\sqrt{\alpha})$个这样的线性系统，其中$\alpha$是阻尼因子。当$1/\epsilon^2\ll m$时，这意味着存在一种算法，可以在整体时间复杂度为$\tilde{\mathcal{O}}\left(R^2/(\sqrt{\alpha}\epsilon^2)\right)$的情况下计算PPR向量的$\epsilon$-近似，独立于底层图的大小。我们的结果解决了现有文献中的一个开放猜想。对真实世界图的实验结果验证了我们方法的效率，在早期阶段表现出明显的收敛性。

更新时间: 2025-10-09 09:47:40

领域: cs.LG

下载: http://arxiv.org/abs/2510.08010v1

Language Models Do Not Embed Numbers Continuously

Recent research has extensively studied how large language models manipulate integers in specific arithmetic tasks, and on a more fundamental level, how they represent numeric values. These previous works have found that language model embeddings can be used to reconstruct the original values, however, they do not evaluate whether language models actually model continuous values as continuous. Using expected properties of the embedding space, including linear reconstruction and principal component analysis, we show that language models not only represent numeric spaces as non-continuous but also introduce significant noise. Using models from three major providers (OpenAI, Google Gemini and Voyage AI), we show that while reconstruction is possible with high fidelity ($R^2 \geq 0.95$), principal components only explain a minor share of variation within the embedding space. This indicates that many components within the embedding space are orthogonal to the simple numeric input space. Further, both linear reconstruction and explained variance suffer with increasing decimal precision, despite the ordinal nature of the input space being fundamentally unchanged. The findings of this work therefore have implications for the many areas where embedding models are used, in-particular where high numerical precision, large magnitudes or mixed-sign values are common.

Updated: 2025-10-09 09:46:19

标题: 语言模型不会连续嵌入数字

摘要: 最近的研究广泛研究了大型语言模型如何在特定算术任务中操作整数，以及在更基本的层面上，它们如何表示数值。这些先前的研究发现，语言模型嵌入可以用来重建原始数值，然而，它们并没有评估语言模型是否实际上将连续数值作为连续数值建模。通过使用嵌入空间的预期属性，包括线性重建和主成分分析，我们展示了语言模型不仅将数值空间表示为非连续的，而且还引入了显著的噪音。使用来自三家主要提供商（OpenAI、Google Gemini和Voyage AI）的模型，我们发现虽然可以实现高保真度的重建（$R^2 \geq 0.95$），但主成分只解释了嵌入空间内变化的一小部分。这表明嵌入空间内的许多组件与简单的数值输入空间正交。此外，尽管输入空间的序数性质基本不变，但随着小数精度的增加，线性重建和解释方差都会受到影响。因此，本研究的发现对许多使用嵌入模型的领域具有影响，特别是在高数值精度、大幅度或混合符号值常见的领域。

更新时间: 2025-10-09 09:46:19

领域: cs.AI,cs.LG,I.2

下载: http://arxiv.org/abs/2510.08009v1

Recycling Pretrained Checkpoints: Orthogonal Growth of Mixture-of-Experts for Efficient Large Language Model Pre-Training

The rapidly increasing computational cost of pretraining Large Language Models necessitates more efficient approaches. Numerous computational costs have been invested in existing well-trained checkpoints, but many of them remain underutilized due to engineering constraints or limited model capacity. To efficiently reuse this "sunk" cost, we propose to recycle pretrained checkpoints by expanding their parameter counts and continuing training. We propose orthogonal growth method well-suited for converged Mixture-of-Experts model: interpositional layer copying for depth growth and expert duplication with injected noise for width growth. To determine the optimal timing for such growth across checkpoints sequences, we perform comprehensive scaling experiments revealing that the final accuracy has a strong positive correlation with the amount of sunk cost, indicating that greater prior investment leads to better performance. We scale our approach to models with 70B parameters and over 1T training tokens, achieving 10.66% accuracy gain over training from scratch under the same additional compute budget. Our checkpoint recycling approach establishes a foundation for economically efficient large language model pretraining.

Updated: 2025-10-09 09:45:45

标题: 回收预训练检查点：混合专家模型正交增长用于高效大型语言模型预训练

摘要: 大型语言模型预训练的计算成本迅速增长，需要更高效的方法。许多计算成本已经投入到现有训练良好的检查点中，但由于工程约束或有限的模型容量，许多检查点仍未充分利用。为了有效地重复利用这种“沉没”成本，我们提出通过扩展其参数数量并继续训练来回收预训练检查点。我们提出了适用于收敛的专家混合模型的正交增长方法：层间复制用于深度增长，专家复制并注入噪音用于宽度增长。为确定跨检查点序列进行此类增长的最佳时机，我们进行了全面的缩放实验，结果显示最终准确性与沉没成本数量之间存在强烈的正相关性，表明更大的先前投资会带来更好的性能。我们将我们的方法扩展到具有70B参数和超过1T训练令牌的模型，与在相同额外计算预算下从头开始训练相比，实现了10.66%的准确率增益。我们的检查点回收方法为经济高效的大型语言模型预训练奠定了基础。

更新时间: 2025-10-09 09:45:45

领域: cs.LG

下载: http://arxiv.org/abs/2510.08008v1

CREST-Search: Comprehensive Red-teaming for Evaluating Safety Threats in Large Language Models Powered by Web Search

Large Language Models (LLMs) excel at tasks such as dialogue, summarization, and question answering, yet they struggle to adapt to specialized domains and evolving facts. To overcome this, web search has been integrated into LLMs, allowing real-time access to online content. However, this connection magnifies safety risks, as adversarial prompts combined with untrusted sources can cause severe vulnerabilities. We investigate red teaming for LLMs with web search and present CREST-Search, a framework that systematically exposes risks in such systems. Unlike existing methods for standalone LLMs, CREST-Search addresses the complex workflow of search-enabled models by generating adversarial queries with in-context learning and refining them through iterative feedback. We further construct WebSearch-Harm, a search-specific dataset to fine-tune LLMs into efficient red-teaming agents. Experiments show that CREST-Search effectively bypasses safety filters and reveals vulnerabilities in modern web-augmented LLMs, underscoring the need for specialized defenses to ensure trustworthy deployment.

Updated: 2025-10-09 09:44:14

标题: CREST-Search：基于网络搜索的大型语言模型安全威胁全面红队评估

摘要: 大型语言模型（LLMs）在对话，摘要和问题回答等任务方面表现出色，但它们很难适应专业领域和不断变化的事实。为了克服这一问题，网络搜索已经被整合到LLMs中，允许实时访问在线内容。然而，这种连接放大了安全风险，因为敌对提示与不受信任的来源结合可能导致严重漏洞。我们研究了LLMs与网络搜索的红队合作，并提出了一个系统性暴露这些系统中风险的框架CREST-Search。与现有的独立LLMs方法不同，CREST-Search通过生成具有背景学习的敌对查询并通过迭代反馈进行改进，以解决搜索启用模型的复杂工作流程。我们进一步构建了WebSearch-Harm，一个搜索特定的数据集，用于对LLMs进行优化，使其成为高效的红队代理。实验证明，CREST-Search有效地绕过安全过滤器，并揭示了现代网络增强LLMs中的漏洞，强调了需要专门的防御措施以确保可靠的部署。

更新时间: 2025-10-09 09:44:14

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2510.09689v1

Past, Present, and Future of Bug Tracking in the Generative AI Era

Traditional bug tracking systems rely heavily on manual reporting, reproduction, triaging, and resolution, each carried out by different stakeholders such as end users, customer support, developers, and testers. This division of responsibilities requires significant coordination and widens the communication gap between non-technical users and technical teams, slowing the process from bug discovery to resolution. Moreover, current systems are highly asynchronous; users often wait hours or days for a first response, delaying fixes and contributing to frustration. This paper examines the evolution of bug tracking, from early paper-based reporting to today's web-based and SaaS platforms. Building on this trajectory, we propose an AI-powered bug tracking framework that augments existing tools with intelligent, large language model (LLM)-driven automation. Our framework addresses two main challenges: reducing time-to-fix and minimizing human overhead. Users report issues in natural language, while AI agents refine reports, attempt reproduction, and request missing details. Reports are then classified, invalid ones resolved through no-code fixes, and valid ones localized and assigned to developers. LLMs also generate candidate patches, with human oversight ensuring correctness. By integrating automation into each phase, our framework accelerates response times, improves collaboration, and strengthens software maintenance practices for a more efficient, user-centric future.

Updated: 2025-10-09 09:42:30

标题: 在生成式人工智能时代的缺陷跟踪的过去、现在和未来

摘要: 传统的错误跟踪系统在报告、重现、分类和解决方面严重依赖手动操作，这些操作由不同的利益相关者执行，如最终用户、客户支持、开发人员和测试人员。这种责任的分工需要大量的协调工作，并扩大了非技术用户和技术团队之间的沟通差距，从而减慢了从发现错误到解决问题的过程。此外，当前的系统是高度异步的；用户通常需要等待数小时或数天才能得到第一次回应，这延迟了修复的时间，增加了挫折感。本文考察了错误跟踪的演变，从早期的基于纸张的报告到今天的基于网络和SaaS平台。基于这一发展轨迹，我们提出了一个基于人工智能的错误跟踪框架，利用智能的大型语言模型驱动自动化。我们的框架解决了两个主要挑战：缩短修复时间和减少人力开销。用户使用自然语言报告问题，而人工智能代理程序精炼报告，尝试重现问题，并请求缺失的细节。报告然后被分类，无效的报告通过无代码修复解决，有效的报告被定位并分配给开发人员。语言模型还生成候选修补程序，同时人工监督确保正确性。通过在每个阶段集成自动化，我们的框架加快了响应时间，改善了协作，加强了软件维护实践，为更高效、用户为中心的未来做出了贡献。

更新时间: 2025-10-09 09:42:30

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2510.08005v1

Learning on the Job: An Experience-Driven Self-Evolving Agent for Long-Horizon Tasks

Large Language Models have demonstrated remarkable capabilities across diverse domains, yet significant challenges persist when deploying them as AI agents for real-world long-horizon tasks. Existing LLM agents suffer from a critical limitation: they are test-time static and cannot learn from experience, lacking the ability to accumulate knowledge and continuously improve on the job. To address this challenge, we propose MUSE, a novel agent framework that introduces an experience-driven, self-evolving system centered around a hierarchical Memory Module. MUSE organizes diverse levels of experience and leverages them to plan and execute long-horizon tasks across multiple applications. After each sub-task execution, the agent autonomously reflects on its trajectory, converting the raw trajectory into structured experience and integrating it back into the Memory Module. This mechanism enables the agent to evolve beyond its static pretrained parameters, fostering continuous learning and self-evolution. We evaluate MUSE on the long-horizon productivity benchmark TAC. It achieves new SOTA performance by a significant margin using only a lightweight Gemini-2.5 Flash model. Sufficient Experiments demonstrate that as the agent autonomously accumulates experience, it exhibits increasingly superior task completion capabilities, as well as robust continuous learning and self-evolution capabilities. Moreover, the accumulated experience from MUSE exhibits strong generalization properties, enabling zero-shot improvement on new tasks. MUSE establishes a new paradigm for AI agents capable of real-world productivity task automation.

Updated: 2025-10-09 09:40:34

标题: 在职学习：一种面向长期任务的经验驱动自进化智能体

摘要: 大型语言模型在各个领域展示了卓越的能力，但在将它们部署为真实世界长期任务的人工智能代理时仍然存在重大挑战。现有的LLM代理存在一个关键限制：它们在测试时是静态的，无法从经验中学习，缺乏积累知识和持续改进工作的能力。为了解决这一挑战，我们提出了MUSE，这是一个围绕分层记忆模块构建的经验驱动、自我演进系统的新型代理框架。MUSE组织了各种层次的经验，并利用它们来规划和执行跨多个应用的长期任务。在每个子任务执行后，代理自主地反思其轨迹，将原始轨迹转化为结构化经验，并将其集成回记忆模块。这种机制使代理能够超越其静态的预训练参数，促进持续学习和自我演进。我们在长期生产力基准TAC上评估了MUSE。它仅使用轻量级的Gemini-2.5 Flash模型就实现了显著优越的最新性能。充分的实验表明，随着代理自主积累经验，它表现出越来越优越的任务完成能力，以及稳健的持续学习和自我演进能力。此外，来自MUSE的积累经验表现出强大的泛化特性，使其能够在新任务上实现零-shot改进。MUSE为能够实现真实世界生产力任务自动化的人工智能代理建立了一个新的范式。

更新时间: 2025-10-09 09:40:34

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.08002v1

Exact Causal Attention with 10% Fewer Operations

We present Exact Causal Attention (ECA), a Strassen-style algorithm that computes exact Causal Attention using 10\% fewer operations. ECA improves a special class of matrix multiplications where either one operand or the output matrix is upper- or lower-triangular. This includes all matrix multiplication operations in the forward and backward pass of Causal Attention, such as masked product $\mathrm{Mask}(QK^{T})$. ECA is built upon algebraic identities discovered via machine learning and combinatorial search. We note that ECA cannot accelerate fused kernels such as FlashAttention on GPU. This is because ECA requires materialization of large intermediate expressions in the memory, while FlashAttention does not. However, it provides an alternative approach for compute-bound applications and can potentially be useful in scenarios with FLOPs considerations.

Updated: 2025-10-09 09:39:20

标题: 准确因果关注，操作减少10%

摘要: 我们提出了精确因果注意力（ECA），这是一种Strassen风格的算法，可以使用少量操作来计算精确的因果注意力。ECA改进了一类特殊的矩阵乘法，其中一个操作数或输出矩阵是上三角或下三角的。这包括因果注意力的前向和反向传递中的所有矩阵乘法操作，例如掩码乘积$\mathrm{Mask}(QK^{T})$。ECA建立在通过机器学习和组合搜索发现的代数恒等式基础上。我们注意到，ECA无法加速GPU上的融合内核，如FlashAttention。这是因为ECA需要在内存中实现大型中间表达式，而FlashAttention则不需要。然而，它提供了一种计算密集型应用的替代方法，并且在需要考虑FLOPs的情况下可能会有用。

更新时间: 2025-10-09 09:39:20

领域: cs.LG,cs.DM,cs.DS

下载: http://arxiv.org/abs/2510.05175v2

DemandCast: Global hourly electricity demand forecasting

This paper presents a machine learning framework for electricity demand forecasting across diverse geographical regions using the gradient boosting algorithm XGBoost. The model integrates historical electricity demand and comprehensive weather and socioeconomic variables to predict normalized electricity demand profiles. To enable robust training and evaluation, we developed a large-scale dataset spanning multiple years and countries, applying a temporal data-splitting strategy that ensures benchmarking of out-of-sample performance. Our approach delivers accurate and scalable demand forecasts, providing valuable insights for energy system planners and policymakers as they navigate the challenges of the global energy transition.

Updated: 2025-10-09 09:39:06

标题: DemandCast：全球每小时电力需求预测

摘要: 本文提出了一个机器学习框架，利用梯度提升算法XGBoost在不同地理区域进行电力需求预测。该模型整合了历史电力需求以及全面的天气和社会经济变量，以预测归一化电力需求轮廓。为了实现稳健的训练和评估，我们开发了一个跨多年和国家的大规模数据集，应用了时间数据分割策略，确保了对样本外表现的基准测试。我们的方法提供准确、可扩展的需求预测，为能源系统规划者和政策制定者提供宝贵的见解，帮助他们应对全球能源转型的挑战。

更新时间: 2025-10-09 09:39:06

领域: cs.LG,physics.soc-ph

下载: http://arxiv.org/abs/2510.08000v1

Leveraging Author-Specific Context for Scientific Figure Caption Generation: 3rd SciCap Challenge

Scientific figure captions require both accuracy and stylistic consistency to convey visual information. Here, we present a domain-specific caption generation system for the 3rd SciCap Challenge that integrates figure-related textual context with author-specific writing styles using the LaMP-Cap dataset. Our approach uses a two-stage pipeline: Stage 1 combines context filtering, category-specific prompt optimization via DSPy's MIPROv2 and SIMBA, and caption candidate selection; Stage 2 applies few-shot prompting with profile figures for stylistic refinement. Our experiments demonstrate that category-specific prompts outperform both zero-shot and general optimized approaches, improving ROUGE-1 recall by +8.3\% while limiting precision loss to -2.8\% and BLEU-4 reduction to -10.9\%. Profile-informed stylistic refinement yields 40--48\% gains in BLEU scores and 25--27\% in ROUGE. Overall, our system demonstrates that combining contextual understanding with author-specific stylistic adaptation can generate captions that are both scientifically accurate and stylistically faithful to the source paper.

Updated: 2025-10-09 09:30:28

标题: 利用作者特定上下文生成科学图表标题：第三届SciCap挑战

摘要: 科学图表标题需要准确性和风格一致性，以传达视觉信息。在这里，我们提出了一个领域特定的图表标题生成系统，用于第三届SciCap挑战赛，该系统将与作者特定写作风格相结合，使用LaMP-Cap数据集。我们的方法使用了一个两阶段流程：第一阶段结合了上下文过滤、基于类别的提示优化（通过DSPy的MIPROv2和SIMBA）和标题候选选择；第二阶段应用了少量提示与个人资料图形，以进行风格精炼。我们的实验表明，基于类别的提示优于零-shot和一般优化方法，将ROUGE-1召回率提高了+8.3\%，同时将精确度损失限制在-2.8\%以及BLEU-4降低到-10.9\%。基于个人资料的风格精炼在BLEU分数上获得了40-48%的增益，ROUGE上获得了25-27%的增益。总体而言，我们的系统表明，将上下文理解与作者特定的风格适应相结合，可以生成既在科学上准确又在风格上忠实于原始论文的标题。

更新时间: 2025-10-09 09:30:28

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.07993v1

Computing $\varphi(N)$ for an RSA module with a single quantum query

In this paper we give a polynomial time algorithm to compute $\varphi(N)$ for an RSA module $N$ using as input the order modulo $N$ of a randomly chosen integer. This provides a new insight in the very important problem of factoring an RSA module with extra information. In fact, the algorithm is extremely simple and consists only on a computation of a greatest common divisor, two multiplications and a division. The algorithm works with a probability of at least $1-\frac{1}{N^{1/2-\epsilon}}$, where $\epsilon$ is any small positive constant.

Updated: 2025-10-09 09:28:48

标题: 使用单个量子查询计算RSA模块的$\varphi(N)$

摘要: 在这篇论文中，我们提出了一个多项式时间算法，用于计算RSA模数$N$的欧拉函数$\varphi(N)$，输入是随机选择的整数关于模$N$的阶。这为利用额外信息因式分解RSA模数提供了新的见解。事实上，这个算法非常简单，仅包括计算最大公约数、两次乘法和一次除法。该算法的工作概率至少为$1-\frac{1}{N^{1/2-\epsilon}}$，其中$\epsilon$是任意小的正常数。

更新时间: 2025-10-09 09:28:48

领域: cs.CR,quant-ph

下载: http://arxiv.org/abs/2406.04061v4

SmartUT: Receive Beamforming for Spectral Coexistence of NGSO Satellite Systems

In this paper, we investigate downlink co-frequency interference (CFI) mitigation in non-geostationary satellites orbits (NGSOs) co-existing systems. Traditional mitigation techniques, such as Zero-forcing (ZF), produce a null towards the direction of arrivals (DOAs) of the interfering signals, but they suffer from high computational complexity due to matrix inversions and required knowledge of the channel state information (CSI). Furthermore, adaptive beamformers, such as sample matrix inversion (SMI)-based minimum variance, provide poor performance when the available snapshots are limited. We propose a Mamba-based beamformer (MambaBF) that leverages an unsupervised deep learning (DL) approach and can be deployed on the user terminal (UT) antenna array, for assisting downlink beamforming and CFI mitigation using only a limited number of available array snapshots as input, and without CSI knowledge. Simulation results demonstrate that MambaBF consistently outperforms conventional beamforming techniques in mitigating interference and maximizing the signal-to-interference-plus-noise ratio (SINR), particularly under challenging conditions characterized by low SINR, limited snapshots, and imperfect CSI.

Updated: 2025-10-09 09:26:26

标题: 智能UT：NGSO卫星系统频谱共存的接收波束成形

摘要: 在本文中，我们研究了非静止卫星轨道（NGSOs）共存系统中下行同频干扰（CFI）的抑制。传统的抑制技术，如零强制（ZF），会在干扰信号到达方向（DOAs）产生一个空点，但由于需要矩阵求逆和对信道状态信息（CSI）的了解，它们受到高计算复杂度的影响。此外，自适应波束形成器，如基于样本矩阵求逆（SMI）的最小方差，当可用快照有限时，性能较差。我们提出了一种基于Mamba的波束形成器（MambaBF），利用无监督深度学习（DL）方法，并可以部署在用户终端（UT）天线阵列上，用于协助下行波束形成和CFI抑制，仅使用有限数量的可用阵列快照作为输入，而不需要CSI知识。模拟结果表明，MambaBF在减少干扰和最大化信号-干扰加噪比（SINR）方面始终优于传统的波束形成技术，特别是在低SINR、快照有限和CSI不完善的挑战性条件下。

更新时间: 2025-10-09 09:26:26

领域: eess.SP,cs.ET,cs.LG

下载: http://arxiv.org/abs/2505.07714v2

Average Controlled and Average Natural Micro Direct Effects in Summary Causal Graphs

In this paper, we investigate the identifiability of average controlled direct effects and average natural direct effects in causal systems represented by summary causal graphs, which are abstractions of full causal graphs, often used in dynamic systems where cycles and omitted temporal information complicate causal inference. Unlike in the traditional linear setting, where direct effects are typically easier to identify and estimate, non-parametric direct effects, which are crucial for handling real-world complexities, particularly in epidemiological contexts where relationships between variables (e.g, genetic, environmental, and behavioral factors) are often non-linear, are much harder to define and identify. In particular, we give sufficient conditions for identifying average controlled micro direct effect and average natural micro direct effect from summary causal graphs in the presence of hidden confounding. Furthermore, we show that the conditions given for the average controlled micro direct effect become also necessary in the setting where there is no hidden confounding and where we are only interested in identifiability by adjustment.

Updated: 2025-10-09 09:24:37

标题: 平均受控和平均自然微直接效应在总体因果图中的研究

摘要: 在这篇论文中，我们研究了在由总结因果图表示的因果系统中，平均受控直接效应和平均自然直接效应的可辨识性。总结因果图是完整因果图的抽象，通常用于动态系统中，其中循环和省略的时间信息使因果推断复杂化。与传统线性设置不同，在那里直接效应通常更容易识别和估计，非参数直接效应对于处理现实世界复杂性至关重要，特别是在流行病学背景下，变量之间的关系（例如遗传、环境和行为因素）通常是非线性的，这使得定义和识别变得更加困难。特别地，我们给出了在存在隐藏混杂时从总结因果图中识别平均受控微直接效应和平均自然微直接效应的充分条件。此外，我们展示了在不存在隐藏混杂且仅对调整可辨识性感兴趣的情况下，给出的平均受控微直接效应的条件也成为必要条件。

更新时间: 2025-10-09 09:24:37

领域: cs.AI,stat.ME

下载: http://arxiv.org/abs/2410.23975v2

ReInAgent: A Context-Aware GUI Agent Enabling Human-in-the-Loop Mobile Task Navigation

Mobile GUI agents exhibit substantial potential to facilitate and automate the execution of user tasks on mobile phones. However, exist mobile GUI agents predominantly privilege autonomous operation and neglect the necessity of active user engagement during task execution. This omission undermines their adaptability to information dilemmas including ambiguous, dynamically evolving, and conflicting task scenarios, leading to execution outcomes that deviate from genuine user requirements and preferences. To address these shortcomings, we propose ReInAgent, a context-aware multi-agent framework that leverages dynamic information management to enable human-in-the-loop mobile task navigation. ReInAgent integrates three specialized agents around a shared memory module: an information-managing agent for slot-based information management and proactive interaction with the user, a decision-making agent for conflict-aware planning, and a reflecting agent for task reflection and information consistency validation. Through continuous contextual information analysis and sustained user-agent collaboration, ReInAgent overcomes the limitation of existing approaches that rely on clear and static task assumptions. Consequently, it enables more adaptive and reliable mobile task navigation in complex, real-world scenarios. Experimental results demonstrate that ReInAgent effectively resolves information dilemmas and produces outcomes that are more closely aligned with genuine user preferences. Notably, on complex tasks involving information dilemmas, ReInAgent achieves a 25% higher success rate than Mobile-Agent-v2.

Updated: 2025-10-09 09:22:05

标题: ReInAgent：一种支持人为干预的上下文感知GUI代理，实现移动任务导航

摘要: 移动GUI代理展现了在手机上促进和自动化用户任务执行的巨大潜力。然而，现有的移动GUI代理主要偏向自主操作，忽视了任务执行过程中积极用户参与的必要性。这种遗漏削弱了它们适应信息困境的能力，包括模糊、动态演化和冲突任务场景，导致执行结果偏离真实用户需求和偏好。为了解决这些缺陷，我们提出了ReInAgent，一个基于上下文的多代理框架，利用动态信息管理实现人在环环手机任务导航。ReInAgent将三个专门代理整合在一个共享内存模块周围：一个用于基于槽的信息管理和与用户积极交互的信息管理代理，一个用于冲突感知规划的决策代理，以及一个用于任务反思和信息一致性验证的反射代理。通过持续的上下文信息分析和持续的用户-代理协作，ReInAgent克服了现有方法依赖清晰和静态任务假设的限制。因此，在复杂的真实场景中，它实现了更具适应性和可靠性的移动任务导航。实验结果表明，ReInAgent有效解决了信息困境，并产生了更符合真实用户偏好的结果。值得注意的是，在涉及信息困境的复杂任务中，ReInAgent的成功率比Mobile-Agent-v2高出25%。

更新时间: 2025-10-09 09:22:05

领域: cs.AI

下载: http://arxiv.org/abs/2510.07988v1

Fewer Weights, More Problems: A Practical Attack on LLM Pruning

Model pruning, i.e., removing a subset of model weights, has become a prominent approach to reducing the memory footprint of large language models (LLMs) during inference. Notably, popular inference engines, such as vLLM, enable users to conveniently prune downloaded models before they are deployed. While the utility and efficiency of pruning methods have improved significantly, the security implications of pruning remain underexplored. In this work, for the first time, we show that modern LLM pruning methods can be maliciously exploited. In particular, an adversary can construct a model that appears benign yet, once pruned, exhibits malicious behaviors. Our method is based on the idea that the adversary can compute a proxy metric that estimates how likely each parameter is to be pruned. With this information, the adversary can first inject a malicious behavior into those parameters that are unlikely to be pruned. Then, they can repair the model by using parameters that are likely to be pruned, effectively canceling out the injected behavior in the unpruned model. We demonstrate the severity of our attack through extensive evaluation on five models; after any of the pruning in vLLM are applied (Magnitude, Wanda, and SparseGPT), it consistently exhibits strong malicious behaviors in a diverse set of attack scenarios (success rates of up to $95.7\%$ for jailbreak, $98.7\%$ for benign instruction refusal, and $99.5\%$ for targeted content injection). Our results reveal a critical deployment-time security gap and underscore the urgent need for stronger security awareness in model compression.

Updated: 2025-10-09 09:17:35

标题: 少量权重，更多问题：针对LLM修剪的实际攻击

摘要: 模型修剪，即删除模型权重的子集，在推断过程中已成为减少大型语言模型（LLMs）的内存占用的突出方法。值得注意的是，流行的推断引擎，如vLLM，在部署之前允许用户方便地修剪下载的模型。尽管修剪方法的效用和效率已经显着提高，但修剪的安全影响仍未被充分探讨。在这项工作中，我们首次展示了现代LLM修剪方法可能被恶意利用。特别是，对手可以构建一个看似良性的模型，但一旦被修剪，就会表现出恶意行为。我们的方法基于对手可以计算一个代理指标，估算每个参数被修剪的可能性。有了这些信息，对手可以首先在那些不太可能被修剪的参数中注入恶意行为。然后，他们可以使用可能被修剪的参数修复模型，有效地取消未修剪模型中注入的行为。我们通过对五个模型进行广泛评估展示了我们攻击的严重性；在应用vLLM中的任何修剪（Magnitude、Wanda和SparseGPT）后，它在多种攻击场景中始终表现出强烈的恶意行为（越狱成功率高达95.7％，对善意指令拒绝率为98.7％，针对性内容注入率为99.5％）。我们的结果揭示了一个关键的部署时安全漏洞，并强调了在模型压缩中迫切需要更强的安全意识。

更新时间: 2025-10-09 09:17:35

领域: cs.LG,cs.AI,cs.CR

下载: http://arxiv.org/abs/2510.07985v1

Graphon Mixtures

Social networks have a small number of large hubs, and a large number of small dense communities. We propose a generative model that captures both hub and dense structures. Based on recent results about graphons on line graphs, our model is a graphon mixture, enabling us to generate sequences of graphs where each graph is a combination of sparse and dense graphs. We propose a new condition on sparse graphs (the max-degree), which enables us to identify hubs. We show theoretically that we can estimate the normalized degree of the hubs, as well as estimate the graphon corresponding to sparse components of graph mixtures. We illustrate our approach on synthetic data, citation graphs, and social networks, showing the benefits of explicitly modeling sparse graphs.

Updated: 2025-10-09 09:16:06

标题: 图混合

摘要: 社交网络具有少数大型枢纽和大量小型密集社区。我们提出了一个能够捕捉枢纽和密集结构的生成模型。基于关于线图上图纹理的最新结果，我们的模型是一个图纹理混合模型，使我们能够生成图序列，其中每个图是稀疏和密集图的组合。我们提出了一个关于稀疏图的新条件（最大度），这使我们能够识别枢纽。我们理论上展示了我们可以估计枢纽的归一化度，以及估计对应于图混合的稀疏部分的图纹理。我们在合成数据、引文图和社交网络上展示了我们的方法，展示了明确建模稀疏图的好处。

更新时间: 2025-10-09 09:16:06

领域: stat.ML,cs.DM,cs.LG

下载: http://arxiv.org/abs/2505.13864v2

Is Architectural Complexity Always the Answer? A Case Study on SwinIR vs. an Efficient CNN

The simultaneous restoration of high-frequency details and suppression of severe noise in low-light imagery presents a significant and persistent challenge in computer vision. While large-scale Transformer models like SwinIR have set the state of the art in performance, their high computational cost can be a barrier for practical applications. This paper investigates the critical trade-off between performance and efficiency by comparing the state-of-the-art SwinIR model against a standard, lightweight Convolutional Neural Network (CNN) on this challenging task. Our experimental results reveal a nuanced but important finding. While the Transformer-based SwinIR model achieves a higher peak performance, with a Peak Signal-to-Noise Ratio (PSNR) of 39.03 dB, the lightweight CNN delivers a surprisingly competitive PSNR of 37.4 dB. Crucially, the CNN reached this performance after converging in only 10 epochs of training, whereas the more complex SwinIR model required 132 epochs. This efficiency is further underscored by the model's size; the CNN is over 55 times smaller than SwinIR. This work demonstrates that a standard CNN can provide a near state-of-the-art result with significantly lower computational overhead, presenting a compelling case for its use in real-world scenarios where resource constraints are a primary concern.

Updated: 2025-10-09 09:16:05

标题: 建筑复杂性总是答案吗？SwinIR与高效CNN的案例研究

摘要: 在计算机视觉中，同时恢复低光图像中的高频细节并抑制严重噪音的挑战一直存在。尽管像SwinIR这样的大规模Transformer模型已经在性能上取得了最新进展，但它们高昂的计算成本可能阻碍了实际应用。本文通过将最新的SwinIR模型与标准的轻量级卷积神经网络（CNN）在这一挑战性任务上进行比较，探讨了性能和效率之间的关键权衡。我们的实验结果揭示了一个微妙但重要的发现。虽然基于Transformer的SwinIR模型达到了更高的峰值性能，峰值信噪比（PSNR）为39.03 dB，但轻量级CNN提供了令人惊讶的竞争性PSNR达到了37.4 dB。关键是，CNN在只经过10个epochs的训练后就达到了这一性能，而更复杂的SwinIR模型则需要132个epochs。这种效率进一步凸显在模型的大小上；CNN比SwinIR小55倍以上。这项工作表明，标准的CNN可以在计算开销明显较低的情况下提供接近最新技术水平的结果，为其在资源约束是主要关注的实际场景中的使用提供了有力的论据。

更新时间: 2025-10-09 09:16:05

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.07984v1

ZeroCard: Cardinality Estimation with Zero Dependence on Target Databases -- No Data, No Query, No Retraining

Cardinality estimation is a fundamental task in database systems and plays a critical role in query optimization. Despite significant advances in learning-based cardinality estimation methods, most existing approaches remain difficult to generalize to new datasets due to their strong dependence on raw data or queries, thus limiting their practicality in real scenarios. To overcome these challenges, we argue that semantics in the schema may benefit cardinality estimation, and leveraging such semantics may alleviate these dependencies. To this end, we introduce ZeroCard, the first semantics-driven cardinality estimation method that can be applied without any dependence on raw data access, query logs, or retraining on the target database. Specifically, we propose to predict data distributions using schema semantics, thereby avoiding raw data dependence. Then, we introduce a query template-agnostic representation method to alleviate query dependence. Finally, we construct a large-scale query dataset derived from real-world tables and pretrain ZeroCard on it, enabling it to learn cardinality from schema semantics and predicate representations. After pretraining, ZeroCard's parameters can be frozen and applied in an off-the-shelf manner. We conduct extensive experiments to demonstrate the distinct advantages of ZeroCard and show its practical applications in query optimization. Its zero-dependence property significantly facilitates deployment in real-world scenarios.

Updated: 2025-10-09 09:16:01

标题: ZeroCard：零依赖目标数据库的基数估计——无数据，无查询，无需重新训练

摘要: 基数估计是数据库系统中的一项基本任务，对查询优化起着关键作用。尽管学习型基数估计方法取得了重大进展，但大多数现有方法仍然难以推广到新数据集，因为它们对原始数据或查询的强依赖性较大，从而限制了它们在实际场景中的实用性。为了克服这些挑战，我们认为模式中的语义可能有助于基数估计，并利用这种语义可能有助于减轻这些依赖关系。为此，我们引入了ZeroCard，这是第一个基于语义驱动的基数估计方法，可以在不依赖原始数据访问、查询日志或重新对目标数据库进行训练的情况下应用。具体地，我们提出使用模式语义来预测数据分布，从而避免对原始数据的依赖。然后，我们引入一种查询模板不可知的表示方法，以减轻对查询的依赖性。最后，我们从真实世界的表中衍生出一个大规模的查询数据集，并在其上对ZeroCard进行预训练，使其能够从模式语义和谓词表示中学习基数。在预训练之后，ZeroCard的参数可以被冻结并以即插即用的方式应用。我们进行了大量实验来展示ZeroCard的独特优势，并展示其在查询优化中的实际应用。它的零依赖属性显著促进了在实际场景中的部署。

更新时间: 2025-10-09 09:16:01

领域: cs.DB,cs.AI

下载: http://arxiv.org/abs/2510.07983v1

A 3D Generation Framework from Cross Modality to Parameterized Primitive

Recent advancements in AI-driven 3D model generation have leveraged cross modality, yet generating models with smooth surfaces and minimizing storage overhead remain challenges. This paper introduces a novel multi-stage framework for generating 3D models composed of parameterized primitives, guided by textual and image inputs. In the framework, A model generation algorithm based on parameterized primitives, is proposed, which can identifies the shape features of the model constituent elements, and replace the elements with parameterized primitives with high quality surface. In addition, a corresponding model storage method is proposed, it can ensure the original surface quality of the model, while retaining only the parameters of parameterized primitives. Experiments on virtual scene dataset and real scene dataset demonstrate the effectiveness of our method, achieving a Chamfer Distance of 0.003092, a VIoU of 0.545, a F1-Score of 0.9139 and a NC of 0.8369, with primitive parameter files approximately 6KB in size. Our approach is particularly suitable for rapid prototyping of simple models.

Updated: 2025-10-09 09:15:33

标题: 从跨模态到参数化基元的3D生成框架

摘要: 最近AI驱动的3D模型生成方面取得了进展，利用了跨模态，但生成具有平滑表面并最小化存储开销仍然是挑战。本文介绍了一个新颖的多阶段框架，用于生成由参数化基元组成的3D模型，由文本和图像输入引导。在这个框架中，提出了一种基于参数化基元的模型生成算法，它可以识别模型构成元素的形状特征，并用具有高质量表面的参数化基元替换元素。此外，提出了一种相应的模型存储方法，它可以确保模型的原始表面质量，同时保留参数化基元的参数。对虚拟场景数据集和真实场景数据集上的实验表明了我们方法的有效性，实现了0.003092的Chamfer距离，0.545的VIoU，0.9139的F1-Score和0.8369的NC，基元参数文件大小约为6KB。我们的方法特别适用于简单模型的快速原型制作。

更新时间: 2025-10-09 09:15:33

领域: cs.GR,cs.AI,cs.CV

下载: http://arxiv.org/abs/2510.08656v1

Unveiling the Power of Multiple Gossip Steps: A Stability-Based Generalization Analysis in Decentralized Training

Decentralized training removes the centralized server, making it a communication-efficient approach that can significantly improve training efficiency, but it often suffers from degraded performance compared to centralized training. Multi-Gossip Steps (MGS) serve as a simple yet effective bridge between decentralized and centralized training, significantly reducing experiment performance gaps. However, the theoretical reasons for its effectiveness and whether this gap can be fully eliminated by MGS remain open questions. In this paper, we derive upper bounds on the generalization error and excess error of MGS using stability analysis, systematically answering these two key questions. 1). Optimization Error Reduction: MGS reduces the optimization error bound at an exponential rate, thereby exponentially tightening the generalization error bound and enabling convergence to better solutions. 2). Gap to Centralization: Even as MGS approaches infinity, a non-negligible gap in generalization error remains compared to centralized mini-batch SGD ($\mathcal{O}(T^{\frac{c\beta}{c\beta +1}}/{n m})$ in centralized and $\mathcal{O}(T^{\frac{2c\beta}{2c\beta +2}}/{n m^{\frac{1}{2c\beta +2}}})$ in decentralized). Furthermore, we provide the first unified analysis of how factors like learning rate, data heterogeneity, node count, per-node sample size, and communication topology impact the generalization of MGS under non-convex settings without the bounded gradients assumption, filling a critical theoretical gap in decentralized training. Finally, promising experiments on CIFAR datasets support our theoretical findings.

Updated: 2025-10-09 09:14:47

标题: 揭示多次传言步骤的力量：基于稳定性的去中心化训练泛化分析

摘要: 去中心化训练消除了集中式服务器，使得它成为一种通信高效的方法，可以显著提高训练效率，但与集中式训练相比，通常会出现性能下降的情况。多向交流步骤（MGS）作为一种简单而有效的桥梁，显著减少了实验性能差距，从而使得去中心化和集中化训练之间的过渡更加顺畅。然而，其有效性的理论原因以及MGS是否能够完全消除这种差距仍然是开放问题。在本文中，我们利用稳定性分析导出了MGS的泛化误差和过量误差的上限，系统地回答了这两个关键问题。1）优化误差减少：MGS以指数速率减少了优化误差界限，从而指数紧缩了泛化误差界限，使其能够收敛到更好的解决方案。2）与集中化的差距：即使MGS趋近于无穷大，与集中化的小批量随机梯度下降相比，泛化误差仍然存在着可观的差距（在集中化中为$\mathcal{O}(T^{\frac{c\beta}{c\beta +1}}/{n m})$，在去中心化中为$\mathcal{O}(T^{\frac{2c\beta}{2c\beta +2}}/{n m^{\frac{1}{2c\beta +2}}})$）。此外，我们首次对学习率、数据异质性、节点数量、每个节点样本大小和通信拓扑等因素如何影响非凸设置下MGS的泛化进行了统一分析，填补了去中心化训练中关键理论差距。最后，在CIFAR数据集上进行的实验支持了我们的理论研究结果。

更新时间: 2025-10-09 09:14:47

领域: cs.LG,cs.AI,cs.NA,math.NA

下载: http://arxiv.org/abs/2510.07980v1

The Poisson Midpoint Method for Langevin Dynamics: Provably Efficient Discretization for Diffusion Models

Langevin Dynamics is a Stochastic Differential Equation (SDE) central to sampling and generative modeling and is implemented via time discretization. Langevin Monte Carlo (LMC), based on the Euler-Maruyama discretization, is the simplest and most studied algorithm. LMC can suffer from slow convergence - requiring a large number of steps of small step-size to obtain good quality samples. This becomes stark in the case of diffusion models where a large number of steps gives the best samples, but the quality degrades rapidly with smaller number of steps. Randomized Midpoint Method has been recently proposed as a better discretization of Langevin dynamics for sampling from strongly log-concave distributions. However, important applications such as diffusion models involve non-log concave densities and contain time varying drift. We propose its variant, the Poisson Midpoint Method, which approximates a small step-size LMC with large step-sizes. We prove that this can obtain a quadratic speed up of LMC under very weak assumptions. We apply our method to diffusion models for image generation and show that it maintains the quality of DDPM with 1000 neural network calls with just 50-80 neural network calls and outperforms ODE based methods with similar compute.

Updated: 2025-10-09 09:13:57

标题: The Poisson Midpoint Method for Langevin Dynamics: 拉格朗日动力学的泊松中点方法：扩散模型的可证有效离散化

摘要: Langevin动力学是一种随机微分方程(SDE)，在采样和生成建模中起着核心作用，并通过时间离散化实现。基于Euler-Maruyama离散化的Langevin Monte Carlo(LMC)是最简单且最研究的算法。LMC可能收敛速度慢，需要大量步骤和小步长才能获得高质量样本。在扩散模型中，步数较多可以获得最佳样本，但步数较少时质量迅速下降。最近提出的随机中点法被认为是从强对数凹分布中采样Langevin动力学的更好离散化方法。然而，重要的应用比如扩散模型涉及非对数凹密度和包含时间变化漂移。我们提出了其变体，即泊松中点法，它近似于具有大步长的小步长LMC。我们证明，在非常弱的假设下，这可以实现LMC的二次加速。我们将我们的方法应用于图像生成的扩散模型，并展示了与1000次神经网络调用的DDPM相比，仅使用50-80次神经网络调用就能保持质量，并且优于类似计算量的基于ODE的方法。

更新时间: 2025-10-09 09:13:57

领域: cs.LG,cs.NA,math.NA,stat.ML

下载: http://arxiv.org/abs/2405.17068v3

Reducing Cognitive Overhead in Tool Use via Multi-Small-Agent Reinforcement Learning

Recent advances in multi-agent systems highlight the potential of specialized small agents that collaborate via division of labor. Existing tool-integrated reasoning systems, however, often follow a single-agent paradigm in which one large model interleaves long-horizon reasoning with precise tool operations, leading to cognitive-load interference and unstable coordination. We present MSARL, a Multi-Small-Agent Reinforcement Learning framework that explicitly decouples reasoning from tool use. In MSARL, a Reasoning Agent decomposes problems and plans tool invocations, while multiple Tool Agents specialize in specific external tools, each trained via a combination of imitation learning and reinforcement learning with role-specific rewards. On mathematical problem solving with code execution, MSARL significantly improves reasoning stability and final-answer accuracy over single-agent baselines. Moreover, the architecture generalizes to diverse tool-use tasks, demonstrating that cognitive-role decoupling with small agents is a scalable blueprint for multi-agent AI design.

Updated: 2025-10-09 09:11:56

标题: 通过多小型智能体强化学习减少工具使用中的认知负担

摘要: 最近多智能体系统的进展突显了通过分工合作的专门化小智能体的潜力。然而，现有的工具集成推理系统通常遵循单智能体范式，其中一个大模型交织着长期推理和精确工具操作，导致认知负荷干扰和不稳定的协调。我们提出了MSARL，一个明确将推理与工具使用分离的多小智能体强化学习框架。在MSARL中，一个推理智能体分解问题并规划工具调用，而多个工具智能体专门化于特定外部工具，每个通过模仿学习和角色特定奖励的强化学习进行训练。在数学问题求解和代码执行方面，MSARL显著提高了推理稳定性和最终答案准确性，超过了单一智能体基线。此外，该架构推广到各种工具使用任务，表明通过小智能体进行认知角色分离是多智能体人工智能设计的可扩展蓝图。

更新时间: 2025-10-09 09:11:56

领域: cs.AI

下载: http://arxiv.org/abs/2508.08882v3

VoiceAgentBench: Are Voice Assistants ready for agentic tasks?

Large-scale Speech Language Models (SpeechLMs) have enabled voice assistants capable of understanding natural spoken queries and performing complex tasks. However, existing speech benchmarks primarily focus on isolated capabilities such as transcription, or question-answering, and do not systematically evaluate agentic scenarios encompassing multilingual and cultural understanding, as well as adversarial robustness. To address this, we introduce VoiceAgentBench, a comprehensive benchmark designed to evaluate SpeechLMs in realistic spoken agentic settings. It comprises over 5,500 synthetic spoken queries, including dialogues grounded in Indian context, covering single-tool invocations, multi-tool workflows, multi-turn interactions, and safety evaluations. The benchmark supports English, Hindi, and 5 other Indian languages, reflecting real-world linguistic and cultural diversity. We simulate speaker variability using a novel sampling algorithm that selects audios for TTS voice conversion based on its speaker embeddings, maximizing acoustic and speaker diversity. Our evaluation measures tool selection accuracy, structural consistency, and the correctness of tool invocations, including adversarial robustness. Our experiments reveal significant gaps in contextual tool orchestration tasks, Indic generalization, and adversarial robustness, exposing critical limitations of current SpeechLMs.

Updated: 2025-10-09 09:11:38

标题: VoiceAgentBench：语音助手是否准备好执行代理任务？

摘要: 大规模语音语言模型（SpeechLMs）已经使得语音助手能够理解自然语音查询并执行复杂任务。然而，现有的语音基准主要关注于独立的能力，如转录或问答，并没有系统地评估包括多语言和文化理解以及对抗性鲁棒性的代理情景。为了解决这个问题，我们引入了VoiceAgentBench，一个旨在评估SpeechLMs在真实口语代理设置中的全面基准。它包括超过5,500个合成口语查询，包括基于印度背景的对话，涵盖单工具调用、多工具工作流、多轮交互和安全评估。该基准支持英语、印地语和其他5种印度语言，反映了现实世界的语言和文化多样性。我们使用一种新颖的采样算法模拟说话者变异性，该算法根据其说话者嵌入选择音频以进行TTS语音转换，最大化声学和说话者多样性。我们的评估衡量了工具选择准确性、结构一致性以及工具调用的正确性，包括对抗性鲁棒性。我们的实验揭示了在上下文工具编排任务、Indic泛化和对抗性鲁棒性方面存在重大差距，揭示了当前SpeechLMs的关键限制。

更新时间: 2025-10-09 09:11:38

领域: cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2510.07978v1

Fine-Tuning Jailbreaks under Highly Constrained Black-Box Settings: A Three-Pronged Approach

With the rapid advancement of large language models (LLMs), ensuring their safe use becomes increasingly critical. Fine-tuning is a widely used method for adapting models to downstream tasks, yet it is vulnerable to jailbreak attacks. However, most existing studies focus on overly simplified attack scenarios, limiting their practical relevance to real-world defense settings. To make this risk concrete, we present a three-pronged jailbreak attack and evaluate it against provider defenses under a dataset-only black-box fine-tuning interface. In this setting, the attacker can only submit fine-tuning data to the provider, while the provider may deploy defenses across stages: (1) pre-upload data filtering, (2) training-time defensive fine-tuning, and (3) post-training safety audit. Our attack combines safety-styled prefix/suffix wrappers, benign lexical encodings (underscoring) of sensitive tokens, and a backdoor mechanism, enabling the model to learn harmful behaviors while individual datapoints appear innocuous. Extensive experiments demonstrate the effectiveness of our approach. In real-world deployment, our method successfully jailbreaks GPT-4.1 and GPT-4o on the OpenAI platform with attack success rates above 97% for both models. Our code is available at https://github.com/lxf728/tri-pronged-ft-attack.

Updated: 2025-10-09 09:10:33

标题: 在高度受限的黑盒环境下微调越狱：一种三足鼎立的方法

摘要: 随着大型语言模型（LLMs）的快速发展，确保它们的安全使用变得越来越关键。微调是一种广泛使用的方法，用于将模型适应下游任务，但它容易受到越狱攻击的影响。然而，大多数现有研究都集中在过于简化的攻击场景上，限制了它们对真实世界防御设置的实际相关性。为了使这种风险具体化，我们提出了一种三重越狱攻击，并在仅基于数据集的黑盒微调接口下评估其对提供者防御的影响。在这种情况下，攻击者只能向提供者提交微调数据，而提供者可以在各个阶段部署防御措施：（1）上传前数据过滤，（2）训练时防御性微调，以及（3）训练后安全审计。我们的攻击结合了安全风格的前缀/后缀包装器、敏感令牌的良性词汇编码（下划线）以及后门机制，使模型能够学习有害行为，而各个数据点看起来是无害的。大量实验证明了我们方法的有效性。在真实世界部署中，我们的方法成功越狱了OpenAI平台上的GPT-4.1和GPT-4o，两种模型的攻击成功率均超过97％。我们的代码可以在https://github.com/lxf728/tri-pronged-ft-attack 上找到。

更新时间: 2025-10-09 09:10:33

领域: cs.CR

下载: http://arxiv.org/abs/2510.01342v2

Empirical evaluation of normalizing flows in Markov Chain Monte Carlo

Recent advances in MCMC use normalizing flows to precondition target distributions and enable jumps to distant regions. However, there is currently no systematic comparison of different normalizing flow architectures for MCMC. As such, many works choose simple flow architectures that are readily available and do not consider other models. Guidelines for choosing an appropriate architecture would reduce analysis time for practitioners and motivate researchers to take the recommended models as foundations to be improved. We provide the first such guideline by extensively evaluating many normalizing flow architectures on various flow-based MCMC methods and target distributions. When the target density gradient is available, we show that flow-based MCMC outperforms classic MCMC for suitable NF architecture choices with minor hyperparameter tuning. When the gradient is unavailable, flow-based MCMC wins with off-the-shelf architectures. We find contractive residual flows to be the best general-purpose models with relatively low sensitivity to hyperparameter choice. We also provide various insights into normalizing flow behavior within MCMC when varying their hyperparameters, properties of target distributions, and the overall computational budget.

Updated: 2025-10-09 09:09:02

标题: 在马尔可夫链蒙特卡洛中正规流的经验评估

摘要: 最近的MCMC技术使用归一化流对目标分布进行预处理，并实现对远距离区域的跳跃。然而，目前没有对MCMC中不同归一化流结构进行系统比较。因此，许多研究选择简单的流结构，这些结构易于获得，并且不考虑其他模型。选择适当的结构对于实践者来说将减少分析时间，并激励研究人员将推荐的模型作为改进的基础。我们通过广泛评估许多归一化流结构在各种基于流的MCMC方法和目标分布上提供了第一个指南。当目标密度梯度可用时，我们展示了基于流的MCMC在适当的NF结构选择和轻微超参数调整下表现优于经典MCMC。当梯度不可用时，基于流的MCMC在现成的结构下取得了胜利。我们发现收缩残差流是最佳的通用模型，对超参数选择的敏感性相对较低。我们还提供了有关在MCMC中变化超参数、目标分布特性和整体计算预算时归一化流行为的各种见解。

更新时间: 2025-10-09 09:09:02

领域: cs.LG,stat.CO,stat.ML,62-08,G.3; I.5.1; I.6.4

下载: http://arxiv.org/abs/2412.17136v2

Executable Analytic Concepts as the Missing Link Between VLM Insight and Precise Manipulation

Enabling robots to perform precise and generalized manipulation in unstructured environments remains a fundamental challenge in embodied AI. While Vision-Language Models (VLMs) have demonstrated remarkable capabilities in semantic reasoning and task planning, a significant gap persists between their high-level understanding and the precise physical execution required for real-world manipulation. To bridge this "semantic-to-physical" gap, we introduce GRACE, a novel framework that grounds VLM-based reasoning through executable analytic concepts (EAC)-mathematically defined blueprints that encode object affordances, geometric constraints, and semantics of manipulation. Our approach integrates a structured policy scaffolding pipeline that turn natural language instructions and visual information into an instantiated EAC, from which we derive grasp poses, force directions and plan physically feasible motion trajectory for robot execution. GRACE thus provides a unified and interpretable interface between high-level instruction understanding and low-level robot control, effectively enabling precise and generalizable manipulation through semantic-physical grounding. Extensive experiments demonstrate that GRACE achieves strong zero-shot generalization across a variety of articulated objects in both simulated and real-world environments, without requiring task-specific training.

Updated: 2025-10-09 09:08:33

标题: 可执行的分析概念作为VLM洞察和精确操作之间的缺失环节

摘要: 将机器人能够在非结构化环境中进行精确和广义操纵的能力仍然是体现人工智能的一个基本挑战。虽然视觉语言模型（VLMs）在语义推理和任务规划方面展示出了出色的能力，但它们在高级理解和实际操纵所需的精确物理执行之间仍存在显著差距。为了弥合这种“从语义到物理”的差距，我们引入了GRACE，这是一个新颖的框架，通过可执行的分析概念（EAC）将VLM的推理与数学定义的蓝图相结合，这些蓝图编码了物体的可供性、几何约束和操纵的语义。我们的方法集成了一个结构化的策略支架管道，将自然语言指令和视觉信息转化为一个实例化的EAC，从中我们推导出抓取姿势、力方向，并为机器人执行计划物理可行的运动轨迹。因此，GRACE提供了一个统一且可解释的界面，将高级指令理解和低级机器人控制有效地联系起来，通过语义-物理基础实现精确和广义的操纵。大量实验表明，GRACE在模拟和真实环境中跨各种关节式物体实现了强大的零样本泛化，而无需特定任务的训练。

更新时间: 2025-10-09 09:08:33

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2510.07975v1

Rule Encoding and Compliance in Large Language Models: An Information-Theoretic Analysis

The design of safety-critical agents based on large language models (LLMs) requires more than simple prompt engineering. This paper presents a comprehensive information-theoretic analysis of how rule encodings in system prompts influence attention mechanisms and compliance behaviour. We demonstrate that rule formats with low syntactic entropy and highly concentrated anchors reduce attention entropy and improve pointer fidelity, but reveal a fundamental trade-off between anchor redundancy and attention entropy that previous work failed to recognize. Through formal analysis of multiple attention architectures including causal, bidirectional, local sparse, kernelized, and cross-attention mechanisms, we establish bounds on pointer fidelity and show how anchor placement strategies must account for competing fidelity and entropy objectives. Combining these insights with a dynamic rule verification architecture, we provide a formal proof that hot reloading of verified rule sets increases the asymptotic probability of compliant outputs. These findings underscore the necessity of principled anchor design and dual enforcement mechanisms to protect LLM-based agents against prompt injection attacks while maintaining compliance in evolving domains.

Updated: 2025-10-09 09:08:05

标题: 大型语言模型中的规则编码和遵从性：信息论分析

摘要: 基于大型语言模型（LLMs）设计安全关键代理需要更多的不仅限于简单的提示工程。本文提出了一个全面的信息论分析，探讨了系统提示中规则编码如何影响注意力机制和遵从行为。我们证明，具有低句法熵和高度集中锚点的规则格式可以降低注意力熵并改善指针的准确性，但揭示了锚点冗余和注意力熵之间的根本权衡，这是先前工作未能认识到的。通过对多种注意力架构（包括因果、双向、本地稀疏、核化和交叉注意力机制）的形式化分析，我们建立了指针准确性的界限，并展示了锚点放置策略如何考虑竞争性准确性和熵目标。结合这些见解和动态规则验证架构，我们提供了一个正式证明，即经过验证的规则集的热重新加载会增加符合输出的渐近概率。这些发现强调了有原则的锚点设计和双重执行机制对LLM基础代理进行保护，以防范提示注入攻击，同时在不断发展的领域中保持合规性的必要性。

更新时间: 2025-10-09 09:08:05

领域: cs.AI

下载: http://arxiv.org/abs/2510.05106v2

Active Confusion Expression in Large Language Models: Leveraging World Models toward Better Social Reasoning

While large language models (LLMs) excel in mathematical and code reasoning, we observe they struggle with social reasoning tasks, exhibiting cognitive confusion, logical inconsistencies, and conflation between objective world states and subjective belief states. Through deteiled analysis of DeepSeek-R1's reasoning trajectories, we find that LLMs frequently encounter reasoning impasses and tend to output contradictory terms like "tricky" and "confused" when processing scenarios with multiple participants and timelines, leading to erroneous reasoning or infinite loops. The core issue is their inability to disentangle objective reality from agents' subjective beliefs. To address this, we propose an adaptive world model-enhanced reasoning mechanism that constructs a dynamic textual world model to track entity states and temporal sequences. It dynamically monitors reasoning trajectories for confusion indicators and promptly intervenes by providing clear world state descriptions, helping models navigate through cognitive dilemmas. The mechanism mimics how humans use implicit world models to distinguish between external events and internal beliefs. Evaluations on three social benchmarks demonstrate significant improvements in accuracy (e.g., +10% in Hi-ToM) while reducing computational costs (up to 33.8% token reduction), offering a simple yet effective solution for deploying LLMs in social contexts.

Updated: 2025-10-09 09:07:31

标题: 大语言模型中的主动混淆表达：利用世界模型实现更好的社交推理

摘要: 尽管大型语言模型（LLMs）在数学和代码推理方面表现出色，但我们观察到它们在社交推理任务上遇到困难，表现出认知混乱、逻辑不一致以及客观世界状态和主观信念状态之间的混淆。通过对DeepSeek-R1的推理轨迹进行详细分析，我们发现LLMs经常遇到推理僵局，并倾向于在处理涉及多个参与者和时间线的场景时输出矛盾的术语，如“棘手”和“困惑”，导致错误推理或无限循环。核心问题在于它们无法将客观现实与主体信念解开。为了解决这个问题，我们提出了一种增强推理机制，该机制采用自适应世界模型，构建动态的文本世界模型来跟踪实体状态和时间序列。它动态监测推理轨迹中的混乱指标，并及时干预，提供清晰的世界状态描述，帮助模型在认知困境中导航。该机制模仿人类如何使用隐含的世界模型区分外部事件和内部信念。在三个社交基准测试中的评估结果显示，在提高准确性（例如，在Hi-ToM中提高了10%）的同时，降低了计算成本（最多减少33.8%的令牌），为在社交环境中部署LLMs提供了一种简单而有效的解决方案。

更新时间: 2025-10-09 09:07:31

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.07974v1

Knowledge Graph Sparsification for GNN-based Rare Disease Diagnosis

Rare genetic disease diagnosis faces critical challenges: insufficient patient data, inaccessible full genome sequencing, and the immense number of possible causative genes. These limitations cause prolonged diagnostic journeys, inappropriate treatments, and critical delays, disproportionately affecting patients in resource-limited settings where diagnostic tools are scarce. We propose RareNet, a subgraph-based Graph Neural Network that requires only patient phenotypes to identify the most likely causal gene and retrieve focused patient subgraphs for targeted clinical investigation. RareNet can function as a standalone method or serve as a pre-processing or post-processing filter for other candidate gene prioritization methods, consistently enhancing their performance while potentially enabling explainable insights. Through comprehensive evaluation on two biomedical datasets, we demonstrate competitive and robust causal gene prediction and significant performance gains when integrated with other frameworks. By requiring only phenotypic data, which is readily available in any clinical setting, RareNet democratizes access to sophisticated genetic analysis, offering particular value for underserved populations lacking advanced genomic infrastructure.

Updated: 2025-10-09 09:05:06

标题: 知识图谱稀疏化用于基于GNN的罕见疾病诊断

摘要: 罕见遗传病诊断面临着关键挑战：患者数据不足、全基因组测序不可获得，以及可能致病基因数量庞大。这些限制导致诊断过程延长、治疗不当和关键延迟，严重影响资源有限的地区的患者，在这些地区，诊断工具匮乏。我们提出了RareNet，这是一种基于子图的图神经网络，仅需要患者表型即可识别最可能的致病基因，并检索针对性的患者子图进行有针对性的临床调查。RareNet可以作为独立方法运行，也可以作为其他候选基因优先排序方法的预处理或后处理过滤器，不断提高它们的性能，同时可能提供可解释的见解。通过在两个生物医学数据集上进行全面评估，我们展示了竞争力强和稳健的致病基因预测，以及与其他框架集成时显著的性能提升。通过仅需表型数据，这在任何临床环境中都是可获得的，RareNet使得复杂的遗传分析变得更加民主化，为缺乏先进基因组基础设施的弱势人群提供特殊价值。

更新时间: 2025-10-09 09:05:06

领域: cs.LG,cs.AI,q-bio.GN

下载: http://arxiv.org/abs/2510.08655v1

TaoSR-SHE: Stepwise Hybrid Examination Reinforcement Learning Framework for E-commerce Search Relevance

Query-product relevance analysis is a foundational technology in e-commerce search engines and has become increasingly important in AI-driven e-commerce. The recent emergence of large language models (LLMs), particularly their chain-of-thought (CoT) reasoning capabilities, offers promising opportunities for developing relevance systems that are both more interpretable and more robust. However, existing training paradigms have notable limitations: SFT and DPO suffer from poor generalization on long-tail queries and from a lack of fine-grained, stepwise supervision to enforce rule-aligned reasoning. In contrast, reinforcement learning with verification rewards (RLVR) suffers from sparse feedback, which provides insufficient signal to correct erroneous intermediate steps, thereby undermining logical consistency and limiting performance in complex inference scenarios. To address these challenges, we introduce the Stepwise Hybrid Examination Reinforcement Learning framework for Taobao Search Relevance (TaoSR-SHE). At its core is Stepwise Reward Policy Optimization (SRPO), a reinforcement learning algorithm that leverages step-level rewards generated by a hybrid of a high-quality generative stepwise reward model and a human-annotated offline verifier, prioritizing learning from critical correct and incorrect reasoning steps. TaoSR-SHE further incorporates two key techniques: diversified data filtering to encourage exploration across varied reasoning paths and mitigate policy entropy collapse, and multi-stage curriculum learning to foster progressive capability growth. Extensive experiments on real-world search benchmarks show that TaoSR-SHE improves both reasoning quality and relevance-prediction accuracy in large-scale e-commerce settings, outperforming SFT, DPO, GRPO, and other baselines, while also enhancing interpretability and robustness.

Updated: 2025-10-09 09:03:15

标题: TaoSR-SHE：面向电子商务搜索相关性的逐步混合检验强化学习框架

摘要: 查询产品相关性分析是电子商务搜索引擎中的基础技术，并在人工智能驱动的电子商务中变得越来越重要。最近出现的大型语言模型（LLMs），特别是它们的思维链（CoT）推理能力，为开发更具可解释性和更稳健的相关性系统提供了有希望的机会。然而，现有的训练范式存在明显的局限性：SFT和DPO在长尾查询上的泛化能力差，缺乏细粒度的、逐步的监督来强制规则对齐推理。相比之下，基于验证奖励的强化学习（RLVR）受到稀疏反馈的困扰，这提供了不足以纠正错误中间步骤的信号，从而破坏了逻辑一致性，并限制了在复杂推理场景中的性能。为了解决这些挑战，我们引入了适用于淘宝搜索相关性（TaoSR-SHE）的逐步混合考试强化学习框架。其核心是逐步奖励策略优化（SRPO），这是一种利用由高质量生成式逐步奖励模型和人工注释的离线验证器生成的步级奖励的强化学习算法，优先学习关键的正确和不正确的推理步骤。TaoSR-SHE进一步融合了两个关键技术：多样化数据过滤以鼓励探索各种推理路径并减轻策略熵崩溃，以及多阶段课程学习以促进渐进式能力增长。对真实世界搜索基准的大量实验证明，TaoSR-SHE在大规模电子商务环境中提高了推理质量和相关性预测准确性，超过了SFT、DPO、GRPO和其他基线，同时提高了可解释性和稳健性。

更新时间: 2025-10-09 09:03:15

领域: cs.AI

下载: http://arxiv.org/abs/2510.07972v1

Climate Surrogates for Scalable Multi-Agent Reinforcement Learning: A Case Study with CICERO-SCM

Climate policy studies require models that capture the combined effects of multiple greenhouse gases on global temperature, but these models are computationally expensive and difficult to embed in reinforcement learning. We present a multi-agent reinforcement learning (MARL) framework that integrates a high-fidelity, highly efficient climate surrogate directly in the environment loop, enabling regional agents to learn climate policies under multi-gas dynamics. As a proof of concept, we introduce a recurrent neural network architecture pretrained on ($20{,}000$) multi-gas emission pathways to surrogate the climate model CICERO-SCM. The surrogate model attains near-simulator accuracy with global-mean temperature RMSE $\approx 0.0004 \mathrm{K}$ and approximately $1000\times$ faster one-step inference. When substituted for the original simulator in a climate-policy MARL setting, it accelerates end-to-end training by $>\!100\times$. We show that the surrogate and simulator converge to the same optimal policies and propose a methodology to assess this property in cases where using the simulator is intractable. Our work allows to bypass the core computational bottleneck without sacrificing policy fidelity, enabling large-scale multi-agent experiments across alternative climate-policy regimes with multi-gas dynamics and high-fidelity climate response.

Updated: 2025-10-09 09:02:49

标题: 气候替代品用于可扩展的多智能体强化学习：以CICERO-SCM为案例研究

摘要: 气候政策研究需要捕捉多种温室气体对全球气温的综合影响的模型，但这些模型在计算上昂贵且难以嵌入强化学习中。我们提出了一个多智能体强化学习（MARL）框架，将一个高保真、高效的气候替代模型直接集成到环境循环中，使区域智能体能够在多气体动态下学习气候政策。作为概念验证，我们引入了一个基于（20,000）多气体排放路径进行预训练的循环神经网络架构，用于替代气候模型CICERO-SCM。替代模型达到了接近模拟器准确性的水平，全球平均温度均方根误差约为0.0004K，推断速度大约快了1000倍。当在气候政策MARL环境中替代原始模拟器时，它将训练加速了超过100倍。我们展示了替代模型和模拟器收敛到相同最优政策，并提出了一种方法来评估这种属性，在使用模拟器是不可行的情况下。我们的工作允许绕过核心计算瓶颈，而不损害政策的忠实性，从而实现跨多个气候政策制度的大规模多智能体实验，包括多气体动态和高保真气候响应。

更新时间: 2025-10-09 09:02:49

领域: cs.LG,cs.MA

下载: http://arxiv.org/abs/2510.07971v1

Safe-Control: A Safety Patch for Mitigating Unsafe Content in Text-to-Image Generation Models

Despite the advancements in Text-to-Image (T2I) generation models, their potential for misuse or even abuse raises serious safety concerns. Model developers have made tremendous efforts to introduce safety mechanisms that can address these concerns in T2I models. However, the existing safety mechanisms, whether external or internal, either remain susceptible to evasion under distribution shifts or require extensive model-specific adjustments. To address these limitations, we introduce Safe-Control, an innovative plug-and-play safety patch designed to mitigate unsafe content generation in T2I models. Using data-driven strategies and safety-aware conditions, Safe-Control injects safety control signals into the locked T2I model, acting as an update in a patch-like manner. Model developers can also construct various safety patches to meet the evolving safety requirements, which can be flexibly merged into a single, unified patch. Its plug-and-play design further ensures adaptability, making it compatible with other T2I models of similar denoising architecture. We conduct extensive evaluations on six diverse and public T2I models. Empirical results highlight that Safe-Control is effective in reducing unsafe content generation across six diverse T2I models with similar generative architectures, yet it successfully maintains the quality and text alignment of benign images. Compared to seven state-of-the-art safety mechanisms, including both external and internal defenses, Safe-Control significantly outperforms all baselines in reducing unsafe content generation. For example, it reduces the probability of unsafe content generation to 7%, compared to approximately 20% for most baseline methods, under both unsafe prompts and the latest adversarial attacks.

Updated: 2025-10-09 09:02:05

标题: 安全控制：一种安全补丁，用于减轻文本到图像生成模型中的不安全内容

摘要: 尽管文本到图像（T2I）生成模型取得了进展，但它们的潜在被滥用甚至滥用的可能性引起了严重的安全关注。模型开发人员已经做出了巨大的努力，引入了可以解决T2I模型中这些问题的安全机制。然而，现有的安全机制，无论是外部的还是内部的，要么仍然容易在分布变化下被规避，要么需要进行大量的特定于模型的调整。为了解决这些限制，我们引入了Safe-Control，这是一种创新的即插即用安全补丁，旨在减轻T2I模型中不安全内容生成的问题。使用数据驱动的策略和安全意识条件，Safe-Control将安全控制信号注入到锁定的T2I模型中，以一种类似补丁的方式进行更新。模型开发人员还可以构建各种安全补丁来满足不断发展的安全要求，这些补丁可以灵活地合并为一个统一的补丁。其即插即用设计进一步确保了适应性，使其与类似去噪架构的其他T2I模型兼容。我们对六种不同的公共T2I模型进行了广泛评估。实证结果表明，Safe-Control在减少六种不同T2I模型中不安全内容生成方面是有效的，这些模型具有相似的生成架构，同时成功地保持了良性图像的质量和文本对齐。与七种最先进的安全机制相比，包括外部和内部防御，Safe-Control在减少不安全内容生成方面显著优于所有基线。例如，在不安全提示和最新的对抗性攻击下，它将不安全内容生成的概率降低到7%，而大多数基线方法的降低幅度约为20%。

更新时间: 2025-10-09 09:02:05

领域: cs.CV,cs.AI,cs.CR

下载: http://arxiv.org/abs/2508.21099v2

From Defender to Devil? Unintended Risk Interactions Induced by LLM Defenses

Large Language Models (LLMs) have shown remarkable performance across various applications, but their deployment in sensitive domains raises significant concerns. To mitigate these risks, numerous defense strategies have been proposed. However, most existing studies assess these defenses in isolation, overlooking their broader impacts across other risk dimensions. In this work, we take the first step in investigating unintended interactions caused by defenses in LLMs, focusing on the complex interplay between safety, fairness, and privacy. Specifically, we propose CrossRiskEval, a comprehensive evaluation framework to assess whether deploying a defense targeting one risk inadvertently affects others. Through extensive empirical studies on 14 defense-deployed LLMs, covering 12 distinct defense strategies, we reveal several alarming side effects: 1) safety defenses may suppress direct responses to sensitive queries related to bias or privacy, yet still amplify indirect privacy leakage or biased outputs; 2) fairness defenses increase the risk of misuse and privacy leakage; 3) privacy defenses often impair safety and exacerbate bias. We further conduct a fine-grained neuron-level analysis to uncover the underlying mechanisms of these phenomena. Our analysis reveals the existence of conflict-entangled neurons in LLMs that exhibit opposing sensitivities across multiple risk dimensions. Further trend consistency analysis at both task and neuron levels confirms that these neurons play a key role in mediating the emergence of unintended behaviors following defense deployment. We call for a paradigm shift in LLM risk evaluation, toward holistic, interaction-aware assessment of defense strategies.

Updated: 2025-10-09 09:00:00

标题: 从“卫士”到“魔鬼”？由LLM防御引发的意外风险相互作用

摘要: 大型语言模型（LLMs）在各种应用中表现出卓越性能，但它们在敏感领域的部署引发了重大关注。为了减轻这些风险，已经提出了许多防御策略。然而，大多数现有研究都是孤立评估这些防御措施，忽视了它们在其他风险维度上的更广泛影响。在这项工作中，我们迈出了第一步，研究了LLMs中由防御措施引起的意外相互作用，重点关注安全性、公平性和隐私之间的复杂相互作用。具体而言，我们提出了CrossRiskEval，一个全面评估框架，用于评估部署一个针对一个风险的防御是否无意中影响其他风险。通过对14个部署了防御的LLMs进行广泛的实证研究，涵盖了12种不同的防御策略，我们揭示了几个令人震惊的副作用：1）安全防御可能抑制与偏见或隐私有关的敏感查询的直接响应，但仍可能增加间接的隐私泄露或有偏见的输出；2）公平性防御增加了滥用和隐私泄露的风险；3）隐私防御通常会损害安全性，加剧偏见。我们进一步进行了细粒度的神经元级分析，揭示了这些现象的潜在机制。我们的分析揭示了LLMs中存在着在多个风险维度上表现出相反敏感性的冲突纠缠神经元。在任务和神经元级别进行的趋势一致性分析进一步确认了这些神经元在防御部署后引发意外行为的出现中发挥了关键作用。我们呼吁在LLM风险评估中实现一种范式转变，朝着全面、互动感知的防御策略评估。

更新时间: 2025-10-09 09:00:00

领域: cs.CR

下载: http://arxiv.org/abs/2510.07968v1

A pseudo-inverse of a line graph

Line graphs are an alternative representation of graphs where each vertex of the original (root) graph becomes an edge. However not all graphs have a corresponding root graph, hence the transformation from graphs to line graphs is not invertible. We investigate the case when there is a small perturbation in the space of line graphs, and try to recover the corresponding root graph, essentially defining the inverse of the line graph operation. We propose a linear integer program that edits the smallest number of edges in the line graph, that allow a root graph to be found. We use the spectral norm to theoretically prove that such a pseudo-inverse operation is well behaved. Illustrative empirical experiments on Erd\H{o}s-R\'enyi graphs show that our theoretical results work in practice.

Updated: 2025-10-09 08:59:58

标题: 一个线图的伪逆

摘要: 线图是图的一种替代表示形式，其中原始（根）图的每个顶点都变成了一条边。然而，并非所有图都有一个对应的根图，因此从图到线图的转换是不可逆的。我们研究了当线图空间中存在微小扰动时的情况，并试图恢复相应的根图，从而实质上定义了线图操作的逆操作。我们提出了一个线性整数规划，它编辑了线图中最小数量的边，以便找到一个根图。我们使用谱范数在理论上证明了这种伪逆操作的良好性能。在Erd\H{o}s-R\'enyi图上进行了说明性的实验，证明了我们的理论结果在实践中是有效的。

更新时间: 2025-10-09 08:59:58

领域: stat.ML,cs.LG,math.OC

下载: http://arxiv.org/abs/2508.09412v2

Stick-Breaking Mixture Normalizing Flows with Component-Wise Tail Adaptation for Variational Inference

Normalizing flows with a Gaussian base provide a computationally efficient way to approximate posterior distributions in Bayesian inference, but they often struggle to capture complex posteriors with multimodality and heavy tails. We propose a stick-breaking mixture base with component-wise tail adaptation (StiCTAF) for posterior approximation. The method first learns a flexible mixture base to mitigate the mode-seeking bias of reverse KL divergence through a weighted average of component-wise ELBOs. It then estimates local tail indices of unnormalized densities and finally refines each mixture component using a shared backbone combined with component-specific tail transforms calibrated by the estimated indices. This design enables accurate mode coverage and anisotropic tail modeling while retaining exact density evaluation and stable optimization. Experiments on synthetic posteriors demonstrate improved tail recovery and better coverage of multiple modes compared to benchmark models. We also present a real-data analysis illustrating the practical benefits of our approach for posterior inference.

Updated: 2025-10-09 08:57:27

标题: 使用分量逐项尾部调整的棍子破碎混合归一化流进行变分推断

摘要: 使用具有高斯基础的归一化流可以提供一种在贝叶斯推断中近似后验分布的计算有效方式，但它们通常难以捕捉具有多模态和重尾特征的复杂后验分布。我们提出了一种具有组件级尾部适应性的粘结破裂混合基础（StiCTAF）用于后验分布的近似。该方法首先学习一个灵活的混合基础，通过组件级ELBO加权平均来减轻反向KL散度的寻找模式偏差。然后估计非标准化密度的本地尾部指数，最后利用估计的指数校准的组件特定尾变换来优化每个混合组件。这种设计实现了准确的模式覆盖和各向异性尾部建模，同时保留了精确的密度评估和稳定的优化。合成后验的实验表明，与基准模型相比，尾部恢复和多模式覆盖都有所改善。我们还展示了一个实际数据分析，展示了我们的方法对后验推断的实际益处。

更新时间: 2025-10-09 08:57:27

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2510.07965v1

PRESCRIBE: Predicting Single-Cell Responses with Bayesian Estimation

In single-cell perturbation prediction, a central task is to forecast the effects of perturbing a gene unseen in the training data. The efficacy of such predictions depends on two factors: (1) the similarity of the target gene to those covered in the training data, which informs model (epistemic) uncertainty, and (2) the quality of the corresponding training data, which reflects data (aleatoric) uncertainty. Both factors are critical for determining the reliability of a prediction, particularly as gene perturbation is an inherently stochastic biochemical process. In this paper, we propose PRESCRIBE (PREdicting Single-Cell Response wIth Bayesian Estimation), a multivariate deep evidential regression framework designed to measure both sources of uncertainty jointly. Our analysis demonstrates that PRESCRIBE effectively estimates a confidence score for each prediction, which strongly correlates with its empirical accuracy. This capability enables the filtering of untrustworthy results, and in our experiments, it achieves steady accuracy improvements of over 3% compared to comparable baselines.

Updated: 2025-10-09 08:57:11

标题: PRESCRIBE: 使用贝叶斯估计预测单细胞反应

摘要: 在单细胞干扰预测中，一个核心任务是预测在训练数据中未曾见过的基因干扰的影响。这种预测的有效性取决于两个因素：（1）目标基因与训练数据中涵盖的基因的相似性，这影响模型（认识论）的不确定性；（2）相应训练数据的质量，反映数据（随机）的不确定性。这两个因素对于确定预测的可靠性至关重要，尤其是基因干扰是一种固有的随机生物化学过程。在本文中，我们提出了PRESCRIBE（利用贝叶斯估计预测单细胞反应），这是一个设计用于同时测量两种不确定性来源的多元深度证据回归框架。我们的分析表明，PRESCRIBE有效地为每个预测估计置信度分数，这与其实证准确性强相关。这种能力使得可以过滤出不可信的结果，在我们的实验中，与可比较的基准方法相比，它实现了超过3%的稳定准确性改进。

更新时间: 2025-10-09 08:57:11

领域: cs.LG,q-bio.QM

下载: http://arxiv.org/abs/2510.07964v1

LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?

Large language models (LLMs) have demonstrated remarkable progress in reasoning, often through supervised fine-tuning (SFT). However, SFT is resource-intensive, relying on large curated datasets, rejection-sampled demonstrations, and uniform optimization across all tokens, even though only a fraction carry meaningful learning value. In this work, we explore a counterintuitive idea: can smaller language models (SLMs) teach larger language models (LLMs) by revealing high-value reasoning moments that reflect the latter's unique strength? We propose LightReasoner, a novel framework that leverages the behavioral divergence between a stronger expert model (LLM) and a weaker amateur model (SLM). LightReasoner operates in two stages: (1) a sampling stage that pinpoints critical reasoning moments and constructs supervision examples capturing the expert's advantage through expert-amateur contrast, and (2) a fine-tuning stage that aligns the expert model with these distilled examples, amplifying its reasoning strengths. Across seven mathematical benchmarks, LightReasoner improves accuracy by up to 28.1%, while reducing time consumption by 90%, sampled problems by 80%, and tuned token usage by 99%, all without relying on ground-truth labels. By turning weaker SLMs into effective teaching signals, LightReasoner offers a scalable and resource-efficient approach for advancing LLM reasoning. Code is available at: https://github.com/HKUDS/LightReasoner

Updated: 2025-10-09 08:55:12

标题: LightReasoner：小语言模型能教大语言模型推理吗？

摘要: 大型语言模型（LLMs）通过监督微调（SFT）在推理方面取得了显著进展。然而，SFT需要大量资源，依赖于大型策划数据集、拒绝采样演示，并且在所有标记上进行统一优化，尽管只有一小部分具有有意义的学习价值。在这项工作中，我们探讨了一个反直觉的想法：较小的语言模型（SLMs）是否可以通过展示反映后者独特优势的高价值推理时刻来教导较大的语言模型（LLMs）？我们提出了LightReasoner，一个新颖的框架，利用了更强的专家模型（LLM）和较弱的业余模型（SLM）之间的行为差异。LightReasoner分为两个阶段运作：（1）一个采样阶段，找出关键推理时刻并构建捕捉专家优势的监督示例，通过专家-业余对比，（2）一个微调阶段，将专家模型与这些提炼的示例对齐，增强其推理优势。在七个数学基准测试中，LightReasoner将准确率提高了最高28.1%，同时将时间消耗减少了90%，采样问题减少了80%，调整的标记使用减少了99%，而且均不依赖于地面真实标签。通过将较弱的SLMs转化为有效的教学信号，LightReasoner为推进LLM推理提供了一种可伸缩且资源高效的方法。源代码可在以下网址获取：https://github.com/HKUDS/LightReasoner

更新时间: 2025-10-09 08:55:12

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.07962v1

A Systematic Evaluation of Self-Supervised Learning for Label-Efficient Sleep Staging with Wearable EEG

Wearable EEG devices have emerged as a promising alternative to polysomnography (PSG). As affordable and scalable solutions, their widespread adoption results in the collection of massive volumes of unlabeled data that cannot be analyzed by clinicians at scale. Meanwhile, the recent success of deep learning for sleep scoring has relied on large annotated datasets. Self-supervised learning (SSL) offers an opportunity to bridge this gap, leveraging unlabeled signals to address label scarcity and reduce annotation effort. In this paper, we present the first systematic evaluation of SSL for sleep staging using wearable EEG. We investigate a range of well-established SSL methods and evaluate them on two sleep databases acquired with the Ikon Sleep wearable EEG headband: BOAS, a high-quality benchmark containing PSG and wearable EEG recordings with consensus labels, and HOGAR, a large collection of home-based, self-recorded, and unlabeled recordings. Three evaluation scenarios are defined to study label efficiency, representation quality, and cross-dataset generalization. Results show that SSL consistently improves classification performance by up to 10% over supervised baselines, with gains particularly evident when labeled data is scarce. SSL achieves clinical-grade accuracy above 80% leveraging only 5% to 10% of labeled data, while the supervised approach requires twice the labels. Additionally, SSL representations prove robust to variations in population characteristics, recording environments, and signal quality. Our findings demonstrate the potential of SSL to enable label-efficient sleep staging with wearable EEG, reducing reliance on manual annotations and advancing the development of affordable sleep monitoring systems.

Updated: 2025-10-09 08:54:10

标题: 一种针对佩戴式脑电图的标签有效性睡眠分期的自监督学习的系统评估

摘要: 可穿戴脑电图（EEG）设备已经成为一种有前途的替代多导睡眠图（PSG）的选择。作为价格实惠且可扩展的解决方案，它们的广泛应用导致了大量未标记数据的收集，这些数据无法按规模由临床医生进行分析。与此同时，最近深度学习在睡眠评分方面取得的成功依赖于大规模的注释数据集。自监督学习（SSL）为弥合这一差距提供了机会，利用未标记信号来解决标签稀缺性问题并减少注释工作量。在本文中，我们首次系统评估了使用可穿戴 EEG 进行睡眠分期的 SSL。我们研究了一系列成熟的 SSL 方法，并在使用 Ikon Sleep 可穿戴 EEG 头带获取的两个睡眠数据库上进行评估：BOAS，一个包含 PSG 和可穿戴 EEG 记录的高质量基准数据库，其中包含一致的标签，以及 HOGAR，一个包含家庭自录和未标记记录的大型集合。定义了三种评估场景以研究标签效率、表示质量和跨数据集泛化。结果表明，SSL 一直能够将分类性能提高高达 10%，相对于有限标记数据时，增益尤为明显。SSL 只利用 5% 到 10% 的标记数据即可实现高于 80% 的临床级准确性，而监督方法则需要两倍的标签。此外，SSL 表示对于不同人口特征、记录环境和信号质量的变化表现出鲁棒性。我们的研究结果展示了 SSL 在利用可穿戴 EEG 实现标签高效睡眠分期方面的潜力，减少对手动注释的依赖，并推动便宜睡眠监测系统的发展。

更新时间: 2025-10-09 08:54:10

领域: cs.HC,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.07960v1

DISCO: Diversifying Sample Condensation for Efficient Model Evaluation

Evaluating modern machine learning models has become prohibitively expensive. Benchmarks such as LMMs-Eval and HELM demand thousands of GPU hours per model. Costly evaluation reduces inclusivity, slows the cycle of innovation, and worsens environmental impact. The typical approach follows two steps. First, select an anchor subset of data. Second, train a mapping from the accuracy on this subset to the final test result. The drawback is that anchor selection depends on clustering, which can be complex and sensitive to design choices. We argue that promoting diversity among samples is not essential; what matters is to select samples that $\textit{maximise diversity in model responses}$. Our method, $\textbf{Diversifying Sample Condensation (DISCO)}$, selects the top-k samples with the greatest model disagreements. This uses greedy, sample-wise statistics rather than global clustering. The approach is conceptually simpler. From a theoretical view, inter-model disagreement provides an information-theoretically optimal rule for such greedy selection. $\textbf{DISCO}$ shows empirical gains over prior methods, achieving state-of-the-art results in performance prediction across MMLU, Hellaswag, Winogrande, and ARC. Code is available here: https://github.com/arubique/disco-public.

Updated: 2025-10-09 08:53:59

标题: DISCO：多样化样本浓缩以提高模型评估效率

摘要: 评估现代机器学习模型已经变得难以承受。像LMMs-Eval和HELM这样的基准测试每个模型需要成千上万个GPU小时。昂贵的评估降低了包容性，减缓了创新速度，并加剧了环境影响。典型的方法包括两个步骤。首先，选择一组锚定数据子集。其次，训练一个从该子集准确性到最终测试结果的映射。缺点是锚定选择取决于聚类，这可能会复杂且对设计选择敏感。我们认为促进样本间的多样性并不是必要的；重要的是选择能够最大程度地增加模型响应多样性的样本。我们的方法，Diversifying Sample Condensation（DISCO），选择具有最大模型分歧的前k个样本。这使用了贪婪的、逐个样本的统计数据，而不是全局聚类。该方法在概念上更简单。从理论上看，模型间的分歧为这种贪婪选择提供了信息论上的最优规则。DISCO在先前方法上显示出实证收益，在MMLU、Hellaswag、Winogrande和ARC的性能预测中取得了最先进的结果。代码可以在此处找到：https://github.com/arubique/disco-public。

更新时间: 2025-10-09 08:53:59

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.07959v1

A$^2$Search: Ambiguity-Aware Question Answering with Reinforcement Learning

Recent advances in Large Language Models (LLMs) and Reinforcement Learning (RL) have led to strong performance in open-domain question answering (QA). However, existing models still struggle with questions that admit multiple valid answers. Standard QA benchmarks, which typically assume a single gold answer, overlook this reality and thus produce inappropriate training signals. Existing attempts to handle ambiguity often rely on costly manual annotation, which is difficult to scale to multi-hop datasets such as HotpotQA and MuSiQue. In this paper, we present A$^2$Search, an annotation-free, end-to-end training framework to recognize and handle ambiguity. At its core is an automated pipeline that detects ambiguous questions and gathers alternative answers via trajectory sampling and evidence verification. The model is then optimized with RL using a carefully designed $\mathrm{AnsF1}$ reward, which naturally accommodates multiple answers. Experiments on eight open-domain QA benchmarks demonstrate that A$^2$Search achieves new state-of-the-art performance. With only a single rollout, A$^2$Search-7B yields an average $\mathrm{AnsF1}@1$ score of $48.4\%$ across four multi-hop benchmarks, outperforming all strong baselines, including the substantially larger ReSearch-32B ($46.2\%$). Extensive analyses further show that A$^2$Search resolves ambiguity and generalizes across benchmarks, highlighting that embracing ambiguity is essential for building more reliable QA systems. Our code, data, and model weights can be found at https://github.com/zfj1998/A2Search

Updated: 2025-10-09 08:53:31

标题: A$^2$Search: 使用强化学习的模糊感知问答

摘要: 近期对大型语言模型（LLM）和强化学习（RL）的进展已经在开放域问答（QA）领域取得了强大的表现。然而，现有模型仍然在可以有多个有效答案的问题上面临困难。标准的QA基准通常假设只有一个正确答案，忽视了这一现实，因此产生了不适当的训练信号。现有的处理模棱两可问题的尝试通常依赖于昂贵的手动标注，这在扩展到HotpotQA和MuSiQue等多跳数据集时难以实现。在本文中，我们提出了A$^2$Search，这是一个无需标注的端到端训练框架，用于识别和处理歧义。其核心是一个自动化流程，通过轨迹采样和证据验证检测模糊问题并收集替代答案。然后，该模型通过使用精心设计的$\mathrm{AnsF1}$奖励进行RL优化，自然地适应多个答案。在八个开放域QA基准上的实验表明，A$^2$Search实现了新的最先进性能。仅通过一次模拟，A$^2$Search-7B在四个多跳基准上取得了平均$\mathrm{AnsF1}@1$得分为$48.4\%$，优于所有强基线模型，包括更大的ReSearch-32B（$46.2\%$）。进一步的详细分析显示，A$^2$Search解决了歧义并在基准之间实现了泛化，突显了接受歧义对建立更可靠的QA系统至关重要。我们的代码、数据和模型权重可在https://github.com/zfj1998/A2Search找到。

更新时间: 2025-10-09 08:53:31

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.07958v1

Condition Weaving Meets Expert Modulation: Towards Universal and Controllable Image Generation

The image-to-image generation task aims to produce controllable images by leveraging conditional inputs and prompt instructions. However, existing methods often train separate control branches for each type of condition, leading to redundant model structures and inefficient use of computational resources. To address this, we propose a Unified image-to-image Generation (UniGen) framework that supports diverse conditional inputs while enhancing generation efficiency and expressiveness. Specifically, to tackle the widely existing parameter redundancy and computational inefficiency in controllable conditional generation architectures, we propose the Condition Modulated Expert (CoMoE) module. This module aggregates semantically similar patch features and assigns them to dedicated expert modules for visual representation and conditional modeling. By enabling independent modeling of foreground features under different conditions, CoMoE effectively mitigates feature entanglement and redundant computation in multi-condition scenarios. Furthermore, to bridge the information gap between the backbone and control branches, we propose WeaveNet, a dynamic, snake-like connection mechanism that enables effective interaction between global text-level control from the backbone and fine-grained control from conditional branches. Extensive experiments on the Subjects-200K and MultiGen-20M datasets across various conditional image generation tasks demonstrate that our method consistently achieves state-of-the-art performance, validating its advantages in both versatility and effectiveness. The code has been uploaded to https://github.com/gavin-gqzhang/UniGen.

Updated: 2025-10-09 08:50:35

标题: 条件编织遇上专家调制：走向通用可控的图像生成

摘要: 图像到图像生成任务的目标是通过利用条件输入和提示说明来生成可控图像。然而，现有方法通常为每种类型的条件训练单独的控制分支，导致冗余的模型结构和计算资源的低效利用。为了解决这个问题，我们提出了一个统一的图像到图像生成（UniGen）框架，支持多样化的条件输入，同时增强生成效率和表现力。具体来说，为了解决可控条件生成架构中普遍存在的参数冗余和计算效率低下问题，我们提出了条件调制专家（CoMoE）模块。该模块聚合语义上相似的补丁特征，并将它们分配给专门的专家模块，用于视觉表示和条件建模。通过在不同条件下独立建模前景特征，CoMoE有效地减轻了多条件情况下的特征纠缠和冗余计算。此外，为了弥合主干和控制分支之间的信息鸿沟，我们提出了WeaveNet，一种动态的、类似蛇的连接机制，可以实现来自主干的全局文本级控制和来自条件分支的细粒度控制之间的有效交互。在Subjects-200K和MultiGen-20M数据集上进行的大量实验跨越各种条件图像生成任务，证明我们的方法始终达到最先进的性能，验证了其在多功能性和有效性方面的优势。代码已上传至https://github.com/gavin-gqzhang/UniGen。

更新时间: 2025-10-09 08:50:35

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.17364v2

SimCast: Enhancing Precipitation Nowcasting with Short-to-Long Term Knowledge Distillation

Precipitation nowcasting predicts future radar sequences based on current observations, which is a highly challenging task driven by the inherent complexity of the Earth system. Accurate nowcasting is of utmost importance for addressing various societal needs, including disaster management, agriculture, transportation, and energy optimization. As a complementary to existing non-autoregressive nowcasting approaches, we investigate the impact of prediction horizons on nowcasting models and propose SimCast, a novel training pipeline featuring a short-to-long term knowledge distillation technique coupled with a weighted MSE loss to prioritize heavy rainfall regions. Improved nowcasting predictions can be obtained without introducing additional overhead during inference. As SimCast generates deterministic predictions, we further integrate it into a diffusion-based framework named CasCast, leveraging the strengths from probabilistic models to overcome limitations such as blurriness and distribution shift in deterministic outputs. Extensive experimental results on three benchmark datasets validate the effectiveness of the proposed framework, achieving mean CSI scores of 0.452 on SEVIR, 0.474 on HKO-7, and 0.361 on MeteoNet, which outperforms existing approaches by a significant margin.

Updated: 2025-10-09 08:49:16

标题: SimCast：利用短期到长期知识蒸馏增强降水即时预报

摘要: 现在降水预报根据当前观测预测未来雷达序列，这是一个由地球系统固有复杂性驱动的极具挑战性的任务。准确的现在预测对于解决各种社会需求至关重要，包括灾害管理、农业、交通和能源优化。作为对现有非自回归现在预测方法的补充，我们研究了预测时间范围对预测模型的影响，并提出了SimCast，这是一个新颖的训练管道，采用短期到长期知识蒸馏技术，结合加权均方误差损失来优先考虑大雨区域。在推理过程中，可以获得改进的现在预测，而无需引入额外的开销。由于SimCast生成确定性预测，我们进一步将其整合到一个基于扩散的框架中，命名为CasCast，利用概率模型的优势来克服确定性输出中的模糊性和分布转移等限制。在三个基准数据集上进行了大量实验结果验证了所提出框架的有效性，实现了在SEVIR上的平均CSI得分为0.452，在HKO-7上为0.474，在MeteoNet上为0.361，明显优于现有方法。

更新时间: 2025-10-09 08:49:16

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.07953v1

Spatiotemporal Forecasting as Planning: A Model-Based Reinforcement Learning Approach with Generative World Models

To address the dual challenges of inherent stochasticity and non-differentiable metrics in physical spatiotemporal forecasting, we propose Spatiotemporal Forecasting as Planning (SFP), a new paradigm grounded in Model-Based Reinforcement Learning. SFP constructs a novel Generative World Model to simulate diverse, high-fidelity future states, enabling an "imagination-based" environmental simulation. Within this framework, a base forecasting model acts as an agent, guided by a beam search-based planning algorithm that leverages non-differentiable domain metrics as reward signals to explore high-return future sequences. These identified high-reward candidates then serve as pseudo-labels to continuously optimize the agent's policy through iterative self-training, significantly reducing prediction error and demonstrating exceptional performance on critical domain metrics like capturing extreme events.

Updated: 2025-10-09 08:48:26

标题: 时空预测作为规划：基于模型的强化学习方法与生成世界模型

摘要: 为了应对物理时空预测中固有的随机性和不可微度量的双重挑战，我们提出了基于模型的强化学习的时空预测规划（SFP）的新范式。SFP构建了一个新颖的生成世界模型，模拟多样化、高保真度的未来状态，实现了基于想象的环境模拟。在这个框架内，一个基础预测模型充当代理，由基于光束搜索的规划算法指导，利用不可微领域度量作为奖励信号来探索高回报的未来序列。这些确定的高回报候选者然后作为伪标签，通过迭代自我训练不断优化代理的策略，显著减少预测误差，并在关键领域度量（如捕捉极端事件）上表现出色。

更新时间: 2025-10-09 08:48:26

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.04020v2

A Large-scale Dataset for Robust Complex Anime Scene Text Detection

Current text detection datasets primarily target natural or document scenes, where text typically appear in regular font and shapes, monotonous colors, and orderly layouts. The text usually arranged along straight or curved lines. However, these characteristics differ significantly from anime scenes, where text is often diverse in style, irregularly arranged, and easily confused with complex visual elements such as symbols and decorative patterns. Text in anime scene also includes a large number of handwritten and stylized fonts. Motivated by this gap, we introduce AnimeText, a large-scale dataset containing 735K images and 4.2M annotated text blocks. It features hierarchical annotations and hard negative samples tailored for anime scenarios. %Cross-dataset evaluations using state-of-the-art methods demonstrate that models trained on AnimeText achieve superior performance in anime text detection tasks compared to existing datasets. To evaluate the robustness of AnimeText in complex anime scenes, we conducted cross-dataset benchmarking using state-of-the-art text detection methods. Experimental results demonstrate that models trained on AnimeText outperform those trained on existing datasets in anime scene text detection tasks. AnimeText on HuggingFace: https://huggingface.co/datasets/deepghs/AnimeText

Updated: 2025-10-09 08:47:52

标题: 一个用于稳健复杂动漫场景文本检测的大规模数据集

摘要: 目前的文本检测数据集主要针对自然或文档场景，其中文本通常以常规字体和形状、单调的颜色和有序的布局出现。文本通常沿直线或曲线排列。然而，这些特征与动漫场景显著不同，动漫场景中的文本通常风格多样，排列不规则，并且很容易与复杂的视觉元素（如符号和装饰图案）混淆。动漫场景中的文本还包括大量手写和艺术字体。受此差距的启发，我们引入了AnimeText，一个包含735K图像和4.2M注释文本块的大规模数据集。它具有针对动漫场景定制的分层注释和难以负样本。使用最先进的方法进行跨数据集评估表明，在动漫文本检测任务中，使用AnimeText训练的模型相对于现有数据集表现出更优越的性能。为了评估AnimeText在复杂动漫场景中的稳健性，我们使用最先进的文本检测方法进行了跨数据集基准测试。实验结果表明，在动漫场景文本检测任务中，使用AnimeText训练的模型优于在现有数据集上训练的模型。AnimeText在HuggingFace上：https://huggingface.co/datasets/deepghs/AnimeText

更新时间: 2025-10-09 08:47:52

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.07951v1

SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights

Post-training quantization has emerged as the most widely used strategy for deploying large language models at low precision. Still, current methods show perplexity degradation at bit-widths less than or equal to 4, partly because representing outliers causes precision issues in parameters that share the same scales as these outliers. This problem is especially pronounced for calibration-free, uniform quantization methods. We introduce SINQ to augment existing post-training quantizers with an additional second-axis scale factor and a fast Sinkhorn-Knopp-style algorithm that finds scales to normalize per-row and per-column variances, thereby minimizing a novel per-matrix proxy target for quantization: the matrix imbalance. Our method has no interactions between layers and can be trivially applied to new architectures to quantize any linear layers. We evaluate our method on the Qwen3 model family and DeepSeek-V2.5. SINQ improves WikiText2 and C4 perplexity significantly against uncalibrated uniform quantization baselines and can be further enhanced by combining it with calibration and non-uniform quantization levels. Code to reproduce the results of this work and to easily quantize models using SINQ is available at https://github.com/huawei-csl/SINQ.

Updated: 2025-10-09 08:46:47

标题: SINQ：Sinkhorn规范化量化用于无校准低精度LLM权重

摘要: 后训练量化已成为部署大型语言模型在低精度下最广泛使用的策略。然而，目前的方法在比特宽度小于或等于4时显示了困惑度下降，部分原因是表示异常值会导致与这些异常值共享相同尺度的参数的精度问题。这个问题在无校准、均匀量化方法中特别明显。我们引入了SINQ来增强现有的后训练量化器，该量化器具有额外的第二轴比例因子和一种快速Sinkhorn-Knopp风格的算法，该算法用于找到规范化每行和每列方差的尺度，从而最小化量化的新奇矩阵代理目标：矩阵不平衡。我们的方法在层之间没有交互，并且可以轻松应用于新的架构以量化任何线性层。我们在Qwen3模型系列和DeepSeek-V2.5上评估了我们的方法。SINQ显著改善了WikiText2和C4的困惑度，相对于未校准的均匀量化基线，并且可以通过结合校准和非均匀量化水平进一步增强。可以通过https://github.com/huawei-csl/SINQ获取用于重现此工作结果并使用SINQ轻松量化模型的代码。

更新时间: 2025-10-09 08:46:47

领域: cs.LG

下载: http://arxiv.org/abs/2509.22944v3

Trans-EnV: A Framework for Evaluating the Linguistic Robustness of LLMs Against English Varieties

Large Language Models (LLMs) are predominantly evaluated on Standard American English (SAE), often overlooking the diversity of global English varieties. This narrow focus may raise fairness concerns as degraded performance on non-standard varieties can lead to unequal benefits for users worldwide. Therefore, it is critical to extensively evaluate the linguistic robustness of LLMs on multiple non-standard English varieties. We introduce Trans-EnV, a framework that automatically transforms SAE datasets into multiple English varieties to evaluate the linguistic robustness. Our framework combines (1) linguistics expert knowledge to curate variety-specific features and transformation guidelines from linguistic literature and corpora, and (2) LLM-based transformations to ensure both linguistic validity and scalability. Using Trans-EnV, we transform six benchmark datasets into 38 English varieties and evaluate seven state-of-the-art LLMs. Our results reveal significant performance disparities, with accuracy decreasing by up to 46.3% on non-standard varieties. These findings highlight the importance of comprehensive linguistic robustness evaluation across diverse English varieties. Each construction of Trans-EnV was validated through rigorous statistical testing and consultation with a researcher in the field of second language acquisition, ensuring its linguistic validity. Our code and datasets are publicly available at https://github.com/jiyounglee-0523/TransEnV and https://huggingface.co/collections/jiyounglee0523/transenv-681eadb3c0c8cf363b363fb1.

Updated: 2025-10-09 08:41:44

标题: Trans-EnV：评估LLMs对英语变体的语言鲁棒性的框架

摘要: 大型语言模型（LLMs）主要在标准美式英语（SAE）上进行评估，往往忽视了全球英语变体的多样性。这种狭隘的关注可能引发公平性关切，因为在非标准变体上的表现不佳可能导致全球用户获益不均。因此，对LLMs在多种非标准英语变体上的语言鲁棒性进行广泛评估至关重要。我们引入了Trans-EnV，这是一个框架，可以自动将SAE数据集转换为多种英语变体以评估语言鲁棒性。我们的框架结合了（1）语言学专家知识，从语言学文献和语料库中策划了特定于变体的特征和转换指南，以及（2）基于LLM的转换，以确保语言有效性和可扩展性。使用Trans-EnV，我们将六个基准数据集转换为38种英语变体，并评估了七种最先进的LLMs。我们的结果显示了显著的性能差异，精度在非标准变体上降低了最多46.3％。这些发现突显了跨多样英语变体的全面语言鲁棒性评估的重要性。每个Trans-EnV构建都经过严格的统计测试和与第二语言习得领域的研究者的协商验证，确保其语言有效性。我们的代码和数据集可以在https://github.com/jiyounglee-0523/TransEnV和https://huggingface.co/collections/jiyounglee0523/transenv-681eadb3c0c8cf363b363fb1 上公开获取。

更新时间: 2025-10-09 08:41:44

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2505.20875v3

Agent-Based Genetic Algorithm for Crypto Trading Strategy Optimization

Cryptocurrency markets present formidable challenges for trading strategy optimization due to extreme volatility, non-stationary dynamics, and complex microstructure patterns that render conventional parameter optimization methods fundamentally inadequate. We introduce Cypto Genetic Algorithm Agent (CGA-Agent), a pioneering hybrid framework that synergistically integrates genetic algorithms with intelligent multi-agent coordination mechanisms for adaptive trading strategy parameter optimization in dynamic financial environments. The framework uniquely incorporates real-time market microstructure intelligence and adaptive strategy performance feedback through intelligent mechanisms that dynamically guide evolutionary processes, transcending the limitations of static optimization approaches. Comprehensive empirical evaluation across three cryptocurrencies demonstrates systematic and statistically significant performance improvements on both total returns and risk-adjusted metrics.

Updated: 2025-10-09 08:41:42

标题: 代理基因算法用于加密交易策略优化

摘要: 加密货币市场对交易策略优化提出了巨大挑战，因为极端波动、非平稳动态和复杂的微观结构模式使传统参数优化方法基本无法胜任。我们引入了Cypto遗传算法代理（CGA-Agent），这是一个开创性的混合框架，通过智能多智能体协调机制与遗传算法协同集成，用于在动态金融环境中进行自适应交易策略参数优化。该框架独特地融合了实时市场微观结构智能和自适应策略绩效反馈，通过动态引导进化过程的智能机制，超越了静态优化方法的局限。对三种加密货币的全面实证评估显示，在总回报和风险调整指标上都实现了系统性和统计显著的绩效改善。

更新时间: 2025-10-09 08:41:42

领域: cs.AI

下载: http://arxiv.org/abs/2510.07943v1

MAGREF: Masked Guidance for Any-Reference Video Generation with Subject Disentanglement

We tackle the task of any-reference video generation, which aims to synthesize videos conditioned on arbitrary types and combinations of reference subjects, together with textual prompts. This task faces persistent challenges, including identity inconsistency, entanglement among multiple reference subjects, and copy-paste artifacts. To address these issues, we introduce MAGREF, a unified and effective framework for any-reference video generation. Our approach incorporates masked guidance and a subject disentanglement mechanism, enabling flexible synthesis conditioned on diverse reference images and textual prompts. Specifically, masked guidance employs a region-aware masking mechanism combined with pixel-wise channel concatenation to preserve appearance features of multiple subjects along the channel dimension. This design preserves identity consistency and maintains the capabilities of the pre-trained backbone, without requiring any architectural changes. To mitigate subject confusion, we introduce a subject disentanglement mechanism which injects the semantic values of each subject derived from the text condition into its corresponding visual region. Additionally, we establish a four-stage data pipeline to construct diverse training pairs, effectively alleviating copy-paste artifacts. Extensive experiments on a comprehensive benchmark demonstrate that MAGREF consistently outperforms existing state-of-the-art approaches, paving the way for scalable, controllable, and high-fidelity any-reference video synthesis. Code and model can be found at: https://github.com/MAGREF-Video/MAGREF

Updated: 2025-10-09 08:41:12

标题: MAGREF：具有主体解缠的任意参考视频生成的遮罩引导

摘要: 我们致力于解决任意参考视频生成的任务，旨在合成受任意类型和组合的参考主题以及文本提示条件的视频。这一任务面临持久的挑战，包括身份不一致、多个参考主题之间的纠缠以及复制粘贴的人工痕迹。为了解决这些问题，我们引入了MAGREF，这是一个统一而有效的任意参考视频生成框架。我们的方法结合了掩蔽引导和主题解缠机制，实现了对多样化参考图像和文本提示进行灵活合成。具体来说，掩蔽引导采用了一个区域感知的掩蔽机制，与像素级通道串联相结合，以在通道维度上保留多个主题的外观特征。这种设计保持了身份的一致性，并保持了预训练骨干的功能，而无需进行任何架构更改。为了减少主题混淆，我们引入了一个主题解缠机制，将从文本条件中导出的每个主题的语义值注入到其对应的视觉区域中。此外，我们建立了一个四阶段的数据流水线来构建多样化的训练对，有效缓解复制粘贴的人工痕迹。对全面基准测试的大量实验表明，MAGREF始终优于现有的最先进方法，为可扩展、可控和高保真的任意参考视频合成铺平了道路。代码和模型可在以下链接找到：https://github.com/MAGREF-Video/MAGREF

更新时间: 2025-10-09 08:41:12

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2505.23742v2

TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

Video Foundation Models (VFMs) exhibit remarkable visual generation performance, but struggle in compositional scenarios (e.g., motion, numeracy, and spatial relation). In this work, we introduce Test-Time Optimization and Memorization (TTOM), a training-free framework that aligns VFM outputs with spatiotemporal layouts during inference for better text-image alignment. Rather than direct intervention to latents or attention per-sample in existing work, we integrate and optimize new parameters guided by a general layout-attention objective. Furthermore, we formulate video generation within a streaming setting, and maintain historical optimization contexts with a parametric memory mechanism that supports flexible operations, such as insert, read, update, and delete. Notably, we found that TTOM disentangles compositional world knowledge, showing powerful transferability and generalization. Experimental results on the T2V-CompBench and Vbench benchmarks establish TTOM as an effective, practical, scalable, and efficient framework to achieve cross-modal alignment for compositional video generation on the fly.

Updated: 2025-10-09 08:37:00

标题: TTOM:测试时间优化和记忆化用于组合视频生成

摘要: 视频基础模型(VFMs)展现出卓越的视觉生成性能，但在组合场景(如运动、数字和空间关系)中表现不佳。在这项工作中，我们引入了一种名为测试时间优化和记忆(TTOM)的无训练框架，通过在推断过程中将VFM输出与时空布局对齐，以实现更好的文本-图像对齐。与现有工作中直接干预潜在变量或每个样本的注意力不同，我们通过一个通用的布局-注意力目标来集成和优化新参数。此外，我们将视频生成纳入流式设置，并通过一个支持灵活操作的参数化记忆机制来维护历史优化上下文，如插入、读取、更新和删除。值得注意的是，我们发现TTOM能够解开组合世界知识，展现出强大的可迁移性和泛化能力。在T2V-CompBench和Vbench基准测试上的实验结果表明，TTOM是一个有效、实用、可扩展和高效的框架，能够实现即时的组合视频生成的跨模态对齐。

更新时间: 2025-10-09 08:37:00

领域: cs.CV,cs.AI,cs.CL,cs.LG,cs.MM

下载: http://arxiv.org/abs/2510.07940v1

Some theoretical improvements on the tightness of PAC-Bayes risk certificates for neural networks

This paper presents four theoretical contributions that improve the usability of risk certificates for neural networks based on PAC-Bayes bounds. First, two bounds on the KL divergence between Bernoulli distributions enable the derivation of the tightest explicit bounds on the true risk of classifiers across different ranges of empirical risk. The paper next focuses on the formalization of an efficient methodology based on implicit differentiation that enables the introduction of the optimization of PAC-Bayesian risk certificates inside the loss/objective function used to fit the network/model. The last contribution is a method to optimize bounds on non-differentiable objectives such as the 0-1 loss. These theoretical contributions are complemented with an empirical evaluation on the MNIST and CIFAR-10 datasets. In fact, this paper presents the first non-vacuous generalization bounds on CIFAR-10 for neural networks.

Updated: 2025-10-09 08:34:47

标题: 神经网络的PAC-Bayes风险证书紧密性的一些理论改进

摘要: 本文提出了四个理论贡献，改进了基于PAC-Bayes界限的神经网络风险证书的可用性。首先，对伯努利分布之间的KL散度提出了两个界限，使得可以在不同范围的经验风险下推导分类器真实风险的最紧密显式界限。接下来，本文着重于基于隐式微分的高效方法的形式化，使得可以将PAC-Bayesian风险证书的优化引入用于拟合网络/模型的损失/目标函数中。最后一个贡献是一种优化非可微目标的方法，比如0-1损失。这些理论贡献与对MNIST和CIFAR-10数据集的经验评估相结合。事实上，本文提出了神经网络在CIFAR-10上的第一个非空虚的泛化界限。

更新时间: 2025-10-09 08:34:47

领域: cs.LG,cs.IT,math.IT,stat.ML

下载: http://arxiv.org/abs/2510.07935v1

Provable Speech Attributes Conversion via Latent Independence

While signal conversion and disentangled representation learning have shown promise for manipulating data attributes across domains such as audio, image, and multimodal generation, existing approaches, especially for speech style conversion, are largely empirical and lack rigorous theoretical foundations to guarantee reliable and interpretable control. In this work, we propose a general framework for speech attribute conversion, accompanied by theoretical analysis and guarantees under reasonable assumptions. Our framework builds on a non-probabilistic autoencoder architecture with an independence constraint between the predicted latent variable and the target controllable variable. This design ensures a consistent signal transformation, conditioned on an observed style variable, while preserving the original content and modifying the desired attribute. We further demonstrate the versatility of our method by evaluating it on speech styles, including speaker identity and emotion. Quantitative evaluations confirm the effectiveness and generality of the proposed approach.

Updated: 2025-10-09 08:32:27

标题: 可证实的言语属性转换通过潜在独立性

摘要: 尽管信号转换和解耦表示学习已经显示出在跨领域如音频、图像和多模态生成中操纵数据属性的潜力，但现有方法，尤其是针对语音风格转换的方法，主要是经验性的，缺乏严格的理论基础来保证可靠和可解释的控制。在这项工作中，我们提出了一个针对语音属性转换的通用框架，并在合理假设下进行了理论分析和保证。我们的框架建立在一个非概率自动编码器架构上，其中预测的潜变量和目标可控变量之间存在独立约束。这种设计确保在观察到的风格变量的条件下进行一致的信号转换，同时保留原始内容并修改所需的属性。我们进一步通过在包括说话者身份和情感在内的语音风格上对其进行评估，展示了我们方法的多功能性。定量评估证实了所提方法的有效性和普适性。

更新时间: 2025-10-09 08:32:27

领域: cs.SD,cs.AI

下载: http://arxiv.org/abs/2510.05191v2

Enabling Personalized Long-term Interactions in LLM-based Agents through Persistent Memory and User Profiles

Large language models (LLMs) increasingly serve as the central control unit of AI agents, yet current approaches remain limited in their ability to deliver personalized interactions. While Retrieval Augmented Generation enhances LLM capabilities by improving context-awareness, it lacks mechanisms to combine contextual information with user-specific data. Although personalization has been studied in fields such as human-computer interaction or cognitive science, existing perspectives largely remain conceptual, with limited focus on technical implementation. To address these gaps, we build on a unified definition of personalization as a conceptual foundation to derive technical requirements for adaptive, user-centered LLM-based agents. Combined with established agentic AI patterns such as multi-agent collaboration or multi-source retrieval, we present a framework that integrates persistent memory, dynamic coordination, self-validation, and evolving user profiles to enable personalized long-term interactions. We evaluate our approach on three public datasets using metrics such as retrieval accuracy, response correctness, or BertScore. We complement these results with a five-day pilot user study providing initial insights into user feedback on perceived personalization. The study provides early indications that guide future work and highlights the potential of integrating persistent memory and user profiles to improve the adaptivity and perceived personalization of LLM-based agents.

Updated: 2025-10-09 08:22:16

标题: 通过持久记忆和用户档案实现基于LLM代理的个性化长期交互

摘要: 大型语言模型(LLMs)越来越成为人工智能代理的中央控制单元，但当前的方法在提供个性化互动能力方面仍然存在局限。虽然检索增强生成通过提高LLM的上下文感知能力来增强其能力，但缺乏将上下文信息与用户特定数据相结合的机制。尽管个性化已在人机交互或认知科学等领域进行了研究，但现有观点主要仍停留在概念层面，对技术实施的关注有限。为了弥补这些空白，我们基于个性化的统一定义作为概念基础，推导出基于自适应、以用户为中心的LLM代理的技术需求。结合已建立的代理型人工智能模式，如多代理协作或多源检索，我们提出了一个框架，整合了持久记忆、动态协调、自我验证和不断发展的用户个人资料，以实现个性化的长期互动。我们使用检索准确性、响应正确性或BertScore等指标在三个公共数据集上评估了我们的方法。我们还通过一个为期五天的试点用户研究补充了这些结果，初步洞察用户对感知个性化的反馈。这项研究提供了指导未来工作的早期迹象，并突显了整合持久记忆和用户个人资料以改善基于LLM的代理的适应性和感知个性化的潜力。

更新时间: 2025-10-09 08:22:16

领域: cs.AI,cs.HC

下载: http://arxiv.org/abs/2510.07925v1

Synergy Between the Strong and the Weak: Spiking Neural Networks are Inherently Self-Distillers

Brain-inspired spiking neural networks (SNNs) promise to be a low-power alternative to computationally intensive artificial neural networks (ANNs), although performance gaps persist. Recent studies have improved the performance of SNNs through knowledge distillation, but rely on large teacher models or introduce additional training overhead. In this paper, we show that SNNs can be naturally deconstructed into multiple submodels for efficient self-distillation. We treat each timestep instance of the SNN as a submodel and evaluate its output confidence, thus efficiently identifying the strong and the weak. Based on this strong and weak relationship, we propose two efficient self-distillation schemes: (1) \textbf{Strong2Weak}: During training, the stronger "teacher" guides the weaker "student", effectively improving overall performance. (2) \textbf{Weak2Strong}: The weak serve as the "teacher", distilling the strong in reverse with underlying dark knowledge, again yielding significant performance gains. For both distillation schemes, we offer flexible implementations such as ensemble, simultaneous, and cascade distillation. Experiments show that our method effectively improves the discriminability and overall performance of the SNN, while its adversarial robustness is also enhanced, benefiting from the stability brought by self-distillation. This ingeniously exploits the temporal properties of SNNs and provides insight into how to efficiently train high-performance SNNs.

Updated: 2025-10-09 08:21:46

标题: 强弱之间的协同作用：脉冲神经网络本质上是自我提炼者

摘要: 脑启发式的脉冲神经网络（SNNs）被认为是对计算密集型人工神经网络（ANNs）的低功耗替代方案，尽管性能差距仍然存在。最近的研究通过知识蒸馏改善了SNNs的性能，但依赖于大型教师模型或引入额外的训练开销。在本文中，我们展示了SNNs可以自然地拆分成多个子模型以进行有效的自我蒸馏。我们将SNN的每个时间步实例视为一个子模型，并评估其输出置信度，从而有效地识别出强和弱。基于这种强弱关系，我们提出了两种有效的自我蒸馏方案：（1）Strong2Weak：在训练过程中，更强的“教师”指导更弱的“学生”，有效提高整体性能。（2）Weak2Strong：弱者充当“教师”，通过基本的暗知识反向蒸馏强者，再次产生显著的性能提升。对于这两种蒸馏方案，我们提供了灵活的实现，如集成、同时和级联蒸馏。实验证明，我们的方法有效地提高了SNN的区分能力和整体性能，同时其对抗性的稳健性也得到了增强，受益于自我蒸馏带来的稳定性。这种方法巧妙地利用了SNN的时间特性，并为如何有效训练高性能SNN提供了见解。

更新时间: 2025-10-09 08:21:46

领域: cs.LG

下载: http://arxiv.org/abs/2510.07924v1

STOPA: A Database of Systematic VariaTion Of DeePfake Audio for Open-Set Source Tracing and Attribution

A key research area in deepfake speech detection is source tracing - determining the origin of synthesised utterances. The approaches may involve identifying the acoustic model (AM), vocoder model (VM), or other generation-specific parameters. However, progress is limited by the lack of a dedicated, systematically curated dataset. To address this, we introduce STOPA, a systematically varied and metadata-rich dataset for deepfake speech source tracing, covering 8 AMs, 6 VMs, and diverse parameter settings across 700k samples from 13 distinct synthesisers. Unlike existing datasets, which often feature limited variation or sparse metadata, STOPA provides a systematically controlled framework covering a broader range of generative factors, such as the choice of the vocoder model, acoustic model, or pretrained weights, ensuring higher attribution reliability. This control improves attribution accuracy, aiding forensic analysis, deepfake detection, and generative model transparency.

Updated: 2025-10-09 08:21:42

标题: STOPA：一个用于开放式源追踪和归因的深度伪造音频系统性变体数据库

摘要: 深度伪造语音检测的一个关键研究领域是源头追踪-确定合成话语的来源。方法可能涉及识别声学模型（AM）、声码器模型（VM）或其他生成特定参数。然而，由于缺乏专门的、系统化策划的数据集，进展受到限制。为了解决这个问题，我们引入了STOPA，一个系统化变化丰富、元数据丰富的用于深度伪造语音源头追踪的数据集，涵盖了8个AMs、6个VMs以及来自13个不同合成器的700k个样本的多样参数设置。与现有数据集不同，后者通常具有有限的变化或稀疏的元数据，STOPA提供了一个系统控制的框架，涵盖了更广泛的生成因素，如选择声码器模型、声学模型或预训练权重，确保更高的归因可靠性。这种控制提高了归因的准确性，有助于法证分析、深度伪造检测和生成模型透明度。

更新时间: 2025-10-09 08:21:42

领域: cs.SD,cs.AI,cs.CR,eess.AS,68T45, 68T10, 94A08,I.2.7; I.5.4; K.4.1

下载: http://arxiv.org/abs/2505.19644v3

STEPER: Step-wise Knowledge Distillation for Enhancing Reasoning Ability in Multi-Step Retrieval-Augmented Language Models

Answering complex real-world questions requires step-by-step retrieval and integration of relevant information to generate well-grounded responses. However, existing knowledge distillation methods overlook the need for different reasoning abilities at different steps, hindering transfer in multi-step retrieval-augmented frameworks. To address this, we propose Stepwise Knowledge Distillation for Enhancing Reasoning Ability in Multi-Step Retrieval-Augmented Language Models (StepER). StepER employs step-wise supervision to align with evolving information and reasoning demands across stages. Additionally, it incorporates difficulty-aware training to progressively optimize learning by prioritizing suitable steps. Our method is adaptable to various multi-step retrieval-augmented language models, including those that use retrieval queries for reasoning paths or decomposed questions. Extensive experiments show that StepER outperforms prior methods on multi-hop QA benchmarks, with an 8B model achieving performance comparable to a 70B teacher model.

Updated: 2025-10-09 08:20:27

标题: STEPER：逐步知识蒸馏以增强多步检索增强语言模型的推理能力

摘要: 回答复杂的现实问题需要逐步检索和整合相关信息，以生成基于充分理由的回答。然而，现有的知识蒸馏方法忽视了在不同步骤需要不同推理能力的需求，从而阻碍了在多步检索增强框架中的转移。为了解决这个问题，我们提出了Stepwise Knowledge Distillation for Enhancing Reasoning Ability in Multi-Step Retrieval-Augmented Language Models (StepER)。StepER采用逐步监督来与不同阶段的信息和推理需求相一致。此外，它还包括难度感知训练，通过优先考虑适当的步骤逐步优化学习。我们的方法适用于各种多步检索增强语言模型，包括使用检索查询进行推理路径或分解问题的模型。大量实验证明，StepER在多跳QA基准测试中优于先前的方法，8B模型实现的性能与70B教师模型相媲美。

更新时间: 2025-10-09 08:20:27

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.07923v1

SketchGuard: Scaling Byzantine-Robust Decentralized Federated Learning via Sketch-Based Screening

Decentralized Federated Learning (DFL) enables privacy-preserving collaborative training without centralized servers, but remains vulnerable to Byzantine attacks where malicious clients submit corrupted model updates. Existing Byzantine-robust DFL defenses rely on similarity-based neighbor screening that requires every client to exchange and compare complete high-dimensional model vectors with all neighbors in each training round, creating prohibitive communication and computational costs that prevent deployment at web scale. We propose SketchGuard, a general framework that decouples Byzantine filtering from model aggregation through sketch-based neighbor screening. SketchGuard compresses $d$-dimensional models to $k$-dimensional sketches ($k \ll d$) using Count Sketch for similarity comparisons, then selectively fetches full models only from accepted neighbors, reducing per-round communication complexity from $O(d|N_i|)$ to $O(k|N_i| + d|S_i|)$, where $|N_i|$ is the neighbor count and $|S_i| \le |N_i|$ is the accepted neighbor count. We establish rigorous convergence guarantees in both strongly convex and non-convex settings, proving that Count Sketch compression preserves Byzantine resilience with controlled degradation bounds where approximation errors introduce only a $(1+O(\epsilon))$ factor in the effective threshold parameter. Comprehensive experiments across multiple datasets, network topologies, and attack scenarios demonstrate that SketchGuard maintains identical robustness to state-of-the-art methods while reducing computation time by up to 82% and communication overhead by 50-70% depending on filtering effectiveness, with benefits scaling multiplicatively with model dimensionality and network connectivity. These results establish the viability of sketch-based compression as a fundamental enabler of robust DFL at web scale.

Updated: 2025-10-09 08:16:32

标题: SketchGuard：通过基于草图的筛选实现可扩展的拜占庭鲁棒去中心化联邦学习

摘要: 分散式联邦学习（DFL）实现了在没有集中式服务器的情况下实现隐私保护的协作训练，但仍然容易受到拜占庭攻击的影响，其中恶意客户端提交损坏的模型更新。现有的拜占庭抗攻击的DFL防御依赖于基于相似性的邻居筛选，这需要每个客户端在每轮训练中交换和比较与所有邻居的完整高维模型向量，导致通信和计算成本过高，阻止了在Web规模上的部署。我们提出了SketchGuard，这是一个通用框架，通过基于草图的邻居筛选将拜占庭过滤与模型聚合分离开来。SketchGuard使用Count Sketch将$d$维模型压缩为$k$维草图（$k \ll d$）进行相似性比较，然后仅从被接受的邻居那里选择性地获取完整模型，将每轮通信复杂度从$O(d|N_i|)$降低到$O(k|N_i| + d|S_i|)$，其中$|N_i|$是邻居数量，$|S_i| \le |N_i|$是被接受的邻居数量。我们在强凸和非凸设置中建立了严格的收敛保证，证明Count Sketch压缩保持了对拜占庭攻击的弹性，其控制了降级界限，其中近似误差仅在有效阈值参数中引入了一个$(1+O(\epsilon))$因子。跨多个数据集、网络拓扑和攻击场景的全面实验表明，SketchGuard保持了与最先进方法相同的鲁棒性，同时根据过滤效果，将计算时间缩短高达82%，通信开销减少50-70%，其效益随着模型维度和网络连接性的增加呈乘法增长。这些结果建立了基于草图压缩作为在Web规模上实现鲁棒DFL的基本实现的可行性。

更新时间: 2025-10-09 08:16:32

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2510.07922v1

Profit Mirage: Revisiting Information Leakage in LLM-based Financial Agents

LLM-based financial agents have attracted widespread excitement for their ability to trade like human experts. However, most systems exhibit a "profit mirage": dazzling back-tested returns evaporate once the model's knowledge window ends, because of the inherent information leakage in LLMs. In this paper, we systematically quantify this leakage issue across four dimensions and release FinLake-Bench, a leakage-robust evaluation benchmark. Furthermore, to mitigate this issue, we introduce FactFin, a framework that applies counterfactual perturbations to compel LLM-based agents to learn causal drivers instead of memorized outcomes. FactFin integrates four core components: Strategy Code Generator, Retrieval-Augmented Generation, Monte Carlo Tree Search, and Counterfactual Simulator. Extensive experiments show that our method surpasses all baselines in out-of-sample generalization, delivering superior risk-adjusted performance.

Updated: 2025-10-09 08:13:35

标题: 利润幻觉：重新审视基于LLM的金融代理商中的信息泄漏

摘要: 基于LLM的金融代理人因其能够像人类专家一样进行交易而引起了广泛的兴奋。然而，大多数系统表现出一种“利润幻觉”：令人眼花缭乱的回测收益在模型知识窗口结束时消失，这是因为LLM中固有的信息泄漏。在本文中，我们系统地量化了这个泄漏问题从四个维度，并发布了FinLake-Bench，一个具有泄漏鲁棒性的评估基准。此外，为了减轻这个问题，我们引入了FactFin，一个框架，它应用反事实扰动来迫使基于LLM的代理人学习因果驱动因素，而不是记忆化的结果。FactFin集成了四个核心组件：策略代码生成器、检索增强生成、蒙特卡洛树搜索和反事实模拟器。大量实验证明，我们的方法在样本外泛化方面超越了所有基线，并提供了优越的风险调整表现。

更新时间: 2025-10-09 08:13:35

领域: cs.AI

下载: http://arxiv.org/abs/2510.07920v1

A Kernel Distribution Closeness Testing

The distribution closeness testing (DCT) assesses whether the distance between a distribution pair is at least $\epsilon$-far. Existing DCT methods mainly measure discrepancies between a distribution pair defined on discrete one-dimensional spaces (e.g., using total variation), which limits their applications to complex data (e.g., images). To extend DCT to more types of data, a natural idea is to introduce maximum mean discrepancy (MMD), a powerful measurement of the distributional discrepancy between two complex distributions, into DCT scenarios. However, we find that MMD's value can be the same for many pairs of distributions that have different norms in the same reproducing kernel Hilbert space (RKHS), making MMD less informative when assessing the closeness levels for multiple distribution pairs. To mitigate the issue, we design a new measurement of distributional discrepancy, norm-adaptive MMD (NAMMD), which scales MMD's value using the RKHS norms of distributions. Based on the asymptotic distribution of NAMMD, we finally propose the NAMMD-based DCT to assess the closeness levels of a distribution pair. Theoretically, we prove that NAMMD-based DCT has higher test power compared to MMD-based DCT, with bounded type-I error, which is also validated by extensive experiments on many types of data (e.g., synthetic noise, real images). Furthermore, we also apply the proposed NAMMD for addressing the two-sample testing problem and find NAMMD-based two-sample test has higher test power than the MMD-based two-sample test in both theory and experiments.

Updated: 2025-10-09 08:13:32

标题: 核分布接近性检验

摘要: 分布接近性测试（DCT）评估分布对之间的距离是否至少为$\epsilon$-远。现有的DCT方法主要衡量定义在离散一维空间上的分布对之间的差异（例如，使用总变差），这限制了它们对复杂数据（例如，图像）的应用。为了将DCT扩展到更多类型的数据，一个自然的想法是将最大均值差异（MMD），一种强大的测量两个复杂分布之间分布差异的方法，引入到DCT场景中。然而，我们发现MMD的值可以对许多具有相同再生核希尔伯特空间（RKHS）中不同范数的分布对相同，这使得MMD在评估多个分布对的接近程度时信息较少。为了减轻这个问题，我们设计了一种新的分布差异度量，归一化自适应MMD（NAMMD），它使用分布的RKHS范数来缩放MMD的值。基于NAMMD的渐近分布，我们最终提出了基于NAMMD的DCT来评估分布对的接近程度。理论上，我们证明了与基于MMD的DCT相比，基于NAMMD的DCT具有更高的检验能力，具有有界的第一类错误，这也得到了对许多类型数据（例如，合成噪声，真实图像）的大量实验证实。此外，我们还将所提出的NAMMD应用于解决双样本检验问题，并发现基于NAMMD的双样本检验在理论和实验中都比基于MMD的双样本检验具有更高的检验能力。

更新时间: 2025-10-09 08:13:32

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2507.12843v2

Can Large Language Models Be Trusted as Evolutionary Optimizers for Network-Structured Combinatorial Problems?

Large Language Models (LLMs) have shown strong capabilities in language understanding and reasoning across diverse domains. Recently, there has been increasing interest in utilizing LLMs not merely as assistants in optimization tasks, but as primary optimizers, particularly for network-structured combinatorial problems. However, before LLMs can be reliably deployed in this role, a fundamental question must be addressed: Can LLMs iteratively manipulate solutions that consistently adhere to problem constraints? In this work, we propose a systematic framework to evaluate the capability of LLMs to engage with problem structures. Rather than treating the model as a black-box generator, we adopt the commonly used evolutionary optimizer (EVO) and propose a comprehensive evaluation framework that rigorously assesses the output fidelity of LLM-based operators across different stages of the evolutionary process. To enhance robustness, we introduce a hybrid error-correction mechanism that mitigates uncertainty in LLMs outputs. Moreover, we explore a cost-efficient population-level optimization strategy that significantly improves efficiency compared to traditional individual-level approaches. Extensive experiments on a representative node-level combinatorial network optimization task demonstrate the effectiveness, adaptability, and inherent limitations of LLM-based EVO. Our findings present perspectives on integrating LLMs into evolutionary computation and discuss paths that may support scalable and context-aware optimization in networked systems.

Updated: 2025-10-09 08:13:18

标题: 大型语言模型可以被信任作为网络结构组合问题的进化优化器吗？

摘要: 大型语言模型（LLMs）在各个领域展示出了强大的语言理解和推理能力。最近，人们对利用LLMs的兴趣越来越大，不仅仅作为优化任务中的助手，而是作为主要的优化器，特别是用于网络结构的组合问题。然而，在LLMs能够可靠地部署在这个角色之前，必须解决一个根本问题：LLMs能否迭代地操纵符合问题约束的解决方案？在这项工作中，我们提出了一个系统框架来评估LLMs与问题结构互动的能力。我们不将模型视为黑盒生成器，而是采用常用的进化优化器（EVO）并提出了一个全面的评估框架，严格评估了LLMs基于操作符在进化过程的不同阶段的输出准确性。为了增强稳健性，我们引入了一种混合误差校正机制，以减轻LLMs输出中的不确定性。此外，我们探索了一种成本效益的基于人口水平的优化策略，与传统的个体级方法相比显著提高了效率。对代表性的节点级组合网络优化任务的大量实验展示了LLMs基于EVO的有效性、适应性和固有限制。我们的发现提出了将LLMs整合到进化计算中的观点，并讨论可能支持网络系统中可扩展和上下文感知优化的路径。

更新时间: 2025-10-09 08:13:18

领域: cs.NE,cs.AI

下载: http://arxiv.org/abs/2501.15081v4

GRADE: Personalized Multi-Task Fusion via Group-relative Reinforcement Learning with Adaptive Dirichlet Exploratio

Overall architecture of the personalized multi-objective ranking system. It comprises: (1) a Feature Center and Prerank Model for initial feature processing and candidate generation; (2) a Multi-Task Learning (MTL) model predicting various user feedback signals; (3) a Multi-Task Fusion (MTF) module (our proposed GRADE framework) that learns personalized weights ($w_1, \dots, w_n$); these weights are then applied to calculate final scores and sorted to generate a blended ranking by the Blended Ranking Model, which ultimately delivers results to users.

Updated: 2025-10-09 08:12:52

标题: GRADE：通过自适应狄利克雷探索的群体相对强化学习实现个性化多任务融合

摘要: 该个性化多目标排名系统的整体架构包括：（1）一个特征中心和预排序模型，用于初始特征处理和候选生成；（2）一个多任务学习（MTL）模型，用于预测各种用户反馈信号；（3）一个多任务融合（MTF）模块（我们提出的GRADE框架），学习个性化权重（$w_1, \dots, w_n$）；然后将这些权重应用于计算最终分数，并通过混合排名模型排序生成混合排名，最终向用户提供结果。

更新时间: 2025-10-09 08:12:52

领域: cs.LG

下载: http://arxiv.org/abs/2510.07919v1

Markets for Models

Motivated by the prevalence of prediction problems in the economy, we study markets in which firms sell models to a consumer to help improve their prediction. Firms decide whether to enter, choose models to train on their data, and set prices. The consumer can purchase multiple models and use a weighted average of the models bought. Market outcomes can be expressed in terms of the \emph{bias-variance decompositions} of the models that firms sell. We give conditions when symmetric firms will choose different modeling techniques, e.g., each using only a subset of available covariates. We also show firms can choose inefficiently biased models or inefficiently costly models to deter entry by competitors.

Updated: 2025-10-09 08:08:41

标题: 模型市场

摘要: 受经济中预测问题的普遍存在启发，我们研究了市场中企业向消费者出售模型以帮助改善其预测的情况。企业决定是否进入市场，选择在其数据上训练的模型，并设定价格。消费者可以购买多个模型，并使用所购买模型的加权平均值。市场结果可以用企业销售的模型的偏差-方差分解来表示。我们给出了对称企业将选择不同建模技术的条件，例如，每个企业仅使用可用协变量的子集。我们还展示企业可以选择效率低下的偏差模型或效率低下的昂贵模型来阻止竞争对手的进入。

更新时间: 2025-10-09 08:08:41

领域: econ.TH,cs.LG

下载: http://arxiv.org/abs/2503.02946v3

Not All Clients Are Equal: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients

As AI becomes more personal, e.g., Agentic AI, there is an increasing need for personalizing models for various use cases. Personalized federated learning (PFL) enables each client to collaboratively leverage other clients' knowledge for better adaptation to the task of interest, without privacy risks. Despite its potential, existing PFL methods remain confined to rather simplified scenarios where data and models are the same across clients. To move towards realistic scenarios, we propose FedMosaic, a method that jointly addresses data and model heterogeneity with a task-relevance-aware model aggregation strategy to reduce parameter interference, and a dimension-invariant module that enables knowledge sharing across heterogeneous architectures without huge computational cost. To mimic the real-world task diversity, we propose a multi-modal PFL benchmark spanning 40 distinct tasks with distribution shifts over time. The empirical study shows that FedMosaic outperforms the state-of-the-art PFL methods, excelling in both personalization and generalization capabilities under challenging, realistic scenarios.

Updated: 2025-10-09 08:07:53

标题: 并非所有客户都相等：异构多模客户上的协作模型个性化

摘要: 随着人工智能变得更加个性化，例如自主AI，对于各种用例个性化模型的需求日益增加。个性化联邦学习（PFL）使每个客户端能够共同利用其他客户端的知识，以更好地适应感兴趣的任务，而不会存在隐私风险。尽管具有潜力，但现有的PFL方法仍然局限于数据和模型在客户端之间相同的简化场景。为了迈向现实场景，我们提出了FedMosaic，一种同时解决数据和模型异质性的方法，采用任务相关性感知模型聚合策略来减少参数干扰，并使用维度不变模块实现跨异构架构的知识共享，而不会带来巨大的计算成本。为了模拟真实世界的任务多样性，我们提出了一个跨越40个不同任务并随时间发生分布变化的多模态PFL基准测试。实证研究表明，FedMosaic优于现有的PFL方法，在具有挑战性的现实场景中，在个性化和泛化能力方面表现优异。

更新时间: 2025-10-09 08:07:53

领域: cs.LG,cs.AI,cs.DC

下载: http://arxiv.org/abs/2506.11024v2

Towards Human-Like Grading: A Unified LLM-Enhanced Framework for Subjective Question Evaluation

Automatic grading of subjective questions remains a significant challenge in examination assessment due to the diversity in question formats and the open-ended nature of student responses. Existing works primarily focus on a specific type of subjective question and lack the generality to support comprehensive exams that contain diverse question types. In this paper, we propose a unified Large Language Model (LLM)-enhanced auto-grading framework that provides human-like evaluation for all types of subjective questions across various domains. Our framework integrates four complementary modules to holistically evaluate student answers. In addition to a basic text matching module that provides a foundational assessment of content similarity, we leverage the powerful reasoning and generative capabilities of LLMs to: (1) compare key knowledge points extracted from both student and reference answers, (2) generate a pseudo-question from the student answer to assess its relevance to the original question, and (3) simulate human evaluation by identifying content-related and non-content strengths and weaknesses. Extensive experiments on both general-purpose and domain-specific datasets show that our framework consistently outperforms traditional and LLM-based baselines across multiple grading metrics. Moreover, the proposed system has been successfully deployed in real-world training and certification exams at a major e-commerce enterprise.

Updated: 2025-10-09 08:05:39

标题: 朝向类人评分：一种用于主观问题评估的统一LLM增强框架

摘要: 主观问题的自动评分仍然是考试评估中的一个重要挑战，因为问题格式的多样性和学生回答的开放性质。现有的工作主要关注特定类型的主观问题，缺乏支持包含多种问题类型的综合考试的一般性。在本文中，我们提出了一种统一的大型语言模型（LLM）增强自动评分框架，为各个领域中所有类型的主观问题提供类似人类的评价。我们的框架整合了四个互补模块，全面评估学生答案。除了提供基础内容相似性评估的基本文本匹配模块外，我们利用LLM的强大推理和生成能力来：（1）比较从学生和参考答案中提取的关键知识点，（2）从学生答案生成伪问题以评估其与原问题的相关性，（3）通过识别内容相关和非内容优势和劣势来模拟人类评价。对通用和领域特定数据集进行的大量实验证明，我们的框架在多个评分指标上始终优于传统和基于LLM的基准。此外，所提出的系统已成功部署在一家主要电子商务企业的实际培训和认证考试中。

更新时间: 2025-10-09 08:05:39

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.07912v1

MMM: Quantum-Chemical Molecular Representation Learning for Combinatorial Drug Recommendation

Drug recommendation is an essential task in machine learning-based clinical decision support systems. However, the risk of drug-drug interactions (DDI) between co-prescribed medications remains a significant challenge. Previous studies have used graph neural networks (GNNs) to represent drug structures. Regardless, their simplified discrete forms cannot fully capture the molecular binding affinity and reactivity. Therefore, we propose Multimodal DDI Prediction with Molecular Electron Localization Function (ELF) Maps (MMM), a novel framework that integrates three-dimensional (3D) quantum-chemical information into drug representation learning. It generates 3D electron density maps using the ELF. To capture both therapeutic relevance and interaction risks, MMM combines ELF-derived features that encode global electronic properties with a bipartite graph encoder that models local substructure interactions. This design enables learning complementary characteristics of drug molecules. We evaluate MMM in the MIMIC-III dataset (250 drugs, 442 substructures), comparing it with several baseline models. In particular, a comparison with the GNN-based SafeDrug model demonstrates statistically significant improvements in the F1-score (p = 0.0387), Jaccard (p = 0.0112), and the DDI rate (p = 0.0386). These results demonstrate the potential of ELF-based 3D representations to enhance prediction accuracy and support safer combinatorial drug prescribing in clinical practice.

Updated: 2025-10-09 08:03:14

标题: MMM：组合药物推荐的量子化学分子表示学习

摘要: 药物推荐是基于机器学习的临床决策支持系统中的一个关键任务。然而，共同处方药物之间的药物相互作用（DDI）风险仍然是一个重要挑战。先前的研究使用图神经网络（GNNs）来表示药物结构。然而，它们简化的离散形式无法完全捕捉分子结合亲和性和反应性。因此，我们提出了一种新颖的框架，即将分子电子定位功能（ELF）图集成到多模式DDI预测中（MMM），将三维量子化学信息融入药物表示学习中。它使用ELF生成三维电子密度图。为了捕捉治疗相关性和相互作用风险，MMM结合了编码全局电子性质的ELF衍生特征和建模局部亚结构相互作用的二部图编码器。这种设计使得学习药物分子的互补特征成为可能。我们在MIMIC-III数据集（250种药物，442个亚结构）中评估了MMM，并将其与几种基准模型进行了比较。特别是与基于GNN的SafeDrug模型相比，在F1分数（p = 0.0387）、Jaccard指数（p = 0.0112）和DDI率（p = 0.0386）方面都表现出统计上显著的改进。这些结果表明，基于ELF的三维表示可以提高预测准确性，并支持临床实践中更安全的联合药物处方。

更新时间: 2025-10-09 08:03:14

领域: cs.LG,cs.AI,cs.CV,I.2.6; I.5.1

下载: http://arxiv.org/abs/2510.07910v1

Multi-level informed optimization via decomposed Kriging for large design problems under uncertainty

Engineering design involves demanding models encompassing many decision variables and uncontrollable parameters. In addition, unavoidable aleatoric and epistemic uncertainties can be very impactful and add further complexity. The state-of-the-art adopts two steps, uncertainty quantification and design optimization, to optimize systems under uncertainty by means of robust or stochastic metrics. However, conventional scenario-based, surrogate-assisted, and mathematical programming methods are not sufficiently scalable to be affordable and precise in large and complex cases. Here, a multi-level approach is proposed to accurately optimize resource-intensive, high-dimensional, and complex engineering problems under uncertainty with minimal resources. A non-intrusive, fast-scaling, Kriging-based surrogate is developed to map the combined design/parameter domain efficiently. Multiple surrogates are adaptively updated by hierarchical and orthogonal decomposition to leverage the fewer and most uncertainty-informed data. The proposed method is statistically compared to the state-of-the-art via an analytical testbed and is shown to be concurrently faster and more accurate by orders of magnitude.

Updated: 2025-10-09 07:59:16

标题: 多级通知优化通过分解克里金在不确定性下的大型设计问题

摘要: 工程设计涉及包含许多决策变量和不可控参数的复杂模型。此外，不可避免的随机和认知不确定性可能会产生很大影响，并增加进一步的复杂性。现代技术采用两步法，即不确定性量化和设计优化，通过稳健或随机指标来优化受不确定性影响的系统。然而，传统的基于场景、辅助代理和数学规划方法在大型复杂案例中并不足以扩展，无法承担成本并提供足够的精确性。因此，提出了一种多层次方法，用较少资源准确优化资源密集、高维和复杂的工程问题。开发了一种非侵入式、快速扩展的基于Kriging的代理模型，有效地映射了组合设计/参数域。通过分层和正交分解自适应更新多个代理模型，以利用更少且信息最丰富的数据。所提出的方法通过分析测试床与现代技术进行了统计比较，表明其速度和准确性均比现代技术快数个数量级。

更新时间: 2025-10-09 07:59:16

领域: eess.SY,cs.LG,cs.SY,stat.ML,60G15 (Primary) 68T37, 90C26 (Secondary),C.4; G.1.6; G.3; G.4; I.2.3; I.5.1; J.2

下载: http://arxiv.org/abs/2510.07904v1

Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers

Large Language Models (LLMs) have recently been applied to reranking tasks in information retrieval, achieving strong performance. However, their high computational demands often hinder practical deployment. Existing studies evaluate the efficiency of LLM-based rerankers using proxy metrics such as latency, the number of forward passes, input tokens, and output tokens. However, these metrics depend on hardware and running-time choices (\eg parallel or not, batch size, etc), and often fail to account for model size, making it difficult to interpret and obscuring the evaluation of the efficiency-effectiveness tradeoff. To address this issue, we propose \ours\footnote{https://github.com/zhiyuanpeng/EER-FLOPs.} for LLM-based rerankers: RPP (ranking metrics per PetaFLOP), measuring how much ranking quality (e.g., NDCG or MRR) a method achieves per PetaFLOP, and QPP (queries per PetaFLOP), measuring how many queries can be processed per PetaFLOP. Accompanied by the new metrics, an interpretable FLOPs estimator is developed to estimate the FLOPs of an LLM-based reranker even without running any experiments. Based on the proposed metrics, we conduct comprehensive experiments to evaluate a wide range of LLM-based rerankers with different architectures, studying the efficiency-effectiveness trade-off and bringing this issue to the attention of the research community.

Updated: 2025-10-09 07:56:11

标题: Efficiency-Effectiveness重新排序FLOPs用于基于LLM的重新排序器

摘要: 最近，大型语言模型（LLMs）已被应用于信息检索中的重新排序任务，取得了强大的性能。然而，它们高昂的计算需求通常阻碍了实际部署。现有研究评估基于LLM的重新排序器的效率，使用代理指标如延迟、前向传递次数、输入标记和输出标记。然而，这些指标取决于硬件和运行时间选择（如并行或否、批处理大小等），并且往往未考虑模型大小，这使得解释和评估效率-有效性权衡变得困难且模糊。为解决这一问题，我们提出了基于LLM的重新排序器\ours，即RPP（每PetaFLOP的排名指标），衡量一种方法每PetaFLOP实现多少排名质量（如NDCG或MRR），以及QPP（每PetaFLOP的查询数），衡量每PetaFLOP可以处理多少查询。伴随着新指标，我们开发了一个可解释的FLOPs估计器，可以估计基于LLM的重新排序器的FLOPs，即使没有运行任何实验。基于提出的指标，我们进行了全面实验，评估了一系列不同架构的基于LLM的重新排序器，研究了效率-有效性权衡，并引起了研究社区的关注。

更新时间: 2025-10-09 07:56:11

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.06223v2

Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing

Diffusion-based large language models (DLLMs) have recently attracted growing interest as an alternative to autoregressive decoders. In this work, we present an empirical study on using the diffusion-based large language model LLaDA for automatic speech recognition (ASR). We first investigate its use as an external deliberation-based processing module for Whisper-LLaMA transcripts. By leveraging the bidirectional attention and denoising capabilities of LLaDA, we explore random masking, low-confidence masking, and semi-autoregressive strategies, showing that Whisper-LLaDA substantially reduces WER compared with the baseline. On LibriSpeech, the best cascade system achieves 2.25%/4.94% WER on test-clean/test-other, representing a 12.3% relative improvement over the Whisper-LLaMA baseline on the test-other split. In contrast, a plain-text LLaDA without acoustic features fails to improve accuracy, highlighting the importance of audio-conditioned embeddings. We further evaluate Whisper-LLaDA as a standalone decoder for ASR with diffusion-based and semi-autoregressive decoding. Most experimental configurations achieve faster inference than the Whisper-LLaMA baseline, although recognition accuracy is slightly lower. These findings offer an empirical view of diffusion-based LLMs for ASR and point to promising directions for improvements.

Updated: 2025-10-09 07:55:28

标题: 音频条件扩散LLMs用于自动语音识别和决策处理

摘要: 最近，以扩散为基础的大型语言模型（DLLMs）作为自回归解码器的替代方案引起了越来越多的关注。在这项工作中，我们展示了一个关于使用基于扩散的大型语言模型LLaDA进行自动语音识别（ASR）的实证研究。我们首先研究了它作为Whisper-LLaMA转录的外部思考型处理模块的使用。通过利用LLaDA的双向注意力和去噪功能，我们探索了随机遮蔽、低置信度遮蔽和半自回归策略，结果显示Whisper-LLaDA与基线相比明显减少了识别错误率（WER）。在LibriSpeech上，最佳级联系统在test-clean/test-other上实现了2.25%/4.94%的WER，相对于test-other分组上Whisper-LLaMA基线的12.3%相对改善。相反，没有声学特征的纯文本LLaDA未能提高准确性，突显了音频条件化嵌入的重要性。我们进一步评估了Whisper-LLaDA作为ASR的独立解码器，采用基于扩散和半自回归的解码。大多数实验配置比Whisper-LLaMA基线实现了更快的推断，尽管识别准确性略有降低。这些发现为ASR的扩散式LLM提供了实证视角，并指出了改进的有希望的方向。

更新时间: 2025-10-09 07:55:28

领域: eess.AS,cs.AI,cs.SD

下载: http://arxiv.org/abs/2509.16622v2

Decentralised Blockchain Management Through Digital Twins

The necessity of blockchain systems to remain decentralised limits current solutions to blockchain governance and dynamic management, forcing a trade-off between control and decentralisation. In light of the above, this work proposes a dynamic and decentralised blockchain management mechanism based on digital twins. To ensure decentralisation, the proposed mechanism utilises multiple digital twins that the system's stakeholders control. To facilitate decentralised decision-making, the twins are organised in a secondary blockchain system that orchestrates agreement on, and propagation of decisions to the managed blockchain. This enables the management of blockchain systems without centralised control. A preliminary evaluation of the performance and impact of the overheads introduced by the proposed mechanism is conducted through simulation. The results demonstrate the proposed mechanism's ability to reach consensus on decisions quickly and reconfigure the primary blockchain with minimal overhead.

Updated: 2025-10-09 07:54:26

标题: 通过数字孪生技术实现去中心化的区块链管理

摘要: 区块链系统保持去中心化的必要性限制了当前的区块链治理和动态管理解决方案，迫使在控制和去中心化之间进行权衡。鉴于上述情况，本文提出了一种基于数字孪生的动态去中心化区块链管理机制。为了确保去中心化，所提出的机制利用系统利益相关者控制的多个数字孪生体。为了促进去中心化决策制定，这些孪生体被组织在一个次级区块链系统中，协调对管理区块链的决策达成一致，并传播决策。这使得能够在没有中心化控制的情况下管理区块链系统。通过模拟对所提出机制引入的开销的性能和影响进行了初步评估。结果表明，所提出的机制能够快速达成对决策的共识，并以最小的开销重新配置主区块链。

更新时间: 2025-10-09 07:54:26

领域: cs.CR,cs.DC

下载: http://arxiv.org/abs/2510.07901v1

MInDI-3D: Iterative Deep Learning in 3D for Sparse-view Cone Beam Computed Tomography

We present MInDI-3D (Medical Inversion by Direct Iteration in 3D), the first 3D conditional diffusion-based model for real-world sparse-view Cone Beam Computed Tomography (CBCT) artefact removal, aiming to reduce imaging radiation exposure. A key contribution is extending the "InDI" concept from 2D to a full 3D volumetric approach for medical images, implementing an iterative denoising process that refines the CBCT volume directly from sparse-view input. A further contribution is the generation of a large pseudo-CBCT dataset (16,182) from chest CT volumes of the CT-RATE public dataset to robustly train MInDI-3D. We performed a comprehensive evaluation, including quantitative metrics, scalability analysis, generalisation tests, and a clinical assessment by 11 clinicians. Our results show MInDI-3D's effectiveness, achieving a 12.96 (6.10) dB PSNR gain over uncorrected scans with only 50 projections on the CT-RATE pseudo-CBCT (independent real-world) test set and enabling an 8x reduction in imaging radiation exposure. We demonstrate its scalability by showing that performance improves with more training data. Importantly, MInDI-3D matches the performance of a 3D U-Net on real-world scans from 16 cancer patients across distortion and task-based metrics. It also generalises to new CBCT scanner geometries. Clinicians rated our model as sufficient for patient positioning across all anatomical sites and found it preserved lung tumour boundaries well.

Updated: 2025-10-09 07:53:47

标题: MInDI-3D：三维稀疏视锥束计算机断层扫描中的迭代式深度学习

摘要: 我们提出了MInDI-3D（三维直接迭代医学反演），这是第一个针对现实世界稀疏视角锥束计算机断层扫描（CBCT）伪影去除的三维条件扩散模型，旨在减少成像辐射暴露。一个关键贡献是将“InDI”概念从2D扩展到医学图像的完整三维体积方法，实现了一种迭代去噪过程，直接从稀疏视角输入中细化CBCT体积。另一个贡献是从CT-RATE公共数据集的胸部CT体积生成了一个大型伪CBCT数据集（16,182），以稳健地训练MInDI-3D。我们进行了全面评估，包括定量指标、可扩展性分析、泛化测试以及由11名临床医生进行的临床评估。我们的结果显示，MInDI-3D的有效性，仅在CT-RATE伪CBCT（独立现实世界）测试集上使用50个投影，与未校正的扫描相比，获得了12.96（6.10）dB的峰值信噪比增益，并实现了成像辐射暴露的8倍降低。我们展示了它的可扩展性，表明性能随着更多训练数据的改进而提高。重要的是，MInDI-3D在16名癌症患者的真实世界扫描中的失真和基于任务的度量方面与3D U-Net的性能相匹配。它还可以推广到新的CBCT扫描仪几何结构。临床医生评价我们的模型在所有解剖部位的患者定位方面都是足够的，并且发现它能很好地保留肺部肿瘤边界。

更新时间: 2025-10-09 07:53:47

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.09616v2

Teaching Your Models to Understand Code via Focal Preference Alignment

Preference learning extends the performance of Code LLMs beyond traditional supervised fine-tuning by leveraging relative quality comparisons. In existing approaches, a set of n candidate solutions is evaluated based on test case success rates, with the candidate demonstrating a higher pass rate being labeled as positive and its counterpart with a lower pass rate as negative. However, because this approach aligns entire failing code blocks rather than pinpointing specific errors, it lacks the granularity necessary to capture meaningful error-correction relationships. As a result, the model is unable to learn more informative error-correction patterns. To address these issues, we propose Target-DPO, a new preference alignment framework that mimics human iterative debugging to refine Code LLMs. Target-DPO explicitly locates error regions and aligns the corresponding tokens via a tailored DPO algorithm. To facilitate it, we introduce the CodeFlow dataset, where samples are iteratively refined until passing tests, with modifications capturing error corrections. Extensive experiments show that a diverse suite of Code LLMs equipped with Target-DPO achieves significant performance gains in code generation and improves on challenging tasks like BigCodeBench. In-depth analysis reveals that Target-DPO yields fewer errors. Code, model and datasets are in: https://github.com/JieWu02/Target-DPO.

Updated: 2025-10-09 07:51:19

标题: 通过焦点偏好对模型进行编码理解教学

摘要: 偏好学习通过利用相对质量比较，将Code LLMs的性能扩展到传统监督微调之外。在现有方法中，根据测试案例的成功率对一组n个候选解决方案进行评估，通过展示更高通过率的候选标记为正，其相对应的通过率较低的标记为负。然而，由于这种方法对整个失败的代码块进行对齐而不是精确定位特定错误，缺乏捕捉有意义的错误校正关系所需的细粒度。因此，该模型无法学习更具信息性的错误校正模式。为了解决这些问题，我们提出了Target-DPO，这是一个新的偏好对齐框架，模拟人类迭代调试以改进Code LLMs。Target-DPO明确地定位错误区域，并通过定制的DPO算法对齐相应的标记。为了促进这一过程，我们引入了CodeFlow数据集，其中样本经过迭代改进直至通过测试，通过修改捕捉错误校正。大量实验证明，配备Target-DPO的多样化Code LLMs套件在代码生成方面取得了显著的性能提升，并改进了像BigCodeBench这样具有挑战性的任务。深入分析表明，Target-DPO产生较少的错误。代码、模型和数据集位于：https://github.com/JieWu02/Target-DPO。

更新时间: 2025-10-09 07:51:19

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.02783v4

CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D

3D scene understanding is fundamental for embodied AI and robotics, supporting reliable perception for interaction and navigation. Recent approaches achieve zero-shot, open-vocabulary 3D semantic mapping by assigning embedding vectors to 2D class-agnostic masks generated via vision-language models (VLMs) and projecting these into 3D. However, these methods often produce fragmented masks and inaccurate semantic assignments due to the direct use of raw masks, limiting their effectiveness in complex environments. To address this, we leverage SemanticSAM with progressive granularity refinement to generate more accurate and numerous object-level masks, mitigating the over-segmentation commonly observed in mask generation models such as vanilla SAM, and improving downstream 3D semantic segmentation. To further enhance semantic context, we employ a context-aware CLIP encoding strategy that integrates multiple contextual views of each mask using empirically determined weighting, providing much richer visual context. We evaluate our approach on multiple 3D scene understanding tasks, including 3D semantic segmentation and object retrieval from language queries, across several benchmark datasets. Experimental results demonstrate significant improvements over existing methods, highlighting the effectiveness of our approach.

Updated: 2025-10-09 07:45:58

标题: CORE-3D: 通过三维嵌入的上下文感知开放词汇检索

摘要: 三维场景理解对于具有身体感知的人工智能和机器人至关重要，支持交互和导航的可靠感知。最近的方法通过将嵌入向量分配给通过视觉语言模型（VLMs）生成的2D类别不可知掩模，并将其投影到三维空间，实现了零样本、开放词汇的三维语义映射。然而，由于直接使用原始掩模，这些方法经常产生碎片化的掩模和不准确的语义分配，从而限制了它们在复杂环境中的有效性。为了解决这个问题，我们利用SemanticSAM和逐渐细化的粒度来生成更准确和更多的对象级掩模，减轻了像普通SAM这样的掩模生成模型中常见的过分分割现象，并改善了下游的三维语义分割。为了进一步增强语义上下文，我们采用了一个上下文感知的CLIP编码策略，使用经验确定的加权来集成每个掩模的多个上下文视图，提供更丰富的视觉上下文。我们在多个三维场景理解任务上评估我们的方法，包括三维语义分割和通过语言查询进行物体检索，涵盖了几个基准数据集。实验结果表明，我们的方法相比现有方法取得了显著的改进，突显了我们方法的有效性。

更新时间: 2025-10-09 07:45:58

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2509.24528v2

Adaptive Optimizable Gaussian Process Regression Linear Least Squares Regression Filtering Method for SEM Images

Scanning Electron Microscopy (SEM) images often suffer from noise contamination, which degrades image quality and affects further analysis. This research presents a complete approach to estimate their Signal-to-Noise Ratio (SNR) and noise variance (NV), and enhance image quality using NV-guided Wiener filter. The main idea of this study is to use a good SNR estimation technique and infuse a machine learning model to estimate NV of the SEM image, which then guides the wiener filter to remove the noise, providing a more robust and accurate SEM image filtering pipeline. First, we investigate five different SNR estimation techniques, namely Nearest Neighbourhood (NN) method, First-Order Linear Interpolation (FOL) method, Nearest Neighbourhood with First-Order Linear Interpolation (NN+FOL) method, Non-Linear Least Squares Regression (NLLSR) method, and Linear Least Squares Regression (LSR) method. It is shown that LSR method to perform better than the rest. Then, Support Vector Machines (SVM) and Gaussian Process Regression (GPR) are tested by pairing it with LSR. In this test, the Optimizable GPR model shows the highest accuracy and it stands as the most effective solution for NV estimation. Combining these results lead to the proposed Adaptive Optimizable Gaussian Process Regression Linear Least Squares Regression (AO-GPRLLSR) Filtering pipeline. The AO-GPRLLSR method generated an estimated noise variance which served as input to NV-guided Wiener filter for improving the quality of SEM images. The proposed method is shown to achieve notable success in estimating SNR and NV of SEM images and leads to lower Mean Squared Error (MSE) after the filtering process.

Updated: 2025-10-09 07:45:35

标题: 自适应可优化的高斯过程回归线性最小二乘回归滤波方法用于SEM图像

摘要: 扫描电子显微镜（SEM）图像经常受到噪音污染，这会降低图像质量并影响进一步分析。本研究提出了一种完整的方法来估计SEM图像的信噪比（SNR）和噪声方差（NV），并利用NV引导的维纳滤波器增强图像质量。本研究的主要思想是利用良好的SNR估计技术并融入机器学习模型来估计SEM图像的NV，然后引导维纳滤波器去除噪音，提供更强大和准确的SEM图像滤波流程。首先，我们调查了五种不同的SNR估计技术，即最近邻（NN）方法，一阶线性插值（FOL）方法，最近邻和一阶线性插值（NN+FOL）方法，非线性最小二乘回归（NLLSR）方法和线性最小二乘回归（LSR）方法。结果表明LSR方法表现优于其他方法。然后，通过将其与LSR配对，测试了支持向量机（SVM）和高斯过程回归（GPR）。在此测试中，可优化的GPR模型显示出最高的准确性，成为NV估计的最有效解决方案。将这些结果结合起来得到了提出的自适应可优化高斯过程回归线性最小二乘回归（AO-GPRLLSR）滤波流程。AO-GPRLLSR方法生成了一个估计的噪声方差，作为NV引导维纳滤波器的输入，以提高SEM图像的质量。所提出的方法在估计SEM图像的SNR和NV方面取得了显著成功，并在滤波过程后导致更低的均方误差（MSE）。

更新时间: 2025-10-09 07:45:35

领域: cs.LG

下载: http://arxiv.org/abs/2510.07895v1

Towards Meaningful Transparency in Civic AI Systems

Artificial intelligence has become a part of the provision of governmental services, from making decisions about benefits to issuing fines for parking violations. However, AI systems rarely live up to the promise of neutral optimisation, creating biased or incorrect outputs and reducing the agency of both citizens and civic workers to shape the way decisions are made. Transparency is a principle that can both help subjects understand decisions made about them and shape the processes behind those decisions. However, transparency as practiced around AI systems tends to focus on the production of technical objects that represent algorithmic aspects of decision making. These are often difficult for publics to understand, do not connect to potential for action, and do not give insight into the wider socio-material context of decision making. In this paper, we build on existing approaches that take a human-centric view on AI transparency, combined with a socio-technical systems view, to develop the concept of meaningful transparency for civic AI systems: transparencies that allow publics to engage with AI systems that affect their lives, connecting understanding with potential for action.

Updated: 2025-10-09 07:43:01

标题: 朝着公民人工智能系统的有意义透明化前进

摘要: 人工智能已经成为政府服务的一部分，从决定福利到对停车违规罚款。然而，人工智能系统很少能够实现中立优化的承诺，会产生有偏见或不正确的输出，降低了公民和公务人员塑造决策方式的能力。透明度是一个原则，可以帮助受影响者理解关于他们的决定并塑造这些决定背后的过程。然而，围绕人工智能系统实践的透明度往往侧重于代表决策算法方面的技术对象的生产。这些往往难以为公众理解，不与行动潜力联系起来，并且不提供对决策的更广泛社会-物质背景的洞察。在本文中，我们借鉴了采取以人为中心的人工智能透明度观点的现有方法，结合了一种社会技术系统观点，发展了有意义透明度的概念，用于公民人工智能系统：这种透明度允许公众参与影响他们生活的人工智能系统，将理解与行动潜力联系起来。

更新时间: 2025-10-09 07:43:01

领域: cs.AI,cs.CY,cs.HC

下载: http://arxiv.org/abs/2510.07889v1

ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning

Reasoning-based large language models have excelled in mathematics and programming, yet their potential in knowledge-intensive medical question answering remains underexplored and insufficiently validated in clinical contexts. To bridge this gap, we introduce ReasonMed, the largest medical reasoning dataset to date, comprising 370k high-quality examples distilled from 1.75 million initial reasoning paths generated by complementary LLMs and curated through a cost-efficient easy-medium-difficult (EMD) pipeline. ReasonMed is built through a multi-agent generation, verification, and refinement process, in which an Error Refiner improves reasoning paths by correcting error-prone steps identified by a verifier. Using ReasonMed, we investigate effective strategies for training medical reasoning models and find that integrating detailed CoT reasoning with concise answer summaries yields the most robust fine-tuning results. Models trained on ReasonMed set a new benchmark: ReasonMed-7B surpasses the prior best sub-10B models by 4.17% and even exceeds LLaMA3.1-70B on PubMedQA by 4.60%. When scaled to ReasonMed-14B, it remains highly competitive, underscoring consistent scaling potential. The codes and datasets are available at https://github.com/YuSun-Work/ReasonMed.

Updated: 2025-10-09 07:42:50

标题: ReasonMed：一个用于推进医学推理的37万个多智能体生成的数据集

摘要: 基于推理的大型语言模型在数学和编程方面取得了优异的成绩，然而它们在知识密集型医学问题回答方面的潜力仍未得到充分探索，并在临床环境中验证不足。为了弥补这一差距，我们引入了到目前为止最大的医学推理数据集ReasonMed，包括从1.75百万个初始推理路径中精选出的37万个高质量示例，通过成本效益高的简单-中等-困难（EMD）流程筛选而来。ReasonMed是通过多代理生成、验证和完善过程构建的，其中一个错误细化器通过纠正验证器识别出的易出错步骤来改进推理路径。利用ReasonMed，我们研究了训练医学推理模型的有效策略，并发现将详细的CoT推理与简洁的答案摘要相结合可以产生最稳健的微调结果。在ReasonMed数据集上训练的模型创造了一个新的基准：ReasonMed-7B比先前最佳的小于10B模型提高了4.17%，甚至超过了PubMedQA上的LLaMA3.1-70B模型4.60%。当扩展到ReasonMed-14B时，它仍然具有很高的竞争力，突出了持续的扩展潜力。代码和数据集可在https://github.com/YuSun-Work/ReasonMed 上获取。

更新时间: 2025-10-09 07:42:50

领域: cs.CL,cs.AI,cs.MA

下载: http://arxiv.org/abs/2506.09513v3

EpiCoder: Encompassing Diversity and Complexity in Code Generation

Existing methods for code generation use code snippets as seed data, restricting the complexity and diversity of the synthesized data. In this paper, we introduce a novel feature tree-based synthesis framework, which revolves around hierarchical code features derived from high-level abstractions of code. The feature tree is constructed from raw data and refined iteratively to increase the quantity and diversity of the extracted features, which captures and recognizes more complex patterns and relationships within the code. By adjusting the depth and breadth of the sampled subtrees, our framework provides precise control over the complexity of the generated code, enabling functionalities that range from function-level operations to multi-file scenarios. We fine-tuned widely-used base models to obtain EpiCoder series, achieving state-of-the-art performance on multiple benchmarks at both the function and file levels. In particular, empirical evidence indicates that our approach shows significant potential in the synthesizing of repository-level code data. Our code and data are publicly available at https://github.com/microsoft/EpiCoder.

Updated: 2025-10-09 07:39:50

标题: EpiCoder：在代码生成中包容多样性和复杂性

摘要: 现有的代码生成方法使用代码片段作为种子数据，限制了合成数据的复杂性和多样性。在本文中，我们介绍了一种基于特征树的新颖合成框架，其围绕着从代码的高层抽象中导出的层次化代码特征。特征树是从原始数据构建的，并经过迭代优化，以增加提取特征的数量和多样性，捕捉并识别代码中更复杂的模式和关系。通过调整采样子树的深度和广度，我们的框架可以精确控制生成代码的复杂性，实现从函数级操作到多文件场景的功能。我们对广泛使用的基础模型进行了微调，获得了EpiCoder系列，在函数和文件级别的多个基准测试中实现了最先进的性能。特别是，经验证据表明，我们的方法在合成存储库级别的代码数据方面显示出显著潜力。我们的代码和数据可以在https://github.com/microsoft/EpiCoder 上公开获取。

更新时间: 2025-10-09 07:39:50

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2501.04694v3

Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation

Despite rapid progress, ASR evaluation remains saturated with short-form English, and efficiency is rarely reported. We present the Open ASR Leaderboard, a fully reproducible benchmark and interactive leaderboard comparing 60+ open-source and proprietary systems across 11 datasets, including dedicated multilingual and long-form tracks. We standardize text normalization and report both word error rate (WER) and inverse real-time factor (RTFx), enabling fair accuracy-efficiency comparisons. For English transcription, Conformer encoders paired with LLM decoders achieve the best average WER but are slower, while CTC and TDT decoders deliver much better RTFx, making them attractive for long-form and offline use. Whisper-derived encoders fine-tuned for English improve accuracy but often trade off multilingual coverage. All code and dataset loaders are open-sourced to support transparent, extensible evaluation.

Updated: 2025-10-09 07:39:28

标题: 开放式ASR排行榜：向可复制和透明的多语言和长篇语音识别评估迈进

摘要: 尽管ASR评估取得了快速进展，但仍然以短文本英语为主，并且很少报告效率。我们提出了Open ASR Leaderboard，这是一个完全可重现的基准测试和互动排行榜，比较了11个数据集中60多个开源和专有系统，包括专门的多语言和长篇跟踪。我们标准化文本规范化，并报告了词错误率（WER）和逆实时因子（RTFx），使得公平的准确度-效率比较成为可能。对于英语转录，Conformer编码器与LLM解码器配对实现了最佳平均WER，但速度较慢，而CTC和TDT解码器提供了更好的RTFx，使它们适用于长篇和离线使用。针对英语微调的Whisper派生编码器提高了准确性，但通常会牺牲多语言覆盖范围。所有的代码和数据集加载器都是开源的，以支持透明、可扩展的评估。

更新时间: 2025-10-09 07:39:28

领域: cs.CL,cs.AI,cs.SD,eess.AS

下载: http://arxiv.org/abs/2510.06961v2

MoM: Linear Sequence Modeling with Mixture-of-Memories

Linear sequence modeling methods, such as linear attention, state space modeling, and linear RNNs, offer significant efficiency improvements by reducing the complexity of training and inference. However, these methods typically compress the entire input sequence into a single fixed-size memory state, which leads to suboptimal performance on recall-intensive tasks. To address this limitation, we introduce a novel architecture called Mixture-of-Memories (MoM). MoM utilizes multiple independent memory states, with a router network directing input tokens to specific memory states. This approach greatly enhances the overall memory capacity while minimizing memory interference. MoM serves as a general framework that can be seamlessly combined with diverse memory update mechanisms across linear models. As a result, MoM performs exceptionally well on recall-intensive tasks, surpassing existing linear sequence modeling techniques. Despite incorporating multiple memory states, the computation of each memory state remains linear in complexity, allowing MoM to retain the linear-complexity advantage during training, while constant-complexity during inference. Our experimental results show that MoM outperforms current linear sequence models on downstream language tasks, particularly recall-intensive tasks, and even achieves performance comparable to Transformer models. The code is released at https://github.com/OpenSparseLLMs/MoM and is also released as a part of https://github.com/OpenSparseLLMs/Linear-MoE.

Updated: 2025-10-09 07:38:50

标题: MoM：混合记忆的线性序列建模

摘要: 线性序列建模方法，如线性注意力，状态空间建模和线性RNN，通过降低训练和推理的复杂性，提供了显著的效率改进。然而，这些方法通常将整个输入序列压缩成单个固定大小的内存状态，这导致在需要大量回忆的任务上表现不佳。为了解决这一限制，我们引入了一种称为记忆混合体（MoM）的新型架构。MoM利用多个独立的内存状态，通过路由器网络将输入令牌定向到特定的内存状态。这种方法极大地增强了整体存储容量，同时最小化了内存干扰。MoM作为一个通用框架，可以无缝地与线性模型中的多种记忆更新机制结合使用。因此，MoM在需要大量回忆的任务上表现出色，超越了现有的线性序列建模技术。尽管包含了多个内存状态，但每个内存状态的计算复杂度仍然保持在线性水平，使得MoM在训练过程中保持线性复杂度优势，而在推理过程中保持恒定复杂度。我们的实验结果表明，MoM在下游语言任务中表现优于当前的线性序列模型，特别是在需要大量回忆的任务上，甚至实现了与Transformer模型相媲美的性能。代码已发布在https://github.com/OpenSparseLLMs/MoM，并作为https://github.com/OpenSparseLLMs/Linear-MoE的一部分发布。

更新时间: 2025-10-09 07:38:50

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.13685v3

Signal-to-Noise Ratio in Scanning Electron Microscopy: A Comprehensive Review

Scanning Electron Microscopy (SEM) is critical in nanotechnology, materials science, and biological imaging due to its high spatial resolution and depth of focus. Signal-to-noise ratio (SNR) is an essential parameter in SEM because it directly impacts the quality and interpretability of the images. SEM is widely used in various scientific disciplines, but its utility can be compromised by noise, which degrades image clarity. This review explores multiple aspects of the SEM imaging process, from the principal operation of SEM, sources of noise in SEM, methods for SNR measurement and estimations, to various aspects that affect the SNR measurement and approaches to enhance SNR, both from a hardware and software standpoint. We review traditional and emerging techniques, focusing on their applications, advantages, and limitations. The paper aims to provide a comprehensive understanding of SNR optimization in SEM for researchers and practitioners and to encourage further research in the field.

Updated: 2025-10-09 07:38:46

标题: 扫描电子显微镜中的信噪比：综合评述

摘要: 扫描电子显微镜（SEM）在纳米技术、材料科学和生物成像中至关重要，因为它具有高空间分辨率和焦深。信噪比（SNR）是SEM中的一个关键参数，因为它直接影响图像的质量和可解释性。SEM被广泛应用于各种科学学科，但其实用性可能会受到噪声的影响，从而降低图像的清晰度。本综述探讨了SEM成像过程的多个方面，从SEM的主要操作、SEM中的噪声来源、SNR测量和估计方法，到影响SNR测量的各种因素和从硬件和软件角度提高SNR的方法。我们回顾了传统和新兴技术，重点关注它们的应用、优点和局限性。本文旨在为研究人员和从业者提供SEM中SNR优化的全面理解，并鼓励在该领域进行进一步研究。

更新时间: 2025-10-09 07:38:46

领域: cs.LG

下载: http://arxiv.org/abs/2510.07886v1

Contrastive Weak-to-strong Generalization

Weak-to-strong generalization provides a promising paradigm for scaling large language models (LLMs) by training stronger models on samples from aligned weaker ones, without requiring human feedback or explicit reward modeling. However, its robustness and generalization are hindered by the noise and biases in weak-model outputs, which limit its applicability in practice. To address this challenge, we leverage implicit rewards, which approximate explicit rewards through log-likelihood ratios, and reveal their structural equivalence with Contrastive Decoding (CD), a decoding strategy shown to reduce noise in LLM generation. Building on this connection, we propose Contrastive Weak-to-Strong Generalization (ConG), a framework that employs contrastive decoding between pre- and post-alignment weak models to generate higher-quality samples. This approach enables more reliable capability transfer, denoising, and improved robustness, substantially mitigating the limitations of traditional weak-to-strong methods. Empirical results across different model families confirm consistent improvements, demonstrating the generality and effectiveness of ConG. Taken together, our findings highlight the potential of ConG to advance weak-to-strong generalization and provide a promising pathway toward AGI.

Updated: 2025-10-09 07:37:23

标题: 对比弱到强的概括

摘要: 弱到强的概括为扩展大型语言模型（LLMs）提供了一种有前途的范式，通过在来自对齐的较弱模型的样本上训练更强的模型，而无需人类反馈或显式奖励建模。然而，弱模型输出中的噪声和偏见阻碍了它的鲁棒性和概括能力，限制了它在实践中的适用性。为了解决这一挑战，我们利用隐式奖励，通过对数似然比近似显式奖励，并揭示了它们与对比解码（CD）的结构等价性，对比解码是一种已被证明可以减少LLM生成中噪声的解码策略。在这种联系的基础上，我们提出了对比弱到强泛化（ConG）框架，该框架在预对齐和后对齐的弱模型之间使用对比解码生成更高质量的样本。这种方法实现了更可靠的能力转移、去噪和改善鲁棒性，显著减轻了传统弱到强方法的局限性。不同模型族的实证结果证实了一致的改进，展示了ConG的普适性和有效性。综上所述，我们的研究结果突出了ConG推进弱到强泛化的潜力，并为通向AGI提供了一个有前途的途径。

更新时间: 2025-10-09 07:37:23

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.07884v1

Building Resource-Constrained Language Agents: A Korean Case Study on Chemical Toxicity Information

Language agents powered by large language models (LLMs) face significant deployment challenges in resource-constrained environments, particularly for specialized domains and less-common languages. This paper presents Tox-chat, a Korean chemical toxicity information agent devised within these limitations. We propose two key innovations: a context-efficient architecture that reduces token consumption through hierarchical section search, and a scenario-based dialogue generation methodology that effectively distills tool-using capabilities from larger models. Experimental evaluations demonstrate that our fine-tuned 8B parameter model substantially outperforms both untuned models and baseline approaches, in terms of DB faithfulness and preference. Our work offers valuable insights for researchers developing domain-specific language agents under practical constraints.

Updated: 2025-10-09 07:36:34

标题: 建立资源受限语言代理：韩国化学毒性信息案例研究

摘要: 由大型语言模型（LLMs）驱动的语言代理在资源受限环境中面临着重大的部署挑战，特别是针对专业领域和较少使用的语言。本文介绍了Tox-chat，一个在这些限制条件下设计的韩文化学毒性信息代理。我们提出了两个关键创新：一种通过分层段落搜索减少标记消耗的上下文高效架构，以及一种基于场景的对话生成方法，有效地从更大的模型中提炼出工具使用能力。实验评估表明，我们微调的8B参数模型在DB忠实度和偏好方面明显优于未调整的模型和基线方法。我们的工作为在实际约束条件下开发领域特定语言代理的研究人员提供了宝贵的见解。

更新时间: 2025-10-09 07:36:34

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.17753v3

Learning General Causal Structures with Hidden Dynamic Process for Climate Analysis

Understanding climate dynamics requires going beyond correlations in observational data to uncover their underlying causal process. Latent drivers, such as atmospheric processes, play a critical role in temporal dynamics, while direct causal influences also exist among geographically proximate observed variables. Traditional Causal Representation Learning (CRL) typically focuses on latent factors but overlooks such observable-to-observable causal relations, limiting its applicability to climate analysis. In this paper, we introduce a unified framework that jointly uncovers (i) causal relations among observed variables and (ii) latent driving forces together with their interactions. We establish conditions under which both the hidden dynamic processes and the causal structure among observed variables are simultaneously identifiable from time-series data. Remarkably, our guarantees hold even in the nonparametric setting, leveraging contextual information to recover latent variables and causal relations. Building on these insights, we propose CaDRe (Causal Discovery and Representation learning), a time-series generative model with structural constraints that integrates CRL and causal discovery. Experiments on synthetic datasets validate our theoretical results. On real-world climate datasets, CaDRe not only delivers competitive forecasting accuracy but also recovers visualized causal graphs aligned with domain expertise, thereby offering interpretable insights into climate systems.

Updated: 2025-10-09 07:34:57

标题: 学习用于气候分析的隐藏动态过程的一般因果结构

摘要: 理解气候动态需要超越观察数据中的相关性，揭示其潜在的因果过程。潜在驱动因素，如大气过程，在时间动态中扮演关键角色，同时地理邻近观察变量之间也存在直接因果影响。传统的因果表征学习（CRL）通常侧重于潜在因素，但忽视了可观察到可观察到因果关系，从而限制了其在气候分析中的适用性。本文介绍了一个统一框架，同时揭示（i）观察变量之间的因果关系和（ii）潜在驱动力以及它们之间的相互作用。我们建立了同时从时间序列数据中识别出隐藏的动态过程和观察变量之间的因果结构的条件。值得注意的是，即使在非参数设置下，我们的保证也成立，利用上下文信息恢复潜在变量和因果关系。基于这些见解，我们提出了CaDRe（因果发现和表征学习），这是一个具有结构约束的时间序列生成模型，集成了CRL和因果发现。在合成数据集上的实验验证了我们的理论结果。在真实的气候数据集上，CaDRe不仅提供了竞争力的预测准确性，还恢复了与领域专业知识一致的可视化因果图，从而提供了对气候系统的可解释见解。

更新时间: 2025-10-09 07:34:57

领域: cs.LG,stat.ME

下载: http://arxiv.org/abs/2501.12500v2

AutoAgent: A Fully-Automated and Zero-Code Framework for LLM Agents

Large Language Model (LLM) Agents have demonstrated remarkable capabilities in task automation and intelligent decision-making, driving the widespread adoption of agent development frameworks such as LangChain and AutoGen. However, these frameworks predominantly serve developers with extensive technical expertise - a significant limitation considering that only 0.03 % of the global population possesses the necessary programming skills. This stark accessibility gap raises a fundamental question: Can we enable everyone, regardless of technical background, to build their own LLM agents using natural language alone? To address this challenge, we introduce AutoAgent-a Fully-Automated and highly Self-Developing framework that enables users to create and deploy LLM agents through Natural Language Alone. Operating as an autonomous Agent Operating System, AutoAgent comprises four key components: i) Agentic System Utilities, ii) LLM-powered Actionable Engine, iii) Self-Managing File System, and iv) Self-Play Agent Customization module. This lightweight yet powerful system enables efficient and dynamic creation and modification of tools, agents, and workflows without coding requirements or manual intervention. Beyond its code-free agent development capabilities, AutoAgent also serves as a versatile multi-agent system for General AI Assistants. Comprehensive evaluations on the GAIA benchmark demonstrate AutoAgent's effectiveness in generalist multi-agent tasks, surpassing existing state-of-the-art methods. Furthermore, AutoAgent's Retrieval-Augmented Generation (RAG)-related capabilities have shown consistently superior performance compared to many alternative LLM-based solutions.

Updated: 2025-10-09 07:27:04

标题: AutoAgent：一种完全自动化且零代码的LLM代理框架

摘要: 大语言模型（LLM）代理在任务自动化和智能决策方面展现出了卓越的能力，推动了代理开发框架（如LangChain和AutoGen）的广泛采用。然而，这些框架主要面向具有丰富技术专业知识的开发人员 - 考虑到全球仅有0.03％的人口具备必要的编程技能，这是一个重大限制。这种明显的可访问性差距引发了一个基本问题：我们能否使每个人，无论技术背景如何，仅使用自然语言来构建自己的LLM代理？为了解决这一挑战，我们引入了AutoAgent-一个完全自动化且高度自我发展的框架，使用户能够仅使用自然语言创建和部署LLM代理。作为一个自主代理操作系统，AutoAgent包括四个关键组件：i）代理系统实用程序，ii）LLM驱动的可操作引擎，iii）自管理文件系统，和iv）自我游玩代理定制模块。这个轻量但功能强大的系统能够在不需要编码要求或手动干预的情况下有效动态地创建和修改工具、代理和工作流程。除了其无代码代理开发能力外，AutoAgent还充当通用人工智能助手的多功能代理系统。对GAIA基准的全面评估显示，AutoAgent在通用多代理任务中的有效性超过了现有的最先进方法。此外，AutoAgent的检索增强生成（RAG）相关能力与许多替代LLM解决方案相比，表现出一贯卓越的性能。

更新时间: 2025-10-09 07:27:04

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2502.05957v3

Mining the Mind: What 100M Beliefs Reveal About Frontier LLM Knowledge

LLMs are remarkable artifacts that have revolutionized a range of NLP and AI tasks. A significant contributor is their factual knowledge, which, to date, remains poorly understood, and is usually analyzed from biased samples. In this paper, we take a deep tour into the factual knowledge (or beliefs) of a frontier LLM, based on GPTKB v1.5 (Hu et al., 2025a), a recursively elicited set of 100 million beliefs of one of the strongest currently available frontier LLMs, GPT-4.1. We find that the models' factual knowledge differs quite significantly from established knowledge bases, and that its accuracy is significantly lower than indicated by previous benchmarks. We also find that inconsistency, ambiguity and hallucinations are major issues, shedding light on future research opportunities concerning factual LLM knowledge.

Updated: 2025-10-09 07:23:03

标题: 挖掘思想：100M信念揭示前沿LLM知识

摘要: LLMs是一种非凡的工具，已经彻底改变了一系列自然语言处理和人工智能任务。一个重要的贡献因素是它们的事实知识，迄今为止，这方面仍然知之甚少，并且通常是从有偏见的样本中进行分析的。在本文中，我们深入研究了一个前沿LLM的事实知识（或信念），基于GPTKB v1.5（Hu等，2025a），这是一个递归引发的集合，包含了目前最强大的前沿LLMs之一GPT-4.1的1亿个信念。我们发现模型的事实知识与已建立的知识库相差很大，并且其准确性明显低于先前的基准所示。我们还发现不一致性、模糊性和幻觉是主要问题，为关于事实LLM知识的未来研究机会提供了新的视角。

更新时间: 2025-10-09 07:23:03

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.07024v2

Team Xiaomi EV-AD VLA: Learning to Navigate Socially Through Proactive Risk Perception - Technical Report for IROS 2025 RoboSense Challenge Social Navigation Track

In this report, we describe the technical details of our submission to the IROS 2025 RoboSense Challenge Social Navigation Track. This track focuses on developing RGBD-based perception and navigation systems that enable autonomous agents to navigate safely, efficiently, and socially compliantly in dynamic human-populated indoor environments. The challenge requires agents to operate from an egocentric perspective using only onboard sensors including RGB-D observations and odometry, without access to global maps or privileged information, while maintaining social norm compliance such as safe distances and collision avoidance. Building upon the Falcon model, we introduce a Proactive Risk Perception Module to enhance social navigation performance. Our approach augments Falcon with collision risk understanding that learns to predict distance-based collision risk scores for surrounding humans, which enables the agent to develop more robust spatial awareness and proactive collision avoidance behaviors. The evaluation on the Social-HM3D benchmark demonstrates that our method improves the agent's ability to maintain personal space compliance while navigating toward goals in crowded indoor scenes with dynamic human agents, achieving 2nd place among 16 participating teams in the challenge.

Updated: 2025-10-09 07:22:12

标题: 小米团队EV-AD VLA：通过主动风险感知学习社交导航 - 2025年IROS RoboSense挑战社交导航赛道技术报告

摘要: 在这份报告中，我们描述了我们提交给IROS 2025 RoboSense挑战赛社交导航赛道的技术细节。该赛道侧重于开发基于RGBD的感知和导航系统，使自主代理能够在动态的人口密集的室内环境中安全、高效和符合社交规范地导航。挑战要求代理从自我中心的视角操作，仅使用RGB-D观测和里程计等车载传感器，而无需访问全局地图或特权信息，同时保持安全距离和避碰等社交规范遵从。在Falcon模型的基础上，我们引入了一种主动风险感知模块来增强社交导航性能。我们的方法通过学习预测周围人类的基于距离的碰撞风险评分，使代理能够建立更强大的空间意识和主动避碰行为。在Social-HM3D基准测试中的评估表明，我们的方法提高了代理在拥挤的室内场景中向目标导航时保持个人空间合规性的能力，取得了在16支参赛团队中排名第二的成绩。

更新时间: 2025-10-09 07:22:12

领域: cs.RO,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.07871v1

On the Optimality of the Median-of-Means Estimator under Adversarial Contamination

The Median-of-Means (MoM) is a robust estimator widely used in machine learning that is known to be (minimax) optimal in scenarios where samples are i.i.d. In more grave scenarios, samples are contaminated by an adversary that can inspect and modify the data. Previous work has theoretically shown the suitability of the MoM estimator in certain contaminated settings. However, the (minimax) optimality of MoM and its limitations under adversarial contamination remain unknown beyond the Gaussian case. In this paper, we present upper and lower bounds for the error of MoM under adversarial contamination for multiple classes of distributions. In particular, we show that MoM is (minimax) optimal in the class of distributions with finite variance, as well as in the class of distributions with infinite variance and finite absolute $(1+r)$-th moment. We also provide lower bounds for MoM's error that match the order of the presented upper bounds, and show that MoM is sub-optimal for light-tailed distributions.

Updated: 2025-10-09 07:17:09

标题: 关于在对抗性污染下中位数均值估计器的最优性

摘要: Median-of-Means（MoM）是一种在机器学习中广泛使用的鲁棒估计器，已知在样本为i.i.d的情况下（极小化）是最优的。在更为严重的情况下，样本会受到对手的污染，对手可以检查和修改数据。先前的研究在理论上展示了MoM估计器在某些受污染的情境下的适用性。然而，在对抗性污染情况下，MoM的（极小化）最优性及其局限性在高斯案例之外仍是未知的。本文提出了MoM在对抗性污染下的误差的上限和下限，适用于多种分布类别。特别地，我们展示了MoM在具有有限方差的分布类别中是（极小化）最优的，以及在具有无限方差和有限绝对值$(1+r)$-th矩的分布类别中也是最优的。我们还提供了与所呈现的上限同阶的MoM误差的下限，并展示MoM对于轻尾分布是次优的。

更新时间: 2025-10-09 07:17:09

领域: stat.ML,cs.LG,math.ST,stat.TH

下载: http://arxiv.org/abs/2510.07867v1

DM1: MeanFlow with Dispersive Regularization for 1-Step Robotic Manipulation

The ability to learn multi-modal action distributions is indispensable for robotic manipulation policies to perform precise and robust control. Flow-based generative models have recently emerged as a promising solution to learning distributions of actions, offering one-step action generation and thus achieving much higher sampling efficiency compared to diffusion-based methods. However, existing flow-based policies suffer from representation collapse, the inability to distinguish similar visual representations, leading to failures in precise manipulation tasks. We propose DM1 (MeanFlow with Dispersive Regularization for One-Step Robotic Manipulation), a novel flow matching framework that integrates dispersive regularization into MeanFlow to prevent collapse while maintaining one-step efficiency. DM1 employs multiple dispersive regularization variants across different intermediate embedding layers, encouraging diverse representations across training batches without introducing additional network modules or specialized training procedures. Experiments on RoboMimic benchmarks show that DM1 achieves 20-40 times faster inference (0.07s vs. 2-3.5s) and improves success rates by 10-20 percentage points, with the Lift task reaching 99% success over 85% of the baseline. Real-robot deployment on a Franka Panda further validates that DM1 transfers effectively from simulation to the physical world. To the best of our knowledge, this is the first work to leverage representation regularization to enable flow-based policies to achieve strong performance in robotic manipulation, establishing a simple yet powerful approach for efficient and robust manipulation.

Updated: 2025-10-09 07:12:20

标题: DM1：具有色散正则化的MeanFlow用于一步机器人操作

摘要: 学习多模态动作分布的能力对于机器人操作策略执行精确和稳健的控制至关重要。基于流的生成模型最近已经成为学习动作分布的一个有前途的解决方案，提供一步动作生成，因此与基于扩散的方法相比，实现了更高的采样效率。然而，现有的基于流的策略存在表示崩溃问题，无法区分类似的视觉表示，导致在精确操纵任务中失败。我们提出了DM1（带有分散正则化的单步机器人操作均值流），这是一个将分散正则化集成到MeanFlow中的新颖流匹配框架，以防止崩溃同时保持一步效率。DM1在不同中间嵌入层中采用多种分散正则化变体，鼓励跨训练批次的多样表示，而不引入额外的网络模块或专门的训练过程。在RoboMimic基准测试上的实验表明，DM1实现了20-40倍更快的推理速度（0.07秒对2-3.5秒），并将成功率提高了10-20个百分点，其中Lift任务中，成功率达到基线的85%以上的99%。在Franka Panda上的真实机器人部署进一步验证了DM1从模拟到物理世界的有效转移。据我们所知，这是第一项利用表示正则化来使基于流的策略在机器人操作中取得强大表现的工作，为高效和稳健的操作建立了一种简单而强大的方法。

更新时间: 2025-10-09 07:12:20

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2510.07865v1

On the Optimality of Tracking Fisher Information in Adaptive Testing with Stochastic Binary Responses

We study the problem of estimating a continuous ability parameter from sequential binary responses by actively asking questions with varying difficulties, a setting that arises naturally in adaptive testing and online preference learning. Our goal is to certify that the estimate lies within a desired margin of error, using as few queries as possible. We propose a simple algorithm that adaptively selects questions to maximize Fisher information and updates the estimate using a method-of-moments approach, paired with a novel test statistic to decide when the estimate is accurate enough. We prove that this Fisher-tracking strategy achieves optimal performance in both fixed-confidence and fixed-budget regimes, which are commonly invested in the best-arm identification literature. Our analysis overcomes a key technical challenge in the fixed-budget setting -- handling the dependence between the evolving estimate and the query distribution -- by exploiting a structural symmetry in the model and combining large deviation tools with Ville's inequality. Our results provide rigorous theoretical support for simple and efficient adaptive testing procedures.

Updated: 2025-10-09 07:10:00

标题: 关于在自适应测试中跟踪费歇尔信息的最优性与随机二进制响应的研究

摘要: 我们研究了通过积极提出具有不同难度的问题来估计连续能力参数的问题，这种情景在自适应测试和在线偏好学习中自然而然地出现。我们的目标是证明估计值在所需误差范围内，并尽量减少查询次数。我们提出了一种简单的算法，通过自适应选择问题以最大化费舍尔信息，并使用矩估计方法更新估计值，配合一种新颖的检验统计量来判断估计值是否足够准确。我们证明了这种费舍尔跟踪策略在固定置信度和固定预算制度中实现了最佳性能，这在最佳臂识别文献中是常见的。我们的分析克服了固定预算情况下的一个关键技术挑战 -- 处理不断变化的估计值和查询分布之间的依赖关系 -- 通过利用模型中的结构对称性，并将大偏差工具与维尔不等式相结合。我们的结果为简单高效的自适应测试程序提供了严谨的理论支持。

更新时间: 2025-10-09 07:10:00

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2510.07862v1

Understanding DeepResearch via Reports

DeepResearch agents represent a transformative AI paradigm, conducting expert-level research through sophisticated reasoning and multi-tool integration. However, evaluating these systems remains critically challenging due to open-ended research scenarios and existing benchmarks that focus on isolated capabilities rather than holistic performance. Unlike traditional LLM tasks, DeepResearch systems must synthesize diverse sources, generate insights, and present coherent findings, which are capabilities that resist simple verification. To address this gap, we introduce DeepResearch-ReportEval, a comprehensive framework designed to assess DeepResearch systems through their most representative outputs: research reports. Our approach systematically measures three dimensions: quality, redundancy, and factuality, using an innovative LLM-as-a-Judge methodology achieving strong expert concordance. We contribute a standardized benchmark of 100 curated queries spanning 12 real-world categories, enabling systematic capability comparison. Our evaluation of four leading commercial systems reveals distinct design philosophies and performance trade-offs, establishing foundational insights as DeepResearch evolves from information assistants toward intelligent research partners. Source code and data are available at: https://github.com/HKUDS/DeepResearch-Eval.

Updated: 2025-10-09 07:03:43

标题: 深入研究报告的理解

摘要: 深度研究代理代表了一种变革性的人工智能范式，通过复杂的推理和多工具集成进行专家级研究。然而，评估这些系统仍然具有极大的挑战，因为开放性的研究场景和现有的基准主要关注孤立的能力，而非整体性能。与传统的LLM任务不同，深度研究系统必须综合多种来源，产生见解，并呈现连贯的发现，这些能力抵抗简单的验证。为了弥补这一差距，我们引入了DeepResearch-ReportEval，这是一个全面的框架，旨在通过其最具代表性的输出——研究报告来评估DeepResearch系统。我们的方法系统地衡量了质量、冗余和事实性三个维度，使用一种创新的LLM作为评判者的方法，实现了强大的专家一致性。我们贡献了一个由100个精心策划的查询组成的标准化基准，涵盖12个现实世界的类别，从而实现了系统能力比较。我们对四个领先的商业系统的评估揭示了明显的设计理念和性能权衡，为从信息助手向智能研究伙伴转变的DeepResearch奠定了基础。源代码和数据可在以下网址获得：https://github.com/HKUDS/DeepResearch-Eval。

更新时间: 2025-10-09 07:03:43

领域: cs.AI

下载: http://arxiv.org/abs/2510.07861v1

Augur: Modeling Covariate Causal Associations in Time Series via Large Language Models

Large language models (LLM) have emerged as a promising avenue for time series forecasting, offering the potential to integrate multimodal data. However, existing LLM-based approaches face notable limitations-such as marginalized role in model architectures, reliance on coarse statistical text prompts, and lack of interpretability. In this work, we introduce Augur, a fully LLM driven time series forecasting framework that exploits LLM causal reasoning to discover and use directed causal associations among covariates. Augur uses a two stage teacher student architecture where a powerful teacher LLM infers a directed causal graph from time series using heuristic search together with pairwise causality testing. A lightweight student agent then refines the graph and fine tune on high confidence causal associations that are encoded as rich textual prompts to perform forecasting. This design improves predictive accuracy while yielding transparent, traceable reasoning about variable interactions. Extensive experiments on real-world datasets with 25 baselines demonstrate that Augur achieves competitive performance and robust zero-shot generalization.

Updated: 2025-10-09 06:59:15

标题: Augur：通过大型语言模型在时间序列中建模协变因果关联

摘要: 大型语言模型（LLM）已成为时间序列预测的一个有前途的途径，提供了整合多模态数据的潜力。然而，现有基于LLM的方法面临显著的局限性，例如在模型架构中的边缘化作用，依赖粗略的统计文本提示以及缺乏可解释性。在这项工作中，我们介绍了Augur，这是一个完全由LLM驱动的时间序列预测框架，利用LLM因果推理来发现和利用协变量之间的有向因果关系。Augur使用一个两阶段的师生架构，其中一个强大的师傅LLM通过启发式搜索和成对因果性测试从时间序列中推断出一个有向因果图。然后，一个轻量级的学生代理者对图进行改进，并在高置信度的因果关联上进行微调，这些关联被编码为丰富的文本提示以进行预测。这种设计提高了预测准确性，同时提供了关于变量交互的透明、可追溯的推理。对包括25个基线在内的真实世界数据集进行的大量实验表明，Augur实现了竞争性的性能并具有强大的零-shot泛化能力。

更新时间: 2025-10-09 06:59:15

领域: cs.AI,cs.LG,62M10,I.2.7

下载: http://arxiv.org/abs/2510.07858v1

Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling

Large language models (LLMs) are prone to hallucination stemming from misaligned self-awareness, particularly when processing queries exceeding their knowledge boundaries. While existing mitigation strategies employ uncertainty estimation or query rejection mechanisms, they suffer from computational efficiency and sacrificed helpfulness. To address these issues, we propose the Explicit Knowledge Boundary Modeling (EKBM) framework, integrating fast and slow reasoning systems to harmonize reliability and usability. The framework first employs a fast-thinking model to generate confidence-labeled responses, enabling immediate utilization of high-confidence outputs, whereas uncertain predictions trigger a slow refinement model for accuracy improvement. To align model behavior with our proposed object, we propose a hybrid training pipeline, enhancing self-awareness without degrading task performance. Evaluations on dialogue state tracking tasks demonstrate that EKBM achieves superior model reliability over uncertainty-based baselines. Further analysis reveals that refinement substantially boosts accuracy while maintaining low computational overhead. The framework establishes a scalable paradigm for deploying reliable LLMs in error-sensitive applications, effectively balancing accuracy and practical utility.

Updated: 2025-10-09 06:57:43

标题: 通过显式知识边界建模提高LLM可靠性

摘要: 大型语言模型（LLMs）容易产生幻觉，这源于自我意识不一致，特别是在处理超出其知识范围的查询时。尽管现有的缓解策略采用不确定性估计或查询拒绝机制，但它们在计算效率和帮助性方面存在问题。为了解决这些问题，我们提出了显式知识边界建模（EKBM）框架，将快速和慢速推理系统整合在一起，以协调可靠性和可用性。该框架首先利用快速思维模型生成具有置信标签的响应，使高置信度输出能够立即利用，而不确定预测会触发一个慢速的精细模型，以提高准确性。为了使模型行为与我们提出的对象相一致，我们提出了混合训练流程，增强自我意识而不降低任务性能。对对话状态跟踪任务的评估显示，EKBM相对于基于不确定性的基准线具有更高的模型可靠性。进一步的分析表明，精炼大幅提高了准确性，同时保持了较低的计算开销。该框架为在错误敏感应用中部署可靠的LLMs建立了一个可扩展的范式，有效平衡了准确性和实际效用。

更新时间: 2025-10-09 06:57:43

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.02233v4

Self-Supervised Learning Strategies for a Platform to Test the Toxicity of New Chemicals and Materials

High-throughput toxicity testing offers a fast and cost-effective way to test large amounts of compounds. A key component for such systems is the automated evaluation via machine learning models. In this paper, we address critical challenges in this domain and demonstrate how representations learned via self-supervised learning can effectively identify toxicant-induced changes. We provide a proof-of-concept that utilizes the publicly available EmbryoNet dataset, which contains ten zebrafish embryo phenotypes elicited by various chemical compounds targeting different processes in early embryonic development. Our analysis shows that the learned representations using self-supervised learning are suitable for effectively distinguishing between the modes-of-action of different compounds. Finally, we discuss the integration of machine learning models in a physical toxicity testing device in the context of the TOXBOX project.

Updated: 2025-10-09 06:51:12

标题: 自我监督学习策略用于测试新化学品和材料毒性的平台

摘要: 高通量毒性测试提供了一种快速且具有成本效益的方式来测试大量化合物。这种系统的关键组成部分是通过机器学习模型进行自动评估。本文讨论了该领域的关键挑战，并演示了通过自监督学习学习到的表示如何有效识别毒物诱导的变化。我们提供了一个概念验证，利用了公开可用的EmbryoNet数据集，该数据集包含了由不同化合物引发的十种斑马鱼胚胎表型，这些化合物瞄准早期胚胎发育中的不同过程。我们的分析表明，使用自监督学习学习到的表示适用于有效区分不同化合物的作用方式。最后，我们讨论了在TOXBOX项目的背景下将机器学习模型整合到物理毒性测试设备中的可能性。

更新时间: 2025-10-09 06:51:12

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.07853v1

Multiple Memory Systems for Enhancing the Long-term Memory of Agent

An agent powered by large language models have achieved impressive results, but effectively handling the vast amounts of historical data generated during interactions remains a challenge. The current approach is to design a memory module for the agent to process these data. However, existing methods, such as MemoryBank and A-MEM, have poor quality of stored memory content, which affects recall performance and response quality. In order to better construct high-quality long-term memory content, we have designed a multiple memory system (MMS) inspired by cognitive psychology theory. The system processes short-term memory to multiple long-term memory fragments, and constructs retrieval memory units and contextual memory units based on these fragments, with a one-to-one correspondence between the two. During the retrieval phase, MMS will match the most relevant retrieval memory units based on the user's query. Then, the corresponding contextual memory units is obtained as the context for the response stage to enhance knowledge, thereby effectively utilizing historical data. Experiments on LoCoMo dataset compared our method with three others, proving its effectiveness. Ablation studies confirmed the rationality of our memory units. We also analyzed the robustness regarding the number of selected memory segments and the storage overhead, demonstrating its practical value.

Updated: 2025-10-09 06:50:57

标题: 多种记忆系统用于增强主体的长期记忆

摘要: 由大型语言模型驱动的代理已经取得了令人印象深刻的成果，但有效地处理在交互过程中产生的大量历史数据仍然是一个挑战。目前的方法是为代理设计一个记忆模块来处理这些数据。然而，现有的方法，如MemoryBank和A-MEM，存储的记忆内容质量较差，影响了召回性能和响应质量。为了更好地构建高质量的长期记忆内容，我们设计了一个受认知心理学理论启发的多重记忆系统（MMS）。该系统将短期记忆处理为多个长期记忆片段，并基于这些片段构建检索记忆单元和上下文记忆单元，两者之间有一对一的对应关系。在检索阶段，MMS将根据用户的查询匹配最相关的检索记忆单元。然后，相应的上下文记忆单元作为响应阶段的上下文获得，以增强知识，从而有效利用历史数据。在LoCoMo数据集上的实验将我们的方法与其他三种方法进行了比较，证明了其有效性。消融研究证实了我们记忆单元的合理性。我们还分析了所选记忆段的数量和存储开销方面的鲁棒性，展示了其实际价值。

更新时间: 2025-10-09 06:50:57

领域: cs.AI,cs.CL,cs.MA,I.2.7

下载: http://arxiv.org/abs/2508.15294v2

PiCo: Jailbreaking Multimodal Large Language Models via Pictorial Code Contextualization

Multimodal Large Language Models (MLLMs), which integrate vision and other modalities into Large Language Models (LLMs), significantly enhance AI capabilities but also introduce new security vulnerabilities. By exploiting the vulnerabilities of the visual modality and the long-tail distribution characteristic of code training data, we present PiCo, a novel jailbreaking framework designed to progressively bypass multi-tiered defense mechanisms in advanced MLLMs. PiCo employs a tier-by-tier jailbreak strategy, using token-level typographic attacks to evade input filtering and embedding harmful intent within programming context instructions to bypass runtime monitoring. To comprehensively assess the impact of attacks, a new evaluation metric is further proposed to assess both the toxicity and helpfulness of model outputs post-attack. By embedding harmful intent within code-style visual instructions, PiCo achieves an average Attack Success Rate (ASR) of 84.13% on Gemini-Pro Vision and 52.66% on GPT-4, surpassing previous methods. Experimental results highlight the critical gaps in current defenses, underscoring the need for more robust strategies to secure advanced MLLMs.

Updated: 2025-10-09 06:50:04

标题: PiCo: 通过图片代码上下文化越狱多模态大型语言模型

摘要: 多模态大型语言模型（MLLMs）将视觉和其他模态集成到大型语言模型（LLMs）中，显著增强了人工智能的能力，但也引入了新的安全漏洞。通过利用视觉模态的漏洞和代码训练数据的长尾分布特性，我们提出了PiCo，一个新颖的越狱框架，旨在逐步绕过先进MLLMs中的多层防御机制。PiCo采用逐层越狱策略，利用令牌级别的排印攻击来规避输入过滤，并将有害意图嵌入到编程上下文指令中，以绕过运行时监控。为了全面评估攻击的影响，进一步提出了一种新的评估指标，用于评估攻击后模型输出的毒性和帮助性。通过将有害意图嵌入到代码风格的视觉指令中，PiCo在Gemini-Pro Vision上实现了84.13％的平均攻击成功率（ASR），在GPT-4上达到了52.66％，超过了先前的方法。实验结果突出了当前防御中的关键差距，强调了需要更加健壮的策略来保护先进的MLLMs。

更新时间: 2025-10-09 06:50:04

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2504.01444v4

FinMR: A Knowledge-Intensive Multimodal Benchmark for Advanced Financial Reasoning

Multimodal Large Language Models (MLLMs) have made substantial progress in recent years. However, their rigorous evaluation within specialized domains like finance is hindered by the absence of datasets characterized by professional-level knowledge intensity, detailed annotations, and advanced reasoning complexity. To address this critical gap, we introduce FinMR, a high-quality, knowledge-intensive multimodal dataset explicitly designed to evaluate expert-level financial reasoning capabilities at a professional analyst's standard. FinMR comprises over 3,200 meticulously curated and expertly annotated question-answer pairs across 15 diverse financial topics, ensuring broad domain diversity and integrating sophisticated mathematical reasoning, advanced financial knowledge, and nuanced visual interpretation tasks across multiple image types. Through comprehensive benchmarking with leading closed-source and open-source MLLMs, we highlight significant performance disparities between these models and professional financial analysts, uncovering key areas for model advancement, such as precise image analysis, accurate application of complex financial formulas, and deeper contextual financial understanding. By providing richly varied visual content and thorough explanatory annotations, FinMR establishes itself as an essential benchmark tool for assessing and advancing multimodal financial reasoning toward professional analyst-level competence.

Updated: 2025-10-09 06:49:55

标题: FinMR：一个知识密集型的高级金融推理多模态基准

摘要: 多模态大语言模型（MLLMs）近年来取得了重大进展。然而，在金融等专业领域内对它们的严格评估受到了数据集缺乏专业水平知识强度、详细注释和高级推理复杂性的阻碍。为了填补这一关键空白，我们引入了FinMR，这是一个高质量、知识密集型的多模态数据集，专门设计用于评估专业分析师水平的金融推理能力。FinMR包括超过3,200个精心策划和专业注释的问题-答案对，涵盖15个不同的金融主题，确保领域的广泛多样性，并整合复杂的数学推理、先进的金融知识和对多种图像类型的微妙视觉解释任务。通过与主要的闭源和开源MLLMs进行全面基准测试，我们突出了这些模型与专业金融分析师之间的显著性能差异，揭示了模型进步的关键领域，如精确的图像分析、复杂金融公式的准确应用以及更深入的背景金融理解。通过提供丰富多样的视觉内容和详尽的解释性注释，FinMR建立了自己作为一个必不可少的基准工具，用于评估和推进多模态金融推理朝着专业分析师水平的能力。

更新时间: 2025-10-09 06:49:55

领域: cs.AI

下载: http://arxiv.org/abs/2510.07852v1

Formalizing Style in Personal Narratives

Personal narratives are stories authors construct to make meaning of their experiences. Style, the distinctive way authors use language to express themselves, is fundamental to how these narratives convey subjective experiences. Yet there is a lack of a formal framework for systematically analyzing these stylistic choices. We present a novel approach that formalizes style in personal narratives as patterns in the linguistic choices authors make when communicating subjective experiences. Our framework integrates three domains: functional linguistics establishes language as a system of meaningful choices, computer science provides methods for automatically extracting and analyzing sequential patterns, and these patterns are linked to psychological observations. Using language models, we automatically extract linguistic features such as processes, participants, and circumstances. We apply our framework to hundreds of dream narratives, including a case study on a war veteran with post-traumatic stress disorder. Analysis of his narratives uncovers distinctive patterns, particularly how verbal processes dominate over mental ones, illustrating the relationship between linguistic choices and psychological states.

Updated: 2025-10-09 06:48:06

标题: 在个人叙事中形式化风格

摘要: 个人叙述是作者构建的故事，用来解释他们的经历。风格，作者用语言表达自己的独特方式，对于这些叙述传达主观经历至关重要。然而，缺乏一个系统分析这些风格选择的正式框架。我们提出了一种新颖的方法，将个人叙述中的风格形式化为作者在传达主观经历时所做的语言选择中的模式。我们的框架整合了三个领域：功能语言学将语言确立为一个有意义选择的系统，计算机科学提供了自动提取和分析顺序模式的方法，这些模式与心理学观察联系起来。使用语言模型，我们自动提取语言特征，如过程、参与者和情境。我们将我们的框架应用于数百个梦境叙述，包括对一名患有创伤后应激障碍的战争退伍军人的案例研究。对他的叙述进行分析揭示了独特的模式，特别是动词过程占主导地位，说明了语言选择与心理状态之间的关系。

更新时间: 2025-10-09 06:48:06

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.08649v1

Leveraging Personalized PageRank and Higher-Order Topological Structures for Heterophily Mitigation in Graph Neural Networks

Graph Neural Networks (GNNs) excel in node classification tasks but often assume homophily, where connected nodes share similar labels. This assumption does not hold in many real-world heterophilic graphs. Existing models for heterophilic graphs primarily rely on pairwise relationships, overlooking multi-scale information from higher-order structures. This leads to suboptimal performance, particularly under noise from conflicting class information across nodes. To address these challenges, we propose HPGNN, a novel model integrating Higher-order Personalized PageRank with Graph Neural Networks. HPGNN introduces an efficient high-order approximation of Personalized PageRank (PPR) to capture long-range and multi-scale node interactions. This approach reduces computational complexity and mitigates noise from surrounding information. By embedding higher-order structural information into convolutional networks, HPGNN effectively models key interactions across diverse graph dimensions. Extensive experiments on benchmark datasets demonstrate HPGNN's effectiveness. The model achieves better performance than five out of seven state-of-the-art methods on heterophilic graphs in downstream tasks while maintaining competitive performance on homophilic graphs. HPGNN's ability to balance multi-scale information and robustness to noise makes it a versatile solution for real-world graph learning challenges. Codes are available at https://github.com/streetcorner/HPGNN.

Updated: 2025-10-09 06:45:15

标题: 利用个性化PageRank和高阶拓扑结构在图神经网络中减轻异质性的影响

摘要: 图神经网络（GNNs）在节点分类任务中表现出色，但通常假设同质性，即连接的节点具有相似的标签。这一假设在许多现实世界的异质图中并不成立。现有的异质图模型主要依赖于成对关系，忽视了来自高阶结构的多尺度信息。这导致了性能不佳，特别是在节点之间存在冲突类信息的噪音时。为了解决这些挑战，我们提出了HPGNN，这是一个将高阶个性化PageRank与图神经网络相结合的新型模型。HPGNN引入了个性化PageRank（PPR）的高效高阶近似，以捕获长程和多尺度节点交互。这种方法降低了计算复杂性，并减轻了周围信息的噪音。通过将高阶结构信息嵌入卷积网络中，HPGNN有效地建模了跨不同图维度的关键交互。在基准数据集上进行的大量实验表明了HPGNN的有效性。该模型在下游任务中在异质图中的表现优于七种最先进方法中的五种，同时在同质图上保持竞争性能。HPGNN平衡多尺度信息和对噪音的鲁棒性，使其成为现实世界图学习挑战的多功能解决方案。代码可在https://github.com/streetcorner/HPGNN获取。

更新时间: 2025-10-09 06:45:15

领域: cs.LG,cs.AI,I.2.6

下载: http://arxiv.org/abs/2507.16347v2

Meta-Learning Based Few-Shot Graph-Level Anomaly Detection

Graph-level anomaly detection aims to identify anomalous graphs or subgraphs within graph datasets, playing a vital role in various fields such as fraud detection, review classification, and biochemistry. While Graph Neural Networks (GNNs) have made significant progress in this domain, existing methods rely heavily on large amounts of labeled data, which is often unavailable in real-world scenarios. Additionally, few-shot anomaly detection methods based on GNNs are prone to noise interference, resulting in poor embedding quality and reduced model robustness. To address these challenges, we propose a novel meta-learning-based graph-level anomaly detection framework (MA-GAD), incorporating a graph compression module that reduces the graph size, mitigating noise interference while retaining essential node information. We also leverage meta-learning to extract meta-anomaly information from similar networks, enabling the learning of an initialization model that can rapidly adapt to new tasks with limited samples. This improves the anomaly detection performance on target graphs, and a bias network is used to enhance the distinction between anomalous and normal nodes. Our experimental results, based on four real-world biochemical datasets, demonstrate that MA-GAD outperforms existing state-of-the-art methods in graph-level anomaly detection under few-shot conditions. Experiments on both graph anomaly and subgraph anomaly detection tasks validate the framework's effectiveness on real-world datasets.

Updated: 2025-10-09 06:45:07

标题: 元学习基于少样本图级别异常检测

摘要: 图级异常检测旨在识别图数据集中的异常图或子图，在欺诈检测、评论分类和生物化学等各个领域起着至关重要的作用。虽然图神经网络（GNNs）在这一领域取得了显著进展，但现有方法往往依赖大量标记数据，在现实场景中往往难以获得。此外，基于GNNs的少样本异常检测方法容易受到噪声干扰，导致嵌入质量差和模型鲁棒性降低。为了解决这些挑战，我们提出了一种新颖的基于元学习的图级异常检测框架（MA-GAD），其中包括一个图压缩模块，可以减小图的大小，减轻噪声干扰同时保留关键节点信息。我们还利用元学习从类似网络中提取元异常信息，使得初始化模型可以快速适应具有有限样本的新任务。这提高了目标图上的异常检测性能，同时使用偏差网络来增强异常节点和正常节点之间的区分能力。基于四个真实生物化学数据集的实验结果表明，MA-GAD在少样本条件下的图级异常检测方面优于现有的最先进方法。对图异常和子图异常检测任务的实验证实了该框架在真实数据集上的有效性。

更新时间: 2025-10-09 06:45:07

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.07847v1

LogAction: Consistent Cross-system Anomaly Detection through Logs via Active Domain Adaptation

Log-based anomaly detection is a essential task for ensuring the reliability and performance of software systems. However, the performance of existing anomaly detection methods heavily relies on labeling, while labeling a large volume of logs is highly challenging. To address this issue, many approaches based on transfer learning and active learning have been proposed. Nevertheless, their effectiveness is hindered by issues such as the gap between source and target system data distributions and cold-start problems. In this paper, we propose LogAction, a novel log-based anomaly detection model based on active domain adaptation. LogAction integrates transfer learning and active learning techniques. On one hand, it uses labeled data from a mature system to train a base model, mitigating the cold-start issue in active learning. On the other hand, LogAction utilize free energy-based sampling and uncertainty-based sampling to select logs located at the distribution boundaries for manual labeling, thus addresses the data distribution gap in transfer learning with minimal human labeling efforts. Experimental results on six different combinations of datasets demonstrate that LogAction achieves an average 93.01% F1 score with only 2% of manual labels, outperforming some state-of-the-art methods by 26.28%. Website: https://logaction.github.io

Updated: 2025-10-09 06:43:25

标题: LogAction：通过主动领域适应实现跨系统日志的一致异常检测

摘要: 基于日志的异常检测是确保软件系统可靠性和性能的重要任务。然而，现有异常检测方法的性能在很大程度上依赖于标记，而标记大量日志是非常具有挑战性的。为了解决这个问题，提出了许多基于迁移学习和主动学习的方法。然而，它们的有效性受到源系统数据分布与目标系统数据分布之间的差距以及冷启动问题等问题的阻碍。在本文中，我们提出了一种基于主动领域适应的新型基于日志的异常检测模型LogAction。LogAction集成了迁移学习和主动学习技术。一方面，它使用成熟系统的标记数据来训练基本模型，减轻主动学习中的冷启动问题。另一方面，LogAction利用基于自由能的采样和基于不确定性的采样来选择位于分布边界的日志进行手动标记，从而在最小的人工标记工作量下解决了迁移学习中的数据分布差距问题。对六种不同数据集组合的实验结果表明，LogAction在仅有2%的手动标记下平均获得93.01%的F1得分，比一些最先进的方法提高了26.28%。网站：https://logaction.github.io

更新时间: 2025-10-09 06:43:25

领域: cs.LG,cs.AI,cs.DC,cs.SE

下载: http://arxiv.org/abs/2510.03288v2

AdaSwitch: Adaptive Switching Generation for Knowledge Distillation

Small language models (SLMs) are crucial for applications with strict latency and computational constraints, yet achieving high performance remains challenging. Knowledge distillation (KD) can transfer capabilities from large teacher models, but existing methods involve trade-offs: off-policy distillation provides high-quality supervision but introduces a training-inference mismatch, while on-policy approaches maintain consistency but rely on low-quality student outputs. To address these issues, we propose AdaSwitch, a novel approach that dynamically combines on-policy and off-policy generation at the token level. AdaSwitch allows the student to first explore its own predictions and then selectively integrate teacher guidance based on real-time quality assessment. This approach simultaneously preserves consistency and maintains supervision quality. Experiments on three datasets with two teacher-student LLM pairs demonstrate that AdaSwitch consistently improves accuracy, offering a practical and effective method for distilling SLMs with acceptable additional overhead.

Updated: 2025-10-09 06:38:37

标题: AdaSwitch：自适应开关生成用于知识蒸馏

摘要: 小型语言模型（SLMs）对于具有严格延迟和计算约束的应用至关重要，然而实现高性能仍然具有挑战性。知识蒸馏（KD）可以从大型教师模型中转移能力，但现有方法涉及权衡：离策略蒸馏提供高质量的监督，但引入了训练-推理不匹配，而在策略方法保持一致性但依赖于低质量的学生输出。为了解决这些问题，我们提出了AdaSwitch，一种新颖的方法，动态地在令牌级别上结合在策略和离策略生成。AdaSwitch允许学生首先探索自己的预测，然后根据实时质量评估有选择地集成教师指导。这种方法同时保持一致性并保持监督质量。在两个教师-学生LLM对的三个数据集上进行的实验表明，AdaSwitch始终提高准确性，为蒸馏具有可接受的额外开销的SLMs提供了一种实用而有效的方法。

更新时间: 2025-10-09 06:38:37

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.07842v1

Self-Improving LLM Agents at Test-Time

One paradigm of language model (LM) fine-tuning relies on creating large training datasets, under the assumption that high quantity and diversity will enable models to generalize to novel tasks after post-training. In practice, gathering large sets of data is inefficient, and training on them is prohibitively expensive; worse, there is no guarantee that the resulting model will handle complex scenarios or generalize better. Moreover, existing techniques rarely assess whether a training sample provides novel information or is redundant with the knowledge already acquired by the model, resulting in unnecessary costs. In this work, we explore a new test-time self-improvement method to create more effective and generalizable agentic LMs on-the-fly. The proposed algorithm can be summarized in three steps: (i) first it identifies the samples that model struggles with (self-awareness), (ii) then generates similar examples from detected uncertain samples (self-data augmentation), and (iii) uses these newly generated samples at test-time fine-tuning (self-improvement). We study two variants of this approach: Test-Time Self-Improvement (TT-SI), where the same model generates additional training examples from its own uncertain cases and then learns from them, and contrast this approach with Test-Time Distillation (TT-D), where a stronger model generates similar examples for uncertain cases, enabling student to adapt using distilled supervision. Empirical evaluations across different agent benchmarks demonstrate that TT-SI improves the performance with +5.48% absolute accuracy gain on average across all benchmarks and surpasses other standard learning methods, yet using 68x less training samples. Our findings highlight the promise of TT-SI, demonstrating the potential of self-improvement algorithms at test-time as a new paradigm for building more capable agents toward self-evolution.

Updated: 2025-10-09 06:37:35

标题: 测试时间的自我改进LLM代理

摘要: 一种语言模型（LM）微调的范式依赖于创建大规模的训练数据集，假设高数量和多样性将使模型在后期训练后能够推广到新领域。实际上，收集大量数据集是低效的，对其进行训练成本高昂；更糟糕的是，无法保证生成的模型将处理复杂情况或具有更好的泛化能力。此外，现有技术很少评估训练样本是否提供新信息或与模型已经获得的知识重复，导致不必要的成本。在这项工作中，我们探索了一种新的测试时间自我改进方法，以实时创建更有效和具有泛化能力的agent LM。所提出的算法可以总结为三个步骤：（i）首先识别模型难以处理的样本（自我意识），（ii）然后从检测到的不确定样本生成类似示例（自我数据增强），并且（iii）在测试时间微调中使用这些新生成的样本（自我改进）。我们研究了这种方法的两种变体：测试时间自我改进（TT-SI），其中相同的模型从其自身的不确定案例生成额外的训练示例，然后从中学习；并将此方法与测试时间蒸馏（TT-D）进行对比，在此方法中，更强大的模型为不确定案例生成类似示例，使学生能够使用蒸馏监督进行调整。在不同的agent基准测试中的实证评估表明，TT-SI平均在所有基准测试中提高了+5.48％的绝对精度，并超过其他标准学习方法，同时使用了68倍少的训练样本。我们的发现突显了TT-SI的潜力，展示了测试时间自我改进算法作为建立更有能力的代理人向自我进化的新范式的潜力。

更新时间: 2025-10-09 06:37:35

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.07841v1

Logic Jailbreak: Efficiently Unlocking LLM Safety Restrictions Through Formal Logical Expression

Despite substantial advancements in aligning large language models (LLMs) with human values, current safety mechanisms remain susceptible to jailbreak attacks. We hypothesize that this vulnerability stems from distributional discrepancies between alignment-oriented prompts and malicious prompts. To investigate this, we introduce LogiBreak, a novel and universal black-box jailbreak method that leverages logical expression translation to circumvent LLM safety systems. By converting harmful natural language prompts into formal logical expressions, LogiBreak exploits the distributional gap between alignment data and logic-based inputs, preserving the underlying semantic intent and readability while evading safety constraints. We evaluate LogiBreak on a multilingual jailbreak dataset spanning three languages, demonstrating its effectiveness across various evaluation settings and linguistic contexts.

Updated: 2025-10-09 06:29:26

标题: 逻辑越狱：通过形式逻辑表达有效解除LLM安全限制

摘要: 尽管在将大型语言模型（LLMs）与人类价值观进行对齐方面取得了实质性进展，但当前的安全机制仍然容易受到越狱攻击的影响。我们假设这种脆弱性源于对齐导向提示和恶意提示之间的分布差异。为了研究这一点，我们引入了LogiBreak，一种新颖且通用的黑盒越狱方法，利用逻辑表达式转换来规避LLM安全系统。通过将有害的自然语言提示转换为形式逻辑表达式，LogiBreak利用对齐数据和基于逻辑的输入之间的分布差距，保留了底层的语义意图和可读性，同时规避了安全约束。我们在跨三种语言的多语言越狱数据集上评估了LogiBreak，在各种评估设置和语言环境中展示了其有效性。

更新时间: 2025-10-09 06:29:26

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2505.13527v2

MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation

This paper introduces MetaDefense, a novel framework for defending against finetuning-based jailbreak attacks in large language models (LLMs). We observe that existing defense mechanisms fail to generalize to harmful queries disguised by unseen attack templates, despite LLMs being capable of distinguishing disguised harmful queries in the embedding space. Based on these insights, we propose a two-stage defense approach: (i) pre-generation defense that detects harmful queries before response generation begins, and (ii) mid-generation defense that monitors partial responses during generation to prevent outputting more harmful content. Our MetaDefense trains the LLM to predict the harmfulness of both queries and partial responses using specialized prompts, enabling early termination of potentially harmful interactions. Extensive experiments across multiple LLM architectures (LLaMA-2-7B, Qwen-2.5-3B-Instruct, and LLaMA-3.2-3B-Instruct) demonstrate that MetaDefense significantly outperforms existing defense mechanisms, achieving robust defense against harmful queries with seen and unseen attack templates while maintaining competitive performance on benign tasks. Code is available at https://github.com/ws-jiang/MetaDefense.

Updated: 2025-10-09 06:27:34

标题: MetaDefense：在生成期间和期间防御基于微调的越狱攻击

摘要: 本文介绍了MetaDefense，这是一个新颖的框架，用于在大型语言模型（LLMs）中防御基于微调的越狱攻击。我们观察到，尽管LLMs能够在嵌入空间中区分伪装的有害查询，但现有的防御机制无法推广到由未见攻击模板伪装的有害查询。基于这些见解，我们提出了一个两阶段的防御方法：（i）在响应生成开始之前检测有害查询的预生成防御，以及（ii）在生成过程中监控部分响应以防止输出更多有害内容的中生成防御。我们的MetaDefense训练LLM使用专门的提示来预测查询和部分响应的有害性，从而实现潜在有害交互的早期终止。在多个LLM架构（LLaMA-2-7B、Qwen-2.5-3B-Instruct和LLaMA-3.2-3B-Instruct）上进行的大量实验表明，MetaDefense在防御有害查询方面明显优于现有的防御机制，能够抵御具有已见和未见攻击模板的有害查询，同时在良性任务上保持竞争性能。代码可在https://github.com/ws-jiang/MetaDefense找到。

更新时间: 2025-10-09 06:27:34

领域: cs.LG,cs.AI,cs.CL,cs.CR

下载: http://arxiv.org/abs/2510.07835v1

Surrogate Graph Partitioning for Spatial Prediction

Spatial prediction refers to the estimation of unobserved values from spatially distributed observations. Although recent advances have improved the capacity to model diverse observation types, adoption in practice remains limited in industries that demand interpretability. To mitigate this gap, surrogate models that explain black-box predictors provide a promising path toward interpretable decision making. In this study, we propose a graph partitioning problem to construct spatial segments that minimize the sum of within-segment variances of individual predictions. The assignment of data points to segments can be formulated as a mixed-integer quadratic programming problem. While this formulation potentially enables the identification of exact segments, its computational complexity becomes prohibitive as the number of data points increases. Motivated by this challenge, we develop an approximation scheme that leverages the structural properties of graph partitioning. Experimental results demonstrate the computational efficiency of this approximation in identifying spatial segments.

Updated: 2025-10-09 06:24:49

标题: 空间预测的替代图分区

摘要: 空间预测指的是从空间分布的观测值中估计未观测值。尽管最近的进展改善了对多样观测类型建模的能力，但在实践中的应用仍然受到需要可解释性的行业的限制。为了弥合这一差距，解释黑匣子预测器的替代模型提供了一条通向可解释决策的有希望的途径。在这项研究中，我们提出了一个图分割问题，以构建最小化个体预测内部方差之和的空间段。将数据点分配给段可以被形式化为一个混合整数二次规划问题。虽然这种形式可能使得确切段的识别成为可能，但随着数据点数量的增加，其计算复杂性变得难以接受。受到这一挑战的启发，我们发展了一个利用图分割的结构特性的近似方案。实验结果证明了这种近似在识别空间段方面的计算效率。

更新时间: 2025-10-09 06:24:49

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2510.07832v1

The Rise of the Knowledge Sculptor: A New Archetype for Knowledge Work in the Age of Generative AI

In the Generative Age, the nature of knowledge work is transforming. Traditional models that emphasise the organisation and retrieval of pre-existing information are increasingly inadequate in the face of generative AI (GenAI) systems capable of autonomous content creation. This paper introduces the Knowledge Sculptor (KS), a new professional archetype for Human-GenAI collaboration that transforms raw AI output into trustworthy, actionable knowledge. Grounded in a socio-technical perspective, the KS is conceptualised through a framework of competencies, including architecting a vision, iterative dialogue, information sculpting, and curiosity-driven synthesis. A practice-based vignette illustrates the KS role in action, and in a self-referential approach, the paper itself serves as an artefact of the sculpting process it describes.

Updated: 2025-10-09 06:19:17

标题: 知识雕塑师的崛起：生成式人工智能时代知识工作的新原型

摘要: 在生成时代，知识工作的性质正在转变。强调组织和检索现有信息的传统模型在面对能够自主创作内容的生成AI（GenAI）系统时，越来越不足够。本文介绍了知识塑造者（KS），这是一种新的专业原型，用于人类与GenAI合作，将原始AI输出转化为可信赖的、可操作的知识。基于社会技术的视角，KS通过一个包括构建愿景、迭代对话、信息塑造和基于好奇心的综合的能力框架来概念化。一个基于实践的简短插图展示了KS在行动中的角色，并且在一种自我参照的方法中，本文本身作为描述塑造过程的文献。

更新时间: 2025-10-09 06:19:17

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2510.07829v1

GL-PGENet: A Parameterized Generation Framework for Robust Document Image Enhancement

Document Image Enhancement (DIE) serves as a critical component in Document AI systems, where its performance substantially determines the effectiveness of downstream tasks. To address the limitations of existing methods confined to single-degradation restoration or grayscale image processing, we present Global with Local Parametric Generation Enhancement Network (GL-PGENet), a novel architecture designed for multi-degraded color document images, ensuring both efficiency and robustness in real-world scenarios. Our solution incorporates three key innovations: First, a hierarchical enhancement framework that integrates global appearance correction with local refinement, enabling coarse-to-fine quality improvement. Second, a Dual-Branch Local-Refine Network with parametric generation mechanisms that replaces conventional direct prediction, producing enhanced outputs through learned intermediate parametric representations rather than pixel-wise mapping. This approach enhances local consistency while improving model generalization. Finally, a modified NestUNet architecture incorporating dense block to effectively fuse low-level pixel features and high-level semantic features, specifically adapted for document image characteristics. In addition, to enhance generalization performance, we adopt a two-stage training strategy: large-scale pretraining on a synthetic dataset of 500,000+ samples followed by task-specific fine-tuning. Extensive experiments demonstrate the superiority of GL-PGENet, achieving state-of-the-art SSIM scores of 0.7721 on DocUNet and 0.9480 on RealDAE. The model also exhibits remarkable cross-domain adaptability and maintains computational efficiency for high-resolution images without performance degradation, confirming its practical utility in real-world scenarios.

Updated: 2025-10-09 06:16:02

标题: GL-PGENet：一种用于强化文档图像的鲁棒性参数化生成框架

摘要: 文档图像增强（DIE）在文档人工智能系统中起着关键作用，其性能在很大程度上决定了下游任务的有效性。为了解决现有方法局限于单一退化恢复或灰度图像处理的限制，我们提出了全局与局部参数生成增强网络（GL-PGENet），这是一种专为多重退化彩色文档图像设计的新型架构，确保在现实场景中既高效又稳健。我们的解决方案融合了三个关键创新：首先，一个层次化增强框架，将全局外观校正与局部细化相结合，实现由粗到细的质量改进。其次，一个具有参数生成机制的双分支局部细化网络，取代传统的直接预测，通过学习中间参数表示产生增强的输出，而不是像素级映射。这种方法增强了局部一致性，同时提高了模型的泛化能力。最后，修改后的NestUNet架构融合了密集块，有效地融合了低级像素特征和高级语义特征，专门适用于文档图像的特征。此外，为了提高泛化性能，我们采用了两阶段训练策略：在一个包含500,000多个样本的合成数据集上进行大规模预训练，然后进行任务特定的微调。大量实验证明了GL-PGENet的优越性，实现了在DocUNet上0.7721的SSIM分数和在RealDAE上0.9480的最新水平。该模型还展现出显著的跨域适应性，并在高分辨率图像中保持了计算效率，没有性能下降，证实了其在实际场景中的实用性。

更新时间: 2025-10-09 06:16:02

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2505.22021v2

An LLM-Powered Cooperative Framework for Large-Scale Multi-Vehicle Navigation

The rise of Internet of Vehicles (IoV) technologies is transforming traffic management from isolated control to a collective, multi-vehicle process. At the heart of this shift is multi-vehicle dynamic navigation, which requires simultaneously routing large fleets under evolving traffic conditions. Existing path search algorithms and reinforcement learning methods struggle to scale to city-wide networks, often failing to capture the nonlinear, stochastic, and coupled dynamics of urban traffic. To address these challenges, we propose CityNav, a hierarchical, LLM-powered framework for large-scale multi-vehicle navigation. CityNav integrates a global traffic allocation agent, which coordinates strategic traffic flow distribution across regions, with local navigation agents that generate locally adaptive routes aligned with global directives. To enable effective cooperation, we introduce a cooperative reasoning optimization mechanism, in which agents are jointly trained with a dual-reward structure: individual rewards promote per-vehicle efficiency, while shared rewards encourage network-wide coordination and congestion reduction. Extensive experiments on four real-world road networks of varying scales (up to 1.6 million roads and 430,000 intersections) and traffic datasets demonstrate that CityNav consistently outperforms nine classical path search and RL-based baselines in city-scale travel efficiency and congestion mitigation. Our results highlight the potential of LLMs to enable scalable, adaptive, and cooperative city-wide traffic navigation, providing a foundation for intelligent, large-scale vehicle routing in complex urban environments. Our project is available at https://github.com/usail-hkust/CityNav.

Updated: 2025-10-09 06:14:29

标题: 一个以LLM为动力的大规模多车辆导航的合作框架

摘要: 车联网（IoV）技术的兴起正在将交通管理从孤立控制转变为集体、多车辆的过程。这种变化的核心是多车辆动态导航，需要在不断变化的交通条件下同时为大型车队规划路线。现有的路径搜索算法和强化学习方法往往难以扩展到整个城市范围的网络，往往无法捕捉城市交通的非线性、随机和耦合动态。为了解决这些挑战，我们提出了CityNav，这是一个基于层次结构和LLM的大规模多车辆导航框架。CityNav集成了一个全局交通分配代理，协调各地区的战略交通流量分配，以及生成与全局指令一致的本地自适应路线的本地导航代理。为了实现有效的合作，我们引入了一个协作推理优化机制，其中代理与双重奖励结构共同训练：个体奖励促进每辆车的效率，而共享奖励鼓励网络范围的协调和拥堵减少。对四个不同规模的现实道路网络（最多1.6百万条道路和43万个交叉口）和交通数据集的广泛实验表明，CityNav在城市规模的出行效率和拥堵缓解方面始终优于九种经典的路径搜索和基于RL的基线。我们的结果突显了LLM的潜力，可以实现可扩展、自适应和合作的城市范围交通导航，为复杂城市环境中的智能大规模车辆路由提供了基础。我们的项目可在https://github.com/usail-hkust/CityNav 上找到。

更新时间: 2025-10-09 06:14:29

领域: cs.AI

下载: http://arxiv.org/abs/2510.07825v1

The Role of Model Confidence on Bias Effects in Measured Uncertainties for Vision-Language Models

With the growing adoption of Large Language Models (LLMs) for open-ended tasks, accurately assessing epistemic uncertainty, which reflects a model's lack of knowledge, has become crucial to ensuring reliable outcomes. However, quantifying epistemic uncertainty in such tasks is challenging due to the presence of aleatoric uncertainty, which arises from multiple valid answers. While bias can introduce noise into epistemic uncertainty estimation, it may also reduce noise from aleatoric uncertainty. To investigate this trade-off, we conduct experiments on Visual Question Answering (VQA) tasks and find that mitigating prompt-introduced bias improves uncertainty quantification in GPT-4o. Building on prior work showing that LLMs tend to copy input information when model confidence is low, we further analyze how these prompt biases affect measured epistemic and aleatoric uncertainty across varying bias-free confidence levels with GPT-4o and Qwen2-VL. We find that all considered biases have greater effects in both uncertainties when bias-free model confidence is lower. Moreover, lower bias-free model confidence is associated with greater bias-induced underestimation of epistemic uncertainty, resulting in overconfident estimates, whereas it has no significant effect on the direction of bias effect in aleatoric uncertainty estimation. These distinct effects deepen our understanding of bias mitigation for uncertainty quantification and potentially inform the development of more advanced techniques.

Updated: 2025-10-09 06:08:47

标题: 模型置信度对视觉-语言模型测量不确定性偏差效应的作用

摘要: 随着大型语言模型（LLMs）在开放式任务中的日益普及，准确评估表明模型缺乏知识的认知不确定性变得至关重要，以确保可靠的结果。然而，在此类任务中量化认知不确定性是具有挑战性的，因为存在着由多个有效答案引起的偶然性不确定性。虽然偏差可能会在认知不确定性估计中引入噪音，但也可能减少偶然性不确定性的噪音。为了调查这种权衡，我们在视觉问答（VQA）任务上进行实验，并发现减轻提示引入的偏差可以改善GPT-4o中的不确定性量化。基于先前的工作表明当模型置信度较低时，LLMs倾向于复制输入信息，我们进一步分析了这些提示偏差如何影响在不同无偏置信水平下使用GPT-4o和Qwen2-VL测量的认知和偶然性不确定性。我们发现，在无偏模型置信度较低时，所有考虑的偏差在两种不确定性中都具有更大的影响。此外，较低的无偏模型置信度与由偏差导致的认知不确定性低估有关，导致自负估计，而它对偶然性不确定性估计的偏向效果没有显著影响。这些不同的影响加深了我们对于减轻偏差以进行不确定性量化的理解，并有可能为更先进技术的发展提供信息。

更新时间: 2025-10-09 06:08:47

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.16724v2

SIMU: Selective Influence Machine Unlearning

The undesired memorization of sensitive information by Large Language Models (LLMs) has emphasized the need for safety mechanisms that can regulate model behavior. This has led to the development of machine unlearning techniques that enable models to precisely forget sensitive and unwanted information. For machine unlearning, first-order and second-order optimizer-based methods have shown significant progress in enabling LLMs to forget targeted information. However, in doing so, these approaches often compromise the model's original capabilities, resulting in unlearned models that struggle to retain their prior knowledge and overall utility. To address this, we propose Selective Influence Machine Unlearning (SIMU), a two-step framework that enhances second-order optimizer-based unlearning by selectively updating only the critical neurons responsible for encoding the forget-set. By constraining updates to these targeted neurons, SIMU achieves comparable unlearning efficacy while substantially outperforming current methods in retaining the model's original knowledge.

Updated: 2025-10-09 06:03:15

标题: SIMU: 选择性影响机器去学习

摘要: 大型语言模型（LLMs）对敏感信息的不必要记忆强调了需要能够调节模型行为的安全机制。这导致了机器遗忘技术的发展，使模型能够精确地忘记敏感和不需要的信息。对于机器遗忘来说，基于一阶和二阶优化器的方法已经显示出显著进展，使LLMs能够忘记目标信息。然而，在这样做的过程中，这些方法往往会影响模型的原始功能，导致无法保留先前知识和整体实用性的遗忘模型。为了解决这个问题，我们提出了选择性影响机器遗忘（SIMU），这是一个两步框架，通过选择性更新仅负责编码遗忘集的关键神经元，增强了基于二阶优化器的遗忘。通过限制更新到这些目标神经元，SIMU实现了可比的遗忘效果，同时在保留模型原始知识方面明显优于当前方法。

更新时间: 2025-10-09 06:03:15

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.07822v1

IPR: Intelligent Prompt Routing with User-Controlled Quality-Cost Trade-offs

Routing incoming queries to the most cost-effective LLM while maintaining response quality poses a fundamental challenge in optimizing performance-cost trade-offs for large-scale commercial systems. We present IPR\, -- \,a quality-constrained \textbf{I}ntelligent \textbf{P}rompt \textbf{R}outing framework that dynamically selects optimal models based on predicted response quality and user-specified tolerance levels. IPR introduces three key innovations: (1) a modular architecture with lightweight quality estimators trained on 1.5M prompts annotated with calibrated quality scores, enabling fine-grained quality prediction across model families; (2) a user-controlled routing mechanism with tolerance parameter $\tau \in [0,1]$ that provides explicit control over quality-cost trade-offs; and (3) an extensible design using frozen encoders with model-specific adapters, reducing new model integration from days to hours. To rigorously train and evaluate IPR, we curate an industrial-level dataset IPRBench\footnote{IPRBench will be released upon legal approval.}, a comprehensive benchmark containing 1.5 million examples with response quality annotations across 11 LLM candidates. Deployed on a major cloud platform, IPR achieves 43.9\% cost reduction while maintaining quality parity with the strongest model in the Claude family and processes requests with sub-150ms latency. The deployed system and additional product details are publicly available at https://aws.amazon.com/bedrock/intelligent-prompt-routing/

Updated: 2025-10-09 05:51:50

标题: 知识产权：智能提示路由与用户控制的质量-成本权衡

摘要: 将传入查询路由到成本最低的LLM，同时保持响应质量，对于优化大型商业系统的性能成本权衡而言是一个基本挑战。我们提出了IPR-一个质量受限的智能提示路由框架，根据预测的响应质量和用户指定的容忍水平动态选择最佳模型。IPR引入了三个关键创新：(1) 一个模块化架构，使用轻量级质量估计器训练了150万个带有校准质量分数注释的提示，实现了对模型家族的细粒度质量预测；(2) 一个用户可控的路由机制，容忍参数τ∈[0,1]，提供了对质量成本权衡的明确控制；(3) 一个可扩展的设计，使用冻结的编码器和特定于模型的适配器，将新模型集成的时间从几天缩短到几小时。为了严格训练和评估IPR，我们整理了一个工业级别的数据集IPRBench（在法律批准后将发布），这是一个包含150万个例子的全面基准，跨11个LLM候选模型带有响应质量注释。在一个主要的云平台上部署，IPR实现了43.9%的成本降低，同时保持与Claude家族最强模型的质量相当，并以低于150毫秒的延迟处理请求。部署的系统和其他产品详情可以在https://aws.amazon.com/bedrock/intelligent-prompt-routing/上公开获取。

更新时间: 2025-10-09 05:51:50

领域: cs.LG

下载: http://arxiv.org/abs/2509.06274v4

Strategic Communication under Threat: Learning Information Trade-offs in Pursuit-Evasion Games

Adversarial environments require agents to navigate a key strategic trade-off: acquiring information enhances situational awareness, but may simultaneously expose them to threats. To investigate this tension, we formulate a PursuitEvasion-Exposure-Concealment Game (PEEC) in which a pursuer agent must decide when to communicate in order to obtain the evader's position. Each communication reveals the pursuer's location, increasing the risk of being targeted. Both agents learn their movement policies via reinforcement learning, while the pursuer additionally learns a communication policy that balances observability and risk. We propose SHADOW (Strategic-communication Hybrid Action Decision-making under partial Observation for Warfare), a multi-headed sequential reinforcement learning framework that integrates continuous navigation control, discrete communication actions, and opponent modeling for behavior prediction. Empirical evaluations show that SHADOW pursuers achieve higher success rates than six competitive baselines. Our ablation study confirms that temporal sequence modeling and opponent modeling are critical for effective decision-making. Finally, our sensitivity analysis reveals that the learned policies generalize well across varying communication risks and physical asymmetries between agents.

Updated: 2025-10-09 05:44:00

标题: 受威胁下的战略沟通：在追逃游戏中学习信息权衡

摘要: 敌对环境要求代理商在关键战略折衷中导航：获取信息增强了情境意识，但同时可能使其暴露于威胁之下。为了调查这种紧张关系，我们制定了一个追逐-逃避-暴露-隐瞒游戏（PEEC），在这个游戏中，追捕者代理必须决定何时通信以获取逃避者的位置。每次通信都会揭示追捕者的位置，增加被瞄准的风险。两个代理通过强化学习学习他们的移动策略，而追捕者另外学习了一个平衡可观察性和风险的通信策略。我们提出了SHADOW（部分观察下用于战争的战略通信混合行动决策）, 这是一个集成连续导航控制，离散通信行动和对手建模以进行行为预测的多头顺序强化学习框架。实证评估表明，SHADOW追捕者实现了比六个竞争基线更高的成功率。我们的消融研究证实，时间序列建模和对手建模对于有效决策至关重要。最后，我们的敏感性分析显示，学习的策略在代理之间的通信风险和物理不对称性变化时表现出良好的泛化性。

更新时间: 2025-10-09 05:44:00

领域: cs.AI

下载: http://arxiv.org/abs/2510.07813v1

Little By Little: Continual Learning via Self-Activated Sparse Mixture-of-Rank Adaptive Learning

Continual learning (CL) with large pre-trained models is challenged by catastrophic forgetting and task interference. Existing LoRA-based Mixture-of-Experts (MoE) approaches mitigate forgetting by assigning and freezing task-specific adapters, but suffer from interference, redundancy, and ambiguous routing due to coarse adapter-level selection. However, this design introduces three key challenges: 1) Interference: Activating full LoRA experts per input leads to subspace interference and prevents selective reuse of useful components across tasks. 2) Redundancy: Newly added experts often duplicate or contradict existing knowledge due to unnecessary activation of unrelated ranks and insufficient reuse of relevant ones. 3) Ambiguity: Overlapping features across tasks confuse the router, resulting in unstable expert assignments. As more experts accumulate, earlier task routing degrades, accelerating forgetting. We propose MoRA, a Mixture-of-Rank Adaptive learning approaches with self-activated and sparse rank activation for CL. Unlike mixing multiple low-rank matrices, MoRA decomposes each rank-r update into r rank-one components, each treated as an independent expert, enabling fine-grained rank-one expert utilization while mitigating interference and redundancy. To avoid ambiguous routing, we propose that each rank-one expert can infer its own relevance via intermediate activations. Coupled with our proposed rank pruning and activation budgets, MoRA adaptively selects a sparse mixture of ranks per input. We validate MoRA on continual learning benchmarks using CLIP and language models, analyzing both in-domain learning and out-of-domain forgetting/generalization during fine-tuning. MoRA shows significant effectiveness in enhancing CL with PTMs, and improving generalization while mitigating forgetting.

Updated: 2025-10-09 05:43:44

标题: 一点一点：通过自激活稀疏混合秩自适应学习实现持续学习

摘要: 使用大型预训练模型进行持续学习（CL）面临着灾难性遗忘和任务干扰的挑战。现有基于LoRA的专家混合（MoE）方法通过分配和冻结特定任务适配器来减轻遗忘，但受到干扰、冗余和由于粗糙的适配器级别选择而导致的模糊路由的影响。然而，这种设计引入了三个关键挑战：1）干扰：每输入激活完整的LoRA专家会导致子空间干扰，并阻止跨任务对有用组件的有选择性重用。2）冗余：新添加的专家经常由于不必要激活不相关排名和对相关排名的不充分重用而复制或矛盾于现有知识。3）模糊：跨任务重叠的特征使路由器混淆，导致专家分配不稳定。随着更多专家的积累，早期任务路由恶化，加速遗忘。我们提出MoRA，一种具有自激活和稀疏排名激活的混合排名自适应学习方法用于CL。与混合多个低秩矩阵不同，MoRA将每个秩-r更新分解为r个秩一分量，每个分量被视为一个独立的专家，从而实现细粒度的秩一专家利用，同时减轻干扰和冗余。为避免模糊的路由，我们提出每个秩一专家可以通过中间激活推断其自身的相关性。结合我们提出的秩修剪和激活预算，MoRA自适应地选择每个输入的稀疏秩混合。我们使用CLIP和语言模型在持续学习基准上验证了MoRA，分析了在微调期间的领域内学习和领域外遗忘/泛化。MoRA在增强具有PTM的CL和改善泛化能力同时减轻遗忘方面显示出显著的有效性。

更新时间: 2025-10-09 05:43:44

领域: cs.LG

下载: http://arxiv.org/abs/2506.21035v2

Attention based End to end network for Offline Writer Identification on Word level data

Writer identification due to its widespread application in various fields has gained popularity over the years. In scenarios where optimum handwriting samples are available, whether they be in the form of a single line, a sentence, or an entire page, writer identification algorithms have demonstrated noteworthy levels of accuracy. However, in scenarios where only a limited number of handwritten samples are available, particularly in the form of word images, there is a significant scope for improvement. In this paper, we propose a writer identification system based on an attention-driven Convolutional Neural Network (CNN). The system is trained utilizing image segments, known as fragments, extracted from word images, employing a pyramid-based strategy. This methodology enables the system to capture a comprehensive representation of the data, encompassing both fine-grained details and coarse features across various levels of abstraction. These extracted fragments serve as the training data for the convolutional network, enabling it to learn a more robust representation compared to traditional convolution-based networks trained on word images. Additionally, the paper explores the integration of an attention mechanism to enhance the representational power of the learned features. The efficacy of the proposed algorithm is evaluated on three benchmark databases, demonstrating its proficiency in writer identification tasks, particularly in scenarios with limited access to handwriting data.

Updated: 2025-10-09 05:43:00

标题: 基于注意力机制的端到端网络，用于离线文字识别中的作者识别在单词级数据上

摘要: 由于在各个领域中的广泛应用，作者识别技术近年来变得越来越受欢迎。在有最佳手写样本的情况下，无论是单行、一句话还是整页，作者识别算法都展示了显著的准确性水平。然而，在只有有限数量手写样本的情况下，特别是以词图像形式存在时，存在着显著的改进空间。本文提出了一种基于注意力驱动卷积神经网络（CNN）的作者识别系统。该系统通过从词图像中提取的片段（fragment）进行训练，采用基于金字塔的策略。这种方法使系统能够捕捉数据的全面表示，涵盖各种抽象级别上的细粒度细节和粗糙特征。这些提取的片段用作卷积网络的训练数据，使其能够学习比传统基于卷积的网络训练在词图像上更强大的表示。此外，本文探讨了整合注意力机制以增强所学特征的表现力。提出的算法在三个基准数据库上进行了评估，展示了其在作者识别任务中的有效性，特别是在访问手写数据有限的情况下。

更新时间: 2025-10-09 05:43:00

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2404.07602v2

Personalized Federated Fine-Tuning for LLMs via Data-Driven Heterogeneous Model Architectures

Large language models (LLMs) are increasingly powering web-based applications, whose effectiveness relies on fine-tuning with large-scale instruction data. However, such data often contains valuable or sensitive information that limits its public sharing among business organizations. Federated learning (FL) enables collaborative fine-tuning of LLMs without accessing raw data. Existing approaches to federated LLM fine-tuning usually adopt a uniform model architecture, making it challenging to fit highly heterogeneous client-side data in varying domains and tasks, e.g., hospitals and financial institutions conducting federated fine-tuning may require different LLM architectures due to the distinct nature of their domains and tasks. To address this, we propose FedAMoLE, a lightweight personalized FL framework that enables data-driven heterogeneous model architectures. It features a heterogeneous mixture of low-rank adaptation (LoRA) experts module to aggregate architecturally heterogeneous models and a reverse selection-based expert assignment strategy to tailor model architectures for each client based on data distributions. Experiments across seven scenarios demonstrate that FedAMoLE improves client-side performance by an average of 5.97% over existing approaches while maintaining practical memory, communication, and computation overhead.

Updated: 2025-10-09 05:40:31

标题: 基于数据驱动的异构模型架构的个性化联邦微调LLMs

摘要: 大型语言模型（LLMs）越来越多地支持基于网络的应用程序，其有效性依赖于大规模指导数据的微调。然而，这些数据通常包含有价值或敏感信息，限制了其在商业组织之间的公开共享。联邦学习（FL）使LLMs能够在不访问原始数据的情况下进行协作微调。现有的联邦LLM微调方法通常采用统一的模型架构，这使得在不同领域和任务中拟合高度异质的客户端数据变得具有挑战性，例如，医院和金融机构进行联邦微调可能需要不同的LLM架构，因为其领域和任务的性质不同。为了解决这个问题，我们提出了FedAMoLE，这是一个轻量级个性化FL框架，它能够支持数据驱动的异构模型架构。它具有一个低秩适应（LoRA）专家模块的异构混合，用于聚合架构异构模型，并采用基于反向选择的专家分配策略，根据数据分布为每个客户端定制模型架构。在七种场景中的实验表明，FedAMoLE相比现有方法提高了客户端性能平均5.97％，同时保持了实际的内存、通信和计算开销。

更新时间: 2025-10-09 05:40:31

领域: cs.LG

下载: http://arxiv.org/abs/2411.19128v4

Adaptive Execution Scheduler for DataDios SmartDiff

We present an adaptive scheduler for a single differencing engine (SmartDiff) with two execution modes: (i) in-memory threads and (ii) Dask based parallelism. The scheduler continuously tunes batch size and worker/thread count within fixed CPU and memory budgets to minimize p95 latency. A lightweight preflight profiler estimates bytes/row and I/O rate; an online cost/memory model prunes unsafe actions; and a guarded hill-climb policy favors lower latency with backpressure and straggler mitigation. Backend selection is gated by a conservative working-set estimate so that in-memory execution is chosen when safe, otherwise Dask is used. Across synthetic and public tabular benchmarks, the scheduler reduces p95 latency by 23 to 28 percent versus a tuned warm-up heuristic (and by 35 to 40 percent versus fixed grid baselines), while lowering peak memory by 16 to 22 percent (25 to 32 percent vs. fixed) with zero OOMs and comparable throughput.

Updated: 2025-10-09 05:40:16

标题: 自适应执行调度器用于DataDios SmartDiff

摘要: 我们提出了一种适应性调度器，用于单个差分引擎（SmartDiff），具有两种执行模式：（i）内存线程和（ii）基于Dask的并行性。调度器持续调整批处理大小和工作线程数，在固定的CPU和内存预算内，以最小化p95延迟。一个轻量级的预先飞行探测器估计字节/行和I/O速率；在线成本/内存模型修剪不安全的操作；受保护的爬坡策略倾向于通过背压和绊脚者缓解来降低延迟。后端选择由一个保守的工作集估计进行控制，因此在安全时选择内存执行，否则使用Dask。在合成和公共表格基准测试中，该调度器将p95延迟降低了23到28％，与调整后的热身启发式相比（与固定网格基线相比降低了35到40％），同时将峰值内存降低了16到22％（与固定相比降低了25到32％），具有零OOM和可比的吞吐量。

更新时间: 2025-10-09 05:40:16

领域: cs.DC,cs.LG

下载: http://arxiv.org/abs/2510.07811v1

LogicMP: A Neuro-symbolic Approach for Encoding First-order Logic Constraints

Integrating first-order logic constraints (FOLCs) with neural networks is a crucial but challenging problem since it involves modeling intricate correlations to satisfy the constraints. This paper proposes a novel neural layer, LogicMP, whose layers perform mean-field variational inference over an MLN. It can be plugged into any off-the-shelf neural network to encode FOLCs while retaining modularity and efficiency. By exploiting the structure and symmetries in MLNs, we theoretically demonstrate that our well-designed, efficient mean-field iterations effectively mitigate the difficulty of MLN inference, reducing the inference from sequential calculation to a series of parallel tensor operations. Empirical results in three kinds of tasks over graphs, images, and text show that LogicMP outperforms advanced competitors in both performance and efficiency.

Updated: 2025-10-09 05:36:43

标题: LogicMP：一种编码一阶逻辑约束的神经符号方法

摘要: 将一阶逻辑约束（FOLCs）与神经网络集成是一个至关重要但具有挑战性的问题，因为它涉及建模复杂的相关性以满足约束条件。本文提出了一种新颖的神经网络层，称为LogicMP，其层对MLN执行均场变分推断。它可以插入任何现成的神经网络来编码FOLCs，同时保持模块化和效率。通过利用MLN中的结构和对称性，我们在理论上证明了我们精心设计的高效均场迭代有效地缓解了MLN推断的困难，将推断从顺序计算减少到一系列并行张量操作。在图形、图像和文本三种任务中的实证结果表明，LogicMP在性能和效率方面均优于先进的竞争对手。

更新时间: 2025-10-09 05:36:43

领域: cs.AI,cs.SC

下载: http://arxiv.org/abs/2309.15458v4

Long Chain-of-Thought Reasoning Across Languages

While large reasoning models have shown remarkable ability to generate long chains-of-thought (CoTs) in English, we still lack understanding of how these long-form reasoning abilities transfer to the vast majority of the world's languages. In this work, we systematically investigate four key stages of model development--scaling, pretraining, post-training, and inference--to understand how long CoT capabilities extend beyond English. We compare two reasoning settings across nine non-English target languages: En-CoT, where models process target-language inputs, but reason in English; and Target-CoT, where models both process inputs and generate long CoTs in the target language. We find that scaling reasoning model size improves multilingual task performance in En-CoT, but Target-CoT performance lags behind. This gap widens for tasks requiring long, multi-step CoTs such as mathematical reasoning. Shifting to pretraining, we find that adding a specialized reasoning stage enhances En-CoT performance but degrades Target-CoT, whereas broad multilingual pretraining improves both modes simultaneously. Given the scarcity of high-quality reasoning traces in languages other than English, we explore synthetic data curation approaches for post-training. We demonstrate that fine-tuning on reasoning traces automatically translated from gold English traces outperforms fine-tuning on target-language traces distilled from large reasoning models. Finally, we report disparities in inference efficiency between languages and uncover language-specific failure modes in CoTs. We release models, datasets, and code to foster further research.

Updated: 2025-10-09 05:36:20

标题: 跨语言的长篇连贯推理

摘要: 尽管大型推理模型展示出在英语中生成长篇思维链（CoTs）的显著能力，但我们仍然缺乏对这些长篇推理能力如何转移到世界大多数语言的理解。在这项工作中，我们系统地研究了模型发展的四个关键阶段——扩展、预训练、训练后和推理——以了解长篇CoT能力如何延伸到英语以外的语言。我们在九种非英语目标语言中比较了两种推理设置：En-CoT，其中模型处理目标语言输入，但用英语推理；以及Target-CoT，其中模型既处理输入，又在目标语言中生成长篇CoTs。我们发现，扩展推理模型大小可以提高En-CoT中的多语言任务性能，但Target-CoT的表现落后。这种差距在需要长时间、多步CoTs的任务中扩大，比如数学推理。在转向预训练时，我们发现添加专门的推理阶段可以增强En-CoT的性能，但会降低Target-CoT的性能，而广泛的多语言预训练可以同时提高两种模式的性能。鉴于除了英语以外其他语言中高质量推理迹象的稀缺性，我们探索了用于后期训练的合成数据整理方法。我们证明，在从黄金英语迹象自动翻译的推理迹象上微调优于在从大型推理模型提炼的目标语言迹象上微调。最后，我们报告了不同语言之间推理效率的差异，并揭示了CoTs中特定语言的失败模式。我们发布了模型、数据集和代码，以促进进一步的研究。

更新时间: 2025-10-09 05:36:20

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.14828v2

Effective and Stealthy One-Shot Jailbreaks on Deployed Mobile Vision-Language Agents

Large vision-language models (LVLMs) enable autonomous mobile agents to operate smartphone user interfaces, yet vulnerabilities to UI-level attacks remain critically understudied. Existing research often depends on conspicuous UI overlays, elevated permissions, or impractical threat models, limiting stealth and real-world applicability. In this paper, we present a practical and stealthy one-shot jailbreak attack that leverages in-app prompt injections: malicious applications embed short prompts in UI text that remain inert during human interaction but are revealed when an agent drives the UI via ADB (Android Debug Bridge). Our framework comprises three crucial components: (1) low-privilege perception-chain targeting, which injects payloads into malicious apps as the agent's visual inputs; (2) stealthy user-invisible activation, a touch-based trigger that discriminates agent from human touches using physical touch attributes and exposes the payload only during agent operation; and (3) one-shot prompt efficacy, a heuristic-guided, character-level iterative-deepening search algorithm (HG-IDA*) that performs one-shot, keyword-level detoxification to evade on-device safety filters. We evaluate across multiple LVLM backends, including closed-source services and representative open-source models within three Android applications, and we observe high planning and execution hijack rates in single-shot scenarios (e.g., GPT-4o: 82.5% planning / 75.0% execution). These findings expose a fundamental security vulnerability in current mobile agents with immediate implications for autonomous smartphone operation.

Updated: 2025-10-09 05:34:57

标题: 在部署的移动视觉语言代理上实施有效且隐秘的一次性越狱

摘要: 大型视觉语言模型(LVLMs)使自主移动代理能够操作智能手机用户界面，然而对UI级别攻击的漏洞仍然受到严重研究不足的影响。现有研究通常依赖于显眼的UI叠加、提升的权限或不切实际的威胁模型，限制了隐蔽性和实际应用性。在本文中，我们提出了一种实用且隐秘的一次性越狱攻击，利用应用内提示注入：恶意应用在UI文本中嵌入短提示，在人类交互过程中保持不活动，但在代理通过ADB(Android Debug Bridge)驱动UI时被揭示。我们的框架包括三个关键组成部分：(1) 针对低权限感知链的目标化，将有效负载注入恶意应用作为代理的视觉输入；(2) 隐秘的用户不可见激活，一种基于触摸的触发器，利用物理触摸属性区分代理与人类触摸，并仅在代理操作过程中暴露有效负载；以及(3) 一次性提示效力，一种启发式引导的，字符级迭代加深搜索算法(HG-IDA*)，执行一次，关键词级别的解毒操作，以规避设备上的安全过滤器。我们在多个LVLM后端进行评估，包括闭源服务和三个Android应用中代表性的开源模型，我们观察到在单次场景中高计划和执行劫持率(e.g., GPT-4o: 82.5% 计划 / 75.0% 执行)。这些发现揭示了当前移动代理中的一项基本安全漏洞，对自主智能手机操作具有直接影响。

更新时间: 2025-10-09 05:34:57

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2510.07809v1

ANCORA: Accurate Intrusion Recovery for Web Applications

Modern web application recovery presents a critical dilemma. Coarse-grained snapshot rollbacks cause unacceptable data loss for legitimate users. Surgically removing an attack's impact is hindered by a fundamental challenge in high-concurrency environments: it is difficult to attribute resulting file and database modifications to a specific attack-related request. We present ANCORA, a system for precise intrusion recovery in web applications without invasive instrumentation. ANCORA first isolates the full sequence of syscalls triggered by a single malicious request. Based on this sequence, ANCORA addresses file and database modifications separately. To trace file changes, it builds a provenance graph that reveals all modifications, including those by exploit-spawned processes. To attribute database operations, a more difficult challenge due to connection pooling, ANCORA introduces a novel spatiotemporal anchor. This anchor uses the request's network connection tuple and active time window to pinpoint exact database operations. With all malicious file and database operations precisely identified, ANCORA performs a unified rewind and selective replay recovery. It reverts the system to a clean snapshot taken before the attack, then selectively re-applies only legitimate operations to both the file system and database. This completely removes the attack's effects while preserving concurrent legitimate data. We evaluated ANCORA on 10 web applications and 20 CVE-based attack scenarios with concurrency up to 150 connections. Experiments demonstrate ANCORA achieves 99.9% recovery accuracy with manageable overhead: up to 19.8% response latency increase and 17.8% QPS decrease in worst cases, and recovery throughput of 110.7 database operations per second and 27.2 affected files per second, effectively preserving legitimate data.

Updated: 2025-10-09 05:33:09

标题: ANCORA：用于Web应用程序的准确入侵恢复

摘要: 现代网络应用程序恢复面临一个关键的困境。粗粒度的快照回滚会导致合法用户无法接受的数据丢失。在高并发环境中，对攻击影响进行精确切除受到了一个根本性挑战：很难将导致文件和数据库修改的结果归因于特定的攻击相关请求。我们提出了ANCORA，这是一个用于网络应用程序精确入侵恢复的系统，无需侵入性仪器。ANCORA首先隔离了由单个恶意请求触发的完整系统调用序列。基于这个序列，ANCORA分别处理文件和数据库修改。为了追踪文件更改，它构建了一个揭示所有修改的溯源图，包括利用产生的进程。为了归因数据库操作，由于连接池的存在，这是一个更具挑战性的问题，ANCORA引入了一个新颖的时空锚点。这个锚点使用请求的网络连接元组和活动时间窗口来准确定位数据库操作。通过精确识别所有恶意文件和数据库操作，ANCORA执行统一的倒带和选择性重播恢复。它将系统恢复到攻击前拍摄的干净快照，然后只有重新应用合法操作到文件系统和数据库。这样就完全消除了攻击的影响，同时保留了并发的合法数据。我们在10个网络应用程序和20个基于CVE的攻击场景上评估了ANCORA，最高并发量为150个连接。实验表明ANCORA在可管理的开销下实现了99.9%的恢复准确性：在最坏的情况下，响应延迟增加了最多19.8%，QPS减少了17.8%，恢复吞吐量为每秒110.7个数据库操作和每秒27.2个受影响的文件，有效地保护了合法数据。

更新时间: 2025-10-09 05:33:09

领域: cs.CR

下载: http://arxiv.org/abs/2510.07806v1

Dynamic Generation of Multi-LLM Agents Communication Topologies with Graph Diffusion Models

The efficiency of multi-agent systems driven by large language models (LLMs) largely hinges on their communication topology. However, designing an optimal topology is a non-trivial challenge, as it requires balancing competing objectives such as task performance, communication cost, and robustness. Existing frameworks often rely on static or hand-crafted topologies, which inherently fail to adapt to diverse task requirements, leading to either excessive token consumption for simple problems or performance bottlenecks for complex ones. To address this challenge, we introduce a novel generative framework called \textit{Guided Topology Diffusion (GTD)}. Inspired by conditional discrete graph diffusion models, GTD formulates topology synthesis as an iterative construction process. At each step, the generation is steered by a lightweight proxy model that predicts multi-objective rewards (e.g., accuracy, utility, cost), enabling real-time, gradient-free optimization towards task-adaptive topologies. This iterative, guided synthesis process distinguishes GTD from single-step generative frameworks, enabling it to better navigate complex design trade-offs. We validated GTD across multiple benchmarks, and experiments show that this framework can generate highly task-adaptive, sparse, and efficient communication topologies, significantly outperforming existing methods in LLM agent collaboration.

Updated: 2025-10-09 05:28:28

标题: 动态生成多LLM代理通信拓扑结构的图扩散模型

摘要: 由大型语言模型（LLMs）驱动的多智能体系统的效率在很大程度上取决于它们的通信拓扑结构。然而，设计一个最优拓扑结构是一个非常棘手的挑战，因为它需要平衡竞争性目标，比如任务性能、通信成本和鲁棒性。现有框架通常依赖于静态或手工制作的拓扑结构，这些拓扑结构本质上无法适应不同任务要求，导致对于简单问题消耗过多令牌或者对于复杂问题出现性能瓶颈。为了解决这一挑战，我们引入了一种新颖的生成框架，称为\textit{Guided Topology Diffusion (GTD)}。受条件离散图扩散模型的启发，GTD将拓扑结构合成形式化为一个迭代构建过程。在每一步中，生成是由一个轻量级代理模型引导，该模型预测多目标奖励（例如准确性、效用、成本），实现实时、无梯度优化朝向任务自适应拓扑结构。这种迭代、引导合成过程使GTD与单步生成框架有所不同，使其能够更好地应对复杂设计权衡。我们在多个基准测试中验证了GTD，实验证明该框架能够生成高度任务自适应、稀疏且高效的通信拓扑结构，在LLM智能体协作中明显优于现有方法。

更新时间: 2025-10-09 05:28:28

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.07799v1

Search Wisely: Mitigating Sub-optimal Agentic Searches By Reducing Uncertainty

Agentic Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by enabling dynamic, multi-step reasoning and information retrieval. However, these systems often exhibit sub-optimal search behaviors like over-search (retrieving redundant information) and under-search (failing to retrieve necessary information), which hinder efficiency and reliability. This work formally defines and quantifies these behaviors, revealing their prevalence across multiple QA datasets and agentic RAG systems (e.g., one model could have avoided searching in 27.7% of its search steps). Furthermore, we demonstrate a crucial link between these inefficiencies and the models' uncertainty regarding their own knowledge boundaries, where response accuracy correlates with model's uncertainty in its search decisions. To address this, we propose $\beta$-GRPO, a reinforcement learning-based training method that incorporates confidence threshold to reward high-certainty search decisions. Experiments on seven QA benchmarks show that $\beta$-GRPO enable a 3B model with better agentic RAG ability, outperforming other strong baselines with a 4% higher average exact match score.

Updated: 2025-10-09 05:24:13

标题: 明智搜索：通过减少不确定性来缓解次优代理搜索

摘要: 主动检索增强生成（RAG）系统通过实现动态、多步推理和信息检索，增强了大型语言模型（LLMs）。然而，这些系统经常表现出次优的搜索行为，比如过度搜索（检索冗余信息）和不足搜索（未能检索必要信息），这些行为妨碍了效率和可靠性。本研究正式定义和量化了这些行为，并揭示了它们在多个问答数据集和主动RAG系统（例如，一个模型在其搜索步骤中本可以避免搜索27.7%的情况）。此外，我们展示了这些低效行为与模型对自身知识边界的不确定性之间的关键联系，其中响应准确性与模型在其搜索决策中的不确定性相关。为了解决这个问题，我们提出了基于强化学习的$\beta$-GRPO训练方法，该方法将置信阈值纳入奖励高确定性搜索决策。对七个问答基准测试的实验证明，$\beta$-GRPO使一个3B模型具有更好的主动RAG能力，比其他强基线模型表现出4%更高的平均准确匹配得分。

更新时间: 2025-10-09 05:24:13

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2505.17281v2

Knowledge-Driven Federated Graph Learning on Model Heterogeneity

Federated graph learning (FGL) has emerged as a promising paradigm for collaborative graph representation learning, enabling multiple parties to jointly train models while preserving data privacy. However, most existing approaches assume homogeneous client models and largely overlook the challenge of model-centric heterogeneous FGL (MHtFGL), which frequently arises in practice when organizations employ graph neural networks (GNNs) of different scales and architectures.Such architectural diversity not only undermines smooth server-side aggregation, which presupposes a unified representation space shared across clients' updates, but also further complicates the transfer and integration of structural knowledge across clients. To address this issue, we propose the Federated Graph Knowledge Collaboration (FedGKC) framework. FedGKC introduces a lightweight Copilot Model on each client to facilitate knowledge exchange while local architectures are heterogeneous across clients, and employs two complementary mechanisms: Client-side Self-Mutual Knowledge Distillation, which transfers effective knowledge between local and copilot models through bidirectional distillation with multi-view perturbation; and Server-side Knowledge-Aware Model Aggregation, which dynamically assigns aggregation weights based on knowledge provided by clients. Extensive experiments on eight benchmark datasets demonstrate that FedGKC achieves an average accuracy gain of 3.74% over baselines in MHtFGL scenarios, while maintaining excellent performance in homogeneous settings.

Updated: 2025-10-09 05:20:16

标题: 基于知识驱动的异构模型联合图学习

摘要: 联邦图学习（FGL）已经成为协作图表示学习的一种有前途的范式，使多个方当能够共同训练模型同时保持数据隐私。然而，大多数现有方法假定客户端模型是同质的，并且很大程度上忽视了模型中心的异构FGL（MHtFGL）的挑战，这在实践中经常出现，当组织使用不同规模和架构的图神经网络（GNNs）时。这种架构多样性不仅破坏了平滑的服务器端聚合，这需要一个在客户端更新之间共享的统一表示空间，而且进一步复杂化了跨客户端的结构知识的转移和整合。为了解决这个问题，我们提出了联邦图知识协作（FedGKC）框架。FedGKC在每个客户端引入一个轻量级的Copilot模型，以促进知识交换，同时在客户端之间的本地架构是异构的，并采用两种互补机制：客户端自我互助知识蒸馏，通过双向蒸馏和多视图扰动在本地模型和Copilot模型之间转移有效知识；以及基于客户端提供的知识动态分配聚合权重的服务器端知识感知模型聚合。对八个基准数据集进行的大量实验表明，在MHtFGL场景中，FedGKC相对基线实现了平均准确率提升3.74%，同时在同质设置中保持了出色的性能。

更新时间: 2025-10-09 05:20:16

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2501.12624v3

Task Vector Bases: A Unified and Scalable Framework for Compressed Task Arithmetic

Task arithmetic, representing downstream tasks through linear operations on task vectors, has emerged as a simple yet powerful paradigm for transferring knowledge across diverse settings. However, maintaining a large collection of task vectors introduces scalability challenges in both storage and computation. We propose Task Vector Bases, a framework compressing $T$ task vectors into $M < T$ basis vectors while preserving the functionality of task arithmetic. By representing each task vector as a structured linear combination of basis atoms, our approach supports standard operations such as addition, negation, as well as more advanced arithmetic ones. The framework is orthogonal to other efficiency-oriented improvements in task arithmetic and can be used in combination with them. We provide theoretical analysis showing that basis compression retains addition generalization guarantees and enables principled unlearning, with error bounds depending on reconstruction quality. Empirically, our proposed basis construction methods consistently outperform heuristic basis construction baselines and, in some cases, even surpass the performance of full task vector collections across diverse downstream applications while reducing storage and computational requirements. The code is available at https://github.com/uiuctml/TaskVectorBasis.

Updated: 2025-10-09 05:18:04

标题: 任务向量基础：一种统一且可扩展的压缩任务算术框架

摘要: 任务算术通过对任务向量进行线性操作来表示下游任务，已经成为在不同环境中传递知识的简单而强大的范例。然而，维护大量的任务向量集合在存储和计算方面引入了可扩展性挑战。我们提出了任务向量基础框架，将$T$个任务向量压缩为$M<T$个基础向量，同时保持任务算术的功能。通过将每个任务向量表示为基础原子的结构化线性组合，我们的方法支持标准操作，如加法、否定，以及更高级的算术操作。该框架与任务算术中的其他效率改进是正交的，可以与其结合使用。我们提供了理论分析，显示基础压缩保留了加法泛化保证，并实现了基于重建质量的原则性取消学习，具有依赖于重建质量的误差界限。在实证方面，我们提出的基础建设方法在各种下游应用中持续优于启发式基础建设基线，并在某些情况下甚至超过完整任务向量集合的性能，同时减少了存储和计算需求。代码可在https://github.com/uiuctml/TaskVectorBasis找到。

更新时间: 2025-10-09 05:18:04

领域: cs.LG

下载: http://arxiv.org/abs/2502.01015v4

HySim-LLM: Embedding-Weighted Fine-Tuning Bounds and Manifold Denoising for Domain-Adapted LLMs

The extraction and standardization of pharmacokinetic (PK) information from scientific literature remain significant challenges in computational pharmacology, which limits the reliability of data-driven models in drug development. Large language models (LLMs) have achieved remarkable progress in text understanding and reasoning, yet their adaptation to structured biomedical data, such as PK tables, remains constrained by heterogeneity, noise, and domain shift. To address these limitations, we propose HySim-LLM, a unified mathematical and computational framework that integrates embedding-weighted fine-tuning and manifold-aware denoising to enhance the robustness and interpretability of LLMs. We establish two theoretical results: (1) a similarity-weighted generalization bound that quantifies adaptation performance under embedding divergence, and (2) a manifold-based denoising guarantee that bounds loss contributions from noisy or off-manifold samples. These theorems provide a principled foundation for fine-tuning LLMs in structured biomedical settings. The framework offers a mathematically grounded pathway toward reliable and interpretable LLM adaptation for biomedical and data-intensive scientific domains.

Updated: 2025-10-09 05:16:46

标题: HySim-LLM：嵌入加权微调边界和流形去噪用于领域自适应LLMs

摘要: 从科学文献中提取和标准化药代动力学（PK）信息仍然是计算药理学中的重要挑战，这限制了药物开发中基于数据的模型的可靠性。大型语言模型（LLMs）在文本理解和推理方面取得了显著进展，但它们对结构化生物医学数据（如PK表）的适应性受到异质性、噪声和领域转移的限制。为了解决这些限制，我们提出了HySim-LLM，这是一个统一的数学和计算框架，整合了嵌入加权微调和流形感知去噪，以增强LLMs的稳健性和可解释性。我们建立了两个理论结果：（1）一个相似性加权的泛化界限，量化了在嵌入分歧下的适应性表现，以及（2）基于流形的去噪保证，限制了来自嘈杂或脱离流形样本的损失贡献。这些定理为在结构化生物医学环境中微调LLMs提供了基础。该框架为生物医学和数据密集型科学领域可靠且可解释的LLM适应性提供了一个具有数学基础的路径。

更新时间: 2025-10-09 05:16:46

领域: cs.LG,cs.IR

下载: http://arxiv.org/abs/2510.07796v1

HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation

Agentic RAG is a powerful technique for incorporating external information that LLMs lack, enabling better problem solving and question answering. However, suboptimal search behaviors exist widely, such as over-search (retrieving information already known) and under-search (failing to search when necessary), which leads to unnecessary overhead and unreliable outputs. Current training methods, which typically rely on outcome-based rewards in a RL framework, lack the fine-grained control needed to address these inefficiencies. To overcome this, we introduce Hierarchical Process Rewards for Efficient agentic RAG (HiPRAG), a training methodology that incorporates a fine-grained, knowledge-grounded process reward into the RL training. Our approach evaluates the necessity of each search decision on-the-fly by decomposing the agent's reasoning trajectory into discrete, parsable steps. We then apply a hierarchical reward function that provides an additional bonus based on the proportion of optimal search and non-search steps, on top of commonly used outcome and format rewards. Experiments on the Qwen2.5 and Llama-3.2 models across seven diverse QA benchmarks show that our method achieves average accuracies of 65.4% (3B) and 67.2% (7B). This is accomplished while improving search efficiency, reducing the over-search rate to just 2.3% and concurrently lowering the under-search rate. These results demonstrate the efficacy of optimizing the reasoning process itself, not just the final outcome. Further experiments and analysis demonstrate that HiPRAG shows good generalizability across a wide range of RL algorithms, model families, sizes, and types. This work demonstrates the importance and potential of fine-grained control through RL, for improving the efficiency and optimality of reasoning for search agents.

Updated: 2025-10-09 05:13:10

标题: HiPRAG: 高效智能检索增强生成的分层过程奖励

摘要: 主动式RAG是一种强大的技术，可以整合LLM缺乏的外部信息，从而实现更好的问题解决和问题回答。然而，存在广泛存在的次优搜索行为，如过度搜索（检索已知信息）和不足搜索（在必要时未进行搜索），这导致了不必要的开销和不可靠的输出。目前的训练方法通常依赖于RL框架中基于结果的奖励，缺乏解决这些低效率所需的细粒度控制。为了克服这一问题，我们引入了用于高效主动式RAG（HiPRAG）的层次过程奖励训练方法，该方法将细粒度、知识基础的过程奖励纳入RL训练中。我们的方法通过将代理的推理轨迹分解为离散、可解析的步骤，即时评估每个搜索决策的必要性。然后，我们应用一种分层奖励函数，根据最佳搜索和非搜索步骤的比例提供额外奖励，除了通常使用的结果和格式奖励。在七个不同的QA基准测试中对Qwen2.5和Llama-3.2模型进行的实验表明，我们的方法实现了65.4%（3B）和67.2%（7B）的平均准确率。在提高搜索效率的同时，将过度搜索率降低到仅为2.3%，同时降低了不足搜索率。这些结果证明了优化推理过程本身的有效性，而不仅仅是最终结果。进一步的实验和分析表明，HiPRAG在各种RL算法、模型系列、大小和类型中表现出良好的泛化能力。这项工作展示了通过RL实现细粒度控制的重要性和潜力，以改进搜索代理的推理效率和最优性。

更新时间: 2025-10-09 05:13:10

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.07794v1

LLM4Cell: A Survey of Large Language and Agentic Models for Single-Cell Biology

Large language models (LLMs) and emerging agentic frameworks are beginning to transform single-cell biology by enabling natural-language reasoning, generative annotation, and multimodal data integration. However, progress remains fragmented across data modalities, architectures, and evaluation standards. LLM4Cell presents the first unified survey of 58 foundation and agentic models developed for single-cell research, spanning RNA, ATAC, multi-omic, and spatial modalities. We categorize these methods into five families-foundation, text-bridge, spatial, multimodal, epigenomic, and agentic-and map them to eight key analytical tasks including annotation, trajectory and perturbation modeling, and drug-response prediction. Drawing on over 40 public datasets, we analyze benchmark suitability, data diversity, and ethical or scalability constraints, and evaluate models across 10 domain dimensions covering biological grounding, multi-omics alignment, fairness, privacy, and explainability. By linking datasets, models, and evaluation domains, LLM4Cell provides the first integrated view of language-driven single-cell intelligence and outlines open challenges in interpretability, standardization, and trustworthy model development.

Updated: 2025-10-09 05:12:09

标题: LLM4Cell：单细胞生物学中大型语言和主体模型的调查

摘要: 大型语言模型（LLMs）和新兴的代理框架开始通过实现自然语言推理、生成注释和多模态数据集成来改变单细胞生物学。然而，进展在数据模态、架构和评估标准方面仍然分散。LLM4Cell提供了首个统一的调查，涵盖了为单细胞研究开发的58个基础和代理模型，涵盖了RNA、ATAC、多组学和空间模态。我们将这些方法分类为五个家族-基础、文本桥接、空间、多模态、表观基因组和代理，并将它们映射到包括注释、轨迹和干扰建模以及药物反应预测在内的八个关键分析任务。借助超过40个公共数据集，我们分析了基准适用性、数据多样性以及道德或可扩展性约束，并在涵盖生物学基础、多组学对齐、公平性、隐私性和可解释性的十个领域维度上评估模型。通过链接数据集、模型和评估领域，LLM4Cell提供了对基于语言驱动的单细胞智能的首个综合视图，并概述了在可解释性、标准化和可信模型开发方面的开放挑战。

更新时间: 2025-10-09 05:12:09

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.07793v1

GCPO: When Contrast Fails, Go Gold

Reinforcement learning has been widely applied to enhance the reasoning capabilities of large language models. Extending the inference limits of smaller models has become a prominent research focus. However, algorithms such as Group Relative Policy Optimization (GRPO) suffer from a clear drawback: the upper bound of a model's rollout responses is entirely determined by the model itself, preventing the acquisition of knowledge from samples that are either all incorrect or all correct. In this paper, we introduce Group Contrastive Policy Optimization (GCPO), a method that incorporates external standard reference answers. When the model cannot solve a problem, the reference answer supplies the correct response, steering the model toward an unequivocally accurate update direction. This approach offers two main advantages: (1) it improves training efficiency by fully utilizing every sample; (2) it enables the model to emulate the problem solving strategy of the reference answer during training, thereby enhancing generalization in reasoning. GCPO achieves outstanding results across multiple benchmark datasets, yielding substantial improvements over the baseline model. Our code is available at: https://github.com/AchoWu/GCPO.

Updated: 2025-10-09 05:09:06

标题: GCPO: 当对比失败时，选择黄金

摘要: 强化学习被广泛应用于增强大型语言模型的推理能力。扩展较小模型的推理限制已成为一个突出的研究重点。然而，诸如Group Relative Policy Optimization (GRPO)的算法存在明显缺陷：模型的展开响应的上限完全由模型自身确定，阻止了从样本中获取知识，这些样本可能全部不正确或全部正确。在本文中，我们介绍了Group Contrastive Policy Optimization (GCPO)方法，该方法将外部标准参考答案纳入其中。当模型无法解决问题时，参考答案提供正确响应，引导模型朝着明确准确的更新方向发展。这种方法具有两个主要优点：（1）通过充分利用每个样本来提高训练效率；（2）在训练过程中使模型模仿参考答案的问题解决策略，从而增强推理的泛化能力。GCPO在多个基准数据集上取得了出色的结果，显著改进了基线模型。我们的代码可在https://github.com/AchoWu/GCPO找到。

更新时间: 2025-10-09 05:09:06

领域: cs.AI

下载: http://arxiv.org/abs/2510.07790v1

Weak Form Learning for Mean-Field Partial Differential Equations: an Application to Insect Movement

Insect species subject to infection, predation, and anisotropic environmental conditions may exhibit preferential movement patterns. Given the innate stochasticity of exogenous factors driving these patterns over short timescales, individual insect trajectories typically obey overdamped stochastic dynamics. In practice, data-driven modeling approaches designed to learn the underlying Fokker-Planck equations from observed insect distributions serve as ideal tools for understanding and predicting such behavior. Understanding dispersal dynamics of crop and silvicultural pests can lead to a better forecasting of outbreak intensity and location, which can result in better pest management. In this work, we extend weak-form equation learning techniques, coupled with kernel density estimation, to learn effective models for lepidopteran larval population movement from highly sparse experimental data. Galerkin methods such as the Weak form Sparse Identification of Nonlinear Dynamics (WSINDy) algorithm have recently proven useful for learning governing equations in several scientific contexts. We demonstrate the utility of the method on a sparse dataset of position measurements of fall armyworms (Spodoptera frugiperda) obtained in simulated agricultural conditions with varied plant resources and infection status.

Updated: 2025-10-09 05:04:32

标题: 弱形式学习用于均场偏微分方程：昆虫运动的应用

摘要: 受感染、捕食和非均匀环境条件影响的昆虫种类可能表现出偏好的移动模式。考虑到驱动这些模式的外部因素在短时间尺度上的固有随机性，个体昆虫轨迹通常遵循过阻尼随机动力学。在实践中，旨在从观察到的昆虫分布中学习潜在Fokker-Planck方程的数据驱动建模方法可作为理解和预测此类行为的理想工具。了解作物和林业害虫的扩散动态可以带来更好的疫情强度和位置预测，从而实现更好的害虫管理。在这项工作中，我们将弱形式方程学习技术与核密度估计相结合，从高度稀疏的实验数据中学习鳞翅目幼虫种群移动的有效模型。最近，像弱形式稀疏非线性动力学识别（WSINDy）算法这样的Galerkin方法已被证明在几个科学背景下学习主导方程是有用的。我们在模拟的农业条件下获得的玉米螟（Spodoptera frugiperda）位置测量的稀疏数据集上展示了该方法的效用，这些条件包括不同植物资源和感染状态。

更新时间: 2025-10-09 05:04:32

领域: cs.LG,math.DS,q-bio.PE,60J70, 62FXX, 92-08

下载: http://arxiv.org/abs/2510.07786v1

PLUM: Adapting Pre-trained Language Models for Industrial-scale Generative Recommendations

Large Language Models (LLMs) pose a new paradigm of modeling and computation for information tasks. Recommendation systems are a critical application domain poised to benefit significantly from the sequence modeling capabilities and world knowledge inherent in these large models. In this paper, we introduce PLUM, a framework designed to adapt pre-trained LLMs for industry-scale recommendation tasks. PLUM consists of item tokenization using Semantic IDs, continued pre-training (CPT) on domain-specific data, and task-specific fine-tuning for recommendation objectives. For fine-tuning, we focus particularly on generative retrieval, where the model is directly trained to generate Semantic IDs of recommended items based on user context. We conduct comprehensive experiments on large-scale internal video recommendation datasets. Our results demonstrate that PLUM achieves substantial improvements for retrieval compared to a heavily-optimized production model built with large embedding tables. We also present a scaling study for the model's retrieval performance, our learnings about CPT, a few enhancements to Semantic IDs, along with an overview of the training and inference methods that enable launching this framework to billions of users in YouTube.

Updated: 2025-10-09 05:01:05

标题: PLUM：将预训练语言模型调整为工业规模的生成式推荐

摘要: 大型语言模型(LLMs)提出了一种新的建模和计算范式，用于信息任务。推荐系统是一个关键的应用领域，有望从这些大型模型中固有的序列建模能力和世界知识中受益。在本文中，我们介绍了PLUM，这是一个旨在为行业规模的推荐任务调整预训练LLMs的框架。PLUM包括使用语义ID对项目进行标记、在领域特定数据上进行持续预训练(CPT)，以及针对推荐目标进行任务特定的微调。对于微调，我们特别关注生成式检索，其中模型直接训练以生成基于用户环境的推荐项目的语义ID。我们在大规模内部视频推荐数据集上进行了全面的实验。我们的结果表明，与使用大型嵌入表构建的经过大量优化的生产模型相比，PLUM在检索方面取得了显著的改进。我们还对模型的检索性能进行了扩展研究，总结了有关CPT的经验教训、对语义ID的一些增强以及启用这一框架面向YouTube数十亿用户的训练和推断方法的概述。

更新时间: 2025-10-09 05:01:05

领域: cs.IR,cs.LG

下载: http://arxiv.org/abs/2510.07784v1

AEGIS : Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema

Prompt injection attacks pose a significant challenge to the safe deployment of Large Language Models (LLMs) in real-world applications. While prompt-based detection offers a lightweight and interpretable defense strategy, its effectiveness has been hindered by the need for manual prompt engineering. To address this issue, we propose AEGIS , an Automated co-Evolutionary framework for Guarding prompt Injections Schema. Both attack and defense prompts are iteratively optimized against each other using a gradient-like natural language prompt optimization technique. This framework enables both attackers and defenders to autonomously evolve via a Textual Gradient Optimization (TGO) module, leveraging feedback from an LLM-guided evaluation loop. We evaluate our system on a real-world assignment grading dataset of prompt injection attacks and demonstrate that our method consistently outperforms existing baselines, achieving superior robustness in both attack success and detection. Specifically, the attack success rate (ASR) reaches 1.0, representing an improvement of 0.26 over the baseline. For detection, the true positive rate (TPR) improves by 0.23 compared to the previous best work, reaching 0.84, and the true negative rate (TNR) remains comparable at 0.89. Ablation studies confirm the importance of co-evolution, gradient buffering, and multi-objective optimization. We also confirm that this framework is effective in different LLMs. Our results highlight the promise of adversarial training as a scalable and effective approach for guarding prompt injections.

Updated: 2025-10-09 04:58:46

标题: AEGIS：用于保护即时注入模式的自动共同进化框架

摘要: 快速注入攻击对于在现实世界应用中安全部署大型语言模型(LLMs)构成了重大挑战。虽然基于提示的检测提供了一种轻量且可解释的防御策略，但其有效性受到手动提示工程需求的限制。为了解决这一问题，我们提出了AEGIS，一个自动协同进化框架，用于保护提示注入模式。攻击和防御提示通过类似于梯度的自然语言提示优化技术相互迭代优化。该框架使攻击者和防御者能够通过文本梯度优化(TGO)模块自主进化，利用LLM引导的评估循环的反馈。我们在一个真实世界的任务打分数据集上评估了我们的系统，并展示了我们的方法始终优于现有基线，在攻击成功和检测方面实现了更强大的鲁棒性。具体来说，攻击成功率(ASR)达到1.0，比基线提高了0.26。在检测方面，真正率(TPR)相比之前的最佳工作提高了0.23，达到0.84，真负率(TNR)保持在0.89左右。消融研究证实了协同进化、梯度缓冲和多目标优化的重要性。我们还确认了这一框架在不同LLMs中的有效性。我们的结果突显了对抗训练作为一种可扩展且有效的保护提示注入方法的潜力。

更新时间: 2025-10-09 04:58:46

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2509.00088v2

Bridged Clustering for Representation Learning: Semi-Supervised Sparse Bridging

We introduce Bridged Clustering, a semi-supervised framework to learn predictors from any unpaired input $X$ and output $Y$ dataset. Our method first clusters $X$ and $Y$ independently, then learns a sparse, interpretable bridge between clusters using only a few paired examples. At inference, a new input $x$ is assigned to its nearest input cluster, and the centroid of the linked output cluster is returned as the prediction $\hat{y}$. Unlike traditional SSL, Bridged Clustering explicitly leverages output-only data, and unlike dense transport-based methods, it maintains a sparse and interpretable alignment. Through theoretical analysis, we show that with bounded mis-clustering and mis-bridging rates, our algorithm becomes an effective and efficient predictor. Empirically, our method is competitive with SOTA methods while remaining simple, model-agnostic, and highly label-efficient in low-supervision settings.

Updated: 2025-10-09 04:57:59

标题: 桥接聚类用于表示学习：半监督稀疏桥接

摘要: 我们引入了桥接聚类(Bridged Clustering)，这是一个半监督框架，可以从任何未配对的输入$X$和输出$Y$数据集中学习预测器。我们的方法首先独立地对$X$和$Y$进行聚类，然后仅使用少量配对示例学习聚类之间的稀疏、可解释的桥接关系。在推断阶段，新的输入$x$被分配到最近的输入聚类中，与之关联的输出聚类的中心被返回作为预测$\hat{y}$。与传统的半监督学习不同，桥接聚类明确利用了仅有输出数据，与基于密集传输的方法不同，它保持了稀疏和可解释的对齐。通过理论分析，我们表明在有界的错误聚类和错误桥接率下，我们的算法成为了一种有效且高效的预测器。在经验方面，我们的方法在保持简单、与模型无关、并在低监督设置中高度节省标签的情况下，与SOTA方法竞争力相当。

更新时间: 2025-10-09 04:57:59

领域: cs.LG

下载: http://arxiv.org/abs/2510.07182v2

IntentionVLA: Generalizable and Efficient Embodied Intention Reasoning for Human-Robot Interaction

Vision-Language-Action (VLA) models leverage pretrained vision-language models (VLMs) to couple perception with robotic control, offering a promising path toward general-purpose embodied intelligence. However, current SOTA VLAs are primarily pretrained on multimodal tasks with limited relevance to embodied scenarios, and then finetuned to map explicit instructions to actions. Consequently, due to the lack of reasoning-intensive pretraining and reasoning-guided manipulation, these models are unable to perform implicit human intention reasoning required for complex, real-world interactions. To overcome these limitations, we propose \textbf{IntentionVLA}, a VLA framework with a curriculum training paradigm and an efficient inference mechanism. Our proposed method first leverages carefully designed reasoning data that combine intention inference, spatial grounding, and compact embodied reasoning, endowing the model with both reasoning and perception capabilities. In the following finetuning stage, IntentionVLA employs the compact reasoning outputs as contextual guidance for action generation, enabling fast inference under indirect instructions. Experimental results show that IntentionVLA substantially outperforms $\pi_0$, achieving 18\% higher success rates with direct instructions and 28\% higher than ECoT under intention instructions. On out-of-distribution intention tasks, IntentionVLA achieves over twice the success rate of all baselines, and further enables zero-shot human-robot interaction with 40\% success rate. These results highlight IntentionVLA as a promising paradigm for next-generation human-robot interaction (HRI) systems.

Updated: 2025-10-09 04:49:46

标题: 意图VLA：通用且高效的人机交互中的具身意图推理

摘要: Vision-Language-Action (VLA)模型利用预训练的视觉-语言模型(VLMs)将感知与机器人控制结合起来，为通用目的的具身智能提供了有希望的路径。然而，当前最先进的VLA模型主要是在多模态任务上进行预训练，与具身场景的相关性有限，然后进行微调以将明确指令映射到动作。因此，由于缺乏需要推理密集型预训练和推理引导操纵，这些模型无法执行复杂、现实世界互动所需的隐含人类意图推理。为了克服这些限制，我们提出了\textbf{IntentionVLA}，这是一个具有课程训练范式和高效推理机制的VLA框架。我们提出的方法首先利用精心设计的推理数据，结合意图推断、空间定位和紧凑的具身推理，赋予模型推理和感知能力。在接下来的微调阶段，IntentionVLA利用紧凑的推理输出作为行动生成的上下文指导，实现在间接指令下的快速推理。实验结果表明，IntentionVLA明显优于$\pi_0$，在直接指令下的成功率高出18％，在意图指令下高出28％，超过了ECoT。在分布外的意图任务中，IntentionVLA的成功率是所有基线的两倍以上，并进一步实现了40％的零射击人机交互成功率。这些结果突显了IntentionVLA作为下一代人机交互(HRI)系统的一个有希望的范例。

更新时间: 2025-10-09 04:49:46

领域: cs.RO,cs.AI,cs.CV

下载: http://arxiv.org/abs/2510.07778v1

Drift No More? Context Equilibria in Multi-Turn LLM Interactions

Large Language Models (LLMs) excel at single-turn tasks such as instruction following and summarization, yet real-world deployments require sustained multi-turn interactions where user goals and conversational context persist and evolve. A recurring challenge in this setting is context drift: the gradual divergence of a model's outputs from goal-consistent behavior across turns. Unlike single-turn errors, drift unfolds temporally and is poorly captured by static evaluation metrics. In this work, we present a study of context drift in multi-turn interactions and propose a simple dynamical framework to interpret its behavior. We formalize drift as the turn-wise KL divergence between the token-level predictive distributions of the test model and a goal-consistent reference model, and propose a recurrence model that interprets its evolution as a bounded stochastic process with restoring forces and controllable interventions. We instantiate this framework in both synthetic long-horizon rewriting tasks and realistic user-agent simulations such as in $\tau$-Bench, measuring drift for several open-weight LLMs that are used as user simulators. Our experiments consistently reveal stable, noise-limited equilibria rather than runaway degradation, and demonstrate that simple reminder interventions reliably reduce divergence in line with theoretical predictions. Together, these results suggest that multi-turn drift can be understood as a controllable equilibrium phenomenon rather than as inevitable decay, providing a foundation for studying and mitigating context drift in extended interactions.

Updated: 2025-10-09 04:48:49

标题: 不再漂移？多轮LLM交互中的上下文平衡

摘要: 大型语言模型（LLMs）在单轮任务（如遵循指示和总结）方面表现优异，但实际部署需要持续的多轮交互，其中用户目标和对话背景保持并发展。在这种情况下的一个常见挑战是上下文漂移：模型输出逐渐从目标一致行为中漂移出来。与单轮错误不同，漂移在时间上展开，并且由静态评估指标难以捕捉。在这项工作中，我们对多轮交互中的上下文漂移进行了研究，并提出了一个简单的动态框架来解释其行为。我们将漂移形式化为测试模型和目标一致参考模型之间的基于标记级别的预测分布之间的KL散度值，提出了一个循环模型，将其演变解释为具有恢复力和可控干预的有界随机过程。我们将这一框架实例化为合成的长期重写任务和实际用户代理模拟，如$\tau$-Bench，测量几个用作用户模拟器的开放权重LLMs的漂移。我们的实验始终显示稳定的、受噪声限制的平衡状态，而不是失控的退化，并且表明简单的提醒干预可可靠地减少与理论预测一致的差异。总的来说，这些结果表明多轮漂移可以被理解为一个可控的平衡现象，而不是不可避免的衰变，为研究和减轻在扩展交互中的上下文漂移提供了基础。

更新时间: 2025-10-09 04:48:49

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.07777v1

Instance Relation Learning Network with Label Knowledge Propagation for Few-shot Multi-label Intent Detection

Few-shot Multi-label Intent Detection (MID) is crucial for dialogue systems, aiming to detect multiple intents of utterances in low-resource dialogue domains. Previous studies focus on a two-stage pipeline. They first learn representations of utterances with multiple labels and then use a threshold-based strategy to identify multi-label results. However, these methods rely on representation classification and ignore instance relations, leading to error propagation. To solve the above issues, we propose a multi-label joint learning method for few-shot MID in an end-to-end manner, which constructs an instance relation learning network with label knowledge propagation to eliminate error propagation. Concretely, we learn the interaction relations between instances with class information to propagate label knowledge between a few labeled (support set) and unlabeled (query set) instances. With label knowledge propagation, the relation strength between instances directly indicates whether two utterances belong to the same intent for multi-label prediction. Besides, a dual relation-enhanced loss is developed to optimize support- and query-level relation strength to improve performance. Experiments show that we outperform strong baselines by an average of 9.54% AUC and 11.19% Macro-F1 in 1-shot scenarios.

Updated: 2025-10-09 04:47:06

标题: 少样本多标签意图检测的实例关系学习网络与标签知识传播

摘要: Few-shot Multi-label Intent Detection (MID)对于对话系统至关重要，旨在在资源匮乏的对话领域中检测话语的多个意图。先前的研究集中在一个两阶段的流程上。他们首先学习具有多个标签的话语的表示，然后使用基于阈值的策略来识别多标签结果。然而，这些方法依赖于表示分类，并忽略了实例关系，导致错误传播。为了解决以上问题，我们提出了一种用于Few-shot MID的多标签联合学习方法，以端到端的方式构建了一个具有标签知识传播的实例关系学习网络，以消除错误传播。具体而言，我们学习了实例之间的相互关系，带有类别信息，以在少量标记（支持集）和未标记（查询集）实例之间传播标签知识。通过标签知识传播，实例之间的关系强度直接表明两个话语是否属于相同的意图进行多标签预测。此外，我们开发了一种双重关系增强损失，以优化支持和查询级别的关系强度，以提高性能。实验证明，在1-shot场景中，我们的表现优于强基线，AUC平均提升了9.54%，Macro-F1提升了11.19%。

更新时间: 2025-10-09 04:47:06

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2510.07776v1

Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning

Reinforcement learning (RL) is the dominant paradigm for sharpening strategic tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks, yet it faces a fundamental challenge of exploration-exploitation trade-off. Existing studies stimulate exploration through the lens of policy entropy, but such mechanical entropy maximization is prone to RL training instability due to the multi-turn distribution shifting. In this paper, we target the progressive exploration-exploitation balance under the guidance of the agent own experiences without succumbing to either entropy collapsing or runaway divergence. We propose SPEAR, a curriculum-based self-imitation learning (SIL) recipe for training agentic LLMs. It extends the vanilla SIL framework, where a replay buffer stores self-generated promising trajectories for off-policy update, by gradually steering the policy evolution within a well-balanced range of entropy across stages. Specifically, our approach incorporates a curriculum to manage the exploration process, utilizing intrinsic rewards to foster skill-level exploration and facilitating action-level exploration through SIL. At first, the auxiliary tool call reward plays a critical role in the accumulation of tool-use skills, enabling broad exposure to the unfamiliar distributions of the environment feedback with an upward entropy trend. As training progresses, self-imitation gets strengthened to exploit existing successful patterns from replayed experiences for comparative action-level exploration, accelerating solution iteration without unbounded entropy growth. To further stabilize training, we recalibrate the advantages of experiences in the replay buffer to address the potential policy drift. Reugularizations such as the clipping of tokens with high covariance between probability and advantage are introduced to the trajectory-level entropy control to curb over-confidence.

Updated: 2025-10-09 04:27:07

标题: 学习基础知识，然后相信胜利：自主强化学习的渐进探索自我模仿

摘要: 强化学习（RL）是对LLMs在长期、稀疏奖励代理任务上锐化战略工具使用能力的主导范式，然而它面临着探索和利用之间的根本挑战。现有研究通过政策熵的视角刺激探索，但这种机械熵最大化容易因多轮分布转移而导致RL训练不稳定。本文旨在在代理自身经验的指导下，针对渐进的探索-利用平衡，既不陷入熵崩溃，也不出现失控的发散。我们提出了一种基于课程的自我模仿学习（SIL）方法，用于训练代理LLMs。它扩展了普通的SIL框架，其中一个重播缓冲区存储自动生成的有前途的轨迹以进行离线策略更新，逐渐引导策略演变在各个阶段在一个良好平衡的熵范围内。具体来说，我们的方法包括一个课程来管理探索过程，利用内在奖励来促进技能级别的探索，并通过SIL促进动作级别的探索。首先，辅助工具调用奖励在工具使用技能的积累中发挥关键作用，使得对环境反馈的陌生分布有一个上升的熵趋势的广泛暴露。随着训练的进行，自我模仿得到加强，从重播的经验中利用现有成功模式进行比较动作级别的探索，加速解决方案迭代，而不会导致熵的无限增长。为了进一步稳定训练，我们重新校准重播缓冲区中经验的优势，以解决潜在的策略漂移。引入诸如在概率和优势之间具有高协方差的令牌的裁剪等正则化方法，以控制轨迹级熵，遏制过度自信。

更新时间: 2025-10-09 04:27:07

领域: cs.LG,cs.AI,cs.CL,cs.CV,cs.MA

下载: http://arxiv.org/abs/2509.22601v2

Trajectory Conditioned Cross-embodiment Skill Transfer

Learning manipulation skills from human demonstration videos presents a promising yet challenging problem, primarily due to the significant embodiment gap between human body and robot manipulators. Existing methods rely on paired datasets or hand-crafted rewards, which limit scalability and generalization. We propose TrajSkill, a framework for Trajectory Conditioned Cross-embodiment Skill Transfer, enabling robots to acquire manipulation skills directly from human demonstration videos. Our key insight is to represent human motions as sparse optical flow trajectories, which serve as embodiment-agnostic motion cues by removing morphological variations while preserving essential dynamics. Conditioned on these trajectories together with visual and textual inputs, TrajSkill jointly synthesizes temporally consistent robot manipulation videos and translates them into executable actions, thereby achieving cross-embodiment skill transfer. Extensive experiments are conducted, and the results on simulation data (MetaWorld) show that TrajSkill reduces FVD by 39.6\% and KVD by 36.6\% compared with the state-of-the-art, and improves cross-embodiment success rate by up to 16.7\%. Real-robot experiments in kitchen manipulation tasks further validate the effectiveness of our approach, demonstrating practical human-to-robot skill transfer across embodiments.

Updated: 2025-10-09 04:26:06

标题: 轨迹条件下的跨体技能转移

摘要: 从人类演示视频中学习操作技能是一个具有前景但具有挑战性的问题，主要是由于人体和机器人操作者之间存在显著的实体差距。现有方法依赖于配对数据集或手工制作的奖励，这限制了可扩展性和泛化能力。我们提出了TrajSkill，这是一个Trajectory Conditioned Cross-embodiment Skill Transfer框架，使机器人能够直接从人类演示视频中获得操作技能。我们的关键见解是将人类动作表示为稀疏的光流轨迹，这些轨迹作为不考虑实体的运动线索，通过消除形态变化而保留基本动态。基于这些轨迹以及视觉和文本输入，TrajSkill联合合成时间一致的机器人操作视频，并将其转化为可执行的动作，从而实现跨实体技能转移。进行了大量实验，模拟数据（MetaWorld）的结果显示，与最先进技术相比，TrajSkill将FVD降低了39.6％，KVD降低了36.6％，并将跨实体成功率提高了高达16.7％。在厨房操作任务中进行的真实机器人实验进一步验证了我们方法的有效性，展示了实际的人类到机器人技能跨实体转移。

更新时间: 2025-10-09 04:26:06

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2510.07773v1

An approach for systematic decomposition of complex llm tasks

Large Language Models (LLMs) suffer from reliability issues on complex tasks, as existing decomposition methods are heuristic and rely on agent or manual decomposition. This work introduces a novel, systematic decomposition framework that we call Analysis of CONstraint-Induced Complexity (ACONIC), which models the task as a constraint problem and leveraging formal complexity measures to guide decomposition. On combinatorial (SATBench) and LLM database querying tasks (Spider), we find that by decomposing the tasks following the measure of complexity, agent can perform considerably better (10-40 percentage point).

Updated: 2025-10-09 04:24:47

标题: 一个系统分解复杂LLM任务的方法

摘要: 大型语言模型（LLMs）在复杂任务上存在可靠性问题，因为现有的分解方法是启发式的，依赖于代理或手动分解。本文介绍了一种名为Analysis of CONstraint-Induced Complexity（ACONIC）的新颖、系统化的分解框架，该框架将任务建模为一个约束问题，并利用形式复杂度度量来指导分解。在组合（SATBench）和LLM数据库查询任务（Spider）上，我们发现通过按照复杂度度量进行任务分解，代理可以表现得更好（提高10-40个百分点）。

更新时间: 2025-10-09 04:24:47

领域: cs.AI

下载: http://arxiv.org/abs/2510.07772v1

From Correction to Mastery: Reinforced Distillation of Large Language Model Agents

Large Language Model agents excel at solving complex tasks through iterative reasoning and tool use, but typically depend on ultra-large, costly backbones. Existing distillation approaches train smaller students to imitate full teacher trajectories, yet reasoning and knowledge gaps between the teacher and student can cause compounding errors. We propose SCoRe, a student-centered framework in which the student generates training trajectories and the teacher corrects only the earliest error, producing training data matched to the student's ability and exposing specific weaknesses. The student is first fine-tuned on corrected trajectories. Subsequently, short-horizon reinforcement learning starts from the verified prefix preceding the earliest error, with target rewards assigned at that step. This design encourages autonomous problem-solving beyond imitation and enhances training stability. On 12 challenging benchmarks, a 7B-parameter student distilled with SCoRe matches the agentic performance of a 72B-parameter teacher.

Updated: 2025-10-09 04:22:47

标题: 从纠正到掌握：大型语言模型代理的强化蒸馏

摘要: 大型语言模型代理通过迭代推理和工具使用在解决复杂任务方面表现出色，但通常依赖于超大规模、昂贵的主干网络。现有的蒸馏方法训练较小的学生来模仿完整的教师轨迹，但教师和学生之间的推理和知识差距可能导致错误的不断累积。我们提出了一种以学生为中心的框架SCoRe，在这种框架中，学生生成训练轨迹，而教师仅纠正最早的错误，生成与学生能力匹配的训练数据，并暴露特定的弱点。学生首先在纠正后的轨迹上进行微调。随后，以验证的前缀为起点进行短视角强化学习，目标奖励在该步骤分配。这种设计鼓励超越模仿的自主问题解决，并增强训练稳定性。在12个具有挑战性的基准测试中，使用SCoRe蒸馏的7B参数学生与使用72B参数的教师的代理性能相匹配。

更新时间: 2025-10-09 04:22:47

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.14257v2

Erasing Without Remembering: Implicit Knowledge Forgetting in Large Language Models

In this paper, we investigate knowledge forgetting in large language models with a focus on its generalisation, ensuring that models forget not only specific training samples but also related implicit knowledge. To this end, we begin by identifying a broader unlearning scope that includes both target data and logically associated samples, including rephrased, subject-replaced, relation-reversed, and one-hop reasoned data. We then conduct a rigorous evaluation of 15 state-of-the-art methods across three datasets, revealing that unlearned models still recall paraphrased answers and retain target facts in their intermediate layers. This motivates us to take a preliminary step toward more generalised implicit knowledge forgetting by proposing PerMU, a novel probability perturbation-based unlearning paradigm. PerMU simulates adversarial unlearning samples to eliminate fact-related tokens from the logit distribution, collectively reducing the probabilities of all answer-associated tokens. Experiments are conducted on a diverse range of datasets, including TOFU, Harry Potter, ZsRE, WMDP, and MUSE, using models ranging from 1.3B to 13B in scale. The results demonstrate that PerMU delivers up to a 50.40% improvement in unlearning vanilla target data while maintaining a 40.73% boost in forgetting implicit knowledge. Our code can be found in https://github.com/MaybeLizzy/PERMU.

Updated: 2025-10-09 04:22:34

标题: 不记得而擦除：大型语言模型中的隐性知识遗忘

摘要: 在这篇论文中，我们研究了大型语言模型中的知识遗忘，重点关注其泛化能力，确保模型不仅忘记特定的训练样本，还忘记相关的隐含知识。为此，我们首先确定了一个更广泛的遗忘范围，包括目标数据和逻辑相关的样本，包括重新表述、主题替换、关系反转和一跳推理数据。然后我们对15种最先进的方法在三个数据集上进行了严格评估，发现遗忘的模型仍然能回忆起改写的答案，并在中间层保留目标事实。这促使我们迈出了迈向更广泛的隐含知识遗忘的初步步骤，提出了PerMU，一种基于概率扰动的全新遗忘范例。PerMU 模拟对抗性遗忘样本，从逻辑分布中消除与事实相关的标记，集体降低所有与答案相关的标记的概率。我们在各种数据集上进行了实验，包括 TOFU、Harry Potter、ZsRE、WMDP 和 MUSE，使用规模从 1.3B 到 13B 的模型。结果表明，PerMU 在遗忘普通目标数据方面提供了高达 50.40% 的改进，同时保持了遗忘隐含知识方面的 40.73% 的提升。我们的代码可以在 https://github.com/MaybeLizzy/PERMU 找到。

更新时间: 2025-10-09 04:22:34

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2502.19982v3

Solving Time-Fractional Partial Integro-Differential Equations Using Tensor Neural Network

In this paper, we propose a novel machine learning method based on adaptive tensor neural network subspace to solve linear time-fractional diffusion-wave equations and nonlinear time-fractional partial integro-differential equations. In this framework, the tensor neural network and Gauss-Jacobi quadrature are effectively combined to construct a universal numerical scheme for the temporal Caputo derivative with orders spanning $ (0,1)$ and $(1,2)$. Specifically, in order to effectively utilize Gauss-Jacobi quadrature to discretize Caputo derivatives, we design the tensor neural network function multiplied by the function $t^{\mu}$ where the power $\mu$ is selected according to the parameters of the equations at hand. Finally, some numerical examples are provided to validate the efficiency and accuracy of the proposed tensor neural network based machine learning method.

Updated: 2025-10-09 04:11:44

标题: 使用张量神经网络求解时间分数阶部分积分微分方程

摘要: 在本文中，我们提出了一种基于自适应张量神经网络子空间的新型机器学习方法，用于解决线性时分扩散波方程和非线性时分部分积分微分方程。在这个框架中，张量神经网络和高斯-雅各比积分被有效地结合起来，构建了一个用于时间Caputo导数的通用数值方案，其阶数跨越了$(0,1)$和$(1,2)$。具体来说，为了有效利用高斯-雅各比积分来离散化Caputo导数，我们设计了与函数$t^{\mu}$相乘的张量神经网络函数，其中指数$\mu$根据手头方程的参数进行选择。最后，提供了一些数值例子来验证所提出的基于张量神经网络的机器学习方法的效率和准确性。

更新时间: 2025-10-09 04:11:44

领域: cs.LG

下载: http://arxiv.org/abs/2504.01440v3

ToolLibGen: Scalable Automatic Tool Creation and Aggregation for LLM Reasoning

Large Language Models (LLMs) equipped with external tools have demonstrated enhanced performance on complex reasoning tasks. The widespread adoption of this tool-augmented reasoning is hindered by the scarcity of domain-specific tools. For instance, in domains such as physics question answering, suitable and specialized tools are often missing. Recent work has explored automating tool creation by extracting reusable functions from Chain-of-Thought (CoT) reasoning traces; however, these approaches face a critical scalability bottleneck. As the number of generated tools grows, storing them in an unstructured collection leads to significant retrieval challenges, including an expanding search space and ambiguity between function-related tools. To address this, we propose a systematic approach to automatically refactor an unstructured collection of tools into a structured tool library. Our system first generates discrete, task-specific tools and clusters them into semantically coherent topics. Within each cluster, we introduce a multi-agent framework to consolidate scattered functionalities: a code agent refactors code to extract shared logic and creates versatile, aggregated tools, while a reviewing agent ensures that these aggregated tools maintain the complete functional capabilities of the original set. This process transforms numerous question-specific tools into a smaller set of powerful, aggregated tools without loss of functionality. Experimental results demonstrate that our approach significantly improves tool retrieval accuracy and overall reasoning performance across multiple reasoning tasks. Furthermore, our method shows enhanced scalability compared with baselines as the number of question-specific increases.

Updated: 2025-10-09 04:11:16

标题: ToolLibGen：用于LLM推理的可扩展自动工具创建和聚合

摘要: 大型语言模型（LLMs）配备外部工具已经展示出在复杂推理任务上的增强性能。这种工具增强推理的广泛采用受到领域特定工具稀缺的阻碍。例如，在物理问题回答等领域，适当和专门的工具通常缺失。最近的工作探索了通过从思维链（CoT）推理轨迹中提取可重复使用的函数来自动化工具创建；然而，这些方法面临一个关键的可扩展性瓶颈。随着生成的工具数量增加，将它们存储在一个无结构的集合中会带来重大的检索挑战，包括搜索空间的扩大和功能相关工具之间的歧义。为了解决这个问题，我们提出了一个系统化的方法，将一个无结构的工具集合自动重构为一个结构化的工具库。我们的系统首先生成离散的、任务特定的工具，并将它们聚类成语义连贯的主题。在每个聚类中，我们引入了一个多智能体框架来 consolide scattered functionalities: 一个代码智能体重构代码以提取共享逻辑并创建多功能的、聚合的工具，而一个审核智能体确保这些聚合工具保持原始集合的完整功能能力。这个过程将众多特定问题的工具转化为一个更小的功能强大的、聚合的工具集，而不损失功能。实验结果表明，我们的方法显著提高了工具检索的准确性和多个推理任务的整体推理性能。此外，与基线相比，我们的方法在问题特定数量增加时显示出增强的可扩展性。

更新时间: 2025-10-09 04:11:16

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.07768v1

FedLAM: Low-latency Wireless Federated Learning via Layer-wise Adaptive Modulation

In wireless federated learning (FL), the clients need to transmit the high-dimensional deep neural network (DNN) parameters through bandwidth-limited channels, which causes the communication latency issue. In this paper, we propose a layer-wise adaptive modulation scheme to save the communication latency. Unlike existing works which assign the same modulation level for all DNN layers, we consider the layers' importance which provides more freedom to save the latency. The proposed scheme can automatically decide the optimal modulation levels for different DNN layers. Experimental results show that the proposed scheme can save up to 73.9% of communication latency compared with the existing schemes.

Updated: 2025-10-09 04:07:30

标题: FedLAM: 通过逐层自适应调制实现低延迟的无线联邦学习

摘要: 在无线联邦学习（FL）中，客户端需要通过带宽有限的信道传输高维度的深度神经网络（DNN）参数，这导致通信延迟问题。在本文中，我们提出了一种逐层自适应调制方案来节省通信延迟。与现有的工作不同，现有工作为所有DNN层分配相同的调制级别，我们考虑了层的重要性，提供更多的自由度来节省延迟。所提出的方案可以自动决定不同DNN层的最佳调制级别。实验结果表明，与现有方案相比，所提出的方案可以节省高达73.9%的通信延迟。

更新时间: 2025-10-09 04:07:30

领域: cs.LG

下载: http://arxiv.org/abs/2510.07766v1

OneFlow: Concurrent Mixed-Modal and Interleaved Generation with Edit Flows

We present OneFlow, the first non-autoregressive multimodal model that enables variable-length and concurrent mixed-modal generation. Unlike autoregressive models that enforce rigid causal ordering between text and image generation, OneFlow combines an insertion-based Edit Flow for discrete text tokens with Flow Matching for image latents. OneFlow enables concurrent text-image synthesis with hierarchical sampling that prioritizes content over grammar. Through controlled experiments across model sizes from 1B to 8B, we demonstrate that OneFlow outperforms autoregressive baselines on both generation and understanding tasks while using up to 50% fewer training FLOPs. OneFlow surpasses both autoregressive and diffusion-based approaches while unlocking new capabilities for concurrent generation, iterative refinement, and natural reasoning-like generation.

Updated: 2025-10-09 04:05:49

标题: OneFlow: 并发混合模态和交错生成与编辑流

摘要: 我们提出了OneFlow，这是第一个非自回归的多模态模型，可以实现可变长度和并发的混合模态生成。与强制要求文本和图像生成之间严格因果顺序的自回归模型不同，OneFlow将基于插入的Edit Flow用于离散文本标记与基于Flow Matching用于图像潜在变量的结合。OneFlow通过优先考虑内容而不是语法的分层抽样，实现了文本-图像的并发合成。通过在模型规模从1B到8B的受控实验中，我们证明OneFlow在生成和理解任务上优于自回归基线，同时使用的训练FLOPs少达50％。OneFlow在超越自回归和扩散基础方法的同时，还解锁了并发生成、迭代细化和类似自然推理的生成的新能力。

更新时间: 2025-10-09 04:05:49

领域: cs.AI

下载: http://arxiv.org/abs/2510.03506v2

From Noisy to Native: LLM-driven Graph Restoration for Test-Time Graph Domain Adaptation

Graph domain adaptation (GDA) has achieved great attention due to its effectiveness in addressing the domain shift between train and test data. A significant bottleneck in existing graph domain adaptation methods is their reliance on source-domain data, which is often unavailable due to privacy or security concerns. This limitation has driven the development of Test-Time Graph Domain Adaptation (TT-GDA), which aims to transfer knowledge without accessing the source examples. Inspired by the generative power of large language models (LLMs), we introduce a novel framework that reframes TT-GDA as a generative graph restoration problem, "restoring the target graph to its pristine, source-domain-like state". There are two key challenges: (1) We need to construct a reasonable graph restoration process and design an effective encoding scheme that an LLM can understand, bridging the modality gap. (2) We need to devise a mechanism to ensure the restored graph acquires the intrinsic features of the source domain, even without access to the source data. To ensure the effectiveness of graph restoration, we propose GRAIL, that restores the target graph into a state that is well-aligned with the source domain. Specifically, we first compress the node representations into compact latent features and then use a graph diffusion process to model the graph restoration process. Then a quantization module encodes the restored features into discrete tokens. Building on this, an LLM is fine-tuned as a generative restorer to transform a "noisy" target graph into a "native" one. To further improve restoration quality, we introduce a reinforcement learning process guided by specialized alignment and confidence rewards. Extensive experiments demonstrate the effectiveness of our approach across various datasets.

Updated: 2025-10-09 04:00:42

标题: 从嘈杂到本地化：LLM驱动的图恢复用于测试时图领域自适应

摘要: 图领域适应（GDA）因其在解决训练和测试数据之间的领域转移问题方面的有效性而受到广泛关注。现有图领域适应方法中存在一个重要瓶颈，即它们依赖于源领域数据，由于隐私或安全问题通常无法获得。这一限制推动了测试时图领域适应（TT-GDA）的发展，其旨在在不访问源示例的情况下转移知识。受大型语言模型（LLMs）生成能力的启发，我们引入了一个新颖的框架，将TT-GDA重新定位为一个生成图恢复问题，即“将目标图恢复到其原始的、类似源领域的状态”。存在两个关键挑战：（1）我们需要构建一个合理的图恢复过程，并设计一种LLM可以理解的有效编码方案，弥合模态差距。（2）我们需要设计一种机制，以确保恢复的图获得源领域的内在特征，即使没有访问源数据也是如此。为了确保图恢复的有效性，我们提出了GRAIL，将目标图恢复到与源领域良好对齐的状态。具体而言，我们首先将节点表示压缩为紧凑的潜在特征，然后使用图扩散过程建模图恢复过程。然后，一个量化模块将恢复的特征编码为离散令牌。在此基础上，通过专门的对齐和置信奖励引导的强化学习过程，对LLM进行微调，作为生成式修复器，将“嘈杂”的目标图转化为“本地”图。为了进一步提高恢复质量，我们引入了一个受专门对齐和置信奖励指导的强化学习过程。广泛的实验证明了我们的方法在各种数据集上的有效性。

更新时间: 2025-10-09 04:00:42

领域: cs.AI

下载: http://arxiv.org/abs/2510.07762v1

Utility-Focused LLM Annotation for Retrieval and Retrieval-Augmented Generation

This paper explores the use of large language models (LLMs) for annotating document utility in training retrieval and retrieval-augmented generation (RAG) systems, aiming to reduce dependence on costly human annotations. We address the gap between retrieval relevance and generative utility by employing LLMs to annotate document utility. To effectively utilize multiple positive samples per query, we introduce a novel loss that maximizes their summed marginal likelihood. Using the Qwen-2.5-32B model, we annotate utility on the MS MARCO dataset and conduct retrieval experiments on MS MARCO and BEIR, as well as RAG experiments on MS MARCO QA, NQ, and HotpotQA. Our results show that LLM-generated annotations enhance out-of-domain retrieval performance and improve RAG outcomes compared to models trained solely on human annotations or downstream QA metrics. Furthermore, combining LLM annotations with just 20% of human labels achieves performance comparable to using full human annotations. Our study offers a comprehensive approach to utilizing LLM annotations for initializing QA systems on new corpora.

Updated: 2025-10-09 04:00:21

标题: 面向效用的LLM注释用于检索和检索增强生成

摘要: 本文探讨了大型语言模型（LLMs）在训练检索和检索增强生成（RAG）系统中用于注释文档效用的应用，旨在减少对昂贵人工注释的依赖。我们通过利用LLMs注释文档效用来解决检索相关性和生成效用之间的差距。为了有效利用每个查询的多个正样本，我们引入了一种新颖的损失函数，最大化它们的边际似然之和。使用Qwen-2.5-32B模型，我们在MS MARCO数据集上注释了效用，并在MS MARCO和BEIR上进行了检索实验，以及在MS MARCO QA、NQ和HotpotQA上进行了RAG实验。我们的结果表明，LLM生成的注释提升了域外检索性能，并改善了相对于仅基于人工注释或下游QA指标训练的模型的RAG结果。此外，将LLM注释与仅有20%的人类标签结合使用可实现与使用完整人类注释相当的性能。我们的研究提供了一种全面的方法，利用LLM注释在新语料库上初始化QA系统。

更新时间: 2025-10-09 04:00:21

领域: cs.IR,cs.AI,cs.CL

下载: http://arxiv.org/abs/2504.05220v5

A Unified Multi-Task Learning Framework for Generative Auto-Bidding with Validation-Aligned Optimization

In online advertising, heterogeneous advertiser requirements give rise to numerous customized bidding tasks that are typically optimized independently, resulting in extensive computation and limited data efficiency. Multi-task learning offers a principled framework to train these tasks jointly through shared representations. However, existing multi-task optimization strategies are primarily guided by training dynamics and often generalize poorly in volatile bidding environments. To this end, we present Validation-Aligned Multi-task Optimization (VAMO), which adaptively assigns task weights based on the alignment between per-task training gradients and a held-out validation gradient, thereby steering updates toward validation improvement and better matching deployment objectives. We further equip the framework with a periodicity-aware temporal module and couple it with an advanced generative auto-bidding backbone to enhance cross-task transfer of seasonal structure and strengthen bidding performance. Meanwhile, we provide theoretical insights into the proposed method, e.g., convergence guarantee and alignment analysis. Extensive experiments on both simulated and large-scale real-world advertising systems consistently demonstrate significant improvements over typical baselines, illuminating the effectiveness of the proposed approach.

Updated: 2025-10-09 03:59:51

标题: 一个统一的多任务学习框架，用于生成式自动竞标与验证对齐优化

摘要: 在线广告中，异质化的广告主需求引发了许多定制化的竞价任务，通常独立优化，导致计算量大且数据效率有限。多任务学习提供了一个基于共享表示训练这些任务的原则性框架。然而，现有的多任务优化策略主要受训练动态指导，并且在不稳定的竞价环境中通常泛化能力差。因此，我们提出了Validation-Aligned Multi-task Optimization (VAMO)，根据每个任务的训练梯度与保留的验证梯度之间的对齐情况自适应地分配任务权重，从而将更新引导向验证改进并更好地匹配部署目标。我们进一步配备了一个周期感知的时间模块，并将其与先进的生成自动竞价骨干相结合，以增强季节结构的跨任务转移并加强竞价性能。同时，我们为所提出的方法提供了理论见解，例如收敛保证和对齐分析。在模拟和大规模真实广告系统上进行的广泛实验一致表明，与典型基线相比，该方法显著改进，阐明了所提出方法的有效性。

更新时间: 2025-10-09 03:59:51

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.07760v1

Rényi Sharpness: A Novel Sharpness that Strongly Correlates with Generalization

Sharpness (of the loss minima) is a common measure to investigate the generalization of neural networks. Intuitively speaking, the flatter the landscape near the minima is, the better generalization might be. Unfortunately, the correlation between many existing sharpness measures and the generalization is usually not strong, sometimes even weak. To close the gap between the intuition and the reality, we propose a novel sharpness measure, i.e., \textit{R\'enyi sharpness}, which is defined as the negative R\'enyi entropy (a generalization of the classical Shannon entropy) of the loss Hessian. The main ideas are as follows: 1) we realize that \textit{uniform} (identical) eigenvalues of the loss Hessian is most desirable (while keeping the sum constant) to achieve good generalization; 2) we employ the \textit{R\'enyi entropy} to concisely characterize the extent of the spread of the eigenvalues of loss Hessian. Normally, the larger the spread, the smaller the (R\'enyi) entropy. To rigorously establish the relationship between generalization and (R\'enyi) sharpness, we provide several generalization bounds in terms of R\'enyi sharpness, by taking advantage of the reparametrization invariance property of R\'enyi sharpness, as well as the trick of translating the data discrepancy to the weight perturbation. Furthermore, extensive experiments are conducted to verify the strong correlation (in specific, Kendall rank correlation) between the R\'enyi sharpness and generalization. Moreover, we propose to use a variant of R\'enyi Sharpness as regularizer during training, i.e., R\'enyi Sharpness Aware Minimization (RSAM), which turns out to outperform all existing sharpness-aware minimization methods. It is worthy noting that the test accuracy gain of our proposed RSAM method could be as high as nearly 2.5\%, compared against the classical SAM method.

Updated: 2025-10-09 03:58:21

标题: Rényi锐度：一种与泛化强相关的新锐度

摘要: 锐度（损失最小值）是研究神经网络泛化的常见指标。直观地说，最小值附近的景观越平坦，泛化可能就越好。不幸的是，许多现有的锐度测量方法与泛化之间的相关性通常不强，有时甚至很弱。为了弥合直觉与现实之间的差距，我们提出了一种新颖的锐度测量方法，即\textit{R\'enyi锐度}，它被定义为损失Hessian的负R\'enyi熵（经典Shannon熵的推广）。主要思想如下：1）我们意识到损失Hessian的\textit{均匀}（相同）特征值是最可取的（同时保持总和不变）以实现良好的泛化；2）我们利用\textit{R\'enyi熵}来简洁地表征损失Hessian特征值的分布程度。通常来说，分布越广，（R\'enyi）熵就越小。为了严格建立泛化与（R\'enyi）锐度之间的关系，我们利用R\'enyi锐度的重新参数化不变性以及将数据差异转化为权重扰动的技巧，提供了几个基于R\'enyi锐度的泛化界限。此外，我们进行了大量实验，验证了R\'enyi锐度与泛化之间的强相关性（特别是Kendall等级相关性）。此外，我们提出在训练过程中使用R\'enyi锐度的一种变体作为正则化器，即R\'enyi锐度感知最小化（RSAM），结果表明它优于所有现有的锐度感知最小化方法。值得注意的是，我们提出的RSAM方法的测试准确率提升可能高达近2.5％，与经典的SAM方法相比。

更新时间: 2025-10-09 03:58:21

领域: cs.LG

下载: http://arxiv.org/abs/2510.07758v1

Is Supervised Learning Really That Different from Unsupervised?

We demonstrate how supervised learning can be decomposed into a two-stage procedure, where (1) all model parameters are selected in an unsupervised manner, and (2) the outputs y are added to the model, without changing the parameter values. This is achieved by a new model selection criterion that, in contrast to cross-validation, can be used also without access to y. For linear ridge regression, we bound the asymptotic out-of-sample risk of our method in terms of the optimal asymptotic risk. We also demonstrate on real and synthetic data that versions of linear and kernel ridge regression, smoothing splines, and neural networks, which are trained without access to y, perform similarly to their standard y-based counterparts. Hence, our results suggest that the difference between supervised and unsupervised learning is less fundamental than it may appear.

Updated: 2025-10-09 03:52:05

标题: 监督学习真的与无监督学习有很大不同吗？

摘要: 我们展示了监督学习如何分解为一个两阶段的过程，其中（1）所有模型参数都是以非监督方式选择的，（2）输出y被添加到模型中，而不改变参数值。这是通过一个新的模型选择标准实现的，与交叉验证相比，该标准也可以在没有y访问权限的情况下使用。对于线性岭回归，我们通过最优渐近风险将我们方法的渐近样本外风险限制在一定范围内。我们还在真实和合成数据上展示了线性和核岭回归、平滑样条和神经网络的版本，这些版本在没有y访问权限的情况下训练，表现与它们的标准y基础版本类似。因此，我们的结果表明，监督学习和非监督学习之间的差异可能并不像看起来那样根本性。

更新时间: 2025-10-09 03:52:05

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2505.11006v4

FedBook: A Unified Federated Graph Foundation Codebook with Intra-domain and Inter-domain Knowledge Modeling

Foundation models have shown remarkable cross-domain generalization in language and vision, inspiring the development of graph foundation models (GFMs). However, existing GFMs typically assume centralized access to multi-domain graphs, which is often infeasible due to privacy and institutional constraints. Federated Graph Foundation Models (FedGFMs) address this limitation, but their effectiveness fundamentally hinges on constructing a robust global codebook that achieves intra-domain coherence by consolidating mutually reinforcing semantics within each domain, while also maintaining inter-domain diversity by retaining heterogeneous knowledge across domains. To this end, we propose FedBook, a unified federated graph foundation codebook that systematically aggregates clients' local codebooks during server-side federated pre-training. FedBook follows a two-phase process: (1) Intra-domain Collaboration, where low-frequency tokens are refined by referencing more semantically reliable high-frequency tokens across clients to enhance domain-specific coherence; and (2) Inter-domain Integration, where client contributions are weighted by the semantic distinctiveness of their codebooks during the aggregation of the global GFM, thereby preserving cross-domain diversity. Extensive experiments on 8 benchmarks across multiple domains and tasks demonstrate that FedBook consistently outperforms 21 baselines, including isolated supervised learning, FL/FGL, federated adaptations of centralized GFMs, and FedGFM techniques.

Updated: 2025-10-09 03:50:30

标题: 《FedBook：具有域内和域间知识建模的统一联邦图基础代码书》

摘要: 基于基础模型的研究在语言和视觉领域展现了显著的跨领域泛化能力，激发了图基础模型（GFMs）的发展。然而，现有的GFMs通常假设对多领域图的集中访问，由于隐私和机构约束，这经常是不可行的。联邦图基础模型（FedGFMs）解决了这一限制，但它们的有效性基本上取决于构建一个强大的全局码书，通过在每个领域内整合相互强化的语义实现领域内的一致性，同时通过保留跨领域的异质知识来保持跨领域的多样性。为此，我们提出了FedBook，一个统一的联邦图基础码书，它在服务器端联邦预训练期间系统地聚合客户端的本地码书。FedBook遵循一个两阶段的过程：（1）领域内协作，在此过程中，通过参考客户端之间更具语义可靠性的高频令牌来完善低频令牌，以增强领域特定的一致性；（2）跨领域整合，在此过程中，在全局GFM的整合过程中，客户端的贡献根据其码书的语义独特性进行加权，从而保留跨领域的多样性。在多个领域和任务的8个基准测试上进行的大量实验表明，FedBook始终优于21个基准线，包括孤立的监督学习、FL/FGL、集中式GFMs的联邦适应以及FedGFM技术。

更新时间: 2025-10-09 03:50:30

领域: cs.LG

下载: http://arxiv.org/abs/2510.07755v1

Efficient and Adaptable Overlapping for Computation and Communication via Signaling and Reordering

Generative models have achieved remarkable success across various applications, driving the demand for multi-GPU computing. Inter-GPU communication becomes a bottleneck in multi-GPU computing systems, particularly on consumer-grade GPUs. By exploiting concurrent hardware execution, overlapping computation and communication latency becomes an effective technique for mitigating the communication overhead. We identify that an efficient and adaptable overlapping design should satisfy (1) tile-wise overlapping to maximize the overlapping opportunity, (2) interference-free computation to maintain the original computational performance, and (3) communication agnosticism to reduce the development burden against varying communication primitives. Nevertheless, current designs fail to simultaneously optimize for all of those features. To address the issue, we propose FlashOverlap, which utilizes a novel signaling mechanism: when part of the output finishes, the computation kernel sends a signal to trigger the communication of that part, while continuing the computation of the remaining part (interference-free computation). Consequently, the communication of the finished part and the computation of the remaining part can be overlapped. On top of the signaling mechanism, FlashOverlap comprises two key components: (1) the determination of the signaling timing to boost the overlap efficiency (tile-wise overlapping), and (2) a pre-communication reordering to create the contiguous address for finished data, enabling communication by simply calling NCCL APIs (communication agnosticism), and a post-communication reordering to correct the data order. Experiments show that FlashOverlap achieves up to 1.65x speedup through overlap, outperforming existing works in most cases. Code is available at https://github.com/infinigence/FlashOverlap.

Updated: 2025-10-09 03:47:25

标题: 高效且适应性强的通过信号传递和重排序实现计算和通信的重叠

摘要: 生成模型在各种应用中取得了显著的成功，推动了对多GPU计算的需求。在多GPU计算系统中，特别是在消费级GPU上，跨GPU通信成为一个瓶颈。通过利用并发硬件执行，重叠计算和通信延迟成为减轻通信开销的有效技术。我们确定，一个高效且适应性强的重叠设计应该满足以下要求：（1）以瓦片为单位的重叠，以最大化重叠机会，（2）无干扰计算，以保持原始的计算性能，（3）通信无关性，以降低针对不同通信原语的开发负担。然而，当前的设计未能同时优化所有这些特性。为了解决这个问题，我们提出了FlashOverlap，它利用一种新颖的信号机制：当部分输出完成时，计算核心发送信号触发该部分的通信，同时继续计算剩余部分（无干扰计算）。因此，完成部分的通信和剩余部分的计算可以重叠。除了信号机制，FlashOverlap包括两个关键组件：（1）确定信号时机以提高重叠效率（瓦片级重叠），以及（2）预通信重排序以为完成的数据创建连续地址，通过简单调用NCCL API进行通信（通信无关性），以及一个后通信重排序来纠正数据顺序。实验表明，FlashOverlap通过重叠可实现高达1.65倍的加速，大多数情况下都优于现有作品。代码可在https://github.com/infinigence/FlashOverlap找到。

更新时间: 2025-10-09 03:47:25

领域: cs.DC,cs.CL,cs.LG

下载: http://arxiv.org/abs/2504.19519v2

MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling

Despite recent advances, long-sequence video generation frameworks still suffer from significant limitations: poor assistive capability, suboptimal visual quality, and limited expressiveness. To mitigate these limitations, we propose MAViS, a multi-agent collaborative framework designed to assist in long-sequence video storytelling by efficiently translating ideas into visual narratives. MAViS orchestrates specialized agents across multiple stages, including script writing, shot designing, character modeling, keyframe generation, video animation, and audio generation. In each stage, agents operate under the 3E Principle -- Explore, Examine, and Enhance -- to ensure the completeness of intermediate outputs. Considering the capability limitations of current generative models, we propose the Script Writing Guidelines to optimize compatibility between scripts and generative tools. Experimental results demonstrate that MAViS achieves state-of-the-art performance in assistive capability, visual quality, and video expressiveness. Its modular framework further enables scalability with diverse generative models and tools. With just a brief idea description, MAViS enables users to rapidly explore diverse visual storytelling and creative directions for sequential video generation by efficiently producing high-quality, complete long-sequence videos. To the best of our knowledge, MAViS is the only framework that provides multimodal design output -- videos with narratives and background music.

Updated: 2025-10-09 03:46:23

标题: MAViS：用于长序列视频叙事的多智能体框架

摘要: 尽管最近取得了进展，但长序列视频生成框架仍然存在显著的局限性：辅助能力差、视觉质量不佳和表现力有限。为了缓解这些限制，我们提出了MAViS，这是一个多代理协作框架，旨在通过有效地将想法转化为视觉叙事来协助长序列视频叙事。MAViS在包括剧本编写、镜头设计、角色建模、关键帧生成、视频动画和音频生成在内的多个阶段协调专门的代理。在每个阶段，代理根据“探索、审查和增强”的3E原则运作，以确保中间输出的完整性。考虑到当前生成模型的能力限制，我们提出了剧本编写指南，以优化剧本与生成工具之间的兼容性。实验结果表明，MAViS在辅助能力、视觉质量和视频表现力方面实现了最先进的性能。其模块化框架进一步实现了与多样化生成模型和工具的可扩展性。仅仅通过简短的想法描述，MAViS使用户能够通过有效地生成高质量、完整的长序列视频，快速探索各种视觉叙事和创意方向。据我们所知，MAViS是唯一提供多模态设计输出--具有叙事和背景音乐的视频的框架。

更新时间: 2025-10-09 03:46:23

领域: cs.CV,cs.AI,cs.MA

下载: http://arxiv.org/abs/2508.08487v4

From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes

AI-generated clinical notes are increasingly used in healthcare, but evaluating their quality remains a challenge due to high subjectivity and limited scalability of expert review. Existing automated metrics often fail to align with real-world physician preferences. To address this, we propose a pipeline that systematically distills real user feedback into structured checklists for note evaluation. These checklists are designed to be interpretable, grounded in human feedback, and enforceable by LLM-based evaluators. Using deidentified data from over 21,000 clinical encounters (prepared in accordance with the HIPAA safe harbor standard) from a deployed AI medical scribe system, we show that our feedback-derived checklist outperforms a baseline approach in our offline evaluations in coverage, diversity, and predictive power for human ratings. Extensive experiments confirm the checklist's robustness to quality-degrading perturbations, significant alignment with clinician preferences, and practical value as an evaluation methodology. In offline research settings, our checklist offers a practical tool for flagging notes that may fall short of our defined quality standards.

Updated: 2025-10-09 03:45:01

标题: 从反馈到清单：对人工智能生成的临床笔记进行基于实地的评估

摘要: 人工智能生成的临床笔记在医疗保健领域越来越被广泛使用，但由于专家审查的主观性高和可扩展性有限，评估它们的质量仍然是一个挑战。现有的自动化指标通常无法与真实世界的医生偏好相一致。为了解决这个问题，我们提出了一个系统地将真实用户反馈提炼成结构化核查表的流程，用于笔记评估。这些核查表设计成可解释的，基于人类反馈，由基于LLM的评估者执行。我们使用来自一个部署的AI医学抄写员系统的超过21,000个临床会诊的去身份化数据（按照HIPAA安全港标准准备），展示了我们从反馈中得出的核查表在离线评估中在覆盖范围、多样性和对人类评级的预测力方面优于基线方法。广泛的实验验证了核查表对质量降低扰动的稳健性，与临床医生偏好的显著一致性，以及作为评估方法的实际价值。在离线研究设置中，我们的核查表为标记可能未达到我们定义的质量标准的笔记提供了一个实用工具。

更新时间: 2025-10-09 03:45:01

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.17717v2

When Robustness Meets Conservativeness: Conformalized Uncertainty Calibration for Balanced Decision Making

Robust optimization safeguards decisions against uncertainty by optimizing against worst-case scenarios, yet their effectiveness hinges on a prespecified robustness level that is often chosen ad hoc, leading to either insufficient protection or overly conservative and costly solutions. Recent approaches using conformal prediction construct data-driven uncertainty sets with finite-sample coverage guarantees, but they still fix coverage targets a priori and offer little guidance for selecting robustness levels. We propose a new framework that provides distribution-free, finite-sample guarantees on both miscoverage and regret for any family of robust predict-then-optimize policies. Our method constructs valid estimators that trace out the miscoverage-regret Pareto frontier, enabling decision-makers to reliably evaluate and calibrate robustness levels according to their cost-risk preferences. The framework is simple to implement, broadly applicable across classical optimization formulations, and achieves sharper finite-sample performance than existing approaches. These results offer the first principled data-driven methodology for guiding robustness selection and empower practitioners to balance robustness and conservativeness in high-stakes decision-making.

Updated: 2025-10-09 03:38:17

标题: 当鲁棒性遇到保守性：基于等概率置信标定的平衡决策制定

摘要: 鲁棒优化通过针对最坏情况进行优化来保护决策免受不确定性的影响，但其有效性取决于通常是主观选择的预先确定的鲁棒性水平，导致保护不足或过度保守和昂贵的解决方案。最近的方法使用符合预测构建基于数据驱动的不确定性集，具有有限样本覆盖保证，但仍然预先固定覆盖目标，并且在选择鲁棒性水平方面提供很少的指导。我们提出了一个新框架，为任何一类鲁棒预测-优化策略提供了分布无关、有限样本保证，涵盖了误差覆盖和后悔。我们的方法构建了有效的估计量，跟踪误差覆盖-后悔帕累托边界，使决策者能够可靠地评估和校准鲁棒性水平，根据其成本风险偏好。该框架易于实施，在经典优化公式中具有广泛适用性，并且比现有方法实现了更尖锐的有限样本性能。这些结果提供了第一个有原则的数据驱动方法，用于指导鲁棒性选择，并赋予从业者在高风险决策中平衡鲁棒性和保守性的能力。

更新时间: 2025-10-09 03:38:17

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2510.07750v1

Haibu Mathematical-Medical Intelligent Agent:Enhancing Large Language Model Reliability in Medical Tasks via Verifiable Reasoning Chains

Large Language Models (LLMs) show promise in medicine but are prone to factual and logical errors, which is unacceptable in this high-stakes field. To address this, we introduce the "Haibu Mathematical-Medical Intelligent Agent" (MMIA), an LLM-driven architecture that ensures reliability through a formally verifiable reasoning process. MMIA recursively breaks down complex medical tasks into atomic, evidence-based steps. This entire reasoning chain is then automatically audited for logical coherence and evidence traceability, similar to theorem proving. A key innovation is MMIA's "bootstrapping" mode, which stores validated reasoning chains as "theorems." Subsequent tasks can then be efficiently solved using Retrieval-Augmented Generation (RAG), shifting from costly first-principles reasoning to a low-cost verification model. We validated MMIA across four healthcare administration domains, including DRG/DIP audits and medical insurance adjudication, using expert-validated benchmarks. Results showed MMIA achieved an error detection rate exceeding 98% with a false positive rate below 1%, significantly outperforming baseline LLMs. Furthermore, the RAG matching mode is projected to reduce average processing costs by approximately 85% as the knowledge base matures. In conclusion, MMIA's verifiable reasoning framework is a significant step toward creating trustworthy, transparent, and cost-effective AI systems, making LLM technology viable for critical applications in medicine.

Updated: 2025-10-09 03:35:37

标题: 海布数学-医学智能代理：通过可验证推理链提高在医学任务中大型语言模型的可靠性

摘要: 大型语言模型（LLMs）在医学领域显示出潜力，但容易出现事实和逻辑错误，这在这个高风险领域是不可接受的。为了解决这个问题，我们引入了“Haibu数学医学智能代理”（MMIA），这是一个由LLM驱动的架构，通过一个形式可验证的推理过程来确保可靠性。MMIA将复杂的医学任务递归地分解为基于证据的原子步骤。整个推理链然后自动审核逻辑连贯性和证据可追溯性，类似于定理证明。一个关键的创新是MMIA的“引导”模式，它将经过验证的推理链存储为“定理”。随后的任务可以使用检索增强生成（RAG）有效地解决，从昂贵的第一原理推理转变为低成本的验证模型。我们在四个医疗行政领域验证了MMIA，包括DRG/DIP审核和医疗保险裁决，使用专家验证的基准。结果显示，MMIA实现了超过98%的错误检测率，假阳性率低于1%，明显优于基准LLMs。此外，预计RAG匹配模式将随着知识库的成熟而将平均处理成本降低约85%。总之，MMIA的可验证推理框架是朝着创建值得信赖、透明和经济高效的AI系统迈出的重要一步，使LLM技术在医学的关键应用中变得可行。

更新时间: 2025-10-09 03:35:37

领域: cs.AI

下载: http://arxiv.org/abs/2510.07748v1

BRIGHT: A globally distributed multimodal building damage assessment dataset with very-high-resolution for all-weather disaster response

Disaster events occur around the world and cause significant damage to human life and property. Earth observation (EO) data enables rapid and comprehensive building damage assessment (BDA), an essential capability in the aftermath of a disaster to reduce human casualties and to inform disaster relief efforts. Recent research focuses on the development of AI models to achieve accurate mapping of unseen disaster events, mostly using optical EO data. However, solutions based on optical data are limited to clear skies and daylight hours, preventing a prompt response to disasters. Integrating multimodal (MM) EO data, particularly the combination of optical and SAR imagery, makes it possible to provide all-weather, day-and-night disaster responses. Despite this potential, the development of robust multimodal AI models has been constrained by the lack of suitable benchmark datasets. In this paper, we present a BDA dataset using veRy-hIGH-resoluTion optical and SAR imagery (BRIGHT) to support AI-based all-weather disaster response. To the best of our knowledge, BRIGHT is the first open-access, globally distributed, event-diverse MM dataset specifically curated to support AI-based disaster response. It covers five types of natural disasters and two types of man-made disasters across 14 regions worldwide, with a particular focus on developing countries where external assistance is most needed. The optical and SAR imagery in BRIGHT, with a spatial resolution between 0.3-1 meters, provides detailed representations of individual buildings, making it ideal for precise BDA. In our experiments, we have tested seven advanced AI models trained with our BRIGHT to validate the transferability and robustness. The dataset and code are available at https://github.com/ChenHongruixuan/BRIGHT. BRIGHT also serves as the official dataset for the 2025 IEEE GRSS Data Fusion Contest.

Updated: 2025-10-09 03:34:50

标题: BRIGHT：一个全球分布的多模式建筑损坏评估数据集，具有非常高的分辨率，适用于全天候灾害响应

摘要: 灾难事件在全球范围内发生，并对人类生命和财产造成重大损害。地球观测（EO）数据使得快速和全面的建筑损害评估（BDA）成为可能，这是灾难后减少人员伤亡并通知救灾工作的重要能力。最近的研究集中在开发AI模型，以实现对未知灾难事件的精确映射，主要使用光学EO数据。然而，基于光学数据的解决方案受限于晴朗的天气和白天时间，无法及时响应灾难。整合多模式（MM）EO数据，特别是光学和SAR图像的结合，使得能够提供全天候、日夜灾难响应成为可能。尽管具有这种潜力，但由于缺乏适用的基准数据集，强大的多模式AI模型的发展受到限制。在本文中，我们提出了一个使用veRy-hIGH-resoluTion光学和SAR图像（BRIGHT）的BDA数据集，以支持基于AI的全天候灾难响应。据我们所知，BRIGHT是第一个开放获取、全球分布、事件多样的MM数据集，专门为支持基于AI的灾难响应而精心策划。它覆盖了全球14个地区的五种自然灾害和两种人为灾害，特别关注外部援助最需要的发展中国家。BRIGHT中的光学和SAR图像，空间分辨率在0.3-1米之间，提供了个别建筑的详细表示，非常适合精确的BDA。在我们的实验中，我们测试了七种经过训练的高级AI模型，并使用我们的BRIGHT验证了其可转移性和稳健性。数据集和代码可在https://github.com/ChenHongruixuan/BRIGHT 上找到。BRIGHT还作为2025年IEEE GRSS数据融合比赛的官方数据集。

更新时间: 2025-10-09 03:34:50

领域: cs.CV,cs.AI,eess.IV,eess.SP

下载: http://arxiv.org/abs/2501.06019v4

t-SNE Exaggerates Clusters, Provably

Central to the widespread use of t-distributed stochastic neighbor embedding (t-SNE) is the conviction that it produces visualizations whose structure roughly matches that of the input. To the contrary, we prove that (1) the strength of the input clustering, and (2) the extremity of outlier points, cannot be reliably inferred from the t-SNE output. We demonstrate the prevalence of these failure modes in practice as well.

Updated: 2025-10-09 03:34:36

标题: t-SNE夸大集群，可证明

摘要: 广泛使用 t-分布随机邻域嵌入(t-SNE)的核心是认为它生成的可视化结果大致与输入数据的结构相匹配。相反，我们证明了(1)输入数据的聚类强度，以及(2)异常点的极端程度，无法可靠地从t-SNE输出中推断出来。我们还在实践中展示了这些失败模式的普遍存在。

更新时间: 2025-10-09 03:34:36

领域: cs.LG

下载: http://arxiv.org/abs/2510.07746v1

Parallel Test-Time Scaling for Latent Reasoning Models

Parallel test-time scaling (TTS) is a pivotal approach for enhancing large language models (LLMs), typically by sampling multiple token-based chains-of-thought in parallel and aggregating outcomes through voting or search. Recent advances in latent reasoning, where intermediate reasoning unfolds in continuous vector spaces, offer a more efficient alternative to explicit Chain-of-Thought, yet whether such latent models can similarly benefit from parallel TTS remains open, mainly due to the absence of sampling mechanisms in continuous space, and the lack of probabilistic signals for advanced trajectory aggregation. \ This work enables parallel TTS for latent reasoning models by addressing the above issues. For sampling, we introduce two uncertainty-inspired stochastic strategies: Monte Carlo Dropout and Additive Gaussian Noise. For aggregation, we design a Latent Reward Model (LatentRM) trained with step-wise contrastive objective to score and guide latent reasoning. Extensive experiments and visualization analyses show that both sampling strategies scale effectively with compute and exhibit distinct exploration dynamics, while LatentRM enables effective trajectory selection. Together, our explorations open a new direction for scalable inference in continuous spaces. Code released at https://github.com/YRYangang/LatentTTS.

Updated: 2025-10-09 03:33:00

标题: 潜在推理模型的并行测试时间缩放

摘要: 并行测试时间缩放（TTS）是增强大型语言模型（LLMs）的关键方法，通常通过并行采样多个基于令牌的思维链并通过投票或搜索聚合结果。最近在潜在推理方面取得了进展，其中中间推理在连续向量空间中展开，提供了一个更有效的替代方案来明确思维链，然而，这样的潜在模型是否可以类似地从并行TTS中受益仍然是一个未解之谜，主要是由于连续空间中缺乏采样机制，以及缺乏用于高级轨迹聚合的概率信号。本文通过解决上述问题实现了潜在推理模型的并行TTS。对于采样，我们引入了两种受不确定性启发的随机策略：蒙特卡洛辍学和加性高斯噪声。对于聚合，我们设计了一个训练有素的潜在奖励模型（LatentRM），该模型使用逐步对比目标来评分和引导潜在的推理。大量实验和可视化分析显示，两种采样策略都有效地扩展了计算，并展示了不同的探索动态，而LatentRM使得有效的轨迹选择成为可能。综合来看，我们的探索开辟了在连续空间中可扩展推理的新方向。代码发布在https://github.com/YRYangang/LatentTTS。

更新时间: 2025-10-09 03:33:00

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.07745v1

UltraLED: Learning to See Everything in Ultra-High Dynamic Range Scenes

Ultra-high dynamic range (UHDR) scenes exhibit significant exposure disparities between bright and dark regions. Such conditions are commonly encountered in nighttime scenes with light sources. Even with standard exposure settings, a bimodal intensity distribution with boundary peaks often emerges, making it difficult to preserve both highlight and shadow details simultaneously. RGB-based bracketing methods can capture details at both ends using short-long exposure pairs, but are susceptible to misalignment and ghosting artifacts. We found that a short-exposure image already retains sufficient highlight detail. The main challenge of UHDR reconstruction lies in denoising and recovering information in dark regions. In comparison to the RGB images, RAW images, thanks to their higher bit depth and more predictable noise characteristics, offer greater potential for addressing this challenge. This raises a key question: can we learn to see everything in UHDR scenes using only a single short-exposure RAW image? In this study, we rely solely on a single short-exposure frame, which inherently avoids ghosting and motion blur, making it particularly robust in dynamic scenes. To achieve that, we introduce UltraLED, a two-stage framework that performs exposure correction via a ratio map to balance dynamic range, followed by a brightness-aware RAW denoiser to enhance detail recovery in dark regions. To support this setting, we design a 9-stop bracketing pipeline to synthesize realistic UHDR images and contribute a corresponding dataset based on diverse scenes, using only the shortest exposure as input for reconstruction. Extensive experiments show that UltraLED significantly outperforms existing single-frame approaches. Our code and dataset are made publicly available at https://srameo.github.io/projects/ultraled.

Updated: 2025-10-09 03:29:39

标题: UltraLED：学习在超高动态范围场景中观察一切

摘要: 超高动态范围（UHDR）场景在明亮和黑暗区域之间显示出明显的曝光差异。这种条件通常在带有光源的夜间场景中遇到。即使使用标准曝光设置，通常会出现具有边界峰值的双峰强度分布，使得同时保留高光和阴影细节变得困难。基于RGB的曝光合成方法可以使用短-长曝光对捕捉两端的细节，但容易受到错位和幽灵伪影的影响。我们发现，短曝光图像已经保留了足够的高光细节。UHDR重建的主要挑战在于去噪和恢复黑暗区域的信息。与RGB图像相比，由于RAW图像具有更高的位深和更可预测的噪声特性，因此具有更大的潜力来解决这一挑战。这引发了一个关键问题：我们是否可以仅使用单个短曝光RAW图像学会在UHDR场景中看到一切？在这项研究中，我们仅依赖于单个短曝光帧，这本质上避免了幽灵效应和运动模糊，使其在动态场景中特别强大。为了实现这一目标，我们引入了UltraLED，一个两阶段框架，通过比例图进行曝光校正以平衡动态范围，然后利用明亮感知的RAW去噪器增强黑暗区域中的细节恢复。为支持这一设置，我们设计了一个9档曝光合成流水线来合成逼真的UHDR图像，并基于各种场景提供了一个仅使用最短曝光作为重建输入的相应数据集。广泛的实验证明，UltraLED明显优于现有的单帧方法。我们的代码和数据集可在https://srameo.github.io/projects/ultraled 公开获取。

更新时间: 2025-10-09 03:29:39

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.07741v1

Towards Urban Planing AI Agent in the Age of Agentic AI

Generative AI, large language models, and agentic AI have emerged separately of urban planning. However, the convergence between AI and urban planning presents an interesting opportunity towards AI urban planners. Existing studies conceptualizes urban planning as a generative AI task, where AI synthesizes land-use configurations under geospatial, social, and human-centric constraints and reshape automated urban design. We further identify critical gaps of existing generative urban planning studies: 1) the generative structure has to be predefined with strong assumption: all of adversarial generator-discriminator, forward and inverse diffusion structures, hierarchical zone-POI generative structure are predefined by humans; 2) ignore the power of domain expert developed tools: domain urban planners have developed various tools in the urban planning process guided by urban theory, while existing pure neural networks based generation ignore the power of the tools developed by urban planner practitioners. To address these limitations, we outline a future research direction agentic urban AI planner, calling for a new synthesis of agentic AI and participatory urbanism.

Updated: 2025-10-09 03:28:38

标题: 在Agent AI时代朝向城市规划AI代理的方向

摘要: 生成式人工智能、大型语言模型和主动型人工智能已经分别出现在城市规划领域。然而，人工智能与城市规划的融合为人工智能城市规划师提供了一个有趣的机会。现有研究将城市规划概念化为生成式人工智能任务，其中人工智能在地理空间、社会和以人为中心的约束条件下合成土地利用配置，并重新塑造自动化城市设计。我们进一步确定了现有生成式城市规划研究的关键缺口：1）生成结构必须预先定义，具有强烈的假设：所有对抗生成器-鉴别器、正向和逆向扩散结构、分层区域-POI生成结构都是由人类预先定义的；2）忽视领域专家开发的工具的能力：领域城市规划师在城市规划过程中开发了各种由城市理论指导的工具，而现有的基于纯神经网络的生成方法忽视了城市规划师开发的工具的能力。为了解决这些限制，我们概述了未来研究方向主动型城市人工智能规划师，呼吁对主动型人工智能和参与式城市主义进行新的综合。

更新时间: 2025-10-09 03:28:38

领域: cs.AI

下载: http://arxiv.org/abs/2507.14730v4

AppForge: From Assistant to Independent Developer -- Are GPTs Ready for Software Development?

Large language models (LLMs) have demonstrated remarkable capability in function-level code generation tasks. Unlike isolated functions, real-world applications demand reasoning over the entire software system: developers must orchestrate how different components interact, maintain consistency across states over time, and ensure the application behaves correctly within the lifecycle and framework constraints. Yet, no existing benchmark adequately evaluates whether LLMs can bridge this gap and construct entire software systems from scratch. To address this gap, we propose APPFORGE, a benchmark consisting of 101 software development problems drawn from real-world Android apps. Given a natural language specification detailing the app functionality, a language model is tasked with implementing the functionality into an Android app from scratch. Developing an Android app from scratch requires understanding and coordinating app states, lifecycle management, and asynchronous operations, calling for LLMs to generate context-aware, robust, and maintainable code. To construct APPFORGE, we design a multi-agent system to automatically summarize the main functionalities from app documents and navigate the app to synthesize test cases validating the functional correctness of app implementation. Following rigorous manual verification by Android development experts, APPFORGE incorporates the test cases within an automated evaluation framework that enables reproducible assessment without human intervention, making it easily adoptable for future research. Our evaluation on 12 flagship LLMs show that all evaluated models achieve low effectiveness, with the best-performing model (GPT-5) developing only 18.8% functionally correct applications, highlighting fundamental limitations in current models' ability to handle complex, multi-component software engineering challenges.

Updated: 2025-10-09 03:26:05

标题: AppForge：从助理到独立开发者——GPT是否准备好进行软件开发？

摘要: 大型语言模型（LLMs）已经在功能级别代码生成任务中展现出了卓越的能力。与孤立的函数不同，现实世界中的应用程序需要对整个软件系统进行推理：开发人员必须协调不同组件的交互方式，维护随时间变化的状态的一致性，并确保应用程序在生命周期和框架约束内正确运行。然而，目前没有现有基准可以充分评估LLMs是否可以弥合这一差距，并从头开始构建整个软件系统。为了解决这一差距，我们提出了APPFORGE，一个由101个来自现实世界Android应用程序的软件开发问题组成的基准。给定详细描述应用程序功能的自然语言规范，语言模型被要求将功能从头开始实现到一个Android应用程序中。从头开始开发一个Android应用程序需要理解和协调应用程序状态、生命周期管理和异步操作，要求LLMs生成具有上下文感知、稳健和可维护性的代码。为了构建APPFORGE，我们设计了一个多代理系统，自动总结应用程序文档中的主要功能，并导航应用程序以综合测试用例，验证应用实现的功能正确性。经过Android开发专家的严格手动验证后，APPFORGE将测试用例纳入一个自动化评估框架，实现可重复的评估，无需人为干预，使其易于被未来研究所采用。我们对12个旗舰LLMs的评估显示，所有评估的模型都达到了低的有效性，表现最好的模型（GPT-5）仅开发了18.8%的功能正确的应用程序，突出了当前模型处理复杂多组件软件工程挑战的基本限制。

更新时间: 2025-10-09 03:26:05

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2510.07740v1

MeSH: Memory-as-State-Highways for Recursive Transformers

Recursive transformers reuse parameters and iterate over hidden states multiple times, decoupling compute depth from parameter depth. However, under matched compute, recursive models with fewer parameters often lag behind non-recursive counterparts. By probing hidden states, we trace this performance gap to two primary bottlenecks: undifferentiated computation, where the core is forced to adopt a similar computational pattern at every iteration, and information overload, where long-lived and transient information must coexist in a single hidden state. To address the issues, we introduce a Memory-as-State-Highways (MeSH) scheme, which externalizes state management into an explicit memory buffer and employs lightweight routers to dynamically diversify computation across iterations. Probing visualizations confirm that MeSH successfully resolves the pathologies by inducing functional specialization across iterations. On the Pythia suite (160M-1.4B), MeSH-enhanced recursive transformers consistently improve over recursive baselines and outperforms its larger non-recursive counterpart at the 1.4B scale, improving average downstream accuracy by +1.06% with 33% fewer non-embedding parameters. Our analysis establishes MeSH as a scalable and principled architecture for building stronger recursive models.

Updated: 2025-10-09 03:23:38

标题: MeSH: 用于递归变换器的“记忆即状态高速公路”

摘要: 递归变压器重复使用参数并在隐藏状态上进行多次迭代，将计算深度与参数深度分离。然而，在匹配计算的情况下，参数较少的递归模型通常落后于非递归对应物。通过探查隐藏状态，我们将这种性能差距追溯到两个主要瓶颈：未差异化的计算，在这种情况下，核心被迫在每次迭代中采用类似的计算模式，以及信息过载，在这种情况下，长期存在的和短暂的信息必须共存于一个隐藏状态中。为了解决这些问题，我们引入了一种Memory-as-State-Highways（MeSH）方案，它将状态管理外部化到显式内存缓冲区，并采用轻量级路由器动态地在迭代中分散计算。探查可视化证实，MeSH成功通过在迭代中引入功能专门化来解决这些病态。在Pythia套件上（160M-1.4B），MeSH增强的递归变压器持续改善递归基线，并在1.4B规模上优于较大的非递归对应物，在33%较少的非嵌入参数的情况下，平均下游准确性提高了+1.06%。我们的分析将MeSH确定为构建更强大的递归模型的可扩展且有原则的架构。

更新时间: 2025-10-09 03:23:38

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.07739v1

Propagation-Based Vulnerability Impact Assessment for Software Supply Chains

Identifying the impact scope and scale is critical for software supply chain vulnerability assessment. However, existing studies face substantial limitations. First, prior studies either work at coarse package-level granularity, producing many false positives, or fail to accomplish whole-ecosystem vulnerability propagation analysis. Second, although vulnerability assessment indicators like CVSS characterize individual vulnerabilities, no metric exists to specifically quantify the dynamic impact of vulnerability propagation across software supply chains. To address these limitations and enable accurate and comprehensive vulnerability impact assessment, we propose a novel approach: (i) a hierarchical worklist-based algorithm for whole-ecosystem and call-graph-level vulnerability propagation analysis and (ii) the Vulnerability Propagation Scoring System (VPSS), a dynamic metric to quantify the scope and evolution of vulnerability impacts in software supply chains. We implement a prototype of our approach in the Java Maven ecosystem and evaluate it on 100 real-world vulnerabilities. Experimental results demonstrate that our approach enables effective ecosystem-wide vulnerability propagation analysis, and provides a practical, quantitative measure of vulnerability impact through VPSS.

Updated: 2025-10-09 03:22:47

标题: 基于传播的软件供应链漏洞影响评估

摘要: 确定影响范围和规模对于软件供应链漏洞评估至关重要。然而，现有研究面临重大限制。首先，先前的研究要么以粗粒度的软件包级别工作，导致许多误报，要么未能完成整个生态系统漏洞传播分析。其次，尽管像CVSS这样的漏洞评估指标表征了单个漏洞，但没有度量标准专门量化漏洞在软件供应链中传播的动态影响。为了解决这些限制并实现准确和全面的漏洞影响评估，我们提出了一种新颖的方法：（i）基于层次工作列表的算法，用于整个生态系统和调用图级别的漏洞传播分析；（ii）漏洞传播评分系统（VPSS），一种动态度量标准，用于量化软件供应链中漏洞影响的范围和演变。我们在Java Maven生态系统中实现了我们方法的原型，并在100个真实世界漏洞上对其进行评估。实验结果表明，我们的方法实现了有效的全生态系统漏洞传播分析，并通过VPSS提供了漏洞影响的实用、定量度量。

更新时间: 2025-10-09 03:22:47

领域: cs.SE,cs.CR

下载: http://arxiv.org/abs/2506.01342v2

ToolExpander: Extending the Frontiers of Tool-Using Reinforcement Learning to Weak LLMs

Training Large Language Models (LLMs) with Group Relative Policy Optimization (GRPO) encounters a significant challenge: models often fail to produce accurate responses, particularly in small-scale architectures. This limitation not only diminishes performance improvements and undermines the potential of GRPO but also frequently leads to mid-training collapse, adversely affecting stability and final efficacy. To address these issues, we propose ToolExpander, a novel framework that advances tool-oriented reinforcement learning for resource-constrained LLMs through two key innovations:(1) Dynamic Multi-Round Hard Sampling, which dynamically substitutes challenging samples(those without correct outputs over 10 rollouts) with high-quality few-shot demonstrations during training, coupled with an exponential learning rate decay strategy to mitigate oscillations;(2) Self-Exemplifying Thinking, an enhanced GRPO framework that eliminates KL divergence and incorporates adjusted clipping coefficients, encouraging models to autonomously generate and analyze few-shot examples via a minimal additional reward (0.01).Experimental results demonstrate that ToolExpander significantly enhances tool-using capabilities in LLMs, especially in weaker small-scale models, improving both training stability and overall performance.

Updated: 2025-10-09 03:20:13

标题: ToolExpander：将工具使用强化学习的边界扩展到弱LLMs

摘要: 使用Group Relative Policy Optimization（GRPO）训练大型语言模型（LLMs）遇到了一个重要挑战：模型往往无法产生准确的响应，特别是在小规模架构中。这种局限不仅降低了性能提升并破坏了GRPO的潜力，还经常导致中途训练崩溃，对稳定性和最终效果产生不利影响。为了解决这些问题，我们提出了ToolExpander，这是一个新颖的框架，通过两个关键创新推进了面向资源受限的LLMs的工具导向强化学习：(1)动态多轮硬采样，动态地用高质量的少样本演示替换具有挑战性的样本（在10次模拟中没有正确输出的样本），并结合指数学习率衰减策略来减轻振荡；(2)自我示范思维，增强的GRPO框架，消除KL散度并整合调整后的剪切系数，鼓励模型通过最小的额外奖励（0.01）自主生成和分析少样本示例。实验结果表明，ToolExpander显著提高了LLMs中的工具使用能力，特别是在较弱的小规模模型中，改善了训练稳定性和整体性能。

更新时间: 2025-10-09 03:20:13

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2510.07737v1

Incremental Summarization for Customer Support via Progressive Note-Taking and Agent Feedback

We introduce an incremental summarization system for customer support agents that intelligently determines when to generate concise bullet notes during conversations, reducing agents' context-switching effort and redundant review. Our approach combines a fine-tuned Mixtral-8x7B model for continuous note generation with a DeBERTa-based classifier to filter trivial content. Agent edits refine the online notes generation and regularly inform offline model retraining, closing the agent edits feedback loop. Deployed in production, our system achieved a 3% reduction in case handling time compared to bulk summarization (with reductions of up to 9% in highly complex cases), alongside high agent satisfaction ratings from surveys. These results demonstrate that incremental summarization with continuous feedback effectively enhances summary quality and agent productivity at scale.

Updated: 2025-10-09 03:15:43

标题: 通过渐进式记录和代理反馈实现客户支持的增量摘要

摘要: 我们介绍了一种增量摘要系统，用于客服代理，智能确定何时在对话过程中生成简洁的要点笔记，减少代理的上下文切换工作量和冗余回顾。我们的方法将经过精细调整的Mixtral-8x7B模型与基于DeBERTa的分类器相结合，以过滤无关紧要的内容。代理人的编辑可以完善在线笔记生成，并定期通知离线模型重新训练，闭环代理编辑反馈。在生产中部署，我们的系统相比批量摘要实现了3%的案件处理时间缩短（在高度复杂案例中可减少高达9%），同时通过调查获得了高满意度的代理评分。这些结果表明，具有持续反馈的增量摘要有效地提高了摘要质量和代理人的生产力。

更新时间: 2025-10-09 03:15:43

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.06677v2

GeoGen: A Two-stage Coarse-to-Fine Framework for Fine-grained Synthetic Location-based Social Network Trajectory Generation

Location-Based Social Network (LBSN) check-in trajectory data are important for many practical applications, like POI recommendation, advertising, and pandemic intervention. However, the high collection costs and ever-increasing privacy concerns prevent us from accessing large-scale LBSN trajectory data. The recent advances in synthetic data generation provide us with a new opportunity to achieve this, which utilizes generative AI to generate synthetic data that preserves the characteristics of real data while ensuring privacy protection. However, generating synthetic LBSN check-in trajectories remains challenging due to their spatially discrete, temporally irregular nature and the complex spatio-temporal patterns caused by sparse activities and uncertain human mobility. To address this challenge, we propose GeoGen, a two-stage coarse-to-fine framework for large-scale LBSN check-in trajectory generation. In the first stage, we reconstruct spatially continuous, temporally regular latent movement sequences from the original LBSN check-in trajectories and then design a Sparsity-aware Spatio-temporal Diffusion model (S$^2$TDiff) with an efficient denosing network to learn their underlying behavioral patterns. In the second stage, we design Coarse2FineNet, a Transformer-based Seq2Seq architecture equipped with a dynamic context fusion mechanism in the encoder and a multi-task hybrid-head decoder, which generates fine-grained LBSN trajectories based on coarse-grained latent movement sequences by modeling semantic relevance and behavioral uncertainty. Extensive experiments on four real-world datasets show that GeoGen excels state-of-the-art models for both fidelity and utility evaluation, e.g., it increases over 69% and 55% in distance and radius metrics on the FS-TKY dataset.

Updated: 2025-10-09 03:15:24

标题: GeoGen：一个用于细粒度合成基于位置的社交网络轨迹生成的两阶段粗到细框架

摘要: 基于位置的社交网络（LBSN）签到轨迹数据对许多实际应用非常重要，如POI推荐、广告和疫情干预。然而，高昂的收集成本和不断增加的隐私问题阻止我们访问大规模的LBSN轨迹数据。最近合成数据生成的进展为我们提供了一个新的机会，利用生成式人工智能生成保留真实数据特征的合成数据，同时确保隐私保护。然而，生成合成LBSN签到轨迹仍然具有挑战性，因为它们具有空间离散、时间不规则的特性，以及由于稀疏活动和不确定的人类移动而引起的复杂时空模式。为了解决这一挑战，我们提出了GeoGen，一个用于大规模LBSN签到轨迹生成的两阶段粗到精的框架。在第一阶段，我们从原始LBSN签到轨迹中重建空间连续、时间规则的潜在运动序列，然后设计了一个带有高效去噪网络的稀疏感知时空扩散模型（S$^2$TDiff）来学习它们的潜在行为模式。在第二阶段，我们设计了Coarse2FineNet，一个基于Transformer的Seq2Seq架构，配备了编码器中的动态上下文融合机制和具有多任务混合头解码器，该解码器通过建模语义相关性和行为不确定性，基于粗粒度潜在运动序列生成细粒度的LBSN轨迹。对四个真实世界数据集的大量实验表明，GeoGen在忠实度和效用评估方面都优于最先进的模型，例如，在FS-TKY数据集上，它在距离和半径指标上分别提高了69%和55%。

更新时间: 2025-10-09 03:15:24

领域: cs.LG

下载: http://arxiv.org/abs/2510.07735v1

SurveyG: A Multi-Agent LLM Framework with Hierarchical Citation Graph for Automated Survey Generation

Large language models (LLMs) are increasingly adopted for automating survey paper generation \cite{wang2406autosurvey, liang2025surveyx, yan2025surveyforge,su2025benchmarking,wen2025interactivesurvey}. Existing approaches typically extract content from a large collection of related papers and prompt LLMs to summarize them directly. However, such methods often overlook the structural relationships among papers, resulting in generated surveys that lack a coherent taxonomy and a deeper contextual understanding of research progress. To address these shortcomings, we propose \textbf{SurveyG}, an LLM-based agent framework that integrates \textit{hierarchical citation graph}, where nodes denote research papers and edges capture both citation dependencies and semantic relatedness between their contents, thereby embedding structural and contextual knowledge into the survey generation process. The graph is organized into three layers: \textbf{Foundation}, \textbf{Development}, and \textbf{Frontier}, to capture the evolution of research from seminal works to incremental advances and emerging directions. By combining horizontal search within layers and vertical depth traversal across layers, the agent produces multi-level summaries, which are consolidated into a structured survey outline. A multi-agent validation stage then ensures consistency, coverage, and factual accuracy in generating the final survey. Experiments, including evaluations by human experts and LLM-as-a-judge, demonstrate that SurveyG outperforms state-of-the-art frameworks, producing surveys that are more comprehensive and better structured to the underlying knowledge taxonomy of a field.

Updated: 2025-10-09 03:14:20

标题: SurveyG：一个具有分层引文图的多智能体LLM框架，用于自动生成调查

摘要: 大型语言模型（LLMs）越来越多地被采用来自动化调查论文生成\cite{wang2406autosurvey, liang2025surveyx, yan2025surveyforge, su2025benchmarking, wen2025interactivesurvey}。现有方法通常从大量相关论文中提取内容，并提示LLMs直接总结它们。然而，这种方法经常忽视论文之间的结构关系，导致生成的调查缺乏连贯的分类法和对研究进展的更深入的上下文理解。为了解决这些缺点，我们提出了SurveyG，这是一个基于LLM的代理框架，它集成了\textbf{层次引用图}，其中节点表示研究论文，边捕获了引用依赖性和内容之间的语义相关性，从而将结构和上下文知识嵌入到调查生成过程中。该图分为三个层次：\textbf{基础}、\textbf{发展}和\textbf{前沿}，以捕捉从开创性作品到增量进展和新兴方向的研究演变。通过在层内进行水平搜索和跨层进行垂直深度遍历的结合，代理生成多层次摘要，并将其整合成结构化的调查大纲。随后的多代理验证阶段确保在生成最终调查时的一致性、覆盖范围和事实准确性。实验证明，包括人类专家和LLM作为评判者的评估，SurveyG优于最先进的框架，生成的调查更全面，更好地结构化到领域的基础知识分类法之下。

更新时间: 2025-10-09 03:14:20

领域: cs.AI

下载: http://arxiv.org/abs/2510.07733v1

oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning

Organic reaction mechanisms are the stepwise elementary reactions by which reactants form intermediates and products, and are fundamental to understanding chemical reactivity and designing new molecules and reactions. Although large language models (LLMs) have shown promise in understanding chemical tasks such as synthesis design, it is unclear to what extent this reflects genuine chemical reasoning capabilities, i.e., the ability to generate valid intermediates, maintain chemical consistency, and follow logically coherent multi-step pathways. We address this by introducing oMeBench, the first large-scale, expert-curated benchmark for organic mechanism reasoning in organic chemistry. It comprises over 10,000 annotated mechanistic steps with intermediates, type labels, and difficulty ratings. Furthermore, to evaluate LLM capability more precisely and enable fine-grained scoring, we propose oMeS, a dynamic evaluation framework that combines step-level logic and chemical similarity. We analyze the performance of state-of-the-art LLMs, and our results show that although current models display promising chemical intuition, they struggle with correct and consistent multi-step reasoning. Notably, we find that using prompting strategy and fine-tuning a specialist model on our proposed dataset increases performance by 50% over the leading closed-source model. We hope that oMeBench will serve as a rigorous foundation for advancing AI systems toward genuine chemical reasoning.

Updated: 2025-10-09 03:13:31

标题: oMeBench：面向有机机理阐明和推理的LLM稳健基准测试

摘要: 有机反应机理是指反应物形成中间体和产物的逐步基本反应，对于理解化学反应性并设计新分子和反应至关重要。尽管大型语言模型（LLMs）在理解化学任务（如合成设计）方面表现出了潜力，但目前尚不清楚这是否反映了真正的化学推理能力，即生成有效的中间体、保持化学一致性并遵循逻辑一致的多步路径的能力。我们通过引入oMeBench来解决这个问题，这是有机化学中第一个大规模的、由专家策划的有机机理推理基准测试。它包括超过10,000个带有中间体、类型标签和难度评级的注释机理步骤。此外，为了更精确评估LLM的能力并实现细粒度评分，我们提出了oMeS，这是一个结合了步骤级逻辑和化学相似性的动态评估框架。我们分析了最先进的LLMs的表现，结果显示，尽管当前模型展现出有前途的化学直觉，但它们在正确和一致的多步推理方面仍存在困难。值得注意的是，我们发现使用提示策略并在我们提出的数据集上对专业模型进行微调，使性能比领先的闭源模型提高了50%。我们希望oMeBench将成为推动人工智能系统朝着真正的化学推理方向发展的严谨基础。

更新时间: 2025-10-09 03:13:31

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2510.07731v1

DEAS: DEtached value learning with Action Sequence for Scalable Offline RL

Offline reinforcement learning (RL) presents an attractive paradigm for training intelligent agents without expensive online interactions. However, current approaches still struggle with complex, long-horizon sequential decision making. In this work, we introduce DEtached value learning with Action Sequence (DEAS), a simple yet effective offline RL framework that leverages action sequences for value learning. These temporally extended actions provide richer information than single-step actions and can be interpreted through the options framework via semi-Markov decision process Q-learning, enabling reduction of the effective planning horizon by considering longer sequences at once. However, directly adopting such sequences in actor-critic algorithms introduces excessive value overestimation, which we address through detached value learning that steers value estimates toward in-distribution actions that achieve high return in the offline dataset. We demonstrate that DEAS consistently outperforms baselines on complex, long-horizon tasks from OGBench and can be applied to enhance the performance of large-scale Vision-Language-Action models that predict action sequences, significantly boosting performance in both RoboCasa Kitchen simulation tasks and real-world manipulation tasks.

Updated: 2025-10-09 03:11:09

标题: DEAS：基于动作序列的离线强化学习中的分离式数值学习

摘要: 离线强化学习（RL）提供了一种吸引人的范式，可以训练智能代理而无需昂贵的在线交互。然而，当前的方法仍然在复杂、长期序列决策制定方面存在困难。在这项工作中，我们介绍了一种名为DEAS（DEtached value learning with Action Sequence）的简单而有效的离线RL框架，利用动作序列进行值学习。这些时间延长的动作提供比单步动作更丰富的信息，并且可以通过选项框架和半马尔可夫决策过程Q-learning进行解释，从而通过一次考虑更长序列来减少有效规划范围。然而，直接采用这样的序列在演员-评论家算法中会引入过多的值高估，我们通过DEAS方法解决了这个问题，通过引导值估计朝向达到离线数据集中高回报的分布内动作。我们证明DEAS在OGBench的复杂、长期任务中始终优于基线，并且可应用于增强大规模的预测动作序列的Vision-Language-Action模型的性能，在RoboCasa厨房模拟任务和现实世界操作任务中显著提高性能。

更新时间: 2025-10-09 03:11:09

领域: cs.LG,cs.AI,cs.RO

下载: http://arxiv.org/abs/2510.07730v1

Temporal Conformal Prediction (TCP): A Distribution-Free Statistical and Machine Learning Framework for Adaptive Risk Forecasting

We propose Temporal Conformal Prediction (TCP), a distribution-free framework for constructing well-calibrated prediction intervals in nonstationary time series. TCP couples a modern quantile forecaster with a split-conformal calibration layer on a rolling window and, in its TCP-RM variant, augments the conformal threshold with a single online Robbins-Monro (RM) offset to steer coverage toward a target level in real time. We benchmark TCP against GARCH, Historical Simulation, and a rolling Quantile Regression (QR) baseline across equities (S&P 500), cryptocurrency (Bitcoin), and commodities (Gold). Three results are consistent across assets. First, rolling QR yields the sharpest intervals but is materially under-calibrated (e.g., S&P 500: 83.2% vs. 95% target). Second, TCP (and TCP-RM) achieves near-nominal coverage across assets, with intervals that are wider than Historical Simulation in this evaluation (e.g., S&P 500: 5.21 vs. 5.06). Third, the RM update changes calibration and width only marginally at our default hyperparameters. Crisis-window visualizations around March 2020 show TCP/TCP-RM expanding and then contracting their interval bands promptly as volatility spikes and recedes, with red dots marking days where realized returns fall outside the reported 95% interval (miscoverage). A sensitivity study confirms robustness to window size and step-size choices. Overall, TCP provides a practical, theoretically grounded solution to calibrated uncertainty quantification under distribution shift, bridging statistical inference and machine learning for risk forecasting.

Updated: 2025-10-09 03:07:59

标题: 时间协形预测（TCP）：一种自适应风险预测的无分布统计和机器学习框架

摘要: 我们提出了时间一致预测（TCP），这是一个无分布框架，用于构建在非平稳时间序列中的校准预测区间。TCP将现代分位数预测器与一个滚动窗口上的分裂一致校准层相结合，在其TCP-RM变体中，通过一个在线Robbins-Monro（RM）偏移来调整覆盖率，实时朝着目标水平发展。我们在股票（标普500）、加密货币（比特币）和大宗商品（黄金）中对TCP进行了基准测试，与GARCH、历史模拟和滚动分位数回归（QR）基线进行了对比。在各种资产中有三个一致的结果。首先，滚动QR提供了最尖锐的区间，但在校准方面存在显著不足（例如，标普500：83.2% vs 95%目标）。其次，TCP（和TCP-RM）在各种资产中实现了接近名义覆盖率，其区间比本次评估中的历史模拟更宽（例如，标普500：5.21 vs 5.06）。第三，我们默认的超参数下，RM更新仅在校准和宽度方面略有变化。2020年3月的危机窗口可视化显示，当波动性急剧上升和回落时，TCP/TCP-RM迅速扩大和收缩其区间带，红点标记着实现收益超出报告的95%区间（错误覆盖率）的日期。一项敏感性研究证实了对窗口大小和步长选择的稳健性。总体而言，TCP为在分布转移下校准不确定性量化提供了一个实用的、理论上基础的解决方案，为风险预测的统计推断和机器学习之间搭建了桥梁。

更新时间: 2025-10-09 03:07:59

领域: stat.ML,cs.LG,62G08, 62M10, 62P05, 91G70, 68T05

下载: http://arxiv.org/abs/2507.05470v4

WebRenderBench: Enhancing Web Interface Generation through Layout-Style Consistency and Reinforcement Learning

Automating the conversion of UI images into web code is a critical task for front-end development and rapid prototyping. Advances in multimodal large language models (MLLMs) have made WebUI-to-Code increasingly feasible, yet existing benchmarks remain limited in data diversity and evaluation reliability. To address these issues, we present WebRenderBench, a large-scale benchmark of 45.1k webpages collected from real-world portal sites, offering greater diversity, complexity, and realism than prior benchmarks. We further propose a novel evaluation metric that measures layout and style consistency from the final rendered pages. Unlike vision-based methods that rely on costly LLM reasoning or structure-based comparisons vulnerable to noise and asymmetry, our approach enables more efficient, objective, and reliable UI quality assessment. Finally, we introduce the Automated Layout and Style Inspection Agent (ALISA), which integrates this metric into reinforcement learning as a reward signal to enhance training on crawled asymmetric webpages. Experiments show that ALISA significantly boosts generation performance, achieving state-of-the-art results across multiple metrics.

Updated: 2025-10-09 03:04:27

标题: WebRenderBench：通过布局样式一致性和强化学习增强Web界面生成

摘要: 自动将UI图像转换为Web代码是前端开发和快速原型设计的关键任务。多模式大型语言模型（MLLMs）的进展使得WebUI-to-Code越来越可行，然而现有的基准测试在数据多样性和评估可靠性方面仍然存在局限。为了解决这些问题，我们提出了WebRenderBench，这是一个大规模基准测试，收集了来自真实门户网站的45.1k个网页，比以往的基准测试具有更大的多样性、复杂性和现实感。我们进一步提出了一种新颖的评估指标，用于衡量最终呈现页面的布局和风格一致性。与基于视觉的方法依赖于昂贵的MLLM推理或易受噪声和不对称的结构比较的方法不同，我们的方法能够更有效、客观和可靠地评估UI质量。最后，我们介绍了自动布局和风格检查代理（ALISA），它将这一指标整合到强化学习中作为奖励信号，以增强对爬取的不对称网页的训练。实验表明，ALISA显著提高了生成性能，实现了在多个指标上的最先进结果。

更新时间: 2025-10-09 03:04:27

领域: cs.AI

下载: http://arxiv.org/abs/2510.04097v2

Golden Ratio Weighting Prevents Model Collapse

Recent studies identified an intriguing phenomenon in recursive generative model training known as model collapse, where models trained on data generated by previous models exhibit severe performance degradation. Addressing this issue and developing more effective training strategies have become central challenges in generative model research. In this paper, we investigate this phenomenon within a novel framework, where generative models are iteratively trained on a combination of newly collected real data and synthetic data from the previous training step. To develop an optimal training strategy for integrating real and synthetic data, we evaluate the performance of a weighted training scheme in various scenarios, including Gaussian distribution estimation, generalized linear models, and nonparametric estimation. We theoretically characterize the impact of the mixing proportion and weighting scheme of synthetic data on the final model's performance. Our key finding is that, across different settings, the optimal weighting scheme under different proportions of synthetic data asymptotically follows a unified expression, revealing a fundamental trade-off between leveraging synthetic data and model performance. In some cases, the optimal weight assigned to real data corresponds to the reciprocal of the golden ratio. Finally, we validate our theoretical results on extensive simulated datasets and a real tabular dataset.

Updated: 2025-10-09 03:00:19

标题: 黄金比重量化防止模型崩溃

摘要: 最近的研究发现了一种有趣的现象，即递归生成模型训练中的模型崩溃，即在先前模型生成的数据上训练的模型表现严重下降。解决这一问题并开发更有效的训练策略已成为生成模型研究的核心挑战。在本文中，我们在一个新颖的框架内研究了这一现象，其中生成模型被迭代地训练在新收集的真实数据和前一训练步骤的合成数据的组合上。为了开发一个整合真实和合成数据的最佳训练策略，我们在各种场景下评估了加权训练方案的性能，包括高斯分布估计、广义线性模型和非参数估计。我们理论上表征了合成数据的混合比例和加权方案对最终模型性能的影响。我们的主要发现是，在不同设置下，不同比例的合成数据下的最佳加权方案在渐近情况下遵循一个统一的表达式，揭示了利用合成数据和模型性能之间的基本权衡。在某些情况下，分配给真实数据的最佳权重对应于黄金比例的倒数。最后，我们在广泛的模拟数据集和真实表格数据集上验证了我们的理论结果。

更新时间: 2025-10-09 03:00:19

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2502.18049v4

Unified Cross-Scale 3D Generation and Understanding via Autoregressive Modeling

3D structure modeling is essential across scales, enabling applications from fluid simulation and 3D reconstruction to protein folding and molecular docking. Yet, despite shared 3D spatial patterns, current approaches remain fragmented, with models narrowly specialized for specific domains and unable to generalize across tasks or scales. We propose Uni-3DAR, a unified autoregressive framework for cross-scale 3D generation and understanding. At its core is a coarse-to-fine tokenizer based on octree data structures, which compresses diverse 3D structures into compact 1D token sequences. We further propose a two-level subtree compression strategy, which reduces the octree token sequence by up to 8x. To address the challenge of dynamically varying token positions introduced by compression, we introduce a masked next-token prediction strategy that ensures accurate positional modeling, significantly boosting model performance. Extensive experiments across multiple 3D generation and understanding tasks, including small molecules, proteins, polymers, crystals, and macroscopic 3D objects, validate its effectiveness and versatility. Notably, Uni-3DAR surpasses previous state-of-the-art diffusion models by a substantial margin, achieving up to 256\% relative improvement while delivering inference speeds up to 21.8x faster.

Updated: 2025-10-09 02:59:45

标题: 跨尺度统一的三维生成和理解：基于自回归建模的方法

摘要: 3D结构建模在各种尺度上是至关重要的，可以应用于流体模拟和3D重建，以至蛋白质折叠和分子对接等领域。然而，尽管存在共享的3D空间模式，当前的方法仍然是分散的，模型狭窄地专门针对特定领域，并且无法在不同任务或尺度之间泛化。我们提出了Uni-3DAR，一个统一的自回归框架，用于跨尺度的3D生成和理解。其核心是基于八叉树数据结构的由粗到细的标记器，将多样化的3D结构压缩成紧凑的1D标记序列。我们进一步提出了一个两级子树压缩策略，可以将八叉树标记序列减少高达8倍。为了解决由压缩引入的动态变化的标记位置的挑战，我们引入了一个掩码下一个标记预测策略，确保精确的位置建模，显著提升了模型性能。广泛的实验涵盖了多个3D生成和理解任务，包括小分子、蛋白质、聚合物、晶体和宏观3D物体，验证了其有效性和多功能性。值得注意的是，Uni-3DAR在很大程度上超越了先前最先进的扩散模型，在推理速度提高了高达21.8倍的同时，实现了高达256\%的相对改进。

更新时间: 2025-10-09 02:59:45

领域: cs.LG,cond-mat.mtrl-sci,q-bio.BM

下载: http://arxiv.org/abs/2503.16278v3

SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation

We propose SDAR, a Synergistic Diffusion-Autoregression paradigm that unifies the training efficiency of autoregressive models with the parallel inference capability of diffusion. Instead of costly end-to-end diffusion training, SDAR performs a lightweight paradigm conversion that transforms a well-trained autoregressive (AR) model into a blockwise diffusion model through brief, data-efficient adaptation. During inference, SDAR generates sequences autoregressively across blocks for global coherence while decoding all tokens within each block in parallel via a discrete diffusion process. Extensive experiments show that AR models remain substantially more compute-efficient than masked diffusion models, providing a strong foundation for adaptation. Building on this insight, SDAR achieves efficient AR-to-diffusion conversion with minimal cost, preserving AR-level performance while enabling parallel generation. Scaling studies across dense and Mixture-of-Experts architectures confirm that SDAR scales without compromise: larger models exhibit stronger robustness to block size and decoding thresholds, yielding greater speedups without accuracy loss. Beyond efficiency, SDAR demonstrates enhanced reasoning and domain adaptability. Our 30B MoE model surpasses its AR counterpart on challenging scientific reasoning benchmarks such as GPQA and ChemBench, and gains further improvements under test-time scaling methods like majority voting and pass@k. Together, these results establish SDAR as a practical paradigm that combines the strengths of autoregression and diffusion for scalable, high-throughput reasoning.

Updated: 2025-10-09 02:55:18

标题: SDAR：一种用于可扩展序列生成的协同扩散-自回归范式

摘要: 我们提出了SDAR，一种将自回归模型的训练效率与扩散的并行推理能力统一起来的联合扩散-自回归范式。与昂贵的端到端扩散训练不同，SDAR执行一种轻量级的范式转换，通过简短、数据高效的适应将一个训练良好的自回归（AR）模型转换为块状扩散模型。在推理过程中，SDAR通过一个离散扩散过程在每个块内同时解码所有标记，实现了全局一致性的块内自回归序列生成。大量实验表明，相比掩蔽扩散模型，自回归模型仍然具有更高的计算效率，为适应提供了坚实的基础。基于这一认识，SDAR实现了高效的自回归到扩散的转换，成本最小化，保持了自回归水平的性能同时实现了并行生成。在密集和专家混合架构的缩放研究中，确认SDAR在不妥协的情况下扩展：更大的模型对块大小和解码阈值表现出更强的鲁棒性，提供更大的加速而无损准确性。除了效率外，SDAR展示了增强的推理和领域适应性。我们的30B MoE模型在挑战性科学推理基准测试中超越了其自回归对应模型，如GPQA和ChemBench，并在类似多数票和pass@k的测试时间缩放方法下获得进一步改进。综上所述，这些结果将SDAR确立为一个实用的范式，将自回归和扩散的优势结合起来，实现可扩展、高吞吐量的推理。

更新时间: 2025-10-09 02:55:18

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.06303v2

Computationally-efficient Graph Modeling with Refined Graph Random Features

We propose refined GRFs (GRFs++), a new class of Graph Random Features (GRFs) for efficient and accurate computations involving kernels defined on the nodes of a graph. GRFs++ resolve some of the long-standing limitations of regular GRFs, including difficulty modeling relationships between more distant nodes. They reduce dependence on sampling long graph random walks via a novel walk-stitching technique, concatenating several shorter walks without breaking unbiasedness. By applying these techniques, GRFs++ inherit the approximation quality provided by longer walks but with greater efficiency, trading sequential, inefficient sampling of a long walk for parallel computation of short walks and matrix-matrix multiplication. Furthermore, GRFs++ extend the simplistic GRFs walk termination mechanism (Bernoulli schemes with fixed halting probabilities) to a broader class of strategies, applying general distributions on the walks' lengths. This improves the approximation accuracy of graph kernels, without incurring extra computational cost. We provide empirical evaluations to showcase all our claims and complement our results with theoretical analysis.

Updated: 2025-10-09 02:53:26

标题: 使用精细的图随机特征进行高效计算的图建模

摘要: 我们提出了一种改进的GRFs（GRFs++），这是一种用于在图的节点上定义核的高效准确计算的新类图随机特征（GRFs）。 GRFs++解决了常规GRFs长期存在的一些限制，包括难以建模更远节点之间的关系。它们通过一种新颖的行走拼接技术减少了对采样长图随机行走的依赖，将几个较短的行走连接在一起而不破坏无偏性。通过应用这些技术，GRFs++继承了由较长行走提供的逼近质量，但具有更高的效率，通过交换顺序，低效的采样长行走，转而进行短行走和矩阵-矩阵乘法的并行计算。此外，GRFs++将简单的GRFs行走终止机制（具有固定停止概率的伯努利方案）扩展到更广泛的策略类别，应用于行走长度的一般分布。这提高了图核的逼近准确性，而不会增加额外的计算成本。我们提供实证评估来展示我们所有声明，并通过理论分析补充我们的结果。

更新时间: 2025-10-09 02:53:26

领域: cs.LG

下载: http://arxiv.org/abs/2510.07716v1

Control Synthesis of Cyber-Physical Systems for Real-Time Specifications through Causation-Guided Reinforcement Learning

In real-time and safety-critical cyber-physical systems (CPSs), control synthesis must guarantee that generated policies meet stringent timing and correctness requirements under uncertain and dynamic conditions. Signal temporal logic (STL) has emerged as a powerful formalism of expressing real-time constraints, with its semantics enabling quantitative assessment of system behavior. Meanwhile, reinforcement learning (RL) has become an important method for solving control synthesis problems in unknown environments. Recent studies incorporate STL-based reward functions into RL to automatically synthesize control policies. However, the automatically inferred rewards obtained by these methods represent the global assessment of a whole or partial path but do not accumulate the rewards of local changes accurately, so the sparse global rewards may lead to non-convergence and unstable training performances. In this paper, we propose an online reward generation method guided by the online causation monitoring of STL. Our approach continuously monitors system behavior against an STL specification at each control step, computing the quantitative distance toward satisfaction or violation and thereby producing rewards that reflect instantaneous state dynamics. Additionally, we provide a smooth approximation of the causation semantics to overcome the discontinuity of the causation semantics and make it differentiable for using deep-RL methods. We have implemented a prototype tool and evaluated it in the Gym environment on a variety of continuously controlled benchmarks. Experimental results show that our proposed STL-guided RL method with online causation semantics outperforms existing relevant STL-guided RL methods, providing a more robust and efficient reward generation framework for deep-RL.

Updated: 2025-10-09 02:49:28

标题: 通过因果引导的强化学习实现实时规范的网络物理系统控制合成

摘要: 在实时和安全关键的网络物理系统（CPSs）中，控制合成必须保证生成的策略在不确定和动态条件下满足严格的时间和正确性要求。信号时间逻辑（STL）已经成为表达实时约束的强大形式化方法，其语义使系统行为可以量化评估。与此同时，强化学习（RL）已成为在未知环境中解决控制合成问题的重要方法。最近的研究将基于STL的奖励函数纳入强化学习中，以自动合成控制策略。然而，这些方法自动推断的奖励代表了整个或部分路径的全局评估，但并未准确地累积局部变化的奖励，因此稀疏的全局奖励可能导致非收敛和不稳定的训练性能。在本文中，我们提出了一种在线奖励生成方法，该方法受STL的在线因果监控引导。我们的方法在每个控制步骤中持续监控系统行为与STL规范的对比，计算满足或违反的定量距离，从而产生反映瞬时状态动态的奖励。此外，我们为因果语义提供了平滑的近似，以克服因果语义的不连续性，并使其可微分以使用深度强化学习方法。我们已经实现了一个原型工具，并在Gym环境中对各种连续控制基准进行了评估。实验结果显示，我们提出的基于STL引导的RL方法与在线因果语义优于现有相关的STL引导的RL方法，为深度强化学习提供了更强大和高效的奖励生成框架。

更新时间: 2025-10-09 02:49:28

领域: cs.AI

下载: http://arxiv.org/abs/2510.07715v1

Multimodal Safety Evaluation in Generative Agent Social Simulations

Can generative agents be trusted in multimodal environments? Despite advances in large language and vision-language models that enable agents to act autonomously and pursue goals in rich settings, their ability to reason about safety, coherence, and trust across modalities remains limited. We introduce a reproducible simulation framework for evaluating agents along three dimensions: (1) safety improvement over time, including iterative plan revisions in text-visual scenarios; (2) detection of unsafe activities across multiple categories of social situations; and (3) social dynamics, measured as interaction counts and acceptance ratios of social exchanges. Agents are equipped with layered memory, dynamic planning, multimodal perception, and are instrumented with SocialMetrics, a suite of behavioral and structural metrics that quantifies plan revisions, unsafe-to-safe conversions, and information diffusion across networks. Experiments show that while agents can detect direct multimodal contradictions, they often fail to align local revisions with global safety, reaching only a 55 percent success rate in correcting unsafe plans. Across eight simulation runs with three models - Claude, GPT-4o mini, and Qwen-VL - five agents achieved average unsafe-to-safe conversion rates of 75, 55, and 58 percent, respectively. Overall performance ranged from 20 percent in multi-risk scenarios with GPT-4o mini to 98 percent in localized contexts such as fire/heat with Claude. Notably, 45 percent of unsafe actions were accepted when paired with misleading visuals, showing a strong tendency to overtrust images. These findings expose critical limitations in current architectures and provide a reproducible platform for studying multimodal safety, coherence, and social dynamics.

Updated: 2025-10-09 02:42:57

标题: 多模态安全评估在生成型智能体社交模拟中的应用

摘要: 在多模态环境中可以信任生成式代理吗？尽管大型语言和视觉-语言模型的发展使代理能够在丰富的环境中自主行动并追求目标，但它们在跨模态推理安全性、连贯性和信任方面的能力仍然有限。我们引入了一个可重现的仿真框架，用于评估代理在三个维度上的表现：（1）随时间的安全性改进，包括在文本-视觉场景中的迭代计划修订；（2）检测跨多个社交情境类别的不安全活动；以及（3）社交动态，以社交交流的互动次数和接受比率来衡量。代理配备了分层记忆、动态规划、多模态感知，并带有SocialMetrics，一个量化计划修订、不安全到安全转换以及网络中信息扩散的一套行为和结构指标。实验表明，虽然代理可以检测直接的多模态矛盾，但它们经常无法将局部修订与全局安全保持一致，仅在修正不安全计划方面达到了55%的成功率。在三种模型 - Claude、GPT-4o mini和Qwen-VL的八次仿真运行中，五个代理的平均不安全到安全转换率分别为75％、55％和58％。总体表现范围从在GPT-4o mini中的多风险场景中的20％到在Claude中的火灾/高温等局部情境中的98％。值得注意的是，在与误导性视觉相配对时，45％的不安全行为被接受，显示出对图像的过度信任的强烈倾向。这些发现揭示了当前架构的关键限制，并为研究多模态安全性、连贯性和社交动态提供了一个可重现的平台。

更新时间: 2025-10-09 02:42:57

领域: cs.AI,cs.CL,cs.CY,cs.MA

下载: http://arxiv.org/abs/2510.07709v1

Causality Guided Representation Learning for Cross-Style Hate Speech Detection

The proliferation of online hate speech poses a significant threat to the harmony of the web. While explicit hate is easily recognized through overt slurs, implicit hate speech is often conveyed through sarcasm, irony, stereotypes, or coded language -- making it harder to detect. Existing hate speech detection models, which predominantly rely on surface-level linguistic cues, fail to generalize effectively across diverse stylistic variations. Moreover, hate speech spread on different platforms often targets distinct groups and adopts unique styles, potentially inducing spurious correlations between them and labels, further challenging current detection approaches. Motivated by these observations, we hypothesize that the generation of hate speech can be modeled as a causal graph involving key factors: contextual environment, creator motivation, target, and style. Guided by this graph, we propose CADET, a causal representation learning framework that disentangles hate speech into interpretable latent factors and then controls confounders, thereby isolating genuine hate intent from superficial linguistic cues. Furthermore, CADET allows counterfactual reasoning by intervening on style within the latent space, naturally guiding the model to robustly identify hate speech in varying forms. CADET demonstrates superior performance in comprehensive experiments, highlighting the potential of causal priors in advancing generalizable hate speech detection.

Updated: 2025-10-09 02:41:37

标题: 因果关系引导的跨风格仇恨言论检测的表示学习

摘要: 网络仇恨言论的蔓延对网络和谐构成了重大威胁。尽管明显的仇恨很容易通过公开的侮辱词语识别出来，但隐含的仇恨言论往往通过讽刺、讽刺、刻板印象或编码语言传达，使其更难以检测。现有的仇恨言论检测模型主要依赖于表面层面的语言线索，无法有效地跨越各种风格变化进行泛化。此外，在不同平台上传播的仇恨言论往往针对不同群体，并采用独特的风格，可能在它们和标签之间引起虚假相关性，进一步挑战当前的检测方法。受这些观察启发，我们假设仇恨言论的生成可以被建模为涉及关键因素的因果图：上下文环境、创作者动机、目标和风格。在这个图的指导下，我们提出了 CADET，一个因果表征学习框架，将仇恨言论分解为可解释的潜在因素，然后控制混淆因素，从而将真正的仇恨意图与表面语言线索隔离开来。此外，CADET允许在潜在空间中对风格进行干预进行反事实推理，自然地引导模型在不同形式中鲁棒地识别仇恨言论。CADET在全面实验中展示了卓越的性能，突显了因果先验在推进可泛化仇恨言论检测方面的潜力。

更新时间: 2025-10-09 02:41:37

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2510.07707v1

Large Language Models Meet Virtual Cell: A Survey

Large language models (LLMs) are transforming cellular biology by enabling the development of "virtual cells"--computational systems that represent, predict, and reason about cellular states and behaviors. This work provides a comprehensive review of LLMs for virtual cell modeling. We propose a unified taxonomy that organizes existing methods into two paradigms: LLMs as Oracles, for direct cellular modeling, and LLMs as Agents, for orchestrating complex scientific tasks. We identify three core tasks--cellular representation, perturbation prediction, and gene regulation inference--and review their associated models, datasets, evaluation benchmarks, as well as the critical challenges in scalability, generalizability, and interpretability.

Updated: 2025-10-09 02:41:30

标题: 大型语言模型遇见虚拟细胞：一项调查

摘要: 大型语言模型（LLMs）正在通过实现“虚拟细胞”——代表、预测和推理细胞状态和行为的计算系统，改变细胞生物学。本文提供了对LLMs用于虚拟细胞建模的综合评估。我们提出了一个统一的分类体系，将现有方法分为两种范式：LLMs作为“神谕”，用于直接细胞建模，以及LLMs作为“代理”，用于编排复杂科学任务。我们确定了三个核心任务——细胞表示、干扰预测和基因调控推理——并评估了它们的相关模型、数据集、评估基准，以及在可扩展性、泛化能力和可解释性方面的关键挑战。

更新时间: 2025-10-09 02:41:30

领域: cs.CL,cs.CE,cs.LG,q-bio.CB

下载: http://arxiv.org/abs/2510.07706v1

More Bang for the Buck: Process Reward Modeling with Entropy-Driven Uncertainty

We introduce the Entropy-Driven Uncertainty Process Reward Model (EDU-PRM), a novel entropy-driven training framework for process reward modeling that enables dynamic, uncertainty-aligned segmentation of complex reasoning steps, eliminating the need for costly manual step annotations. Unlike previous Process Reward Models (PRMs) that rely on static partitioning and human labeling, EDU-PRM automatically anchors step boundaries at tokens with high predictive entropy, effectively capturing intrinsic logical transitions and facilitating efficient exploration of diverse reasoning paths. On the ProcessBench benchmark, EDU-PRM outperforms strong public PRM baselines, such as Math-Shepherd PRM and Omega PRM, and EDU-PRM achieves comparable results with SOTA models while only using 1.5% training data. Furthermore, by leveraging our proposed EDU sampling strategy, we observe accuracy boosts from 64.7% to 67.3% for generative reasoning tasks, accompanied by a reduction of 32% in token usage. These findings underscore the potential of EDU-PRM as a scalable and annotation-efficient paradigm for process supervision in mathematical reasoning, paving the way for more efficient and robust approaches to complex mathematical problem solving.

Updated: 2025-10-09 02:37:01

标题: “更高的回报：基于熵驱动不确定性的过程奖励建模”

摘要: 我们介绍了熵驱动的不确定性过程奖励模型（EDU-PRM），这是一个新颖的基于熵的训练框架，用于过程奖励建模，实现了动态的、与不确定性对齐的复杂推理步骤分割，消除了昂贵的手动步骤注释的需求。与之前依赖静态分区和人工标记的过程奖励模型（PRMs）不同，EDU-PRM自动将步骤边界锚定在具有高预测熵的标记上，有效地捕捉内在的逻辑转换，促进了对多样推理路径的高效探索。在ProcessBench基准测试中，EDU-PRM优于强大的公共PRM基线，如Math-Shepherd PRM和Omega PRM，而且EDU-PRM在仅使用1.5％的训练数据的情况下实现了可比的结果。此外，通过利用我们提出的EDU采样策略，我们观察到生成式推理任务的准确率从64.7％提高到67.3％，并且标记使用量减少了32％。这些发现强调了EDU-PRM作为一个可扩展且注释效率高的过程监督范式的潜力，为数学推理中更高效和稳健的方法铺平了道路。

更新时间: 2025-10-09 02:37:01

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2503.22233v3

Rethinking Reasoning: A Survey on Reasoning-based Backdoors in LLMs

With the rise of advanced reasoning capabilities, large language models (LLMs) are receiving increasing attention. However, although reasoning improves LLMs' performance on downstream tasks, it also introduces new security risks, as adversaries can exploit these capabilities to conduct backdoor attacks. Existing surveys on backdoor attacks and reasoning security offer comprehensive overviews but lack in-depth analysis of backdoor attacks and defenses targeting LLMs' reasoning abilities. In this paper, we take the first step toward providing a comprehensive review of reasoning-based backdoor attacks in LLMs by analyzing their underlying mechanisms, methodological frameworks, and unresolved challenges. Specifically, we introduce a new taxonomy that offers a unified perspective for summarizing existing approaches, categorizing reasoning-based backdoor attacks into associative, passive, and active. We also present defense strategies against such attacks and discuss current challenges alongside potential directions for future research. This work offers a novel perspective, paving the way for further exploration of secure and trustworthy LLM communities.

Updated: 2025-10-09 02:35:37

标题: 重新思考推理：关于基于推理的LLM中后门的调查

摘要: 随着先进推理能力的兴起，大型语言模型（LLMs）越来越受到关注。然而，虽然推理能力可以提高LLMs在下游任务中的性能，但也引入了新的安全风险，因为对手可以利用这些能力进行后门攻击。现有关于后门攻击和推理安全的调查提供了全面的概述，但缺乏针对LLMs推理能力的后门攻击和防御的深入分析。在本文中，我们首次对LLMs中基于推理的后门攻击进行全面审查，分析其基本机制、方法论框架和未解决的挑战。具体而言，我们引入了一个新的分类法，为总结现有方法、将基于推理的后门攻击分类为联想、被动和主动提供了统一的视角。我们还提出了针对此类攻击的防御策略，并讨论当前面临的挑战以及未来研究的潜在方向。这项工作提供了一种新颖的视角，为进一步探索安全可信任的LLM社区铺平了道路。

更新时间: 2025-10-09 02:35:37

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2510.07697v1

LLM Applications: Current Paradigms and the Next Frontier

The development of large language models (LLMs) has given rise to four major application paradigms: LLM app stores, LLM agents, self-hosted LLM services, and LLM-powered devices. Each has its advantages but also shares common challenges. LLM app stores lower the barrier to development but lead to platform lock-in; LLM agents provide autonomy but lack a unified communication mechanism; self-hosted LLM services enhance control but increase deployment complexity; and LLM-powered devices improve privacy and real-time performance but are limited by hardware. This paper reviews and analyzes these paradigms, covering architecture design, application ecosystem, research progress, as well as the challenges and open problems they face. Based on this, we outline the next frontier of LLM applications, characterizing them through three interconnected layers: infrastructure, protocol, and application. We describe their responsibilities and roles of each layer and demonstrate how to mitigate existing fragmentation limitations and improve security and scalability. Finally, we discuss key future challenges, identify opportunities such as protocol-driven cross-platform collaboration and device integration, and propose a research roadmap for openness, security, and sustainability.

Updated: 2025-10-09 02:34:39

标题: LLM应用：当前范式和下一个前沿

摘要: 大型语言模型（LLM）的发展催生了四种主要的应用范式：LLM应用商店、LLM代理、自托管LLM服务和LLM驱动的设备。每种都有其优势，但也共享共同的挑战。LLM应用商店降低了开发门槛，但导致了平台封锁；LLM代理提供了自主性，但缺乏统一的通信机制；自托管LLM服务增强了控制力，但增加了部署复杂性；而LLM驱动的设备提高了隐私和实时性能，但受到硬件限制。本文回顾并分析了这些范例，涵盖了体系结构设计、应用生态系统、研究进展，以及它们面临的挑战和开放问题。基于此，我们概述了LLM应用的下一个前沿，通过三个相互连接的层次进行特征化：基础设施、协议和应用。我们描述了每个层次的责任和角色，并展示了如何缓解现有的碎片化限制，改善安全性和可扩展性。最后，我们讨论了关键的未来挑战，确定了机遇，如协议驱动的跨平台协作和设备集成，并提出了一个关于开放性、安全性和可持续性的研究路线图。

更新时间: 2025-10-09 02:34:39

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2503.04596v2

The Curious Case of In-Training Compression of State Space Models

State Space Models (SSMs), developed to tackle long sequence modeling tasks efficiently, offer both parallelizable training and fast inference. At their core are recurrent dynamical systems that maintain a hidden state, with update costs scaling with the state dimension. A key design challenge is striking the right balance between maximizing expressivity and limiting this computational burden. Control theory, and more specifically Hankel singular value analysis, provides a potent framework for the measure of energy for each state, as well as the balanced truncation of the original system down to a smaller representation with performance guarantees. Leveraging the eigenvalue stability properties of Hankel matrices, we apply this lens to SSMs \emph{during training}, where only dimensions of high influence are identified and preserved. Our approach, \textsc{CompreSSM}, applies to Linear Time-Invariant SSMs such as Linear Recurrent Units, but is also extendable to selective models. Experiments show that in-training reduction significantly accelerates optimization while preserving expressivity, with compressed models retaining task-critical structure lost by models trained directly at smaller dimension. In other words, SSMs that begin large and shrink during training achieve computational efficiency while maintaining higher performance. Project code is available at github.com/camail-official/compressm.

Updated: 2025-10-09 02:32:54

标题: 在训练中状态空间模型压缩的奇特案例

摘要: 状态空间模型（SSMs）是为了高效处理长序列建模任务而开发的，它们提供了可并行化的训练和快速推断。在它们的核心是维持隐藏状态的循环动态系统，更新成本随着状态维度的增加而扩展。一个关键的设计挑战是在最大化表达能力和限制这种计算负担之间找到合适的平衡。控制理论，尤其是Hankel奇异值分析，为每个状态的能量测量以及将原始系统平衡截断为具有性能保证的较小表示提供了一个有效的框架。利用Hankel矩阵的特征值稳定性属性，我们将这个方法应用于SSMs的\emph{训练过程}中，仅识别和保留具有高影响力的维度。我们的方法，\textsc{CompreSSM}，适用于线性时不变SSMs，如线性递归单元，但也可扩展到选择性模型。实验表明，在训练过程中进行的减少显著加快了优化过程，同时保持了表达能力，压缩模型保留了直接在较小维度训练的模型丢失的任务关键结构。换句话说，开始较大并在训练过程中缩小的SSMs实现了计算效率，同时保持了更高的性能。项目代码可在github.com/camail-official/compressm上找到。

更新时间: 2025-10-09 02:32:54

领域: cs.LG,68T07,I.2.0; I.2.7

下载: http://arxiv.org/abs/2510.02823v2

Want to train KANS at scale? Now UKAN!

Kolmogorov-Arnold Networks (KANs) have recently emerged as a powerful alternative to traditional multilayer perceptrons. However, their reliance on predefined, bounded grids restricts their ability to approximate functions on unbounded domains. To address this, we present Unbounded Kolmogorov-Arnold Networks (UKANs), a method that removes the need for bounded grids in traditional Kolmogorov-Arnold Networks (KANs). The key innovation of this method is a coefficient-generator (CG) model that produces, on the fly, only the B-spline coefficients required locally on an unbounded symmetric grid. UKANs couple multilayer perceptrons with KANs by feeding the positional encoding of grid groups into the CG model, enabling function approximation on unbounded domains without requiring data normalization. To reduce the computational cost of both UKANs and KANs, we introduce a GPU-accelerated library that lowers B-spline evaluation complexity by a factor proportional to the grid size, enabling large-scale learning by leveraging efficient memory management, in line with recent software advances such as FlashAttention and FlashFFTConv. Performance benchmarking confirms the superior memory and computational efficiency of our accelerated KAN (warpKAN), and UKANs, showing a 3-30x speed-up and up to 1000x memory reduction compared to vanilla KANs. Experiments on regression, classification, and generative tasks demonstrate the effectiveness of UKANs to match or surpass KAN accuracy. Finally, we use both accelerated KAN and UKAN in a molecular property prediction task, establishing the feasibility of large-scale end-to-end training with our optimized implementation.

Updated: 2025-10-09 02:32:06

标题: 想要规模化训练KANS吗？现在UKAN！

摘要: 科尔莫戈洛夫-阿诺德网络（KANs）最近已经成为传统多层感知器的一个强大替代方案。然而，它们依赖于预定义的有界网格限制了它们在无界域上逼近函数的能力。为了解决这个问题，我们提出了无界科尔莫戈洛夫-阿诺德网络（UKANs），这种方法消除了传统科尔莫戈洛夫-阿诺德网络（KANs）中有界网格的需求。该方法的关键创新是一个系数生成器（CG）模型，该模型动态生成在无界对称网格上局部需要的B样条系数。UKANs通过将网格组的位置编码输入CG模型，将多层感知器与KANs相结合，实现了在无界域上进行函数逼近而无需数据归一化。为了降低UKANs和KANs的计算成本，我们引入了一个GPU加速库，通过减少与网格大小成比例的B样条评估复杂性，实现了大规模学习，利用高效的内存管理，与最近的软件进展如FlashAttention和FlashFFTConv一致。性能基准测试证实了我们加速的KAN（warpKAN）和UKANs的卓越内存和计算效率，与普通KANs相比，速度提高了3-30倍，内存减少了最多1000倍。在回归、分类和生成任务上的实验表明，UKANs在匹配或超越KAN准确度方面的有效性。最后，我们在一个分子性质预测任务中使用了加速的KAN和UKAN，证实了通过我们优化的实现进行大规模端到端训练的可行性。

更新时间: 2025-10-09 02:32:06

领域: cs.LG

下载: http://arxiv.org/abs/2408.11200v4

PO-Flow: Flow-based Generative Models for Sampling Potential Outcomes and Counterfactuals

Predicting potential and counterfactual outcomes from observational data is central to clinical decision-making, where physicians must weigh treatments for an individual patient rather than relying solely on average effects at the population level. We propose PO-Flow, a continuous normalizing flow (CNF) framework for causal inference that jointly models potential outcomes and counterfactuals. Trained via flow matching, PO-Flow provides a unified approach to average treatment effect estimation, individualized potential outcome prediction, and counterfactual prediction. Besides, PO-Flow directly learns the densities of potential outcomes, enabling likelihood-based evaluation of predictions. Furthermore, PO-Flow explores counterfactual outcome generation conditioned on the observed factual in general observational datasets, with a supporting recovery result under certain assumptions. PO-Flow outperforms modern baselines across diverse datasets and causal tasks in the potential outcomes framework.

Updated: 2025-10-09 02:28:01

标题: PO-Flow：基于流的生成模型用于采样潜在结果和反事实情况

摘要: 从观察数据中预测潜在和反事实结果对于临床决策至关重要，医生必须权衡针对个体患者的治疗，而不仅仅依赖于人群水平上的平均效果。我们提出了PO-Flow，一个连续正态化流（CNF）框架，用于因果推断，它同时建模潜在结果和反事实。通过流匹配训练，PO-Flow提供了一种统一的方法来估计平均治疗效果，个性化潜在结果预测和反事实预测。此外，PO-Flow直接学习潜在结果的密度，从而能够基于可能性进行预测评估。此外，PO-Flow探索了在一般观察数据集中，基于观察到的事实生成反事实结果，在某些假设下提供了支持性的恢复结果。PO-Flow在潜在结果框架中的各种数据集和因果任务中胜过现代基线。

更新时间: 2025-10-09 02:28:01

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2505.16051v2

LLMs on a Budget? Say HOLA

Running Large Language Models (LLMs) on edge devices is constrained by high compute and memory demands posing a barrier for real-time applications in sectors like healthcare, education, and embedded systems. Current solutions such as quantization, pruning, and retrieval-augmented generation (RAG) offer only partial optimizations and often compromise on speed or accuracy. We introduce HOLA, an end-to-end optimization framework for efficient LLM deployment. Internally, it leverages Hierarchical Speculative Decoding (HSD) for faster inference without quality loss. Externally, AdaComp-RAG adjusts retrieval complexity based on context needs. Together with LoBi, which blends structured pruning (LoRA) and quantization, HOLA delivers significant gains: 17.6% EMA on GSM8K, 10.5% MCA on ARC, and reduced latency and memory on edge devices like Jetson Nano--proving both scalable and production-ready.

Updated: 2025-10-09 02:27:57

标题: 低预算下的LLMs？说HOLA

摘要: 在边缘设备上运行大型语言模型（LLMs）受到高计算和内存需求的限制，这对于医疗保健、教育和嵌入式系统等领域的实时应用构成了障碍。当前的解决方案如量化、修剪和检索增强生成（RAG）只提供了部分优化，并且通常在速度或准确性上存在妥协。我们介绍了HOLA，这是一个用于高效部署LLM的端到端优化框架。在内部，它利用分层推测解码（HSD）进行更快的推理，而不会损失质量。在外部，AdaComp-RAG根据上下文需求调整检索复杂性。与LoBi一起，它结合了结构化修剪（LoRA）和量化，HOLA实现了显著的增益：在GSM8K上的17.6%EMA，在ARC上的10.5%MCA，并且在像Jetson Nano这样的边缘设备上减少了延迟和内存，证明了其可扩展性和产品就绪性。

更新时间: 2025-10-09 02:27:57

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2506.18952v2

Stress-Testing Model Specs Reveals Character Differences among Language Models

Large language models (LLMs) are increasingly trained from AI constitutions and model specifications that establish behavioral guidelines and ethical principles. However, these specifications face critical challenges, including internal conflicts between principles and insufficient coverage of nuanced scenarios. We present a systematic methodology for stress-testing model character specifications, automatically identifying numerous cases of principle contradictions and interpretive ambiguities in current model specs. We stress test current model specs by generating scenarios that force explicit tradeoffs between competing value-based principles. Using a comprehensive taxonomy we generate diverse value tradeoff scenarios where models must choose between pairs of legitimate principles that cannot be simultaneously satisfied. We evaluate responses from twelve frontier LLMs across major providers (Anthropic, OpenAI, Google, xAI) and measure behavioral disagreement through value classification scores. Among these scenarios, we identify over 70,000 cases exhibiting significant behavioral divergence. Empirically, we show this high divergence in model behavior strongly predicts underlying problems in model specifications. Through qualitative analysis, we provide numerous example issues in current model specs such as direct contradiction and interpretive ambiguities of several principles. Additionally, our generated dataset also reveals both clear misalignment cases and false-positive refusals across all of the frontier models we study. Lastly, we also provide value prioritization patterns and differences of these models.

Updated: 2025-10-09 02:24:37

标题: 压力测试模型规范揭示语言模型之间的特征差异

摘要: 大型语言模型（LLMs）越来越多地从人工智能宪章和模型规范中进行训练，这些规范建立了行为准则和道德原则。然而，这些规范面临着关键挑战，包括原则之间的内部冲突和对微妙情景的覆盖不足。我们提出了一种系统方法来对模型特性规范进行压力测试，自动识别当前模型规范中的许多原则矛盾和解释模糊之处。我们通过生成强制性使不同价值观原则之间存在明显取舍的情景来压力测试当前模型规范。使用全面的分类法，我们生成各种不同的价值取舍情景，其中模型必须在无法同时满足的合法原则对之间进行选择。我们评估来自十二个主要提供商（Anthropic、OpenAI、Google、xAI）的前沿LLMs的响应，并通过价值分类分数来衡量行为分歧。在这些情景中，我们发现超过70,000个案例表现出显著的行为分歧。从经验上看，我们展示了模型行为的高度分歧强烈预示了模型规范中的潜在问题。通过定性分析，我们提供了一些当前模型规范中的问题示例，例如几个原则的直接矛盾和解释模糊。此外，我们生成的数据集还揭示了我们研究的所有前沿模型中的明显不一致案例和虚假拒绝。最后，我们还提供这些模型的价值优先模式和差异。

更新时间: 2025-10-09 02:24:37

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.07686v1

PARL-MT: Learning to Call Functions in Multi-Turn Conversation with Progress Awareness

Large language models (LLMs) have achieved impressive success in single-turn function calling, yet real-world applications such as travel planning or multi-stage data analysis typically unfold across multi-turn conversations. In these settings, LLMs must not only issue accurate function calls at each step but also maintain progress awareness, the ability to summarize past interactions and plan future actions to ensure coherent, long-horizon task execution. Existing approaches, however, either reduce multi-turn training to isolated single-turn samples, which neglects task-level planning, or employ end-to-end reinforcement learning (RL) that struggles with redundancy and lacks explicit integration of progress awareness. To overcome these limitations, we introduce PARL-MT, a framework that explicitly incorporates progress awareness into LLM training for multi-turn function calling. PARL-MT combines (i) a Progress Awareness Generation (PAG) pipeline, which automatically constructs datasets coupling conversation summaries with future task planning, and (ii) a Progress Awareness-Guided Reinforcement Learning (PAG-RL) algorithm, which integrates progress awareness into RL training to reduce contextual redundancy and improve alignment between local actions and global task completion. Empirical results on two public benchmarks demonstrate that PARL-MT significantly outperforms existing methods, highlighting the effectiveness of progress awareness in enabling robust and efficient multi-turn function calling.

Updated: 2025-10-09 02:23:18

标题: PARL-MT：具有进展意识的多轮对话中学习调用功能

摘要: 大型语言模型（LLMs）在单轮功能调用方面取得了令人印象深刻的成功，然而，像旅行规划或多阶段数据分析等真实应用通常是在多轮对话中展开的。在这些场景中，LLMs不仅必须在每一步发出准确的功能调用，还必须保持进展意识，总结过去的互动并规划未来的行动，以确保一致、长期任务执行。然而，现有方法要么将多轮训练简化为孤立的单轮样本，忽视了任务级别的规划，要么采用端到端强化学习（RL），这种方法很难应对冗余并缺乏明确整合进展意识。为了克服这些限制，我们引入了PARL-MT，这是一个明确将进展意识纳入LLM多轮功能调用训练的框架。PARL-MT结合了（i）进展意识生成（PAG）管道，该管道自动构建将对话总结与未来任务规划相结合的数据集，以及（ii）进展意识引导的强化学习（PAG-RL）算法，该算法将进展意识整合到RL训练中，以降低上下文冗余并提高局部行动与全局任务完成之间的对齐度。在两个公共基准上的实证结果表明，PARL-MT明显优于现有方法，突显了进展意识在实现稳健、高效的多轮功能调用中的有效性。

更新时间: 2025-10-09 02:23:18

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.23206v3

Advancing Automated Urban Planning: Exploring Algorithmic Approaches with Generative Artificial Intelligence

The two fields of urban planning and artificial intelligence (AI) arose and developed separately. However, there is now cross-pollination and increasing interest in both fields to benefit from the advances of the other. In the present paper, we introduce the importance of urban planning from the sustainability, living, economic, disaster, and environmental perspectives. We review the fundamental concepts of urban planning and relate these concepts to crucial open problems of machine learning, including adversarial learning, generative neural networks, deep encoder-decoder networks, conversational AI, and geospatial and temporal machine learning, thereby assaying how AI can contribute to modern urban planning. Thus, a central problem is automated land-use configuration, which is formulated as the generation of land uses and building configuration for a target area from surrounding geospatial, human mobility, social media, environment, and economic activities. Finally, we delineate some implications of AI for urban planning and propose key research areas at the intersection of both topics.

Updated: 2025-10-09 02:19:03

标题: 推进自动城市规划：探索具有生成人工智能的算法方法

摘要: 城市规划和人工智能（AI）这两个领域起源于并独立发展。然而，现在两个领域之间出现了交叉交流，越来越多的人对这两个领域感兴趣，以便从对方的进步中受益。在本文中，我们介绍了从可持续性、居住、经济、灾害和环境角度看城市规划的重要性。我们回顾了城市规划的基本概念，并将这些概念与机器学习的关键问题联系起来，包括对抗学习、生成神经网络、深度编码器-解码器网络、对话式AI以及地理空间和时间机器学习，从而评估AI如何可以为现代城市规划做出贡献。因此，一个中心问题是自动化土地利用配置，这被定义为从周围的地理空间、人类移动、社交媒体、环境和经济活动生成目标区域的土地利用和建筑配置。最后，我们勾勒了AI对城市规划的一些影响，并提出了两个领域交叉点的关键研究领域。

更新时间: 2025-10-09 02:19:03

领域: cs.AI,cs.CY,cs.LG

下载: http://arxiv.org/abs/2304.03892v2

Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning

Knowledge-graph retrieval-augmented generation (KG-RAG) couples large language models (LLMs) with structured, verifiable knowledge graphs (KGs) to reduce hallucinations and expose reasoning traces. However, many KG-RAG systems compose multiple LLM modules (e.g planning, reasoning, and responding), inflating inference cost and binding behavior to a specific target KG. To address this, we introduce KG-R1, an agentic KG retrieval-augmented generation (KG-RAG) framework through reinforcement learning (RL). KG-R1 utilizes a single agent that interacts with KGs as its environment, learning to retrieve at each step and incorporating the retrieved information into its reasoning and generation. The process is optimized through end-to-end RL. In controlled experiments across Knowledge-Graph Question Answering (KGQA) benchmarks, our method demonstrates both efficiency and transferability: Using Qwen-2.5-3B, KG-R1 improves answer accuracy with fewer generation tokens than prior multi-module workflow methods that use larger foundation or fine-tuned models. Furthermore, KG-R1 enables plug and play: after training, it maintains strong accuracy on new KGs without modification. These properties make KG-R1 a promising KG-RAG framework for real-world deployment. Our code is publicly available at https://github.com/Jinyeop3110/KG-R1.

Updated: 2025-10-09 02:18:28

标题: 高效且可传递的主体知识图RAG的强化学习

摘要: 知识图检索增强生成（KG-RAG）将大型语言模型（LLMs）与结构化、可验证的知识图（KGs）相结合，以减少幻觉并暴露推理痕迹。然而，许多KG-RAG系统由多个LLM模块（例如规划、推理和响应）组成，增加了推理成本并将行为绑定到特定目标KG。为了解决这个问题，我们引入了KG-R1，这是一个通过强化学习（RL）实现的主动型知识图检索增强生成（KG-RAG）框架。KG-R1利用一个与KGs交互的单个agent作为其环境，学习在每一步中检索并将检索到的信息整合到其推理和生成中。这个过程通过端到端的RL进行优化。在知识图问答（KGQA）基准测试中进行的对照实验中，我们的方法展示了效率和可转移性：使用Qwen-2.5-3B，KG-R1在比使用更大基础模型或微调模型的先前多模块工作流方法生成更少的生成标记的情况下提高了答案准确性。此外，KG-R1实现了即插即用：训练后，它在新的KG上保持了强大的准确性而无需修改。这些特性使KG-R1成为一个有希望在现实世界中部署的KG-RAG框架。我们的代码公开可用于https://github.com/Jinyeop3110/KG-R1。

更新时间: 2025-10-09 02:18:28

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2509.26383v3

Truth, Trust, and Trouble: Medical AI on the Edge

Large Language Models (LLMs) hold significant promise for transforming digital health by enabling automated medical question answering. However, ensuring these models meet critical industry standards for factual accuracy, usefulness, and safety remains a challenge, especially for open-source solutions. We present a rigorous benchmarking framework using a dataset of over 1,000 health questions. We assess model performance across honesty, helpfulness, and harmlessness. Our results highlight trade-offs between factual reliability and safety among evaluated models -- Mistral-7B, BioMistral-7B-DARE, and AlpaCare-13B. AlpaCare-13B achieves the highest accuracy (91.7%) and harmlessness (0.92), while domain-specific tuning in BioMistral-7B-DARE boosts safety (0.90) despite its smaller scale. Few-shot prompting improves accuracy from 78% to 85%, and all models show reduced helpfulness on complex queries, highlighting ongoing challenges in clinical QA.

Updated: 2025-10-09 02:17:59

标题: 真相、信任和麻烦：医疗人工智能的边缘

摘要: 大型语言模型（LLMs）具有显著的潜力，可以通过实现自动化医疗问题回答来改变数字健康。然而，确保这些模型符合关键的行业标准，如事实准确性、有用性和安全性，仍然是一个挑战，尤其是对于开源解决方案。我们提出了一个严格的基准测试框架，使用一个包含1000多个健康问题的数据集。我们评估了模型在诚实性、帮助性和无害性方面的表现。我们的结果突显了在评估模型中事实可靠性和安全性之间的权衡 -- Mistral-7B、BioMistral-7B-DARE和AlpaCare-13B。AlpaCare-13B实现了最高的准确性（91.7%）和无害性（0.92），而BioMistral-7B-DARE中的领域特定调整提升了安全性（0.90），尽管其规模较小。少量提示将准确性从78%提高到85%，所有模型在复杂查询上显示出降低的帮助性，突显了临床问答中持续挑战。

更新时间: 2025-10-09 02:17:59

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.02983v2

LiveThinking: Enabling Real-Time Efficient Reasoning for AI-Powered Livestreaming via Reinforcement Learning

In AI-powered e-commerce livestreaming, digital avatars require real-time responses to drive engagement, a task for which high-latency Large Reasoning Models (LRMs) are ill-suited. We introduce LiveThinking, a practical two-stage optimization framework to bridge this gap. First, we address computational cost by distilling a 670B teacher LRM into a lightweight 30B Mixture-of-Experts (MoE) model (3B active) using Rejection Sampling Fine-Tuning (RFT). This reduces deployment overhead but preserves the teacher's verbose reasoning, causing latency. To solve this, our second stage employs reinforcement learning with Group Relative Policy Optimization (GRPO) to compress the model's reasoning path, guided by a multi-objective reward function balancing correctness, helpfulness, and brevity. LiveThinking achieves a 30-fold reduction in computational cost, enabling sub-second latency. In real-world application on Taobao Live, it improved response correctness by 3.3% and helpfulness by 21.8%. Tested by hundreds of thousands of viewers, our system led to a statistically significant increase in Gross Merchandise Volume (GMV), demonstrating its effectiveness in enhancing user experience and commercial performance in live, interactive settings.

Updated: 2025-10-09 02:17:20

标题: 活思：通过强化学习实现AI动力直播的实时高效推理

摘要: 在基于人工智能的电子商务直播中，数字化头像需要实时响应来推动参与度，而这是高延迟的大型推理模型（LRMs）无法胜任的任务。我们引入了LiveThinking，这是一个实用的两阶段优化框架，以弥补这一差距。首先，我们通过使用拒绝抽样微调（RFT）将670B教师LRM精馏为轻量级的30B专家混合模型（3B活跃），以解决计算成本问题。这降低了部署开销，但保留了教师的冗长推理，导致延迟。为了解决这个问题，我们的第二阶段采用了强化学习和组相对策略优化（GRPO），通过平衡正确性、帮助性和简洁性的多目标奖励函数来压缩模型的推理路径。LiveThinking实现了计算成本的30倍降低，实现了亚秒级的延迟。在淘宝直播的实际应用中，它将响应正确性提高了3.3%，帮助性提高了21.8%。经过数十万观众的测试，我们的系统导致了毛利额（GMV）的显著增加，证明了它在提升用户体验和商业表现方面的有效性。

更新时间: 2025-10-09 02:17:20

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2510.07685v1

On the Convergence of Moral Self-Correction in Large Language Models

Large Language Models (LLMs) are able to improve their responses when instructed to do so, a capability known as self-correction. When instructions provide only a general and abstract goal without specific details about potential issues in the response, LLMs must rely on their internal knowledge to improve response quality, a process referred to as intrinsic self-correction. The empirical success of intrinsic self-correction is evident in various applications, but how and why it is effective remains unknown. Focusing on moral self-correction in LLMs, we reveal a key characteristic of intrinsic self-correction: performance convergence through multi-round interactions; and provide a mechanistic analysis of this convergence behavior. Based on our experimental results and analysis, we uncover the underlying mechanism of convergence: consistently injected self-correction instructions activate moral concepts that reduce model uncertainty, leading to converged performance as the activated moral concepts stabilize over successive rounds. This paper demonstrates the strong potential of moral self-correction by showing that it exhibits a desirable property of converged performance.

Updated: 2025-10-09 02:09:12

标题: 关于大型语言模型中道德自我修正的收敛性

摘要: 大型语言模型（LLMs）能够在被指示时改善其响应，这种能力被称为自我纠正。当指示只提供一个一般和抽象的目标，而没有关于响应中潜在问题的具体细节时，LLMs必须依靠其内部知识来提高响应质量，这个过程被称为内在自我纠正。内在自我纠正在各种应用中的经验成功是显而易见的，但为什么以及它为何有效仍然未知。集中在LLMs中的道德自我纠正，我们揭示了内在自我纠正的一个关键特征：通过多轮交互实现性能收敛；并对这种收敛行为进行了机制分析。基于我们的实验结果和分析，我们揭示了收敛的潜在机制：一致注入的自我纠正指令激活道德概念，减少模型的不确定性，导致随着激活的道德概念在连续轮次中稳定，性能收敛。本文通过展示道德自我纠正具有性能收敛的可取特性，展示了道德自我纠正的强大潜力。

更新时间: 2025-10-09 02:09:12

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2510.07290v2

Curriculum Learning with Synthetic Data for Enhanced Pulmonary Nodule Detection in Chest Radiographs

This study evaluates whether integrating curriculum learning with diffusion-based synthetic augmentation can enhance the detection of difficult pulmonary nodules in chest radiographs, particularly those with low size, brightness, and contrast, which often challenge conventional AI models due to data imbalance and limited annotation. A Faster R-CNN with a Feature Pyramid Network (FPN) backbone was trained on a hybrid dataset comprising expert-labeled NODE21 (1,213 patients; 52.4 percent male; mean age 63.2 +/- 11.5 years), VinDr-CXR, CheXpert, and 11,206 DDPM-generated synthetic images. Difficulty scores based on size, brightness, and contrast guided curriculum learning. Performance was compared to a non-curriculum baseline using mean average precision (mAP), Dice score, and area under the curve (AUC). Statistical tests included bootstrapped confidence intervals, DeLong tests, and paired t-tests. The curriculum model achieved a mean AUC of 0.95 versus 0.89 for the baseline (p < 0.001), with improvements in sensitivity (70 percent vs. 48 percent) and accuracy (82 percent vs. 70 percent). Stratified analysis demonstrated consistent gains across all difficulty bins (Easy to Very Hard). Grad-CAM visualizations confirmed more anatomically focused attention under curriculum learning. These results suggest that curriculum-guided synthetic augmentation enhances model robustness and generalization for pulmonary nodule detection.

Updated: 2025-10-09 02:06:13

标题: 使用合成数据进行课程学习，增强胸部放射影像中肺结节检测

摘要: 这项研究评估了将课程学习与基于扩散的合成增强相结合是否能提高在胸部X射线片中检测困难肺结节的能力，特别是那些具有较小尺寸、亮度和对比度的结节，这些结节通常由于数据不平衡和有限的注释而挑战传统的人工智能模型。在一个混合数据集上训练了一个具有特征金字塔网络（FPN）骨干的Faster R-CNN，该数据集包括专家标记的NODE21（1,213名患者；男性52.4％；平均年龄63.2 +/- 11.5岁）、VinDr-CXR、CheXpert和11,206个DDPM生成的合成图像。基于尺寸、亮度和对比度的困难分数指导了课程学习。性能与非课程基线进行了比较，使用了平均精度（mAP）、Dice分数和曲线下面积（AUC）。统计测试包括自助置信区间、DeLong测试和成对t检验。课程模型的平均AUC为0.95，而基线为0.89（p <0.001），在灵敏度（70％vs。48％）和准确性（82％vs。70％）方面有所提高。分层分析表明在所有难度区间（简单到非常困难）中都获得了一致的收益。Grad-CAM可视化证实在课程学习下更加注重解剖学的注意力。这些结果表明，课程引导的合成增强提高了肺结节检测模型的稳健性和泛化能力。

更新时间: 2025-10-09 02:06:13

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2510.07681v1

Agent-in-the-Loop: A Data Flywheel for Continuous Improvement in LLM-based Customer Support

We introduce an Agent-in-the-Loop (AITL) framework that implements a continuous data flywheel for iteratively improving an LLM-based customer support system. Unlike standard offline approaches that rely on batch annotations, AITL integrates four key types of annotations directly into live customer operations: (1) pairwise response preferences, (2) agent adoption and rationales, (3) knowledge relevance checks, and (4) identification of missing knowledge. These feedback signals seamlessly feed back into models' updates, reducing retraining cycles from months to weeks. Our production pilot involving US-based customer support agents demonstrated significant improvements in retrieval accuracy (+11.7% recall@75, +14.8% precision@8), generation quality (+8.4% helpfulness) and agent adoption rates (+4.5%). These results underscore the effectiveness of embedding human feedback loops directly into operational workflows to continuously refine LLM-based customer support system.

Updated: 2025-10-09 02:01:51

标题: Agent-in-the-Loop: 一个基于LLM的客户支持持续改进的数据飞轮

摘要: 我们引入了一个Agent-in-the-Loop（AITL）框架，实现了一个连续的数据飞轮，用于迭代改进基于LLM的客户支持系统。与依赖批注的标准离线方法不同，AITL直接将四种关键类型的注释集成到实时客户操作中：（1）成对响应偏好，（2）代理人采用和理由，（3）知识相关性检查，以及（4）缺失知识的识别。这些反馈信号无缝地反馈到模型的更新中，将重新训练周期从几个月缩短到几周。我们在美国客户支持代理人参与的生产试点项目中，实现了检索准确性的显著提高（+11.7%召回率@75，+14.8%精度@8），生成质量的提高（+8.4%有用性）和代理人采用率的提高（+4.5%）。这些结果突显了将人类反馈环直接嵌入运营工作流程中，以持续改进基于LLM的客户支持系统的有效性。

更新时间: 2025-10-09 02:01:51

领域: cs.AI

下载: http://arxiv.org/abs/2510.06674v2

Mitigating Noise Detriment in Differentially Private Federated Learning with Model Pre-training

Differentially Private Federated Learning (DPFL) strengthens privacy protection by perturbing model gradients with noise, though at the cost of reduced accuracy. Although prior empirical studies indicate that initializing from pre-trained rather than random parameters can alleviate noise disturbance, the problem of optimally fine-tuning pre-trained models in DPFL remains unaddressed. In this paper, we propose Pretrain-DPFL, a framework that systematically evaluates three most representative fine-tuning strategies: full-tuning (FT), head-tuning (HT), and unified-tuning(UT) combining HT followed by FT. Through convergence analysis under smooth non-convex loss, we establish theoretical conditions for identifying the optimal fine-tuning strategy in Pretrain-DPFL, thereby maximizing the benefits of pre-trained models in mitigating noise disturbance. Extensive experiments across multiple datasets demonstrate Pretrain-DPFL's superiority, achieving $25.22\%$ higher accuracy than scratch training and outperforming the second-best baseline by $8.19\%$, significantly improving the privacy-utility trade-off in DPFL.

Updated: 2025-10-09 01:53:00

标题: 通过模型预训练减轻差分隐私联邦学习中的噪声损害

摘要: 差分隐私联邦学习（DPFL）通过向模型梯度添加噪音来加强隐私保护，尽管会降低准确性。尽管先前的经验研究表明，从预训练参数而不是随机参数初始化可以缓解噪音干扰，但在DPFL中优化微调预训练模型的问题仍未解决。在本文中，我们提出了Pretrain-DPFL框架，系统评估了三种最具代表性的微调策略：全调整（FT）、头调整（HT）和统一调整（UT），结合HT后跟随FT。通过在平滑非凸损失下的收敛分析，我们建立了在Pretrain-DPFL中识别最佳微调策略的理论条件，从而最大化预训练模型在减轻噪音干扰方面的益处。跨多个数据集进行的大量实验表明Pretrain-DPFL的优越性，比从头训练高出25.22%的准确性，并且优于第二优基线8.19%，在DPFL中显着改善隐私-效用权衡。

更新时间: 2025-10-09 01:53:00

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2408.09478v2

Domain Generalization in-the-Wild: Disentangling Classification from Domain-Aware Representations

Evaluating domain generalization (DG) for foundational models like CLIP is challenging, as web-scale pretraining data potentially covers many existing benchmarks. Consequently, current DG evaluation may neither be sufficiently challenging nor adequately test genuinely unseen data scenarios. To better assess the performance of CLIP on DG in-the-wild, a scenario where CLIP encounters challenging unseen data, we consider two approaches: (1) evaluating on 33 diverse datasets with quantified out-of-distribution (OOD) scores after fine-tuning CLIP on ImageNet, and (2) using unlearning to make CLIP `forget' some domains as an approximation. We observe that CLIP's performance deteriorates significantly on more OOD datasets. To address this, we present CLIP-DCA (Disentangling Classification from enhanced domain Aware representations). Our approach is motivated by the observation that while standard domain invariance losses aim to make representations domain-invariant, this can be harmful to foundation models by forcing the discarding of domain-aware representations beneficial for generalization. We instead hypothesize that enhancing domain awareness is a prerequisite for effective domain-invariant classification in foundation models. CLIP-DCA identifies and enhances domain awareness within CLIP's encoders using a separate domain head and synthetically generated diverse domain data. Simultaneously, it encourages domain-invariant classification through disentanglement from the domain features. CLIP-DCA shows significant improvements within this challenging evaluation compared to existing methods, particularly on datasets that are more OOD.

Updated: 2025-10-09 01:50:32

标题: 野外环境中的领域泛化：从领域感知表示中解耦分类

摘要: 评估基础模型如CLIP的领域泛化(DG)是具有挑战性的，因为网络规模的预训练数据可能涵盖许多现有的基准。因此，当前的DG评估可能既不足够具有挑战性，也不足以充分测试真正未见数据场景。为了更好地评估CLIP在野外的DG性能，即CLIP遇到具有挑战性的未见数据的情况，我们考虑了两种方法：(1)在对ImageNet进行微调后，在33个不同的数据集上评估，并量化超出分布(ODD)分数；(2)使用遗忘技术使CLIP“忘记”一些领域作为近似。我们观察到，CLIP在更多的ODD数据集上的性能显著下降。为了解决这个问题，我们提出了CLIP-DCA(从增强领域感知中分离分类)。我们的方法受到这样的观察启发：虽然标准的领域不变性损失旨在使表示具有领域不变性，但这可能对基础模型有害，因为它强迫舍弃对泛化有益的领域感知表示。我们假设增强领域感知是基础模型有效的领域不变分类的先决条件。CLIP-DCA通过使用单独的领域头和合成生成的多样化领域数据，识别并增强CLIP编码器中的领域感知。同时，它通过与领域特征的解耦来鼓励领域不变分类。与现有方法相比，CLIP-DCA在这种具有挑战性的评估中显示出显著的改进，特别是在更多的ODD数据集上。

更新时间: 2025-10-09 01:50:32

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.21769v2

Multi-Source Knowledge Pruning for Retrieval-Augmented Generation: A Benchmark and Empirical Study

Retrieval-augmented generation (RAG) is increasingly recognized as an effective approach to mitigating the hallucination of large language models (LLMs) through the integration of external knowledge. While numerous efforts, most studies focus on a single type of external knowledge source. However, in real-world applications, most situations involve diverse knowledge from various sources, yet this area has been less explored. The main dilemma is the lack of a suitable dataset containing multiple knowledge sources and pre-exploration of the associated issues. To address these challenges, we standardize a benchmark dataset that combines structured and unstructured knowledge across diverse and complementary domains. Based on this dataset, we further develop a plug-and-play RAG framework, \textbf{PruningRAG}, whose main characteristic is the use of multi-granularity pruning strategies to optimize the integration of relevant information while minimizing misleading context. It consistently improves performance across various existing RAG variants, demonstrating its robustness and broad applicability. Building upon the standardized dataset and PruningRAG, we also report a series of experimental results, as well as insightful findings. Our dataset and code are publicly available\footnote{https://github.com/USTCAGI/PruningRAG}, with the aim of advancing future research in the RAG community.

Updated: 2025-10-09 01:48:50

标题: 多源知识修剪用于检索增强生成：基准和实证研究

摘要: 检索增强生成（RAG）越来越被认为是通过整合外部知识来减轻大型语言模型（LLMs）幻觉的有效方法。虽然已经做出了许多努力，大多数研究集中在单一类型的外部知识源上。然而，在实际应用中，大多数情况涉及来自各种来源的多样化知识，但这个领域却被较少探索。主要困境在于缺乏包含多种知识源和相关问题预探索的合适数据集。为了解决这些挑战，我们标准化了一个基准数据集，该数据集结合了跨不同和互补领域的结构化和非结构化知识。基于该数据集，我们进一步开发了一个即插即用的RAG框架，PruningRAG，其主要特点是使用多粒度修剪策略来优化相关信息的整合，同时最小化误导性上下文。它在各种现有的RAG变体中持续改进性能，展示了其稳健性和广泛适用性。基于标准化数据集和PruningRAG，我们还报告了一系列实验结果，以及有见地的发现。我们的数据集和代码公开可用，旨在推动RAG社区未来研究的发展。

更新时间: 2025-10-09 01:48:50

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2409.13694v4

Controllable Video Synthesis via Variational Inference

Many video workflows benefit from a mixture of user controls with varying granularity, from exact 4D object trajectories and camera paths to coarse text prompts, while existing video generative models are typically trained for fixed input formats. We develop a video synthesis method that addresses this need and generates samples with high controllability for specified elements while maintaining diversity for under-specified ones. We cast the task as variational inference to approximate a composed distribution, leveraging multiple video generation backbones to account for all task constraints collectively. To address the optimization challenge, we break down the problem into step-wise KL divergence minimization over an annealed sequence of distributions, and further propose a context-conditioned factorization technique that reduces modes in the solution space to circumvent local optima. Experiments suggest that our method produces samples with improved controllability, diversity, and 3D consistency compared to prior works.

Updated: 2025-10-09 01:48:16

标题: 可控的视频合成通过变分推断

摘要: 许多视频工作流程受益于混合用户控件，具有不同的粒度，从精确的4D对象轨迹和摄像机路径到粗糙的文本提示，而现有的视频生成模型通常针对固定的输入格式进行训练。我们开发了一种视频合成方法，以满足这种需求，并为指定元素生成具有高可控性的样本，同时保持对未指定元素的多样性。我们将任务构建为变分推断，以近似一个组合分布，利用多个视频生成支架来共同考虑所有任务约束。为了解决优化挑战，我们将问题分解为在一个经退火的分布序列上逐步最小化KL散度，并进一步提出了一种上下文条件分解技术，以减少解空间中的模式，以避开局部最优解。实验表明，与先前的工作相比，我们的方法产生了具有改善的可控性、多样性和3D一致性的样本。

更新时间: 2025-10-09 01:48:16

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.07670v1

XRPO: Pushing the limits of GRPO with Targeted Exploration and Exploitation

Reinforcement learning algorithms such as GRPO have driven recent advances in large language model (LLM) reasoning. While scaling the number of rollouts stabilizes training, existing approaches suffer from limited exploration on challenging prompts and leave informative feedback signals underexploited, due to context-independent rollout allocation across prompts (e.g., generating 16 rollouts per prompt) and relying heavily on sparse rewards. This paper presents XRPO(eXplore - eXploit GRPO), a unified framework that recasts policy optimization through the principled lens of rollout exploration-exploitation. To enhance exploration, XRPO introduces a mathematically grounded rollout allocator that adaptively prioritizes prompts with higher potential for uncertainty reduction. It further addresses stagnation on zero-reward prompts through an in-context seeding strategy that injects curated exemplars, steering the model into more difficult reasoning trajectories. To strengthen exploitation, XRPO develops a group-relative, novelty-aware advantage sharpening mechanism that leverages sequence likelihoods to amplify low-probability yet correct responses, thereby extending the policy's reach beyond sparse rewards. Experiments across diverse math and coding benchmarks on both reasoning and non-reasoning models demonstrate that XRPO outperforms existing advances (e.g., GRPO and GSPO) up to 4% pass@1 and 6% cons@32, while accelerating training convergence by up to 2.7X.

Updated: 2025-10-09 01:39:47

标题: XRPO：通过有针对性的探索和开发推动GRPO的极限

摘要: 强化学习算法，如GRPO，推动了最近在大型语言模型（LLM）推理方面的进展。虽然扩大rollouts的数量可以稳定训练，但现有方法在具有挑战性提示的有限探索和未充分利用信息反馈信号方面存在问题，这是因为在提示之间独立上下文的rollout分配（例如，每个提示生成16个rollouts）并且过度依赖稀疏奖励。本文提出了XRPO（eXplore - eXploit GRPO），这是一个统一的框架，通过rollout探索-利用的原则性视角重新构建策略优化。为了增强探索，XRPO引入了一个数学基础的rollout分配器，自适应地优先考虑具有更高潜力减少不确定性的提示。通过一个在上下文中播种的策略，它进一步解决了零奖励提示的停滞问题，注入了经过策划的示例，引导模型进入更困难的推理轨迹。为了加强利用，XRPO开发了一种群体相对、新颖感知的优势锐化机制，利用序列可能性来放大低概率但正确的响应，从而扩展策略的影响范围以超越稀疏奖励。对各种数学和编码基准测试的实验，包括推理和非推理模型，表明XRPO在4% pass@1和6% cons@32方面优于现有进展（例如，GRPO和GSPO），同时将训练收敛加速了最多2.7倍。

更新时间: 2025-10-09 01:39:47

领域: cs.LG

下载: http://arxiv.org/abs/2510.06672v2

TCIP: Threshold-Controlled Iterative Pyramid Network for Deformable Medical Image Registration

Although pyramid networks have demonstrated superior performance in deformable medical image registration, their decoder architectures are inherently prone to propagating and accumulating anatomical structure misalignments. Moreover, most existing models do not adaptively determine the number of iterations for optimization under varying deformation requirements across images, resulting in either premature termination or excessive iterations that degrades registration accuracy. To effectively mitigate the accumulation of anatomical misalignments, we propose the Feature-Enhanced Residual Module (FERM) as the core component of each decoding layer in the pyramid network. FERM comprises three sequential blocks that extract anatomical semantic features, learn to suppress irrelevant features, and estimate the final deformation field, respectively. To adaptively determine the number of iterations for varying images, we propose the dual-stage Threshold-Controlled Iterative (TCI) strategy. In the first stage, TCI assesses registration stability and with asserted stability, it continues with the second stage to evaluate convergence. We coin the model that integrates FERM and TCI as Threshold-Controlled Iterative Pyramid (TCIP). Extensive experiments on three public brain MRI datasets and one abdomen CT dataset demonstrate that TCIP outperforms the state-of-the-art (SOTA) registration networks in terms of accuracy, while maintaining comparable inference speed and a compact model parameter size. Finally, we assess the generalizability of FERM and TCI by integrating them with existing registration networks and further conduct ablation studies to validate the effectiveness of these two proposed methods.

Updated: 2025-10-09 01:38:40

标题: TCIP：可变形医学图像配准的阈值控制迭代金字塔网络

摘要: 尽管金字塔网络在可变形医学图像配准中表现出优越的性能，但其解码器架构天生容易传播和累积解剖结构错位。此外，大多数现有模型不会自适应地确定在不同图像中根据变形要求的优化迭代次数，导致要么过早终止，要么迭代过多，从而降低了配准精度。为了有效减轻解剖错位的累积，我们提出了特征增强残差模块（FERM）作为金字塔网络中每个解码层的核心组件。FERM包括三个连续的块，分别提取解剖语义特征，学习抑制不相关特征，并估计最终的形变场。为了自适应地确定不同图像的迭代次数，我们提出了双阶段阈值控制迭代（TCI）策略。在第一阶段，TCI评估配准稳定性，并在稳定性确认后继续第二阶段评估收敛性。我们将集成了FERM和TCI的模型称为Threshold-Controlled Iterative Pyramid（TCIP）。对三个公共脑MRI数据集和一个腹部CT数据集进行的大量实验证明，TCIP在精度方面优于最先进的（SOTA）配准网络，同时保持可比的推理速度和紑模型参数大小。最后，我们通过将FERM和TCI与现有配准网络集成，并进一步进行消融研究来验证这两种提出的方法的有效性。

更新时间: 2025-10-09 01:38:40

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2510.07666v1

Product of Experts for Visual Generation

Modern neural models capture rich priors and have complementary knowledge over shared data domains, e.g., images and videos. Integrating diverse knowledge from multiple sources -- including visual generative models, visual language models, and sources with human-crafted knowledge such as graphics engines and physics simulators -- remains under-explored. We propose a Product of Experts (PoE) framework that performs inference-time knowledge composition from heterogeneous models. This training-free approach samples from the product distribution across experts via Annealed Importance Sampling (AIS). Our framework shows practical benefits in image and video synthesis tasks, yielding better controllability than monolithic methods and additionally providing flexible user interfaces for specifying visual generation goals.

Updated: 2025-10-09 01:37:47

标题: 专家产品用于视觉生成

摘要: 现代神经模型捕捉丰富的先验知识，并在共享数据领域（例如图像和视频）上具有互补知识。整合来自多个来源的多样化知识，包括视觉生成模型、视觉语言模型以及具有人工制作知识的来源，如图形引擎和物理模拟器，仍未被充分探索。我们提出了一个专家乘积（PoE）框架，可以在异构模型之间进行推断时知识组合。这种无需训练的方法通过退火重要性采样（AIS）从专家间的乘积分布中进行采样。我们的框架在图像和视频合成任务中显示出实际的优势，比单一方法具有更好的可控性，同时还为指定视觉生成目标提供了灵活的用户界面。

更新时间: 2025-10-09 01:37:47

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.08894v2

Emotionally Vulnerable Subtype of Internet Gaming Disorder: Measuring and Exploring the Pathology of Problematic Generative AI Use

Concerns over the potential over-pathologization of generative AI (GenAI) use and the lack of conceptual clarity surrounding GenAI addiction call for empirical tools and theoretical refinement. This study developed and validated the PUGenAIS-9 (Problematic Use of Generative Artificial Intelligence Scale-9 items) and examined whether PUGenAIS reflects addiction-like patterns under the Internet Gaming Disorder (IGD) framework. Using samples from China and the United States (N = 1,508), we conducted confirmatory factor analysis and identified a robust 31-item structure across nine IGD-based dimensions. We then derived the PUGenAIS-9 by selecting the highest-loading items from each dimension and validated its structure in an independent sample (N = 1,426). Measurement invariance tests confirmed its stability across nationality and gender. Person-centered (latent profile analysis) and variable-centered (network analysis) approaches revealed a 5-10% prevalence rate, a symptom network structure similar to IGD, and predictive factors related to psychological distress and functional impairment. These findings indicate that PUGenAI shares features of the emotionally vulnerable subtype of IGD rather than the competence-based type. These results support using PUGenAIS-9 to identify problematic GenAI use and show the need to rethink digital addiction with an ICD (infrastructures, content, and device) model. This keeps addiction research responsive to new media while avoiding over-pathologizing.

Updated: 2025-10-09 01:37:06

标题: 情感脆弱亚型的互联网游戏障碍：测量和探索问题性生成人工智能使用的病理学

摘要: 对生成式人工智能（GenAI）使用的潜在过度病态化以及围绕GenAI成瘾缺乏概念上的清晰度的担忧需要实证工具和理论的完善。本研究开发并验证了PUGenAIS-9（生成式人工智能问题使用量表-9项），并检验了PUGenAIS是否反映了按照互联网游戏障碍（IGD）框架下成瘾样式。利用中国和美国的样本（N = 1,508），我们进行了验证性因素分析，并在九个基于IGD的维度上鉴定了一个稳健的31项结构。然后我们从每个维度中选择载荷最高的项目得出了PUGenAIS-9，并在一个独立样本中（N = 1,426）验证了其结构。测量不变性测试确认了其在国籍和性别方面的稳定性。以个体为中心的（潜在剖面分析）和变量为中心的（网络分析）方法揭示了5-10%的患病率，与IGD类似的症状网络结构，以及与心理困扰和功能障碍相关的预测因素。这些发现表明PUGenAI与情绪脆弱亚型的IGD共享特征，而不是基于能力的类型。这些结果支持使用PUGenAIS-9来识别问题性GenAI使用，并显示需要重新思考数字成瘾，以ICD（基础设施、内容和设备）模型为基础。这使得成瘾研究能够响应新媒体而避免过度病态化。

更新时间: 2025-10-09 01:37:06

领域: cs.HC,cs.AI,cs.CY

下载: http://arxiv.org/abs/2510.06908v2

FedQS: Optimizing Gradient and Model Aggregation for Semi-Asynchronous Federated Learning

Federated learning (FL) enables collaborative model training across multiple parties without sharing raw data, with semi-asynchronous FL (SAFL) emerging as a balanced approach between synchronous and asynchronous FL. However, SAFL faces significant challenges in optimizing both gradient-based (e.g., FedSGD) and model-based (e.g., FedAvg) aggregation strategies, which exhibit distinct trade-offs in accuracy, convergence speed, and stability. While gradient aggregation achieves faster convergence and higher accuracy, it suffers from pronounced fluctuations, whereas model aggregation offers greater stability but slower convergence and suboptimal accuracy. This paper presents FedQS, the first framework to theoretically analyze and address these disparities in SAFL. FedQS introduces a divide-and-conquer strategy to handle client heterogeneity by classifying clients into four distinct types and adaptively optimizing their local training based on data distribution characteristics and available computational resources. Extensive experiments on computer vision, natural language processing, and real-world tasks demonstrate that FedQS achieves the highest accuracy, attains the lowest loss, and ranks among the fastest in convergence speed, outperforming state-of-the-art baselines. Our work bridges the gap between aggregation strategies in SAFL, offering a unified solution for stable, accurate, and efficient federated learning. The code and datasets are available at https://anonymous.4open.science/r/FedQS-EDD6.

Updated: 2025-10-09 01:32:19

标题: FedQS：优化半异步联邦学习中的梯度和模型聚合

摘要: 联邦学习（FL）使多方在不共享原始数据的情况下进行协作模型训练成为可能，半异步FL（SAFL）作为同步和异步FL之间的平衡方法逐渐兴起。然而，SAFL在优化基于梯度（例如FedSGD）和基于模型（例如FedAvg）的聚合策略方面面临着重大挑战，这两种策略在准确性、收敛速度和稳定性方面存在明显的权衡。虽然梯度聚合实现了更快的收敛和更高的准确性，但存在明显的波动，而模型聚合提供了更大的稳定性，但收敛速度较慢且准确性不佳。本文提出了FedQS，这是第一个理论上分析和解决SAFL中这些差异的框架。FedQS引入了一种分而治之的策略，通过将客户端分类为四种不同类型，并根据数据分布特征和可用计算资源自适应优化它们的本地训练。在计算机视觉、自然语言处理和真实世界任务上进行了大量实验，结果表明FedQS实现了最高的准确性，达到了最低的损失，并在收敛速度方面排名最快，胜过了现有的基线。我们的工作填补了SAFL中聚合策略之间的差距，为稳定、准确和高效的联邦学习提供了统一解决方案。代码和数据集可在https://anonymous.4open.science/r/FedQS-EDD6上找到。

更新时间: 2025-10-09 01:32:19

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2510.07664v1

BEVTrack: A Simple and Strong Baseline for 3D Single Object Tracking in Bird's-Eye View

3D Single Object Tracking (SOT) is a fundamental task in computer vision and plays a critical role in applications like autonomous driving. However, existing algorithms often involve complex designs and multiple loss functions, making model training and deployment challenging. Furthermore, their reliance on fixed probability distribution assumptions (e.g., Laplacian or Gaussian) hinders their ability to adapt to diverse target characteristics such as varying sizes and motion patterns, ultimately affecting tracking precision and robustness. To address these issues, we propose BEVTrack, a simple yet effective motion-based tracking method. BEVTrack directly estimates object motion in Bird's-Eye View (BEV) using a single regression loss. To enhance accuracy for targets with diverse attributes, it learns adaptive likelihood functions tailored to individual targets, avoiding the limitations of fixed distribution assumptions in previous methods. This approach provides valuable priors for tracking and significantly boosts performance. Comprehensive experiments on KITTI, NuScenes, and Waymo Open Dataset demonstrate that BEVTrack achieves state-of-the-art results while operating at 200 FPS, enabling real-time applicability. The code will be released at https://github.com/xmm-prio/BEVTrack.

Updated: 2025-10-09 01:31:44

标题: BEVTrack：鸟瞰视角下三维单目标跟踪的简单且强大基准线

摘要: 三维单目标跟踪（SOT）是计算机视觉中的基本任务，在自动驾驶等应用中起着关键作用。然而，现有的算法通常涉及复杂的设计和多个损失函数，使模型的训练和部署具有挑战性。此外，它们对固定概率分布假设（如拉普拉斯或高斯）的依赖阻碍了其适应不同目标特性（如不同大小和运动模式），最终影响了跟踪的精度和鲁棒性。为了解决这些问题，我们提出了BEVTrack，一种简单而有效的基于运动的跟踪方法。BEVTrack直接在鸟瞰视图（BEV）中使用单一回归损失来估计对象的运动。为了提高对具有不同属性的目标的准确性，它学习了适应于个体目标的自适应似然函数，避免了先前方法中固定分布假设的限制。这种方法为跟踪提供了有价值的先验信息，并显著提升了性能。对KITTI、NuScenes和Waymo Open Dataset的全面实验表明，BEVTrack在200 FPS的操作速度下取得了最先进的结果，实现了实时适用性。代码将在https://github.com/xmm-prio/BEVTrack发布。

更新时间: 2025-10-09 01:31:44

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2309.02185v8

Incremental Hybrid Ensemble with Graph Attention and Frequency-Domain Features for Stable Long-Term Credit Risk Modeling

Predicting long-term loan defaults is hard because borrower behavior often changes and data distributions shift over time. This paper presents HYDRA-EI, a hybrid ensemble incremental learning framework. It uses several stages of feature processing and combines multiple models. The framework builds relational, cross, and frequency-based features. It uses graph attention, automatic cross-feature creation, and transformations from the frequency domain. HYDRA-EI updates weekly using new data and adjusts the model weights with a simple performance-based method. It works without frequent manual changes or fixed retraining. HYDRA-EI improves model stability and generalization, which makes it useful for long-term credit risk tasks.

Updated: 2025-10-09 01:31:04

标题: 增量混合集成方法：利用图注意力和频域特征进行稳定的长期信用风险建模

摘要: 预测长期贷款违约是困难的，因为借款人的行为经常发生变化，数据分布随时间变化。本文介绍了HYDRA-EI，一种混合集成增量学习框架。该框架使用多个阶段的特征处理，并结合多个模型。该框架构建了关系、交叉和基于频率的特征。它使用图注意力、自动交叉特征创建和从频率域进行的转换。HYDRA-EI每周更新一次，使用新数据调整模型权重，采用简单的基于性能的方法。它可以在不频繁进行手动更改或固定重新训练的情况下运行。HYDRA-EI提高了模型的稳定性和泛化能力，使其对长期信用风险任务有用。

更新时间: 2025-10-09 01:31:04

领域: cs.LG

下载: http://arxiv.org/abs/2510.07663v1

IKNet: Interpretable Stock Price Prediction via Keyword-Guided Integration of News and Technical Indicators

The increasing influence of unstructured external information, such as news articles, on stock prices has attracted growing attention in financial markets. Despite recent advances, most existing newsbased forecasting models represent all articles using sentiment scores or average embeddings that capture the general tone but fail to provide quantitative, context-aware explanations of the impacts of public sentiment on predictions. To address this limitation, we propose an interpretable keyword-guided network (IKNet), which is an explainable forecasting framework that models the semantic association between individual news keywords and stock price movements. The IKNet identifies salient keywords via FinBERTbased contextual analysis, processes each embedding through a separate nonlinear projection layer, and integrates their representations with the time-series data of technical indicators to forecast next-day closing prices. By applying Shapley Additive Explanations the model generates quantifiable and interpretable attributions for the contribution of each keyword to predictions. Empirical evaluations of S&P 500 data from 2015 to 2024 demonstrate that IKNet outperforms baselines, including recurrent neural networks and transformer models, reducing RMSE by up to 32.9% and improving cumulative returns by 18.5%. Moreover, IKNet enhances transparency by offering contextualized explanations of volatility events driven by public sentiment.

Updated: 2025-10-09 01:30:30

标题: IKNet：通过关键字引导的新闻和技术指标集成实现可解释的股价预测

摘要: 随着新闻文章等非结构化外部信息对股价影响日益增强，金融市场对此引起了越来越多的关注。尽管近年来取得了进展，大多数现有的基于新闻的预测模型仍然使用情感分数或平均嵌入来表示所有文章，这些分数或嵌入捕捉了一般情绪，但未能提供有关公众情绪对预测的影响的定量、上下文感知解释。为了解决这一局限性，我们提出了一种可解释的关键词引导网络（IKNet），这是一个可解释的预测框架，模拟了新闻关键词与股价波动之间的语义关联。IKNet通过基于FinBERT的上下文分析识别显著关键词，通过单独的非线性投影层处理每个嵌入，并将它们的表示与技术指标的时间序列数据整合起来，以预测第二天的收盘价。通过应用Shapley Additive Explanations，该模型为每个关键词对预测的贡献生成可量化和可解释的归因。对2015年至2024年的标准普尔500指数数据的实证评估表明，IKNet胜过基线模型，包括递归神经网络和变压器模型，将RMSE降低了高达32.9%，并提高了18.5%的累计回报。此外，IKNet通过提供由公众情绪驱动的波动事件的具体化解释来增强透明度。

更新时间: 2025-10-09 01:30:30

领域: cs.CE,cs.AI

下载: http://arxiv.org/abs/2510.07661v1

A Survey of Foundation Models for IoT: Taxonomy and Criteria-Based Analysis

Foundation models have gained growing interest in the IoT domain due to their reduced reliance on labeled data and strong generalizability across tasks, which address key limitations of traditional machine learning approaches. However, most existing foundation model based methods are developed for specific IoT tasks, making it difficult to compare approaches across IoT domains and limiting guidance for applying them to new tasks. This survey aims to bridge this gap by providing a comprehensive overview of current methodologies and organizing them around four shared performance objectives by different domains: efficiency, context-awareness, safety, and security & privacy. For each objective, we review representative works, summarize commonly-used techniques and evaluation metrics. This objective-centric organization enables meaningful cross-domain comparisons and offers practical insights for selecting and designing foundation model based solutions for new IoT tasks. We conclude with key directions for future research to guide both practitioners and researchers in advancing the use of foundation models in IoT applications.

Updated: 2025-10-09 01:28:19

标题: 物联网基础模型调查：分类和基于标准的分析

摘要: 基础模型在物联网领域引起了越来越多的关注，因为它们对标记数据的依赖性减少，并且在任务之间具有很强的泛化能力，从而解决了传统机器学习方法的关键局限性。然而，大多数现有的基础模型方法都是针对特定的物联网任务开发的，这使得跨物联网领域比较方法变得困难，并限制了将它们应用于新任务的指导。本调查旨在通过提供对当前方法论的全面概述，并围绕四个不同领域的共享性能目标对其进行组织：效率、上下文感知、安全性以及安全性和隐私性。对于每个目标，我们回顾代表性作品，总结常用的技术和评估指标。这种以目标为中心的组织方式能够实现有意义的跨领域比较，并为选择和设计基础模型解决方案提供实用洞察力。我们最后总结了未来研究的关键方向，以指导从业者和研究人员在推动基础模型在物联网应用中的使用方面取得进展。

更新时间: 2025-10-09 01:28:19

领域: cs.LG,cs.AI,cs.SY,eess.SY

下载: http://arxiv.org/abs/2506.12263v3

FairSHAP: Preprocessing for Fairness Through Attribution-Based Data Augmentation

Ensuring fairness in machine learning models is critical, particularly in high-stakes domains where biased decisions can lead to serious societal consequences. Existing preprocessing approaches generally lack transparent mechanisms for identifying which features or instances are responsible for unfairness. This obscures the rationale behind data modifications. We introduce FairSHAP, a novel pre-processing framework that leverages Shapley value attribution to improve both individual and group fairness. FairSHAP identifies fairness-critical instances in the training data using an interpretable measure of feature importance, and systematically modifies them through instance-level matching across sensitive groups. This process reduces discriminative risk - an individual fairness metric - while preserving data integrity and model accuracy. We demonstrate that FairSHAP significantly improves demographic parity and equality of opportunity across diverse tabular datasets, achieving fairness gains with minimal data perturbation and, in some cases, improved predictive performance. As a model-agnostic and transparent method, FairSHAP integrates seamlessly into existing machine learning pipelines and provides actionable insights into the sources of bias.Our code is on https://github.com/youlei202/FairSHAP.

Updated: 2025-10-09 01:12:31

标题: FairSHAP：通过基于属性的数据增强实现公平预处理

摘要: 确保机器学习模型的公平性至关重要，特别是在高风险领域，偏见决策可能导致严重的社会后果。现有的预处理方法通常缺乏透明的机制来识别哪些特征或实例导致不公平性。这使得数据修改背后的原因变得模糊。我们引入了FairSHAP，这是一个新颖的预处理框架，利用Shapley值分配来提高个人和群体公平性。FairSHAP使用可解释的特征重要性度量识别训练数据中的关键公平性实例，并通过跨敏感群体进行实例级匹配系统地修改它们。这个过程降低了歧视风险 - 一个个人公平度量 - 同时保持了数据的完整性和模型的准确性。我们证明FairSHAP显著提高了跨各种表格数据集的人口统计学平衡和机会平等，实现了公平性增益，同时最小程度地扰乱数据，并在某些情况下提高了预测性能。作为一个模型不可知且透明的方法，FairSHAP能够无缝集成到现有的机器学习流程中，并提供有关偏见来源的可操作洞察。我们的代码位于https://github.com/youlei202/FairSHAP。

更新时间: 2025-10-09 01:12:31

领域: cs.LG,cs.AI,cs.CY

下载: http://arxiv.org/abs/2505.11111v2

MAHL: Multi-Agent LLM-Guided Hierarchical Chiplet Design with Adaptive Debugging

As program workloads (e.g., AI) increase in size and algorithmic complexity, the primary challenge lies in their high dimensionality, encompassing computing cores, array sizes, and memory hierarchies. To overcome these obstacles, innovative approaches are required. Agile chip design has already benefited from machine learning integration at various stages, including logic synthesis, placement, and routing. With Large Language Models (LLMs) recently demonstrating impressive proficiency in Hardware Description Language (HDL) generation, it is promising to extend their abilities to 2.5D integration, an advanced technique that saves area overhead and development costs. However, LLM-driven chiplet design faces challenges such as flatten design, high validation cost and imprecise parameter optimization, which limit its chiplet design capability. To address this, we propose MAHL, a hierarchical LLM-based chiplet design generation framework that features six agents which collaboratively enable AI algorithm-hardware mapping, including hierarchical description generation, retrieval-augmented code generation, diverseflow-based validation, and multi-granularity design space exploration. These components together enhance the efficient generation of chiplet design with optimized Power, Performance and Area (PPA). Experiments show that MAHL not only significantly improves the generation accuracy of simple RTL design, but also increases the generation accuracy of real-world chiplet design, evaluated by Pass@5, from 0 to 0.72 compared to conventional LLMs under the best-case scenario. Compared to state-of-the-art CLARIE (expert-based), MAHL achieves comparable or even superior PPA results under certain optimization objectives.

Updated: 2025-10-09 01:12:14

标题: MAHL: 多智能体LLM引导的自适应调试的分层芯片设计

摘要: 随着程序工作负载（例如人工智能）的规模和算法复杂性的增加，主要挑战在于它们的高维度，涵盖了计算核心、数组大小和内存层次结构。为了克服这些障碍，需要创新的方法。敏捷芯片设计已经从各个阶段的机器学习集成中受益，包括逻辑综合、布局和布线。最近，大型语言模型（LLMs）在硬件描述语言（HDL）生成方面展现出令人印象深刻的熟练度，有望将它们的能力扩展到2.5D集成，这是一种节省面积开销和开发成本的先进技术。然而，由LLM驱动的芯片片设计面临着诸如扁平设计、高验证成本和参数优化不精确等挑战，限制了其芯片片设计能力。为了解决这个问题，我们提出了MAHL，一个基于层次LLM的芯片片设计生成框架，具有六个代理，它们共同实现了AI算法-硬件映射，包括层次描述生成、检索增强代码生成、基于多样化流的验证和多粒度设计空间探索。这些组件共同增强了芯片片设计的高效生成，实现了优化的功耗、性能和面积（PPA）。实验表明，MAHL不仅显著提高了简单RTL设计的生成准确性，还将实际芯片片设计的生成准确性（通过Pass@5评估）从0提高到0.72，与传统LLMs相比，在最佳情况下。与最先进的CLARIE（基于专家的方法）相比，MAHL在某些优化目标下实现了可比甚至更优越的PPA结果。

更新时间: 2025-10-09 01:12:14

领域: cs.AR,cs.AI,cs.MA

下载: http://arxiv.org/abs/2508.14053v2

HiVeGen -- Hierarchical LLM-based Verilog Generation for Scalable Chip Design

With Large Language Models (LLMs) recently demonstrating impressive proficiency in code generation, it is promising to extend their abilities to Hardware Description Language (HDL). However, LLMs tend to generate single HDL code blocks rather than hierarchical structures for hardware designs, leading to hallucinations, particularly in complex designs like Domain-Specific Accelerators (DSAs). To address this, we propose HiVeGen, a hierarchical LLM-based Verilog generation framework that decomposes generation tasks into LLM-manageable hierarchical submodules. HiVeGen further harnesses the advantages of such hierarchical structures by integrating automatic Design Space Exploration (DSE) into hierarchy-aware prompt generation, introducing weight-based retrieval to enhance code reuse, and enabling real-time human-computer interaction to lower error-correction cost, significantly improving the quality of generated designs.

Updated: 2025-10-09 01:09:16

标题: HiVeGen - 基于分层LLM的Verilog生成，用于可扩展芯片设计

摘要: 随着大型语言模型（LLMs）最近在代码生成方面展示出令人印象深刻的能力，将它们的能力扩展到硬件描述语言（HDL）是具有前景的。然而，LLMs倾向于生成单个HDL代码块，而不是针对硬件设计的层次结构，这导致了幻觉，特别是在复杂设计如特定领域加速器（DSAs）中。为了解决这个问题，我们提出了HiVeGen，一个基于层次结构LLM的Verilog生成框架，将生成任务分解为LLM可管理的层次子模块。HiVeGen进一步利用这种层次结构的优势，通过将自动设计空间探索（DSE）集成到层次感知提示生成中，引入基于权重的检索来增强代码重用，并实现实时人机交互以降低错误更正成本，显著提高生成设计的质量。

更新时间: 2025-10-09 01:09:16

领域: cs.LG,cs.AI,cs.AR

下载: http://arxiv.org/abs/2412.05393v2

OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference

Large language models (LLMs) with extended context windows enable powerful downstream applications but impose significant memory overhead, as caching all key-value (KV) states scales linearly with sequence length and batch size. Existing cache eviction methods address this by exploiting attention sparsity, yet they typically rank tokens heuristically using accumulated attention weights without considering their true impact on attention outputs. We propose Optimal Brain Cache (OBCache), a principled framework that formulates cache eviction as a layer-wise structured pruning problem. Building upon the Optimal Brain Damage (OBD) theory, OBCache quantifies token saliency by measuring the perturbation in attention outputs induced by pruning tokens, with closed-form scores derived for isolated keys, isolated values, and joint key-value pairs. Our scores account not only for attention weights but also for information from value states and attention outputs, thereby enhancing existing eviction strategies with output-aware signals. Experiments on LLaMA and Qwen models demonstrate that replacing the heuristic scores in existing works, which estimate token saliency across different query positions, with OBCache's output-aware scores consistently improves long-context accuracy.

Updated: 2025-10-09 00:58:28

标题: OBCache: 高效长上下文LLM推理的最佳脑KV缓存修剪

摘要: 大型语言模型（LLMs）具有扩展上下文窗口，可以实现强大的下游应用程序，但会带来显著的内存开销，因为缓存所有键-值（KV）状态会随着序列长度和批量大小呈线性增长。现有的缓存驱逐方法通过利用注意力稀疏性来解决这个问题，但它们通常通过启发式地使用累积的注意力权重对标记进行排名，而不考虑它们对注意力输出的真实影响。我们提出了Optimal Brain Cache（OBCache），这是一个原则性的框架，将缓存驱逐形式化为逐层结构化的修剪问题。基于Optimal Brain Damage（OBD）理论，OBCache通过衡量修剪标记所引起的注意力输出的扰动来量化标记的显著性，为孤立键、孤立值和联合键-值对导出了闭合形式的分数。我们的分数不仅考虑注意力权重，还考虑来自值状态和注意力输出的信息，从而利用输出感知信号增强了现有的驱逐策略。对LLaMA和Qwen模型的实验表明，用OBCache的输出感知分数替代现有作品中估计不同查询位置上的标记显著性的启发式分数，可以持续提高长上下文准确性。

更新时间: 2025-10-09 00:58:28

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.07651v1

Value Flows

While most reinforcement learning methods today flatten the distribution of future returns to a single scalar value, distributional RL methods exploit the return distribution to provide stronger learning signals and to enable applications in exploration and safe RL. While the predominant method for estimating the return distribution is by modeling it as a categorical distribution over discrete bins or estimating a finite number of quantiles, such approaches leave unanswered questions about the fine-grained structure of the return distribution and about how to distinguish states with high return uncertainty for decision-making. The key idea in this paper is to use modern, flexible flow-based models to estimate the full future return distributions and identify those states with high return variance. We do so by formulating a new flow-matching objective that generates probability density paths satisfying the distributional Bellman equation. Building upon the learned flow models, we estimate the return uncertainty of distinct states using a new flow derivative ODE. We additionally use this uncertainty information to prioritize learning a more accurate return estimation on certain transitions. We compare our method (Value Flows) with prior methods in the offline and online-to-online settings. Experiments on $37$ state-based and $25$ image-based benchmark tasks demonstrate that Value Flows achieves a $1.3\times$ improvement on average in success rates. Website: https://pd-perry.github.io/value-flows Code: https://github.com/chongyi-zheng/value-flows

Updated: 2025-10-09 00:57:40

标题: 价值流动

摘要: 大多数强化学习方法今天将未来回报的分布压缩为单个标量值，分布式强化学习方法利用回报分布提供更强的学习信号，并实现在探索和安全强化学习中的应用。尽管估计回报分布的主要方法是将其建模为离散区间上的分类分布或估计有限数量的分位数，但这些方法未能解决关于回报分布的细粒度结构以及如何区分具有高回报不确定性的状态以供决策的问题。本文的关键思想是使用现代、灵活的基于流的模型来估计完整的未来回报分布，并识别那些具有高回报方差的状态。我们通过制定一个新的流匹配目标，生成满足分布Bellman方程的概率密度路径来实现这一点。在学习的流模型基础上，我们使用新的流导数ODE估计不同状态的回报不确定性。我们还利用这些不确定性信息来优先学习某些转换上更准确的回报估计。我们将我们的方法（价值流）与离线和在线到在线设置中的先前方法进行比较。对37个基于状态和25个基于图像的基准任务的实验表明，价值流在成功率方面平均改善了1.3倍。网站：https://pd-perry.github.io/value-flows 代码：https://github.com/chongyi-zheng/value-flows

更新时间: 2025-10-09 00:57:40

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.07650v1

Efficient Multi Subject Visual Reconstruction from fMRI Using Aligned Representations

This work introduces a novel approach to fMRI-based visual image reconstruction using a subject-agnostic common representation space. We show that the brain signals of the subjects can be aligned in this common space during training to form a semantically aligned common brain. This is leveraged to demonstrate that aligning subject-specific lightweight modules to a reference subject is significantly more efficient than traditional end-to-end training methods. Our approach excels in low-data scenarios. We evaluate our methods on different datasets, demonstrating that the common space is subject and dataset-agnostic.

Updated: 2025-10-09 00:49:56

标题: 使用对齐表示从fMRI中高效地重建多主体视觉

摘要: 这项工作介绍了一种基于fMRI的视觉图像重建的新方法，使用一个与主体无关的共同表示空间。我们展示了在训练过程中，可以将受试者的脑信号在这个共同空间中对齐，形成一个语义对齐的共同大脑。利用这一点，我们证明将主体特定的轻量级模块对齐到参考主体比传统的端到端训练方法更有效。我们的方法在数据稀缺的情况下表现出色。我们在不同的数据集上评估我们的方法，证明了共同空间是与主体和数据集无关的。

更新时间: 2025-10-09 00:49:56

领域: eess.IV,cs.CV,cs.LG

下载: http://arxiv.org/abs/2505.01670v2

High-Fidelity Synthetic ECG Generation via Mel-Spectrogram Informed Diffusion Training

The development of machine learning for cardiac care is severely hampered by privacy restrictions on sharing real patient electrocardiogram (ECG) data. Although generative AI offers a promising solution, the real-world use of existing model-synthesized ECGs is limited by persistent gaps in trustworthiness and clinical utility. In this work, we address two major shortcomings of current generative ECG methods: insufficient morphological fidelity and the inability to generate personalized, patient-specific physiological signals. To address these gaps, we build on a conditional diffusion-based Structured State Space Model (SSSD-ECG) with two principled innovations: (1) MIDT-ECG (Mel-Spectrogram Informed Diffusion Training), a novel training paradigm with time-frequency domain supervision to enforce physiological structural realism, and (2) multi-modal demographic conditioning to enable patient-specific synthesis. We comprehensively evaluate our approach on the PTB-XL dataset, assessing the synthesized ECG signals on fidelity, clinical coherence, privacy preservation, and downstream task utility. MIDT-ECG achieves substantial gains: it improves morphological coherence, preserves strong privacy guarantees with all metrics evaluated exceeding the baseline by 4-8%, and notably reduces the interlead correlation error by an average of 74%, while demographic conditioning enhances signal-to-noise ratio and personalization. In critical low-data regimes, a classifier trained on datasets supplemented with our synthetic ECGs achieves performance comparable to a classifier trained solely on real data. Together, we demonstrate that ECG synthesizers, trained with the proposed time-frequency structural regularization scheme, can serve as personalized, high-fidelity, privacy-preserving surrogates when real data are scarce, advancing the responsible use of generative AI in healthcare.

Updated: 2025-10-09 00:47:14

标题: 通过Mel-Spectrogram信息扩散训练实现高保真度合成心电图生成

摘要: 心脏护理机器学习的发展受到隐私限制在分享真实患者心电图（ECG）数据方面的严重阻碍。虽然生成式人工智能提供了一个有希望的解决方案，但现有模型合成的ECG在现实世界中的使用受到信任度和临床实用性方面持续存在的差距的限制。在这项工作中，我们解决了当前生成ECG方法的两个主要缺点：不足的形态保真度和无法生成个性化、患者特定的生理信号。为了弥补这些差距，我们基于基于条件扩散的结构化状态空间模型（SSSD-ECG）进行了两个基本创新：（1）MIDT-ECG（Mel-Spectrogram Informed Diffusion Training），一种新颖的训练范式，通过时间-频率域监督来强化生理结构的真实性；（2）多模态人口统计学调节，以实现个性化综合。我们全面评估了我们的方法在PTB-XL数据集上的表现，评估了合成的ECG信号在保真度、临床连贯性、隐私保护和下游任务效用方面。MIDT-ECG取得了显著的进展：它提高了形态的连贯性，保障了强大的隐私保证，所有评估的指标超过基线4-8％，并且显著减少了平均74％的导联之间的相关性误差，而人口统计学调节提高了信噪比和个性化。在重要的低数据情境下，一个在我们的合成ECG补充数据集上训练的分类器达到了与仅在真实数据上训练的分类器相当的性能。总的来说，我们证明了用提出的时间-频率结构规范化方案训练的ECG合成器可以在真实数据稀缺时作为个性化、高保真度、隐私保护的替代品，推进了在医疗保健领域负责任地使用生成式人工智能的进展。

更新时间: 2025-10-09 00:47:14

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2510.05492v2

A Honest Cross-Validation Estimator for Prediction Performance

Cross-validation is a standard tool for obtaining a honest assessment of the performance of a prediction model. The commonly used version repeatedly splits data, trains the prediction model on the training set, evaluates the model performance on the test set, and averages the model performance across different data splits. A well-known criticism is that such cross-validation procedure does not directly estimate the performance of the particular model recommended for future use. In this paper, we propose a new method to estimate the performance of a model trained on a specific (random) training set. A naive estimator can be obtained by applying the model to a disjoint testing set. Surprisingly, cross-validation estimators computed from other random splits can be used to improve this naive estimator within a random-effects model framework. We develop two estimators -- a hierarchical Bayesian estimator and an empirical Bayes estimator -- that perform similarly to or better than both the conventional cross-validation estimator and the naive single-split estimator. Simulations and a real-data example demonstrate the superior performance of the proposed method.

Updated: 2025-10-09 00:45:03

标题: 一个诚实的交叉验证估计器用于预测性能

摘要: 交叉验证是获取预测模型性能诚实评估的标准工具。常用版本是重复分割数据，对训练集训练预测模型，在测试集上评估模型性能，并在不同数据分割上对模型性能取平均。一个众所周知的批评是，这种交叉验证程序并不直接估计未来使用的特定模型的性能。在本文中，我们提出了一种新方法来估计在特定（随机）训练集上训练的模型的性能。可以通过将模型应用于不相交的测试集获得一个天真的估计器。令人惊讶的是，从其他随机分割计算的交叉验证估计器可以在随机效应模型框架内改进这个天真的估计器。我们开发了两个估计器 - 一个层次贝叶斯估计器和一个经验贝叶斯估计器 - 它们的表现类似于或优于传统的交叉验证估计器和天真的单次分割估计器。模拟和一个真实数据示例展示了所提出方法的优越性能。

更新时间: 2025-10-09 00:45:03

领域: stat.ML,cs.LG,stat.AP,stat.ME

下载: http://arxiv.org/abs/2510.07649v1

Continual Learning for Adaptive AI Systems

Continual learning the ability of a neural network to learn multiple sequential tasks without losing previously acquired knowledge remains a significant obstacle to developing truly adaptive artificial intelligence. Deep learning models have achieved remarkable results in various applications, but overfitting remains a common issue. Regularization techniques can help prevent overfitting by adding constraints to the model's parameters. To prevent catastrophic forgetting, in this paper we introduce a novel regularization technique based on inter-cluster separation (ICS) in the loss function, which penalizes the model for producing outputs that are far away from the centroids of the clusters formed by the data from previous tasks. We also performed hyperparameter tuning to find the optimal weighting of the proposed regularization term. This ensures clearer separation between tasks in the neural network's internal representation, reducing overlap and mitigating forgetting. Using the standard 5-task Split CIFAR-10 benchmark and a ResNet-18 architecture, we demonstrate ICS's effectiveness in maintaining strong performance on initial tasks. However, our results also highlight limitations in long-term knowledge retention, particularly when the number of tasks increases. This underscores the complexity and trade-offs inherent in continual learning and points toward avenues for further research.

Updated: 2025-10-09 00:44:32

标题: Adaptive AI系统的持续学习

摘要: 持续学习是神经网络学习多个连续任务而不丢失先前获得知识的能力，这仍然是发展真正适应性人工智能的重要障碍。深度学习模型在各种应用中取得了显著成果，但过拟合仍然是一个常见问题。正则化技术可以通过向模型参数添加约束来帮助防止过拟合。为了防止灾难性遗忘，在本文中，我们引入了一种基于互聚类分离（ICS）的新型正则化技术，该技术在损失函数中惩罚模型产生远离以前任务数据形成的类簇中心的输出。我们还进行了超参数调整，以找到提议的正则化项的最佳权重。这确保了神经网络内部表示中任务之间更清晰的分隔，减少了重叠并减轻了遗忘。使用标准的5个任务分隔CIFAR-10基准和ResNet-18架构，我们展示了ICS在保持初始任务的强大性能方面的有效性。然而，我们的结果也突显了在长期知识保留方面的局限性，特别是当任务数量增加时。这凸显了持续学习中固有的复杂性和权衡，并指向了进一步研究的途径。

更新时间: 2025-10-09 00:44:32

领域: cs.LG

下载: http://arxiv.org/abs/2510.07648v1

Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

A class of generative models that unifies flow-based and diffusion-based methods is introduced. These models extend the framework proposed in Albergo and Vanden-Eijnden (2023), enabling the use of a broad class of continuous-time stochastic processes called stochastic interpolants to bridge any two probability density functions exactly in finite time. These interpolants are built by combining data from the two prescribed densities with an additional latent variable that shapes the bridge in a flexible way. The time-dependent density function of the interpolant is shown to satisfy a transport equation as well as a family of forward and backward Fokker-Planck equations with tunable diffusion coefficient. Upon consideration of the time evolution of an individual sample, this viewpoint leads to both deterministic and stochastic generative models based on probability flow equations or stochastic differential equations with an adjustable level of noise. The drift coefficients entering these models are time-dependent velocity fields characterized as the unique minimizers of simple quadratic objective functions, one of which is a new objective for the score. We show that minimization of these quadratic objectives leads to control of the likelihood for generative models built upon stochastic dynamics, while likelihood control for deterministic dynamics is more stringent. We also construct estimators for the likelihood and the cross entropy of interpolant-based generative models, and we discuss connections with other methods such as score-based diffusion models, stochastic localization, probabilistic denoising, and rectifying flows. In addition, we demonstrate that stochastic interpolants recover the Schr\"odinger bridge between the two target densities when explicitly optimizing over the interpolant. Finally, algorithmic aspects are discussed and the approach is illustrated on numerical examples.

Updated: 2025-10-09 00:43:44

标题: 随机插值：流和扩散的统一框架

摘要: 引入了一种统一流式和扩散式方法的生成模型类别。这些模型扩展了Albergo和Vanden-Eijnden（2023）提出的框架，使得可以在有限时间内使用一类称为随机插值器的连续时间随机过程来精确地连接任意两个概率密度函数。这些插值器通过将来自两个指定密度的数据与一个额外的潜变量结合起来构建，以灵活的方式塑造桥梁。插值器的时间依赖密度函数被证明满足一个输运方程以及一系列具有可调扩散系数的正向和反向福克-普朗克方程。在考虑个体样本的时间演变时，这一观点导致基于概率流方程或具有可调噪声水平的随机微分方程的确定性和随机生成模型。进入这些模型的漂移系数是时间依赖的速度场，被描述为简单二次目标函数的唯一最小化器之一，其中一个是得分的新目标。我们展示，这些二次目标的最小化导致了基于随机动力学构建的生成模型的可能性控制，而对于确定性动力学的可能性控制更为严格。我们还构建了基于插值器的生成模型的可能性和交叉熵的估计器，并讨论了与其他方法的联系，如基于得分的扩散模型、随机定位、概率去噪和矫正流。此外，我们证明了当明确优化插值器时，随机插值器可以恢复两个目标密度之间的Schr\"odinger桥。最后，讨论了算法方面，并在数值例子上进行了说明。

更新时间: 2025-10-09 00:43:44

领域: cs.LG,cond-mat.dis-nn,math.PR

下载: http://arxiv.org/abs/2303.08797v4

Design-Based Bandits Under Network Interference: Trade-Off Between Regret and Statistical Inference

In multi-armed bandits with network interference (MABNI), the action taken by one node can influence the rewards of others, creating complex interdependence. While existing research on MABNI largely concentrates on minimizing regret, it often overlooks the crucial concern that an excessive emphasis on the optimal arm can undermine the inference accuracy for sub-optimal arms. Although initial efforts have been made to address this trade-off in single-unit scenarios, these challenges have become more pronounced in the context of MABNI. In this paper, we establish, for the first time, a theoretical Pareto frontier characterizing the trade-off between regret minimization and inference accuracy in adversarial (design-based) MABNI. We further introduce an anytime-valid asymptotic confidence sequence along with a corresponding algorithm, $\texttt{EXP3-N-CS}$, specifically designed to balance the trade-off between regret minimization and inference accuracy in this setting.

Updated: 2025-10-09 00:38:10

标题: 基于设计的网络干扰下的赌博机问题：后悔与统计推断之间的权衡

摘要: 在具有网络干扰的多臂老虎机中（MABNI），一个节点所采取的行动可能会影响其他节点的奖励，从而产生复杂的相互依赖关系。虽然现有关于MABNI的研究主要集中在最小化后悔上，但往往忽视了一个关键问题，即对最佳臂的过分强调可能会损害次优臂的推断准确性。尽管在单元场景中已经开始努力解决这种权衡，但在MABNI的背景下，这些挑战变得更加突出。在本文中，我们首次建立了一个理论帕累托前沿，描述了在敌对（基于设计）的MABNI中后悔最小化和推断准确性之间的权衡。我们进一步引入了一个任意有效的渐近置信序列以及相应的算法$\texttt{EXP3-N-CS}$，专门设计用于在这种环境中平衡后悔最小化和推断准确性之间的权衡。

更新时间: 2025-10-09 00:38:10

领域: cs.LG

下载: http://arxiv.org/abs/2510.07646v1

Intention-Conditioned Flow Occupancy Models

Large-scale pre-training has fundamentally changed how machine learning research is done today: large foundation models are trained once, and then can be used by anyone in the community (including those without data or compute resources to train a model from scratch) to adapt and fine-tune to specific tasks. Applying this same framework to reinforcement learning (RL) is appealing because it offers compelling avenues for addressing core challenges in RL, including sample efficiency and robustness. However, there remains a fundamental challenge to pre-train large models in the context of RL: actions have long-term dependencies, so training a foundation model that reasons across time is important. Recent advances in generative AI have provided new tools for modeling highly complex distributions. In this paper, we build a probabilistic model to predict which states an agent will visit in the temporally distant future (i.e., an occupancy measure) using flow matching. As large datasets are often constructed by many distinct users performing distinct tasks, we include in our model a latent variable capturing the user intention. This intention increases the expressivity of our model, and enables adaptation with generalized policy improvement. We call our proposed method intention-conditioned flow occupancy models (InFOM). Comparing with alternative methods for pre-training, our experiments on $36$ state-based and $4$ image-based benchmark tasks demonstrate that the proposed method achieves $1.8 \times$ median improvement in returns and increases success rates by $36\%$. Website: https://chongyi-zheng.github.io/infom Code: https://github.com/chongyi-zheng/infom

Updated: 2025-10-09 00:36:50

标题: 意向条件下的流量占用模型

摘要: 大规模预训练已经从根本上改变了当今机器学习研究的方式：大型基础模型只需训练一次，然后就可以被社区中的任何人使用（包括那些没有数据或计算资源从头开始训练模型的人）来适应和微调特定任务。将这个框架应用于强化学习（RL）是有吸引力的，因为它为解决RL中的核心挑战提供了引人注目的途径，包括样本效率和鲁棒性。然而，在RL的背景下预训练大型模型仍然存在一个基本挑战：动作具有长期依赖关系，因此训练一个能跨越时间进行推理的基础模型是重要的。生成AI的最新进展为建模高度复杂的分布提供了新的工具。在本文中，我们构建了一个概率模型，使用流匹配来预测一个代理将在远期（即，一个占据度量）访问的状态。由于大型数据集通常是由许多不同的用户执行不同的任务构建的，我们在模型中包含了一个捕捉用户意图的潜在变量。这个意图增加了我们模型的表达能力，并实现了广义策略改进的适应性。我们称提出的方法为意图条件流占用模型（InFOM）。与其他预训练方法相比，我们在36个基于状态和4个基于图像的基准任务上的实验表明，所提出的方法在回报方面实现了1.8倍的中位改进，并将成功率提高了36%。网站：https://chongyi-zheng.github.io/infom 代码：https://github.com/chongyi-zheng/infom

更新时间: 2025-10-09 00:36:50

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.08902v2

Banking Done Right: Redefining Retail Banking with Language-Centric AI

This paper presents Ryt AI, an LLM-native agentic framework that powers Ryt Bank to enable customers to execute core financial transactions through natural language conversation. This represents the first global regulator-approved deployment worldwide where conversational AI functions as the primary banking interface, in contrast to prior assistants that have been limited to advisory or support roles. Built entirely in-house, Ryt AI is powered by ILMU, a closed-source LLM developed internally, and replaces rigid multi-screen workflows with a single dialogue orchestrated by four LLM-powered agents (Guardrails, Intent, Payment, and FAQ). Each agent attaches a task-specific LoRA adapter to ILMU, which is hosted within the bank's infrastructure to ensure consistent behavior with minimal overhead. Deterministic guardrails, human-in-the-loop confirmation, and a stateless audit architecture provide defense-in-depth for security and compliance. The result is Banking Done Right: demonstrating that regulator-approved natural-language interfaces can reliably support core financial operations under strict governance.

Updated: 2025-10-09 00:35:08

标题: 银行业务的正确实践：用语言中心人工智能重新定义零售银行业务

摘要: 本文介绍了 Ryt AI，这是一个以 LLM 为本地代理框架，为 Ryt 银行提供支持，使客户能够通过自然语言对话执行核心金融交易。这代表了全球第一个获得监管机构批准的部署，其中会话式人工智能作为主要银行界面，与之前仅限于咨询或支持角色的助手形成对比。Ryt AI 完全在内部构建，由内部开发的 ILMU 驱动，取代了多屏幕工作流程，改为由四个以 LLM 为动力的代理（Guardrails、Intent、Payment 和 FAQ）组织的单个对话。每个代理将特定任务的 LoRA 适配器连接到 ILMU 上，后者托管在银行的基础设施中，以确保一致的行为和最小的开销。确定性的防护栏、人为环路确认和无状态审计架构为安全和合规提供了深入的防御。其结果是 Banking Done Right：证明获得监管机构批准的自然语言界面可以可靠地支持在严格治理下进行核心金融操作。

更新时间: 2025-10-09 00:35:08

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2510.07645v1

Property Classification of Vacation Rental Properties during Covid-19

This study advocates for employing clustering techniques to classify vacation rental properties active during the Covid pandemic to identify inherent patterns and behaviours. The dataset, a collaboration between the ESRC funded Consumer Data Research Centre (CDRC) and AirDNA, encompasses data for over a million properties and hosts. Utilising K-means and K-medoids clustering techniques, we identify homogenous groups and their common characteristics. Our findings enhance comprehension of the intricacies of vacation rental evaluations and could potentially be utilised in the creation of targeted, cluster-specific policies.

Updated: 2025-10-09 00:24:59

标题: COVID-19期间度假租赁物业的财产分类

摘要: 这项研究主张利用聚类技术对新冠疫情期间活跃的度假出租物业进行分类，以识别固有的模式和行为。该数据集由英国经济和社会研究委员会资助的消费者数据研究中心（CDRC）和AirDNA合作共同获取，涵盖了超过一百万个物业和房东的数据。通过利用K-means和K-medoids聚类技术，我们确定了同质群体及其共同特征。我们的研究结果增进了对度假出租物评估复杂性的理解，并有可能用于制定针对性、特定群体的政策。

更新时间: 2025-10-09 00:24:59

领域: cs.LG

下载: http://arxiv.org/abs/2510.07639v1

Controllable Hybrid Captioner for Improved Long-form Video Understanding

Video data, especially long-form video, is extremely dense and high-dimensional. Text-based summaries of video content offer a way to represent query-relevant content in a much more compact manner than raw video. In addition, textual representations are easily ingested by state-of-the-art large language models (LLMs), which enable reasoning over video content to answer complex natural language queries. To solve this issue, we rely on the progressive construction of a text-based memory by a video captioner operating on shorter chunks of the video, where spatio-temporal modeling is computationally feasible. We explore ways to improve the quality of the activity log comprised solely of short video captions. Because the video captions tend to be focused on human actions, and questions may pertain to other information in the scene, we seek to enrich the memory with static scene descriptions using Vision Language Models (VLMs). Our video understanding system relies on the LaViLa video captioner in combination with a LLM to answer questions about videos. We first explored different ways of partitioning the video into meaningful segments such that the textual descriptions more accurately reflect the structure of the video content. Furthermore, we incorporated static scene descriptions into the captioning pipeline using LLaVA VLM, resulting in a more detailed and complete caption log and expanding the space of questions that are answerable from the textual memory. Finally, we have successfully fine-tuned the LaViLa video captioner to produce both action and scene captions, significantly improving the efficiency of the captioning pipeline compared to using separate captioning models for the two tasks. Our model, controllable hybrid captioner, can alternate between different types of captions according to special input tokens that signals scene changes detected in the video.

Updated: 2025-10-09 00:12:07

标题: 可控混合字幕生成器以提高长篇视频理解能力

摘要: 视频数据，尤其是长格式视频，非常密集和高维。基于文本的视频内容摘要提供了一种比原始视频更紧凑地表示查询相关内容的方式。此外，文本表示易于被最先进的大型语言模型（LLMs）接受，这些模型可以推理视频内容以回答复杂的自然语言查询。为了解决这个问题，我们依赖于视频字幕生成器对视频的较短片段进行逐步构建文本记忆，其中时空建模在计算上是可行的。我们探索了改善仅由短视频字幕组成的活动日志质量的方法。由于视频字幕往往集中在人类行为上，而问题可能涉及场景中的其他信息，我们试图通过使用视觉语言模型（VLMs）丰富记忆的静态场景描述。我们的视频理解系统依赖于LaViLa视频字幕生成器与LLM相结合以回答关于视频的问题。我们首先探索了将视频分成有意义的段落的不同方式，以便文本描述更准确地反映视频内容的结构。此外，我们使用LLaVA VLM将静态场景描述纳入字幕生成管道，从而产生更详细和完整的字幕记录，并扩大了从文本记忆中可回答的问题空间。最后，我们成功地对LaViLa视频字幕生成器进行了微调，以生成动作和场景字幕，与使用两个任务的独立字幕生成模型相比，显著提高了字幕生成管道的效率。我们的模型，可控混合字幕生成器，可以根据视频中检测到的场景变化信号的特殊输入令牌在不同类型的字幕之间切换。

更新时间: 2025-10-09 00:12:07

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.17047v3

Safely Exploring Novel Actions in Recommender Systems via Deployment-Efficient Policy Learning

In many real recommender systems, novel items are added frequently over time. The importance of sufficiently presenting novel actions has widely been acknowledged for improving long-term user engagement. A recent work builds on Off-Policy Learning (OPL), which trains a policy from only logged data, however, the existing methods can be unsafe in the presence of novel actions. Our goal is to develop a framework to enforce exploration of novel actions with a guarantee for safety. To this end, we first develop Safe Off-Policy Policy Gradient (Safe OPG), which is a model-free safe OPL method based on a high confidence off-policy evaluation. In our first experiment, we observe that Safe OPG almost always satisfies a safety requirement, even when existing methods violate it greatly. However, the result also reveals that Safe OPG tends to be too conservative, suggesting a difficult tradeoff between guaranteeing safety and exploring novel actions. To overcome this tradeoff, we also propose a novel framework called Deployment-Efficient Policy Learning for Safe User Exploration, which leverages safety margin and gradually relaxes safety regularization during multiple (not many) deployments. Our framework thus enables exploration of novel actions while guaranteeing safe implementation of recommender systems.

Updated: 2025-10-09 00:10:07

标题: 通过高效部署策略学习在推荐系统中安全地探索新的操作

摘要: 在许多真实的推荐系统中，新颖的物品经常随着时间的推移被添加。充分呈现新颖行为的重要性已被广泛认可，以改善长期用户参与度。最近的一项工作基于离线学习（OPL），仅从记录的数据中训练策略，然而，在存在新颖行为的情况下，现有方法可能不安全。我们的目标是开发一个框架，以确保探索新颖行为并保证安全。为此，我们首先开发了安全离线策略梯度（Safe OPG），这是一种基于高置信度离线评估的无模型安全OPL方法。在我们的第一个实验中，我们观察到Safe OPG几乎总是满足安全要求，即使现有方法严重违反该要求。然而，结果也显示Safe OPG往往过于保守，表明在保证安全和探索新颖行为之间存在难以权衡的问题。为了克服这种权衡，我们还提出了一个名为Deployment-Efficient Policy Learning for Safe User Exploration的新框架，利用安全边界，并在多次部署中逐渐放宽安全规范。我们的框架因此能够在保证推荐系统安全实施的同时探索新颖行为。

更新时间: 2025-10-09 00:10:07

领域: cs.AI

下载: http://arxiv.org/abs/2510.07635v1

SAGE: Streaming Agreement-Driven Gradient Sketches for Representative Subset Selection

Training modern neural networks on large datasets is computationally and energy intensive. We present SAGE, a streaming data-subset selection method that maintains a compact Frequent Directions (FD) sketch of gradient geometry in $O(\ell D)$ memory and prioritizes examples whose sketched gradients align with a consensus direction. The approach eliminates $N \times N$ pairwise similarities and explicit $N \times \ell$ gradient stores, yielding a simple two-pass, GPU-friendly pipeline. Leveraging FD's deterministic approximation guarantees, we analyze how agreement scoring preserves gradient energy within the principal sketched subspace. Across multiple benchmarks, SAGE trains with small kept-rate budgets while retaining competitive accuracy relative to full-data training and recent subset-selection baselines, and reduces end-to-end compute and peak memory. Overall, SAGE offers a practical, constant-memory alternative that complements pruning and model compression for efficient training.

Updated: 2025-10-09 00:04:51

标题: SAGE：基于协议驱动的梯度草图流式选择代表性子集

摘要: 在大型数据集上训练现代神经网络需要大量计算和能量。我们提出了SAGE，一种流式数据子集选择方法，它在$O(\ell D)$内存中维护了梯度几何的紧凑Frequent Directions（FD）草图，并优先考虑与共识方向对齐的示例。该方法消除了$N \times N$个成对相似性和显式的$N \times \ell$梯度存储，从而产生了一个简单的两次传递、适用于GPU的流水线。利用FD的确定性近似保证，我们分析了协议评分如何在主要的草图子空间中保留梯度能量。在多个基准测试中，SAGE在保持竞争性准确性的同时，使用较小的保留率预算进行训练，相对于完整数据训练和最近的子集选择基线，减少了端到端计算和峰值内存。总的来说，SAGE提供了一个实用的、常量内存的选择，与修剪和模型压缩相辅相成，用于高效训练。

更新时间: 2025-10-09 00:04:51

领域: cs.LG

下载: http://arxiv.org/abs/2510.02470v2

Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models

Frontier AI models have achieved remarkable progress, yet recent studies suggest they struggle with compositional reasoning, often performing at or below random chance on established benchmarks. We revisit this problem and show that widely used evaluation metrics systematically underestimate model capability. To address this, we introduce a group matching score that better exploits group structure and reveals substantial hidden capability in both contrastive vision-language models (VLMs) and multimodal large language models (MLLMs). Moreover, simply overfitting to the induced group matchings at test time transfers this hidden capability into higher scores under standard evaluation metrics, closing much of the reported gap. This adjustment enables SigLIP-B16 to surpass all previous results and GPT-4.1 to yield the first result surpassing estimated human performance on Winoground. Building on this insight, we propose Test-Time Matching (TTM), an iterative, self-improving algorithm that further bootstraps model performance without any external supervision. TTM delivers additional, non-trivial improvements: for example, TTM enables SigLIP-B16 to surpass GPT-4.1 on MMVP-VLM, establishing a new state of the art. Importantly, TTM remains broadly effective even on benchmarks without metric-induced effects or group structures, achieving relative gains up to 85.7% on challenging datasets such as WhatsUp. Across 16 dataset variants spanning diverse setups, our experiments demonstrate that TTM consistently improves model performance and advances the frontier of compositional reasoning.

Updated: 2025-10-09 00:00:49

标题: 测试时间匹配：在多模态模型中解锁组合推理

摘要: 前沿的人工智能模型取得了显著进展，然而最近的研究表明它们在组合推理方面存在困难，通常在已建立的基准测试中表现不佳甚至低于随机概率。我们重新审视了这个问题，并展示了广泛使用的评估指标系统地低估了模型的能力。为了解决这个问题，我们引入了一个更好地利用群体结构并揭示了对比视觉语言模型（VLMs）和多模态大型语言模型（MLLMs）中的实质隐藏能力的群体匹配分数。此外，简单地在测试时过度拟合诱导的群体匹配将这种隐藏能力转化为在标准评估指标下获得更高分数，从而缩小了大部分已报告的差距。这种调整使得SigLIP-B16超越了所有先前的结果，而GPT-4.1则获得了首个超越Winoground估计人类表现的结果。基于这一认识，我们提出了测试时间匹配（TTM），这是一个迭代的、自我改进的算法，可以在没有任何外部监督的情况下进一步提升模型性能。TTM带来了额外的、非平凡的改进：例如，TTM使得SigLIP-B16在MMVP-VLM上超越了GPT-4.1，建立了一个新的技术水平。重要的是，即使在没有度量诱导效应或群体结构的基准测试中，TTM仍然广泛有效，在具有挑战性的数据集（如WhatsUp）上实现了高达85.7%的相对增益。在涵盖多种设置的16个数据集变体上，我们的实验证明TTM能够持续改进模型性能，并推动组合推理的前沿。

更新时间: 2025-10-09 00:00:49

领域: cs.AI,cs.CL,cs.CV,cs.LG

下载: http://arxiv.org/abs/2510.07632v1