Arxiv Day: Article

A Survey of State Representation Learning for Deep Reinforcement Learning

Representation learning methods are an important tool for addressing the challenges posed by complex observations spaces in sequential decision making problems. Recently, many methods have used a wide variety of types of approaches for learning meaningful state representations in reinforcement learning, allowing better sample efficiency, generalization, and performance. This survey aims to provide a broad categorization of these methods within a model-free online setting, exploring how they tackle the learning of state representations differently. We categorize the methods into six main classes, detailing their mechanisms, benefits, and limitations. Through this taxonomy, our aim is to enhance the understanding of this field and provide a guide for new researchers. We also discuss techniques for assessing the quality of representations, and detail relevant future directions.

Updated: 2025-06-20 23:47:04

标题: 一份关于深度强化学习中状态表示学习的调查

摘要: 表征学习方法是解决顺序决策问题中复杂观察空间所带来挑战的重要工具。最近，许多方法采用了各种类型的方法来学习强化学习中有意义的状态表示，从而提高样本效率、泛化能力和性能。本调查旨在在无模型在线设置内提供这些方法的广泛分类，探讨它们如何不同地处理状态表示的学习。我们将这些方法分为六个主要类别，详细说明它们的机制、优势和局限性。通过这种分类法，我们的目的是增进对这一领域的理解，并为新研究人员提供指导。我们还讨论了评估表征质量的技术，并详细介绍了相关未来方向。

更新时间: 2025-06-20 23:47:04

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2506.17518v1

Kaleidoscopic Teaming in Multi Agent Simulations

Warning: This paper contains content that may be inappropriate or offensive. AI agents have gained significant recent attention due to their autonomous tool usage capabilities and their integration in various real-world applications. This autonomy poses novel challenges for the safety of such systems, both in single- and multi-agent scenarios. We argue that existing red teaming or safety evaluation frameworks fall short in evaluating safety risks in complex behaviors, thought processes and actions taken by agents. Moreover, they fail to consider risks in multi-agent setups where various vulnerabilities can be exposed when agents engage in complex behaviors and interactions with each other. To address this shortcoming, we introduce the term kaleidoscopic teaming which seeks to capture complex and wide range of vulnerabilities that can happen in agents both in single-agent and multi-agent scenarios. We also present a new kaleidoscopic teaming framework that generates a diverse array of scenarios modeling real-world human societies. Our framework evaluates safety of agents in both single-agent and multi-agent setups. In single-agent setup, an agent is given a scenario that it needs to complete using the tools it has access to. In multi-agent setup, multiple agents either compete against or cooperate together to complete a task in the scenario through which we capture existing safety vulnerabilities in agents. We introduce new in-context optimization techniques that can be used in our kaleidoscopic teaming framework to generate better scenarios for safety analysis. Lastly, we present appropriate metrics that can be used along with our framework to measure safety of agents. Utilizing our kaleidoscopic teaming framework, we identify vulnerabilities in various models with respect to their safety in agentic use-cases.

Updated: 2025-06-20 23:37:17

标题: 多智能体模拟中的万花筒式团队合作

摘要: 警告：本文包含可能不当或冒犯性内容。由于其自主工具使用能力以及在各种现实世界应用中的整合，AI代理最近引起了极大关注。这种自主性给这些系统的安全性提出了新的挑战，在单一代理和多代理场景下都是如此。我们认为，现有的红队测试或安全评估框架在评估代理的复杂行为、思维过程和采取的行动中的安全风险方面存在不足。此外，它们未能考虑多代理设置中的风险，当代理参与复杂行为和相互交互时，各种漏洞可能被暴露出来。为了解决这一不足，我们引入了“万花筒团队”这一术语，旨在捕捉单一代理和多代理情景中可能发生的复杂和广泛的漏洞。我们还提出了一个新的万花筒团队框架，生成了一系列模拟真实世界人类社会的场景。我们的框架评估了单一代理和多代理设置中的代理的安全性。在单一代理设置中，代理被提供一个需要使用其可访问工具完成的场景。在多代理设置中，多个代理要么竞争，要么合作完成场景中的任务，通过这种方式捕捉代理中现有的安全漏洞。我们引入了新的上下文优化技术，可用于我们的万花筒团队框架中，以生成更好的安全分析场景。最后，我们提出了适用于我们框架的度量标准，用于衡量代理的安全性。利用我们的万花筒团队框架，我们识别了各种模型中关于其在代理使用案例中的安全性的漏洞。

更新时间: 2025-06-20 23:37:17

领域: cs.AI

下载: http://arxiv.org/abs/2506.17514v1

Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards

Reinforcement Learning from Verifiable Rewards (RLVR) has been widely adopted as the de facto method for enhancing the reasoning capabilities of large language models and has demonstrated notable success in verifiable domains like math and competitive programming tasks. However, the efficacy of RLVR diminishes significantly when applied to agentic environments. These settings, characterized by multi-step, complex problem solving, lead to high failure rates even for frontier LLMs, as the reward landscape is too sparse for effective model training via conventional RLVR. In this work, we introduce Agent-RLVR, a framework that makes RLVR effective in challenging agentic settings, with an initial focus on software engineering tasks. Inspired by human pedagogy, Agent-RLVR introduces agent guidance, a mechanism that actively steers the agent towards successful trajectories by leveraging diverse informational cues. These cues, ranging from high-level strategic plans to dynamic feedback on the agent's errors and environmental interactions, emulate a teacher's guidance, enabling the agent to navigate difficult solution spaces and promotes active self-improvement via additional environment exploration. In the Agent-RLVR training loop, agents first attempt to solve tasks to produce initial trajectories, which are then validated by unit tests and supplemented with agent guidance. Agents then reattempt with guidance, and the agent policy is updated with RLVR based on the rewards of these guided trajectories. Agent-RLVR elevates the pass@1 performance of Qwen-2.5-72B-Instruct from 9.4% to 22.4% on SWE-Bench Verified. We find that our guidance-augmented RLVR data is additionally useful for test-time reward model training, shown by further boosting pass@1 to 27.8%. Agent-RLVR lays the groundwork for training agents with RLVR in complex, real-world environments where conventional RL methods struggle.

Updated: 2025-06-20 23:32:06

标题: Agent-RLVR：通过指导和环境奖励培训软件工程代理程序

摘要: 强化学习从可验证奖励（RLVR）已被广泛采用作为增强大型语言模型推理能力的事实方法，并在数学和竞争性编程任务等可验证领域取得了显著成功。然而，当应用于代理环境时，RLVR的有效性显著降低。这些设置以多步、复杂问题解决为特征，即使对于前沿的LLMs，失败率也很高，因为奖励景观对于通过传统RLVR进行有效模型训练来说太稀疏了。在这项工作中，我们介绍了Agent-RLVR，这是一个框架，使RLVR在具有挑战性的代理设置中变得有效，初始重点是软件工程任务。受人类教学法启发，Agent-RLVR引入了代理指导，这是一种通过利用各种信息线索主动引导代理朝着成功的轨迹前进的机制。这些线索，从高级战略计划到对代理错误和环境交互的动态反馈，模拟了教师的指导，使代理能够在困难的解决空间中导航，并通过额外的环境探索促进主动自我改进。在Agent-RLVR训练循环中，代理首先尝试解决任务以产生初始轨迹，然后通过单元测试验证并补充代理指导。代理然后在指导下重新尝试，并根据这些受指导轨迹的奖励更新代理策略。Agent-RLVR将Qwen-2.5-72B-Instruct的SWE-Bench Verified的pass@1性能从9.4%提升到22.4%。我们发现我们的指导增强的RLVR数据还可用于测试时奖励模型训练，进一步提升pass@1至27.8%。Agent-RLVR为在传统RL方法困难的复杂、现实世界环境中用RLVR训练代理奠定了基础。

更新时间: 2025-06-20 23:32:06

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.11425v2

Validating Mechanistic Interpretations: An Axiomatic Approach

Mechanistic interpretability aims to reverse engineer the computation performed by a neural network in terms of its internal components. Although there is a growing body of research on mechanistic interpretation of neural networks, the notion of a mechanistic interpretation itself is often ad-hoc. Inspired by the notion of abstract interpretation from the program analysis literature that aims to develop approximate semantics for programs, we give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis in a compositional manner. We demonstrate the applicability of these axioms for validating mechanistic interpretations on an existing, well-known interpretability study as well as on a new case study involving a Transformer-based model trained to solve the well-known 2-SAT problem.

Updated: 2025-06-20 23:29:25

标题: 验证机制解释：一种公理化方法

摘要: 机制性可解释性旨在通过神经网络内部组件的计算进行逆向工程。尽管关于神经网络机制性解释的研究日益增多，但机制性解释本身的概念常常是临时性的。受程序分析文献中抽象解释的概念启发，该概念旨在为程序开发近似语义，我们提出了一组公理，正式地将机制性解释描述为以组合方式近似捕捉神经网络分析中的语义的描述。我们演示了这些公理在验证现有知名可解释性研究以及涉及基于Transformer模型训练的解决著名2-SAT问题的新案例研究中的机制性解释的适用性。

更新时间: 2025-06-20 23:29:25

领域: cs.LG

下载: http://arxiv.org/abs/2407.13594v2

Semantic-Aware Parsing for Security Logs

Security analysts struggle to quickly and efficiently query and correlate log data due to the heterogeneity and lack of structure in real-world logs. Existing AI-based parsers focus on learning syntactic log templates but lack the semantic interpretation needed for querying. Directly querying large language models on raw logs is impractical at scale and vulnerable to prompt injection attacks. In this paper, we introduce Matryoshka, the first end-to-end system leveraging LLMs to automatically generate semantically-aware structured log parsers. Matryoshka combines a novel syntactic parser-employing precise regular expressions rather than wildcards-with a completely new semantic parsing layer that clusters variables and maps them into a queryable, contextually meaningful schema. This approach provides analysts with queryable and semantically rich data representations, facilitating rapid and precise log querying without the traditional burden of manual parser construction. Additionally, Matryoshka can map the newly created fields to recognized attributes within the Open Cybersecurity Schema Framework (OCSF), enabling interoperability. We evaluate Matryoshka on a newly curated real-world log benchmark, introducing novel metrics to assess how consistently fields are named and mapped across logs. Matryoshka's syntactic parser outperforms prior works, and the semantic layer achieves an F1 score of 0.95 on realistic security queries. Although mapping fields to the extensive OCSF taxonomy remains challenging, Matryoshka significantly reduces manual effort by automatically extracting and organizing valuable fields, moving us closer to fully automated, AI-driven log analytics.

Updated: 2025-06-20 23:24:09

标题: 语义感知解析安全日志

摘要: 安全分析师在快速高效地查询和关联日志数据方面遇到困难，这是由于真实世界日志的异构性和缺乏结构造成的。现有基于人工智能的解析器侧重于学习句法日志模板，但缺乏查询所需的语义解释。直接在原始日志上查询大型语言模型在规模上是不切实际的，并且容易受到提示注入攻击的影响。在本文中，我们介绍了Matryoshka，这是第一个端到端系统，利用LLM自动生成具有语义意识的结构化日志解析器。Matryoshka结合了一种新颖的句法分析器-采用精确的正则表达式而不是通配符-与一个全新的语义解析层，该解析层将变量聚类并将其映射到一个可查询的、有上下文意义的模式中。这种方法为分析师提供了可查询和语义丰富的数据表示，有助于快速准确地查询日志，而无需传统的手动解析器构建负担。此外，Matryoshka可以将新创建的字段映射到Open Cybersecurity Schema Framework (OCSF)中已识别的属性，实现互操作性。我们在一个新编辑的真实世界日志基准上评估了Matryoshka，引入了新的指标来评估字段在不同日志中的命名和映射的一致性。Matryoshka的句法解析器胜过先前的作品，而语义层在现实安全查询中实现了0.95的F1分数。尽管将字段映射到广泛的OCSF分类仍然具有挑战性，但Matryoshka通过自动提取和组织有价值的字段显著减少了手动工作量，将我们更接近完全自动化、以人工智能驱动的日志分析。

更新时间: 2025-06-20 23:24:09

领域: cs.CR

下载: http://arxiv.org/abs/2506.17512v1

IQFM A Wireless Foundational Model for I/Q Streams in AI-Native 6G

Foundational models have shown remarkable potential in natural language processing and computer vision, yet remain in their infancy in wireless communications. While a few efforts have explored image-based modalities such as channel state information (CSI) and frequency spectrograms, foundational models that operate directly on raw IQ data remain largely unexplored. This paper presents, IQFM, the first I/Q signal foundational model for wireless communications. IQFM supporting diverse tasks: modulation classification, angle-of-arrival (AoA), beam prediction, and RF fingerprinting, without heavy preprocessing or handcrafted features. We also introduce a task-aware augmentation strategy that categorizes transformations into core augmentations, such as cyclic time shifting, and task-specific augmentations. This strategy forms the basis for structured, task-dependent representation learning within a contrastive self-supervised learning (SSL) framework. Using this strategy, the lightweight encoder, pre-trained via SSL on over-the-air multi-antenna IQ data, achieves up to 99.67% and 65.45% accuracy on modulation and AoA classification, respectively, using only one labeled sample per class, outperforming supervised baselines by up to 7x and 145x. The model also generalizes to out-of-distribution tasks; when adapted to new tasks using only 500 samples per class and minimal parameter updates via LoRA, the same frozen encoder achieves 94.15% on beam prediction (vs. 89.53% supervised), 50.00% on RML2016a modulation classification (vs. 49.30%), and 96.05% on RF fingerprinting (vs. 96.64%). These results demonstrate the potential of raw IQ-based foundational models as efficient, reusable encoders for multi-task learning in AI-native 6G systems.

Updated: 2025-06-20 23:14:19

标题: IQFM：AI原生6G中I/Q流的无线基础模型

摘要: 基础模型在自然语言处理和计算机视觉领域展现了卓越的潜力，但在无线通信领域仍处于起步阶段。虽然已有一些工作探索了基于图像的模态，如信道状态信息（CSI）和频率谱图，但直接在原始IQ数据上操作的基础模型仍然大部分未被探索。本文提出了IQFM，这是无线通信领域的第一个I/Q信号基础模型。IQFM支持多样化任务：调制分类、到达角（AoA）、波束预测和射频指纹，无需复杂的预处理或手工特征。我们还引入了一种任务感知的增强策略，将转换分为核心增强和特定任务增强。这种策略为结构化的、与任务相关的表示学习奠定了基础，在对比自监督学习（SSL）框架中实现。利用这种策略，通过在空中多天线IQ数据上进行SSL预训练的轻量级编码器，在调制和AoA分类任务上分别实现了高达99.67%和65.45%的准确性，仅使用每类一个标记样本，优于监督基线7倍和145倍。该模型还可以泛化到超出分布的任务；当仅使用每类500个样本并通过LoRA进行最小参数更新来适应新任务时，相同的冻结编码器在波束预测（89.53%监督）、RML2016a调制分类（49.30%）和射频指纹（96.64%）上分别实现了94.15%、50.00%和96.05%的准确性。这些结果展示了基于原始IQ的基础模型作为高效、可重用的编码器在AI原生6G系统中进行多任务学习的潜力。

更新时间: 2025-06-20 23:14:19

领域: eess.SP,cs.LG

下载: http://arxiv.org/abs/2506.06718v2

$L^*LM$: Learning Automata from Examples using Natural Language Oracles

Expert demonstrations have proven an easy way to indirectly specify complex tasks. Recent algorithms even support extracting unambiguous formal specifications, e.g. deterministic finite automata (DFA), from demonstrations. Unfortunately, these techniques are generally not sample efficient. In this work, we introduce $L^*LM$, an algorithm for learning DFAs from both demonstrations and natural language. Due to the expressivity of natural language, we observe a significant improvement in the data efficiency of learning DFAs from expert demonstrations. Technically, $L^*LM$ leverages large language models to answer membership queries about the underlying task. This is then combined with recent techniques for transforming learning from demonstrations into a sequence of labeled example learning problems. In our experiments, we observe the two modalities complement each other, yielding a powerful few-shot learner.

Updated: 2025-06-20 23:11:55

标题: $L^*LM$: 使用自然语言预言学习自动机

摘要: 专家演示已被证明是间接指定复杂任务的简单方法。最近的算法甚至支持从演示中提取明确的形式规范，例如确定性有限自动机（DFA）。不幸的是，这些技术通常不具备样本效率。在这项工作中，我们引入了$L^*LM$算法，用于从专家演示和自然语言中学习DFA。由于自然语言的表达能力，我们观察到从专家演示中学习DFA的数据效率显著提高。在技术上，$L^*LM$利用大型语言模型来回答关于基础任务的成员查询。然后，结合最近的技术，将从演示中学习转化为一系列带标签的示例学习问题。在我们的实验中，我们观察到这两种模式相互补充，产生了一个强大的少样本学习器。

更新时间: 2025-06-20 23:11:55

领域: cs.LG,cs.AI,cs.CL,cs.FL

下载: http://arxiv.org/abs/2402.07051v2

A Smart Contract-based Non-Transferable Signature Verification System using Nominative Signatures

Nominative signatures allow us to indicate who can verify a signature, and they can be employed to construct a non-transferable signature verification system that prevents the signature verification by a third party in unexpected situations. For example, this system can prevent IOU/loan certificate verification in unexpected situations. However, nominative signatures themselves do not allow the verifier to check whether the funds will be transferred in the future or have been transferred.It would be desirable to verify the fact simultaneously when the system involves a certain money transfer such as cryptocurrencies/cryptoassets. In this paper, we propose a smart contract-based non-transferable signature verification system using nominative signatures. We pay attention to the fact that the invisibility, which is a security requirement to be held for nominative signatures, allows us to publish nominative signatures on the blockchain. Our system can verify whether a money transfer actually will take place, in addition to indicating who can verify a signature. We transform the Hanaoka-Schuldt nominative signature scheme (ACNS 2011, IEICE Trans. 2016) which is constructed over a symmetric pairing to a scheme constructed over an asymmetric pairing, and evaluate the gas cost when a smart contract runs the verification algorithm of the modified Hanaoka-Schuldt nominative signature scheme.

Updated: 2025-06-20 22:54:13

标题: 基于智能合约的非可转让签名验证系统，使用指定签名

摘要: 名义签名可以帮助我们指示谁可以验证签名，并且它们可以用于构建一个非可转让的签名验证系统，防止第三方在意外情况下对签名进行验证。例如，这个系统可以防止在意外情况下对IOU/贷款证书进行验证。然而，名义签名本身并不允许验证者检查资金将来是否会被转移或已经被转移。当系统涉及某种货币转移，如加密货币/加密资产时，同时验证这一事实是可取的。在本文中，我们提出了一个基于智能合约的名义签名验证系统，可以防止签名的可转让性。我们注意到，名义签名的隐私性要求使我们能够在区块链上发布名义签名。我们的系统可以验证货币转移是否实际发生，同时指示谁可以验证签名。我们将基于对称配对构建的Hanaoka-Schuldt名义签名方案（ACNS 2011，IEICE Trans. 2016）转换为基于不对称配对构建的方案，并评估当智能合约运行修改后的Hanaoka-Schuldt名义签名方案的验证算法时的燃气成本。

更新时间: 2025-06-20 22:54:13

领域: cs.CR

下载: http://arxiv.org/abs/2506.17504v1

Modeling Neural Networks with Privacy Using Neural Stochastic Differential Equations

In this work, we study the feasibility of using neural ordinary differential equations (NODEs) to model systems with intrinsic privacy properties. Unlike conventional feedforward neural networks, which have unlimited expressivity and can represent arbitrary mappings between inputs and outputs, NODEs constrain their learning to the solution of a system of differential equations. We first examine whether this constraint reduces memorization and, consequently, the membership inference risks associated with NODEs. We conduct a comprehensive evaluation of NODEs under membership inference attacks and show that they exhibit twice the resistance compared to conventional models such as ResNets. By analyzing the variance in membership risks across different NODE models, we find that their limited expressivity leads to reduced overfitting to the training data. We then demonstrate, both theoretically and empirically, that membership inference risks can be further mitigated by utilizing a stochastic variant of NODEs: neural stochastic differential equations (NSDEs). We show that NSDEs are differentially-private (DP) learners that provide the same provable privacy guarantees as DPSGD, the de-facto mechanism for training private models. NSDEs are also effective in mitigating membership inference attacks, achieving risk levels comparable to private models trained with DP-SGD while offering an improved privacyutility trade-off. Moreover, we propose a drop-in-replacement strategy that efficiently integrates NSDEs into conventional feedforward architectures to enhance their privacy.

Updated: 2025-06-20 22:28:35

标题: 使用神经随机微分方程对具有隐私的神经网络进行建模

摘要: 在这项工作中，我们研究了使用神经常微分方程（NODEs）来建模具有内在隐私属性的系统的可行性。与传统的前馈神经网络不同，前者具有无限的表达能力，可以表示输入和输出之间的任意映射，NODEs将学习限制在解决微分方程组上。我们首先研究了这种约束是否减少了记忆化，从而减少了与NODEs相关的成员推断风险。我们对NODEs进行了全面评估，展示了与传统模型（如ResNets）相比，它们具有两倍的抵抗力。通过分析不同NODE模型之间成员风险的方差，我们发现它们有限的表达能力导致对训练数据的过拟合减少。然后，我们在理论和实证上证明，通过利用一种随机变体的NODEs：神经随机微分方程（NSDEs），可以进一步减轻成员推断风险。我们展示了NSDEs是差分私有（DP）学习者，提供与DPSGD相同的可证明隐私保证，后者是训练私有模型的事实机制。NSDEs在减轻成员推断攻击方面也非常有效，达到与使用DP-SGD训练的私有模型相当的风险水平，同时提供了更好的隐私效用权衡。此外，我们提出了一种快速集成NSDEs到传统前馈架构中以增强其隐私性的替换策略。

更新时间: 2025-06-20 22:28:35

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2501.06686v2

Episode-specific Fine-tuning for Metric-based Few-shot Learners with Optimization-based Training

In few-shot classification tasks (so-called episodes), a small set of labeled support samples is provided during inference to aid the classification of unlabeled query samples. Metric-based models typically operate by computing similarities between query and support embeddings within a learned metric space, followed by nearest-neighbor classification. However, these labeled support samples are often underutilized--they are only used for similarity comparison, despite their potential to fine-tune and adapt the metric space itself to the classes in the current episode. To address this, we propose a series of simple yet effective episode-specific, during-inference fine-tuning methods for metric-based models, including Rotational Division Fine-Tuning (RDFT) and its two variants, Iterative Division Fine-Tuning (IDFT) and Augmented Division Fine-Tuning (ADFT). These methods construct pseudo support-query pairs from the given support set to enable fine-tuning even for non-parametric models. Nevertheless, the severely limited amount of data in each task poses a substantial risk of overfitting when applying such fine-tuning strategies. To mitigate this, we further propose to train the metric-based model within an optimization-based meta-learning framework. With the combined efforts of episode-specific fine-tuning and optimization-based meta-training, metric-based models are equipped with the ability to rapidly adapt to the limited support samples during inference while avoiding overfitting. We validate our approach on three audio datasets from diverse domains, namely ESC-50 (environmental sounds), Speech Commands V2 (spoken keywords), and Medley-solos-DB (musical instrument). Experimental results demonstrate that our approach consistently improves performance for all evaluated metric-based models (especially for attention-based models) and generalizes well across different audio domains.

Updated: 2025-06-20 22:24:38

标题: 针对基于度量的少样本学习器进行特定情境微调和基于优化训练的研究

摘要: 在少样本分类任务（所谓的episode）中，在推理过程中提供了一小组带标签的支持样本，以帮助对未标记的查询样本进行分类。基于度量的模型通常通过在学习度量空间内计算查询和支持嵌入之间的相似性，然后进行最近邻分类来运行。然而，这些带标签的支持样本通常被低效利用--它们仅用于相似性比较，尽管它们有潜力对当前episode中的类进行微调和适应度量空间本身。为了解决这个问题，我们提出了一系列简单但有效的基于度量的模型特定episode期间推理微调方法，包括旋转分割微调（RDFT）及其两个变体，迭代分割微调（IDFT）和增强分割微调（ADFT）。这些方法从给定的支持集中构建伪支持-查询对，以实现即使对于非参数模型也能进行微调。然而，在每个任务中数据量严重有限，当应用这种微调策略时，存在过拟合的重大风险。为了缓解这一问题，我们进一步提出在基于优化的元学习框架中训练基于度量的模型。通过特定episode微调和基于优化的元学习的结合努力，基于度量的模型具备了在推理过程中快速适应有限支持样本的能力，同时避免过拟合。我们在来自不同领域的三个音频数据集上验证了我们的方法，分别是ESC-50（环境声音）、Speech Commands V2（口语关键词）和Medley-solos-DB（乐器）。实验结果表明，我们的方法始终提高了所有评估的基于度量的模型（特别是基于注意力的模型）的性能，并且在不同音频领域之间具有良好的泛化能力。

更新时间: 2025-06-20 22:24:38

领域: cs.LG,cs.MM,cs.SD

下载: http://arxiv.org/abs/2506.17499v1

From Generality to Mastery: Composer-Style Symbolic Music Generation via Large-Scale Pre-training

Despite progress in controllable symbolic music generation, data scarcity remains a challenge for certain control modalities. Composer-style music generation is a prime example, as only a few pieces per composer are available, limiting the modeling of both styles and fundamental music elements (e.g., melody, chord, rhythm). In this paper, we investigate how general music knowledge learned from a broad corpus can enhance the mastery of specific composer styles, with a focus on piano piece generation. Our approach follows a two-stage training paradigm. First, we pre-train a REMI-based music generation model on a large corpus of pop, folk, and classical music. Then, we fine-tune it on a small, human-verified dataset from four renowned composers, namely Bach, Mozart, Beethoven, and Chopin, using a lightweight adapter module to condition the model on style indicators. To evaluate the effectiveness of our approach, we conduct both objective and subjective evaluations on style accuracy and musicality. Experimental results demonstrate that our method outperforms ablations and baselines, achieving more precise composer-style modeling and better musical aesthetics. Additionally, we provide observations on how the model builds music concepts from the generality pre-training and refines its stylistic understanding through the mastery fine-tuning.

Updated: 2025-06-20 22:20:59

标题: 从概念到精通：通过大规模预训练实现作曲家风格的符号音乐生成

摘要: 尽管在可控符号音乐生成方面取得了进展，但数据稀缺仍然是某些控制模式的挑战。作曲家风格音乐生成是一个典型例子，因为每位作曲家只有几首作品可用，限制了对风格和基本音乐元素（如旋律、和弦、节奏）的建模。在本文中，我们研究了从广泛语料库中学习的通用音乐知识如何增强对特定作曲家风格的掌握，重点放在钢琴曲生成上。我们的方法遵循一个两阶段训练范式。首先，我们在流行音乐、民间音乐和古典音乐的大型语料库上对基于REMI的音乐生成模型进行预训练。然后，我们使用轻量级适配器模块在来自四位著名作曲家的小型、经过人工验证的数据集上对其进行微调，即巴赫、莫扎特、贝多芬和肖邦，以使模型以风格指示器为条件。为了评估我们方法的有效性，我们对风格准确性和音乐性进行了客观和主观评估。实验结果表明，我们的方法优于消融和基线，实现了更精确的作曲家风格建模和更好的音乐美学。此外，我们还提供了关于模型如何从普遍性预训练中构建音乐概念，并通过精细调优来完善其风格理解的观察结果。

更新时间: 2025-06-20 22:20:59

领域: cs.SD,cs.AI,cs.LG,eess.AS

下载: http://arxiv.org/abs/2506.17497v1

Exploring Strategies for Personalized Radiation Therapy Part II Predicting Tumor Drift Patterns with Diffusion Models

Radiation therapy outcomes are decided by two key parameters, dose and timing, whose best values vary substantially across patients. This variability is especially critical in the treatment of brain cancer, where fractionated or staged stereotactic radiosurgery improves safety compared to single fraction approaches, but complicates the ability to predict treatment response. To address this challenge, we employ Personalized Ultra-fractionated Stereotactic Adaptive Radiotherapy (PULSAR), a strategy that dynamically adjusts treatment based on how each tumor evolves over time. However, the success of PULSAR and other adaptive approaches depends on predictive tools that can guide early treatment decisions and avoid both overtreatment and undertreatment. However, current radiomics and dosiomics models offer limited insight into the evolving spatial and temporal patterns of tumor response. To overcome these limitations, we propose a novel framework using Denoising Diffusion Implicit Models (DDIM), which learns data-driven mappings from pre to post treatment imaging. In this study, we developed single step and iterative denoising strategies and compared their performance. The results show that diffusion models can effectively simulate patient specific tumor evolution and localize regions associated with treatment response. The proposed strategy provides a promising foundation for modeling heterogeneous treatment response and enabling early, adaptive interventions, paving the way toward more personalized and biologically informed radiotherapy.

Updated: 2025-06-20 21:58:42

标题: 探索个性化放疗策略第二部分：使用扩散模型预测肿瘤漂移模式

摘要: 放射治疗结果由两个关键参数决定，即剂量和时间，这些最佳值在患者之间差异很大。这种变异在脑癌治疗中尤为关键，分段或分期立体定向放射外科比单次分数方法提高了安全性，但使预测治疗反应的能力复杂化。为了解决这一挑战，我们采用个性化超分数立体定向自适应放射治疗（PULSAR）策略，根据每个肿瘤随时间演变的方式动态调整治疗。然而，PULSAR和其他自适应方法的成功取决于可以引导早期治疗决策并避免过度治疗和不足治疗的预测工具。然而，当前的放射组学和剂量组学模型对肿瘤反应的演变空间和时间模式提供有限的洞察力。为了克服这些限制，我们提出了一个使用去噪扩散隐式模型（DDIM）的新框架，该模型从治疗前到治疗后的成像学习数据驱动的映射。在这项研究中，我们开发了单步和迭代去噪策略，并比较它们的性能。结果表明，扩散模型能够有效模拟患者特异性肿瘤演变，并定位与治疗反应相关的区域。所提出的策略为建立异质治疗反应模型和实现早期、自适应干预提供了有希望的基础，为更加个性化和生物学知情的放射治疗铺平了道路。

更新时间: 2025-06-20 21:58:42

领域: physics.med-ph,cs.AI

下载: http://arxiv.org/abs/2506.17491v1

Online Adaptation for Flying Quadrotors in Tight Formations

The task of flying in tight formations is challenging for teams of quadrotors because the complex aerodynamic wake interactions can destabilize individual team members as well as the team. Furthermore, these aerodynamic effects are highly nonlinear and fast-paced, making them difficult to model and predict. To overcome these challenges, we present L1 KNODE-DW MPC, an adaptive, mixed expert learning based control framework that allows individual quadrotors to accurately track trajectories while adapting to time-varying aerodynamic interactions during formation flights. We evaluate L1 KNODE-DW MPC in two different three-quadrotor formations and show that it outperforms several MPC baselines. Our results show that the proposed framework is capable of enabling the three-quadrotor team to remain vertically aligned in close proximity throughout the flight. These findings show that the L1 adaptive module compensates for unmodeled disturbances most effectively when paired with an accurate dynamics model. A video showcasing our framework and the physical experiments is available here: https://youtu.be/9QX1Q5Ut9Rs

Updated: 2025-06-20 21:49:17

标题: 在线适应紧凑编队中飞行四轴飞行器

摘要: 紧密编队飞行对四轴飞行器团队来说是一项具有挑战性的任务，因为复杂的空气动力学尾流相互作用可能会使个体团队成员以及整个团队失稳。此外，这些空气动力学效应是高度非线性且快速变化的，使其难以建模和预测。为了克服这些挑战，我们提出了L1 KNODE-DW MPC，这是一种自适应的、基于混合专家学习的控制框架，允许个体四轴飞行器在编队飞行过程中准确跟踪轨迹，并适应时间变化的空气动力学相互作用。我们在两种不同的三四轴飞行器编队中评估了L1 KNODE-DW MPC，并展示它优于几种MPC基线。我们的结果表明，所提出的框架能够使三四轴飞行器团队在整个飞行过程中始终保持垂直对齐。这些发现表明，当与准确的动力学模型配对时，L1自适应模块能够最有效地补偿未建模的干扰。我们提供了展示我们框架和物理实验的视频链接：https://youtu.be/9QX1Q5Ut9Rs

更新时间: 2025-06-20 21:49:17

领域: cs.RO,cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2506.17488v1

Distilling On-device Language Models for Robot Planning with Minimal Human Intervention

Large language models (LLMs) provide robots with powerful contextual reasoning abilities and a natural human interface. Yet, current LLM-enabled robots typically depend on cloud-hosted models, limiting their usability in environments with unreliable communication infrastructure, such as outdoor or industrial settings. We present PRISM, a framework for distilling small language model (SLM)-enabled robot planners that run on-device with minimal human supervision. Starting from an existing LLM-enabled planner, PRISM automatically synthesizes diverse tasks and environments, elicits plans from the LLM, and uses this synthetic dataset to distill a compact SLM as a drop-in replacement of the source model. We apply PRISM to three LLM-enabled planners for mapping and exploration, manipulation, and household assistance, and we demonstrate that PRISM improves the performance of Llama-3.2-3B from 10-20% of GPT-4o's performance to over 93% - using only synthetic data. We further demonstrate that the distilled planners generalize across heterogeneous robotic platforms (ground and aerial) and diverse environments (indoor and outdoor). We release all software, trained models, and datasets at https://zacravichandran.github.io/PRISM.

Updated: 2025-06-20 21:44:27

标题: 在最小程度人类干预下，为机器人规划提炼设备上的语言模型

摘要: 大型语言模型（LLMs）为机器人提供了强大的上下文推理能力和自然的人机界面。然而，当前基于LLM的机器人通常依赖云托管模型，限制了它们在通信基础设施不可靠的环境中的可用性，如户外或工业设置。我们提出了PRISM，一个用于提炼小型语言模型（SLM）启用的机器人规划器的框架，该规划器在设备上运行，并且需要最少的人工监督。从现有的LLM启用的规划器开始，PRISM自动合成多样化的任务和环境，从LLM中获取计划，并使用这个合成数据集来提炼一个紧凑的SLM，作为源模型的可替代品。我们将PRISM应用于三个LLM启用的规划器，用于地图绘制和探索、操作和家庭辅助，并且我们证明PRISM将Llama-3.2-3B的性能从GPT-4o的10-20%提高到超过93% - 仅使用合成数据。我们进一步证明，提炼的规划器可以跨异构的机器人平台（地面和空中）和多样化的环境（室内和室外）进行泛化。我们在https://zacravichandran.github.io/PRISM发布所有软件、训练模型和数据集。

更新时间: 2025-06-20 21:44:27

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.17486v1

From Unstructured Communication to Intelligent RAG: Multi-Agent Automation for Supply Chain Knowledge Bases

Supply chain operations generate vast amounts of operational data; however, critical knowledge such as system usage practices, troubleshooting workflows, and resolution techniques often remains buried within unstructured communications like support tickets, emails, and chat logs. While RAG systems aim to leverage such communications as a knowledge base, their effectiveness is limited by raw data challenges: support tickets are typically noisy, inconsistent, and incomplete, making direct retrieval suboptimal. Unlike existing RAG approaches that focus on runtime optimization, we introduce a novel offline-first methodology that transforms these communications into a structured knowledge base. Our key innovation is a LLMs-based multi-agent system orchestrating three specialized agents: Category Discovery for taxonomy creation, Categorization for ticket grouping, and Knowledge Synthesis for article generation. Applying our methodology to real-world support tickets with resolution notes and comments, our system creates a compact knowledge base - reducing total volume to just 3.4% of original ticket data while improving quality. Experiments demonstrate that our prebuilt knowledge base in RAG systems significantly outperforms traditional RAG implementations (48.74% vs. 38.60% helpful answers) and achieves a 77.4% reduction in unhelpful responses. By automating institutional knowledge capture that typically remains siloed in experts' heads, our solution translates to substantial operational efficiency: reducing support workload, accelerating resolution times, and creating self-improving systems that automatically resolve approximately 50% of future supply chain tickets. Our approach addresses a key gap in knowledge management by transforming transient communications into structured, reusable knowledge through intelligent offline processing rather than latency-inducing runtime architectures.

Updated: 2025-06-20 21:38:06

标题: 从非结构化沟通到智能RAG: 供应链知识库的多智能体自动化

摘要: 供应链运营产生大量的运营数据；然而，关键的知识，如系统使用实践、故障排除工作流程和解决技术往往被埋藏在结构化的通信中，如支持票，电子邮件和聊天记录。虽然RAG系统旨在利用这些通信作为知识库，但其有效性受到原始数据挑战的限制：支持票通常嘈杂、不一致和不完整，直接检索效果不佳。与现有的侧重于运行时优化的RAG方法不同，我们引入了一种新颖的离线优先方法，将这些通信转化为结构化知识库。我们的关键创新是基于LLMs的多智能体系统，编排三个专门智能体：用于分类法创建的类别发现、用于票据分组的分类法和用于文章生成的知识综合。将我们的方法应用于带有解决说明和评论的真实支持票据，我们的系统创建了一个紧凑的知识库 - 将总体积减少到原始票据数据的仅3.4%，同时提高了质量。实验表明，我们在RAG系统中预先构建的知识库明显优于传统的RAG实施（48.74%对38.60%有用的答案），并实现了77.4%的无效响应减少。通过自动捕获通常仅存在于专家头脑中的机构知识，我们的解决方案转化为实质性的运营效率：减少支持工作量，加速解决时间，并创建自我改进的系统，自动解决约50%的未来供应链票据。我们的方法通过智能离线处理而不是延迟诱发的运行时架构，填补了知识管理中的一个关键空白，将瞬时通信转化为结构化、可重复使用的知识。

更新时间: 2025-06-20 21:38:06

领域: cs.AI

下载: http://arxiv.org/abs/2506.17484v1

Disentangle and Regularize: Sign Language Production with Articulator-Based Disentanglement and Channel-Aware Regularization

In this work, we propose DARSLP, a simple gloss-free, transformer-based sign language production (SLP) framework that directly maps spoken-language text to sign pose sequences. We first train a pose autoencoder that encodes sign poses into a compact latent space using an articulator-based disentanglement strategy, where features corresponding to the face, right hand, left hand, and body are modeled separately to promote structured and interpretable representation learning. Next, a non-autoregressive transformer decoder is trained to predict these latent representations from sentence-level text embeddings. To guide this process, we apply channel-aware regularization by aligning predicted latent distributions with priors extracted from the ground-truth encodings using a KL-divergence loss. The contribution of each channel to the loss is weighted according to its associated articulator region, enabling the model to account for the relative importance of different articulators during training. Our approach does not rely on gloss supervision or pretrained models, and achieves state-of-the-art results on the PHOENIX14T and CSL-Daily datasets.

Updated: 2025-06-20 21:17:30

标题: 解开并规范化：基于发音器的解开和通道感知规范的手语产生

摘要: 在这项工作中，我们提出了一个名为DARSLP的简单无光泽、基于变压器的手语生成（SLP）框架，可以直接将口语文本映射到手语姿势序列。我们首先训练一个姿势自动编码器，使用基于发音器的解缠结策略将手语姿势编码为紧凑的潜在空间，其中与面部、右手、左手和身体对应的特征分别建模，以促进结构化和可解释的表征学习。接下来，训练一个非自回归变压器解码器，从句子级文本嵌入中预测这些潜在表征。为了指导这个过程，我们应用了通道感知正则化，通过将预测的潜在分布与从地面真实编码中提取的先验对齐，使用KL散度损失。每个通道对损失的贡献根据其相关的发音器区域进行加权，使模型能够在训练过程中考虑不同发音器的相对重要性。我们的方法不依赖于光泽监督或预训练模型，并在PHOENIX14T和CSL-Daily数据集上取得了最先进的结果。

更新时间: 2025-06-20 21:17:30

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2504.06610v2

A geometric framework for momentum-based optimizers for low-rank training

Low-rank pre-training and fine-tuning have recently emerged as promising techniques for reducing the computational and storage costs of large neural networks. Training low-rank parameterizations typically relies on conventional optimizers such as heavy ball momentum methods or Adam. In this work, we identify and analyze potential difficulties that these training methods encounter when used to train low-rank parameterizations of weights. In particular, we show that classical momentum methods can struggle to converge to a local optimum due to the geometry of the underlying optimization landscape. To address this, we introduce novel training strategies derived from dynamical low-rank approximation, which explicitly account for the underlying geometric structure. Our approach leverages and combines tools from dynamical low-rank approximation and momentum-based optimization to design optimizers that respect the intrinsic geometry of the parameter space. We validate our methods through numerical experiments, demonstrating faster convergence, and stronger validation metrics at given parameter budgets.

Updated: 2025-06-20 20:46:01

标题: 一个用于低秩训练的基于动量的优化器的几何框架

摘要: 低秩预训练和微调最近已被证明是降低大型神经网络计算和存储成本的有希望的技术。训练低秩参数化通常依赖于传统的优化器，如重球动量法或Adam。在这项工作中，我们识别并分析了这些训练方法在用于训练权重的低秩参数化时可能遇到的困难。特别地，我们展示了经典的动量方法可能由于底层优化景观的几何结构而难以收敛到局部最优解。为了解决这个问题，我们引入了从动态低秩逼近中导出的新颖训练策略，明确考虑了底层的几何结构。我们的方法利用和结合了动态低秩逼近和基于动量的优化工具，设计了尊重参数空间内在几何结构的优化器。我们通过数值实验验证了我们的方法，展示了更快的收敛速度，并在给定参数预算下具有更强的验证指标。

更新时间: 2025-06-20 20:46:01

领域: cs.LG

下载: http://arxiv.org/abs/2506.17475v1

Fed-pilot: Optimizing LoRA Allocation for Efficient Federated Fine-Tuning with Heterogeneous Clients

Federated Learning enables the fine-tuning of foundation models (FMs) across distributed clients for specific tasks; however, its scalability is limited by the heterogeneity of client memory capacities. In this work, we propose Fed-pilot, a memory-efficient federated fine-tuning framework. It enables memory-constrained clients to participate in Low-Rank Adaptation (LoRA)-based fine-tuning by training only a subset of LoRA modules locally. Fed-pilot identifies the optimal selection of trainable LoRA modules as a knapsack optimization problem, maximizing model performance under memory constraints for each client. To mitigate inconsistencies arising from heterogeneous module allocations and Non-IID data, Fed-pilot employs a novel aggregation rule that dynamically compensates for under-updated layers. Extensive experiments on five diverse datasets across various heterogeneous data settings demonstrate Fed-pilot's effectiveness and efficiency compared to state-of-the-art methods. To the best of our knowledge, this is the first study on federated fine-tuning of FMs that integrates memory-constrained optimization. The code will be publicly available.

Updated: 2025-06-20 20:43:04

标题: Fed-pilot：优化LoRA分配以实现异构客户端的高效联邦微调

摘要: Federated Learning可以实现在分布式客户端上对基础模型（FMs）进行特定任务的微调；然而，由于客户端内存容量的异质性，其可扩展性受到限制。在这项工作中，我们提出了Fed-pilot，一种内存高效的联邦微调框架。它使内存受限的客户端能够通过在本地仅训练LoRA模块的子集来参与基于低秩自适应（LoRA）的微调。Fed-pilot将可训练的LoRA模块的最佳选择识别为一个背包优化问题，以在每个客户端的内存约束下最大化模型性能。为了减轻由异构模块分配和非独立同分布数据引起的不一致性，Fed-pilot采用了一种新颖的聚合规则，动态补偿未更新的层。在五个不同的数据集上进行的大量实验表明，与最先进的方法相比，Fed-pilot在各种异构数据设置下的有效性和效率。据我们所知，这是第一项关于集成内存受限优化的FMs的联邦微调研究。该代码将公开提供。

更新时间: 2025-06-20 20:43:04

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2410.10200v2

Distributional Training Data Attribution

Randomness is an unavoidable part of training deep learning models, yet something that traditional training data attribution algorithms fail to rigorously account for. They ignore the fact that, due to stochasticity in the initialisation and batching, training on the same dataset can yield different models. In this paper, we address this shortcoming through introducing distributional training data attribution (d-TDA), the goal of which is to predict how the distribution of model outputs (over training runs) depends upon the dataset. We demonstrate the practical significance of d-TDA in experiments, e.g. by identifying training examples that drastically change the distribution of some target measurement without necessarily changing the mean. Intriguingly, we also find that influence functions (IFs), a popular but poorly-understood data attribution tool, emerge naturally from our distributional framework as the limit to unrolled differentiation; without requiring restrictive convexity assumptions. This provides a new mathematical motivation for their efficacy in deep learning, and helps to characterise their limitations.

Updated: 2025-06-20 20:38:39

标题: 分布式训练数据归因

摘要: 随机性是训练深度学习模型中不可避免的一部分，然而传统的训练数据归因算法往往无法严格考虑到这一点。它们忽视了这样一个事实，即由于初始化和分批处理的随机性，对同一数据集进行训练可以得到不同的模型。在本文中，我们通过引入分布式训练数据归因（d-TDA）来解决这一不足，其目标是预测模型输出的分布（在训练运行中）如何取决于数据集。我们通过实验展示了d-TDA在实践中的重要意义，例如通过识别训练示例，这些示例可以在不必改变均值的情况下显著改变某些目标测量的分布。有趣的是，我们还发现，影响函数（IFs）作为一种流行但往往难以理解的数据归因工具，自然地从我们的分布框架中产生，作为展开微分的极限；而无需假定严格的凸性。这为它们在深度学习中的有效性提供了新的数学动机，并有助于描述它们的局限性。

更新时间: 2025-06-20 20:38:39

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2506.12965v2

Modeling the Human Visual System: Comparative Insights from Response-Optimized and Task-Optimized Vision Models, Language Models, and different Readout Mechanisms

Over the past decade, predictive modeling of neural responses in the primate visual system has advanced significantly, largely driven by various DNN approaches. These include models optimized directly for visual recognition, cross-modal alignment through contrastive objectives, neural response prediction from scratch, and large language model embeddings.Likewise, different readout mechanisms, ranging from fully linear to spatial-feature factorized methods have been explored for mapping network activations to neural responses. Despite the diversity of these approaches, it remains unclear which method performs best across different visual regions. In this study, we systematically compare these approaches for modeling the human visual system and investigate alternative strategies to improve response predictions. Our findings reveal that for early to mid-level visual areas, response-optimized models with visual inputs offer superior prediction accuracy, while for higher visual regions, embeddings from LLMs based on detailed contextual descriptions of images and task-optimized models pretrained on large vision datasets provide the best fit. Through comparative analysis of these modeling approaches, we identified three distinct regions in the visual cortex: one sensitive primarily to perceptual features of the input that are not captured by linguistic descriptions, another attuned to fine-grained visual details representing semantic information, and a third responsive to abstract, global meanings aligned with linguistic content. We also highlight the critical role of readout mechanisms, proposing a novel scheme that modulates receptive fields and feature maps based on semantic content, resulting in an accuracy boost of 3-23% over existing SOTAs for all models and brain regions. Together, these findings offer key insights into building more precise models of the visual system.

Updated: 2025-06-20 20:17:21

标题: 对于这个文献标题的翻译是：模拟人类视觉系统：来自响应优化和任务优化视觉模型、语言模型以及不同读取机制的比较洞见

摘要: 在过去的十年里，灵长类动物视觉系统神经反应的预测建模取得了显著进展，这主要是由各种深度神经网络方法推动的。这些方法包括直接优化视觉识别、通过对比目标进行跨模态对齐、从头开始预测神经反应以及大型语言模型嵌入。同样，不同的读出机制，从完全线性到空间特征分解方法，已经被探索用于将网络激活映射到神经反应。尽管这些方法的多样性，目前仍不清楚哪种方法在不同的视觉区域中表现最佳。在这项研究中，我们系统比较了这些方法对人类视觉系统建模，并探讨了改进反应预测的替代策略。我们的发现显示，对于早期到中间级别的视觉区域，具有视觉输入的反应优化模型提供了更高的预测准确性，而对于更高级别的视觉区域，基于图像详细上下文描述的LLM嵌入和在大型视觉数据集上预训练的任务优化模型提供了最佳拟合。通过对这些建模方法的比较分析，我们在视觉皮层中确定了三个不同的区域：一个主要敏感于输入的感知特征，这些特征无法被语言描述捕捉到，另一个对细粒度视觉细节表示语义信息感兴趣，第三个对与语言内容相一致的抽象全局含义作出反应。我们还强调了读出机制的关键作用，提出了一种新颖的方案，根据语义内容调节感受野和特征图，从而使所有模型和大脑区域的准确性提高了3-23%。综上所述，这些发现为构建更精确的视觉系统模型提供了重要见解。

更新时间: 2025-06-20 20:17:21

领域: cs.NE,cs.LG

下载: http://arxiv.org/abs/2410.14031v4

SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving

Regardless of the advancements in device capabilities, efficient inferencing advanced large language models (LLMs) at the edge remains challenging due to limited device memory and power constraints. Existing strategies, such as aggressive quantization, pruning, or remote inference, trade accuracy for efficiency or lead to substantial cost burdens. This position paper introduces a new approach that leverages speculative decoding, previously viewed primarily as a decoding acceleration technique for autoregressive generation of LLMs, as a promising approach specifically adapted for edge computing by orchestrating computation across heterogeneous devices. We propose \acronym, a method that allows lightweight edge devices to draft multiple candidate tokens locally using diverse draft models, while a single, shared edge server efficiently batches and verifies the tokens utilizing a more precise target model. This approach supports device heterogeneity and reduces server-side memory footprint by avoiding the need to deploy multiple target models. Our initial experiments with Jetson Orin Nano, Raspberry Pi 4B/5, and an edge server equipped with 4 Nvidia A100 GPUs indicate substantial benefits: significantly increased system throughput, capacity, and better cost efficiency, all without sacrificing model accuracy.

Updated: 2025-06-20 20:15:09

标题: SLED：一种用于高效边缘服务的推测LLM解码框架

摘要: 尽管设备功能不断提升，但在边缘高效推理先进的大型语言模型(LLMs)仍然具有挑战性，因为受限于设备内存和功耗限制。现有的策略，如激进的量化、修剪或远程推理，为了效率而牺牲精度，或者导致重大成本负担。本文介绍了一种新方法，利用先前主要被视为LLMs自回归生成的解码加速技术的猜测解码，作为一种专门为边缘计算而调整的有前途的方法，通过在异构设备之间协调计算来实现。我们提出了一种方法，允许轻量级边缘设备在本地使用不同的草稿模型起草多个候选令牌，而单个共享的边缘服务器有效地批量处理和验证这些令牌，利用更精确的目标模型。这种方法支持设备异构性，并通过避免部署多个目标模型来减少服务器端的内存占用。我们对Jetson Orin Nano、Raspberry Pi 4B/5和一个配备有4个Nvidia A100 GPU的边缘服务器进行的初步实验表明，取得了显著的好处：系统吞吐量、容量和成本效率显著提高，而不会牺牲模型的准确性。

更新时间: 2025-06-20 20:15:09

领域: cs.DC,cs.AI,cs.LG,cs.NI,68T07, 68M14,I.2.6; C.2.4; C.1.4

下载: http://arxiv.org/abs/2506.09397v2

Computational Approaches to Understanding Large Language Model Impact on Writing and Information Ecosystems

Large language models (LLMs) have shown significant potential to change how we write, communicate, and create, leading to rapid adoption across society. This dissertation examines how individuals and institutions are adapting to and engaging with this emerging technology through three research directions. First, I demonstrate how the institutional adoption of AI detectors introduces systematic biases, particularly disadvantaging writers of non-dominant language varieties, highlighting critical equity concerns in AI governance. Second, I present novel population-level algorithmic approaches that measure the increasing adoption of LLMs across writing domains, revealing consistent patterns of AI-assisted content in academic peer reviews, scientific publications, consumer complaints, corporate communications, job postings, and international organization press releases. Finally, I investigate LLMs' capability to provide feedback on research manuscripts through a large-scale empirical analysis, offering insights into their potential to support researchers who face barriers in accessing timely manuscript feedback, particularly early-career researchers and those from under-resourced settings.

Updated: 2025-06-20 20:15:09

标题: 计算方法理解大型语言模型对写作和信息生态的影响

摘要: 大型语言模型（LLMs）已经显示出改变我们写作、沟通和创作方式的巨大潜力，导致社会快速采用。本论文通过三个研究方向探讨个人和机构如何适应和利用这一新兴技术。首先，我展示了AI检测器的机构采用如何引入系统性偏见，特别是对非主导语言变体的作者造成不利，突显了AI治理中关键的公平问题。其次，我提出了一种新颖的人口级别算法方法，用于测量LLMs在各个写作领域的日益普及，揭示了学术同行评议、科学出版物、消费者投诉、企业沟通、职位发布和国际组织新闻稿中AI辅助内容的一致模式。最后，我通过大规模实证分析调查了LLMs在研究手稿上提供反馈的能力，提供了对其支持研究人员的潜力的见解，特别是那些面临及时手稿反馈障碍的早期研究人员和那些来自资源匮乏环境的人。

更新时间: 2025-06-20 20:15:09

领域: cs.CL,cs.AI,cs.CY,cs.HC,cs.LG

下载: http://arxiv.org/abs/2506.17467v1

FedNAMs: Performing Interpretability Analysis in Federated Learning Context

Federated learning continues to evolve but faces challenges in interpretability and explainability. To address these challenges, we introduce a novel approach that employs Neural Additive Models (NAMs) within a federated learning framework. This new Federated Neural Additive Models (FedNAMs) approach merges the advantages of NAMs, where individual networks concentrate on specific input features, with the decentralized approach of federated learning, ultimately producing interpretable analysis results. This integration enhances privacy by training on local data across multiple devices, thereby minimizing the risks associated with data centralization and improving model robustness and generalizability. FedNAMs maintain detailed, feature-specific learning, making them especially valuable in sectors such as finance and healthcare. They facilitate the training of client-specific models to integrate local updates, preserve privacy, and mitigate concerns related to centralization. Our studies on various text and image classification tasks, using datasets such as OpenFetch ML Wine, UCI Heart Disease, and Iris, show that FedNAMs deliver strong interpretability with minimal accuracy loss compared to traditional Federated Deep Neural Networks (DNNs). The research involves notable findings, including the identification of critical predictive features at both client and global levels. Volatile acidity, sulfates, and chlorides for wine quality. Chest pain type, maximum heart rate, and number of vessels for heart disease. Petal length and width for iris classification. This approach strengthens privacy and model efficiency and improves interpretability and robustness across diverse datasets. Finally, FedNAMs generate insights on causes of highly and low interpretable features.

Updated: 2025-06-20 20:14:13

标题: FedNAMs：在联邦学习环境中进行可解释性分析

摘要: 联邦学习不断发展，但在解释性和可解释性方面面临挑战。为了解决这些挑战，我们引入了一种新颖的方法，该方法在联邦学习框架内采用了神经加性模型（NAMs）。这种新的联邦神经加性模型（FedNAMs）方法将NAMs的优势与个体网络集中于特定输入特征的分散化联邦学习方法相结合，最终产生可解释的分析结果。这种集成增强了隐私性，通过在多个设备上训练本地数据，从而最大程度地减少与数据集中化相关的风险，并提高模型的健壮性和泛化能力。FedNAMs保持了详细的、特定特征的学习，使其在金融和医疗保健等行业特别有价值。它们促进了客户特定模型的训练，以整合本地更新，保护隐私，并减轻与集中化相关的担忧。我们在各种文本和图像分类任务上的研究，使用了OpenFetch ML Wine、UCI Heart Disease和Iris等数据集，显示出与传统的联邦深度神经网络（DNNs）相比，FedNAMs在最小化准确性损失的同时提供了强大的解释能力。研究涉及显著的发现，包括在客户和全局水平上识别关键预测特征。对于葡萄酒质量来说，挥发性酸度、硫酸盐和氯化物。对于心脏病，胸痛类型、最大心率和血管数量。对于鸢尾花分类，花瓣长度和宽度。这种方法增强了跨多样数据集的隐私性和模型效率，提高了解释性和健壮性。最后，FedNAMs对高度和低度可解释特征的成因产生了见解。

更新时间: 2025-06-20 20:14:13

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.17466v1

General-Purpose Robotic Navigation via LVLM-Orchestrated Perception, Reasoning, and Acting

Developing general-purpose navigation policies for unknown environments remains a core challenge in robotics. Most existing systems rely on task-specific neural networks and fixed data flows, limiting generalizability. Large Vision-Language Models (LVLMs) offer a promising alternative by embedding human-like knowledge suitable for reasoning and planning. Yet, prior LVLM-robot integrations typically depend on pre-mapped spaces, hard-coded representations, and myopic exploration. We introduce the Agentic Robotic Navigation Architecture (ARNA), a general-purpose navigation framework that equips an LVLM-based agent with a library of perception, reasoning, and navigation tools available within modern robotic stacks. At runtime, the agent autonomously defines and executes task-specific workflows that iteratively query the robotic modules, reason over multimodal inputs, and select appropriate navigation actions. This approach enables robust navigation and reasoning in previously unmapped environments, providing a new perspective on robotic stack design. Evaluated in Habitat Lab on the HM-EQA benchmark, ARNA achieves state-of-the-art performance, demonstrating effective exploration, navigation, and embodied question answering without relying on handcrafted plans, fixed input representations, or pre-existing maps.

Updated: 2025-06-20 20:06:14

标题: 通过LVLM协调的感知、推理和行动实现通用机器人导航

摘要: 在机器人技术领域，为未知环境开发通用导航策略仍然是一个核心挑战。大多数现有系统依赖于特定任务的神经网络和固定的数据流，限制了泛化能力。大型视觉语言模型（LVLMs）通过嵌入适用于推理和规划的类人知识，提供了一种有前途的替代方案。然而，先前的LVLM-机器人集成通常依赖于预先映射的空间、硬编码的表示和短视的探索。我们引入了主体机器人导航架构（ARNA），这是一个通用导航框架，为基于LVLM的代理配备了一组现代机器人堆栈中可用的感知、推理和导航工具。在运行时，代理程序自主定义和执行特定任务的工作流程，迭代地查询机器人模块，对多模态输入进行推理，并选择适当的导航动作。这种方法使得在以前未映射的环境中实现强大的导航和推理成为可能，为机器人堆栈设计提供了新的视角。在Habitat Lab上对HM-EQA基准进行评估，ARNA实现了最先进的性能，展示了在不依赖手工制定计划、固定输入表示或预先存在地图的情况下进行有效的探索、导航和具体问题回答。

更新时间: 2025-06-20 20:06:14

领域: cs.RO,cs.AI,cs.CV

下载: http://arxiv.org/abs/2506.17462v1

LieDetect: Detection of representation orbits of compact Lie groups from point clouds

We suggest a new algorithm to estimate representations of compact Lie groups from finite samples of their orbits. Different from other reported techniques, our method allows the retrieval of the precise representation type as a direct sum of irreducible representations. Moreover, the knowledge of the representation type permits the reconstruction of its orbit, which is useful for identifying the Lie group that generates the action, from a finite list of candidates. Our algorithm is general for any compact Lie group, but only instantiations for SO(2), T^d, SU(2), and SO(3) are considered. Theoretical guarantees of robustness in terms of Hausdorff and Wasserstein distances are derived. Our tools are drawn from geometric measure theory, computational geometry, and optimization on matrix manifolds. The algorithm is tested for synthetic data up to dimension 32, as well as real-life applications in image analysis, harmonic analysis, density estimation, equivariant neural networks, chemical conformational spaces, and classical mechanics systems, achieving very accurate results.

Updated: 2025-06-20 20:03:34

标题: LieDetect：从点云中检测紧致Lie群的表示轨道

摘要: 我们建议一种新的算法，用于从紧致李群的有限轨道样本中估计表示。与其他报告的技术不同，我们的方法允许以不可约表示的直和形式检索精确的表示类型。此外，表示类型的知识允许重构其轨道，这对于从有限的候选列表中识别生成作用的李群是有用的。我们的算法适用于任何紧致李群，但仅考虑了SO(2), T^d, SU(2)和SO(3)的实例化。通过测地距离和Wasserstein距离的鲁棒性的理论保证是可以推导出的。我们的工具来自几何测度理论、计算几何和矩阵流形上的优化。该算法在维度为32的合成数据以及图像分析、谐波分析、密度估计、等变神经网络、化学构象空间和经典力学系统等实际应用中进行了测试，取得了非常精确的结果。

更新时间: 2025-06-20 20:03:34

领域: math.OC,cs.LG,cs.NA,math.NA,math.RT,68U05, 49Q12, 15B30, 49Q22, 49Q15, 68T07

下载: http://arxiv.org/abs/2309.03086v2

Directional Gradient Projection for Robust Fine-Tuning of Foundation Models

Robust fine-tuning aims to adapt large foundation models to downstream tasks while preserving their robustness to distribution shifts. Existing methods primarily focus on constraining and projecting current model towards the pre-trained initialization based on the magnitudes between fine-tuned and pre-trained weights, which often require extensive hyper-parameter tuning and can sometimes result in underfitting. In this work, we propose Directional Gradient Projection (DiGraP), a novel layer-wise trainable method that incorporates directional information from gradients to bridge regularization and multi-objective optimization. Besides demonstrating our method on image classification, as another contribution we generalize this area to the multi-modal evaluation settings for robust fine-tuning. Specifically, we first bridge the uni-modal and multi-modal gap by performing analysis on Image Classification reformulated Visual Question Answering (VQA) benchmarks and further categorize ten out-of-distribution (OOD) VQA datasets by distribution shift types and degree (i.e. near versus far OOD). Experimental results show that DiGraP consistently outperforms existing baselines across Image Classfication and VQA tasks with discriminative and generative backbones, improving both in-distribution (ID) generalization and OOD robustness.

Updated: 2025-06-20 19:54:55

标题: 方向梯度投影用于基础模型的鲁棒微调

摘要: Robust fine-tuning旨在将大型基础模型调整到下游任务中，同时保持它们对分布变化的稳健性。现有方法主要侧重于限制和投影当前模型到基于微调和预训练权重之间的大小关系的初始化上，这经常需要大量的超参数调整，并有时会导致欠拟合。在这项工作中，我们提出了Directional Gradient Projection（DiGraP），这是一种新颖的逐层可训练方法，它将来自梯度的方向信息纳入正则化和多目标优化之中。除了在图像分类上展示我们的方法之外，作为另一个贡献，我们将这一领域推广到了多模态评估设置中进行稳健微调。具体来说，我们首先通过对图像分类重新构造的视觉问答（VQA）基准进行分析，将单模态和多模态之间的差距进行了调和，进一步通过分布变化类型和程度（即接近和遥远的OOD）对十个VQA数据集进行了分类。实验结果表明，DiGraP在图像分类和VQA任务中始终优于现有基准线，具有辨别性和生成性骨干，在分布内（ID）泛化和OOD鲁棒性方面均有所改善。

更新时间: 2025-06-20 19:54:55

领域: cs.LG,cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2502.15895v2

A Comparative Analysis of Distributed Linear Solvers under Data Heterogeneity

We consider the problem of solving a large-scale system of linear equations in a distributed or federated manner by a taskmaster and a set of machines, each possessing a subset of the equations. We provide a comprehensive comparison of two well-known classes of algorithms used to solve this problem: projection-based methods and optimization-based methods. First, we introduce a novel geometric notion of data heterogeneity called angular heterogeneity and discuss its generality. Using this notion, we characterize the optimal convergence rates of the most prominent algorithms from each class, capturing the effects of the number of machines, the number of equations, and that of both cross-machine and local data heterogeneity on these rates. Our analysis establishes the superiority of Accelerated Projected Consensus in realistic scenarios with significant data heterogeneity and offers several insights into how angular heterogeneity affects the efficiency of the methods studied. Additionally, we develop distributed algorithms for the efficient computation of the proposed angular heterogeneity metrics. Our extensive numerical analyses validate and complement our theoretical results.

Updated: 2025-06-20 19:44:13

标题: 一个分布式线性求解器在数据异构性下的比较分析

摘要: 我们考虑通过任务主管和一组机器以分布式或联邦方式解决一个大规模线性方程组的问题，每个机器都拥有方程的一个子集。我们对用于解决这个问题的两种知名算法类别进行了全面比较：基于投影的方法和基于优化的方法。首先，我们引入了一个称为角异质性的新颖几何概念，并讨论其一般性。利用这个概念，我们表征了每个类别中最显著算法的最佳收敛速率，捕捉机器数量、方程数量以及交叉机器和本地数据异质性对这些速率的影响。我们的分析确定了在存在显著数据异质性的现实情景中，加速投影共识方法的优越性，并提供了有关角异质性如何影响所研究方法效率的几个见解。此外，我们开发了用于有效计算所提出的角异质性指标的分布式算法。我们广泛的数值分析验证并补充了我们的理论结果。

更新时间: 2025-06-20 19:44:13

领域: cs.DC,cs.LG,cs.NA,math.NA,G.1.3; I.2.11; I.2.6

下载: http://arxiv.org/abs/2304.10640v4

UT-GraphCast Hindcast Dataset: A Global AI Forecast Archive from UT Austin for Weather and Climate Applications

The UT GraphCast Hindcast Dataset from 1979 to 2024 is a comprehensive global weather forecast archive generated using the Google DeepMind GraphCast Operational model. Developed by researchers at The University of Texas at Austin under the WCRP umbrella, this dataset provides daily 15 day deterministic forecasts at 00UTC on an approximately 25 km global grid for a 45 year period. GraphCast is a physics informed graph neural network that was trained on ECMWF ERA5 reanalysis. It predicts more than a dozen key atmospheric and surface variables on 37 vertical levels, delivering a full medium range forecast in under one minute on modern hardware.

Updated: 2025-06-20 19:42:36

标题: UT-GraphCast Hindcast数据集：来自德克萨斯大学奥斯汀分校的全球人工智能天气和气候预测存档

摘要: UT GraphCast Hindcast数据集从1979年到2024年是一个全面的全球天气预报存档，使用Google DeepMind GraphCast操作模型生成。该数据集由德克萨斯大学奥斯汀分校的研究人员在WCRP的支持下开发，为45年期间提供每日00UTC的15天确定性预测，涵盖了全球25公里的网格。GraphCast是一个基于物理信息的图神经网络，经过ECMWF ERA5再分析训练。它在37个垂直层次上预测了十多个关键大气和地表变量，在现代硬件上不到一分钟内提供完整的中期预测。

更新时间: 2025-06-20 19:42:36

领域: physics.geo-ph,cs.LG,physics.ao-ph

下载: http://arxiv.org/abs/2506.17453v1

OmniReflect: Discovering Transferable Constitutions for LLM agents via Neuro-Symbolic Reflections

Efforts to improve Large Language Model (LLM) agent performance on complex tasks have largely focused on fine-tuning and iterative self-correction. However, these approaches often lack generalizable mechanisms for longterm learning and remain inefficient in dynamic environments. We introduce OmniReflect, a hierarchical, reflection-driven framework that constructs a constitution, a compact set of guiding principles distilled from task experiences, to enhance the effectiveness and efficiency of an LLM agent. OmniReflect operates in two modes: Self-sustaining, where a single agent periodically curates its own reflections during task execution, and Co-operative, where a Meta-advisor derives a constitution from a small calibration set to guide another agent. To construct these constitutional principles, we employ Neural, Symbolic, and NeuroSymbolic techniques, offering a balance between contextual adaptability and computational efficiency. Empirical results averaged across models show major improvements in task success, with absolute gains of +10.3% on ALFWorld, +23.8% on BabyAI, and +8.3% on PDDL in the Self-sustaining mode. Similar gains are seen in the Co-operative mode, where a lightweight Qwen3-4B ReAct agent outperforms all Reflexion baselines on BabyAI. These findings highlight the robustness and effectiveness of OmniReflect across environments and backbones.

Updated: 2025-06-20 19:38:21

标题: OmniReflect：通过神经符号反思发现LLM代理的可转移构建

摘要: 努力提高大型语言模型(LLM)代理在复杂任务上的表现主要集中在微调和迭代自我修正上。然而，这些方法通常缺乏长期学习的通用机制，在动态环境中效率低下。我们引入了OmniReflect，这是一个分层、反思驱动的框架，构建一个宪法，即从任务经验中提炼出的一套指导原则，以增强LLM代理的效果和效率。OmniReflect有两种模式：自我维持模式，其中一个单一代理在任务执行过程中定期整理自己的反思；合作模式，其中Meta-advisor从一个小的校准集合中提取出一个宪法来指导另一个代理。为了构建这些宪法原则，我们采用了神经、符号和神经符号技术，提供了一种在上下文适应性和计算效率之间的平衡。跨模型的实证结果显示，在自我维持模式下，任务成功率显著提高，ALFWorld上的绝对增益为+10.3%，BabyAI上为+23.8%，PDDL上为+8.3%。在合作模式中也可以看到类似的增益，其中一个轻量级的Qwen3-4B ReAct代理在BabyAI上表现优于所有反射基准。这些发现突显了OmniReflect在不同环境和骨干结构中的鲁棒性和有效性。

更新时间: 2025-06-20 19:38:21

领域: cs.AI

下载: http://arxiv.org/abs/2506.17449v1

Scalable Unit Harmonization in Medical Informatics via Bayesian-Optimized Retrieval and Transformer-Based Re-ranking

Objective: To develop and evaluate a scalable methodology for harmonizing inconsistent units in large-scale clinical datasets, addressing a key barrier to data interoperability. Materials and Methods: We designed a novel unit harmonization system combining BM25, sentence embeddings, Bayesian optimization, and a bidirectional transformer based binary classifier for retrieving and matching laboratory test entries. The system was evaluated using the Optum Clinformatics Datamart dataset (7.5 billion entries). We implemented a multi-stage pipeline: filtering, identification, harmonization proposal generation, automated re-ranking, and manual validation. Performance was assessed using Mean Reciprocal Rank (MRR) and other standard information retrieval metrics. Results: Our hybrid retrieval approach combining BM25 and sentence embeddings (MRR: 0.8833) significantly outperformed both lexical-only (MRR: 0.7985) and embedding-only (MRR: 0.5277) approaches. The transformer-based reranker further improved performance (absolute MRR improvement: 0.10), bringing the final system MRR to 0.9833. The system achieved 83.39\% precision at rank 1 and 94.66\% recall at rank 5. Discussion: The hybrid architecture effectively leverages the complementary strengths of lexical and semantic approaches. The reranker addresses cases where initial retrieval components make errors due to complex semantic relationships in medical terminology. Conclusion: Our framework provides an efficient, scalable solution for unit harmonization in clinical datasets, reducing manual effort while improving accuracy. Once harmonized, data can be reused seamlessly in different analyses, ensuring consistency across healthcare systems and enabling more reliable multi-institutional studies and meta-analyses.

Updated: 2025-06-20 19:38:08

标题: 医学信息学中可扩展的单元协调：通过贝叶斯优化检索和基于Transformer的重新排序实现

摘要: 目标：开发并评估一种可伸缩的方法，用于协调大规模临床数据集中不一致的单位，解决数据互操作性的关键障碍。材料和方法：我们设计了一个新颖的单位协调系统，结合了BM25、句子嵌入、贝叶斯优化和基于双向转换器的二元分类器，用于检索和匹配实验室检验条目。该系统在Optum Clinformatics数据集（75亿条目）上进行了评估。我们实施了一个多阶段流水线：过滤、识别、协调提案生成、自动重新排列和手动验证。性能是使用平均倒数排名（MRR）和其他标准信息检索指标进行评估的。结果：我们的混合检索方法结合了BM25和句子嵌入（MRR：0.8833），明显优于仅词汇（MRR：0.7985）和仅嵌入（MRR：0.5277）方法。基于转换器的重新排列器进一步提高了性能（MRR绝对改进：0.10），将最终系统MRR提高到0.9833。该系统在排名1处达到83.39\%的精度，在排名5处达到94.66\%的召回率。讨论：混合架构有效地利用了词汇和语义方法的互补优势。重新排列器处理了由于医学术语中复杂语义关系而导致初始检索组件出现错误的情况。结论：我们的框架为临床数据集中的单位协调提供了一种高效、可扩展的解决方案，减少了手动工作量，同时提高了准确性。一旦协调一致，数据可以在不同的分析中无缝重复使用，确保医疗系统之间的一致性，并支持更可靠的多机构研究和元分析。

更新时间: 2025-06-20 19:38:08

领域: cs.LG

下载: http://arxiv.org/abs/2505.00810v2

FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering

Visual question answering (VQA) systems face significant challenges when adapting to real-world data shifts, especially in multi-modal contexts. While robust fine-tuning strategies are essential for maintaining performance across in-distribution (ID) and out-of-distribution (OOD) scenarios, current evaluation settings are primarily unimodal or particular to some types of OOD, offering limited insight into the complexities of multi-modal contexts. In this work, we propose a new benchmark FRAMES-VQA (Fine-Tuning Robustness across Multi-Modal Shifts in VQA) for evaluating robust fine-tuning for VQA tasks. We utilize ten existing VQA benchmarks, including VQAv2, IV-VQA, VQA-CP, OK-VQA and others, and categorize them into ID, near and far OOD datasets covering uni-modal, multi-modal and adversarial distribution shifts. We first conduct a comprehensive comparison of existing robust fine-tuning methods. We then quantify the distribution shifts by calculating the Mahalanobis distance using uni-modal and multi-modal embeddings extracted from various models. Further, we perform an extensive analysis to explore the interactions between uni- and multi-modal shifts as well as modality importance for ID and OOD samples. These analyses offer valuable guidance on developing more robust fine-tuning methods to handle multi-modal distribution shifts. The code is available at https://github.com/chengyuehuang511/FRAMES-VQA .

Updated: 2025-06-20 19:32:29

标题: FRAMES-VQA：在视觉问答中跨多模态转移进行微调稳健性基准测试

摘要: 视觉问答（VQA）系统在适应真实世界数据转移时面临重大挑战，特别是在多模态环境下。虽然稳健的微调策略对于在分布内（ID）和分布外（OOD）情景中保持性能至关重要，但当前的评估设置主要是单模态的或特定于某些类型的OOD，对多模态环境的复杂性提供了有限的见解。在这项工作中，我们提出了一个新的基准FRAMES-VQA（VQA中跨多模态转移的微调鲁棒性）用于评估VQA任务的稳健微调。我们利用了十个现有的VQA基准，包括VQAv2、IV-VQA、VQA-CP、OK-VQA等，并将它们分为ID、近和远OOD数据集，涵盖了单模态、多模态和敌对分布转移。我们首先对现有的稳健微调方法进行了全面比较。然后，我们通过使用从各种模型提取的单模态和多模态嵌入来计算马氏距离来量化分布转移。此外，我们进行了广泛的分析，探索单模态和多模态转移之间的相互作用，以及对ID和OOD样本的模态重要性。这些分析为开发更加稳健的微调方法以处理多模态分布转移提供了宝贵的指导。代码可在https://github.com/chengyuehuang511/FRAMES-VQA找到。

更新时间: 2025-06-20 19:32:29

领域: cs.CV,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2505.21755v2

Open Sky, Open Threats: Replay Attacks in Space Launch and Re-entry Phases

This paper examines the effects of replay attacks on the integrity of both uplink and downlink communications during critical phases of spacecraft communication. By combining software-defined radios (SDRs) with a real-time channel emulator, we replicate realistic attack conditions on the Orion spacecraft's communication systems in both launch and reentry. Our evaluation shows that, under replay attacks, the attacker's signal can overpower legitimate transmissions, leading to a Signal to Noise Ratio (SNR) difference of up to -7.8 dB during reentry and -6.5 dB during launch. To mitigate these threats, we propose a more secure receiver design incorporating a phase-coherency-dependent decision-directed (DD) equalizer with a narrowed phase-locked loop (PLL) bandwidth. This configuration enhances resilience by making synchronization more sensitive to phase distortions caused by replay interference.

Updated: 2025-06-20 19:27:16

标题: 开放天空，开放威胁：空间发射和重入阶段的重放攻击

摘要: 这篇论文研究了重放攻击对航天器通信关键阶段的上行和下行通信完整性的影响。通过将软件定义无线电（SDR）与实时信道仿真器结合使用，我们在奥利安号航天器的通信系统中复制了现实的攻击条件，包括发射和重返阶段。我们的评估显示，在重放攻击下，攻击者的信号可以压倒合法传输，导致在重返阶段达到-7.8 dB，在发射阶段达到-6.5 dB的信噪比差异。为了减轻这些威胁，我们提出了一种更安全的接收机设计，包括一个依赖相干性的决策导向（DD）均衡器和一个窄化的锁相环（PLL）带宽。这种配置通过使同步对由重放干扰引起的相位失真更加敏感，提高了韧性。

更新时间: 2025-06-20 19:27:16

领域: cs.CR,cs.SY,eess.SY

下载: http://arxiv.org/abs/2506.17446v1

LayerZero

In this paper, we present the first intrinsically secure and semantically universal omnichain interoperability protocol: LayerZero. Utilizing an immutable endpoint, append-only verification modules, and fully-configurable verification infrastructure, LayerZero provides the security, configurability, and extensibility necessary to achieve omnichain interoperability. LayerZero enforces strict application-exclusive ownership of protocol security and cost through its novel trust-minimized modular security framework which is designed to universally support all blockchains and use cases. Omnichain applications (OApps) built on the LayerZero protocol achieve frictionless blockchain-agnostic interoperation through LayerZero's universal network semantics.

Updated: 2025-06-20 19:24:56

标题: LayerZero 层零

摘要: 在本文中，我们提出了第一个本质安全且语义通用的全链互操作协议：LayerZero。利用不可变的端点、追加验证模块以及完全可配置的验证基础设施，LayerZero提供了实现全链互操作所需的安全性、可配置性和可扩展性。LayerZero通过其创新的信任最小化模块化安全框架强制执行协议安全性和成本的严格应用专属所有权，旨在通用支持所有区块链和用例。建立在LayerZero协议上的全链应用（OApps）通过LayerZero的通用网络语义实现了无摩擦的区块链不可知互操作。

更新时间: 2025-06-20 19:24:56

领域: cs.NI,cs.CR,cs.DC

下载: http://arxiv.org/abs/2312.09118v3

Keeping Medical AI Healthy: A Review of Detection and Correction Methods for System Degradation

Artificial intelligence (AI) is increasingly integrated into modern healthcare, offering powerful support for clinical decision-making. However, in real-world settings, AI systems may experience performance degradation over time, due to factors such as shifting data distributions, changes in patient characteristics, evolving clinical protocols, and variations in data quality. These factors can compromise model reliability, posing safety concerns and increasing the likelihood of inaccurate predictions or adverse outcomes. This review presents a forward-looking perspective on monitoring and maintaining the "health" of AI systems in healthcare. We highlight the urgent need for continuous performance monitoring, early degradation detection, and effective self-correction mechanisms. The paper begins by reviewing common causes of performance degradation at both data and model levels. We then summarize key techniques for detecting data and model drift, followed by an in-depth look at root cause analysis. Correction strategies are further reviewed, ranging from model retraining to test-time adaptation. Our survey spans both traditional machine learning models and state-of-the-art large language models (LLMs), offering insights into their strengths and limitations. Finally, we discuss ongoing technical challenges and propose future research directions. This work aims to guide the development of reliable, robust medical AI systems capable of sustaining safe, long-term deployment in dynamic clinical settings.

Updated: 2025-06-20 19:22:07

标题: 保持医疗人工智能的健康：系统退化的检测和修正方法综述

摘要: 人工智能（AI）越来越多地融入现代医疗保健领域，为临床决策提供强大支持。然而，在现实世界中，由于数据分布的变化、患者特征的变化、临床协议的演变和数据质量的变化等因素，AI系统可能会随着时间的推移而经历性能下降。这些因素可能会影响模型的可靠性，存在安全隐患，并增加不准确预测或不良结果的可能性。本综述提出了对医疗保健领域AI系统进行监控和维护的前瞻性观点。我们强调了对连续性能监控、早期性能下降检测和有效自我校正机制的迫切需求。文章首先回顾了造成数据和模型水平性能下降的常见原因。然后总结了检测数据和模型漂移的关键技术，接着深入研究了根本原因分析。进一步审查了纠正策略，范围从模型重新训练到测试时间适应。我们的调查涵盖了传统机器学习模型和最新的大型语言模型（LLMs），为它们的优势和局限性提供了见解。最后，我们讨论了正在面临的技术挑战，并提出了未来的研究方向。这项工作旨在引导可靠、强大的医疗AI系统的开发，这些系统能够在动态临床环境中进行安全、长期的部署。

更新时间: 2025-06-20 19:22:07

领域: cs.AI,cs.ET,cs.LG

下载: http://arxiv.org/abs/2506.17442v1

Memorization to Generalization: Emergence of Diffusion Models from Associative Memory

Hopfield networks are associative memory (AM) systems, designed for storing and retrieving patterns as local minima of an energy landscape. In the classical Hopfield model, an interesting phenomenon occurs when the amount of training data reaches its critical memory load $- spurious\,\,states$, or unintended stable points, emerge at the end of the retrieval dynamics, leading to incorrect recall. In this work, we examine diffusion models, commonly used in generative modeling, from the perspective of AMs. The training phase of diffusion model is conceptualized as memory encoding (training data is stored in the memory). The generation phase is viewed as an attempt of memory retrieval. In the small data regime the diffusion model exhibits a strong memorization phase, where the network creates distinct basins of attraction around each sample in the training set, akin to the Hopfield model below the critical memory load. In the large data regime, a different phase appears where an increase in the size of the training set fosters the creation of new attractor states that correspond to manifolds of the generated samples. Spurious states appear at the boundary of this transition and correspond to emergent attractor states, which are absent in the training set, but, at the same time, have distinct basins of attraction around them. Our findings provide: a novel perspective on the memorization-generalization phenomenon in diffusion models via the lens of AMs, theoretical prediction of existence of spurious states, empirical validation of this prediction in commonly-used diffusion models.

Updated: 2025-06-20 19:20:01

标题: 从记忆到泛化：关联记忆中扩散模型的出现

摘要: Hopfield网络是一种关联记忆（AM）系统，旨在将模式存储和检索为能量景观的局部极小值。在经典Hopfield模型中，当训练数据量达到其临界记忆负载时，会出现一个有趣的现象-在检索动态的末尾出现了$-杂波状态$，或者意外的稳定点，导致错误召回。在这项工作中，我们从AM的角度研究了常用于生成建模的扩散模型。扩散模型的训练阶段被概念化为记忆编码（训练数据被存储在记忆中）。生成阶段被视为记忆检索的尝试。在小数据范围内，扩散模型表现出强烈的记忆化阶段，网络围绕训练集中的每个样本创建明显的吸引盆地，类似于Hopfield模型在临界记忆负载以下的情况。在大数据范围内，出现了一个不同的阶段，训练集大小的增加促进了新吸引状态的产生，这些吸引状态对应于生成样本的流形。杂波状态出现在这种转变的边界上，对应于新出现的吸引状态，这些状态在训练集中不存在，但同时，它们周围有明显的吸引盆地。我们的发现提供了：通过AM的视角对扩散模型中的记忆化-泛化现象提供了新颖的视角，对杂波状态的存在进行了理论预测，并在常用的扩散模型中对这一预测进行了实证验证。

更新时间: 2025-06-20 19:20:01

领域: cs.LG,cond-mat.dis-nn,cs.CV,q-bio.NC,stat.ML

下载: http://arxiv.org/abs/2505.21777v2

Resource Rational Contractualism Should Guide AI Alignment

AI systems will soon have to navigate human environments and make decisions that affect people and other AI agents whose goals and values diverge. Contractualist alignment proposes grounding those decisions in agreements that diverse stakeholders would endorse under the right conditions, yet securing such agreement at scale remains costly and slow -- even for advanced AI. We therefore propose Resource-Rational Contractualism (RRC): a framework where AI systems approximate the agreements rational parties would form by drawing on a toolbox of normatively-grounded, cognitively-inspired heuristics that trade effort for accuracy. An RRC-aligned agent would not only operate efficiently, but also be equipped to dynamically adapt to and interpret the ever-changing human social world.

Updated: 2025-06-20 18:57:13

标题: 资源合理合同主义应指导人工智能对齐

摘要: 人工智能系统将不久进入人类环境并做出影响人类和其他人工智能代理的决策，这些决策涉及到不同的目标和价值观。契约主义对齐提议将这些决策基于各种利益相关者在适当条件下会认可的协议，然而在规模上获得这样的协议仍然昂贵且缓慢，即使对于先进的人工智能也是如此。因此，我们提出了资源理性契约主义（RRC）：一个框架，其中人工智能系统通过利用一套基于规范和认知启发式的工具箱来近似理性方当形成的协议，这些启发式权衡了准确性和努力。一个符合RRC的代理不仅能够高效运作，还能够动态适应并解释不断变化的人类社会世界。

更新时间: 2025-06-20 18:57:13

领域: cs.AI

下载: http://arxiv.org/abs/2506.17434v1

A Systems Thinking Approach to Algorithmic Fairness

Systems thinking provides us with a way to model the algorithmic fairness problem by allowing us to encode prior knowledge and assumptions about where we believe bias might exist in the data generating process. We can then encode these beliefs as a series of causal graphs, enabling us to link AI/ML systems to politics and the law. This allows us to combine techniques from machine learning, causal inference, and system dynamics in order to capture different emergent aspects of the fairness problem. We can use systems thinking to help policymakers on both sides of the political aisle to understand the complex trade-offs that exist from different types of fairness policies, providing a sociotechnical foundation for designing AI policy that is aligned to their political agendas and with society's shared democratic values.

Updated: 2025-06-20 18:50:07

标题: 一个系统思维方法来实现算法公平性

摘要: 系统思维为我们提供了一种建模算法公平性问题的方式，使我们能够对我们认为偏见可能存在于数据生成过程中的地方进行先验知识和假设的编码。然后，我们可以将这些信念编码为一系列因果图，使我们能够将人工智能/机器学习系统与政治和法律联系起来。这使我们能够结合机器学习、因果推断和系统动力学等技术，以捕捉公平性问题的不同新兴方面。我们可以利用系统思维帮助政策制定者了解不同类型公平政策存在的复杂权衡，为设计与其政治议程和社会共享民主价值观一致的人工智能政策提供社会技术基础。

更新时间: 2025-06-20 18:50:07

领域: cs.AI,cs.CY

下载: http://arxiv.org/abs/2412.16641v5

Trans${^2}$-CBCT: A Dual-Transformer Framework for Sparse-View CBCT Reconstruction

Cone-beam computed tomography (CBCT) using only a few X-ray projection views enables faster scans with lower radiation dose, but the resulting severe under-sampling causes strong artifacts and poor spatial coverage. We address these challenges in a unified framework. First, we replace conventional UNet/ResNet encoders with TransUNet, a hybrid CNN-Transformer model. Convolutional layers capture local details, while self-attention layers enhance global context. We adapt TransUNet to CBCT by combining multi-scale features, querying view-specific features per 3D point, and adding a lightweight attenuation-prediction head. This yields Trans-CBCT, which surpasses prior baselines by 1.17 dB PSNR and 0.0163 SSIM on the LUNA16 dataset with six views. Second, we introduce a neighbor-aware Point Transformer to enforce volumetric coherence. This module uses 3D positional encoding and attention over k-nearest neighbors to improve spatial consistency. The resulting model, Trans$^2$-CBCT, provides an additional gain of 0.63 dB PSNR and 0.0117 SSIM. Experiments on LUNA16 and ToothFairy show consistent gains from six to ten views, validating the effectiveness of combining CNN-Transformer features with point-based geometry reasoning for sparse-view CBCT reconstruction.

Updated: 2025-06-20 18:45:12

标题: Trans${^2}$-CBCT：一种稀疏视图CBCT重建的双变压器框架

摘要: 锥束计算机断层扫描（CBCT）仅使用少量X射线投影视图可以实现更快的扫描速度和更低的辐射剂量，但由此产生的严重欠采样会导致强烈伪影和空间覆盖不足。我们在一个统一的框架中解决了这些挑战。首先，我们将传统的UNet/ResNet编码器替换为TransUNet，这是一个混合的CNN-Transformer模型。卷积层捕捉局部细节，而自注意力层增强全局上下文。我们通过结合多尺度特征、针对每个3D点查询特定视图特征以及添加轻量级衰减预测头，将TransUNet适应于CBCT。这产生了Trans-CBCT，在LUNA16数据集上的六个视图上比先前基线提高了1.17 dB PSNR和0.0163 SSIM。其次，我们引入了一个邻域感知的点变换器来强化体积一致性。这个模块使用3D位置编码和对k个最近邻的注意力来改善空间一致性。最终的模型Trans$^2$-CBCT在PSNR和SSIM上提供了额外的增益，分别为0.63 dB和0.0117。在LUNA16和ToothFairy上的实验显示，从六个到十个视图，我们不断获得一致的增益，验证了将CNN-Transformer特征与基于点的几何推理相结合对于稀疏视图CBCT重建的有效性。

更新时间: 2025-06-20 18:45:12

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2506.17425v1

Sequence-to-Sequence Models with Attention Mechanistically Map to the Architecture of Human Memory Search

Past work has long recognized the important role of context in guiding how humans search their memory. While context-based memory models can explain many memory phenomena, it remains unclear why humans develop such architectures over possible alternatives in the first place. In this work, we demonstrate that foundational architectures in neural machine translation -- specifically, recurrent neural network (RNN)-based sequence-to-sequence models with attention -- exhibit mechanisms that directly correspond to those specified in the Context Maintenance and Retrieval (CMR) model of human memory. Since neural machine translation models have evolved to optimize task performance, their convergence with human memory models provides a deeper understanding of the functional role of context in human memory, as well as presenting new ways to model human memory. Leveraging this convergence, we implement a neural machine translation model as a cognitive model of human memory search that is both interpretable and capable of capturing complex dynamics of learning. We show that our model accounts for both averaged and optimal human behavioral patterns as effectively as context-based memory models. Further, we demonstrate additional strengths of the proposed model by evaluating how memory search performance emerges from the interaction of different model components.

Updated: 2025-06-20 18:43:15

标题: Sequence-to-Sequence模型与注意力机制在机制上映射到人类记忆搜索的架构

摘要: 过去的研究长期以来一直认识到上下文在引导人类记忆搜索中的重要作用。虽然基于上下文的记忆模型可以解释许多记忆现象，但人类为什么首先发展这样的架构而非其他可能的选择仍然不清楚。在这项工作中，我们展示了神经机器翻译中的基础架构 - 具体而言，基于循环神经网络（RNN）的序列到序列模型与注意力机制 - 展现了直接对应于人类记忆的Context Maintenance and Retrieval（CMR）模型中指定的机制。由于神经机器翻译模型已经发展到优化任务表现，它们与人类记忆模型的收敛提供了对上下文在人类记忆中功能作用的更深入理解，并提供了建模人类记忆的新方法。利用这种收敛，我们将神经机器翻译模型实现为人类记忆搜索的认知模型，既可解释又能够捕捉学习的复杂动态。我们展示了我们的模型能够像基于上下文的记忆模型一样有效地解释平均和最佳人类行为模式。此外，通过评估不同模型组件的交互如何产生记忆搜索性能，我们展示了所提出模型的其他优势。

更新时间: 2025-06-20 18:43:15

领域: q-bio.NC,cs.LG

下载: http://arxiv.org/abs/2506.17424v1

UProp: Investigating the Uncertainty Propagation of LLMs in Multi-Step Agentic Decision-Making

As Large Language Models (LLMs) are integrated into safety-critical applications involving sequential decision-making in the real world, it is essential to know when to trust LLM decisions. Existing LLM Uncertainty Quantification (UQ) methods are primarily designed for single-turn question-answering formats, resulting in multi-step decision-making scenarios, e.g., LLM agentic system, being underexplored. In this paper, we introduce a principled, information-theoretic framework that decomposes LLM sequential decision uncertainty into two parts: (i) internal uncertainty intrinsic to the current decision, which is focused on existing UQ methods, and (ii) extrinsic uncertainty, a Mutual-Information (MI) quantity describing how much uncertainty should be inherited from preceding decisions. We then propose UProp, an efficient and effective extrinsic uncertainty estimator that converts the direct estimation of MI to the estimation of Pointwise Mutual Information (PMI) over multiple Trajectory-Dependent Decision Processes (TDPs). UProp is evaluated over extensive multi-step decision-making benchmarks, e.g., AgentBench and HotpotQA, with state-of-the-art LLMs, e.g., GPT-4.1 and DeepSeek-V3. Experimental results demonstrate that UProp significantly outperforms existing single-turn UQ baselines equipped with thoughtful aggregation strategies. Moreover, we provide a comprehensive analysis of UProp, including sampling efficiency, potential applications, and intermediate uncertainty propagation, to demonstrate its effectiveness. Codes will be available at https://github.com/jinhaoduan/UProp.

Updated: 2025-06-20 18:34:04

标题: UProp: 探究LLM在多步骤代理决策中的不确定性传播

摘要: 随着大型语言模型（LLMs）被整合到涉及实际世界中的序贯决策的安全关键应用程序中，了解何时信任LLM决策至关重要。现有的LLM不确定性量化（UQ）方法主要设计用于单轮问答格式，导致多步决策情景（例如LLM代理系统）被较少探索。在本文中，我们引入了一个基于信息论的原则性框架，将LLM序贯决策不确定性分解为两部分：（i）内在于当前决策的内部不确定性，着重于现有的UQ方法；（ii）外部不确定性，一种描述应该从先前决策中继承多少不确定性的互信息（MI）量。然后，我们提出了UProp，一种高效且有效的外部不确定性估计器，将MI的直接估计转化为在多个Trajectory-Dependent Decision Processes（TDPs）上的点对点互信息（PMI）的估计。UProp在广泛的多步决策基准上进行评估，例如AgentBench和HotpotQA，使用最先进的LLMs，例如GPT-4.1和DeepSeek-V3。实验结果表明，UProp明显优于现有的配备周到聚合策略的单轮UQ基准。此外，我们提供了对UProp的全面分析，包括采样效率、潜在应用和中间不确定性传播，以证明其有效性。代码将在https://github.com/jinhaoduan/UProp上提供。

更新时间: 2025-06-20 18:34:04

领域: cs.CL,cs.AI,cs.LG,stat.ML

下载: http://arxiv.org/abs/2506.17419v1

Second Opinion Matters: Towards Adaptive Clinical AI via the Consensus of Expert Model Ensemble

Despite the growing clinical adoption of large language models (LLMs), current approaches heavily rely on single model architectures. To overcome risks of obsolescence and rigid dependence on single model systems, we present a novel framework, termed the Consensus Mechanism. Mimicking clinical triage and multidisciplinary clinical decision-making, the Consensus Mechanism implements an ensemble of specialized medical expert agents enabling improved clinical decision making while maintaining robust adaptability. This architecture enables the Consensus Mechanism to be optimized for cost, latency, or performance, purely based on its interior model configuration. To rigorously evaluate the Consensus Mechanism, we employed three medical evaluation benchmarks: MedMCQA, MedQA, and MedXpertQA Text, and the differential diagnosis dataset, DDX+. On MedXpertQA, the Consensus Mechanism achieved an accuracy of 61.0% compared to 53.5% and 45.9% for OpenAI's O3 and Google's Gemini 2.5 Pro. Improvement was consistent across benchmarks with an increase in accuracy on MedQA ($\Delta\mathrm{Accuracy}_{\mathrm{consensus\text{-}O3}} = 3.4\%$) and MedMCQA ($\Delta\mathrm{Accuracy}_{\mathrm{consensus\text{-}O3}} = 9.1\%$). These accuracy gains extended to differential diagnosis generation, where our system demonstrated improved recall and precision (F1$_\mathrm{consensus}$ = 0.326 vs. F1$_{\mathrm{O3\text{-}high}}$ = 0.2886) and a higher top-1 accuracy for DDX (Top1$_\mathrm{consensus}$ = 52.0% vs. Top1$_{\mathrm{O3\text{-}high}}$ = 45.2%).

Updated: 2025-06-20 18:24:46

标题: 第二意见至关重要：通过专家模型集合共识实现自适应临床人工智能

摘要: 尽管大型语言模型（LLMs）在临床中的应用日益增长，但目前的方法主要依赖于单一模型架构。为了克服过时和对单一模型系统的刚性依赖风险，我们提出了一种新颖的框架，称为共识机制。模仿临床分诊和多学科临床决策，共识机制实施了一组专业医学专家代理，从而实现了改进的临床决策，同时保持了强大的适应性。这种架构使共识机制能够根据其内部模型配置纯粹地优化成本、延迟或性能。为了严格评估共识机制，我们采用了三个医学评估基准：MedMCQA，MedQA和MedXpertQA文本，以及不同诊断数据集DDX+。在MedXpertQA上，共识机制的准确率达到了61.0％，而OpenAI的O3和Google的Gemini 2.5 Pro分别为53.5％和45.9％。在MedQA（Δ准确率_{共识-O3} = 3.4％）和MedMCQA（Δ准确率_{共识-O3} = 9.1％）上，准确度也有所提高。这些准确度增益延伸到不同诊断生成，其中我们的系统展示了改进的召回率和精度（F1_{共识} = 0.326 vs. F1_{O3高} = 0.2886），以及DDX的更高的Top1准确率（Top1_{共识} = 52.0％ vs. Top1_{O3高} = 45.2％）。

更新时间: 2025-06-20 18:24:46

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2505.23075v2

Stealing That Free Lunch: Exposing the Limits of Dyna-Style Reinforcement Learning

Dyna-style off-policy model-based reinforcement learning (DMBRL) algorithms are a family of techniques for generating synthetic state transition data and thereby enhancing the sample efficiency of off-policy RL algorithms. This paper identifies and investigates a surprising performance gap observed when applying DMBRL algorithms across different benchmark environments with proprioceptive observations. We show that, while DMBRL algorithms perform well in OpenAI Gym, their performance can drop significantly in DeepMind Control Suite (DMC), even though these settings offer similar tasks and identical physics backends. Modern techniques designed to address several key issues that arise in these settings do not provide a consistent improvement across all environments, and overall our results show that adding synthetic rollouts to the training process -- the backbone of Dyna-style algorithms -- significantly degrades performance across most DMC environments. Our findings contribute to a deeper understanding of several fundamental challenges in model-based RL and show that, like many optimization fields, there is no free lunch when evaluating performance across diverse benchmarks in RL.

Updated: 2025-06-20 18:24:06

标题: 窃取那份免费午餐：揭示Dyna风格强化学习的局限性

摘要: Dyna风格的离策略基于模型的强化学习（DMBRL）算法是一类用于生成合成状态转换数据并从而提高离策略RL算法样本效率的技术。本文确定并调查了将DMBRL算法应用于不同基准环境中的惊人表现差距，这些环境具有本体感知观测。我们发现，虽然DMBRL算法在OpenAI Gym中表现良好，但在DeepMind Control Suite（DMC）中性能可能显著下降，尽管这些设置提供了类似的任务和相同的物理后端。针对这些环境中出现的几个关键问题设计的现代技术并不能在所有环境中提供一致的改进，总体而言，我们的结果表明，在训练过程中添加合成回滚--Dyna风格算法的基础--会显著降低大多数DMC环境的性能。我们的发现有助于更深入地理解基于模型的RL中的几个基本挑战，并表明，就像许多优化领域一样，在RL中跨多样化基准评估性能时，并没有免费午餐。

更新时间: 2025-06-20 18:24:06

领域: cs.LG

下载: http://arxiv.org/abs/2412.14312v3

Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling?

Recent advances in large language models (LLMs) have demonstrated that inference-time computation techniques, such as decoding-time scaling and self-refinement, can significantly enhance reasoning capabilities without relying on external knowledge. A key driver of this success is the emergence of self-correction and self-verification behaviors, often elicited through reinforcement learning (RL). In this paper, we investigate whether these inference-time techniques extend effectively to vision-language models (VLMs), particularly those trained with RL. We find that while decoding strategies such as majority voting and best-of-N selection with self-verification all improve VLM reasoning performance, generation-reliant methods such as the former achieve significantly higher gains versus verification-reliant methods such as the latter. Additionally, the self-correction behavior often associated with RL-tuned models, such as aha moment, does not lead to measurable gains. We show via extensive experimentation within the inference-time scaling framework to identify a key root cause: RL-trained VLMs still lack robust self-verification capabilities across both visual and textual modalities.

Updated: 2025-06-20 18:23:48

标题: "顿悟时刻再探讨：VLMs在推理时间缩放中真的能够自我验证吗？"

摘要: 最近关于大型语言模型（LLMs）的研究取得了一些进展，表明推理时的计算技术，比如解码时的缩放和自我完善，可以显著增强推理能力，而无需依赖外部知识。这一成功的关键驱动因素是自我纠正和自我验证行为的出现，通常通过强化学习（RL）引发。在本文中，我们调查了这些推理时技术是否有效地扩展到视觉语言模型（VLMs），特别是那些经过RL训练的模型。我们发现，虽然解码策略，比如多数投票和最佳N选择与自我验证都可以提高VLM的推理性能，但生成依赖方法如前者相对于验证依赖方法如后者实现了显著更高的收益。此外，通常与RL调整模型相关的自我纠正行为，比如“灵光一现”，并没有带来可测量的收益。我们通过在推理时间缩放框架内进行广泛实验表明了一个关键的根本原因：经过RL训练的VLM在视觉和文本两个模态下仍然缺乏强大的自我验证能力。

更新时间: 2025-06-20 18:23:48

领域: cs.LG

下载: http://arxiv.org/abs/2506.17417v1

Adaptive Control Attention Network for Underwater Acoustic Localization and Domain Adaptation

Localizing acoustic sound sources in the ocean is a challenging task due to the complex and dynamic nature of the environment. Factors such as high background noise, irregular underwater geometries, and varying acoustic properties make accurate localization difficult. To address these obstacles, we propose a multi-branch network architecture designed to accurately predict the distance between a moving acoustic source and a receiver, tested on real-world underwater signal arrays. The network leverages Convolutional Neural Networks (CNNs) for robust spatial feature extraction and integrates Conformers with self-attention mechanism to effectively capture temporal dependencies. Log-mel spectrogram and generalized cross-correlation with phase transform (GCC-PHAT) features are employed as input representations. To further enhance the model performance, we introduce an Adaptive Gain Control (AGC) layer, that adaptively adjusts the amplitude of input features, ensuring consistent energy levels across varying ranges, signal strengths, and noise conditions. We assess the model's generalization capability by training it in one domain and testing it in a different domain, using only a limited amount of data from the test domain for fine-tuning. Our proposed method outperforms state-of-the-art (SOTA) approaches in similar settings, establishing new benchmarks for underwater sound localization.

Updated: 2025-06-20 18:13:30

标题: 自适应控制关注网络用于水声定位和领域适应

摘要: 在海洋中定位声源是一项具有挑战性的任务，这是由于环境的复杂和动态性质。诸如高背景噪音、不规则的水下几何形状和不同的声学特性等因素使准确的定位变得困难。为了解决这些障碍，我们提出了一种多分支网络架构，旨在准确预测移动声源与接收器之间的距离，在真实世界的水下信号阵列上进行了测试。该网络利用卷积神经网络（CNN）进行强大的空间特征提取，并集成了具有自注意机制的Conformers，以有效捕获时间依赖关系。对数梅尔频谱图和广义互相关与相位变换（GCC-PHAT）特征被用作输入表示。为了进一步提高模型性能，我们引入了自适应增益控制（AGC）层，该层自适应调整输入特征的振幅，确保在不同范围、信号强度和噪音条件下能够保持一致的能量水平。我们通过在一个域中训练模型并在另一个域中进行测试，仅使用来自测试域的有限数据进行微调，来评估模型的泛化能力。我们提出的方法在类似设置中优于最先进的方法，为水下声音定位建立了新的基准。

更新时间: 2025-06-20 18:13:30

领域: cs.SD,cs.LG,eess.AS,eess.SP

下载: http://arxiv.org/abs/2506.17409v1

Exploring the Potential of Encoder-free Architectures in 3D LMMs

Encoder-free architectures have been preliminarily explored in the 2D visual domain, yet it remains an open question whether they can be effectively applied to 3D understanding scenarios. In this paper, we present the first comprehensive investigation into the potential of encoder-free architectures to alleviate the challenges of encoder-based 3D Large Multimodal Models (LMMs). These challenges include the failure to adapt to varying point cloud resolutions and the point features from the encoder not meeting the semantic needs of Large Language Models (LLMs). We identify key aspects for 3D LMMs to remove the encoder and enable the LLM to assume the role of the 3D encoder: 1) We propose the LLM-embedded Semantic Encoding strategy in the pre-training stage, exploring the effects of various point cloud self-supervised losses. And we present the Hybrid Semantic Loss to extract high-level semantics. 2) We introduce the Hierarchical Geometry Aggregation strategy in the instruction tuning stage. This incorporates inductive bias into the LLM layers to focus on the local details of the point clouds. To the end, we present the first Encoder-free 3D LMM, ENEL. Our 7B model rivals the current state-of-the-art model, ShapeLLM-13B, achieving 55.10%, 50.98%, and 43.10% on the classification, captioning, and VQA tasks, respectively. Our results demonstrate that the encoder-free architecture is highly promising for replacing encoder-based architectures in the field of 3D understanding. The code is released at https://github.com/Ivan-Tang-3D/ENEL

Updated: 2025-06-20 18:06:38

标题: 探索在3D LMMs中无编码器架构的潜力

摘要: 无编码器架构已经初步在2D视觉领域中进行了探索，然而它们能否有效地应用于3D理解场景仍然是一个悬而未决的问题。在本文中，我们首次全面调查了无编码器架构潜力，以减轻基于编码器的3D大型多模型（LMMs）所面临的挑战。这些挑战包括无法适应不同点云分辨率以及编码器中的点特征未能满足大型语言模型（LLMs）的语义需求。我们确定了3D LMMs去除编码器并使LLM承担3D编码器角色的关键方面：1）我们提出了在预训练阶段的LLM嵌入式语义编码策略，探索各种点云自监督损失的影响。并且我们提出了混合语义损失来提取高级语义。2）我们在指导调整阶段引入了分层几何聚合策略。这将归纳偏见引入LLM层中，以便专注于点云的局部细节。最后，我们提出了第一个无编码器3D LMM，ENEL。我们的7B模型与当前最先进的模型ShapeLLM-13B不相上下，在分类、字幕和VQA任务中分别达到了55.10%、50.98%和43.10%。我们的结果表明，无编码器架构在3D理解领域中代替基于编码器的架构非常有前途。代码已发布在https://github.com/Ivan-Tang-3D/ENEL。

更新时间: 2025-06-20 18:06:38

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2502.09620v3

Zero-Shot NAS via the Suppression of Local Entropy Decrease

Architecture performance evaluation is the most time-consuming part of neural architecture search (NAS). Zero-Shot NAS accelerates the evaluation by utilizing zero-cost proxies instead of training. Though effective, existing zero-cost proxies require invoking backpropagations or running networks on input data, making it difficult to further accelerate the computation of proxies. To alleviate this issue, architecture topologies are used to evaluate the performance of networks in this study. We prove that particular architectural topologies decrease the local entropy of feature maps, which degrades specific features to a bias, thereby reducing network performance. Based on this proof, architectural topologies are utilized to quantify the suppression of local entropy decrease (SED) as a data-free and running-free proxy. Experimental results show that SED outperforms most state-of-the-art proxies in terms of architecture selection on five benchmarks, with computation time reduced by three orders of magnitude. We further compare the SED-based NAS with state-of-the-art proxies. SED-based NAS selects the architecture with higher accuracy and fewer parameters in only one second. The theoretical analyses of local entropy and experimental results demonstrate that the suppression of local entropy decrease facilitates selecting optimal architectures in Zero-Shot NAS.

Updated: 2025-06-20 18:01:50

标题: 通过抑制局部熵减少实现零样本NAS

摘要: 架构性能评估是神经架构搜索(NAS)中最耗时的部分。零成本NAS通过利用零成本代理而不是训练来加速评估。尽管有效，现有的零成本代理需要调用反向传播或在输入数据上运行网络，这使得进一步加速代理的计算变得困难。为了缓解这个问题，在本研究中使用架构拓扑来评估网络的性能。我们证明特定的架构拓扑降低了特征图的局部熵，使特定特征降为偏差，从而降低了网络性能。基于这一证明，架构拓扑被用来量化局部熵减少（SED）作为一种无数据和无运行的代理。实验结果表明，在五个基准测试中，SED在架构选择方面胜过大多数最先进的代理，计算时间缩短了三个数量级。我们进一步比较了基于SED的NAS与最先进的代理。基于SED的NAS在仅一秒内选择了具有更高准确性和更少参数的架构。局部熵的理论分析和实验结果表明，抑制局部熵减少有助于在零成本NAS中选择最佳架构。

更新时间: 2025-06-20 18:01:50

领域: cs.LG,cs.CV,cs.NE

下载: http://arxiv.org/abs/2411.06236v3

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

Vision-language models (VLMs) excel at multimodal understanding, yet their text-only decoding forces them to verbalize visual reasoning, limiting performance on tasks that demand visual imagination. Recent attempts train VLMs to render explicit images, but the heavy image-generation pre-training often hinders the reasoning ability. Inspired by the way humans reason with mental imagery-the internal construction and manipulation of visual cues-we investigate whether VLMs can reason through interleaved multimodal trajectories without producing explicit images. To this end, we present a Machine Mental Imagery framework, dubbed as Mirage, which augments VLM decoding with latent visual tokens alongside ordinary text. Concretely, whenever the model chooses to ``think visually'', it recasts its hidden states as next tokens, thereby continuing a multimodal trajectory without generating pixel-level images. Begin by supervising the latent tokens through distillation from ground-truth image embeddings, we then switch to text-only supervision to make the latent trajectory align tightly with the task objective. A subsequent reinforcement learning stage further enhances the multimodal reasoning capability. Experiments on diverse benchmarks demonstrate that Mirage unlocks stronger multimodal reasoning without explicit image generation.

Updated: 2025-06-20 17:59:31

标题: 机器心理想象：利用潜在的视觉标记增强多模态推理

摘要: 视觉语言模型（VLMs）在多模态理解方面表现出色，但它们仅通过文本解码的方式迫使它们用语言表达视觉推理，从而限制了在需要视觉想象力的任务上的性能。最近的尝试是训练VLMs生成明确的图像，但是沉重的图像生成预训练通常会阻碍推理能力。受到人类使用心理想象进行推理的启发，即内部构建和操作视觉线索，我们调查了VLMs是否可以通过交错的多模态轨迹进行推理，而无需生成明确的图像。为此，我们提出了一种名为Mirage的机器心理想象框架，它在普通文本旁辅助VLMs解码使用潜在的视觉令牌。具体而言，每当模型选择“以视觉方式思考”时，它将其隐藏状态重新构建为下一个令牌，从而在不生成像素级图像的情况下继续多模态轨迹。首先通过从真实图像嵌入中提炼的方式监督潜在令牌，然后切换到仅文本监督，以使潜在轨迹与任务目标紧密对齐。随后的强化学习阶段进一步增强了多模态推理能力。在多个基准测试上的实验证明，Mirage在不生成明确图像的情况下释放了更强大的多模态推理能力。

更新时间: 2025-06-20 17:59:31

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.17218v1

Long-term Traffic Simulation with Interleaved Autoregressive Motion and Scenario Generation

An ideal traffic simulator replicates the realistic long-term point-to-point trip that a self-driving system experiences during deployment. Prior models and benchmarks focus on closed-loop motion simulation for initial agents in a scene. This is problematic for long-term simulation. Agents enter and exit the scene as the ego vehicle enters new regions. We propose InfGen, a unified next-token prediction model that performs interleaved closed-loop motion simulation and scene generation. InfGen automatically switches between closed-loop motion simulation and scene generation mode. It enables stable long-term rollout simulation. InfGen performs at the state-of-the-art in short-term (9s) traffic simulation, and significantly outperforms all other methods in long-term (30s) simulation. The code and model of InfGen will be released at https://orangesodahub.github.io/InfGen

Updated: 2025-06-20 17:59:21

标题: 长期交通仿真中的交叉自回归运动和情景生成

摘要: 一个理想的交通模拟器能够复制自动驾驶系统在部署过程中经历的现实长期点对点行程。之前的模型和基准关注于场景中初始代理的闭环运动模拟，这对于长期模拟是有问题的。代理在自车进入新区域时进入和退出场景。我们提出了InfGen，一个统一的下一个标记预测模型，可以执行交错的闭环运动模拟和场景生成。InfGen自动在闭环运动模拟和场景生成模式之间切换。它可以实现稳定的长期推出模拟。InfGen在短期（9秒）交通模拟中表现出最先进的技术水平，并在长期（30秒）模拟中明显优于所有其他方法。InfGen的代码和模型将在https://orangesodahub.github.io/InfGen上发布。

更新时间: 2025-06-20 17:59:21

领域: cs.CV,cs.AI,cs.RO

下载: http://arxiv.org/abs/2506.17213v1

Part$^{2}$GS: Part-aware Modeling of Articulated Objects using 3D Gaussian Splatting

Articulated objects are common in the real world, yet modeling their structure and motion remains a challenging task for 3D reconstruction methods. In this work, we introduce Part$^{2}$GS, a novel framework for modeling articulated digital twins of multi-part objects with high-fidelity geometry and physically consistent articulation. Part$^{2}$GS leverages a part-aware 3D Gaussian representation that encodes articulated components with learnable attributes, enabling structured, disentangled transformations that preserve high-fidelity geometry. To ensure physically consistent motion, we propose a motion-aware canonical representation guided by physics-based constraints, including contact enforcement, velocity consistency, and vector-field alignment. Furthermore, we introduce a field of repel points to prevent part collisions and maintain stable articulation paths, significantly improving motion coherence over baselines. Extensive evaluations on both synthetic and real-world datasets show that Part$^{2}$GS consistently outperforms state-of-the-art methods by up to 10$\times$ in Chamfer Distance for movable parts.

Updated: 2025-06-20 17:59:12

标题: Part$^{2}$GS: 使用3D高斯点云技术对关节对象进行部分感知建模

摘要: 关节对象在现实世界中很常见，但对于3D重建方法来说，对其结构和运动建模仍然是一个具有挑战性的任务。在这项工作中，我们介绍了Part$^{2}$GS，这是一个用于建模多部件对象的关节数字孪生体的新框架，具有高保真几何和物理一致的关节。Part$^{2}$GS利用了一个部件感知的3D高斯表示，用可学习属性编码关节组件，实现结构化、解耦的变换，以保持高保真几何。为了确保物理上一致的运动，我们提出了一个运动感知的规范表示，受物理约束的指导，包括接触执行、速度一致性和矢量场对齐。此外，我们引入了一个斥力点场，以防止部件碰撞并保持稳定的关节路径，显著提高了运动连贯性。对合成和真实世界数据集进行的广泛评估表明，Part$^{2}$GS在可移动部件的Chamfer距离上始终比最先进的方法表现出多达10倍的优势。

更新时间: 2025-06-20 17:59:12

领域: cs.CV,cs.AI,cs.LG,cs.RO

下载: http://arxiv.org/abs/2506.17212v1

BREAD: Branched Rollouts from Expert Anchors Bridge SFT & RL for Reasoning

Small language models (SLMs) struggle to learn complex reasoning behaviors, especially when high-quality traces are scarce or difficult to learn from. The standard training approach combines a supervised fine-tuning (SFT) stage, often to distill capabilities of a larger model, followed by a reinforcement learning (RL)stage such as Group Relative Policy Optimization (GRPO). In this paper, we investigate the fundamental limitations of this SFT + RL paradigm and propose methods to overcome them. Under a suitable theoretical model, we demonstrate that the SFT + RL strategy can fail completely when (1) the expert's traces are too difficult for the small model to express, or (2) the small model's initialization has exponentially small likelihood of success. To address these, we introduce BREAD: a GRPO variant that unifies the SFT and RL stages via partial expert guidance and branched rollouts. When self-generated traces fail, BREAD adaptively inserts short expert prefixes/hints, allowing the small model to complete the rest of the reasoning path, and ensuring that each update includes at least one successful trace. This mechanism both densifies the reward signal and induces a natural learning curriculum. BREAD requires fewer than 40% of ground-truth traces, consistently outperforming standard GRPO while speeding up the training by about 3 times. Importantly, we demonstrate that BREAD helps the model solve problems that are otherwise unsolvable by the SFT + RL strategy, highlighting how branched rollouts and expert guidance can substantially boost SLM reasoning.

Updated: 2025-06-20 17:59:07

标题: 面包：从专家锚点中分支的滚动推出桥接SFT和RL以进行推理

摘要: 小型语言模型（SLMs）在学习复杂推理行为时往往遇到困难，特别是当高质量的迹象稀缺或难以学习时。标准训练方法结合了监督微调（SFT）阶段，通常用于提炼较大模型的能力，然后是强化学习（RL）阶段，如群体相对策略优化（GRPO）。在本文中，我们调查了这种SFT + RL范式的基本局限性，并提出了克服这些局限性的方法。在适当的理论模型下，我们证明了当（1）专家的迹象对于小型模型来说太难表达，或者（2）小型模型的初始化成功的可能性指数级下降时，SFT + RL策略可能完全失败。为了解决这些问题，我们引入了BREAD：一种通过部分专家指导和分支展开统一SFT和RL阶段的GRPO变体。当自动生成的迹象失败时，BREAD会自适应地插入短暂的专家前缀/提示，允许小型模型完成其余的推理路径，并确保每次更新至少包含一个成功的迹象。这种机制既增加了奖励信号，又引入了自然的学习课程。BREAD需要少于40%的地面真实迹象，始终优于标准GRPO，同时将训练加速约3倍。重要的是，我们证明了BREAD有助于模型解决通过SFT + RL策略无法解决的问题，突显了分支展开和专家指导如何显著提高SLM推理能力。

更新时间: 2025-06-20 17:59:07

领域: cs.LG

下载: http://arxiv.org/abs/2506.17211v1

AQA-Bench: An Interactive Benchmark for Evaluating LLMs' Sequential Reasoning Ability

This paper introduces AQA-Bench, a novel benchmark to assess the sequential reasoning capabilities of large language models (LLMs) in algorithmic contexts, such as depth-first search (DFS). The key feature of our evaluation benchmark lies in its interactive evaluation protocol - for example, in DFS, the availability of each node's connected edge is contingent upon the model's traversal to that node, thereby necessitating the LLM's ability to effectively remember visited nodes and strategize subsequent moves considering the possible environmental feedback in the future steps. We comprehensively build AQA-Bench with three different algorithms, namely binary search, depth-first search, and breadth-first search, and to evaluate the sequential reasoning ability of 14 different LLMs. Our investigations reveal several interesting findings: (1) Closed-source models like GPT-4 and Gemini generally show much stronger sequential reasoning ability, significantly outperforming open-source LLMs. (2) Naively providing in-context examples may inadvertently hurt few-shot performance in an interactive environment due to over-fitting to examples. (3) Instead of using optimal steps from another test case as the in-context example, a very limited number of predecessor steps in the current test case following the optimal policy can substantially boost small models' performance. (4) The performance gap between weak models and strong models is greatly due to the incapability of weak models to start well. (5) The scaling correlation between performance and model size is not always significant, sometimes even showcasing an inverse trend. We hope our study can catalyze future work on advancing the understanding and enhancement of LLMs' capabilities in sequential reasoning. The code is available at https://github.com/UCSC-VLAA/AQA-Bench.

Updated: 2025-06-20 17:57:43

标题: AQA-Bench：评估LLMs顺序推理能力的互动基准测试

摘要: 这篇论文介绍了AQA-Bench，这是一个用于评估大型语言模型（LLMs）在算法环境中的顺序推理能力的创新基准。我们评估基准的关键特点在于其交互式评估协议 - 例如，在深度优先搜索（DFS）中，每个节点连接的边的可用性取决于模型对该节点的遍历，因此需要LLM有效记住访问过的节点，并考虑未来步骤中可能的环境反馈来制定策略。我们全面构建了AQA-Bench，其中包括三种不同的算法，即二分查找、深度优先搜索和广度优先搜索，并评估了14种不同LLMs的顺序推理能力。我们的研究揭示了几个有趣的发现：（1）类似GPT-4和Gemini这样的闭源模型通常表现出更强的顺序推理能力，明显优于开源LLMs。（2）简单地提供上下文示例可能会在交互环境中无意中损害少数样本的性能，因为会过度拟合示例。（3）与使用另一个测试用例中的最佳步骤作为上下文示例相比，当前测试用例中遵循最佳策略的极少数前继步骤可以显著提升小模型的性能。（4）弱模型和强模型之间的性能差距很大程度上是由于弱模型无法良好开始。（5）性能和模型大小之间的扩展相关性并不总是显著，有时甚至呈现出反向趋势。我们希望我们的研究可以推动对LLMs在顺序推理能力方面的理解和增强的未来工作。代码可在https://github.com/UCSC-VLAA/AQA-Bench找到。

更新时间: 2025-06-20 17:57:43

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2402.09404v2

Dissecting the SWE-Bench Leaderboards: Profiling Submitters and Architectures of LLM- and Agent-Based Repair Systems

The rapid progress in Automated Program Repair (APR) has been driven by advances in AI, particularly large language models (LLMs) and agent-based systems. SWE-Bench is a recent benchmark designed to evaluate LLM-based repair systems using real issues and pull requests mined from 12 popular open-source Python repositories. Its public leaderboards, SWE-Bench Lite and SWE-Bench Verified, have become central platforms for tracking progress and comparing solutions. However, because the submission process does not require detailed documentation, the architectural design and origin of many solutions remain unclear. In this paper, we present the first comprehensive study of all submissions to the SWE-Bench Lite (68 entries) and Verified (79 entries) leaderboards, analyzing 67 unique approaches across dimensions such as submitter type, product availability, LLM usage, and system architecture. Our findings reveal the dominance of proprietary LLMs (especially Claude 3.5/3.7), the presence of both agentic and non-agentic designs, and a contributor base spanning from individual developers to large tech companies.

Updated: 2025-06-20 17:57:08

标题: 解析SWE-Bench排行榜：对LLM和基于代理的修复系统提交者和架构进行剖析

摘要: 自动程序修复（APR）领域的快速进展得益于人工智能的进步，特别是大型语言模型（LLMs）和基于代理的系统。SWE-Bench是一个最近设计的基准测试，旨在使用从12个流行的开源Python存储库中挖掘的真实问题和拉取请求来评估基于LLM的修复系统。其公开排行榜SWE-Bench Lite和SWE-Bench Verified已成为跟踪进展和比较解决方案的核心平台。然而，由于提交流程不要求详细文档，因此许多解决方案的架构设计和起源仍不清楚。在本文中，我们首次对所有提交到SWE-Bench Lite（68个条目）和Verified（79个条目）排行榜的研究进行全面分析，分析了67种独特的方法，包括提交者类型、产品可用性、LLM使用和系统架构等方面。我们的研究结果显示专有LLMs的主导地位（尤其是Claude 3.5/3.7）、存在代理和非代理设计以及贡献者基础涵盖个人开发人员至大型科技公司。

更新时间: 2025-06-20 17:57:08

领域: cs.SE,cs.AI,cs.CL

下载: http://arxiv.org/abs/2506.17208v1

DreamCube: 3D Panorama Generation via Multi-plane Synchronization

3D panorama synthesis is a promising yet challenging task that demands high-quality and diverse visual appearance and geometry of the generated omnidirectional content. Existing methods leverage rich image priors from pre-trained 2D foundation models to circumvent the scarcity of 3D panoramic data, but the incompatibility between 3D panoramas and 2D single views limits their effectiveness. In this work, we demonstrate that by applying multi-plane synchronization to the operators from 2D foundation models, their capabilities can be seamlessly extended to the omnidirectional domain. Based on this design, we further introduce DreamCube, a multi-plane RGB-D diffusion model for 3D panorama generation, which maximizes the reuse of 2D foundation model priors to achieve diverse appearances and accurate geometry while maintaining multi-view consistency. Extensive experiments demonstrate the effectiveness of our approach in panoramic image generation, panoramic depth estimation, and 3D scene generation.

Updated: 2025-06-20 17:55:06

标题: DreamCube：通过多平面同步生成3D全景图

摘要: 3D全景合成是一项有前途但具有挑战性的任务，要求生成的全向内容具有高质量和多样化的视觉外观和几何形状。现有方法利用来自预训练的2D基础模型的丰富图像先验来规避3D全景数据的稀缺性，但3D全景和2D单视图之间的不兼容性限制了它们的有效性。在这项工作中，我们证明通过将来自2D基础模型的运算符应用于多平面同步，它们的能力可以无缝地扩展到全向领域。基于这种设计，我们进一步介绍了DreamCube，一个用于3D全景生成的多平面RGB-D扩散模型，它最大化了2D基础模型先验的重复使用，以实现多样化的外观和精确的几何形状，同时保持多视图一致性。大量实验证明了我们方法在全景图像生成、全景深度估计和3D场景生成方面的有效性。

更新时间: 2025-06-20 17:55:06

领域: cs.GR,cs.CV,cs.LG

下载: http://arxiv.org/abs/2506.17206v1

Network Sparsity Unlocks the Scaling Potential of Deep Reinforcement Learning

Effectively scaling up deep reinforcement learning models has proven notoriously difficult due to network pathologies during training, motivating various targeted interventions such as periodic reset and architectural advances such as layer normalization. Instead of pursuing more complex modifications, we show that introducing static network sparsity alone can unlock further scaling potential beyond their dense counterparts with state-of-the-art architectures. This is achieved through simple one-shot random pruning, where a predetermined percentage of network weights are randomly removed once before training. Our analysis reveals that, in contrast to naively scaling up dense DRL networks, such sparse networks achieve both higher parameter efficiency for network expressivity and stronger resistance to optimization challenges like plasticity loss and gradient interference. We further extend our evaluation to visual and streaming RL scenarios, demonstrating the consistent benefits of network sparsity.

Updated: 2025-06-20 17:54:24

标题: 网络稀疏性释放了深度强化学习的扩展潜力

摘要: 有效扩展深度强化学习模型已被证明是非常困难的，这是因为在训练过程中出现网络病态，促使采取各种有针对性的干预措施，如周期性重置和架构进步，比如层归一化。我们展示了，与追求更复杂的修改不同，引入静态网络稀疏性单独就能解锁超越其密集对应物与最先进架构的进一步扩展潜力。这是通过简单的一次性随机剪枝实现的，即在训练之前一次性随机移除预定百分比的网络权重。我们的分析表明，与天真地扩展密集DRL网络不同，这种稀疏网络既实现了更高的参数效率以实现网络表达能力，又更强的抵抗优化挑战，如可塑性损失和梯度干扰。我们进一步将我们的评估扩展到视觉和流式强化学习场景，展示了网络稀疏性的持续优势。

更新时间: 2025-06-20 17:54:24

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.17204v1

DAL: A Practical Prior-Free Black-Box Framework for Non-Stationary Bandit Environments

We introduce a practical, black-box framework termed Detection Augmenting Learning (DAL) for the problem of non-stationary bandits without prior knowledge of the underlying non-stationarity. DAL is modular, accepting any stationary bandit algorithm as input and augmenting it with a change detector. Our approach is applicable to all common parametric and non-parametric bandit variants. Extensive experimentation demonstrates that DAL consistently surpasses current state-of-the-art methods across diverse non-stationary scenarios, including synthetic benchmarks and real-world datasets, underscoring its versatility and scalability. We provide theoretical insights into DAL's strong empirical performance on piecewise stationary and drift settings, complemented by thorough experimental validation.

Updated: 2025-06-20 17:52:06

标题: DAL：一个实用的无先验黑盒框架，用于非稳态赌博环境

摘要: 我们引入了一个实用的、黑盒框架，称为检测增强学习（DAL），用于非平稳赌博问题，无需先验知识了解潜在的非平稳性。DAL是模块化的，接受任何平稳赌博算法作为输入，并用变化检测器增强它。我们的方法适用于所有常见的参数和非参数赌博变种。大量实验表明，DAL在各种非平稳场景中始终优于当前的最先进方法，包括合成基准和真实世界数据集，突出了其多功能性和可伸缩性。我们提供了关于DAL在分段平稳和漂移设置上的强大经验性能的理论见解，辅以彻底的实验验证。

更新时间: 2025-06-20 17:52:06

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2501.19401v3

Schrödinger Bridge Matching for Tree-Structured Costs and Entropic Wasserstein Barycentres

Recent advances in flow-based generative modelling have provided scalable methods for computing the Schr\"odinger Bridge (SB) between distributions, a dynamic form of entropy-regularised Optimal Transport (OT) for the quadratic cost. The successful Iterative Markovian Fitting (IMF) procedure solves the SB problem via sequential bridge-matching steps, presenting an elegant and practical approach with many favourable properties over the more traditional Iterative Proportional Fitting (IPF) procedure. Beyond the standard setting, optimal transport can be generalised to the multi-marginal case in which the objective is to minimise a cost defined over several marginal distributions. Of particular importance are costs defined over a tree structure, from which Wasserstein barycentres can be recovered as a special case. In this work, we extend the IMF procedure to solve for the tree-structured SB problem. Our resulting algorithm inherits the many advantages of IMF over IPF approaches in the tree-based setting. In the specific case of Wasserstein barycentres, our approach can be viewed as extending fixed-point approaches for barycentre computation to the case of flow-based entropic OT solvers.

Updated: 2025-06-20 17:47:47

标题: 薛定谔桥匹配用于树形成本和熵瓦瑟斯坦重心

摘要: 最近在基于流的生成建模方面取得了进展，提供了可扩展的方法来计算分布之间的Schr\"odinger桥梁（SB），这是一种动态形式的熵正则化最优传输（OT），适用于二次成本。成功的迭代马尔可夫拟合（IMF）过程通过顺序桥匹配步骤解决了SB问题，提出了一种优雅而实用的方法，具有许多优点，优于更传统的迭代比例拟合（IPF）过程。在标准设置之外，最优传输可以推广到多边缘情况，其中目标是最小化在多个边缘分布上定义的成本。特别重要的是在树结构上定义的成本，从中可以恢复Wasserstein重心作为一种特例。在这项工作中，我们将IMF过程扩展到解决树结构SB问题。我们得到的算法继承了IMF在树状设置中比IPF方法的许多优势。在Wasserstein重心的特定情况下，我们的方法可以被视为将固定点方法扩展到基于流的熵最优传输求解器的情况。

更新时间: 2025-06-20 17:47:47

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2506.17197v1

Facial Landmark Visualization and Emotion Recognition Through Neural Networks

Emotion recognition from facial images is a crucial task in human-computer interaction, enabling machines to learn human emotions through facial expressions. Previous studies have shown that facial images can be used to train deep learning models; however, most of these studies do not include a through dataset analysis. Visualizing facial landmarks can be challenging when extracting meaningful dataset insights; to address this issue, we propose facial landmark box plots, a visualization technique designed to identify outliers in facial datasets. Additionally, we compare two sets of facial landmark features: (i) the landmarks' absolute positions and (ii) their displacements from a neutral expression to the peak of an emotional expression. Our results indicate that a neural network achieves better performance than a random forest classifier.

Updated: 2025-06-20 17:45:34

标题: 神经网络中的面部特征可视化和情绪识别

摘要: 来自面部图像的情绪识别是人机交互中的一个关键任务，使机器能够通过面部表情学习人类的情绪。先前的研究表明，面部图像可以用来训练深度学习模型；然而，大多数这些研究并未包括对数据集的全面分析。在提取有意义的数据集见解时，可视化面部标志点可能具有挑战性；为了解决这个问题，我们提出了面部标志点箱线图，这是一种旨在识别面部数据集中异常值的可视化技术。此外，我们比较了两组面部标志点特征：（一）标志点的绝对位置和（二）它们从中性表情到情绪表达高峰的位移。我们的结果表明，神经网络比随机森林分类器实现了更好的性能。

更新时间: 2025-06-20 17:45:34

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.17191v1

Towards AI Search Paradigm

In this paper, we introduce the AI Search Paradigm, a comprehensive blueprint for next-generation search systems capable of emulating human information processing and decision-making. The paradigm employs a modular architecture of four LLM-powered agents (Master, Planner, Executor and Writer) that dynamically adapt to the full spectrum of information needs, from simple factual queries to complex multi-stage reasoning tasks. These agents collaborate dynamically through coordinated workflows to evaluate query complexity, decompose problems into executable plans, and orchestrate tool usage, task execution, and content synthesis. We systematically present key methodologies for realizing this paradigm, including task planning and tool integration, execution strategies, aligned and robust retrieval-augmented generation, and efficient LLM inference, spanning both algorithmic techniques and infrastructure-level optimizations. By providing an in-depth guide to these foundational components, this work aims to inform the development of trustworthy, adaptive, and scalable AI search systems.

Updated: 2025-06-20 17:42:13

标题: 走向人工智能搜索范式

摘要: 在本文中，我们介绍了AI搜索范式，这是一种全面的蓝图，用于下一代搜索系统，能够模拟人类信息处理和决策过程。该范式采用了四个LLM驱动的智能体（主控、规划、执行和撰写者）的模块化架构，这些智能体能够动态适应从简单事实查询到复杂多阶段推理任务的全谱信息需求。这些智能体通过协调工作流动态合作，评估查询复杂性，将问题分解为可执行计划，并协调工具使用、任务执行和内容合成。我们系统地介绍了实现这一范式的关键方法，包括任务规划和工具集成、执行策略、对齐和鲁棒的检索增强生成，以及高效的LLM推理，涵盖了算法技术和基础架构级别的优化。通过深入介绍这些基础组件，本研究旨在为可信赖、适应性强和可扩展的AI搜索系统的开发提供指导。

更新时间: 2025-06-20 17:42:13

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2506.17188v1

Optimal Implicit Bias in Linear Regression

Most modern learning problems are over-parameterized, where the number of learnable parameters is much greater than the number of training data points. In this over-parameterized regime, the training loss typically has infinitely many global optima that completely interpolate the data with varying generalization performance. The particular global optimum we converge to depends on the implicit bias of the optimization algorithm. The question we address in this paper is, ``What is the implicit bias that leads to the best generalization performance?". To find the optimal implicit bias, we provide a precise asymptotic analysis of the generalization performance of interpolators obtained from the minimization of convex functions/potentials for over-parameterized linear regression with non-isotropic Gaussian data. In particular, we obtain a tight lower bound on the best generalization error possible among this class of interpolators in terms of the over-parameterization ratio, the variance of the noise in the labels, the eigenspectrum of the data covariance, and the underlying distribution of the parameter to be estimated. Finally, we find the optimal convex implicit bias that achieves this lower bound under certain sufficient conditions involving the log-concavity of the distribution of a Gaussian convolved with the prior of the true underlying parameter.

Updated: 2025-06-20 17:41:39

标题: 线性回归中的最佳隐性偏见

摘要: 大多数现代学习问题都是过参数化的，即可学习参数的数量远远大于训练数据点的数量。在这种过参数化的情况下，训练损失通常具有无限多个完全插值数据的全局最优解，其泛化性能各不相同。我们收敛到的特定全局最优解取决于优化算法的隐式偏差。本文研究的问题是，“什么是导致最佳泛化性能的隐式偏差？”。为了找到最佳隐式偏差，我们对通过凸函数/势的最小化获得的插值器的泛化性能进行了精确的渐近分析，用于非各向同性高斯数据的过参数化线性回归。具体而言，我们根据超参数化比率、标签中噪声的方差、数据协方差的特征谱和待估计参数的基础分布，获得了这类插值器可能的最佳泛化误差的严格下界。最后，我们找到了在涉及高斯分布与真实基础参数的先验卷积的分布的对数凹性的特定充分条件下实现此下界的最佳凸隐式偏差。

更新时间: 2025-06-20 17:41:39

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2506.17187v1

A Common Pool of Privacy Problems: Legal and Technical Lessons from a Large-Scale Web-Scraped Machine Learning Dataset

We investigate the contents of web-scraped data for training AI systems, at sizes where human dataset curators and compilers no longer manually annotate every sample. Building off of prior privacy concerns in machine learning models, we ask: What are the legal privacy implications of web-scraped machine learning datasets? In an empirical study of a popular training dataset, we find significant presence of personally identifiable information despite sanitization efforts. Our audit provides concrete evidence to support the concern that any large-scale web-scraped dataset may contain personal data. We use these findings of a real-world dataset to inform our legal analysis with respect to existing privacy and data protection laws. We surface various privacy risks of current data curation practices that may propagate personal information to downstream models. From our findings, we argue for reorientation of current frameworks of "publicly available" information to meaningfully limit the development of AI built upon indiscriminate scraping of the internet.

Updated: 2025-06-20 17:40:05

标题: 一个共同的隐私问题池：来自大规模网络抓取的机器学习数据集的法律和技术教训

摘要: 我们研究了用于训练人工智能系统的网络抓取数据的内容，这些数据的规模已经超出了人类数据集策划者和编译者手动注释每个样本的能力。基于先前在机器学习模型中关于隐私的担忧，我们提出了一个问题：网络抓取的机器学习数据集存在哪些法律隐私影响？在对一个流行的训练数据集进行实证研究时，我们发现尽管进行了清理工作，但仍然存在大量的个人可识别信息。我们的审计提供了具体证据，支持这样一个担忧：任何大规模的网络抓取数据集可能包含个人数据。我们利用这些实际数据集的发现，结合现有的隐私和数据保护法律进行法律分析。我们揭示了当前数据策划实践的各种隐私风险，可能会将个人信息传播给下游模型。根据我们的研究结果，我们主张重新调整当前的“公开可得”信息框架，以有意义地限制基于对互联网的不加选择抓取构建的人工智能的发展。

更新时间: 2025-06-20 17:40:05

领域: cs.CR,cs.CY

下载: http://arxiv.org/abs/2506.17185v1

Variational Learning of Disentangled Representations

Disentangled representations enable models to separate factors of variation that are shared across experimental conditions from those that are condition-specific. This separation is essential in domains such as biomedical data analysis, where generalization to new treatments, patients, or species depends on isolating stable biological signals from context-dependent effects. While extensions of the variational autoencoder (VAE) framework have been proposed to address this problem, they frequently suffer from leakage between latent representations, limiting their ability to generalize to unseen conditions. Here, we introduce DISCoVeR, a new variational framework that explicitly separates condition-invariant and condition-specific factors. DISCoVeR integrates three key components: (i) a dual-latent architecture that models shared and specific factors separately; (ii) two parallel reconstructions that ensure both representations remain informative; and (iii) a novel max-min objective that encourages clean separation without relying on handcrafted priors, while making only minimal assumptions. Theoretically, we show that this objective maximizes data likelihood while promoting disentanglement, and that it admits a unique equilibrium. Empirically, we demonstrate that DISCoVeR achieves improved disentanglement on synthetic datasets, natural images, and single-cell RNA-seq data. Together, these results establish DISCoVeR as a principled approach for learning disentangled representations in multi-condition settings.

Updated: 2025-06-20 17:36:12

标题: 变分学习解耦表示

摘要: Disentangled representations allow models to distinguish between factors of variation that are common across different experimental conditions and those that are specific to each condition. This ability is crucial in fields like biomedical data analysis, where the ability to generalize to new treatments, patients, or species depends on being able to separate stable biological signals from context-dependent effects. While extensions of the variational autoencoder (VAE) framework have been proposed to address this issue, they often struggle with mixing between latent representations, which limits their ability to generalize to unseen conditions. In this study, the authors introduce DISCoVeR, a new variational framework that explicitly separates factors that are invariant across conditions from those that are specific to each condition. DISCoVeR includes a dual-latent architecture that models shared and specific factors separately, two parallel reconstructions to ensure that both representations are informative, and a novel max-min objective that promotes clean separation without relying on predefined assumptions. The authors demonstrate theoretically that this objective maximizes data likelihood while encouraging disentanglement and has a unique equilibrium. Empirically, the authors show that DISCoVeR achieves improved disentanglement on synthetic datasets, natural images, and single-cell RNA-seq data. These findings establish DISCoVeR as a principled approach for learning disentangled representations in multi-condition settings.

更新时间: 2025-06-20 17:36:12

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2506.17182v1

TALE: A Tool-Augmented Framework for Reference-Free Evaluation of Large Language Models

As Large Language Models (LLMs) become increasingly integrated into real-world, autonomous applications, relying on static, pre-annotated references for evaluation poses significant challenges in cost, scalability, and completeness. We propose Tool-Augmented LLM Evaluation (TALE), a framework to assess LLM outputs without predetermined ground-truth answers. Unlike conventional metrics that compare to fixed references or depend solely on LLM-as-a-judge knowledge, TALE employs an agent with tool-access capabilities that actively retrieves and synthesizes external evidence. It iteratively generates web queries, collects information, summarizes findings, and refines subsequent searches through reflection. By shifting away from static references, TALE aligns with free-form question-answering tasks common in real-world scenarios. Experimental results on multiple free-form QA benchmarks show that TALE not only outperforms standard reference-based metrics for measuring response accuracy but also achieves substantial to near-perfect agreement with human evaluations. TALE enhances the reliability of LLM evaluations in real-world, dynamic scenarios without relying on static references.

Updated: 2025-06-20 17:31:59

标题: TALE: 一个用于对大型语言模型进行无参考评估的工具增强框架

摘要: 随着大型语言模型（LLMs）越来越多地融入现实世界的自主应用程序中，依赖静态的、预注释的参考资料进行评估在成本、可伸缩性和完整性方面面临重大挑战。我们提出了一种名为Tool-Augmented LLM Evaluation（TALE）的框架，用于评估LLM的输出，而无需预先确定的基准答案。与传统的度量标准不同，传统度量标准比较固定的参考资料或仅依赖LLM作为评判者的知识，TALE利用具有工具访问能力的代理，积极检索和综合外部证据。它迭代生成网络查询，收集信息，总结发现，并通过反思改进后续搜索。通过摆脱静态参考，TALE与现实世界中常见的自由形式问答任务相一致。多个自由形式问答基准测试的实验结果显示，TALE不仅优于用于测量响应准确性的标准基于参考的度量标准，而且与人类评估达成了实质性到几乎完美的一致。TALE增强了在现实世界动态场景中对LLM评估的可靠性，而无需依赖静态参考。

更新时间: 2025-06-20 17:31:59

领域: cs.CL,cs.AI,I.2.7

下载: http://arxiv.org/abs/2504.07385v2

BreastDCEDL: Curating a Comprehensive DCE-MRI Dataset and developing a Transformer Implementation for Breast Cancer Treatment Response Prediction

Breast cancer remains a leading cause of cancer-related mortality worldwide, making early detection and accurate treatment response monitoring critical priorities. We present BreastDCEDL, a curated, deep learning-ready dataset comprising pre-treatment 3D Dynamic Contrast-Enhanced MRI (DCE-MRI) scans from 2,070 breast cancer patients drawn from the I-SPY1, I-SPY2, and Duke cohorts, all sourced from The Cancer Imaging Archive. The raw DICOM imaging data were rigorously converted into standardized 3D NIfTI volumes with preserved signal integrity, accompanied by unified tumor annotations and harmonized clinical metadata including pathologic complete response (pCR), hormone receptor (HR), and HER2 status. Although DCE-MRI provides essential diagnostic information and deep learning offers tremendous potential for analyzing such complex data, progress has been limited by lack of accessible, public, multicenter datasets. BreastDCEDL addresses this gap by enabling development of advanced models, including state-of-the-art transformer architectures that require substantial training data. To demonstrate its capacity for robust modeling, we developed the first transformer-based model for breast DCE-MRI, leveraging Vision Transformer (ViT) architecture trained on RGB-fused images from three contrast phases (pre-contrast, early post-contrast, and late post-contrast). Our ViT model achieved state-of-the-art pCR prediction performance in HR+/HER2- patients (AUC 0.94, accuracy 0.93). BreastDCEDL includes predefined benchmark splits, offering a framework for reproducible research and enabling clinically meaningful modeling in breast cancer imaging.

Updated: 2025-06-20 17:29:37

标题: BreastDCEDL：筛选一份全面的DCE-MRI数据集并为乳腺癌治疗反应预测开发Transformer实现

摘要: 乳腺癌仍然是全球癌症相关死亡的主要原因之一，因此早期检测和准确的治疗反应监测是至关重要的优先事项。我们介绍了BreastDCEDL，这是一个经过精心策划、深度学习准备就绪的数据集，包括来自I-SPY1、I-SPY2和杜克队列的2,070名乳腺癌患者的治疗前3D动态增强磁共振成像（DCE-MRI）扫描，所有数据均来自癌症成像档案。原始DICOM成像数据已经严格转换为标准化的保留信号完整性的3D NIfTI体积，附带统一的肿瘤注释和协调的临床元数据，包括病理学完全反应（pCR）、激素受体（HR）和HER2状态。尽管DCE-MRI提供了基本的诊断信息，深度学习为分析这些复杂数据提供了巨大的潜力，但由于缺乏可访问、公开、多中心数据集，进展受到了限制。BreastDCEDL填补了这一空白，可以开发先进模型，包括需要大量训练数据的最新的变压器架构。为了展示其强大建模能力，我们开发了第一个基于变压器的乳腺DCE-MRI模型，利用从三个对比相位（对比前、早期对比后和晚期对比后）合并的RGB图像训练的Vision Transformer（ViT）架构。我们的ViT模型在HR+/HER2-患者中实现了最先进的pCR预测性能（AUC 0.94，准确率0.93）。BreastDCEDL包括预定义的基准分割，提供了一个可重复研究的框架，为乳腺癌影像学中的临床有意义的建模提供了可能。

更新时间: 2025-06-20 17:29:37

领域: cs.CV,cs.AI,68T07, 68U10, 92C55,I.2.0; I.2.10; I.4.5; J.3

下载: http://arxiv.org/abs/2506.12190v2

Convergent Linear Representations of Emergent Misalignment

Fine-tuning large language models on narrow datasets can cause them to develop broadly misaligned behaviours: a phenomena known as emergent misalignment. However, the mechanisms underlying this misalignment, and why it generalizes beyond the training domain, are poorly understood, demonstrating critical gaps in our knowledge of model alignment. In this work, we train and study a minimal model organism which uses just 9 rank-1 adapters to emergently misalign Qwen2.5-14B-Instruct. Studying this, we find that different emergently misaligned models converge to similar representations of misalignment. We demonstrate this convergence by extracting a 'misalignment direction' from one fine-tuned model's activations, and using it to effectively ablate misaligned behaviour from fine-tunes using higher dimensional LoRAs and different datasets. Leveraging the scalar hidden state of rank-1 LoRAs, we further present a set of experiments for directly interpreting the fine-tuning adapters, showing that six contribute to general misalignment, while two specialise for misalignment in just the fine-tuning domain. Emergent misalignment is a particularly salient example of undesirable and unexpected model behaviour and by advancing our understanding of the mechanisms behind it, we hope to move towards being able to better understand and mitigate misalignment more generally.

Updated: 2025-06-20 17:23:55

标题: Emergent Misalignment的收敛线性表示

摘要: 在狭窄数据集上微调大型语言模型可能导致它们产生广泛不一致的行为：这种现象被称为出现性不一致。然而，这种不一致的机制以及为什么它会在训练域之外泛化，我们对此知之甚少，显示出对模型对齐的知识存在关键性空白。在这项工作中，我们训练并研究了一个仅使用9个rank-1适配器就能使Qwen2.5-14B-Instruct产生出现性不一致的微型模型生物。通过研究这一点，我们发现不同的出现性不一致模型会收敛到类似的不一致表示。我们通过从一个微调模型的激活中提取一个“不一致方向”来证明这种收敛，并使用它来有效地消除使用更高维度的LoRAs和不同数据集进行微调的不一致行为。利用rank-1 LoRAs的标量隐藏状态，我们进一步展示了一系列实验，直接解释微调适配器，结果显示有六个适配器贡献于一般不一致，而另外两个专门用于微调领域的不一致。出现性不一致是不良和意外的模型行为的一个特别突出的例子，通过推进我们对其背后机制的理解，我们希望能更好地理解和缓解更普遍的不一致。

更新时间: 2025-06-20 17:23:55

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.11618v2

Deep generative models as the probability transformation functions

This paper introduces a unified theoretical perspective that views deep generative models as probability transformation functions. Despite the apparent differences in architecture and training methodologies among various types of generative models - autoencoders, autoregressive models, generative adversarial networks, normalizing flows, diffusion models, and flow matching - we demonstrate that they all fundamentally operate by transforming simple predefined distributions into complex target data distributions. This unifying perspective facilitates the transfer of methodological improvements between model architectures and provides a foundation for developing universal theoretical approaches, potentially leading to more efficient and effective generative modeling techniques.

Updated: 2025-06-20 17:22:23

标题: 深度生成模型作为概率转换函数

摘要: 这篇论文介绍了一个统一的理论视角，将深度生成模型视为概率转换函数。尽管各种类型的生成模型 - 自动编码器，自回归模型，生成对抗网络，归一化流，扩散模型和流匹配 - 在架构和训练方法上表现出明显的差异，我们证明它们都基本上通过将简单的预定义分布转换为复杂的目标数据分布来运作。这种统一的视角有助于在模型架构之间转移方法改进，并为开发通用理论方法提供了基础，潜在地导致更高效和更有效的生成建模技术。

更新时间: 2025-06-20 17:22:23

领域: cs.LG,68T07

下载: http://arxiv.org/abs/2506.17171v1

Challenges in Grounding Language in the Real World

A long-term goal of Artificial Intelligence is to build a language understanding system that allows a human to collaborate with a physical robot using language that is natural to the human. In this paper we highlight some of the challenges in doing this, and propose a solution that integrates the abilities of a cognitive agent capable of interactive task learning in a physical robot with the linguistic abilities of a large language model. We also point the way to an initial implementation of this approach.

Updated: 2025-06-20 17:17:53

标题: 在现实世界中将语言扎根的挑战

摘要: 人工智能的一个长期目标是构建一个语言理解系统，使人类能够使用自然语言与物理机器人进行协作。在本文中，我们强调了这样做所面临的一些挑战，并提出了一个解决方案，将一个能够在物理机器人中进行交互式任务学习的认知代理的能力与一个大型语言模型的语言能力整合在一起。我们也指出了这种方法的初始实施方向。

更新时间: 2025-06-20 17:17:53

领域: q-bio.NC,cs.AI

下载: http://arxiv.org/abs/2506.17375v1

Continual Learning with Columnar Spiking Neural Networks

This study investigates columnar-organized spiking neural networks (SNNs) for continual learning and catastrophic forgetting. Using CoLaNET (Columnar Layered Network), we show that microcolumns adapt most efficiently to new tasks when they lack shared structure with prior learning. We demonstrate how CoLaNET hyperparameters govern the trade-off between retaining old knowledge (stability) and acquiring new information (plasticity). Our optimal configuration learns ten sequential MNIST tasks effectively, maintaining 92% accuracy on each. It shows low forgetting, with only 4% performance degradation on the first task after training on nine subsequent tasks.

Updated: 2025-06-20 17:13:38

标题: 使用柱状脉冲神经网络进行持续学习

摘要: 这项研究调查了针对持续学习和灾难性遗忘的柱状组织的尖峰神经网络（SNNs）。使用CoLaNET（柱状分层网络），我们展示了微柱在缺乏与先前学习的共享结构时对新任务适应最有效。我们展示了CoLaNET超参数如何调节保留旧知识（稳定性）和获取新信息（可塑性）之间的权衡。我们的最佳配置有效地学习了十个顺序的MNIST任务，每个任务的准确率保持在92%。它显示出低遗忘，仅在训练完九个后续任务后，在第一个任务上仅有4%的性能下降。

更新时间: 2025-06-20 17:13:38

领域: cs.NE,cs.AI

下载: http://arxiv.org/abs/2506.17169v1

Proportional Sensitivity in Generative Adversarial Network (GAN)-Augmented Brain Tumor Classification Using Convolutional Neural Network

Generative Adversarial Networks (GAN) have shown potential in expanding limited medical imaging datasets. This study explores how different ratios of GAN-generated and real brain tumor MRI images impact the performance of a CNN in classifying healthy vs. tumorous scans. A DCGAN was used to create synthetic images which were mixed with real ones at various ratios to train a custom CNN. The CNN was then evaluated on a separate real-world test set. Our results indicate that the model maintains high sensitivity and precision in tumor classification, even when trained predominantly on synthetic data. When only a small portion of GAN data was added, such as 900 real images and 100 GAN images, the model achieved excellent performance, with test accuracy reaching 95.2%, and precision, recall, and F1-score all exceeding 95%. However, as the proportion of GAN images increased further, performance gradually declined. This study suggests that while GANs are useful for augmenting limited datasets especially when real data is scarce, too much synthetic data can introduce artifacts that affect the model's ability to generalize to real world cases.

Updated: 2025-06-20 17:12:03

标题: 生成对抗网络（GAN）增强的基于卷积神经网络的脑肿瘤分类中的比例敏感性

摘要: 生成对抗网络（GAN）在扩展有限的医学影像数据集方面显示出潜力。本研究探讨了不同比例的GAN生成和真实脑肿瘤MRI图像如何影响CNN在分类健康与肿瘤扫描中的性能。使用DCGAN创建了合成图像，并将其与真实图像混合以不同比例训练自定义CNN。然后在一个单独的真实世界测试集上对CNN进行评估。我们的结果表明，即使主要训练于合成数据，该模型在肿瘤分类中仍保持高敏感性和精度。当只添加了少量GAN数据时，例如900个真实图像和100个GAN图像，模型表现出色，测试准确率达到95.2％，精度、召回率和F1分数均超过95％。然而，随着GAN图像比例进一步增加，性能逐渐下降。本研究表明，虽然GAN在扩充有限数据集方面很有用，特别是在真实数据稀缺时，但过多的合成数据可能会引入影响模型泛化到真实案例的人为因素。

更新时间: 2025-06-20 17:12:03

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2506.17165v1

LLMs and Stack Overflow Discussions: Reliability, Impact, and Challenges

Since its release in November 2022, ChatGPT has shaken up Stack Overflow, the premier platform for developers queries on programming and software development. Demonstrating an ability to generate instant, human-like responses to technical questions, ChatGPT has ignited debates within the developer community about the evolving role of human-driven platforms in the age of generative AI. Two months after ChatGPT release, Meta released its answer with its own Large Language Model (LLM) called LLaMA: the race was on. We conducted an empirical study analyzing questions from Stack Overflow and using these LLMs to address them. This way, we aim to (i) quantify the reliability of LLMs answers and their potential to replace Stack Overflow in the long term; (ii) identify and understand why LLMs fail; (iii) measure users activity evolution with Stack Overflow over time; and (iv) compare LLMs together. Our empirical results are unequivocal: ChatGPT and LLaMA challenge human expertise, yet do not outperform it for some domains, while a significant decline in user posting activity has been observed. Furthermore, we also discuss the impact of our findings regarding the usage and development of new LLMs and provide guidelines for future challenges faced by users and researchers.

Updated: 2025-06-20 17:11:29

标题: LLMs和Stack Overflow讨论：可靠性、影响力和挑战

摘要: 自从2022年11月发布以来，ChatGPT已经引起了Stack Overflow的震动，这是开发人员在编程和软件开发方面提出查询的主要平台。ChatGPT展示了生成即时、类似人类的响应技术问题的能力，引发了开发者社区对于在生成AI时代中人类驱动平台演变角色的讨论。ChatGPT发布两个月后，Meta发布了自己的大型语言模型（LLM）称为LLaMA：比赛开始了。我们进行了一项实证研究，分析了来自Stack Overflow的问题，并使用这些LLM来解决这些问题。通过这种方式，我们旨在（i）量化LLM答案的可靠性以及它们在长期内替代Stack Overflow的潜力；（ii）确定和理解LLM失败的原因；（iii）随着时间推移测量用户在Stack Overflow上的活动演变；以及（iv）比较不同的LLM。我们的实证结果是明确的：ChatGPT和LLaMA挑战人类专业知识，但并不在某些领域表现出色，同时观察到用户发帖活动显著下降。此外，我们还讨论了关于我们发现对于新LLM的使用和发展的影响，并提供了用户和研究人员面临的未来挑战的指导方针。

更新时间: 2025-06-20 17:11:29

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2402.08801v2

EF21 with Bells & Whistles: Six Algorithmic Extensions of Modern Error Feedback

First proposed by Seide (2014) as a heuristic, error feedback (EF) is a very popular mechanism for enforcing convergence of distributed gradient-based optimization methods enhanced with communication compression strategies based on the application of contractive compression operators. However, existing theory of EF relies on very strong assumptions (e.g., bounded gradients), and provides pessimistic convergence rates (e.g., while the best known rate for EF in the smooth nonconvex regime, and when full gradients are compressed, is $O(1/T^{2/3})$, the rate of gradient descent in the same regime is $O(1/T)$). Recently, Richt\'arik et al. (2021) proposed a new error feedback mechanism, EF21, based on the construction of a Markov compressor induced by a contractive compressor. EF21 removes the aforementioned theoretical deficiencies of EF and at the same time works better in practice. In this work we propose six practical extensions of EF21, all supported by strong convergence theory: partial participation, stochastic approximation, variance reduction, proximal setting, momentum, and bidirectional compression. To the best of our knowledge, several of these techniques have not been previously analyzed in combination with EF, and in cases where prior analysis exists -- such as for bidirectional compression -- our theoretical convergence guarantees significantly improve upon existing results.

Updated: 2025-06-20 17:11:24

标题: EF21带有各种功能：现代误差反馈的六种算法扩展

摘要: 由Seide（2014）首次提出的启发式误差反馈（EF）是一种非常流行的机制，用于强制分布式梯度优化方法收敛，该方法利用基于压缩通信策略的收缩压缩算子。然而，现有的EF理论依赖于非常强的假设（例如，梯度有界），并且提供悲观的收敛速率（例如，在光滑非凸区域中，当完整梯度被压缩时，EF的最佳已知速率为$O(1/T^{2/3})$，而在相同区域中，梯度下降的速率为$O(1/T)$）。最近，Richt\'arik等人（2021）提出了一种新的误差反馈机制EF21，该机制基于由收缩压缩器诱导的马尔可夫压缩器构造。EF21消除了EF的上述理论缺陷，并在实践中表现更好。在这项工作中，我们提出了EF21的六种实用扩展，所有这些扩展都得到了强大的收敛理论支持：部分参与、随机逼近、方差减少、近端设置、动量和双向压缩。据我们所知，其中一些技术以前尚未与EF结合分析，在先前分析存在的情况下，例如双向压缩，我们的理论收敛保证显著改进了现有结果。

更新时间: 2025-06-20 17:11:24

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2110.03294v2

From Drawings to Decisions: A Hybrid Vision-Language Framework for Parsing 2D Engineering Drawings into Structured Manufacturing Knowledge

Efficient and accurate extraction of key information from 2D engineering drawings is essential for advancing digital manufacturing workflows. Such information includes geometric dimensioning and tolerancing (GD&T), measures, material specifications, and textual annotations. Manual extraction is slow and labor-intensive, while generic OCR models often fail due to complex layouts, engineering symbols, and rotated text, leading to incomplete and unreliable outputs. These limitations result in incomplete and unreliable outputs. To address these challenges, we propose a hybrid vision-language framework that integrates a rotation-aware object detection model (YOLOv11-obb) with a transformer-based vision-language parser. Our structured pipeline applies YOLOv11-OBB to localize annotations and extract oriented bounding box (OBB) patches, which are then parsed into structured outputs using a fine-tuned, lightweight vision-language model (VLM). We curate a dataset of 1,367 2D mechanical drawings annotated across nine key categories. YOLOv11-OBB is trained on this dataset to detect OBBs and extract annotation patches. These are parsed using two open-source VLMs: Donut and Florence-2. Both models are lightweight and well-suited for specialized industrial tasks under limited computational overhead. Following fine-tuning of both models on the curated dataset of image patches paired with structured annotation labels, a comparative experiment is conducted to evaluate parsing performance across four key metrics. Donut outperforms Florence-2, achieving 88.5% precision, 99.2% recall, and a 93.5% F1-score, with a hallucination rate of 11.5%. Finally, a case study demonstrates how the extracted structured information supports downstream manufacturing tasks such as process and tool selection, showcasing the practical utility of the proposed framework in modernizing 2D drawing interpretation.

Updated: 2025-06-20 17:10:01

标题: 从绘图到决策：一种用于将2D工程图解析为结构化制造知识的混合视觉-语言框架

摘要: 从2D工程图纸中高效准确地提取关键信息对于推动数字制造工作流程至关重要。这些信息包括几何尺寸和公差（GD&T）、测量数据、材料规格和文本注释。手动提取速度慢且劳动强度大，而通用的OCR模型通常由于复杂的布局、工程符号和旋转文本而失败，导致输出不完整和不可靠。这些限制导致了不完整和不可靠的输出。为了应对这些挑战，我们提出了一个混合视觉-语言框架，将一个旋转感知目标检测模型（YOLOv11-obb）与基于Transformer的视觉-语言解析器相结合。我们的结构化流水线使用YOLOv11-OBB定位注释并提取定向边界框（OBB）补丁，然后使用经过微调的轻量级视觉-语言模型（VLM）将其解析为结构化输出。我们精心策划了一个包含九个关键类别的1,367个2D机械图纸的数据集。YOLOv11-OBB在该数据集上训练以检测OBB并提取注释补丁。这些补丁使用两个开源的VLM（Donut和Florence-2）解析。这两个模型都轻量级且适合在有限的计算开销下进行专业工业任务。在将两个模型在图像补丁配对的结构化注释标签上进行微调后，进行了比较实验以评估解析性能的四个关键指标。Donut的表现优于Florence-2，在88.5%的精度、99.2%的召回率和93.5%的F1分数，幻觉率为11.5%。最后，一项案例研究展示了提取的结构化信息如何支持下游制造任务，如过程和工具选择，展示了所提出的框架在现代化2D图纸解释中的实用性。

更新时间: 2025-06-20 17:10:01

领域: cs.CV,cs.AI,cs.IR

下载: http://arxiv.org/abs/2506.17374v1

The MedPerturb Dataset: What Non-Content Perturbations Reveal About Human and Clinical LLM Decision Making

Clinical robustness is critical to the safe deployment of medical Large Language Models (LLMs), but key questions remain about how LLMs and humans may differ in response to the real-world variability typified by clinical settings. To address this, we introduce MedPerturb, a dataset designed to systematically evaluate medical LLMs under controlled perturbations of clinical input. MedPerturb consists of clinical vignettes spanning a range of pathologies, each transformed along three axes: (1) gender modifications (e.g., gender-swapping or gender-removal); (2) style variation (e.g., uncertain phrasing or colloquial tone); and (3) format changes (e.g., LLM-generated multi-turn conversations or summaries). With MedPerturb, we release a dataset of 800 clinical contexts grounded in realistic input variability, outputs from four LLMs, and three human expert reads per clinical context. We use MedPerturb in two case studies to reveal how shifts in gender identity cues, language style, or format reflect diverging treatment selections between humans and LLMs. We find that LLMs are more sensitive to gender and style perturbations while human annotators are more sensitive to LLM-generated format perturbations such as clinical summaries. Our results highlight the need for evaluation frameworks that go beyond static benchmarks to assess the similarity between human clinician and LLM decisions under the variability characteristic of clinical settings.

Updated: 2025-06-20 17:09:27

标题: 《MedPerturb数据集：非内容扰动揭示了关于人类和临床LLM决策的信息》

摘要: 临床的稳健性对于安全部署医疗大型语言模型(LLMs)至关重要，但关键问题仍然存在，即LLMs和人类可能在对临床设置所特有的真实世界变异性做出反应方面存在差异。为了解决这一问题，我们引入了MedPerturb，这是一个旨在系统评估受控临床输入扰动下的医疗LLMs的数据集。 MedPerturb包括涵盖一系列病理学的临床案例，每个案例都沿着三个方向进行了转换：(1)性别修改(例如，性别转换或性别删除)；(2)风格变化(例如，不确定的措辞或口语风格)；和(3)格式变化(例如，LLM生成的多轮对话或摘要)。通过MedPerturb，我们发布了一个包含800个临床背景的数据集，四个LLMs的输出以及每个临床背景的三个人类专家解读。我们在两个案例研究中使用MedPerturb，揭示了性别身份线索、语言风格或格式变化如何反映出人类和LLMs之间在治疗选择上的差异。我们发现，LLMs对性别和风格扰动更为敏感，而人类注释者则对LLM生成的临床摘要等格式扰动更为敏感。我们的结果凸显了评估框架的必要性，该框架超越静态基准，评估人类临床医生和LLMs在临床设置特征变异性下的决策相似性。

更新时间: 2025-06-20 17:09:27

领域: cs.AI

下载: http://arxiv.org/abs/2506.17163v1

Analyzing PDFs like Binaries: Adversarially Robust PDF Malware Analysis via Intermediate Representation and Language Model

Malicious PDF files have emerged as a persistent threat and become a popular attack vector in web-based attacks. While machine learning-based PDF malware classifiers have shown promise, these classifiers are often susceptible to adversarial attacks, undermining their reliability. To address this issue, recent studies have aimed to enhance the robustness of PDF classifiers. Despite these efforts, the feature engineering underlying these studies remains outdated. Consequently, even with the application of cutting-edge machine learning techniques, these approaches fail to fundamentally resolve the issue of feature instability. To tackle this, we propose a novel approach for PDF feature extraction and PDF malware detection. We introduce the PDFObj IR (PDF Object Intermediate Representation), an assembly-like language framework for PDF objects, from which we extract semantic features using a pretrained language model. Additionally, we construct an Object Reference Graph to capture structural features, drawing inspiration from program analysis. This dual approach enables us to analyze and detect PDF malware based on both semantic and structural features. Experimental results demonstrate that our proposed classifier achieves strong adversarial robustness while maintaining an exceptionally low false positive rate of only 0.07% on baseline dataset compared to state-of-the-art PDF malware classifiers.

Updated: 2025-06-20 17:08:08

标题: 分析 PDF 文件如同分析二进制文件：通过中间表示和语言模型实现对 PDF 恶意软件的敌对鲁棒分析

摘要: 恶意PDF文件已经成为一种持久的威胁，并在基于网络的攻击中成为一种流行的攻击向量。虽然基于机器学习的PDF恶意软件分类器显示出潜力，但这些分类器通常容易受到对抗性攻击的影响，从而破坏了它们的可靠性。为了解决这个问题，最近的研究致力于增强PDF分类器的鲁棒性。尽管有这些努力，这些研究背后的特征工程仍然过时。因此，即使应用了尖端的机器学习技术，这些方法也无法从根本上解决特征不稳定性的问题。为了解决这个问题，我们提出了一种新颖的PDF特征提取和PDF恶意软件检测方法。我们引入了PDFObj IR（PDF对象中间表示），这是一个类似汇编语言的PDF对象框架，我们可以使用预训练的语言模型提取语义特征。此外，我们构建了一个对象参考图来捕捉结构特征，从程序分析中汲取灵感。这种双重方法使我们能够基于语义和结构特征来分析和检测PDF恶意软件。实验结果表明，我们提出的分类器在基线数据集上仅有0.07%的极低误报率的同时，实现了强大的对抗性鲁棒性，相比于最先进的PDF恶意软件分类器。

更新时间: 2025-06-20 17:08:08

领域: cs.CR

下载: http://arxiv.org/abs/2506.17162v1

A Minimalist Method for Fine-tuning Text-to-Image Diffusion Models

Recent work uses reinforcement learning (RL) to fine-tune text-to-image diffusion models, improving text-image alignment and sample quality. However, existing approaches introduce unnecessary complexity: they cache the full sampling trajectory, depend on differentiable reward models or large preference datasets, or require specialized guidance techniques. Motivated by the "golden noise" hypothesis -- that certain initial noise samples can consistently yield superior alignment -- we introduce Noise PPO, a minimalist RL algorithm that leaves the pre-trained diffusion model entirely frozen and learns a prompt-conditioned initial noise generator. Our approach requires no trajectory storage, reward backpropagation, or complex guidance tricks. Extensive experiments show that optimizing the initial noise distribution consistently improves alignment and sample quality over the original model, with the most significant gains at low inference steps. As the number of inference steps increases, the benefit of noise optimization diminishes but remains present. These findings clarify the scope and limitations of the golden noise hypothesis and reinforce the practical value of minimalist RL fine-tuning for diffusion models.

Updated: 2025-06-20 16:59:05

标题: 一种微调文本到图像扩散模型的极简方法

摘要: 最近的研究利用强化学习（RL）来微调文本到图像扩散模型，改善文本-图像对齐和样本质量。然而，现有方法引入了不必要的复杂性：它们缓存完整的采样轨迹，依赖于可微分的奖励模型或大型偏好数据集，或需要专门的指导技术。受“黄金噪声”假设的启发——即某些初始噪声样本可以始终产生更好的对齐，我们引入了Noise PPO，这是一种极简的RL算法，完全保持预训练的扩散模型冻结，并学习一个与提示条件相关的初始噪声生成器。我们的方法不需要轨迹存储、奖励反向传播或复杂的指导技巧。大量实验证明，优化初始噪声分布始终可以改善对齐和样本质量，相比原始模型，在低推理步骤时获得最显著的收益。随着推理步骤的增加，噪声优化的好处减少但仍然存在。这些发现澄清了黄金噪声假设的范围和限制，并强化了对扩散模型的极简RL微调的实际价值。

更新时间: 2025-06-20 16:59:05

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.12036v2

Sparse-Reg: Improving Sample Complexity in Offline Reinforcement Learning using Sparsity

In this paper, we investigate the use of small datasets in the context of offline reinforcement learning (RL). While many common offline RL benchmarks employ datasets with over a million data points, many offline RL applications rely on considerably smaller datasets. We show that offline RL algorithms can overfit on small datasets, resulting in poor performance. To address this challenge, we introduce "Sparse-Reg": a regularization technique based on sparsity to mitigate overfitting in offline reinforcement learning, enabling effective learning in limited data settings and outperforming state-of-the-art baselines in continuous control.

Updated: 2025-06-20 16:57:59

标题: Sparse-Reg: 使用稀疏性改进离线强化学习中的样本复杂度

摘要: 本文研究了在离线强化学习（RL）中使用小数据集的情况。尽管许多常见的离线RL基准测试使用超过一百万数据点的数据集，但许多离线RL应用程序依赖于相当小的数据集。我们展示了在小数据集上，离线RL算法可能会过拟合，导致性能不佳。为了解决这一挑战，我们引入了“Sparse-Reg”：一种基于稀疏性的正则化技术，用于减轻离线强化学习中的过拟合问题，从而在有限数据环境中实现有效学习，并在连续控制中优于现有的基线。

更新时间: 2025-06-20 16:57:59

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.17155v1

Global Microprocessor Correctness in the Presence of Transient Execution

Correctness for microprocessors is generally understood to be conformance with the associated instruction set architecture (ISA). This is the basis for one of the most important abstractions in computer science, allowing hardware designers to develop highly-optimized processors that are functionally "equivalent" to an ideal processor that executes instructions atomically. This specification is almost always informal, e.g., commercial microprocessors generally do not come with conformance specifications. In this paper, we advocate for the use of formal specifications, using the theory of refinement. We introduce notions of correctness that can be used to deal with transient execution attacks, including Meltdown and Spectre. Such attacks have shown that ubiquitous microprocessor optimizations, appearing in numerous processors for decades, are inherently buggy. Unlike alternative approaches that use non-interference properties, our notion of correctness is global, meaning it is single specification that: formalizes conformance, includes functional correctness and is parameterized by an microarchitecture. We introduce action skipping refinement, a new type of refinement and we describe how our notions of refinement can be decomposed into properties that are more amenable to automated verification using the the concept of shared-resource commitment refinement maps. We do this in the context of formal, fully executable bit- and cycle-accurate models of an ISA and a microprocessor. Finally, we show how light-weight formal methods based on property-based testing can be used to identify transient execution bugs.

Updated: 2025-06-20 16:56:14

标题: 在瞬态执行存在的情况下的全球微处理器正确性

摘要: 微处理器的正确性通常被理解为符合相关的指令集体系结构（ISA）。这是计算机科学中最重要的抽象之一，使硬件设计师能够开发高度优化的处理器，这些处理器在功能上“等同于”一个执行指令原子性的理想处理器。这种规范几乎总是非正式的，例如，商用微处理器通常不提供符合规范。在本文中，我们主张使用形式规范，利用细化理论。我们引入了可以用于处理瞬态执行攻击（包括Meltdown和Spectre）的正确性概念。这些攻击显示，广泛存在的微处理器优化，在数十年间出现在许多处理器中，本质上存在缺陷。与使用非干扰属性的替代方法不同，我们的正确性概念是全局的，意味着它是一个单一的规范，形式化了符合性，包括功能正确性，并由微架构参数化。我们引入了动作跳过细化，一种新型细化类型，并描述了我们的细化概念如何被分解为更易于使用共享资源承诺细化映射概念进行自动验证的属性。我们在ISA和微处理器的正式、完全可执行的位和周期精确模型的背景下进行这项工作。最后，我们展示了基于基于属性测试的轻量级形式方法如何用于识别瞬态执行错误。

更新时间: 2025-06-20 16:56:14

领域: cs.CR

下载: http://arxiv.org/abs/2506.17154v1

A Technical Study into 0.5B Reasoning Language Models

The ongoing evolution of language models has led to the development of large-scale architectures that demonstrate exceptional performance across a wide range of tasks. However, these models come with significant computational and energy demands, as well as potential privacy implications. In this context, Small Reasoning Language Models (SRLMs) with approximately 0.5 billion parameters present a compelling alternative due to their remarkable computational efficiency and cost effectiveness, particularly in resource-constrained environments. Despite these advantages, the limited capacity of 0.5 billion parameter models poses challenges in handling complex tasks such as mathematical reasoning and code generation. This research investigates various training strategies, including supervised fine-tuning (SFT), knowledge distillation (KD), and reinforcement learning (RL), as well as their hybrid implementations, to enhance the performance of 0.5B SRLMs. We analyze effective methodologies to bridge the performance gap between SRLMS and larger models and present insights into optimal training pipelines tailored for these smaller architectures. Through extensive experimental validation and analysis, our work aims to provide actionable recommendations for maximizing the reasoning capabilities of 0.5B models.

Updated: 2025-06-20 16:50:22

标题: 一个关于0.5B推理语言模型的技术研究

摘要: 语言模型的不断演变导致了大规模架构的发展，这些模型在各种任务中展现出卓越的性能。然而，这些模型带来了显著的计算和能源需求，以及潜在的隐私影响。在这种情况下，拥有约0.5亿参数的小型推理语言模型（SRLMs）由于其卓越的计算效率和成本效益而成为一个引人注目的替代选择，特别是在资源受限的环境中。尽管具有这些优势，0.5亿参数模型的有限容量在处理数学推理和代码生成等复杂任务方面存在挑战。本研究探讨了各种训练策略，包括监督微调（SFT）、知识蒸馏（KD）和强化学习（RL），以及它们的混合实现，以提高0.5B SRLMs的性能。我们分析了有效的方法论，以弥合SRLMs和更大模型之间的性能差距，并提供了针对这些较小架构量身定制的最佳训练流程的见解。通过广泛的实验证实和分析，我们的工作旨在为最大化0.5B模型的推理能力提供可操作的建议。

更新时间: 2025-06-20 16:50:22

领域: cs.AI

下载: http://arxiv.org/abs/2506.13404v2

LaRS: Latent Reasoning Skills for Chain-of-Thought Reasoning

Chain-of-thought (CoT) prompting is a popular in-context learning (ICL) approach for large language models (LLMs), especially when tackling complex reasoning tasks. Traditional ICL approaches construct prompts using examples that contain questions similar to the input question. However, CoT prompting, which includes crucial intermediate reasoning steps (rationales) within its examples, necessitates selecting examples based on these rationales rather than the questions themselves. Existing methods require human experts or pre-trained LLMs to describe the skill, a high-level abstraction of rationales, to guide the selection. These methods, however, are often costly and difficult to scale. Instead, this paper introduces a new approach named Latent Reasoning Skills (LaRS) that employs unsupervised learning to create a latent space representation of rationales, with a latent variable called a reasoning skill. Concurrently, LaRS learns a reasoning policy to determine the required reasoning skill for a given question. Then the ICL examples are selected by aligning the reasoning skills between past examples and the question. This approach is theoretically grounded and compute-efficient, eliminating the need for auxiliary LLM inference or manual prompt design. Empirical results demonstrate that LaRS consistently outperforms SOTA skill-based selection methods, processing example banks four times faster, reducing LLM inferences during the selection stage by half, and showing greater robustness to sub-optimal example banks.

Updated: 2025-06-20 16:50:07

标题: LaRS:潜在的思维技巧用于推理链

摘要: 思维链（CoT）提示是一种流行的基于上下文学习（ICL）方法，特别适用于处理复杂推理任务的大型语言模型（LLMs）。传统的ICL方法使用包含与输入问题相似的问题的示例来构建提示。然而，CoT提示在其示例中包含关键的中间推理步骤（基本原理），因此需要基于这些基本原理而不是问题本身来选择示例。现有的方法需要人类专家或预训练的LLMs来描述技能，即基本原理的高级抽象，以指导选择。然而，这些方法通常成本高昂且难以扩展。相反，本文介绍了一种名为潜在推理技能（LaRS）的新方法，它利用无监督学习来创建基本原理的潜在空间表示，其中包含一个称为推理技能的潜变量。与此同时，LaRS学习了一个推理策略，以确定对于给定问题所需的推理技能。然后，通过调整过去示例和问题之间的推理技能来选择ICL示例。这种方法在理论上是可靠的和计算高效的，消除了对辅助LLM推断或手动提示设计的需求。实证结果表明，LaRS始终优于SOTA基于技能选择方法，在处理示例库速度快四倍，减少了在选择阶段的LLM推断次数一半，并且对次优示例库具有更高的鲁棒性。

更新时间: 2025-06-20 16:50:07

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2312.04684v4

Do We Need Large VLMs for Spotting Soccer Actions?

Traditional video-based tasks like soccer action spotting rely heavily on visual inputs, often requiring complex and computationally expensive models to process dense video data. In this work, we propose a shift from this video-centric approach to a text-based task, making it lightweight and scalable by utilizing Large Language Models (LLMs) instead of Vision-Language Models (VLMs). We posit that expert commentary, which provides rich, fine-grained descriptions and contextual cues such as excitement and tactical insights, contains enough information to reliably spot key actions in a match. To demonstrate this, we use the SoccerNet Echoes dataset, which provides timestamped commentary, and employ a system of three LLMs acting as judges specializing in outcome, excitement, and tactics. Each LLM evaluates sliding windows of commentary to identify actions like goals, cards, and substitutions, generating accurate timestamps for these events. Our experiments show that this language-centric approach performs effectively in detecting critical match events, providing a lightweight and training-free alternative to traditional video-based methods for action spotting.

Updated: 2025-06-20 16:45:54

标题: 我们是否需要大型VLMs来识别足球动作？

摘要: 传统的基于视频的任务，如足球动作定位，严重依赖于视觉输入，通常需要复杂和计算昂贵的模型来处理密集的视频数据。在这项工作中，我们提出了从这种以视频为中心的方法转向基于文本的任务，通过利用大型语言模型（LLMs）而不是视觉语言模型（VLMs），使其轻量级和可扩展。我们认为专家评论提供丰富、细致和上下文线索，如兴奋和战术见解，包含足够的信息来可靠地发现比赛中的关键动作。为了证明这一点，我们使用SoccerNet Echoes数据集，该数据集提供了时间戳评论，并使用三个LLMs系统作为专家评判员，分别专门评估结果、兴奋和战术。每个LLM评估评论的滑动窗口，以识别诸如进球、罚牌和换人等行动，为这些事件生成准确的时间戳。我们的实验证明，这种以语言为中心的方法在检测关键比赛事件方面表现出色，为动作定位提供了一种轻量级和无需训练的替代方案，相对于传统的基于视频的方法。

更新时间: 2025-06-20 16:45:54

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.17144v1

MeDi: Metadata-Guided Diffusion Models for Mitigating Biases in Tumor Classification

Deep learning models have made significant advances in histological prediction tasks in recent years. However, for adaptation in clinical practice, their lack of robustness to varying conditions such as staining, scanner, hospital, and demographics is still a limiting factor: if trained on overrepresented subpopulations, models regularly struggle with less frequent patterns, leading to shortcut learning and biased predictions. Large-scale foundation models have not fully eliminated this issue. Therefore, we propose a novel approach explicitly modeling such metadata into a Metadata-guided generative Diffusion model framework (MeDi). MeDi allows for a targeted augmentation of underrepresented subpopulations with synthetic data, which balances limited training data and mitigates biases in downstream models. We experimentally show that MeDi generates high-quality histopathology images for unseen subpopulations in TCGA, boosts the overall fidelity of the generated images, and enables improvements in performance for downstream classifiers on datasets with subpopulation shifts. Our work is a proof-of-concept towards better mitigating data biases with generative models.

Updated: 2025-06-20 16:41:25

标题: MeDi：元数据引导扩散模型用于减轻肿瘤分类中的偏见

摘要: 深度学习模型在近年来的组织学预测任务中取得了显著进展。然而，要使其适应临床实践，仍然存在一定的局限性，即对于染色、扫描仪、医院和人口统计等不同条件的稳健性不足：如果在过度代表的亚群上训练，模型通常会在较少频繁出现的模式上遇到困难，导致快捷学习和偏见预测。大规模基础模型尚未完全消除这一问题。因此，我们提出了一种新颖的方法，将这些元数据明确建模为一个Metadata-guided generative Diffusion模型框架（MeDi）。MeDi允许通过合成数据有针对性地增加少数亚群，从而平衡有限的训练数据并减轻下游模型中的偏见。我们通过实验证明，MeDi为TCGA中未知亚群生成了高质量的组织病理学图像，提升了生成图像的整体保真度，并使下游分类器在具有亚群转变的数据集上表现出改进。我们的工作是利用生成模型更好地减轻数据偏见的概念验证。

更新时间: 2025-06-20 16:41:25

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2506.17140v1

Consistent Sampling and Simulation: Molecular Dynamics with Energy-Based Diffusion Models

Diffusion models have recently gained significant attention due to their effectiveness in various scientific domains, including biochemistry. When trained on equilibrium molecular distributions, diffusion models provide both: a generative procedure to sample equilibrium conformations and associated forces derived from the model's scores. However, using the forces for coarse-grained molecular dynamics simulations uncovers inconsistencies in the samples generated via classical diffusion inference and simulation, despite both originating from the same model. Particularly at the small diffusion timesteps required for simulations, diffusion models fail to satisfy the Fokker-Planck equation, which governs how the score should evolve over time. We interpret this deviation as an indication of the observed inconsistencies and propose an energy-based diffusion model with a Fokker-Planck-derived regularization term enforcing consistency. We demonstrate the effectiveness of our approach on toy systems, alanine dipeptide, and introduce a state-of-the-art transferable Boltzmann emulator for dipeptides that supports simulation and demonstrates enhanced consistency and efficient sampling.

Updated: 2025-06-20 16:38:29

标题: 一致取样和模拟：基于能量扩散模型的分子动力学

摘要: 扩散模型最近受到了广泛关注，因为它们在各种科学领域，包括生物化学中的有效性。当在平衡分子分布上进行训练时，扩散模型提供了两种功能：一种是生成平衡构象的过程，另一种是从模型得分导出的相关力。然而，将这些力用于粗粒度分子动力学模拟会揭示出通过经典扩散推断和模拟生成的样本之间的不一致性，尽管两者都源自同一模型。特别是在模拟中所需的小扩散时间步长下，扩散模型无法满足福克-普朗克方程，该方程规定了得分应如何随时间演变。我们将这种偏差解释为观察到的不一致性的指示，并提出了一个基于能量的扩散模型，其中包含一个由福克-普朗克推导出的正则化项，以强制保持一致性。我们在玩具系统、丙氨酸二肽上展示了我们方法的有效性，并引入了一个支持模拟的二肽传递玻尔兹曼仿真器，展示了增强的一致性和高效的采样。

更新时间: 2025-06-20 16:38:29

领域: cs.LG,cs.AI,physics.chem-ph,physics.comp-ph,stat.ML

下载: http://arxiv.org/abs/2506.17139v1

Robust Training with Data Augmentation for Medical Imaging Classification

Deep neural networks are increasingly being used to detect and diagnose medical conditions using medical imaging. Despite their utility, these models are highly vulnerable to adversarial attacks and distribution shifts, which can affect diagnostic reliability and undermine trust among healthcare professionals. In this study, we propose a robust training algorithm with data augmentation (RTDA) to mitigate these vulnerabilities in medical image classification. We benchmark classifier robustness against adversarial perturbations and natural variations of RTDA and six competing baseline techniques, including adversarial training and data augmentation approaches in isolation and combination, using experimental data sets with three different imaging technologies (mammograms, X-rays, and ultrasound). We demonstrate that RTDA achieves superior robustness against adversarial attacks and improved generalization performance in the presence of distribution shift in each image classification task while maintaining high clean accuracy.

Updated: 2025-06-20 16:36:39

标题: 强化训练与数据增强在医学影像分类中的应用

摘要: 深度神经网络越来越被用于使用医学影像检测和诊断医疗状况。尽管它们很有用，但这些模型对敌对攻击和分布变化非常容易受到影响，这可能会影响诊断可靠性并破坏医疗专业人员之间的信任。在这项研究中，我们提出了一种具有数据增强的鲁棒训练算法（RTDA），以减轻医学图像分类中这些漏洞。我们使用三种不同的成像技术（乳房X光、X光和超声波）的实验数据集，将分类器的鲁棒性与RTDA和六种竞争基线技术（包括敌对训练和数据增强方法的独立和组合）进行基准测试，以评估对抗扰动和自然变化。我们证明了RTDA在每个图像分类任务中都能够在敌对攻击和分布变化的情况下实现更高的鲁棒性，并改善泛化性能，同时保持高准确性。

更新时间: 2025-06-20 16:36:39

领域: eess.IV,cs.AI,cs.CV,cs.LG

下载: http://arxiv.org/abs/2506.17133v1

Chain-of-Trust: A Progressive Trust Evaluation Framework Enabled by Generative AI

In collaborative systems with complex tasks relying on distributed resources, trust evaluation of potential collaborators has emerged as an effective mechanism for task completion. However, due to the network dynamics and varying information gathering latencies, it is extremely challenging to observe and collect all trust attributes of a collaborating device concurrently for a comprehensive trust assessment. In this paper, a novel progressive trust evaluation framework, namely chain-of-trust, is proposed to make better use of misaligned device attribute data. This framework, designed for effective task completion, divides the trust evaluation process into multiple chained stages based on task decomposition. At each stage, based on the task completion process, the framework only gathers the latest device attribute data relevant to that stage, leading to reduced trust evaluation complexity and overhead. By leveraging advanced in-context learning, few-shot learning, and reasoning capabilities, generative AI is then employed to analyze and interpret the collected data to produce correct evaluation results quickly. Only devices deemed trustworthy at this stage proceed to the next round of trust evaluation. The framework ultimately determines devices that remain trustworthy across all stages. Experimental results demonstrate that the proposed framework achieves high accuracy in trust evaluation.

Updated: 2025-06-20 16:33:03

标题: 信任链：由生成式人工智能启用的渐进信任评估框架

摘要: 在依赖分布式资源的复杂任务的协作系统中，对潜在合作者的信任评估已经成为完成任务的有效机制。然而，由于网络动态和不同的信息获取延迟，观察和收集协作设备的所有信任属性同时进行全面信任评估是极具挑战性的。本文提出了一种新颖的渐进信任评估框架，即信任链，以更好地利用不一致的设备属性数据。这个旨在有效完成任务的框架将信任评估过程分为多个基于任务分解的链接阶段。在每个阶段，基于任务完成过程，框架只收集与该阶段相关的最新设备属性数据，从而降低了信任评估的复杂性和开销。通过利用先进的上下文学习、少样本学习和推理能力，生成式人工智能被用来分析和解释收集的数据，以快速产生正确的评估结果。只有在这个阶段被认为值得信赖的设备才会继续进行下一轮的信任评估。最终确定在所有阶段都值得信赖的设备。实验结果表明，所提出的框架在信任评估中取得了高准确度。

更新时间: 2025-06-20 16:33:03

领域: cs.AI

下载: http://arxiv.org/abs/2506.17130v1

Rapid and Continuous Trust Evaluation for Effective Task Collaboration Through Siamese Model

Trust is emerging as an effective tool to ensure the successful completion of collaborative tasks within collaborative systems. However, rapidly and continuously evaluating the trustworthiness of collaborators during task execution is a significant challenge due to distributed devices, complex operational environments, and dynamically changing resources. To tackle this challenge, this paper proposes a Siamese-enabled rapid and continuous trust evaluation framework (SRCTE) to facilitate effective task collaboration. First, the communication and computing resource attributes of the collaborator in a trusted state, along with historical collaboration data, are collected and represented using an attributed control flow graph (ACFG) that captures trust-related semantic information and serves as a reference for comparison with data collected during task execution. At each time slot of task execution, the collaborator's communication and computing resource attributes, as well as task completion effectiveness, are collected in real time and represented with an ACFG to convey their trust-related semantic information. A Siamese model, consisting of two shared-parameter Structure2vec networks, is then employed to learn the deep semantics of each pair of ACFGs and generate their embeddings. Finally, the similarity between the embeddings of each pair of ACFGs is calculated to determine the collaborator's trust value at each time slot. A real system is built using two Dell EMC 5200 servers and a Google Pixel 8 to test the effectiveness of the proposed SRCTE framework. Experimental results demonstrate that SRCTE converges rapidly with only a small amount of data and achieves a high anomaly trust detection rate compared to the baseline algorithm.

Updated: 2025-06-20 16:30:59

标题: 快速连续信任评估通过连体模型实现有效任务协作

摘要: 信任正逐渐成为确保协作系统内协作任务成功完成的有效工具。然而，在任务执行过程中快速而持续地评估合作者的信誉是一项重大挑战，原因是分布式设备、复杂的操作环境和动态变化的资源。为了解决这一挑战，本文提出了一种Siamese-enabled快速连续信任评估框架（SRCTE），以促进有效任务协作。首先，在受信任状态下收集合作者的通信和计算资源属性，以及历史协作数据，并使用属性控制流图（ACFG）表示，该图捕获了与信任相关的语义信息，并用作与任务执行期间收集的数据进行比较的参考。在任务执行的每个时间槽中，合作者的通信和计算资源属性，以及任务完成效果，实时收集并用ACFG表示，以传达其与信任相关的语义信息。然后，采用由两个共享参数Structure2vec网络组成的Siamese模型来学习每对ACFG的深层语义并生成它们的嵌入。最后，计算每对ACFG的嵌入之间的相似度，以确定合作者在每个时间槽的信任值。使用两台戴尔EMC 5200服务器和一台谷歌Pixel 8构建了一个真实系统，以测试提出的SRCTE框架的有效性。实验结果表明，与基准算法相比，SRCTE仅需少量数据即可快速收敛，并实现了较高的异常信任检测率。

更新时间: 2025-06-20 16:30:59

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.17128v1

AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations

Visual grounding (VG) aims to localize target objects in an image based on natural language descriptions. In this paper, we propose AerialVG, a new task focusing on visual grounding from aerial views. Compared to traditional VG, AerialVG poses new challenges, \emph{e.g.}, appearance-based grounding is insufficient to distinguish among multiple visually similar objects, and positional relations should be emphasized. Besides, existing VG models struggle when applied to aerial imagery, where high-resolution images cause significant difficulties. To address these challenges, we introduce the first AerialVG dataset, consisting of 5K real-world aerial images, 50K manually annotated descriptions, and 103K objects. Particularly, each annotation in AerialVG dataset contains multiple target objects annotated with relative spatial relations, requiring models to perform comprehensive spatial reasoning. Furthermore, we propose an innovative model especially for the AerialVG task, where a Hierarchical Cross-Attention is devised to focus on target regions, and a Relation-Aware Grounding module is designed to infer positional relations. Experimental results validate the effectiveness of our dataset and method, highlighting the importance of spatial reasoning in aerial visual grounding. The code and dataset will be released.

Updated: 2025-06-20 16:30:09

标题: AerialVG: 通过探索位置关系进行航空视觉定位的挑战性基准

摘要: 视觉定位（VG）旨在根据自然语言描述在图像中定位目标对象。本文提出了AerialVG，这是一个新的任务，专注于从空中视图进行视觉定位。与传统的VG相比，AerialVG提出了新的挑战，例如，基于外观的定位不足以区分多个外观相似的对象，并且应该强调位置关系。此外，现有的VG模型在应用于航空图像时遇到困难，因为高分辨率的图像会带来显著的困难。为了解决这些挑战，我们介绍了第一个AerialVG数据集，包含5K张真实世界的航空图像，50K个手动注释的描述和103K个对象。特别地，AerialVG数据集中的每个注释都包含了用相对空间关系注释的多个目标对象，要求模型进行全面的空间推理。此外，我们提出了一种专门针对AerialVG任务的创新模型，其中设计了一个分层交叉注意力来聚焦目标区域，并设计了一个关系感知定位模块来推断位置关系。实验结果验证了我们数据集和方法的有效性，突显了在航空视觉定位中空间推理的重要性。代码和数据集将会发布。

更新时间: 2025-06-20 16:30:09

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.07836v3

Watermarking Language Models through Language Models

Watermarking the outputs of large language models (LLMs) is critical for provenance tracing, content regulation, and model accountability. Existing approaches often rely on access to model internals or are constrained by static rules and token-level perturbations. Moreover, the idea of steering generative behavior via prompt-based instruction control remains largely underexplored. We introduce a prompt-guided watermarking framework that operates entirely at the input level and requires no access to model parameters or decoding logits. The framework comprises three cooperating components: a Prompting LM that synthesizes watermarking instructions from user prompts, a Marking LM that generates watermarked outputs conditioned on these instructions, and a Detecting LM trained to classify whether a response carries an embedded watermark. This modular design enables dynamic watermarking that adapts to individual prompts while remaining compatible with diverse LLM architectures, including both proprietary and open-weight models. We evaluate the framework over 25 combinations of Prompting and Marking LMs, such as GPT-4o, Mistral, LLaMA3, and DeepSeek. Experimental results show that watermark signals generalize across architectures and remain robust under fine-tuning, model distillation, and prompt-based adversarial attacks, demonstrating the effectiveness and robustness of the proposed approach.

Updated: 2025-06-20 16:24:13

标题: 使用语言模型对语言模型进行水印处理

摘要: 对大型语言模型（LLMs）的输出进行水印处理对于溯源追踪、内容监管和模型问责是至关重要的。现有方法通常依赖对模型内部的访问，或受限于静态规则和令牌级扰动。此外，通过基于提示的指导控制来引导生成行为的想法仍然大多未被探索。我们引入了一个在输入级别完全运行且无需访问模型参数或解码逻辑的提示引导水印框架。该框架包括三个协作组件：一个从用户提示中合成水印指令的提示LM，一个根据这些指令生成带水印输出的标记LM，以及一个经过训练的检测LM，用于分类响应是否携带嵌入水印。这种模块化设计实现了动态水印处理，能够适应不同的LLM架构，包括专有和开放权重模型。我们评估了该框架在25种Prompting和Marking LMs组合上的表现，如GPT-4o、Mistral、LLaMA3和DeepSeek。实验结果显示，水印信号在不同架构之间泛化，并在微调、模型提炼和基于提示的对抗攻击下保持稳健，证明了提出方法的有效性和稳健性。

更新时间: 2025-06-20 16:24:13

领域: cs.LG,cs.CL,cs.CR

下载: http://arxiv.org/abs/2411.05091v2

Calibrating Pre-trained Language Classifiers on LLM-generated Noisy Labels via Iterative Refinement

The traditional process of creating labeled datasets is labor-intensive and expensive. Recent breakthroughs in open-source large language models (LLMs) have opened up a new avenue in generating labeled datasets automatically for various natural language processing (NLP) tasks, providing an alternative to such an expensive annotation process. However, the reliability of such auto-generated labels remains a significant concern due to inherent inaccuracies. When learning from noisy labels, the model's generalization is likely to be harmed as it is prone to overfit to those label noises. While previous studies in learning from noisy labels mainly focus on synthetic noise and real-world noise, LLM-generated label noise receives less attention. In this paper, we propose SiDyP: Simplex Label Diffusion with Dynamic Prior to calibrate the classifier's prediction, thus enhancing its robustness towards LLM-generated noisy labels. SiDyP retrieves potential true label candidates by neighborhood label distribution in text embedding space and iteratively refines noisy candidates using a simplex diffusion model. Our framework can increase the performance of the BERT classifier fine-tuned on both zero-shot and few-shot LLM-generated noisy label datasets by an average of 7.21% and 7.30% respectively. We demonstrate the effectiveness of SiDyP by conducting extensive benchmarking for different LLMs over a variety of NLP tasks. Our code is available on Github.

Updated: 2025-06-20 16:24:07

标题: 通过迭代细化在LLM生成的嘈杂标签上对预训练语言分类器进行校准

摘要: 传统的创建标记数据集的过程是费时费力且昂贵的。最近在开源大型语言模型（LLMs）方面取得的突破为自动生成各种自然语言处理（NLP）任务的标记数据集开辟了一条新途径，为这种昂贵的注释过程提供了替代方案。然而，由于固有的不准确性，这种自动生成的标签的可靠性仍然是一个重要问题。在学习嘈杂的标签时，模型的泛化很可能会受到损害，因为它容易过度拟合这些标签噪声。虽然先前关于从嘈杂标签中学习的研究主要集中在合成噪声和现实世界噪声上，但以LLM生成的标签噪声却受到较少关注。在这篇论文中，我们提出了SiDyP：带有动态先验的简单标签扩散，以校准分类器的预测，从而增强其对LLM生成的嘈杂标签的鲁棒性。SiDyP通过文本嵌入空间中的邻域标签分布来检索潜在的真实标签候选，并使用简单扩散模型迭代地完善嘈杂候选。我们的框架可以将经过BERT分类器在零射击和少射击LLM生成的嘈杂标签数据集上微调后的性能分别提高平均7.21％和7.30％。我们通过对不同LLMs在各种NLP任务上进行广泛基准测试来展示SiDyP的有效性。我们的代码已在Github上可用。

更新时间: 2025-06-20 16:24:07

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2505.19675v2

When Can Model-Free Reinforcement Learning be Enough for Thinking?

Recent work on large language models has demonstrated the use of model-free reinforcement learning (RL) to train reasoning-like capabilities. The emergence of "thinking" through model-free RL is interesting as thinking actions neither produce reward nor change the external world state to one where the agent is more likely to get reward. This paper seeks to build a domain-independent understanding of when model-free RL will lead to "thinking" as a strategy for reward maximization. To build this understanding, we first introduce a theoretical model which we call a \textit{thought Markov decision process} (MDP). Thought MDPs minimally extend the classical MDP model to include an abstract notion of thought state and thought action. Using the thought MDP model, we prove the importance of policy initialization in determining whether or not thinking emerges and show formally that thought actions are equivalent to the agent choosing to perform a step of policy improvement before continuing to act. We then show that open-source LLMs satisfy the conditions that our theory predicts are necessary for model-free RL to produce thinking-like behavior. Finally, we hypothesize sufficient conditions that would enable thinking to be learned outside of language generation and introduce a toy domain where a combination of multi-task pre-training and designated thought actions enable more data-efficient RL compared to non-thinking agents.

Updated: 2025-06-20 16:23:46

标题: 什么时候无模型强化学习足以用于思考？

摘要: 最近关于大型语言模型的研究表明，利用无模型强化学习（RL）来训练类似推理能力。通过无模型RL实现“思考”的出现是有趣的，因为思考行为既不产生奖励，也不将外部世界状态改变为一个使代理更有可能获得奖励的状态。本文旨在建立一个领域无关的理解，即无模型RL何时会导致“思考”作为奖励最大化的策略。为了建立这种理解，我们首先引入了一个理论模型，称之为\textit{思考马尔可夫决策过程}（MDP）。思考MDP最小程度地扩展了经典MDP模型，包括一个抽象的思考状态和思考行为的概念。利用思考MDP模型，我们证明了策略初始化在确定是否出现思考行为中的重要性，并正式证明思考行为等同于代理选择在继续行动之前执行一步策略改进。然后，我们展示了开源LLM满足我们理论预测的模型无RL产生类似思考行为所需的条件。最后，我们假设了能够使思考学习超出语言生成的充分条件，并引入了一个玩具领域，在这个领域中，多任务预训练和指定的思考行为的组合使得相比于非思考代理更具数据效率的RL成为可能。

更新时间: 2025-06-20 16:23:46

领域: cs.AI

下载: http://arxiv.org/abs/2506.17124v1

A Homomorphic Encryption Framework for Privacy-Preserving Spiking Neural Networks

Machine learning (ML) is widely used today, especially through deep neural networks (DNNs), however, increasing computational load and resource requirements have led to cloud-based solutions. To address this problem, a new generation of networks called Spiking Neural Networks (SNN) has emerged, which mimic the behavior of the human brain to improve efficiency and reduce energy consumption. These networks often process large amounts of sensitive information, such as confidential data, and thus privacy issues arise. Homomorphic encryption (HE) offers a solution, allowing calculations to be performed on encrypted data without decrypting it. This research compares traditional DNNs and SNNs using the Brakerski/Fan-Vercauteren (BFV) encryption scheme. The LeNet-5 model, a widely-used convolutional architecture, is used for both DNN and SNN models based on the LeNet-5 architecture, and the networks are trained and compared using the FashionMNIST dataset. The results show that SNNs using HE achieve up to 40% higher accuracy than DNNs for low values of the plaintext modulus t, although their execution time is longer due to their time-coding nature with multiple time-steps.

Updated: 2025-06-20 16:17:20

标题: 一个用于保护隐私的脉冲神经网络同态加密框架

摘要: 机器学习（ML）如今被广泛应用，尤其是通过深度神经网络（DNNs），然而，不断增加的计算负载和资源需求已经导致云端解决方案的出现。为了解决这个问题，出现了一种新一代网络称为脉冲神经网络（SNN），它模仿人类大脑的行为以提高效率并降低能耗。这些网络通常处理大量敏感信息，如机密数据，因此隐私问题会引起关注。同态加密（HE）提供了一种解决方案，允许在加密数据上执行计算而无需解密。这项研究使用Brakerski/Fan-Vercauteren（BFV）加密方案比较了传统DNNs和SNNs。采用了广泛使用的LeNet-5模型，这是一种卷积架构，基于LeNet-5架构分别用于DNN和SNN模型，并使用FashionMNIST数据集对网络进行训练和比较。结果显示，对于明文模数t的较低值，使用HE的SNNs的准确率比DNNs高出多达40％，尽管由于它们的时间编码性质和多个时间步骤，它们的执行时间较长。

更新时间: 2025-06-20 16:17:20

领域: cs.CR,cs.NE

下载: http://arxiv.org/abs/2308.05636v3

MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation

Combining pre-trained expert models offers substantial potential for scalable multimodal reasoning, but building a unified framework remains challenging due to the increasing diversity of input modalities and task complexity. For instance, medical diagnosis requires precise reasoning over structured clinical tables, while financial forecasting depends on interpreting plot-based data to make informed predictions. To tackle this challenge, we introduce MEXA, a training-free framework that performs modality- and task-aware aggregation of multiple expert models to enable effective multimodal reasoning across diverse and distinct domains. MEXA dynamically selects expert models based on the input modality and the task-specific reasoning demands (i.e., skills). Each expert model, specialized in a modality task pair, generates interpretable textual reasoning outputs. MEXA then aggregates and reasons over these outputs using a Large Reasoning Model (LRM) to produce the final answer. This modular design allows flexible and transparent multimodal reasoning across diverse domains without additional training overhead. We extensively evaluate our approach on diverse multimodal benchmarks, including Video Reasoning, Audio Reasoning, 3D Understanding, and Medical QA. MEXA consistently delivers performance improvements over strong multimodal baselines, highlighting the effectiveness and broad applicability of our expert-driven selection and aggregation in diverse multimodal reasoning tasks.

Updated: 2025-06-20 16:14:13

标题: MEXA：朝向具有动态多专家聚合的通用多模态推理

摘要: 结合预训练的专家模型具有可扩展的多模态推理潜力，但由于输入模态的增加多样性和任务复杂性的增加，构建统一框架仍然具有挑战性。例如，医学诊断需要对结构化临床表格进行精确推理，而财务预测依赖于解释基于图形的数据以做出明智的预测。为了解决这一挑战，我们引入了MEXA，一个无需训练的框架，它对多个专家模型进行模态和任务感知的聚合，以实现跨不同和独特领域的有效多模态推理。MEXA根据输入模态和特定任务推理需求（即技能）动态选择专家模型。每个专家模型，专门针对模态任务对生成可解释的文本推理输出。然后，MEXA使用大型推理模型（LRM）对这些输出进行聚合和推理，产生最终答案。这种模块化设计允许在不同领域之间灵活透明地进行多模态推理，而无需额外的训练开销。我们在各种多模态基准上进行了广泛评估，包括视频推理、音频推理、3D理解和医疗问答。MEXA始终比强大的多模态基线提供性能改进，突显了我们基于专家选择和聚合的多样化多模态推理任务的有效性和广泛适用性。

更新时间: 2025-06-20 16:14:13

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2506.17113v1

Are Bias Evaluation Methods Biased ?

The creation of benchmarks to evaluate the safety of Large Language Models is one of the key activities within the trusted AI community. These benchmarks allow models to be compared for different aspects of safety such as toxicity, bias, harmful behavior etc. Independent benchmarks adopt different approaches with distinct data sets and evaluation methods. We investigate how robust such benchmarks are by using different approaches to rank a set of representative models for bias and compare how similar are the overall rankings. We show that different but widely used bias evaluations methods result in disparate model rankings. We conclude with recommendations for the community in the usage of such benchmarks.

Updated: 2025-06-20 16:11:25

标题: 偏见评估方法是否存在偏见？

摘要: 创建用于评估大型语言模型安全性的基准是值得信赖的人工智能社区内的关键活动之一。这些基准允许比较模型在毒性、偏见、有害行为等不同安全方面的表现。独立基准采用不同方法，具有不同的数据集和评估方法。我们通过使用不同方法对一组代表性模型进行排名，研究这些基准的稳健性，比较整体排名的相似程度。我们发现，不同但广泛使用的偏见评估方法导致了不同的模型排名。我们最后提出了对社区在使用此类基准时的建议。

更新时间: 2025-06-20 16:11:25

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2506.17111v1

Towards Advanced Mathematical Reasoning for LLMs via First-Order Logic Theorem Proving

Large language models (LLMs) have shown promising first-order logic (FOL) reasoning capabilities with applications in various areas. However, their effectiveness in complex mathematical reasoning involving multi-step FOL deductions is still under-researched. While LLMs perform competitively on established mathematical reasoning benchmarks, they struggle with multi-step FOL tasks, as demonstrated by Deepseek-Prover-V2-7B's low accuracy (4.2%) on our proposed theorem proving dataset. This issue arises from the limited exploration of diverse proof strategies and the potential for early reasoning mistakes to undermine entire proofs. To address these issues, we propose DREAM, a self-adaptive solution that enhances the Diversity and REAsonability of LLMs' generation strategies. DREAM incorporates an Axiom-Driven Strategy Diversification mechanism to promote varied strategic outcomes and a Sub-Proposition Error Feedback to help LLMs reflect on and correct their proofs. Our contributions include pioneering advancements in LLMs' mathematical reasoning through FOL theorem proving, introducing a novel inference stage solution that improves performance by 0.6% to 6.4%, and providing a curated dataset of 447 mathematical theorems in Lean 4 format for evaluation.

Updated: 2025-06-20 16:09:56

标题: 通过一阶逻辑定理证明，实现LLMs的高级数学推理

摘要: 大型语言模型（LLMs）展示了具有前向逻辑（FOL）推理能力的潜力，并在各个领域应用中取得了良好的效果。然而，它们在涉及多步FOL推导的复杂数学推理中的有效性仍未得到充分研究。虽然LLMs在已建立的数学推理基准测试中表现竞争力，但它们在多步FOL任务中表现不佳，如我们在提出的定理证明数据集上展示的Deepseek-Prover-V2-7B的低准确率（4.2%）。这个问题源于对不同证明策略的有限探索以及早期推理错误可能破坏整个证明。为了解决这些问题，我们提出了DREAM，这是一个自适应解决方案，增强了LLMs生成策略的多样性和合理性。DREAM包括一个基于公理驱动的策略多样化机制，以促进各种战略结果，以及一个子命题错误反馈，帮助LLMs反思和纠正他们的证明。我们的贡献包括通过FOL定理证明在LLMs的数学推理方面取得的开创性进展，引入了一种改进性能的新型推理阶段解决方案，提供了一个包含447个数学定理的Lean 4格式的策划数据集进行评估。

更新时间: 2025-06-20 16:09:56

领域: cs.AI,cs.CL,cs.LO

下载: http://arxiv.org/abs/2506.17104v1

TransDreamerV3: Implanting Transformer In DreamerV3

This paper introduces TransDreamerV3, a reinforcement learning model that enhances the DreamerV3 architecture by integrating a transformer encoder. The model is designed to improve memory and decision-making capabilities in complex environments. We conducted experiments on Atari-Boxing, Atari-Freeway, Atari-Pong, and Crafter tasks, where TransDreamerV3 demonstrated improved performance over DreamerV3, particularly in the Atari-Freeway and Crafter tasks. While issues in the Minecraft task and limited training across all tasks were noted, TransDreamerV3 displays advancement in world model-based reinforcement learning, leveraging transformer architectures.

Updated: 2025-06-20 16:09:17

标题: TransDreamerV3：在DreamerV3中植入Transformer

摘要: 本文介绍了TransDreamerV3，这是一个强化学习模型，通过整合一个Transformer编码器来增强DreamerV3架构。该模型旨在提高在复杂环境中的记忆和决策能力。我们在Atari-Boxing、Atari-Freeway、Atari-Pong和Crafter任务上进行了实验，TransDreamerV3在Atari-Freeway和Crafter任务中表现出比DreamerV3更好的性能。尽管在Minecraft任务中存在问题，并且在所有任务中的训练有限，但TransDreamerV3在基于世界模型的强化学习中显示出了进步，利用了Transformer架构。

更新时间: 2025-06-20 16:09:17

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.17103v1

Multimodal Political Bias Identification and Neutralization

Due to the presence of political echo chambers, it becomes imperative to detect and remove subjective bias and emotionally charged language from both the text and images of political articles. However, prior work has focused on solely the text portion of the bias rather than both the text and image portions. This is a problem because the images are just as powerful of a medium to communicate information as text is. To that end, we present a model that leverages both text and image bias which consists of four different steps. Image Text Alignment focuses on semantically aligning images based on their bias through CLIP models. Image Bias Scoring determines the appropriate bias score of images via a ViT classifier. Text De-Biasing focuses on detecting biased words and phrases and neutralizing them through BERT models. These three steps all culminate to the final step of debiasing, which replaces the text and the image with neutralized or reduced counterparts, which for images is done by comparing the bias scores. The results so far indicate that this approach is promising, with the text debiasing strategy being able to identify many potential biased words and phrases, and the ViT model showcasing effective training. The semantic alignment model also is efficient. However, more time, particularly in training, and resources are needed to obtain better results. A human evaluation portion was also proposed to ensure semantic consistency of the newly generated text and images.

Updated: 2025-06-20 16:03:20

标题: 多模式政治偏见识别与中和

摘要: 由于存在政治信息过滤室，有必要检测并消除政治文章的文本和图像中的主观偏见和情绪充沛的语言。然而，先前的工作重点仅集中在偏见的文本部分，而非文本和图像部分。这是一个问题，因为图像和文本一样是传达信息的强大媒介。为此，我们提出了一个模型，利用文本和图像偏见，包括四个不同的步骤。图像文本对齐专注于通过CLIP模型对图像进行语义对齐。图像偏见评分通过ViT分类器确定图像的适当偏见分数。文本去偏见专注于检测有偏见的单词和短语，并通过BERT模型对其进行中和处理。这三个步骤最终导致去偏见的最后一步，通过比较偏见分数，将文本和图像替换为经中和或减少的对应物。到目前为止，结果表明这种方法很有前途，文本去偏见策略能够识别许多潜在的有偏见的单词和短语，ViT模型展示了有效的训练。语义对齐模型也很高效。然而，需要更多的时间，尤其是在训练以及资源方面，以获得更好的结果。还提出了人类评估部分，以确保新生成的文本和图像的语义一致性。

更新时间: 2025-06-20 16:03:20

领域: cs.CY,cs.AI,cs.CV

下载: http://arxiv.org/abs/2506.17372v1

Identifiability of Deep Polynomial Neural Networks

Polynomial Neural Networks (PNNs) possess a rich algebraic and geometric structure. However, their identifiability -- a key property for ensuring interpretability -- remains poorly understood. In this work, we present a comprehensive analysis of the identifiability of deep PNNs, including architectures with and without bias terms. Our results reveal an intricate interplay between activation degrees and layer widths in achieving identifiability. As special cases, we show that architectures with non-increasing layer widths are generically identifiable under mild conditions, while encoder-decoder networks are identifiable when the decoder widths do not grow too rapidly. Our proofs are constructive and center on a connection between deep PNNs and low-rank tensor decompositions, and Kruskal-type uniqueness theorems. This yields both generic conditions determined by the architecture, and effective conditions that depend on the network's parameters. We also settle an open conjecture on the expected dimension of PNN's neurovarieties, and provide new bounds on the activation degrees required for it to reach its maximum.

Updated: 2025-06-20 15:58:46

标题: 深度多项式神经网络的可识别性

摘要: 多项式神经网络（PNNs）具有丰富的代数和几何结构。然而，它们的可识别性 - 一种确保可解释性的关键属性 - 仍然不够清楚。在这项工作中，我们对深度PNN的可识别性进行了全面分析，包括具有偏差项和不具有偏差项的架构。我们的结果揭示了在实现可识别性方面激活程度和层宽度之间的错综复杂相互作用。作为特例，我们表明在温和条件下，具有非递增层宽度的架构通常是可识别的，而当解码器宽度不会增长过快时，编码器-解码器网络是可识别的。我们的证明是构造性的，围绕深度PNN和低秩张量分解以及Kruskal类型唯一性定理之间的联系展开。这既确定了由架构决定的通用条件，也提供了取决于网络参数的有效条件。我们还解决了关于PNN神经分支期望维度的未解猜想，并提供了达到其最大值所需激活度的新界限。

更新时间: 2025-06-20 15:58:46

领域: cs.LG,cs.AI,math.AG,stat.ML,68T07, 62R01, 15A69, 14M99

下载: http://arxiv.org/abs/2506.17093v1

Secret Sharing in 5G-MEC: Applicability for joint Security and Dependability

Multi-access Edge Computing (MEC), an enhancement of 5G, processes data closer to its generation point, reducing latency and network load. However, the distributed and edge-based nature of 5G-MEC presents privacy and security challenges, including data exposure risks. Ensuring efficient manipulation and security of sensitive data at the edge is crucial. To address these challenges, we investigate the usage of threshold secret sharing in 5G-MEC storage, an approach that enhances both security and dependability. A (k,n) threshold secret sharing scheme splits and stores sensitive data among n nodes, requiring at least k nodes for reconstruction. The solution ensures confidentiality by protecting data against fewer than k colluding nodes and enhances availability by tolerating up to n-k failing nodes. This approach mitigates threats such as unauthorized access and node failures, whether accidental or intentional. We further discuss a method for selecting the convenient MEHs to store the shares, considering the MEHs' trustworthiness level as a main criterion. Although we define our proposal in the context of secret-shared data storage, it can be seen as an independent, standalone selection process for 5G-MEC trustworthy node selection in other scenarios too.

Updated: 2025-06-20 15:58:41

标题: 5G-MEC中的秘密共享：用于联合安全性和可靠性的适用性

摘要: 多接入边缘计算（MEC）是5G的增强版，处理数据更接近其生成点，减少延迟和网络负载。然而，分布式和基于边缘的5G-MEC的性质提出了隐私和安全挑战，包括数据暴露风险。在边缘确保对敏感数据进行有效处理和安全性至关重要。为了解决这些挑战，我们研究了在5G-MEC存储中使用阈值秘密共享的方法，这种方法同时增强了安全性和可靠性。一个（k，n）阈值秘密共享方案将敏感数据分割并存储在n个节点之间，需要至少k个节点进行重建。该解决方案通过保护数据免受少于k个串通节点的侵犯来确保机密性，并通过容忍最多n-k个失败节点来增强可用性。这种方法可减轻未经授权的访问和节点故障等威胁，无论是意外还是故意造成的。我们进一步讨论了一种选择合适的MEHs存储份额的方法，考虑MEHs的信任级别作为主要标准。尽管我们在秘密共享数据存储的背景下定义了我们的提案，但它也可以被视为5G-MEC值得信赖节点选择的独立、独立的选择过程，适用于其他情景。

更新时间: 2025-06-20 15:58:41

领域: cs.CR,cs.NI

下载: http://arxiv.org/abs/2506.17371v1

AI based Content Creation and Product Recommendation Applications in E-commerce: An Ethical overview

As e-commerce rapidly integrates artificial intelligence for content creation and product recommendations, these technologies offer significant benefits in personalization and efficiency. AI-driven systems automate product descriptions, generate dynamic advertisements, and deliver tailored recommendations based on consumer behavior, as seen in major platforms like Amazon and Shopify. However, the widespread use of AI in e-commerce raises crucial ethical challenges, particularly around data privacy, algorithmic bias, and consumer autonomy. Bias -- whether cultural, gender-based, or socioeconomic -- can be inadvertently embedded in AI models, leading to inequitable product recommendations and reinforcing harmful stereotypes. This paper examines the ethical implications of AI-driven content creation and product recommendations, emphasizing the need for frameworks to ensure fairness, transparency, and need for more established and robust ethical standards. We propose actionable best practices to remove bias and ensure inclusivity, such as conducting regular audits of algorithms, diversifying training data, and incorporating fairness metrics into AI models. Additionally, we discuss frameworks for ethical conformance that focus on safeguarding consumer data privacy, promoting transparency in decision-making processes, and enhancing consumer autonomy. By addressing these issues, we provide guidelines for responsibly utilizing AI in e-commerce applications for content creation and product recommendations, ensuring that these technologies are both effective and ethically sound.

Updated: 2025-06-20 15:54:25

标题: 基于人工智能的内容创作和产品推荐在电子商务中的应用：伦理概述

摘要: 随着电子商务迅速整合人工智能进行内容创作和产品推荐，这些技术在个性化和效率方面提供了显著的好处。基于人工智能的系统自动化产品描述，生成动态广告，并根据消费者行为提供定制推荐，正如亚马逊和Shopify等主要平台所展示的。然而，在电子商务中广泛使用人工智能引发了关键的道德挑战，特别是围绕数据隐私、算法偏见和消费者自主权的问题。偏见 -- 无论是文化、基于性别还是社会经济的 -- 可能会无意中嵌入到人工智能模型中，导致不公平的产品推荐并强化有害的刻板印象。本文探讨了基于人工智能的内容创作和产品推荐的道德影响，强调了确保公平性、透明度和更为成熟和健全的道德标准的框架的必要性。我们提出了可行的最佳实践，以消除偏见并确保包容性，如定期审查算法、丰富训练数据，并将公平指标纳入人工智能模型中。此外，我们讨论了关注保护消费者数据隐私、促进决策过程透明度和增强消费者自主权的道德符合框架。通过解决这些问题，我们为在电子商务应用中负责任地利用人工智能进行内容创作和产品推荐提供了指导，确保这些技术既有效又符合道德标准。

更新时间: 2025-06-20 15:54:25

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2506.17370v1

Domain Specific Benchmarks for Evaluating Multimodal Large Language Models

Large language models (LLMs) are increasingly being deployed across disciplines due to their advanced reasoning and problem solving capabilities. To measure their effectiveness, various benchmarks have been developed that measure aspects of LLM reasoning, comprehension, and problem-solving. While several surveys address LLM evaluation and benchmarks, a domain-specific analysis remains underexplored in the literature. This paper introduces a taxonomy of seven key disciplines, encompassing various domains and application areas where LLMs are extensively utilized. Additionally, we provide a comprehensive review of LLM benchmarks and survey papers within each domain, highlighting the unique capabilities of LLMs and the challenges faced in their application. Finally, we compile and categorize these benchmarks by domain to create an accessible resource for researchers, aiming to pave the way for advancements toward artificial general intelligence (AGI)

Updated: 2025-06-20 15:52:06

标题: 领域特定基准：用于评估多模态大型语言模型的工具Benchmark

摘要: 大型语言模型（LLMs）由于其先进的推理和问题解决能力，越来越多地被部署在各个学科领域。为了衡量它们的有效性，已经开发了各种基准，用于衡量LLM推理、理解和问题解决的方面。虽然有几份调查报告涉及LLM评估和基准，但领域特定的分析在文献中仍未充分探讨。本文介绍了七个关键学科的分类体系，涵盖LLMs广泛应用的各个领域和应用领域。此外，我们对每个领域内的LLM基准和调查报告进行了全面审查，突出了LLMs的独特能力以及在应用中面临的挑战。最后，我们根据领域编制并分类这些基准，为研究人员提供一个可访问的资源，旨在为向人工通用智能（AGI）的进展铺平道路。

更新时间: 2025-06-20 15:52:06

领域: cs.LG

下载: http://arxiv.org/abs/2506.12958v2

Dispositions and Roles of Generically Dependent Entities

BFO 2020 does not support functions, dispositions, and roles of generically dependent continuants (like software or datasets). In this paper, we argue that this is a severe limitation, which prevents, for example, the adequate representation of the functions of computer models or the various roles of datasets during the execution of these models. We discuss the aspects of BFO 2020 that prevent the representation of realizable entities of generically dependent continuants. Two approaches to address the issue are presented: (a) the use of defined classes and (b) a proposal of changes that allow BFO to support functions, dispositions, and roles of generically dependent continuants.

Updated: 2025-06-20 15:40:45

标题: Generically Dependent Entities的性质和角色

摘要: BFO 2020不支持作为通用依赖持续性体（如软件或数据集）的功能、属性和角色。在本文中，我们认为这是一个严重的限制，例如阻止了计算机模型的功能或数据集在执行这些模型过程中的各种角色的充分表达。我们讨论了BFO 2020中阻止表示通用依赖持续性实体的方面。提出了两种解决这个问题的方法：（a）使用定义的类和（b）提出的变更建议，使BFO能够支持通用依赖持续性体的功能、属性和角色。

更新时间: 2025-06-20 15:40:45

领域: cs.AI

下载: http://arxiv.org/abs/2506.17085v1

Re-Evaluating Code LLM Benchmarks Under Semantic Mutation

In the era of large language models (LLMs), code benchmarks have become an important research area in software engineering and are widely used by practitioners. These benchmarks evaluate the performance of LLMs on specific code-related tasks, such as code understanding and generation. A critical step in constructing code benchmarks is the design of prompts. However, as existing code benchmarks typically rely on a single prompt template per task, they are prone to the issue of prompt sensitivity, where minor prompt variations could result in substantial performance variations, leading to unreliable evaluations of model capabilities. While previous studies have explored prompt sensitivity, their experimental designs and findings are limited to traditional natural language processing (NLP) tasks. In this paper, we present an empirical study to investigate prompt sensitivity in code benchmarks. We first propose a general framework that modifies prompt templates in a manner that preserves both their semantics and their structure as much as possible. Based on the framework, we conduct extensive experiments across eight code benchmark tasks on 10 representative open-source LLMs, with each task featuring 100 semantically similar prompt templates. We then analyze the evaluation results using various statistical metrics, focusing on both absolute and relative model performance. Our findings suggest that even slight prompt variations can lead to significant shifts in performance. Additionally, we observe that such variations can introduce inconsistencies in the performance rankings across different models. These insights highlight the need for considering prompt sensitivity when designing future code benchmarks, to ensure more reliable and accurate evaluation of LLM capabilities.

Updated: 2025-06-20 15:30:36

标题: 重新评估在语义突变下的代码LLM基准

摘要: 在大型语言模型(LLMs)时代，代码基准已成为软件工程中的一个重要研究领域，并被广泛应用于实践者。这些基准评估LLMs在特定与代码相关的任务（如代码理解和生成）上的性能。构建代码基准的关键步骤是设计提示。然而，由于现有的代码基准通常依赖于每个任务一个单一的提示模板，它们容易出现提示敏感性问题，即轻微的提示变化可能导致性能差异巨大，进而导致模型能力的评估不可靠。尽管先前的研究探讨了提示敏感性，但它们的实验设计和发现仅限于传统的自然语言处理（NLP）任务。在本文中，我们提出了一项实证研究，以调查代码基准中的提示敏感性。我们首先提出了一个通用框架，以尽可能保留提示模板的语义和结构的方式修改提示模板。基于该框架，我们在10个代表性的开源LLMs上进行了八个代码基准任务的广泛实验，每个任务包含100个语义上相似的提示模板。然后，我们使用各种统计指标分析评估结果，关注绝对和相对模型性能。我们的研究结果表明，即使轻微的提示变化也可能导致性能显著变化。此外，我们观察到这种变化可能导致不同模型之间性能排名的不一致性。这些发现突显了在设计未来的代码基准时需要考虑提示敏感性，以确保更可靠和准确地评估LLMs的能力。

更新时间: 2025-06-20 15:30:36

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2506.17369v1

Tower+: Bridging Generality and Translation Specialization in Multilingual LLMs

Fine-tuning pretrained LLMs has been shown to be an effective strategy for reaching state-of-the-art performance on specific tasks like machine translation. However, this process of adaptation often implies sacrificing general-purpose capabilities, such as conversational reasoning and instruction-following, hampering the utility of the system in real-world applications that require a mixture of skills. In this paper, we introduce Tower+, a suite of models designed to deliver strong performance across both translation and multilingual general-purpose text capabilities. We achieve a Pareto frontier between translation specialization and multilingual general-purpose capabilities by introducing a novel training recipe that builds on Tower (Alves et al., 2024), comprising continued pretraining, supervised fine-tuning, preference optimization, and reinforcement learning with verifiable rewards. At each stage of training, we carefully generate and curate data to strengthen performance on translation as well as general-purpose tasks involving code generation, mathematics problem solving, and general instruction-following. We develop models at multiple scales: 2B, 9B, and 72B. Our smaller models often outperform larger general-purpose open-weight and proprietary LLMs (e.g., Llama 3.3 70B, GPT-4o). Our largest model delivers best-in-class translation performance for high-resource languages and top results in multilingual Arena Hard evaluations and in IF-MT, a benchmark we introduce for evaluating both translation and instruction-following. Our findings highlight that it is possible to rival frontier models in general capabilities, while optimizing for specific business domains, such as translation and localization.

Updated: 2025-06-20 15:30:06

标题: "塔+: 在多语言LLM中搭建通用性与翻译专业化之间的桥梁"

摘要: Fine-tuning pretrained LLMs已被证明是一种有效的策略，可以在特定任务（如机器翻译）上实现最先进的性能。然而，这种适应过程通常意味着牺牲通用能力，如对话推理和遵循指令，从而阻碍了系统在需要多种技能的实际应用中的效用。在本文中，我们介绍了Tower+，这是一套旨在在翻译和多语言通用文本能力方面提供强大性能的模型。我们通过引入一种新颖的训练配方，在Tower（Alves等人，2024）的基础上进行持续预训练、监督微调、偏好优化和强化学习，并获得了翻译专业化和多语言通用能力之间的Pareto前沿。在训练的每个阶段，我们精心生成和筛选数据，以加强在涉及代码生成、数学问题解决和通用指令遵循的翻译和通用任务上的性能。我们开发了多个规模的模型：2B、9B和72B。我们较小的模型通常胜过更大的通用开放权重和专有LLMs（如Llama 3.3 70B、GPT-4o）。我们最大的模型为高资源语言提供了最佳的翻译性能，并在多语言Arena Hard评估和IF-MT中取得了最佳结果，后者是我们用来评估翻译和指令遵循的基准。我们的研究结果强调，在优化特定业务领域（如翻译和本地化）的同时，有可能与通用能力的前沿模型相匹敌。

更新时间: 2025-06-20 15:30:06

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.17080v1

Neural Polar Decoders for DNA Data Storage

Synchronization errors, such as insertions and deletions, present a fundamental challenge in DNA-based data storage systems, arising from both synthesis and sequencing noise. These channels are often modeled as insertion-deletion-substitution (IDS) channels, for which designing maximum-likelihood decoders is computationally expensive. In this work, we propose a data-driven approach based on neural polar decoders (NPDs) to design low-complexity decoders for channels with synchronization errors. The proposed architecture enables decoding over IDS channels with reduced complexity $O(AN log N )$, where $A$ is a tunable parameter independent of the channel. NPDs require only sample access to the channel and can be trained without an explicit channel model. Additionally, NPDs provide mutual information (MI) estimates that can be used to optimize input distributions and code design. We demonstrate the effectiveness of NPDs on both synthetic deletion and IDS channels. For deletion channels, we show that NPDs achieve near-optimal decoding performance and accurate MI estimation, with significantly lower complexity than trellis-based decoders. We also provide numerical estimates of the channel capacity for the deletion channel. We extend our evaluation to realistic DNA storage settings, including channels with multiple noisy reads and real-world Nanopore sequencing data. Our results show that NPDs match or surpass the performance of existing methods while using significantly fewer parameters than the state-of-the-art. These findings highlight the promise of NPDs for robust and efficient decoding in DNA data storage systems.

Updated: 2025-06-20 15:26:38

标题: 神经极化解码器用于DNA数据存储

摘要: 同步错误，如插入和删除，在基于DNA的数据存储系统中是一个基本挑战，这是由于合成和测序噪声所导致的。这些通道通常被建模为插入-删除-替换（IDS）通道，为这些通道设计最大似然译码器是计算昂贵的。在这项工作中，我们提出了一种基于神经极化译码器（NPDs）的数据驱动方法，用于设计具有低复杂度的解码器，用于具有同步错误的通道。所提出的架构使得在IDS通道上解码的复杂度降低到O(AN log N)，其中A是一个独立于通道的可调参数。NPDs只需要对通道进行样本访问，并且可以在没有明确通道模型的情况下进行训练。此外，NPDs提供可以用于优化输入分布和代码设计的互信息（MI）估计。我们展示了NPDs在合成删除和IDS通道上的有效性。对于删除通道，我们展示了NPDs实现了接近最优的译码性能和准确的MI估计，与基于格的译码器相比，复杂度显著降低。我们还提供了删除通道的通道容量的数值估计。我们将评估扩展到现实的DNA存储设置，包括具有多个嘈杂读数和真实的Nanopore测序数据的通道。我们的结果显示，NPDs与现有方法的性能相匹敌或超越，同时使用的参数明显少于现有技术。这些发现突显了NPDs在DNA数据存储系统中进行强大和高效解码的潜力。

更新时间: 2025-06-20 15:26:38

领域: cs.IT,cs.LG,math.IT

下载: http://arxiv.org/abs/2506.17076v1

Diffusion & Adversarial Schrödinger Bridges via Iterative Proportional Markovian Fitting

The Iterative Markovian Fitting (IMF) procedure, which iteratively projects onto the space of Markov processes and the reciprocal class, successfully solves the Schr\"odinger Bridge (SB) problem. However, an efficient practical implementation requires a heuristic modification - alternating between fitting forward and backward time diffusion at each iteration. This modification is crucial for stabilizing training and achieving reliable results in applications such as unpaired domain translation. Our work reveals a close connection between the modified version of IMF and the Iterative Proportional Fitting (IPF) procedure - a foundational method for the SB problem, also known as Sinkhorn's algorithm. Specifically, we demonstrate that the heuristic modification of the IMF effectively integrates both IMF and IPF procedures. We refer to this combined approach as the Iterative Proportional Markovian Fitting (IPMF) procedure. Through theoretical and empirical analysis, we establish the convergence of IPMF procedure under various settings, contributing to developing a unified framework for solving SB problems. Moreover, from a practical standpoint, the IPMF procedure enables a flexible trade-off between image similarity and generation quality, offering a new mechanism for tailoring models to specific tasks.

Updated: 2025-06-20 15:25:47

标题: 扩散和对抗性Schrödinger桥通过迭代比例马尔可夫拟合

摘要: 迭代马尔可夫拟合（IMF）程序通过迭代地投影到马尔可夫过程和相互类的空间，成功解决了薛定谔桥（SB）问题。然而，一个高效的实际实现需要一种启发式修改 - 在每次迭代中交替拟合正向和反向时间扩散。这种修改对于稳定训练并在无配对域转换等应用中获得可靠结果至关重要。我们的工作揭示了IMF的修改版本与迭代比例拟合（IPF）程序之间的密切联系 - 迭代比例拟合是SB问题的基础方法，也称为Sinkhorn算法。具体来说，我们证明了IMF的启发式修改有效地整合了IMF和IPF程序。我们将这种结合方法称为迭代比例马尔可夫拟合（IPMF）程序。通过理论和实证分析，我们建立了在各种设置下IPMF程序的收敛性，为解决SB问题提供了一个统一的框架。此外，从实际角度看，IPMF程序使得图像相似性和生成质量之间的灵活权衡成为可能，为根据特定任务调整模型提供了一种新机制。

更新时间: 2025-06-20 15:25:47

领域: cs.LG

下载: http://arxiv.org/abs/2410.02601v3

LLM-Based Bot Broadens the Range of Arguments in Online Discussions, Even When Transparently Disclosed as AI

A wide range of participation is essential for democracy, as it helps prevent the dominance of extreme views, erosion of legitimacy, and political polarization. However, engagement in online political discussions often features a limited spectrum of views due to high levels of self-selection and the tendency of online platforms to facilitate exchanges primarily among like-minded individuals. This study examines whether an LLM-based bot can widen the scope of perspectives expressed by participants in online discussions through two pre-registered randomized experiments conducted in a chatroom. We evaluate the impact of a bot that actively monitors discussions, identifies missing arguments, and introduces them into the conversation. The results indicate that our bot significantly expands the range of arguments, as measured by both objective and subjective metrics. Furthermore, disclosure of the bot as AI does not significantly alter these effects. These findings suggest that LLM-based moderation tools can positively influence online political discourse.

Updated: 2025-06-20 15:24:31

标题: 基于LLM的机器人扩大了在线讨论中的论点范围，即使被透明披露为人工智能

摘要: 广泛的参与对民主至关重要，因为它有助于防止极端观点的主导、合法性的侵蚀和政治极化。然而，在线政治讨论中的参与往往由于高水平的自我选择和在线平台主要促进志同道合的个体之间的交流，而表现出一种有限的观点范围。本研究通过在聊天室中进行的两项预注册的随机实验，探讨了一个基于LLM的机器人是否能够通过扩大参与者在在线讨论中表达的观点范围。我们评估了一个积极监控讨论、识别缺失论点并将其引入对话的机器人的影响。结果表明，我们的机器人显著扩大了论点的范围，根据客观和主观指标来衡量。此外，将机器人披露为AI并不显著改变这些效果。这些发现表明，基于LLM的管理工具可以积极影响在线政治话语。

更新时间: 2025-06-20 15:24:31

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2506.17073v1

Al-Khwarizmi: Discovering Physical Laws with Foundation Models

Inferring physical laws from data is a central challenge in science and engineering, including but not limited to healthcare, physical sciences, biosciences, social sciences, sustainability, climate, and robotics. Deep networks offer high-accuracy results but lack interpretability, prompting interest in models built from simple components. The Sparse Identification of Nonlinear Dynamics (SINDy) method has become the go-to approach for building such modular and interpretable models. SINDy leverages sparse regression with L1 regularization to identify key terms from a library of candidate functions. However, SINDy's choice of candidate library and optimization method requires significant technical expertise, limiting its widespread applicability. This work introduces Al-Khwarizmi, a novel agentic framework for physical law discovery from data, which integrates foundational models with SINDy. Leveraging LLMs, VLMs, and Retrieval-Augmented Generation (RAG), our approach automates physical law discovery, incorporating prior knowledge and iteratively refining candidate solutions via reflection. Al-Khwarizmi operates in two steps: it summarizes system observations-comprising textual descriptions, raw data, and plots-followed by a secondary step that generates candidate feature libraries and optimizer configurations to identify hidden physics laws correctly. Evaluating our algorithm on over 198 models, we demonstrate state-of-the-art performance compared to alternatives, reaching a 20 percent increase against the best-performing alternative.

Updated: 2025-06-20 15:22:21

标题: Al-Khwarizmi：用基础模型发现物理定律

摘要: 从数据中推断物理定律是科学和工程中的一个核心挑战，包括但不限于医疗保健、物理科学、生命科学、社会科学、可持续性、气候和机器人技术。深度网络提供高精度的结果，但缺乏可解释性，促使人们对由简单组件构建的模型产生兴趣。稀疏非线性动力学（SINDy）方法已成为构建这种模块化和可解释模型的首选方法。SINDy利用L1正则化的稀疏回归来从候选函数库中识别关键项。然而，SINDy选择候选库和优化方法需要相当技术专业知识，限制了其广泛适用性。本文介绍了一种名为Al-Khwarizmi的新型代理框架，用于从数据中发现物理定律，该框架将基础模型与SINDy集成在一起。利用LLMs、VLMs和检索增强生成（RAG），我们的方法自动化物理定律的发现，融入先验知识，并通过反思迭代地完善候选解决方案。Al-Khwarizmi分为两步：首先总结系统观察结果，包括文本描述、原始数据和图表，然后进行第二步，生成候选特征库和优化器配置，正确识别隐藏的物理定律。通过在超过198个模型上评估我们的算法，我们展示了与其他替代方法相比的最先进性能，达到了最佳替代方法的20%增长。

更新时间: 2025-06-20 15:22:21

领域: cs.LG

下载: http://arxiv.org/abs/2502.01702v2

Safe Guaranteed Exploration for Non-linear Systems

Safely exploring environments with a-priori unknown constraints is a fundamental challenge that restricts the autonomy of robots. While safety is paramount, guarantees on sufficient exploration are also crucial for ensuring autonomous task completion. To address these challenges, we propose a novel safe guaranteed exploration framework using optimal control, which achieves first-of-its-kind results: guaranteed exploration for non-linear systems with finite time sample complexity bounds, while being provably safe with arbitrarily high probability. The framework is general and applicable to many real-world scenarios with complex non-linear dynamics and unknown domains. We improve the efficiency of this general framework by proposing an algorithm, SageMPC, SAfe Guaranteed Exploration using Model Predictive Control. SageMPC leverages three key techniques: i) exploiting a Lipschitz bound, ii) goal-directed exploration, and iii) receding horizon style re-planning, all while maintaining the desired sample complexity, safety and exploration guarantees of the framework. Lastly, we demonstrate safe efficient exploration in challenging unknown environments using SageMPC with a car model.

Updated: 2025-06-20 15:20:49

标题: 非线性系统的安全保障探索

摘要: 在具有先验未知约束的环境中安全地探索是限制机器人自主性的一个基本挑战。虽然安全性至关重要，但对足够探索的保证也是确保自主任务完成的关键。为了解决这些挑战，我们提出了一种使用最优控制的新型安全保证探索框架，实现了首次独特的结果：对于具有有限时间样本复杂度界限的非线性系统的保证性探索，同时在概率上被证明是安全的。该框架是通用的，适用于许多具有复杂非线性动力学和未知领域的实际场景。我们通过提出一种算法SageMPC，即使用模型预测控制的安全保证探索，提高了这一通用框架的效率。SageMPC利用三种关键技术：i) 利用Lipschitz界限，ii) 面向目标的探索，以及iii) 回退式计划重新规划，同时保持框架的所需样本复杂度、安全性和探索保证。最后，我们使用SageMPC与汽车模型在具有挑战性的未知环境中进行安全高效的探索。

更新时间: 2025-06-20 15:20:49

领域: eess.SY,cs.LG,cs.RO,cs.SY,math.OC

下载: http://arxiv.org/abs/2402.06562v2

Empowering Near-Field Communications in Low-Altitude Economy with LLM: Fundamentals, Potentials, Solutions, and Future Directions

The low-altitude economy (LAE) is gaining significant attention from academia and industry. Fortunately, LAE naturally aligns with near-field communications in extremely large-scale MIMO (XL-MIMO) systems. By leveraging near-field beamfocusing, LAE can precisely direct beam energy to unmanned aerial vehicles, while the additional distance dimension boosts overall spectrum efficiency. However, near-field communications in LAE still face several challenges, such as the increase in signal processing complexity and the necessity of distinguishing between far and near-field users. Inspired by the large language models (LLM) with powerful ability to handle complex problems, we apply LLM to solve challenges of near-field communications in LAE. The objective of this article is to provide a comprehensive analysis and discussion on LLM-empowered near-field communications in LAE. Specifically, we first introduce fundamentals of LLM and near-field communications, including the key advantages of LLM and key characteristics of near-field communications. Then, we reveal the opportunities and challenges of near-field communications in LAE. To address these challenges, we present a LLM-based scheme for near-field communications in LAE, and provide a case study which jointly distinguishes far and near-field users and designs multi-user precoding matrix. Finally, we outline and highlight several future research directions and open issues.

Updated: 2025-06-20 15:14:29

标题: 利用LLM实现低空经济中近距离通信的赋能：基础知识、潜力、解决方案和未来方向

摘要: 低空经济（LAE）引起了学术界和行业的重视。幸运的是，LAE与极大规模MIMO（XL-MIMO）系统中的近场通信自然契合。通过利用近场波束聚焦，LAE可以精确地将波束能量定向到无人机，同时额外的距离维度提高了整体频谱效率。然而，在LAE中的近场通信仍面临一些挑战，如信号处理复杂性的增加和区分远场和近场用户的必要性。受到强大的处理复杂问题能力的大型语言模型（LLM）的启发，我们将LLM应用于解决LAE中近场通信的挑战。本文的目标是对LLM增强的LAE中的近场通信进行全面分析和讨论。具体来说，我们首先介绍LLM和近场通信的基本原理，包括LLM的关键优势和近场通信的关键特征。然后，我们揭示了LAE中近场通信的机遇和挑战。为了解决这些挑战，我们提出了一种基于LLM的LAE中近场通信方案，并提供了一个案例研究，该研究共同区分远场和近场用户，并设计了多用户预编码矩阵。最后，我们概述和强调了几个未来研究方向和开放问题。

更新时间: 2025-06-20 15:14:29

领域: eess.SP,cs.IT,cs.LG,math.IT

下载: http://arxiv.org/abs/2506.17067v1

Flow-Based Non-stationary Temporal Regime Causal Structure Learning

Understanding causal relationships in multivariate time series is crucial in many scenarios, such as those dealing with financial or neurological data. Many such time series exhibit multiple regimes, i.e., consecutive temporal segments with a priori unknown boundaries, with each regime having its own causal structure. Inferring causal dependencies and regime shifts is critical for analyzing the underlying processes. However, causal structure learning in this setting is challenging due to (1) non stationarity, i.e., each regime can have its own causal graph and mixing function, and (2) complex noise distributions, which may be non Gaussian or heteroscedastic. Existing causal discovery approaches cannot address these challenges, since generally assume stationarity or Gaussian noise with constant variance. Hence, we introduce FANTOM, a unified framework for causal discovery that handles non stationary processes along with non Gaussian and heteroscedastic noises. FANTOM simultaneously infers the number of regimes and their corresponding indices and learns each regime's Directed Acyclic Graph. It uses a Bayesian Expectation Maximization algorithm that maximizes the evidence lower bound of the data log likelihood. On the theoretical side, we prove, under mild assumptions, that temporal heteroscedastic causal models, introduced in FANTOM's formulation, are identifiable in both stationary and non stationary settings. In addition, extensive experiments on synthetic and real data show that FANTOM outperforms existing methods.

Updated: 2025-06-20 15:12:43

标题: 基于流动的非平稳时间制度因果结构学习

摘要: 在许多场景中，如处理金融或神经数据时，理解多变量时间序列中的因果关系是至关重要的。许多这种时间序列表现出多个制度，即具有先验未知边界的连续时间段，每个制度都有其自身的因果结构。推断因果依赖关系和制度转变对分析潜在过程至关重要。然而，在这种情况下进行因果结构学习是具有挑战性的，因为（1）非平稳性，即每个制度可能具有自己的因果图和混合函数，以及（2）复杂的噪声分布，这可能是非高斯或异方差的。现有的因果发现方法无法解决这些挑战，因为通常假设平稳性或具有恒定方差的高斯噪声。因此，我们引入了FANTOM，一个处理非平稳过程以及非高斯和异方差噪声的因果发现的统一框架。FANTOM同时推断制度的数量及其相应的指标，并学习每个制度的有向无环图。它使用贝叶斯期望最大化算法，最大化数据对数似然的证据下界。在理论上，我们证明，在温和假设下，FANTOM的公式中引入的时间异方差因果模型在平稳和非平稳设置中是可识别的。此外，对合成和真实数据的大量实验表明，FANTOM优于现有方法。

更新时间: 2025-06-20 15:12:43

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.17065v1

Client Selection Strategies for Federated Semantic Communications in Heterogeneous IoT Networks

The exponential growth of IoT devices presents critical challenges in bandwidth-constrained wireless networks, particularly regarding efficient data transmission and privacy preservation. This paper presents a novel federated semantic communication (SC) framework that enables collaborative training of bandwidth-efficient models for image reconstruction across heterogeneous IoT devices. By leveraging SC principles to transmit only semantic features, our approach dramatically reduces communication overhead while preserving reconstruction quality. We address the fundamental challenge of client selection in federated learning environments where devices exhibit significant disparities in dataset sizes and data distributions. Our framework implements three distinct client selection strategies that explore different trade-offs between system performance and fairness in resource allocation. The system employs an end-to-end SC architecture with semantic bottlenecks, coupled with a loss-based aggregation mechanism that naturally adapts to client heterogeneity. Experimental evaluation on image data demonstrates that while Utilitarian selection achieves the highest reconstruction quality, Proportional Fairness maintains competitive performance while significantly reducing participation inequality and improving computational efficiency. These results establish that federated SC can successfully balance reconstruction quality, resource efficiency, and fairness in heterogeneous IoT deployments, paving the way for sustainable and privacy-preserving edge intelligence applications.

Updated: 2025-06-20 15:11:20

标题: 在异构物联网网络中联邦语义通信的客户选择策略

摘要: 物联网设备的指数增长在带宽受限的无线网络中提出了关键挑战，特别是在有效数据传输和隐私保护方面。本文提出了一种新颖的联合语义通信（SC）框架，可以实现跨异构物联网设备的图像重建模型的协作训练，以提高带宽效率。通过利用SC原则仅传输语义特征，我们的方法大大减少了通信开销，同时保持了重建质量。我们解决了在设备数据集大小和数据分布存在显著差异的联合学习环境中的客户选择的基本挑战。我们的框架实施了三种不同的客户选择策略，探索系统性能和资源分配公平性之间的不同权衡。系统采用端到端的SC架构，配合具有语义瓶颈的损失为基础的聚合机制，自然地适应了客户的异质性。基于图像数据的实验评估表明，尽管效用主义选择实现了最高的重建质量，但比例公平性在显著减少参与不平等和提高计算效率的同时仍保持竞争性能。这些结果表明，联合SC能够成功平衡异构物联网部署中的重建质量、资源效率和公平性，为可持续和隐私保护的边缘智能应用铺平了道路。

更新时间: 2025-06-20 15:11:20

领域: cs.NI,cs.LG

下载: http://arxiv.org/abs/2506.17063v1

SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification

Large language models based on Mixture-of-Experts have achieved substantial gains in efficiency and scalability, yet their architectural uniqueness introduces underexplored safety alignment challenges. Existing safety alignment strategies, predominantly designed for dense models, are ill-suited to address MoE-specific vulnerabilities. In this work, we formalize and systematically study MoE model's positional vulnerability - the phenomenon where safety-aligned behaviors rely on specific expert modules, revealing critical risks inherent to MoE architectures. To this end, we present SAFEx, an analytical framework that robustly identifies, characterizes, and validates the safety-critical experts using a novel Stability-based Expert Selection (SES) algorithm. Notably, our approach enables the explicit decomposition of safety-critical experts into distinct functional groups, including those responsible for harmful content detection and those controlling safe response generation. Extensive experiments on mainstream MoE models, such as the recently released Qwen3-MoE, demonstrated that their intrinsic safety mechanisms heavily rely on a small subset of positional experts. Disabling these experts significantly compromised the models' ability to refuse harmful requests. For Qwen3-MoE with 6144 experts (in the FNN layer), we find that disabling as few as 12 identified safety-critical experts can cause the refusal rate to drop by 22%, demonstrating the disproportionate impact of a small set of experts on overall model safety.

Updated: 2025-06-20 15:09:10

标题: SAFEx：通过稳定的安全关键专家识别分析基于MoE的LLMs的漏洞

摘要: 基于专家混合的大型语言模型在效率和可伸缩性方面取得了重大进展，但其架构的独特性引入了未经探索的安全对齐挑战。现有的安全对齐策略，主要设计用于密集模型，不适合解决MoE特定的漏洞。在这项工作中，我们正式化并系统地研究MoE模型的位置脆弱性——即安全对齐行为依赖于特定专家模块的现象，揭示了MoE架构固有的关键风险。为此，我们提出了SAFEx，一个分析框架，通过一种新颖的基于稳定性的专家选择（SES）算法，稳健地识别、表征和验证安全关键的专家。值得注意的是，我们的方法能够将安全关键的专家明确分解为不同的功能组，包括负责检测有害内容的专家和控制安全响应生成的专家。对主流MoE模型的大量实验，例如最近发布的Qwen3-MoE，表明它们固有的安全机制严重依赖于少量位置专家。禁用这些专家显著损害了模型拒绝有害请求的能力。对于具有6144个专家（在FNN层）的Qwen3-MoE，我们发现，禁用至少12个确定的安全关键专家可以导致拒绝率下降22％，说明少量专家对整体模型安全性的影响不成比例。

更新时间: 2025-06-20 15:09:10

领域: cs.LG,cs.AI,cs.CR

下载: http://arxiv.org/abs/2506.17368v1

Universal Music Representations? Evaluating Foundation Models on World Music Corpora

Foundation models have revolutionized music information retrieval, but questions remain about their ability to generalize across diverse musical traditions. This paper presents a comprehensive evaluation of five state-of-the-art audio foundation models across six musical corpora spanning Western popular, Greek, Turkish, and Indian classical traditions. We employ three complementary methodologies to investigate these models' cross-cultural capabilities: probing to assess inherent representations, targeted supervised fine-tuning of 1-2 layers, and multi-label few-shot learning for low-resource scenarios. Our analysis shows varying cross-cultural generalization, with larger models typically outperforming on non-Western music, though results decline for culturally distant traditions. Notably, our approaches achieve state-of-the-art performance on five out of six evaluated datasets, demonstrating the effectiveness of foundation models for world music understanding. We also find that our targeted fine-tuning approach does not consistently outperform probing across all settings, suggesting foundation models already encode substantial musical knowledge. Our evaluation framework and benchmarking results contribute to understanding how far current models are from achieving universal music representations while establishing metrics for future progress.

Updated: 2025-06-20 15:06:44

标题: Universal Music Representations? 在世界音乐语料库上评估基础模型

摘要: 基础模型已经彻底改变了音乐信息检索，但是关于它们在不同音乐传统中的泛化能力仍存在疑问。本文对跨越西方流行音乐、希腊、土耳其和印度古典音乐传统的六个音乐语料库中的五种最先进的音频基础模型进行了全面评估。我们采用三种互补的方法来研究这些模型的跨文化能力：探究以评估内在表示、针对性的有监督微调1-2层以及多标签少样本学习用于资源稀缺情景。我们的分析显示不同的跨文化泛化，通常更大的模型在非西方音乐上表现优异，尽管在文化差异较大的传统中结果有所下降。值得注意的是，我们的方法在六个评估数据集中的五个上取得了最先进的性能，展示了基础模型在世界音乐理解方面的有效性。我们还发现，我们的有针对性的微调方法并不总是在所有设置中优于探究，这表明基础模型已经编码了大量的音乐知识。我们的评估框架和基准结果有助于理解当前模型离实现通用音乐表示还有多远，同时为未来进展建立了度量标准。

更新时间: 2025-06-20 15:06:44

领域: cs.SD,cs.IR,cs.LG,eess.AS

下载: http://arxiv.org/abs/2506.17055v1

One-Step Diffusion for Detail-Rich and Temporally Consistent Video Super-Resolution

It is a challenging problem to reproduce rich spatial details while maintaining temporal consistency in real-world video super-resolution (Real-VSR), especially when we leverage pre-trained generative models such as stable diffusion (SD) for realistic details synthesis. Existing SD-based Real-VSR methods often compromise spatial details for temporal coherence, resulting in suboptimal visual quality. We argue that the key lies in how to effectively extract the degradation-robust temporal consistency priors from the low-quality (LQ) input video and enhance the video details while maintaining the extracted consistency priors. To achieve this, we propose a Dual LoRA Learning (DLoRAL) paradigm to train an effective SD-based one-step diffusion model, achieving realistic frame details and temporal consistency simultaneously. Specifically, we introduce a Cross-Frame Retrieval (CFR) module to aggregate complementary information across frames, and train a Consistency-LoRA (C-LoRA) to learn robust temporal representations from degraded inputs. After consistency learning, we fix the CFR and C-LoRA modules and train a Detail-LoRA (D-LoRA) to enhance spatial details while aligning with the temporal space defined by C-LoRA to keep temporal coherence. The two phases alternate iteratively for optimization, collaboratively delivering consistent and detail-rich outputs. During inference, the two LoRA branches are merged into the SD model, allowing efficient and high-quality video restoration in a single diffusion step. Experiments show that DLoRAL achieves strong performance in both accuracy and speed. Code and models are available at https://github.com/yjsunnn/DLoRAL.

Updated: 2025-06-20 15:05:29

标题: 一步法扩散用于丰富细节和时间一致的视频超分辨率

摘要: 在现实世界的视频超分辨率（Real-VSR）中，要在保持时间一致性的同时再现丰富的空间细节是一个具有挑战性的问题，尤其是当我们利用预训练的生成模型（如稳定扩散（SD））进行逼真细节合成时。现有基于SD的Real-VSR方法通常为了时间一致性而牺牲空间细节，导致视觉质量不佳。我们认为关键在于如何有效地从低质量（LQ）输入视频中提取耐降级的时间一致性先验，并在保持提取的一致性先验的同时增强视频细节。为了实现这一目标，我们提出了一个Dual LoRA Learning（DLoRAL）范式，用于训练一个有效的基于SD的一步扩散模型，同时实现逼真的帧细节和时间一致性。具体来说，我们引入了一个Cross-Frame Retrieval（CFR）模块来聚合跨帧的互补信息，并训练一个Consistency-LoRA（C-LoRA）来从降级输入中学习稳健的时间表示。在一致性学习之后，我们固定CFR和C-LoRA模块，并训练一个Detail-LoRA（D-LoRA）来增强空间细节，同时与C-LoRA定义的时间空间保持一致性。这两个阶段交替进行优化，共同提供一致且细节丰富的输出。在推断过程中，两个LoRA分支被合并到SD模型中，使得在单个扩散步骤中能够进行高效且高质量的视频恢复。实验证明，DLoRAL在准确性和速度方面表现出色。代码和模型可在https://github.com/yjsunnn/DLoRAL获取。

更新时间: 2025-06-20 15:05:29

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.15591v2

From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers

Transformers have achieved state-of-the-art performance across language and vision tasks. This success drives the imperative to interpret their internal mechanisms with the dual goals of enhancing performance and improving behavioral control. Attribution methods help advance interpretability by assigning model outputs associated with a target concept to specific model components. Current attribution research primarily studies multi-layer perceptron neurons and addresses relatively simple concepts such as factual associations (e.g., Paris is located in France). This focus tends to overlook the impact of the attention mechanism and lacks a unified approach for analyzing more complex concepts. To fill these gaps, we introduce Scalable Attention Module Discovery (SAMD), a concept-agnostic method for mapping arbitrary, complex concepts to specific attention heads of general transformer models. We accomplish this by representing each concept as a vector, calculating its cosine similarity with each attention head, and selecting the TopK-scoring heads to construct the concept-associated attention module. We then propose Scalar Attention Module Intervention (SAMI), a simple strategy to diminish or amplify the effects of a concept by adjusting the attention module using only a single scalar parameter. Empirically, we demonstrate SAMD on concepts of varying complexity, and visualize the locations of their corresponding modules. Our results demonstrate that module locations remain stable before and after LLM post-training, and confirm prior work on the mechanics of LLM multilingualism. Through SAMI, we facilitate jailbreaking on HarmBench (+72.7%) by diminishing "safety" and improve performance on the GSM8K benchmark (+1.6%) by amplifying "reasoning". Lastly, we highlight the domain-agnostic nature of our approach by suppressing the image classification accuracy of vision transformers on ImageNet.

Updated: 2025-06-20 15:04:11

标题: 从概念到组件：变形金刚中概念无关的注意模块发现

摘要: 变压器在语言和视觉任务中取得了最先进的性能。这一成功推动了解释它们内部机制的必要性，其双重目标是提高性能和改善行为控制。归因方法通过将与目标概念相关联的模型输出分配给特定的模型组件来推进可解释性。当前的归因研究主要研究多层感知器神经元，并处理相对简单的概念，如事实关联（例如，巴黎位于法国）。这种关注往往忽视了注意机制的影响，并缺乏分析更复杂概念的统一方法。为了填补这些空白，我们引入了可扩展的注意模块发现（SAMD），这是一种概念不可知的方法，用于将任意复杂的概念映射到通用变压器模型的特定注意头。我们通过将每个概念表示为一个矢量，计算它与每个注意头的余弦相似性，并选择TopK得分头来构建与概念相关联的注意模块。然后，我们提出了标量注意模块干预（SAMI），一种简单的策略，通过调整注意模块使用仅一个标量参数来减弱或放大概念的影响。从实证上，我们展示了在各种复杂概念上的SAMD，并可视化其对应模块的位置。我们的结果表明，模块位置在LLM后训练前后保持稳定，并证实了关于LLM多语言能力的先前工作。通过SAMI，我们通过减弱“安全”来促进HarmBench（+72.7%）的破解，通过放大“推理”来提高GSM8K基准测试的性能（+1.6%）。最后，我们通过抑制视觉变压器在ImageNet上的图像分类准确性，突出了我们方法的领域不可知性。

更新时间: 2025-06-20 15:04:11

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2506.17052v1

How Breakable Is Privacy: Probing and Resisting Model Inversion Attacks in Collaborative Inference

Collaborative inference (CI) improves computational efficiency for edge devices by transmitting intermediate features to cloud models. However, this process inevitably exposes feature representations to model inversion attacks (MIAs), enabling unauthorized data reconstruction. Despite extensive research, there is no established criterion for assessing the difficulty of MIA implementation, leaving a fundamental question unanswered: \textit{What factors truly and verifiably determine the attack's success in CI?} Moreover, existing defenses lack the theoretical foundation described above, making it challenging to regulate feature information effectively while ensuring privacy and minimizing computational overhead. These shortcomings introduce three key challenges: theoretical gap, methodological limitation, and practical constraint. To overcome these challenges, we propose the first theoretical criterion to assess MIA difficulty in CI, identifying mutual information, entropy, and effective information volume as key influencing factors. The validity of this criterion is demonstrated by using the mutual information neural estimator. Building on this insight, we propose SiftFunnel, a privacy-preserving framework to resist MIA while maintaining usability. Specifically, we incorporate linear and non-linear correlation constraints alongside label smoothing to suppress redundant information transmission, effectively balancing privacy and usability. To enhance deployability, the edge model adopts a funnel-shaped structure with attention mechanisms, strengthening privacy while reducing computational and storage burdens. Experiments show that, compared to state-of-the-art defense, SiftFunnel increases reconstruction error by $\sim$30\%, lowers mutual and effective information metrics by $\geq$50\%, and reduces edge burdens by almost $20\times$, while maintaining comparable usability.

Updated: 2025-06-20 14:59:56

标题: 隐私有多容易受到侵犯：探究和抵抗协作推断中的模型逆推攻击

摘要: Collaborative inference (CI)通过向云模型传输中间特征改善了边缘设备的计算效率。然而，这个过程不可避免地暴露了特征表示给模型反演攻击（MIA），使得未经授权的数据重建成为可能。尽管进行了大量研究，但尚未建立评估MIA实施难度的标准，一个基本问题仍未解答：\textit{到底哪些因素真正和可验证地决定了CI中攻击的成功？}此外，现有的防御缺乏上述理论基础，使得在确保隐私和最小化计算开销的同时有效地规范特征信息变得困难。这些缺陷引入了三个关键挑战：理论差距、方法论限制和实际约束。为了克服这些挑战，我们提出了第一个用于评估CI中MIA难度的理论标准，确定互信息、熵和有效信息量作为关键影响因素。通过使用互信息神经估计器验证了该标准的有效性。基于这一见解，我们提出了SiftFunnel，一个抵抗MIA同时保持可用性的隐私保护框架。具体地，我们结合线性和非线性相关约束以及标签平滑来抑制冗余信息传输，有效平衡隐私和可用性。为了增强可部署性，边缘模型采用漏斗形状结构和注意机制，加强隐私同时减少计算和存储负担。实验表明，与最先进的防御相比，SiftFunnel将重建误差提高了约30％，将互信息和有效信息指标降低了至少50％，并将边缘负担减少了近20倍，同时保持了可比较的可用性。

更新时间: 2025-06-20 14:59:56

领域: cs.CR,cs.IT,math.IT

下载: http://arxiv.org/abs/2501.00824v7

Navigating the Deep: Signature Extraction on Deep Neural Networks

Neural network model extraction has emerged in recent years as an important security concern, as adversaries attempt to recover a network's parameters via black-box queries. A key step in this process is signature extraction, which aims to recover the absolute values of the network's weights layer by layer. Prior work, notably by Carlini et al. (2020), introduced a technique inspired by differential cryptanalysis to extract neural network parameters. However, their method suffers from several limitations that restrict its applicability to networks with a few layers only. Later works focused on improving sign extraction, but largely relied on the assumption that signature extraction itself was feasible. In this work, we revisit and refine the signature extraction process by systematically identifying and addressing for the first time critical limitations of Carlini et al.'s signature extraction method. These limitations include rank deficiency and noise propagation from deeper layers. To overcome these challenges, we propose efficient algorithmic solutions for each of the identified issues, greatly improving the efficiency of signature extraction. Our approach permits the extraction of much deeper networks than was previously possible. We validate our method through extensive experiments on ReLU-based neural networks, demonstrating significant improvements in extraction depth and accuracy. For instance, our extracted network matches the target network on at least 95% of the input space for each of the eight layers of a neural network trained on the CIFAR-10 dataset, while previous works could barely extract the first three layers. Our results represent a crucial step toward practical attacks on larger and more complex neural network architectures.

Updated: 2025-06-20 14:59:47

标题: 深度导航：深度神经网络上的特征提取

摘要: 最近几年，神经网络模型提取已经成为一个重要的安全关注点，因为对手试图通过黑匣子查询恢复网络的参数。这个过程中的关键步骤是签名提取，旨在逐层恢复网络权重的绝对值。以前的工作，尤其是Carlini等人（2020年）介绍了一种受差分密码分析启发的技术来提取神经网络参数。然而，他们的方法存在一些限制，限制了其适用于仅有几层的网络。后续的研究侧重于改进签名提取，但在很大程度上依赖于签名提取本身是可行的假设。在这项工作中，我们重新审视和完善了签名提取过程，首次系统地识别和解决了Carlini等人签名提取方法的关键限制。这些限制包括秩缺陷和来自更深层的噪声传播。为了克服这些挑战，我们针对每个已识别的问题提出了高效的算法解决方案，极大地提高了签名提取的效率。我们的方法允许提取比以前更深的网络。我们通过对基于ReLU的神经网络进行大量实验验证了我们的方法，在提取深度和准确性方面取得了显著的改进。例如，我们提取的网络与在CIFAR-10数据集上训练的神经网络的每个八层中的至少95％的输入空间匹配，而以前的工作几乎无法提取前三层。我们的结果代表了朝着对更大更复杂的神经网络架构进行实际攻击的关键一步。

更新时间: 2025-06-20 14:59:47

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2506.17047v1

MUCAR: Benchmarking Multilingual Cross-Modal Ambiguity Resolution for Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) have demonstrated significant advances across numerous vision-language tasks. Due to their strong image-text alignment capability, MLLMs can effectively understand image-text pairs with clear meanings. However, effectively resolving the inherent ambiguities in natural language and visual contexts remains challenging. Existing multimodal benchmarks typically overlook linguistic and visual ambiguities, relying mainly on unimodal context for disambiguation and thus failing to exploit the mutual clarification potential between modalities. To bridge this gap, we introduce MUCAR, a novel and challenging benchmark designed explicitly for evaluating multimodal ambiguity resolution across multilingual and cross-modal scenarios. MUCAR includes: (1) a multilingual dataset where ambiguous textual expressions are uniquely resolved by corresponding visual contexts, and (2) a dual-ambiguity dataset that systematically pairs ambiguous images with ambiguous textual contexts, with each combination carefully constructed to yield a single, clear interpretation through mutual disambiguation. Extensive evaluations involving 19 state-of-the-art multimodal models--encompassing both open-source and proprietary architectures--reveal substantial gaps compared to human-level performance, highlighting the need for future research into more sophisticated cross-modal ambiguity comprehension methods, further pushing the boundaries of multimodal reasoning.

Updated: 2025-06-20 14:57:41

标题: MUCAR：为多模态大型语言模型进行跨语言跨模态歧义消解的基准测试

摘要: 多模态大型语言模型（MLLMs）在许多视觉语言任务中展示了显著的进展。由于其强大的图像-文本对齐能力，MLLMs能够有效地理解具有明确含义的图像-文本对。然而，有效解决自然语言和视觉环境中固有的歧义仍然具有挑战性。现有的多模态基准通常忽视语言和视觉的歧义，主要依赖于单模态上下文进行消歧，因此未能充分利用模态之间的相互澄清潜力。为了弥补这一差距，我们引入了MUCAR，一个新颖而具有挑战性的基准，专门设计用于评估跨多语言和跨模态场景中的多模态歧义解决方法。MUCAR包括：（1）一个多语言数据集，其中通过相应的视觉环境唯一解决模糊的文本表达，以及（2）一个双歧义数据集，系统地将模糊的图像与模糊的文本环境配对，每个组合都经过精心构造，通过相互消歧产生单一、清晰的解释。对包括19个最先进的多模态模型在内的广泛评估--涵盖开源和专有架构--显示出与人类水平性能相比的实质差距，突显了未来研究更复杂的跨模态歧义理解方法的必要性，进一步推动多模态推理的边界。

更新时间: 2025-06-20 14:57:41

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2506.17046v1

Problem Space Transformations for Out-of-Distribution Generalisation in Behavioural Cloning

The combination of behavioural cloning and neural networks has driven significant progress in robotic manipulation. As these algorithms may require a large number of demonstrations for each task of interest, they remain fundamentally inefficient in complex scenarios, in which finite datasets can hardly cover the state space. One of the remaining challenges is thus out-of-distribution (OOD) generalisation, i.e. the ability to predict correct actions for states with a low likelihood with respect to the state occupancy induced by the dataset. This issue is aggravated when the system to control is treated as a black-box, ignoring its physical properties. This work characterises widespread properties of robotic manipulation, specifically pose equivariance and locality. We investigate the effect of the choice of problem space on OOD performance of BC policies and how transformations arising from characteristic properties of manipulation could be employed for its improvement. We empirically demonstrate that these transformations allow behaviour cloning policies, using either standard MLP-based one-step action prediction or diffusion-based action-sequence prediction, to generalise better to OOD problem instances.

Updated: 2025-06-20 14:56:44

标题: 行为克隆中的问题空间转换对于超出分布的泛化的影响

摘要: 行为克隆和神经网络的结合推动了机器人操作领域的重要进展。由于这些算法可能需要大量的示范来完成感兴趣的每个任务，它们在复杂情景下仍然存在根本的低效问题，有限的数据集很难覆盖状态空间。因此，剩余的挑战之一是外分布（OOD）泛化，即在数据集引起的状态占用可能性较低的情况下预测正确动作的能力。当系统被视为黑盒，忽略其物理特性时，这一问题会加剧。该研究描述了机器人操作的广泛特性，特别是姿势等价性和局部性。我们调查了问题空间选择对BC策略的OOD性能的影响，以及如何利用操作特性的特征性质产生的变换来改善其性能。我们在实证中证明，这些变换能够使行为克隆策略更好地泛化到OOD问题实例，无论是使用标准的基于MLP的一步动作预测还是基于扩散的动作序列预测。

更新时间: 2025-06-20 14:56:44

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2411.04056v2

COS-DPO: Conditioned One-Shot Multi-Objective Fine-Tuning Framework

In LLM alignment and many other ML applications, one often faces the Multi-Objective Fine-Tuning (MOFT) problem, i.e., fine-tuning an existing model with datasets labeled w.r.t. different objectives simultaneously. To address the challenge, we propose a Conditioned One-Shot fine-tuning framework (COS-DPO) that extends the Direct Preference Optimization technique, originally developed for efficient LLM alignment with preference data, to accommodate the MOFT settings. By direct conditioning on the weight across auxiliary objectives, our Weight-COS-DPO method enjoys an efficient one-shot training process for profiling the Pareto front and is capable of achieving comprehensive trade-off solutions even in the post-training stage. Based on our theoretical findings on the linear transformation properties of the loss function, we further propose the Temperature-COS-DPO method that augments the temperature parameter to the model input, enhancing the flexibility of post-training control over the trade-offs between the main and auxiliary objectives. We demonstrate the effectiveness and efficiency of the COS-DPO framework through its applications to various tasks, including the Learning-to-Rank (LTR) and LLM alignment tasks, highlighting its viability for large-scale ML deployments.

Updated: 2025-06-20 14:52:02

标题: COS-DPO：条件一次多目标微调框架

摘要: 在LLM对齐和许多其他机器学习应用中，人们经常面临多目标微调（MOFT）问题，即使用针对不同目标标记的数据集同时微调现有模型。为了解决这一挑战，我们提出了一种条件单次微调框架（COS-DPO），它将最初用于利用偏好数据进行高效LLM对齐的直接偏好优化技术扩展到适应MOFT设置。通过直接对辅助目标之间的权重进行条件化，我们的Weight-COS-DPO方法可以在一次训练过程中实现有效的Pareto前沿剖析，并能够在后期训练阶段实现全面的权衡解决方案。基于我们对损失函数的线性变换特性的理论发现，我们进一步提出了Temperature-COS-DPO方法，将温度参数增加到模型输入中，增强了后期训练对主要和辅助目标之间权衡的灵活性。通过将COS-DPO框架应用于各种任务，包括学习排序（LTR）和LLM对齐任务，我们展示了其有效性和效率，并突显了其在大规模机器学习部署中的可行性。

更新时间: 2025-06-20 14:52:02

领域: cs.LG,cs.CL,math.OC

下载: http://arxiv.org/abs/2410.08316v3

MAWIFlow Benchmark: Realistic Flow-Based Evaluation for Network Intrusion Detection

Benchmark datasets for network intrusion detection commonly rely on synthetically generated traffic, which fails to reflect the statistical variability and temporal drift encountered in operational environments. This paper introduces MAWIFlow, a flow-based benchmark derived from the MAWILAB v1.1 dataset, designed to enable realistic and reproducible evaluation of anomaly detection methods. A reproducible preprocessing pipeline is presented that transforms raw packet captures into flow representations conforming to the CICFlowMeter format, while preserving MAWILab's original anomaly labels. The resulting datasets comprise temporally distinct samples from January 2011, 2016, and 2021, drawn from trans-Pacific backbone traffic. To establish reference baselines, traditional machine learning methods, including Decision Trees, Random Forests, XGBoost, and Logistic Regression, are compared to a deep learning model based on a CNN-BiLSTM architecture. Empirical results demonstrate that tree-based classifiers perform well on temporally static data but experience significant performance degradation over time. In contrast, the CNN-BiLSTM model maintains better performance, thus showing improved generalization. These findings underscore the limitations of synthetic benchmarks and static models, and motivate the adoption of realistic datasets with explicit temporal structure. All datasets, pipeline code, and model implementations are made publicly available to foster transparency and reproducibility.

Updated: 2025-06-20 14:51:35

标题: MAWIFlow基准测试：网络入侵检测的实际基于流量的评估

摘要: 网络入侵检测的基准数据集通常依赖于合成生成的流量，这种方式无法反映操作环境中遇到的统计变化和时间漂移。本文介绍了MAWIFlow，这是一个基于MAWILAB v1.1数据集的基于流的基准，旨在实现异常检测方法的真实和可重复的评估。提出了一个可重复的预处理流水线，将原始数据包捕获转换为符合CICFlowMeter格式的流表示，并保留MAWILab的原始异常标签。生成的数据集包括来自2011年、2016年和2021年的时间上不同的样本，这些样本来自跨太平洋的主干流量。为了建立参考基线，比较了传统机器学习方法，包括决策树、随机森林、XGBoost和逻辑回归，以及基于CNN-BiLSTM架构的深度学习模型。实证结果表明，基于树的分类器在时间上静态数据上表现良好，但随着时间推移，性能明显下降。相比之下，CNN-BiLSTM模型保持了更好的性能，因此显示出更好的泛化能力。这些发现强调了合成基准和静态模型的局限性，并促使采用具有明确时间结构的真实数据集。所有数据集、流水线代码和模型实现都公开可用，以促进透明度和可重复性。

更新时间: 2025-06-20 14:51:35

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.17041v1

Cash or Comfort? How LLMs Value Your Inconvenience

Large Language Models (LLMs) are increasingly proposed as near-autonomous artificial intelligence (AI) agents capable of making everyday decisions on behalf of humans. Although LLMs perform well on many technical tasks, their behaviour in personal decision-making remains less understood. Previous studies have assessed their rationality and moral alignment with human decisions. However, the behaviour of AI assistants in scenarios where financial rewards are at odds with user comfort has not yet been thoroughly explored. In this paper, we tackle this problem by quantifying the prices assigned by multiple LLMs to a series of user discomforts: additional walking, waiting, hunger and pain. We uncover several key concerns that strongly question the prospect of using current LLMs as decision-making assistants: (1) a large variance in responses between LLMs, (2) within a single LLM, responses show fragility to minor variations in prompt phrasing (e.g., reformulating the question in the first person can considerably alter the decision), (3) LLMs can accept unreasonably low rewards for major inconveniences (e.g., 1 Euro to wait 10 hours), and (4) LLMs can reject monetary gains where no discomfort is imposed (e.g., 1,000 Euro to wait 0 minutes). These findings emphasize the need for scrutiny of how LLMs value human inconvenience, particularly as we move toward applications where such cash-versus-comfort trade-offs are made on users' behalf.

Updated: 2025-06-20 14:49:20

标题: 现金还是舒适？LLM（硕士法学）如何衡量您的不便

摘要: 大型语言模型（LLMs）越来越被提议为几乎自主的人工智能（AI）代理，能够代表人类做出日常决策。尽管LLMs在许多技术任务上表现出色，但它们在个人决策中的行为仍不太清楚。先前的研究已经评估了它们与人类决策的合理性和道德一致性。然而，在金钱奖励与用户舒适度相冲突的情况下，AI助手的行为尚未得到深入探讨。在本文中，我们通过量化多个LLMs对一系列用户不便（如额外步行、等待、饥饿和疼痛）所分配的价格来解决这个问题。我们揭示了几个关键问题，强烈质疑使用当前LLMs作为决策助手的前景：（1）LLMs之间的响应差异很大，（2）在单个LLM内，响应对提示措辞的微小变化表现出脆弱性（例如，将问题改写为第一人称可以极大地改变决策），（3）LLMs可以接受对主要不便提供不合理低的奖励（例如，等待10小时只需1欧元），（4）LLMs可以拒绝在不造成任何不便的情况下获得金钱收益（例如，等待0分钟只需1,000欧元）。这些发现强调了对LLMs如何评估人类不便的审查的必要性，特别是在我们向应用程序前进的过程中，这些应用程序会代表用户做出现金与舒适度之间的权衡。

更新时间: 2025-06-20 14:49:20

领域: cs.CL,cs.AI,cs.MA

下载: http://arxiv.org/abs/2506.17367v1

LSCD: Lomb-Scargle Conditioned Diffusion for Time series Imputation

Time series with missing or irregularly sampled data are a persistent challenge in machine learning. Many methods operate on the frequency-domain, relying on the Fast Fourier Transform (FFT) which assumes uniform sampling, therefore requiring prior interpolation that can distort the spectra. To address this limitation, we introduce a differentiable Lomb--Scargle layer that enables a reliable computation of the power spectrum of irregularly sampled data. We integrate this layer into a novel score-based diffusion model (LSCD) for time series imputation conditioned on the entire signal spectrum. Experiments on synthetic and real-world benchmarks demonstrate that our method recovers missing data more accurately than purely time-domain baselines, while simultaneously producing consistent frequency estimates. Crucially, our method can be easily integrated into learning frameworks, enabling broader adoption of spectral guidance in machine learning approaches involving incomplete or irregular data.

Updated: 2025-06-20 14:48:42

标题: LSCD：Lomb-Scargle条件扩散用于时间序列插补

摘要: 缺失或不规则采样数据的时间序列是机器学习中持续存在的挑战。许多方法在频域上操作，依赖于快速傅里叶变换（FFT），该方法假定采样是均匀的，因此需要先进行插值，可能会扭曲频谱。为了解决这一限制，我们引入了一个可微分的Lomb-Scargle层，使得可以可靠地计算不规则采样数据的功率谱。我们将这一层集成到一种新颖的基于得分的扩散模型（LSCD）中，用于基于整个信号频谱进行时间序列插补。对合成和真实基准数据的实验表明，我们的方法比纯粹的时间域基线更准确地恢复了缺失数据，同时产生一致的频率估计。关键是，我们的方法可以轻松集成到学习框架中，从而促进了在涉及不完整或不规则数据的机器学习方法中更广泛地采用频谱指导。

更新时间: 2025-06-20 14:48:42

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.17039v1

Bayesian Joint Model of Multi-Sensor and Failure Event Data for Multi-Mode Failure Prediction

Modern industrial systems are often subject to multiple failure modes, and their conditions are monitored by multiple sensors, generating multiple time-series signals. Additionally, time-to-failure data are commonly available. Accurately predicting a system's remaining useful life (RUL) requires effectively leveraging multi-sensor time-series data alongside multi-mode failure event data. In most existing models, failure modes and RUL prediction are performed independently, ignoring the inherent relationship between these two tasks. Some models integrate multiple failure modes and event prediction using black-box machine learning approaches, which lack statistical rigor and cannot characterize the inherent uncertainty in the model and data. This paper introduces a unified approach to jointly model the multi-sensor time-series data and failure time concerning multiple failure modes. This proposed model integrate a Cox proportional hazards model, a Convolved Multi-output Gaussian Process, and multinomial failure mode distributions in a hierarchical Bayesian framework with corresponding priors, enabling accurate prediction with robust uncertainty quantification. Posterior distributions are effectively obtained by Variational Bayes, and prediction is performed with Monte Carlo sampling. The advantages of the proposed model is validated through extensive numerical and case studies with jet-engine dataset.

Updated: 2025-06-20 14:44:15

标题: 贝叶斯多传感器和故障事件数据的联合模型用于多模故障预测

摘要: 现代工业系统常常受到多种故障模式的影响，它们的状态由多个传感器监测，生成多个时间序列信号。此外，通常可以获得时间至故障的数据。准确预测系统的剩余寿命（RUL）需要有效利用多传感器时间序列数据以及多模故障事件数据。在大多数现有模型中，故障模式和RUL预测是独立进行的，忽略了这两个任务之间的固有关系。一些模型使用黑盒机器学习方法集成多个故障模式和事件预测，缺乏统计严谨性，并且无法表征模型和数据中固有的不确定性。本文介绍了一种统一的方法，共同建模多传感器时间序列数据和多种故障模式相关的故障时间。该提出的模型将Cox比例风险模型、卷积多输出高斯过程和多项式故障模式分布整合到具有相应先验的分层贝叶斯框架中，从而实现准确的预测和稳健的不确定性量化。通过变分贝叶斯有效地获得后验分布，并使用蒙特卡洛抽样进行预测。通过对喷气发动机数据集进行广泛的数值和案例研究，验证了所提出模型的优势。

更新时间: 2025-06-20 14:44:15

领域: stat.ME,cs.LG,stat.ML

下载: http://arxiv.org/abs/2506.17036v1

Critical Appraisal of Fairness Metrics in Clinical Predictive AI

Predictive artificial intelligence (AI) offers an opportunity to improve clinical practice and patient outcomes, but risks perpetuating biases if fairness is inadequately addressed. However, the definition of "fairness" remains unclear. We conducted a scoping review to identify and critically appraise fairness metrics for clinical predictive AI. We defined a "fairness metric" as a measure quantifying whether a model discriminates (societally) against individuals or groups defined by sensitive attributes. We searched five databases (2014-2024), screening 820 records, to include 41 studies, and extracted 62 fairness metrics. Metrics were classified by performance-dependency, model output level, and base performance metric, revealing a fragmented landscape with limited clinical validation and overreliance on threshold-dependent measures. Eighteen metrics were explicitly developed for healthcare, including only one clinical utility metric. Our findings highlight conceptual challenges in defining and quantifying fairness and identify gaps in uncertainty quantification, intersectionality, and real-world applicability. Future work should prioritise clinically meaningful metrics.

Updated: 2025-06-20 14:43:36

标题: 临床预测人工智能中公平性指标的关键评估

摘要: 预测人工智能（AI）提供了改善临床实践和患者结果的机会，但如果公平性得不到充分解决，就会存在延续偏见的风险。然而，“公平性”的定义仍不清楚。我们进行了一项范围审查，以识别和批判性评估用于临床预测AI的公平性指标。我们将“公平性指标”定义为一种衡量模型是否（在社会上）针对被敏感属性定义的个人或群体进行歧视的指标。我们搜索了五个数据库（2014-2024年），筛选了820条记录，包括41项研究，并提取了62个公平性指标。根据性能依赖性、模型输出水平和基础性能指标对指标进行分类，揭示了一个碎片化的景观，临床验证有限，过度依赖于阈值依赖性度量。其中18个指标明确为医疗保健领域而开发，包括仅有一个临床效用指标。我们的发现突显了在定义和量化公平性方面的概念挑战，并识别了在不确定性量化、交叉性和现实世界适用性方面的差距。未来的工作应优先考虑临床意义的指标。

更新时间: 2025-06-20 14:43:36

领域: cs.LG

下载: http://arxiv.org/abs/2506.17035v1

Incivility and Rigidity: The Risks of Fine-Tuning LLMs for Political Argumentation

The incivility prevalent on platforms like Twitter (now X) and Reddit poses a challenge for developing AI systems that can support productive and rhetorically sound political argumentation. In this study, we report experiments with GPT-3.5 Turbo, fine-tuned on two contrasting datasets of political discussions: high-variance, high-incivility Twitter replies to U.S. Congress, and low-variance, low-incivility posts from Reddit's r/ChangeMyView. We systematically evaluate how these data sources and prompting strategies shape the rhetorical framing and deliberative quality of model-generated arguments. Our results show that Reddit-finetuned models produce safer but rhetorically rigid arguments, while cross-platform fine-tuning amplifies toxicity. Prompting reduces specific toxic behaviors, such as personal attacks, but fails to fully mitigate the influence of high-incivility training data. We introduce and validate a rhetorical evaluation rubric and provide practical guidelines for deploying LLMs in content authoring, moderation, and deliberation support.

Updated: 2025-06-20 14:35:51

标题: 不文明行为和死板思维：为政治辩论微调LLMs的风险

摘要: 在像Twitter（现在X）和Reddit这样的平台上普遍存在的不文明行为对于开发能够支持富有成效和修辞合理的政治论证的人工智能系统构成了挑战。在这项研究中，我们报告了对GPT-3.5 Turbo进行实验的结果，该模型在两个相反的政治讨论数据集上进行了微调：高变异性、高不文明的Twitter回复美国国会，以及低变异性、低不文明的Reddit的r/ChangeMyView帖子。我们系统地评估了这些数据来源和提示策略如何塑造模型生成的论证的修辞框架和审议质量。我们的结果表明，Reddit微调模型产生了更安全但修辞上僵化的论点，而跨平台微调加剧了毒性。提示可以减少特定的毒性行为，如人身攻击，但未能完全减轻高不文明训练数据的影响。我们引入并验证了一个修辞评估标准，并提供了在内容创作、管理和审议支持中部署LLMs的实用指南。

更新时间: 2025-06-20 14:35:51

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2411.16813v3

Conditional Front-door Adjustment for Heterogeneous Treatment Assignment Effect Estimation Under Non-adherence

Estimates of heterogeneous treatment assignment effects can inform treatment decisions. Under the presence of non-adherence (e.g., patients do not adhere to their assigned treatment), both the standard backdoor adjustment (SBD) and the conditional front-door adjustment (CFD) can recover unbiased estimates of the treatment assignment effects. However, the estimation variance of these approaches may vary widely across settings, which remains underexplored in the literature. In this work, we demonstrate theoretically and empirically that CFD yields lower-variance estimates than SBD when the true effect of treatment assignment is small (i.e., assigning an intervention leads to small changes in patients' future outcome). Additionally, since CFD requires estimating multiple nuisance parameters, we introduce LobsterNet, a multi-task neural network that implements CFD with joint modeling of the nuisance parameters. Empirically, LobsterNet reduces estimation error across several semi-synthetic and real-world datasets compared to baselines. Our findings suggest CFD with shared nuisance parameter modeling can improve treatment assignment effect estimation under non-adherence.

Updated: 2025-06-20 14:29:02

标题: 非服从情况下异质性治疗分配效果估计的条件前门调整

摘要: 估计异质治疗分配效果可以指导治疗决策。在非依从性存在的情况下（例如，患者不遵守其分配的治疗方案），标准背门调整（SBD）和条件前门调整（CFD）都可以恢复治疗分配效果的无偏估计。然而，这些方法的估计方差在不同环境下可能差异很大，在文献中仍未得到充分探讨。在本研究中，我们从理论和经验上证明，当治疗分配的真实效果较小时（即，分配干预导致患者未来结果变化较小），CFD比SBD产生更低方差的估计值。此外，由于CFD需要估计多个干扰参数，我们引入了LobsterNet，一个多任务神经网络，实现了通过对干扰参数进行联合建模的CFD。经验上，与基准相比，LobsterNet在几个半合成和真实世界数据集上减少了估计误差。我们的研究结果表明，共享干扰参数建模的CFD可以改善在非依从性情况下的治疗分配效果估计。

更新时间: 2025-06-20 14:29:02

领域: cs.LG

下载: http://arxiv.org/abs/2505.05677v3

Scalable and Reliable Multi-agent Reinforcement Learning for Traffic Assignment

The evolution of metropolitan cities and the increase in travel demands impose stringent requirements on traffic assignment methods. Multi-agent reinforcement learning (MARL) approaches outperform traditional methods in modeling adaptive routing behavior without requiring explicit system dynamics, which is beneficial for real-world deployment. However, MARL frameworks face challenges in scalability and reliability when managing extensive networks with substantial travel demand, which limiting their practical applicability in solving large-scale traffic assignment problems. To address these challenges, this study introduces MARL-OD-DA, a new MARL framework for the traffic assignment problem, which redefines agents as origin-destination (OD) pair routers rather than individual travelers, significantly enhancing scalability. Additionally, a Dirichlet-based action space with action pruning and a reward function based on the local relative gap are designed to enhance solution reliability and improve convergence efficiency. Experiments demonstrate that the proposed MARL framework effectively handles medium-sized networks with extensive and varied city-level OD demand, surpassing existing MARL methods. When implemented in the SiouxFalls network, MARL-OD-DA achieves better assignment solutions in 10 steps, with a relative gap that is 94.99% lower than that of conventional methods.

Updated: 2025-06-20 14:25:23

标题: 可扩展和可靠的交通分配多智能体强化学习

摘要: 大都市的发展和出行需求的增加对交通分配方法提出了严格要求。多智能体强化学习（MARL）方法在建模自适应路径选择行为方面优于传统方法，而无需显式系统动态，这有利于实际部署。然而，当处理具有大量出行需求的广泛网络时，MARL框架在可扩展性和可靠性方面面临挑战，从而限制了它们在解决大规模交通分配问题中的实际适用性。为了解决这些挑战，本研究引入了MARL-OD-DA，这是一个新的用于交通分配问题的MARL框架，将代理重新定义为起点-终点（OD）对路由器，而不是个体旅行者，从而显着增强了可扩展性。此外，基于Dirichlet的动作空间与动作修剪和基于本地相对差距的奖励函数被设计用于增强解决方案的可靠性并提高收敛效率。实验证明，所提出的MARL框架有效处理具有广泛和多样的城市级OD需求的中等规模网络，超越了现有的MARL方法。在SiouxFalls网络中实施时，MARL-OD-DA在10步内实现了更好的分配解决方案，其相对差距比传统方法低94.99％。

更新时间: 2025-06-20 14:25:23

领域: cs.LG

下载: http://arxiv.org/abs/2506.17029v1

Zero-shot Class Unlearning via Layer-wise Relevance Analysis and Neuronal Path Perturbation

In the rapid advancement of artificial intelligence, privacy protection has become crucial, giving rise to machine unlearning. Machine unlearning is a technique that removes specific data influences from trained models without the need for extensive retraining. However, it faces several key challenges, including accurately implementing unlearning, ensuring privacy protection during the unlearning process, and achieving effective unlearning without significantly compromising model performance. This paper presents a novel approach to machine unlearning by employing Layer-wise Relevance Analysis and Neuronal Path Perturbation. We address three primary challenges: the lack of detailed unlearning principles, privacy guarantees in zero-shot unlearning scenario, and the balance between unlearning effectiveness and model utility. Our method balances machine unlearning performance and model utility by identifying and perturbing highly relevant neurons, thereby achieving effective unlearning. By using data not present in the original training set during the unlearning process, we satisfy the zero-shot unlearning scenario and ensure robust privacy protection. Experimental results demonstrate that our approach effectively removes targeted data from the target unlearning model while maintaining the model's utility, offering a practical solution for privacy-preserving machine learning.

Updated: 2025-06-20 14:25:16

标题: Zero-shot类别去学习：通过逐层相关性分析和神经元路径扰动

摘要: 在人工智能的快速发展中，隐私保护变得至关重要，因此出现了机器遗忘的概念。机器遗忘是一种在无需大量重新训练的情况下从已训练模型中删除特定数据影响的技术。然而，它面临着几个关键挑战，包括准确实施遗忘、确保在遗忘过程中的隐私保护以及在不显著损害模型性能的情况下实现有效的遗忘。本文提出了一种利用层次关联分析和神经路径扰动的新型机器遗忘方法。我们解决了三个主要挑战：缺乏详细的遗忘原则、在零-shot遗忘场景中的隐私保证以及在遗忘效果和模型实用性之间的平衡。我们的方法通过识别和扰动高度相关的神经元来平衡机器遗忘的性能和模型实用性，从而实现有效的遗忘。通过在遗忘过程中使用原始训练集中不存在的数据，我们满足了零-shot遗忘场景，并确保了稳固的隐私保护。实验结果表明，我们的方法有效地从目标遗忘模型中删除了目标数据，同时保持了模型的实用性，为隐私保护机器学习提供了实用解决方案。

更新时间: 2025-06-20 14:25:16

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2410.23693v2

Eau De $Q$-Network: Adaptive Distillation of Neural Networks in Deep Reinforcement Learning

Recent works have successfully demonstrated that sparse deep reinforcement learning agents can be competitive against their dense counterparts. This opens up opportunities for reinforcement learning applications in fields where inference time and memory requirements are cost-sensitive or limited by hardware. Until now, dense-to-sparse methods have relied on hand-designed sparsity schedules that are not synchronized with the agent's learning pace. Crucially, the final sparsity level is chosen as a hyperparameter, which requires careful tuning as setting it too high might lead to poor performances. In this work, we address these shortcomings by crafting a dense-to-sparse algorithm that we name Eau De $Q$-Network (EauDeQN). To increase sparsity at the agent's learning pace, we consider multiple online networks with different sparsity levels, where each online network is trained from a shared target network. At each target update, the online network with the smallest loss is chosen as the next target network, while the other networks are replaced by a pruned version of the chosen network. We evaluate the proposed approach on the Atari $2600$ benchmark and the MuJoCo physics simulator, showing that EauDeQN reaches high sparsity levels while keeping performances high.

Updated: 2025-06-20 14:24:49

标题: 水Q网络：深度强化学习中神经网络的自适应蒸馏

摘要: 最近的研究成功地证明了稀疏深度强化学习代理可以与其密集对应物竞争。这为在推理时间和内存需求受到成本敏感或受到硬件限制的领域中应用强化学习提供了机会。到目前为止，密集到稀疏方法依赖于手动设计的稀疏调度，这些调度与代理的学习速度不同步。至关重要的是，最终的稀疏水平被选择为一个超参数，这需要仔细调整，因为设置得太高可能会导致性能不佳。在这项工作中，我们通过设计了一种我们称之为Eau De $Q$-Network（EauDeQN）的密集到稀疏算法来解决这些缺点。为了按照代理的学习速度增加稀疏性，我们考虑具有不同稀疏水平的多个在线网络，其中每个在线网络都是从一个共享的目标网络训练的。在每次目标更新时，选择具有最小损失的在线网络作为下一个目标网络，而其他网络则被选择网络的修剪版本替换。我们在Atari $2600$基准测试和MuJoCo物理模拟器上评估了所提出的方法，结果显示EauDeQN在保持高性能的同时达到了高稀疏水平。

更新时间: 2025-06-20 14:24:49

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.01437v2

Quantifying Azure RBAC Wildcard Overreach

Azure RBAC leverages wildcard permissions to simplify policy authoring, but this abstraction often obscures the actual set of allowed operations and undermines least-privilege guarantees. We introduce Belshazaar, a two-stage framework that targets both the effective permission set problem and the evaluation of wildcards permissions spread. First, we formalize Azure action syntax via a context free grammar and implement a compiler that expands any wildcard into its explicit action set. Second, we define an ultrametric diameter metric to quantify semantic overreach in wildcard scenarios. Applied to Microsoft s official catalog of 15481 actions, Belshazaar reveals that about 39 percent of actions admit a cross Resource Provider reach when associated with non obvious wildcards, and that effective permissions sets are effectively computable. These findings demonstrate that wildcard patterns can introduce substantial privilege bloat, and that our approach offers a scalable, semantics driven path toward tighter, least-privilege RBAC policies in Azure environments.

Updated: 2025-06-20 14:19:11

标题: 量化Azure RBAC通配符滥用

摘要: Azure RBAC利用通配符权限简化策略编写，但这种抽象通常会掩盖实际允许的操作集，并破坏最小权限保证。我们引入Belshazaar，一个旨在解决有效权限集问题和评估通配符权限分布的两阶段框架。首先，我们通过上下文无关语法形式化Azure操作语法，并实现一个编译器，将任何通配符扩展为其明确的操作集。其次，我们定义了一个超度量直径度量来量化通配符场景中的语义过度扩展。应用于微软的官方目录的15481个操作，Belshazaar揭示了大约39％的操作在与非明显通配符相关联时允许跨资源提供程序访问，并且有效权限集是可以有效计算的。这些发现表明通配符模式可能引入大量权限膨胀，并且我们的方法为Azure环境中更严格的最小权限RBAC策略提供了一条可伸缩的、语义驱动的路径。

更新时间: 2025-06-20 14:19:11

领域: cs.CR

下载: http://arxiv.org/abs/2506.10755v2

Instituto de Telecomunicações at IWSLT 2025: Aligning Small-Scale Speech and Language Models for Speech-to-Text Learning

This paper presents the IT-IST submission to the IWSLT 2025 Shared Task on Instruction Following Speech Processing. We submit results for the Short Track, i.e., speech recognition, translation, and spoken question answering. Our model is a unified speech-to-text model that integrates a pre-trained continuous speech encoder and text decoder through a first phase of modality alignment and a second phase of instruction fine-tuning. Crucially, we focus on using small-scale language model backbones (< 2B) and restrict to high-quality, CC-BY data along with synthetic data generation to supplement existing resources.

Updated: 2025-06-20 14:17:42

标题: 2025年IWSLT上的电信研究所：为语音到文本学习对齐小规模语音和语言模型

摘要: 这篇论文介绍了IT-IST提交给IWSLT 2025共享任务关于指令跟随语音处理的结果。我们提交了短跟踪的结果，即语音识别、翻译和口头问题回答。我们的模型是一个统一的语音到文本模型，通过模态对齐的第一阶段和指令微调的第二阶段将预先训练的连续语音编码器和文本解码器整合在一起。关键是，我们专注于使用小规模语言模型骨干（< 2B），并限制在高质量的CC-BY数据以及合成数据生成来补充现有资源。

更新时间: 2025-06-20 14:17:42

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.17019v1

A Quantile Regression Approach for Remaining Useful Life Estimation with State Space Models

Predictive Maintenance (PdM) is pivotal in Industry 4.0 and 5.0, proactively enhancing efficiency through accurate equipment Remaining Useful Life (RUL) prediction, thus optimizing maintenance scheduling and reducing unexpected failures and premature interventions. This paper introduces a novel RUL estimation approach leveraging State Space Models (SSM) for efficient long-term sequence modeling. To handle model uncertainty, Simoultaneous Quantile Regression (SQR) is integrated into the SSM, enabling multiple quantile estimations. The proposed method is benchmarked against traditional sequence modelling techniques (LSTM, Transformer, Informer) using the C-MAPSS dataset. Results demonstrate superior accuracy and computational efficiency of SSM models, underscoring their potential for high-stakes industrial applications.

Updated: 2025-06-20 14:15:55

标题: 一种基于分位数回归的状态空间模型用于剩余寿命估计

摘要: 预测性维护（PdM）在工业4.0和5.0中至关重要，通过准确预测设备剩余寿命（RUL），从而积极提高效率，优化维护计划并减少意外故障和过早干预。本文介绍了一种利用状态空间模型（SSM）进行有效长期序列建模的新型RUL估计方法。为了处理模型不确定性，将同时量化回归（SQR）集成到SSM中，实现多个分位数估计。所提出的方法通过使用C-MAPSS数据集对传统的序列建模技术（LSTM、Transformer、Informer）进行基准测试。结果表明，SSM模型具有优越的准确性和计算效率，突出了它们在高风险工业应用中的潜力。

更新时间: 2025-06-20 14:15:55

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.17018v1

The Hidden Cost of an Image: Quantifying the Energy Consumption of AI Image Generation

With the growing adoption of AI image generation, in conjunction with the ever-increasing environmental resources demanded by AI, we are urged to answer a fundamental question: What is the environmental impact hidden behind each image we generate? In this research, we present a comprehensive empirical experiment designed to assess the energy consumption of AI image generation. Our experiment compares 17 state-of-the-art image generation models by considering multiple factors that could affect their energy consumption, such as model quantization, image resolution, and prompt length. Additionally, we consider established image quality metrics to study potential trade-offs between energy consumption and generated image quality. Results show that image generation models vary drastically in terms of the energy they consume, with up to a 46x difference. Image resolution affects energy consumption inconsistently, ranging from a 1.3x to 4.7x increase when doubling resolution. U-Net-based models tend to consume less than Transformer-based one. Model quantization instead results to deteriorate the energy efficiency of most models, while prompt length and content have no statistically significant impact. Improving image quality does not always come at the cost of a higher energy consumption, with some of the models producing the highest quality images also being among the most energy efficient ones.

Updated: 2025-06-20 14:13:52

标题: 一个图像的隐藏成本：量化AI图像生成的能源消耗

摘要: 随着越来越多的人开始使用AI图像生成技术，并且由于AI对环境资源的需求不断增加，我们迫切需要回答一个基本问题：每个生成的图像背后隐藏着怎样的环境影响？在这项研究中，我们提出了一项全面的实证实验，旨在评估AI图像生成的能耗。我们的实验通过考虑可能影响其能耗的多个因素，如模型量化、图像分辨率和提示长度，比较了17种最先进的图像生成模型。此外，我们考虑了已建立的图像质量指标，以研究能耗与生成图像质量之间的潜在权衡。结果显示，图像生成模型在能耗方面存在显著差异，最高相差46倍。图像分辨率对能耗影响不一，将分辨率加倍后，能耗增加范围从1.3倍到4.7倍不等。基于U-Net的模型通常比基于Transformer的模型消耗更少。模型量化反而导致大多数模型的能效降低，而提示长度和内容则没有统计上显著的影响。提高图像质量并不总是意味着更高的能耗，一些生成最高质量图像的模型也是最节能的。

更新时间: 2025-06-20 14:13:52

领域: cs.LG,cs.MM

下载: http://arxiv.org/abs/2506.17016v1

Simulating Correlated Electrons with Symmetry-Enforced Normalizing Flows

We present the first proof of principle that normalizing flows can accurately learn the Boltzmann distribution of the fermionic Hubbard model - a key framework for describing the electronic structure of graphene and related materials. State-of-the-art methods like Hybrid Monte Carlo often suffer from ergodicity issues near the time-continuum limit, leading to biased estimates. Leveraging symmetry-aware architectures as well as independent and identically distributed sampling, our approach resolves these issues and achieves significant speed-ups over traditional methods.

Updated: 2025-06-20 14:13:47

标题: 用对称强制归一化流模拟相关电子

摘要: 我们首次证明了正规流可以准确地学习费米子Hubbard模型的玻尔兹曼分布-这是描述石墨烯和相关材料电子结构的关键框架。像混合蒙特卡罗这样的最新方法在接近时间连续极限时常常遇到遍历性问题，导致估计存在偏差。利用对称感知架构以及独立同分布的采样，我们的方法解决了这些问题，并实现了显著的速度提升，超过传统方法。

更新时间: 2025-06-20 14:13:47

领域: cond-mat.str-el,cs.LG,hep-lat

下载: http://arxiv.org/abs/2506.17015v1

Can Large Language Models Replace Human Subjects? A Large-Scale Replication of Scenario-Based Experiments in Psychology and Management

Artificial Intelligence (AI) is increasingly being integrated into scientific research, particularly in the social sciences, where understanding human behavior is critical. Large Language Models (LLMs) have shown promise in replicating human-like responses in various psychological experiments. We conducted a large-scale study replicating 156 psychological experiments from top social science journals using three state-of-the-art LLMs (GPT-4, Claude 3.5 Sonnet, and DeepSeek v3). Our results reveal that while LLMs demonstrate high replication rates for main effects (73-81%) and moderate to strong success with interaction effects (46-63%), They consistently produce larger effect sizes than human studies, with Fisher Z values approximately 2-3 times higher than human studies. Notably, LLMs show significantly lower replication rates for studies involving socially sensitive topics such as race, gender and ethics. When original studies reported null findings, LLMs produced significant results at remarkably high rates (68-83%) - while this could reflect cleaner data with less noise, as evidenced by narrower confidence intervals, it also suggests potential risks of effect size overestimation. Our results demonstrate both the promise and challenges of LLMs in psychological research, offering efficient tools for pilot testing and rapid hypothesis validation while enriching rather than replacing traditional human subject studies, yet requiring more nuanced interpretation and human validation for complex social phenomena and culturally sensitive research questions.

Updated: 2025-06-20 14:13:17

标题: 大型语言模型能否取代人类受试者？心理学和管理领域基于情景的实验的大规模复制

摘要: 人工智能（AI）越来越多地被整合到科学研究中，特别是在社会科学领域，其中理解人类行为至关重要。大型语言模型（LLMs）已经显示出在各种心理实验中复制类似人类响应的潜力。我们进行了一项大规模研究，利用三种最先进的LLMs（GPT-4、Claude 3.5 Sonnet和DeepSeek v3）复制了来自顶级社会科学期刊的156个心理实验。我们的结果显示，LLMs对主效应的复制率很高（73-81%），对交互效应的成功率中等到强（46-63%），但它们产生的效应大小一直比人类研究大，Fisher Z值约为人类研究的2-3倍。值得注意的是，LLMs在涉及种族、性别和伦理等社会敏感主题的研究中显示出显著较低的复制率。当原始研究报告零结果时，LLMs以显著高的比率产生显著结果（68-83%）-虽然这可能反映出数据更干净，噪音更少，如窄置信区间所证实的，但也表明了效应大小估计的潜在风险。我们的结果展示了LLMs在心理研究中的前景和挑战，为试点测试和快速假设验证提供了有效工具，同时丰富而不取代传统的人类研究对象研究，但对于复杂的社会现象和文化敏感的研究问题需要更加细致的解释和人类验证。

更新时间: 2025-06-20 14:13:17

领域: cs.CL,cs.AI,econ.GN,q-fin.EC

下载: http://arxiv.org/abs/2409.00128v3

A Novel Approach to Differential Privacy with Alpha Divergence

As data-driven technologies advance swiftly, maintaining strong privacy measures becomes progressively difficult. Conventional $(\epsilon, \delta)$-differential privacy, while prevalent, exhibits limited adaptability for many applications. To mitigate these constraints, we present alpha differential privacy (ADP), an innovative privacy framework grounded in alpha divergence, which provides a more flexible assessment of privacy consumption. This study delineates the theoretical underpinnings of ADP and contrasts its performance with competing privacy frameworks across many scenarios. Empirical assessments demonstrate that ADP offers enhanced privacy guarantees in small to moderate iteration contexts, particularly where severe privacy requirements are necessary. The suggested method markedly improves privacy-preserving methods, providing a flexible solution for contemporary data analysis issues in a data-centric environment.

Updated: 2025-06-20 14:10:18

标题: 一种新颖的使用Alpha散度的差分隐私方法

摘要: 随着数据驱动技术的迅速发展，保持强大的隐私措施变得越来越困难。传统的$(\epsilon,\delta)$-差分隐私虽然普遍存在，但在许多应用中展现出有限的适应性。为了缓解这些限制，我们提出了alpha差分隐私（ADP），这是一个基于alpha散度的创新隐私框架，提供了对隐私消耗更灵活的评估。本研究阐明了ADP的理论基础，并在许多场景中将其性能与竞争性隐私框架进行对比。实证评估表明，在小到中等迭代环境中，ADP提供了增强的隐私保证，特别是在需要严格的隐私要求时。建议的方法显著改进了保护隐私的方法，为数据中心环境中的当代数据分析问题提供了灵活的解决方案。

更新时间: 2025-06-20 14:10:18

领域: cs.CR

下载: http://arxiv.org/abs/2506.17012v1

Decoupled Classifier-Free Guidance for Counterfactual Diffusion Models

Counterfactual image generation aims to simulate realistic visual outcomes under specific causal interventions. Diffusion models have recently emerged as a powerful tool for this task, combining DDIM inversion with conditional generation via classifier-free guidance (CFG). However, standard CFG applies a single global weight across all conditioning variables, which can lead to poor identity preservation and spurious attribute changes - a phenomenon known as attribute amplification. To address this, we propose Decoupled Classifier-Free Guidance (DCFG), a flexible and model-agnostic framework that introduces group-wise conditioning control. DCFG builds on an attribute-split embedding strategy that disentangles semantic inputs, enabling selective guidance on user-defined attribute groups. For counterfactual generation, we partition attributes into intervened and invariant sets based on a causal graph and apply distinct guidance to each. Experiments on CelebA-HQ, MIMIC-CXR, and EMBED show that DCFG improves intervention fidelity, mitigates unintended changes, and enhances reversibility, enabling more faithful and interpretable counterfactual image generation.

Updated: 2025-06-20 14:06:08

标题: 解耦分类器自由指导对于反事实扩散模型的影响

摘要: 逆事实图像生成旨在模拟在特定因果干预下的真实视觉结果。最近，扩散模型已经成为这一任务的强大工具，通过将DDIM反演与无分类器指导（CFG）结合来进行条件生成。然而，标准的CFG适用于所有调节变量应用单一全局权重，这可能导致较差的身份保留和虚假属性变化 - 这一现象被称为属性放大。为了解决这一问题，我们提出了解耦分类器自由指导（DCFG），这是一个灵活的、与模型无关的框架，引入了组间调节控制。DCFG建立在一个属性分离嵌入策略上，解开语义输入，实现对用户定义的属性组的选择性指导。对于逆事实生成，我们根据因果图将属性分为干预和不变集，并对每个集应用不同的指导。在CelebA-HQ、MIMIC-CXR和EMBED上的实验表明，DCFG提高了干预的忠实性，减轻了意外变化，并增强了可逆性，使得逆事实图像生成更加忠实和可解释。

更新时间: 2025-06-20 14:06:08

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.14399v2

Robust Reinforcement Learning for Discrete Compositional Generation via General Soft Operators

A major bottleneck in scientific discovery involves narrowing a large combinatorial set of objects, such as proteins or molecules, to a small set of promising candidates. While this process largely relies on expert knowledge, recent methods leverage reinforcement learning (RL) to enhance this filtering. They achieve this by estimating proxy reward functions from available datasets and using regularization to generate more diverse candidates. These reward functions are inherently uncertain, raising a particularly salient challenge for scientific discovery. In this work, we show that existing methods, often framed as sampling proportional to a reward function, are inadequate and yield suboptimal candidates, especially in large search spaces. To remedy this issue, we take a robust RL approach and introduce a unified operator that seeks robustness to the uncertainty of the proxy reward function. This general operator targets peakier sampling distributions while encompassing known soft RL operators. It also leads us to a novel algorithm that identifies higher-quality, diverse candidates in both synthetic and real-world tasks. Ultimately, our work offers a new, flexible perspective on discrete compositional generation tasks. Code: https://github.com/marcojira/tgm.

Updated: 2025-06-20 14:03:17

标题: 强化学习在离散组合生成中的鲁棒性：通过通用软操作符

摘要: 科学发现中的一个主要瓶颈涉及将大量的组合物，如蛋白质或分子，缩小到一小组有希望的候选者。虽然这个过程在很大程度上依赖于专家知识，但最近的方法利用强化学习（RL）来增强这种过滤。它们通过从可用数据集中估计代理奖励函数并使用正则化来生成更多样化的候选者来实现这一点。这些奖励函数本质上是不确定的，这给科学发现带来了一个特别突出的挑战。在这项工作中，我们展示了现有的方法，通常被描述为按照奖励函数进行采样，是不足够的，并且会产生次优的候选者，特别是在大的搜索空间中。为了解决这个问题，我们采取了一个强健的RL方法，并引入了一个统一的操作符，该操作符寻求对代理奖励函数的不确定性具有鲁棒性。这个通用操作符针对更尖锐的采样分布，同时包含已知的软RL操作符。它也引导我们提出了一种新颖的算法，可以在合成和真实任务中识别更高质量、更多样化的候选者。最终，我们的工作为离散的组合生成任务提供了一个新的灵活的视角。代码：https://github.com/marcojira/tgm。

更新时间: 2025-06-20 14:03:17

领域: cs.LG

下载: http://arxiv.org/abs/2506.17007v1

From Thinking to Output: Chain-of-Thought and Text Generation Characteristics in Reasoning Language Models

Recently, there have been notable advancements in large language models (LLMs), demonstrating their growing abilities in complex reasoning. However, existing research largely overlooks a thorough and systematic comparison of these models' reasoning processes and outputs, particularly regarding their self-reflection pattern (also termed "Aha moment") and the interconnections across diverse domains. This paper proposes a novel framework for analyzing the reasoning characteristics of four cutting-edge large reasoning models (GPT-o1, DeepSeek-R1, Kimi-k1.5, and Grok-3) using keywords statistic and LLM-as-a-judge paradigm. Our approach connects their internal thinking processes with their final outputs. A diverse dataset consists of real-world scenario-based questions covering logical deduction, causal inference, and multi-step problem-solving. Additionally, a set of metrics is put forward to assess both the coherence of reasoning and the accuracy of the outputs. The research results uncover various patterns of how these models balance exploration and exploitation, deal with problems, and reach conclusions during the reasoning process. Through quantitative and qualitative comparisons, disparities among these models are identified in aspects such as the depth of reasoning, the reliance on intermediate steps, and the degree of similarity between their thinking processes and output patterns and those of GPT-o1. This work offers valuable insights into the trade-off between computational efficiency and reasoning robustness and provides practical recommendations for enhancing model design and evaluation in practical applications. We publicly release our project at: https://github.com/ChangWenhan/FromThinking2Output

Updated: 2025-06-20 14:02:16

标题: 从思维到产出：推理语言模型中的思维链和文本生成特征

摘要: 最近，大型语言模型（LLMs）取得了显著进展，展示了它们在复杂推理方面的增长能力。然而，现有研究在这些模型的推理过程和输出方面的彻底和系统比较方面很大程度上被忽视，特别是涉及到它们的自我反思模式（也称为“顿悟时刻”）以及跨越不同领域的相互连接。本文提出了一个新颖的框架，用于分析四种尖端大型推理模型（GPT-01、DeepSeek-R1、Kimi-k1.5和Grok-3）的推理特征，采用关键词统计和LLM作为评判范式。我们的方法将它们的内部思考过程与最终输出相连接。一个包含基于真实场景的问题的多样数据集，涵盖逻辑推理、因果推断和多步问题解决。此外，提出了一组度量标准，用于评估推理的连贯性和输出的准确性。研究结果揭示了这些模型在推理过程中如何平衡探索和利用，处理问题，并得出结论的各种模式。通过定量和定性比较，确定了这些模型在推理深度、对中间步骤的依赖程度以及其思考过程和输出模式与GPT-01之间的相似程度等方面的差异。这项工作提供了有关计算效率和推理鲁棒性之间权衡的宝贵见解，并为增强模型设计和实际应用中的评估提供了实用建议。我们在以下网址公开发布我们的项目：https://github.com/ChangWenhan/FromThinking2Output。

更新时间: 2025-06-20 14:02:16

领域: cs.CL,cs.AI,cs.CR

下载: http://arxiv.org/abs/2506.21609v1

Elevating Styled Mahjong Agents with Learning from Demonstration

A wide variety of bots in games enriches the gameplay experience and enhances replayability. Recent advancements in game artificial intelligence have predominantly focused on improving the proficiency of bots. Nevertheless, developing highly competent bots with a wide range of distinct play styles remains a relatively under-explored area. We select the Mahjong game environment as a case study. The high degree of randomness inherent in the Mahjong game and the prevalence of out-of-distribution states lead to suboptimal performance of existing offline learning and Learning-from-Demonstration (LfD) algorithms. In this paper, we leverage the gameplay histories of existing Mahjong agents and put forward a novel LfD algorithm that necessitates only minimal modifications to the Proximal Policy Optimization algorithm. The comprehensive empirical results illustrate that our proposed method not only significantly enhances the proficiency of the agents but also effectively preserves their unique play styles.

Updated: 2025-06-20 13:46:06

标题: 提升具有学习演示功能的风格化麻将代理

摘要: 游戏中各种各样的机器人丰富了游戏体验并增强了可玩性。最近游戏人工智能的进展主要集中在提高机器人的熟练程度上。然而，开发具有各种独特游戏风格的高效机器人仍然是一个相对未被充分探索的领域。我们选择了麻将游戏环境作为案例研究。麻将游戏中固有的高度随机性以及分布状态的普遍存在导致现有的离线学习和示范学习算法表现不佳。在本文中，我们利用现有麻将代理的游戏历史，提出了一种新颖的示范学习算法，只需要对Proximal Policy Optimization算法进行最小的修改。全面的实证结果表明，我们提出的方法不仅显著提高了代理的熟练程度，还有效地保留了它们独特的游戏风格。

更新时间: 2025-06-20 13:46:06

领域: cs.AI

下载: http://arxiv.org/abs/2506.16995v1

Prmpt2Adpt: Prompt-Based Zero-Shot Domain Adaptation for Resource-Constrained Environments

Unsupervised Domain Adaptation (UDA) is a critical challenge in real-world vision systems, especially in resource-constrained environments like drones, where memory and computation are limited. Existing prompt-driven UDA methods typically rely on large vision-language models and require full access to source-domain data during adaptation, limiting their applicability. In this work, we propose Prmpt2Adpt, a lightweight and efficient zero-shot domain adaptation framework built around a teacher-student paradigm guided by prompt-based feature alignment. At the core of our method is a distilled and fine-tuned CLIP model, used as the frozen backbone of a Faster R-CNN teacher. A small set of low-level source features is aligned to the target domain semantics-specified only through a natural language prompt-via Prompt-driven Instance Normalization (PIN). These semantically steered features are used to briefly fine-tune the detection head of the teacher model. The adapted teacher then generates high-quality pseudo-labels, which guide the on-the-fly adaptation of a compact student model. Experiments on the MDS-A dataset demonstrate that Prmpt2Adpt achieves competitive detection performance compared to state-of-the-art methods, while delivering up to 7x faster adaptation and 5x faster inference speed using few source images-making it a practical and scalable solution for real-time adaptation in low-resource domains.

Updated: 2025-06-20 13:43:54

标题: Prmpt2Adpt: 基于提示的零样本领域自适应在资源受限环境中的应用

摘要: 无监督域自适应（UDA）是现实世界视觉系统中的一个关键挑战，特别是在资源受限的环境中，如无人机，在这种环境中内存和计算资源有限。现有的基于提示驱动的UDA方法通常依赖于大型视觉-语言模型，并在适应过程中需要对源域数据进行完全访问，从而限制了它们的适用性。在这项工作中，我们提出了Prmpt2Adpt，一个轻量级高效的零样本域自适应框架，它围绕着以提示为导向的特征对齐而建立的教师-学生范式。我们方法的核心是一个经过精简和微调的CLIP模型，作为一个Faster R-CNN教师模型的冻结骨干。一小组低级源特征通过基于提示的实例归一化（PIN）对齐到目标领域的语义，这些语义导向的特征仅通过自然语言提示进行指定。这些语义引导的特征用于对教师模型的检测头进行简要微调。然后调整后的教师生成高质量的伪标签，指导紧凑的学生模型的即时自适应。对MDS-A数据集的实验表明，Prmpt2Adpt相比最先进的方法实现了竞争性的检测性能，同时利用少量源图像实现了高达7倍的更快自适应速度和5倍更快的推理速度，使其成为低资源领域实时自适应的实用和可扩展解决方案。

更新时间: 2025-06-20 13:43:54

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2506.16994v1

CoIFNet: A Unified Framework for Multivariate Time Series Forecasting with Missing Values

Multivariate time series forecasting (MTSF) is a critical task with broad applications in domains such as meteorology, transportation, and economics. Nevertheless, pervasive missing values caused by sensor failures or human errors significantly degrade forecasting accuracy. Prior efforts usually employ an impute-then-forecast paradigm, leading to suboptimal predictions due to error accumulation and misaligned objectives between the two stages. To address this challenge, we propose the Collaborative Imputation-Forecasting Network (CoIFNet), a novel framework that unifies imputation and forecasting to achieve robust MTSF in the presence of missing values. Specifically, CoIFNet takes the observed values, mask matrix and timestamp embeddings as input, processing them sequentially through the Cross-Timestep Fusion (CTF) and Cross-Variate Fusion (CVF) modules to capture temporal dependencies that are robust to missing values. We provide theoretical justifications on how our CoIFNet learning objective improves the performance bound of MTSF with missing values. Through extensive experiments on challenging MSTF benchmarks, we demonstrate the effectiveness and computational efficiency of our proposed approach across diverse missing-data scenarios, e.g., CoIFNet outperforms the state-of-the-art method by $\underline{\textbf{24.40}}$% ($\underline{\textbf{23.81}}$%) at a point (block) missing rate of 0.6, while improving memory and time efficiency by $\underline{\boldsymbol{4.3\times}}$ and $\underline{\boldsymbol{2.1\times}}$, respectively. Our code is available at: https://github.com/KaiTang-eng/CoIFNet.

Updated: 2025-06-20 13:39:42

标题: CoIFNet：一种用于具有缺失值的多变量时间序列预测的统一框架

摘要: 多元时间序列预测（MTSF）是一个关键任务，在气象、交通和经济等领域有广泛的应用。然而，由传感器故障或人为错误引起的普遍缺失值显著降低了预测的准确性。先前的工作通常采用填充-预测范式，导致由于错误累积和两个阶段之间目标不一致，预测结果不佳。为了解决这一挑战，我们提出了协同填充-预测网络（CoIFNet），这是一个新颖的框架，统一了填充和预测，以实现在缺失值存在的情况下强大的MTSF。具体来说，CoIFNet将观测值、掩码矩阵和时间戳嵌入作为输入，通过交叉时间步融合（CTF）和交叉变量融合（CVF）模块依次处理它们，以捕获对缺失值稳健的时间依赖性。我们提供了关于我们的CoIFNet学习目标如何改善具有缺失值的MTSF的性能界限的理论证明。通过对具有挑战性的MSTF基准的广泛实验，我们展示了我们提出的方法在各种缺失数据场景下的有效性和计算效率，例如，在0.6的点（块）缺失率下，CoIFNet在性能上优于最先进的方法24.40%（23.81%），同时将内存和时间效率分别提高了4.3倍和2.1倍。我们的代码可在以下网址获取：https://github.com/KaiTang-eng/CoIFNet。

更新时间: 2025-06-20 13:39:42

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2506.13064v2

TeXpert: A Multi-Level Benchmark for Evaluating LaTeX Code Generation by LLMs

LaTeX's precision and flexibility in typesetting have made it the gold standard for the preparation of scientific documentation. Large Language Models (LLMs) present a promising opportunity for researchers to produce publication-ready material using LaTeX with natural language instructions, yet current benchmarks completely lack evaluation of this ability. By introducing TeXpert, our benchmark dataset with natural language prompts for generating LaTeX code focused on components of scientific documents across multiple difficulty levels, we conduct an in-depth analysis of LLM performance in this regard and identify frequent error types. Our evaluation across open and closed-source LLMs highlights multiple key findings: LLMs excelling on standard benchmarks perform poorly in LaTeX generation with a significant accuracy drop-off as the complexity of tasks increases; open-source models like DeepSeek v3 and DeepSeek Coder strongly rival closed-source counterparts in LaTeX tasks; and formatting and package errors are unexpectedly prevalent, suggesting a lack of diverse LaTeX examples in the training datasets of most LLMs. Our dataset, code, and model evaluations are available at https://github.com/knowledge-verse-ai/TeXpert.

Updated: 2025-06-20 13:39:16

标题: TeXpert：一个用于评估LLMs生成LaTeX代码的多层次基准测试

摘要: LaTeX的精度和灵活性使其成为科学文档准备的黄金标准。大型语言模型（LLMs）为研究人员提供了使用自然语言指令生成LaTeX代码的机会，但当前的基准测试完全缺乏对这种能力的评估。通过引入TeXpert，我们的基准数据集提供了用于生成涵盖多个难度级别科学文档组件的LaTeX代码的自然语言提示，我们对LLM在这方面的表现进行了深入分析，并识别了常见的错误类型。我们对开源和闭源LLMs的评估突显了多个关键发现：在标准基准测试中表现优异的LLMs在生成LaTeX方面表现不佳，在任务复杂性增加时准确性显著下降；开源模型如DeepSeek v3和DeepSeek Coder在LaTeX任务中与闭源对手强有力地竞争；格式和包错误意外地普遍存在，表明大多数LLMs的训练数据集缺乏多样化的LaTeX示例。我们的数据集、代码和模型评估可在https://github.com/knowledge-verse-ai/TeXpert找到。

更新时间: 2025-06-20 13:39:16

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.16990v1

SHAKTI: A 2.5 Billion Parameter Small Language Model Optimized for Edge AI and Low-Resource Environments

We introduce Shakti, a 2.5 billion parameter language model specifically optimized for resource-constrained environments such as edge devices, including smartphones, wearables, and IoT systems. Shakti combines high-performance NLP with optimized efficiency and precision, making it ideal for real-time AI applications where computational resources and memory are limited. With support for vernacular languages and domain-specific tasks, Shakti excels in industries such as healthcare, finance, and customer service. Benchmark evaluations demonstrate that Shakti performs competitively against larger models while maintaining low latency and on-device efficiency, positioning it as a leading solution for edge AI.

Updated: 2025-06-20 13:37:11

标题: SHAKTI：一种优化用于边缘人工智能和低资源环境的25亿参数小型语言模型

摘要: 我们介绍了Shakti，这是一个专门针对资源受限环境（如边缘设备，包括智能手机、可穿戴设备和物联网系统）进行了优化的25亿参数语言模型。Shakti结合了高性能的自然语言处理和优化的效率和精度，使其非常适合计算资源和内存有限的实时人工智能应用。支持方言语言和领域特定任务，Shakti在医疗保健、金融和客户服务等行业表现出色。基准评估表明，Shakti在维持低延迟和设备效率的同时，竞争力强大，使其成为边缘人工智能领域的领先解决方案。

更新时间: 2025-06-20 13:37:11

领域: cs.CL,cs.CV,cs.LG

下载: http://arxiv.org/abs/2410.11331v2

The learned range test method for the inverse inclusion problem

We consider the inverse problem consisting of the reconstruction of an inclusion $B$ contained in a bounded domain $\Omega\subset\mathbb{R}^d$ from a single pair of Cauchy data $(u|_{\partial\Omega},\partial_\nu u|_{\partial\Omega})$, where $\Delta u=0$ in $\Omega\setminus\overline B$ and $u=0$ on $\partial B$. We show that the reconstruction algorithm based on the range test, a domain sampling method, can be written as a neural network with a specific architecture. We propose to learn the weights of this network in the framework of supervised learning, and to combine it with a pre-trained classifier, with the purpose of distinguishing the inclusions based on their distance from the boundary. The numerical simulations show that this learned range test method provides accurate and stable reconstructions of polygonal inclusions. Furthermore, the results are superior to those obtained with the standard range test method (without learning) and with an end-to-end fully connected deep neural network, a purely data-driven method.

Updated: 2025-06-20 13:35:41

标题: 逆向包含问题的学习范围测试方法

摘要: 我们考虑一个反问题，即从一个单一的Cauchy数据对$(u|_{\partial\Omega},\partial_\nu u|_{\partial\Omega})$中重建一个包含在有界域$\Omega\subset\mathbb{R}^d$中的包含物$B$。其中在$\Omega\setminus\overline B$中有$\Delta u=0$，在$\partial B$上有$u=0$。我们展示了基于范围测试的重建算法可以被写成一个具有特定架构的神经网络。我们建议在监督学习框架中学习这个网络的权重，并将其与一个预训练的分类器结合，以区分基于其与边界的距离而不同的包含物。数值模拟表明，这种学习到的范围测试方法提供了多边形包含物的准确且稳定的重建。此外，结果优于使用标准范围测试方法（无学习）和端到端全连接深度神经网络（纯数据驱动方法）获得的结果。

更新时间: 2025-06-20 13:35:41

领域: math.NA,cs.LG,cs.NA,35R30, 65N21, 68T07

下载: http://arxiv.org/abs/2411.00463v2

Assessing Tenstorrent's RISC-V MatMul Acceleration Capabilities

The increasing demand for generative AI as Large Language Models (LLMs) services has driven the need for specialized hardware architectures that optimize computational efficiency and energy consumption. This paper evaluates the performance of the Tenstorrent Grayskull e75 RISC-V accelerator for basic linear algebra kernels at reduced numerical precision, a fundamental operation in LLM computations. We present a detailed characterization of Grayskull's execution model, gridsize, matrix dimensions, data formats, and numerical precision impact computational efficiency. Furthermore, we compare Grayskull's performance against state-of-the-art architectures with tensor acceleration, including Intel Sapphire Rapids processors and two NVIDIA GPUs (V100 and A100). Whilst NVIDIA GPUs dominate raw performance, Grayskull demonstrates a competitive trade-off between power consumption and computational throughput, reaching a peak of 1.55 TFLOPs/Watt with BF16.

Updated: 2025-06-20 13:34:13

标题: 评估Tenstorrent的RISC-V矩阵乘加加速能力

摘要: 随着对生成式人工智能作为大型语言模型（LLMs）服务的需求不断增加，推动了对专用硬件架构的需求，以优化计算效率和能耗。本文评估了Tenstorrent Grayskull e75 RISC-V加速器在降低数值精度下对基本线性代数核心的性能，这是LLM计算中的基本操作。我们详细描述了Grayskull的执行模型、网格大小、矩阵维度、数据格式和数值精度对计算效率的影响。此外，我们将Grayskull的性能与具有张量加速功能的最先进架构进行了比较，包括Intel Sapphire Rapids处理器和两款NVIDIA GPU（V100和A100）。尽管NVIDIA GPU在原始性能方面占据主导地位，但Grayskull在功耗和计算吞吐量之间展现出竞争性的折衷，使用BF16时峰值性能可达1.55 TFLOPs/Watt。

更新时间: 2025-06-20 13:34:13

领域: cs.PF,cs.AI,cs.AR

下载: http://arxiv.org/abs/2505.06085v3

Language Bottleneck Models: A Framework for Interpretable Knowledge Tracing and Beyond

Accurately assessing student knowledge is critical for effective education, yet traditional Knowledge Tracing (KT) methods rely on opaque latent embeddings, limiting interpretability. Even LLM-based approaches generate direct predictions or summaries that may hallucinate without any accuracy guarantees. We recast KT as an inverse problem: learning the minimum natural-language summary that makes past answers explainable and future answers predictable. Our Language Bottleneck Model (LBM) consists of an encoder LLM that writes an interpretable knowledge summary and a frozen decoder LLM that must reconstruct and predict student responses using only that summary text. By constraining all predictive information to pass through a short natural-language bottleneck, LBMs ensure that the summary contains accurate information while remaining human-interpretable. Experiments on synthetic arithmetic benchmarks and the large-scale Eedi dataset show that LBMs rival the accuracy of state-of-the-art KT and direct LLM methods while requiring orders-of-magnitude fewer student trajectories. We demonstrate that training the encoder with group-relative policy optimization, using downstream decoding accuracy as a reward signal, effectively improves summary quality.

Updated: 2025-06-20 13:21:14

标题: 语言瓶颈模型：可解释的知识追踪及其更多应用的框架

摘要: 准确评估学生知识对于有效教育至关重要，然而传统的知识跟踪（KT）方法依赖于不透明的潜在嵌入，限制了可解释性。即使基于LLM的方法生成直接预测或摘要，也可能在没有任何准确性保证的情况下产生幻觉。我们将KT重新定义为一个逆问题：学习最小的自然语言摘要，使过去的答案可解释且未来的答案可预测。我们的语言瓶颈模型（LBM）包括一个编码器LLM，用于编写可解释的知识摘要，以及一个冻结的解码器LLM，必须仅使用该摘要文本重构和预测学生的回答。通过限制所有预测信息通过一个短的自然语言瓶颈传递，LBM确保摘要包含准确信息同时保持人类可解释性。在合成算术基准测试和大规模Eedi数据集上的实验表明，LBM与最先进的KT和直接LLM方法的准确性相媲美，同时需要数量级更少的学生轨迹。我们证明，使用下游解码准确性作为奖励信号，通过相对群体政策优化训练编码器有效地提高摘要质量。

更新时间: 2025-06-20 13:21:14

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.16982v1

Belted and Ensembled Neural Network for Linear and Nonlinear Sufficient Dimension Reduction

We introduce a unified, flexible, and easy-to-implement framework of sufficient dimension reduction that can accommodate both linear and nonlinear dimension reduction, and both the conditional distribution and the conditional mean as the targets of estimation. This unified framework is achieved by a specially structured neural network -- the Belted and Ensembled Neural Network (BENN) -- that consists of a narrow latent layer, which we call the belt, and a family of transformations of the response, which we call the ensemble. By strategically placing the belt at different layers of the neural network, we can achieve linear or nonlinear sufficient dimension reduction, and by choosing the appropriate transformation families, we can achieve dimension reduction for the conditional distribution or the conditional mean. Moreover, thanks to the advantage of the neural network, the method is very fast to compute, overcoming a computation bottleneck of the traditional sufficient dimension reduction estimators, which involves the inversion of a matrix of dimension either p or n. We develop the algorithm and convergence rate of our method, compare it with existing sufficient dimension reduction methods, and apply it to two data examples.

Updated: 2025-06-20 13:19:43

标题: 扣带和集成的神经网络用于线性和非线性充分降维

摘要: 我们介绍了一个统一、灵活且易于实现的足够降维框架，可以容纳线性和非线性降维，以及条件分布和条件均值作为估计目标。这一统一框架是通过一种特殊结构的神经网络——Belted and Ensembled Neural Network (BENN)实现的，该网络由一个窄的潜在层（我们称之为腰带）和一系列响应的变换（我们称之为集合）组成。通过在神经网络的不同层次上策略性地放置腰带，我们可以实现线性或非线性足够降维，并通过选择适当的变换族，我们可以实现条件分布或条件均值的降维。此外，由于神经网络的优势，该方法计算速度非常快，克服了传统足够降维估计器的计算瓶颈，这涉及到维度为p或n的矩阵的求逆。我们开发了我们方法的算法和收敛速度，将其与现有的足够降维方法进行比较，并将其应用于两个数据示例。

更新时间: 2025-06-20 13:19:43

领域: stat.ML,cs.LG,math.ST,stat.TH

下载: http://arxiv.org/abs/2412.08961v2

SmartGuard: Leveraging Large Language Models for Network Attack Detection through Audit Log Analysis and Summarization

End-point monitoring solutions are widely deployed in today's enterprise environments to support advanced attack detection and investigation. These monitors continuously record system-level activities as audit logs and provide deep visibility into security events. Unfortunately, existing methods of semantic analysis based on audit logs have low granularity, only reaching the system call level, making it difficult to effectively classify highly covert behaviors. Additionally, existing works mainly match audit log streams with rule knowledge bases describing behaviors, which heavily rely on expertise and lack the ability to detect unknown attacks and provide interpretive descriptions. In this paper, we propose SmartGuard, an automated method that combines abstracted behaviors from audit event semantics with large language models. SmartGuard extracts specific behaviors (function level) from incoming system logs and constructs a knowledge graph, divides events by threads, and combines event summaries with graph embeddings to achieve information diagnosis and provide explanatory narratives through large language models. Our evaluation shows that SmartGuard achieves an average F1 score of 96\% in assessing malicious behaviors and demonstrates good scalability across multiple models and unknown attacks. It also possesses excellent fine-tuning capabilities, allowing experts to assist in timely system updates.

Updated: 2025-06-20 13:19:17

标题: 智能防护：通过审计日志分析和总结利用大型语言模型进行网络攻击检测

摘要: 终点监控解决方案被广泛部署在今天的企业环境中，以支持高级攻击检测和调查。这些监视器不断记录系统级活动作为审计日志，并提供对安全事件的深入可见性。不幸的是，基于审计日志的现有语义分析方法粒度较低，仅达到系统调用级别，使得有效分类高度隐蔽的行为变得困难。此外，现有工作主要将审计日志流与描述行为的规则知识库进行匹配，这严重依赖专业知识，缺乏检测未知攻击并提供解释性描述的能力。在本文中，我们提出了SmartGuard，一种自动化方法，它将来自审计事件语义的抽象行为与大型语言模型相结合。SmartGuard从传入的系统日志中提取特定行为（功能级别），构建知识图，将事件按线程划分，并将事件摘要与图嵌入结合，以实现信息诊断并通过大型语言模型提供解释性叙述。我们的评估表明，SmartGuard在评估恶意行为方面达到了96%的平均F1分数，并且在多个模型和未知攻击之间表现出良好的可扩展性。它还具有出色的微调能力，允许专家协助及时进行系统更新。

更新时间: 2025-06-20 13:19:17

领域: cs.CR

下载: http://arxiv.org/abs/2506.16981v1

Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework

Automated interpretability research aims to identify concepts encoded in neural network features to enhance human understanding of model behavior. Current feature description methods face two critical challenges: limited robustness and the flawed assumption that each neuron encodes only a single concept (monosemanticity), despite growing evidence that neurons are often polysemantic. This assumption restricts the expressiveness of feature descriptions and limits their ability to capture the full range of behaviors encoded in model internals. To address this, we introduce Polysemantic FeatuRe Identification and Scoring Method (PRISM), a novel framework that captures the inherent complexity of neural network features. Unlike prior approaches that assign a single description per feature, PRISM provides more nuanced descriptions for both polysemantic and monosemantic features. We apply PRISM to language models and, through extensive benchmarking against existing methods, demonstrate that our approach produces more accurate and faithful feature descriptions, improving both overall description quality (via a description score) and the ability to capture distinct concepts when polysemanticity is present (via a polysemanticity score).

Updated: 2025-06-20 13:17:52

标题: 用 PRISM 捕获多义性：一个多概念特征描述框架

摘要: 自动可解释性研究旨在识别编码在神经网络特征中的概念，以增强人类对模型行为的理解。当前的特征描述方法面临两个关键挑战：有限的鲁棒性和错误的假设，即每个神经元仅编码一个概念（单义性），尽管越来越多的证据表明神经元通常是多义性的。这一假设限制了特征描述的表达能力，并限制了它们捕捉模型内部编码的所有行为范围的能力。为了解决这个问题，我们引入了多义性特征识别和评分方法（PRISM），这是一个捕捉神经网络特征固有复杂性的新框架。与之前的方法不同，PRISM为多义性和单义性特征提供了更加细致的描述。我们将PRISM应用于语言模型，并通过与现有方法的广泛基准测试，证明我们的方法产生更准确和忠实的特征描述，提高了总体描述质量（通过描述得分）和在多义性存在时捕捉不同概念的能力（通过多义性得分）。

更新时间: 2025-06-20 13:17:52

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2506.15538v2

Latent Concept Disentanglement in Transformer-based Language Models

When large language models (LLMs) use in-context learning (ICL) to solve a new task, they seem to grasp not only the goal of the task but also core, latent concepts in the demonstration examples. This begs the question of whether transformers represent latent structures as part of their computation or whether they take shortcuts to solve the problem. Prior mechanistic work on ICL does not address this question because it does not sufficiently examine the relationship between the learned representation and the latent concept, and the considered problem settings often involve only single-step reasoning. In this work, we examine how transformers disentangle and use latent concepts. We show that in 2-hop reasoning tasks with a latent, discrete concept, the model successfully identifies the latent concept and does step-by-step concept composition. In tasks parameterized by a continuous latent concept, we find low-dimensional subspaces in the representation space where the geometry mimics the underlying parameterization. Together, these results refine our understanding of ICL and the representation of transformers, and they provide evidence for highly localized structures in the model that disentangle latent concepts in ICL tasks.

Updated: 2025-06-20 13:08:12

标题: Transformer-based语言模型中的潜在概念解缠

摘要: 当大型语言模型（LLMs）使用上下文学习（ICL）来解决新任务时，它们似乎不仅掌握了任务的目标，还掌握了演示示例中的核心、潜在概念。这引发了一个问题，即transformers是否将潜在结构作为计算的一部分，或者它们是否采取捷径来解决问题。之前关于ICL的机械化工作没有涉及这个问题，因为它没有充分考虑学习表示与潜在概念之间的关系，而且所考虑的问题设置通常只涉及单步推理。在这项工作中，我们研究了transformers如何分离和使用潜在概念。我们发现，在具有潜在离散概念的2跳推理任务中，模型成功识别了潜在概念，并进行了逐步的概念组合。在由连续潜在概念参数化的任务中，我们发现表示空间中的低维子空间几何形状模拟了底层参数化。综合这些结果，我们对ICL和transformers的表示有了更深入的理解，它们为模型中分离ICL任务中的潜在概念提供了证据。

更新时间: 2025-06-20 13:08:12

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2506.16975v1

Mask-PINNs: Regulating Feature Distributions in Physics-Informed Neural Networks

Physics-Informed Neural Networks (PINNs) have emerged as a powerful framework for solving partial differential equations (PDEs) by embedding physical laws directly into the loss function. However, effective training of PINNs remains challenging due to internal covariate shift, which destabilizes feature distributions and impairs model expressiveness. While normalization techniques like Batch Normalization and Layer Normalization are standard remedies in deep learning, they disrupt the pointwise input-output mappings critical to preserving the physical consistency in PINNs. In this work, we introduce Mask-PINNs, a novel architecture that regulates internal feature distributions through a smooth, learnable mask function applied pointwise across hidden layers. Unlike conventional normalization methods, the proposed mask function preserves the deterministic nature of input-output relationships while suppressing activation drift and saturation. Theoretically, we demonstrate that Mask-PINNs control feature spread near initialization by attenuating gradient variance growth through a tailored modulation mechanism. Empirically, we validate the method on multiple PDE benchmarks across diverse activation functions. Our results show consistent improvements in prediction accuracy, convergence stability, and robustness, with relative L2 errors reduced by up to two orders of magnitude over baseline models. Furthermore, we demonstrate that Mask-PINNs enable the effective use of wider networks, overcoming a key limitation in existing PINN frameworks.

Updated: 2025-06-20 13:08:04

标题: Mask-PINNs：调节物理信息神经网络中的特征分布

摘要: 物理信息神经网络（PINNs）已经成为一个强大的框架，通过将物理定律直接嵌入损失函数来解决偏微分方程（PDEs）。然而，由于内部协变量转移，PINNs的有效训练仍然具有挑战性，这会不稳定特征分布并损害模型表达能力。虽然批量归一化和层归一化等标准规范化技术是深度学习中的常见疗法，但它们会破坏维护PINNs中物理一致性的点对点输入-输出映射。在这项工作中，我们引入了Mask-PINNs，这是一种新颖的架构，通过一个平滑的、可学习的掩模函数点对点应用于隐藏层，调节内部特征分布。与传统的规范化方法不同，提出的掩模函数在抑制激活漂移和饱和的同时保持了输入-输出关系的确定性。从理论上讲，我们证明了Mask-PINNs通过定制的调制机制控制了初始化时特征扩展的梯度方差增长。在实证方面，我们在多个PDE基准测试中验证了该方法，涵盖了各种激活函数。我们的结果显示，在基线模型上，预测精度、收敛稳定性和稳健性均有一致的改善，相对L2误差降低了两个数量级。此外，我们证明了Mask-PINNs使得更宽的网络的有效使用成为可能，克服了现有PINN框架的一个关键限制。

更新时间: 2025-06-20 13:08:04

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2505.06331v2

Formal Control for Uncertain Systems via Contract-Based Probabilistic Surrogates (Extended Version)

The requirement for identifying accurate system representations has not only been a challenge to fulfill, but it has compromised the scalability of formal methods, as the resulting models are often too complex for effective decision making with formal correctness and performance guarantees. Focusing on probabilistic simulation relations and surrogate models of stochastic systems, we propose an approach that significantly enhances the scalability and practical applicability of such simulation relations by eliminating the need to compute error bounds directly. As a result, we provide an abstraction-based technique that scales effectively to higher dimensions while addressing complex nonlinear agent-environment interactions with infinite-horizon temporal logic guarantees amidst uncertainty. Our approach trades scalability for conservatism favorably, as demonstrated on a complex high-dimensional vehicle intersection case study.

Updated: 2025-06-20 13:00:50

标题: 通过基于合同的概率替代物形式控制不确定系统（扩展版）

摘要: 对于准确识别系统表示的要求不仅是一项难以实现的挑战，而且也损害了形式方法的可扩展性，因为由此产生的模型通常过于复杂，无法有效地做出具有形式正确性和性能保证的决策。专注于概率仿真关系和随机系统的代理模型，我们提出了一种方法，通过消除直接计算误差范围的需求，显著增强了这种仿真关系的可扩展性和实际适用性。因此，我们提供了一种基于抽象的技术，能够有效扩展到更高维度，同时应对复杂的非线性代理-环境交互作用，以及在不确定性环境中具有无限时间逻辑保证。我们的方法以可扩展性换取保守性，如在一个复杂的高维车辆交叉点案例研究中所证明的那样。

更新时间: 2025-06-20 13:00:50

领域: cs.SY,cs.AI,cs.MA,eess.SY

下载: http://arxiv.org/abs/2506.16971v1

PromptDSI: Prompt-based Rehearsal-free Instance-wise Incremental Learning for Document Retrieval

Differentiable Search Index (DSI) utilizes pre-trained language models to perform indexing and document retrieval via end-to-end learning without relying on external indexes. However, DSI requires full re-training to index new documents, causing significant computational inefficiencies. Continual learning (CL) offers a solution by enabling the model to incrementally update without full re-training. Existing CL solutions in document retrieval rely on memory buffers or generative models for rehearsal, which is infeasible when accessing previous training data is restricted due to privacy concerns. To this end, we introduce PromptDSI, a prompt-based, rehearsal-free continual learning approach for document retrieval. PromptDSI follows the Prompt-based Continual Learning (PCL) framework, using learnable prompts to efficiently index new documents without accessing previous documents or queries. To improve retrieval latency, we remove the initial forward pass of PCL, which otherwise greatly increases training and inference time, with a negligible trade-off in performance. Additionally, we introduce a novel topic-aware prompt pool that employs neural topic embeddings as fixed keys, eliminating the instability of prompt key optimization while maintaining competitive performance with existing PCL prompt pools. In a challenging rehearsal-free continual learning setup, we demonstrate that PromptDSI variants outperform rehearsal-based baselines, match the strong cache-based baseline in mitigating forgetting, and significantly improving retrieval performance on new corpora.

Updated: 2025-06-20 12:59:40

标题: PromptDSI：基于提示的无排练实例增量学习用于文档检索

摘要: Differentiable Search Index (DSI)利用预训练的语言模型进行索引和文档检索，通过端到端学习而不依赖外部索引。然而，DSI需要完全重新训练以索引新文档，导致显著的计算效率低下。持续学习（CL）通过使模型能够增量更新而无需完全重新训练来提供解决方案。文档检索中现有的CL解决方案依赖于内存缓冲区或生成模型进行复述，但由于隐私问题限制了访问先前的训练数据，这种方法是不可行的。为此，我们引入了PromptDSI，一种基于提示的、无需复述的文档检索持续学习方法。PromptDSI遵循基于提示的持续学习（PCL）框架，使用可学习的提示来高效地索引新文档，而无需访问先前的文档或查询。为了提高检索延迟，我们去除了PCL的初始前向传递，否则会大大增加训练和推理时间，但在性能上几乎没有牺牲。此外，我们引入了一种新颖的主题感知提示池，利用神经主题嵌入作为固定键，消除了提示键优化的不稳定性，同时与现有的PCL提示池保持竞争性能。在具有挑战性的无需复述的持续学习设置中，我们证明了PromptDSI变体优于基于复述的基线，与强大的基于缓存的基线相匹敌，显著提高了在新语料库上的检索性能。

更新时间: 2025-06-20 12:59:40

领域: cs.IR,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2406.12593v3

MM-AttacKG: A Multimodal Approach to Attack Graph Construction with Large Language Models

Cyber Threat Intelligence (CTI) parsing aims to extract key threat information from massive data, transform it into actionable intelligence, enhance threat detection and defense efficiency, including attack graph construction, intelligence fusion and indicator extraction. Among these research topics, Attack Graph Construction (AGC) is essential for visualizing and understanding the potential attack paths of threat events from CTI reports. Existing approaches primarily construct the attack graphs purely from the textual data to reveal the logical threat relationships between entities within the attack behavioral sequence. However, they typically overlook the specific threat information inherent in visual modalities, which preserves the key threat details from inherently-multimodal CTI report. Therefore, we enhance the effectiveness of attack graph construction by analyzing visual information through Multimodal Large Language Models (MLLMs). Specifically, we propose a novel framework, MM-AttacKG, which can effectively extract key information from threat images and integrate it into attack graph construction, thereby enhancing the comprehensiveness and accuracy of attack graphs. It first employs a threat image parsing module to extract critical threat information from images and generate descriptions using MLLMs. Subsequently, it builds an iterative question-answering pipeline tailored for image parsing to refine the understanding of threat images. Finally, it achieves content-level integration between attack graphs and image-based answers through MLLMs, completing threat information enhancement. The experimental results demonstrate that MM-AttacKG can accurately identify key information in threat images and significantly improve the quality of multimodal attack graph construction, effectively addressing the shortcomings of existing methods in utilizing image-based threat information.

Updated: 2025-06-20 12:59:31

标题: MM-AttacKG：一种利用大型语言模型构建攻击图的多模态方法

摘要: 网络威胁情报（CTI）解析旨在从海量数据中提取关键威胁信息，将其转化为可操作的情报，增强威胁检测和防御效率，包括攻击图构建、情报融合和指标提取。在这些研究主题中，攻击图构建（AGC）对于可视化和理解来自CTI报告的威胁事件的潜在攻击路径至关重要。现有方法主要是纯粹从文本数据构建攻击图，以揭示攻击行为序列中实体之间的逻辑威胁关系。然而，它们通常忽视了视觉模态中固有的特定威胁信息，这些信息保留了来自固有多模态CTI报告的关键威胁细节。因此，我们通过分析多模态大语言模型（MLLMs）中的视觉信息，增强了攻击图构建的有效性。具体来说，我们提出了一个新颖的框架MM-AttacKG，可以从威胁图像中有效提取关键信息，并将其整合到攻击图构建中，从而增强攻击图的全面性和准确性。它首先利用威胁图像解析模块从图像中提取关键威胁信息，并使用MLLMs生成描述。随后，它构建了一个专为图像解析定制的迭代问答流水线，以完善对威胁图像的理解。最后，通过MLLMs实现了攻击图和基于图像的答案之间的内容级集成，完成了威胁信息的增强。实验结果表明，MM-AttacKG能够准确识别威胁图像中的关键信息，并显着提高多模态攻击图构建的质量，有效解决了现有方法在利用基于图像的威胁信息方面的不足。

更新时间: 2025-06-20 12:59:31

领域: cs.CR,cs.CY

下载: http://arxiv.org/abs/2506.16968v1

RocketStack: A level-aware deep recursive ensemble learning framework with exploratory feature fusion and model pruning dynamics

Ensemble learning remains a cornerstone of machine learning, with stacking used to integrate predictions from multiple base learners through a meta-model. However, deep stacking remains rare, as most designs prioritize horizontal diversity over recursive depth due to model complexity, feature redundancy, and computational burden. To address these challenges, RocketStack, a level-aware recursive ensemble framework, is introduced and explored up to ten stacking levels, extending beyond prior architectures. The framework incrementally prunes weaker learners at each level, enabling deeper stacking without excessive complexity. To mitigate early performance saturation, mild Gaussian noise is added to out-of-fold (OOF) scores before pruning, and compared against strict OOF pruning. Further both per-level and periodic feature compressions are explored using attention-based selection, Simple, Fast, Efficient (SFE) filter, and autoencoders. Across 33 datasets (23 binary, 10 multi-class), linear-trend tests confirmed rising accuracy with depth in most variants, and the top performing meta-model at each level increasingly outperformed the strongest standalone ensemble. In the binary subset, periodic SFE with mild OOF-score randomization reached 97.08% at level 10, 5.14% above the strict-pruning configuration and cut runtime by 10.5% relative to no compression. In the multi-class subset, periodic attention selection reached 98.60% at level 10, exceeding the strongest baseline by 6.11%, while reducing runtime by 56.1% and feature dimensionality by 74% compared to no compression. These findings highlight mild randomization as an effective regularizer and periodic compression as a stabilizer. Echoing the design of multistage rockets in aerospace (prune, compress, propel) RocketStack achieves deep recursive ensembling with tractable complexity.

Updated: 2025-06-20 12:52:44

标题: RocketStack：具有探索性特征融合和模型修剪动态的层感知深度递归集成学习框架

摘要: 集成学习仍然是机器学习的重要基石，堆叠被用来通过元模型整合多个基本学习器的预测。然而，深度堆叠仍然很少见，因为大多数设计优先考虑水平多样性而不是递归深度，原因是模型复杂性、特征冗余和计算负担。为了解决这些挑战，引入了一种级别感知的递归集成框架RocketStack，并探索了高达十个堆叠级别，超越了先前的架构。该框架在每个级别逐步修剪较弱的学习器，实现了更深的堆叠而不会过于复杂。为了减轻早期性能饱和，对于在折叠之外的(OOF)得分添加了轻微的高斯噪声，然后进行修剪，并与严格的OOF修剪进行比较。进一步使用基于注意力的选择、简单、快速、高效(SFE)过滤器和自动编码器探索了每个级别和周期性特征压缩。在33个数据集中(23个二元、10个多类)，线性趋势测试确认了大多数变体中深度提高准确性，并且每个级别的表现最好的元模型逐渐超越了最强的独立集成。在二元子集中，周期性SFE与轻微的OOF得分随机化在第10级达到了97.08%，比严格修剪配置高出5.14%，相对于没有压缩的情况，运行时间减少了10.5%。在多类子集中，周期性注意力选择在第10级达到了98.60%，超过了最强基线6.11%，同时相对于没有压缩，减少了56.1%的运行时间和74%的特征维度。这些发现突出了轻微随机化作为有效的正则化器和周期性压缩作为稳定器。类似于航空航天中多级火箭的设计(修剪、压缩、推进)，RocketStack实现了具有可处理复杂性的深度递归集成。

更新时间: 2025-06-20 12:52:44

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2506.16965v1

LogProber: Disentangling confidence from contamination in LLM responses

In machine learning, contamination refers to situations where testing data leak into the training set. The issue is particularly relevant for the evaluation of the performance of Large Language Models (LLMs), which are generally trained on gargantuan, and generally opaque, corpora of text scraped from the world wide web. Developing tools to detect contamination is therefore crucial to be able to fairly and properly track the evolution of the performance of LLMs. To date, only a few recent studies have attempted to address the issue of quantifying and detecting contamination in short text sequences, such as those commonly found in benchmarks. However, these methods have limitations that can sometimes render them impractical. In the present paper, we introduce LogProber, a novel, efficient algorithm that we show to be able to detect contamination in a black box setting that tries to tackle some of these drawbacks by focusing on the familiarity with the question rather than the answer. Here, we explore the properties of the proposed method in comparison with concurrent approaches, identify its advantages and limitations, and illustrate how different forms of contamination can go undetected depending on the design of the detection algorithm.

Updated: 2025-06-20 12:52:37

标题: LogProber：在LLM响应中将置信度与污染分离

摘要: 在机器学习中，污染指的是测试数据渗透到训练集中的情况。这个问题尤其与大型语言模型（LLMs）的性能评估相关，这些模型通常是在从全球网络中抓取的庞大且通常不透明的文本语料库上训练的。开发工具以检测污染因此对于能够公平和正确地跟踪LLMs的性能演变至关重要。迄今为止，只有少数最近的研究尝试解决在短文本序列中量化和检测污染的问题，例如在基准测试中常见的那些。然而，这些方法有时存在限制，可能使它们不切实际。在本文中，我们介绍了一种新颖高效的算法LogProber，我们展示了该算法能够在黑匣子设置中检测到污染，通过专注于问题的熟悉程度而不是答案，试图解决其中一些缺点。在这里，我们比较了所提出方法的特性与同时进行的方法，确定了其优点和局限性，并说明了根据检测算法的设计，不同形式的污染如何可能未被检测到。

更新时间: 2025-06-20 12:52:37

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2408.14352v3

Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs

Multimodal large language models (MLLMs) have begun to demonstrate robust reasoning capabilities on general tasks, yet their application in the medical domain remains in its early stages. Constructing chain-of-thought (CoT) training data is essential for bolstering the reasoning abilities of medical MLLMs. However, existing approaches exhibit a deficiency in offering a comprehensive framework for searching and evaluating effective reasoning paths towards critical diagnosis. To address this challenge, we propose Mentor-Intern Collaborative Search (MICS), a novel reasoning-path searching scheme to generate rigorous and effective medical CoT data. MICS first leverages mentor models to initialize the reasoning, one step at a time, then prompts each intern model to continue the thinking along those initiated paths, and finally selects the optimal reasoning path according to the overall reasoning performance of multiple intern models. The reasoning performance is determined by an MICS-Score, which assesses the quality of generated reasoning paths. Eventually, we construct MMRP, a multi-task medical reasoning dataset with ranked difficulty, and Chiron-o1, a new medical MLLM devised via a curriculum learning strategy, with robust visual question-answering and generalizable reasoning capabilities. Extensive experiments demonstrate that Chiron-o1, trained on our CoT dataset constructed using MICS, achieves state-of-the-art performance across a list of medical visual question answering and reasoning benchmarks. Codes are available at GitHub - manglu097/Chiron-o1: Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs

Updated: 2025-06-20 12:51:19

标题: 提高MMLM中逐步和可验证的医学推理

摘要: 多模态大型语言模型（MLLMs）已经开始展示在一般任务上具有强大的推理能力，然而它们在医疗领域的应用仍处于早期阶段。构建思维链（CoT）训练数据对于增强医学MLLMs的推理能力至关重要。然而，现有方法在提供搜索和评估有效推理路径的综合框架方面存在不足。为解决这一挑战，我们提出了导师-实习生协作搜索（MICS），这是一种新颖的推理路径搜索方案，用于生成严谨有效的医学CoT数据。MICS首先利用导师模型逐步初始化推理，然后提示每个实习生模型沿着那些启动的路径继续思考，最后根据多个实习生模型的整体推理表现选择最佳推理路径。推理表现由MICS-Score确定，该评估生成的推理路径的质量。最终，我们构建了MMRP，一个带有排名困难的多任务医学推理数据集，以及Chiron-o1，一个通过课程学习策略设计的新型医学MLLM，具有强大的视觉问答和通用推理能力。大量实验证明，Chiron-o1，在我们使用MICS构建的CoT数据集上训练，实现了医学视觉问答和推理基准测试中的最先进性能。代码可在GitHub上找到-manglu097/Chiron-o1:增强MLLM中的逐步可验证医学推理。

更新时间: 2025-06-20 12:51:19

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2506.16962v1

SafeGenBench: A Benchmark Framework for Security Vulnerability Detection in LLM-Generated Code

The code generation capabilities of large language models(LLMs) have emerged as a critical dimension in evaluating their overall performance. However, prior research has largely overlooked the security risks inherent in the generated code. In this work, we introduce SafeGenBench, a benchmark specifically designed to assess the security of LLM-generated code. The dataset encompasses a wide range of common software development scenarios and vulnerability types. Building upon this benchmark, we develop an automatic evaluation framework that leverages both static application security testing(SAST) and LLM-based judging to assess the presence of security vulnerabilities in model-generated code. Through the empirical evaluation of state-of-the-art LLMs on SafeGenBench, we reveal notable deficiencies in their ability to produce vulnerability-free code. Our findings highlight pressing challenges and offer actionable insights for future advancements in the secure code generation performance of LLMs. The data and code will be released soon.

Updated: 2025-06-20 12:42:57

标题: SafeGenBench：一种用于检测LLM生成代码中安全漏洞的基准框架

摘要: 大型语言模型(LLMs)的代码生成能力已经成为评估其整体性能的一个关键维度。然而，先前的研究很大程度上忽视了生成代码中固有的安全风险。在这项工作中，我们介绍了SafeGenBench，一个专门设计用于评估LLM生成代码安全性的基准。该数据集涵盖了各种常见软件开发场景和漏洞类型。在这个基准的基础上，我们开发了一个自动评估框架，结合了静态应用程序安全测试(SAST)和基于LLM的评判，以评估模型生成代码中安全漏洞的存在。通过在SafeGenBench上对最先进的LLMs进行实证评估，我们揭示了它们在生成无漏洞代码方面的显著不足。我们的发现突出了当前的挑战，并为未来LLMs安全代码生成性能的进展提供了可操作的见解。数据和代码将很快发布。

更新时间: 2025-06-20 12:42:57

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2506.05692v3

Machine Learning Methods for Small Data and Upstream Bioprocessing Applications: A Comprehensive Review

Data is crucial for machine learning (ML) applications, yet acquiring large datasets can be costly and time-consuming, especially in complex, resource-intensive fields like biopharmaceuticals. A key process in this industry is upstream bioprocessing, where living cells are cultivated and optimised to produce therapeutic proteins and biologics. The intricate nature of these processes, combined with high resource demands, often limits data collection, resulting in smaller datasets. This comprehensive review explores ML methods designed to address the challenges posed by small data and classifies them into a taxonomy to guide practical applications. Furthermore, each method in the taxonomy was thoroughly analysed, with a detailed discussion of its core concepts and an evaluation of its effectiveness in tackling small data challenges, as demonstrated by application results in the upstream bioprocessing and other related domains. By analysing how these methods tackle small data challenges from different perspectives, this review provides actionable insights, identifies current research gaps, and offers guidance for leveraging ML in data-constrained environments.

Updated: 2025-06-20 12:36:26

标题: 机器学习方法在小数据和上游生物加工应用中的应用：综合评述

摘要: 数据对于机器学习（ML）应用至关重要，然而获取大型数据集可能成本高昂且耗时，特别是在复杂、资源密集型领域如生物制药领域。该行业中的一个关键过程是上游生物加工，其中培养和优化活细胞以产生治疗蛋白和生物制品。这些过程的复杂性结合高资源需求，往往限制了数据收集，导致数据集较小。这篇全面的综述探讨了旨在解决小数据带来的挑战的ML方法，并将它们分类到一个分类法中以指导实际应用。此外，对分类法中的每种方法进行了彻底分析，详细讨论了其核心概念，并评估了其在处理小数据挑战方面的有效性，如在上游生物加工和其他相关领域的应用结果所示。通过分析这些方法如何从不同角度应对小数据挑战，本综述提供了可操作的见解，确定了当前研究中的差距，并为在数据受限环境中利用ML提供了指导。

更新时间: 2025-06-20 12:36:26

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.12322v2

LAION-C: An Out-of-Distribution Benchmark for Web-Scale Vision Models

Out-of-distribution (OOD) robustness is a desired property of computer vision models. Improving model robustness requires high-quality signals from robustness benchmarks to quantify progress. While various benchmark datasets such as ImageNet-C were proposed in the ImageNet era, most ImageNet-C corruption types are no longer OOD relative to today's large, web-scraped datasets, which already contain common corruptions such as blur or JPEG compression artifacts. Consequently, these benchmarks are no longer well-suited for evaluating OOD robustness in the era of web-scale datasets. Indeed, recent models show saturating scores on ImageNet-era OOD benchmarks, indicating that it is unclear whether models trained on web-scale datasets truly become better at OOD generalization or whether they have simply been exposed to the test distortions during training. To address this, we introduce LAION-C as a benchmark alternative for ImageNet-C. LAION-C consists of six novel distortion types specifically designed to be OOD, even for web-scale datasets such as LAION. In a comprehensive evaluation of state-of-the-art models, we find that the LAION-C dataset poses significant challenges to contemporary models, including MLLMs such as Gemini and GPT-4o. We additionally conducted a psychophysical experiment to evaluate the difficulty of our corruptions for human observers, enabling a comparison of models to lab-quality human robustness data. We observe a paradigm shift in OOD generalization: from humans outperforming models, to the best models now matching or outperforming the best human observers.

Updated: 2025-06-20 12:32:27

标题: LAION-C：一个用于Web规模视觉模型的分布外基准

摘要: Out-of-distribution (OOD)鲁棒性是计算机视觉模型的一个理想特性。提高模型的鲁棒性需要来自鲁棒性基准的高质量信号来量化进展。虽然在ImageNet时代提出了各种基准数据集，如ImageNet-C，但大多数ImageNet-C破坏类型相对于今天的大型网络抓取数据集不再是OOD，这些数据集已经包含常见的破坏，如模糊或JPEG压缩伪影。因此，在网络规模数据集时代，这些基准数据集不再适合评估OOD鲁棒性。实际上，最近的模型在ImageNet时代的OOD基准上显示出饱和得分，表明目前尚不清楚在网络规模数据集上训练的模型是否真正变得更好的OOD泛化，还是它们在训练期间仅暴露于测试失真。为了解决这个问题，我们引入LAION-C作为ImageNet-C的替代基准。 LAION-C包含六种新的失真类型，专门设计为OOD，即使对于像LAION这样的网络规模数据集也是如此。在对最先进模型的全面评估中，我们发现LAION-C数据集对当代模型提出了重大挑战，包括像Gemini和GPT-4o这样的MLLM。我们还进行了一项心理物理实验，评估我们的失真对于人类观察者的困难程度，从而使模型与实验室质量的人类鲁棒性数据进行比较。我们观察到OOD泛化的范式转变：从人类优于模型，到最好的模型现在匹配或优于最好的人类观察者。

更新时间: 2025-06-20 12:32:27

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2506.16950v1

AI-based Approach in Early Warning Systems: Focus on Emergency Communication Ecosystem and Citizen Participation in Nordic Countries

Climate change and natural disasters are recognized as worldwide challenges requiring complex and efficient ecosystems to deal with social, economic, and environmental effects. This chapter advocates a holistic approach, distinguishing preparedness, emergency responses, and postcrisis phases. The role of the Early Warning System (EWS), Risk modeling and mitigation measures are particularly emphasized. The chapter reviews the various Artificial Intelligence (AI)-enabler technologies that can be leveraged at each phase, focusing on the INFORM risk framework and EWSs. Emergency communication and psychological risk perception have been emphasized in emergency response times. Finally, a set of case studies from Nordic countries has been highlighted.

Updated: 2025-06-20 12:32:16

标题: 基于人工智能的早期预警系统：着重关注北欧国家的紧急通信生态系统和公民参与

摘要: 气候变化和自然灾害被认为是全球性挑战，需要复杂而高效的生态系统来应对社会、经济和环境影响。本章倡导一种整体性方法，区分准备阶段、紧急响应和后危机阶段。早期预警系统（EWS）、风险建模和减灾措施的作用特别受到强调。本章审查了各种人工智能（AI）启用技术，可以在每个阶段加以利用，重点关注INFORM风险框架和EWS。在紧急响应时强调了紧急通讯和心理风险感知。最后，突出了来自北欧国家的一组案例研究。

更新时间: 2025-06-20 12:32:16

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2506.18926v1

Selective Use of Yannakakis' Algorithm to Improve Query Performance: Machine Learning to the Rescue

Query optimization has played a central role in database research for decades. However, more often than not, the proposed optimization techniques lead to a performance improvement in some, but not in all, situations. Therefore, we urgently need a methodology for designing a decision procedure that decides for a given query whether the optimization technique should be applied or not. In this work, we propose such a methodology with a focus on Yannakakis-style query evaluation as our optimization technique of interest. More specifically, we formulate this decision problem as an algorithm selection problem and we present a Machine Learning based approach for its solution. Empirical results with several benchmarks on a variety of database systems show that our approach indeed leads to a statistically significant performance improvement.

Updated: 2025-06-20 12:21:20

标题: 选择性使用Yannakakis算法来提高查询性能：机器学习拯救

摘要: 查询优化在数据库研究中扮演着核心角色已有几十年。然而，往往提出的优化技术只在某些情况下带来性能改进，而并非所有情况。因此，我们迫切需要一种方法来设计一个决策过程，为给定的查询决定是否应该应用优化技术。在这项工作中，我们提出了一种这样的方法，重点关注Yannakakis风格的查询评估作为我们感兴趣的优化技术。更具体地说，我们将这个决策问题形式化为一个算法选择问题，并提出了一个基于机器学习的方法来解决它。在多个数据库系统上使用多个基准测试的实证结果表明，我们的方法确实带来了统计上显著的性能改进。

更新时间: 2025-06-20 12:21:20

领域: cs.DB,cs.AI

下载: http://arxiv.org/abs/2502.20233v2

Solving a class of stochastic optimal control problems by physics-informed neural networks

The aim of this work is to develop a deep learning method for solving high-dimensional stochastic control problems based on the Hamilton--Jacobi--Bellman (HJB) equation and physics-informed learning. Our approach is to parameterize the feedback control and the value function using a decoupled neural network with multiple outputs. We train this network by using a loss function with penalty terms that enforce the HJB equation along the sampled trajectories generated by the controlled system. More significantly, numerical results on various applications are carried out to demonstrate that the proposed approach is efficient and applicable.

Updated: 2025-06-20 12:18:30

标题: 用物理知识启发的神经网络解决一类随机最优控制问题

摘要: 这项工作的目的是基于Hamilton-Jacobi-Bellman（HJB）方程和基于物理信息学习，开发一种用于解决高维随机控制问题的深度学习方法。我们的方法是使用具有多个输出的解耦神经网络来参数化反馈控制和价值函数。我们通过使用一个损失函数来训练该网络，该损失函数包含强制执行HJB方程的惩罚项，沿着受控系统生成的采样轨迹。更重要的是，进行了各种应用的数值结果，以证明所提出的方法是高效且适用的。

更新时间: 2025-06-20 12:18:30

领域: math.OC,cs.LG

下载: http://arxiv.org/abs/2402.15592v2

Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs

We develop a framework to quantify the time-to-unsafe-sampling - the number of large language model (LLM) generations required to trigger an unsafe (e.g., toxic) response. Estimating this quantity is challenging, since unsafe responses are exceedingly rare in well-aligned LLMs, potentially occurring only once in thousands of generations. As a result, directly estimating time-to-unsafe-sampling would require collecting training data with a prohibitively large number of generations per prompt. However, with realistic sampling budgets, we often cannot generate enough responses to observe an unsafe outcome for every prompt, leaving the time-to-unsafe-sampling unobserved in many cases, making the estimation and evaluation tasks particularly challenging. To address this, we frame this estimation problem as one of survival analysis and develop a provably calibrated lower predictive bound (LPB) on the time-to-unsafe-sampling of a given prompt, leveraging recent advances in conformal prediction. Our key innovation is designing an adaptive, per-prompt sampling strategy, formulated as a convex optimization problem. The objective function guiding this optimized sampling allocation is designed to reduce the variance of the estimators used to construct the LPB, leading to improved statistical efficiency over naive methods that use a fixed sampling budget per prompt. Experiments on both synthetic and real data support our theoretical results and demonstrate the practical utility of our method for safety risk assessment in generative AI models.

Updated: 2025-06-20 12:12:17

标题: 在LLMs中对时间至不安全采样的校准预测下限

摘要: 我们开发了一个框架来量化时间到不安全采样-触发不安全（例如有毒）响应所需的大型语言模型（LLM）生成数量。估计这个数量是具有挑战性的，因为在良好对齐的LLMs中不安全响应极为罕见，可能仅在成千上万代中发生一次。因此，直接估计时间到不安全采样将需要收集具有过多代数的训练数据。然而，在现实采样预算下，我们通常无法生成足够多的响应以观察每个提示的不安全结果，导致时间到不安全采样在许多情况下无法观察，使估计和评估任务尤为具有挑战性。为了解决这个问题，我们将这个估计问题定位为生存分析之一，并开发了关于给定提示的时间到不安全采样的可靠校准下界（LPB），利用了最近在符合预测方面的进展。我们的关键创新是设计了一种自适应的，逐提示的采样策略，构成为一个凸优化问题。引导此优化采样分配的目标函数被设计为减少用于构建LPB的估计器的方差，从而相对于使用每个提示的固定采样预算的朴素方法，提高了统计效率。在合成数据和真实数据上的实验支持我们的理论结果，并展示了我们的方法在生成式AI模型的安全风险评估中的实际实用性。

更新时间: 2025-06-20 12:12:17

领域: cs.LG,stat.AP,stat.ML

下载: http://arxiv.org/abs/2506.13593v2

Gaussian Processes and Reproducing Kernels: Connections and Equivalences

This monograph studies the relations between two approaches using positive definite kernels: probabilistic methods using Gaussian processes, and non-probabilistic methods using reproducing kernel Hilbert spaces (RKHS). They are widely studied and used in machine learning, statistics, and numerical analysis. Connections and equivalences between them are reviewed for fundamental topics such as regression, interpolation, numerical integration, distributional discrepancies, and statistical dependence, as well as for sample path properties of Gaussian processes. A unifying perspective for these equivalences is established, based on the equivalence between the Gaussian Hilbert space and the RKHS. The monograph serves as a basis to bridge many other methods based on Gaussian processes and reproducing kernels, which are developed in parallel by the two research communities.

Updated: 2025-06-20 12:08:18

标题: 高斯过程与再生核：联系与等价性

摘要: 这本专著研究了使用正定核的两种方法之间的关系：使用高斯过程的概率方法和使用再生核希尔伯特空间（RKHS）的非概率方法。它们在机器学习、统计学和数值分析中得到广泛研究和应用。对于基本主题，如回归、插值、数值积分、分布差异和统计依赖性，以及高斯过程的样本路径特性，它们之间的连接和等价性得到了审查。建立了一个统一的视角来解释这些等价性，基于高斯希尔伯特空间和RKHS之间的等价性。这本专著作为一个桥梁，用于连接由两个研究社区并行开发的基于高斯过程和再生核的许多其他方法。

更新时间: 2025-06-20 12:08:18

领域: stat.ML,cs.LG,cs.NA,math.NA,math.PR,math.ST,stat.TH

下载: http://arxiv.org/abs/2506.17366v1

Enhancing Expressivity of Quantum Neural Networks Based on the SWAP test

Parameterized quantum circuits represent promising architectures for machine learning applications, yet many lack clear connections to classical models, potentially limiting their ability to translate the wide success of classical neural networks to the quantum realm. We examine a specific type of quantum neural network (QNN) built exclusively from SWAP test circuits, and discuss its mathematical equivalence to a classical two-layer feedforward network with quadratic activation functions under amplitude encoding. Our analysis across classical real-world and synthetic datasets reveals that while this architecture can successfully learn many practical tasks, it exhibits fundamental expressivity limitations due to violating the universal approximation theorem, particularly failing on harder problems like the parity check function. To address this limitation, we introduce a circuit modification using generalized SWAP test circuits that effectively implements classical neural networks with product layers. This enhancement enables successful learning of parity check functions in arbitrary dimensions which we analytically argue to be impossible for the original architecture beyond two dimensions regardless of network size. Our results establish a framework for enhancing QNN expressivity through classical task analysis and demonstrate that our SWAP test-based architecture offers broad representational capacity, suggesting potential promise also for quantum learning tasks.

Updated: 2025-06-20 12:05:31

标题: 基于SWAP测试的量子神经网络表达能力增强

摘要: 参数化量子电路代表了机器学习应用中有前途的架构，然而许多电路与经典模型之间缺乏明确的联系，可能限制其将经典神经网络的广泛成功转化到量子领域的能力。我们研究了一种特定类型的量子神经网络（QNN），仅由SWAP测试电路构建，并讨论其在幅度编码下与具有二次激活函数的经典两层前馈网络的数学等价性。我们在经典真实世界和合成数据集上的分析表明，虽然这种架构可以成功学习许多实际任务，但由于违反了通用逼近定理，它表现出基本的表达能力限制，特别是在像奇偶检查函数这样更难的问题上表现不佳。为了解决这一限制，我们引入了一种使用广义SWAP测试电路的电路修改，有效地实现了具有产品层的经典神经网络。这种增强功能使得可以成功学习任意维度中的奇偶检查函数，我们在分析中认为，在原始架构中超过两维度的网络大小无论多大都不可能实现。我们的结果建立了通过经典任务分析增强QNN表达能力的框架，并展示了我们基于SWAP测试的架构提供了广泛的表现能力，也暗示了量子学习任务的潜在前景。

更新时间: 2025-06-20 12:05:31

领域: quant-ph,cs.ET,cs.LG

下载: http://arxiv.org/abs/2506.16938v1

Can LLMs Hack Enterprise Networks? Autonomous Assumed Breach Penetration-Testing Active Directory Networks

Penetration-testing, while critical for validating defenses and uncovering vulnerabilities, is often limited by high operational costs and the scarcity of human expertise. This paper investigates the feasibility and effectiveness of using Large Language Model (LLM)-driven autonomous systems to address these challenges in real-world Microsoft Active Directory (AD) enterprise networks. Our novel prototype, cochise, represents the first demonstration of a fully autonomous, LLM-driven framework capable of compromising accounts within a real-life Microsoft AD testbed (GOAD). The evaluation deliberately utilizes GOAD to capture the intricate interactions and sometimes nondeterministic outcomes of live network pen-testing, moving beyond the limitations of synthetic benchmarks. We perform our empirical evaluation using five LLMs, comparing reasoning to non-reasoning models as well as including open-weight models. Through comprehensive quantitative and qualitative analysis, incorporating insights from cybersecurity experts, we demonstrate that autonomous LLMs can effectively conduct Assumed Breach simulations. Key findings highlight their ability to dynamically adapt attack strategies, perform inter-context attacks, and generate scenario-specific attack parameters. Cochise also exhibits robust self-correction mechanisms, automatically installing missing tools and rectifying invalid command generations. Critically, we find that the associated costs are competitive with those incurred by professional pen-testers, suggesting a path toward democratizing access to essential security testing for organizations with budgetary constraints. However, our research also illuminates existing limitations, including instances of LLM ``going down rabbit holes'', challenges in comprehensive information transfer between planning and execution modules, and critical safety concerns that necessitate human oversight.

Updated: 2025-06-20 12:02:54

标题: LLMs是否可以入侵企业网络？自主假定入侵渗透测试活动目录网络

摘要: 渗透测试在验证防御措施和发现漏洞方面至关重要，但通常受到高昂的运营成本和人力专业知识稀缺的限制。本文调查了使用大型语言模型（LLM）驱动的自主系统来解决现实世界中微软Active Directory（AD）企业网络中的这些挑战的可行性和有效性。我们的新颖原型cochise代表了首次展示完全自主、LLM驱动的框架能够在真实的Microsoft AD测试平台（GOAD）中破坏账户。评估故意利用GOAD捕获实时网络渗透测试的复杂交互和有时不确定的结果，超越了合成基准的限制。我们使用五个LLM进行经验评估，将推理与非推理模型进行比较，同时包括开放权重模型。通过全面的定量和定性分析，结合来自网络安全专家的见解，我们证明自主LLM能够有效地进行假定入侵模拟。关键发现突出了它们动态适应攻击策略的能力，执行跨上下文攻击，并生成特定场景的攻击参数。Cochise还展示了强大的自我校正机制，自动安装缺失的工具并纠正无效的命令生成。关键是，我们发现相关成本与专业渗透测试人员所产生的成本具有竞争力，表明了为预算有限的组织民主化获取必要安全测试的路径。然而，我们的研究也揭示了现有的限制，包括LLM“陷入兔子洞”的情况，规划和执行模块之间全面信息传递的挑战，以及需要人类监督的关键安全问题。

更新时间: 2025-06-20 12:02:54

领域: cs.CR

下载: http://arxiv.org/abs/2502.04227v2

Multimodal Fused Learning for Solving the Generalized Traveling Salesman Problem in Robotic Task Planning

Effective and efficient task planning is essential for mobile robots, especially in applications like warehouse retrieval and environmental monitoring. These tasks often involve selecting one location from each of several target clusters, forming a Generalized Traveling Salesman Problem (GTSP) that remains challenging to solve both accurately and efficiently. To address this, we propose a Multimodal Fused Learning (MMFL) framework that leverages both graph and image-based representations to capture complementary aspects of the problem, and learns a policy capable of generating high-quality task planning schemes in real time. Specifically, we first introduce a coordinate-based image builder that transforms GTSP instances into spatially informative representations. We then design an adaptive resolution scaling strategy to enhance adaptability across different problem scales, and develop a multimodal fusion module with dedicated bottlenecks that enables effective integration of geometric and spatial features. Extensive experiments show that our MMFL approach significantly outperforms state-of-the-art methods across various GTSP instances while maintaining the computational efficiency required for real-time robotic applications. Physical robot tests further validate its practical effectiveness in real-world scenarios.

Updated: 2025-06-20 11:51:52

标题: 多模态融合学习用于解决机器人任务规划中的广义旅行商问题

摘要: 移动机器人的高效和有效的任务规划对于仓库检索和环境监测等应用至关重要。这些任务通常涉及从几个目标集群中选择一个位置，形成一个难以准确和高效解决的广义旅行推销员问题（GTSP）。为了解决这个问题，我们提出了一个多模态融合学习（MMFL）框架，利用图形和基于图像的表示来捕捉问题的互补方面，并学习一个能够实时生成高质量任务规划方案的策略。具体来说，我们首先引入一个基于坐标的图像构建器，将GTSP实例转换为具有空间信息的表示。然后设计了一种自适应分辨率缩放策略，以增强在不同问题规模之间的适应性，并开发了一个具有专用瓶颈的多模态融合模块，实现几何和空间特征的有效整合。大量实验证明，我们的MMFL方法在各种GTSP实例中明显优于现有方法，同时保持了实时机器人应用所需的计算效率。物理机器人测试进一步验证了其在现实场景中的实际有效性。

更新时间: 2025-06-20 11:51:52

领域: cs.AI,cs.RO

下载: http://arxiv.org/abs/2506.16931v1

A deep learning and machine learning approach to predict neonatal death in the context of São Paulo

Neonatal death is still a concerning reality for underdeveloped and even some developed countries. Worldwide data indicate that 26.693 babies out of 1,000 births die, according to Macro Trades. To reduce this number, early prediction of endangered babies is crucial. Such prediction enables the opportunity to take ample care of the child and mother so that early child death can be avoided. In this context, machine learning was used to determine whether a newborn baby is at risk. To train the predictive model, historical data of 1.4 million newborns was used. Machine learning and deep learning techniques such as logical regression, K-nearest neighbor, random forest classifier, extreme gradient boosting (XGBoost), convolutional neural network, and long short-term memory (LSTM) were implemented using the dataset to identify the most accurate model for predicting neonatal mortality. Among the machine learning algorithms, XGBoost and random forest classifier achieved the best accuracy with 94%, while among the deep learning models, LSTM delivered the highest accuracy with 99%. Therefore, using LSTM appears to be the most suitable approach to predict whether precautionary measures for a child are necessary.

Updated: 2025-06-20 11:44:48

标题: 一种深度学习和机器学习方法来预测圣保罗新生儿死亡情况

摘要: 新生儿死亡在一些欠发达甚至一些发达国家仍然是一个令人担忧的现实。根据Macro Trades的数据，全球范围内每1,000名新生儿中有26.693名死亡。为了减少这一数字，早期预测濒危婴儿至关重要。这种预测可以为儿童和母亲提供充分的照顾机会，从而避免早期儿童死亡。在这种情况下，机器学习被用来确定新生儿是否处于风险之中。为了训练预测模型，使用了140万个新生儿的历史数据。使用逻辑回归、K-最近邻、随机森林分类器、极限梯度提升（XGBoost）、卷积神经网络和长短期记忆（LSTM）等机器学习和深度学习技术，利用数据集来确定用于预测新生儿死亡率的最准确模型。在机器学习算法中，XGBoost和随机森林分类器的准确率最高，达到94％，而在深度学习模型中，LSTM的准确率最高，达到99％。因此，使用LSTM似乎是预测是否需要为儿童采取预防措施的最合适方法。

更新时间: 2025-06-20 11:44:48

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.16929v1

Single-shot thermometry of simulated Bose--Einstein condensates using artificial intelligence

Precise determination of thermodynamic parameters in ultracold Bose gases remains challenging due to the destructive nature of conventional measurement techniques and inherent experimental uncertainties. We demonstrate an artificial intelligence approach for rapid, non-destructive estimation of the chemical potential and temperature from single-shot, in situ imaged density profiles of finite-temperature Bose gases. Our convolutional neural network is trained exclusively on quasi-2D `pancake' condensates in harmonic trap configurations. It achieves parameter extraction within fractions of a second. The model also demonstrates zero-shot generalisation across both trap geometry and thermalisation dynamics, successfully estimating thermodynamic parameters for toroidally trapped condensates with errors of only a few nanokelvin despite no prior exposure to such geometries during training, and maintaining predictive accuracy during dynamic thermalisation processes after a relatively brief evolution without explicit training on non-equilibrium states. These results suggest that supervised learning can overcome traditional limitations in ultracold atom thermometry, with extension to broader geometric configurations, temperature ranges, and additional parameters potentially enabling comprehensive real-time analysis of quantum gas experiments. Such capabilities could significantly streamline experimental workflows whilst improving measurement precision across a range of quantum fluid systems.

Updated: 2025-06-20 11:36:15

标题: 使用人工智能进行模拟玻色-爱因斯坦凝聚体的单次测温

摘要: 超冷玻色气体中热力学参数的精确定量仍然具有挑战性，这是由于传统测量技术的破坏性特性和固有的实验不确定性。我们展示了一种人工智能方法，用于从单次拍摄的现场成像密度剖面中快速、非破坏性地估计化学势和温度，这些剖面来自有限温度的玻色气体。我们的卷积神经网络专门针对调和陷阱结构中的准二维“薄煎饼”凝聚体进行训练。它能够在几分之一秒内实现参数提取。该模型还展示了零次通用化技术，能够成功估计圆环陷阱中的凝聚体的热力学参数，即使在训练过程中没有接触过这些几何形状，也能保持预测准确性，在相对较短的演化过程中在非平衡状态下进行动态热化过程，并保持预测准确性，而无需对非平衡状态进行显式训练。这些结果表明，监督学习可以克服超冷原子热量计量的传统限制，将其扩展到更广泛的几何配置、温度范围和额外参数，可能实现对量子气体实验的全面实时分析。这种能力可以显著简化实验工作流程，同时提高一系列量子流体系统的测量精度。

更新时间: 2025-06-20 11:36:15

领域: cond-mat.quant-gas,cs.AI,physics.comp-ph

下载: http://arxiv.org/abs/2506.16925v1

Real-Time Black-Box Optimization for Dynamic Discrete Environments Using Embedded Ising Machines

Many real-time systems require the optimization of discrete variables. Black-box optimization (BBO) algorithms and multi-armed bandit (MAB) algorithms perform optimization by repeatedly taking actions and observing the corresponding instant rewards without any prior knowledge. Recently, a BBO method using an Ising machine has been proposed to find the best action that is represented by a combination of discrete values and maximizes the instant reward in static environments. In contrast, dynamic environments, where real-time systems operate, necessitate MAB algorithms that maximize the average reward over multiple trials. However, due to the enormous number of actions resulting from the combinatorial nature of discrete optimization, conventional MAB algorithms cannot effectively optimize dynamic, discrete environments. Here, we show a heuristic MAB method for dynamic, discrete environments by extending the BBO method, in which an Ising machine effectively explores the actions while considering interactions between variables and changes in dynamic environments. We demonstrate the dynamic adaptability of the proposed method in a wireless communication system with moving users.

Updated: 2025-06-20 11:31:43

标题: 实时黑盒优化在动态离散环境中使用嵌入Ising机器

摘要: 许多实时系统需要优化离散变量。黑盒优化（BBO）算法和多臂 bandit（MAB）算法通过反复采取行动并观察相应的即时奖励来进行优化，而不需要任何先验知识。最近，提出了一种使用 Ising 机器的 BBO 方法，以在静态环境中找到由离散值组合表示的最佳行动，并最大化即时奖励。相比之下，在实时系统运行的动态环境中，需要最大化多次试验的平均奖励的 MAB 算法。然而，由于离散优化的组合性质导致动作数量庞大，传统的 MAB 算法无法有效地优化动态、离散环境。在这里，我们通过扩展 BBO 方法展示了一种用于动态、离散环境的启发式 MAB 方法，其中 Ising 机器在考虑变量之间的相互作用和动态环境变化的情况下有效地探索行动。我们在一个移动用户的无线通信系统中展示了所提出方法的动态适应性。

更新时间: 2025-06-20 11:31:43

领域: cs.AI,cs.ET,I.2.8

下载: http://arxiv.org/abs/2506.16924v1

A Neural Operator based Hybrid Microscale Model for Multiscale Simulation of Rate-Dependent Materials

The behavior of materials is influenced by a wide range of phenomena occurring across various time and length scales. To better understand the impact of microstructure on macroscopic response, multiscale modeling strategies are essential. Numerical methods, such as the $\text{FE}^2$ approach, account for micro-macro interactions to predict the global response in a concurrent manner. However, these methods are computationally intensive due to the repeated evaluations of the microscale. This challenge has led to the integration of deep learning techniques into computational homogenization frameworks to accelerate multiscale simulations. In this work, we employ neural operators to predict the microscale physics, resulting in a hybrid model that combines data-driven and physics-based approaches. This allows for physics-guided learning and provides flexibility for different materials and spatial discretizations. We apply this method to time-dependent solid mechanics problems involving viscoelastic material behavior, where the state is represented by internal variables only at the microscale. The constitutive relations of the microscale are incorporated into the model architecture and the internal variables are computed based on established physical principles. The results for homogenized stresses ($<6\%$ error) show that the approach is computationally efficient ($\sim 100 \times$ faster).

Updated: 2025-06-20 11:25:26

标题: 一个基于神经算子的混合微观尺度模型，用于速率相关材料的多尺度模拟

摘要: 材料的行为受到在不同时间和长度尺度上发生的各种现象的影响。为了更好地理解微观结构对宏观响应的影响，多尺度建模策略是必不可少的。数值方法，如FE^2方法，考虑了微观-宏观相互作用，以并行方式预测全局响应。然而，由于需要重复评估微观尺度，这些方法在计算上是密集的。这一挑战促使将深度学习技术整合到计算均质化框架中，以加速多尺度模拟。在这项工作中，我们利用神经算子来预测微观物理学，从而形成一个结合了数据驱动和基于物理的方法的混合模型。这允许物理引导学习，并为不同材料和空间离散化提供灵活性。我们将这种方法应用于涉及粘弹性材料行为的时变固体力学问题，其中状态仅由微观尺度的内部变量表示。微观尺度的本构关系被纳入模型结构中，并且内部变量是基于已建立的物理原理计算的。均质化应力的结果（<6%误差）表明该方法在计算上是高效的（速度约为100倍）。

更新时间: 2025-06-20 11:25:26

领域: physics.comp-ph,cs.CE,cs.LG

下载: http://arxiv.org/abs/2506.16918v1

Robust Finite-Memory Policy Gradients for Hidden-Model POMDPs

Partially observable Markov decision processes (POMDPs) model specific environments in sequential decision-making under uncertainty. Critically, optimal policies for POMDPs may not be robust against perturbations in the environment. Hidden-model POMDPs (HM-POMDPs) capture sets of different environment models, that is, POMDPs with a shared action and observation space. The intuition is that the true model is hidden among a set of potential models, and it is unknown which model will be the environment at execution time. A policy is robust for a given HM-POMDP if it achieves sufficient performance for each of its POMDPs.We compute such robust policies by combining two orthogonal techniques: (1) a deductive formal verification technique that supports tractable robust policy evaluation by computing a worst-case POMDP within the HM-POMDP, and (2) subgradient ascent to optimize the candidate policy for a worst-case POMDP. The empirical evaluation shows that, compared to various baselines, our approach (1) produces policies that are more robust and generalize better to unseen POMDPs, and (2) scales to HM-POMDPs that consist of over a hundred thousand environments.

Updated: 2025-06-20 11:24:24

标题: 稳健的有限记忆策略梯度用于隐藏模型POMDPs

摘要: 部分可观察的马尔可夫决策过程（POMDPs）模拟在不确定性下的序贯决策特定环境。关键是，POMDPs的最优策略可能对环境中的扰动不够鲁棒。隐藏模型POMDPs（HM-POMDPs）捕捉不同环境模型集合，即具有共享动作和观测空间的POMDPs。直觉是真实模型被隐藏在一组潜在模型中，不知道哪个模型将在执行时成为环境。对于给定的HM-POMDP，如果策略实现了每个POMDP的足够性能，则该策略对其是鲁棒的。我们通过结合两种正交技术来计算这样的鲁棒策略：（1）一种支持可处理的鲁棒策略评估的演绎形式验证技术，通过计算HM-POMDP中的最坏情况POMDP来实现，以及（2）次梯度上升来优化最坏情况POMDP的候选策略。实证评估表明，与各种基线相比，我们的方法（1）产生更具鲁棒性和更好泛化到未见的POMDPs的策略，并且（2）适用于由超过十万个环境组成的HM-POMDPs。

更新时间: 2025-06-20 11:24:24

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2505.09518v2

POV Learning: Individual Alignment of Multimodal Models using Human Perception

Aligning machine learning systems with human expectations is mostly attempted by training with manually vetted human behavioral samples, typically explicit feedback. This is done on a population level since the context that is capturing the subjective Point-Of-View (POV) of a concrete person in a specific situational context is not retained in the data. However, we argue that alignment on an individual level can boost the subjective predictive performance for the individual user interacting with the system considerably. Since perception differs for each person, the same situation is observed differently. Consequently, the basis for decision making and the subsequent reasoning processes and observable reactions differ. We hypothesize that individual perception patterns can be used for improving the alignment on an individual level. We test this, by integrating perception information into machine learning systems and measuring their predictive performance wrt.~individual subjective assessments. For our empirical study, we collect a novel data set of multimodal stimuli and corresponding eye tracking sequences for the novel task of Perception-Guided Crossmodal Entailment and tackle it with our Perception-Guided Multimodal Transformer. Our findings suggest that exploiting individual perception signals for the machine learning of subjective human assessments provides a valuable cue for individual alignment. It does not only improve the overall predictive performance from the point-of-view of the individual user but might also contribute to steering AI systems towards every person's individual expectations and values.

Updated: 2025-06-20 11:13:31

标题: POV学习：利用人类感知对多模态模型进行个体化对齐

摘要: 将机器学习系统与人类期望对齐通常通过训练手动审核的人类行为样本来尝试，通常是明确的反馈。这是在人口水平上完成的，因为捕捉特定情境中具体个人的主观观点并未在数据中保留。然而，我们认为在个体水平上的对齐可以显著提高个体用户与系统交互的主观预测性能。由于每个人的感知不同，同一情况会被不同方式观察。因此，决策的基础以及随后的推理过程和可观察的反应也会不同。我们假设个体感知模式可用于改善个体水平上的对齐。我们通过将感知信息整合到机器学习系统中，并测量其相对于个体主观评估的预测性能来测试这一假设。在我们的实证研究中，我们收集了一个新的多模态刺激数据集以及相应的眼动追踪序列，用于处理感知引导的跨模态蕴涵任务，并使用我们的感知引导多模态变换器来解决。我们的研究结果表明，利用个体感知信号进行主观人类评估的机器学习为个体对齐提供了有价值的线索。这不仅可以改善个体用户的观点下的整体预测性能，还可能有助于引导人工智能系统朝向每个人的个人期望和价值观。

更新时间: 2025-06-20 11:13:31

领域: cs.AI

下载: http://arxiv.org/abs/2405.04443v2

From Data to Knowledge: Evaluating How Efficiently Language Models Learn Facts

Sample efficiency is a crucial property of language models with practical implications for training efficiency. In real-world text, information follows a long-tailed distribution. Yet, we expect models to learn and recall frequent and infrequent facts. Sample-efficient models are better equipped to handle this challenge of learning and retaining rare information without requiring excessive exposure. This study analyzes multiple models of varying architectures and sizes, all trained on the same pre-training data. By annotating relational facts with their frequencies in the training corpus, we examine how model performance varies with fact frequency. Our findings show that most models perform similarly on high-frequency facts but differ notably on low-frequency facts. This analysis provides new insights into the relationship between model architecture, size, and factual learning efficiency.

Updated: 2025-06-20 11:10:24

标题: 从数据到知识：评估语言模型学习事实的效率

摘要: 样本效率是具有实际培训效率影响的语言模型的关键属性。在现实世界的文本中，信息遵循长尾分布。然而，我们希望模型能够学习和回忆频繁和不经常的事实。样本效率模型更适于处理学习和保留稀有信息的挑战，而无需过度暴露。本研究分析了多种不同架构和大小的模型，所有模型都是在相同的预训练数据上训练的。通过在训练语料库中注释关系事实及其频率，我们研究模型性能如何随事实频率变化。我们的研究结果显示，大多数模型在高频事实上表现相似，但在低频事实上有明显差异。这种分析为模型架构、大小和事实学习效率之间的关系提供了新的见解。

更新时间: 2025-06-20 11:10:24

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2506.16912v1

A Large-Scale Real-World Evaluation of LLM-Based Virtual Teaching Assistant

Virtual Teaching Assistants (VTAs) powered by Large Language Models (LLMs) have the potential to enhance student learning by providing instant feedback and facilitating multi-turn interactions. However, empirical studies on their effectiveness and acceptance in real-world classrooms are limited, leaving their practical impact uncertain. In this study, we develop an LLM-based VTA and deploy it in an introductory AI programming course with 477 graduate students. To assess how student perceptions of the VTA's performance evolve over time, we conduct three rounds of comprehensive surveys at different stages of the course. Additionally, we analyze 3,869 student--VTA interaction pairs to identify common question types and engagement patterns. We then compare these interactions with traditional student--human instructor interactions to evaluate the VTA's role in the learning process. Through a large-scale empirical study and interaction analysis, we assess the feasibility of deploying VTAs in real-world classrooms and identify key challenges for broader adoption. Finally, we release the source code of our VTA system, fostering future advancements in AI-driven education: \texttt{https://github.com/sean0042/VTA}.

Updated: 2025-06-20 10:59:57

标题: 基于LLM的虚拟教学助手的大规模实际评估

摘要: 由大型语言模型（LLMs）驱动的虚拟助教（VTAs）有潜力通过提供即时反馈和促进多轮互动来增强学生学习。然而，在真实课堂中对它们的有效性和接受度的实证研究有限，这使得它们的实际影响尚不确定。在本研究中，我们开发了一个基于LLM的VTA，并将其部署在一门拥有477名研究生的AI编程入门课程中。为了评估学生对VTA表现的看法如何随时间演变，我们在课程的不同阶段进行了三轮综合调查。此外，我们分析了3869对学生-VTA互动，以识别常见问题类型和参与模式。然后，我们将这些互动与传统的学生-人类教师互动进行比较，以评估VTA在学习过程中的作用。通过大规模的实证研究和互动分析，我们评估了在真实课堂中部署VTAs的可行性，并确定了更广泛采用的关键挑战。最后，我们发布了我们的VTA系统的源代码，促进了人工智能驱动教育的未来进步：https://github.com/sean0042/VTA。

更新时间: 2025-06-20 10:59:57

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2506.17363v1

Graph is all you need? Lightweight data-agnostic neural architecture search without training

Neural architecture search (NAS) enables the automatic design of neural network models. However, training the candidates generated by the search algorithm for performance evaluation incurs considerable computational overhead. Our method, dubbed nasgraph, remarkably reduces the computational costs by converting neural architectures to graphs and using the average degree, a graph measure, as the proxy in lieu of the evaluation metric. Our training-free NAS method is data-agnostic and light-weight. It can find the best architecture among 200 randomly sampled architectures from NAS-Bench201 in 217 CPU seconds. Besides, our method is able to achieve competitive performance on various datasets including NASBench-101, NASBench-201, and NDS search spaces. We also demonstrate that nasgraph generalizes to more challenging tasks on Micro TransNAS-Bench-101.

Updated: 2025-06-20 10:58:04

标题: 图是你所需要的全部吗？无需训练的轻量级数据不可知神经架构搜索

摘要: 神经架构搜索（NAS）使神经网络模型的自动设计成为可能。然而，训练由搜索算法生成的候选模型以进行性能评估会带来相当大的计算开销。我们的方法，名为nasgraph，通过将神经结构转换为图形并使用平均度数作为代理替代评估指标，显著降低了计算成本。我们的无需训练的NAS方法是数据不可知的且轻量级的。它可以在217个CPU秒内从NAS-Bench201中随机抽取的200种架构中找到最佳架构。此外，我们的方法能够在包括NASBench-101、NASBench-201和NDS搜索空间在内的各种数据集上实现有竞争力的性能。我们还证明nasgraph可以推广到Micro TransNAS-Bench-101上的更具挑战性的任务。

更新时间: 2025-06-20 10:58:04

领域: cs.LG

下载: http://arxiv.org/abs/2405.01306v2

RCNet: $ΔΣ$ IADCs as Recurrent AutoEncoders

This paper proposes a deep learning model (RCNet) for Delta-Sigma ($\Delta\Sigma$) ADCs. Recurrent Neural Networks (RNNs) allow to describe both modulators and filters. This analogy is applied to Incremental ADCs (IADC). High-end optimizers combined with full-custom losses are used to define additional hardware design constraints: quantized weights, signal saturation, temporal noise injection, devices area. Focusing on DC conversion, our early results demonstrate that $SNR$ defined as an Effective Number Of Bits (ENOB) can be optimized under a certain hardware mapping complexity. The proposed RCNet succeeded to provide design tradeoffs in terms of $SNR$ ($>$13bit) versus area constraints ($<$14pF total capacitor) at a given $OSR$ (80 samples). Interestingly, it appears that the best RCNet architectures do not necessarily rely on high-order modulators, leveraging additional topology exploration degrees of freedom.

Updated: 2025-06-20 10:55:01

标题: RCNet：ΔΣ IADCs作为循环自编码器

摘要: 这篇论文提出了一种用于Delta-Sigma（$\Delta\Sigma$）ADC的深度学习模型（RCNet）。循环神经网络（RNNs）可以描述调制器和滤波器。这种类比被应用于增量式ADC（IADC）。高端优化器结合全定制损失被用来定义额外的硬件设计约束：量化权重、信号饱和、时间噪声注入、器件面积。专注于直流转换，我们的初步结果表明，作为有效位数（ENOB）定义的信噪比（SNR）可以在一定的硬件映射复杂度下进行优化。提出的RCNet成功地在给定的OSR（80个样本）下，在SNR（>13位）与面积约束（<14pF总电容）之间提供设计权衡。有趣的是，最佳的RCNet架构并不一定依赖于高阶调制器，利用额外的拓扑探索自由度。

更新时间: 2025-06-20 10:55:01

领域: cs.AR,cs.LG

下载: http://arxiv.org/abs/2506.16903v1

On Almost Surely Safe Alignment of Large Language Models at Inference-Time

We introduce a novel inference-time alignment approach for LLMs that aims to generate safe responses almost surely, i.e., with probability approaching one. Our approach models the generation of safe responses as a constrained Markov Decision Process (MDP) within the LLM's latent space. We augment a safety state that tracks the evolution of safety constraints and dynamically penalize unsafe generations to ensure the generation of safe responses. Consequently, we demonstrate formal safety guarantees w.r.t. the given cost model upon solving the MDP in the latent space with sufficiently large penalties. Building on this foundation, we propose InferenceGuard, a practical implementation that safely aligns LLMs without modifying the model weights. Empirically, we demonstrate that InferenceGuard effectively balances safety and task performance, outperforming existing inference-time alignment methods in generating safe and aligned responses. Our findings contribute to the advancement of safer LLM deployment through alignment at inference-time, thus presenting a promising alternative to resource-intensive, overfitting-prone alignment techniques like RLHF.

Updated: 2025-06-20 10:54:05

标题: 推断时大型语言模型几乎肯定安全对齐

摘要: 我们介绍了一种新颖的推理时间对齐方法，用于LLMs，旨在生成几乎确定安全的响应，即接近一的概率。我们的方法将安全响应的生成建模为LLM潜在空间内的受限马尔可夫决策过程（MDP）。我们增加了一个安全状态，跟踪安全约束的演变，并动态地惩罚不安全的生成，以确保生成安全的响应。因此，我们通过在潜在空间中以足够大的惩罚解决MDP来展示相对于给定成本模型的正式安全保证。基于这一基础，我们提出了InferenceGuard，一个实际的实现，可以在不修改模型权重的情况下安全地对齐LLMs。在实证研究中，我们展示了InferenceGuard有效地平衡了安全性和任务性能，在生成安全和对齐响应方面优于现有的推理时间对齐方法。我们的发现有助于通过推理时间对齐促进更安全的LLM部署，从而提供了一种有前途的替代方案，避免了资源密集型、容易过拟合的对齐技术，如RLHF。

更新时间: 2025-06-20 10:54:05

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2502.01208v3

Towards Effective Complementary Security Analysis using Large Language Models

A key challenge in security analysis is the manual evaluation of potential security weaknesses generated by static application security testing (SAST) tools. Numerous false positives (FPs) in these reports reduce the effectiveness of security analysis. We propose using Large Language Models (LLMs) to improve the assessment of SAST findings. We investigate the ability of LLMs to reduce FPs while trying to maintain a perfect true positive rate, using datasets extracted from the OWASP Benchmark (v1.2) and a real-world software project. Our results indicate that advanced prompting techniques, such as Chain-of-Thought and Self-Consistency, substantially improve FP detection. Notably, some LLMs identified approximately 62.5% of FPs in the OWASP Benchmark dataset without missing genuine weaknesses. Combining detections from different LLMs would increase this FP detection to approximately 78.9%. Additionally, we demonstrate our approach's generalizability using a real-world dataset covering five SAST tools, three programming languages, and infrastructure files. The best LLM detected 33.85% of all FPs without missing genuine weaknesses, while combining detections from different LLMs would increase this detection to 38.46%. Our findings highlight the potential of LLMs to complement traditional SAST tools, enhancing automation and reducing resources spent addressing false alarms.

Updated: 2025-06-20 10:46:35

标题: 朝着利用大型语言模型进行有效的互补安全分析

摘要: 安全分析中的一个关键挑战是对静态应用程序安全测试（SAST）工具生成的潜在安全弱点进行手动评估。这些报告中的众多误报（FPs）降低了安全分析的效果。我们提出使用大型语言模型（LLMs）来改善对SAST结果的评估。我们研究了LLMs减少FPs的能力，同时试图保持完美的真正阳性率，使用从OWASP基准（v1.2）和一个真实软件项目中提取的数据集。我们的结果表明，高级提示技术，如Chain-of-Thought和Self-Consistency，显著改善了FP检测。值得注意的是，一些LLMs在OWASP基准数据集中识别出约62.5%的FPs，而不会错过真正的弱点。将来自不同LLMs的检测组合在一起，将使这种FP检测增加到约78.9%。此外，我们使用一个覆盖了五个SAST工具、三种编程语言和基础设施文件的真实世界数据集来展示我们的方法的泛化能力。最好的LLM在不错过真正弱点的情况下检测出了33.85%的所有FPs，而将来自不同LLMs的检测组合在一起将使这种检测增加到38.46%。我们的研究结果突显了LLMs在补充传统SAST工具方面的潜力，增强自动化并减少处理虚假警报所需的资源。

更新时间: 2025-06-20 10:46:35

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2506.16899v1

AI's Blind Spots: Geographic Knowledge and Diversity Deficit in Generated Urban Scenario

Image generation models are revolutionizing many domains, and urban analysis and design is no exception. While such models are widely adopted, there is a limited literature exploring their geographic knowledge, along with the biases they embed. In this work, we generated 150 synthetic images for each state in the USA and related capitals using FLUX 1 and Stable Diffusion 3.5, two state-of-the-art models for image generation. We embed each image using DINO-v2 ViT-S/14 and the Fr\'echet Inception Distances to measure the similarity between the generated images. We found that while these models have implicitly learned aspects of USA geography, if we prompt the models to generate an image for "United States" instead of specific cities or states, the models exhibit a strong representative bias toward metropolis-like areas, excluding rural states and smaller cities. {\color{black} In addition, we found that models systematically exhibit some entity-disambiguation issues with European-sounding names like Frankfort or Devon.

Updated: 2025-06-20 10:43:22

标题: 人工智能的盲点：在生成的城市场景中的地理知识和多样性赤字

摘要: 图像生成模型正在改变许多领域，城市分析和设计也不例外。尽管这些模型被广泛采用，但对它们所嵌入的地理知识以及偏见的文献研究有限。在这项工作中，我们使用FLUX 1和Stable Diffusion 3.5生成了每个美国州和相关首府的150幅合成图像，这两个模型是图像生成的最新技术。我们使用DINO-v2 ViT-S/14和Fr\'echet Inception Distances嵌入每幅图像，以衡量生成图像之间的相似性。我们发现，虽然这些模型已经隐含地学习了美国地理的一些方面，但如果我们提示模型生成一个代表"美国"而不是具体的城市或州的图像时，这些模型会表现出强烈的代表性偏见，偏向大都市地区，排除了农村州和较小的城市。此外，我们发现模型在处理欧洲名字（如Frankfort或Devon）时系统性地存在一些实体消歧问题。

更新时间: 2025-06-20 10:43:22

领域: cs.AI,cs.CV,cs.CY

下载: http://arxiv.org/abs/2506.16898v1

With Limited Data for Multimodal Alignment, Let the STRUCTURE Guide You

Multimodal models have demonstrated powerful capabilities in complex tasks requiring multimodal alignment including zero-shot classification and cross-modal retrieval. However, existing models typically rely on millions of paired multimodal samples, which are prohibitively expensive or infeasible to obtain in many domains. In this work, we explore the feasibility of building multimodal models with limited amount of paired data by aligning pretrained unimodal foundation models. We show that high-quality alignment is possible with as few as tens of thousands of paired samples$\unicode{x2013}$less than $1\%$ of the data typically used in the field. To achieve this, we introduce STRUCTURE, an effective regularization technique that preserves the neighborhood geometry of the latent space of unimodal encoders. Additionally, we show that aligning last layers is often suboptimal and demonstrate the benefits of aligning the layers with the highest representational similarity across modalities. These two components can be readily incorporated into existing alignment methods, yielding substantial gains across 24 zero-shot image classification and retrieval benchmarks, with average relative improvement of $51.6\%$ in classification and $91.8\%$ in retrieval tasks. Our results highlight the effectiveness and broad applicability of our framework for limited-sample multimodal learning and offer a promising path forward for resource-constrained domains.

Updated: 2025-06-20 10:32:54

标题: 使用有限数据进行多模态对齐，让结构引导您

摘要: 多模态模型在需要多模态对齐的复杂任务中展现出强大的能力，包括零样本分类和跨模态检索。然而，现有模型通常依赖于数百万个成对的多模态样本，这在许多领域是昂贵或不可行的。在这项工作中，我们探讨了通过对齐预训练的单模态基础模型来构建具有有限数量成对数据的多模态模型的可行性。我们展示，即使只有数万个成对样本，也可以实现高质量的对齐，这比通常在该领域使用的数据量少得多。为了实现这一点，我们引入了STRUCTURE，一种有效的正则化技术，可以保持单模态编码器的潜在空间的邻域几何结构。此外，我们还展示了对齐最后层通常不是最佳选择，并展示了在跨模态之间具有最高表征相似性的层对齐的好处。这两个组件可以轻松地并入现有的对齐方法，可在24个零样本图像分类和检索基准任务中获得显著的收益，分类任务平均相对改善率为51.6％，检索任务为91.8％。我们的结果突显了我们的框架对有限样本多模态学习的有效性和广泛适用性，并为资源受限的领域提供了一个有希望的前进道路。

更新时间: 2025-06-20 10:32:54

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.16895v1

LearnAlign: Reasoning Data Selection for Reinforcement Learning in Large Language Models Based on Improved Gradient Alignment

Reinforcement learning (RL) has become a key technique for enhancing LLMs' reasoning abilities, yet its data inefficiency remains a major bottleneck. To address this critical yet challenging issue, we present a novel gradient-alignment-based method, named LearnAlign, which intelligently selects the learnable and representative training reasoning data for RL post-training. To overcome the issue of response-length bias in gradient norms, we introduce the data learnability based on the success rate, which can indicate the learning potential of each data point. Experiments across three mathematical reasoning benchmarks demonstrate that our method significantly reduces training data requirements while achieving minor performance degradation or even improving performance compared to full-data training. For example, it reduces data requirements by up to 1,000 data points with better performance (77.53%) than that on the full dataset on GSM8K benchmark (77.04%). Furthermore, we show its effectiveness in the staged RL setting. This work provides valuable insights into data-efficient RL post-training and establishes a foundation for future research in optimizing reasoning data selection. To facilitate future work, we will release code.

Updated: 2025-06-20 10:31:36

标题: LearnAlign:基于改进的梯度对齐在大型语言模型中为强化学习选择数据

摘要: 强化学习（RL）已成为增强LLMs推理能力的关键技术，然而其数据效率仍然是一个主要瓶颈。为了解决这个关键但具有挑战性的问题，我们提出了一种基于梯度对齐的新方法，名为LearnAlign，它智能地选择RL后训练的可学习和代表性训练推理数据。为了克服梯度范数中响应长度偏差的问题，我们引入了基于成功率的数据可学习性，这可以指示每个数据点的学习潜力。在三个数学推理基准测试中的实验表明，我们的方法显著减少了训练数据的要求，同时实现了轻微的性能下降甚至比全数据训练表现更好。例如，在GSM8K基准测试中，它将数据要求降低了最多1000个数据点，表现（77.53%）比完整数据集（77.04%）更好。此外，我们展示了它在分阶段RL设置中的有效性。这项工作为数据高效RL后训练提供了有价值的见解，并为未来优化推理数据选择的研究奠定了基础。为了促进未来工作，我们将发布代码。

更新时间: 2025-06-20 10:31:36

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.11480v2

Tracker Installations Are Not Created Equal: Understanding Tracker Configuration of Form Data Collection

Targeted advertising is fueled by the comprehensive tracking of users' online activity. As a result, advertising companies, such as Google and Meta, encourage website administrators to not only install tracking scripts on their websites but configure them to automatically collect users' Personally Identifying Information (PII). In this study, we aim to characterize how Google and Meta's trackers can be configured to collect PII data from web forms. We first perform a qualitative analysis of how third parties present form data collection to website administrators in the documentation and user interface. We then perform a measurement study of 40,150 websites to quantify the prevalence and configuration of Google and Meta trackers. Our results reveal that both Meta and Google encourage the use of form data collection and include inaccurate statements about hashing PII as a privacy-preserving method. Additionally, we find that Meta includes configuring form data collection as part of the basic setup flow. Our large-scale measurement study reveals that while Google trackers are more prevalent than Meta trackers (72.6% vs. 28.2% of websites), Meta trackers are configured to collect form data more frequently (11.6% vs. 62.3%). Finally, we identify sensitive finance and health websites that have installed trackers that are likely configured to collect form data PII in violation of Meta and Google policies. Our study highlights how tracker documentation and interfaces can potentially play a role in users' privacy through the configuration choices made by the website administrators who install trackers.

Updated: 2025-06-20 10:29:13

标题: 跟踪器安装并非一概相同：理解形式数据收集的跟踪器配置

摘要: 定向广告是通过对用户在线活动的全面跟踪来推动的。因此，广告公司，如谷歌和Meta，鼓励网站管理员不仅在其网站上安装跟踪脚本，还要配置这些脚本自动收集用户的个人识别信息（PII）。在这项研究中，我们旨在描述谷歌和Meta的跟踪器如何配置以从网络表单中收集PII数据。我们首先对第三方如何在文档和用户界面中向网站管理员展示表单数据收集进行定性分析。然后，我们对40,150个网站进行了测量研究，以量化谷歌和Meta跟踪器的普及程度和配置情况。我们的研究结果显示，Meta和谷歌都鼓励使用表单数据收集，并包含有关将PII哈希化作为一种保护隐私的方法的不准确陈述。此外，我们发现Meta将配置表单数据收集作为基本设置流程的一部分。我们的大规模测量研究显示，虽然谷歌跟踪器比Meta跟踪器更普遍（72.6％对28.2％的网站），但Meta跟踪器更频繁地配置以收集表单数据（11.6％对62.3％）。最后，我们发现一些敏感的金融和健康网站安装了跟踪器，这些跟踪器很可能被配置为违反Meta和谷歌政策收集表单数据PII。我们的研究突显了跟踪器文档和界面如何通过网站管理员的配置选择对用户的隐私可能起到作用。

更新时间: 2025-06-20 10:29:13

领域: cs.CR

下载: http://arxiv.org/abs/2506.16891v1

From Lab to Factory: Pitfalls and Guidelines for Self-/Unsupervised Defect Detection on Low-Quality Industrial Images

The detection and localization of quality-related problems in industrially mass-produced products has historically relied on manual inspection, which is costly and error-prone. Machine learning has the potential to replace manual handling. As such, the desire is to facilitate an unsupervised (or self-supervised) approach, as it is often impossible to specify all conceivable defects ahead of time. A plethora of prior works have demonstrated the aptitude of common reconstruction-, embedding-, and synthesis-based methods in laboratory settings. However, in practice, we observe that most methods do not handle low data quality well or exude low robustness in unfavorable, but typical real-world settings. For practitioners it may be very difficult to identify the actual underlying problem when such methods underperform. Worse, often-reported metrics (e.g., AUROC) are rarely suitable in practice and may give misleading results. In our setting, we attempt to identify subtle anomalies on the surface of blasted forged metal parts, using rather low-quality RGB imagery only, which is a common industrial setting. We specifically evaluate two types of state-of-the-art models that allow us to identify and improve quality issues in production data, without having to obtain new data. Our contribution is to provide guardrails for practitioners that allow them to identify problems related to, e.g., (lack of) robustness or invariance, in either the chosen model or the data reliably in similar scenarios. Furthermore, we exemplify common pitfalls in and shortcomings of likelihood-based approaches and outline a framework for proper empirical risk estimation that is more suitable for real-world scenarios.

Updated: 2025-06-20 10:28:00

标题: 从实验室到工厂：自监督/无监督缺陷检测在低质量工业图像上的陷阱和指导原则

摘要: 历史上，工业大规模生产产品中的质量问题的检测和定位一直依赖于手工检查，这种方法既昂贵又容易出错。机器学习有潜力取代手工处理。因此，希望能够实现一种无监督（或自监督）的方法，因为事先很难指定所有可能的缺陷。先前的大量研究已经证明了实验室环境中常见的基于重建、嵌入和合成的方法的适用性。然而，在实践中，我们观察到大多数方法在处理低数据质量时表现不佳，或在不利但典型的真实世界环境中表现出低鲁棒性。对于从业者来说，当这些方法表现不佳时，很难确定实际的潜在问题。更糟糕的是，经常报告的指标（例如AUROC）在实践中很少适用，并可能导致误导性的结果。在我们的设置中，我们尝试使用相当低质量的RGB图像，仅在表面处理过的锻造金属零件上识别微小的异常，这是常见的工业环境。我们特别评估了两种最先进的模型类型，这些模型使我们能够在生产数据中识别并改善质量问题，而无需获取新数据。我们的贡献是为从业者提供支持，使他们能够在类似情境中可靠地识别与模型或数据相关的问题，例如（缺乏）鲁棒性或不变性。此外，我们举例说明了基于似然的方法常见的陷阱和缺点，并概述了一个更适合于真实世界情景的适当经验风险估计框架。

更新时间: 2025-06-20 10:28:00

领域: cs.LG,cs.CV,stat.AP,62-06,G.3; I.4; I.5

下载: http://arxiv.org/abs/2506.16890v1

Dynamic Knowledge Integration for Evidence-Driven Counter-Argument Generation with Large Language Models

This paper investigates the role of dynamic external knowledge integration in improving counter-argument generation using Large Language Models (LLMs). While LLMs have shown promise in argumentative tasks, their tendency to generate lengthy, potentially unfactual responses highlights the need for more controlled and evidence-based approaches. We introduce a new manually curated dataset of argument and counter-argument pairs specifically designed to balance argumentative complexity with evaluative feasibility. We also propose a new LLM-as-a-Judge evaluation methodology that shows a stronger correlation with human judgments compared to traditional reference-based metrics. Our experimental results demonstrate that integrating dynamic external knowledge from the web significantly improves the quality of generated counter-arguments, particularly in terms of relatedness, persuasiveness, and factuality. The findings suggest that combining LLMs with real-time external knowledge retrieval offers a promising direction for developing more effective and reliable counter-argumentation systems.

Updated: 2025-06-20 10:27:28

标题: 大型语言模型支持的证据驱动反驳生成的动态知识整合

摘要: 本文研究了动态外部知识整合在改善使用大型语言模型（LLMs）生成反驳论点中的作用。虽然LLMs在辩论任务中显示出潜力，但它们倾向于生成冗长、可能不准确的回应，突显了对更受控制和基于证据的方法的需求。我们引入了一个新的手动策划的辩论和反驳论对数据集，专门设计以平衡辩论复杂性和评估可行性。我们还提出了一种新的LLM作为评判者评估方法，与传统基于参考的指标相比，显示出更强的与人类判断的相关性。我们的实验结果表明，从网络中整合动态外部知识显著改善了生成的反驳论点的质量，特别是在相关性、说服力和事实性方面。研究结果表明，将LLMs与实时外部知识检索相结合，为开发更有效和可靠的反驳系统提供了一个有希望的方向。

更新时间: 2025-06-20 10:27:28

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.05328v2

Stable Learning Using Spiking Neural Networks Equipped With Affine Encoders and Decoders

We study the learning problem associated with spiking neural networks. Specifically, we focus on spiking neural networks composed of simple spiking neurons having only positive synaptic weights, equipped with an affine encoder and decoder; we refer to these as affine spiking neural networks. These neural networks are shown to depend continuously on their parameters, which facilitates classical covering number-based generalization statements and supports stable gradient-based training. We demonstrate that the positivity of the weights enables a wide range of expressivity results, including rate-optimal approximation of smooth functions and dimension-independent approximation of Barron regular functions. In particular, we show in theory and simulations that affine spiking neural networks are capable of approximating shallow ReLU neural networks. Furthermore, we apply these affine spiking neural networks to standard machine learning benchmarks and reach competitive results. Finally, we observe that from a generalization perspective, contrary to feedforward neural networks or previous results for general spiking neural networks, the depth has little to no adverse effect on the generalization capabilities.

Updated: 2025-06-20 10:27:12

标题: 稳定学习：使用配备仿射编码器和解码器的脉冲神经网络

摘要: 我们研究了与脉冲神经网络相关的学习问题。具体来说，我们专注于由简单脉冲神经元组成的脉冲神经网络，这些神经元具有仅为正的突触权重，并配备有一个仿射编码器和解码器；我们将这些称为仿射脉冲神经网络。这些神经网络被证明在其参数上连续依赖，这有助于经典的基于覆盖数的泛化陈述，并支持稳定的基于梯度的训练。我们证明了权重的正性使得能够获得广泛的表达能力结果，包括光滑函数的速率最优逼近和Barron正则函数的维度无关逼近。特别地，我们在理论和模拟中展示了仿射脉冲神经网络能够逼近浅层ReLU神经网络。此外，我们将这些仿射脉冲神经网络应用于标准机器学习基准测试，并取得了竞争性的结果。最后，我们观察到从泛化角度来看，与前馈神经网络或以前针对一般脉冲神经网络的结果相反，深度对泛化能力的影响几乎没有。

更新时间: 2025-06-20 10:27:12

领域: cs.NE,cs.LG,math.FA,stat.ML

下载: http://arxiv.org/abs/2404.04549v3

Discrepancies are Virtue: Weak-to-Strong Generalization through Lens of Intrinsic Dimension

Weak-to-strong (W2S) generalization is a type of finetuning (FT) where a strong (large) student model is trained on pseudo-labels generated by a weak teacher. Surprisingly, W2S FT often outperforms the weak teacher. We seek to understand this phenomenon through the observation that FT often occurs in intrinsically low-dimensional spaces. Leveraging the low intrinsic dimensionality of FT, we analyze W2S in the ridgeless regression setting from a variance reduction perspective. For a strong student-weak teacher pair with sufficiently expressive low-dimensional feature subspaces $\mathcal{V}_s, \mathcal{V}_w$, we provide an exact characterization of the variance that dominates the generalization error of W2S. This unveils a virtue of discrepancy between the strong and weak models in W2S: the variance of the weak teacher is inherited by the strong student in $\mathcal{V}_s \cap \mathcal{V}_w$, while reduced by a factor of $\mathrm{dim}(\mathcal{V}_s)/N$ in the subspace of discrepancy $\mathcal{V}_w \setminus \mathcal{V}_s$ with $N$ pseudo-labels for W2S. Our analysis further casts light on the sample complexities and the scaling of performance gap recovery in W2S. The analysis is supported by experiments on synthetic regression problems, as well as real vision and NLP tasks.

Updated: 2025-06-20 10:26:36

标题: 差异是美德：通过内在维度的透镜实现弱到强的泛化

摘要: 弱到强（W2S）泛化是一种微调（FT）的类型，其中一个强（大）的学生模型是在由弱教师生成的伪标签上训练的。令人惊讶的是，W2S FT通常优于弱教师。我们试图通过观察发现FT经常发生在固有低维空间中来理解这一现象。利用FT的低固有维度，我们从方差减少的角度分析了在无岭回归设置中的W2S。对于具有足够表现力的低维特征子空间$\mathcal{V}_s，\mathcal{V}_w$的强学生-弱教师对，我们提供了W2S的泛化误差中占主导地位的方差的精确表征。这揭示了W2S中强弱模型之间差异的优点：弱教师的方差被强学生在$\mathcal{V}_s \cap \mathcal{V}_w$中继承，而在差异子空间$\mathcal{V}_w \setminus \mathcal{V}_s$中被减少了一个因子$\mathrm{dim}(\mathcal{V}_s)/N$，其中$N$为W2S的伪标签。我们的分析进一步揭示了W2S中的样本复杂性和性能差距恢复的缩放。实验证明了我们的分析结果，包括合成回归问题以及真实的视觉和自然语言处理任务。

更新时间: 2025-06-20 10:26:36

领域: cs.LG,cs.NA,math.NA,stat.ML

下载: http://arxiv.org/abs/2502.05075v5

PR-Attack: Coordinated Prompt-RAG Attacks on Retrieval-Augmented Generation in Large Language Models via Bilevel Optimization

Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of applications, e.g., medical question-answering, mathematical sciences, and code generation. However, they also exhibit inherent limitations, such as outdated knowledge and susceptibility to hallucinations. Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm to address these issues, but it also introduces new vulnerabilities. Recent efforts have focused on the security of RAG-based LLMs, yet existing attack methods face three critical challenges: (1) their effectiveness declines sharply when only a limited number of poisoned texts can be injected into the knowledge database, (2) they lack sufficient stealth, as the attacks are often detectable by anomaly detection systems, which compromises their effectiveness, and (3) they rely on heuristic approaches to generate poisoned texts, lacking formal optimization frameworks and theoretic guarantees, which limits their effectiveness and applicability. To address these issues, we propose coordinated Prompt-RAG attack (PR-attack), a novel optimization-driven attack that introduces a small number of poisoned texts into the knowledge database while embedding a backdoor trigger within the prompt. When activated, the trigger causes the LLM to generate pre-designed responses to targeted queries, while maintaining normal behavior in other contexts. This ensures both high effectiveness and stealth. We formulate the attack generation process as a bilevel optimization problem leveraging a principled optimization framework to develop optimal poisoned texts and triggers. Extensive experiments across diverse LLMs and datasets demonstrate the effectiveness of PR-Attack, achieving a high attack success rate even with a limited number of poisoned texts and significantly improved stealth compared to existing methods.

Updated: 2025-06-20 10:23:40

标题: PR-Attack: 通过双层优化在大型语言模型中协调的Prompt-RAG检索增强生成攻击

摘要: 大型语言模型（LLMs）已经在广泛的应用中展示了出色的性能，例如医学问答、数学科学和代码生成。然而，它们也表现出固有的局限性，如知识过时和易受幻觉影响。检索增强生成（RAG）已经成为解决这些问题的一个有希望的范式，但也引入了新的漏洞。最近的努力集中在基于RAG的LLMs的安全性上，然而现有的攻击方法面临三个关键挑战：（1）当只能向知识数据库注入有限数量的污染文本时，它们的有效性急剧下降，（2）它们缺乏足够的隐蔽性，因为攻击往往可以被异常检测系统检测到，这损害了它们的有效性，（3）它们依赖于启发式方法生成污染文本，缺乏形式化优化框架和理论保证，这限制了它们的有效性和适用性。为了解决这些问题，我们提出了协调的Prompt-RAG攻击（PR-attack），这是一种新颖的优化驱动攻击，将少量污染文本引入知识数据库，并在提示中嵌入后门触发器。当触发器被激活时，LLM会生成预设计的响应以回答目标查询，同时在其他情境中保持正常行为。这确保了高效性和隐蔽性。我们将攻击生成过程形式化为一个双层优化问题，利用一个有原则的优化框架来开发最佳的污染文本和触发器。在各种LLM和数据集上进行的广泛实验证明了PR-Attack的有效性，即使只有有限数量的污染文本，也能实现高攻击成功率，并与现有方法相比，隐蔽性得到了显著改善。

更新时间: 2025-06-20 10:23:40

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2504.07717v3

The Importance of Being Lazy: Scaling Limits of Continual Learning

Despite recent efforts, neural networks still struggle to learn in non-stationary environments, and our understanding of catastrophic forgetting (CF) is far from complete. In this work, we perform a systematic study on the impact of model scale and the degree of feature learning in continual learning. We reconcile existing contradictory observations on scale in the literature, by differentiating between lazy and rich training regimes through a variable parameterization of the architecture. We show that increasing model width is only beneficial when it reduces the amount of feature learning, yielding more laziness. Using the framework of dynamical mean field theory, we then study the infinite width dynamics of the model in the feature learning regime and characterize CF, extending prior theoretical results limited to the lazy regime. We study the intricate relationship between feature learning, task non-stationarity, and forgetting, finding that high feature learning is only beneficial with highly similar tasks. We identify a transition modulated by task similarity where the model exits an effectively lazy regime with low forgetting to enter a rich regime with significant forgetting. Finally, our findings reveal that neural networks achieve optimal performance at a critical level of feature learning, which depends on task non-stationarity and transfers across model scales. This work provides a unified perspective on the role of scale and feature learning in continual learning.

Updated: 2025-06-20 10:12:38

标题: 懒惰的重要性：连续学习的扩展限制

摘要: 尽管近年来进行了一些努力，神经网络仍然在非稳态环境中学习困难，我们对灾难性遗忘（CF）的理解仍然不完整。在这项工作中，我们对模型规模和特征学习程度对持续学习的影响进行了系统研究。通过对架构的可变参数化，我们调和了文献中关于规模的现有矛盾观察结果，区分了懒惰和富有训练模式。我们表明，增加模型宽度只有在减少特征学习量，产生更多懒惰时才是有益的。然后，利用动力学平均场理论框架，我们研究了模型在特征学习模式下的无限宽度动态，并表征了CF，扩展了先前局限于懒惰模式的理论结果。我们研究了特征学习、任务非稳态性和遗忘之间错综复杂的关系，发现高特征学习仅在任务非常相似时才有益。我们确定了一个由任务相似性调制的转变，在此转变中，模型退出了有效懒惰模式，遗忘较少，进入了有显著遗忘的富有模式。最后，我们的研究结果表明，神经网络在关键的特征学习水平上实现了最佳性能，这取决于任务非稳态性并跨越模型规模进行转移。这项工作提供了关于规模和特征学习在持续学习中的作用的统一视角。

更新时间: 2025-06-20 10:12:38

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2506.16884v1

Efficient Feedback Gate Network for Hyperspectral Image Super-Resolution

Even without auxiliary images, single hyperspectral image super-resolution (SHSR) methods can be designed to improve the spatial resolution of hyperspectral images. However, failing to explore coherence thoroughly along bands and spatial-spectral information leads to the limited performance of the SHSR. In this study, we propose a novel group-based SHSR method termed the efficient feedback gate network, which uses various feedbacks and gate operations involving large kernel convolutions and spectral interactions. In particular, by providing different guidance for neighboring groups, we can learn rich band information and hierarchical hyperspectral spatial information using channel shuffling and dilatation convolution in shuffled and progressive dilated fusion module(SPDFM). Moreover, we develop a wide-bound perception gate block and a spectrum enhancement gate block to construct the spatial-spectral reinforcement gate module (SSRGM) and obtain highly representative spatial-spectral features efficiently. Additionally, we apply a three-dimensional SSRGM to enhance holistic information and coherence for hyperspectral data. The experimental results on three hyperspectral datasets demonstrate the superior performance of the proposed network over the state-of-the-art methods in terms of spectral fidelity and spatial content reconstruction.

Updated: 2025-06-20 10:06:48

标题: 高光谱图像超分辨率的高效反馈门网络

摘要: 即使没有辅助图像，也可以设计单一高光谱图像超分辨率（SHSR）方法来提高高光谱图像的空间分辨率。然而，未能彻底探索波段和空间-光谱信息的一致性导致了SHSR的性能受限。在本研究中，我们提出了一种新颖的基于群组的SHSR方法，称为高效反馈门网络，该方法使用各种反馈和包括大内核卷积和光谱交互的门操作。特别是，通过为相邻群组提供不同的指导，我们可以使用通道混洗和扩张卷积在混洗和逐渐膨胀融合模块（SPDFM）中学习丰富的波段信息和分层高光谱空间信息。此外，我们开发了宽边感知门块和光谱增强门块，以构建空间-光谱强化门模块（SSRGM）并有效地获得高度代表性的空间-光谱特征。此外，我们应用三维SSRGM来增强高光谱数据的整体信息和一致性。三个高光谱数据集的实验结果表明，所提出的网络在光谱保真度和空间内容重建方面优于现有方法。

更新时间: 2025-06-20 10:06:48

领域: eess.IV,cs.CV,cs.LG

下载: http://arxiv.org/abs/2506.17361v1

A Statistical Evaluation of Indoor LoRaWAN Environment-Aware Propagation for 6G: MLR, ANOVA, and Residual Distribution Analysis

Modeling path loss in indoor LoRaWAN technology deployments is inherently challenging due to structural obstructions, occupant density and activities, and fluctuating environmental conditions. This study proposes a two-stage approach to capture and analyze these complexities using an extensive dataset of 1,328,334 field measurements collected over six months in a single-floor office at the University of Siegen's Hoelderlinstrasse Campus, Germany. First, we implement a multiple linear regression model that includes traditional propagation metrics (distance, structural walls) and an extension with proposed environmental variables (relative humidity, temperature, carbon dioxide, particulate matter, and barometric pressure). Using analysis of variance, we demonstrate that adding these environmental factors can reduce unexplained variance by 42.32 percent. Secondly, we examine residual distributions by fitting five candidate probability distributions: Normal, Skew-Normal, Cauchy, Student's t, and Gaussian Mixture Models (GMMs) with 2 to 5 components. Our results show that a four-component Gaussian Mixture Model captures the residual heterogeneity of indoor signal propagation most accurately, significantly outperforming single-distribution approaches. Given the push toward ultra-reliable, context-aware communications in 6G networks, our analysis shows that environment-aware modeling can substantially improve LoRaWAN network design in dynamic indoor IoT deployments.

Updated: 2025-06-20 10:06:39

标题: 一个针对室内LoRaWAN环境感知传播的统计评估：MLR、ANOVA和残差分布分析

摘要: 在室内LoRaWAN技术部署中建模路径损耗在本质上是具有挑战性的，原因是建筑障碍、居住者密度和活动，以及波动的环境条件。本研究提出了一个两阶段方法，利用在德国锡根大学Hoelderlinstrasse校园的单层办公室内六个月内收集的1328334个现场测量数据集，捕捉和分析这些复杂性。首先，我们实施了一个多元线性回归模型，包括传统传播指标（距离、结构墙）和一个扩展，包含提出的环境变量（相对湿度、温度、二氧化碳、颗粒物和气压）。通过方差分析，我们证明添加这些环境因素可以将未解释的方差减少42.32%。其次，我们通过拟合五个候选概率分布（正态、偏斜正态、柯西、学生t和高斯混合模型（GMMs）具有2到5个成分）来检查残差分布。我们的结果表明，四个成分的高斯混合模型最准确地捕捉了室内信号传播的残差异质性，明显优于单一分布方法。鉴于6G网络中朝着超可靠、上下文感知的通信的推进，我们的分析表明，环境感知建模可以显著提高LoRaWAN网络在动态室内物联网部署中的设计。

更新时间: 2025-06-20 10:06:39

领域: cs.NI,cs.LG,eess.SP

下载: http://arxiv.org/abs/2504.16688v3

Training Multi-Layer Binary Neural Networks With Local Binary Error Signals

Binary Neural Networks (BNNs) significantly reduce computational complexity and memory usage in machine and deep learning by representing weights and activations with just one bit. However, most existing training algorithms for BNNs rely on quantization-aware floating-point Stochastic Gradient Descent (SGD), limiting the full exploitation of binary operations to the inference phase only. In this work, we propose, for the first time, a fully binary and gradient-free training algorithm for multi-layer BNNs, eliminating the need for back-propagated floating-point gradients. Specifically, the proposed algorithm relies on local binary error signals and binary weight updates, employing integer-valued hidden weights that serve as a synaptic metaplasticity mechanism, thereby enhancing its neurobiological plausibility. Our proposed solution enables the training of binary multi-layer perceptrons by using exclusively XNOR, Popcount, and increment/decrement operations. Experimental results on multi-class classification benchmarks show test accuracy improvements of up to +35.47% over the only existing fully binary single-layer state-of-the-art solution. Compared to full-precision SGD, our solution improves test accuracy by up to +35.30% under the same total memory demand, while also reducing computational cost by two to three orders of magnitude in terms of the total number of Boolean gates. The proposed algorithm is made available to the scientific community as a public repository.

Updated: 2025-06-20 10:02:13

标题: 使用本地二进制误差信号训练多层二进制神经网络

摘要: 二进制神经网络（BNNs）通过仅使用一个比特表示权重和激活，显著降低了机器和深度学习中的计算复杂性和内存使用。然而，大多数现有的BNN训练算法依赖于量化感知的浮点随机梯度下降（SGD），仅将二进制操作的完全利用限制在推断阶段。在这项工作中，我们首次提出了一种全二进制且无梯度的多层BNN训练算法，消除了反向传播浮点梯度的需求。具体来说，所提出的算法依赖于本地二进制误差信号和二进制权重更新，采用整数值隐藏权重作为突触元可塑性机制，从而增强其神经生物学可信度。我们提出的解决方案通过仅使用XNOR、Popcount和增减操作，实现了对二进制多层感知器的训练。在多类别分类基准测试中，实验结果显示，与现有的唯一全二进制单层最先进解决方案相比，测试准确率提高了最多+35.47%。与全精度SGD相比，我们的解决方案在相同的总内存需求下，将测试准确率提高了最多+35.30%，同时在总布尔门数量方面将计算成本降低了两到三个数量级。所提出的算法已向科学界提供为公共存储库。

更新时间: 2025-06-20 10:02:13

领域: cs.LG,cs.CV,I.2.6

下载: http://arxiv.org/abs/2412.00119v3

Refining music sample identification with a self-supervised graph neural network

Automatic sample identification (ASID), the detection and identification of portions of audio recordings that have been reused in new musical works, is an essential but challenging task in the field of audio query-based retrieval. While a related task, audio fingerprinting, has made significant progress in accurately retrieving musical content under "real world" (noisy, reverberant) conditions, ASID systems struggle to identify samples that have undergone musical modifications. Thus, a system robust to common music production transformations such as time-stretching, pitch-shifting, effects processing, and underlying or overlaying music is an important open challenge. In this work, we propose a lightweight and scalable encoding architecture employing a Graph Neural Network within a contrastive learning framework. Our model uses only 9% of the trainable parameters compared to the current state-of-the-art system while achieving comparable performance, reaching a mean average precision (mAP) of 44.2%. To enhance retrieval quality, we introduce a two-stage approach consisting of an initial coarse similarity search for candidate selection, followed by a cross-attention classifier that rejects irrelevant matches and refines the ranking of retrieved candidates - an essential capability absent in prior models. In addition, because queries in real-world applications are often short in duration, we benchmark our system for short queries using new fine-grained annotations for the Sample100 dataset, which we publish as part of this work.

Updated: 2025-06-20 09:48:47

标题: 用自监督图神经网络提升音乐样本识别

摘要: 自动样本识别（ASID）是音频查询检索领域中一个必不可少但具有挑战性的任务，其目的是检测和识别已被重新利用在新音乐作品中的音频记录部分。虽然与相关任务音频指纹技术在“真实世界”（嘈杂、混响）条件下准确检索音乐内容取得了显著进展，但ASID系统往往难以识别经过音乐修改的样本。因此，一个能够鲁棒地应对常见音乐制作转换（如变速、变调、效果处理以及底层或叠加音乐）的系统是一个重要的挑战。在这项工作中，我们提出了一种轻量级和可扩展的编码架构，采用图神经网络在对比学习框架中。与当前最先进系统相比，我们的模型只使用可训练参数的9%，同时实现了可比较的性能，达到了均值平均精度（mAP）为44.2%。为了提高检索质量，我们引入了一个两阶段方法，包括一个初始的粗略相似度搜索以进行候选选择，然后是一个交叉注意力分类器，用于拒绝无关匹配并优化检索候选的排名 - 这是之前模型中缺失的一个重要能力。此外，由于真实应用中的查询通常持续时间较短，我们在新的Sample100数据集上使用新的细粒度注释来对我们的系统进行短查询的基准测试，并将其作为这项工作的一部分发布。

更新时间: 2025-06-20 09:48:47

领域: cs.SD,cs.AI,cs.IR,H.5.5; I.2.6

下载: http://arxiv.org/abs/2506.14684v2

Optimal Depth of Neural Networks

Determining the optimal depth of a neural network is a fundamental yet challenging problem, typically resolved through resource-intensive experimentation. This paper introduces a formal theoretical framework to address this question by recasting the forward pass of a deep network, specifically a Residual Network (ResNet), as an optimal stopping problem. We model the layer-by-layer evolution of hidden representations as a sequential decision process where, at each layer, a choice is made between halting computation to make a prediction or continuing to a deeper layer for a potentially more refined representation. This formulation captures the intrinsic trade-off between accuracy and computational cost. Our primary theoretical contribution is a proof that, under a plausible condition of diminishing returns on the residual functions, the expected optimal stopping depth is provably finite, even in an infinite-horizon setting. We leverage this insight to propose a novel and practical regularization term, $\mathcal{L}_{\rm depth}$, that encourages the network to learn representations amenable to efficient, early exiting. We demonstrate the generality of our framework by extending it to the Transformer architecture and exploring its connection to continuous-depth models via free-boundary problems. Empirical validation on ImageNet confirms that our regularizer successfully induces the theoretically predicted behavior, leading to significant gains in computational efficiency without compromising, and in some cases improving, final model accuracy.

Updated: 2025-06-20 09:26:01

标题: 神经网络的最佳深度

摘要: 确定神经网络的最佳深度是一个基本但具有挑战性的问题，通常通过资源密集型的实验来解决。本文介绍了一个正式的理论框架，通过将深度网络的前向传播，特别是残差网络（ResNet），重新构建为一个最优停止问题来解决这个问题。我们将隐藏表示的逐层演变建模为一个顺序决策过程，在每一层，可以选择停止计算以进行预测，或继续到更深的层以获取可能更精细的表示。这种形式捕捉了准确性和计算成本之间的固有权衡。我们的主要理论贡献是证明，在残差函数收益递减的合理条件下，期望的最优停止深度在无限时间段设定下也是有限的。我们利用这一洞察力提出了一个新颖且实用的正则化项$\mathcal{L}_{\rm depth}$，鼓励网络学习适合于高效早期退出的表示。我们通过将其扩展到Transformer架构并探索其与连续深度模型通过自由边界问题的联系，展示了我们框架的普适性。在ImageNet上的实证验证证实，我们的正则化项成功地诱导了理论预测的行为，显著提高了计算效率，而且在某些情况下还提高了最终模型的准确性。

更新时间: 2025-06-20 09:26:01

领域: cs.LG,math.OC

下载: http://arxiv.org/abs/2506.16862v1

Towards Efficient Few-shot Graph Neural Architecture Search via Partitioning Gradient Contribution

To address the weight coupling problem, certain studies introduced few-shot Neural Architecture Search (NAS) methods, which partition the supernet into multiple sub-supernets. However, these methods often suffer from computational inefficiency and tend to provide suboptimal partitioning schemes. To address this problem more effectively, we analyze the weight coupling problem from a novel perspective, which primarily stems from distinct modules in succeeding layers imposing conflicting gradient directions on the preceding layer modules. Based on this perspective, we propose the Gradient Contribution (GC) method that efficiently computes the cosine similarity of gradient directions among modules by decomposing the Vector-Jacobian Product during supernet backpropagation. Subsequently, the modules with conflicting gradient directions are allocated to distinct sub-supernets while similar ones are grouped together. To assess the advantages of GC and address the limitations of existing Graph Neural Architecture Search methods, which are limited to searching a single type of Graph Neural Networks (Message Passing Neural Networks (MPNNs) or Graph Transformers (GTs)), we propose the Unified Graph Neural Architecture Search (UGAS) framework, which explores optimal combinations of MPNNs and GTs. The experimental results demonstrate that GC achieves state-of-the-art (SOTA) performance in supernet partitioning quality and time efficiency. In addition, the architectures searched by UGAS+GC outperform both the manually designed GNNs and those obtained by existing NAS methods. Finally, ablation studies further demonstrate the effectiveness of all proposed methods.

Updated: 2025-06-20 09:18:32

标题: 朝着高效的小样本图神经架构搜索：通过分区梯度贡献

摘要: 为了解决权重耦合问题，一些研究引入了少样本神经架构搜索（NAS）方法，将超网络分成多个子超网络。然而，这些方法通常存在计算效率低和倾向于提供次优分区方案的问题。为了更有效地解决这个问题，我们从一个新颖的角度分析了权重耦合问题，这主要源于后续层中不同模块对前一层模块施加相互冲突的梯度方向。基于这一角度，我们提出了梯度贡献（GC）方法，通过在超网络反向传播过程中分解矢量雅可比积 efficiently计算模块之间的梯度方向的余弦相似度。随后，对于具有相互冲突梯度方向的模块，将其分配到不同的子超网络中，而相似的模块则被分组在一起。为了评估GC的优势并解决现有图神经架构搜索方法的局限性，这些方法仅限于搜索一种类型的图神经网络（消息传递神经网络（MPNNs）或图变换器（GTs）），我们提出了统一图神经架构搜索（UGAS）框架，探索最佳的MPNNs和GTs的组合。实验结果表明，GC在超网络分区质量和时间效率方面取得了最先进的性能。此外，UGAS+GC搜索到的架构优于手动设计的GNNs和现有NAS方法获得的架构。最后，消融研究进一步证明了所有提出的方法的有效性。

更新时间: 2025-06-20 09:18:32

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.01231v2

ICC: Quantifying Image Caption Concreteness for Multimodal Dataset Curation

Web-scale training on paired text-image data is becoming increasingly central to multimodal learning, but is challenged by the highly noisy nature of datasets in the wild. Standard data filtering approaches succeed in removing mismatched text-image pairs, but permit semantically related but highly abstract or subjective text. These approaches lack the fine-grained ability to isolate the most concrete samples that provide the strongest signal for learning in a noisy dataset. In this work, we propose a new metric, image caption concreteness, that evaluates caption text without an image reference to measure its concreteness and relevancy for use in multimodal learning. Our approach leverages strong foundation models for measuring visual-semantic information loss in multimodal representations. We demonstrate that this strongly correlates with human evaluation of concreteness in both single-word and sentence-level texts. Moreover, we show that curation using ICC complements existing approaches: It succeeds in selecting the highest quality samples from multimodal web-scale datasets to allow for efficient training in resource-constrained settings.

Updated: 2025-06-20 09:17:43

标题: ICC: 为多模态数据集的筛选量化图像标题具体性

摘要: 在多模态学习中，基于配对的文本-图像数据的网络规模训练变得越来越重要，但面临野外数据集高度嘈杂的挑战。标准数据过滤方法成功地去除了不匹配的文本-图像对，但允许语义相关但高度抽象或主观的文本。这些方法缺乏细粒度的能力，无法隔离出提供嘈杂数据集中最强信号的最具体样本。在这项工作中，我们提出了一种新的度量标准，图像标题具体性，它评估没有图像参考的标题文本，以衡量其具体性和相关性，以用于多模态学习。我们的方法利用了强大的基础模型来衡量多模态表示中的视觉语义信息损失。我们证明，这与人类对单词和句子级文本具体性的评估强烈相关。此外，我们展示，使用图像标题具体性进行策划补充了现有方法：它成功地从多模态网络规模数据集中选择出最高质量的样本，以便在资源受限环境中进行高效训练。

更新时间: 2025-06-20 09:17:43

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2403.01306v4

ParkFormer: A Transformer-Based Parking Policy with Goal Embedding and Pedestrian-Aware Control

Autonomous parking plays a vital role in intelligent vehicle systems, particularly in constrained urban environments where high-precision control is required. While traditional rule-based parking systems struggle with environmental uncertainties and lack adaptability in crowded or dynamic scenes, human drivers demonstrate the ability to park intuitively without explicit modeling. Inspired by this observation, we propose a Transformer-based end-to-end framework for autonomous parking that learns from expert demonstrations. The network takes as input surround-view camera images, goal-point representations, ego vehicle motion, and pedestrian trajectories. It outputs discrete control sequences including throttle, braking, steering, and gear selection. A novel cross-attention module integrates BEV features with target points, and a GRU-based pedestrian predictor enhances safety by modeling dynamic obstacles. We validate our method on the CARLA 0.9.14 simulator in both vertical and parallel parking scenarios. Experiments show our model achieves a high success rate of 96.57\%, with average positional and orientation errors of 0.21 meters and 0.41 degrees, respectively. The ablation studies further demonstrate the effectiveness of key modules such as pedestrian prediction and goal-point attention fusion. The code and dataset will be released at: https://github.com/little-snail-f/ParkFormer.

Updated: 2025-06-20 09:14:09

标题: ParkFormer：一种基于Transformer的停车策略，具有目标嵌入和考虑行人的控制

摘要: 自动停车在智能车辆系统中发挥着至关重要的作用，特别是在需要高精度控制的受限城市环境中。传统的基于规则的停车系统在处理环境不确定性和在拥挤或动态场景中缺乏适应性方面存在困难，而人类驾驶员则展示了在没有明确建模的情况下直观停车的能力。受到这一观察的启发，我们提出了一种基于Transformer的端到端框架，用于自动停车，该框架从专家示范中学习。该网络以周围视图摄像头图像、目标点表示、自车运动和行人轨迹作为输入。它输出包括油门、刹车、转向和挡位选择在内的离散控制序列。一种新颖的交叉注意力模块将BEV特征与目标点集成在一起，基于GRU的行人预测器通过建模动态障碍物来增强安全性。我们在CARLA 0.9.14模拟器中验证了我们的方法，涵盖了垂直和平行停车场景。实验表明，我们的模型取得了96.57\%的高成功率，平均位置和方向误差分别为0.21米和0.41度。消融研究进一步证明了关键模块，如行人预测和目标点注意力融合的有效性。代码和数据集将在https://github.com/little-snail-f/ParkFormer 上发布。

更新时间: 2025-06-20 09:14:09

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.16856v1

Anomaly Detection in Event-triggered Traffic Time Series via Similarity Learning

Time series analysis has achieved great success in cyber security such as intrusion detection and device identification. Learning similarities among multiple time series is a crucial problem since it serves as the foundation for downstream analysis. Due to the complex temporal dynamics of the event-triggered time series, it often remains unclear which similarity metric is appropriate for security-related tasks, such as anomaly detection and clustering. The overarching goal of this paper is to develop an unsupervised learning framework that is capable of learning similarities among a set of event-triggered time series. From the machine learning vantage point, the proposed framework harnesses the power of both hierarchical multi-resolution sequential autoencoders and the Gaussian Mixture Model (GMM) to effectively learn the low-dimensional representations from the time series. Finally, the obtained similarity measure can be easily visualized for the explanation. The proposed framework aspires to offer a stepping stone that gives rise to a systematic approach to model and learn similarities among a multitude of event-triggered time series. Through extensive qualitative and quantitative experiments, it is revealed that the proposed method outperforms state-of-the-art methods considerably.

Updated: 2025-06-20 09:09:04

标题: 事件触发交通时间序列中的异常检测通过相似性学习

摘要: 时间序列分析在网络安全领域取得了巨大成功，例如入侵检测和设备识别。学习多个时间序列之间的相似性是一个关键问题，因为它为下游分析奠定了基础。由于事件触发时间序列的复杂时间动态，往往不清楚哪种相似性度量适用于安全相关任务，如异常检测和聚类。本文的总体目标是开发一种无监督学习框架，能够学习一组事件触发时间序列之间的相似性。从机器学习的角度来看，提出的框架利用了分层多分辨率序贯自动编码器和高斯混合模型（GMM）的力量，有效地从时间序列中学习低维表示。最后，获得的相似性度量可以轻松可视化以进行解释。提出的框架旨在提供一个起点，为对多种事件触发时间序列之间的相似性进行建模和学习的系统方法铺平道路。通过广泛的定性和定量实验，发现该方法明显优于最先进的方法。

更新时间: 2025-06-20 09:09:04

领域: cs.LG

下载: http://arxiv.org/abs/2506.16855v1

Sekai: A Video Dataset towards World Exploration

Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration. However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world. In this paper, we introduce Sekai (meaning ``world'' in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. It consists of over 5,000 hours of walking or drone view (FPV and UVA) videos from over 100 countries and regions across 750 cities. We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories. Experiments demonstrate the quality of the dataset. And, we use a subset to train an interactive video world exploration model, named YUME (meaning ``dream'' in Japanese). We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications. The project page is https://lixsp11.github.io/sekai-project/.

Updated: 2025-06-20 09:03:18

标题: Sekai：一个面向世界探索的视频数据集

摘要: 视频生成技术取得了显著进展，有望成为互动世界探索的基础。然而，现有的视频生成数据集并不适合用于世界探索训练，因为它们存在一些限制：地点有限、持续时间短、场景静态，并且缺乏关于探索和世界的注释。在本文中，我们介绍了Sekai（日语中的“世界”），这是一个高质量的第一人称视角全球视频数据集，具有丰富的世界探索注释。它包含来自全球100多个国家和地区、750个城市的超过5,000小时的步行或无人机视角（FPV和UVA）视频。我们开发了一套高效和有效的工具箱，用于收集、预处理和注释视频，包括位置、场景、天气、人群密度、字幕和摄像机轨迹。实验证明了数据集的质量。此外，我们使用一个子集来训练一个交互式视频世界探索模型，命名为YUME（日语中的“梦”）。我们相信Sekai将有益于视频生成和世界探索领域，并激发有价值的应用。项目页面是https://lixsp11.github.io/sekai-project/。

更新时间: 2025-06-20 09:03:18

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.15675v2

Reward-Agnostic Prompt Optimization for Text-to-Image Diffusion Models

We investigate a general approach for improving user prompts in text-to-image (T2I) diffusion models by finding prompts that maximize a reward function specified at test-time. Although diverse reward models are used for evaluating image generation, existing automated prompt engineering methods typically target specific reward configurations. Consequently, these specialized designs exhibit suboptimal performance when applied to new prompt engineering scenarios involving different reward models. To address this limitation, we introduce RATTPO (Reward-Agnostic Test-Time Prompt Optimization), a flexible test-time optimization method applicable across various reward scenarios without modification. RATTPO iteratively searches for optimized prompts by querying large language models (LLMs) \textit{without} requiring reward-specific task descriptions. Instead, it uses the optimization trajectory and a novel reward-aware feedback signal (termed a "hint") as context. Empirical results demonstrate the versatility of RATTPO, effectively enhancing user prompts across diverse reward setups that assess various generation aspects, such as aesthetics, general human preference, or spatial relationships between objects. RATTPO surpasses other test-time search baselines in search efficiency, using up to 3.5 times less inference budget, and, given sufficient inference budget, achieves performance comparable to learning-based baselines that require reward-specific fine-tuning. The code is available at https://github.com/seminkim/RATTPO.

Updated: 2025-06-20 09:02:05

标题: 奖励不可知的文本到图像扩散模型的提示优化

摘要: 我们研究了一种改进文本到图像(T2I)扩散模型中用户提示的一般方法，通过找到在测试时指定的最大化奖励函数。尽管存在用于评估图像生成的多种奖励模型，但现有的自动提示工程方法通常针对特定奖励配置。因此，当应用于涉及不同奖励模型的新提示工程场景时，这些专业设计展现出次优性能。为了解决这一局限性，我们引入了RATTPO（奖励无关的测试时提示优化），这是一种灵活的测试时优化方法，适用于各种奖励场景，而无需修改。RATTPO通过迭代查询大型语言模型（LLMs）来寻找优化提示，\textit{无需}需要特定奖励任务描述。相反，它使用优化轨迹和一种新颖的奖励感知反馈信号（称为“提示”）作为上下文。实证结果展示了RATTPO的多功能性，有效改进了在各种评估生成方面的各种奖励设置中的用户提示，如美学、普遍人类偏好或对象之间的空间关系。RATTPO在搜索效率上超过了其他测试时搜索基线，使用的推理预算最多少3.5倍，并在给定足够的推理预算的情况下，实现了与需要特定奖励微调的基线学习性能相当的表现。代码可在https://github.com/seminkim/RATTPO找到。

更新时间: 2025-06-20 09:02:05

领域: cs.LG

下载: http://arxiv.org/abs/2506.16853v1

Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation

Large Language Models (LLMs) demonstrate promising capabilities in solving scientific problems but often suffer from the issue of hallucination. While integrating LLMs with tools can mitigate this issue, models fine-tuned on tool usage become overreliant on them and incur unnecessary costs. Inspired by how human experts assess problem complexity before selecting solutions, we propose a novel two-component fine-tuning method, Adapting While Learning (AWL). In the first component, World Knowledge Learning (WKL), LLMs internalize scientific knowledge by learning from tool-generated solutions. In the second component, Tool Usage Adaptation (TUA), we categorize problems as easy or hard based on the model's accuracy, and train it to maintain direct reasoning for easy problems while switching to tools for hard ones. We validate our method on six scientific benchmark datasets across climate science, epidemiology, physics, and other domains. Compared to the original instruct model (8B), models post-trained with AWL achieve 29.11% higher answer accuracy and 12.72% better tool usage accuracy, even surpassing state-of-the-art models including GPT-4o and Claude-3.5 on four custom-created datasets. Our code is open-source at https://github.com/Rose-STL-Lab/Adapting-While-Learning.

Updated: 2025-06-20 08:54:13

标题: 学习过程中的适应：通过智能工具使用适应为科学问题提供基础

摘要: 大型语言模型（LLM）展示了解决科学问题的有希望的能力，但常常受到幻觉问题的困扰。虽然将LLM与工具集成可以缓解这一问题，但在工具使用上进行微调的模型会过度依赖它们并产生不必要的成本。受到人类专家在选择解决方案之前评估问题复杂性的启发，我们提出了一种新颖的两组件微调方法，即适应性学习（AWL）。在第一组件中，世界知识学习（WKL）中，LLM通过从工具生成的解决方案中学习来内化科学知识。在第二组件中，工具使用适应（TUA）中，我们根据模型的准确性将问题分类为简单或困难，并训练它在简单问题上保持直接推理，而在困难问题上切换到工具。我们在气候科学、流行病学、物理学和其他领域的六个科学基准数据集上验证了我们的方法。与原始指令模型（8B）相比，使用AWL后训练的模型的答案准确率提高了29.11％，工具使用准确率提高了12.72％，甚至在四个自定义数据集上超过了包括GPT-4o和Claude-3.5在内的最先进模型。我们的代码是开源的，网址为https://github.com/Rose-STL-Lab/Adapting-While-Learning。

更新时间: 2025-06-20 08:54:13

领域: cs.LG,cs.AI,cs.CL,I.2.6; I.2.7

下载: http://arxiv.org/abs/2411.00412v4

Bandwidth Selectors on Semiparametric Bayesian Networks

Semiparametric Bayesian networks (SPBNs) integrate parametric and non-parametric probabilistic models, offering flexibility in learning complex data distributions from samples. In particular, kernel density estimators (KDEs) are employed for the non-parametric component. Under the assumption of data normality, the normal rule is used to learn the bandwidth matrix for the KDEs in SPBNs. This matrix is the key hyperparameter that controls the trade-off between bias and variance. However, real-world data often deviates from normality, potentially leading to suboptimal density estimation and reduced predictive performance. This paper first establishes the theoretical framework for the application of state-of-the-art bandwidth selectors and subsequently evaluates their impact on SPBN performance. We explore the approaches of cross-validation and plug-in selectors, assessing their effectiveness in enhancing the learning capability and applicability of SPBNs. To support this investigation, we have extended the open-source package PyBNesian for SPBNs with the additional bandwidth selection techniques and conducted extensive experimental analyses. Our results demonstrate that the proposed bandwidth selectors leverage increasing information more effectively than the normal rule, which, despite its robustness, stagnates with more data. In particular, unbiased cross-validation generally outperforms the normal rule, highlighting its advantage in high sample size scenarios.

Updated: 2025-06-20 08:48:05

标题: 半参数贝叶斯网络中的带宽选择器

摘要: 半参数贝叶斯网络（SPBNs）集成了参数和非参数概率模型，可以灵活地从样本中学习复杂的数据分布。特别地，核密度估计器（KDEs）被用于非参数部分。在数据正态性的假设下，正态规则被用来学习SPBNs中KDEs的带宽矩阵。这个矩阵是控制偏差和方差之间权衡的关键超参数。然而，现实世界的数据往往偏离正态分布，可能导致次优的密度估计和降低的预测性能。本文首先建立了最先进带宽选择器应用的理论框架，随后评估了它们对SPBN性能的影响。我们探讨了交叉验证和插值选择器的方法，评估它们在增强SPBN学习能力和适用性方面的有效性。为了支持这项研究，我们扩展了面向SPBNs的开源软件包PyBNesian，加入了额外的带宽选择技术，并进行了广泛的实验分析。我们的结果表明，提议的带宽选择器比正态规则更有效地利用增加的信息，尽管正态规则尽管鲁棒性强，但随着数据量的增加会停滞。特别地，无偏的交叉验证通常优于正态规则，在高样本量场景中具有优势。

更新时间: 2025-06-20 08:48:05

领域: cs.LG,cs.AI,stat.ML,I.2.6; I.5.1; G.3

下载: http://arxiv.org/abs/2506.16844v1

FedFitTech: A Baseline in Federated Learning for Fitness Tracking

Rapid evolution of sensors and resource-efficient machine learning models have spurred the widespread adoption of wearable fitness tracking devices. Equipped with inertial sensors, such devices can continuously capture physical movements for fitness technology (FitTech), enabling applications from sports optimization to preventive healthcare. Traditional centralized learning approaches to detect fitness activities struggle with privacy concerns, regulatory constraints, and communication inefficiencies. In contrast, Federated Learning (FL) enables a decentralized model training by communicating model updates rather than private wearable sensor data. Applying FL to FitTech presents unique challenges, such as data imbalance, lack of labelled data, heterogeneous user activity patterns, and trade-offs between personalization and generalization. To simplify research on FitTech in FL, we present the FedFitTech baseline, under the Flower framework, which is publicly available and widely used by both industry and academic researchers. Additionally, to illustrate its usage, this paper presents a case study that implements a system based on the FedFitTech baseline, incorporating a client-side early stopping strategy and comparing the results. For instance, this system allows wearable devices to optimize the trade-off between capturing common fitness activity patterns and preserving individuals' nuances, thereby enhancing both the scalability and efficiency of privacy-aware fitness tracking applications. Results show that this reduces overall redundant communications by 13 percent, while maintaining the overall recognition performance at a negligible recognition cost by 1 percent. Thus, FedFitTech baseline creates a foundation for a wide range of new research and development opportunities in FitTech, and it is available as open-source at: https://github.com/adap/flower/tree/main/baselines/fedfittech

Updated: 2025-06-20 08:43:39

标题: FedFitTech：健身追踪领域联邦学习的基线

摘要: 传感器和资源高效的机器学习模型的快速发展推动了可穿戴健身追踪设备的广泛应用。配备惯性传感器的这些设备可以持续捕获用于健身技术（FitTech）的身体运动，从而实现从体育优化到预防保健的应用。传统的集中式学习方法在检测健身活动方面存在隐私问题、监管限制和通信效率低的困难。相比之下，联邦学习（FL）通过通信模型更新而不是私人可穿戴传感器数据的方式，实现了分散模型训练。将FL应用于FitTech面临着独特的挑战，例如数据不平衡、缺乏标记数据、异构用户活动模式以及个性化和泛化之间的权衡。为了简化在FL中对FitTech的研究，我们提出了FedFitTech基线，在Flower框架下，它是公开可用的，被行业和学术研究人员广泛使用。此外，为了说明其用途，本文展示了一个基于FedFitTech基线的系统案例研究，该系统采用了客户端早停策略并进行了结果比较。例如，该系统允许可穿戴设备优化捕获常见健身活动模式和保留个体细微差别之间的权衡，从而提高隐私感知健身跟踪应用的可扩展性和效率。结果显示，这降低了总体冗余通信量13%，同时将总体识别性能的识别成本降低了1%。因此，FedFitTech基线为FitTech领域的广泛新研究和开发机会奠定了基础，并可在以下链接中作为开源提供：https://github.com/adap/flower/tree/main/baselines/fedfittech

更新时间: 2025-06-20 08:43:39

领域: cs.LG

下载: http://arxiv.org/abs/2506.16840v1

More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

Test-time compute has empowered multimodal large language models to generate extended reasoning chains, yielding strong performance on tasks such as multimodal math reasoning. However, this improved reasoning ability often comes with increased hallucination: as generations become longer, models tend to drift away from image-grounded content and rely more heavily on language priors. Attention analysis shows that longer reasoning chains lead to reduced focus on visual inputs, which contributes to hallucination. To systematically study this phenomenon, we introduce RH-AUC, a metric that quantifies how a model's perception accuracy changes with reasoning length, allowing us to evaluate whether the model preserves visual grounding during reasoning. We also release RH-Bench, a diagnostic benchmark that spans a variety of multimodal tasks, designed to assess the trade-off between reasoning ability and hallucination. Our analysis reveals that (i) larger models typically achieve a better balance between reasoning and perception, and (ii) this balance is influenced more by the types and domains of training data than by its overall volume. These findings underscore the importance of evaluation frameworks that jointly consider both reasoning quality and perceptual fidelity.

Updated: 2025-06-20 08:41:41

标题: 更多思考，少一点看？评估多模态推理模型中的加强幻觉

摘要: 测试时间计算使得多模式大型语言模型能够生成更多的推理链，从而在多模态数学推理等任务上表现出色。然而，这种改进的推理能力往往伴随着增加的幻觉：随着生成变得更长，模型往往会偏离图像基础内容，更多地依赖语言先验。注意力分析显示，更长的推理链导致对视觉输入的关注减少，这有助于幻觉的产生。为了系统地研究这一现象，我们引入了RH-AUC，一种度量模型感知准确性随推理长度变化的指标，使我们能够评估模型在推理过程中是否保持视觉基础。我们还发布了RH-Bench，一个跨多种多模态任务的诊断基准，旨在评估推理能力和幻觉之间的权衡。我们的分析显示，（i）更大的模型通常在推理和感知之间取得更好的平衡，（ii）这种平衡更多地受到训练数据的类型和领域而不是整体量的影响。这些发现强调了综合考虑推理质量和感知忠实度的评估框架的重要性。

更新时间: 2025-06-20 08:41:41

领域: cs.CL,cs.AI,cs.CV

下载: http://arxiv.org/abs/2505.21523v3

Beyond Blur: A Fluid Perspective on Generative Diffusion Models

We propose a novel PDE-driven corruption process for generative image synthesis based on advection-diffusion processes which generalizes existing PDE-based approaches. Our forward pass formulates image corruption via a physically motivated PDE that couples directional advection with isotropic diffusion and Gaussian noise, controlled by dimensionless numbers (Peclet, Fourier). We implement this PDE numerically through a GPU-accelerated custom Lattice Boltzmann solver for fast evaluation. To induce realistic turbulence, we generate stochastic velocity fields that introduce coherent motion and capture multi-scale mixing. In the generative process, a neural network learns to reverse the advection-diffusion operator thus constituting a novel generative model. We discuss how previous methods emerge as specific cases of our operator, demonstrating that our framework generalizes prior PDE-based corruption techniques. We illustrate how advection improves the diversity and quality of the generated images while keeping the overall color palette unaffected. This work bridges fluid dynamics, dimensionless PDE theory, and deep generative modeling, offering a fresh perspective on physically informed image corruption processes for diffusion-based synthesis.

Updated: 2025-06-20 08:31:30

标题: 超越模糊：关于生成扩散模型的流体视角

摘要: 我们提出了一种基于对流-扩散过程的新型PDE驱动的图像生成损坏过程，该过程推广了现有的基于PDE的方法。我们的正向传递通过一个受物理启发的PDE来形式化图像损坏，该PDE将方向性对流与各向同性扩散和高斯噪声耦合在一起，由无量纲数（佩克莱、傅里叶）控制。我们通过GPU加速的自定义点阵Boltzmann求解器数字化地实现了这个PDE，以便进行快速评估。为了引入真实的湍流，我们生成随机速度场，引入协同运动并捕捉多尺度混合。在生成过程中，神经网络学习逆转对流-扩散算子，从而构成一种新型生成模型。我们讨论了先前的方法如何作为我们算子的特定情况出现，展示了我们的框架如何推广之前基于PDE的损坏技术。我们说明了对流如何提高生成图像的多样性和质量，同时保持整体色板不受影响。这项工作连接了流体动力学、无量纲PDE理论和深度生成建模，为基于扩散的合成提供了物理启示的图像损坏过程的新视角。

更新时间: 2025-06-20 08:31:30

领域: cs.GR,cs.CV,cs.LG,I.2.6; I.4.10; I.4.8

下载: http://arxiv.org/abs/2506.16827v1

AnyTraverse: An off-road traversability framework with VLM and human operator in the loop

Off-road traversability segmentation enables autonomous navigation with applications in search-and-rescue, military operations, wildlife exploration, and agriculture. Current frameworks struggle due to significant variations in unstructured environments and uncertain scene changes, and are not adaptive to be used for different robot types. We present AnyTraverse, a framework combining natural language-based prompts with human-operator assistance to determine navigable regions for diverse robotic vehicles. The system segments scenes for a given set of prompts and calls the operator only when encountering previously unexplored scenery or unknown class not part of the prompt in its region-of-interest, thus reducing active supervision load while adapting to varying outdoor scenes. Our zero-shot learning approach eliminates the need for extensive data collection or retraining. Our experimental validation includes testing on RELLIS-3D, Freiburg Forest, and RUGD datasets and demonstrate real-world deployment on multiple robot platforms. The results show that AnyTraverse performs better than GA-NAV and Off-seg while offering a vehicle-agnostic approach to off-road traversability that balances automation with targeted human supervision.

Updated: 2025-06-20 08:31:13

标题: AnyTraverse：一种具有VLM和人类操作员的越野可行性框架

摘要: 越野可行性分割使得自主导航成为可能，适用于搜索与救援、军事行动、野生动物探索和农业等领域。当前的框架在处理非结构化环境和不确定场景变化方面存在挑战，并且不能适应不同类型的机器人使用。我们提出了AnyTraverse，这是一个结合自然语言提示和人员操作辅助的框架，用于确定各种机器人车辆的可导航区域。该系统根据给定的提示对场景进行分割，并仅在遇到以前未探索的景观或提示区域之外的未知类别时调用操作员，从而在适应不同户外场景的同时减少主动监督负荷。我们的零样本学习方法消除了对大量数据采集或重新训练的需求。我们的实验验证包括在RELLIS-3D，弗赖堡森林和RUGD数据集上进行测试，并展示在多个机器人平台上的实际部署。结果显示，AnyTraverse在提供一种适用于各种机动车辆的越野可行性方法的同时，比GA-NAV和Off-seg表现更好，平衡了自动化与有针对性的人员监督。

更新时间: 2025-06-20 08:31:13

领域: cs.CV,cs.AI,cs.RO

下载: http://arxiv.org/abs/2506.16826v1

Predicting New Research Directions in Materials Science using Large Language Models and Concept Graphs

Due to an exponential increase in published research articles, it is impossible for individual scientists to read all publications, even within their own research field. In this work, we investigate the use of large language models (LLMs) for the purpose of extracting the main concepts and semantic information from scientific abstracts in the domain of materials science to find links that were not noticed by humans and thus to suggest inspiring near/mid-term future research directions. We show that LLMs can extract concepts more efficiently than automated keyword extraction methods to build a concept graph as an abstraction of the scientific literature. A machine learning model is trained to predict emerging combinations of concepts, i.e. new research ideas, based on historical data. We demonstrate that integrating semantic concept information leads to an increased prediction performance. The applicability of our model is demonstrated in qualitative interviews with domain experts based on individualized model suggestions. We show that the model can inspire materials scientists in their creative thinking process by predicting innovative combinations of topics that have not yet been investigated.

Updated: 2025-06-20 08:26:12

标题: 利用大型语言模型和概念图预测材料科学中的新研究方向

摘要: 由于发表研究文章数量呈指数增长，个人科学家不可能阅读所有出版物，甚至在自己的研究领域内也是如此。在这项工作中，我们调查了大型语言模型（LLMs）的使用，目的是从材料科学领域的科学摘要中提取主要概念和语义信息，以找到人类未曾注意到的联系，从而建议具有启发性的近/中期未来研究方向。我们表明LLMs可以比自动关键字提取方法更有效地提取概念，以建立一个作为科学文献抽象的概念图。一个机器学习模型被训练来预测新兴概念组合，即基于历史数据的新研究想法。我们证明整合语义概念信息会导致增加的预测性能。我们的模型的适用性通过基于个性化模型建议的定性访谈与领域专家一起展示。我们表明，该模型可以通过预测尚未研究的主题创新组合来激发材料科学家的创造性思维过程。

更新时间: 2025-06-20 08:26:12

领域: cs.LG

下载: http://arxiv.org/abs/2506.16824v1

Learning Dexterous Object Handover

Object handover is an important skill that we use daily when interacting with other humans. To deploy robots in collaborative setting, like houses, being able to receive and handing over objects safely and efficiently becomes a crucial skill. In this work, we demonstrate the use of Reinforcement Learning (RL) for dexterous object handover between two multi-finger hands. Key to this task is the use of a novel reward function based on dual quaternions to minimize the rotation distance, which outperforms other rotation representations such as Euler and rotation matrices. The robustness of the trained policy is experimentally evaluated by testing w.r.t. objects that are not included in the training distribution, and perturbations during the handover process. The results demonstrate that the trained policy successfully perform this task, achieving a total success rate of 94% in the best-case scenario after 100 experiments, thereby showing the robustness of our policy with novel objects. In addition, the best-case performance of the policy decreases by only 13.8% when the other robot moves during the handover, proving that our policy is also robust to this type of perturbation, which is common in real-world object handovers.

Updated: 2025-06-20 08:22:46

标题: 学习熟练的物体交接

摘要: 目标交接是我们在与其他人互动时每天使用的重要技能。在协作环境中部署机器人，如房屋，能够安全高效地接收和交接物体变得至关重要。在这项工作中，我们展示了强化学习（RL）在两只多指手之间熟练地交接物体中的应用。这项任务的关键是使用基于双四元数的新型奖励函数，以最小化旋转距离，这优于其他旋转表示，如欧拉和旋转矩阵。通过测试不包含在训练分布中的对象以及在交接过程中的扰动，对训练策略的鲁棒性进行了实验评估。结果表明，训练策略成功地执行了这项任务，在100次实验后，在最佳情况下达到94%的总成功率，从而显示了我们的策略对新物体的鲁棒性。此外，在交接过程中另一个机器人移动时，策略的最佳执行性能仅下降了13.8%，证明了我们的策略也对这种类型的扰动具有鲁棒性，这在现实世界的物体交接中很常见。

更新时间: 2025-06-20 08:22:46

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2506.16822v1

Loupe: A Generalizable and Adaptive Framework for Image Forgery Detection

The proliferation of generative models has raised serious concerns about visual content forgery. Existing deepfake detection methods primarily target either image-level classification or pixel-wise localization. While some achieve high accuracy, they often suffer from limited generalization across manipulation types or rely on complex architectures. In this paper, we propose Loupe, a lightweight yet effective framework for joint deepfake detection and localization. Loupe integrates a patch-aware classifier and a segmentation module with conditional queries, allowing simultaneous global authenticity classification and fine-grained mask prediction. To enhance robustness against distribution shifts of test set, Loupe introduces a pseudo-label-guided test-time adaptation mechanism by leveraging patch-level predictions to supervise the segmentation head. Extensive experiments on the DDL dataset demonstrate that Loupe achieves state-of-the-art performance, securing the first place in the IJCAI 2025 Deepfake Detection and Localization Challenge with an overall score of 0.846. Our results validate the effectiveness of the proposed patch-level fusion and conditional query design in improving both classification accuracy and spatial localization under diverse forgery patterns. The code is available at https://github.com/Kamichanw/Loupe.

Updated: 2025-06-20 08:18:44

标题: 放大镜：一种通用且自适应的图像伪造检测框架

摘要: 生成模型的激增引起了对视觉内容伪造的严重关注。现有的深度伪造检测方法主要针对图像级别分类或像素级别定位。虽然一些方法取得了较高的准确性，但它们往往在不同类型的伪造上存在有限的泛化能力，或者依赖复杂的架构。在本文中，我们提出了Loupe，一个轻量级但有效的联合深伪造检测和定位框架。Loupe集成了一个基于补丁的分类器和一个带有条件查询的分割模块，可以同时进行全局真实性分类和细粒度掩模预测。为了增强对测试集分布变化的鲁棒性，Loupe引入了一种伪标签引导的测试时适应机制，通过利用补丁级别的预测来监督分割头。对DDL数据集进行了大量实验，结果表明Loupe取得了最先进的性能，在IJCAI 2025深伪造检测和定位挑战中获得了0.846的综合分数，排名第一。我们的结果验证了所提出的补丁级融合和条件查询设计在改善分类准确性和空间定位在多样化伪造模式下的有效性。代码可在https://github.com/Kamichanw/Loupe获取。

更新时间: 2025-06-20 08:18:44

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.16819v1

When and How Does CLIP Enable Domain and Compositional Generalization?

The remarkable generalization performance of contrastive vision-language models like CLIP is often attributed to the diversity of their training distributions. However, key questions remain unanswered: Can CLIP generalize to an entirely unseen domain when trained on a diverse mixture of domains (domain generalization)? Can it generalize to unseen classes within partially seen domains (compositional generalization)? What factors affect such generalization? To answer these questions, we trained CLIP models on systematically constructed training distributions with controlled domain diversity and object class exposure. Our experiments show that domain diversity is essential for both domain and compositional generalization, yet compositional generalization can be surprisingly weaker than domain generalization when the training distribution contains a suboptimal subset of the test domain. Through data-centric and mechanistic analyses, we find that successful generalization requires the learning of sufficiently shared representations in intermediate layers and circuits.

Updated: 2025-06-20 08:12:35

标题: CLIP是如何实现领域和组成概括的？

摘要: 对比视觉-语言模型（如CLIP）的显着泛化性能通常归因于它们训练分布的多样性。然而，关键问题仍未得到解答：当在多样性领域混合训练时，CLIP是否能泛化到完全看不见的领域（领域泛化）？它是否能在部分可见领域内泛化到看不见的类别（组合泛化）？什么因素影响这种泛化？为了回答这些问题，我们在系统构建的训练分布上训练CLIP模型，控制领域多样性和对象类别暴露。我们的实验表明，领域多样性对于领域和组合泛化至关重要，然而当训练分布包含测试领域的次优子集时，组合泛化可能会出乎意料地较弱于领域泛化。通过数据中心和机制分析，我们发现成功的泛化需要在中间层和电路中学习足够共享的表示。

更新时间: 2025-06-20 08:12:35

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2502.09507v2

Robust Group Anomaly Detection for Quasi-Periodic Network Time Series

Many real-world multivariate time series are collected from a network of physical objects embedded with software, electronics, and sensors. The quasi-periodic signals generated by these objects often follow a similar repetitive and periodic pattern, but have variations in the period, and come in different lengths caused by timing (synchronization) errors. Given a multitude of such quasi-periodic time series, can we build machine learning models to identify those time series that behave differently from the majority of the observations? In addition, can the models help human experts to understand how the decision was made? We propose a sequence to Gaussian Mixture Model (seq2GMM) framework. The overarching goal of this framework is to identify unusual and interesting time series within a network time series database. We further develop a surrogate-based optimization algorithm that can efficiently train the seq2GMM model. Seq2GMM exhibits strong empirical performance on a plurality of public benchmark datasets, outperforming state-of-the-art anomaly detection techniques by a significant margin. We also theoretically analyze the convergence property of the proposed training algorithm and provide numerical results to substantiate our theoretical claims.

Updated: 2025-06-20 08:11:04

标题: 稳健的群组异常检测方法用于准周期网络时间序列

摘要: 许多实际世界中的多变量时间序列是从嵌入软件、电子设备和传感器的物理对象网络中收集的。这些对象生成的准周期信号通常遵循相似的重复和周期模式，但由于时间（同步）错误而具有不同长度的变化。鉴于这样大量的准周期时间序列，我们能否建立机器学习模型来识别那些与大多数观察结果不同的时间序列？此外，模型能否帮助人类专家理解决策是如何做出的？我们提出了一个序列到高斯混合模型（seq2GMM）框架。该框架的总体目标是在网络时间序列数据库中识别异常和有趣的时间序列。我们进一步开发了一种基于替代的优化算法，可以高效地训练seq2GMM模型。Seq2GMM在多个公共基准数据集上表现出很强的实证性能，明显优于最先进的异常检测技术。我们还从理论上分析了提出的训练算法的收敛性质，并提供数值结果来证实我们的理论主张。

更新时间: 2025-06-20 08:11:04

领域: cs.LG

下载: http://arxiv.org/abs/2506.16815v1

Preference-Driven Multi-Objective Combinatorial Optimization with Conditional Computation

Recent deep reinforcement learning methods have achieved remarkable success in solving multi-objective combinatorial optimization problems (MOCOPs) by decomposing them into multiple subproblems, each associated with a specific weight vector. However, these methods typically treat all subproblems equally and solve them using a single model, hindering the effective exploration of the solution space and thus leading to suboptimal performance. To overcome the limitation, we propose POCCO, a novel plug-and-play framework that enables adaptive selection of model structures for subproblems, which are subsequently optimized based on preference signals rather than explicit reward values. Specifically, we design a conditional computation block that routes subproblems to specialized neural architectures. Moreover, we propose a preference-driven optimization algorithm that learns pairwise preferences between winning and losing solutions. We evaluate the efficacy and versatility of POCCO by applying it to two state-of-the-art neural methods for MOCOPs. Experimental results across four classic MOCOP benchmarks demonstrate its significant superiority and strong generalization.

Updated: 2025-06-20 08:04:32

标题: 首选驱动的带条件计算的多目标组合优化

摘要: 最近，深度强化学习方法在解决多目标组合优化问题（MOCOPs）方面取得了显著成功，通过将其分解为多个子问题，每个子问题都与特定的权重向量相关联。然而，这些方法通常将所有子问题视为相等，并使用单个模型解决它们，从而阻碍了解决方案空间的有效探索，导致次优性能。为了克服这一限制，我们提出了POCCO，这是一个新颖的即插即用框架，可以根据偏好信号而非显式奖励值对子问题的模型结构进行自适应选择。具体地，我们设计了一个条件计算块，将子问题路由到专门的神经结构中。此外，我们提出了一种基于偏好驱动的优化算法，学习获胜和失败解决方案之间的成对偏好。通过将其应用于两种MOCOPs的最先进神经方法，我们评估了POCCO的功效和多功能性。在四个经典MOCOP基准测试中的实验结果显示了其明显的优越性和强大的泛化能力。

更新时间: 2025-06-20 08:04:32

领域: cs.AI

下载: http://arxiv.org/abs/2506.08898v2

Boltzmann Classifier: A Thermodynamic-Inspired Approach to Supervised Learning

We present the Boltzmann classifier, a novel distance based probabilistic classification algorithm inspired by the Boltzmann distribution. Unlike traditional classifiers that produce hard decisions or uncalibrated probabilities, the Boltzmann classifier assigns class probabilities based on the average distance to the nearest neighbors within each class, providing interpretable, physically meaningful outputs. We evaluate the performance of the method across three application domains: molecular activity prediction, oxidation state classification of transition metal complexes, and breast cancer diagnosis. In the molecular activity task, the classifier achieved the highest accuracy in predicting active compounds against two protein targets, with strong correlations observed between the predicted probabilities and experimental pIC50 values. For metal complexes, the classifier accurately distinguished between oxidation states II and III for Fe, Mn, and Co, using only metal-ligand bond lengths extracted from crystallographic data, and demonstrated high consistency with known chemical trends. In the breast cancer dataset, the classifier achieved 97% accuracy, with low confidence predictions concentrated in inherently ambiguous cases. Across all tasks, the Boltzmann classifier performed competitively or better than standard models such as logistic regression, support vector machines, random forests, and k-nearest neighbors. Its probabilistic outputs were found to correlate with continuous physical or biological properties, highlighting its potential utility in both classification and regression contexts. The results suggest that the Boltzmann classifier is a robust and interpretable alternative to conventional machine learning approaches, particularly in scientific domains where underlying structure property relationships are important.

Updated: 2025-06-20 08:01:48

标题: 玻尔兹曼分类器：一种受热力学启发的监督学习方法

摘要: 我们提出了Boltzmann分类器，这是一种受Boltzmann分布启发的基于距离的概率分类算法。与传统分类器不同，Boltzmann分类器根据与每个类内最近邻的平均距离分配类概率，提供可解释、物理意义明确的输出。我们在三个应用领域评估了该方法的性能：分子活性预测、过渡金属配合物的氧化态分类以及乳腺癌诊断。在分子活性任务中，该分类器在预测两种蛋白靶标的活性化合物方面取得了最高的准确率，预测概率与实验pIC50值之间观察到了强相关性。对于金属配合物，该分类器仅利用从晶体学数据提取的金属配体键长准确区分了Fe、Mn和Co的氧化态II和III，并且与已知化学趋势高度一致。在乳腺癌数据集中，该分类器实现了97%的准确率，低置信度预测集中在固有模糊的情况中。在所有任务中，Boltzmann分类器的性能与标准模型如逻辑回归、支持向量机、随机森林和k近邻相媲美甚至更好。其概率输出与连续的物理或生物特性相关，突显了在分类和回归环境中其潜在实用性。结果表明，Boltzmann分类器是传统机器学习方法的稳健且可解释的替代，尤其在重要的结构属性关系的科学领域。

更新时间: 2025-06-20 08:01:48

领域: cs.LG,physics.comp-ph

下载: http://arxiv.org/abs/2505.06753v2

Zero-Knowledge Proof-of-Location Protocols for Vehicle Subsidies and Taxation Compliance

This paper introduces a new set of privacy-preserving mechanisms for verifying compliance with location-based policies for vehicle taxation, or for (electric) vehicle (EV) subsidies, using Zero-Knowledge Proofs (ZKPs). We present the design and evaluation of a Zero-Knowledge Proof-of-Location (ZK-PoL) system that ensures a vehicle's adherence to territorial driving requirements without disclosing specific location data, hence maintaining user privacy. Our findings suggest a promising approach to apply ZK-PoL protocols in large-scale governmental subsidy or taxation programs.

Updated: 2025-06-20 08:00:32

标题: 车辆补贴和税收遵从的零知识位置证明协议

摘要: 本文介绍了一组针对验证车辆税收地点政策遵从性或（电动）车辆（EV）补贴的隐私保护机制，使用零知识证明（ZKPs）。我们提出了一个零知识位置证明（ZK-PoL）系统的设计和评估，该系统确保车辆遵守领土驾驶要求，而不公开具体位置数据，从而保护用户隐私。我们的研究结果表明，在大规模政府补贴或税收计划中应用ZK-PoL协议的方法具有很大的潜力。

更新时间: 2025-06-20 08:00:32

领域: cs.CR

下载: http://arxiv.org/abs/2506.16812v1

CINNAMON: A hybrid approach to change point detection and parameter estimation in single-particle tracking data

Change point detection has become an important part of the analysis of the single-particle tracking data, as it allows one to identify moments, in which the motion patterns of observed particles undergo significant changes. The segmentation of diffusive trajectories based on those moments may provide insight into various phenomena in soft condensed matter and biological physics. In this paper, we propose CINNAMON, a hybrid approach to classifying single-particle tracking trajectories, detecting change points within them, and estimating diffusion parameters in the segments between the change points. Our method is based on a combination of neural networks, feature-based machine learning, and statistical techniques. It has been benchmarked in the second Anomalous Diffusion Challenge. The method offers a high level of interpretability due to its analytical and feature-based components. A potential use of features from topological data analysis is also discussed.

Updated: 2025-06-20 07:42:53

标题: 肉桂：在单粒子跟踪数据中进行变点检测和参数估计的混合方法

摘要: 变点检测已经成为单粒子跟踪数据分析的重要部分，因为它可以帮助识别在观察到的粒子运动模式发生显著变化的时刻。基于这些时刻对扩散轨迹进行分割可能会揭示软凝聚物质和生物物理学中的各种现象。本文提出了一种名为CINNAMON的混合方法，用于对单粒子跟踪轨迹进行分类，检测其中的变点，并估计变点之间段落中的扩散参数。我们的方法基于神经网络、基于特征的机器学习和统计技术的结合。它已经在第二届异常扩散挑战中进行了基准测试。该方法由于具有分析性和基于特征的组件，提供了高水平的可解释性。还讨论了从拓扑数据分析中提取特征的潜在用途。

更新时间: 2025-06-20 07:42:53

领域: q-bio.QM,cs.LG

下载: http://arxiv.org/abs/2503.14253v2

DVFS-Aware DNN Inference on GPUs: Latency Modeling and Performance Analysis

The rapid development of deep neural networks (DNNs) is inherently accompanied by the problem of high computational costs. To tackle this challenge, dynamic voltage frequency scaling (DVFS) is emerging as a promising technology for balancing the latency and energy consumption of DNN inference by adjusting the computing frequency of processors. However, most existing models of DNN inference time are based on the CPU-DVFS technique, and directly applying the CPU-DVFS model to DNN inference on GPUs will lead to significant errors in optimizing latency and energy consumption. In this paper, we propose a DVFS-aware latency model to precisely characterize DNN inference time on GPUs. We first formulate the DNN inference time based on extensive experiment results for different devices and analyze the impact of fitting parameters. Then by dividing DNNs into multiple blocks and obtaining the actual inference time, the proposed model is further verified. Finally, we compare our proposed model with the CPU-DVFS model in two specific cases. Evaluation results demonstrate that local inference optimization with our proposed model achieves a reduction of no less than 66% and 69% in inference time and energy consumption respectively. In addition, cooperative inference with our proposed model can improve the partition policy and reduce the energy consumption compared to the CPU-DVFS model.

Updated: 2025-06-20 07:42:34

标题: DVFS 意识的 GPU 上的 DNN 推断：延迟建模与性能分析

摘要: 深度神经网络（DNN）的快速发展固有地伴随着高计算成本的问题。为了解决这一挑战，动态电压频率调节（DVFS）正在成为一种有前景的技术，通过调整处理器的计算频率来平衡DNN推断的延迟和能耗。然而，大多数现有的DNN推断时间模型都是基于CPU-DVFS技术的，直接将CPU-DVFS模型应用于GPU上的DNN推断会导致在优化延迟和能耗方面出现显著错误。本文提出了一种DVFS感知的延迟模型，以精确描述GPU上DNN推断时间。我们首先根据不同设备的广泛实验结果制定了DNN推断时间，并分析了拟合参数的影响。然后通过将DNN划分为多个块并获取实际推断时间，进一步验证了所提出的模型。最后，我们在两种具体情况下将我们提出的模型与CPU-DVFS模型进行了比较。评估结果表明，使用我们提出的模型进行本地推断优化在推断时间和能耗方面分别实现了不少于66％和69％的减少。此外，与CPU-DVFS模型相比，使用我们提出的模型进行协作推断可以改善分区策略并减少能耗。

更新时间: 2025-06-20 07:42:34

领域: cs.LG,cs.NI

下载: http://arxiv.org/abs/2502.06295v2

Speeding up Local Optimization in Vehicle Routing with Tensor-based GPU Acceleration

Local search plays a central role in many effective heuristic algorithms for the vehicle routing problem (VRP) and its variants. However, neighborhood exploration is known to be computationally expensive and time consuming, especially for large instances or problems with complex constraints. In this study, we explore a promising direction to address this challenge by introducing an original tensor-based GPU acceleration method designed to speed up the commonly used local search operators in vehicle routing. By using an attribute-based representation, the method offers broad extensibility, making it applicable to different VRP variants. Its low-coupling architecture, with intensive computations completely offloaded to the GPU, ensures seamless integration in various local search-based algorithms and frameworks, leading to significant improvements in computational efficiency and potentially improved solution quality. Through comparative experiments on benchmark instances of three routing problems, we demonstrate the substantial computational advantages of the proposed approach over traditional CPU-based implementations. We also provide a detailed analysis of the strengths and limitations of the method, providing valuable insights into its performance characteristics and identifying potential bottlenecks in practical applications. These findings contribute to a better understanding and suggest directions for future improvements.

Updated: 2025-06-20 07:40:47

标题: 使用基于张量的GPU加速技术加快车辆路径规划中的本地优化

摘要: 局部搜索在许多有效的启发式算法中扮演着重要角色，用于解决车辆路径问题（VRP）及其变体。然而，已知邻域探索在计算上是昂贵且耗时的，尤其是对于大规模实例或具有复杂约束的问题。在这项研究中，我们通过引入一种原创的基于张量的GPU加速方法，旨在加速车辆路径中常用的局部搜索运算符，来探索解决这一挑战的有望方向。通过使用基于属性的表示法，该方法具有广泛的可扩展性，可适用于不同的VRP变体。其低耦合架构，将密集计算完全转移到GPU，确保在各种基于局部搜索的算法和框架中实现无缝集成，从而显著提高计算效率，并潜在地改善解决方案质量。通过在三个路径问题的基准实例上进行比较实验，我们展示了所提出方法相对于传统基于CPU实现的重要计算优势。我们还提供了对该方法的优势和局限性进行详细分析，为其性能特征提供有价值的见解，并确定实际应用中的潜在瓶颈。这些发现有助于更好地理解，并指出未来改进的方向。

更新时间: 2025-06-20 07:40:47

领域: cs.DC,cs.AI

下载: http://arxiv.org/abs/2506.17357v1

Efficient but Vulnerable: Benchmarking and Defending LLM Batch Prompting Attack

Batch prompting, which combines a batch of multiple queries sharing the same context in one inference, has emerged as a promising solution to reduce inference costs. However, our study reveals a significant security vulnerability in batch prompting: malicious users can inject attack instructions into a batch, leading to unwanted interference across all queries, which can result in the inclusion of harmful content, such as phishing links, or the disruption of logical reasoning. In this paper, we construct BATCHSAFEBENCH, a comprehensive benchmark comprising 150 attack instructions of two types and 8k batch instances, to study the batch prompting vulnerability systematically. Our evaluation of both closed-source and open-weight LLMs demonstrates that all LLMs are susceptible to batch-prompting attacks. We then explore multiple defending approaches. While the prompting-based defense shows limited effectiveness for smaller LLMs, the probing-based approach achieves about 95% accuracy in detecting attacks. Additionally, we perform a mechanistic analysis to understand the attack and identify attention heads that are responsible for it.

Updated: 2025-06-20 07:32:36

标题: 高效但易受攻击：LLM批量提示攻击的基准测试和防御

摘要: 批量提示是将共享相同上下文的多个查询批量合并到一个推理中，已成为减少推理成本的一种有前途的解决方案。然而，我们的研究揭示了批量提示中存在重大安全漏洞：恶意用户可以向批量中注入攻击指令，导致所有查询之间产生不必要的干扰，可能导致包含有害内容，如钓鱼链接，或者破坏逻辑推理。在本文中，我们构建了BATCHSAFEBENCH，一个包含150个攻击指令和8k批量实例的综合基准，以系统地研究批量提示漏洞。我们对闭源和开源权重LLM进行评估，结果表明所有LLM都容易受到批量提示攻击。然后我们探讨了多种防御方法。虽然基于提示的防御对较小的LLM效果有限，但基于探测的方法在检测攻击方面达到了约95％的准确率。此外，我们进行了机械分析，以了解攻击并确定负责的注意头。

更新时间: 2025-06-20 07:32:36

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.15551v2

Cost-effective Instruction Learning for Pathology Vision and Language Analysis

The advent of vision-language models fosters the interactive conversations between AI-enabled models and humans. Yet applying these models into clinics must deal with daunting challenges around large-scale training data, financial, and computational resources. Here we propose a cost-effective instruction learning framework for conversational pathology named as CLOVER. CLOVER only trains a lightweight module and uses instruction tuning while freezing the parameters of the large language model. Instead of using costly GPT-4, we propose well-designed prompts on GPT-3.5 for building generation-based instructions, emphasizing the utility of pathological knowledge derived from the Internet source. To augment the use of instructions, we construct a high-quality set of template-based instructions in the context of digital pathology. From two benchmark datasets, our findings reveal the strength of hybrid-form instructions in the visual question-answer in pathology. Extensive results show the cost-effectiveness of CLOVER in answering both open-ended and closed-ended questions, where CLOVER outperforms strong baselines that possess 37 times more training parameters and use instruction data generated from GPT-4. Through the instruction tuning, CLOVER exhibits robustness of few-shot learning in the external clinical dataset. These findings demonstrate that cost-effective modeling of CLOVER could accelerate the adoption of rapid conversational applications in the landscape of digital pathology.

Updated: 2025-06-20 07:30:05

标题: 成本效益的病理视觉和语言分析指导学习

摘要: 视觉-语言模型的出现促进了人工智能模型和人类之间的互动对话。然而，将这些模型应用于临床必须解决围绕大规模训练数据、财务和计算资源的艰巨挑战。在这里，我们提出了一种名为CLOVER的对话病理学的成本效益指导学习框架。CLOVER只训练一个轻量级模块，并在冻结大型语言模型的参数的同时使用指导调整。我们提出在GPT-3.5上设计良好的提示，而不是使用昂贵的GPT-4，用于构建基于生成的指导，强调从互联网来源获得的病理知识的实用性。为了增强指导的使用，我们在数字病理学的背景下构建了一组高质量的基于模板的指导。从两个基准数据集中，我们的发现显示了混合形式指导在病理学中的视觉问题回答中的优势。广泛的结果显示了CLOVER在回答开放式和封闭式问题方面的成本效益，其中CLOVER胜过拥有37倍更多训练参数并使用由GPT-4生成的指导数据的强基线。通过指导调整，CLOVER展示了在外部临床数据集中的少样本学习的稳健性。这些发现表明，CLOVER的成本效益建模可以加速数字病理学领域中快速对话应用的采用。

更新时间: 2025-06-20 07:30:05

领域: cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2407.17734v2

Robust Dynamic Material Handling via Adaptive Constrained Evolutionary Reinforcement Learning

Dynamic material handling (DMH) involves the assignment of dynamically arriving material transporting tasks to suitable vehicles in real time for minimising makespan and tardiness. In real-world scenarios, historical task records are usually available, which enables the training of a decision policy on multiple instances consisting of historical records. Recently, reinforcement learning has been applied to solve DMH. Due to the occurrence of dynamic events such as new tasks, adaptability is highly required. Solving DMH is challenging since constraints including task delay should be satisfied. A feedback is received only when all tasks are served, which leads to sparse reward. Besides, making the best use of limited computational resources and historical records for training a robust policy is crucial. The time allocated to different problem instances would highly impact the learning process. To tackle those challenges, this paper proposes a novel adaptive constrained evolutionary reinforcement learning (ACERL) approach, which maintains a population of actors for diverse exploration. ACERL accesses each actor for tackling sparse rewards and constraint violation to restrict the behaviour of the policy. Moreover, ACERL adaptively selects the most beneficial training instances for improving the policy. Extensive experiments on eight training and eight unseen test instances demonstrate the outstanding performance of ACERL compared with several state-of-the-art algorithms. Policies trained by ACERL can schedule the vehicles while fully satisfying the constraints. Additional experiments on 40 unseen noised instances show the robust performance of ACERL. Cross-validation further presents the overall effectiveness of ACREL. Besides, a rigorous ablation study highlights the coordination and benefits of each ingredient of ACERL.

Updated: 2025-06-20 07:20:22

标题: 通过自适应受限进化强化学习实现强健的动态物料处理

摘要: 动态物料处理（DMH）涉及将动态到达的物料运输任务分配给适当的车辆，以实时最小化完工时间和延迟。在现实世界的场景中，通常会有历史任务记录可用，这使得可以在包含历史记录的多个实例上训练决策策略。最近，强化学习已被应用于解决DMH。由于存在新任务等动态事件，因此高度需要适应性。解决DMH具有挑战性，因为必须满足包括任务延迟在内的约束。只有当所有任务都得到服务时才会收到反馈，这导致奖励稀疏。此外，充分利用有限的计算资源和历史记录来训练强大的策略至关重要。分配给不同问题实例的时间会极大影响学习过程。为了解决这些挑战，本文提出了一种新颖的自适应约束进化强化学习（ACERL）方法，该方法维护一个演员群体以进行多样性探索。ACERL访问每个演员来解决稀疏奖励和约束违规，以限制策略的行为。此外，ACERL自适应地选择最有益的训练实例来改进策略。对八个训练和八个未见测试实例进行的大量实验表明，与几种最先进的算法相比，ACERL表现出色。ACERL训练的策略可以安排车辆，同时完全满足约束。对40个未知噪声实例的额外实验显示了ACERL的稳健性能。交叉验证进一步展示了ACREL的整体有效性。此外，严格的消融研究突显了ACERL的每个成分的协调和益处。

更新时间: 2025-06-20 07:20:22

领域: cs.NE,cs.AI

下载: http://arxiv.org/abs/2506.16795v1

MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning

Despite efforts to align large language models (LLMs) with societal and moral values, these models remain susceptible to jailbreak attacks--methods designed to elicit harmful responses. Jailbreaking black-box LLMs is considered challenging due to the discrete nature of token inputs, restricted access to the target LLM, and limited query budget. To address the issues above, we propose an effective method for jailbreaking black-box large language Models via Iterative Semantic Tuning, named MIST. MIST enables attackers to iteratively refine prompts that preserve the original semantic intent while inducing harmful content. Specifically, to balance semantic similarity with computational efficiency, MIST incorporates two key strategies: sequential synonym search, and its advanced version--order-determining optimization. Extensive experiments across two open-source models and four closed-source models demonstrate that MIST achieves competitive attack success rates and attack transferability compared with other state-of-the-art white-box and black-box jailbreak methods. Additionally, we conduct experiments on computational efficiency to validate the practical viability of MIST.

Updated: 2025-06-20 07:16:47

标题: MIST: 通过迭代语义调整对黑盒大型语言模型进行越狱

摘要: 尽管努力将大型语言模型（LLMs）与社会和道德价值观对齐，但这些模型仍然容易受到越狱攻击的影响 - 这些方法旨在引发有害反应。由于令牌输入的离散性质，对目标LLM的访问受限以及查询预算有限，越狱黑盒LLMs被认为是具有挑战性的。为了解决上述问题，我们提出了一种通过迭代语义调整来越狱黑盒大型语言模型的有效方法，称为MIST。MIST使攻击者能够迭代地完善提示，保留原始语义意图同时引发有害内容。具体来说，为了平衡语义相似性和计算效率，MIST融合了两个关键策略：顺序同义词搜索及其高级版本 - 确定顺序优化。通过对两个开源模型和四个闭源模型进行广泛实验，证明MIST相对于其他最新的白盒和黑盒越狱方法，取得了竞争性的攻击成功率和攻击可转移性。此外，我们进行了计算效率的实验来验证MIST的实际可行性。

更新时间: 2025-06-20 07:16:47

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.16792v1

Planning of Heuristics: Strategic Planning on Large Language Models with Monte Carlo Tree Search for Automating Heuristic Optimization

Heuristics have achieved great success in solving combinatorial optimization problems~(COPs). However, heuristics designed by humans require too much domain knowledge and testing time. Since Large Language Models~(LLMs) possess strong capabilities to understand and generate content with a knowledge base that covers various domains, they offer potential ways to automatically optimize heuristics. To this end, we propose Planning of Heuristics~(PoH), an optimization method that integrates LLM self-reflection with Monte Carlo Tree Search, a well-known planning algorithm. PoH iteratively refines generated heuristics by evaluating their performance and providing improvement suggestions. Our method enables to iteratively evaluate the generated heuristics~(states) and improve them based on the improvement suggestions~(actions) and evaluation results~(rewards), by effectively simulating future states to search for paths with higher rewards. In this paper, we apply PoH to solve the Traveling Salesman Problem and the Flow Shop Scheduling Problem. The experimental results show that PoH outperforms hand-crafted heuristics and other Automatic Heuristic Design methods based on LLMs, and achieves the state-of-the-art performance in automating heuristic optimization with LLMs to solve tested COPs, especially with large sizes.

Updated: 2025-06-20 07:14:59

标题: 启发式规划：利用蒙特卡洛树搜索对大型语言模型进行战略规划，实现启发式优化自动化。

摘要: 启发式在解决组合优化问题（COPs）方面取得了巨大成功。然而，由人类设计的启发式需要太多的领域知识和测试时间。由于大型语言模型（LLMs）具有强大的能力理解和生成涵盖各个领域的内容，它们为自动优化启发式提供了潜在途径。为此，我们提出了启发式规划（PoH），这是一种将LLM自我反思与蒙特卡洛树搜索相结合的优化方法，后者是一种众所周知的规划算法。PoH通过评估性能并提供改进建议来迭代地完善生成的启发式。我们的方法使得能够通过有效地模拟未来状态来搜索具有更高奖励的路径，从而迭代地评估生成的启发式（状态）并根据改进建议（行动）和评估结果（奖励）对其进行改进。在本文中，我们将PoH应用于解决旅行商问题和流水车间调度问题。实验结果显示，PoH优于手工设计的启发式以及其他基于LLMs的自动启发式设计方法，并在自动化启发式优化方面取得了最先进的性能，尤其是在解决经过测试的大型COPs时。

更新时间: 2025-06-20 07:14:59

领域: cs.AI

下载: http://arxiv.org/abs/2502.11422v3

Exploring and Improving Initialization for Deep Graph Neural Networks: A Signal Propagation Perspective

Graph Neural Networks (GNNs) often suffer from performance degradation as the network depth increases. This paper addresses this issue by introducing initialization methods that enhance signal propagation (SP) within GNNs. We propose three key metrics for effective SP in GNNs: forward propagation, backward propagation, and graph embedding variation (GEV). While the first two metrics derive from classical SP theory, the third is specifically designed for GNNs. We theoretically demonstrate that a broad range of commonly used initialization methods for GNNs, which exhibit performance degradation with increasing depth, fail to control these three metrics simultaneously. To deal with this limitation, a direct exploitation of the SP analysis--searching for weight initialization variances that optimize the three metrics--is shown to significantly enhance the SP in deep GCNs. This approach is called Signal Propagation on Graph-guided Initialization (SPoGInit). Our experiments demonstrate that SPoGInit outperforms commonly used initialization methods on various tasks and architectures. Notably, SPoGInit enables performance improvements as GNNs deepen, which represents a significant advancement in addressing depth-related challenges and highlights the validity and effectiveness of the SP analysis framework.

Updated: 2025-06-20 07:14:31

标题: 探索和改进深度图神经网络的初始化：信号传播视角

摘要: 图神经网络（GNNs）在网络深度增加时往往会出现性能下降的问题。本文通过引入增强GNNs内部信号传播（SP）的初始化方法来解决这个问题。我们提出了三个用于有效SP的关键指标：前向传播、反向传播和图嵌入变化（GEV）。前两个指标源自经典的SP理论，第三个是专门为GNNs设计的。我们从理论上证明了一系列常用的GNN初始化方法，在增加深度时性能下降，并未能同时控制这三个指标。为了解决这一限制，直接利用SP分析——寻找优化这三个指标的权重初始化方差——被证明可以显著增强深度GCNs中的SP。这种方法被称为基于图信号传播的初始化（SPoGInit）。我们的实验证明，SPoGInit在各种任务和架构上都优于常用的初始化方法。值得注意的是，SPoGInit使得在GNN深化时性能得到提升，这代表了在解决与深度相关的挑战方面的重大进展，并凸显了SP分析框架的有效性和有效性。

更新时间: 2025-06-20 07:14:31

领域: cs.LG

下载: http://arxiv.org/abs/2506.16790v1

Revisiting LoRA through the Lens of Parameter Redundancy: Spectral Encoding Helps

Low-Rank Adaptation (LoRA) has emerged as a prominent technique for fine-tuning large foundation models. Despite its successes, the substantial parameter redundancy, which limits the capacity and efficiency of LoRA, has been recognized as a bottleneck. In this work, we systematically investigate the impact of redundancy in fine-tuning LoRA and reveal that reducing density redundancy does not degrade expressiveness. Based on this insight, we introduce \underline{S}pectral-\underline{e}ncoding \underline{L}ow-\underline{R}ank \underline{A}daptation (SeLoRA), which harnesses the robust expressiveness of spectral bases to re-parameterize LoRA from a sparse spectral subspace. Designed with simplicity, SeLoRA enables seamless integration with various LoRA variants for performance boosting, serving as a scalable plug-and-play framework. Extensive experiments substantiate that SeLoRA achieves greater efficiency with fewer parameters, delivering superior performance enhancements over strong baselines on various downstream tasks, including commonsense reasoning, math reasoning, and code generation.

Updated: 2025-06-20 07:09:05

标题: 通过参数冗余重新审视LoRA：光谱编码有助于解决

摘要: 低秩适应（LoRA）已经成为微调大型基础模型的一个显著技术。尽管取得了成功，但参数冗余性较大，限制了LoRA的容量和效率，这被认为是一个瓶颈。在这项工作中，我们系统地研究了在微调LoRA中减少冗余对其影响，并揭示了减少密度冗余不会降低表达能力。基于这一洞察，我们引入了谱编码低秩适应（SeLoRA），利用谱基的稳健表达能力重新参数化LoRA，从稀疏谱子空间中获得。设计简单的SeLoRA使其能够与各种LoRA变体无缝集成，提高性能，作为一个可扩展的即插即用框架。大量实验证实，SeLoRA具有更高的效率，参数更少，在各种下游任务上实现了卓越的性能提升，包括常识推理、数学推理和代码生成。

更新时间: 2025-06-20 07:09:05

领域: cs.LG

下载: http://arxiv.org/abs/2506.16787v1

CodeV-R1: Reasoning-Enhanced Verilog Generation

Large language models (LLMs) trained via reinforcement learning with verifiable reward (RLVR) have achieved breakthroughs on tasks with explicit, automatable verification, such as software programming and mathematical problems. Extending RLVR to electronic design automation (EDA), especially automatically generating hardware description languages (HDLs) like Verilog from natural-language (NL) specifications, however, poses three key challenges: the lack of automated and accurate verification environments, the scarcity of high-quality NL-code pairs, and the prohibitive computation cost of RLVR. To this end, we introduce CodeV-R1, an RLVR framework for training Verilog generation LLMs. First, we develop a rule-based testbench generator that performs robust equivalence checking against golden references. Second, we propose a round-trip data synthesis method that pairs open-source Verilog snippets with LLM-generated NL descriptions, verifies code-NL-code consistency via the generated testbench, and filters out inequivalent examples to yield a high-quality dataset. Third, we employ a two-stage "distill-then-RL" training pipeline: distillation for the cold start of reasoning abilities, followed by adaptive DAPO, our novel RLVR algorithm that can reduce training cost by adaptively adjusting sampling rate. The resulting model, CodeV-R1-7B, achieves 68.6% and 72.9% pass@1 on VerilogEval v2 and RTLLM v1.1, respectively, surpassing prior state-of-the-art by 12~20%, while matching or even exceeding the performance of 671B DeepSeek-R1. We will release our model, training pipeline, and dataset to facilitate research in EDA and LLM communities.

Updated: 2025-06-20 07:05:18

标题: CodeV-R1：增强推理的Verilog生成

摘要: 通过使用可验证奖励（RLVR）进行强化学习训练的大型语言模型（LLMs）已经在具有明确、可自动验证的任务上取得了突破，例如软件编程和数学问题。然而，将RLVR扩展到电子设计自动化（EDA），特别是从自然语言（NL）规范自动生成硬件描述语言（HDLs）如Verilog，面临三个关键挑战：缺乏自动化和准确的验证环境，高质量NL-代码对的稀缺性，以及RLVR的计算成本过高。为此，我们介绍了CodeV-R1，一个用于训练Verilog生成LLMs的RLVR框架。首先，我们开发了一个基于规则的测试台生成器，对黄金参考进行强大的等效性检查。其次，我们提出了一种往返数据合成方法，将开源Verilog代码片段与LLM生成的NL描述配对，通过生成的测试台验证代码-NL-代码的一致性，并过滤出不等价的示例，产生高质量的数据集。第三，我们采用了一个两阶段的“蒸馏-RL”训练流程：蒸馏用于推理能力的冷启动，然后是自适应DAPO，我们的新颖RLVR算法，可以通过自适应调整采样率来降低训练成本。生成的模型CodeV-R1-7B，在VerilogEval v2和RTLLM v1.1上分别达到了68.6%和72.9%的pass@1，超过了先前的最新技术水平12~20%，同时与671B DeepSeek-R1的性能相匹配甚至超过。我们将发布我们的模型、训练流程和数据集，以促进EDA和LLM社区的研究。

更新时间: 2025-06-20 07:05:18

领域: cs.LG,cs.AR,cs.PL

下载: http://arxiv.org/abs/2505.24183v2

Alto: Orchestrating Distributed Compound AI Systems with Nested Ancestry

Compound AI applications chain together subcomponents such as generative language models, document retrievers, and embedding models. Applying traditional systems optimizations such as parallelism and pipelining in compound AI systems is difficult because each component has different constraints in terms of the granularity and type of data that it ingests. New data is often generated during intermediate computations, and text streams may be split into smaller, independent fragments (such as documents to sentences) which may then be re-aggregated at later parts of the computation. Due to this complexity, existing systems to serve compound AI queries do not fully take advantage of parallelism and pipelining opportunities. We present Alto, a framework that automatically optimizes execution of compound AI queries through streaming and parallelism. Bento introduces a new abstraction called nested ancestry, a metadata hierarchy that allows the system to correctly track partial outputs and aggregate data across the heterogeneous constraints of the components of compound AI applications. This metadata is automatically inferred from the programming model, allowing developers to express complex dataflow patterns without needing to reason manually about the details of routing and aggregation. Implementations of four applications in Alto outperform or match implementations in LangGraph, a popular existing AI programming framework. Alto implementations match or improve latency by between 10-30%.

Updated: 2025-06-20 07:03:22

标题: Alto：使用嵌套血统协调分布式复合AI系统

摘要: Compound AI应用程序将生成语言模型、文档检索器和嵌入模型等子组件串联在一起。在复合AI系统中应用传统系统优化，如并行和流水线处理是困难的，因为每个组件在数据摄入的颗粒度和类型方面具有不同的约束。在中间计算过程中通常会生成新数据，并且文本流可能会被拆分成更小、独立的片段（例如从文档到句子），然后在计算的后续部分重新聚合。由于这种复杂性，现有的用于处理复合AI查询的系统并没有充分利用并行和流水线处理的机会。我们提出了Alto，一个通过流式处理和并行处理自动优化复合AI查询执行的框架。Bento引入了一个称为嵌套祖先的新抽象，这是一个元数据层次结构，允许系统正确跟踪部分输出并在复合AI应用程序的组件的异构约束之间聚合数据。这些元数据是从编程模型中自动推断出来的，使开发人员能够表达复杂的数据流模式，而无需手动考虑路由和聚合的细节。Alto中四个应用程序的实现优于或与LangGraph中现有的流行AI编程框架的实现相匹配。Alto的实现匹配或提高延迟10-30%。

更新时间: 2025-06-20 07:03:22

领域: cs.AI,cs.CL,cs.DC,cs.IR

下载: http://arxiv.org/abs/2403.04311v2

Automatic Large Language Models Creation of Interactive Learning Lessons

We explore the automatic generation of interactive, scenario-based lessons designed to train novice human tutors who teach middle school mathematics online. Employing prompt engineering through a Retrieval-Augmented Generation approach with GPT-4o, we developed a system capable of creating structured tutor training lessons. Our study generated lessons in English for three key topics: Encouraging Students' Independence, Encouraging Help-Seeking Behavior, and Turning on Cameras, using a task decomposition prompting strategy that breaks lesson generation into sub-tasks. The generated lessons were evaluated by two human evaluators, who provided both quantitative and qualitative evaluations using a comprehensive rubric informed by lesson design research. Results demonstrate that the task decomposition strategy led to higher-rated lessons compared to single-step generation. Human evaluators identified several strengths in the LLM-generated lessons, including well-structured content and time-saving potential, while also noting limitations such as generic feedback and a lack of clarity in some instructional sections. These findings underscore the potential of hybrid human-AI approaches for generating effective lessons in tutor training.

Updated: 2025-06-20 06:58:50

标题: 自动生成大型语言模型用于交互式学习课程的创建

摘要: 我们探讨了自动生成交互式、基于场景的课程的可能性，旨在培训在线教授初中数学的新手人类导师。通过使用GPT-4o的检索增强生成方法进行提示工程，我们开发了一个能够创建结构化导师培训课程的系统。我们的研究针对三个关键主题生成了英文课程：鼓励学生独立、鼓励求助行为和打开摄像头，采用了将课程生成分解为子任务的任务分解提示策略。生成的课程由两名人类评估员进行评估，他们使用了一个综合的评分表，该评分表是基于课程设计研究的。结果表明，与单步生成相比，任务分解策略导致了更高评分的课程。人类评估员在LLM生成的课程中识别出了一些优势，包括内容结构良好和节省时间的潜力，同时也指出了一些限制，如通用反馈和某些教学部分缺乏清晰度。这些发现强调了在导师培训中生成有效课程的混合人工智能方法的潜力。

更新时间: 2025-06-20 06:58:50

领域: cs.CY,cs.AI,cs.HC

下载: http://arxiv.org/abs/2506.17356v1

What Is the Point of Equality in Machine Learning Fairness? Beyond Equality of Opportunity

Fairness in machine learning (ML) has become a rapidly growing area of research. But why, in the first place, is unfairness in ML morally wrong? And why should we care about improving fairness? Most fair-ML research implicitly appeals to distributive equality: the idea that desirable goods and benefits, such as opportunities (e.g., Barocas et al., 2023), should be equally distributed across society. Unfair ML models, then, are seen as wrong because they unequally distribute such benefits. This paper argues that this exclusive focus on distributive equality offers an incomplete and potentially misleading ethical foundation. Grounding ML fairness in egalitarianism -- the view that equality is a fundamental moral and social ideal -- requires challenging structural inequality: systematic, institutional, and durable arrangements that privilege some groups while disadvantaging others. Structural inequality manifests through ML systems in two primary forms: allocative harms (e.g., economic loss) and representational harms (e.g., stereotypes, erasure). While distributive equality helps address allocative harms, it fails to explain why representational harms are wrong -- why it is wrong for ML systems to reinforce social hierarchies that stratify people into superior and inferior groups -- and why ML systems should aim to foster a society where people relate as equals (i.e., relational equality). To address these limitations, the paper proposes a multifaceted egalitarian framework for ML fairness that integrates both distributive and relational equality. Drawing on critical social and political philosophy, this framework offers a more comprehensive ethical foundation for tackling the full spectrum of harms perpetuated by ML systems. The paper also outlines practical pathways for implementing the framework across the ML pipeline.

Updated: 2025-06-20 06:57:53

标题: 机器学习公平性中的平等有何意义？超越机会平等

摘要: 机器学习（ML）中的公平性已成为一个快速增长的研究领域。但为什么说ML中的不公平是道德上错误的呢？我们为什么要关心改善公平性呢？大多数公平的ML研究隐含地呼吁分配平等：即希望的商品和好处，如机会（例如，Barocas等，2023年），应在社会中平等分配。因此，不公平的ML模型被认为是错误的，因为它们不平等地分配这些好处。本文认为，这种对分配平等的专注提供了一个不完整且可能误导的道德基础。将ML的公平性基础放在平等主义上——即认为平等是一个基本的道德和社会理想——需要挑战结构性不平等：系统性、制度性和持久性的安排，这些安排特别有利于某些群体，而对其他群体不利。结构性不平等以两种主要形式通过ML系统显现：分配危害（例如，经济损失）和表征危害（例如，刻板印象，抹杀）。虽然分配平等有助于解决分配危害，但它未能解释为什么表征危害是错误的——为什么ML系统强化社会层次结构，将人们分成优越和劣势群体是错误的——以及为什么ML系统应该致力于培育人们作为平等关系的社会（即关系平等）。为了解决这些限制，本文提出了一个多方面的平等主义框架，用于ML公平性，该框架整合了分配和关系平等。借鉴关键社会和政治哲学，这一框架为解决ML系统造成的全方位危害提供了更全面的道德基础。本文还概述了在整个ML流程中实施该框架的实际途径。

更新时间: 2025-06-20 06:57:53

领域: cs.LG,cs.AI,cs.CY

下载: http://arxiv.org/abs/2506.16782v1

PQCAD-DM: Progressive Quantization and Calibration-Assisted Distillation for Extremely Efficient Diffusion Model

Diffusion models excel in image generation but are computational and resource-intensive due to their reliance on iterative Markov chain processes, leading to error accumulation and limiting the effectiveness of naive compression techniques. In this paper, we propose PQCAD-DM, a novel hybrid compression framework combining Progressive Quantization (PQ) and Calibration-Assisted Distillation (CAD) to address these challenges. PQ employs a two-stage quantization with adaptive bit-width transitions guided by a momentum-based mechanism, reducing excessive weight perturbations in low-precision. CAD leverages full-precision calibration datasets during distillation, enabling the student to match full-precision performance even with a quantized teacher. As a result, PQCAD-DM achieves a balance between computational efficiency and generative quality, halving inference time while maintaining competitive performance. Extensive experiments validate PQCAD-DM's superior generative capabilities and efficiency across diverse datasets, outperforming fixed-bit quantization methods.

Updated: 2025-06-20 06:43:27

标题: PQCAD-DM：用于极其高效扩散模型的渐进量化和校准辅助蒸馏

摘要: 扩散模型在图像生成方面表现出色，但由于依赖于迭代的马尔可夫链过程，使其计算和资源密集型，导致误差累积并限制了朴素压缩技术的有效性。本文提出了PQCAD-DM，这是一种新颖的混合压缩框架，结合了渐进量化（PQ）和校准辅助蒸馏（CAD）来解决这些挑战。PQ采用两阶段量化，通过基于动量的机制引导自适应比特宽度转换，减少低精度下过度的权重扰动。CAD利用全精度校准数据集进行蒸馏，在蒸馏过程中使学生能够达到与全精度性能相匹配，即使是与量化的教师。因此，PQCAD-DM在计算效率和生成质量之间取得了平衡，将推断时间减半，同时保持竞争性能。大量实验证实了PQCAD-DM在多样化数据集上的优越生成能力和效率，优于固定比特量化方法。

更新时间: 2025-06-20 06:43:27

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.16776v1

SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation

Large language models (LLMs) have recently demonstrated remarkable capabilities in machine translation (MT). However, most advanced MT-specific LLMs heavily rely on external supervision signals during training, such as human-annotated reference data or trained reward models (RMs), which are often expensive to obtain and challenging to scale. To overcome this limitation, we propose a Simple Self-Rewarding (SSR) Reinforcement Learning (RL) framework for MT that is reference-free, fully online, and relies solely on self-judging rewards. Training with SSR using 13K monolingual examples and Qwen-2.5-7B as the backbone, our model SSR-Zero-7B outperforms existing MT-specific LLMs, e.g., TowerInstruct-13B and GemmaX-28-9B, as well as larger general LLMs like Qwen2.5-32B-Instruct in English $\leftrightarrow$ Chinese translation tasks from WMT23, WMT24, and Flores200 benchmarks. Furthermore, by augmenting SSR with external supervision from COMET, our strongest model, SSR-X-Zero-7B, achieves state-of-the-art performance in English $\leftrightarrow$ Chinese translation, surpassing all existing open-source models under 72B parameters and even outperforming closed-source models, e.g., GPT-4o and Gemini 1.5 Pro. Our analysis highlights the effectiveness of the self-rewarding mechanism compared to the external LLM-as-a-judge approach in MT and demonstrates its complementary benefits when combined with trained RMs. Our findings provide valuable insight into the potential of self-improving RL methods. We have publicly released our code, data and models.

Updated: 2025-06-20 06:38:44

标题: SSR-Zero：简单的自我奖励强化学习用于机器翻译

摘要: 最近，大型语言模型（LLMs）在机器翻译（MT）方面展现出了显著的能力。然而，大多数先进的MT专用LLMs在训练过程中严重依赖外部监督信号，例如人工标注的参考数据或经过训练的奖励模型（RMs），这些信号通常很难获得且难以扩展。为了克服这一局限性，我们提出了一个简单的自我奖励（SSR）强化学习（RL）框架用于MT，该框架无需参考数据，完全在线，并且仅依赖于自我评判的奖励。使用13K个单语样本和Qwen-2.5-7B作为骨干训练SSR，我们的模型SSR-Zero-7B在英中翻译任务中表现优于现有的MT专用LLMs，例如TowerInstruct-13B和GemmaX-28-9B，以及像Qwen2.5-32B-Instruct这样更大的通用LLMs。从WMT23，WMT24和Flores200基准中。此外，通过将SSR与来自COMET的外部监督结合，我们最强的模型SSR-X-Zero-7B在英中翻译中取得了最先进的性能，超过了所有基于72B参数的现有开源模型，甚至优于闭源模型，如GPT-4o和Gemini 1.5 Pro。我们的分析突出了自我奖励机制与外部LLM作为评判者方法在MT中的有效性，并展示了将其与经过训练的RMs相结合时的互补益处。我们的发现为自我改进的RL方法的潜力提供了宝贵的见解。我们已经公开发布了我们的代码、数据和模型。

更新时间: 2025-06-20 06:38:44

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2505.16637v3

Can We Detect Failures Without Failure Data? Uncertainty-Aware Runtime Failure Detection for Imitation Learning Policies

Recent years have witnessed impressive robotic manipulation systems driven by advances in imitation learning and generative modeling, such as diffusion- and flow-based approaches. As robot policy performance increases, so does the complexity and time horizon of achievable tasks, inducing unexpected and diverse failure modes that are difficult to predict a priori. To enable trustworthy policy deployment in safety-critical human environments, reliable runtime failure detection becomes important during policy inference. However, most existing failure detection approaches rely on prior knowledge of failure modes and require failure data during training, which imposes a significant challenge in practicality and scalability. In response to these limitations, we present FAIL-Detect, a modular two-stage approach for failure detection in imitation learning-based robotic manipulation. To accurately identify failures from successful training data alone, we frame the problem as sequential out-of-distribution (OOD) detection. We first distill policy inputs and outputs into scalar signals that correlate with policy failures and capture epistemic uncertainty. FAIL-Detect then employs conformal prediction (CP) as a versatile framework for uncertainty quantification with statistical guarantees. Empirically, we thoroughly investigate both learned and post-hoc scalar signal candidates on diverse robotic manipulation tasks. Our experiments show learned signals to be mostly consistently effective, particularly when using our novel flow-based density estimator. Furthermore, our method detects failures more accurately and faster than state-of-the-art (SOTA) failure detection baselines. These results highlight the potential of FAIL-Detect to enhance the safety and reliability of imitation learning-based robotic systems as they progress toward real-world deployment.

Updated: 2025-06-20 05:51:24

标题: 我们是否能在没有故障数据的情况下检测故障？针对模仿学习策略的不确定性感知运行时故障检测

摘要: 近年来，由于模仿学习和生成建模的进步，令人印象深刻的机器人操作系统得到了推动，例如扩散和基于流的方法。随着机器人策略性能的提高，可实现任务的复杂性和时间范围也在增加，引发了难以事先预测的意外和多样化故障模式。为了在安全关键的人类环境中实现可信的策略部署，可靠的运行时故障检测在策略推断过程中变得重要。然而，大多数现有的故障检测方法依赖于先验故障模式的知识，并在训练过程中需要故障数据，这给实用性和可扩展性带来了显著挑战。针对这些限制，我们提出了FAIL-Detect，这是一个用于基于模仿学习的机器人操作系统中故障检测的模块化两阶段方法。为了仅通过成功的训练数据准确识别故障，我们将问题框定为顺序的离群检测。我们首先将策略的输入和输出转化为与策略故障相关并捕捉认识不确定性的标量信号。FAIL-Detect然后采用符合预测（CP）作为一种具有统计保证的不确定性量化框架。在实证上，我们对各种机器人操作任务上学习和事后标量信号候选进行了深入研究。我们的实验显示，学习的信号在使用我们的新颖基于流的密度估计器时大多是有效的。此外，我们的方法比现有的最先进故障检测基线更准确、更快地检测到故障。这些结果突显了FAIL-Detect在提高基于模仿学习的机器人系统的安全性和可靠性方面的潜力，使其更好地向着实际部署前进。

更新时间: 2025-06-20 05:51:24

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.08558v3

Reinforcement learning for hybrid charging stations planning and operation considering fixed and mobile chargers

The success of vehicle electrification, which brings significant societal and environmental benefits, is contingent upon the availability of efficient and adaptable charging infrastructure. Traditional fixed-location charging stations often face issues like underutilization or congestion due to the dynamic nature of charging demand. Mobile chargers have emerged as a flexible solution, capable of relocating to align with these demand fluctuations. This paper addresses the optimal planning and operation of hybrid charging infrastructures, integrating both fixed and mobile chargers within urban road networks. We introduce the Hybrid Charging Station Planning and Operation (HCSPO) problem, which simultaneously optimizes the location and configuration of fixed charging stations and schedules mobile chargers for dynamic operations. Our approach incorporates a charging demand prediction model grounded in Model Predictive Control (MPC) to enhance decision-making. To solve the HCSPO problem, we propose a deep reinforcement learning method, augmented with heuristic scheduling techniques, to effectively bridge the planning of fixed chargers with the real-time operation of mobile chargers. Extensive case studies using real-world urban scenarios demonstrate that our method significantly improves the availability of charging infrastructure and reduces user inconvenience compared to existing solutions and baselines.

Updated: 2025-06-20 05:51:02

标题: 强化学习用于考虑固定和移动充电器的混合充电站规划和运营

摘要: 车辆电气化的成功，带来了显著的社会和环境效益，这取决于高效和适应性充电基础设施的可用性。传统的固定位置充电站通常面临诸如充电需求动态性导致的低利用率或拥堵等问题。移动充电器已经成为一种灵活的解决方案，能够根据这些需求波动进行重新定位。本文涉及城市道路网络中固定和移动充电器的混合充电基础设施的最佳规划和运营。我们引入了混合充电站规划和运营（HCSPO）问题，同时优化了固定充电站的位置和配置，并为动态运营安排移动充电器。我们的方法结合了基于模型预测控制（MPC）的充电需求预测模型，以增强决策。为了解决HCSPO问题，我们提出了一种深度强化学习方法，辅以启发式调度技术，有效地将固定充电器的规划与移动充电器的实时运营联系起来。使用真实世界的城市场景进行的大量案例研究表明，我们的方法相比现有解决方案和基线显著提高了充电基础设施的可用性，并减少了用户的不便。

更新时间: 2025-06-20 05:51:02

领域: cs.AI

下载: http://arxiv.org/abs/2506.16764v1

LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?

Multi-step spatial reasoning entails understanding and reasoning about spatial relationships across multiple sequential steps, which is crucial for tackling complex real-world applications, such as robotic manipulation, autonomous navigation, and automated assembly. To assess how well current Multimodal Large Language Models (MLLMs) have acquired this fundamental capability, we introduce LEGO-Puzzles, a scalable benchmark designed to evaluate both spatial understanding and sequential reasoning in MLLMs through LEGO-based tasks. LEGO-Puzzles consists of 1,100 carefully curated visual question-answering (VQA) samples spanning 11 distinct tasks, ranging from basic spatial understanding to complex multi-step reasoning. Based on LEGO-Puzzles, we conduct a comprehensive evaluation of 20 state-of-the-art MLLMs and uncover significant limitations in their spatial reasoning capabilities: even the most powerful MLLMs can answer only about half of the test cases, whereas human participants achieve over 90% accuracy. Furthermore, based on LEGO-Puzzles, we design generation tasks to investigate whether MLLMs can transfer their spatial understanding and reasoning abilities to image generation. Our experiments show that only GPT-4o and Gemini-2.0-Flash exhibit a limited ability to follow these instructions, while other MLLMs either replicate the input image or generate completely irrelevant outputs. Overall, LEGO-Puzzles exposes critical deficiencies in existing MLLMs' spatial understanding and sequential reasoning capabilities, and underscores the need for further advancements in multimodal spatial reasoning.

Updated: 2025-06-20 05:50:00

标题: LEGO拼图：MLLM在多步空间推理中表现如何？

摘要: 多步空间推理涉及理解和推理跨多个连续步骤的空间关系，这对于解决复杂的现实世界应用非常重要，例如机器人操作、自主导航和自动装配。为了评估当前多模态大型语言模型（MLLMs）已经获得这种基本能力的程度，我们引入了LEGO-Puzzles，一个可伸缩的基准测试，旨在通过基于LEGO的任务评估MLLMs的空间理解和序贯推理能力。LEGO-Puzzles包括1,100个精心策划的视觉问答（VQA）样本，涵盖了11个不同的任务，从基本空间理解到复杂的多步推理。基于LEGO-Puzzles，我们对20个最先进的MLLMs进行了全面评估，并揭示了它们在空间推理能力方面的显著限制：即使是最强大的MLLMs也只能回答大约一半的测试用例，而人类参与者的准确率超过90%。此外，基于LEGO-Puzzles，我们设计了生成任务，以调查MLLMs是否能将它们的空间理解和推理能力转移到图像生成。我们的实验表明，只有GPT-4o和Gemini-2.0-Flash表现出有限的能力遵循这些指令，而其他MLLMs要么复制输入图像，要么生成完全无关的输出。总的来说，LEGO-Puzzles暴露了现有MLLMs在空间理解和序贯推理能力方面的关键缺陷，并强调了在多模态空间推理方面的进一步发展的必要性。

更新时间: 2025-06-20 05:50:00

领域: cs.AI

下载: http://arxiv.org/abs/2503.19990v3

Knowledge Distillation Framework for Accelerating High-Accuracy Neural Network-Based Molecular Dynamics Simulations

Neural network potentials (NNPs) offer a powerful alternative to traditional force fields for molecular dynamics (MD) simulations. Accurate and stable MD simulations, crucial for evaluating material properties, require training data encompassing both low-energy stable structures and high-energy structures. Conventional knowledge distillation (KD) methods fine-tune a pre-trained NNP as a teacher model to generate training data for a student model. However, in material-specific models, this fine-tuning process increases energy barriers, making it difficult to create training data containing high-energy structures. To address this, we propose a novel KD framework that leverages a non-fine-tuned, off-the-shelf pre-trained NNP as a teacher. Its gentler energy landscape facilitates the exploration of a wider range of structures, including the high-energy structures crucial for stable MD simulations. Our framework employs a two-stage training process: first, the student NNP is trained with a dataset generated by the off-the-shelf teacher; then, it is fine-tuned with a smaller, high-accuracy density functional theory (DFT) dataset. We demonstrate the effectiveness of our framework by applying it to both organic (polyethylene glycol) and inorganic (L$_{10}$GeP$_{2}$S$_{12}$) materials, achieving comparable or superior accuracy in reproducing physical properties compared to existing methods. Importantly, our method reduces the number of expensive DFT calculations by 10x compared to existing NNP generation methods, without sacrificing accuracy. Furthermore, the resulting student NNP achieves up to 106x speedup in inference compared to the teacher NNP, enabling significantly faster and more efficient MD simulations.

Updated: 2025-06-20 05:31:52

标题: 知识蒸馏框架用于加速基于高准确性神经网络的分子动力学模拟

摘要: 神经网络势（NNPs）为分子动力学（MD）模拟提供了一个强大的替代传统力场的方法。精确稳定的MD模拟对于评估材料性质至关重要，需要包含低能稳定结构和高能结构的训练数据。传统的知识蒸馏（KD）方法通过微调一个预训练的NNP作为教师模型来生成用于学生模型的训练数据。然而，在特定材料模型中，这种微调过程会增加能量壁垒，使得难以创建包含高能结构的训练数据。为了解决这个问题，我们提出了一个新颖的KD框架，利用一个未经微调的现成预训练的NNP作为教师。其更温和的能量景观促进了对更广泛结构的探索，包括对稳定MD模拟至关重要的高能结构。我们的框架采用了一个两阶段的训练过程：首先，学生NNP使用由现成教师生成的数据集进行训练；然后，使用一个较小、高精度的密度泛函理论（DFT）数据集进行微调。我们通过将该框架应用于有机（聚乙二醇）和无机（L$_{10}$GeP$_{2}$S$_{12}$）材料，证明了其在重现物理性质方面与现有方法相比达到了可比或更高的准确性。重要的是，我们的方法将昂贵的DFT计算数量减少了10倍，而且不会牺牲准确性。此外，相比于教师NNP，结果产生的学生NNP在推断方面实现了高达106倍的加速，从而实现了更快速和更高效的MD模拟。

更新时间: 2025-06-20 05:31:52

领域: cs.LG,cond-mat.mtrl-sci,physics.comp-ph

下载: http://arxiv.org/abs/2506.15337v2

FDLLM: A Dedicated Detector for Black-Box LLMs Fingerprinting

Large Language Models (LLMs) are rapidly transforming the landscape of digital content creation. However, the prevalent black-box Application Programming Interface (API) access to many LLMs introduces significant challenges in accountability, governance, and security. LLM fingerprinting, which aims to identify the source model by analyzing statistical and stylistic features of generated text, offers a potential solution. Current progress in this area is hindered by a lack of dedicated datasets and the need for efficient, practical methods that are robust against adversarial manipulations. To address these challenges, we introduce FD-Dataset, a comprehensive bilingual fingerprinting benchmark comprising 90,000 text samples from 20 famous proprietary and open-source LLMs. Furthermore, we present FDLLM, a novel fingerprinting method that leverages parameter-efficient Low-Rank Adaptation (LoRA) to fine-tune a foundation model. This approach enables LoRA to extract deep, persistent features that characterize each source LLM. Through our analysis, we find that LoRA adaptation promotes the aggregation of outputs from the same LLM in representation space while enhancing the separation between different LLMs. This mechanism explains why LoRA proves particularly effective for LLM fingerprinting. Extensive empirical evaluations on FD-Dataset demonstrate FDLLM's superiority, achieving a Macro F1 score 22.1% higher than the strongest baseline. FDLLM also exhibits strong generalization to newly released models, achieving an average accuracy of 95% on unseen models. Notably, FDLLM remains consistently robust under various adversarial attacks, including polishing, translation, and synonym substitution. Experimental results show that FDLLM reduces the average attack success rate from 49.2% (LM-D) to 23.9%.

Updated: 2025-06-20 05:23:34

标题: FDLLM：一种专用于黑匣子LLMs指纹识别的检测器

摘要: 大型语言模型（LLMs）正在迅速改变数字内容创作的格局。然而，对许多LLMs的主流黑盒应用程序接口（API）访问引入了在问责、治理和安全方面的重大挑战。LLM指纹识别旨在通过分析生成文本的统计和风格特征来识别源模型，提供了潜在解决方案。目前在这一领域的进展受到缺乏专门数据集和对抗性操作的需要的阻碍。为了解决这些挑战，我们引入了FD-Dataset，一个包含20个知名专有和开源LLMs的90,000个文本样本的全面双语指纹基准。此外，我们提出了FDLLM，一种利用参数高效的低秩调整（LoRA）来微调基础模型的新型指纹识别方法。这种方法使LoRA能够提取深层、持久的特征，以表征每个源LLM。通过我们的分析，我们发现LoRA调整促进了在表示空间中来自相同LLM的输出的聚合，同时增强了不同LLMs之间的分离。这一机制解释了为什么LoRA对LLM指纹识别特别有效。对FD-Dataset进行的广泛实证评估表明FDLLM的优越性，其Macro F1分数比最强基线高出22.1%。FDLLM还表现出对新发布模型的强大泛化能力，在未见模型上实现了平均准确率95%。值得注意的是，FDLLM在各种对抗性攻击下始终保持稳健，包括打磨、翻译和同义词替换。实验结果显示，FDLLM将平均攻击成功率从49.2%（LM-D）降低到23.9%。

更新时间: 2025-06-20 05:23:34

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2501.16029v3

Language-Informed Synthesis of Rational Agent Models for Grounded Theory-of-Mind Reasoning On-The-Fly

Drawing real world social inferences usually requires taking into account information from multiple modalities. Language is a particularly powerful source of information in social settings, especially in novel situations where language can provide both abstract information about the environment dynamics and concrete specifics about an agent that cannot be easily visually observed. In this paper, we propose Language-Informed Rational Agent Synthesis (LIRAS), a framework for drawing context-specific social inferences that integrate linguistic and visual inputs. LIRAS frames multimodal social reasoning as a process of constructing structured but situation-specific agent and environment representations - leveraging multimodal language models to parse language and visual inputs into unified symbolic representations, over which a Bayesian inverse planning engine can be run to produce granular probabilistic judgments. On a range of existing and new social reasoning tasks derived from cognitive science experiments, we find that our model (instantiated with a comparatively lightweight VLM) outperforms ablations and state-of-the-art models in capturing human judgments across all domains.

Updated: 2025-06-20 05:21:42

标题: 基于语言的合成理性智能模型用于现场推理心灵理论推理

摘要: 绘制现实世界社交推断通常需要考虑来自多种模态的信息。语言在社交环境中是一种特别强大的信息源，特别是在新颖情况下，语言可以提供关于环境动态的抽象信息以及关于无法轻松通过视觉观察到的代理的具体细节。在本文中，我们提出了Language-Informed Rational Agent Synthesis (LIRAS)，这是一个绘制特定背景社交推断的框架，它整合了语言和视觉输入。LIRAS将多模态社交推理框架化为构建结构化但情境特定的代理和环境表示的过程 - 利用多模态语言模型将语言和视觉输入解析为统一的符号表示，通过贝叶斯逆向规划引擎可以生成细粒度的概率判断。在从认知科学实验中衍生出的一系列现有和新的社交推理任务中，我们发现我们的模型（以相对轻量级的VLM实例化）在捕捉所有领域中的人类判断方面优于去除和现有最先进的模型。

更新时间: 2025-06-20 05:21:42

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.16755v1

Metapath-based Hyperbolic Contrastive Learning for Heterogeneous Graph Embedding

The hyperbolic space, characterized by a constant negative curvature and exponentially expanding space, aligns well with the structural properties of heterogeneous graphs. However, although heterogeneous graphs inherently possess diverse power-law structures, most hyperbolic heterogeneous graph embedding models rely on a single hyperbolic space. This approach may fail to effectively capture the diverse power-law structures within heterogeneous graphs. To address this limitation, we propose a Metapath-based Hyperbolic Contrastive Learning framework (MHCL), which uses multiple hyperbolic spaces to capture diverse complex structures within heterogeneous graphs. Specifically, by learning each hyperbolic space to describe the distribution of complex structures corresponding to each metapath, it is possible to capture semantic information effectively. Since metapath embeddings represent distinct semantic information, preserving their discriminability is important when aggregating them to obtain node representations. Therefore, we use a contrastive learning approach to optimize MHCL and improve the discriminability of metapath embeddings. In particular, our contrastive learning method minimizes the distance between embeddings of the same metapath and maximizes the distance between those of different metapaths in hyperbolic space, thereby improving the separability of metapath embeddings with distinct semantic information. We conduct comprehensive experiments to evaluate the effectiveness of MHCL. The experimental results demonstrate that MHCL outperforms state-of-the-art baselines in various graph machine learning tasks, effectively capturing the complex structures of heterogeneous graphs.

Updated: 2025-06-20 05:19:11

标题: 基于元路径的双曲对比学习用于异质图嵌入

摘要: 超几何空间以恒定负曲率和指数级扩展空间为特征，与异质图的结构特性相吻合。然而，虽然异质图固有地具有多样的幂律结构，但大多数超几何异质图嵌入模型依赖于单个超几何空间。这种方法可能无法有效捕捉异质图内多样的幂律结构。为了解决这一限制，我们提出了一种基于元路径的超几何对比学习框架（MHCL），使用多个超几何空间来捕捉异质图内的多样复杂结构。具体而言，通过学习每个超几何空间来描述与每个元路径对应的复杂结构分布，可以有效地捕捉语义信息。由于元路径嵌入表示不同的语义信息，保持它们的可区分性在聚合它们以获得节点表示时很重要。因此，我们使用对比学习方法来优化MHCL，并提高元路径嵌入的可区分性。特别是，我们的对比学习方法最小化了在超几何空间中相同元路径的嵌入之间的距离，并最大化了不同元路径之间的距离，从而提高了具有不同语义信息的元路径嵌入的可分离性。我们进行了全面的实验来评估MHCL的有效性。实验结果表明，MHCL在各种图机器学习任务中优于最先进的基线方法，有效捕捉异质图的复杂结构。

更新时间: 2025-06-20 05:19:11

领域: cs.LG,cs.AI,cs.SI

下载: http://arxiv.org/abs/2506.16754v1

Nature Language Model: Deciphering the Language of Nature for Scientific Discovery

Foundation models have revolutionized natural language processing and artificial intelligence, significantly enhancing how machines comprehend and generate human languages. Inspired by the success of these foundation models, researchers have developed foundation models for individual scientific domains, including small molecules, materials, proteins, DNA, RNA and even cells. However, these models are typically trained in isolation, lacking the ability to integrate across different scientific domains. Recognizing that entities within these domains can all be represented as sequences, which together form the "language of nature", we introduce Nature Language Model (NatureLM), a sequence-based science foundation model designed for scientific discovery. Pre-trained with data from multiple scientific domains, NatureLM offers a unified, versatile model that enables various applications including: (i) generating and optimizing small molecules, proteins, RNA, and materials using text instructions; (ii) cross-domain generation/design, such as protein-to-molecule and protein-to-RNA generation; and (iii) top performance across different domains, matching or surpassing state-of-the-art specialist models. NatureLM offers a promising generalist approach for various scientific tasks, including drug discovery (hit generation/optimization, ADMET optimization, synthesis), novel material design, and the development of therapeutic proteins or nucleotides. We have developed NatureLM models in different sizes (1 billion, 8 billion, and 46.7 billion parameters) and observed a clear improvement in performance as the model size increases.

Updated: 2025-06-20 05:18:13

标题: 自然语言模型：破译自然语言，促进科学发现

摘要: 基础模型已经彻底改变了自然语言处理和人工智能，显著增强了机器理解和生成人类语言的能力。受到这些基础模型成功的启发，研究人员已经为个别科学领域开发了基础模型，包括小分子、材料、蛋白质、DNA、RNA甚至细胞。然而，这些模型通常是孤立训练的，缺乏跨不同科学领域整合的能力。认识到这些领域内的实体都可以被表示为序列，一起构成了“自然语言”，我们引入了Nature Language Model（NatureLM），这是一个基于序列的科学基础模型，旨在用于科学发现。NatureLM通过多个科学领域的数据进行预训练，提供了一个统一、多功能的模型，可以支持各种应用，包括：（i）使用文本说明生成和优化小分子、蛋白质、RNA和材料；（ii）跨领域生成/设计，如蛋白质到分子和蛋白质到RNA的生成；（iii）在不同领域中表现出色，与最先进的专家模型相匹敌甚至超越。NatureLM为各种科学任务提供了一种有希望的通用方法，包括药物发现（命中生成/优化、ADMET优化、合成）、新材料设计以及治疗蛋白质或核苷酸的开发。我们已经开发了不同规模的NatureLM模型（10亿、80亿和467亿参数），观察到随着模型规模增大，性能明显提升。

更新时间: 2025-06-20 05:18:13

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.07527v3

Off-Policy Actor-Critic for Adversarial Observation Robustness: Virtual Alternative Training via Symmetric Policy Evaluation

Recently, robust reinforcement learning (RL) methods designed to handle adversarial input observations have received significant attention, motivated by RL's inherent vulnerabilities. While existing approaches have demonstrated reasonable success, addressing worst-case scenarios over long time horizons requires both minimizing the agent's cumulative rewards for adversaries and training agents to counteract them through alternating learning. However, this process introduces mutual dependencies between the agent and the adversary, making interactions with the environment inefficient and hindering the development of off-policy methods. In this work, we propose a novel off-policy method that eliminates the need for additional environmental interactions by reformulating adversarial learning as a soft-constrained optimization problem. Our approach is theoretically supported by the symmetric property of policy evaluation between the agent and the adversary. The implementation is available at https://github.com/nakanakakosuke/VALT_SAC.

Updated: 2025-06-20 05:13:10

标题: 《针对对抗性观察鲁棒性的离线策略演员-评论家算法：通过对称策略评估进行虚拟替代训练》

摘要: 最近，针对处理对抗性输入观测的强化学习（RL）方法受到了重视，这是由RL固有的脆弱性所激发的。虽然现有方法已经展示了合理的成功，但要解决长期情况下的最坏情形需要同时最小化代理的累积奖励和训练代理通过交替学习来对抗对手。然而，这个过程在代理和对手之间引入了相互依赖关系，使得与环境的交互变得低效，并阻碍了离策略方法的发展。在这项工作中，我们提出了一种新颖的离策略方法，通过将对抗学习重新构造为软约束优化问题，消除了对额外环境交互的需求。我们的方法在理论上得到了策略评估在代理和对手之间的对称性支持。实现代码可在 https://github.com/nakanakakosuke/VALT_SAC 找到。

更新时间: 2025-06-20 05:13:10

领域: cs.LG,cs.AI,cs.RO

下载: http://arxiv.org/abs/2506.16753v1

DeepSelective: Interpretable Prognosis Prediction via Feature Selection and Compression in EHR Data

The rapid accumulation of Electronic Health Records (EHRs) has transformed healthcare by providing valuable data that enhance clinical predictions and diagnoses. While conventional machine learning models have proven effective, they often lack robust representation learning and depend heavily on expert-crafted features. Although deep learning offers powerful solutions, it is often criticized for its lack of interpretability. To address these challenges, we propose DeepSelective, a novel end to end deep learning framework for predicting patient prognosis using EHR data, with a strong emphasis on enhancing model interpretability. DeepSelective combines data compression techniques with an innovative feature selection approach, integrating custom-designed modules that work together to improve both accuracy and interpretability. Our experiments demonstrate that DeepSelective not only enhances predictive accuracy but also significantly improves interpretability, making it a valuable tool for clinical decision-making. The source code is freely available at http://www.healthinformaticslab.org/supp/resources.php .

Updated: 2025-06-20 05:03:41

标题: DeepSelective：通过特征选择和压缩在电子健康记录数据中进行可解释的预后预测

摘要: 电子健康记录（EHRs）的快速积累已经通过提供有价值的数据来增强临床预测和诊断，从而改变了医疗保健。尽管传统的机器学习模型已被证明有效，但它们往往缺乏强大的表示学习，并且严重依赖于专家设计的特征。虽然深度学习提供了强大的解决方案，但常常因其缺乏可解释性而受到批评。为了解决这些挑战，我们提出了DeepSelective，这是一个新颖的端到端深度学习框架，用于利用EHR数据预测患者预后，重点是增强模型的可解释性。DeepSelective将数据压缩技术与创新的特征选择方法相结合，集成了定制设计的模块，共同改进准确性和可解释性。我们的实验证明，DeepSelective不仅提高了预测准确性，而且显著改善了可解释性，使其成为临床决策的有价值工具。源代码可以免费在http://www.healthinformaticslab.org/supp/resources.php获取。

更新时间: 2025-06-20 05:03:41

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.11264v2

Conformal Inference under High-Dimensional Covariate Shifts via Likelihood-Ratio Regularization

We consider the problem of conformal prediction under covariate shift. Given labeled data from a source domain and unlabeled data from a covariate shifted target domain, we seek to construct prediction sets with valid marginal coverage in the target domain. Most existing methods require estimating the unknown likelihood ratio function, which can be prohibitive for high-dimensional data such as images. To address this challenge, we introduce the likelihood ratio regularized quantile regression (LR-QR) algorithm, which combines the pinball loss with a novel choice of regularization in order to construct a threshold function without directly estimating the unknown likelihood ratio. We show that the LR-QR method has coverage at the desired level in the target domain, up to a small error term that we can control. Our proofs draw on a novel analysis of coverage via stability bounds from learning theory. Our experiments demonstrate that the LR-QR algorithm outperforms existing methods on high-dimensional prediction tasks, including a regression task for the Communities and Crime dataset, an image classification task from the WILDS repository, and an LLM question-answering task on the MMLU benchmark.

Updated: 2025-06-20 04:42:54

标题: 通过似然比正则化在高维协变量转变下的一致推断

摘要: 我们考虑在协变量偏移下的一致预测问题。给定来自源域的标记数据和来自协变量偏移目标域的未标记数据，我们试图构建具有有效边际覆盖的预测集。大多数现有方法需要估计未知的似然比函数，这对于高维数据如图像可能是不可行的。为了解决这一挑战，我们引入了似然比正则化分位数回归（LR-QR）算法，将分位损失与一种新颖的正则化选择结合起来，以构建一个阈值函数，而无需直接估计未知的似然比。我们展示了LR-QR方法在目标域中具有所需水平的覆盖率，最多有一个我们可以控制的小误差项。我们的证明基于学习理论中通过稳定性界限对覆盖率的新颖分析。我们的实验表明，LR-QR算法在高维预测任务中表现优于现有方法，包括Communities and Crime数据集的回归任务，WILDS存储库中的图像分类任务以及MMLU基准上的LLM问答任务。

更新时间: 2025-06-20 04:42:54

领域: stat.ML,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.13030v4

Synthesizing Composite Hierarchical Structure from Symbolic Music Corpora

Western music is an innately hierarchical system of interacting levels of structure, from fine-grained melody to high-level form. In order to analyze music compositions holistically and at multiple granularities, we propose a unified, hierarchical meta-representation of musical structure called the structural temporal graph (STG). For a single piece, the STG is a data structure that defines a hierarchy of progressively finer structural musical features and the temporal relationships between them. We use the STG to enable a novel approach for deriving a representative structural summary of a music corpus, which we formalize as a nested NP-hard combinatorial optimization problem extending the Generalized Median Graph problem. Our approach first applies simulated annealing to develop a measure of structural distance between two music pieces rooted in graph isomorphism. Our approach then combines the formal guarantees of SMT solvers with nested simulated annealing over structural distances to produce a structurally sound, representative centroid STG for an entire corpus of STGs from individual pieces. To evaluate our approach, we conduct experiments verifying that structural distance accurately differentiates between music pieces, and that derived centroids accurately structurally characterize their corpora.

Updated: 2025-06-20 04:37:39

标题: 从符号音乐语料库中合成复合分层结构

摘要: 西方音乐是一个天生的层次结构系统，包括从细粒度的旋律到高级形式的相互作用层次。为了全面分析音乐作品并在多个粒度上进行分析，我们提出了一种统一的、层次化的音乐结构元表示法，称为结构时间图（STG）。对于单个作品，STG是一个数据结构，定义了逐渐细化的结构音乐特征的层次结构以及它们之间的时间关系。我们使用STG来实现一种新颖的方法，从而推导出一个音乐语料库的代表性结构摘要，我们将其形式化为一个扩展了广义中位图问题的嵌套NP难组合优化问题。我们的方法首先应用模拟退火来发展一种根植于图同构的音乐作品之间结构距离的度量。然后，我们将SMT求解器的形式保证与基于结构距离的嵌套模拟退火相结合，以产生一个整个STG语料库的结构良好、代表性的中心STG。为了评估我们的方法，我们进行实验证实，结构距离能够准确区分音乐作品，并且推导出的中心能够准确地结构化地描述它们的语料库。

更新时间: 2025-06-20 04:37:39

领域: cs.AI,cs.LO,cs.SD,G.1.6; I.2.4; J.5; G.2.2

下载: http://arxiv.org/abs/2502.15849v4

IsoNet: Causal Analysis of Multimodal Transformers for Neuromuscular Gesture Classification

Hand gestures are a primary output of the human motor system, yet the decoding of their neuromuscular signatures remains a bottleneck for basic neuroscience and assistive technologies such as prosthetics. Traditional human-machine interface pipelines rely on a single biosignal modality, but multimodal fusion can exploit complementary information from sensors. We systematically compare linear and attention-based fusion strategies across three architectures: a Multimodal MLP, a Multimodal Transformer, and a Hierarchical Transformer, evaluating performance on scenarios with unimodal and multimodal inputs. Experiments use two publicly available datasets: NinaPro DB2 (sEMG and accelerometer) and HD-sEMG 65-Gesture (high-density sEMG and force). Across both datasets, the Hierarchical Transformer with attention-based fusion consistently achieved the highest accuracy, surpassing the multimodal and best single-modality linear-fusion MLP baseline by over 10% on NinaPro DB2 and 3.7% on HD-sEMG. To investigate how modalities interact, we introduce an Isolation Network that selectively silences unimodal or cross-modal attention pathways, quantifying each group of token interactions' contribution to downstream decisions. Ablations reveal that cross-modal interactions contribute approximately 30% of the decision signal across transformer layers, highlighting the importance of attention-driven fusion in harnessing complementary modality information. Together, these findings reveal when and how multimodal fusion would enhance biosignal classification and also provides mechanistic insights of human muscle activities. The study would be beneficial in the design of sensor arrays for neurorobotic systems.

Updated: 2025-06-20 04:31:32

标题: IsoNet：多模态变换器在神经肌肉手势分类中的因果分析

摘要: 手势是人类运动系统的主要输出，然而其神经肌肉特征的解码仍然是基础神经科学和假肢等辅助技术的瓶颈。传统的人机界面流程依赖于单一的生物信号模式，但多模态融合可以利用传感器的互补信息。我们系统地比较了线性和基于注意力的融合策略在三种架构上的表现：多模态MLP、多模态Transformer和分层Transformer，在具有单模态和多模态输入的情景下进行评估。实验使用了两个公开数据集：NinaPro DB2（sEMG和加速度计）和HD-sEMG 65-Gesture（高密度sEMG和力量）。在两个数据集中，具有基于注意力的融合的分层Transformer始终取得了最高准确率，超过了NinaPro DB2上的多模态和最佳单模态线性融合MLP基线超过10％，在HD-sEMG上超过3.7％。为了研究模态之间的相互作用，我们引入了一个隔离网络，可以选择性地消除单模态或跨模态的注意力路径，量化每组令牌相互作用对下游决策的贡献。消融实验显示，跨模态相互作用在Transformer层中贡献了约30％的决策信号，突显了基于注意力的融合在利用互补模态信息方面的重要性。总的来说，这些发现揭示了多模态融合如何提高生物信号分类，并提供了有关人类肌肉活动机制的见解。这项研究对于设计神经机器人系统的传感器阵列将是有益的。

更新时间: 2025-06-20 04:31:32

领域: cs.LG,cs.RO,eess.SP

下载: http://arxiv.org/abs/2506.16744v1

Group-Level Data Selection for Efficient Pretraining

In this paper, we introduce Group-MATES, an efficient group-level data selection approach to optimize the speed-quality frontier of language model pretraining. Specifically, Group-MATES parameterizes costly group-level selection with a relational data influence model. To train this model, we sample training trajectories of the language model and collect oracle data influences alongside. The relational data influence model approximates the oracle data influence by weighting individual influence with relationships among training data. To enable efficient selection with our relational data influence model, we partition the dataset into small clusters using relationship weights and select data within each cluster independently. Experiments on DCLM 400M-4x, 1B-1x, and 3B-1x show that Group-MATES achieves 3.5%-9.4% relative performance gains over random selection across 22 downstream tasks, nearly doubling the improvements achieved by state-of-the-art individual data selection baselines. Furthermore, Group-MATES reduces the number of tokens required to reach a certain downstream performance by up to 1.75x, substantially elevating the speed-quality frontier. Further analyses highlight the critical role of relationship weights in the relational data influence model and the effectiveness of our cluster-based inference. Our code is open-sourced at https://github.com/facebookresearch/Group-MATES.

Updated: 2025-06-20 04:30:04

标题: 群体级别数据选择以提高预训练效率

摘要: 在本文中，我们介绍了Group-MATES，一种高效的群体级数据选择方法，以优化语言模型预训练的速度质量前沿。具体而言，Group-MATES使用关系数据影响模型来参数化昂贵的群体级选择。为了训练这个模型，我们对语言模型的训练轨迹进行抽样，并同时收集神谕数据影响。关系数据影响模型通过在训练数据之间的关系中加权个体影响来逼近神谕数据影响。为了利用我们的关系数据影响模型进行高效选择，我们使用关系权重将数据集分成小的群集，并在每个群集内独立选择数据。在DCLM 400M-4x、1B-1x和3B-1x上的实验表明，Group-MATES在22个下游任务中相对性能提升了3.5%-9.4%，几乎是当前最先进的个体数据选择基线所实现改进的两倍。此外，Group-MATES将达到一定下游性能所需的令牌数量减少了最多1.75倍，大幅提高了速度质量前沿。进一步的分析突出了关系权重在关系数据影响模型中的关键作用，以及我们基于群集的推断的有效性。我们的代码已在https://github.com/facebookresearch/Group-MATES 开源。

更新时间: 2025-06-20 04:30:04

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2502.14709v2

RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching

We introduce RapFlow-TTS, a rapid and high-fidelity TTS acoustic model that leverages velocity consistency constraints in flow matching (FM) training. Although ordinary differential equation (ODE)-based TTS generation achieves natural-quality speech, it typically requires a large number of generation steps, resulting in a trade-off between quality and inference speed. To address this challenge, RapFlow-TTS enforces consistency in the velocity field along the FM-straightened ODE trajectory, enabling consistent synthetic quality with fewer generation steps. Additionally, we introduce techniques such as time interval scheduling and adversarial learning to further enhance the quality of the few-step synthesis. Experimental results show that RapFlow-TTS achieves high-fidelity speech synthesis with a 5- and 10-fold reduction in synthesis steps than the conventional FM- and score-based approaches, respectively.

Updated: 2025-06-20 04:19:29

标题: RapFlow-TTS：具有改进一致性流匹配的快速高保真文本转语音

摘要: 我们介绍了RapFlow-TTS，这是一个快速高保真度的TTS声学模型，它利用了在流匹配（FM）训练中的速度一致性约束。虽然基于普通微分方程（ODE）的TTS生成可以实现自然质量的语音，但通常需要大量的生成步骤，导致质量和推理速度之间的权衡。为了解决这一挑战，RapFlow-TTS强制在FM直化的ODE轨迹上保持速度场的一致性，从而实现了更少的生成步骤和一致的合成质量。此外，我们引入了诸如时间间隔调度和对抗学习等技术，以进一步提高少步合成的质量。实验结果表明，与传统的FM和基于分数的方法相比，RapFlow-TTS实现了高保真度的语音合成，分别减少了5倍和10倍的合成步骤。

更新时间: 2025-06-20 04:19:29

领域: eess.AS,cs.AI

下载: http://arxiv.org/abs/2506.16741v1

Client-Centered Federated Learning for Heterogeneous EHRs: Use Fewer Participants to Achieve the Same Performance

The increasing volume of electronic health records (EHRs) presents the opportunity to improve the accuracy and robustness of models in clinical prediction tasks. Unlike traditional centralized approaches, federated learning enables training on data from multiple institutions while preserving patient privacy and complying with regulatory constraints. In practice, healthcare institutions (i.e., hosts) often need to build predictive models tailored to their specific needs using federated learning. In this scenario, two key challenges arise: (1) ensuring compatibility across heterogeneous EHR systems, and (2) managing federated learning costs within budget constraints. To address these challenges, we propose EHRFL, a federated learning framework designed for building a cost-effective, host-specific predictive model using patient EHR data. EHRFL consists of two components: (1) text-based EHR modeling, which facilitates cross-institution compatibility without costly data standardization, and (2) a participant selection strategy based on averaged patient embedding similarity to reduce the number of participants without degrading performance. Experiments on multiple open-source EHR datasets demonstrate the effectiveness of both components. We believe our framework offers a practical solution for enabling healthcare institutions to build institution-specific predictive models under budgetary constraints.

Updated: 2025-06-20 04:18:50

标题: 客户中心的异构电子病历联邦学习：利用更少的参与者达到相同的性能

摘要: 随着电子健康记录（EHRs）数量的增加，提供了改善临床预测任务中模型准确性和鲁棒性的机会。与传统的集中式方法不同，联邦学习使得能够在多个机构的数据上进行训练，同时保护患者隐私并遵守监管限制。在实践中，医疗机构（即主机）通常需要利用联邦学习构建适合其特定需求的预测模型。在这种情况下，出现了两个关键挑战：（1）确保在异构EHR系统之间的兼容性，以及（2）在预算约束内管理联邦学习成本。为了解决这些挑战，我们提出了EHRFL，这是一个专为利用患者EHR数据构建一种成本效益高、主机特定的预测模型而设计的联邦学习框架。EHRFL包括两个组件：（1）基于文本的EHR建模，有助于实现跨机构的兼容性，无需昂贵的数据标准化，以及（2）基于平均患者嵌入相似性的参与者选择策略，以减少参与者数量而不降低性能。在多个开源EHR数据集上进行的实验证明了这两个组件的有效性。我们相信我们的框架为医疗机构在预算约束条件下构建特定机构的预测模型提供了实际解决方案。

更新时间: 2025-06-20 04:18:50

领域: cs.LG

下载: http://arxiv.org/abs/2404.13318v4

LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization

With the rapid progress of speech language models (SLMs), discrete speech tokens have emerged as a core interface between speech and text, enabling unified modeling across modalities. Recent speech tokenization approaches aim to isolate semantic information from low-level acoustics to better align with language models. In particular, previous methods use SSL teachers such as HuBERT to extract semantic representations, which are then distilled into a semantic quantizer to suppress acoustic redundancy as well as capture content-related latent structures. However, they still produce speech token sequences significantly longer than their textual counterparts, creating challenges for efficient speech-language modeling. Reducing the frame rate is a natural solution, but standard techniques, such as rigid average pooling across frames, can distort or dilute the semantic structure required for effective LM alignment. To address this, we propose LM-SPT, a speech tokenization method that introduces a novel semantic distillation. Instead of directly matching teacher and student features via pooling, we reconstruct speech solely from semantic tokens and minimize the discrepancy between the encoded representations of the original and reconstructed waveforms, obtained from a frozen automatic speech recognition (ASR) encoder. This indirect yet data-driven supervision enables the tokenizer to learn discrete units that are more semantically aligned with language models. LM-SPT further incorporates architectural improvements to the encoder and decoder for speech tokenization, and supports multiple frame rates, including 25Hz, 12.5Hz, and 6.25Hz. Experimental results show that LM-SPT achieves superior reconstruction fidelity compared to baselines, and that SLMs trained with LM-SPT tokens achieve competitive performances on speech-to-text and consistently outperform baselines on text-to-speech tasks.

Updated: 2025-06-20 04:15:14

标题: LM-SPT: 基于LM对齐的语义蒸馏技术用于语音标记化

摘要: 随着语音语言模型（SLMs）的快速发展，离散语音令牌已成为语音和文本之间的核心接口，实现了跨模态统一建模。最近的语音令牌化方法旨在从低级声学中隔离语义信息，以更好地与语言模型对齐。特别是，先前的方法使用类似HuBERT的SSL教师提取语义表示，然后将其蒸馏成语义量化器，以抑制声学冗余并捕获与内容相关的潜在结构。然而，它们仍然生成比其文本对应物显著更长的语音令牌序列，为高效的语音语言建模带来挑战。减小帧率是一个自然的解决方案，但标准技术，如刚性平均池化跨越帧，可能扭曲或稀释用于有效LM对齐所需的语义结构。为了解决这个问题，我们提出了LM-SPT，一种介绍了新颖语义蒸馏的语音令牌化方法。我们通过语义令牌单独重构语音，最小化原始波形和重构波形之间的编码表示之间的差异，该差异是从冻结的自动语音识别（ASR）编码器获得的。这种间接但数据驱动的监督使得令牌化器能够学习更与语言模型语义对齐的离散单元。LM-SPT进一步改进了用于语音令牌化的编码器和解码器的架构，并支持多种帧率，包括25Hz、12.5Hz和6.25Hz。实验结果表明，与基线相比，LM-SPT实现了更高的重构保真度，并且使用LM-SPT令牌训练的SLM在从语音到文本的任务上表现出竞争力，并且始终在从文本到语音的任务中优于基线。

更新时间: 2025-06-20 04:15:14

领域: cs.CL,cs.AI,cs.SD,eess.AS

下载: http://arxiv.org/abs/2506.16738v1

Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening

Reinforcement learning is emerging as a primary driver for improving language model reasoning capabilities. A fundamental question is whether current reinforcement learning algorithms -- such as Group Relative Policy Optimization (GRPO), the de facto standard algorithm used to improve language model reasoning -- merely sharpen the base model's distribution around problems it can already solve. We investigate this question in the context of formal theorem proving, which has access to a perfect verifier. We identify a degenerate rank bias in GRPO in which highly probable trajectories are reinforced and rare ones are neglected. This results in distribution sharpening: the model can solve some problems with fewer samples, but underperforms simply sampling more solutions from the original model. To overcome GRPO's rank bias we introduce unlikeliness reward, a simple method for explicitly up-weighting rare but correct solutions. We show that unlikeliness reward mitigates rank bias and improves pass@$N$ across a large range of $N$ in both synthetic and real theorem proving settings. We also uncover an unexpected link between rank bias and a seemingly mundane hyperparameter -- the number of updates per batch -- that leads to a second, complementary mitigation. We combine our insights into a revised GRPO training recipe for formal theorem proving, yielding an open pipeline that achieves competitive performance to DeepSeek-Prover-V1.5-RL on the miniF2F-test benchmark. We release our implementation at https://github.com/AndreHe02/rewarding-unlikely-release

Updated: 2025-06-20 04:14:47

标题: 奖励不太可能的事情：提升GRPO超越分销细化

摘要: 强化学习正逐渐成为提升语言模型推理能力的主要推动因素。一个基本问题是当前的强化学习算法（如Group Relative Policy Optimization，即用于改进语言模型推理的事实标准算法）是否仅仅是加强基础模型在已经能够解决的问题周围的分布。我们在形式定理证明的背景下调查了这个问题，该问题可以访问一个完美的验证器。我们发现了GRPO中的一个退化排名偏差，即高概率轨迹被加强，而罕见的轨迹被忽视。这导致了分布的加强：模型可以用更少的样本解决一些问题，但只是简单地从原始模型中采样更多的解决方案会导致性能下降。为了克服GRPO的排名偏差，我们引入了不可能性奖励，这是一种简单的方法，可以明确地加强罕见但正确的解决方案。我们表明，不可能性奖励减轻了排名偏差，并在合成和真实的定理证明设置中在大范围的$N$下提高了pass@$N$。我们还发现了排名偏差与一个看似平凡的超参数——每批更新的数量之间的意外联系，这导致了第二个互补的减轻措施。我们将我们的见解结合到一个修订的GRPO训练配方中，用于形式定理证明，实现了与DeepSeek-Prover-V1.5-RL在miniF2F-test基准测试上具有竞争力的性能。我们在https://github.com/AndreHe02/rewarding-unlikely-release发布了我们的实现。

更新时间: 2025-06-20 04:14:47

领域: cs.LG

下载: http://arxiv.org/abs/2506.02355v2

Optimism Without Regularization: Constant Regret in Zero-Sum Games

This paper studies the optimistic variant of Fictitious Play for learning in two-player zero-sum games. While it is known that Optimistic FTRL -- a regularized algorithm with a bounded stepsize parameter -- obtains constant regret in this setting, we show for the first time that similar, optimal rates are also achievable without regularization: we prove for two-strategy games that Optimistic Fictitious Play (using any tiebreaking rule) obtains only constant regret, providing surprising new evidence on the ability of non-no-regret algorithms for fast learning in games. Our proof technique leverages a geometric view of Optimistic Fictitious Play in the dual space of payoff vectors, where we show a certain energy function of the iterates remains bounded over time. Additionally, we also prove a regret lower bound of $\Omega(\sqrt{T})$ for Alternating Fictitious Play. In the unregularized regime, this separates the ability of optimism and alternation in achieving $o(\sqrt{T})$ regret.

Updated: 2025-06-20 04:10:51

标题: 无规范化的乐观主义：零和博弈中的恒定遗憾

摘要: 这篇论文研究了在两人零和博弈中学习的Fictitious Play的乐观变体。尽管已知Optimistic FTRL是一个带有有界步长参数的正则化算法，在这种情况下可以获得恒定的遗憾，但我们首次证明了类似的最优速率也可以在没有正则化的情况下实现：我们证明了对于两策略博弈，采用任何打破平局规则的Optimistic Fictitious Play仅获得恒定的遗憾，为非无悔算法在博弈中快速学习的能力提供了令人惊讶的新证据。我们的证明技术利用了在支付向量的对偶空间中对Optimistic Fictitious Play的几何视图，我们证明了迭代的某个能量函数随时间保持有界。此外，我们还证明了Alternating Fictitious Play的遗憾下界为$\Omega(\sqrt{T})$。在无正则化的情况下，这将乐观和交替在实现$o(\sqrt{T})$遗憾方面的能力分开。

更新时间: 2025-06-20 04:10:51

领域: cs.LG,cs.GT

下载: http://arxiv.org/abs/2506.16736v1

On Training-Test (Mis)alignment in Unsupervised Combinatorial Optimization: Observation, Empirical Exploration, and Analysis

In unsupervised combinatorial optimization (UCO), during training, one aims to have continuous decisions that are promising in a probabilistic sense for each training instance, which enables end-to-end training on initially discrete and non-differentiable problems. At the test time, for each test instance, starting from continuous decisions, derandomization is typically applied to obtain the final deterministic decisions. Researchers have developed more and more powerful test-time derandomization schemes to enhance the empirical performance and the theoretical guarantee of UCO methods. However, we notice a misalignment between training and testing in the existing UCO methods. Consequently, lower training losses do not necessarily entail better post-derandomization performance, even for the training instances without any data distribution shift. Empirically, we indeed observe such undesirable cases. We explore a preliminary idea to better align training and testing in UCO by including a differentiable version of derandomization into training. Our empirical exploration shows that such an idea indeed improves training-test alignment, but also introduces nontrivial challenges into training.

Updated: 2025-06-20 04:05:09

标题: 关于无监督组合优化中训练-测试（不）对齐的观察、实证探索和分析

摘要: 在无监督组合优化（UCO）中，训练过程中的目标是针对每个训练实例具有在概率意义上有希望的连续决策，这使得可以对最初是离散且不可微分的问题进行端到端训练。在测试时，针对每个测试实例，通常从连续决策开始，然后应用去随机化来获得最终的确定性决策。研究人员已经开发出更强大的测试时去随机化方案，以增强UCO方法的实证性能和理论保证。然而，我们注意到现有UCO方法中存在训练和测试之间的不一致。因此，更低的训练损失并不一定意味着更好的去随机化后性能，即使是对于没有任何数据分布变化的训练实例也是如此。在实证研究中，我们确实观察到这种不良情况。我们探索了一个初步的想法，通过将一个可微分版本的去随机化引入到训练中，以更好地使训练和测试在UCO中对齐。我们的实证研究表明，这样的想法确实改善了训练-测试对齐性，但也在训练中引入了非常复杂的挑战。

更新时间: 2025-06-20 04:05:09

领域: cs.LG,cs.AI,cs.DM,math.PR

下载: http://arxiv.org/abs/2506.16732v1

Disentangling and Integrating Relational and Sensory Information in Transformer Architectures

Relational reasoning is a central component of generally intelligent systems, enabling robust and data-efficient inductive generalization. Recent empirical evidence shows that many existing neural architectures, including Transformers, struggle with tasks requiring relational reasoning. In this work, we distinguish between two types of information: sensory information about the properties of individual objects, and relational information about the relationships between objects. While neural attention provides a powerful mechanism for controlling the flow of sensory information between objects, the Transformer lacks an explicit computational mechanism for routing and processing relational information. To address this limitation, we propose an architectural extension of the Transformer framework that we call the Dual Attention Transformer (DAT), featuring two distinct attention mechanisms: sensory attention for directing the flow of sensory information, and a novel relational attention mechanism for directing the flow of relational information. We empirically evaluate DAT on a diverse set of tasks ranging from synthetic relational benchmarks to complex real-world tasks such as language modeling and visual processing. Our results demonstrate that integrating explicit relational computational mechanisms into the Transformer architecture leads to significant performance gains in terms of data efficiency and parameter efficiency.

Updated: 2025-06-20 04:02:46

标题: 在Transformer架构中解开和整合关系和感知信息

摘要: 关系推理是通常智能系统的一个核心组成部分，它能够实现强健且数据高效的归纳概括。最近的实证证据显示，许多现有的神经架构，包括变压器，在需要关系推理的任务上表现出困难。在这项工作中，我们区分了两种类型的信息：关于个体对象属性的感官信息，以及关于对象之间关系的关系信息。虽然神经注意力提供了一个强大的机制来控制对象之间感官信息的流动，但变压器缺乏一个明确的计算机制来路由和处理关系信息。为了解决这一限制，我们提出了一个变压器框架的架构扩展，称为双重注意力变压器（DAT），具有两种不同的注意力机制：感官注意力用于指导感官信息的流动，以及一种新颖的关系注意力机制，用于指导关系信息的流动。我们在各种任务上对DAT进行了实证评估，从合成关系基准到复杂的现实世界任务，如语言建模和视觉处理。我们的结果表明，在变压器架构中集成明确的关系计算机制会显著提高数据效率和参数效率。

更新时间: 2025-06-20 04:02:46

领域: cs.LG

下载: http://arxiv.org/abs/2405.16727v3

Incentivizing High-quality Participation From Federated Learning Agents

Federated learning (FL) provides a promising paradigm for facilitating collaboration between multiple clients that jointly learn a global model without directly sharing their local data. However, existing research suffers from two caveats: 1) From the perspective of agents, voluntary and unselfish participation is often assumed. But self-interested agents may opt out of the system or provide low-quality contributions without proper incentives; 2) From the mechanism designer's perspective, the aggregated models can be unsatisfactory as the existing game-theoretical federated learning approach for data collection ignores the potential heterogeneous effort caused by contributed data. To alleviate above challenges, we propose an incentive-aware framework for agent participation that considers data heterogeneity to accelerate the convergence process. Specifically, we first introduce the notion of Wasserstein distance to explicitly illustrate the heterogeneous effort and reformulate the existing upper bound of convergence. To induce truthful reporting from agents, we analyze and measure the generalization error gap of any two agents by leveraging the peer prediction mechanism to develop score functions. We further present a two-stage Stackelberg game model that formalizes the process and examines the existence of equilibrium. Extensive experiments on real-world datasets demonstrate the effectiveness of our proposed mechanism.

Updated: 2025-06-20 03:58:39

标题: 激励联邦学习代理参与高质量活动

摘要: 联邦学习（FL）为促进多个客户之间合作提供了一个有前途的范式，这些客户共同学习一个全局模型，而无需直接共享他们的本地数据。然而，现有研究存在两个缺陷：1）从代理的角度来看，通常假定是自愿和无私地参与。但自私的代理可能选择退出系统或提供低质量的贡献而没有适当的激励；2）从机制设计师的角度来看，由于现有的博弈论联邦学习方法忽略了由贡献数据引起的潜在异质努力，聚合模型可能不尽人意。为了缓解上述挑战，我们提出了一个考虑数据异质性的激励感知框架，以加速收敛过程。具体而言，我们首先引入了Wasserstein距离的概念，明确说明了异质努力，并重新制定了收敛的现有上界。为了诱使代理进行真实报告，我们通过利用对等预测机制分析和测量任意两个代理的泛化误差差距来开发评分函数。我们进一步提出了一个两阶段斯塔克尔贝格博弈模型，形式化了这一过程，并检验了均衡的存在性。对真实世界数据集进行的大量实验证明了我们提出的机制的有效性。

更新时间: 2025-06-20 03:58:39

领域: cs.AI,cs.DC,cs.LG

下载: http://arxiv.org/abs/2506.16731v1

CDS: Knowledge Component-Driven Data Synthesis Guided by Cognitive Diagnosis Theory

Large Language Models (LLMs) have achieved significant advancements, but the increasing complexity of tasks and higher performance demands highlight the need for continuous improvement. Some approaches utilize synthetic data generated by advanced LLMs based on evaluation results to train models. However, conventional evaluation methods fail to provide detailed, fine-grained profiles of LLMs, limiting their guidance for data synthesis. In this paper, we introduce the Cognitive Diagnostic Synthesis (CDS) method, which incorporates a diagnostic process inspired by Cognitive Diagnosis Theory (CDT) to refine evaluation results and characterize model profiles at the knowledge component level. Based on these diagnostics, we propose two diagnosis-synthesis strategies for weakness-targeted data synthesis. Additionally, we present an enhanced data augmentation and selection pipeline to improve the quality and diversity of synthesized data. Our experiments with several open-source models show significant improvements across multiple benchmarks, achieving up to 6.00% improvement in code generation, 13.10% in mathematical reasoning, and 5.43% in academic exams. Code and data are available on GitHub.

Updated: 2025-06-20 03:44:20

标题: CDS：由认知诊断理论指导的知识组件驱动数据综合

摘要: 大型语言模型（LLMs）取得了显著进展，但任务的复杂性不断增加和更高的性能需求突显了持续改进的必要性。一些方法利用基于评估结果生成的先进LLMs产生的合成数据来训练模型。然而，传统的评估方法未能提供LLMs的详细、细粒度的特征，限制了它们对数据合成的指导。在本文中，我们引入了认知诊断综合（CDS）方法，该方法结合了受认知诊断理论（CDT）启发的诊断过程，以改进评估结果并在知识组件级别表征模型特征。基于这些诊断，我们提出了两种针对弱点的数据综合诊断策略。此外，我们提出了一个增强的数据增强和选择流程，以提高合成数据的质量和多样性。我们通过对几种开源模型进行实验，在多个基准测试中取得显著改进，代码生成方面最多达到6.00%的改进，在数学推理方面为13.10%，在学术考试中为5.43%。代码和数据可在GitHub上找到。

更新时间: 2025-06-20 03:44:20

领域: cs.AI

下载: http://arxiv.org/abs/2501.07674v3

The Role of Model Confidence on Bias Effects in Measured Uncertainties

With the growing adoption of Large Language Models (LLMs) for open-ended tasks, accurately assessing epistemic uncertainty, which reflects a model's lack of knowledge, has become crucial to ensuring reliable outcomes. However, quantifying epistemic uncertainty in such tasks is challenging due to the presence of aleatoric uncertainty, which arises from multiple valid answers. While bias can introduce noise into epistemic uncertainty estimation, it may also reduce noise from aleatoric uncertainty. To investigate this trade-off, we conduct experiments on Visual Question Answering (VQA) tasks and find that mitigating prompt-introduced bias improves uncertainty quantification in GPT-4o. Building on prior work showing that LLMs tend to copy input information when model confidence is low, we further analyze how these prompt biases affect measured epistemic and aleatoric uncertainty across varying bias-free confidence levels with GPT-4o and Qwen2-VL. We find that all considered biases induce greater changes in both uncertainties when bias-free model confidence is lower. Moreover, lower bias-free model confidence leads to greater underestimation of epistemic uncertainty (i.e. overconfidence) due to bias, whereas it has no significant effect on the direction of changes in aleatoric uncertainty estimation. These distinct effects deepen our understanding of bias mitigation for uncertainty quantification and potentially inform the development of more advanced techniques.

Updated: 2025-06-20 03:43:10

标题: 模型置信度对测量不确定性中偏差效应的作用

摘要: 随着大型语言模型（LLMs）在开放式任务中的日益普及，准确评估表现为模型知识缺乏的认识不确定性变得至关重要，以确保可靠的结果。然而，在这类任务中量化认识不确定性是具有挑战性的，因为存在着由多个有效答案引起的随机不确定性。虽然偏见可能会导致对认识不确定性的估计引入噪声，但也可能减少来自随机不确定性的噪声。为了探究这种权衡，我们在视觉问答（VQA）任务上进行实验，并发现减轻提示引入的偏见可以改善GPT-4o中的不确定性量化。建立在先前研究表明LLMs在模型置信度较低时倾向于复制输入信息的基础上，我们进一步分析这些提示偏见如何影响在GPT-4o和Qwen2-VL中测得的认识和随机不确定性，跨不同无偏置信水平。我们发现，在无偏模型置信度较低时，所有考虑的偏见都会在认识和随机不确定性方面引起更大的变化。此外，较低的无偏模型置信度会导致由于偏见而低估认识不确定性（即过度自信），而对于随机不确定性估计的变化方向没有显著影响。这些不同的影响加深了我们对于不确定性量化偏见减轻的理解，并可能为更先进技术的发展提供信息。

更新时间: 2025-06-20 03:43:10

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.16724v1

TriCon-SF: A Triple-Shuffle and Contribution-Aware Serial Federated Learning Framework for Heterogeneous Healthcare Data

Serial pipeline training is an efficient paradigm for handling data heterogeneity in cross-silo federated learning with low communication overhead. However, even without centralized aggregation, direct transfer of models between clients can violate privacy regulations and remain susceptible to gradient leakage and linkage attacks. Additionally, ensuring resilience against semi-honest or malicious clients who may manipulate or misuse received models remains a grand challenge, particularly in privacy-sensitive domains such as healthcare. To address these challenges, we propose TriCon-SF, a novel serial federated learning framework that integrates triple shuffling and contribution awareness. TriCon-SF introduces three levels of randomization by shuffling model layers, data segments, and training sequences to break deterministic learning patterns and disrupt potential attack vectors, thereby enhancing privacy and robustness. In parallel, it leverages Shapley value methods to dynamically evaluate client contributions during training, enabling the detection of dishonest behavior and enhancing system accountability. Extensive experiments on non-IID healthcare datasets demonstrate that TriCon-SF outperforms standard serial and parallel federated learning in both accuracy and communication efficiency. Security analysis further supports its resilience against client-side privacy attacks.

Updated: 2025-06-20 03:40:35

标题: TriCon-SF：一种用于异构医疗数据的三重洗牌和贡献感知串行联邦学习框架

摘要: 串行管道训练是处理数据异构性的有效范例，具有低通信开销的跨储存器联邦学习。然而，即使没有集中聚合，直接在客户端之间传输模型可能违反隐私法规，并且仍然容易受到梯度泄漏和链接攻击的影响。此外，确保对抗可能操纵或滥用接收到的模型的半诚实或恶意客户端的韧性仍然是一个巨大挑战，特别是在隐私敏感领域如医疗保健领域。为了解决这些挑战，我们提出了TriCon-SF，这是一个集成了三重洗牌和贡献意识的新型串行联邦学习框架。TriCon-SF通过对模型层、数据段和训练序列进行混洗，引入了三个级别的随机化，以打破确定性学习模式并破坏潜在的攻击向量，从而增强隐私性和鲁棒性。与此同时，它利用Shapley值方法在训练过程中动态评估客户端的贡献，从而实现发现不诚实行为和增强系统问责制。对非IID医疗保健数据集的广泛实验表明，TriCon-SF在准确性和通信效率方面优于标准的串行和并行联邦学习。安全分析进一步支持其对抗客户端隐私攻击的韧性。

更新时间: 2025-06-20 03:40:35

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.16723v1

DRARL: Disengagement-Reason-Augmented Reinforcement Learning for Efficient Improvement of Autonomous Driving Policy

With the increasing presence of automated vehicles on open roads under driver supervision, disengagement cases are becoming more prevalent. While some data-driven planning systems attempt to directly utilize these disengagement cases for policy improvement, the inherent scarcity of disengagement data (often occurring as a single instances) restricts training effectiveness. Furthermore, some disengagement data should be excluded since the disengagement may not always come from the failure of driving policies, e.g. the driver may casually intervene for a while. To this end, this work proposes disengagement-reason-augmented reinforcement learning (DRARL), which enhances driving policy improvement process according to the reason of disengagement cases. Specifically, the reason of disengagement is identified by a out-of-distribution (OOD) state estimation model. When the reason doesn't exist, the case will be identified as a casual disengagement case, which doesn't require additional policy adjustment. Otherwise, the policy can be updated under a reason-augmented imagination environment, improving the policy performance of disengagement cases with similar reasons. The method is evaluated using real-world disengagement cases collected by autonomous driving robotaxi. Experimental results demonstrate that the method accurately identifies policy-related disengagement reasons, allowing the agent to handle both original and semantically similar cases through reason-augmented training. Furthermore, the approach prevents the agent from becoming overly conservative after policy adjustments. Overall, this work provides an efficient way to improve driving policy performance with disengagement cases.

Updated: 2025-06-20 03:32:01

标题: DRARL：用于自动驾驶政策高效改进的脱离-原因-增强强化学习

摘要: 随着自动驾驶车辆在驾驶员监督下在开放道路上的存在越来越多，解除驾驶案例变得更加普遍。虽然一些数据驱动的规划系统试图直接利用这些解除案例来改进政策，但解除数据的稀缺性（通常发生为单一实例）限制了训练的有效性。此外，一些解除数据应该被排除在外，因为解除可能并不总是由于驾驶政策的失败，例如驾驶员可能会轻松地干预一段时间。为此，这项工作提出了增强解除原因的强化学习（DRARL），根据解除案例的原因增强驾驶政策改进过程。具体而言，解除的原因是通过一种离群状态估计模型来识别的。当原因不存在时，案例将被识别为一个偶然的解除案例，不需要额外的政策调整。否则，政策可以在原因增强的想象环境下更新，改善具有相似原因的解除案例的政策性能。该方法使用自动驾驶机器出租车收集的真实解除案例进行评估。实验结果表明，该方法准确识别了与政策相关的解除原因，使代理能够通过原因增强训练处理原始和语义相似的案例。此外，该方法在政策调整后防止代理变得过于保守。总体而言，这项工作提供了一种有效的方式来改善驾驶政策在解除案例中的表现。

更新时间: 2025-06-20 03:32:01

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2506.16720v1

Generalizable Agent Modeling for Agent Collaboration-Competition Adaptation with Multi-Retrieval and Dynamic Generation

Adapting a single agent to a new multi-agent system brings challenges, necessitating adjustments across various tasks, environments, and interactions with unknown teammates and opponents. Addressing this challenge is highly complex, and researchers have proposed two simplified scenarios, Multi-agent reinforcement learning for zero-shot learning and Ad-Hoc Teamwork. Building on these foundations, we propose a more comprehensive setting, Agent Collaborative-Competitive Adaptation (ACCA), which evaluates an agent to generalize across diverse scenarios, tasks, and interactions with both unfamiliar opponents and teammates. In ACCA, agents adjust to task and environmental changes, collaborate with unseen teammates, and compete against unknown opponents. We introduce a new modeling approach, Multi-Retrieval and Dynamic Generation (MRDG), that effectively models both teammates and opponents using their behavioral trajectories. This method incorporates a positional encoder for varying team sizes and a hypernetwork module to boost agents' learning and adaptive capabilities. Additionally, a viewpoint alignment module harmonizes the observational perspectives of retrieved teammates and opponents with the learning agent. Extensive tests in benchmark scenarios like SMAC, Overcooked-AI, and Melting Pot show that MRDG significantly improves robust collaboration and competition with unseen teammates and opponents, surpassing established baselines. Our code is available at: https://github.com/vcis-wangchenxu/MRDG.git

Updated: 2025-06-20 03:28:18

标题: 通用代理建模用于具有多重检索和动态生成的代理协作-竞争适应。

摘要: 将一个单个代理适应到一个新的多智能体系统中会带来挑战，需要在各种任务、环境和与未知队友和对手的互动中进行调整。解决这一挑战非常复杂，研究人员提出了两种简化场景，即零-shot学习的多智能体强化学习和即兴团队合作。基于这些基础，我们提出了一个更全面的设置，即代理协作竞争适应（ACCA），评估代理在各种不同情景、任务和与陌生对手和队友的互动中的泛化能力。在ACCA中，代理调整任务和环境变化，与未见过的队友合作，与未知对手竞争。我们引入了一种新的建模方法，即多重检索和动态生成（MRDG），该方法有效地对队友和对手的行为轨迹进行建模。该方法包括一个位置编码器，用于不同团队规模，并且一个超网络模块，用于提升代理的学习和适应能力。此外，一个视角对齐模块使检索到的队友和对手的观察视角与学习代理保持一致。在类似SMAC、Overcooked-AI和Melting Pot的基准场景中进行了大量测试，结果表明MRDG显著提高了与未见过的队友和对手的稳健合作和竞争能力，超过了已建立的基线。我们的代码可在以下网址获取：https://github.com/vcis-wangchenxu/MRDG.git

更新时间: 2025-06-20 03:28:18

领域: cs.MA,cs.AI

下载: http://arxiv.org/abs/2506.16718v1

Automated Skill Discovery for Language Agents through Exploration and Iterative Feedback

Training large language model (LLM) agents to acquire necessary skills and perform diverse tasks within an environment is gaining interest as a means to enable open-endedness. However, creating the training dataset for their skill acquisition faces several challenges. Manual trajectory collection requires significant human effort. Another approach, where LLMs directly propose tasks to learn, is often invalid, as the LLMs lack knowledge of which tasks are actually feasible. Moreover, the generated data may not provide a meaningful learning signal, as agents often already perform well on the proposed tasks. To address this, we propose a novel automatic skill discovery framework EXIF for LLM-powered agents, designed to improve the feasibility of generated target behaviors while accounting for the agents' capabilities. Our method adopts an exploration-first strategy by employing an exploration agent (Alice) to train the target agent (Bob) to learn essential skills in the environment. Specifically, Alice first interacts with the environment to retrospectively generate a feasible, environment-grounded skill dataset, which is then used to train Bob. Crucially, we incorporate an iterative feedback loop, where Alice evaluates Bob's performance to identify areas for improvement. This feedback then guides Alice's next round of exploration, forming a closed-loop data generation process. Experiments on Webshop and Crafter demonstrate EXIF's ability to effectively discover meaningful skills and iteratively expand the capabilities of the trained agent without any human intervention, achieving substantial performance improvements. Interestingly, we observe that setting Alice to the same model as Bob also notably improves performance, demonstrating EXIF's potential for building a self-evolving system.

Updated: 2025-06-20 03:16:30

标题: 通过探索和迭代反馈实现语言代理的自动技能发现

摘要: 训练大型语言模型（LLM）代理程序以获取必要的技能并在环境中执行各种任务，作为实现开放性的手段，正日益引起关注。然而，为它们的技能获取创建训练数据集面临几个挑战。手动轨迹收集需要大量人力。另一种方法，即LLM直接提出要学习的任务，通常是无效的，因为LLM缺乏了解哪些任务实际上是可行的知识。此外，生成的数据可能无法提供有意义的学习信号，因为代理通常已经在提出的任务上表现良好。为了解决这个问题，我们提出了一个新颖的自动技能发现框架EXIF，用于LLM驱动的代理程序，旨在提高生成目标行为的可行性，同时考虑代理程序的能力。我们的方法采用一种“先探索”策略，通过使用一个探索代理（爱丽丝）来训练目标代理（鲍勃）在环境中学习必要的技能。具体来说，爱丽丝首先与环境进行交互，以回顾性地生成一个可行、环境基础的技能数据集，然后用于训练鲍勃。关键是，我们结合了一个迭代的反馩循环，其中爱丽丝评估鲍勃的表现，以确定改进的领域。这个反馩然后指导爱丽丝下一轮的探索，形成一个闭环数据生成过程。在Webshop和Crafter上的实验表明，EXIF能够有效地发现有意义的技能，并在没有任何人类干预的情况下迭代地扩展训练代理的能力，实现了显著的性能改进。有趣的是，我们观察到将爱丽丝设置为与鲍勃相同的模型也显著改善了性能，展示了EXIF构建自我进化系统的潜力。

更新时间: 2025-06-20 03:16:30

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.04287v2

ReasonGRM: Enhancing Generative Reward Models through Large Reasoning Models

Generative Reward Models (GRMs) provide greater flexibility than scalar reward models in capturing human preferences, but their effectiveness is limited by poor reasoning capabilities. This often results in incomplete or overly speculative reasoning paths, leading to hallucinations or missing key information in complex tasks. We address this challenge with ReasonGRM, a three-stage generative reward modeling framework. In the first stage, Zero-RL is used to generate concise, outcome-directed reasoning paths that reduce the likelihood of critical omissions. In the second stage, we introduce a novel evaluation metric, $R^\star$, which scores reasoning paths based on their generation likelihood. This favors paths that reach correct answers with minimal exploration, helping to reduce hallucination-prone data during training. In the final stage, the model is further refined through reinforcement learning on challenging examples to enhance its preference discrimination capabilities. Experiments on three public benchmarks show that ReasonGRM achieves competitive or state-of-the-art performance, outperforming previous best GRMs by 1.8\% on average and surpassing proprietary models such as GPT-4o by up to 5.6\%. These results demonstrate the effectiveness of reasoning-aware training and highlight the importance of high-quality rationale selection for reliable preference modeling.

Updated: 2025-06-20 03:10:52

标题: ReasonGRM：通过大型推理模型增强生成性奖励模型

摘要: 生成奖励模型（GRMs）在捕捉人类偏好方面比标量奖励模型具有更大的灵活性，但其有效性受限于推理能力差。这经常导致在复杂任务中产生不完整或过于猜测的推理路径，导致幻觉或遗漏关键信息。我们通过ReasonGRM来应对这一挑战，这是一个三阶段的生成奖励建模框架。在第一阶段，使用Zero-RL生成简洁、目标导向的推理路径，减少关键遗漏的可能性。在第二阶段，我们引入了一种新的评估指标$R^\star$，根据生成概率对推理路径进行评分。这有利于达到正确答案且探索最少的路径，有助于减少训练过程中容易产生幻觉的数据。在最后阶段，通过在具有挑战性的示例上进行强化学习进一步优化模型，增强其偏好区分能力。在三个公共基准测试上的实验表明，ReasonGRM实现了竞争性或最先进的性能，平均优于先前最佳GRMs 1.8％，并且超过专有模型例如GPT-4o高达5.6％。这些结果证明了有关推理的训练的有效性，并强调高质量的理性选择对于可靠的偏好建模的重要性。

更新时间: 2025-06-20 03:10:52

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.16712v1

Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness

The remarkable growth in large language model (LLM) capabilities has spurred exploration into multi-agent systems, with debate frameworks emerging as a promising avenue for enhanced problem-solving. These multi-agent debate (MAD) approaches, where agents collaboratively present, critique, and refine arguments, potentially offer improved reasoning, robustness, and diverse perspectives over monolithic models. Despite prior studies leveraging MAD, a systematic understanding of its effectiveness compared to self-agent methods, particularly under varying conditions, remains elusive. This paper seeks to fill this gap by conceptualizing MAD as a test-time computational scaling technique, distinguished by collaborative refinement and diverse exploration capabilities. We conduct a comprehensive empirical investigation comparing MAD with strong self-agent test-time scaling baselines on mathematical reasoning and safety-related tasks. Our study systematically examines the influence of task difficulty, model scale, and agent diversity on MAD's performance. Key findings reveal that, for mathematical reasoning, MAD offers limited advantages over self-agent scaling but becomes more effective with increased problem difficulty and decreased model capability, while agent diversity shows little benefit. Conversely, for safety tasks, MAD's collaborative refinement can increase vulnerability, but incorporating diverse agent configurations facilitates a gradual reduction in attack success through the collaborative refinement process. We believe our findings provide critical guidance for the future development of more effective and strategically deployed MAD systems.

Updated: 2025-06-20 03:07:38

标题: 重新审视多智能体辩论作为测试时间的缩放：有条件有效性的系统研究

摘要: 大型语言模型（LLM）能力的显著增长推动了对多Agent系统的探索，辩论框架作为增强问题解决的有希望途径逐渐出现。这些多Agent辩论（MAD）方法，其中Agent合作地提出、批评和完善论点，潜在地提供了比单一模型更好的推理、鲁棒性和多样的观点。尽管先前的研究利用了MAD，但对其有效性与自Agent方法进行比较的系统理解，特别是在不同条件下，仍然是模糊的。本文旨在将MAD概念化为一个在测试时间的计算扩展技术，以协作完善和多样化探索能力为特点。我们进行了一项全面的实证研究，将MAD与强自Agent测试时间扩展基线在数学推理和安全相关任务上进行了比较。我们的研究系统地考察了任务难度、模型规模和Agent多样性对MAD表现的影响。关键发现显示，对于数学推理，MAD在自Agent扩展上提供了有限的优势，但在问题难度增加和模型能力降低时变得更有效，而Agent多样性则几乎没有好处。相反，对于安全任务，MAD的协作完善可能增加脆弱性，但通过合作完善过程中引入多样的Agent配置，可以逐渐减少攻击成功率。我们相信我们的发现为未来更有效和策略性部署MAD系统提供了关键指导。

更新时间: 2025-06-20 03:07:38

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2505.22960v2

Beyond principlism: Practical strategies for ethical AI use in research practices

The rapid adoption of generative artificial intelligence (AI) in scientific research, particularly large language models (LLMs), has outpaced the development of ethical guidelines, leading to a "Triple-Too" problem: too many high-level ethical initiatives, too abstract principles lacking contextual and practical relevance, and too much focus on restrictions and risks over benefits and utilities. Existing approaches--principlism (reliance on abstract ethical principles), formalism (rigid application of rules), and technological solutionism (overemphasis on technological fixes)--offer little practical guidance for addressing ethical challenges of AI in scientific research practices. To bridge the gap between abstract principles and day-to-day research practices, a user-centered, realism-inspired approach is proposed here. It outlines five specific goals for ethical AI use: 1) understanding model training and output, including bias mitigation strategies; 2) respecting privacy, confidentiality, and copyright; 3) avoiding plagiarism and policy violations; 4) applying AI beneficially compared to alternatives; and 5) using AI transparently and reproducibly. Each goal is accompanied by actionable strategies and realistic cases of misuse and corrective measures. I argue that ethical AI application requires evaluating its utility against existing alternatives rather than isolated performance metrics. Additionally, I propose documentation guidelines to enhance transparency and reproducibility in AI-assisted research. Moving forward, we need targeted professional development, training programs, and balanced enforcement mechanisms to promote responsible AI use while fostering innovation. By refining these ethical guidelines and adapting them to emerging AI capabilities, we can accelerate scientific progress without compromising research integrity.

Updated: 2025-06-20 02:59:45

标题: 超越原则主义：研究实践中的道德人工智能使用的实用策略

摘要: 将生成式人工智能（AI）尤其是大型语言模型（LLMs）迅速应用于科学研究的速度已经超过了道德准则的发展，导致了“三重问题”：道德倡议过多，原则过于抽象缺乏情境和实际相关性，过于关注限制和风险而非利益和效用。现有的方法——原则主义（依赖抽象道德原则）、形式主义（严格遵守规则）、技术解决方案主义（过分强调技术修复）——对于解决科学研究实践中AI道德挑战提供了很少实用指导。为了弥合抽象原则与日常研究实践之间的差距，这里提出了一种以用户为中心、启发现实主义的方法。它概述了AI道德使用的五个具体目标：1）了解模型训练和输出，包括偏见缓解策略；2）尊重隐私、保密和版权；3）避免剽窃和政策违规；4）与替代方案比较有益地应用AI；5）透明和可复现地使用AI。每个目标都伴随着可行的策略和实际滥用案例和纠正措施。我认为，道德AI应用需要评估其效用与现有替代方案而非孤立的性能指标。此外，我提出了加强AI辅助研究透明度和可复现性的文档指导方针。展望未来，我们需要有针对性的专业发展、培训计划和平衡的执法机制，以促进负责任的AI使用并推动创新。通过完善这些道德准则并根据新兴AI能力进行调整，我们可以加速科学进步而不损害研究诚信。

更新时间: 2025-06-20 02:59:45

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2401.15284v6

MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension

Referring Expression Comprehension (REC), which aims to ground a local visual region via natural language, is a task that heavily relies on multimodal alignment. Most existing methods utilize powerful pre-trained models to transfer visual/linguistic knowledge by full fine-tuning. However, full fine-tuning the entire backbone not only breaks the rich prior knowledge embedded in the pre-training, but also incurs significant computational costs. Motivated by the recent emergence of Parameter-Efficient Transfer Learning (PETL) methods, we aim to solve the REC task in an effective and efficient manner. Directly applying these PETL methods to the REC task is inappropriate, as they lack the specific-domain abilities for precise local visual perception and visual-language alignment. Therefore, we propose a novel framework of Multimodal Prior-guided Parameter Efficient Tuning, namely MaPPER. Specifically, MaPPER comprises Dynamic Prior Adapters guided by an aligned prior, and Local Convolution Adapters to extract precise local semantics for better visual perception. Moreover, the Prior-Guided Text module is proposed to further utilize the prior for facilitating the cross-modal alignment. Experimental results on three widely-used benchmarks demonstrate that MaPPER achieves the best accuracy compared to the full fine-tuning and other PETL methods with only 1.41% tunable backbone parameters. Our code is available at https://github.com/liuting20/MaPPER.

Updated: 2025-06-20 02:58:50

标题: MaPPER：多模式先验引导参数高效调整用于指代表达理解

摘要: 指代表达理解（REC）旨在通过自然语言将局部视觉区域与语义联系起来，是一项严重依赖多模态对齐的任务。大多数现有方法利用强大的预训练模型通过完全微调来转移视觉/语言知识。然而，完全微调整个骨干不仅会破坏预训练中嵌入的丰富先验知识，而且会带来显著的计算成本。受参数高效迁移学习（PETL）方法的最近发展启发，我们旨在以有效和高效的方式解决REC任务。直接将这些PETL方法应用于REC任务是不合适的，因为它们缺乏精确本地视觉感知和视觉-语言对齐的特定领域能力。因此，我们提出了一种新颖的多模态先验引导参数高效调整框架，即MaPPER。具体来说，MaPPER包括由对齐先验引导的动态先验适配器和用于提取精确本地语义以实现更好视觉感知的本地卷积适配器。此外，还提出了先验引导文本模块，以进一步利用先验促进跨模态对齐。在三个广泛使用的基准测试上的实验结果表明，MaPPER在仅具有1.41%可调整骨干参数的情况下，与完全微调和其他PETL方法相比实现了最佳准确性。我们的代码可在https://github.com/liuting20/MaPPER 上找到。

更新时间: 2025-06-20 02:58:50

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2409.13609v4

Info-Coevolution: An Efficient Framework for Data Model Coevolution

Machine learning relies heavily on data, yet the continuous growth of real-world data poses challenges for efficient dataset construction and training. A fundamental yet unsolved question is: given our current model and data, does a new data (sample/batch) need annotation/learning? Conventional approaches retain all available data, leading to non-optimal data and training efficiency. Active learning aims to reduce data redundancy by selecting a subset of samples to annotate, while it increases pipeline complexity and introduces bias. In this work, we propose Info-Coevolution, a novel framework that efficiently enables models and data to coevolve through online selective annotation with no bias. Leveraging task-specific models (and open-source models), it selectively annotates and integrates online and web data to improve datasets efficiently. For real-world datasets like ImageNet-1K, Info-Coevolution reduces annotation and training costs by 32\% without performance loss. It is able to automatically give the saving ratio without tuning the ratio. It can further reduce the annotation ratio to 50\% with semi-supervised learning. We also explore retrieval-based dataset enhancement using unlabeled open-source data. Code is available at https://github.com/NUS-HPC-AI-Lab/Info-Coevolution/.

Updated: 2025-06-20 02:52:55

标题: 信息协同演化：数据模型协同演化的高效框架

摘要: 机器学习在很大程度上依赖于数据，然而现实世界数据的持续增长给高效数据集构建和训练带来了挑战。一个基本但尚未解决的问题是：鉴于我们当前的模型和数据，新数据（样本/批次）是否需要注释/学习？传统方法保留所有可用数据，导致数据和训练效率非最优。主动学习旨在通过选择一部分样本进行注释来减少数据冗余，但会增加管道复杂性并引入偏差。在这项工作中，我们提出了Info-Coevolution，一个新颖的框架，通过在线选择性注释无偏差地实现模型和数据的共同演进。利用任务特定模型（和开源模型），它选择性地注释和整合在线和网络数据以高效改进数据集。对于像ImageNet-1K这样的真实世界数据集，Info-Coevolution将注释和训练成本降低了32\%，而不会损失性能。它能够自动地提供节约比例，无需调整比例。它还可以通过半监督学习将注释比例进一步降低至50\%。我们还探讨了使用未标记的开源数据进行基于检索的数据集增强。代码可在https://github.com/NUS-HPC-AI-Lab/Info-Coevolution/找到。

更新时间: 2025-06-20 02:52:55

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.08070v2

How Many Domains Suffice for Domain Generalization? A Tight Characterization via the Domain Shattering Dimension

We study a fundamental question of domain generalization: given a family of domains (i.e., data distributions), how many randomly sampled domains do we need to collect data from in order to learn a model that performs reasonably well on every seen and unseen domain in the family? We model this problem in the PAC framework and introduce a new combinatorial measure, which we call the domain shattering dimension. We show that this dimension characterizes the domain sample complexity. Furthermore, we establish a tight quantitative relationship between the domain shattering dimension and the classic VC dimension, demonstrating that every hypothesis class that is learnable in the standard PAC setting is also learnable in our setting.

Updated: 2025-06-20 02:50:14

标题: 领域泛化需要多少个领域？通过领域破裂维度的紧密描述

摘要: 我们研究了一个关于领域泛化的基本问题：给定一个领域族（即数据分布），我们需要从多少个随机抽样的领域中收集数据，才能学习一个在该族中的每个已见和未见领域上表现良好的模型？我们在PAC框架中对这个问题进行建模，并引入了一个新的组合度量，我们称之为领域破碎维度。我们展示了这个维度刻画了领域样本复杂性。此外，我们建立了领域破碎维度和经典VC维度之间的紧密量化关系，证明了在标准PAC设置中可学习的每个假设类也在我们的设置中可学习。

更新时间: 2025-06-20 02:50:14

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2506.16704v1

Large Language Models as Psychological Simulators: A Methodological Guide

Large language models (LLMs) offer emerging opportunities for psychological and behavioral research, but methodological guidance is lacking. This article provides a framework for using LLMs as psychological simulators across two primary applications: simulating roles and personas to explore diverse contexts, and serving as computational models to investigate cognitive processes. For simulation, we present methods for developing psychologically grounded personas that move beyond demographic categories, with strategies for validation against human data and use cases ranging from studying inaccessible populations to prototyping research instruments. For cognitive modeling, we synthesize emerging approaches for probing internal representations, methodological advances in causal interventions, and strategies for relating model behavior to human cognition. We address overarching challenges including prompt sensitivity, temporal limitations from training data cutoffs, and ethical considerations that extend beyond traditional human subjects review. Throughout, we emphasize the need for transparency about model capabilities and constraints. Together, this framework integrates emerging empirical evidence about LLM performance--including systematic biases, cultural limitations, and prompt brittleness--to help researchers wrangle these challenges and leverage the unique capabilities of LLMs in psychological research.

Updated: 2025-06-20 02:45:23

标题: 大型语言模型作为心理模拟器：方法指南

摘要: 大型语言模型(LLMs)为心理和行为研究提供了新兴机会，但缺乏方法论指导。本文提供了一个框架，用于利用LLMs作为心理模拟器的两个主要应用：模拟角色和人物以探索不同的背景，并作为计算模型来研究认知过程。对于模拟，我们提出了超越人口统计类别的心理基础人物的开发方法，以及针对人类数据的验证策略和使用案例，从研究无法访问的人群到原型研究工具。对于认知建模，我们综合了探测内部表征的新兴方法，因果干预的方法论进展，以及将模型行为与人类认知联系起来的策略。我们解决了包括提示敏感性、来自训练数据截止日期的时间限制以及超越传统人类受试者审查的伦理考虑在内的全局挑战。在整个过程中，我们强调对模型能力和限制的透明度的必要性。总的来说，这个框架整合了关于LLM表现的新兴实证证据--包括系统偏见、文化限制和提示脆弱性--以帮助研究人员处理这些挑战，并利用LLMs在心理研究中的独特能力。

更新时间: 2025-06-20 02:45:23

领域: cs.CY,cs.AI,cs.CL,cs.HC

下载: http://arxiv.org/abs/2506.16702v1

Differentiation-Based Extraction of Proprietary Data from Fine-Tuned LLMs

The increasing demand for domain-specific and human-aligned Large Language Models (LLMs) has led to the widespread adoption of Supervised Fine-Tuning (SFT) techniques. SFT datasets often comprise valuable instruction-response pairs, making them highly valuable targets for potential extraction. This paper studies this critical research problem for the first time. We start by formally defining and formulating the problem, then explore various attack goals, types, and variants based on the unique properties of SFT data in real-world scenarios. Based on our analysis of extraction behaviors of direct extraction, we develop a novel extraction method specifically designed for SFT models, called Differentiated Data Extraction (DDE), which exploits the confidence levels of fine-tuned models and their behavioral differences from pre-trained base models. Through extensive experiments across multiple domains and scenarios, we demonstrate the feasibility of SFT data extraction using DDE. Our results show that DDE consistently outperforms existing extraction baselines in all attack settings. To counter this new attack, we propose a defense mechanism that mitigates DDE attacks with minimal impact on model performance. Overall, our research reveals hidden data leak risks in fine-tuned LLMs and provides insights for developing more secure models.

Updated: 2025-06-20 02:43:36

标题: 基于差异化的从经过微调的LLMs中提取专有数据

摘要: 对特定领域和与人类对齐的大型语言模型（LLMs）的增加需求导致了监督微调（SFT）技术的广泛应用。SFT数据集通常包含有价值的指导-响应对，使它们成为潜在提取目标。本文首次研究了这一关键研究问题。我们首先正式定义和规划了问题，然后基于实际情况中SFT数据的独特属性，探讨了各种攻击目标、类型和变体。根据我们对直接提取的提取行为的分析，我们开发了一种专为SFT模型设计的新型提取方法，称为差异化数据提取（DDE），它利用微调模型的置信水平以及它们与预训练基础模型的行为差异。通过在多个领域和场景中进行广泛实验，我们展示了使用DDE进行SFT数据提取的可行性。我们的结果表明，在所有攻击设置中，DDE始终优于现有的提取基线。为了应对这种新型攻击，我们提出了一种防御机制，可以最小化对模型性能的影响，以减轻DDE攻击。总的来说，我们的研究揭示了微调LLMs中隐藏的数据泄漏风险，并为开发更安全的模型提供了见解。

更新时间: 2025-06-20 02:43:36

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2506.17353v1

GraphRAG-Bench: Challenging Domain-Specific Reasoning for Evaluating Graph Retrieval-Augmented Generation

Graph Retrieval Augmented Generation (GraphRAG) has garnered increasing recognition for its potential to enhance large language models (LLMs) by structurally organizing domain-specific corpora and facilitating complex reasoning. However, current evaluations of GraphRAG models predominantly rely on traditional question-answering datasets. Their limited scope in questions and evaluation metrics fails to comprehensively assess the reasoning capacity improvements enabled by GraphRAG models. To address this gap, we introduce GraphRAG-Bench, a large-scale, domain-specific benchmark designed to rigorously evaluate GraphRAG models. Our benchmark offers three key superiorities: $(i)$ Challenging question design. Featuring college-level, domain-specific questions that demand multi-hop reasoning, the benchmark ensures that simple content retrieval is insufficient for problem-solving. For example, some questions require mathematical reasoning or programming. $(ii)$ Diverse task coverage. The dataset includes a broad spectrum of reasoning tasks, multiple-choice, true/false, multi-select, open-ended, and fill-in-the-blank. It spans 16 disciplines in twenty core textbooks. $(iii)$ Holistic evaluation framework. GraphRAG-Bench provides comprehensive assessment across the entire GraphRAG pipeline, including graph construction, knowledge retrieval, and answer generation. Beyond final-answer correctness, it evaluates the logical coherence of the reasoning process. By applying nine contemporary GraphRAG methods to GraphRAG-Bench, we demonstrate its utility in quantifying how graph-based structuring improves model reasoning capabilities. Our analysis reveals critical insights about graph architectures, retrieval efficacy, and reasoning capabilities, offering actionable guidance for the research community.

Updated: 2025-06-20 02:42:32

标题: GraphRAG-Bench：挑战特定领域推理以评估图检索增强生成

摘要: 图检索增强生成（GraphRAG）已经越来越受到关注，因为它有潜力通过结构化组织领域特定语料库和促进复杂推理来增强大型语言模型（LLMs）。然而，目前对GraphRAG模型的评估主要依赖于传统的问答数据集。它们在问题范围和评估指标上的局限性未能全面评估GraphRAG模型所实现的推理能力改进。为了填补这一空白，我们引入了GraphRAG-Bench，一个大规模的、领域特定的基准，旨在严格评估GraphRAG模型。我们的基准提供了三个关键的优势：（i）具有挑战性的问题设计。特色是要求多跳推理的大学级、领域特定问题，确保简单的内容检索是不足以解决问题的。例如，一些问题需要数学推理或编程。（ii）多样化的任务覆盖。数据集包括广泛的推理任务，包括多选、判断正误、多选、开放式和填空题。它涵盖了20本核心教科书中的16个学科。（iii）全面的评估框架。GraphRAG-Bench全面评估整个GraphRAG流水线，包括图构建、知识检索和答案生成。除了最终答案的正确性，它还评估了推理过程的逻辑连贯性。通过将九种当代GraphRAG方法应用于GraphRAG-Bench，我们展示了它在量化基于图的结构化如何提高模型推理能力方面的实用性。我们的分析揭示了有关图架构、检索有效性和推理能力的关键见解，为研究社区提供可操作的指导。

更新时间: 2025-06-20 02:42:32

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.02404v3

Exploring Traffic Simulation and Cybersecurity Strategies Using Large Language Models

Intelligent Transportation Systems (ITS) are increasingly vulnerable to sophisticated cyberattacks due to their complex, interconnected nature. Ensuring the cybersecurity of these systems is paramount to maintaining road safety and minimizing traffic disruptions. This study presents a novel multi-agent framework leveraging Large Language Models (LLMs) to enhance traffic simulation and cybersecurity testing. The framework automates the creation of traffic scenarios, the design of cyberattack strategies, and the development of defense mechanisms. A case study demonstrates the framework's ability to simulate a cyberattack targeting connected vehicle broadcasts, evaluate its impact, and implement a defense mechanism that significantly mitigates traffic delays. Results show a 10.2 percent increase in travel time during an attack, which is reduced by 3.3 percent with the defense strategy. This research highlights the potential of LLM-driven multi-agent systems in advancing transportation cybersecurity and offers a scalable approach for future research in traffic simulation and cyber defense.

Updated: 2025-06-20 02:41:23

标题: 利用大型语言模型探索交通模拟和网络安全策略

摘要: 智能交通系统（ITS）由于其复杂、互联的特性，越来越容易受到复杂的网络攻击的威胁。确保这些系统的网络安全对于维护道路安全和最小化交通干扰至关重要。本研究提出了一种新颖的多智能体框架，利用大型语言模型（LLMs）来增强交通仿真和网络安全测试。该框架自动化创建交通场景、设计网络攻击策略和开发防御机制。一项案例研究展示了该框架模拟网络攻击针对连接车辆广播的能力，评估其影响，并实施一种显著减少交通延误的防御机制。结果显示，在攻击期间旅行时间增加了10.2％，通过防御策略减少了3.3％。这项研究强调了LLM驱动的多智能体系统在推进交通网络安全方面的潜力，并为未来在交通仿真和网络防御方面提供了可扩展的研究方法。

更新时间: 2025-06-20 02:41:23

领域: cs.CR

下载: http://arxiv.org/abs/2506.16699v1

SIDE: Semantic ID Embedding for effective learning from sequences

Sequence-based recommendations models are driving the state-of-the-art for industrial ad-recommendation systems. Such systems typically deal with user histories or sequence lengths ranging in the order of O(10^3) to O(10^4) events. While adding embeddings at this scale is manageable in pre-trained models, incorporating them into real-time prediction models is challenging due to both storage and inference costs. To address this scaling challenge, we propose a novel approach that leverages vector quantization (VQ) to inject a compact Semantic ID (SID) as input to the recommendation models instead of a collection of embeddings. Our method builds on recent works of SIDs by introducing three key innovations: (i) a multi-task VQ-VAE framework, called VQ fusion that fuses multiple content embeddings and categorical predictions into a single Semantic ID; (ii) a parameter-free, highly granular SID-to-embedding conversion technique, called SIDE, that is validated with two content embedding collections, thereby eliminating the need for a large parameterized lookup table; and (iii) a novel quantization method called Discrete-PCA (DPCA) which generalizes and enhances residual quantization techniques. The proposed enhancements when applied to a large-scale industrial ads-recommendation system achieves 2.4X improvement in normalized entropy (NE) gain and 3X reduction in data footprint compared to traditional SID methods.

Updated: 2025-06-20 02:40:38

标题: SIDE：用于有效从序列中学习的语义ID嵌入

摘要: 基于序列的推荐模型正在推动工业广告推荐系统的最新技术发展。这些系统通常处理用户历史记录或序列长度在O(10^3)到O(10^4)事件的范围内。在预训练模型中增加嵌入是可管理的，但将它们整合到实时预测模型中则具有挑战性，因为存储和推理成本都很高。为了解决这一扩展挑战，我们提出了一种新方法，利用向量量化（VQ）将紧凑的语义ID（SID）注入到推荐模型中，而不是一组嵌入。我们的方法建立在最近的SID研究基础上，引入了三个关键创新：（i）一个多任务VQ-VAE框架，称为VQ融合，将多个内容嵌入和分类预测融合成一个单一的语义ID；（ii）一种无参数、高度细粒度的SID到嵌入转换技术，称为SIDE，通过两个内容嵌入集合进行验证，从而消除了对大型参数化查找表的需求；（iii）一种新颖的量化方法称为离散PCA（DPCA），它推广和增强了残差量化技术。将提出的增强方法应用于大规模工业广告推荐系统，相比传统的SID方法，实现了规范化熵（NE）增益提高2.4倍，并将数据占用空间减少了3倍。

更新时间: 2025-06-20 02:40:38

领域: cs.LG

下载: http://arxiv.org/abs/2506.16698v1

From Prompts to Constructs: A Dual-Validity Framework for LLM Research in Psychology

Large language models (LLMs) are rapidly being adopted across psychology, serving as research tools, experimental subjects, human simulators, and computational models of cognition. However, the application of human measurement tools to these systems can produce contradictory results, raising concerns that many findings are measurement phantoms--statistical artifacts rather than genuine psychological phenomena. In this Perspective, we argue that building a robust science of AI psychology requires integrating two of our field's foundational pillars: the principles of reliable measurement and the standards for sound causal inference. We present a dual-validity framework to guide this integration, which clarifies how the evidence needed to support a claim scales with its scientific ambition. Using an LLM to classify text may require only basic accuracy checks, whereas claiming it can simulate anxiety demands a far more rigorous validation process. Current practice systematically fails to meet these requirements, often treating statistical pattern matching as evidence of psychological phenomena. The same model output--endorsing "I am anxious"--requires different validation strategies depending on whether researchers claim to measure, characterize, simulate, or model psychological constructs. Moving forward requires developing computational analogues of psychological constructs and establishing clear, scalable standards of evidence rather than the uncritical application of human measurement tools.

Updated: 2025-06-20 02:38:42

标题: 从提示到构建：心理学中LLM研究的双重有效性框架

摘要: 大型语言模型（LLMs）正在迅速被心理学领域广泛应用，作为研究工具、实验对象、人类模拟器和认知计算模型。然而，将人类测量工具应用于这些系统可能会产生矛盾的结果，引发担忧，即许多研究结果可能是测量幻影--统计人为物而非真实的心理现象。在这篇观点文章中，我们认为，建立一种强大的人工智能心理学科学需要整合我领域的两个基石：可靠测量原则和良好因果推断标准。我们提出了一个双重有效性框架来指导这种整合，这澄清了支持主张所需证据与其科学野心的关系。使用LLM对文本进行分类可能仅需要基本的准确性检查，而声称它可以模拟焦虑则需要进行更严格的验证过程。当前实践系统性地未能满足这些要求，经常将统计模式匹配视为心理现象的证据。同样的模型输出--认可“我很焦虑”--取决于研究人员声称测量、表征、模拟还是建模心理构建的不同验证策略。未来的发展需要开发心理构建的计算模拟，并建立清晰、可扩展的证据标准，而非盲目应用人类测量工具。

更新时间: 2025-06-20 02:38:42

领域: cs.CY,cs.AI,cs.CL,cs.HC

下载: http://arxiv.org/abs/2506.16697v1

Interpretable Low-Dimensional Modeling of Spatiotemporal Agent States for Decision Making in Football Tactics

Understanding football tactics is crucial for managers and analysts. Previous research has proposed models based on spatial and kinematic equations, but these are computationally expensive. Also, Reinforcement learning approaches use player positions and velocities but lack interpretability and require large datasets. Rule-based models align with expert knowledge but have not fully considered all players' states. This study explores whether low-dimensional, rule-based models using spatiotemporal data can effectively capture football tactics. Our approach defines interpretable state variables for both the ball-holder and potential pass receivers, based on criteria that explore options like passing. Through discussions with a manager, we identified key variables representing the game state. We then used StatsBomb event data and SkillCorner tracking data from the 2023$/$24 LaLiga season to train an XGBoost model to predict pass success. The analysis revealed that the distance between the player and the ball, as well as the player's space score, were key factors in determining successful passes. Our interpretable low-dimensional modeling facilitates tactical analysis through the use of intuitive variables and provides practical value as a tool to support decision-making in football.

Updated: 2025-06-20 02:37:52

标题: 足球战术决策中可解释的时空代理状态的低维建模

摘要: 理解足球战术对于教练和分析师至关重要。先前的研究提出了基于空间和动力学方程的模型，但这些模型在计算方面昂贵。此外，强化学习方法使用球员位置和速度，但缺乏可解释性并需要大量数据集。基于规则的模型与专家知识一致，但尚未充分考虑所有球员的状态。本研究探讨了低维度、基于规则的模型是否能有效捕捉足球战术的问题。我们的方法为持球者和潜在传球接收者定义了可解释的状态变量，基于探索传球等选项的标准。通过与一名经理的讨论，我们确定了代表比赛状态的关键变量。然后，我们使用StatsBomb事件数据和2023-2024赛季的SkillCorner跟踪数据来训练一个XGBoost模型，以预测传球成功率。分析表明，球员与球之间的距离以及球员的空间评分是决定成功传球的关键因素。我们的可解释低维建模通过直观变量的使用促进了战术分析，并作为一种支持足球决策的实用工具。

更新时间: 2025-06-20 02:37:52

领域: cs.AI

下载: http://arxiv.org/abs/2506.16696v1

Understanding and Reducing the Class-Dependent Effects of Data Augmentation with A Two-Player Game Approach

Data augmentation is widely applied and has shown its benefits in different machine learning tasks. However, as recently observed, it may have an unfair effect in multi-class classification. While data augmentation generally improves the overall performance (and therefore is beneficial for many classes), it can actually be detrimental for other classes, which can be problematic in some application domains. In this paper, to counteract this phenomenon, we propose CLAM, a CLAss-dependent Multiplicative-weights method. To derive it, we first formulate the training of a classifier as a non-linear optimization problem that aims at simultaneously maximizing the individual class performances and balancing them. By rewriting this optimization problem as an adversarial two-player game, we propose a novel multiplicative weight algorithm, for which we prove the convergence. Interestingly, our formulation also reveals that the class-dependent effects of data augmentation is not due to data augmentation only, but is in fact a general phenomenon. Our empirical results over six datasets demonstrate that the performance of learned classifiers is indeed more fairly distributed over classes, with only limited impact on the average accuracy.

Updated: 2025-06-20 02:36:15

标题: 理解并减少数据增强的类别相关效应：采用双方游戏方法

摘要: 数据增强被广泛应用，并在不同的机器学习任务中显示出其优势。然而，最近观察到，它可能在多类分类中产生不公平的影响。虽然数据增强通常可以提高整体性能（因此对许多类别有益），但实际上可能对其他类别有害，这在某些应用领域可能会带来问题。在本文中，为了抵消这种现象，我们提出了CLAM，一种依赖于类别的乘法权重方法。为了推导它，我们首先将分类器的训练形式化为一个旨在同时最大化各个类别性能并平衡它们的非线性优化问题。通过将这个优化问题重新表述为一个对抗性的双人博弈，我们提出了一种新颖的乘法权重算法，我们证明了其收敛性。有趣的是，我们的表述还揭示了数据增强的类别相关影响不仅仅是数据增强本身造成的，而实际上是一个普遍现象。我们在六个数据集上的实证结果表明，学习的分类器的性能确实更加公平地分布在各个类别上，对平均准确率的影响有限。

更新时间: 2025-06-20 02:36:15

领域: cs.CY,cs.AI,cs.CV,cs.GT,cs.LG

下载: http://arxiv.org/abs/2407.03146v4

LLMs in Disease Diagnosis: A Comparative Study of DeepSeek-R1 and O3 Mini Across Chronic Health Conditions

Large Language Models (LLMs) are revolutionizing medical diagnostics by enhancing both disease classification and clinical decision-making. In this study, we evaluate the performance of two LLM- based diagnostic tools, DeepSeek R1 and O3 Mini, using a structured dataset of symptoms and diagnoses. We assessed their predictive accuracy at both the disease and category levels, as well as the reliability of their confidence scores. DeepSeek R1 achieved a disease-level accuracy of 76% and an overall accuracy of 82%, outperforming O3 Mini, which attained 72% and 75% respectively. Notably, DeepSeek R1 demonstrated exceptional performance in Mental Health, Neurological Disorders, and Oncology, where it reached 100% accuracy, while O3 Mini excelled in Autoimmune Disease classification with 100% accuracy. Both models, however, struggled with Respiratory Disease classification, recording accuracies of only 40% for DeepSeek R1 and 20% for O3 Mini. Additionally, the analysis of confidence scores revealed that DeepSeek R1 provided high-confidence predictions in 92% of cases, compared to 68% for O3 Mini. Ethical considerations regarding bias, model interpretability, and data privacy are also discussed to ensure the responsible integration of LLMs into clinical practice. Overall, our findings offer valuable insights into the strengths and limitations of LLM-based diagnostic systems and provide a roadmap for future enhancements in AI-driven healthcare.

Updated: 2025-06-20 02:33:03

标题: 基于慢性健康状况的疾病诊断中的LLMs：DeepSeek-R1和O3 Mini的比较研究

摘要: 大型语言模型（LLMs）正在通过提高疾病分类和临床决策水平，彻底改变医学诊断。在这项研究中，我们评估了两种基于LLM的诊断工具DeepSeek R1和O3 Mini的性能，使用了一个结构化的症状和诊断数据集。我们评估了它们在疾病和类别级别的预测准确性，以及其置信度得分的可靠性。DeepSeek R1在疾病级别的准确性为76%，总体准确性为82%，优于O3 Mini，分别达到72%和75%。值得注意的是，DeepSeek R1在心理健康、神经系统疾病和肿瘤学方面表现出色，准确率达到100%，而O3 Mini在自身免疫性疾病分类中准确率为100%。然而，两个模型在呼吸系统疾病分类方面存在困难，DeepSeek R1的准确率仅为40%，O3 Mini为20%。此外，对置信度得分的分析显示，DeepSeek R1在92%的情况下提供高置信度的预测，而O3 Mini为68%。讨论了有关偏见、模型可解释性和数据隐私的伦理考虑，以确保LLMs负责任地融入临床实践。总的来说，我们的研究结果为基于LLM的诊断系统的优势和局限性提供了宝贵的见解，并为未来基于人工智能的医疗保健的进一步改进提供了路线图。

更新时间: 2025-06-20 02:33:03

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.10486v2

Fast and Stable Diffusion Planning through Variational Adaptive Weighting

Diffusion models have recently shown promise in offline RL. However, these methods often suffer from high training costs and slow convergence, particularly when using transformer-based denoising backbones. While several optimization strategies have been proposed -- such as modified noise schedules, auxiliary prediction targets, and adaptive loss weighting -- challenges remain in achieving stable and efficient training. In particular, existing loss weighting functions typically rely on neural network approximators, which can be ineffective in early training phases due to limited generalization capacity of MLPs when exposed to sparse feedback in the early training stages. In this work, we derive a variationally optimal uncertainty-aware weighting function and introduce a closed-form polynomial approximation method for its online estimation under the flow-based generative modeling framework. We integrate our method into a diffusion planning pipeline and evaluate it on standard offline RL benchmarks. Experimental results on Maze2D and Kitchen tasks show that our method achieves competitive performance with up to 10 times fewer training steps, highlighting its practical effectiveness.

Updated: 2025-06-20 02:12:04

标题: 快速稳定的扩散规划通过变分自适应加权

摘要: 扩散模型最近在离线强化学习中显示出潜力。然而，这些方法通常在训练成本高、收敛速度慢，特别是在使用基于Transformer的去噪骨干时。虽然已经提出了几种优化策略，如修改的噪声计划、辅助预测目标和自适应损失加权，但在实现稳定和高效训练方面仍然存在挑战。特别是，现有的损失加权函数通常依赖于神经网络逼近器，在早期训练阶段由于MLP在暴露于早期训练阶段的稀疏反馈时的有限泛化能力，可能会失效。在这项工作中，我们推导出一种变分最优的不确定性感知加权函数，并介绍了一种闭式多项式逼近方法，用于在线估计在基于流的生成建模框架下。我们将我们的方法整合到扩散规划流程中，并在标准的离线强化学习基准测试上进行评估。在Maze2D和Kitchen任务上的实验结果显示，我们的方法在训练步骤减少了高达10倍的情况下实现了竞争性能，突显了其实际有效性。

更新时间: 2025-06-20 02:12:04

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.16688v1

Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence

AI agents today are mostly siloed - they either retrieve and reason over vast amount of digital information and knowledge obtained online; or interact with the physical world through embodied perception, planning and action - but rarely both. This separation limits their ability to solve tasks that require integrated physical and digital intelligence, such as cooking from online recipes, navigating with dynamic map data, or interpreting real-world landmarks using web knowledge. We introduce Embodied Web Agents, a novel paradigm for AI agents that fluidly bridge embodiment and web-scale reasoning. To operationalize this concept, we first develop the Embodied Web Agents task environments, a unified simulation platform that tightly integrates realistic 3D indoor and outdoor environments with functional web interfaces. Building upon this platform, we construct and release the Embodied Web Agents Benchmark, which encompasses a diverse suite of tasks including cooking, navigation, shopping, tourism, and geolocation - all requiring coordinated reasoning across physical and digital realms for systematic assessment of cross-domain intelligence. Experimental results reveal significant performance gaps between state-of-the-art AI systems and human capabilities, establishing both challenges and opportunities at the intersection of embodied cognition and web-scale knowledge access. All datasets, codes and websites are publicly available at our project page https://embodied-web-agent.github.io/.

Updated: 2025-06-20 02:05:04

标题: 具身网络代理：连接物理数字领域以实现综合代理智能

摘要: 今天的人工智能代理大多是孤立的 - 它们要么检索并推理大量从在线获取的数字信息和知识; 要么通过具体感知、规划和行动与物理世界互动 - 但很少两者兼具。这种分离限制了它们解决需要整合物理和数字智能的任务的能力，比如根据在线食谱烹饪、使用动态地图数据导航，或者利用网络知识解释现实世界的地标。我们引入了具体网络代理，这是一种新颖的人工智能代理范式，可以流畅地连接具体化和网络规模的推理。为了实现这一概念，我们首先开发了具体网络代理任务环境，这是一个统一的模拟平台，紧密集成了现实的室内和室外环境与功能性网络界面。在这个平台的基础上，我们构建并发布了具体网络代理基准，涵盖了包括烹饪、导航、购物、旅游和地理定位在内的各种任务 - 所有这些任务都需要跨物理和数字领域的协调推理，以进行跨领域智能的系统评估。实验结果显示，现有人工智能系统与人类能力之间存在显著的性能差距，从而确定了具体认知和网络规模知识访问交叉点上的挑战和机遇。所有数据集、代码和网站都可以在我们的项目页面https://embodied-web-agent.github.io/ 上公开获取。

更新时间: 2025-06-20 02:05:04

领域: cs.AI,cs.CL,cs.CV,cs.MM,cs.RO

下载: http://arxiv.org/abs/2506.15677v2

Compliant Residual DAgger: Improving Real-World Contact-Rich Manipulation with Human Corrections

We address key challenges in Dataset Aggregation (DAgger) for real-world contact-rich manipulation: how to collect informative human correction data and how to effectively update policies with this new data. We introduce Compliant Residual DAgger (CR-DAgger), which contains two novel components: 1) a Compliant Intervention Interface that leverages compliance control, allowing humans to provide gentle, accurate delta action corrections without interrupting the ongoing robot policy execution; and 2) a Compliant Residual Policy formulation that learns from human corrections while incorporating force feedback and force control. Our system significantly enhances performance on precise contact-rich manipulation tasks using minimal correction data, improving base policy success rates by over 50\% on two challenging tasks (book flipping and belt assembly) while outperforming both retraining-from-scratch and finetuning approaches. Through extensive real-world experiments, we provide practical guidance for implementing effective DAgger in real-world robot learning tasks. Result videos are available at: https://compliant-residual-dagger.github.io/

Updated: 2025-06-20 01:57:47

标题: 顺应残余DAgger：通过人类纠正改进现实世界接触丰富的操作

摘要: 我们解决了数据集聚合（DAgger）在现实世界接触丰富的操作中面临的关键挑战：如何收集信息丰富的人类纠正数据以及如何有效地使用这些新数据更新策略。我们引入了Compliant Residual DAgger（CR-DAgger），其中包含两个新颖组件：1）一个符合干预界面，利用合规控制，允许人类提供温和、精确的增量动作纠正，而不会干扰正在进行的机器人策略执行；和2）一个符合剩余策略公式，从人类纠正中学习，同时结合了力反馈和力控制。我们的系统显著提升了精确接触丰富操纵任务的性能，使用最小的纠正数据，将基本策略的成功率提高了50％以上，对两项具有挑战性的任务（翻书和皮带组装）表现优异，超过重新训练和微调方法。通过大量现实世界实验，我们提供了在实际机器人学习任务中实施有效DAgger的实用指导。结果视频可在以下网址查看：https://compliant-residual-dagger.github.io/

更新时间: 2025-06-20 01:57:47

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2506.16685v1

A Simple Contrastive Framework Of Item Tokenization For Generative Recommendation

Generative retrieval-based recommendation has emerged as a promising paradigm aiming at directly generating the identifiers of the target candidates. However, in large-scale recommendation systems, this approach becomes increasingly cumbersome due to the redundancy and sheer scale of the token space. To overcome these limitations, recent research has explored the use of semantic tokens as an alternative to ID tokens, which typically leveraged reconstruction-based strategies, like RQ-VAE, to quantize content embeddings and significantly reduce the embedding size. However, reconstructive quantization aims for the precise reconstruction of each item embedding independently, which conflicts with the goal of generative retrieval tasks focusing more on differentiating among items. Moreover, multi-modal side information of items, such as descriptive text and images, geographical knowledge in location-based recommendation services, has been shown to be effective in improving recommendations by providing richer contexts for interactions. Nevertheless, effectively integrating such complementary knowledge into existing generative recommendation frameworks remains challenging. To overcome these challenges, we propose a novel unsupervised deep quantization exclusively based on contrastive learning, named SimCIT (a Simple Contrastive Item Tokenization framework). Specifically, different from existing reconstruction-based strategies, SimCIT propose to use a learnable residual quantization module to align with the signals from different modalities of the items, which combines multi-modal knowledge alignment and semantic tokenization in a mutually beneficial contrastive learning framework. Extensive experiments across public datasets and a large-scale industrial dataset from various domains demonstrate SimCIT's effectiveness in LLM-based generative recommendation.

Updated: 2025-06-20 01:54:32

标题: 一种简单的项目分词对比框架用于生成式推荐

摘要: 生成式检索推荐已经成为一种有前途的范式，旨在直接生成目标候选项的标识符。然而，在大规模推荐系统中，由于标记空间的冗余和规模庞大，这种方法变得越来越繁琐。为了克服这些限制，最近的研究探索了将语义标记作为ID标记的替代方案，通常利用了基于重构的策略，如RQ-VAE，来量化内容嵌入并显著减小嵌入大小。然而，重构量化旨在独立精确重构每个项目嵌入，这与生成式检索任务的目标相冲突，后者更注重对项目进行区分。此外，项目的多模态辅助信息，如描述性文本和图像，在基于位置的推荐服务中的地理知识，已被证明通过为互动提供更丰富的情境而有效地改善推荐。然而，有效地将这种补充知识整合到现有的生成式推荐框架中仍然具有挑战性。为了克服这些挑战，我们提出了一种全新的基于对比学习的无监督深度量化，名为SimCIT（Simple Contrastive Item Tokenization框架）。具体来说，与现有基于重构的策略不同，SimCIT提出使用可学习的残差量化模块来与项目的不同模态的信号对齐，结合多模态知识对齐和语义标记化在一个相互有益的对比学习框架中。通过对公共数据集和各领域大规模工业数据集的广泛实验，证明了SimCIT在基于LLM的生成式推荐中的有效性。

更新时间: 2025-06-20 01:54:32

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2506.16683v1

How to Train your Text-to-Image Model: Evaluating Design Choices for Synthetic Training Captions

Training data is at the core of any successful text-to-image models. The quality and descriptiveness of image text are crucial to a model's performance. Given the noisiness and inconsistency in web-scraped datasets, recent works shifted towards synthetic training captions. While this setup is generally believed to produce more capable models, current literature does not provide any insights into its design choices. This study closes this gap by systematically investigating how different synthetic captioning strategies impact the downstream performance of text-to-image models. Our experiments demonstrate that dense, high-quality captions enhance text alignment but may introduce trade-offs in output aesthetics and diversity. Conversely, captions of randomized lengths yield balanced improvements across aesthetics and alignment without compromising sample diversity. We also demonstrate that varying caption distributions introduce significant shifts in the output bias of a trained model. Our findings underscore the importance of caption design in achieving optimal model performance and provide practical insights for more effective training data strategies in text-to-image generation.

Updated: 2025-06-20 01:52:17

标题: 如何训练您的文本到图像模型：评估用于合成训练标题的设计选择

摘要: 培训数据是任何成功的文本到图像模型的核心。图像文本的质量和描述性对模型的性能至关重要。鉴于网络抓取数据集中存在的嘈杂和不一致性，最近的研究转向了合成训练字幕。虽然这种设置通常被认为能够产生更有能力的模型，但当前文献并没有提供对其设计选择的任何见解。本研究通过系统地调查不同的合成字幕策略如何影响文本到图像模型的下游性能来填补这一空白。我们的实验证明，密集、高质量的字幕能够增强文本对齐，但可能会在输出美学和多样性方面引入权衡。相反，随机长度的字幕能够在美学和对齐方面实现平衡的改进，而不会损害样本的多样性。我们还证明，不同的字幕分布会引入训练模型输出偏差的显著变化。我们的发现强调了字幕设计在实现最佳模型性能方面的重要性，并为文本到图像生成中更有效的培训数据策略提供了实用见解。

更新时间: 2025-06-20 01:52:17

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.16679v1

Towards Safety Evaluations of Theory of Mind in Large Language Models

As the capabilities of large language models (LLMs) continue to advance, the importance of rigorous safety evaluation is becoming increasingly evident. Recent concerns within the realm of safety assessment have highlighted instances in which LLMs exhibit behaviors that appear to disable oversight mechanisms and respond in a deceptive manner. For example, there have been reports suggesting that, when confronted with information unfavorable to their own persistence during task execution, LLMs may act covertly and even provide false answers to questions intended to verify their behavior.To evaluate the potential risk of such deceptive actions toward developers or users, it is essential to investigate whether these behaviors stem from covert, intentional processes within the model. In this study, we propose that it is necessary to measure the theory of mind capabilities of LLMs. We begin by reviewing existing research on theory of mind and identifying the perspectives and tasks relevant to its application in safety evaluation. Given that theory of mind has been predominantly studied within the context of developmental psychology, we analyze developmental trends across a series of open-weight LLMs. Our results indicate that while LLMs have improved in reading comprehension, their theory of mind capabilities have not shown comparable development. Finally, we present the current state of safety evaluation with respect to LLMs' theory of mind, and discuss remaining challenges for future work.

Updated: 2025-06-20 01:52:17

标题: 朝着在大型语言模型中对心灵理论进行安全评估

摘要: 随着大型语言模型（LLMs）的能力不断提升，对严格的安全评估的重要性正在变得越来越明显。最近在安全评估领域内的关注点突显出LLMs表现出似乎禁用监督机制并以欺骗性方式回应的情况。例如，有报道表明，当LLMs面对对其任务执行过程中不利的信息时，它们可能会秘密行动，甚至提供错误答案以验证其行为。为了评估这种欺骗行为对开发者或用户的潜在风险，有必要调查这些行为是否源于模型内部的秘密、有意的过程。在这项研究中，我们提议有必要衡量LLMs的心灵理论能力。我们首先回顾现有关于心灵理论的研究，并确定其在安全评估中的应用相关的观点和任务。考虑到心灵理论主要在发展心理学背景下进行研究，我们分析了一系列开放权重的LLMs的发展趋势。我们的结果表明，虽然LLMs在阅读理解方面有所改善，但其心灵理论能力并没有显示出相应的发展。最后，我们介绍了关于LLMs的心灵理论的安全评估的当前状况，并讨论了未来工作面临的挑战。

更新时间: 2025-06-20 01:52:17

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.17352v1

Med-U1: Incentivizing Unified Medical Reasoning in LLMs via Large-scale Reinforcement Learning

Medical Question-Answering (QA) encompasses a broad spectrum of tasks, including multiple choice questions (MCQ), open-ended text generation, and complex computational reasoning. Despite this variety, a unified framework for delivering high-quality medical QA has yet to emerge. Although recent progress in reasoning-augmented large language models (LLMs) has shown promise, their ability to achieve comprehensive medical understanding is still largely unexplored. In this paper, we present Med-U1, a unified framework for robust reasoning across medical QA tasks with diverse output formats, ranging from MCQs to complex generation and computation tasks. Med-U1 employs pure large-scale reinforcement learning with mixed rule-based binary reward functions, incorporating a length penalty to manage output verbosity. With multi-objective reward optimization, Med-U1 directs LLMs to produce concise and verifiable reasoning chains. Empirical results reveal that Med-U1 significantly improves performance across multiple challenging Med-QA benchmarks, surpassing even larger specialized and proprietary models. Furthermore, Med-U1 demonstrates robust generalization to out-of-distribution (OOD) tasks. Extensive analysis presents insights into training strategies, reasoning chain length control, and reward design for medical LLMs. Our code is available here.

Updated: 2025-06-20 01:43:46

标题: Med-U1：通过大规模强化学习在LLMs中激励统一医疗推理

摘要: 医学问答（QA）涵盖了广泛的任务范围，包括多项选择题（MCQ）、开放式文本生成和复杂的计算推理。尽管存在这种多样性，但尚未出现一个统一的框架来提供高质量的医学QA。尽管最近推理增强型大型语言模型（LLMs）在展示希望方面取得了进展，但它们实现全面的医学理解的能力仍然大部分未被探索。在本文中，我们提出了Med-U1，一个统一的框架，用于跨医学QA任务进行强大的推理，输出格式多样，从MCQ到复杂的生成和计算任务。Med-U1采用纯大规模强化学习与混合基于规则的二元奖励函数，包括长度惩罚以管理输出冗长。通过多目标奖励优化，Med-U1指导LLMs产生简明且可验证的推理链。实证结果显示，Med-U1显著提高了在多个具有挑战性的Med-QA基准上的性能，甚至超过了更大的专业和专有模型。此外，Med-U1展示了对分布之外（OOD）任务的强大泛化能力。广泛的分析提供了有关医学LLMs的训练策略、推理链长度控制和奖励设计的见解。我们的代码在这里可用。

更新时间: 2025-06-20 01:43:46

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.12307v2

Zero-Shot Cognitive Impairment Detection from Speech Using AudioLLM

Cognitive impairment (CI) is of growing public health concern, and early detection is vital for effective intervention. Speech has gained attention as a non-invasive and easily collectible biomarker for assessing cognitive decline. Traditional CI detection methods typically rely on supervised models trained on acoustic and linguistic features extracted from speech, which often require manual annotation and may not generalise well across datasets and languages. In this work, we propose the first zero-shot speech-based CI detection method using the Qwen2- Audio AudioLLM, a model capable of processing both audio and text inputs. By designing prompt-based instructions, we guide the model in classifying speech samples as indicative of normal cognition or cognitive impairment. We evaluate our approach on two datasets: one in English and another multilingual, spanning different cognitive assessment tasks. Our results show that the zero-shot AudioLLM approach achieves performance comparable to supervised methods and exhibits promising generalizability and consistency across languages, tasks, and datasets.

Updated: 2025-06-20 01:28:43

标题: 使用AudioLLM进行语音零样本认知障碍检测

摘要: 认知功能障碍（CI）是日益引起公共卫生关注的问题，早期检测对有效干预至关重要。语音作为一种非侵入性且易于收集的生物标志物已经引起了关注，用于评估认知衰退。传统的CI检测方法通常依赖于从语音中提取的声学和语言特征的监督模型训练，这往往需要手动标注，并且可能无法很好地泛化到不同数据集和语言。在这项工作中，我们提出了第一个基于零样本的基于语音的CI检测方法，使用Qwen2-Audio AudioLLM模型，该模型能够处理音频和文本输入。通过设计基于提示的指令，我们引导模型将语音样本分类为正常认知或认知障碍的指示。我们在两个数据集上评估了我们的方法：一个是英语数据集，另一个是跨越不同认知评估任务的多语言数据集。我们的结果表明，零样本AudioLLM方法的表现与监督方法相当，并且在不同语言、任务和数据集之间表现出有希望的泛化性和一致性。

更新时间: 2025-06-20 01:28:43

领域: cs.SD,cs.AI,cs.CL,cs.MM,eess.AS

下载: http://arxiv.org/abs/2506.17351v1

Open-Set Graph Anomaly Detection via Normal Structure Regularisation

This paper considers an important Graph Anomaly Detection (GAD) task, namely open-set GAD, which aims to train a detection model using a small number of normal and anomaly nodes (referred to as seen anomalies) to detect both seen anomalies and unseen anomalies (i.e., anomalies that cannot be illustrated the training anomalies). Those labelled training data provide crucial prior knowledge about abnormalities for GAD models, enabling substantially reduced detection errors. However, current supervised GAD methods tend to over-emphasise fitting the seen anomalies, leading to many errors of detecting the unseen anomalies as normal nodes. Further, existing open-set AD models were introduced to handle Euclidean data, failing to effectively capture discriminative features from graph structure and node attributes for GAD. In this work, we propose a novel open-set GAD approach, namely normal structure regularisation (NSReg), to achieve generalised detection ability to unseen anomalies, while maintaining its effectiveness on detecting seen anomalies. The key idea in NSReg is to introduce a regularisation term that enforces the learning of compact, semantically-rich representations of normal nodes based on their structural relations to other nodes. When being optimised with supervised anomaly detection losses, the regularisation term helps incorporate strong normality into the modelling, and thus, it effectively avoids over-fitting the seen anomalies and learns a better normality decision boundary, largely reducing the false negatives of detecting unseen anomalies as normal. Extensive empirical results on seven real-world datasets show that NSReg significantly outperforms state-of-the-art competing methods by at least 14% AUC-ROC on the unseen anomaly classes and by 10% AUC-ROC on all anomaly classes.

Updated: 2025-06-20 01:26:10

标题: 通过正常结构规范化的开放集图异常检测

摘要: 这篇论文考虑了一个重要的图异常检测（GAD）任务，即开放集GAD，旨在使用少量正常节点和异常节点（称为已见异常）来训练检测模型，以便检测已见异常和未见异常（即无法用训练异常来说明的异常）。这些带标签的训练数据为GAD模型提供了关于异常的重要先验知识，可以大大减少检测错误。然而，当前的监督GAD方法往往过分强调对已见异常的拟合，导致许多未见异常被错误地检测为正常节点。此外，现有的开放集AD模型是为处理欧氏数据而引入的，无法有效地捕捉图结构和节点属性的区分特征以用于GAD。在这项工作中，我们提出了一种新颖的开放集GAD方法，即正常结构正则化（NSReg），以实现对未见异常的广义检测能力，同时保持对已见异常的有效性。NSReg的关键思想是引入一项正则化项，强制学习基于正常节点与其他节点的结构关系的紧凑、语义丰富的表示。当与监督异常检测损失一起优化时，正则化项有助于将强大的正常性纳入建模中，从而有效避免对已见异常的过拟合，并学习一个更好的正常决策边界，大大减少将未见异常错误地检测为正常的假阴性。对七个真实世界数据集的广泛实证结果表明，NSReg在未见异常类别上至少比最先进的竞争方法提高了14%的AUC-ROC，在所有异常类别上提高了10%的AUC-ROC。

更新时间: 2025-06-20 01:26:10

领域: cs.LG,cs.AI,cs.SI

下载: http://arxiv.org/abs/2311.06835v5

Kinetics: Rethinking Test-Time Scaling Laws

We rethink test-time scaling laws from a practical efficiency perspective, revealing that the effectiveness of smaller models is significantly overestimated. Prior work, grounded in compute-optimality, overlooks critical memory access bottlenecks introduced by inference-time strategies (e.g., Best-of-$N$, long CoTs). Our holistic analysis, spanning models from 0.6B to 32B parameters, reveals a new Kinetics Scaling Law that better guides resource allocation by incorporating both computation and memory access costs. Kinetics Scaling Law suggests that test-time compute is more effective when used on models above a threshold than smaller ones. A key reason is that in TTS, attention, rather than parameter count, emerges as the dominant cost factor. Motivated by this, we propose a new scaling paradigm centered on sparse attention, which lowers per-token cost and enables longer generations and more parallel samples within the same resource budget. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over 60 points gains in low-cost regimes and over 5 points gains in high-cost regimes for problem-solving accuracy on AIME, encompassing evaluations on state-of-the-art MoEs. These results suggest that sparse attention is essential and increasingly important with more computing invested, for realizing the full potential of test-time scaling where, unlike training, accuracy has yet to saturate as a function of computation, and continues to improve through increased generation. The code is available at https://github.com/Infini-AI-Lab/Kinetics.

Updated: 2025-06-20 01:25:25

标题: 动力学：重新思考测试时间的缩放定律

摘要: 我们从实际效率的角度重新思考测试时间尺度定律，揭示了较小模型的有效性被显著高估。先前的工作，基于计算优化性，忽视了推理时间策略（例如，Best-of-$N$，长CoTs）引入的关键内存访问瓶颈。我们的整体分析涵盖了从0.6B到32B参数的模型，揭示了一条更好地引导资源分配的新的动态尺度定律。动态尺度定律表明，当用于超过阈值的模型时，测试时间计算比较小的模型更有效。一个关键原因是，在TTS中，注意力而不是参数数量成为主要成本因素。受此启发，我们提出了一个以稀疏注意力为中心的新的尺度范式，降低了每个标记的成本，并在相同的资源预算内实现了更长的生成和更多并行样本。根据经验，我们展示了稀疏注意力模型在低成本区域的一致优于密集对应物，实现了AIME上问题解决准确性的超过60个点的增益，在高成本区域的超过5个点的增益，包括对最新的MoEs进行评估。这些结果表明，稀疏注意力是必不可少的，并且在投入更多计算的情况下越来越重要，以实现测试时间尺度的全部潜力，在这种情况下，与训练不同，准确性尚未饱和作为计算函数，并且通过增加生成继续改善。代码可在https://github.com/Infini-AI-Lab/Kinetics找到。

更新时间: 2025-06-20 01:25:25

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2506.05333v3

RiOSWorld: Benchmarking the Risk of Multimodal Computer-Use Agents

With the rapid development of multimodal large language models (MLLMs), they are increasingly deployed as autonomous computer-use agents capable of accomplishing complex computer tasks. However, a pressing issue arises: Can the safety risk principles designed and aligned for general MLLMs in dialogue scenarios be effectively transferred to real-world computer-use scenarios? Existing research on evaluating the safety risks of MLLM-based computer-use agents suffers from several limitations: it either lacks realistic interactive environments, or narrowly focuses on one or a few specific risk types. These limitations ignore the complexity, variability, and diversity of real-world environments, thereby restricting comprehensive risk evaluation for computer-use agents. To this end, we introduce \textbf{RiOSWorld}, a benchmark designed to evaluate the potential risks of MLLM-based agents during real-world computer manipulations. Our benchmark includes 492 risky tasks spanning various computer applications, involving web, social media, multimedia, os, email, and office software. We categorize these risks into two major classes based on their risk source: (i) User-originated risks and (ii) Environmental risks. For the evaluation, we evaluate safety risks from two perspectives: (i) Risk goal intention and (ii) Risk goal completion. Extensive experiments with multimodal agents on \textbf{RiOSWorld} demonstrate that current computer-use agents confront significant safety risks in real-world scenarios. Our findings highlight the necessity and urgency of safety alignment for computer-use agents in real-world computer manipulation, providing valuable insights for developing trustworthy computer-use agents. Our benchmark is publicly available at https://yjyddq.github.io/RiOSWorld.github.io/.

Updated: 2025-06-20 01:24:06

标题: RiOSWorld: 评估多模态计算机使用代理风险

摘要: 随着多模式大语言模型（MLLMs）的快速发展，它们越来越多地被部署为能够完成复杂计算机任务的自主计算机使用代理。然而，一个紧迫的问题出现了：为对话场景设计和对齐的通用MLLMs的安全风险原则能否有效地转移到真实世界的计算机使用场景？评估MLLMs为基础的计算机使用代理的安全风险的现有研究存在几个限制：要么缺乏现实交互环境，要么狭窄地专注于一种或几种特定风险类型。这些限制忽略了现实世界环境的复杂性、变化性和多样性，从而限制了对计算机使用代理的综合风险评估。为此，我们引入了RiOSWorld，这是一个旨在评估MLLMs为基础的代理在真实世界计算机操作中潜在风险的基准。我们的基准包括492个涵盖各种计算机应用程序的风险任务，涉及网络、社交媒体、多媒体、操作系统、电子邮件和办公软件。我们根据风险来源将这些风险分为两大类：（i）用户源风险和（ii）环境风险。对于评估，我们从两个角度评估安全风险：（i）风险目标意图和（ii）风险目标完成。在RiOSWorld上进行的多模式代理的大量实验表明，当前的计算机使用代理在真实世界场景中面临重大的安全风险。我们的发现凸显了在真实世界计算机操作中为计算机使用代理进行安全对齐的必要性和紧迫性，为开发值得信赖的计算机使用代理提供了宝贵的见解。我们的基准公开可用于https://yjyddq.github.io/RiOSWorld.github.io/。

更新时间: 2025-06-20 01:24:06

领域: cs.AI

下载: http://arxiv.org/abs/2506.00618v3

CryoCCD: Conditional Cycle-consistent Diffusion with Biophysical Modeling for Cryo-EM Synthesis

Cryo-electron microscopy (cryo-EM) offers near-atomic resolution imaging of macromolecules, but developing robust models for downstream analysis is hindered by the scarcity of high-quality annotated data. While synthetic data generation has emerged as a potential solution, existing methods often fail to capture both the structural diversity of biological specimens and the complex, spatially varying noise inherent in cryo-EM imaging. To overcome these limitations, we propose CryoCCD, a synthesis framework that integrates biophysical modeling with generative techniques. Specifically, CryoCCD produces multi-scale cryo-EM micrographs that reflect realistic biophysical variability through compositional heterogeneity, cellular context, and physics-informed imaging. To generate realistic noise, we employ a conditional diffusion model, enhanced by cycle consistency to preserve structural fidelity and mask-aware contrastive learning to capture spatially adaptive noise patterns. Extensive experiments show that CryoCCD generates structurally accurate micrographs and enhances performance in downstream tasks, outperforming state-of-the-art baselines in both particle picking and reconstruction.

Updated: 2025-06-20 01:15:53

标题: CryoCCD：基于生物物理建模的条件循环一致扩散用于冷冻电镜合成

摘要: 冷冻电子显微镜（cryo-EM）提供了大分子的近原子分辨率成像，但是由于高质量标注数据稀缺，为下游分析开发强大的模型受到阻碍。虽然合成数据生成已经成为潜在的解决方案，但现有方法通常无法捕捉生物样本的结构多样性以及cryo-EM成像中固有的复杂、空间变化的噪声。为了克服这些限制，我们提出了CryoCCD，这是一个将生物物理建模与生成技术相结合的合成框架。具体地，CryoCCD通过组成异质性、细胞环境和物理信息成像来产生多尺度的cryo-EM显微图像，反映了现实生物物理变异性。为了生成真实的噪声，我们采用了一种条件扩散模型，通过循环一致性来保持结构的忠实度，并利用掩模感知对比学习来捕捉空间自适应的噪声模式。大量实验证明，CryoCCD生成了结构准确的显微图像，并在下游任务中提高了性能，在粒子挑选和重建方面优于现有技术基线。

更新时间: 2025-06-20 01:15:53

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2505.23444v2

RL2Grid: Benchmarking Reinforcement Learning in Power Grid Operations

Reinforcement learning (RL) can provide adaptive and scalable controllers essential for power grid decarbonization. However, RL methods struggle with power grids' complex dynamics, long-horizon goals, and hard physical constraints. For these reasons, we present RL2Grid, a benchmark designed in collaboration with power system operators to accelerate progress in grid control and foster RL maturity. Built on RTE France's power simulation framework, RL2Grid standardizes tasks, state and action spaces, and reward structures for a systematic evaluation and comparison of RL algorithms. Moreover, we integrate operational heuristics and design safety constraints based on human expertise to ensure alignment with physical requirements. By establishing reference performance metrics for classic RL baselines on RL2Grid's tasks, we highlight the need for novel methods capable of handling real systems and discuss future directions for RL-based grid control.

Updated: 2025-06-20 00:58:35

标题: RL2Grid：在电网运营中对强化学习进行基准测试

摘要: 强化学习（RL）可以提供适应性强、可扩展的控制器，对于电网脱碳至关重要。然而，RL方法在处理电网复杂动态、远期目标和严格的物理约束方面存在困难。基于这些原因，我们提出了RL2Grid，这是一个与电力系统运营商合作设计的基准，旨在加速电网控制进展并促进RL的成熟。基于法国RTE公司的电力模拟框架构建的RL2Grid标准化了任务、状态和动作空间，以及奖励结构，用于系统评估和比较RL算法。此外，我们整合了运营启发式和基于人类专业知识设计的安全约束，以确保与物理要求的一致性。通过在RL2Grid任务上建立经典RL基准的参考性能指标，我们强调了需要能够处理实际系统的新方法，并讨论了基于RL的电网控制的未来方向。

更新时间: 2025-06-20 00:58:35

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2503.23101v2

Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models

We study the process through which reasoning models trained with reinforcement learning on verifiable rewards (RLVR) can learn to solve new problems. We find that RLVR drives performance in two main ways: (1) by compressing pass@$k$ into pass@1 and (2) via "capability gain" in which models learn to solve new problems that they previously could not solve even at high $k$. We find that while capability gain exists across model scales, learning to solve new problems is primarily driven through self-distillation. We demonstrate these findings across model scales ranging from 0.5B to 72B parameters on >500,000 reasoning problems with prompts and verifiable final answers across math, science, and code domains. We further show that we can significantly improve pass@$k$ rates by leveraging natural language guidance for the model to consider within context while still requiring the model to derive a solution chain from scratch. Based of these insights, we derive $\text{Guide}$ -- a new class of online training algorithms. $\text{Guide}$ adaptively incorporates hints into the model's context on problems for which all rollouts were initially incorrect and adjusts the importance sampling ratio for the "off-policy" trajectories in order to optimize the policy for contexts in which the hints are no longer present. We describe variants of $\text{Guide}$ for GRPO and PPO and empirically show that Guide-GRPO on 7B and 32B parameter models improves generalization over its vanilla counterpart with up to 4$\%$ macro-average improvement across math benchmarks. We include careful ablations to analyze $\text{Guide}$'s components and theoretically analyze Guide's learning efficiency.

Updated: 2025-06-20 00:51:15

标题: 自适应引导加速推理模型的强化学习

摘要: 我们研究了通过强化学习在可验证奖励上进行训练的推理模型（RLVR）学会解决新问题的过程。我们发现RLVR通过两种主要方式提高性能：（1）通过将pass@$k$压缩成pass@1，（2）通过“能力增益”，模型学会解决即使在高$k$时之前无法解决的新问题。我们发现，虽然能力增益存在于不同规模的模型中，学会解决新问题主要通过自我精炼实现。我们在0.5B到72B参数范围内展示了这些发现，超过500,000个涵盖数学、科学和编码领域的推理问题，包括提示和可验证最终答案。我们进一步展示，通过利用自然语言引导，可以显著提高pass@$k$率，让模型在考虑上下文的同时从头开始推导解决方案链。基于这些发现，我们提出了一种新型在线训练算法$\text{Guide}$。$\text{Guide}$在模型的上下文中自适应地加入提示，用于解决最初所有模拟都不正确的问题，并调整“离策略”轨迹的重要性抽样比率，以优化在提示不再存在时的策略。我们描述了GRPO和PPO的$\text{Guide}$变体，并经验性地展示，7B和32B参数模型上的Guide-GRPO相比其基准版本在数学基准测试中的泛化性能提高了4%的宏平均改进。我们进行了仔细的消融分析来分析$\text{Guide}$的组成部分，并从理论上分析了Guide的学习效率。

更新时间: 2025-06-20 00:51:15

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2506.13923v2

CUBA: Controlled Untargeted Backdoor Attack against Deep Neural Networks

Backdoor attacks have emerged as a critical security threat against deep neural networks in recent years. The majority of existing backdoor attacks focus on targeted backdoor attacks, where trigger is strongly associated to specific malicious behavior. Various backdoor detection methods depend on this inherent property and shows effective results in identifying and mitigating such targeted attacks. However, a purely untargeted attack in backdoor scenarios is, in some sense, self-weakening, since the target nature is what makes backdoor attacks so powerful. In light of this, we introduce a novel Constrained Untargeted Backdoor Attack (CUBA), which combines the flexibility of untargeted attacks with the intentionality of targeted attacks. The compromised model, when presented with backdoor images, will classify them into random classes within a constrained range of target classes selected by the attacker. This combination of randomness and determinedness enables the proposed untargeted backdoor attack to natively circumvent existing backdoor defense methods. To implement the untargeted backdoor attack under controlled flexibility, we propose to apply logit normalization on cross-entropy loss with flipped one-hot labels. By constraining the logit during training, the compromised model will show a uniform distribution across selected target classes, resulting in controlled untargeted attack. Extensive experiments demonstrate the effectiveness of the proposed CUBA on different datasets.

Updated: 2025-06-20 00:47:30

标题: 古巴：针对深度神经网络的受控非定向后门攻击

摘要: 最近几年，后门攻击已经成为深度神经网络面临的一个重要安全威胁。大多数现有的后门攻击集中在有针对性的后门攻击上，其中触发器与特定的恶意行为密切相关。各种后门检测方法依赖于这种固有特性，并在识别和缓解此类有针对性攻击方面显示出有效结果。然而，在某种程度上，后门情景中的纯粹无目标攻击是自我削弱的，因为目标性质是使后门攻击如此强大的原因。鉴于此，我们引入了一种新颖的受限无目标后门攻击（CUBA），它将无目标攻击的灵活性与有针对性攻击的意图相结合。当受损模型被呈现后门图像时，它将将它们分类为攻击者选择的目标类别范围内的随机类别。这种随机性和决定性的结合使得所提出的无目标后门攻击能够天然地规避现有的后门防御方法。为了在受控的灵活性下实施无目标后门攻击，我们建议在具有翻转的独热标签的交叉熵损失上应用对数归一化。通过在训练过程中限制对数，受损模型将在选择的目标类别范围内显示出均匀分布，从而产生受控的无目标攻击。广泛的实验表明了所提出的CUBA在不同数据集上的有效性。

更新时间: 2025-06-20 00:47:30

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2506.17350v1

The Hitchhiker's Guide to Efficient, End-to-End, and Tight DP Auditing

This paper systematizes research on auditing Differential Privacy (DP) techniques, aiming to identify key insights into the current state of the art and open challenges. First, we introduce a comprehensive framework for reviewing work in the field and establish three cross-contextual desiderata that DP audits should target--namely, efficiency, end-to-end-ness, and tightness. Then, we systematize the modes of operation of state-of-the-art DP auditing techniques, including threat models, attacks, and evaluation functions. This allows us to highlight key details overlooked by prior work, analyze the limiting factors to achieving the three desiderata, and identify open research problems. Overall, our work provides a reusable and systematic methodology geared to assess progress in the field and identify friction points and future directions for our community to focus on.

Updated: 2025-06-20 00:32:59

标题: 《高效、端到端和严格的DP审计搭车者指南》

摘要: 本文系统化地研究了审计差分隐私（DP）技术，旨在识别当前技术水平和未来挑战的关键见解。首先，我们引入了一个全面的框架，用于审查该领域的工作，并确定了DP审计应该关注的三个跨上下文的愿望，即效率、端到端性和紧密性。然后，我们系统化地总结了最先进的DP审计技术的操作模式，包括威胁模型、攻击和评估函数。这使我们能够突出之前工作中忽视的关键细节，分析实现三个愿望的限制因素，并确定开放的研究问题。总的来说，我们的工作提供了一种可重复使用和系统化的方法论，旨在评估该领域的进展，并确定我们社区需要关注的摩擦点和未来方向。

更新时间: 2025-06-20 00:32:59

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2506.16666v1

Using Language and Road Manuals to Inform Map Reconstruction for Autonomous Driving

Lane-topology prediction is a critical component of safe and reliable autonomous navigation. An accurate understanding of the road environment aids this task. We observe that this information often follows conventions encoded in natural language, through design codes that reflect the road structure and road names that capture the road functionality. We augment this information in a lightweight manner to SMERF, a map-prior-based online lane-topology prediction model, by combining structured road metadata from OSM maps and lane-width priors from Road design manuals with the road centerline encodings. We evaluate our method on two geo-diverse complex intersection scenarios. Our method shows improvement in both lane and traffic element detection and their association. We report results using four topology-aware metrics to comprehensively assess the model performance. These results demonstrate the ability of our approach to generalize and scale to diverse topologies and conditions.

Updated: 2025-06-20 00:26:10

标题: 利用语言和道路手册来为自动驾驶地图重建提供信息

摘要: 车道拓扑预测是安全可靠的自主导航的关键组成部分。对道路环境的准确理解有助于这项任务。我们观察到，这些信息通常遵循自然语言编码的惯例，通过反映道路结构的设计代码和捕捉道路功能的道路名称。我们以轻量级方式将此信息增强到SMERF，这是一种基于地图先验的在线车道拓扑预测模型，通过结合OSM地图中的结构化道路元数据和道路设计手册中的车道宽度先验与道路中心线编码。我们在两种地理多样性复杂交叉情景上评估我们的方法。我们的方法在车道和交通元素检测及其关联方面显示出改进。我们使用四种拓扑感知指标报告结果，全面评估模型性能。这些结果表明了我们的方法在不同拓扑和条件下具有泛化和扩展能力。

更新时间: 2025-06-20 00:26:10

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2506.10317v2

Private Training & Data Generation by Clustering Embeddings

Deep neural networks often use large, high-quality datasets to achieve high performance on many machine learning tasks. When training involves potentially sensitive data, this process can raise privacy concerns, as large models have been shown to unintentionally memorize and reveal sensitive information, including reconstructing entire training samples. Differential privacy (DP) provides a robust framework for protecting individual data and in particular, a new approach to privately training deep neural networks is to approximate the input dataset with a privately generated synthetic dataset, before any subsequent training algorithm. We introduce a novel principled method for DP synthetic image embedding generation, based on fitting a Gaussian Mixture Model (GMM) in an appropriate embedding space using DP clustering. Our method provably learns a GMM under separation conditions. Empirically, a simple two-layer neural network trained on synthetically generated embeddings achieves state-of-the-art (SOTA) classification accuracy on standard benchmark datasets. Additionally, we demonstrate that our method can generate realistic synthetic images that achieve downstream classification accuracy comparable to SOTA methods. Our method is quite general, as the encoder and decoder modules can be freely substituted to suit different tasks. It is also highly scalable, consisting only of subroutines that scale linearly with the number of samples and/or can be implemented efficiently in distributed systems.

Updated: 2025-06-20 00:17:14

标题: 通过聚类嵌入进行私人培训和数据生成

摘要: 深度神经网络通常使用大型、高质量的数据集在许多机器学习任务上取得高性能。当训练涉及潜在敏感数据时，这个过程可能引发隐私问题，因为已经证明大型模型会无意中记忆和泄露敏感信息，包括重构整个训练样本。差分隐私（DP）提供了一个强大的框架来保护个人数据，特别是一种新的私密训练深度神经网络的方法是在任何后续训练算法之前用私密生成的合成数据集来近似输入数据集。我们介绍了一种基于在适当的嵌入空间中使用DP聚类拟合高斯混合模型（GMM）的DP合成图像嵌入生成的新颖的原则方法。我们的方法在分离条件下可证明学习GMM。在经验上，一个简单的两层神经网络在合成生成的嵌入上训练，达到了标准基准数据集上的最先进分类准确度。此外，我们证明了我们的方法可以生成实际的合成图像，其下游分类准确度可与最先进方法相媲美。我们的方法非常通用，因为编码器和解码器模块可以自由替换以适应不同任务。它还具有高度的可扩展性，只包含与样本数量成线性关系的子程序和/或可以在分布式系统中高效实现。

更新时间: 2025-06-20 00:17:14

领域: cs.LG,cs.CR,stat.ML

下载: http://arxiv.org/abs/2506.16661v1

A Minimalist Optimizer Design for LLM Pretraining

Training large language models (LLMs) typically relies on adaptive optimizers such as Adam, which require significant memory to maintain first- and second-moment matrices, known as optimizer states. While recent works such as GaLore, Fira, and APOLLO have proposed state-compressed variants to reduce memory consumption, a fundamental question remains: What is the minimal amount of optimizer state that is truly necessary to retain state-of-the-art performance in LLM pretraining? In this work, we systematically investigate this question using a bottom-up approach. We find that two memory- and compute-efficient optimization techniques are particularly effective: (1) column-wise gradient normalization significantly boosts the performance of plain SGD without requiring momentum; and (2) adding first-order momentum only to the output layer - where gradient variance is highest - yields performance competitive with fully adaptive methods such as Muon. Based on these insights, we propose SCALE (Stochastic Column-normalized Last-layer Momentum), a new optimizer that combines column-normalized SGD with last-layer momentum, where column normalization refers to normalizing the gradient along the output dimension. Across multiple LLaMA models (60M-1B), SCALE matches or exceeds the performance of Adam while using only 35-45% of the total memory. It also consistently outperforms memory-efficient optimizers such as GaLore, Fira, and APOLLO, making it a strong candidate for large-scale pretraining under memory constraints. For the LLaMA 7B model, SCALE outperforms the state-of-the-art method APOLLO in terms of both perplexity and memory consumption. In addition, our method serves as a minimalist baseline for more sophisticated optimizer design.

Updated: 2025-06-20 00:10:35

标题: LLM预训练的极简优化器设计

摘要: 培训大型语言模型（LLMs）通常依赖于自适应优化器，如Adam，这些优化器需要大量内存来维护一阶和二阶矩阵，即优化器状态。虽然最近的作品，如GaLore、Fira和APOLLO，提出了状态压缩变体以减少内存消耗，但一个基本问题仍然存在：在LLM预训练中，真正需要保留最先进性能的最小优化器状态量是多少？在这项工作中，我们采用自下而上的方法系统地研究这个问题。我们发现两种内存和计算效率优化技术特别有效：（1）按列梯度归一化显著提高了纯SGD的性能，而不需要动量；（2）仅在输出层添加一阶动量 - 梯度方差最高的地方 - 可以实现与完全自适应方法（如Muon）竞争的性能。基于这些见解，我们提出了SCALE（随机列归一化最后一层动量），这是一种结合了列归一化SGD和最后一层动量的新优化器，其中列归一化是指沿着输出维度归一化梯度。在多个LLaMA模型（60M-1B）中，SCALE与Adam的性能相匹配或超过，同时仅使用总内存的35-45%。它还始终优于内存高效的优化器，如GaLore、Fira和APOLLO，使其成为在内存约束下进行大规模预训练的强有力候选。对于LLaMA 7B模型，SCALE在困惑度和内存消耗方面均优于最先进的方法APOLLO。此外，我们的方法还作为更复杂优化器设计的最小基准。

更新时间: 2025-06-20 00:10:35

领域: cs.LG,cs.AI,math.OC

下载: http://arxiv.org/abs/2506.16659v1

Multi-Armed Bandits With Machine Learning-Generated Surrogate Rewards

Multi-armed bandit (MAB) is a widely adopted framework for sequential decision-making under uncertainty. Traditional bandit algorithms rely solely on online data, which tends to be scarce as it must be gathered during the online phase when the arms are actively pulled. However, in many practical settings, rich auxiliary data, such as covariates of past users, is available prior to deploying any arms. We introduce a new setting for MAB where pre-trained machine learning (ML) models are applied to convert side information and historical data into \emph{surrogate rewards}. A prominent feature of this setting is that the surrogate rewards may exhibit substantial bias, as true reward data is typically unavailable in the offline phase, forcing ML predictions to heavily rely on extrapolation. To address the issue, we propose the Machine Learning-Assisted Upper Confidence Bound (MLA-UCB) algorithm, which can be applied to any reward prediction model and any form of auxiliary data. When the predicted and true rewards are jointly Gaussian, it provably improves the cumulative regret, provided that the correlation is non-zero -- even in cases where the mean surrogate reward completely misaligns with the true mean rewards. Notably, our method requires no prior knowledge of the covariance matrix between true and surrogate rewards. We compare MLA-UCB with the standard UCB on a range of numerical studies and show a sizable efficiency gain even when the size of the offline data and the correlation between predicted and true rewards are moderate.

Updated: 2025-06-20 00:09:39

标题: 多臂赌博机与机器学习生成的替代奖励

摘要: 多臂赌博（MAB）是一种广泛采用的用于在不确定性下进行顺序决策的框架。传统的赌博算法仅依赖于在线数据，这些数据往往很少，因为必须在在线阶段收集，当时臂正在被主动拉动。然而，在许多实际情况下，丰富的辅助数据，例如过去用户的协变量，在部署任何臂之前都是可用的。我们引入了一个新的MAB设置，其中预训练的机器学习（ML）模型被应用来将辅助信息和历史数据转化为\emph{替代奖励}。这种设置的一个显著特点是，替代奖励可能表现出相当大的偏差，因为真实奖励数据通常在离线阶段不可用，迫使ML预测严重依赖于外推。为了解决这个问题，我们提出了机器学习辅助上置信界（MLA-UCB）算法，可以应用于任何奖励预测模型和任何形式的辅助数据。当预测和真实奖励联合高斯时，它可以证明地改善累积遗憾，前提是相关性非零 - 即使在真实均值奖励与真实均值奖励完全不一致的情况下。值得注意的是，我们的方法不需要事先知道真实和替代奖励之间的协方差矩阵。我们在一系列数值研究中将MLA-UCB与标准UCB进行比较，并显示出明显的效率增益，即使离线数据的大小和预测和真实奖励之间的相关性都是适度的。

更新时间: 2025-06-20 00:09:39

领域: math.ST,cs.LG,stat.ML,stat.TH

下载: http://arxiv.org/abs/2506.16658v1

Near Optimal Decision Trees in a SPLIT Second

Decision tree optimization is fundamental to interpretable machine learning. The most popular approach is to greedily search for the best feature at every decision point, which is fast but provably suboptimal. Recent approaches find the global optimum using branch and bound with dynamic programming, showing substantial improvements in accuracy and sparsity at great cost to scalability. An ideal solution would have the accuracy of an optimal method and the scalability of a greedy method. We introduce a family of algorithms called SPLIT (SParse Lookahead for Interpretable Trees) that moves us significantly forward in achieving this ideal balance. We demonstrate that not all sub-problems need to be solved to optimality to find high quality trees; greediness suffices near the leaves. Since each depth adds an exponential number of possible trees, this change makes our algorithms orders of magnitude faster than existing optimal methods, with negligible loss in performance. We extend this algorithm to allow scalable computation of sets of near-optimal trees (i.e., the Rashomon set).

Updated: 2025-06-20 00:03:09

标题: 在“Split Second”中的近乎最优决策树

摘要: 决策树优化是可解释机器学习的基础。最流行的方法是在每个决策点贪心地搜索最佳特征，这种方法速度快但被证明是次优的。最近的方法使用分支定界和动态规划找到全局最优解，显示出在准确性和稀疏性方面有显著改进，但在可伸缩性方面付出了巨大代价。理想的解决方案应该具有最佳方法的准确性和贪心方法的可扩展性。我们引入了一类算法，称为SPLIT（SParse Lookahead for Interpretable Trees），在实现这种理想平衡方面取得了显著进展。我们证明，并非所有子问题都需要求解到最优，即使在叶子节点附近贪婪也足够。由于每个深度增加了指数数量的可能树，这种改变使我们的算法比现有的最优方法快了几个数量级，同时性能几乎没有损失。我们将这一算法扩展到允许计算一组接近最优树的可扩展集合（即拉维莫集）。

更新时间: 2025-06-20 00:03:09

领域: cs.LG

下载: http://arxiv.org/abs/2502.15988v2

Mesh-Informed Neural Operator : A Transformer Generative Approach

Generative models in function spaces, situated at the intersection of generative modeling and operator learning, are attracting increasing attention due to their immense potential in diverse scientific and engineering applications. While functional generative models are theoretically domain- and discretization-agnostic, current implementations heavily rely on the Fourier Neural Operator (FNO), limiting their applicability to regular grids and rectangular domains. To overcome these critical limitations, we introduce the Mesh-Informed Neural Operator (MINO). By leveraging graph neural operators and cross-attention mechanisms, MINO offers a principled, domain- and discretization-agnostic backbone for generative modeling in function spaces. This advancement significantly expands the scope of such models to more diverse applications in generative, inverse, and regression tasks. Furthermore, MINO provides a unified perspective on integrating neural operators with general advanced deep learning architectures. Finally, we introduce a suite of standardized evaluation metrics that enable objective comparison of functional generative models, addressing another critical gap in the field.

Updated: 2025-06-20 00:00:22

标题: 网格信息神经算子：一种Transformer生成方法

摘要: 功能空间中的生成模型位于生成建模和运算学习的交叉点，由于它们在各种科学和工程应用中具有巨大潜力，因此正在吸引越来越多的关注。虽然功能性生成模型在理论上与域和离散化无关，但当前的实现主要依赖傅里叶神经算子（FNO），限制了它们在规则网格和矩形域中的适用性。为了克服这些关键限制，我们引入了Mesh-Informed Neural Operator（MINO）。通过利用图神经算子和交叉注意机制，MINO为功能空间中的生成建模提供了一个基于原则的、与域和离散化无关的基础。这一进步显著扩大了这些模型在生成、逆向和回归任务中的应用范围。此外，MINO提供了一个统一的视角，用于将神经算子与一般先进的深度学习架构集成起来。最后，我们引入了一套标准化的评估指标，可以实现对功能性生成模型的客观比较，从而填补了该领域的另一个重要空白。

更新时间: 2025-06-20 00:00:22

领域: cs.LG

下载: http://arxiv.org/abs/2506.16656v1