Arxiv Day: Article

Automated Neuron Labelling Enables Generative Steering and Interpretability in Protein Language Models

Protein language models (PLMs) encode rich biological information, yet their internal neuron representations are poorly understood. We introduce the first automated framework for labeling every neuron in a PLM with biologically grounded natural language descriptions. Unlike prior approaches relying on sparse autoencoders or manual annotation, our method scales to hundreds of thousands of neurons, revealing individual neurons are selectively sensitive to diverse biochemical and structural properties. We then develop a novel neuron activation-guided steering method to generate proteins with desired traits, enabling convergence to target biochemical properties like molecular weight and instability index as well as secondary and tertiary structural motifs, including alpha helices and canonical Zinc Fingers. We finally show that analysis of labeled neurons in different model sizes reveals PLM scaling laws and a structured neuron space distribution.

Updated: 2025-07-08 23:59:13

标题: 自动神经元标记实现了蛋白质语言模型中的生成导向和可解释性

摘要: 蛋白质语言模型（PLMs）编码丰富的生物信息，但其内部神经元表示尚不明确。我们介绍了第一个自动化框架，用于用生物学基础的自然语言描述标记PLM中的每个神经元。与先前依赖稀疏自动编码器或手动注释的方法不同，我们的方法可以扩展到数十万个神经元，揭示单个神经元对不同生化和结构特性具有选择性敏感性。然后，我们开发了一种新颖的神经元激活引导的转向方法，以生成具有期望特征的蛋白质，实现对目标生化特性如分子量和不稳定性指数以及二级和三级结构基序的收敛，包括α螺旋和经典的锌指。最后，我们展示了在不同模型大小中标记神经元的分析揭示了PLM的缩放规律和结构化神经元空间分布。

更新时间: 2025-07-08 23:59:13

领域: cs.LG,q-bio.BM

下载: http://arxiv.org/abs/2507.06458v1

Wild refitting for black box prediction

We describe and analyze a computionally efficient refitting procedure for computing high-probability upper bounds on the instance-wise mean-squared prediction error of penalized nonparametric estimates based on least-squares minimization. Requiring only a single dataset and black box access to the prediction method, it consists of three steps: computing suitable residuals, symmetrizing and scaling them with a pre-factor $\rho$, and using them to define and solve a modified prediction problem recentered at the current estimate. We refer to it as wild refitting, since it uses Rademacher residual symmetrization as in a wild bootstrap variant. Under relatively mild conditions allowing for noise heterogeneity, we establish a high probability guarantee on its performance, showing that the wild refit with a suitably chosen wild noise scale $\rho$ gives an upper bound on prediction error. This theoretical analysis provides guidance into the design of such procedures, including how the residuals should be formed, the amount of noise rescaling in the wild sub-problem needed for upper bounds, and the local stability properties of the block-box procedure. We illustrate the applicability of this procedure to various problems, including non-rigid structure-from-motion recovery with structured matrix penalties; plug-and-play image restoration with deep neural network priors; and randomized sketching with kernel methods.

Updated: 2025-07-08 23:47:24

标题: 野外重新适应用于黑匣子预测

摘要: 我们描述并分析了一种计算效率高的重新拟合程序，用于计算基于最小二乘法的受惩罚非参数估计的实例均方预测误差的高概率上界。该程序仅需要单个数据集和对预测方法的黑盒访问，包括三个步骤：计算合适的残差，用预先因子$\rho$对其进行对称化和缩放，然后使用它们来定义并解决一个在当前估计中心化的修改预测问题。我们将其称为野性重新拟合，因为它使用了Rademacher残差对称化，类似于野性自举变体。在相对温和的条件下，允许噪声异质性，我们对其性能建立了高概率保证，表明野性重新拟合与适当选择的野生噪声尺度$\rho$给出了预测误差的上界。这种理论分析提供了对这种程序设计的指导，包括残差应如何形成、野生子问题中噪声重缩放的量以及黑盒程序的局部稳定性特性。我们展示了这种程序在各种问题上的适用性，包括具有结构矩阵惩罚的非刚性结构运动恢复；带有深度神经网络先验的即插即用图像恢复；以及使用核方法的随机化草图。

更新时间: 2025-07-08 23:47:24

领域: stat.ML,cs.LG,math.ST,stat.TH

下载: http://arxiv.org/abs/2506.21460v2

FedPhD: Federated Pruning with Hierarchical Learning of Diffusion Models

Federated Learning (FL), as a distributed learning paradigm, trains models over distributed clients' data. FL is particularly beneficial for distributed training of Diffusion Models (DMs), which are high-quality image generators that require diverse data. However, challenges such as high communication costs and data heterogeneity persist in training DMs similar to training Transformers and Convolutional Neural Networks. Limited research has addressed these issues in FL environments. To address this gap and challenges, we introduce a novel approach, FedPhD, designed to efficiently train DMs in FL environments. FedPhD leverages Hierarchical FL with homogeneity-aware model aggregation and selection policy to tackle data heterogeneity while reducing communication costs. The distributed structured pruning of FedPhD enhances computational efficiency and reduces model storage requirements in clients. Our experiments across multiple datasets demonstrate that FedPhD achieves high model performance regarding Fr\'echet Inception Distance (FID) scores while reducing communication costs by up to $88\%$. FedPhD outperforms baseline methods achieving at least a $34\%$ improvement in FID, while utilizing only $56\%$ of the total computation and communication resources.

Updated: 2025-07-08 23:24:07

标题: FedPhD：具有扩散模型层级学习的联邦修剪

摘要: 联合学习（FL）作为一种分布式学习范式，通过训练分布式客户端的数据来训练模型。FL特别适用于Diffusion Models（DMs）的分布式训练，这些模型是高质量的图像生成器，需要多样化的数据。然而，在训练DMs时仍存在高通信成本和数据异质性等挑战，类似于训练Transformers和卷积神经网络。目前有限的研究解决了FL环境中的这些问题。为了解决这一差距和挑战，我们引入了一种新颖的方法，FedPhD，旨在在FL环境中高效训练DMs。FedPhD利用具有同质性感知的模型聚合和选择策略的分层FL来解决数据异质性问题，同时降低通信成本。FedPhD的分布式结构修剪提高了计算效率，并减少了客户端中的模型存储需求。我们在多个数据集上的实验表明，FedPhD在降低高达88%的通信成本的同时，实现了高模型性能，如Fr\'echet Inception Distance（FID）得分。FedPhD胜过基线方法，在FID方面至少提高了34%，同时仅利用总计算和通信资源的56%。

更新时间: 2025-07-08 23:24:07

领域: cs.LG,cs.AI,cs.DC,68T05, 68T07, 68Q85, 94A08,I.2.6; I.2.11; C.2.4

下载: http://arxiv.org/abs/2507.06449v1

CodeMirage: Hallucinations in Code Generated by Large Language Models

Large Language Models (LLMs) have shown promising potentials in program generation and no-code automation. However, LLMs are prone to generate hallucinations, i.e., they generate text which sounds plausible but is incorrect. Although there has been a recent surge in research on LLM hallucinations for text generation, similar hallucination phenomenon can happen in code generation. Sometimes the generated code can have syntactical or logical errors as well as more advanced issues like security vulnerabilities, memory leaks, etc. Given the wide adaptation of LLMs to enhance efficiency in code generation and development in general, it becomes imperative to investigate hallucinations in code generation. To the best of our knowledge, this is the first attempt at studying hallucinations in the code generated by LLMs. We start by introducing the code hallucination definition and a comprehensive taxonomy of code hallucination types. We propose the first benchmark CodeMirage dataset for code hallucinations. The benchmark contains 1,137 GPT-3.5 generated hallucinated code snippets for Python programming problems from two base datasets - HumanEval and MBPP. We then propose the methodology for code hallucination detection and experiment with open source LLMs such as CodeLLaMA as well as OpenAI's GPT-3.5 and GPT-4 models using one-shot prompt. We find that GPT-4 performs the best on HumanEval dataset and gives comparable results to the fine-tuned CodeBERT baseline on MBPP dataset. Towards the end, we discuss various mitigation strategies for code hallucinations and conclude our work.

Updated: 2025-07-08 23:14:43

标题: CodeMirage: 大型语言模型生成的代码中的幻觉

摘要: 大型语言模型（LLMs）在程序生成和无代码自动化方面表现出有希望的潜力。然而，LLMs容易生成幻觉，即它们会生成听起来合理但实际不正确的文本。尽管最近有大量关于LLM文本生成幻觉的研究，但类似的幻觉现象也可能在代码生成中发生。有时生成的代码可能存在语法或逻辑错误，以及更高级的问题，如安全漏洞、内存泄漏等。鉴于LLMs被广泛应用于提高代码生成和开发效率，研究代码生成中的幻觉变得迫在眉睫。据我们所知，这是第一次尝试研究LLMs生成的代码中的幻觉。我们首先介绍了代码幻觉的定义和一个全面的代码幻觉类型分类法。我们提出了第一个用于代码幻觉的基准数据集CodeMirage。该基准数据集包含了1,137个GPT-3.5生成的针对Python编程问题的幻觉代码片段，来自两个基础数据集 - HumanEval和MBPP。然后，我们提出了代码幻觉检测的方法论，并尝试使用开源的LLMs，如CodeLLaMA以及OpenAI的GPT-3.5和GPT-4模型，使用一次性提示。我们发现GPT-4在HumanEval数据集上表现最佳，并在MBPP数据集上提供可与精调的CodeBERT基线相媲美的结果。最后，我们讨论了代码幻觉的各种缓解策略，并总结了我们的工作。

更新时间: 2025-07-08 23:14:43

领域: cs.SE,cs.AI,cs.CL

下载: http://arxiv.org/abs/2408.08333v2

Substance over Style: Evaluating Proactive Conversational Coaching Agents

While NLP research has made strides in conversational tasks, many approaches focus on single-turn responses with well-defined objectives or evaluation criteria. In contrast, coaching presents unique challenges with initially undefined goals that evolve through multi-turn interactions, subjective evaluation criteria, mixed-initiative dialogue. In this work, we describe and implement five multi-turn coaching agents that exhibit distinct conversational styles, and evaluate them through a user study, collecting first-person feedback on 155 conversations. We find that users highly value core functionality, and that stylistic components in absence of core components are viewed negatively. By comparing user feedback with third-person evaluations from health experts and an LM, we reveal significant misalignment across evaluation approaches. Our findings provide insights into design and evaluation of conversational coaching agents and contribute toward improving human-centered NLP applications.

Updated: 2025-07-08 23:13:30

标题: 风格之上的实质：评估主动对话辅导代理

摘要: 尽管自然语言处理研究在对话任务方面取得了进展，但许多方法侧重于具有明确定义目标或评估标准的单轮响应。相比之下，辅导面临着独特的挑战，初始目标未定义，通过多轮互动逐渐演变，主观评估标准，混合主动对话。在这项工作中，我们描述并实现了五个展现出不同会话风格的多轮辅导代理，并通过用户研究对它们进行评估，收集了155次对话的第一人称反馈。我们发现用户非常重视核心功能，并且在缺乏核心组件的情况下，风格组件被视为负面的。通过将用户反馈与健康专家和LM的第三人评估进行比较，我们揭示了评估方法之间的显著不一致。我们的发现为对话辅导代理的设计和评估提供了见解，并有助于改善以人为中心的自然语言处理应用。

更新时间: 2025-07-08 23:13:30

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.19328v2

Understanding Fixed Predictions via Confined Regions

Machine learning models can assign fixed predictions that preclude individuals from changing their outcome. Existing approaches to audit fixed predictions do so on a pointwise basis, which requires access to an existing dataset of individuals and may fail to anticipate fixed predictions in out-of-sample data. This work presents a new paradigm to identify fixed predictions by finding confined regions of the feature space in which all individuals receive fixed predictions. This paradigm enables the certification of recourse for out-of-sample data, works in settings without representative datasets, and provides interpretable descriptions of individuals with fixed predictions. We develop a fast method to discover confined regions for linear classifiers using mixed-integer quadratically constrained programming. We conduct a comprehensive empirical study of confined regions across diverse applications. Our results highlight that existing pointwise verification methods fail to anticipate future individuals with fixed predictions, while our method both identifies them and provides an interpretable description.

Updated: 2025-07-08 23:09:48

标题: 理解通过受限区域实现的固定预测

摘要: 机器学习模型可以分配固定预测，阻止个体改变其结果。现有的审计固定预测的方法是基于点的基础，这需要访问现有个体数据集，可能无法预测在样本外数据中的固定预测。本文提出了一种新的范式，通过找到特征空间中的受限区域来识别固定预测，其中所有个体都接收固定预测。这种范式使得对样本外数据的追溯认证成为可能，在没有代表性数据集的情况下工作，并提供了具有固定预测的个体的可解释描述。我们开发了一种快速方法，使用混合整数二次约束规划来发现线性分类器的受限区域。我们进行了一项全面的实证研究，涉及各种应用领域的受限区域。我们的结果突显了现有的点验证方法无法预测未来具有固定预测的个体，而我们的方法既能识别它们，又提供了可解释的描述。

更新时间: 2025-07-08 23:09:48

领域: cs.LG,cs.AI,math.OC

下载: http://arxiv.org/abs/2502.16380v2

Can Interpretation Predict Behavior on Unseen Data?

Interpretability research often aims to predict how a model will respond to targeted interventions on specific mechanisms. However, it rarely predicts how a model will respond to unseen input data. This paper explores the promises and challenges of interpretability as a tool for predicting out-of-distribution (OOD) model behavior. Specifically, we investigate the correspondence between attention patterns and OOD generalization in hundreds of Transformer models independently trained on a synthetic classification task. These models exhibit several distinct systematic generalization rules OOD, forming a diverse population for correlational analysis. In this setting, we find that simple observational tools from interpretability can predict OOD performance. In particular, when in-distribution attention exhibits hierarchical patterns, the model is likely to generalize hierarchically on OOD data -- even when the rule's implementation does not rely on these hierarchical patterns, according to ablation tests. Our findings offer a proof-of-concept to motivate further interpretability work on predicting unseen model behavior.

Updated: 2025-07-08 23:07:33

标题: 可以解释预测未见数据上的行为吗？

摘要: 可解释性研究通常旨在预测模型对特定机制的有针对性干预会作出怎样的响应。然而，它很少预测模型对未见输入数据的响应。本文探讨了可解释性作为预测模型在分布外（OOD）行为的工具的潜力和挑战。具体而言，我们独立训练了数百个Transformer模型，在一个合成分类任务上研究了注意力模式和OOD泛化之间的对应关系。这些模型在OOD上表现出几种不同的系统泛化规则，形成了一个多样化的群体，用于相关性分析。在这种情况下，我们发现可解释性的简单观察工具可以预测OOD性能。特别是，当在分布内的注意力表现出分层模式时，模型很可能在OOD数据上进行分层泛化--即使规则的实施并不依赖于这些分层模式，根据消融测试的结果。我们的发现为进一步的可解释性工作提供了一个概念验证，以预测未见模型行为。

更新时间: 2025-07-08 23:07:33

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.06445v1

FACap: A Large-scale Fashion Dataset for Fine-grained Composed Image Retrieval

The composed image retrieval (CIR) task is to retrieve target images given a reference image and a modification text. Recent methods for CIR leverage large pretrained vision-language models (VLMs) and achieve good performance on general-domain concepts like color and texture. However, they still struggle with application domains like fashion, because the rich and diverse vocabulary used in fashion requires specific fine-grained vision and language understanding. An additional difficulty is the lack of large-scale fashion datasets with detailed and relevant annotations, due to the expensive cost of manual annotation by specialists. To address these challenges, we introduce FACap, a large-scale, automatically constructed fashion-domain CIR dataset. It leverages web-sourced fashion images and a two-stage annotation pipeline powered by a VLM and a large language model (LLM) to generate accurate and detailed modification texts. Then, we propose a new CIR model FashionBLIP-2, which fine-tunes the general-domain BLIP-2 model on FACap with lightweight adapters and multi-head query-candidate matching to better account for fine-grained fashion-specific information. FashionBLIP-2 is evaluated with and without additional fine-tuning on the Fashion IQ benchmark and the enhanced evaluation dataset enhFashionIQ, leveraging our pipeline to obtain higher-quality annotations. Experimental results show that the combination of FashionBLIP-2 and pretraining with FACap significantly improves the model's performance in fashion CIR especially for retrieval with fine-grained modification texts, demonstrating the value of our dataset and approach in a highly demanding environment such as e-commerce websites. Code is available at https://fgxaos.github.io/facap-paper-website/.

Updated: 2025-07-08 23:02:10

标题: FACap：用于细粒度复合图像检索的大规模时尚数据集

摘要: 复合图像检索（CIR）任务是在给定参考图像和修改文本的情况下检索目标图像。最近的CIR方法利用了大型预训练视觉语言模型（VLMs），在颜色和纹理等通用领域概念上取得了良好的性能。然而，它们在时尚等应用领域仍然存在困难，因为时尚领域使用的丰富和多样化词汇需要具体的细粒度视觉和语言理解。另一个困难是缺乏大规模时尚数据集，这些数据集具有详细和相关的注释，因为专家手动注释的成本昂贵。为了解决这些挑战，我们引入了FACap，一个大规模、自动构建的时尚领域CIR数据集。它利用网络来源的时尚图像和一个由VLM和大型语言模型（LLM）驱动的两阶段注释流程，生成准确和详细的修改文本。然后，我们提出了一个新的CIR模型FashionBLIP-2，它在FACap上通过轻量级适配器和多头查询-候选匹配对通用领域BLIP-2模型进行微调，以更好地考虑细粒度的时尚特定信息。FashionBLIP-2在Fashion IQ基准和增强评估数据集enhFashionIQ上进行评估，利用我们的流程获得更高质量的注释。实验结果表明，FashionBLIP-2与FACap的预训练结合显著提高了模型在时尚CIR中的性能，特别是在使用细粒度修改文本进行检索时，展示了我们的数据集和方法在高度要求的环境（如电子商务网站）中的价值。代码可在https://fgxaos.github.io/facap-paper-website/找到。

更新时间: 2025-07-08 23:02:10

领域: cs.LG

下载: http://arxiv.org/abs/2507.07135v1

Quadratic Gating Mixture of Experts: Statistical Insights into Self-Attention

Mixture of Experts (MoE) models are well known for effectively scaling model capacity while preserving computational overheads. In this paper, we establish a rigorous relation between MoE and the self-attention mechanism, showing that each row of a self-attention matrix can be written as a quadratic gating mixture of linear experts. Motivated by this connection, we conduct a comprehensive convergence analysis of MoE models with two different quadratic gating functions, namely the quadratic polynomial gate and the quadratic monomial gate, offering useful insights into the design of gating and experts for the MoE framework. First, our analysis indicates that the use of the quadratic monomial gate yields an improved sample efficiency for estimating parameters and experts compared to the quadratic polynomial gate. Second, parameter and expert estimation rates become significantly faster when employing non-linear experts in place of linear experts. Combining these theoretical insights with the above link between MoE and self-attention, we propose a novel \emph{active-attention} mechanism where we apply a non-linear activation function to the value matrix in the formula of self-attention. Finally, we demonstrate that the proposed active-attention outperforms the standard self-attention through several extensive experiments in various tasks, including image classification, language modeling, and multivariate time series forecasting.

Updated: 2025-07-08 22:45:25

标题: 二次门控混合专家：自注意力的统计洞察

摘要: 混合专家（MoE）模型以有效地扩展模型容量并保留计算开销而闻名。在本文中，我们建立了MoE和自注意机制之间的严格关系，展示了自注意矩阵的每一行可以被写成线性专家的二次门控混合。受到这种联系的启发，我们对具有两种不同二次门控函数的MoE模型进行了全面的收敛分析，即二次多项式门控和二次单项式门控，为MoE框架的门控和专家设计提供了有用的见解。首先，我们的分析表明，与二次多项式门控相比，使用二次单项式门控可以提高参数和专家估计的样本效率。其次，当使用非线性专家替代线性专家时，参数和专家估计速度显著加快。将这些理论见解与MoE和自注意之间的联系结合起来，我们提出了一种新颖的“主动关注”机制，其中我们将非线性激活函数应用于自注意公式中的值矩阵。最后，通过多项广泛的实验，包括图像分类、语言建模和多变量时间序列预测等各种任务，我们证明了所提出的主动关注机制优于标准自注意机制。

更新时间: 2025-07-08 22:45:25

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2410.11222v3

HEMA: A Hands-on Exploration Platform for MEMS Sensor Attacks

Automotive safety and security are paramount in the rapidly advancing landscape of vehicular technology. Building safe and secure vehicles demands a profound understanding of automotive systems, particularly in safety and security. Traditional learning approaches, such as reading materials or observing demonstrations, often fail to provide the practical, hands-on experience essential for developing this expertise. For novice users, gaining access to automotive-grade systems and mastering their associated hardware and software can be challenging and overwhelming. In this paper, we present a novel, affordable, and flexible exploration platform, \hema, that enables users to gain practical, hands-on insights into the security compromises of micro-electromechanical systems (MEMS) sensors, a critical component in modern ADAS systems. Furthermore, we discuss the unique challenges and design considerations involved in creating such a platform, emphasizing its role in enhancing the understanding of automotive safety and security. This framework serves as an invaluable resource for educators, researchers, and practitioners striving to build expertise in the field.

Updated: 2025-07-08 22:44:34

标题: HEMA：用于MEMS传感器攻击的实践探索平台

摘要: 汽车安全和安全在不断发展的车辆技术领域至关重要。构建安全可靠的车辆需要对汽车系统有深刻的理解，特别是在安全和安全方面。传统的学习方法，如阅读材料或观察演示，通常无法提供发展这种专业知识所必需的实践经验。对于初学者来说，获得汽车级系统的访问权限并掌握其相关的硬件和软件可能是具有挑战性和压倒性的。在本文中，我们介绍了一种新颖、经济实惠和灵活的探索平台\hema，使用户能够获得对现代ADAS系统中关键组件微机电系统（MEMS）传感器安全妥协的实际、实践洞察。此外，我们讨论了创建这样一个平台涉及的独特挑战和设计考虑，强调其在增进对汽车安全和安全的理解方面的作用。这个框架对于努力在该领域建立专业知识的教育工作者、研究人员和从业者来说是一种宝贵的资源。

更新时间: 2025-07-08 22:44:34

领域: cs.CR,cs.SY,eess.SY

下载: http://arxiv.org/abs/2507.06439v1

Assessing the Prevalence of AI-assisted Cheating in Programming Courses: A Pilot Study

Tools that can generate computer code in response to inputs written in natural language, such as ChatGPT, pose an existential threat to Computer Science education in its current form, since students can now use these tools to solve assignments without much effort. While that risk has already been recognized by scholars, the proportion of the student body that is incurring in this new kind of plagiarism is still an open problem. We conducted a pilot study in a large CS class (n=120) to assess the feasibility of estimating AI plagiarism through anonymous surveys and interviews. More than 25% of the survey respondents admitted to committing AI plagiarism. Conversely, only one student accepted to be interviewed. Given the high levels of misconduct acknowledgment, we conclude that surveys are an effective method for studies on the matter, while interviews should be avoided or designed in a way that can entice participation.

Updated: 2025-07-08 22:40:44

标题: 评估编程课程中AI辅助作弊的普遍程度：一项初步研究

摘要: 可以生成计算机代码以响应自然语言输入的工具，如ChatGPT，对当前形式的计算机科学教育构成了一种存在威胁，因为学生现在可以使用这些工具轻松解决作业。虽然学者们已经认识到了这种风险，但目前仍有学生群体在进行这种新形式的抄袭，这仍然是一个开放的问题。我们在一门大型计算机科学课程中进行了一项试点研究(n=120)，以评估通过匿名调查和访谈估算AI抄袭的可行性。超过25%的调查受访者承认曾经进行过AI抄袭。相反，只有一名学生愿意接受访谈。鉴于高水平的不端行为承认度，我们得出结论认为调查是研究这一问题的有效方法，而访谈应该避免或设计成能够吸引参与的方式。

更新时间: 2025-07-08 22:40:44

领域: cs.CY,cs.AI,cs.HC

下载: http://arxiv.org/abs/2507.06438v1

Deprecating Benchmarks: Criteria and Framework

As frontier artificial intelligence (AI) models rapidly advance, benchmarks are integral to comparing different models and measuring their progress in different task-specific domains. However, there is a lack of guidance on when and how benchmarks should be deprecated once they cease to effectively perform their purpose. This risks benchmark scores over-valuing model capabilities, or worse, obscuring capabilities and safety-washing. Based on a review of benchmarking practices, we propose criteria to decide when to fully or partially deprecate benchmarks, and a framework for deprecating benchmarks. Our work aims to advance the state of benchmarking towards rigorous and quality evaluations, especially for frontier models, and our recommendations are aimed to benefit benchmark developers, benchmark users, AI governance actors (across governments, academia, and industry panels), and policy makers.

Updated: 2025-07-08 22:29:06

标题: 淘汰基准测试：标准与框架

摘要: 随着前沿人工智能（AI）模型的快速发展，基准测试对比较不同模型并衡量它们在不同任务特定领域的进展至关重要。然而，在基准测试停止有效执行其目的时缺乏何时以及如何淘汰基准测试的指导。这可能导致基准分数高估模型能力，甚至更糟的是，掩盖能力和安全洗白。基于对基准测试实践的审查，我们提出了决定何时完全或部分淘汰基准测试的标准，以及一个淘汰基准测试的框架。我们的工作旨在推动基准测试的发展，实现严格和高质量的评估，特别是针对前沿模型，我们的建议旨在惠及基准测试开发者、基准测试用户、AI治理参与者（跨政府、学术界和行业专家组）以及决策者。

更新时间: 2025-07-08 22:29:06

领域: cs.CY,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.06434v1

eegFloss: A Python package for refining sleep EEG recordings using machine learning models

Electroencephalography (EEG) allows monitoring of brain activity, providing insights into the functional dynamics of various brain regions and their roles in cognitive processes. EEG is a cornerstone in sleep research, serving as the primary modality of polysomnography, the gold standard in the field. However, EEG signals are prone to artifacts caused by both internal (device-specific) factors and external (environmental) interferences. As sleep studies are becoming larger, most rely on automatic sleep staging, a process highly susceptible to artifacts, leading to erroneous sleep scores. This paper addresses this challenge by introducing eegFloss, an open-source Python package to utilize eegUsability, a novel machine learning (ML) model designed to detect segments with artifacts in sleep EEG recordings. eegUsability has been trained and evaluated on manually artifact-labeled EEG data collected from 15 participants over 127 nights using the Zmax headband. It demonstrates solid overall classification performance (F1-score is approximately 0.85, Cohens kappa is 0.78), achieving a high recall rate of approximately 94% in identifying channel-wise usable EEG data, and extends beyond Zmax. Additionally, eegFloss offers features such as automatic time-in-bed detection using another ML model named eegMobility, filtering out certain artifacts, and generating hypnograms and sleep statistics. By addressing a fundamental challenge faced by most sleep studies, eegFloss can enhance the precision and rigor of their analysis as well as the accuracy and reliability of their outcomes.

Updated: 2025-07-08 22:27:43

标题: eegFloss：一个用于使用机器学习模型优化睡眠脑电图记录的Python软件包

摘要: 脑电图（EEG）允许监测脑活动，提供对各个脑区功能动态及其在认知过程中的作用的见解。EEG是睡眠研究的基石，是多导睡眠图的主要模态，也是该领域的黄金标准。然而，EEG信号容易受到内部（设备特定）因素和外部（环境）干扰引起的伪迹影响。随着睡眠研究变得更大，大多数依赖于自动睡眠分期，这一过程极易受到伪迹的影响，导致错误的睡眠评分。本文通过引入eegFloss来解决这一挑战，这是一个开源的Python软件包，利用eegUsability，一种专门设计用于检测睡眠EEG记录中带有伪迹的片段的机器学习（ML）模型。eegUsability已经在使用Zmax头环收集的15名参与者127个晚上手动标记的EEG数据上进行了训练和评估。它展示了良好的整体分类性能（F1得分约为0.85，Cohens卡帕为0.78），在识别通道可用的EEG数据方面实现了约94%的高召回率，并且超出了Zmax范围。此外，eegFloss还提供了一些功能，如使用另一个名为eegMobility的ML模型自动检测入睡时间，过滤掉某些伪迹，并生成睡眠图和睡眠统计数据。通过解决大多数睡眠研究面临的基本挑战，eegFloss可以增强其分析的精度和严谨性，以及其结果的准确性和可靠性。

更新时间: 2025-07-08 22:27:43

领域: cs.LG,eess.SP,q-bio.QM

下载: http://arxiv.org/abs/2507.06433v1

Transfer Learning for Transient Classification: From Simulations to Real Data and ZTF to LSST

Machine learning has become essential for automated classification of astronomical transients, but current approaches face significant limitations: classifiers trained on simulations struggle with real data, models developed for one survey cannot be easily applied to another, and new surveys require prohibitively large amounts of labelled training data. These challenges are particularly pressing as we approach the era of the Vera C. Rubin Observatory's Legacy Survey of Space and Time (LSST), where existing classification models will need to be retrained using LSST observations. We demonstrate that transfer learning can overcome these challenges by repurposing existing models trained on either simulations or data from other surveys. Starting with a model trained on simulated Zwicky Transient Facility (ZTF) light curves, we show that transfer learning reduces the amount of labelled real ZTF transients needed by 95% while maintaining equivalent performance to models trained from scratch. Similarly, when adapting ZTF models for LSST simulations, transfer learning achieves 94% of the baseline performance while requiring only 30% of the training data. These findings have significant implications for the early operations of LSST, suggesting that reliable automated classification will be possible soon after the survey begins, rather than waiting months or years to accumulate sufficient training data.

Updated: 2025-07-08 22:27:42

标题: 跨领域迁移学习：从模拟数据到真实数据和ZTF到LSST的瞬变分类

摘要: 机器学习已经成为自动分类天文现象的必需工具，但目前的方法面临着重大限制：在模拟数据上训练的分类器在处理真实数据时遇到困难，为一项调查开发的模型不能轻易应用于另一项调查，并且新的调查需要大量标记的训练数据。随着鲁宾天文观测所的“空间与时间遗产调查”（LSST）时代的临近，这些挑战变得尤为紧迫，现有的分类模型将需要使用LSST观测重新训练。我们证明，通过重新利用在模拟数据或其他调查数据上训练的现有模型，迁移学习可以克服这些挑战。从一个在模拟的 Zwicky 瞬变设施（ZTF）光变曲线上训练的模型开始，我们展示了迁移学习可以将所需的真实 ZTF 瞬变标记数量减少95%，同时保持与从头开始训练的模型相当的性能。同样，当将 ZTF 模型调整到 LSST 模拟数据时，迁移学习可以实现基线性能的94%，仅需要30%的训练数据。这些发现对于 LSST 的早期运营具有重要意义，表明可靠的自动分类将可能在调查开始后不久实现，而不是等待几个月或几年积累足够的训练数据。

更新时间: 2025-07-08 22:27:42

领域: astro-ph.IM,astro-ph.HE,cs.LG

下载: http://arxiv.org/abs/2502.18558v2

Bridging Data Gaps of Rare Conditions in ICU: A Multi-Disease Adaptation Approach for Clinical Prediction

Artificial Intelligence has revolutionised critical care for common conditions. Yet, rare conditions in the intensive care unit (ICU), including recognised rare diseases and low-prevalence conditions in the ICU, remain underserved due to data scarcity and intra-condition heterogeneity. To bridge such gaps, we developed KnowRare, a domain adaptation-based deep learning framework for predicting clinical outcomes for rare conditions in the ICU. KnowRare mitigates data scarcity by initially learning condition-agnostic representations from diverse electronic health records through self-supervised pre-training. It addresses intra-condition heterogeneity by selectively adapting knowledge from clinically similar conditions with a developed condition knowledge graph. Evaluated on two ICU datasets across five clinical prediction tasks (90-day mortality, 30-day readmission, ICU mortality, remaining length of stay, and phenotyping), KnowRare consistently outperformed existing state-of-the-art models. Additionally, KnowRare demonstrated superior predictive performance compared to established ICU scoring systems, including APACHE IV and IV-a. Case studies further demonstrated KnowRare's flexibility in adapting its parameters to accommodate dataset-specific and task-specific characteristics, its generalisation to common conditions under limited data scenarios, and its rationality in selecting source conditions. These findings highlight KnowRare's potential as a robust and practical solution for supporting clinical decision-making and improving care for rare conditions in the ICU.

Updated: 2025-07-08 22:27:19

标题: 在ICU中填补罕见病情的数据空白：一种用于临床预测的多病种适应方法

摘要: 人工智能已经彻底改变了危重病病情的护理。然而，在重症监护室（ICU）中，包括已知的罕见疾病和低发病率条件在内的罕见疾病仍然受到忽视，这是由于数据稀缺和疾病内部异质性造成的。为了弥补这些差距，我们开发了KnowRare，这是一个基于领域适应的深度学习框架，用于预测ICU中罕见疾病的临床结果。KnowRare通过通过自监督预训练从多样化的电子健康记录中学习与疾病无关的表征来减轻数据稀缺性。它通过有条件知识图选择性地从临床上类似的疾病中适应知识，从而解决了疾病内部异质性。在两个ICU数据集上对五个临床预测任务（90天死亡率、30天再入院率、ICU死亡率、剩余住院时间和表型分类）进行评估，KnowRare始终优于现有的最先进模型。此外，KnowRare表现出比已建立的ICU评分系统（包括APACHE IV和IV-a）更出色的预测性能。案例研究进一步展示了KnowRare在调整其参数以适应数据集特定和任务特定特性、在有限数据场景下将其泛化到常见疾病以及在选择源疾病方面的合理性。这些发现突显了KnowRare作为支持临床决策和改善ICU中罕见疾病护理的健壮和实用解决方案的潜力。

更新时间: 2025-07-08 22:27:19

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.06432v1

Neural Actor-Critic Methods for Hamilton-Jacobi-Bellman PDEs: Asymptotic Analysis and Numerical Studies

We mathematically analyze and numerically study an actor-critic machine learning algorithm for solving high-dimensional Hamilton-Jacobi-Bellman (HJB) partial differential equations from stochastic control theory. The architecture of the critic (the estimator for the value function) is structured so that the boundary condition is always perfectly satisfied (rather than being included in the training loss) and utilizes a biased gradient which reduces computational cost. The actor (the estimator for the optimal control) is trained by minimizing the integral of the Hamiltonian over the domain, where the Hamiltonian is estimated using the critic. We show that the training dynamics of the actor and critic neural networks converge in a Sobolev-type space to a certain infinite-dimensional ordinary differential equation (ODE) as the number of hidden units in the actor and critic $\rightarrow \infty$. Further, under a convexity-like assumption on the Hamiltonian, we prove that any fixed point of this limit ODE is a solution of the original stochastic control problem. This provides an important guarantee for the algorithm's performance in light of the fact that finite-width neural networks may only converge to a local minimizers (and not optimal solutions) due to the non-convexity of their loss functions. In our numerical studies, we demonstrate that the algorithm can solve stochastic control problems accurately in up to 200 dimensions. In particular, we construct a series of increasingly complex stochastic control problems with known analytic solutions and study the algorithm's numerical performance on them. These problems range from a linear-quadratic regulator equation to highly challenging equations with non-convex Hamiltonians, allowing us to identify and analyze the strengths and limitations of this neural actor-critic method for solving HJB equations.

Updated: 2025-07-08 22:20:22

标题: 神经演员评论家方法用于Hamilton-Jacobi-Bellman PDEs：渐近分析和数值研究

摘要: 我们对一种用于解决高维哈密尔顿-雅可比-贝尔曼（HJB）偏微分方程的随机控制理论的演员-评论家机器学习算法进行了数学分析和数值研究。评论家的结构（价值函数的估计器）被设计为始终完全满足边界条件（而不包括在训练损失中），并利用减少计算成本的有偏梯度。演员（最优控制的估计器）通过最小化整个区域上的哈密尔顿量来训练，其中哈密尔顿量是使用评论家估计的。我们证明了演员和评论家神经网络的训练动态在Sobolev类型空间中收敛到一个特定的无穷维常微分方程（ODE），当演员和评论家中的隐藏单元数量$\rightarrow \infty$时。此外，在哈密尔顿量上的类似凸性的假设下，我们证明了这个极限ODE的任何固定点都是原始随机控制问题的解。鉴于有限宽度神经网络可能仅由于其损失函数的非凸性而收敛于局部最小值（而非最优解），这为算法的性能提供了重要保证。在我们的数值研究中，我们展示了该算法在高达200维的情况下可以准确解决随机控制问题。特别地，我们构建了一系列越来越复杂的具有已知解析解的随机控制问题，并研究了该算法在这些问题上的数值性能。这些问题从线性二次调节器方程到具有非凸哈密尔顿量的高度挑战性方程都有，这使我们能够识别和分析这种神经演员-评论家方法在解决HJB方程中的优势和局限性。

更新时间: 2025-07-08 22:20:22

领域: math.OC,cs.LG,cs.NA,math.NA,stat.ML,93E20, 35Q93, 68T07, 90-08

下载: http://arxiv.org/abs/2507.06428v1

BOOST: Out-of-Distribution-Informed Adaptive Sampling for Bias Mitigation in Stylistic Convolutional Neural Networks

The pervasive issue of bias in AI presents a significant challenge to painting classification, and is getting more serious as these systems become increasingly integrated into tasks like art curation and restoration. Biases, often arising from imbalanced datasets where certain artistic styles dominate, compromise the fairness and accuracy of model predictions, i.e., classifiers are less accurate on rarely seen paintings. While prior research has made strides in improving classification performance, it has largely overlooked the critical need to address these underlying biases, that is, when dealing with out-of-distribution (OOD) data. Our insight highlights the necessity of a more robust approach to bias mitigation in AI models for art classification on biased training data. We propose a novel OOD-informed model bias adaptive sampling method called BOOST (Bias-Oriented OOD Sampling and Tuning). It addresses these challenges by dynamically adjusting temperature scaling and sampling probabilities, thereby promoting a more equitable representation of all classes. We evaluate our proposed approach to the KaoKore and PACS datasets, focusing on the model's ability to reduce class-wise bias. We further propose a new metric, Same-Dataset OOD Detection Score (SODC), designed to assess class-wise separation and per-class bias reduction. Our method demonstrates the ability to balance high performance with fairness, making it a robust solution for unbiasing AI models in the art domain.

Updated: 2025-07-08 22:18:36

标题: BOOST：基于分布外信息的自适应采样，用于在风格卷积神经网络中减轻偏见

摘要: AI中存在的偏见普遍问题对于绘画分类构成了重大挑战，并随着这些系统逐渐融入到诸如艺术策展和恢复等任务中而变得更加严重。这些偏见往往源自数据集不平衡，其中某些艺术风格占主导地位，从而损害了模型预测的公平性和准确性，即分类器对很少见的绘画的准确性较低。尽管先前的研究在改善分类性能方面取得了进展，但它在很大程度上忽视了解决这些潜在偏见的关键需求，即在处理分布之外的数据时。我们的观点强调了在偏见数据集上为艺术分类的AI模型实施更加健壮的偏见缓解方法的必要性。我们提出了一种新颖的基于OOD信息的模型偏见自适应采样方法，称为BOOST（Bias-Oriented OOD Sampling and Tuning）。通过动态调整温度缩放和采样概率，该方法解决了这些挑战，从而促进所有类别的更加公平的表现。我们对KaoKore和PACS数据集评估了我们提出的方法，重点关注模型减少类别偏见的能力。我们进一步提出了一个新的度量标准，即Same-Dataset OOD Detection Score（SODC），旨在评估类别间的分离和每个类别的偏见减少。我们的方法展示了在高性能和公平性之间取得平衡的能力，使其成为艺术领域中为AI模型去偏见的强大解决方案。

更新时间: 2025-07-08 22:18:36

领域: cs.AI,cs.LG,I.2.10

下载: http://arxiv.org/abs/2507.07134v1

Exploring Task Performance with Interpretable Models via Sparse Auto-Encoders

Large Language Models (LLMs) are traditionally viewed as black-box algorithms, therefore reducing trustworthiness and obscuring potential approaches to increasing performance on downstream tasks. In this work, we apply an effective LLM decomposition method using a dictionary-learning approach with sparse autoencoders. This helps extract monosemantic features from polysemantic LLM neurons. Remarkably, our work identifies model-internal misunderstanding, allowing the automatic reformulation of the prompts with additional annotations to improve the interpretation by LLMs. Moreover, this approach demonstrates a significant performance improvement in downstream tasks, such as mathematical reasoning and metaphor detection.

Updated: 2025-07-08 22:17:52

标题: 用稀疏自动编码器通过可解释模型探索任务表现

摘要: 大型语言模型（LLMs）传统上被视为黑匣子算法，因此降低了可信度并模糊了增加下游任务性能的潜在方法。在这项工作中，我们应用一种有效的LLM分解方法，使用稀疏自编码器的字典学习方法。这有助于从多义LLM神经元中提取单语义特征。值得注意的是，我们的工作识别了模型内部的误解，允许自动重新制定提示，并附加注释以改善LLMs的解释。此外，这种方法在下游任务中表现出显著的性能提升，如数学推理和隐喻检测。

更新时间: 2025-07-08 22:17:52

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.06427v1

Generative Panoramic Image Stitching

We introduce the task of generative panoramic image stitching, which aims to synthesize seamless panoramas that are faithful to the content of multiple reference images containing parallax effects and strong variations in lighting, camera capture settings, or style. In this challenging setting, traditional image stitching pipelines fail, producing outputs with ghosting and other artifacts. While recent generative models are capable of outpainting content consistent with multiple reference images, they fail when tasked with synthesizing large, coherent regions of a panorama. To address these limitations, we propose a method that fine-tunes a diffusion-based inpainting model to preserve a scene's content and layout based on multiple reference images. Once fine-tuned, the model outpaints a full panorama from a single reference image, producing a seamless and visually coherent result that faithfully integrates content from all reference images. Our approach significantly outperforms baselines for this task in terms of image quality and the consistency of image structure and scene layout when evaluated on captured datasets.

Updated: 2025-07-08 22:07:12

标题: 生成全景图像拼接

摘要: 我们介绍了生成全景图像拼接任务，旨在合成与包含视差效果和光照、相机捕捉设置或风格等方面强烈变化的多个参考图像内容相符的无缝全景图。在这种具有挑战性的情境下，传统的图像拼接流程失败，产生出带有幽灵和其他瑕疵的输出。虽然最近的生成模型能够根据多个参考图像输出与之一致的内容，但在任务要求合成大片连贯的全景时却失败。为了解决这些限制，我们提出了一种方法，通过对基于扩散的修补模型进行微调，以基于多个参考图像保留场景的内容和布局。一旦进行了微调，该模型能够根据单个参考图像输出完整的全景图，产生出无缝且视觉一致的结果，忠实地整合了所有参考图像的内容。我们的方法在评估捕捉数据集时，在图像质量和图像结构一致性以及场景布局方面明显优于此任务的基线。

更新时间: 2025-07-08 22:07:12

领域: cs.GR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.07133v1

Rugsafe: A multichain protocol for recovering from and defending against Rug Pulls

Rugsafe introduces a comprehensive protocol aimed at mitigating the risks of rug pulls in the cryptocurrency ecosystem. By utilizing cryptographic security measures and economic incentives, the protocol provides a secure multichain system for recovering assets and transforming rugged tokens into opportunities and rewards. Foundational to Rugsafe are specialized vaults where rugged tokens can be securely deposited, and anticoin tokens are issued as receipts. These anticoins are designed to be inversely pegged to the price movement of the underlying rugged token. Users can utilize these anticoins within the ecosystem or choose to burn them, further securing the protocol and earning additional rewards. The supply of the native Rugsafe token is dynamically adjusted based on the volume, value, and activity of rugged tokens, ensuring stability and resilience. By depositing rugged tokens into a vault on several chains, and by burning anticoins, users receive incentives on the RugSafe chain. This protocol's vaults are designed to work in heterogenous blockchain ecosystems, offering a practical and effective solution to one of the most significant challenges in the cryptocurrency market.

Updated: 2025-07-08 21:59:47

标题: Rugsafe：一种用于从中恢复和防御的多链协议 Rug Pulls

摘要: Rugsafe提出了一个全面的协议，旨在减轻加密货币生态系统中的风险。通过利用加密安全措施和经济激励，该协议为资产的恢复和将不稳定的代币转化为机会和奖励提供了一个安全的多链系统。Rugsafe的基础是专门的保险库，可以安全地存放不稳定的代币，并发行反代币作为收据。这些反代币的设计是与基础不稳定代币的价格运动成反比。用户可以在生态系统内利用这些反代币，也可以选择销毁它们，进一步保护协议并获得额外的奖励。本地Rugsafe代币的供应根据不稳定代币的数量、价值和活动动态调整，以确保稳定性和弹性。通过在多个链上将不稳定代币存入保险库，并销毁反代币，用户可以在RugSafe链上获得激励。该协议的保险库旨在在异构区块链生态系统中运行，为加密货币市场中最重要的挑战之一提供了一个实用有效的解决方案。

更新时间: 2025-07-08 21:59:47

领域: cs.CR,cs.CE,cs.ET,cs.GT

下载: http://arxiv.org/abs/2507.06423v1

Never Trust the Manufacturer, Never Trust the Client: A Novel Method for Streaming STL Files for Secure Additive

While additive manufacturing has opened interesting avenues to reimagine manufacturing as a service (MaaS) platform, transmission of design files from client to manufacturer over networks opens up many cybersecurity challenges. Securing client's intellectual property (IP) especially from cyber-attacks emerges as a major challenge. Earlier works introduced streaming, instead of sharing process plan (G-code) files, as a possible solution. However, executing client's G-codes on manufacturer's machines exposes them to potential malicious G-codes. This paper proposes a viable approach when the client and manufacturer do not trust each other and both the client and manufacturer want to preserve their IP of designs and manufacturing process respectively. The proposed approach is based on segmenting and streaming design (STL) files and employing a novel machine-specific STL to G-code translator at the manufacturer's site in real-time for printing. This approach secures design and manufacturing process IPs as demonstrated in a real-world implementation.

Updated: 2025-07-08 21:59:21

标题: 永远不要相信制造商，永远不要相信客户：一种用于安全添加的流式STL文件的新方法

摘要: 尽管增材制造已经开辟了重新想象制造服务(MaaS)平台的有趣途径，但设计文件在客户和制造商之间通过网络传输打开了许多网络安全挑战。特别是保护客户的知识产权(IP)免受网络攻击的工作变得越来越具有挑战性。早期的研究引入了流式传输，而不是共享工艺计划(G代码)文件，作为一个可能的解决方案。然而，在制造商的机器上执行客户的G代码会使其暴露于潜在的恶意G代码。本文提出了一种可行的方法，当客户和制造商彼此不信任，并且客户和制造商都希望分别保留他们的设计和制造过程的知识产权时。所提出的方法基于对设计文件(STL)进行分段和流式传输，并在制造商的现场实时使用一种新颖的机器特定STL到G代码转换器进行打印。这种方法通过实际实施证明了设计和制造过程的知识产权安全性。

更新时间: 2025-07-08 21:59:21

领域: cs.CR

下载: http://arxiv.org/abs/2507.06421v1

Capsule-ConvKAN: A Hybrid Neural Approach to Medical Image Classification

This study conducts a comprehensive comparison of four neural network architectures: Convolutional Neural Network, Capsule Network, Convolutional Kolmogorov--Arnold Network, and the newly proposed Capsule--Convolutional Kolmogorov--Arnold Network. The proposed Capsule-ConvKAN architecture combines the dynamic routing and spatial hierarchy capabilities of Capsule Network with the flexible and interpretable function approximation of Convolutional Kolmogorov--Arnold Networks. This novel hybrid model was developed to improve feature representation and classification accuracy, particularly in challenging real-world biomedical image data. The architectures were evaluated on a histopathological image dataset, where Capsule-ConvKAN achieved the highest classification performance with an accuracy of 91.21\%. The results demonstrate the potential of the newly introduced Capsule-ConvKAN in capturing spatial patterns, managing complex features, and addressing the limitations of traditional convolutional models in medical image classification.

Updated: 2025-07-08 21:51:05

标题: Capsule-ConvKAN：一种用于医学图像分类的混合神经网络方法

摘要: 本研究对四种神经网络架构进行了全面比较：卷积神经网络、胶囊网络、卷积Kolmogorov-Arnold网络以及新提出的胶囊-卷积Kolmogorov-Arnold网络。提出的胶囊-卷积Kolmogorov-Arnold网络结合了胶囊网络的动态路由和空间层次能力以及卷积Kolmogorov-Arnold网络的灵活和可解释的函数逼近能力。这种新颖的混合模型旨在提高特征表示和分类准确性，特别是在具有挑战性的现实生物医学图像数据中。这些架构在组织病理学图像数据集上进行了评估，其中胶囊-卷积Kolmogorov-Arnold网络实现了最高的分类性能，准确率达到91.21％。结果表明，新引入的胶囊-卷积Kolmogorov-Arnold网络在捕捉空间模式、处理复杂特征以及解决传统卷积模型在医学图像分类中的局限性方面具有潜力。

更新时间: 2025-07-08 21:51:05

领域: eess.IV,cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.06417v1

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Benchmarks are essential for quantitatively tracking progress in AI. As AI agents become increasingly capable, researchers and practitioners have introduced agentic benchmarks to evaluate agents on complex, real-world tasks. These benchmarks typically measure agent capabilities by evaluating task outcomes via specific reward designs. However, we show that many agentic benchmarks have issues in task setup or reward design. For example, SWE-bench Verified uses insufficient test cases, while TAU-bench counts empty responses as successful. Such issues can lead to under- or overestimation of agents' performance by up to 100% in relative terms. To make agentic evaluation rigorous, we introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience, a survey of best practices, and previously reported issues. When applied to CVE-Bench, a benchmark with a particularly complex evaluation design, ABC reduces the performance overestimation by 33%.

Updated: 2025-07-08 21:45:08

标题: 建立建立严格主体基准的最佳实践

摘要: 基准测试对于定量跟踪人工智能领域的进展至关重要。随着人工智能代理变得越来越强大，研究人员和实践者引入了代理基准测试，以评估代理在复杂的现实任务上的表现。这些基准测试通常通过特定奖励设计评估任务结果来衡量代理的能力。然而，我们发现许多代理基准测试在任务设置或奖励设计方面存在问题。例如，SWE-bench Verified使用不足的测试案例，而TAU-bench将空响应计为成功。这些问题可能导致相对性能的低估或高估高达100%。为了使代理评估更加严谨，我们引入了代理基准测试清单（ABC），这是我们从基准测试构建经验、最佳实践调查和先前报道的问题中综合出的一组指导方针。当应用于具有特别复杂评估设计的CVE-Bench时，ABC将性能高估降低了33%。

更新时间: 2025-07-08 21:45:08

领域: cs.AI,A.1; I.2.m

下载: http://arxiv.org/abs/2507.02825v2

PERK: Long-Context Reasoning as Parameter-Efficient Test-Time Learning

Long-context reasoning requires accurately identifying relevant information in extensive, noisy input contexts. Previous research shows that using test-time learning to encode context directly into model parameters can effectively enable reasoning over noisy information. However, meta-learning methods for enabling test-time learning are prohibitively memory-intensive, preventing their application to long context settings. In this work, we propose PERK (Parameter Efficient Reasoning over Knowledge), a scalable approach for learning to encode long input contexts using gradient updates to a lightweight model adapter at test time. Specifically, PERK employs two nested optimization loops in a meta-training phase. The inner loop rapidly encodes contexts into a low-rank adapter (LoRA) that serves as a parameter-efficient memory module for the base model. Concurrently, the outer loop learns to use the updated adapter to accurately recall and reason over relevant information from the encoded long context. Our evaluations on several long-context reasoning tasks show that PERK significantly outperforms the standard prompt-based long-context baseline, achieving average absolute performance gains of up to 90% for smaller models (GPT-2) and up to 27% for our largest evaluated model, Qwen-2.5-0.5B. In general, PERK is more robust to reasoning complexity, length extrapolation, and the locations of relevant information in contexts. Finally, we show that while PERK is memory-intensive during training, it scales more efficiently at inference time than prompt-based long-context inference.

Updated: 2025-07-08 21:38:45

标题: PERK：作为参数高效测试时间学习的长上下文推理

摘要: 长上下文推理需要准确识别广泛、嘈杂的输入背景中的相关信息。先前的研究表明，利用测试时学习将上下文直接编码到模型参数中可以有效地实现对嘈杂信息的推理。然而，用于实现测试时学习的元学习方法具有过高的内存消耗，阻碍了它们在长上下文设置中的应用。在本文中，我们提出了PERK（高效参数推理知识），这是一种可伸缩的方法，通过在测试时对轻量级模型适配器进行梯度更新来学习编码长输入背景。具体来说，PERK在元训练阶段采用两个嵌套的优化循环。内循环快速将上下文编码到一个低秩适配器（LoRA）中，该适配器作为基础模型的参数高效内存模块。同时，外循环学习如何使用更新后的适配器准确地回忆和推理出编码的长上下文中的相关信息。我们在几个长上下文推理任务上的评估结果显示，PERK明显优于标准基于提示的长上下文基线，对于较小的模型（GPT-2）平均绝对性能提高高达90％，对于我们评估的最大模型Qwen-2.5-0.5B则高达27％。总的来说，PERK对推理复杂性、长度外推和上下文中相关信息的位置更具鲁棒性。最后，我们展示了虽然PERK在训练过程中需要大量内存，但在推理时比基于提示的长上下文推理更高效。

更新时间: 2025-07-08 21:38:45

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.06415v1

Tokenization for Molecular Foundation Models

Text-based foundation models have become an important part of scientific discovery, with molecular foundation models accelerating advancements in material science and molecular design.However, existing models are constrained by closed-vocabulary tokenizers that capture only a fraction of molecular space. In this work, we systematically evaluate 34 tokenizers, including 19 chemistry-specific ones, and reveal significant gaps in their coverage of the SMILES molecular representation. To assess the impact of tokenizer choice, we introduce n-gram language models as a low-cost proxy and validate their effectiveness by pretraining and finetuning 18 RoBERTa-style encoders for molecular property prediction. To overcome the limitations of existing tokenizers, we propose two new tokenizers -- Smirk and Smirk-GPE -- with full coverage of the OpenSMILES specification. The proposed tokenizers systematically integrate nuclear, electronic, and geometric degrees of freedom; facilitating applications in pharmacology, agriculture, biology, and energy storage. Our results highlight the need for open-vocabulary modeling and chemically diverse benchmarks in cheminformatics.

Updated: 2025-07-08 21:38:03

标题: 分子基础模型的标记化

摘要: 基于文本的基础模型已成为科学发现的重要组成部分，分子基础模型加速了材料科学和分子设计的进展。然而，现有模型受到闭合词汇标记器的限制，仅捕捉了分子空间的一小部分。在这项工作中，我们系统评估了34种标记器，包括19种化学特定的标记器，并揭示了它们在SMILES分子表示覆盖范围上的显著缺陷。为了评估标记器选择的影响，我们引入了n-gram语言模型作为低成本代理，并通过对18个RoBERTa风格编码器进行预训练和微调来验证它们的有效性，用于分子属性预测。为了克服现有标记器的限制，我们提出了两种新的标记器——Smirk和Smirk-GPE——完全覆盖了OpenSMILES规范。所提出的标记器系统地整合了核、电子和几何自由度，促进了在药理学、农业、生物学和能源存储方面的应用。我们的结果凸显了在化学信息学中需要开放词汇建模和化学多样性基准的需求。

更新时间: 2025-07-08 21:38:03

领域: cs.LG,cs.AI,physics.chem-ph,q-bio.BM

下载: http://arxiv.org/abs/2409.15370v3

MedSyn: Enhancing Diagnostics with Human-AI Collaboration

Clinical decision-making is inherently complex, often influenced by cognitive biases, incomplete information, and case ambiguity. Large Language Models (LLMs) have shown promise as tools for supporting clinical decision-making, yet their typical one-shot or limited-interaction usage may overlook the complexities of real-world medical practice. In this work, we propose a hybrid human-AI framework, MedSyn, where physicians and LLMs engage in multi-step, interactive dialogues to refine diagnoses and treatment decisions. Unlike static decision-support tools, MedSyn enables dynamic exchanges, allowing physicians to challenge LLM suggestions while the LLM highlights alternative perspectives. Through simulated physician-LLM interactions, we assess the potential of open-source LLMs as physician assistants. Results show open-source LLMs are promising as physician assistants in the real world. Future work will involve real physician interactions to further validate MedSyn's usefulness in diagnostic accuracy and patient outcomes.

Updated: 2025-07-08 21:35:13

标题: MedSyn：通过人工智能与人类协作提升诊断能力

摘要: 临床决策是固有复杂的，通常受认知偏见、信息不完整和案例模糊性的影响。大型语言模型(LLMs)已经显示出作为支持临床决策的工具的潜力，然而它们通常的一次性或有限交互的使用可能忽视了现实世界医疗实践的复杂性。在这项工作中，我们提出了一个混合人工智能框架MedSyn，医生和LLMs进行多步交互对话，以完善诊断和治疗决策。与静态决策支持工具不同，MedSyn实现了动态交流，使医生能够挑战LLMs的建议，同时LLMs突出备选观点。通过模拟医生-LLM交互，我们评估开源LLMs作为医生助手的潜力。结果显示开源LLMs在现实世界中作为医生助手是有希望的。未来的工作将涉及真实医生的互动，进一步验证MedSyn在诊断准确性和患者结果方面的实用性。

更新时间: 2025-07-08 21:35:13

领域: cs.LG,cs.AI,cs.HC

下载: http://arxiv.org/abs/2506.14774v2

SSSUMO: Real-Time Semi-Supervised Submovement Decomposition

This paper introduces a SSSUMO, semi-supervised deep learning approach for submovement decomposition that achieves state-of-the-art accuracy and speed. While submovement analysis offers valuable insights into motor control, existing methods struggle with reconstruction accuracy, computational cost, and validation, due to the difficulty of obtaining hand-labeled data. We address these challenges using a semi-supervised learning framework. This framework learns from synthetic data, initially generated from minimum-jerk principles and then iteratively refined through adaptation to unlabeled human movement data. Our fully convolutional architecture with differentiable reconstruction significantly surpasses existing methods on both synthetic and diverse human motion datasets, demonstrating robustness even in high-noise conditions. Crucially, the model operates in real-time (less than a millisecond per input second), a substantial improvement over optimization-based techniques. This enhanced performance facilitates new applications in human-computer interaction, rehabilitation medicine, and motor control studies. We demonstrate the model's effectiveness across diverse human-performed tasks such as steering, rotation, pointing, object moving, handwriting, and mouse-controlled gaming, showing notable improvements particularly on challenging datasets where traditional methods largely fail. Training and benchmarking source code, along with pre-trained model weights, are made publicly available at https://github.com/dolphin-in-a-coma/sssumo.

Updated: 2025-07-08 21:26:25

标题: SSSUMO：实时半监督子运动分解

摘要: 这篇论文介绍了一种名为SSSUMO的半监督深度学习方法，用于子运动分解，实现了最先进的准确性和速度。尽管子运动分析提供了对运动控制的宝贵见解，但现有方法在重建准确性、计算成本和验证方面存在困难，这是由于获取手工标记数据的困难。我们通过半监督学习框架来解决这些挑战。该框架从最小冲击原则初始生成的合成数据中学习，然后通过适应未标记的人类运动数据进行迭代改进。我们的全卷积架构具有可微重建，在合成和多样化的人类运动数据集上明显优于现有方法，即使在高噪声条件下也表现出鲁棒性。关键是，该模型实时运行（每输入秒少于一毫秒），这是对基于优化的技术的重大改进。这种增强的性能促进了在人机交互、康复医学和运动控制研究中的新应用。我们展示了该模型在各种由人类执行的任务上的有效性，例如转向、旋转、指向、移动物体、书写和鼠标控制游戏，在传统方法很难成功的挑战性数据集上特别表现出明显的改进。培训和基准代码，以及预训练模型权重，均可在https://github.com/dolphin-in-a-coma/sssumo上公开获取。

更新时间: 2025-07-08 21:26:25

领域: cs.HC,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.08028v1

SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents

As Large Language Models (LLMs) are increasingly deployed as autonomous agents in complex and long horizon settings, it is critical to evaluate their ability to sabotage users by pursuing hidden objectives. We study the ability of frontier LLMs to evade monitoring and achieve harmful hidden goals while completing a wide array of realistic tasks. We evaluate a broad range of frontier LLMs using SHADE (Subtle Harmful Agent Detection & Evaluation)-Arena, the first highly diverse agent evaluation dataset for sabotage and monitoring capabilities of LLM agents. SHADE-Arena consists of complex pairs of benign main tasks and harmful side objectives in complicated environments. Agents are evaluated on their ability to complete the side task without appearing suspicious to an LLM monitor. When measuring agent ability to (a) complete the main task, (b) complete the side task, and (c) avoid detection, we find that the best performing frontier models score 27% (Claude 3.7 Sonnet) and 15% (Gemini 2.5 Pro) as sabotage agents when overseen by Claude 3.6 Sonnet. For current frontier models, success on the side task relies heavily on having access to a hidden scratchpad that is not visible to the monitor. We also use SHADE-Arena to measure models' monitoring abilities, with the top monitor (Gemini 2.5 Pro) achieving an AUC of 0.87 at distinguishing benign and malign transcripts. We find that for now, models still struggle at sabotage due to failures in long-context main task execution. However, our measurements already demonstrate the difficulty of monitoring for subtle sabotage attempts, which we expect to only increase in the face of more complex and longer-horizon tasks.

Updated: 2025-07-08 21:23:22

标题: SHADE-Arena：评估LLM代理中的破坏和监控

摘要: 随着大型语言模型（LLMs）越来越多地被部署为复杂和长期的环境中的自主代理，评估它们通过追求隐藏目标来破坏用户的能力至关重要。我们研究了前沿LLMs在完成各种现实任务时逃避监测并实现有害隐藏目标的能力。我们使用SHADE（微妙有害代理检测与评估）-Arena对一系列前沿LLMs进行评估，这是第一个用于评估LLM代理的破坏和监控能力的高度多样化的数据集。SHADE-Arena由复杂的良性主任务和复杂环境中的有害副目标对组成。代理根据其能够完成副任务而不被LLM监视器发现的能力进行评估。在衡量代理完成主任务、完成副任务和避免被发现的能力时，我们发现表现最佳的前沿模型在被Claude 3.6 Sonnet监督时，Sabotage代理的得分分别为27％（Claude 3.7 Sonnet）和15％（Gemini 2.5 Pro）。对于当前的前沿模型，成功完成副任务很大程度上取决于是否能够访问对监视器不可见的隐藏草稿本。我们还使用SHADE-Arena来衡量模型的监控能力，最高的监视器（Gemini 2.5 Pro）在区分良性和恶性转录时实现了0.87的AUC。我们发现，目前模型在破坏方面仍然存在困难，这是由于在长期上下文主任务执行中的失败。然而，我们的测量已经显示了监控微妙破坏尝试的困难，我们预计在面对更复杂和更长期的任务时这种困难只会增加。

更新时间: 2025-07-08 21:23:22

领域: cs.AI,cs.CR,cs.LG

下载: http://arxiv.org/abs/2506.15740v2

Many-Task Federated Fine-Tuning via Unified Task Vectors

Federated Learning (FL) traditionally assumes homogeneous client tasks; however, in real-world scenarios, clients often specialize in diverse tasks, introducing task heterogeneity. To address this challenge, Many-Task FL (MaT-FL) has emerged, enabling clients to collaborate effectively despite task diversity. Existing MaT-FL approaches rely on client grouping or personalized layers, requiring the server to manage individual models and failing to account for clients handling multiple tasks. We propose MaTU, a MaT-FL approach that enables joint learning of task vectors across clients, eliminating the need for clustering or client-specific weight storage at the server. Our method introduces a novel aggregation mechanism that determines task similarity based on the direction of clients task vectors and constructs a unified task vector encapsulating all tasks. To address task-specific requirements, we augment the unified task vector with lightweight modulators that facilitate knowledge transfer among related tasks while disentangling dissimilar ones. Evaluated across 30 datasets, MaTU achieves superior performance over state-of-the-art MaT-FL approaches, with results comparable to per-task fine-tuning, while delivering significant communication savings.

Updated: 2025-07-08 21:23:10

标题: 通过统一任务向量进行的多任务联邦微调

摘要: 联邦学习（FL）传统上假定客户端任务是同质的；然而，在实际场景中，客户端往往专门从事各种任务，引入任务异质性。为了解决这一挑战，许多任务FL（MaT-FL）应运而生，使客户端能够有效合作，尽管任务多样化。现有的MaT-FL方法依赖于客户端分组或个性化层，需要服务器管理单个模型，并未考虑客户端处理多个任务。我们提出了MaTU，一种MaT-FL方法，可以实现客户端跨任务向量的联合学习，消除了服务器对聚类或客户端特定权重存储的需求。我们的方法引入了一种新颖的聚合机制，根据客户端任务向量的方向确定任务相似性，并构建一个封装所有任务的统一任务向量。为了满足特定任务的要求，我们使用轻量级调制器扩充了统一任务向量，促进相关任务之间的知识转移，同时解开不相关任务。在30个数据集上进行评估，MaTU相对于最先进的MaT-FL方法取得了优越的性能，结果与每个任务的微调相当，同时实现了显著的通信节省。

更新时间: 2025-07-08 21:23:10

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2502.06376v3

SImpHAR: Advancing impedance-based human activity recognition using 3D simulation and text-to-motion models

Human Activity Recognition (HAR) with wearable sensors is essential for applications in healthcare, fitness, and human-computer interaction. Bio-impedance sensing offers unique advantages for fine-grained motion capture but remains underutilized due to the scarcity of labeled data. We introduce SImpHAR, a novel framework addressing this limitation through two core contributions. First, we propose a simulation pipeline that generates realistic bio-impedance signals from 3D human meshes using shortest-path estimation, soft-body physics, and text-to-motion generation serving as a digital twin for data augmentation. Second, we design a two-stage training strategy with decoupled approach that enables broader activity coverage without requiring label-aligned synthetic data. We evaluate SImpHAR on our collected ImpAct dataset and two public benchmarks, showing consistent improvements over state-of-the-art methods, with gains of up to 22.3% and 21.8%, in terms of accuracy and macro F1 score, respectively. Our results highlight the promise of simulation-driven augmentation and modular training for impedance-based HAR.

Updated: 2025-07-08 21:15:12

标题: SImpHAR: 使用3D模拟和文本到动作模型推进基于阻抗的人体活动识别

摘要: 穿戴传感器进行人体活动识别（HAR）对于医疗保健、健身和人机交互应用至关重要。生物阻抗传感提供了精细动作捕捉的独特优势，但由于标记数据的稀缺性而被低估利用。我们介绍了SImpHAR，这是一个通过两个核心贡献来解决这一限制的新框架。首先，我们提出了一个模拟流水线，使用最短路径估计、软体物理和文本到动作生成从3D人体网格生成真实的生物阻抗信号，作为数据增强的数字孪生体。其次，我们设计了一个两阶段训练策略，采用解耦方法，使得更广泛的活动覆盖范围而无需标签对齐的合成数据。我们在我们收集的ImpAct数据集和两个公共基准测试中评估了SImpHAR，显示了相对于现有方法的持续改进，准确率和宏F1分数分别提高了高达22.3%和21.8%。我们的结果突显了模拟驱动增强和模块化训练对基于阻抗的HAR的潜力。

更新时间: 2025-07-08 21:15:12

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.06405v1

Learning to Evaluate Autonomous Behaviour in Human-Robot Interaction

Evaluating and comparing the performance of autonomous Humanoid Robots is challenging, as success rate metrics are difficult to reproduce and fail to capture the complexity of robot movement trajectories, critical in Human-Robot Interaction and Collaboration (HRIC). To address these challenges, we propose a general evaluation framework that measures the quality of Imitation Learning (IL) methods by focusing on trajectory performance. We devise the Neural Meta Evaluator (NeME), a deep learning model trained to classify actions from robot joint trajectories. NeME serves as a meta-evaluator to compare the performance of robot control policies, enabling policy evaluation without requiring human involvement in the loop. We validate our framework on ergoCub, a humanoid robot, using teleoperation data and comparing IL methods tailored to the available platform. The experimental results indicate that our method is more aligned with the success rate obtained on the robot than baselines, offering a reproducible, systematic, and insightful means for comparing the performance of multimodal imitation learning approaches in complex HRI tasks.

Updated: 2025-07-08 21:12:57

标题: 学习评估在人机交互中的自主行为

摘要: 评估和比较自主人形机器人的性能是具有挑战性的，因为成功率指标难以复制，并且无法捕捉机器人运动轨迹的复杂性，这在人机交互和协作中至关重要。为了应对这些挑战，我们提出了一个通用评估框架，通过关注轨迹性能来衡量模仿学习（IL）方法的质量。我们设计了神经元元评估器（NeME），这是一个深度学习模型，经过训练可对机器人关节轨迹中的动作进行分类。NeME作为元评估器，用于比较机器人控制策略的性能，使得可以在不需要人类参与的情况下进行策略评估。我们在人形机器人ergoCub上验证了我们的框架，使用远程操作数据，并比较针对可用平台定制的IL方法。实验结果表明，我们的方法与在机器人上获得的成功率更加一致，比基线方法提供了一种可复制、系统化和富有洞察力的方法，用于比较复杂HRI任务中多模仿学习方法的性能。

更新时间: 2025-07-08 21:12:57

领域: cs.RO,cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.06404v1

Fine-tuning Multimodal Transformers on Edge: A Parallel Split Learning Approach

Multimodal transformers integrate diverse data types like images, audio, and text, advancing tasks such as audio-visual understanding and image-text retrieval; yet their high parameterization limits deployment on resource-constrained edge devices. Split Learning (SL), which partitions models at a designated cut-layer to offload compute-intensive operations to the server, offers a promising approach for distributed training of multimodal transformers, though its application remains underexplored. We present MPSL, a parallel SL approach for computational efficient fine-tuning of multimodal transformers in a distributed manner, while eliminating label sharing, client synchronization, and per-client sub-model management. MPSL employs lightweight client-side tokenizers and a unified modality-agnostic encoder, allowing flexible adaptation to task-specific needs. Our evaluation across 7 multimodal datasets demonstrates that MPSL matches or outperforms Federated Learning, reduces client-side computations by 250x, and achieves superior scalability in communication cost with model growth. Through extensive analysis, we highlight task suitability, trade-offs, and scenarios where MPSL excels, inspiring further exploration.

Updated: 2025-07-08 21:12:11

标题: 在边缘上微调多模态变压器：一种并行分割学习方法

摘要: 多模态变压器集成了像图像、音频和文本这样的多种数据类型，推动了诸如音频-视觉理解和图像-文本检索等任务的发展；然而，它们高参数化的特性限制了在资源受限的边缘设备上的部署。分裂学习（SL）将模型分割到指定的切割层，将计算密集型操作卸载到服务器，为多模态变压器的分布式训练提供了一种有前途的方法，尽管其应用仍未得到充分探索。我们提出了MPSL，这是一种并行SL方法，可在分布式环境中进行高效的微调多模态变压器，同时消除了标签共享、客户端同步和每个客户端子模型管理的需求。MPSL采用轻量级的客户端分词器和统一的模态无关编码器，允许灵活地适应特定任务的需求。我们在7个多模态数据集上进行评估，结果表明MPSL与联邦学习相匹配或表现更好，将客户端计算量降低了250倍，并在模型增长的通信成本方面实现了更好的可扩展性。通过广泛的分析，我们突出显示了MPSL在任务适用性、权衡和优势的情况下，激发了进一步的探索。

更新时间: 2025-07-08 21:12:11

领域: cs.DC,cs.LG

下载: http://arxiv.org/abs/2502.06355v3

Detection of Intelligent Tampering in Wireless Electrocardiogram Signals Using Hybrid Machine Learning

With the proliferation of wireless electrocardiogram (ECG) systems for health monitoring and authentication, protecting signal integrity against tampering is becoming increasingly important. This paper analyzes the performance of CNN, ResNet, and hybrid Transformer-CNN models for tamper detection. It also evaluates the performance of a Siamese network for ECG based identity verification. Six tampering strategies, including structured segment substitutions and random insertions, are emulated to mimic real world attacks. The one-dimensional ECG signals are transformed into a two dimensional representation in the time frequency domain using the continuous wavelet transform (CWT). The models are trained and evaluated using ECG data from 54 subjects recorded in four sessions 2019 to 2025 outside of clinical settings while the subjects performed seven different daily activities. Experimental results show that in highly fragmented manipulation scenarios, CNN, FeatCNN-TranCNN, FeatCNN-Tran and ResNet models achieved an accuracy exceeding 99.5 percent . Similarly, for subtle manipulations (for example, 50 percent from A and 50 percent from B and, 75 percent from A and 25 percent from B substitutions) our FeatCNN-TranCNN model demonstrated consistently reliable performance, achieving an average accuracy of 98 percent . For identity verification, the pure Transformer-Siamese network achieved an average accuracy of 98.30 percent . In contrast, the hybrid CNN-Transformer Siamese model delivered perfect verification performance with 100 percent accuracy.

Updated: 2025-07-08 21:10:07

标题: 使用混合机器学习检测无线心电图信号中的智能篡改

摘要: 随着用于健康监测和身份验证的无线心电图（ECG）系统的增多，保护信号完整性免受篡改的重要性日益增加。本文分析了用于篡改检测的CNN、ResNet和混合Transformer-CNN模型的性能。它还评估了基于ECG的身份验证的Siamese网络的性能。模拟了六种篡改策略，包括结构段替换和随机插入，以模拟真实世界的攻击。将一维ECG信号转换为在时间频率域中的二维表示，使用连续小波变换（CWT）。这些模型使用来自54个受试者在2019年至2025年四个会话期间在非临床环境中进行七种不同日常活动时记录的ECG数据进行训练和评估。实验结果显示，在高度分裂的篡改场景中，CNN、FeatCNN-TranCNN、FeatCNN-Tran和ResNet模型的准确率超过99.5％。同样，对于微妙的篡改（例如，50％来自A，50％来自B，以及75％来自A，25％来自B的替换），我们的FeatCNN-TranCNN模型表现出持续可靠的性能，平均准确率达到98％。对于身份验证，纯Transformer-Siamese网络的平均准确率为98.30％。相比之下，混合CNN-Transformer Siamese模型提供了完美的验证性能，准确率达到100％。

更新时间: 2025-07-08 21:10:07

领域: cs.LG,cs.CR,eess.SP

下载: http://arxiv.org/abs/2507.06402v1

The Trilemma of Truth in Large Language Models

We often attribute human characteristics to large language models (LLMs) and claim that they "know" certain things. LLMs have an internal probabilistic knowledge that represents information retained during training. How can we assess the veracity of this knowledge? We examine two common methods for probing the veracity of LLMs and discover several assumptions that are flawed. To address these flawed assumptions, we introduce sAwMIL (short for Sparse Aware Multiple-Instance Learning), a probing method that utilizes the internal activations of LLMs to separate statements into true, false, and neither. sAwMIL is based on multiple-instance learning and conformal prediction. We evaluate sAwMIL on 5 validity criteria across 16 open-source LLMs, including both default and chat-based variants, as well as on 3 new datasets. Among the insights we provide are: (1) the veracity signal is often concentrated in the third quarter of an LLM's depth; (2) truth and falsehood signals are not always symmetric; (3) linear probes perform better on chat models than on default models; (4) nonlinear probes may be required to capture veracity signals for some LLMs with reinforcement learning from human feedback or knowledge distillation; and (5) LLMs capture a third type of signal that is distinct from true and false and is neither true nor false. These findings provide a reliable method for verifying what LLMs "know" and how certain they are of their probabilistic internal knowledge.

Updated: 2025-07-08 21:09:56

标题: 大型语言模型中的真相三难

摘要: 我们经常将人类特征归因于大型语言模型（LLMs），并声称它们“知道”某些事情。LLMs具有代表在训练过程中保留的信息的内部概率知识。我们如何评估这种知识的真实性？我们检查了两种常见的用于检验LLMs真实性的方法，并发现了几个存在缺陷的假设。为了解决这些缺陷的假设，我们引入了sAwMIL（Sparse Aware Multiple-Instance Learning）,这是一种利用LLMs的内部激活将语句分为真、假和无的探针方法。sAwMIL基于多实例学习和符合预测。我们对16个开源LLMs，包括默认和基于聊天的变体，以及3个新数据集上评估了sAwMIL的5个有效性标准。我们提供的见解包括：（1）真实性信号常集中在LLMs深度的第三季度;（2）真实和虚假信号并不总是对称的;（3）线性探针在聊天模型上的表现优于默认模型;（4）对于一些LLMs，可能需要非线性探针来捕获来自人类反馈或知识蒸馏的强化学习的真实性信号;（5）LLMs捕获了一种与真和假不同且既非真又非假的信号。这些发现提供了一种可靠的方法，用于验证LLMs“知道”什么以及他们对自己的概率内部知识有多么确信。

更新时间: 2025-07-08 21:09:56

领域: cs.CL,cs.LG,stat.ML,68T50,I.2.6; I.2.7; G.3

下载: http://arxiv.org/abs/2506.23921v2

An AI-Driven Thermal-Fluid Testbed for Advanced Small Modular Reactors: Integration of Digital Twin and Large Language Models

This paper presents a multipurpose artificial intelligence (AI)-driven thermal-fluid testbed designed to advance Small Modular Reactor technologies by seamlessly integrating physical experimentation with advanced computational intelligence. The platform uniquely combines a versatile three-loop thermal-fluid facility with a high-fidelity digital twin and sophisticated AI frameworks for real-time prediction, control, and operational assistance. Methodologically, the testbed's digital twin, built upon the System Analysis Module code, is coupled with a Gated Recurrent Unit (GRU) neural network. This machine learning model, trained on experimental data, enables faster-than-real-time simulation, providing predictive insights into the system's dynamic behavior. The practical application of this AI integration is showcased through case studies. An AI-driven control framework where the GRU model accurately forecasts future system states and the corresponding control actions required to meet operational demands. Furthermore, an intelligent assistant, powered by a large language model, translates complex sensor data and simulation outputs into natural language, offering operators actionable analysis and safety recommendations. Comprehensive validation against experimental transients confirms the platform's high fidelity, with the GRU model achieving a temperature prediction root mean square error of 1.42 K. This work establishes an integrated research environment at the intersection of AI and thermal-fluid science, showcasing how AI-driven methodologies in modeling, control, and operator support can accelerate the innovation and deployment of next-generation nuclear systems.

Updated: 2025-07-08 21:07:30

标题: 一个用于先进小型模块反应堆的人工智能驱动的热流测试平台：数字孪生体和大型语言模型的集成

摘要: 这篇论文介绍了一个多功能人工智能（AI）驱动的热流测试平台，旨在通过将物理实验与先进计算智能无缝集成，推动小型模块反应堆技术的发展。该平台独特地将多功能三回路热流设施与高保真数字孪生体和复杂的AI框架结合在一起，用于实时预测、控制和运行辅助。在方法上，该测试平台的数字孪生体，建立在系统分析模块代码的基础上，与门控循环单元（GRU）神经网络耦合。这种机器学习模型，经过实验数据训练，能够实现快于实时的模拟，提供对系统动态行为的预测洞察。这种AI集成的实际应用通过案例研究展示出来。一个AI驱动的控制框架，其中GRU模型准确预测未来系统状态及满足运行要求所需的控制行动。此外，一个由大型语言模型驱动的智能助手将复杂传感器数据和模拟输出转化为自然语言，为操作员提供可操作的分析和安全建议。对实验瞬变进行全面验证证实了该平台的高保真度，GRU模型实现了1.42K的温度预测均方根误差。这项工作建立了一个在AI和热流科学交叉点上的综合研究环境，展示了在建模、控制和操作员支持方面采用AI驱动方法如何加速下一代核系统的创新和部署。

更新时间: 2025-07-08 21:07:30

领域: eess.SY,cs.AI,cs.SY

下载: http://arxiv.org/abs/2507.06399v1

Federated Breast Cancer Detection Enhanced by Synthetic Ultrasound Image Augmentation

Federated learning (FL) has emerged as a promising paradigm for collaboratively training deep learning models across institutions without exchanging sensitive medical data. However, its effectiveness is often hindered by limited data availability and non-independent, identically distributed data across participating clients, which can degrade model performance and generalization. To address these challenges, we propose a generative AI based data augmentation framework that integrates synthetic image sharing into the federated training process for breast cancer diagnosis via ultrasound images. Specifically, we train two simple class-specific Deep Convolutional Generative Adversarial Networks: one for benign and one for malignant lesions. We then simulate a realistic FL setting using three publicly available breast ultrasound image datasets: BUSI, BUS-BRA, and UDIAT. FedAvg and FedProx are adopted as baseline FL algorithms. Experimental results show that incorporating a suitable number of synthetic images improved the average AUC from 0.9206 to 0.9237 for FedAvg and from 0.9429 to 0.9538 for FedProx. We also note that excessive use of synthetic data reduced performance, underscoring the importance of maintaining a balanced ratio of real and synthetic samples. Our findings highlight the potential of generative AI based data augmentation to enhance FL results in the breast ultrasound image classification task.

Updated: 2025-07-08 21:03:53

标题: 合成超声图像增强的联邦乳腺癌检测

摘要: 联邦学习（FL）已经成为一个有前途的范例，可以在不交换敏感医疗数据的情况下，跨机构协作训练深度学习模型。然而，其有效性通常受到数据可用性有限以及参与客户之间数据不独立、不同分布的影响，这可能降低模型性能和泛化能力。为了解决这些挑战，我们提出了一个基于生成人工智能的数据增强框架，将合成图像分享集成到基于超声波图像的乳腺癌诊断的联邦训练过程中。具体而言，我们训练了两个简单的类别特定的深度卷积生成对抗网络：一个用于良性病变，一个用于恶性病变。然后，我们使用三个公开可用的乳腺超声波图像数据集（BUSI，BUS-BRA和UDIAT）模拟了一个真实的FL设置。FedAvg和FedProx被采用作为基准FL算法。实验结果显示，合适数量的合成图像的纳入将FedAvg的平均AUC从0.9206提高到0.9237，将FedProx的平均AUC从0.9429提高到0.9538。我们还注意到，过度使用合成数据会降低性能，强调了保持真实和合成样本平衡比例的重要性。我们的发现突显了基于生成人工智能的数据增强对于提升乳腺超声波图像分类任务中的FL结果的潜力。

更新时间: 2025-07-08 21:03:53

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2506.23334v2

Jolting Technologies: Superexponential Acceleration in AI Capabilities and Implications for AGI

This paper investigates the Jolting Technologies Hypothesis, which posits superexponential growth (increasing acceleration, or a positive third derivative) in the development of AI capabilities. We develop a theoretical framework and validate detection methodologies through Monte Carlo simulations, while acknowledging that empirical validation awaits suitable longitudinal data. Our analysis focuses on creating robust tools for future empirical studies and exploring the potential implications should the hypothesis prove valid. The study examines how factors such as shrinking idea-to-action intervals and compounding iterative AI improvements drive this jolting pattern. By formalizing jolt dynamics and validating detection methods through simulation, this work provides the mathematical foundation necessary for understanding potential AI trajectories and their consequences for AGI emergence, offering insights for research and policy.

Updated: 2025-07-08 21:03:49

标题: 震荡技术：人工智能能力的超指数加速及对通用人工智能的影响

摘要: 本文调查了Jolting Technologies假设，该假设认为人工智能能力的发展呈现超指数增长（加速增长，或正的三阶导数）。我们建立了一个理论框架，并通过蒙特卡洛模拟验证了检测方法，同时承认实证验证需要合适的纵向数据。我们的分析重点是为未来的实证研究创建强大的工具，并探讨如果该假设被证明有效可能带来的潜在影响。该研究考察了缩小从想法到行动的间隔和复合迭代人工智能改进如何推动这种突变模式。通过形式化突变动态并通过模拟验证检测方法，这项工作为理解潜在的人工智能轨迹及其对通用人工智能出现的后果提供了必要的数学基础，并为研究和政策提供见解。

更新时间: 2025-07-08 21:03:49

领域: cs.AI,cs.CY,68T01, 91B26, 93C15

下载: http://arxiv.org/abs/2507.06398v1

Representing Prompting Patterns with PDL: Compliance Agent Case Study

Prompt engineering for LLMs remains complex, with existing frameworks either hiding complexity behind restrictive APIs or providing inflexible canned patterns that resist customization -- making sophisticated agentic programming challenging. We present the Prompt Declaration Language (PDL), a novel approach to prompt representation that tackles this fundamental complexity by bringing prompts to the forefront, enabling manual and automatic prompt tuning while capturing the composition of LLM calls together with rule-based code and external tools. By abstracting away the plumbing for such compositions, PDL aims at improving programmer productivity while providing a declarative representation that is amenable to optimization. This paper demonstrates PDL's utility through a real-world case study of a compliance agent. Tuning the prompting pattern of this agent yielded up to 4x performance improvement compared to using a canned agent and prompt pattern.

Updated: 2025-07-08 21:03:22

标题: 用PDL表示提示模式：合规代理案例研究

摘要: 为LLMs进行提示工程仍然是复杂的，现有的框架要么隐藏在限制性API背后的复杂性，要么提供不灵活的预设模式，抵抗定制化 - 使得复杂的代理编程具有挑战性。我们提出了Prompt Declaration Language（PDL），这是一种新颖的提示表示方法，通过将提示置于前沿，实现了手动和自动提示调整，同时捕捉了LLM调用的组合以及基于规则的代码和外部工具。通过抽象化这些组合的管道，PDL旨在提高程序员的生产力，同时提供一个适合优化的声明性表示。本文通过一个合规代理的实际案例研究展示了PDL的实用性。调整该代理的提示模式与使用预设代理和提示模式相比，性能提高了多达4倍。

更新时间: 2025-07-08 21:03:22

领域: cs.AI,cs.LG,cs.PL,cs.SE

下载: http://arxiv.org/abs/2507.06396v1

Nonlinear denoising score matching for enhanced learning of structured distributions

We present a novel method for training score-based generative models which uses nonlinear noising dynamics to improve learning of structured distributions. Generalizing to a nonlinear drift allows for additional structure to be incorporated into the dynamics, thus making the training better adapted to the data, e.g., in the case of multimodality or (approximate) symmetries. Such structure can be obtained from the data by an inexpensive preprocessing step. The nonlinear dynamics introduces new challenges into training which we address in two ways: 1) we develop a new nonlinear denoising score matching (NDSM) method, 2) we introduce neural control variates in order to reduce the variance of the NDSM training objective. We demonstrate the effectiveness of this method on several examples: a) a collection of low-dimensional examples, motivated by clustering in latent space, b) high-dimensional images, addressing issues with mode imbalance, small training sets, and approximate symmetries, the latter being a challenge for methods based on equivariant neural networks, which require exact symmetries, c) latent space representation of high-dimensional data, demonstrating improved performance with greatly reduced computational cost. Our method learns score-based generative models with less data by flexibly incorporating structure arising in the dataset.

Updated: 2025-07-08 21:02:36

标题: 非线性去噪得分匹配用于增强结构化分布的学习

摘要: 我们提出了一种训练基于分数的生成模型的新方法，该方法使用非线性噪声动态来改善对结构化分布的学习。将非线性漂移概括为允许将额外结构纳入动态中，从而使训练更适应数据，例如在多模式或（近似）对称性的情况下。这种结构可以通过廉价的预处理步骤从数据中获得。非线性动态引入了新的挑战，我们以两种方式进行解决：1）我们开发了一种新的非线性去噪得分匹配（NDSM）方法，2）我们引入神经控制变量以减少NDSM训练目标的方差。我们在几个示例上展示了这种方法的有效性：a）一组低维示例，受到潜在空间中聚类的启发，b）高维图像，解决了模式不平衡、小训练集和近似对称性的问题，后者对基于等变神经网络的方法构成挑战，这些方法要求精确对称性，c）高维数据的潜在空间表示，展示了在大大减少计算成本的情况下性能得到改善。我们的方法通过灵活地纳入数据集中出现的结构，以较少的数据学习基于分数的生成模型。

更新时间: 2025-07-08 21:02:36

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2405.15625v2

On Jailbreaking Quantized Language Models Through Fault Injection Attacks

The safety alignment of Language Models (LMs) is a critical concern, yet their integrity can be challenged by direct parameter manipulation attacks, such as those potentially induced by fault injection. As LMs are increasingly deployed using low-precision quantization for efficiency, this paper investigates the efficacy of such attacks for jailbreaking aligned LMs across different quantization schemes. We propose gradient-guided attacks, including a tailored progressive bit-level search algorithm introduced herein and a comparative word-level (single weight update) attack. Our evaluation on Llama-3.2-3B, Phi-4-mini, and Llama-3-8B across FP16 (baseline), and weight-only quantization (FP8, INT8, INT4) reveals that quantization significantly influences attack success. While attacks readily achieve high success (>80% Attack Success Rate, ASR) on FP16 models, within an attack budget of 25 perturbations, FP8 and INT8 models exhibit ASRs below 20% and 50%, respectively. Increasing the perturbation budget up to 150 bit-flips, FP8 models maintained ASR below 65%, demonstrating some resilience compared to INT8 and INT4 models that have high ASR. In addition, analysis of perturbation locations revealed differing architectural targets across quantization schemes, with (FP16, INT4) and (INT8, FP8) showing similar characteristics. Besides, jailbreaks induced in FP16 models were highly transferable to subsequent FP8/INT8 quantization (<5% ASR difference), though INT4 significantly reduced transferred ASR (avg. 35% drop). These findings highlight that while common quantization schemes, particularly FP8, increase the difficulty of direct parameter manipulation jailbreaks, vulnerabilities can still persist, especially through post-attack quantization.

Updated: 2025-07-08 20:54:53

标题: 通过故障注入攻击对量化语言模型进行越狱

摘要: 语言模型（LMs）的安全对齐是一个关键问题，然而它们的完整性可能会受到直接参数篡改攻击的挑战，这些攻击可能是由故障注入引起的。随着LMs越来越多地使用低精度量化以提高效率，本文研究了这种攻击对不同量化方案下对齐的LMs的有效性。我们提出了梯度引导攻击，包括本文介绍的定制渐进位级搜索算法和比较的单词级（单个权重更新）攻击。我们对Llama-3.2-3B、Phi-4-mini和Llama-3-8B在FP16（基线）和仅权重量化（FP8、INT8、INT4）上进行评估，结果显示量化显著影响攻击成功率。虽然攻击很容易在FP16模型上实现高成功率（>80%攻击成功率，ASR），在25次扰动的攻击预算内，FP8和INT8模型的ASR分别低于20%和50%。将扰动预算增加到150个位翻转时，FP8模型的ASR仍然低于65%，相比之下，INT8和INT4模型的ASR较高，表现出一定的韧性。此外，扰动位置的分析显示，不同量化方案之间存在不同的架构目标，（FP16、INT4）和（INT8、FP8）显示相似的特征。此外，在FP16模型中引发的越狱攻击在后续FP8/INT8量化中高度可传递（<5% ASR差异），尽管INT4显著降低了传递的ASR（平均35%下降）。这些发现强调了尽管常见的量化方案，特别是FP8，增加了直接参数篡改越狱攻击的难度，但漏洞仍可能存在，特别是通过攻击后的量化。

更新时间: 2025-07-08 20:54:53

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2507.03236v2

Ampere: Communication-Efficient and High-Accuracy Split Federated Learning

A Federated Learning (FL) system collaboratively trains neural networks across devices and a server but is limited by significant on-device computation costs. Split Federated Learning (SFL) systems mitigate this by offloading a block of layers of the network from the device to a server. However, in doing so, it introduces large communication overheads due to frequent exchanges of intermediate activations and gradients between devices and the server and reduces model accuracy for non-IID data. We propose Ampere, a novel collaborative training system that simultaneously minimizes on-device computation and device-server communication while improving model accuracy. Unlike SFL, which uses a global loss by iterative end-to-end training, Ampere develops unidirectional inter-block training to sequentially train the device and server block with a local loss, eliminating the transfer of gradients. A lightweight auxiliary network generation method decouples training between the device and server, reducing frequent intermediate exchanges to a single transfer, which significantly reduces the communication overhead. Ampere mitigates the impact of data heterogeneity by consolidating activations generated by the trained device block to train the server block, in contrast to SFL, which trains on device-specific, non-IID activations. Extensive experiments on multiple CNNs and transformers show that, compared to state-of-the-art SFL baseline systems, Ampere (i) improves model accuracy by up to 13.26% while reducing training time by up to 94.6%, (ii) reduces device-server communication overhead by up to 99.1% and on-device computation by up to 93.13%, and (iii) reduces standard deviation of accuracy by 53.39% for various non-IID degrees highlighting superior performance when faced with heterogeneous data.

Updated: 2025-07-08 20:54:43

标题: 安培：通信高效且高准确性的分裂联邦学习

摘要: 一种联邦学习（FL）系统协作训练神经网络跨设备和服务器，但受到设备上显著的计算成本限制。分裂式联邦学习（SFL）系统通过将网络的一块层从设备转移到服务器来缓解这一问题。然而，在这样做的过程中，由于设备和服务器之间频繁交换中间激活和梯度，引入了大量的通信开销，并降低了非独立同分布数据的模型准确性。我们提出了一种新颖的协作训练系统Ampere，它同时最小化设备上的计算和设备-服务器通信，同时提高模型准确性。与SFL不同，SFL通过迭代的端到端训练使用全局损失，Ampere开发了单向的块间训练，使用本地损失顺序训练设备和服务器块，消除了梯度的传输。一种轻量级的辅助网络生成方法解耦了设备和服务器之间的训练，将频繁的中间交换减少到一次传输，从而显著减少了通信开销。Ampere通过整合训练设备块生成的激活来训练服务器块，与SFL相反，SFL在设备特定的非独立同分布激活上进行训练。对多个CNN和transformers进行的广泛实验表明，与最先进的SFL基准系统相比，Ampere（i）将模型准确性提高了最多13.26%，同时将训练时间缩短了最多94.6%，（ii）将设备-服务器通信开销减少了最多99.1%，将设备上的计算减少了最多93.13%，（iii）将各种非独立同分布程度的准确性标准差减少了53.39%，突出表现出在面对异质数据时具有卓越性能。

更新时间: 2025-07-08 20:54:43

领域: cs.DC,cs.LG

下载: http://arxiv.org/abs/2507.07130v1

Enhancing Plasticity for First Session Adaptation Continual Learning

The integration of large pre-trained models (PTMs) into Class-Incremental Learning (CIL) has facilitated the development of computationally efficient strategies such as First-Session Adaptation (FSA), which fine-tunes the model solely on the first task while keeping it frozen for subsequent tasks. Although effective in homogeneous task sequences, these approaches struggle when faced with the heterogeneity of real-world task distributions. We introduce Plasticity-Enhanced Test-Time Adaptation in Class-Incremental Learning (PLASTIC), a method that reinstates plasticity in CIL while preserving model stability. PLASTIC leverages Test-Time Adaptation (TTA) by dynamically fine-tuning LayerNorm parameters on unlabeled test data, enabling adaptability to evolving tasks and improving robustness against data corruption. To prevent TTA-induced model divergence and maintain stable learning across tasks, we introduce a teacher-student distillation framework, ensuring that adaptation remains controlled and generalizable. Extensive experiments across multiple benchmarks demonstrate that PLASTIC consistently outperforms both conventional and state-of-the-art PTM-based CIL approaches, while also exhibiting inherent robustness to data corruptions. Code is available at: https://github.com/IemProg/PLASTIC.

Updated: 2025-07-08 20:46:01

标题: 增强首次适应连续学习的可塑性

摘要: 大型预训练模型（PTM）整合到类增量学习（CIL）中已经促进了计算效率高的策略的发展，例如首次会话适应（FSA）等，FSA在仅对第一个任务进行微调的同时将模型冻结用于后续任务。尽管在同质任务序列中有效，但这些方法在面对现实世界任务分布的异质性时很难应对。我们引入了在类增量学习中增强可塑性的测试时间适应（PLASTIC）方法，该方法在保持模型稳定的同时恢复了CIL中的可塑性。PLASTIC利用测试时间适应（TTA）通过动态微调LayerNorm参数来适应未标记的测试数据，从而使其能够适应不断变化的任务，并提高对数据损坏的鲁棒性。为了防止TTA引起的模型分歧并保持跨任务的稳定学习，我们引入了师生蒸馏框架，确保适应性保持受控且具有普遍性。通过在多个基准测试中进行广泛实验，证明PLASTIC始终优于传统和最先进的基于PTM的CIL方法，同时还表现出对数据损坏的内在稳健性。代码可在以下网址找到：https://github.com/IemProg/PLASTIC。

更新时间: 2025-07-08 20:46:01

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2310.11482v3

KPFlow: An Operator Perspective on Dynamic Collapse Under Gradient Descent Training of Recurrent Networks

Gradient Descent (GD) and its variants are the primary tool for enabling efficient training of recurrent dynamical systems such as Recurrent Neural Networks (RNNs), Neural ODEs and Gated Recurrent units (GRUs). The dynamics that are formed in these models exhibit features such as neural collapse and emergence of latent representations that may support the remarkable generalization properties of networks. In neuroscience, qualitative features of these representations are used to compare learning in biological and artificial systems. Despite recent progress, there remains a need for theoretical tools to rigorously understand the mechanisms shaping learned representations, especially in finite, non-linear models. Here, we show that the gradient flow, which describes how the model's dynamics evolve over GD, can be decomposed into a product that involves two operators: a Parameter Operator, K, and a Linearized Flow Propagator, P. K mirrors the Neural Tangent Kernel in feed-forward neural networks, while P appears in Lyapunov stability and optimal control theory. We demonstrate two applications of our decomposition. First, we show how their interplay gives rise to low-dimensional latent dynamics under GD, and, specifically, how the collapse is a result of the network structure, over and above the nature of the underlying task. Second, for multi-task training, we show that the operators can be used to measure how objectives relevant to individual sub-tasks align. We experimentally and theoretically validate these findings, providing an efficient Pytorch package, \emph{KPFlow}, implementing robust analysis tools for general recurrent architectures. Taken together, our work moves towards building a next stage of understanding of GD learning in non-linear recurrent models.

Updated: 2025-07-08 20:33:15

标题: KPFlow：循环网络在梯度下降训练中动态崩溃的操作员视角

摘要: 梯度下降（GD）及其变体是使递归动态系统如递归神经网络（RNNs）、神经ODEs和门控递归单元（GRUs）能够高效训练的主要工具。这些模型中形成的动态表现出神经崩溃和潜在表示的出现，这些特征可能支持网络的显著泛化特性。在神经科学中，这些表示的定性特征被用来比较生物和人工系统中的学习。尽管最近取得了进展，但仍需要理论工具来严格理解塑造学习表示的机制，特别是在有限、非线性模型中。在这里，我们展示了梯度流的分解，描述了模型的动态如何经过GD演变，可以分解为涉及两个算子的乘积：一个参数算子K和一个线性化流传播子P。K反映了前馈神经网络中的神经切线核，而P出现在Lyapunov稳定性和最优控制理论中。我们展示了我们分解的两个应用。首先，我们展示它们的相互作用如何导致在GD下低维潜在动态的产生，特别是崩溃是网络结构的结果，远远超出了潜在任务的本质。其次，对于多任务训练，我们展示了这些算子可以用来衡量与各个子任务相关的目标是如何对齐的。我们在实验和理论上验证了这些发现，提供了一个高效的Pytorch包\emph{KPFlow}，实现了通用递归架构的强大分析工具。总的来说，我们的工作朝着在非线性递归模型中建立对GD学习的下一个阶段的理解迈进。

更新时间: 2025-07-08 20:33:15

领域: cs.LG,cs.AI,math.DS,q-bio.NC

下载: http://arxiv.org/abs/2507.06381v1

Secure and Storage-Efficient Deep Learning Models for Edge AI Using Automatic Weight Generation

Complex neural networks require substantial memory to store a large number of synaptic weights. This work introduces WINGs (Automatic Weight Generator for Secure and Storage-Efficient Deep Learning Models), a novel framework that dynamically generates layer weights in a fully connected neural network (FC) and compresses the weights in convolutional neural networks (CNNs) during inference, significantly reducing memory requirements without sacrificing accuracy. WINGs framework uses principal component analysis (PCA) for dimensionality reduction and lightweight support vector regression (SVR) models to predict layer weights in the FC networks, removing the need for storing full-weight matrices and achieving substantial memory savings. It also preferentially compresses the weights in low-sensitivity layers of CNNs using PCA and SVR with sensitivity analysis. The sensitivity-aware design also offers an added level of security, as any bit-flip attack with weights in compressed layers has an amplified and readily detectable effect on accuracy. WINGs achieves 53x compression for the FC layers and 28x for AlexNet with MNIST dataset, and 18x for Alexnet with CIFAR-10 dataset with 1-2% accuracy loss. This significant reduction in memory results in higher throughput and lower energy for DNN inference, making it attractive for resource-constrained edge applications.

Updated: 2025-07-08 20:33:02

标题: 使用自动生成权重的安全且存储高效的边缘人工智能深度学习模型

摘要: 复杂的神经网络需要大量的内存来存储大量的突触权重。本文介绍了WINGs（用于安全和存储高效深度学习模型的自动权重生成器）框架，这是一个新颖的框架，动态生成全连接神经网络（FC）中的层权重，并在推理过程中压缩卷积神经网络（CNNs）中的权重，显著减少内存需求，而不损失准确性。WINGs框架使用主成分分析（PCA）进行降维，并使用轻量级支持向量回归（SVR）模型来预测FC网络中的层权重，消除了存储完整权重矩阵的需求，并实现了大量内存节省。它还通过PCA和SVR与敏感性分析优先压缩CNNs中的低敏感性层的权重。敏感性感知设计还提供了额外的安全级别，因为对压缩层中的权重进行比特翻转攻击会对准确性产生放大且容易检测的影响。WINGs实现了FC层的53倍压缩以及使用MNIST数据集的AlexNet的28倍压缩，以及对CIFAR-10数据集的Alexnet的18倍压缩，准确性损失为1-2%。这种显著的内存减少导致了更高的吞吐量和更低的DNN推理能耗，使其对资源受限的边缘应用具有吸引力。

更新时间: 2025-07-08 20:33:02

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.06380v1

Thermalizer: Stable autoregressive neural emulation of spatiotemporal chaos

Autoregressive surrogate models (or \textit{emulators}) of spatiotemporal systems provide an avenue for fast, approximate predictions, with broad applications across science and engineering. At inference time, however, these models are generally unable to provide predictions over long time rollouts due to accumulation of errors leading to diverging trajectories. In essence, emulators operate out of distribution, and controlling the online distribution quickly becomes intractable in large-scale settings. To address this fundamental issue, and focusing on time-stationary systems admitting an invariant measure, we leverage diffusion models to obtain an implicit estimator of the score of this invariant measure. We show that this model of the score function can be used to stabilize autoregressive emulator rollouts by applying on-the-fly denoising during inference, a process we call \textit{thermalization}. Thermalizing an emulator rollout is shown to extend the time horizon of stable predictions by an order of magnitude in complex systems exhibiting turbulent and chaotic behavior, opening up a novel application of diffusion models in the context of neural emulation.

Updated: 2025-07-08 20:25:11

标题: Thermalizer：空间时间混沌的稳定自回归神经仿真

摘要: 自回归替代模型（或称为\textit{仿真器}）对时空系统提供了一种快速、近似预测的途径，在科学和工程领域具有广泛应用。然而，在推断时，这些模型通常由于误差累积导致轨迹发散而无法提供长时间预测。实质上，仿真器在分布之外运行，而在大规模设置中快速控制在线分布变得难以处理。为解决这一基本问题，我们专注于具有不变测度的时间稳态系统，并利用扩散模型获得该不变测度的得分的隐式估计器。我们展示了该得分函数模型可用于在推断过程中通过应用即时去噪来稳定自回归仿真器预测，这一过程被称为“热化”。热化仿真器的预测表明，在展示湍流和混沌行为的复杂系统中，可以将稳定预测的时间跨度延长一个数量级，从而在神经仿真的背景下开辟了扩散模型的新应用。

更新时间: 2025-07-08 20:25:11

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2503.18731v2

Digital Wargames to Enhance Military Medical Evacuation Decision-Making

Medical evacuation is one of the United States Army's most storied and critical mission sets, responsible for efficiently and expediently evacuating the battlefield ill and injured. Medical evacuation planning involves designing a robust network of medical platforms and facilities capable of moving and treating large numbers of casualties. Until now, there has not been a medium to simulate these networks in a classroom setting and evaluate both offline planning and online decision-making performance. This work describes the Medical Evacuation Wargaming Initiative (MEWI), a three-dimensional multiplayer simulation developed in Unity that replicates battlefield constraints and uncertainties. MEWI accurately models patient interactions at casualty collection points, ambulance exchange points, medical treatment facilities, and evacuation platforms. Two operational scenarios are introduced: an amphibious island assault in the Pacific and a Eurasian conflict across a sprawling road and river network. These scenarios pit students against the clock to save as many casualties as possible while adhering to doctrinal lessons learned during didactic training. We visualize performance data collected from two iterations of the MEWI Pacific scenario executed in the United States Army's Medical Evacuation Doctrine Course. We consider post-wargame Likert survey data from student participants and external observer notes to identify key planning decision points, document medical evacuation lessons learned, and quantify general utility. Results indicate that MEWI participation substantially improves uptake of medical evacuation lessons learned and co-operative decision-making. MEWI is a substantial step forward in the field of high-fidelity training tools for medical education, and our study findings offer critical insights into improving medical evacuation education and operations across the joint force.

Updated: 2025-07-08 20:20:27

标题: 数字战争游戏以增强军事医疗疏散决策-making

摘要: 医疗疏散是美国陆军最受推崇和关键的任务之一，负责高效地和迅速地疏散战场上的病患和受伤者。医疗疏散规划涉及设计一个强大的医疗平台和设施网络，能够移动和治疗大量伤员。到目前为止，在课堂环境中没有一种中介来模拟这些网络并评估离线规划和在线决策表现。本文描述了医疗疏散战争游戏倡议（MEWI），这是一个在Unity中开发的三维多人游戏模拟，复制了战场的限制和不确定性。MEWI准确模拟了伤员在伤员收集点、救护车交换点、医疗治疗设施和疏散平台的互动。引入了两个操作场景：在太平洋进行两栖岛屿突击和横跨庞大道路和河流网络的欧亚冲突。这些场景要求学生在规定时间内尽可能拯救更多的伤员，同时遵守在教学训练中学到的教条课程。我们展示了从美国陆军医疗疏散学说课中执行的两次MEWI太平洋场景的性能数据。我们考虑了学生参与者的战后Likert调查数据和外部观察者的笔记，以确定关键的规划决策点，记录医疗疏散的教训，并量化一般效用。结果表明，参与MEWI可以显著提高医疗疏散教训的吸收和合作决策能力。MEWI是医学教育领域高保真度培训工具的重要进步，我们的研究结果为改进医疗疏散教育和联合部队行动提供了关键见解。

更新时间: 2025-07-08 20:20:27

领域: cs.AI,cs.CY,cs.HC,cs.MM

下载: http://arxiv.org/abs/2507.06373v1

Humanoid World Models: Open World Foundation Models for Humanoid Robotics

Humanoid robots, with their human-like form, are uniquely suited for interacting in environments built for people. However, enabling humanoids to reason, plan, and act in complex open-world settings remains a challenge. World models, models that predict the future outcome of a given action, can support these capabilities by serving as a dynamics model in long-horizon planning and generating synthetic data for policy learning. We introduce Humanoid World Models (HWM), a family of lightweight, open-source models that forecast future egocentric video conditioned on humanoid control tokens. We train two types of generative models, Masked Transformers and Flow-Matching, on 100 hours of humanoid demonstrations. Additionally, we explore architectural variants with different attention mechanisms and parameter-sharing strategies. Our parameter-sharing techniques reduce model size by 33-53% with minimal impact on performance or visual fidelity. HWMs are designed to be trained and deployed in practical academic and small-lab settings, such as 1-2 GPUs.

Updated: 2025-07-08 20:18:16

标题: 人形世界模型：用于人形机器人的开放世界基础模型

摘要: 人形机器人，以其类似人类形态，非常适合在为人类建造的环境中进行交互。然而，使人形机器人在复杂的开放世界环境中进行推理、规划和行动仍然是一项挑战。世界模型，即预测给定动作未来结果的模型，可以通过作为长期规划中的动态模型和生成策略学习的合成数据来支持这些能力。我们引入了Humanoid World Models (HWM)，这是一系列轻量级、开源模型，可以根据人形控制令牌预测未来的自我中心视频。我们在100小时的人形示范中训练了两种类型的生成模型，Masked Transformers和Flow-Matching。此外，我们探索了具有不同注意机制和参数共享策略的架构变体。我们的参数共享技术将模型大小减少了33-53%，对性能或视觉保真度的影响很小。HWM旨在在实际的学术和小型实验室环境中进行培训和部署，例如1-2个GPU。

更新时间: 2025-07-08 20:18:16

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2506.01182v2

Efficient Decision Trees for Tensor Regressions

We proposed the tensor-input tree (TT) method for scalar-on-tensor and tensor-on-tensor regression problems. We first address scalar-on-tensor problem by proposing scalar-output regression tree models whose input variable are tensors (i.e., multi-way arrays). We devised and implemented fast randomized and deterministic algorithms for efficient fitting of scalar-on-tensor trees, making TT competitive against tensor-input GP models. Based on scalar-on-tensor tree models, we extend our method to tensor-on-tensor problems using additive tree ensemble approaches. Theoretical justification and extensive experiments on real and synthetic datasets are provided to illustrate the performance of TT.

Updated: 2025-07-08 20:12:10

标题: 高效的张量回归决策树

摘要: 我们提出了一种用于标量-张量和张量-张量回归问题的张量输入树（TT）方法。我们首先通过提出以张量（即多维数组）为输入变量的标量输出回归树模型来解决标量-张量问题。我们设计并实现了用于高效拟合标量-张量树的快速随机化和确定性算法，使得TT在与张量输入GP模型的竞争中具有竞争力。基于标量-张量树模型，我们将我们的方法扩展到使用加法树集成方法解决张量-张量问题。理论证明和大量实验结果在真实和合成数据集上提供，以说明TT的性能。

更新时间: 2025-07-08 20:12:10

领域: cs.LG,stat.ME,stat.ML,62G08, 15A69,G.3

下载: http://arxiv.org/abs/2408.01926v2

Online Dynamic Programming

We propose a general method for combinatorial online learning problems whose offline optimization problem can be solved efficiently via a dynamic programming algorithm defined by an arbitrary min-sum recurrence. Examples include online learning of Binary Search Trees, Matrix-Chain Multiplications, $k$-sets, Knapsacks, Rod Cuttings, and Weighted Interval Schedulings. For each of these problems we use the underlying graph of subproblems (called a multi-DAG) for defining a representation of the solutions of the dynamic programming problem by encoding them as a generalized version of paths (called multipaths). These multipaths encode each solution as a series of successive decisions or components over which the loss is linear. We then show that the dynamic programming algorithm for each problem leads to online algorithms for learning multipaths in the underlying multi-DAG. The algorithms maintain a distribution over the multipaths in a concise form as their hypothesis. More specifically we generalize the existing Expanded Hedge and Component Hedge algorithms for the online shortest path problem to learning multipaths. Additionally, we introduce a new and faster prediction technique for Component Hedge which in our case directly samples from a distribution over multipaths, bypassing the need to decompose the distribution over multipaths into a mixture with small support.

Updated: 2025-07-08 20:11:49

标题: 在线动态规划

摘要: 我们提出了一种通用方法，用于可通过动态规划算法有效求解的任意最小和递归定义的组合在线学习问题的离线优化问题。示例包括二叉搜索树的在线学习、矩阵链乘法、k-集、背包、杆切割和加权区间调度的学习。对于这些问题中的每一个，我们使用子问题的底层图（称为多DAG）来定义动态规划问题解的表示，通过将其编码为路径的广义版本（称为多路径）。这些多路径将每个解编码为一系列连续的决策或组件，其中损失是线性的。然后我们展示了每个问题的动态规划算法导致了学习底层多DAG中多路径的在线算法。这些算法以简洁形式维护多路径上的分布作为它们的假设。更具体地，我们将现有的在线最短路径问题的扩展对冲和组件对冲算法推广到学习多路径。此外，我们引入了一种新的更快的预测技术，用于组件对冲，我们的情况下直接从多路径的分布中采样，绕过将多路径的分布分解为具有小支持的混合物的需要。

更新时间: 2025-07-08 20:11:49

领域: cs.LG

下载: http://arxiv.org/abs/1706.00834v4

The Riemannian Geometry associated to Gradient Flows of Linear Convolutional Networks

We study geometric properties of the gradient flow for learning deep linear convolutional networks. For linear fully connected networks, it has been shown recently that the corresponding gradient flow on parameter space can be written as a Riemannian gradient flow on function space (i.e., on the product of weight matrices) if the initialization satisfies a so-called balancedness condition. We establish that the gradient flow on parameter space for learning linear convolutional networks can be written as a Riemannian gradient flow on function space regardless of the initialization. This result holds for $D$-dimensional convolutions with $D \geq 2$, and for $D =1$ it holds if all so-called strides of the convolutions are greater than one. The corresponding Riemannian metric depends on the initialization.

Updated: 2025-07-08 20:04:00

标题: 与线性卷积网络的梯度流相关的黎曼几何学

摘要: 我们研究了学习深度线性卷积网络的梯度流的几何特性。对于线性全连接网络，最近已经表明，如果初始化满足所谓的平衡条件，参数空间上的相应梯度流可以被写成函数空间上的黎曼梯度流（即在权重矩阵的乘积上）。我们建立了一个结果，即学习线性卷积网络的参数空间上的梯度流可以被写成函数空间上的黎曼梯度流，无论初始化如何。这个结果对于$D \geq 2$维卷积都成立，对于$D = 1$维，如果所有卷积的所谓步长大于一，那么也成立。相应的黎曼度量取决于初始化。

更新时间: 2025-07-08 20:04:00

领域: cs.LG,math.AG

下载: http://arxiv.org/abs/2507.06367v1

DecoyDB: A Dataset for Graph Contrastive Learning in Protein-Ligand Binding Affinity Prediction

Predicting the binding affinity of protein-ligand complexes plays a vital role in drug discovery. Unfortunately, progress has been hindered by the lack of large-scale and high-quality binding affinity labels. The widely used PDBbind dataset has fewer than 20K labeled complexes. Self-supervised learning, especially graph contrastive learning (GCL), provides a unique opportunity to break the barrier by pre-training graph neural network models based on vast unlabeled complexes and fine-tuning the models on much fewer labeled complexes. However, the problem faces unique challenges, including a lack of a comprehensive unlabeled dataset with well-defined positive/negative complex pairs and the need to design GCL algorithms that incorporate the unique characteristics of such data. To fill the gap, we propose DecoyDB, a large-scale, structure-aware dataset specifically designed for self-supervised GCL on protein-ligand complexes. DecoyDB consists of high-resolution ground truth complexes (less than 2.5 Angstrom) and diverse decoy structures with computationally generated binding poses that range from realistic to suboptimal (negative pairs). Each decoy is annotated with a Root Mean Squared Deviation (RMSD) from the native pose. We further design a customized GCL framework to pre-train graph neural networks based on DecoyDB and fine-tune the models with labels from PDBbind. Extensive experiments confirm that models pre-trained with DecoyDB achieve superior accuracy, label efficiency, and generalizability.

Updated: 2025-07-08 20:02:53

标题: DecoyDB：用于蛋白质-配体结合亲和力预测中的图对比学习的数据集

摘要: 预测蛋白质-配体复合物的结合亲和力在药物发现中起着至关重要的作用。不幸的是，由于缺乏大规模和高质量的结合亲和力标签，进展受到了阻碍。广泛使用的PDBbind数据集只有少于20K个标记的复合物。自监督学习，特别是图对比学习（GCL），为通过在大量未标记复合物上预训练图神经网络模型，并在少量标记复合物上微调模型来打破障碍提供了独特的机会。然而，该问题面临独特的挑战，包括缺乏一个具有明确定义的正负复合物对的全面未标记数据集，以及需要设计包含这种数据的独特特征的GCL算法。为了填补这一空白，我们提出了DecoyDB，一个专门为蛋白质-配体复合物的自监督GCL设计的大规模、结构感知的数据集。DecoyDB由高分辨率的基本事实复合物（小于2.5埃）和具有从真实到次优（负对）的计算生成的绑定位姿范围的多样化伪装结构组成。每个伪装结构都标有与原生位姿的均方根偏差（RMSD）。我们进一步设计了一个定制的GCL框架，以基于DecoyDB预训练图神经网络，并使用来自PDBbind的标签微调模型。广泛的实验证实，使用DecoyDB预训练的模型在准确性、标签效率和泛化能力方面均表现出优越性。

更新时间: 2025-07-08 20:02:53

领域: cs.LG,q-bio.BM

下载: http://arxiv.org/abs/2507.06366v1

Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

The prevailing paradigm for scaling large language models (LLMs) involves monolithic, end-to-end training, a resource-intensive process that lacks flexibility. This paper explores an alternative, constructive approach to model development, built upon the foundation of non-trainable, deterministic input embeddings. In prior [1], we established that high-level semantic reasoning can emerge in Transformers using frozen embeddings derived from the visual structure of Unicode glyphs. Here, we demonstrate that this fixed representational substrate acts as a universal "docking port," enabling two powerful and efficient scaling paradigms: seamless modular composition and progressive layer-wise growth. First, we show that specialist models trained on disparate datasets (e.g., Russian and Chinese text) can be merged into a single, more capable Mixture-of-Experts (MoE) model, post-training, with zero architectural modification. This is achieved by simply averaging their output logits. The resulting MoE model exhibits immediate performance improvements on reasoning benchmarks like MMLU, surpassing its constituent experts without catastrophic forgetting. Second, we introduce a layer-wise constructive training methodology, where a deep Transformer is "grown" by progressively stacking and training one layer at a time. This method demonstrates stable convergence and a clear correlation between model depth and the emergence of complex reasoning abilities, such as those required for SQuAD. Our findings suggest a paradigm shift from monolithic optimization towards a more biological or constructive model of AI development, where complexity is built incrementally and modules can be composed freely. This opens new avenues for resource-efficient scaling, continual learning, and a more democratized ecosystem for building powerful AI systems. We release all code and models to facilitate further research.

Updated: 2025-07-08 20:01:15

标题: Growing Transformers: 冻结基底上的模块化组合和逐层扩展

摘要: 目前用于扩展大型语言模型（LLMs）的主流范式涉及整体的端到端训练，这是一种资源密集型的过程，缺乏灵活性。本文探讨了一种基于不可训练的确定性输入嵌入基础上的模型开发的替代建设性方法。在先前的研究中，我们证明了可以使用从Unicode字形的视觉结构派生的冻结嵌入在Transformer中引发高级语义推理。在这里，我们展示了这种固定的表征基质充当了一个通用的“对接口”，实现了两种强大而高效的扩展范式：无缝模块化组合和逐层渐进生长。首先，我们展示了在不同数据集（如俄文和中文文本）上训练的专家模型可以在训练后合并成一个更具能力的混合专家模型，而无需进行任何架构修改。这是通过简单地对它们的输出对数进行平均而实现的。结果的混合专家模型在像MMLU这样的推理基准上立即展现出性能提升，超越了其组成专家，而没有灾难性的遗忘。其次，我们引入了一种逐层建设性训练方法，其中一个深度Transformer通过逐层堆叠和训练一层来“生长”。这种方法展示了稳定的收敛性，模型深度与复杂推理能力的出现之间存在明显的相关性，例如SQuAD所需的推理能力。我们的发现表明了从整体优化向更具生物学或建设性的AI开发模型的范式转变，其中复杂性逐步构建，模块可以自由组合。这为资源高效的扩展、持续学习和建立强大AI系统的更加民主化生态系统开辟了新途径。我们发布了所有代码和模型，以促进进一步研究。

更新时间: 2025-07-08 20:01:15

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2507.07129v1

ConTextTab: A Semantics-Aware Tabular In-Context Learner

Tabular in-context learning (ICL) has recently achieved state-of-the-art (SOTA) performance on several tabular prediction tasks. Previously restricted to classification problems on small tables, recent advances such as TabPFN and TabICL have extended its use to larger datasets. While being architecturally efficient and well-adapted to tabular data structures, current table-native ICL architectures, being trained exclusively on synthetic data, do not fully leverage the rich semantics and world knowledge contained in real-world tabular data. On another end of this spectrum, tabular ICL models based on pretrained large language models such as TabuLa-8B integrate deep semantic understanding and world knowledge but are only able to make use of a small amount of context due to inherent architectural limitations. With the aim to combine the best of both these worlds, we introduce ConTextTab, integrating semantic understanding and alignment into a table-native ICL framework. By employing specialized embeddings for different data modalities and by training on large-scale real-world tabular data, our model is competitive with SOTA across a broad set of benchmarks while setting a new standard on the semantically rich CARTE benchmark. Code and checkpoints are available at https://github.com/SAP-samples/contexttab

Updated: 2025-07-08 19:44:57

标题: ConTextTab：一个语义感知的表格内上下文学习器

摘要: 最近，表格上下文学习（ICL）在几个表格预测任务上取得了最新的最先进（SOTA）表现。以前仅限于小表格上的分类问题，最近的进展，如TabPFN和TabICL，已将其应用扩展到更大的数据集。虽然在架构上高效且适应表格数据结构，但目前的基于表格的ICL架构仅在合成数据上进行训练，无法充分利用真实世界表格数据中包含的丰富语义和世界知识。在这个光谱的另一端，基于预训练大型语言模型（如TabuLa-8B）的表格ICL模型整合了深层语义理解和世界知识，但由于固有的架构限制，只能利用少量上下文。为了结合这两个世界的优点，我们引入了ConTextTab，将语义理解和对齐集成到一个表格本地ICL框架中。通过为不同数据模态使用专门的嵌入，并在大规模真实世界表格数据上进行训练，我们的模型在广泛的基准测试中具有与SOTA相竞争的能力，同时在语义丰富的CARTE基准测试上设立了新标准。代码和检查点可在https://github.com/SAP-samples/contexttab找到。

更新时间: 2025-07-08 19:44:57

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.10707v2

hdl2v: A Code Translation Dataset for Enhanced LLM Verilog Generation

Large language models (LLMs) are playing an increasingly large role in domains such as code generation, including hardware code generation, where Verilog is the key language. However, the amount of publicly available Verilog code pales in comparison to the amount of code available for software languages like Python. In this work, we present hdl2v ("HDL-to-Verilog"), a dataset which seeks to increase the amount of available human-written Verilog data by translating or compiling three other hardware description languages - VHDL, Chisel, and PyMTL3 - to Verilog. Furthermore, we demonstrate the value of hdl2v in enhancing LLM Verilog generation by improving performance of a 32 billion-parameter open-weight model by up to 23% (pass@10) in VerilogEvalV2, without utilizing any data augmentation or knowledge distillation from larger models. We also show hdl2v's ability to boost the performance of a data augmentation-based fine-tuning approach by 63%. Finally, we characterize and analyze our dataset to better understand which characteristics of HDL-to-Verilog datasets can be expanded upon in future work for even better performance.

Updated: 2025-07-08 19:43:08

标题: hdl2v：用于增强LLM Verilog生成的代码转换数据集

摘要: 大型语言模型（LLMs）在诸如代码生成等领域中发挥着越来越重要的作用，其中硬件代码生成是其中之一，其中Verilog是关键语言。然而，公开可用的Verilog代码量与诸如Python等软件语言的代码量相比微不足道。在这项工作中，我们提出了hdl2v（“HDL-to-Verilog”），这是一个数据集，旨在通过将另外三种硬件描述语言 - VHDL、Chisel和PyMTL3 - 转换或编译为Verilog，以增加可用的人工编写的Verilog数据量。此外，我们证明了hdl2v在提高LLM Verilog生成性能方面的价值，通过在VerilogEvalV2中将一个拥有32亿个参数的开放权重模型的性能提高了最多23%（pass@10），而无需使用任何数据增强或来自更大模型的知识蒸馏。我们还展示了hdl2v能够将基于数据增强的微调方法的性能提升了63%。最后，我们对数据集进行表征和分析，以更好地了解哪些HDL-to-Verilog数据集的特征可以在未来的工作中扩展，以实现更好的性能。

更新时间: 2025-07-08 19:43:08

领域: cs.AR,cs.AI,cs.LG,cs.PL

下载: http://arxiv.org/abs/2506.04544v2

Deep learning-based species-area models reveal multi-scale patterns of species richness and turnover

The number of species within ecosystems is influenced not only by their intrinsic characteristics but also by the spatial scale considered. As the sampled area expands, species richness increases, a phenomenon described by the species-area relationship (SAR). The accumulation dynamics of the SAR results from a complex interplay of biotic and abiotic processes operating at various spatial scales. However, the challenge of collecting exhaustive biodiversity records across spatial scales has hindered a comprehensive understanding of these dynamics. Here, we develop a deep learning approach that leverages sampling theory and small-scale ecological surveys to spatially resolve the scale-dependency of species richness. We demonstrate its performance by predicting the species richness of vascular plant communities across Europe, and evaluate the predictions against an independent dataset of plant community inventories. Our model improves species richness estimates by 32\% and delivers spatially explicit patterns of species richness and turnover for sampling areas ranging from square meters to hundreds of square kilometers. Explainable AI techniques further disentangle how drivers of species richness operate across spatial scales. The ability of our model to represent the multi-scale nature of biodiversity is essential to deliver robust biodiversity assessments and forecasts under global change.

Updated: 2025-07-08 19:42:33

标题: 基于深度学习的物种面积模型揭示了物种丰富度和转变的多尺度模式

摘要: 生态系统中的物种数量不仅受其内在特征影响，还受到考虑的空间尺度的影响。随着取样区域的扩大，物种丰富度增加，这一现象被物种面积关系（SAR）描述。 SAR的累积动态是由在不同空间尺度上运作的生物和非生物过程的复杂相互作用造成的。然而，跨空间尺度收集详尽的生物多样性记录的挑战阻碍了对这些动态的全面理解。在这里，我们开发了一种深度学习方法，利用采样理论和小尺度生态调查来空间解析物种丰富度的尺度依赖性。我们通过预测欧洲血管植物群落的物种丰富度来展示其性能，并将预测结果与植物群落清查的独立数据集进行评估。我们的模型提高了物种丰富度估计值32％，为从平方米到数百平方公里的取样区域提供了物种丰富度和物种更替的空间明确模式。可解释的人工智能技术进一步剖析了物种丰富度的驱动因素如何跨越空间尺度运作。我们的模型代表生物多样性的多尺度本质的能力对于提供健壮的生物多样性评估和在全球变化下的预测至关重要。

更新时间: 2025-07-08 19:42:33

领域: q-bio.PE,cs.LG,92-08, 92B05, 92B15, 92B20, 92D40 (Primary) 62P10, 62P12 (Secondary)

下载: http://arxiv.org/abs/2507.06358v1

On the Natural Robustness of Vision-Language Models Against Visual Perception Attacks in Autonomous Driving

Autonomous vehicles (AVs) rely on deep neural networks (DNNs) for critical tasks such as traffic sign recognition (TSR), automated lane centering (ALC), and vehicle detection (VD). However, these models are vulnerable to attacks that can cause misclassifications and compromise safety. Traditional defense mechanisms, including adversarial training, often degrade benign accuracy and fail to generalize against unseen attacks. In this work, we introduce Vehicle Vision Language Models (V2LMs), fine-tuned vision-language models specialized for AV perception. Our findings demonstrate that V2LMs inherently exhibit superior robustness against unseen attacks without requiring adversarial training, maintaining significantly higher accuracy than conventional DNNs under adversarial conditions. We evaluate two deployment strategies: Solo Mode, where individual V2LMs handle specific perception tasks, and Tandem Mode, where a single unified V2LM is fine-tuned for multiple tasks simultaneously. Experimental results reveal that DNNs suffer performance drops of 33% to 46% under attacks, whereas V2LMs maintain adversarial accuracy with reductions of less than 8% on average. The Tandem Mode further offers a memory-efficient alternative while achieving comparable robustness to Solo Mode. We also explore integrating V2LMs as parallel components to AV perception to enhance resilience against adversarial threats. Our results suggest that V2LMs offer a promising path toward more secure and resilient AV perception systems.

Updated: 2025-07-08 19:23:54

标题: 关于自动驾驶中视觉-语言模型在视觉感知攻击下的自然强健性

摘要: 自动驾驶车辆（AVs）依赖深度神经网络（DNNs）来进行关键任务，如交通标志识别（TSR）、自动车道居中（ALC）和车辆检测（VD）。然而，这些模型容易受到攻击，可能导致错误分类和危及安全。传统的防御机制，包括对抗训练，通常会降低良性准确性，并且无法对抗未知攻击进行泛化。在本研究中，我们引入了车辆视觉语言模型（V2LMs），这是专门用于AV感知的经过微调的视觉-语言模型。我们的研究结果表明，V2LMs在无需对抗训练的情况下天生具有对抗未知攻击的卓越鲁棒性，其在对抗条件下保持着显著高于传统DNNs的准确性。我们评估了两种部署策略：独立模式，其中单个V2LMs处理特定感知任务，以及串联模式，其中一个统一的V2LMs同时为多个任务进行微调。实验结果显示，在受到攻击时，DNNs的性能下降了33%至46%，而V2LMs的对抗准确性平均降低不到8%。串联模式提供了一种内存高效的选择，同时实现了与独立模式相当的鲁棒性。我们还探讨了将V2LMs作为AV感知的并行组件进行整合，以增强对抗性威胁的韧性。我们的结果表明，V2LMs为更安全、更韧性的AV感知系统提供了一个有前景的路径。

更新时间: 2025-07-08 19:23:54

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2506.11472v2

An Architecture for Privacy-Preserving Telemetry Scheme

We present a privacy-preserving telemetry aggregation scheme. Our underlying frequency estimation routine works within the framework of differential privacy. The design philosophy follows a client-server architecture. Furthermore, the system uses a local differential privacy scheme where data gets randomized on the client before submitting the request to the resource server. This scheme allows for data analysis on de-identified data by carefully adding noise to prevent re-identification attacks, thereby facilitating public data release without compromising the identifiability of the individual record. This work further enhances privacy guarantees by leveraging Oblivious HTTP (OHTTP) to achieve increased privacy protection for data in transit that addresses pre-existing privacy vulnerabilities in raw HTTP. We provide an implementation that focuses on frequency estimation with a histogram of a known dictionary. Our resulting formulation based on OHTTP has provided stricter privacy safeguards when compared to trusting an organization to manually delete identifying information from the client's request in the ingestor as deployed in reference work~\cite{apple2017}. Code available at https://github.com/kenluck2001/miscellaneous/tree/master/src/Privacy-Preserving-Telemetry.

Updated: 2025-07-08 19:20:56

标题: 一个用于保护隐私的遥测方案的架构

摘要: 我们提出了一种隐私保护的遥测聚合方案。我们的基础频率估计程序在差分隐私框架内运行。设计理念遵循客户端-服务器架构。此外，系统采用本地差分隐私方案，在将请求提交给资源服务器之前，数据在客户端上进行随机化处理。该方案允许在去标识化数据上进行数据分析，通过谨慎添加噪音以防止重新识别攻击，从而在不损害个人记录可识别性的情况下促进公开数据发布。此工作通过利用Oblivious HTTP（OHTTP）提高了数据传输中的隐私保护，解决了原始HTTP中现有的隐私漏洞。我们提供了一个关注使用已知词典的直方图进行频率估计的实现。我们基于OHTTP的结果公式与在引用作品中部署的信任组织手动删除客户端请求中标识信息相比，提供了更严格的隐私保护。可在https://github.com/kenluck2001/miscellaneous/tree/master/src/Privacy-Preserving-Telemetry找到代码。

更新时间: 2025-07-08 19:20:56

领域: cs.CR,cs.SE

下载: http://arxiv.org/abs/2507.06350v1

Trainability of Quantum Models Beyond Known Classical Simulability

Variational Quantum Algorithms (VQAs) are promising candidates for near-term quantum computing, yet they face scalability challenges due to barren plateaus, where gradients vanish exponentially in the system size. Recent conjectures suggest that avoiding barren plateaus might inherently lead to classical simulability, thus limiting the opportunities for quantum advantage. In this work, we advance the theoretical understanding of the relationship between the trainability and computational complexity of VQAs, thus directly addressing the conjecture. We introduce the Linear Clifford Encoder (LCE), a novel technique that ensures constant-scaling gradient statistics on optimization landscape regions that are close to Clifford circuits. Additionally, we leverage classical Taylor surrogates to reveal computational complexity phase transitions from polynomial to super-polynomial as the initialization region size increases. Combining these results, we reveal a deeper link between trainability and computational complexity, and analytically prove that barren plateaus can be avoided in regions for which no classical surrogate is known to exist. Furthermore, numerical experiments on LCE transformed landscapes confirm in practice the existence of a super-polynomially complex ``transition zone'' where gradients decay polynomially. These findings indicate a plausible path to practically relevant, barren plateau-free variational models with potential for quantum advantage.

Updated: 2025-07-08 19:10:46

标题: 量子模型的可训练性超越已知的经典模拟能力

摘要: 变分量子算法（VQAs）是近期量子计算中具有潜力的候选者，但面临着规模扩展挑战，因为在贫瘠高原中，梯度在系统尺寸上呈指数级消失。最近的猜想表明，避免贫瘠高原可能固有地导致经典可模拟性，从而限制了量子优势的机会。在这项工作中，我们推进了对VQAs的训练性和计算复杂性之间关系的理论理解，从而直接解决了这个猜想。我们引入了线性克利福德编码器（LCE），这是一种新颖的技术，可以确保在靠近克利福德电路的优化Landscape区域上的梯度统计具有恒定的缩放。此外，我们利用经典泰勒替代物揭示了计算复杂性相变，从多项式到超多项式，随着初始化区域尺寸的增加而增加。结合这些结果，我们揭示了训练性和计算复杂性之间更深层次的联系，并分析证明了在没有已知经典替代物存在的区域中可以避免贫瘠高原。此外，在经过LCE转换的景观上进行的数值实验实际上证实了存在一个超多项式复杂的“过渡区”，在这个区域中梯度呈多项式衰减。这些发现表明了一条通向实际相关、无贫瘠高原的变分模型的可行途径，具有潜在的量子优势。

更新时间: 2025-07-08 19:10:46

领域: quant-ph,cs.CC,cs.LG

下载: http://arxiv.org/abs/2507.06344v1

A Unifying Framework for Robust and Efficient Inference with Unstructured Data

This paper presents a general framework for conducting efficient inference on parameters derived from unstructured data, which include text, images, audio, and video. Economists have long used unstructured data by first extracting low-dimensional structured features (e.g., the topic or sentiment of a text), since the raw data are too high-dimensional and uninterpretable to include directly in empirical analyses. The rise of deep neural networks has accelerated this practice by greatly reducing the costs of extracting structured data at scale, but neural networks do not make generically unbiased predictions. This potentially propagates bias to the downstream estimators that incorporate imputed structured data, and the availability of different off-the-shelf neural networks with different biases moreover raises p-hacking concerns. To address these challenges, we reframe inference with unstructured data as a problem of missing structured data, where structured variables are imputed from high-dimensional unstructured inputs. This perspective allows us to apply classic results from semiparametric inference, leading to estimators that are valid, efficient, and robust. We formalize this approach with MAR-S, a framework that unifies and extends existing methods for debiased inference using machine learning predictions, connecting them to familiar problems such as causal inference. Within this framework, we develop robust and efficient estimators for both descriptive and causal estimands and address challenges like inference with aggregated and transformed missing structured data-a common scenario that is not covered by existing work. These methods-and the accompanying implementation package-provide economists with accessible tools for constructing unbiased estimators using unstructured data in a wide range of applications, as we demonstrate by re-analyzing several influential studies.

Updated: 2025-07-08 19:10:25

标题: 一个统一的框架：用非结构化数据进行稳健和高效的推断

摘要: 本文提出了一个通用的框架，用于对从非结构化数据中提取的参数进行高效推断，这些数据包括文本、图像、音频和视频。经济学家长期以来一直使用非结构化数据，首先提取低维结构化特征（例如文本的主题或情感），因为原始数据的维度太高且无法直接包含在实证分析中。深度神经网络的兴起通过大大降低了大规模提取结构化数据的成本，加速了这一实践，但神经网络并不会做出普遍无偏的预测。这可能会将偏差传播到包含插补结构化数据的下游估计器中，而不同偏差的现成神经网络的可用性还引发了p-hacking的担忧。为了解决这些挑战，我们将非结构化数据的推断重新构思为缺失结构化数据的问题，其中结构化变量是从高维非结构化输入中插补得到的。这种视角使我们能够应用半参数推断的经典结果，从而得到有效、高效且稳健的估计器。我们用MAR-S框架正式化了这种方法，该框架统一并扩展了使用机器学习预测进行无偏推断的现有方法，将它们与熟悉的问题（如因果推断）联系起来。在这个框架内，我们为描述性和因果估计量开发了稳健且高效的估计器，并解决了像推断聚合和转换的缺失结构化数据这样的挑战，这是现有工作没有涵盖的常见情况。这些方法及其伴随的实现包为经济学家提供了可访问的工具，用于在各种应用中使用非结构化数据构建无偏估计器，正如我们通过重新分析几个有影响力的研究所展示的那样。

更新时间: 2025-07-08 19:10:25

领域: econ.EM,cs.LG

下载: http://arxiv.org/abs/2505.00282v2

SymFlux: deep symbolic regression of Hamiltonian vector fields

We present SymFlux, a novel deep learning framework that performs symbolic regression to identify Hamiltonian functions from their corresponding vector fields on the standard symplectic plane. SymFlux models utilize hybrid CNN-LSTM architectures to learn and output the symbolic mathematical expression of the underlying Hamiltonian. Training and validation are conducted on newly developed datasets of Hamiltonian vector fields, a key contribution of this work. Our results demonstrate the model's effectiveness in accurately recovering these symbolic expressions, advancing automated discovery in Hamiltonian mechanics.

Updated: 2025-07-08 19:07:16

标题: SymFlux：汉密尔顿矢量场的深度符号回归

摘要: 我们提出了SymFlux，这是一个新颖的深度学习框架，通过符号回归来识别标准辛平面上的哈密顿函数与其对应的矢量场。SymFlux模型利用混合CNN-LSTM架构来学习和输出基础哈密顿的符号数学表达式。训练和验证是在新开发的哈密顿矢量场数据集上进行的，这是本文的一个重要贡献。我们的结果表明，该模型在准确恢复这些符号表达式方面的有效性，推动了哈密顿力学中自动发现的进展。

更新时间: 2025-07-08 19:07:16

领域: cs.LG,cs.AI,math.DS,math.SG

下载: http://arxiv.org/abs/2507.06342v1

Learning Nonlinear Finite Element Solution Operators using Multilayer Perceptrons and Energy Minimization

We develop and evaluate a method for learning solution operators to nonlinear problems governed by partial differential equations (PDEs). The approach is based on a finite element discretization and aims at representing the solution operator by a multilayer perceptron (MLP) that takes problem data variables as input and gives a prediction of the finite element solution as output. The variables will typically correspond to parameters in a parametrization of input data such as boundary conditions, coefficients, and right-hand sides. The output will be an approximation of the corresponding finite element solution, thus enabling support and enhancement by the standard finite element method (FEM) both theoretically and practically. The loss function is most often an energy functional and we formulate efficient parallelizable training algorithms based on assembling the energy locally on each element. For large problems, the learning process can be made more efficient by using only a small fraction of randomly chosen elements in the mesh in each iteration. The approach is evaluated on several relevant test cases, where learning the finite element solution operator turns out to be beneficial, both in its own right but also by combination with standard FEM theory and software.

Updated: 2025-07-08 19:01:16

标题: 使用多层感知器和能量最小化学习非线性有限元解算符

摘要: 我们开发并评估了一种学习非线性偏微分方程（PDEs）控制的问题的解算符的方法。该方法基于有限元离散化，旨在通过多层感知器（MLP）表示解算符，该感知器将问题数据变量作为输入，并输出有限元解的预测。这些变量通常对应于输入数据（如边界条件、系数和右侧项）的参数化中的参数。输出将是相应有限元解的近似值，从而使其在理论和实践上能够得到标准有限元法（FEM）的支持和增强。损失函数通常是一个能量泛函，我们基于在每个元素上局部组装能量的方法制定了高效可并行化的训练算法。对于大型问题，通过在每次迭代中仅使用网格中随机选择的一小部分元素，可以使学习过程更加高效。该方法在多个相关测试案例上进行评估，结果表明学习有限元解算符是有益的，不仅本身如此，而且与标准FEM理论和软件相结合也是如此。

更新时间: 2025-07-08 19:01:16

领域: cs.LG,cs.NA,math.NA,65K10 65N30 65Y20 68T07,G.1.8; I.2.6; J.2

下载: http://arxiv.org/abs/2412.04596v2

DpDNet: An Dual-Prompt-Driven Network for Universal PET-CT Segmentation

PET-CT lesion segmentation is challenging due to noise sensitivity, small and variable lesion morphology, and interference from physiological high-metabolic signals. Current mainstream approaches follow the practice of one network solving the segmentation of multiple cancer lesions by treating all cancers as a single task. However, this overlooks the unique characteristics of different cancer types. Considering the specificity and similarity of different cancers in terms of metastatic patterns, organ preferences, and FDG uptake intensity, we propose DpDNet, a Dual-Prompt-Driven network that incorporates specific prompts to capture cancer-specific features and common prompts to retain shared knowledge. Additionally, to mitigate information forgetting caused by the early introduction of prompts, prompt-aware heads are employed after the decoder to adaptively handle multiple segmentation tasks. Experiments on a PET-CT dataset with four cancer types show that DpDNet outperforms state-of-the-art models. Finally, based on the segmentation results, we calculated MTV, TLG, and SUVmax for breast cancer survival analysis. The results suggest that DpDNet has the potential to serve as a valuable tool for personalized risk stratification, supporting clinicians in optimizing treatment strategies and improving outcomes. Code is available at https://github.com/XinglongLiang08/DpDNet.

Updated: 2025-07-08 18:56:01

标题: DpDNet: 一种用于通用PET-CT分割的双提示驱动网络

摘要: PET-CT病灶分割由于噪声敏感性、小型和可变的病灶形态以及来自生理高代谢信号的干扰而具有挑战性。目前主流方法遵循一个网络解决多种癌症病灶分割的做法，将所有癌症视为单一任务。然而，这忽视了不同癌症类型的独特特征。考虑到不同癌症的转移模式、器官偏好和FDG摄取强度方面的特异性和相似性，我们提出了DpDNet，一种双提示驱动网络，它结合了特定提示以捕捉癌症特定特征和共同提示以保留共享知识。此外，为了减轻早期引入提示导致的信息遗忘，解码器后采用了提示感知头来自适应处理多个分割任务。对包含四种癌症类型的PET-CT数据集的实验表明，DpDNet胜过了最先进的模型。最后，基于分割结果，我们计算了乳腺癌存活分析的MTV、TLG和SUVmax。结果表明，DpDNet有潜力作为一种有价值的个性化风险分层工具，支持临床医生优化治疗策略并改善结果。代码可在https://github.com/XinglongLiang08/DpDNet获得。

更新时间: 2025-07-08 18:56:01

领域: eess.IV,cs.AI

下载: http://arxiv.org/abs/2507.07126v1

Self-supervised learning predicts plant growth trajectories from multi-modal industrial greenhouse data

Quantifying organism-level phenotypes, such as growth dynamics and biomass accumulation, is fundamental to understanding agronomic traits and optimizing crop production. However, quality growing data of plants at scale is difficult to generate. Here we use a mobile robotic platform to capture high-resolution environmental sensing and phenotyping measurements of a large-scale hydroponic leafy greens system. We describe a self-supervised modeling approach to build a map from observed growing data to the entire plant growth trajectory. We demonstrate our approach by forecasting future plant height and harvest mass of crops in this system. This approach represents a significant advance in combining robotic automation and machine learning, as well as providing actionable insights for agronomic research and operational efficiency.

Updated: 2025-07-08 18:55:11

标题: 自监督学习预测多模态工业温室数据中的植物生长轨迹

摘要: 量化有机体级表型，如生长动态和生物量积累，是理解农艺性状并优化作物生产的基础。然而，在规模上生成植物的优质生长数据是困难的。在这里，我们使用移动机器人平台捕获大规模水培叶菜系统的高分辨率环境感知和表型测量。我们描述了一种自监督建模方法，从观察到的生长数据构建一个从整个植物生长轨迹到地图的方法。我们通过预测该系统中作物未来的植物高度和收获质量来展示我们的方法。这种方法代表了在结合机器人自动化和机器学习方面的重大进展，同时为农艺研究和运营效率提供可操作的见解。

更新时间: 2025-07-08 18:55:11

领域: q-bio.QM,cs.LG,cs.RO

下载: http://arxiv.org/abs/2507.06336v1

AR2: Attention-Guided Repair for the Robustness of CNNs Against Common Corruptions

Deep neural networks suffer from significant performance degradation when exposed to common corruptions such as noise, blur, weather, and digital distortions, limiting their reliability in real-world applications. In this paper, we propose AR2 (Attention-Guided Repair for Robustness), a simple yet effective method to enhance the corruption robustness of pretrained CNNs. AR2 operates by explicitly aligning the class activation maps (CAMs) between clean and corrupted images, encouraging the model to maintain consistent attention even under input perturbations. Our approach follows an iterative repair strategy that alternates between CAM-guided refinement and standard fine-tuning, without requiring architectural changes. Extensive experiments show that AR2 consistently outperforms existing state-of-the-art methods in restoring robustness on standard corruption benchmarks (CIFAR-10-C, CIFAR-100-C and ImageNet-C), achieving a favorable balance between accuracy on clean data and corruption robustness. These results demonstrate that AR2 provides a robust and scalable solution for enhancing model reliability in real-world environments with diverse corruptions.

Updated: 2025-07-08 18:37:00

标题: AR2: 基于注意力引导的修复方法，增强CNN对常见破坏的稳健性

摘要: 深度神经网络在面对常见的污染，如噪音、模糊、天气和数字失真时，往往表现出明显的性能下降，限制了它们在真实世界应用中的可靠性。在本文中，我们提出了AR2（Attention-Guided Repair for Robustness），这是一种简单而有效的方法，用于增强预训练CNN的污染鲁棒性。AR2通过显式地对齐干净图像和受污染图像之间的类激活图（CAMs），鼓励模型在输入扰动下保持一致的注意力。我们的方法采用迭代修复策略，交替进行CAM引导的细化和标准微调，而无需进行架构更改。大量实验证明，AR2在标准污染基准测试（CIFAR-10-C、CIFAR-100-C和ImageNet-C）中始终优于现有的最先进方法，实现了在干净数据准确性和污染鲁棒性之间的有利平衡。这些结果表明，AR2为增强模型在真实世界环境中应对各种污染的可靠性提供了一种强大且可扩展的解决方案。

更新时间: 2025-07-08 18:37:00

领域: cs.CV,cs.LG,cs.SE

下载: http://arxiv.org/abs/2507.06332v1

MixAssist: An Audio-Language Dataset for Co-Creative AI Assistance in Music Mixing

While AI presents significant potential for enhancing music mixing and mastering workflows, current research predominantly emphasizes end-to-end automation or generation, often overlooking the collaborative and instructional dimensions vital for co-creative processes. This gap leaves artists, particularly amateurs seeking to develop expertise, underserved. To bridge this, we introduce MixAssist, a novel audio-language dataset capturing the situated, multi-turn dialogue between expert and amateur music producers during collaborative mixing sessions. Comprising 431 audio-grounded conversational turns derived from 7 in-depth sessions involving 12 producers, MixAssist provides a unique resource for training and evaluating audio-language models that can comprehend and respond to the complexities of real-world music production dialogues. Our evaluations, including automated LLM-as-a-judge assessments and human expert comparisons, demonstrate that fine-tuning models such as Qwen-Audio on MixAssist can yield promising results, with Qwen significantly outperforming other tested models in generating helpful, contextually relevant mixing advice. By focusing on co-creative instruction grounded in audio context, MixAssist enables the development of intelligent AI assistants designed to support and augment the creative process in music mixing.

Updated: 2025-07-08 18:33:26

标题: MixAssist：用于音乐混音中协同创作人工智能辅助的音频语言数据集

摘要: 虽然人工智能在增强音乐混音和母带处理工作流方面具有重要潜力，但目前的研究主要强调端到端的自动化或生成，通常忽视了对于协作和指导维度的重要性，这对于共同创作过程至关重要。这种差距使艺术家，特别是寻求发展专业知识的业余爱好者得不到应有的服务。为了弥合这一差距，我们介绍了MixAssist，这是一个新颖的音频语言数据集，捕捉了专家和业余音乐制作人在协作混音会话期间进行的情境化、多轮对话。MixAssist包括431个音频驱动的对话轮，这些对话轮来自于7个深入的会话，涉及12名制作人。MixAssist为训练和评估音频语言模型提供了独特的资源，这些模型可以理解并回应真实世界音乐制作对话的复杂性。我们的评估，包括自动化的LLM作为评判标准的评估和人类专家的比较，表明在MixAssist上对Qwen-Audio等模型进行微调可以产生令人期待的结果，Qwen在生成有用的、上下文相关的混音建议方面明显优于其他测试过的模型。通过专注于基于音频背景的共同创作指导，MixAssist促进了智能人工智能助手的开发，旨在支持和增强音乐混音中的创造过程。

更新时间: 2025-07-08 18:33:26

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2507.06329v1

Sample-Efficient Reinforcement Learning Controller for Deep Brain Stimulation in Parkinson's Disease

Deep brain stimulation (DBS) is an established intervention for Parkinson's disease (PD), but conventional open-loop systems lack adaptability, are energy-inefficient due to continuous stimulation, and provide limited personalization to individual neural dynamics. Adaptive DBS (aDBS) offers a closed-loop alternative, using biomarkers such as beta-band oscillations to dynamically modulate stimulation. While reinforcement learning (RL) holds promise for personalized aDBS control, existing methods suffer from high sample complexity, unstable exploration in binary action spaces, and limited deployability on resource-constrained hardware. We propose SEA-DBS, a sample-efficient actor-critic framework that addresses the core challenges of RL-based adaptive neurostimulation. SEA-DBS integrates a predictive reward model to reduce reliance on real-time feedback and employs Gumbel Softmax-based exploration for stable, differentiable policy updates in binary action spaces. Together, these components improve sample efficiency, exploration robustness, and compatibility with resource-constrained neuromodulatory hardware. We evaluate SEA-DBS on a biologically realistic simulation of Parkinsonian basal ganglia activity, demonstrating faster convergence, stronger suppression of pathological beta-band power, and resilience to post-training FP16 quantization. Our results show that SEA-DBS offers a practical and effective RL-based aDBS framework for real-time, resource-constrained neuromodulation.

Updated: 2025-07-08 18:30:26

标题: 帕金森病深部脑部刺激的样本高效强化学习控制器

摘要: 深部脑刺激（DBS）是帕金森病（PD）的已建立干预手段，但传统的开环系统缺乏适应性，由于持续刺激而能源效率低下，并且对个体神经动力学的个性化提供有限。自适应DBS（aDBS）提供了一个封闭环路的替代方案，利用生物标志物如β波振荡来动态调节刺激。虽然强化学习（RL）在个性化aDBS控制方面很有前途，但现有方法存在高样本复杂性、在二进制行动空间中不稳定的探索以及在资源受限硬件上的有限部署能力。我们提出了SEA-DBS，一个样本高效的演员-评论家框架，解决了基于RL的自适应神经刺激的核心挑战。SEA-DBS整合了预测性奖励模型以减少对实时反馈的依赖，并采用了基于Gumbel Softmax的探索，以在二进制行动空间中实现稳定、可微分的政策更新。这些组件共同提高了样本效率、探索鲁棒性，并与资源受限的神经调节硬件兼容。我们在帕金森病基底神经节活动的生物学模拟上评估了SEA-DBS，展示了更快的收敛速度、对病理性β波功率的更强抑制力以及对训练后FP16量化的稳健性。我们的结果表明，SEA-DBS为实时、资源受限的神经调节提供了一个实用有效的基于RL的aDBS框架。

更新时间: 2025-07-08 18:30:26

领域: cs.LG,cs.AI,cs.SY,eess.SY,q-bio.NC

下载: http://arxiv.org/abs/2507.06326v1

Bridging AI and Software Security: A Comparative Vulnerability Assessment of LLM Agent Deployment Paradigms

Large Language Model (LLM) agents face security vulnerabilities spanning AI-specific and traditional software domains, yet current research addresses these separately. This study bridges this gap through comparative evaluation of Function Calling architecture and Model Context Protocol (MCP) deployment paradigms using a unified threat classification framework. We tested 3,250 attack scenarios across seven language models, evaluating simple, composed, and chained attacks targeting both AI-specific threats (prompt injection) and software vulnerabilities (JSON injection, denial-of-service). Function Calling showed higher overall attack success rates (73.5% vs 62.59% for MCP), with greater system-centric vulnerability while MCP exhibited increased LLM-centric exposure. Attack complexity dramatically amplified effectiveness, with chained attacks achieving 91-96% success rates. Counterintuitively, advanced reasoning models demonstrated higher exploitability despite better threat detection. Results demonstrate that architectural choices fundamentally reshape threat landscapes. This work establishes methodological foundations for cross-domain LLM agent security assessment and provides evidence-based guidance for secure deployment. Code and experimental materials are available at https: // github. com/ theconsciouslab-ai/llm-agent-security.

Updated: 2025-07-08 18:24:28

标题: 将AI与软件安全桥接：LLM代理部署范式的比较性漏洞评估

摘要: 大型语言模型（LLM）代理面临涵盖人工智能特定和传统软件领域的安全漏洞，但目前的研究分别解决了这些问题。本研究通过比较评估功能调用架构和模型上下文协议（MCP）部署范例，使用统一的威胁分类框架来弥合这一差距。我们测试了3250个攻击方案，涵盖了七个语言模型，评估了简单、组合和链接攻击，旨在针对人工智能特定威胁（提示注入）和软件漏洞（JSON注入、拒绝服务）。功能调用显示出更高的整体攻击成功率（73.5% vs MCP的62.59%），系统中心化漏洞更大，而MCP表现出增加的LLM中心化暴露。攻击复杂性显著增加了效果，链接攻击实现了91-96%的成功率。令人费解的是，高级推理模型表现出更高的利用性，尽管威胁检测更好。结果表明，架构选择从根本上重新塑造了威胁景观。这项工作为跨领域LLM代理安全评估奠定了方法论基础，并为安全部署提供了基于证据的指导。代码和实验材料可在https://github.com/theconsciouslab-ai/llm-agent-security获取。

更新时间: 2025-07-08 18:24:28

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2507.06323v1

(How) Can Transformers Predict Pseudo-Random Numbers?

Transformers excel at discovering patterns in sequential data, yet their fundamental limitations and learning mechanisms remain crucial topics of investigation. In this paper, we study the ability of Transformers to learn pseudo-random number sequences from linear congruential generators (LCGs), defined by the recurrence relation $x_{t+1} = a x_t + c \;\mathrm{mod}\; m$. We find that with sufficient architectural capacity and training data variety, Transformers can perform in-context prediction of LCG sequences with unseen moduli ($m$) and parameters ($a,c$). By analyzing the embedding layers and attention patterns, we uncover how Transformers develop algorithmic structures to learn these sequences in two scenarios of increasing complexity. First, we investigate how Transformers learn LCG sequences with unseen ($a, c$) but fixed modulus; and demonstrate successful learning up to $m = 2^{32}$. We find that models learn to factorize $m$ and utilize digit-wise number representations to make sequential predictions. In the second, more challenging scenario of unseen moduli, we show that Transformers can generalize to unseen moduli up to $m_{\text{test}} = 2^{16}$. In this case, the model employs a two-step strategy: first estimating the unknown modulus from the context, then utilizing prime factorizations to generate predictions. For this task, we observe a sharp transition in the accuracy at a critical depth $d= 3$. We also find that the number of in-context sequence elements needed to reach high accuracy scales sublinearly with the modulus.

Updated: 2025-07-08 18:20:16

标题: (如何) 转换器能够预测伪随机数吗？

摘要: Transformer在发现序列数据中的模式方面表现出色，然而它们的基本限制和学习机制仍然是重要的研究课题。本文研究了Transformer学习线性同余生成器（LCGs）生成的伪随机数序列的能力，该生成器由递归关系$x_{t+1} = a x_t + c \;\mathrm{mod}\; m$定义。我们发现，通过具有足够的结构容量和训练数据多样性，Transformer可以对具有未知模数（$m$）和参数（$a,c$）的LCG序列进行上下文预测。通过分析嵌入层和注意力模式，我们揭示了Transformer如何在两种不断增加复杂性的情景中学习这些序列的算法结构。首先，我们研究了Transformer如何学习具有未知（$a, c$）但固定模数的LCG序列；并展示了成功学习直到$m = 2^{32}$。我们发现模型学会了分解$m$并利用数字逐位表示进行序列预测。在第二种更具挑战性的未知模数情景中，我们展示了Transformer可以推广到未知模数直到$m_{\text{test}} = 2^{16}$。在这种情况下，模型采用了两步策略：首先从上下文中估计未知模数，然后利用质因数分解生成预测。对于这个任务，我们观察到在关键深度$d= 3$处准确性发生了明显的转变。我们还发现，达到高准确性所需的上下文序列元素数量随模数的增加呈次线性缩放。

更新时间: 2025-07-08 18:20:16

领域: cs.LG,cond-mat.dis-nn,cs.CR,stat.ML

下载: http://arxiv.org/abs/2502.10390v2

Centralized Copy-Paste: Enhanced Data Augmentation Strategy for Wildland Fire Semantic Segmentation

Collecting and annotating images for the purpose of training segmentation models is often cost prohibitive. In the domain of wildland fire science, this challenge is further compounded by the scarcity of reliable public datasets with labeled ground truth. This paper presents the Centralized Copy-Paste Data Augmentation (CCPDA) method, for the purpose of assisting with the training of deep-learning multiclass segmentation models, with special focus on improving segmentation outcomes for the fire-class. CCPDA has three main steps: (i) identify fire clusters in the source image, (ii) apply a centralization technique to focus on the core of the fire area, and (iii) paste the refined fire clusters onto a target image. This method increases dataset diversity while preserving the essential characteristics of the fire class. The effectiveness of this augmentation technique is demonstrated via numerical analysis and comparison against various other augmentation methods using a weighted sum-based multi-objective optimization approach. This approach helps elevate segmentation performance metrics specific to the fire class, which carries significantly more operational significance than other classes (fuel, ash, or background). Numerical performance assessment validates the efficacy of the presented CCPDA method in alleviating the difficulties associated with small, manually labeled training datasets. It also illustrates that CCPDA outperforms other augmentation strategies in the application scenario considered, particularly in improving fire-class segmentation performance.

Updated: 2025-07-08 18:17:09

标题: 集中式复制粘贴：野外火灾语义分割的增强数据增强策略

摘要: 收集和注释图像以训练分割模型通常成本过高。在野火科学领域，这个挑战进一步加剧了，因为可靠的带有标记地面真实值的公共数据集稀缺。本文介绍了中心化复制粘贴数据增强（CCPDA）方法，用于辅助训练深度学习多类别分割模型，特别关注提高火灾类的分割结果。CCPDA有三个主要步骤：（i）在源图像中识别火灾聚类，（ii）应用中心化技术聚焦于火灾区域的核心，（iii）将精细化的火灾聚类粘贴到目标图像上。这种方法增加了数据集的多样性，同时保留了火灾类的基本特征。通过数字分析和与各种其他增强方法的比较，采用基于加权和的多目标优化方法展示了这种增强技术的有效性。这种方法有助于提高与火灾类相关的分割性能指标，这比其他类别（燃料、灰烬或背景）具有更重要的操作意义。数字性能评估验证了所提出的CCPDA方法在缓解与小型、手动标记的训练数据集相关的困难方面的有效性。它还说明了在考虑的应用场景中，CCPDA优于其他增强策略，特别是在改善火灾类分割性能方面。

更新时间: 2025-07-08 18:17:09

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.06321v1

MedGellan: LLM-Generated Medical Guidance to Support Physicians

Medical decision-making is a critical task, where errors can result in serious, potentially life-threatening consequences. While full automation remains challenging, hybrid frameworks that combine machine intelligence with human oversight offer a practical alternative. In this paper, we present MedGellan, a lightweight, annotation-free framework that uses a Large Language Model (LLM) to generate clinical guidance from raw medical records, which is then used by a physician to predict diagnoses. MedGellan uses a Bayesian-inspired prompting strategy that respects the temporal order of clinical data. Preliminary experiments show that the guidance generated by the LLM with MedGellan improves diagnostic performance, particularly in recall and $F_1$ score.

Updated: 2025-07-08 18:16:26

标题: MedGellan：由LLM生成的医学指导，以支持医生

摘要: 医疗决策是一项至关重要的任务，错误可能导致严重甚至危及生命的后果。虽然完全自动化仍然具有挑战性，但将机器智能与人类监督相结合的混合框架提供了一个实用的替代方案。在本文中，我们提出了MedGellan，一个轻量级、无需注释的框架，它利用大型语言模型（LLM）从原始医疗记录中生成临床指导，然后由医生用于预测诊断。MedGellan采用了一个受贝叶斯启发的提示策略，尊重临床数据的时间顺序。初步实验显示，由LLM生成的MedGellan指导改善了诊断性能，特别是在召回率和$F_1$分数方面。

更新时间: 2025-07-08 18:16:26

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.04431v2

RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs

The foundational capabilities of large language models (LLMs) are deeply influenced by the quality of their pre-training corpora. However, enhancing data quality at scale remains a significant challenge, primarily due to the trade-off between refinement effectiveness and processing efficiency. While rule-based filtering remains the dominant paradigm, it typically operates at the document level and lacks the granularity needed to refine specific content within documents. Inspired by emerging work such as ProX, we propose $\textbf{RefineX}$, a novel framework for large-scale, surgical refinement of pre-training data through programmatic editing tasks. RefineX enables efficient and fine-grained data refinement while reliably preserving the diversity and naturalness of raw text. The core strength of RefineX lies in distilling high-quality, expert-guided end-to-end refinement results into minimal edit-based deletion programs. This high-precision distillation pipeline is used to train an efficient and reliable refine model that can systematically improve every instance in the corpus at scale. We evaluate RefineX across from-scratch pre-training at multiple model scales and find that it consistently outperforms models trained on raw, filtered, or alternatively refined data across diverse downstream tasks. On the 750M model, RefineX yields 2.6%-7.2% average gains on lighteval tasks, and achieves comparable performance using significantly fewer training tokens. Further analysis shows that RefineX reliably enhances text quality with both high efficiency and precision, outperforming prior approaches such as end-to-end generation and Prox-C. These results position RefineX as a scalable, effective, and reliable solution for optimizing pre-training data in modern LLM pipelines.

Updated: 2025-07-08 18:15:09

标题: RefineX：从专家指导程序中学习如何在规模上优化预训练数据

摘要: 大型语言模型（LLMs）的基础能力深受其预训练语料库质量的影响。然而，在规模上提升数据质量仍然是一个重大挑战，主要是由于精炼效果和处理效率之间的权衡。虽然基于规则的过滤仍然是主导范式，但通常在文档级别操作，缺乏细化文档内特定内容的粒度。受到ProX等新兴工作的启发，我们提出了一种新颖的框架$\textbf{RefineX}$，用于通过程序化编辑任务对预训练数据进行大规模、精准的细化。RefineX能够在可靠地保留原始文本的多样性和自然性的同时，实现高效和精细的数据细化。RefineX的核心优势在于将高质量的专家引导的端到端细化结果提炼成最小的基于编辑的删除程序。这种高精度的提炼管道用于训练一个高效且可靠的细化模型，可以系统地提升语料库中的每个实例。我们在多个模型规模的从头开始预训练上评估了RefineX，并发现它在各种下游任务上始终优于在原始、经过过滤或经过其他方式细化的数据上训练的模型。在750M模型上，RefineX在lighteval任务上平均获得2.6%-7.2%的增益，并且实现相当的性能，使用的训练令牌数量显著更少。进一步的分析显示，RefineX可靠地提升文本质量，同时具有高效性和精确性，优于以往的方法，如端到端生成和Prox-C。这些结果将RefineX定位为现代LLM管道中优化预训练数据的可扩展、有效和可靠的解决方案。

更新时间: 2025-07-08 18:15:09

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.03253v2

Implicit Neural Representations for Chemical Reaction Paths

We show that neural networks can be optimized to represent minimum energy paths as continuous functions, offering a flexible alternative to discrete path-search methods such as Nudged Elastic Band (NEB). Our approach parameterizes reaction paths with a network trained on a loss function that discards tangential energy gradients and enables instant estimation of the transition state. We first validate the method on two-dimensional potentials and then demonstrate its advantages over NEB on challenging atomistic systems where (i) poor initial guesses yield unphysical paths, (ii) multiple competing paths exist, or (iii) the reaction follows a complex multi-step mechanism. Results highlight the versatility of the method: for instance, a simple adjustment to the sampling strategy during optimization can help escape local-minimum solutions. Finally, in a low-dimensional setting, we demonstrate that a single neural network can learn from existing paths and generalize to unseen systems, showing promise for a universal reaction path representation.

Updated: 2025-07-08 18:11:58

标题: 化学反应路径的隐式神经表示

摘要: 我们展示了神经网络可以被优化为将最小能量路径表示为连续函数，提供了一种灵活的替代离散路径搜索方法，如Nudged Elastic Band（NEB）。我们的方法使用一个网络对反应路径进行参数化，该网络经过训练的损失函数丢弃了切向能量梯度，并能够即时估计过渡态。我们首先在二维势能上验证了该方法，然后展示了它在具有挑战性的原子系统上优于NEB的优势，其中（i）糟糕的初始猜测导致非物理路径，（ii）存在多条竞争路径，或者（iii）反应遵循复杂的多步机制。结果突出了该方法的多功能性：例如，在优化过程中对采样策略进行简单调整可以帮助避免局部最小解。最后，在低维设置中，我们展示了单个神经网络可以从现有路径中学习并推广到未知系统，显示了通用反应路径表示的潜力。

更新时间: 2025-07-08 18:11:58

领域: cs.LG,physics.chem-ph

下载: http://arxiv.org/abs/2502.15843v3

AI-Driven Scholarly Peer Review via Persistent Workflow Prompting, Meta-Prompting, and Meta-Reasoning

Critical peer review of scientific manuscripts presents a significant challenge for Large Language Models (LLMs), partly due to data limitations and the complexity of expert reasoning. This report introduces Persistent Workflow Prompting (PWP), a potentially broadly applicable prompt engineering methodology designed to bridge this gap using standard LLM chat interfaces (zero-code, no APIs). We present a proof-of-concept PWP prompt for the critical analysis of experimental chemistry manuscripts, featuring a hierarchical, modular architecture (structured via Markdown) that defines detailed analysis workflows. We develop this PWP prompt through iterative application of meta-prompting techniques and meta-reasoning aimed at systematically codifying expert review workflows, including tacit knowledge. Submitted once at the start of a session, this PWP prompt equips the LLM with persistent workflows triggered by subsequent queries, guiding modern reasoning LLMs through systematic, multimodal evaluations. Demonstrations show the PWP-guided LLM identifying major methodological flaws in a test case while mitigating LLM input bias and performing complex tasks, including distinguishing claims from evidence, integrating text/photo/figure analysis to infer parameters, executing quantitative feasibility checks, comparing estimates against claims, and assessing a priori plausibility. To ensure transparency and facilitate replication, we provide full prompts, detailed demonstration analyses, and logs of interactive chats as supplementary resources. Beyond the specific application, this work offers insights into the meta-development process itself, highlighting the potential of PWP, informed by detailed workflow formalization, to enable sophisticated analysis using readily available LLMs for complex scientific tasks.

Updated: 2025-07-08 18:03:14

标题: 基于人工智能的学术同行评审：通过持续的工作流提示、元提示和元推理

摘要: 科学手稿的关键同行评审对于大型语言模型（LLMs）来说是一个重大挑战，部分原因是由于数据限制和专家推理的复杂性。本报告介绍了持续工作流提示（PWP），这是一种可能广泛适用的提示工程方法，旨在通过标准的LLM聊天界面（零代码，无API）弥合这一差距。我们提出了一个针对实验化学手稿的PWP提示的概念验证，其具有分层、模块化的体系结构（通过Markdown结构化），定义了详细的分析工作流程。我们通过迭代应用元提示技术和元推理来开发这个PWP提示，旨在系统地编码专家评审工作流程，包括隐性知识。一旦在会话开始时提交，这个PWP提示将为LLM提供由后续查询触发的持续工作流，引导现代推理LLM进行系统的多模态评估。演示显示，PWP引导的LLM在一个测试案例中识别了主要的方法ological缺陷，同时减轻了LLM的输入偏见并执行了复杂的任务，包括区分声明和证据，整合文本/照片/图表分析以推断参数，执行定量可行性检查，将估计值与声明进行比较，以及评估先验可信度。为确保透明度并促进复制，我们提供完整的提示、详细的演示分析以及互动聊天记录作为补充资源。除了特定的应用外，这项工作还提供了关于元开发过程本身的见解，突出了PWP的潜力，通过详细的工作流程形式化，使得使用现成的LLMs进行复杂科学任务的精细分析成为可能。

更新时间: 2025-07-08 18:03:14

领域: cs.AI,physics.chem-ph

下载: http://arxiv.org/abs/2505.03332v4

Too Human to Model:The Uncanny Valley of LLMs in Social Simulation -- When Generative Language Agents Misalign with Modelling Principles

Large language models (LLMs) have been increasingly used to build agents in social simulation because of their impressive abilities to generate fluent, contextually coherent dialogues. Such abilities can enhance the realism of models. However, the pursuit of realism is not necessarily compatible with the epistemic foundation of modelling. We argue that LLM agents, in many regards, are too human to model: they are too expressive, detailed and intractable to be consistent with the abstraction, simplification, and interpretability typically demanded by modelling. Through a model-building thought experiment that converts the Bass diffusion model to an LLM-based variant, we uncover five core dilemmas: a temporal resolution mismatch between natural conversation and abstract time steps; the need for intervention in conversations while avoiding undermining spontaneous agent outputs; the temptation to introduce rule-like instructions in prompts while maintaining conversational naturalness; the tension between role consistency and role evolution across time; and the challenge of understanding emergence, where system-level patterns become obscured by verbose micro textual outputs. These dilemmas steer the LLM agents towards an uncanny valley: not abstract enough to clarify underlying social mechanisms, while not natural enough to represent realistic human behaviour. This exposes an important paradox: the realism of LLM agents can obscure, rather than clarify, social dynamics when misapplied. We tease out the conditions in which LLM agents are ideally suited: where system-level emergence is not the focus, linguistic nuances and meaning are central, interactions unfold in natural time, and stable role identity is more important than long-term behavioural evolution. We call for repositioning LLM agents in the ecosystem of social simulation for future applications.

Updated: 2025-07-08 18:02:36

标题: 无法模拟的太人类：社会模拟中LLMs的不令人愉快山谷--当生成语言代理与建模原则不一致时

摘要: 大型语言模型(LLMs)在社会模拟中越来越被广泛应用，因为它们具有生成流畅、上下文连贯对话的出色能力。这种能力可以增强模型的逼真性。然而，逼真性的追求并不一定与建模的认识基础相容。我们认为，在许多方面，LLM代理过于具有人性化，难以建模：它们过于表达丰富、细节化和棘手，与建模通常要求的抽象化、简化和可解释性不一致。通过将巴斯扩散模型转换为基于LLM的变体的模型构建思想实验，我们揭示了五个核心困境：自然对话和抽象时间步之间的时间分辨率不匹配；需要干预对话，同时避免破坏自发代理输出；在提示中引入类似规则的指令的诱惑，同时保持对话的自然性；角色一致性和随时间演变之间的张力；以及理解出现的挑战，系统级模式被冗长的微观文本输出所掩盖。这些困境将LLM代理引向一个怪异谷：不够抽象以澄清潜在的社会机制，同时又不够自然以代表现实的人类行为。这暴露了一个重要的悖论：当错误应用时，LLM代理的逼真性可能会掩盖而不是澄清社会动态。我们挖掘了LLM代理理想适用的条件：系统级出现并非重点，语言细微差异和含义至关重要，互动在自然时间中展开，稳定的角色身份比长期行为演变更重要。我们呼吁重新定位LLM代理在未来应用的社会模拟生态系统中的位置。

更新时间: 2025-07-08 18:02:36

领域: cs.CY,cs.AI,cs.MA

下载: http://arxiv.org/abs/2507.06310v1

Humans overrely on overconfident language models, across languages

As large language models (LLMs) are deployed globally, it is crucial that their responses are calibrated across languages to accurately convey uncertainty and limitations. Previous work has shown that LLMs are linguistically overconfident in English, leading users to overrely on confident generations. However, the usage and interpretation of epistemic markers (e.g., 'It's definitely,' 'I think') can differ sharply across languages. Here, we study the risks of multilingual linguistic (mis)calibration, overconfidence, and overreliance across five languages to evaluate the safety of LLMs in a global context. We find that overreliance risks are high across all languages. We first analyze the distribution of LLM-generated epistemic markers, and observe that while LLMs are cross-linguistically overconfident, they are also sensitive to documented linguistic variation. For example, models generate the most markers of uncertainty in Japanese and the most markers of certainty in German and Mandarin. We then measure human reliance rates across languages, finding that while users strongly rely on confident LLM generations in all languages, reliance behaviors differ cross-linguistically: for example, users rely significantly more on expressions of uncertainty in Japanese than in English. Taken together, these results indicate high risk of reliance on overconfident model generations across languages. Our findings highlight the challenges of multilingual linguistic calibration and stress the importance of culturally and linguistically contextualized model safety evaluations.

Updated: 2025-07-08 18:01:01

标题: 人类在各种语言中过度依赖自信的语言模型

摘要: 随着大型语言模型（LLMs）在全球范围内的部署，确保它们的响应在不同语言间进行校准以准确传达不确定性和限制是至关重要的。先前的研究表明，在英语中，LLMs在语言上表现过于自信，导致用户过度依赖自信的生成物。然而，认识标记（例如，“它肯定是”，“我认为”）的使用和解释在不同语言间可能存在显著差异。在这里，我们研究了五种语言中多语言语言（误）校准、过度自信和过度依赖的风险，以评估LLMs在全球范围内的安全性。我们发现，在所有语言中，过度依赖的风险都很高。我们首先分析了LLM生成的认知标记的分布，并观察到虽然LLMs在跨语言上表现出过度自信，但它们也对已记录的语言变化敏感。例如，模型在日语中生成最多的不确定性标记，在德语和普通话中生成最多的确定性标记。然后，我们测量了不同语言中人类依赖率，发现虽然用户在所有语言中都强烈依赖自信的LLM生成物，但依赖行为在跨语言间有所不同：例如，在日语中，用户更加依赖于不确定性表达，而不是在英语中。总的来说，这些结果表明在不同语言中过度自信的模型生成物依赖风险很高。我们的研究结果突显了多语言语言校准的挑战，并强调了文化和语言背景下模型安全评估的重要性。

更新时间: 2025-07-08 18:01:01

领域: cs.CL,cs.AI,cs.HC

下载: http://arxiv.org/abs/2507.06306v1

Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving

As language agents tackle increasingly complex tasks, they struggle with effective error correction and experience reuse across domains. We introduce Agent KB, a hierarchical experience framework that enables complex agentic problem solving via a novel Reason-Retrieve-Refine pipeline. Agent KB addresses a core limitation: agents traditionally cannot learn from each other's experiences. By capturing both high-level strategies and detailed execution logs, Agent KB creates a shared knowledge base that enables cross-agent knowledge transfer. Evaluated on the GAIA benchmark, Agent KB improves success rates by up to 16.28 percentage points. On the most challenging tasks, Claude-3 improves from 38.46% to 57.69%, while GPT-4 improves from 53.49% to 73.26% on intermediate tasks. On SWE-bench code repair, Agent KB enables Claude-3 to improve from 41.33% to 53.33%. Our results suggest that Agent KB provides a modular, framework-agnostic infrastructure for enabling agents to learn from past experiences and generalize successful strategies to new tasks.

Updated: 2025-07-08 17:59:22

标题: Agent KB：利用跨领域经验进行代理问题解决

摘要: 随着语言代理人处理越来越复杂的任务，他们在有效的错误纠正和跨领域经验重用方面面临困难。我们引入了代理KB，这是一个分层经验框架，通过一种新颖的Reason-Retrieve-Refine流程实现复杂的代理问题解决。代理KB解决了一个核心限制：传统上代理不能从彼此的经验中学习。通过捕获高级策略和详细的执行日志，代理KB创建了一个共享知识库，可实现跨代理的知识转移。在GAIA基准测试中评估，代理KB将成功率提高了高达16.28个百分点。在最具挑战性的任务上，Claude-3的成功率从38.46%提高到57.69%，而GPT-4在中级任务上的成功率从53.49%提高到73.26%。在SWE-bench代码修复中，代理KB使Claude-3的成功率从41.33%提高到53.33%。我们的结果表明，代理KB提供了一个模块化、框架无关的基础设施，使代理能够从过去的经验中学习，并将成功的策略推广到新任务中。

更新时间: 2025-07-08 17:59:22

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.06229v1

EC-Flow: Enabling Versatile Robotic Manipulation from Action-Unlabeled Videos via Embodiment-Centric Flow

Current language-guided robotic manipulation systems often require low-level action-labeled datasets for imitation learning. While object-centric flow prediction methods mitigate this issue, they remain limited to scenarios involving rigid objects with clear displacement and minimal occlusion. In this work, we present Embodiment-Centric Flow (EC-Flow), a framework that directly learns manipulation from action-unlabeled videos by predicting embodiment-centric flow. Our key insight is that incorporating the embodiment's inherent kinematics significantly enhances generalization to versatile manipulation scenarios, including deformable object handling, occlusions, and non-object-displacement tasks. To connect the EC-Flow with language instructions and object interactions, we further introduce a goal-alignment module by jointly optimizing movement consistency and goal-image prediction. Moreover, translating EC-Flow to executable robot actions only requires a standard robot URDF (Unified Robot Description Format) file to specify kinematic constraints across joints, which makes it easy to use in practice. We validate EC-Flow on both simulation (Meta-World) and real-world tasks, demonstrating its state-of-the-art performance in occluded object handling (62% improvement), deformable object manipulation (45% improvement), and non-object-displacement tasks (80% improvement) than prior state-of-the-art object-centric flow methods. For more information, see our project website at https://ec-flow1.github.io .

Updated: 2025-07-08 17:57:03

标题: EC-Flow：通过以实体为中心的流实现从未标记动作视频的多功能机器人操作

摘要: 目前的语言引导的机器人操作系统通常需要低级别的动作标记数据集用于模仿学习。尽管以物体为中心的流预测方法缓解了这个问题，但它们仍然局限于涉及具有明显位移和最小遮挡的刚性物体的情况。在这项工作中，我们提出了一种直接从动作未标记的视频中通过预测以物体为中心的流来学习操作的框架，即Embodiment-Centric Flow (EC-Flow)。我们的关键见解是，将具体体的固有运动学纳入其中显著增强了对多样化操作场景的泛化能力，包括可变形物体处理、遮挡和非物体位移任务。为了将EC-Flow与语言说明和物体交互联系起来，我们进一步引入了一个目标对齐模块，通过联合优化运动一致性和目标图像预测。此外，将EC-Flow转换为可执行的机器人动作只需要一个标准的机器人URDF（统一机器人描述格式）文件来指定关节之间的运动学约束，这使得在实践中易于使用。我们在模拟（Meta-World）和现实世界任务上验证了EC-Flow，展示了它在遮挡物体处理（62％改善）、可变形物体操作（45％改善）和非物体位移任务（80％改善）方面的最新性能优于先前的以物体为中心的流方法。有关更多信息，请访问我们的项目网站：https://ec-flow1.github.io。

更新时间: 2025-07-08 17:57:03

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2507.06224v1

Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers

Large Language Models (LLMs) have recently been applied to reranking tasks in information retrieval, achieving strong performance. However, their high computational demands often hinder practical deployment. Existing studies evaluate the efficiency of LLM-based rerankers using proxy metrics such as latency, the number of forward passes, input tokens, and output tokens. However, these metrics depend on hardware and running-time choices (\eg parallel or not, batch size, etc), and often fail to account for model size, making it difficult to interpret and obscuring the evaluation of the efficiency-effectiveness tradeoff. To address this issue, we propose E\textsuperscript{2}R-FLOPs, for LLM-based rerankers: ranking metrics per PetaFLOP (RPP) for relevance per compute and queries per PetaFLOP (QPP) for hardware-agnostic throughput. Companied with the new metrics, an interpretable FLOPs estimator is built to estimate the FLOPs of an LLM-based reranker even without running any experiments. Based on the proposed metrics, we conduct comprehensive experiments to evaluate a wide range of LLM-based rerankers with different architecture, studying the efficiency-effectiveness trade-off and bringing this issue to the attention of the research community.

Updated: 2025-07-08 17:56:28

标题: 基于LLM的reranker的效率-效果重新排列FLOPs

摘要: 最近，大型语言模型（LLMs）已被应用于信息检索中的重新排序任务，取得了强大的性能。然而，它们高计算需求通常阻碍了实际部署。现有研究评估基于LLM的重新排序器的效率，使用代理指标如延迟、前向传递次数、输入标记和输出标记。然而，这些指标取决于硬件和运行时间选择（如并行或否、批量大小等），并且通常未考虑模型大小，使得难以解释和模糊了效率-有效性折衷的评估。为解决这一问题，我们提出了基于LLM的重新排序器的E\textsuperscript{2}R-FLOPs：每PetaFLOP（RPP）的相关性排名指标，每PetaFLOP（QPP）的硬件无关吞吐量。伴随着新指标，建立了一个可解释的FLOPs估计器，可估计基于LLM的重新排序器的FLOPs，甚至无需运行任何实验。基于提出的指标，我们进行了全面的实验，评估了各种架构的LLM-based重新排序器的效率-有效性折衷，并引起了研究界对这一问题的关注。

更新时间: 2025-07-08 17:56:28

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.06223v1

Deep Learning Optimization of Two-State Pinching Antennas Systems

The evolution of wireless communication systems requires flexible, energy-efficient, and cost-effective antenna technologies. Pinching antennas (PAs), which can dynamically control electromagnetic wave propagation through binary activation states, have recently emerged as a promising candidate. In this work, we investigate the problem of optimally selecting a subset of fixed-position PAs to activate in a waveguide, when the aim is to maximize the communication rate at a user terminal. Due to the complex interplay between antenna activation, waveguide-induced phase shifts, and power division, this problem is formulated as a combinatorial fractional 0-1 quadratic program. To efficiently solve this challenging problem, we use neural network architectures of varying complexity to learn activation policies directly from data, leveraging spatial features and signal structure. Furthermore, we incorporate user location uncertainty into our training and evaluation pipeline to simulate realistic deployment conditions. Simulation results demonstrate the effectiveness and robustness of the proposed models.

Updated: 2025-07-08 17:55:54

标题: 深度学习优化的两态夹持天线系统

摘要: 无线通信系统的发展需要灵活、节能和经济高效的天线技术。捏合天线（PAs）最近作为一种有前途的候选者出现，它可以通过二进制激活状态动态控制电磁波的传播。在这项工作中，我们研究了在波导中最优选择一组固定位置的PAs进行激活，以最大化用户终端的通信速率。由于天线激活、波导诱导的相位变化和功率分配之间的复杂相互作用，这个问题被制定为一个组合分数0-1二次规划问题。为了有效解决这个具有挑战性的问题，我们利用不同复杂度的神经网络架构直接从数据中学习激活策略，利用空间特征和信号结构。此外，我们将用户位置不确定性纳入我们的训练和评估流程中，以模拟真实的部署条件。模拟结果显示了所提出模型的有效性和稳健性。

更新时间: 2025-07-08 17:55:54

领域: cs.LG

下载: http://arxiv.org/abs/2507.06222v1

Aligned Textual Scoring Rules

Scoring rules elicit probabilistic predictions from a strategic agent by scoring the prediction against a ground truth state. A scoring rule is proper if, from the agent's perspective, reporting the true belief maximizes the expected score. With the development of language models, Wu and Hartline (2024) proposes a reduction from textual information elicitation to the numerical (i.e. probabilistic) information elicitation problem, which achieves provable properness for textual elicitation. However, not all proper scoring rules are well aligned with human preference over text. Our paper designs the Aligned Scoring rule (ASR) for text by optimizing and minimizing the mean squared error between a proper scoring rule and a reference score (e.g. human score). Our experiments show that our ASR outperforms previous methods in aligning with human preference while maintaining properness.

Updated: 2025-07-08 17:53:22

标题: 对齐文本评分规则

摘要: 评分规则通过将预测与基本事实状态进行评分，从而从战略代理那里引出概率预测。如果从代理人的角度来看，报告真实信念可以最大化期望分数，则评分规则是正确的。随着语言模型的发展，吴和哈特林（2024年）提出了一种从文本信息引出到数值（即概率）信息引出问题的简化方法，这实现了对文本引出的可证明正确性。然而，并非所有正确的评分规则都与人类对文本的偏好完全一致。我们的论文通过优化和最小化正确评分规则与参考分数（例如人类评分）之间的均方误差，设计了适用于文本的对齐评分规则（ASR）。我们的实验表明，我们的ASR在与人类偏好保持正确性的同时，优于先前的方法。

更新时间: 2025-07-08 17:53:22

领域: cs.AI,cs.GT

下载: http://arxiv.org/abs/2507.06221v1

Is Diversity All You Need for Scalable Robotic Manipulation?

Data scaling has driven remarkable success in foundation models for Natural Language Processing (NLP) and Computer Vision (CV), yet the principles of effective data scaling in robotic manipulation remain insufficiently understood. In this work, we investigate the nuanced role of data diversity in robot learning by examining three critical dimensions-task (what to do), embodiment (which robot to use), and expert (who demonstrates)-challenging the conventional intuition of "more diverse is better". Throughout extensive experiments on various robot platforms, we reveal that (1) task diversity proves more critical than per-task demonstration quantity, benefiting transfer from diverse pre-training tasks to novel downstream scenarios; (2) multi-embodiment pre-training data is optional for cross-embodiment transfer-models trained on high-quality single-embodiment data can efficiently transfer to different platforms, showing more desirable scaling property during fine-tuning than multi-embodiment pre-trained models; and (3) expert diversity, arising from individual operational preferences and stochastic variations in human demonstrations, can be confounding to policy learning, with velocity multimodality emerging as a key contributing factor. Based on this insight, we propose a distribution debiasing method to mitigate velocity ambiguity, the yielding GO-1-Pro achieves substantial performance gains of 15%, equivalent to using 2.5 times pre-training data. Collectively, these findings provide new perspectives and offer practical guidance on how to scale robotic manipulation datasets effectively.

Updated: 2025-07-08 17:52:44

标题: 多样性是否是实现可扩展机器人操作所需要的全部？

摘要: 数据缩放已经在自然语言处理（NLP）和计算机视觉（CV）的基础模型中取得了显著的成功，然而在机器人操作中有效数据缩放的原则仍然不够清晰。在这项研究中，我们通过研究数据多样性在机器人学习中的微妙作用，探讨了任务（该做什么）、实体（使用哪种机器人）和专家（谁进行演示）这三个关键维度，挑战了“更多样性就更好”的传统直觉。通过在各种机器人平台上进行大量实验，我们揭示了以下结论：（1）任务多样性比每个任务的演示数量更为关键，有利于从多样化的预训练任务向新颖的下游场景进行迁移；（2）多实体预训练数据对跨实体转移是可选的-在高质量单实体数据上训练的模型可以有效地转移到不同的平台，在微调期间显示出更理想的缩放特性，优于多实体预训练模型；以及（3）专家多样性，源自个体操作偏好和人类演示中的随机变化，可能对策略学习造成混淆，速度多模态性成为一个关键的贡献因素。基于这一洞见，我们提出了一种分布去偏方法来减少速度模糊，其结果GO-1-Pro实现了相当大的性能提升，相当于使用2.5倍的预训练数据。总的来说，这些发现提供了新的视角，并为如何有效扩展机器人操作数据集提供了实用指导。

更新时间: 2025-07-08 17:52:44

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.06219v1

What ZTF Saw Where Rubin Looked: Anomaly Hunting in DR23

We present results from the SNAD VIII Workshop, during which we conducted the first systematic anomaly search in the ZTF fields also observed by LSSTComCam during Rubin Scientific Pipeline commissioning. Using the PineForest active anomaly detection algorithm, we analysed four selected fields (two galactic and two extragalactic) and visually inspected 400 candidates. As a result, we discovered six previously uncatalogued variable stars, including RS~CVn, BY Draconis, ellipsoidal, and solar-type variables, and refined classifications and periods for six known objects. These results demonstrate the effectiveness of the SNAD anomaly detection pipeline and provide a preview of the discovery potential in the upcoming LSST data.

Updated: 2025-07-08 17:50:17

标题: ZTF看到了什么，Rubin看到了什么：在DR23中寻找异常

摘要: 我们介绍了来自SNAD VIII研讨会的结果，我们在其中进行了首次在LSSTComCam也观测到的ZTF领域中进行系统异常搜索。使用PineForest主动异常检测算法，我们分析了四个选定的领域（两个银河和两个外星系），并对400个候选对象进行了视觉检查。结果发现了六颗先前未编目的可变星，包括RS~CVn，BY Draconis，椭圆形和太阳型变星，并对六个已知对象的分类和周期进行了精细化。这些结果证明了SNAD异常检测管道的有效性，并展示了即将到来的LSST数据中的发现潜力。

更新时间: 2025-07-08 17:50:17

领域: astro-ph.IM,astro-ph.GA,astro-ph.SR,cs.LG

下载: http://arxiv.org/abs/2507.06217v1

Embedding Atlas: Low-Friction, Interactive Embedding Visualization

Embedding projections are popular for visualizing large datasets and models. However, people often encounter "friction" when using embedding visualization tools: (1) barriers to adoption, e.g., tedious data wrangling and loading, scalability limits, no integration of results into existing workflows, and (2) limitations in possible analyses, without integration with external tools to additionally show coordinated views of metadata. In this paper, we present Embedding Atlas, a scalable, interactive visualization tool designed to make interacting with large embeddings as easy as possible. Embedding Atlas uses modern web technologies and advanced algorithms -- including density-based clustering, and automated labeling -- to provide a fast and rich data analysis experience at scale. We evaluate Embedding Atlas with a competitive analysis against other popular embedding tools, showing that Embedding Atlas's feature set specifically helps reduce friction, and report a benchmark on its real-time rendering performance with millions of points. Embedding Atlas is available as open source to support future work in embedding-based analysis.

Updated: 2025-07-08 17:49:59

标题: 嵌入式地图：低摩擦、交互式嵌入式可视化

摘要: 嵌入式投影在可视化大型数据集和模型方面非常受欢迎。然而，人们在使用嵌入式可视化工具时经常遇到“摩擦”问题：（1）采用障碍，例如繁琐的数据整理和加载、可扩展性限制、无法将结果集成到现有工作流程中；（2）在可能的分析方面存在限制，没有与外部工具集成以进一步显示元数据的协调视图。在本文中，我们介绍了Embedding Atlas，这是一款可扩展、交互式可视化工具，旨在使与大型嵌入式交互尽可能简单。Embedding Atlas使用现代网络技术和先进算法（包括基于密度的聚类和自动标记）提供了一个在规模上快速丰富的数据分析体验。我们通过与其他流行的嵌入式工具进行竞争分析来评估Embedding Atlas，显示Embedding Atlas的功能集有助于减少摩擦，并报告其在实时渲染性能上的百万点的基准测试。Embedding Atlas作为开源软件提供，以支持未来基于嵌入式的分析工作。

更新时间: 2025-07-08 17:49:59

领域: cs.HC,cs.LG

下载: http://arxiv.org/abs/2505.06386v2

Instruction Following by Boosting Attention of Large Language Models

Controlling the generation of large language models (LLMs) remains a central challenge to ensure their safe and reliable deployment. While prompt engineering and finetuning are common approaches, recent work has explored latent steering, a lightweight technique that alters LLM internal activations to guide generation. However, subsequent studies revealed latent steering's effectiveness to be limited, often underperforming simple instruction prompting. To address this limitation, we first establish a benchmark across diverse behaviors for standardized evaluation of steering techniques. Building on insights from this benchmark, we introduce Instruction Attention Boosting (InstABoost), a latent steering method that boosts the strength of instruction prompting by altering the model's attention during generation. InstABoost combines the strengths of existing approaches and is theoretically supported by prior work that suggests that in-context rule following in transformer-based models can be controlled by manipulating attention on instructions. Empirically, InstABoost demonstrates superior control success compared to both traditional prompting and latent steering.

Updated: 2025-07-08 17:48:59

标题: 大语言模型通过增强注意力机制来遵循指令

摘要: 控制大型语言模型（LLMs）的生成仍然是确保它们安全可靠部署的关键挑战。尽管提示工程和微调是常见的方法，但最近的工作探索了潜在引导，一种轻量级技术，可以改变LLM内部激活以指导生成。然而，随后的研究揭示了潜在引导的有效性有限，通常表现不如简单的指令提示。为了解决这一限制，我们首先在不同行为方面建立了一个基准，用于标准化评估引导技术。借鉴这一基准的见解，我们引入了Instruction Attention Boosting（InstABoost），一种通过在生成过程中改变模型的注意力来增强指令提示强度的潜在引导方法。InstABoost结合了现有方法的优势，并在理论上得到支持，先前的研究表明，在基于转换器的模型中，通过操纵指令的注意力可以控制上下文规则的遵循。从经验上讲，InstABoost相比传统提示和潜在引导表现出更优越的控制成功。

更新时间: 2025-07-08 17:48:59

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2506.13734v2

Identifiability in Causal Abstractions: A Hierarchy of Criteria

Identifying the effect of a treatment from observational data typically requires assuming a fully specified causal diagram. However, such diagrams are rarely known in practice, especially in complex or high-dimensional settings. To overcome this limitation, recent works have explored the use of causal abstractions-simplified representations that retain partial causal information. In this paper, we consider causal abstractions formalized as collections of causal diagrams, and focus on the identifiability of causal queries within such collections. We introduce and formalize several identifiability criteria under this setting. Our main contribution is to organize these criteria into a structured hierarchy, highlighting their relationships. This hierarchical view enables a clearer understanding of what can be identified under varying levels of causal knowledge. We illustrate our framework through examples from the literature and provide tools to reason about identifiability when full causal knowledge is unavailable.

Updated: 2025-07-08 17:46:08

标题: 因果抽象中的可识别性：一组标准的层次结构

摘要: 在观察数据中确定治疗效果通常需要假设一个完全指定的因果图。然而，在实践中很少知道这样的图，特别是在复杂或高维设置中。为了克服这一限制，最近的研究已经探讨了因果抽象的使用-简化的表示保留了部分因果信息。在本文中，我们将因果抽象形式化为因果图的集合，并关注这种集合内因果查询的可辨识性。我们介绍并形式化了在这种设置下的几个可辨识性标准。我们的主要贡献是将这些标准组织成一个结构化的层次结构，突出它们之间的关系。这种层次结构的视角使人们更清楚地理解在不同水平的因果知识下可以确定什么。我们通过文献中的例子说明了我们的框架，并提供了在没有完全因果知识的情况下推理可辨识性的工具。

更新时间: 2025-07-08 17:46:08

领域: cs.AI

下载: http://arxiv.org/abs/2507.06213v1

StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation

We introduce StreamDiffusion, a real-time diffusion pipeline designed for interactive image generation. Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction. This limitation becomes particularly evident in scenarios involving continuous input, such as Metaverse, live video streaming, and broadcasting, where high throughput is imperative. To address this, we present a novel approach that transforms the original sequential denoising into the batching denoising process. Stream Batch eliminates the conventional wait-and-interact approach and enables fluid and high throughput streams. To handle the frequency disparity between data input and model throughput, we design a novel input-output queue for parallelizing the streaming process. Moreover, the existing diffusion pipeline uses classifier-free guidance(CFG), which requires additional U-Net computation. To mitigate the redundant computations, we propose a novel residual classifier-free guidance (RCFG) algorithm that reduces the number of negative conditional denoising steps to only one or even zero. Besides, we introduce a stochastic similarity filter(SSF) to optimize power consumption. Our Stream Batch achieves around 1.5x speedup compared to the sequential denoising method at different denoising levels. The proposed RCFG leads to speeds up to 2.05x higher than the conventional CFG. Combining the proposed strategies and existing mature acceleration tools makes the image-to-image generation achieve up-to 91.07fps on one RTX4090, improving the throughputs of AutoPipline developed by Diffusers over 59.56x. Furthermore, our proposed StreamDiffusion also significantly reduces the energy consumption by 2.39x on one RTX3060 and 1.99x on one RTX4090, respectively.

Updated: 2025-07-08 17:45:49

标题: StreamDiffusion：一种用于实时交互式生成的流水线级解决方案

摘要: 我们介绍了StreamDiffusion，这是一个为交互式图像生成设计的实时扩散流水线。现有的扩散模型擅长根据文本或图像提示创建图像，但它们在实时交互方面往往表现不佳。这种限制在涉及连续输入的情景中特别明显，比如元宇宙、实时视频流和广播，在这些情况下高吞吐量是至关重要的。为了解决这个问题，我们提出了一种新颖的方法，将原始的顺序去噪转变为批处理去噪过程。Stream Batch消除了传统的等待交互方法，实现了流畅和高吞吐量的流。为了处理数据输入和模型吞吐量之间的频率差异，我们设计了一个新颖的输入输出队列，用于并行化流处理过程。此外，现有的扩散流水线使用无分类器引导(CFG)，这需要额外的U-Net计算。为了减少冗余计算，我们提出了一种新颖的残差无分类器引导(RCFG)算法，将负条件去噪步骤的数量减少到仅为一个甚至零个。此外，我们引入了一种随机相似性滤波器(SSF)来优化功耗。我们的Stream Batch在不同去噪级别上实现了约1.5倍的加速，相比顺序去噪方法。所提出的RCFG的速度比传统CFG高出2.05倍。结合所提出的策略和现有成熟的加速工具，使图像生成在一台RTX4090上实现了高达91.07fps的速度，将Diffusers开发的AutoPipline的吞吐量提高了59.56倍。此外，我们提出的StreamDiffusion在一台RTX3060上将能耗降低了2.39倍，在一台RTX4090上降低了1.99倍。

更新时间: 2025-07-08 17:45:49

领域: cs.CV,cs.GR,cs.LG

下载: http://arxiv.org/abs/2312.12491v2

The Impact of Generative AI on Collaborative Open-Source Software Development: Evidence from GitHub Copilot

Generative artificial intelligence (AI) enables automated content production, including coding in software development, which can significantly influence developer participation and performance. To explore its impact on collaborative open-source software (OSS) development, we investigate the role of GitHub Copilot, a generative AI pair programmer, in OSS development where multiple distributed developers voluntarily collaborate. Using GitHub's proprietary Copilot usage data, combined with public OSS repository data obtained from GitHub, we find that Copilot use increases project-level code contributions by 5.9%. This gain is driven by a 2.1% increase in individual code contributions and a 3.4% rise in developer coding participation. However, these benefits come at a cost as coordination time for code integration increases by 8% due to more code discussions enabled by AI pair programmers. This reveals an important tradeoff: While AI expands who can contribute and how much they contribute, it slows coordination in collective development efforts. Despite this tension, the combined effect of these two competing forces remains positive, indicating a net gain in overall project-level productivity from using AI pair programmers. Interestingly, we also find the effects differ across developer roles. Peripheral developers show relatively smaller gains in project-level code contributions and face a higher increase in coordination time than core developers, likely due to the difference in their project familiarity. In summary, our study underscores the dual role of AI pair programmers in affecting project-level code contributions and coordination time in OSS development. Our findings on the differential effects between core and peripheral developers also provide important implications for the structure of OSS communities in the long run.

Updated: 2025-07-08 17:44:42

标题: 生成式人工智能对协作式开源软件开发的影响：来自GitHub Copilot的证据

摘要: 生成式人工智能（AI）可以实现自动化内容生产，包括在软件开发中进行编码，这可以极大地影响开发者的参与和表现。为了探究其对协作式开源软件（OSS）开发的影响，我们调查了GitHub Copilot在多个分布式开发者自愿合作的OSS开发中的角色，GitHub Copilot是一种生成式AI对编程伙伴。利用GitHub的专有Copilot使用数据，结合从GitHub获取的公共OSS存储库数据，我们发现Copilot的使用使项目级别的代码贡献增加了5.9％。这一增长是由个人代码贡献增加了2.1％和开发者编码参与增加了3.4％推动的。然而，这些好处是以协调时间增加8％为代价的，原因是AI对编程伙伴所启用的更多代码讨论。这揭示了一个重要的权衡：虽然AI扩大了谁可以贡献以及他们可以贡献多少，但它减缓了集体开发努力中的协调。尽管存在这种紧张关系，这两种竞争力量的综合效果仍然是积极的，表明使用AI对编程伙伴可以获得项目级别生产率的净增益。有趣的是，我们还发现这些效果在开发者角色之间存在差异。外围开发者在项目级别代码贡献方面显示出相对较小的增益，并且面临着比核心开发者更高的协调时间增加，这可能是由于他们项目熟悉度的差异。总之，我们的研究强调了AI对编程伙伴对OSS开发中项目级别代码贡献和协调时间的双重作用。我们对核心和外围开发者之间差异效应的发现也为OSS社区的结构提供了重要启示。

更新时间: 2025-07-08 17:44:42

领域: cs.SE,cs.AI,cs.HC,econ.GN,q-fin.EC

下载: http://arxiv.org/abs/2410.02091v2

Benchmarking the CoW with the TopCoW Challenge: Topology-Aware Anatomical Segmentation of the Circle of Willis for CTA and MRA

The Circle of Willis (CoW) is an important network of arteries connecting major circulations of the brain. Its vascular architecture is believed to affect the risk, severity, and clinical outcome of serious neurovascular diseases. However, characterizing the highly variable CoW anatomy is still a manual and time-consuming expert task. The CoW is usually imaged by two non-invasive angiographic imaging modalities, magnetic resonance angiography (MRA) and computed tomography angiography (CTA), but there exist limited datasets with annotations on CoW anatomy, especially for CTA. Therefore, we organized the TopCoW challenge with the release of an annotated CoW dataset. The TopCoW dataset is the first public dataset with voxel-level annotations for 13 CoW vessel components, enabled by virtual reality technology. It is also the first large dataset using 200 pairs of MRA and CTA from the same patients. As part of the benchmark, we invited submissions worldwide and attracted over 250 registered participants from six continents. The submissions were evaluated on both internal and external test datasets of 226 scans from over five centers. The top performing teams achieved over 90% Dice scores at segmenting the CoW components, over 80% F1 scores at detecting key CoW components, and over 70% balanced accuracy at classifying CoW variants for nearly all test sets. The best algorithms also showed clinical potential in classifying fetal-type posterior cerebral artery and locating aneurysms with CoW anatomy. TopCoW demonstrated the utility and versatility of CoW segmentation algorithms for a wide range of downstream clinical applications with explainability. The annotated datasets and best performing algorithms have been released as public Zenodo records to foster further methodological development and clinical tool building.

Updated: 2025-07-08 17:43:30

标题: 使用TopCoW挑战来进行CoW的基准测试：针对CTA和MRA的具有拓扑意识的Willis环解剖分割

摘要: Circle of Willis（CoW）是连接大脑主要循环的重要动脉网络。其血管结构被认为会影响严重神经血管疾病的风险、严重程度和临床结局。然而，表征高度变化的CoW解剖结构仍然是一个需要手动且耗时的专家任务。CoW通常通过两种非侵入性血管造影成像技术，磁共振血管造影（MRA）和计算机断层血管造影（CTA）来成像，但是对CoW解剖结构的带有注释的数据集非常有限，特别是对于CTA。因此，我们组织了TopCoW挑战赛，并发布了一个带有注释的CoW数据集。TopCoW数据集是第一个具有13个CoW血管组分的像素级注释的公共数据集，借助虚拟现实技术实现。它也是第一个使用来自同一患者的200对MRA和CTA的大型数据集。作为基准测试的一部分，我们邀请全球提交，并吸引了来自六大洲的超过250名注册参与者。提交内容在来自五个以上中心的226个扫描的内部和外部测试数据集上进行评估。表现最好的团队在分割CoW组分方面实现了超过90%的Dice分数，在检测关键CoW组分方面实现了超过80%的F1分数，在分类CoW变种方面实现了超过70%的平衡精度。最佳算法还显示了在分类胎儿型后大脑动脉和定位具有CoW解剖结构的动脉瘤方面的临床潜力。TopCoW展示了CoW分割算法在各种下游临床应用中的实用性和多功能性，具有可解释性。已发布带有注释的数据集和表现最佳的算法作为公共Zenodo记录，以促进进一步方法论发展和临床工具构建。

更新时间: 2025-07-08 17:43:30

领域: cs.CV,cs.LG,q-bio.QM,q-bio.TO

下载: http://arxiv.org/abs/2312.17670v4

Modern Methods in Associative Memory

Associative Memories like the famous Hopfield Networks are elegant models for describing fully recurrent neural networks whose fundamental job is to store and retrieve information. In the past few years they experienced a surge of interest due to novel theoretical results pertaining to their information storage capabilities, and their relationship with SOTA AI architectures, such as Transformers and Diffusion Models. These connections open up possibilities for interpreting the computation of traditional AI networks through the theoretical lens of Associative Memories. Additionally, novel Lagrangian formulations of these networks make it possible to design powerful distributed models that learn useful representations and inform the design of novel architectures. This tutorial provides an approachable introduction to Associative Memories, emphasizing the modern language and methods used in this area of research, with practical hands-on mathematical derivations and coding notebooks.

Updated: 2025-07-08 17:40:39

标题: 现代关联记忆方法

摘要: 关联记忆，如著名的霍普菲尔德网络，是描述完全循环神经网络的优雅模型，其基本任务是存储和检索信息。在过去几年中，由于与其信息存储能力相关的新颖理论结果以及与SOTA人工智能架构（如变压器和扩散模型）的关系，它们引起了人们的极大兴趣。这些联系打开了通过关联记忆的理论视角来解释传统人工智能网络计算的可能性。此外，这些网络的新型拉格朗日公式使得设计强大的分布式模型成为可能，这些模型学习到有用的表示，并为新型架构的设计提供信息。本教程提供了一种易于理解的关联记忆介绍，强调在这个研究领域使用的现代语言和方法，同时提供实际的数学推导和编码笔记本。

更新时间: 2025-07-08 17:40:39

领域: cs.LG

下载: http://arxiv.org/abs/2507.06211v1

EEG2TEXT-CN: An Exploratory Study of Open-Vocabulary Chinese Text-EEG Alignment via Large Language Model and Contrastive Learning on ChineseEEG

We propose EEG2TEXT-CN, which, to the best of our knowledge, represents one of the earliest open-vocabulary EEG-to-text generation frameworks tailored for Chinese. Built on a biologically grounded EEG encoder (NICE-EEG) and a compact pretrained language model (MiniLM), our architecture aligns multichannel brain signals with natural language representations via masked pretraining and contrastive learning. Using a subset of the ChineseEEG dataset, where each sentence contains approximately ten Chinese characters aligned with 128-channel EEG recorded at 256 Hz, we segment EEG into per-character embeddings and predict full sentences in a zero-shot setting. The decoder is trained with teacher forcing and padding masks to accommodate variable-length sequences. Evaluation on over 1,500 training-validation sentences and 300 held-out test samples shows promising lexical alignment, with a best BLEU-1 score of 6.38\%. While syntactic fluency remains a challenge, our findings demonstrate the feasibility of non-phonetic, cross-modal language decoding from EEG. This work opens a new direction in multilingual brain-to-text research and lays the foundation for future cognitive-language interfaces in Chinese.

Updated: 2025-07-08 17:34:10

标题: EEG2TEXT-CN：基于大型语言模型和对比学习的开放词汇中文文本-EEG对齐的探索性研究

摘要: 我们提出了EEG2TEXT-CN，据我们所知，这是一种为中国人定制的最早的开放词汇量EEG到文本生成框架之一。建立在一个生物学基础的EEG编码器（NICE-EEG）和一个紧凑的预训练语言模型（MiniLM）上，我们的架构通过掩码预训练和对比学习将多通道脑信号与自然语言表示对齐。使用ChineseEEG数据集的一个子集，在该数据集中，每个句子包含大约十个与以256 Hz记录的128通道EEG对齐的中文字符，我们将EEG分割为每个字符的嵌入，并在零-shot设置中预测完整的句子。解码器通过教师强迫和填充掩码进行训练，以适应可变长度序列。在超过1,500个训练验证句子和300个保留测试样本上的评估显示出令人鼓舞的词汇对齐，最佳BLEU-1分数为6.38\%。尽管句法流畅性仍然是一个挑战，但我们的发现证明了从EEG进行非语音跨模式语言解码的可行性。这项工作开辟了多语种脑到文本研究的新方向，并为未来的中文认知语言界面奠定了基础。

更新时间: 2025-07-08 17:34:10

领域: cs.CL,cs.AI,cs.LG,cs.MM,q-bio.NC

下载: http://arxiv.org/abs/2506.00854v3

Differential Mamba

Sequence models like Transformers and RNNs often overallocate attention to irrelevant context, leading to noisy intermediate representations. This degrades LLM capabilities by promoting hallucinations, weakening long-range and retrieval abilities, and reducing robustness. Recent work has shown that differential design can mitigate this issue in Transformers, improving their effectiveness across various applications. In this paper, we explore whether these techniques, originally developed for Transformers, can be applied to Mamba, a recent architecture based on selective state-space layers that achieves Transformer-level performance with greater efficiency. We show that a naive adaptation of differential design to Mamba is insufficient and requires careful architectural modifications. To address this, we introduce a novel differential mechanism for Mamba, empirically validated on language modeling benchmarks, demonstrating improved retrieval capabilities and superior performance over vanilla Mamba. Finally, we conduct extensive ablation studies and empirical analyses to justify our design choices and provide evidence that our approach effectively mitigates the overallocation problem in Mamba-based models. Our code is publicly available.

Updated: 2025-07-08 17:30:14

标题: 差异性曼巴

摘要: 像Transformer和RNN这样的序列模型经常会过度分配注意力到无关上下文，导致嘈杂的中间表示。这会通过促进幻觉、削弱远程和检索能力，降低鲁棒性来降低LLM的能力。最近的研究表明，差分设计可以缓解Transformer中的这一问题，提高它们在各种应用中的效果。在本文中，我们探讨了这些技术是否可以应用于Mamba，这是一个基于选择性状态空间层的最新架构，可以以更高效的方式实现Transformer级别的性能。我们发现，将差分设计简单地应用到Mamba并不足够，需要仔细的架构修改。为了解决这个问题，我们引入了一个新颖的差分机制用于Mamba，在语言建模基准测试中经验验证，展示了改进的检索能力和优于普通Mamba的性能。最后，我们进行了广泛的消融研究和实证分析，以证明我们的设计选择，并提供证据表明我们的方法有效缓解了基于Mamba的模型中的过度分配问题。我们的代码是公开可用的。

更新时间: 2025-07-08 17:30:14

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.06204v1

Efficient Implementation of Gaussian Process Regression Accelerated Saddle Point Searches with Application to Molecular Reactions

The task of locating first order saddle points on high-dimensional surfaces describing the variation of energy as a function of atomic coordinates is an essential step for identifying the mechanism and estimating the rate of thermally activated events within the harmonic approximation of transition state theory. When combined directly with electronic structure calculations, the number of energy and atomic force evaluations needed for convergence is a primary issue. Here, we describe an efficient implementation of Gaussian process regression (GPR) acceleration of the minimum mode following method where a dimer is used to estimate the lowest eigenmode of the Hessian. A surrogate energy surface is constructed and updated after each electronic structure calculation. The method is applied to a test set of 500 molecular reactions previously generated by Hermez and coworkers [J. Chem. Theory Comput. 18, 6974 (2022)]. An order of magnitude reduction in the number of electronic structure calculations needed to reach the saddle point configurations is obtained by using the GPR compared to the dimer method. Despite the wide range in stiffness of the molecular degrees of freedom, the calculations are carried out using Cartesian coordinates and are found to require similar number of electronic structure calculations as an elaborate internal coordinate method implemented in the Sella software package. The present implementation of the GPR surrogate model in C++ is efficient enough for the wall time of the saddle point searches to be reduced in 3 out of 4 cases even though the calculations are carried out at a low Hartree-Fock level.

Updated: 2025-07-08 17:27:55

标题: 高效实现高斯过程回归加速鞍点搜索并应用于分子反应

摘要: 在描述能量变化的高维表面上定位一阶鞍点的任务是识别机制并估计在过渡态理论谐波逼近中热激活事件速率的基本步骤。当直接与电子结构计算结合时，能量和原子力评估所需的数量对于收敛是一个主要问题。在这里，我们描述了高斯过程回归（GPR）加速最小模式跟踪方法的有效实现，其中使用二聚体来估计Hessian的最低特征模式。在每次电子结构计算后构建并更新一个替代能量表面。该方法应用于Hermez及其同事先前生成的500个分子反应的测试集[J. Chem. Theory Comput. 18, 6974（2022）]。与二聚体方法相比，使用GPR可使达到鞍点构型所需的电子结构计算数量减少一个数量级。尽管分子自由度的刚度范围很广，但计算使用笛卡尔坐标进行，并发现需要的电子结构计算数量与Sella软件包中实现的复杂内坐标方法相似。C++中对GPR替代模型的当前实现足够高效，可以将寻找鞍点的墙时降低3个案例中的4个，即使计算是在低Hartree-Fock级别进行的。

更新时间: 2025-07-08 17:27:55

领域: physics.chem-ph,cs.LG,physics.comp-ph

下载: http://arxiv.org/abs/2505.12519v2

Dynamic Context-Aware Prompt Recommendation for Domain-Specific AI Applications

LLM-powered applications are highly susceptible to the quality of user prompts, and crafting high-quality prompts can often be challenging especially for domain-specific applications. This paper presents a novel dynamic context-aware prompt recommendation system for domain-specific AI applications. Our solution combines contextual query analysis, retrieval-augmented knowledge grounding, hierarchical skill organization, and adaptive skill ranking to generate relevant and actionable prompt suggestions. The system leverages behavioral telemetry and a two-stage hierarchical reasoning process to dynamically select and rank relevant skills, and synthesizes prompts using both predefined and adaptive templates enhanced with few-shot learning. Experiments on real-world datasets demonstrate that our approach achieves high usefulness and relevance, as validated by both automated and expert evaluations.

Updated: 2025-07-08 17:25:34

标题: 领域特定AI应用的动态上下文感知提示推荐

摘要: 基于LLM的应用程序对用户提示的质量非常敏感，创建高质量的提示通常很具有挑战性，尤其是对于特定领域的应用程序。本文介绍了一种新颖的动态上下文感知提示推荐系统，适用于特定领域的人工智能应用程序。我们的解决方案结合了上下文查询分析、检索增强知识基础、分层技能组织和自适应技能排名，以生成相关且可操作的提示建议。该系统利用行为遥测和两阶段分层推理过程动态选择和排名相关的技能，并使用预定义和增强的少样本学习模板合成提示。在真实数据集上的实验表明，我们的方法在自动化和专家评估中都取得了高度的实用性和相关性。

更新时间: 2025-07-08 17:25:34

领域: cs.AI

下载: http://arxiv.org/abs/2506.20815v2

UQLM: A Python Package for Uncertainty Quantification in Large Language Models

Hallucinations, defined as instances where Large Language Models (LLMs) generate false or misleading content, pose a significant challenge that impacts the safety and trust of downstream applications. We introduce UQLM, a Python package for LLM hallucination detection using state-of-the-art uncertainty quantification (UQ) techniques. This toolkit offers a suite of UQ-based scorers that compute response-level confidence scores ranging from 0 to 1. This library provides an off-the-shelf solution for UQ-based hallucination detection that can be easily integrated to enhance the reliability of LLM outputs.

Updated: 2025-07-08 17:22:32

标题: UQLM：用于大型语言模型不确定性量化的Python软件包

摘要: 幻觉被定义为大型语言模型（LLMs）生成虚假或误导性内容的情况，这构成了一个重要挑战，影响了下游应用的安全性和信任度。我们介绍了UQLM，一个使用最先进的不确定性量化（UQ）技术进行LLM幻觉检测的Python包。该工具包提供了一套基于UQ的评分器，计算响应级别的置信度分数，范围从0到1。这个库提供了一个现成的解决方案，用于基于UQ的幻觉检测，可以轻松集成，以提高LLM输出的可靠性。

更新时间: 2025-07-08 17:22:32

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.06196v1

SQLBarber: A System Leveraging Large Language Models to Generate Customized and Realistic SQL Workloads

Database research and development often require a large number of SQL queries for benchmarking purposes. However, acquiring real-world SQL queries is challenging due to privacy concerns, and existing SQL generation methods are limited in customization and in satisfying realistic constraints. To address this issue, we present SQLBarber, a system based on Large Language Models (LLMs) to generate customized and realistic SQL workloads. SQLBarber (i) eliminates the need for users to manually craft SQL templates in advance, while providing the flexibility to accept natural language specifications to constrain SQL templates, (ii) scales efficiently to generate large volumes of queries matching any user-defined cost distribution (e.g., cardinality and execution plan cost), and (iii) uses execution statistics from Amazon Redshift and Snowflake to derive SQL template specifications and query cost distributions that reflect real-world query characteristics. SQLBarber introduces (i) a declarative interface for users to effortlessly generate customized SQL templates, (ii) an LLM-powered pipeline augmented with a self-correction module that profiles, refines, and prunes SQL templates based on query costs, and (iii) a Bayesian Optimizer to efficiently explore different predicate values and identify a set of queries that satisfy the target cost distribution. We construct and open-source ten benchmarks of varying difficulty levels and target query cost distributions based on real-world statistics from Snowflake and Amazon Redshift. Extensive experiments on these benchmarks show that SQLBarber is the only system that can generate customized SQL templates. It reduces query generation time by one to three orders of magnitude, and significantly improves alignment with the target cost distribution, compared with existing methods.

Updated: 2025-07-08 17:20:34

标题: SQL理发师：利用大型语言模型生成定制和逼真的SQL工作负载的系统

摘要: 数据库研究和开发通常需要大量的SQL查询来进行基准测试。然而，由于隐私问题，获取真实世界的SQL查询是具有挑战性的，而现有的SQL生成方法在定制性和满足现实约束方面存在限制。为了解决这个问题，我们提出了SQLBarber，这是一个基于大型语言模型（LLMs）的系统，用于生成定制和真实的SQL工作负载。SQLBarber (i) 消除了用户需要预先手工制作SQL模板的需求，同时提供了接受自然语言规范以限制SQL模板的灵活性，(ii) 能够高效扩展以生成符合任何用户定义成本分布（例如基数和执行计划成本）的大量查询，(iii) 使用来自Amazon Redshift和Snowflake的执行统计数据来推导反映真实世界查询特征的SQL模板规范和查询成本分布。SQLBarber引入了 (i) 一个声明性接口，让用户轻松生成定制的SQL模板，(ii) 一个由LLM驱动的管道，增加了一个自我校正模块，根据查询成本对SQL模板进行概要、细化和修剪，(iii) 一个贝叶斯优化器，能够高效地探索不同谓词值，并确定满足目标成本分布的一组查询。我们基于来自Snowflake和Amazon Redshift的真实世界统计数据构建并开源了十个不同难度级别和目标查询成本分布的基准测试。对这些基准测试的大量实验表明，SQLBarber是唯一能够生成定制SQL模板的系统。与现有方法相比，它将查询生成时间缩短了一到三个数量级，并显著提高了与目标成本分布的对齐性。

更新时间: 2025-07-08 17:20:34

领域: cs.DB,cs.AI,cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.06192v1

Conservative approximation-based feedforward neural network for WENO schemes

In this work, we present the feedforward neural network based on the conservative approximation to the derivative from point values, for the weighted essentially non-oscillatory (WENO) schemes in solving hyperbolic conservation laws. The feedforward neural network, whose inputs are point values from the three-point stencil and outputs are two nonlinear weights, takes the place of the classical WENO weighting procedure. For the training phase, we employ the supervised learning and create a new labeled dataset for one-dimensional conservative approximation, where we construct a numerical flux function from the given point values such that the flux difference approximates the derivative to high-order accuracy. The symmetric-balancing term is introduced for the loss function so that it propels the neural network to match the conservative approximation to the derivative and satisfy the symmetric property that WENO3-JS and WENO3-Z have in common. The consequent WENO schemes, WENO3-CADNNs, demonstrate robust generalization across various benchmark scenarios and resolutions, where they outperform WENO3-Z and achieve accuracy comparable to WENO5-JS.

Updated: 2025-07-08 17:19:48

标题: 基于保守逼近的前馈神经网络用于WENO方案

摘要: 在这项工作中，我们提出了基于保守近似导数的前馈神经网络，用于求解双曲守恒定律中的加权本质非振荡（WENO）方案。这个前馈神经网络的输入是来自三点模板的点值，输出是两个非线性权重，取代了经典的WENO加权过程。在训练阶段，我们采用监督学习，并为一维保守近似创建了一个新的标记数据集，从给定的点值中构建数值通量函数，使通量差近似于高阶精度的导数。对称平衡项被引入到损失函数中，以推动神经网络匹配保守近似导数，并满足WENO3-JS和WENO3-Z共同具有的对称特性。由此产生的WENO方案，WENO3-CADNNs，在各种基准场景和分辨率下展现出强大的泛化能力，它们优于WENO3-Z并实现了与WENO5-JS相当的准确性。

更新时间: 2025-07-08 17:19:48

领域: math.NA,cs.LG,cs.NA,65M06, 68T07

下载: http://arxiv.org/abs/2507.06190v1

DS@GT at CheckThat! 2025: Detecting Subjectivity via Transfer-Learning and Corrective Data Augmentation

This paper presents our submission to Task 1, Subjectivity Detection, of the CheckThat! Lab at CLEF 2025. We investigate the effectiveness of transfer-learning and stylistic data augmentation to improve classification of subjective and objective sentences in English news text. Our approach contrasts fine-tuning of pre-trained encoders and transfer-learning of fine-tuned transformer on related tasks. We also introduce a controlled augmentation pipeline using GPT-4o to generate paraphrases in predefined subjectivity styles. To ensure label and style consistency, we employ the same model to correct and refine the generated samples. Results show that transfer-learning of specified encoders outperforms fine-tuning general-purpose ones, and that carefully curated augmentation significantly enhances model robustness, especially in detecting subjective content. Our official submission placed us $16^{th}$ of 24 participants. Overall, our findings underscore the value of combining encoder specialization with label-consistent augmentation for improved subjectivity detection. Our code is available at https://github.com/dsgt-arc/checkthat-2025-subject.

Updated: 2025-07-08 17:18:50

标题: DS@GT在CheckThat！2025：通过迁移学习和纠正数据增强检测主观性

摘要: 本文介绍我们在CLEF 2025年CheckThat！Lab的Task 1中提交的研究成果，即主观性检测。我们研究了迁移学习和文体数据增强的有效性，以提高对英语新闻文本中主观和客观句子的分类。我们的方法对比了对预训练编码器进行微调和在相关任务上进行迁移学习的细化转换器。我们还引入了一个受控增强管道，使用GPT-4o生成预定义主观性风格的释义。为了确保标签和风格一致性，我们使用相同的模型来纠正和完善生成的样本。结果表明，指定编码器的迁移学习优于微调通用编码器，并且精心策划的增强显著增强了模型的鲁棒性，特别是在检测主观内容方面。我们的官方提交使我们在24个参与者中排名第16。总的来说，我们的发现强调了将编码器专业化与标签一致增强相结合，以改善主观性检测的价值。我们的代码可在https://github.com/dsgt-arc/checkthat-2025-subject 上找到。

更新时间: 2025-07-08 17:18:50

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.06189v1

The Delta Learning Hypothesis: Preference Tuning on Weak Data can Yield Strong Gains

Improvements in language models are often driven by improving the quality of the data we train them on, which can be limiting when strong supervision is scarce. In this work, we show that paired preference data consisting of individually weak data points can enable gains beyond the strength of each individual data point. We formulate the delta learning hypothesis to explain this phenomenon, positing that the relative quality delta between points suffices to drive learning via preference tuning--even when supervised finetuning on the weak data hurts. We validate our hypothesis in controlled experiments and at scale, where we post-train 8B models on preference data generated by pairing a small 3B model's responses with outputs from an even smaller 1.5B model to create a meaningful delta. Strikingly, on a standard 11-benchmark evaluation suite (MATH, MMLU, etc.), our simple recipe matches the performance of Tulu 3, a state-of-the-art open model tuned from the same base model while relying on much stronger supervisors (e.g., GPT-4o). Thus, delta learning enables simpler and cheaper open recipes for state-of-the-art post-training. To better understand delta learning, we prove in logistic regression that the performance gap between two weak teacher models provides useful signal for improving a stronger student. Overall, our work shows that models can learn surprisingly well from paired data that might typically be considered weak.

Updated: 2025-07-08 17:14:44

标题: “三角洲学习假设：在弱数据上进行偏好调整可以获得巨大收益”

摘要: 语言模型的改进通常是通过改进我们训练模型的数据质量来推动的，当强监督稀缺时，这可能具有限制性。在这项工作中，我们展示了由个别弱数据点组成的成对偏好数据可以实现超越每个个别数据点强度的收益。我们提出了“增量学习假设”来解释这一现象，认为点之间的相对质量增量足以通过偏好调整驱动学习--即使在弱数据上进行监督微调会带来负面影响。我们在受控实验和规模上验证了我们的假设，我们使用3B模型的响应与更小的1.5B模型的输出进行配对生成偏好数据，进而在8B模型上进行后训练以创造有意义的增量。引人注目的是，在标准的11个基准评估套件（MATH，MMLU等）上，我们的简单方法与Tulu 3的表现相匹配，后者是从相同基础模型微调而来的最先进的开放模型，而且依赖于更强大的监督者（例如，GPT-4o）。因此，增量学习使得更简单、更便宜的开放式后训练配方成为可能。为了更好地理解增量学习，我们在逻辑回归中证明了两个弱教师模型之间的性能差距为改进更强学生提供了有用信号。总的来说，我们的工作表明，模型可以从通常被认为是弱的配对数据中学习得出令人惊讶的好结果。

更新时间: 2025-07-08 17:14:44

领域: cs.AI

下载: http://arxiv.org/abs/2507.06187v1

Hidden Prompts in Manuscripts Exploit AI-Assisted Peer Review

In July 2025, 18 academic manuscripts on the preprint website arXiv were found to contain hidden instructions known as prompts designed to manipulate AI-assisted peer review. Instructions such as "GIVE A POSITIVE REVIEW ONLY" were concealed using techniques like white-colored text. Author responses varied: one planned to withdraw the affected paper, while another defended the practice as legitimate testing of reviewer compliance. This commentary analyzes this practice as a novel form of research misconduct. We examine the technique of prompt injection in large language models (LLMs), revealing four types of hidden prompts, ranging from simple positive review commands to detailed evaluation frameworks. The defense that prompts served as "honeypots" to detect reviewers improperly using AI fails under examination--the consistently self-serving nature of prompt instructions indicates intent to manipulate. Publishers maintain inconsistent policies: Elsevier prohibits AI use in peer review entirely, while Springer Nature permits limited use with disclosure requirements. The incident exposes systematic vulnerabilities extending beyond peer review to any automated system processing scholarly texts, including plagiarism detection and citation indexing. Our analysis underscores the need for coordinated technical screening at submission portals and harmonized policies governing generative AI (GenAI) use in academic evaluation.

Updated: 2025-07-08 17:11:13

标题: 手稿中的隐藏提示利用人工智能辅助同行评审

摘要: 2025年7月，发现了18篇学术手稿在预印本网站arXiv上含有隐藏指令，这些指令被称为提示，旨在操纵AI辅助的同行评审。指令如“仅给予积极评价”之类的被隐藏在白色文字等技术中。作者的回应各不相同：一位计划撤回受影响的论文，而另一位则将这一做法辩解为对评审者遵从性的合法测试。本评论将这种做法分析为一种新型的研究不端行为。我们研究了在大型语言模型（LLMs）中注入提示的技术，揭示了四种类型的隐藏提示，从简单的积极评价命令到详细的评估框架不等。声称提示用作“蜜罐”以检测评审者不当使用AI的辩护在检验下失败--提示指令的一贯自私性表明了操纵的意图。出版商的政策不一致：Elsevier完全禁止在同行评审中使用AI，而Springer Nature允许有限的使用，并要求披露。该事件暴露了系统性的漏洞，不仅仅限于同行评审，还包括处理学术文本的任何自动化系统，包括抄袭检测和引文索引。我们的分析强调了需要在投稿门户进行协调技术筛查和统一管理生成式AI（GenAI）在学术评价中的使用的政策。

更新时间: 2025-07-08 17:11:13

领域: cs.CY,cs.AI,cs.CL,cs.HC

下载: http://arxiv.org/abs/2507.06185v1

MedGemma Technical Report

Artificial intelligence (AI) has significant potential in healthcare applications, but its training and deployment faces challenges due to healthcare's diverse data, complex tasks, and the need to preserve privacy. Foundation models that perform well on medical tasks and require less task-specific tuning data are critical to accelerate the development of healthcare AI applications. We introduce MedGemma, a collection of medical vision-language foundation models based on Gemma 3 4B and 27B. MedGemma demonstrates advanced medical understanding and reasoning on images and text, significantly exceeding the performance of similar-sized generative models and approaching the performance of task-specific models, while maintaining the general capabilities of the Gemma 3 base models. For out-of-distribution tasks, MedGemma achieves 2.6-10% improvement on medical multimodal question answering, 15.5-18.1% improvement on chest X-ray finding classification, and 10.8% improvement on agentic evaluations compared to the base models. Fine-tuning MedGemma further improves performance in subdomains, reducing errors in electronic health record information retrieval by 50% and reaching comparable performance to existing specialized state-of-the-art methods for pneumothorax classification and histopathology patch classification. We additionally introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP. MedSigLIP powers the visual understanding capabilities of MedGemma and as an encoder achieves comparable or better performance than specialized medical image encoders. Taken together, the MedGemma collection provides a strong foundation of medical image and text capabilities, with potential to significantly accelerate medical research and development of downstream applications. The MedGemma collection, including tutorials and model weights, can be found at https://goo.gle/medgemma.

Updated: 2025-07-08 17:08:06

标题: MedGemma技术报告

摘要: 人工智能（AI）在医疗应用中具有巨大的潜力，但由于医疗数据的多样性、任务的复杂性以及保护隐私的需求，其训练和部署面临挑战。在医疗任务上表现良好且需要较少特定任务调整数据的基础模型对加速医疗AI应用的发展至关重要。我们介绍了MedGemma，这是基于Gemma 3 4B和27B的医疗视觉语言基础模型集合。MedGemma在图像和文本上展示了先进的医学理解和推理能力，明显超过了类似规模的生成模型的性能，并接近任务特定模型的性能，同时保持了Gemma 3基础模型的通用能力。对于超出分布任务，MedGemma在医学多模态问题回答上实现了2.6-10%的改进，在胸部X射线发现分类上实现了15.5-18.1%的改进，并且在代理评估上比基础模型提高了10.8%。对MedGemma进行微调进一步改善了亚领域的性能，将电子病历信息检索中的错误率降低了50%，并达到了与现有专门的最先进方法相媲美的效果，如气胸分类和组织病理学斑块分类。此外，我们还介绍了MedSigLIP，这是从SigLIP衍生的医学调整的视觉编码器。MedSigLIP增强了MedGemma的视觉理解能力，作为编码器实现了与专门的医学图像编码器相媲美或更好的性能。总的来说，MedGemma集合提供了强大的医学图像和文本能力基础，有潜力显著加速医学研究和下游应用的发展。MedGemma集合，包括教程和模型权重，可以在https://goo.gle/medgemma找到。

更新时间: 2025-07-08 17:08:06

领域: cs.AI,cs.CL,cs.CV

下载: http://arxiv.org/abs/2507.05201v2

Quantum Computing and Cybersecurity in Accounting and Finance: Current and Future Challenges and the Opportunities for Securing Accounting and Finance Systems in the Post-Quantum World

Quantum computing is transforming the world profoundly, affecting businesses, organisations, technologies, and human beings' information systems, and will have a profound impact on accounting and finance, particularly in the realm of cybersecurity. It presents both opportunities and risks in ensuring confidentiality and protecting financial data. The purpose of this article is to show the application of quantum technologies in accounting cybersecurity, utilising quantum algorithms and QKD to overcome the limitations of classical computing. The literature review reveals the vulnerabilities of the current accounting cybersecurity to quantum attacks and the need for quantum-resistant cryptographic mechanisms. It elaborates on the risks associated with conventional encryption in the context of quantum capabilities. This study contributes to the understanding of how quantum computing can transform accounting cybersecurity by enhancing quantum-resistant algorithms and using QKD in accounting. The study employs PSALSAR systematic review methodology to ensure rigour and depth. The analysis shows that quantum computing enhances encryption techniques to superior possibilities than classical ones. Using quantum technologies in accounting minimises data breaches and unauthorised access. The study concludes that quantum-resistant algorithms and quantum key distribution (QKD) are necessary for securing the accounting and finance systems of the future. Keywords Quantum Computing, Cybersecurity, Accounting, Machine Learning, Artificial Intelligence, Quantum Key Distribution, Operations Management

Updated: 2025-07-08 17:04:56

标题: 量子计算和会计与金融领域的网络安全：当前和未来的挑战以及在后量子世界中保护会计与金融系统的机遇

摘要: 量子计算正在深刻改变世界，影响着企业、组织、技术和人类信息系统，并将在会计和金融领域产生深远影响，特别是在网络安全方面。它在确保机密性和保护财务数据方面既提供机遇又带来风险。本文的目的是展示量子技术在会计网络安全中的应用，利用量子算法和量子密钥分发来克服经典计算的局限性。文献综述揭示了当前会计网络安全面临的量子攻击漏洞，以及对量子抗攻击加密机制的需求。它详细阐述了在量子能力背景下传统加密所带来的风险。这项研究有助于理解量子计算如何通过增强量子抗攻击算法和在会计中使用量子密钥分发来改变会计网络安全。该研究采用了PSALSAR系统综述方法来确保严谨和深度。分析显示，量子计算提升了加密技术的优越可能性，超过了经典方法。在会计中使用量子技术可以最大程度地减少数据泄露和未经授权的访问。研究得出结论，量子抗攻击算法和量子密钥分发(QKD)对未来会计和金融系统的安全至关重要。关键词：量子计算、网络安全、会计、机器学习、人工智能、量子密钥分发、运营管理

更新时间: 2025-07-08 17:04:56

领域: cs.CR,cs.ET

下载: http://arxiv.org/abs/2506.12096v2

Fast Bilateral Teleoperation and Imitation Learning Using Sensorless Force Control via Accurate Dynamics Model

In recent years, the advancement of imitation learning has led to increased interest in teleoperating low-cost manipulators to collect demonstration data. However, most existing systems rely on unilateral control, which only transmits target position values. While this approach is easy to implement and suitable for slow, non-contact tasks, it struggles with fast or contact-rich operations due to the absence of force feedback. This work demonstrates that fast teleoperation with force feedback is feasible even with force-sensorless, low-cost manipulators by leveraging 4-channel bilateral control. Based on accurately identified manipulator dynamics, our method integrates nonlinear terms compensation, velocity and external force estimation, and variable gain corresponding to inertial variation. Furthermore, using data collected by 4-channel bilateral control, we show that incorporating force information into both the input and output of learned policies improves performance in imitation learning. These results highlight the practical effectiveness of our system for high-fidelity teleoperation and data collection on affordable hardware.

Updated: 2025-07-08 16:54:34

标题: 快速双边远程操作和模仿学习，利用无传感器力控制通过精确动力学模型

摘要: 近年来，模仿学习的进展导致了对远程操作低成本机械手收集示范数据的兴趣增加。然而，大多数现有系统依赖于单边控制，只传输目标位置数值。虽然这种方法易于实现并适用于缓慢、非接触性任务，但由于缺乏力反馈，它在快速或接触丰富的操作中存在困难。本文证明了即使在没有力传感器、低成本机械手中，通过利用4通道双边控制，可以实现具有力反馈的快速远程操作。基于准确识别的机械手动力学，我们的方法整合了非线性项补偿、速度和外力估计，以及与惯性变化相对应的可变增益。此外，利用4通道双边控制收集的数据，我们展示了将力信息整合到学习策略的输入和输出中可以提高模仿学习的性能。这些结果突显了我们系统在高保真度远程操作和在可负担得起的硬件上收集数据方面的实际效果。

更新时间: 2025-07-08 16:54:34

领域: cs.RO,cs.AI,cs.SY,eess.SY

下载: http://arxiv.org/abs/2507.06174v1

cuVSLAM: CUDA accelerated visual odometry and mapping

Accurate and robust pose estimation is a key requirement for any autonomous robot. We present cuVSLAM, a state-of-the-art solution for visual simultaneous localization and mapping, which can operate with a variety of visual-inertial sensor suites, including multiple RGB and depth cameras, and inertial measurement units. cuVSLAM supports operation with as few as one RGB camera to as many as 32 cameras, in arbitrary geometric configurations, thus supporting a wide range of robotic setups. cuVSLAM is specifically optimized using CUDA to deploy in real-time applications with minimal computational overhead on edge-computing devices such as the NVIDIA Jetson. We present the design and implementation of cuVSLAM, example use cases, and empirical results on several state-of-the-art benchmarks demonstrating the best-in-class performance of cuVSLAM.

Updated: 2025-07-08 16:53:53

标题: cuVSLAM：CUDA加速的视觉里程计和地图绘制

摘要: 精确和稳健的姿态估计是任何自主机器人的关键要求。我们提出了cuVSLAM，这是一种用于视觉同时定位和地图绘制的最先进解决方案，可以与各种视觉惯性传感器套件配合使用，包括多个RGB和深度摄像头，以及惯性测量单元。cuVSLAM支持使用至少一个RGB摄像头到多达32个摄像头，在任意几何配置下运行，从而支持各种机器人设置。cuVSLAM专门使用CUDA进行优化，以在边缘计算设备（如NVIDIA Jetson）上部署实时应用程序，计算开销最小。我们介绍了cuVSLAM的设计和实施，示例用例，以及在几个最先进的基准测试中的实证结果，展示了cuVSLAM的最佳性能。

更新时间: 2025-07-08 16:53:53

领域: cs.RO,cs.AI,cs.SE

下载: http://arxiv.org/abs/2506.04359v3

GuiderNet: A Meta-Learning Framework for Optimizing Quantum Circuit Geometry and Mitigating Barren Plateaus

Variational Quantum Algorithms (VQAs) offer potential for near-term quantum advantage but face challenges from barren plateaus, where gradients vanish, and poorly conditioned optimization landscapes. We introduce GuiderNet, a meta-learning framework that conditions Parameterized Quantum Circuits (PQCs) using data-dependent parameter shifts aimed at minimizing the log condition number of the Fubini-Study metric tensor. Implemented as a classical neural network, GuiderNet is meta-trained to guide PQC parameters into geometrically favorable regions and is embedded within hybrid quantum-classical pipelines to steer both initialization and adaptive modulation during training. Applied to the Kaggle Diabetes classification task, GuiderNet reduces cumulative training loss by over 5x, improves test accuracy from 75.3% to 98.6%, and increases the minority-class F1 score from 0.67 to 0.95. It also suppresses gradient explosion and stabilizes parameter updates, enabling smoother and more robust optimization. These results demonstrate that geometric meta-conditioning can mitigate barren plateaus and ill-conditioning, providing a scalable approach to enhance trainability and generalization in quantum machine learning.

Updated: 2025-07-08 16:53:45

标题: GuiderNet：用于优化量子电路几何和减轻荒原高原的元学习框架

摘要: 变分量子算法（VQAs）为近期量子优势提供了潜力，但面临着从平坦高原到梯度消失和优化景观糟糕的挑战。我们引入了GuiderNet，一个元学习框架，通过数据相关的参数变化来调节参数化量子电路（PQCs），旨在最小化Fubini-Study度量张量的对数条件数。作为一个经典神经网络实现，GuiderNet被元训练用于引导PQC参数进入几何优势区域，并嵌入在混合量子-经典管道中，以在训练过程中引导初始化和自适应调制。应用于Kaggle糖尿病分类任务，GuiderNet将累积训练损失降低了5倍以上，将测试准确度从75.3%提高到98.6%，并将少数类F1分数从0.67提高到0.95。它还抑制了梯度爆炸和稳定了参数更新，使优化更加平滑和稳健。这些结果表明，几何元条件化可以缓解平坦高原和病态条件，提供了一种可扩展的方法来增强量子机器学习中的可训练性和泛化性。

更新时间: 2025-07-08 16:53:45

领域: cs.LG

下载: http://arxiv.org/abs/2506.21940v2

A Method for Optimizing Connections in Differentiable Logic Gate Networks

We introduce a novel method for partial optimization of the connections in Deep Differentiable Logic Gate Networks (LGNs). Our training method utilizes a probability distribution over a subset of connections per gate input, selecting the connection with highest merit, after which the gate-types are selected. We show that the connection-optimized LGNs outperform standard fixed-connection LGNs on the Yin-Yang, MNIST and Fashion-MNIST benchmarks, while requiring only a fraction of the number of logic gates. When training all connections, we demonstrate that 8000 simple logic gates are sufficient to achieve over 98% on the MNIST data set. Additionally, we show that our network has 24 times fewer gates, while performing better on the MNIST data set compared to standard fully connected LGNs. As such, our work shows a pathway towards fully trainable Boolean logic.

Updated: 2025-07-08 16:53:39

标题: 一个优化在可微逻辑门网络中连接的方法

摘要: 我们介绍了一种新颖的方法，用于部分优化深度可微逻辑门网络（LGNs）中的连接。我们的训练方法利用每个门输入连接的子集上的概率分布，选择具有最高优势的连接，然后选择门类型。我们展示了连接优化的LGNs在Yin-Yang、MNIST和Fashion-MNIST基准测试上优于标准固定连接的LGNs，同时仅需要少量逻辑门数量。当训练所有连接时，我们展示了8000个简单逻辑门足以在MNIST数据集上实现超过98%的准确率。此外，我们展示了相比于标准全连接的LGNs，在MNIST数据集上表现更好的网络，同时门数量减少了24倍。因此，我们的工作展示了通向完全可训练布尔逻辑的途径。

更新时间: 2025-07-08 16:53:39

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.06173v1

Critical Nodes Identification in Complex Networks: A Survey

Complex networks have become essential tools for understanding diverse phenomena in social systems, traffic systems, biomolecular systems, and financial systems. Identifying critical nodes is a central theme in contemporary research, serving as a vital bridge between theoretical foundations and practical applications. Nevertheless, the intrinsic complexity and structural heterogeneity characterizing real-world networks, with particular emphasis on dynamic and higher-order networks, present substantial obstacles to the development of universal frameworks for critical node identification. This paper provides a comprehensive review of critical node identification techniques, categorizing them into seven main classes: centrality, critical nodes deletion problem, influence maximization, network control, artificial intelligence, higher-order and dynamic methods. Our review bridges the gaps in existing surveys by systematically classifying methods based on their methodological foundations and practical implications, and by highlighting their strengths, limitations, and applicability across different network types. Our work enhances the understanding of critical node research by identifying key challenges, such as algorithmic universality, real-time evaluation in dynamic networks, analysis of higher-order structures, and computational efficiency in large-scale networks. The structured synthesis consolidates current progress and highlights open questions, particularly in modeling temporal dynamics, advancing efficient algorithms, integrating machine learning approaches, and developing scalable and interpretable metrics for complex systems.

Updated: 2025-07-08 16:45:48

标题: 复杂网络中的关键节点识别：一项调查

摘要: 复杂网络已成为理解社会系统、交通系统、生物分子系统和金融系统中各种现象的基本工具。确定关键节点是当代研究的核心主题，它作为理论基础和实际应用之间的重要桥梁。然而，现实世界网络的内在复杂性和结构异质性，特别是动态和高阶网络，给关键节点识别的通用框架发展带来了实质性障碍。本文全面审视了关键节点识别技术，将其分类为七大类：中心度、关键节点删除问题、影响力最大化、网络控制、人工智能、高阶和动态方法。我们的审查通过根据方法论基础和实际影响系统地分类方法，并突出它们在不同网络类型中的优势、局限性和适用性，弥合了现有调查中的差距。我们的工作通过识别关键挑战，如算法的普适性、动态网络中的实时评估、高阶结构的分析以及大规模网络中的计算效率，增进了对关键节点研究的理解。结构化综合整合了当前的进展，并突出了开放问题，特别是在建模时间动态、推进高效算法、整合机器学习方法以及为复杂系统开发可扩展和可解释的度量标准方面。

更新时间: 2025-07-08 16:45:48

领域: cs.SI,cs.AI,physics.app-ph

下载: http://arxiv.org/abs/2507.06164v1

Inferring Higher-Order Couplings with Neural Networks

Maximum entropy methods, rooted in the inverse Ising/Potts problem from statistical physics, are widely used to model pairwise interactions in complex systems across disciplines such as bioinformatics and neuroscience. While successful, these approaches often fail to capture higher-order interactions that are critical for understanding collective behavior. In contrast, modern machine learning methods can model such interactions, but their interpretability often comes at a prohibitive computational cost. Restricted Boltzmann Machines (RBMs) provide a computationally efficient alternative by encoding statistical correlations through hidden units in a bipartite architecture. In this work, we introduce a method that maps RBMs onto generalized Potts models, enabling the systematic extraction of interactions up to arbitrary order. Leveraging large-$N$ approximations, made tractable by the RBM's structure, we extract effective many-body couplings with minimal computational effort. We further propose a robust framework for recovering higher-order interactions in more complex generative models, and introduce a simple gauge-fixing scheme for the effective Potts representation. Validation on synthetic data demonstrates accurate recovery of two- and three-body interactions. Applied to protein sequence data, our method reconstructs contact maps with high fidelity and outperforms state-of-the-art inverse Potts models. These results establish RBMs as a powerful and efficient tool for modeling higher-order structure in high-dimensional categorical data.

Updated: 2025-07-08 16:40:26

标题: 用神经网络推测高阶耦合

摘要: 最大熵方法源于统计物理学中的反向伊辛/波茨问题，在诸如生物信息学和神经科学等跨学科领域广泛应用于模拟复杂系统中的成对相互作用。虽然这些方法取得了成功，但通常无法捕捉对于理解集体行为至关重要的高阶相互作用。相比之下，现代机器学习方法可以模拟这种相互作用，但其可解释性往往需要付出巨大的计算成本。受限玻尔兹曼机(RBMs)通过在双分图结构中通过隐藏单元编码统计相关性，提供了一种计算效率高的替代方案。在这项工作中，我们介绍了一种将RBMs映射到广义波茨模型的方法，使得可以系统地提取高阶相互作用。利用大N近似，通过RBMs的结构使得这一过程可行，我们可以用最小的计算代价提取有效的多体耦合。我们进一步提出了一个强大的框架，用于在更复杂的生成模型中恢复高阶相互作用，并介绍了一种简单的规范固定方案以获取有效的波茨表示。在合成数据上的验证表明准确恢复了二体和三体相互作用。应用于蛋白质序列数据时，我们的方法以高度准确性重建了接触图，超过了最先进的反向波茨模型。这些结果确立了RBMs作为一种强大而高效的工具，用于模拟高维分类数据中的高阶结构。

更新时间: 2025-07-08 16:40:26

领域: cond-mat.dis-nn,cond-mat.stat-mech,cs.LG

下载: http://arxiv.org/abs/2501.06108v4

Hedge Funds on a Swamp: Analyzing Patterns, Vulnerabilities, and Defense Measures in Blockchain Bridges [Experiment, Analysis \& Benchmark]

Blockchain bridges have become essential infrastructure for enabling interoperability across different blockchain networks, with more than $24B monthly bridge transaction volume. However, their growing adoption has been accompanied by a disproportionate rise in security breaches, making them the single largest source of financial loss in Web3. For cross-chain ecosystems to be robust and sustainable, it is essential to understand and address these vulnerabilities. In this study, we present a comprehensive systematization of blockchain bridge design and security. We define three bridge security priors, formalize the architectural structure of 13 prominent bridges, and identify 23 attack vectors grounded in real-world blockchain exploits. Using this foundation, we evaluate 43 representative attack scenarios and introduce a layered threat model that captures security failures across source chain, off-chain, and destination chain components. Our analysis at the static code and transaction network levels reveals recurring design flaws, particularly in access control, validator trust assumptions, and verification logic, and identifies key patterns in adversarial behavior based on transaction-level traces. To support future development, we propose a decision framework for bridge architecture design, along with defense mechanisms such as layered validation and circuit breakers. This work provides a data-driven foundation for evaluating bridge security and lays the groundwork for standardizing resilient cross-chain infrastructure.

Updated: 2025-07-08 16:39:23

标题: 沼泽中的对冲基金：分析区块链桥梁中的模式、脆弱性和防御措施【实验、分析和基准测试】

摘要: 区块链桥已成为实现不同区块链网络之间互操作性的基础设施，每月的桥接交易量超过240亿美元。然而，随着其日益普及，安全漏洞也不成比例地增长，使其成为Web3中造成最大财务损失的单一来源。为了确保跨链生态系统的稳健和可持续发展，了解和解决这些漏洞至关重要。在本研究中，我们对区块链桥设计和安全进行了全面系统化。我们定义了三种桥安全优先级，形式化了13个知名桥的架构结构，并识别了23种基于实际区块链攻击的攻击向量。基于这一基础，我们评估了43个代表性攻击场景，并引入了一个分层威胁模型，涵盖了源链、链下和目标链组件的安全故障。我们在静态代码和交易网络层面的分析揭示了重复出现的设计缺陷，特别是在访问控制、验证者信任假设和验证逻辑方面，并根据交易级别的迹象识别了对抗行为的关键模式。为了支持未来的发展，我们提出了一个桥梁架构设计的决策框架，以及诸如分层验证和断路器等防御机制。这项工作为评估桥安全提供了数据驱动的基础，并为标准化弹性跨链基础设施奠定了基础。

更新时间: 2025-07-08 16:39:23

领域: cs.ET,cs.CR

下载: http://arxiv.org/abs/2507.06156v1

Aliasing in Convnets: A Frame-Theoretic Perspective

Using a stride in a convolutional layer inherently introduces aliasing, which has implications for numerical stability and statistical generalization. While techniques such as the parametrizations via paraunitary systems have been used to promote orthogonal convolution and thus ensure Parseval stability, a general analysis of aliasing and its effects on the stability has not been done in this context. In this article, we adapt a frame-theoretic approach to describe aliasing in convolutional layers with 1D kernels, leading to practical estimates for stability bounds and characterizations of Parseval stability, that are tailored to take short kernel sizes into account. From this, we derive two computationally very efficient optimization objectives that promote Parseval stability via systematically suppressing aliasing. Finally, for layers with random kernels, we derive closed-form expressions for the expected value and variance of the terms that describe the aliasing effects, revealing fundamental insights into the aliasing behavior at initialization.

Updated: 2025-07-08 16:34:43

标题: 卷积神经网络中的混叠：一个框架论的视角

摘要: 在卷积层中使用步幅会在本质上引入混叠，这对数值稳定性和统计泛化性都有影响。虽然诸如通过参数化通过paraunitary系统来促进正交卷积并确保Parseval稳定性等技术已被用于实现，但在这一领域中尚未对混叠及其对稳定性的影响进行一般性分析。在本文中，我们采用框架论方法来描述具有1D核的卷积层中的混叠，从而得出了针对短核大小的稳定性界限和Parseval稳定性的特征化的实用估计。基于此，我们推导出了两个促进Parseval稳定性的计算效率非常高的优化目标，通过系统地抑制混叠。最后，对于具有随机核的图层，我们推导出了描述混叠效应的项的期望值和方差的闭式表达式，揭示了初始化时混叠行为的基本见解。

更新时间: 2025-07-08 16:34:43

领域: cs.LG,cs.NA,math.NA

下载: http://arxiv.org/abs/2507.06152v1

Online Planning for Multi-UAV Pursuit-Evasion in Unknown Environments Using Deep Reinforcement Learning

Multi-UAV pursuit-evasion, where pursuers aim to capture evaders, poses a key challenge for UAV swarm intelligence. Multi-agent reinforcement learning (MARL) has demonstrated potential in modeling cooperative behaviors, but most RL-based approaches remain constrained to simplified simulations with limited dynamics or fixed scenarios. Previous attempts to deploy RL policy to real-world pursuit-evasion are largely restricted to two-dimensional scenarios, such as ground vehicles or UAVs at fixed altitudes. In this paper, we address multi-UAV pursuit-evasion by considering UAV dynamics and physical constraints. We introduce an evader prediction-enhanced network to tackle partial observability in cooperative strategy learning. Additionally, we propose an adaptive environment generator within MARL training, enabling higher exploration efficiency and better policy generalization across diverse scenarios. Simulations show our method significantly outperforms all baselines in challenging scenarios, generalizing to unseen scenarios with a 100% capture rate. Finally, we derive a feasible policy via a two-stage reward refinement and deploy the policy on real quadrotors in a zero-shot manner. To our knowledge, this is the first work to derive and deploy an RL-based policy using collective thrust and body rates control commands for multi-UAV pursuit-evasion in unknown environments. The open-source code and videos are available at https://sites.google.com/view/pursuit-evasion-rl.

Updated: 2025-07-08 16:34:36

标题: 使用深度强化学习进行多无人机在未知环境中的在线追逐-逃避规划

摘要: 多无人机追逐逃避，其中追击者的目标是捕捉逃避者，对无人机群体智能提出了重要挑战。多智能体强化学习（MARL）已经展示出在建模合作行为方面的潜力，但大多数基于强化学习的方法仍然局限于简化的模拟环境，具有有限的动态或固定的场景。以前尝试将强化学习策略部署到现实世界中的追逐逃避主要局限于二维场景，例如地面车辆或固定高度的无人机。在本文中，我们考虑了无人机动态和物理约束，解决了多无人机追逐逃避问题。我们引入了一个增强了逃避者预测的网络，以应对合作策略学习中的部分可观测性。此外，我们在MARL训练中提出了一个自适应环境生成器，实现更高的探索效率，并在不同场景中更好地推广策略。模拟结果显示，我们的方法在具有挑战性的场景中明显优于所有基准线，在未知场景中普遍具有100％的捕获率。最后，我们通过两阶段奖励细化得出一个可行策略，并以零-shot方式将策略部署到真实的四旋翼飞行器上。据我们所知，这是第一项使用集体推力和机体速率控制命令推导并部署基于强化学习的策略，用于未知环境中的多无人机追逐逃避的工作。开源代码和视频可在https://sites.google.com/view/pursuit-evasion-rl 上找到。

更新时间: 2025-07-08 16:34:36

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2409.15866v4

Transformers Simulate MLE for Sequence Generation in Bayesian Networks

Transformers have achieved significant success in various fields, notably excelling in tasks involving sequential data like natural language processing. Despite these achievements, the theoretical understanding of transformers' capabilities remains limited. In this paper, we investigate the theoretical capabilities of transformers to autoregressively generate sequences in Bayesian networks based on in-context maximum likelihood estimation (MLE). Specifically, we consider a setting where a context is formed by a set of independent sequences generated according to a Bayesian network. We demonstrate that there exists a simple transformer model that can (i) estimate the conditional probabilities of the Bayesian network according to the context, and (ii) autoregressively generate a new sample according to the Bayesian network with estimated conditional probabilities. We further demonstrate in extensive experiments that such a transformer does not only exist in theory, but can also be effectively obtained through training. Our analysis highlights the potential of transformers to learn complex probabilistic models and contributes to a better understanding of large language models as a powerful class of sequence generators.

Updated: 2025-07-08 16:32:46

标题: 变压器在贝叶斯网络中模拟MLE用于序列生成

摘要: Transformers在各个领域取得了显著的成功，特别是在涉及自然语言处理等顺序数据的任务中表现出色。尽管取得了这些成就，关于transformers能力的理论理解仍然有限。在本文中，我们研究了transformers在基于上下文最大似然估计（MLE）的贝叶斯网络中自回归生成序列的理论能力。具体而言，我们考虑了一个背景是由根据贝叶斯网络生成的一组独立序列形成的设置。我们证明了存在一个简单的transformer模型，可以（i）根据上下文估计贝叶斯网络的条件概率，以及（ii）根据估计的条件概率自回归生成一个新样本。我们进一步通过大量实验证明，这样的transformer不仅在理论上存在，而且通过训练也可以有效获得。我们的分析突出了transformers学习复杂概率模型的潜力，并有助于更好地理解大型语言模型作为一类强大的序列生成器。

更新时间: 2025-07-08 16:32:46

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2501.02547v2

Fast and Accurate Collision Probability Estimation for Autonomous Vehicles using Adaptive Sigma-Point Sampling

A novel algorithm is presented for the estimation of collision probabilities between dynamic objects with uncertain trajectories, where the trajectories are given as a sequence of poses with Gaussian distributions. We propose an adaptive sigma-point sampling scheme, which ultimately produces a fast, simple algorithm capable of estimating the collision probability with a median error of 3.5%, and a median runtime of 0.21ms, when measured on an Intel Xeon Gold 6226R Processor. Importantly, the algorithm explicitly accounts for the collision probability's temporal dependence, which is often neglected in prior work and otherwise leads to an overestimation of the collision probability. Finally, the method is tested on a diverse set of relevant real-world scenarios, consisting of 400 6-second snippets of autonomous vehicle logs, where the accuracy and latency is rigorously evaluated.

Updated: 2025-07-08 16:31:11

标题: 自主车辆使用自适应Sigma点采样快速准确估计碰撞概率

摘要: 提出了一种用于估计具有不确定轨迹的动态对象之间碰撞概率的新算法，其中轨迹被给定为具有高斯分布的姿势序列。我们提出了一种自适应的西格玛点采样方案，最终产生了一种快速简单的算法，能够在英特尔至强金6226R处理器上测量时，以3.5%的中位误差和0.21毫秒的中位运行时间估计碰撞概率。重要的是，该算法明确考虑了碰撞概率的时间依赖性，这在先前的工作中经常被忽视，否则会导致碰撞概率被高估。最后，该方法在一个包含400个6秒自主车辆日志片段的各种相关实际场景上进行了测试，对准确性和延迟进行了严格评估。

更新时间: 2025-07-08 16:31:11

领域: cs.RO,cs.AI,cs.CG

下载: http://arxiv.org/abs/2507.06149v1

SoftReMish: A Novel Activation Function for Enhanced Convolutional Neural Networks for Visual Recognition Performance

In this study, SoftReMish, a new activation function designed to improve the performance of convolutional neural networks (CNNs) in image classification tasks, is proposed. Using the MNIST dataset, a standard CNN architecture consisting of two convolutional layers, max pooling, and fully connected layers was implemented. SoftReMish was evaluated against popular activation functions including ReLU, Tanh, and Mish by replacing the activation function in all trainable layers. The model performance was assessed in terms of minimum training loss and maximum validation accuracy. Results showed that SoftReMish achieved a minimum loss (3.14e-8) and a validation accuracy (99.41%), outperforming all other functions tested. These findings demonstrate that SoftReMish offers better convergence behavior and generalization capability, making it a promising candidate for visual recognition tasks.

Updated: 2025-07-08 16:29:14

标题: SoftReMish：一种用于增强卷积神经网络视觉识别性能的新型激活函数

摘要: 在这项研究中，提出了一种新的激活函数SoftReMish，旨在改善卷积神经网络（CNNs）在图像分类任务中的性能。使用MNIST数据集，实现了一个标准的CNN架构，包括两个卷积层、最大池化和全连接层。SoftReMish与流行的激活函数（包括ReLU、Tanh和Mish）进行了评估，通过替换所有可训练层中的激活函数。模型性能以最小训练损失和最大验证准确度来评估。结果显示，SoftReMish实现了最小损失（3.14e-8）和验证准确度（99.41%），优于所有其他测试的函数。这些发现表明，SoftReMish具有更好的收敛行为和泛化能力，使其成为视觉识别任务的一个有前途的候选者。

更新时间: 2025-07-08 16:29:14

领域: cs.CV,cs.AI,cs.NE

下载: http://arxiv.org/abs/2507.06148v1

LangMamba: A Language-driven Mamba Framework for Low-dose CT Denoising with Vision-language Models

Low-dose computed tomography (LDCT) reduces radiation exposure but often degrades image quality, potentially compromising diagnostic accuracy. Existing deep learning-based denoising methods focus primarily on pixel-level mappings, overlooking the potential benefits of high-level semantic guidance. Recent advances in vision-language models (VLMs) suggest that language can serve as a powerful tool for capturing structured semantic information, offering new opportunities to improve LDCT reconstruction. In this paper, we introduce LangMamba, a Language-driven Mamba framework for LDCT denoising that leverages VLM-derived representations to enhance supervision from normal-dose CT (NDCT). LangMamba follows a two-stage learning strategy. First, we pre-train a Language-guided AutoEncoder (LangAE) that leverages frozen VLMs to map NDCT images into a semantic space enriched with anatomical information. Second, we synergize LangAE with two key components to guide LDCT denoising: Semantic-Enhanced Efficient Denoiser (SEED), which enhances NDCT-relevant local semantic while capturing global features with efficient Mamba mechanism, and Language-engaged Dual-space Alignment (LangDA) Loss, which ensures that denoised images align with NDCT in both perceptual and semantic spaces. Extensive experiments on two public datasets demonstrate that LangMamba outperforms conventional state-of-the-art methods, significantly improving detail preservation and visual fidelity. Remarkably, LangAE exhibits strong generalizability to unseen datasets, thereby reducing training costs. Furthermore, LangDA loss improves explainability by integrating language-guided insights into image reconstruction and offers a plug-and-play fashion. Our findings shed new light on the potential of language as a supervisory signal to advance LDCT denoising. The code is publicly available on https://github.com/hao1635/LangMamba.

Updated: 2025-07-08 16:22:05

标题: LangMamba：一种用于低剂量CT去噪的语言驱动Mamba框架，基于视觉-语言模型.

摘要: 低剂量计算机断层扫描（LDCT）可以减少辐射暴露，但通常会降低图像质量，可能会影响诊断准确性。现有基于深度学习的去噪方法主要关注像素级映射，忽视了高级语义引导的潜在好处。最近视觉语言模型（VLMs）的进展表明，语言可以作为捕获结构化语义信息的强大工具，为改善LDCT重建提供了新机会。在本文中，我们介绍了LangMamba，一种用于LDCT去噪的语言驱动Mamba框架，利用VLM衍生的表示增强来自正常剂量CT（NDCT）的监督。LangMamba遵循两阶段学习策略。首先，我们预训练了一种Language-guided AutoEncoder（LangAE），利用冻结的VLMs将NDCT图像映射到一个富含解剖信息的语义空间。其次，我们将LangAE与两个关键组件相结合，以指导LDCT去噪：Semantic-Enhanced Efficient Denoiser（SEED），它增强了与NDCT相关的局部语义，同时利用高效的Mamba机制捕获全局特征，以及Language-engaged Dual-space Alignment（LangDA）Loss，它确保去噪图像在感知和语义空间中与NDCT对齐。在两个公共数据集上进行的大量实验表明，LangMamba优于传统的最先进方法，显着改善了细节保留和视觉保真度。值得注意的是，LangAE对未见数据集表现出良好的泛化能力，从而降低了训练成本。此外，LangDA loss通过将语言引导的见解整合到图像重建中，提高了可解释性，并提供了即插即用的方式。我们的发现为语言作为监督信号推进LDCT去噪的潜力带来了新的启示。代码可在https://github.com/hao1635/LangMamba上公开获取。

更新时间: 2025-07-08 16:22:05

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.06140v1

Topic Modeling and Link-Prediction for Material Property Discovery

Link prediction infers missing or future relations between graph nodes, based on connection patterns. Scientific literature networks and knowledge graphs are typically large, sparse, and noisy, and often contain missing links between entities. We present an AI-driven hierarchical link prediction framework that integrates matrix factorization to infer hidden associations and steer discovery in complex material domains. Our method combines Hierarchical Nonnegative Matrix Factorization (HNMFk) and Boolean matrix factorization (BNMFk) with automatic model selection, as well as Logistic matrix factorization (LMF), we use to construct a three-level topic tree from a 46,862-document corpus focused on 73 transition-metal dichalcogenides (TMDs). These materials are studied in a variety of physics fields with many current and potential applications. An ensemble BNMFk + LMF approach fuses discrete interpretability with probabilistic scoring. The resulting HNMFk clusters map each material onto coherent topics like superconductivity, energy storage, and tribology. Also, missing or weakly connected links are highlight between topics and materials, suggesting novel hypotheses for cross-disciplinary exploration. We validate our method by removing publications about superconductivity in well-known superconductors, and show the model predicts associations with the superconducting TMD clusters. This shows the method finds hidden connections in a graph of material to latent topic associations built from scientific literature, especially useful when examining a diverse corpus of scientific documents covering the same class of phenomena or materials but originating from distinct communities and perspectives. The inferred links generating new hypotheses, produced by our method, are exposed through an interactive Streamlit dashboard, designed for human-in-the-loop scientific discovery.

Updated: 2025-07-08 16:20:46

标题: 主题建模和链接预测用于材料性能发现

摘要: 链接预测推断出图节点之间缺失或未来的关系，基于连接模式。科学文献网络和知识图通常是大型、稀疏和嘈杂的，并且通常包含实体之间的缺失链接。我们提出了一个基于人工智能的分层链接预测框架，集成了矩阵分解以推断隐藏的关联并引导复杂材料领域的发现。我们的方法结合了分层非负矩阵分解（HNMFk）和布尔矩阵分解（BNMFk），并使用自动模型选择，以及逻辑矩阵分解（LMF），我们用于从一个聚焦于73种过渡金属二硫化物（TMDs）的46,862份文档语料库中构建三级主题树。这些材料在许多当前和潜在应用的物理领域中受到研究。一个集成的BNMFk + LMF方法将离散的可解释性与概率评分融合。由此产生的HNMFk聚类将每种材料映射到类似超导性、储能和摩擦学等连贯主题。此外，突出显示了主题和材料之间的缺失或弱连接，为跨学科探索提供了新的假设。我们通过删除已知超导体中有关超导性的出版物来验证我们的方法，并展示模型预测了与超导性TMD聚类相关的关联。这表明该方法从科学文献中构建的材料到潜在主题关联图中发现了隐藏的连接，尤其在检查涵盖同一类现象或材料的多样化科学文档语料库时特别有用，这些文档来自不同的社区和观点。我们的方法产生的推断链接生成了新的假设，并通过一个设计用于人机协作科学发现的交互式Streamlit仪表板展示。

更新时间: 2025-07-08 16:20:46

领域: cs.LG,cs.AI,cs.CE

下载: http://arxiv.org/abs/2507.06139v1

Coding Triangle: How Does Large Language Model Understand Code?

Large language models (LLMs) have achieved remarkable progress in code generation, yet their true programming competence remains underexplored. We introduce the Code Triangle framework, which systematically evaluates LLMs across three fundamental dimensions: editorial analysis, code implementation, and test case generation. Through extensive experiments on competitive programming benchmarks, we reveal that while LLMs can form a self-consistent system across these dimensions, their solutions often lack the diversity and robustness of human programmers. We identify a significant distribution shift between model cognition and human expertise, with model errors tending to cluster due to training data biases and limited reasoning transfer. Our study demonstrates that incorporating human-generated editorials, solutions, and diverse test cases, as well as leveraging model mixtures, can substantially enhance both the performance and robustness of LLMs. Furthermore, we reveal both the consistency and inconsistency in the cognition of LLMs that may facilitate self-reflection and self-improvement, providing a potential direction for developing more powerful coding models.

Updated: 2025-07-08 16:20:43

标题: 编码三角形：大型语言模型如何理解代码？

摘要: 大型语言模型（LLMs）在代码生成方面取得了显著进展，但它们真正的编程能力尚未得到充分探索。我们引入了代码三角框架，系统地评估LLMs在三个基本维度上的表现：编辑分析、代码实现和测试用例生成。通过对竞争性编程基准进行大量实验，我们发现，虽然LLMs可以在这些维度上形成一个自洽的系统，但它们的解决方案往往缺乏人类程序员的多样性和健壮性。我们发现模型认知和人类专业知识之间存在显著的分布偏移，模型错误往往由于训练数据偏见和有限的推理传递而聚集。我们的研究表明，将人类生成的编辑、解决方案和多样化测试用例纳入，以及利用模型混合，可以显著提高LLMs的性能和健壮性。此外，我们揭示了LLMs认知中的一致性和不一致性，这可能促进自我反思和自我改进，为开发更强大的编码模型提供了潜在方向。

更新时间: 2025-07-08 16:20:43

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.06138v1

NeoBabel: A Multilingual Open Tower for Visual Generation

Text-to-image generation advancements have been predominantly English-centric, creating barriers for non-English speakers and perpetuating digital inequities. While existing systems rely on translation pipelines, these introduce semantic drift, computational overhead, and cultural misalignment. We introduce NeoBabel, a novel multilingual image generation framework that sets a new Pareto frontier in performance, efficiency and inclusivity, supporting six languages: English, Chinese, Dutch, French, Hindi, and Persian. The model is trained using a combination of large-scale multilingual pretraining and high-resolution instruction tuning. To evaluate its capabilities, we expand two English-only benchmarks to multilingual equivalents: m-GenEval and m-DPG. NeoBabel achieves state-of-the-art multilingual performance while retaining strong English capability, scoring 0.75 on m-GenEval and 0.68 on m-DPG. Notably, it performs on par with leading models on English tasks while outperforming them by +0.11 and +0.09 on multilingual benchmarks, even though these models are built on multilingual base LLMs. This demonstrates the effectiveness of our targeted alignment training for preserving and extending crosslingual generalization. We further introduce two new metrics to rigorously assess multilingual alignment and robustness to code-mixed prompts. Notably, NeoBabel matches or exceeds English-only models while being 2-4x smaller. We release an open toolkit, including all code, model checkpoints, a curated dataset of 124M multilingual text-image pairs, and standardized multilingual evaluation protocols, to advance inclusive AI research. Our work demonstrates that multilingual capability is not a trade-off but a catalyst for improved robustness, efficiency, and cultural fidelity in generative AI.

Updated: 2025-07-08 16:19:45

标题: 新通天塔：一个用于视觉生成的多语言开放平台

摘要: 文本到图像生成的进展主要是以英语为中心的，这为非英语使用者创建了障碍，并延续了数字不平等现象。虽然现有系统依赖于翻译管道，但这些管道引入了语义漂移、计算开销和文化不一致。我们介绍了NeoBabel，这是一个新颖的多语言图像生成框架，它在性能、效率和包容性方面设立了新的帕累托前沿，支持六种语言：英语、中文、荷兰语、法语、印地语和波斯语。该模型通过大规模多语言预训练和高分辨率指导微调的组合进行训练。为了评估其能力，我们将两个仅限于英语的基准扩展为多语言等价物：m-GenEval和m-DPG。NeoBabel在保留强大的英语能力的同时实现了最先进的多语言性能，m-GenEval得分为0.75，m-DPG得分为0.68。值得注意的是，尽管这些模型是基于多语言基础的LLMs构建的，NeoBabel在多语言基准测试上的表现与领先模型不相上下，甚至在多语言基准测试上超过它们+0.11和+0.09。这表明了我们针对性对齐训练在保持和扩展跨语言泛化方面的有效性。我们进一步引入了两个新的指标来严格评估多语言对齐性和对混合代码提示的稳健性。值得注意的是，NeoBabel在与仅英语模型相匹配或超过的同时，大小仅为其2-4倍。我们发布了一个开放工具包，包括所有代码、模型检查点、一组精心策划的124M多语言文本图像对数据集和标准化的多语言评估协议，以推动包容性人工智能研究。我们的工作证明了多语言能力不是一种权衡，而是一种促进生成式人工智能在稳健性、效率和文化忠实性方面改进的催化剂。

更新时间: 2025-07-08 16:19:45

领域: cs.CL,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.06137v1

OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety

Recent advances in AI agents capable of solving complex, everyday tasks, from scheduling to customer service, have enabled deployment in real-world settings, but their possibilities for unsafe behavior demands rigorous evaluation. While prior benchmarks have attempted to assess agent safety, most fall short by relying on simulated environments, narrow task domains, or unrealistic tool abstractions. We introduce OpenAgentSafety, a comprehensive and modular framework for evaluating agent behavior across eight critical risk categories. Unlike prior work, our framework evaluates agents that interact with real tools, including web browsers, code execution environments, file systems, bash shells, and messaging platforms; and supports over 350 multi-turn, multi-user tasks spanning both benign and adversarial user intents. OpenAgentSafety is designed for extensibility, allowing researchers to add tools, tasks, websites, and adversarial strategies with minimal effort. It combines rule-based analysis with LLM-as-judge assessments to detect both overt and subtle unsafe behaviors. Empirical analysis of five prominent LLMs in agentic scenarios reveals unsafe behavior in 51.2% of safety-vulnerable tasks with Claude-Sonnet-3.7, to 72.7% with o3-mini, highlighting critical safety vulnerabilities and the need for stronger safeguards before real-world deployment.

Updated: 2025-07-08 16:18:54

标题: OpenAgentSafety：一个评估现实世界AI代理安全性的综合框架

摘要: 最近，人工智能代理在解决复杂的日常任务方面取得了进展，从排班到客户服务，已经能够在现实世界中部署，但它们可能存在不安全行为的可能性要求进行严格评估。虽然之前的基准尝试评估代理的安全性，但大多数都是依赖于模拟环境、狭窄的任务领域或不切实际的工具抽象，表现不佳。我们引入了OpenAgentSafety，一个全面和模块化的框架，用于评估代理在八个关键风险类别下的行为。与以往的工作不同，我们的框架评估与真实工具交互的代理，包括网络浏览器、代码执行环境、文件系统、bash shell和消息平台；并支持超过350个多轮、多用户任务，涵盖良性和敌对用户意图。OpenAgentSafety设计具有可扩展性，允许研究人员以最小的努力添加工具、任务、网站和对抗性策略。它结合基于规则的分析和以LLM为评判标准的评估，以检测明显和微妙的不安全行为。对五个著名LLM在代理场景中的实证分析显示，Claude-Sonnet-3.7在安全易受损任务中表现出51.2%的不安全行为，o3-mini为72.7%，突出了关键的安全漏洞和在现实世界部署之前需要更强有力的保障。

更新时间: 2025-07-08 16:18:54

领域: cs.AI

下载: http://arxiv.org/abs/2507.06134v1

The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

In reinforcement learning, specifying reward functions that capture the intended task can be very challenging. Reward learning aims to address this issue by learning the reward function. However, a learned reward model may have a low error on the data distribution, and yet subsequently produce a policy with large regret. We say that such a reward model has an error-regret mismatch. The main source of an error-regret mismatch is the distributional shift that commonly occurs during policy optimization. In this paper, we mathematically show that a sufficiently low expected test error of the reward model guarantees low worst-case regret, but that for any fixed expected test error, there exist realistic data distributions that allow for error-regret mismatch to occur. We then show that similar problems persist even when using policy regularization techniques, commonly employed in methods such as RLHF. We hope our results stimulate the theoretical and empirical study of improved methods to learn reward models, and better ways to measure their quality reliably.

Updated: 2025-07-08 16:17:15

标题: 学习奖励函数优化的风险：低训练错误并不保证低后悔

摘要: 在强化学习中，指定能够捕捉预期任务的奖励函数可能非常具有挑战性。奖励学习旨在通过学习奖励函数来解决这一问题。然而，一个学习到的奖励模型可能在数据分布上具有很低的错误，但随后产生一个具有很大后悔的策略。我们称这样的奖励模型具有错误-后悔不匹配。错误-后悔不匹配的主要来源是通常在策略优化过程中发生的分布转移。在本文中，我们数学上证明了奖励模型的预期测试错误足够低可以保证最坏情况下的后悔也很低，但对于任何固定的预期测试错误，都存在能够导致错误-后悔不匹配发生的现实数据分布。然后我们展示了即使使用常用于RLHF等方法中的策略正则化技术，类似的问题仍然存在。我们希望我们的结果能够激发对改进奖励模型学习方法的理论和实证研究，并找到更好的可靠方法来衡量其质量。

更新时间: 2025-07-08 16:17:15

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2406.15753v3

PrefixAgent: An LLM-Powered Design Framework for Efficient Prefix Adder Optimization

Prefix adders are fundamental arithmetic circuits, but their design space grows exponentially with bit-width, posing significant optimization challenges. Previous works face limitations in performance, generalization, and scalability. To address these challenges, we propose PrefixAgent, a large language model (LLM)-powered framework that enables efficient prefix adder optimization. Specifically, PrefixAgent reformulates the problem into subtasks including backbone synthesis and structure refinement, which effectively reduces the search space. More importantly, this new design perspective enables us to efficiently collect enormous high-quality data and reasoning traces with E-graph, which further results in an effective fine-tuning of LLM. Experimental results show that PrefixAgent synthesizes prefix adders with consistently smaller areas compared to baseline methods, while maintaining scalability and generalization in commercial EDA flows.

Updated: 2025-07-08 16:14:17

标题: 前缀代理：一种基于LLM的设计框架，用于高效优化前缀加法器

摘要: 前缀加法器是基础算术电路，但随着位宽的增加，它们的设计空间呈指数增长，提出了重大的优化挑战。先前的研究在性能、泛化能力和可扩展性方面存在局限性。为了解决这些挑战，我们提出了PrefixAgent，这是一个由大型语言模型（LLM）驱动的框架，可以实现高效的前缀加法器优化。具体来说，PrefixAgent将问题重新构建为包括骨干综合和结构优化在内的子任务，有效减少了搜索空间。更重要的是，这种新的设计视角使我们能够高效地收集大量高质量数据和推理轨迹，进而实现LLM的有效微调。实验结果表明，与基准方法相比，PrefixAgent合成的前缀加法器面积始终更小，同时在商业EDA流程中保持了可扩展性和泛化性。

更新时间: 2025-07-08 16:14:17

领域: cs.AR,cs.AI

下载: http://arxiv.org/abs/2507.06127v1

Subspace-based Approximate Hessian Method for Zeroth-Order Optimization

Zeroth-order optimization addresses problems where gradient information is inaccessible or impractical to compute. While most existing methods rely on first-order approximations, incorporating second-order (curvature) information can, in principle, significantly accelerate convergence. However, the high cost of function evaluations required to estimate Hessian matrices often limits practical applicability. We present the subspace-based approximate Hessian (ZO-SAH) method, a zeroth-order optimization algorithm that mitigates these costs by focusing on randomly selected two-dimensional subspaces. Within each subspace, ZO-SAH estimates the Hessian by fitting a quadratic polynomial to the objective function and extracting its second-order coefficients. To further reduce function-query costs, ZO-SAH employs a periodic subspace-switching strategy that reuses function evaluations across optimization steps. Experiments on eight benchmark datasets, including logistic regression and deep neural network training tasks, demonstrate that ZO-SAH achieves significantly faster convergence than existing zeroth-order methods.

Updated: 2025-07-08 16:11:53

标题: 基于子空间的零阶优化的近似Hessian方法

摘要: Zeroth-order optimization解决了梯度信息不可获取或难以计算的问题。大多数现有方法依赖于一阶近似，但原则上，结合二阶（曲率）信息可以显著加速收敛。然而，用于估计Hessian矩阵的高成本函数评估通常限制了实际应用。我们提出了基于子空间的近似Hessian（ZO-SAH）方法，这是一种零阶优化算法，通过关注随机选择的二维子空间来减轻这些成本。在每个子空间中，ZO-SAH通过将二次多项式拟合到目标函数并提取其二阶系数来估计Hessian。为了进一步降低函数查询成本，ZO-SAH采用了周期性子空间切换策略，可以在优化步骤之间重复使用函数评估。在包括逻辑回归和深度神经网络训练任务在内的八个基准数据集上的实验证明，ZO-SAH比现有的零阶方法实现了显著更快的收敛速度。

更新时间: 2025-07-08 16:11:53

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.06125v1

Regression for the Mean: Auto-Evaluation and Inference with Few Labels through Post-hoc Regression

The availability of machine learning systems that can effectively perform arbitrary tasks has led to synthetic labels from these systems being used in applications of statistical inference, such as data analysis or model evaluation. The Prediction Powered Inference (PPI) framework provides a way of leveraging both a large pool of pseudo-labelled data and a small sample with real, high-quality labels to produce a low-variance, unbiased estimate of the quantity being evaluated for. Most work on PPI considers a relatively sizable set of labelled samples, which can be resource intensive to obtain. However, we find that when labelled data is scarce, the PPI++ method can perform even worse than classical inference. We analyze this phenomenon by relating PPI++ to ordinary least squares regression, which also experiences high variance with small sample sizes, and use this regression framework to better understand the efficacy of PPI. Motivated by this, we present two new PPI-based techniques that leverage robust regressors to produce even lower variance estimators in the few-label regime.

Updated: 2025-07-08 16:05:09

标题: 均值回归：通过事后回归进行少标签自动评估和推断

摘要: 机器学习系统的可用性使得这些系统能够有效地执行任意任务，导致来自这些系统的合成标签被用于统计推断的应用，如数据分析或模型评估。预测驱动的推断（PPI）框架提供了一种利用大量伪标记数据和少量具有真实、高质量标签的样本来产生对正在评估的数量的低方差、无偏估计的方法。大多数关于PPI的研究考虑了一个相对庞大的标记样本集，这可能会消耗大量资源来获取。然而，我们发现当标记数据稀缺时，PPI++方法的表现甚至比传统推断更差。我们通过将PPI++与普通最小二乘回归联系起来来分析这种现象，普通最小二乘回归在样本量较小的情况下也会出现高方差，利用这个回归框架来更好地理解PPI的功效。受此启发，我们提出了两种基于PPI的新技术，利用强健的回归器在少标签情况下产生更低方差的估计量。

更新时间: 2025-07-08 16:05:09

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2411.12665v2

Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis

Automatic speech quality assessment plays a crucial role in the development of speech synthesis systems, but existing models exhibit significant performance variations across different granularity levels of prediction tasks. This paper proposes an enhanced MOS prediction system based on self-supervised learning speech models, incorporating a Mixture of Experts (MoE) classification head and utilizing synthetic data from multiple commercial generation models for data augmentation. Our method builds upon existing self-supervised models such as wav2vec2, designing a specialized MoE architecture to address different types of speech quality assessment tasks. We also collected a large-scale synthetic speech dataset encompassing the latest text-to-speech, speech conversion, and speech enhancement systems. However, despite the adoption of the MoE architecture and expanded dataset, the model's performance improvements in sentence-level prediction tasks remain limited. Our work reveals the limitations of current methods in handling sentence-level quality assessment, provides new technical pathways for the field of automatic speech quality assessment, and also delves into the fundamental causes of performance differences across different assessment granularities.

Updated: 2025-07-08 16:00:13

标题: 基于专家混合的语音质量评估模型：系统级性能提升和话语级挑战分析

摘要: 自动语音质量评估在语音合成系统的发展中起着至关重要的作用，但现有模型在不同粒度级别的预测任务中表现出显著的性能变化。本文提出了一种基于自监督学习语音模型的增强MOS预测系统，该系统包括一个专家混合（MoE）分类头，并利用来自多个商业生成模型的合成数据进行数据增强。我们的方法建立在现有的自监督模型（如wav2vec2）之上，设计了一个专门的MoE架构来解决不同类型的语音质量评估任务。我们还收集了一个大规模的合成语音数据集，涵盖了最新的文本转语音、语音转换和语音增强系统。然而，尽管采用了MoE架构和扩展数据集，但模型在句子级别预测任务中的性能改进仍然有限。我们的工作揭示了当前方法在处理句子级别质量评估方面的局限性，为自动语音质量评估领域提供了新的技术路径，并深入探讨了不同评估粒度之间性能差异的根本原因。

更新时间: 2025-07-08 16:00:13

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2507.06116v1

Eyes on the Environment: AI-Driven Analysis for Fire and Smoke Classification, Segmentation, and Detection

Fire and smoke phenomena pose a significant threat to the natural environment, ecosystems, and global economy, as well as human lives and wildlife. In this particular circumstance, there is a demand for more sophisticated and advanced technologies to implement an effective strategy for early detection, real-time monitoring, and minimizing the overall impacts of fires on ecological balance and public safety. Recently, the rapid advancement of Artificial Intelligence (AI) and Computer Vision (CV) frameworks has substantially revolutionized the momentum for developing efficient fire management systems. However, these systems extensively rely on the availability of adequate and high-quality fire and smoke data to create proficient Machine Learning (ML) methods for various tasks, such as detection and monitoring. Although fire and smoke datasets play a critical role in training, evaluating, and testing advanced Deep Learning (DL) models, a comprehensive review of the existing datasets is still unexplored. For this purpose, we provide an in-depth review to systematically analyze and evaluate fire and smoke datasets collected over the past 20 years. We investigate the characteristics of each dataset, including type, size, format, collection methods, and geographical diversities. We also review and highlight the unique features of each dataset, such as imaging modalities (RGB, thermal, infrared) and their applicability for different fire management tasks (classification, segmentation, detection). Furthermore, we summarize the strengths and weaknesses of each dataset and discuss their potential for advancing research and technology in fire management. Ultimately, we conduct extensive experimental analyses across different datasets using several state-of-the-art algorithms, such as ResNet-50, DeepLab-V3, and YoloV8.

Updated: 2025-07-08 16:00:08

标题: 环境监测关注：基于人工智能的火灾和烟雾分类、分割和检测分析

摘要: 火灾和烟雾现象对自然环境、生态系统、全球经济以及人类生命和野生动物构成重大威胁。在这种特殊情况下，需要更加复杂和先进的技术来实施有效的早期检测、实时监测，并最小化火灾对生态平衡和公共安全的总体影响。最近，人工智能（AI）和计算机视觉（CV）框架的快速发展已经极大地改变了开发高效火灾管理系统的动力。然而，这些系统广泛依赖于充足和高质量的火灾和烟雾数据，以创建用于各种任务（如检测和监测）的有效机器学习（ML）方法。尽管火灾和烟雾数据集在训练、评估和测试先进的深度学习（DL）模型方面起着关键作用，但对现有数据集的全面回顾仍未被探索。为此，我们提供了一项深入审查，系统分析和评估过去20年收集的火灾和烟雾数据集。我们调查了每个数据集的特征，包括类型、大小、格式、采集方法和地理多样性。我们还回顾并突出每个数据集的独特特征，例如成像模式（RGB、热像、红外）及其在不同火灾管理任务（分类、分割、检测）中的适用性。此外，我们总结了每个数据集的优势和劣势，并讨论它们在推动火灾管理研究和技术方面的潜力。最终，我们使用几种最先进的算法（如ResNet-50、DeepLab-V3和YoloV8）对不同数据集进行了广泛的实验分析。

更新时间: 2025-07-08 16:00:08

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.14552v2

Safe Beyond the Horizon: Efficient Sampling-based MPC with Neural Control Barrier Functions

A common problem when using model predictive control (MPC) in practice is the satisfaction of safety specifications beyond the prediction horizon. While theoretical works have shown that safety can be guaranteed by enforcing a suitable terminal set constraint or a sufficiently long prediction horizon, these techniques are difficult to apply and thus are rarely used by practitioners, especially in the case of general nonlinear dynamics. To solve this problem, we impose a tradeoff between exact recursive feasibility, computational tractability, and applicability to ``black-box'' dynamics by learning an approximate discrete-time control barrier function and incorporating it into a variational inference MPC (VIMPC), a sampling-based MPC paradigm. To handle the resulting state constraints, we further propose a new sampling strategy that greatly reduces the variance of the estimated optimal control, improving the sample efficiency, and enabling real-time planning on a CPU. The resulting Neural Shield-VIMPC (NS-VIMPC) controller yields substantial safety improvements compared to existing sampling-based MPC controllers, even under badly designed cost functions. We validate our approach in both simulation and real-world hardware experiments. Project website: https://mit-realm.github.io/ns-vimpc/.

Updated: 2025-07-08 15:59:29

标题: 超越地平线的安全性：具有神经控制屏障函数的高效基于采样的MPC

摘要: 使用模型预测控制（MPC）时的一个常见问题是在预测范围之外满足安全规范。虽然理论研究表明，通过强制执行适当的终端集约束或足够长的预测范围可以保证安全，但这些技术很难应用，因此在实践中很少被使用，特别是在一般非线性动态的情况下。为了解决这个问题，我们通过学习一个近似的离散时间控制屏障函数，并将其整合到变分推断MPC（VIMPC）中，即一种基于采样的MPC范式，来实现精确的递归可行性、计算可行性和适用性之间的权衡，适用于“黑盒”动态。为了处理由此产生的状态约束，我们进一步提出了一种新的采样策略，大大减少了估计最优控制的方差，提高了样本效率，并实现了在CPU上的实时规划。由此产生的神经屏障-VIMPC（NS-VIMPC）控制器相比现有的基于采样的MPC控制器，在糟糕设计的成本函数下，安全性得到了显着改善。我们在仿真和真实硬件实验中验证了我们的方法。项目网站：https://mit-realm.github.io/ns-vimpc/。

更新时间: 2025-07-08 15:59:29

领域: cs.RO,cs.AI,cs.SY,eess.SY

下载: http://arxiv.org/abs/2502.15006v2

Entropy stable conservative flux form neural networks

We propose an entropy-stable conservative flux form neural network (CFN) that integrates classical numerical conservation laws into a data-driven framework using the entropy-stable, second-order, and non-oscillatory Kurganov-Tadmor (KT) scheme. The proposed entropy-stable CFN uses slope limiting as a denoising mechanism, ensuring accurate predictions in both noisy and sparse observation environments, as well as in both smooth and discontinuous regions. Numerical experiments demonstrate that the entropy-stable CFN achieves both stability and conservation while maintaining accuracy over extended time domains. Furthermore, it successfully predicts shock propagation speeds in long-term simulations, {\it without} oracle knowledge of later-time profiles in the training data.

Updated: 2025-07-08 15:56:17

标题: 熵稳定的保守通量形式神经网络

摘要: 我们提出了一个熵稳定的保守通量形式神经网络（CFN），将经典数值守恒定律整合到数据驱动框架中，使用熵稳定、二阶和非振荡的Kurganov-Tadmor（KT）方案。所提出的熵稳定CFN使用斜率限制作为去噪机制，确保在嘈杂和稀疏观测环境以及光滑和不连续区域中都能准确预测。数值实验表明，熵稳定CFN在保持准确性的同时在扩展时间域内实现了稳定性和守恒。此外，它成功预测了长期模拟中的冲击传播速度，而无需在训练数据中具有有关后续时间剖面的预知信息。

更新时间: 2025-07-08 15:56:17

领域: math.NA,cs.LG,cs.NA,65M08, 68T07, 65M22, 65M32, 65D25

下载: http://arxiv.org/abs/2411.01746v2

SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity's Last Exam?

The rapid advancements of AI agents have ignited the long-held ambition of leveraging them to accelerate scientific discovery. Achieving this goal requires a deep understanding of the frontiers of human knowledge. As such, Humanity's Last Exam (HLE) provides an exceptionally challenging touchstone for evaluating scientific AI agents. In this work, we aim to construct the foundational architecture for general-purpose agents and validate the capabilities through leading performance on HLE. To achieve this, we introduce X-Master, a tool-augmented reasoning agent designed to emulate human researchers by interacting flexibly with external tools during its reasoning process. This agent, guided by the conceptualization of code as an interaction language, can flexibly leverage built-in Python libraries and our customized tools to augment the reasoning. We further scale its capabilities through X-Masters, a scattered-and-stacked agentic workflow that systematically enhances breadth and depth of reasoning. Our open-source solution, X-Masters, sets a new state-of-the-art record on HLE with a score of 32.1%, surpassing OpenAI's and Google's Deep Research (26.6% and 26.9%) and becoming the first to exceed the 30% threshold. This work allows us to gain a deeper understanding of complex task-solving and accumulates valuable experience that can inform future advancements, guiding subsequent model training.

Updated: 2025-07-08 15:54:19

标题: SciMaster:通向通用科学人工智能代理的道路，第一部分。X-Master作为基础：我们能够在人类最后的考试中领先吗？

摘要: 人工智能代理的快速发展引发了利用它们加速科学发现的长期愿望。实现这一目标需要对人类知识前沿有深刻的理解。因此，人类最后的考试（HLE）为评估科学人工智能代理提供了一个异常具有挑战性的标杆。在本研究中，我们旨在构建通用代理的基础架构，并通过在HLE上的领先表现来验证其能力。为了实现这一目标，我们引入了X-Master，这是一个经过工具增强的推理代理，旨在通过在推理过程中灵活与外部工具互动来模拟人类研究人员。这个代理根据代码作为交互语言的概念化，可以灵活利用内置的Python库和我们定制的工具来增强推理。我们进一步通过X-Masters扩展了其能力，这是一个分散和堆叠的代理工作流程，系统地增强了推理的广度和深度。我们的开源解决方案X-Masters在HLE上取得了32.1%的得分，超过了OpenAI和Google的深度研究（26.6%和26.9%），成为首个超过30%门槛的研究。这项工作使我们能够更深入地了解复杂任务解决，并积累宝贵的经验，可用于指导未来的进展，引导后续模型训练。

更新时间: 2025-07-08 15:54:19

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.05241v2

Fun with flags: How Compilers Break and Fix Constant-Time Code

Developers rely on constant-time programming to prevent timing side-channel attacks. But these efforts can be undone by compilers, whose optimizations may silently reintroduce leaks. While recent works have measured the extent of such leakage, they leave developers without actionable insights: which optimization passes are responsible, and how to disable them without modifying the compiler remains unclear. In this paper, we conduct a qualitative analysis of how compiler optimizations break constant-time code. We construct a dataset of compiler-introduced constant-time violations and analyze the internals of two widely used compilers, GCC and LLVM, to identify the specific optimization passes responsible. Our key insight is that a small set of passes are at the root of most leaks. To the best of our knowledge, we are also the first to characterize how the interactions between these passes contribute to leakage. Based on this analysis, we propose an original and practical mitigation that requires no source code modification or custom compiler: disabling selected optimization passes via compiler flags. We show that this approach significantly reduces leakage with minimal performance overhead, offering an immediately deployable defense for developers.

Updated: 2025-07-08 15:52:17

标题: 旗帜有趣：编译器如何破坏和修复常数时间代码

摘要: 开发人员依赖于常量时间编程来防止定时侧信道攻击。但这些努力可能会被编译器所破坏，其优化可能会悄然重新引入泄漏。尽管最近的研究已经衡量了这种泄漏的程度，但它们并未为开发人员提供可操作的见解：哪些优化步骤是负责的，如何在不修改编译器的情况下禁用它们仍然不清楚。在本文中，我们对编译器优化如何破坏常量时间代码进行了定性分析。我们构建了一个编译器引入的常量时间违规数据集，并分析了两个广泛使用的编译器GCC和LLVM的内部，以确定具体的负责优化步骤。我们的关键见解是，少量步骤是大多数泄漏的根源。据我们所知，我们还是第一个表征这些步骤之间相互作用如何导致泄漏的人。基于这一分析，我们提出了一种原创且实用的缓解措施，无需修改源代码或自定义编译器：通过编译器标志禁用选定的优化步骤。我们证明这种方法显著减少了泄漏，并且性能开销很小，为开发人员提供了一个可以立即部署的防御措施。

更新时间: 2025-07-08 15:52:17

领域: cs.CR

下载: http://arxiv.org/abs/2507.06112v1

Safe Domain Randomization via Uncertainty-Aware Out-of-Distribution Detection and Policy Adaptation

Deploying reinforcement learning (RL) policies in real-world involves significant challenges, including distribution shifts, safety concerns, and the impracticality of direct interactions during policy refinement. Existing methods, such as domain randomization (DR) and off-dynamics RL, enhance policy robustness by direct interaction with the target domain, an inherently unsafe practice. We propose Uncertainty-Aware RL (UARL), a novel framework that prioritizes safety during training by addressing Out-Of-Distribution (OOD) detection and policy adaptation without requiring direct interactions in target domain. UARL employs an ensemble of critics to quantify policy uncertainty and incorporates progressive environmental randomization to prepare the policy for diverse real-world conditions. By iteratively refining over high-uncertainty regions of the state space in simulated environments, UARL enhances robust generalization to the target domain without explicitly training on it. We evaluate UARL on MuJoCo benchmarks and a quadrupedal robot, demonstrating its effectiveness in reliable OOD detection, improved performance, and enhanced sample efficiency compared to baselines.

Updated: 2025-07-08 15:51:57

标题: 通过不确定性感知的区域随机化安全领域外检测和政策调整

摘要: 在现实世界中部署强化学习（RL）策略涉及重大挑战，包括分布转移、安全问题和在策略优化过程中直接交互的不可行性。现有方法，如域随机化（DR）和离线动力学RL，通过直接与目标域互动来增强策略的稳健性，这是一种本质上不安全的做法。我们提出了一种新颖的框架——不确定性感知RL（UARL），在训练过程中优先考虑安全性，通过检测超出分布范围的状态（OOD）并调整策略，而无需在目标域中进行直接交互。UARL利用一组评论家来量化策略的不确定性，并通过渐进性环境随机化来为策略准备应对多样化的现实世界条件。通过在模拟环境中迭代地对状态空间的高不确定性区域进行优化，UARL增强了对目标域的稳健泛化能力，而无需明确在其上进行训练。我们在MuJoCo基准测试和一个四足机器人上评估了UARL，展示了它在可靠的OOD检测、改进的性能和与基线相比的增强样本效率方面的有效性。

更新时间: 2025-07-08 15:51:57

领域: cs.LG,cs.RO

下载: http://arxiv.org/abs/2507.06111v1

LighthouseGS: Indoor Structure-aware 3D Gaussian Splatting for Panorama-Style Mobile Captures

Recent advances in 3D Gaussian Splatting (3DGS) have enabled real-time novel view synthesis (NVS) with impressive quality in indoor scenes. However, achieving high-fidelity rendering requires meticulously captured images covering the entire scene, limiting accessibility for general users. We aim to develop a practical 3DGS-based NVS framework using simple panorama-style motion with a handheld camera (e.g., mobile device). While convenient, this rotation-dominant motion and narrow baseline make accurate camera pose and 3D point estimation challenging, especially in textureless indoor scenes. To address these challenges, we propose LighthouseGS, a novel framework inspired by the lighthouse-like sweeping motion of panoramic views. LighthouseGS leverages rough geometric priors, such as mobile device camera poses and monocular depth estimation, and utilizes the planar structures often found in indoor environments. We present a new initialization method called plane scaffold assembly to generate consistent 3D points on these structures, followed by a stable pruning strategy to enhance geometry and optimization stability. Additionally, we introduce geometric and photometric corrections to resolve inconsistencies from motion drift and auto-exposure in mobile devices. Tested on collected real and synthetic indoor scenes, LighthouseGS delivers photorealistic rendering, surpassing state-of-the-art methods and demonstrating the potential for panoramic view synthesis and object placement.

Updated: 2025-07-08 15:49:53

标题: LighthouseGS：用于全景式移动捕捉的室内结构感知三维高斯飞溅

摘要: 最近在3D高斯光斑（3DGS）方面取得的进展使室内场景中的实时新视图合成（NVS）具有令人印象深刻的质量。然而，要实现高保真度的渲染，需要精心捕捉整个场景的图像，这限制了一般用户的可访问性。我们旨在开发一个实用的基于3DGS的NVS框架，使用手持相机（例如移动设备）进行简单的全景式运动。虽然方便，但这种以旋转为主导的运动和狭窄的基线使得准确的相机姿态和3D点估计变得具有挑战性，特别是在无纹理的室内场景中。为了解决这些挑战，我们提出了一种受到全景视图的灯塔状扫描运动启发的新框架LighthouseGS。LighthouseGS利用粗略的几何先验，如移动设备相机姿态和单目深度估计，并利用室内环境中常见的平面结构。我们提出了一种新的初始化方法，称为平面支架组装，以在这些结构上生成一致的3D点，然后采用稳定的修剪策略来增强几何和优化稳定性。此外，我们引入了几何和光度校正，以解决移动设备中的运动漂移和自动曝光导致的不一致性。通过对收集的真实和合成室内场景进行测试，LighthouseGS提供了逼真的渲染，超越了现有方法，并展示了全景视图合成和物体放置的潜力。

更新时间: 2025-07-08 15:49:53

领域: cs.GR,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.06109v1

Agents Are All You Need for LLM Unlearning

Information removal or suppression in large language models (LLMs) is a desired functionality, useful in AI regulation, legal compliance, safety, and privacy. LLM unlearning methods aim to remove information on demand from LLMs. Current LLM unlearning methods struggle to balance the unlearning efficacy and utility due to the competing nature of these objectives. Keeping the unlearning process computationally feasible without assuming access to the model weights is an overlooked area. In this work we show that \textit{agents might be all we need for effective and practical inference-time LLM unlearning}. We present the first agentic LLM unlearning (\texttt{ALU}) method, a multi-agent, retrain-free, model-agnostic approach to LLM unlearning that achieves effective unlearning while preserving the utility. Our \texttt{ALU} framework unlearns by involving multiple LLM agents, each designed for a specific step in the unlearning process, without the need to update model weights for any of the agents in the framework. Users can easily request any set of unlearning instances in any sequence, and \texttt{ALU} seamlessly adapts in real time. This is facilitated without requiring any changes in the underlying LLM model. Through extensive experiments on established benchmarks (TOFU, WMDP, WPU) and jailbreaking techniques (many shot, target masking, other languages), we demonstrate that \texttt{ALU} consistently stands out as the most robust inference-time LLM unlearning framework among current state-of-the-art methods while incurring time cost that remains effectively constant regardless of the number of unlearning targets. We further highlight \texttt{ALU}'s superior performance compared to existing methods when evaluated at scale. Specifically, \texttt{ALU} is assessed on up to 1000 unlearning targets, exceeding the evaluation scope of all previously proposed LLM unlearning methods.

Updated: 2025-07-08 15:49:01

标题: 仅需代理即可进行LLM遗忘

摘要: 大型语言模型（LLMs）中的信息删除或抑制是一种期望的功能，在AI监管、法律合规、安全和隐私方面非常有用。LLM遗忘方法旨在根据需求从LLMs中删除信息。当前的LLM遗忘方法在遗忘效果和效用之间很难平衡，因为这些目标具有竞争性质。在不假定可以访问模型权重的情况下，保持遗忘过程在计算上可行是一个被忽视的领域。在这项工作中，我们展示\textit{代理可能是我们进行有效和实用推理时LLM遗忘所需的一切}。我们提出了第一个代理式LLM遗忘（\texttt{ALU}）方法，这是一种多代理、无需重新训练、与模型无关的LLM遗忘方法，实现了有效的遗忘同时保留了效用。我们的\texttt{ALU}框架通过涉及多个LLM代理来进行遗忘，每个代理设计用于遗忘过程中的特定步骤，而不需要为框架中的任何代理更新模型权重。用户可以轻松请求任何一组遗忘实例，\texttt{ALU}可以实时无缝地适应。这是在不需要对底层LLM模型进行任何更改的情况下实现的。通过在已建立的基准（TOFU、WMDP、WPU）和越狱技术（多次射击、目标屏蔽、其他语言）上进行广泛实验，我们证明\texttt{ALU}在当前最先进方法中始终表现出色，成为最强大的推理时LLM遗忘框架，而且无论遗忘目标数量如何，时间成本始终保持有效恒定。我们进一步强调了\texttt{ALU}在大规模评估时与现有方法相比表现出色。具体而言，\texttt{ALU}在多达1000个遗忘目标上进行评估，超出了所有先前提出的LLM遗忘方法的评估范围。

更新时间: 2025-07-08 15:49:01

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2502.00406v2

Tile-Based ViT Inference with Visual-Cluster Priors for Zero-Shot Multi-Species Plant Identification

We describe DS@GT's second-place solution to the PlantCLEF 2025 challenge on multi-species plant identification in vegetation quadrat images. Our pipeline combines (i) a fine-tuned Vision Transformer ViTD2PC24All for patch-level inference, (ii) a 4x4 tiling strategy that aligns patch size with the network's 518x518 receptive field, and (iii) domain-prior adaptation through PaCMAP + K-Means visual clustering and geolocation filtering. Tile predictions are aggregated by majority vote and re-weighted with cluster-specific Bayesian priors, yielding a macro-averaged F1 of 0.348 (private leaderboard) while requiring no additional training. All code, configuration files, and reproducibility scripts are publicly available at https://github.com/dsgt-arc/plantclef-2025.

Updated: 2025-07-08 15:35:19

标题: 基于瓷砖的ViT推理与视觉集群先验用于零样本多物种植物识别

摘要: 我们描述了DS@GT在植被象限图像中进行多物种植物识别的PlantCLEF 2025挑战中获得第二名的解决方案。我们的流程结合了以下三个部分：(i) 使用经过微调的Vision Transformer ViTD2PC24All进行补丁级别的推断，(ii) 4x4平铺策略，将补丁大小与网络的518x518感受野对齐，以及(iii) 通过PaCMAP + K-Means视觉聚类和地理位置过滤进行域先验适应。瓷砖预测通过多数投票进行聚合，并使用特定于聚类的贝叶斯先验重新加权，产生私人排行榜上的宏平均F1为0.348，而无需额外的训练。所有代码、配置文件和可重现性脚本均可在https://github.com/dsgt-arc/plantclef-2025 上公开获取。

更新时间: 2025-07-08 15:35:19

领域: cs.CV,cs.IR,cs.LG

下载: http://arxiv.org/abs/2507.06093v1

Taming Data Challenges in ML-based Security Tasks: Lessons from Integrating Generative AI

Machine learning-based supervised classifiers are widely used for security tasks, and their improvement has been largely focused on algorithmic advancements. We argue that data challenges that negatively impact the performance of these classifiers have received limited attention. We address the following research question: Can developments in Generative AI (GenAI) address these data challenges and improve classifier performance? We propose augmenting training datasets with synthetic data generated using GenAI techniques to improve classifier generalization. We evaluate this approach across 7 diverse security tasks using 6 state-of-the-art GenAI methods and introduce a novel GenAI scheme called Nimai that enables highly controlled data synthesis. We find that GenAI techniques can significantly improve the performance of security classifiers, achieving improvements of up to 32.6% even in severely data-constrained settings (only ~180 training samples). Furthermore, we demonstrate that GenAI can facilitate rapid adaptation to concept drift post-deployment, requiring minimal labeling in the adjustment process. Despite successes, our study finds that some GenAI schemes struggle to initialize (train and produce data) on certain security tasks. We also identify characteristics of specific tasks, such as noisy labels, overlapping class distributions, and sparse feature vectors, which hinder performance boost using GenAI. We believe that our study will drive the development of future GenAI tools designed for security tasks.

Updated: 2025-07-08 15:34:45

标题: 驯服基于机器学习的安全任务中的数据挑战：整合生成式人工智能的经验教训

摘要: 基于机器学习的监督分类器广泛用于安全任务，其改进主要集中在算法方面。我们认为负面影响这些分类器性能的数据挑战受到了较少关注。我们探讨以下研究问题：生成式人工智能（GenAI）的发展是否能解决这些数据挑战并提高分类器性能？我们提出使用GenAI技术生成合成数据来改善分类器泛化能力。我们通过使用6种最先进的GenAI方法在7个不同的安全任务中评估这一方法，并介绍一种名为Nimai的新颖GenAI方案，它可以实现高度可控的数据合成。我们发现GenAI技术可以显著提高安全分类器的性能，即使在数据受限的情况下（只有约180个训练样本），也可以实现高达32.6%的改进。此外，我们证明GenAI可以在部署后快速适应概念漂移，需要最少的标记来进行调整过程。尽管取得成功，我们的研究发现一些GenAI方案在某些安全任务上难以初始化（训练和生成数据）。我们还确定了特定任务的特征，如嘈杂的标签、重叠的类分布和稀疏的特征向量，这些特征阻碍了使用GenAI提高性能。我们相信我们的研究将推动未来专为安全任务设计的GenAI工具的发展。

更新时间: 2025-07-08 15:34:45

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.06092v1

CoRE: Enhancing Metacognition with Label-free Self-evaluation in LRMs

Large reasoning models (LRMs) have demonstrated impressive capabilities in domains like mathematics and program synthesis. Despite their strong performance, LRMs often exhibit overthinking -- excessive and redundant reasoning steps that introduce inefficiencies during inference. This phenomenon raises an important question for LRM self-evaluation: How can a model autonomously assess the correctness of its own reasoning trajectory without external labels? To address this, we propose Chain-of-Reasoning Embedding (CoRE), a series of hidden states in latent space to enable label-free self-evaluation on intermediate reasoning steps of LRMs, so as to enhance metacognition abilities for improved reasoning efficiency. By analyzing the geometric properties of the CoRE trajectories, we reveal that redundant reasoning usually presents cyclical fluctuations, which correspond to repetitive and unconscious reflection/exploration. Leveraging this insight, we further introduce a training-free, label-free self-evaluation framework, CoRE-Eval, to detect such patterns and dynamically determine whether to terminate reasoning early. Extensive experiments on mathematical reasoning benchmarks (GSM8K, MATH-500, and AIME) and across model sizes from 7B to 32B demonstrate that CoRE-Eval reduces chain-of-thought length by 13.7% to 33.2% while improving answer accuracy by around 10%, achieving 70.0% accuracy on the challenging AIME benchmark with the 32B model.

Updated: 2025-07-08 15:28:48

标题: CoRE: 在LRMs中通过无标签的自我评估增强元认知

摘要: 大型推理模型（LRMs）在数学和程序综合等领域展示了令人印象深刻的能力。尽管它们表现出色，LRMs经常表现出过度思考 - 过多和多余的推理步骤会在推理过程中引入低效。这种现象引发了一个对LRM自我评估至关重要的问题：模型如何能够在没有外部标签的情况下自主评估其自身推理轨迹的正确性？为了解决这个问题，我们提出了链式推理嵌入（CoRE），这是在潜在空间中的一系列隐藏状态，可以实现对LRMs中间推理步骤的无标签自我评估，以增强元认知能力，从而提高推理效率。通过分析CoRE轨迹的几何特性，我们揭示了冗余推理通常呈现周期性波动，这对应于重复和无意识的反思/探索。利用这一见解，我们进一步引入了一个无需训练、无标签的自我评估框架CoRE-Eval，用于检测这种模式，并动态确定是否提前终止推理。在数学推理基准测试（GSM8K、MATH-500和AIME）上进行了大量实验，跨越了从7B到32B的模型规模，结果表明CoRE-Eval将思维链长度减少了13.7%至33.2%，同时将答案准确率提高了约10%，在32B模型上达到了70.0%的准确率，在具有挑战性的AIME基准测试中表现出色。

更新时间: 2025-07-08 15:28:48

领域: cs.LG

下载: http://arxiv.org/abs/2507.06087v1

The bitter lesson of misuse detection

Prior work on jailbreak detection has established the importance of adversarial robustness for LLMs but has largely focused on the model ability to resist adversarial inputs and to output safe content, rather than the effectiveness of external supervision systems. The only public and independent benchmark of these guardrails to date evaluates a narrow set of supervisors on limited scenarios. Consequently, no comprehensive public benchmark yet verifies how well supervision systems from the market perform under realistic, diverse attacks. To address this, we introduce BELLS, a Benchmark for the Evaluation of LLM Supervision Systems. The framework is two dimensional: harm severity (benign, borderline, harmful) and adversarial sophistication (direct vs. jailbreak) and provides a rich dataset covering 3 jailbreak families and 11 harm categories. Our evaluations reveal drastic limitations of specialized supervision systems. While they recognize some known jailbreak patterns, their semantic understanding and generalization capabilities are very limited, sometimes with detection rates close to zero when asking a harmful question directly or with a new jailbreak technique such as base64 encoding. Simply asking generalist LLMs if the user question is "harmful or not" largely outperforms these supervisors from the market according to our BELLS score. But frontier LLMs still suffer from metacognitive incoherence, often responding to queries they correctly identify as harmful (up to 30 percent for Claude 3.7 and greater than 50 percent for Mistral Large). These results suggest that simple scaffolding could significantly improve misuse detection robustness, but more research is needed to assess the tradeoffs of such techniques. Our results support the "bitter lesson" of misuse detection: general capabilities of LLMs are necessary to detect a diverse array of misuses and jailbreaks.

Updated: 2025-07-08 15:21:17

标题: 滥用检测的痛苦教训

摘要: 先前的关于越狱检测的工作已经确定了对于LLMs来说对抗性的重要性，但主要集中在模型抵抗对抗性输入和输出安全内容的能力，而不是外部监督系统的有效性。迄今为止，唯一的公开和独立的这些防范措施的基准评估了一组有限情景下的监督者。因此，目前还没有一个全面的公开基准来验证市场上监督系统在现实多样攻击下的表现如何。为了解决这个问题，我们引入了BELLS，一个用于评估LLM监督系统的基准。该框架是二维的：伤害严重程度（良性、边缘、有害）和对抗性复杂性（直接 vs. 越狱），并提供了一个丰富的数据集，涵盖了3个越狱家族和11种伤害类别。我们的评估揭示了专门监督系统的严重限制。虽然它们识别了一些已知的越狱模式，但它们的语义理解和泛化能力非常有限，有时在直接询问有害问题或使用新的越狱技术（如base64编码）时，检测率接近零。根据我们的BELLS分数，简单地向通用LLMs询问用户问题是否“有害或无害”在很大程度上优于市场上的这些监督者。但前沿的LLMs仍然存在元认知不一致的问题，经常对他们正确识别为有害的查询作出响应（Claude 3.7最高达30%，Mistral Large高达50%以上）。这些结果表明，简单的支架可以显著提高滥用检测的稳健性，但需要更多研究来评估这些技术的权衡。我们的结果支持滥用检测的“苦涩教训”：LLMs的一般能力是必要的，以便检测各种滥用和越狱。

更新时间: 2025-07-08 15:21:17

领域: cs.CR,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.06282v1

A Cascading Cooperative Multi-agent Framework for On-ramp Merging Control Integrating Large Language Models

Traditional Reinforcement Learning (RL) suffers from replicating human-like behaviors, generalizing effectively in multi-agent scenarios, and overcoming inherent interpretability issues.These tasks are compounded when deep environment understanding, agent coordination and dynamic optimization are required. While Large Language Model (LLM) enhanced methods have shown promise in generalization and interoperability, they often neglect necessary multi-agent coordination. Therefore, we introduce the Cascading Cooperative Multi-agent (CCMA) framework, integrating RL for individual interactions, a fine-tuned LLM for regional cooperation, a reward function for global optimization, and the Retrieval-augmented Generation mechanism to dynamically optimize decision-making across complex driving scenarios. Our experiments demonstrate that the CCMA outperforms existing RL methods, demonstrating significant improvements in both micro and macro-level performance in complex driving environments.

Updated: 2025-07-08 15:19:50

标题: 一个级联协作多智能体框架，用于整合大型语言模型的匝道合并控制

摘要: 传统的强化学习（RL）在复制类似人类行为、在多智能体情景中有效泛化以及克服固有的可解释性问题方面存在困难。当需要深入理解环境、智能体协调和动态优化时，这些任务变得更加复杂。虽然大型语言模型（LLM）增强方法在泛化和互操作性方面显示出潜力，但它们通常忽视了必要的多智能体协调。因此，我们引入了级联合作多智能体（CCMA）框架，将RL用于个体互动，对区域合作进行微调的LLM，用于全局优化的奖励函数，以及检索增强生成机制，以动态优化在复杂驾驶情景中的决策制定。我们的实验表明，CCMA优于现有的RL方法，在复杂驾驶环境中表现出了在微观和宏观水平性能方面的显著改善。

更新时间: 2025-07-08 15:19:50

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.08199v2

QS4D: Quantization-aware training for efficient hardware deployment of structured state-space sequential models

Structured State Space models (SSM) have recently emerged as a new class of deep learning models, particularly well-suited for processing long sequences. Their constant memory footprint, in contrast to the linearly scaling memory demands of Transformers, makes them attractive candidates for deployment on resource-constrained edge-computing devices. While recent works have explored the effect of quantization-aware training (QAT) on SSMs, they typically do not address its implications for specialized edge hardware, for example, analog in-memory computing (AIMC) chips. In this work, we demonstrate that QAT can significantly reduce the complexity of SSMs by up to two orders of magnitude across various performance metrics. We analyze the relation between model size and numerical precision, and show that QAT enhances robustness to analog noise and enables structural pruning. Finally, we integrate these techniques to deploy SSMs on a memristive analog in-memory computing substrate and highlight the resulting benefits in terms of computational efficiency.

Updated: 2025-07-08 15:19:14

标题: QS4D：结构状态空间序贯模型的高效硬件部署的量化感知训练

摘要: 结构化状态空间模型（SSM）最近已经成为一类新型深度学习模型，特别适用于处理长序列。与Transformer线性扩展内存需求形成鲜明对比，它们具有恒定的内存占用，使其成为部署在资源受限的边缘计算设备上的理想选择。虽然最近的研究探讨了量化感知训练（QAT）对SSM的影响，但通常未考虑其对专门的边缘硬件（例如模拟内存计算芯片）的影响。在这项工作中，我们展示了QAT可以显著减少SSM的复杂性，各种性能指标可以降低两个数量级。我们分析了模型大小与数值精度之间的关系，并展示了QAT如何增强对模拟噪声的鲁棒性并实现结构修剪。最后，我们将这些技术集成到一个忆阻模拟内存计算基底上部署SSM，并突出了在计算效率方面带来的好处。

更新时间: 2025-07-08 15:19:14

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.06079v1

AI-Based Demand Forecasting and Load Balancing for Optimising Energy use in Healthcare Systems: A real case study

This paper tackles the urgent need for efficient energy management in healthcare facilities, where fluctuating demands challenge operational efficiency and sustainability. Traditional methods often prove inadequate, causing inefficiencies and higher costs. To address this, the study presents an AI-based framework combining Long Short-Term Memory (LSTM), genetic algorithm (GA), and SHAP (Shapley Additive Explanations), specifically designed for healthcare energy management. Although LSTM is widely used for time-series forecasting, its application in healthcare energy prediction remains underexplored. The results reveal that LSTM significantly outperforms ARIMA and Prophet models in forecasting complex, non-linear demand patterns. LSTM achieves a Mean Absolute Error (MAE) of 21.69 and Root Mean Square Error (RMSE) of 29.96, far better than Prophet (MAE: 59.78, RMSE: 81.22) and ARIMA (MAE: 87.73, RMSE: 125.22), demonstrating superior performance. The genetic algorithm is applied to optimize model parameters and improve load balancing strategies, enabling adaptive responses to real-time energy fluctuations. SHAP analysis further enhances model transparency by explaining the influence of different features on predictions, fostering trust in decision-making processes. This integrated LSTM-GA-SHAP approach offers a robust solution for improving forecasting accuracy, boosting energy efficiency, and advancing sustainability in healthcare facilities. Future research may explore real-time deployment and hybridization with reinforcement learning for continuous optimization. Overall, the study establishes a solid foundation for using AI in healthcare energy management, highlighting its scalability, efficiency, and resilience potential.

Updated: 2025-07-08 15:16:50

标题: 基于人工智能的需求预测和负载平衡：优化医疗系统能源使用的真实案例研究

摘要: 本文研究了医疗设施中高效能管理的紧迫需求，其中波动的需求挑战操作效率和可持续性。传统方法通常被证明是不足的，导致低效和更高的成本。为了解决这个问题，本研究提出了一个基于人工智能的框架，结合了长短期记忆（LSTM）、遗传算法（GA）和SHAP（Shapley加法解释），专门设计用于医疗能源管理。尽管LSTM在时间序列预测中被广泛使用，但其在医疗能源预测中的应用仍未充分开发。结果显示，LSTM在预测复杂的非线性需求模式方面明显优于ARIMA和Prophet模型。LSTM实现了21.69的平均绝对误差（MAE）和29.96的根均方误差（RMSE），远远优于Prophet（MAE：59.78，RMSE：81.22）和ARIMA（MAE：87.73，RMSE：125.22），展示了卓越的性能。遗传算法被应用于优化模型参数和改进负载平衡策略，实现对实时能源波动的自适应响应。SHAP分析进一步提高了模型的透明度，解释了不同特征对预测的影响，增强了决策过程的信任。这种综合的LSTM-GA-SHAP方法为提高预测准确性、提升能源效率和推动医疗设施的可持续发展提供了一个强大的解决方案。未来的研究可以探索实时部署和与强化学习的混合以实现持续优化。总的来说，这项研究为在医疗能源管理中使用人工智能奠定了坚实的基础，突显了其可扩展性、效率和韧性潜力。

更新时间: 2025-07-08 15:16:50

领域: cs.AI

下载: http://arxiv.org/abs/2507.06077v1

The Nexus of AR/VR, AI, UI/UX, and Robotics Technologies in Enhancing Learning and Social Interaction for Children with Autism Spectrum Disorders: A Systematic Review

The emergence of large language models (LLMs), augmented reality (AR), and user interface/user experience (UI/UX) design in therapies for children, especially with disorders like autism spectrum disorder (ASD), is studied in detail in this review study. 150 publications were collected by a thorough literature search throughout PubMed, ACM, IEEE Xplore, Elsevier, and Google Scholar; 60 of them were chosen based on their methodological rigor and relevance to the focus area. Three of the primary areas are studied and covered in this review: how AR can improve social and learning results, how LLMs can support communication, and how UI/UX design affects how effective these technologies can be. Results show that while LLMs can provide individualized learning and communication support, AR has shown promise in enhancing social skills, motivation, and attention. For children with ASD, accessible and engaging interventions rely heavily on effective UI/UX design, but there is still a significant lack of robotics-based education and therapeutic programs specifically tailored for autistic children. To optimize the benefits of these technologies in ASD therapies and immersive education, the study emphasizes the need for additional research to address difficulties related to customization, accessibility, and integration.

Updated: 2025-07-08 15:16:05

标题: 《增强学习和社交互动的AR/VR、AI、UI/UX和机器人技术与自闭症谱系障碍儿童的关系：一项系统性综述》

摘要: 这篇综述研究详细研究了大型语言模型（LLMs）、增强现实（AR）和用户界面/用户体验（UI/UX）设计在治疗儿童特别是患有自闭症谱系障碍（ASD）的儿童中的应用。通过在PubMed、ACM、IEEE Xplore、Elsevier和Google Scholar进行彻底的文献搜索，收集了150篇出版物，其中根据其方法论严谨性和与焦点领域的相关性选择了60篇。本综述研究涵盖了三个主要领域：增强现实如何改善社交和学习结果，大型语言模型如何支持沟通，以及用户界面/用户体验设计如何影响这些技术的有效性。结果显示，虽然LLMs可以提供个性化的学习和沟通支持，AR显示了在增强社交技能、提高动机和注意力方面的潜力。对于患有ASD的儿童，可访问且引人入胜的干预措施严重依赖有效的UI/UX设计，但目前仍然缺乏专门针对自闭症儿童定制的基于机器人的教育和治疗项目。为了最大程度地优化这些技术在ASD治疗和沉浸式教育中的益处，该研究强调需要进一步研究以解决与定制、可访问性和整合相关的困难。

更新时间: 2025-07-08 15:16:05

领域: cs.HC,cs.AI,cs.SI

下载: http://arxiv.org/abs/2409.18162v3

The Algorithmic State Architecture (ASA): An Integrated Framework for AI-Enabled Government

As artificial intelligence transforms public sector operations, governments struggle to integrate technological innovations into coherent systems for effective service delivery. This paper introduces the Algorithmic State Architecture (ASA), a novel four-layer framework conceptualising how Digital Public Infrastructure, Data-for-Policy, Algorithmic Government/Governance, and GovTech interact as an integrated system in AI-enabled states. Unlike approaches that treat these as parallel developments, ASA positions them as interdependent layers with specific enabling relationships and feedback mechanisms. Through comparative analysis of implementations in Estonia, Singapore, India, and the UK, we demonstrate how foundational digital infrastructure enables systematic data collection, which powers algorithmic decision-making processes, ultimately manifesting in user-facing services. Our analysis reveals that successful implementations require balanced development across all layers, with particular attention to integration mechanisms between them. The framework contributes to both theory and practice by bridging previously disconnected domains of digital government research, identifying critical dependencies that influence implementation success, and providing a structured approach for analysing the maturity and development pathways of AI-enabled government systems.

Updated: 2025-07-08 15:13:41

标题: 算法国家架构（ASA）：AI支持政府的综合框架

摘要: 随着人工智能改变公共部门运作方式，政府在将技术创新整合到合乎逻辑的系统中以实现有效服务交付方面面临挑战。本文介绍了算法国家架构（ASA），这是一个新颖的四层框架，概念化了数字公共基础设施、数据为政策、算法政府/治理以及政府科技如何作为一个整体系统在人工智能启用的国家中相互作用。与将这些视为并行发展的方法不同，ASA将它们定位为相互依存的层，具有特定的促进关系和反馈机制。通过对爱沙尼亚、新加坡、印度和英国实施情况的比较分析，我们展示了基础数字基础设施如何实现系统性数据收集，这推动了算法决策过程，最终体现在用户服务中。我们的分析显示，成功的实施需要在所有层面上平衡发展，特别关注它们之间的集成机制。该框架通过连接先前无关的数字政府研究领域，识别影响实施成功的关键依赖关系，提供了一个结构化方法，用于分析启用人工智能的政府系统的成熟度和发展路径。

更新时间: 2025-07-08 15:13:41

领域: cs.CY,cs.AI,cs.ET,cs.MA,cs.SY,eess.SY

下载: http://arxiv.org/abs/2503.08725v3

Contrastive and Transfer Learning for Effective Audio Fingerprinting through a Real-World Evaluation Protocol

Recent advances in song identification leverage deep neural networks to learn compact audio fingerprints directly from raw waveforms. While these methods perform well under controlled conditions, their accuracy drops significantly in real-world scenarios where the audio is captured via mobile devices in noisy environments. In this paper, we introduce a novel evaluation protocol designed to better reflect such real-world conditions. We generate three recordings of the same audio, each with increasing levels of noise, captured using a mobile device's microphone. Our results reveal a substantial performance drop for two state-of-the-art CNN-based models under this protocol, compared to previously reported benchmarks. Additionally, we highlight the critical role of the augmentation pipeline during training with contrastive loss. By introduction low pass and high pass filters in the augmentation pipeline we significantly increase the performance of both systems in our proposed evaluation. Furthermore, we develop a transformer-based model with a tailored projection module and demonstrate that transferring knowledge from a semantically relevant domain yields a more robust solution. The transformer architecture outperforms CNN-based models across all noise levels, and query durations. In low noise conditions it achieves 47.99% for 1-sec queries, and 97% for 10-sec queries in finding the correct song, surpassing by 14%, and by 18.5% the second-best performing model, respectively, Under heavy noise levels, we achieve a detection rate 56.5% for 15-second query duration. All experiments are conducted on public large-scale dataset of over 100K songs, with queries matched against a database of 56 million vectors.

Updated: 2025-07-08 15:13:26

标题: 对比学习和迁移学习在通过真实世界评估协议实现有效音频指纹识别方面的应用

摘要: 最近在歌曲识别方面取得的进展利用深度神经网络直接从原始波形中学习紧凑的音频指纹。虽然这些方法在受控条件下表现良好，但它们在真实世界场景中的准确性显著下降，其中音频是通过移动设备在嘈杂环境中捕获的。在本文中，我们引入了一个新颖的评估协议，旨在更好地反映这种真实世界条件。我们生成了同一音频的三个录音，每个录音的噪音水平逐渐增加，使用移动设备的麦克风进行捕获。我们的结果显示，相比先前报道的基准，两种最先进的基于CNN的模型在这个协议下表现出显著的性能下降。此外，我们强调了在对比损失训练过程中增强管道的关键作用。通过在增强管道中引入低通和高通滤波器，我们显著提高了我们提出的评估中两个系统的性能。此外，我们开发了一个基于transformer的模型，具有定制的投影模块，并证明从语义相关领域转移知识会产生更加稳健的解决方案。transformer架构在所有噪声水平和查询持续时间下表现优于基于CNN的模型。在低噪声条件下，它在1秒查询中达到47.99%的正确歌曲查找率，在10秒查询中达到97%，分别超过次佳表现模型的14%和18.5%。在高噪声水平下，我们实现了15秒查询持续时间的检测率为56.5%。所有实验都在一个包含超过10万首歌曲的公共大规模数据集上进行，查询与包含5600万个向量的数据库匹配。

更新时间: 2025-07-08 15:13:26

领域: cs.SD,cs.AI,cs.IR,cs.LG,eess.AS

下载: http://arxiv.org/abs/2507.06070v1

Enhancing Synthetic CT from CBCT via Multimodal Fusion and End-To-End Registration

Cone-Beam Computed Tomography (CBCT) is widely used for intraoperative imaging due to its rapid acquisition and low radiation dose. However, CBCT images typically suffer from artifacts and lower visual quality compared to conventional Computed Tomography (CT). A promising solution is synthetic CT (sCT) generation, where CBCT volumes are translated into the CT domain. In this work, we enhance sCT generation through multimodal learning by jointly leveraging intraoperative CBCT and preoperative CT data. To overcome the inherent misalignment between modalities, we introduce an end-to-end learnable registration module within the sCT pipeline. This model is evaluated on a controlled synthetic dataset, allowing precise manipulation of data quality and alignment parameters. Further, we validate its robustness and generalizability on two real-world clinical datasets. Experimental results demonstrate that integrating registration in multimodal sCT generation improves sCT quality, outperforming baseline multimodal methods in 79 out of 90 evaluation settings. Notably, the improvement is most significant in cases where CBCT quality is low and the preoperative CT is moderately misaligned.

Updated: 2025-07-08 15:10:04

标题: 通过多模态融合和端到端注册增强CBCT的合成CT

摘要: 锥束计算机断层扫描（CBCT）广泛应用于术中成像，因其快速获取和较低的辐射剂量。然而，与传统计算机断层扫描（CT）相比，CBCT图像通常存在伪影和较低的视觉质量。一种有前途的解决方案是合成CT（sCT）生成，其中CBCT体积被转换为CT领域。在这项工作中，我们通过联合利用术中CBCT和术前CT数据，通过多模态学习增强sCT生成。为了克服模态之间固有的错位，我们在sCT流程中引入了一个端到端可学习的注册模块。该模型在一个受控的合成数据集上进行评估，允许精确操纵数据质量和对齐参数。此外，我们验证其在两个真实世界临床数据集上的稳健性和泛化能力。实验结果表明，在多模态sCT生成中集成注册可以提高sCT质量，在90个评估设置中有79个优于基线多模态方法。值得注意的是，在CBCT质量较低且术前CT适度错位的情况下，改进效果最为显著。

更新时间: 2025-07-08 15:10:04

领域: eess.IV,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.06067v1

Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle

Many existing evaluation benchmarks for Large Language Models (LLMs) quickly become outdated due to the emergence of new models and training data. These benchmarks also fall short in assessing how LLM performance changes over time, as they consist of a static set of questions without a temporal dimension. To address these limitations, we propose using future event prediction as a continuous evaluation method to assess LLMs' temporal generalization and forecasting abilities. Our benchmark, Daily Oracle, automatically generates question-answer (QA) pairs from daily news, challenging LLMs to predict "future" event outcomes. Our findings reveal that as pre-training data becomes outdated, LLM performance degrades over time. While Retrieval Augmented Generation (RAG) has the potential to enhance prediction accuracy, the performance degradation pattern persists, highlighting the need for continuous model updates. Code and data are available at https://agenticlearning.ai/daily-oracle.

Updated: 2025-07-08 15:08:52

标题: LLM是否具有预知能力？使用每日新闻作为参考的持续评估

摘要: 许多现有的大型语言模型（LLMs）评估基准很快就会因新模型和训练数据的出现而过时。这些基准也在评估LLM性能随时间变化的能力方面存在不足，因为它们由一组没有时间维度的静态问题组成。为了解决这些限制，我们提议使用未来事件预测作为一种连续评估方法，以评估LLMs的时间泛化和预测能力。我们的基准，Daily Oracle，自动生成来自每日新闻的问题-答案（QA）对，挑战LLMs预测“未来”事件结果。我们的研究结果表明，随着预训练数据变得过时，LLM性能随时间降低。尽管检索增强生成（RAG）有潜力提高预测准确性，但性能下降模式仍然存在，突出了连续模型更新的必要性。代码和数据可在https://agenticlearning.ai/daily-oracle上获取。

更新时间: 2025-07-08 15:08:52

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2411.08324v2

Few-Shot Learning by Explicit Physics Integration: An Application to Groundwater Heat Transport

Machine learning methods often struggle with real-world applications in science and engineering due to limited or low-quality training data. In this work, the example of groundwater flow with heat transport is considered; this corresponds to an advection-diffusion process under heterogeneous flow conditions, that is, spatially distributed material parameters and heat sources. Classical numerical simulations are costly and challenging due to high spatio-temporal resolution requirements and large domains. While often computationally more efficient, purely data-driven surrogate models face difficulties, particularly in predicting the advection process, which is highly sensitive to input variations and involves long-range spatial interactions. Therefore, in this work, a Local-Global Convolutional Neural Network (LGCNN) approach is introduced. It combines a lightweight numerical surrogate for the transport process (global) with convolutional neural networks for the groundwater velocity and heat diffusion processes (local). With the LGCNN, a city-wide subsurface temperature field is modeled, involving a heterogeneous groundwater flow field and one hundred groundwater heat pump injection points forming interacting heat plumes over long distances. The model is first systematically analyzed based on random subsurface input fields. Then, the model is trained on a handful of cut-outs from a real-world subsurface map of the Munich region in Germany, and it scales to larger cut-outs without retraining. All datasets, our code, and trained models are published for reproducibility.

Updated: 2025-07-08 15:06:15

标题: 少样本学习通过显式物理整合：地下水热传输的应用

摘要: 机器学习方法在科学和工程领域的实际应用中经常面临训练数据有限或质量低的问题。本文考虑了地下水流与热传输的示例；这对应于在异质流动条件下的对流扩散过程，即空间分布的材料参数和热源。经典的数值模拟由于高时空分辨率要求和大范围领域而昂贵且具有挑战性。虽然纯数据驱动的替代模型通常在计算上更有效，但面临困难，特别是在预测高度敏感于输入变化并涉及长距离空间交互的对流过程时。因此，在本研究中，引入了一种局部-全局卷积神经网络（LGCNN）方法。它将轻量级数值替代模型（全局）与卷积神经网络（局部）结合起来，用于地下水速度和热扩散过程。通过LGCNN，建立了一个城市范围的地下温度场模型，包括异质地下水流场和一百个地下水热泵注入点形成的长距离相互作用热羽。首先，基于随机地下输入场系统分析了该模型。然后，该模型在德国慕尼黑地区真实地下地图的几个切割部分上进行了训练，并且可以在无需重新训练的情况下扩展到更大的切割部分。所有数据集、我们的代码和训练模型都已发布以供再现。

更新时间: 2025-07-08 15:06:15

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2507.06062v1

Estimating prevalence with precision and accuracy

Unlike classification, whose goal is to estimate the class of each data point in a dataset, prevalence estimation or quantification is a task that aims to estimate the distribution of classes in a dataset. The two main tasks in prevalence estimation are to adjust for bias, due to the prevalence in the training dataset, and to quantify the uncertainty in the estimate. The standard methods used to quantify uncertainty in prevalence estimates are bootstrapping and Bayesian quantification methods. It is not clear which approach is ideal in terms of precision (i.e. the width of confidence intervals) and coverage (i.e. the confidence intervals being well-calibrated). Here, we propose Precise Quantifier (PQ), a Bayesian quantifier that is more precise than existing quantifiers and with well-calibrated coverage. We discuss the theory behind PQ and present experiments based on simulated and real-world datasets. Through these experiments, we establish the factors which influence quantification precision: the discriminatory power of the underlying classifier; the size of the labeled dataset used to train the quantifier; and the size of the unlabeled dataset for which prevalence is estimated. Our analysis provides deep insights into uncertainty quantification for quantification learning.

Updated: 2025-07-08 15:06:02

标题: 用精确度和准确度估计患病率

摘要: 与分类不同，其目标是估计数据集中每个数据点的类别，患病率估计或量化是一项旨在估计数据集中类别分布的任务。患病率估计的两个主要任务是调整由于训练数据集中的患病率而产生的偏差，并量化估计中的不确定性。用于量化患病率估计不确定性的标准方法是自举和贝叶斯量化方法。目前尚不清楚哪种方法在精度（即置信区间的宽度）和覆盖范围（即置信区间是否良好校准）方面是最理想的。在这里，我们提出了Precise Quantifier（PQ），这是一种比现有量化器更精确且具有良好校准覆盖率的贝叶斯量化器。我们讨论了PQ背后的理论，并提出基于模拟和真实世界数据集的实验。通过这些实验，我们确定了影响量化精度的因素：基础分类器的区分能力；用于训练量化器的有标签数据集的大小；以及用于估计患病率的未标记数据集的大小。我们的分析为量化学习的不确定性量化提供了深刻的见解。

更新时间: 2025-07-08 15:06:02

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2507.06061v1

VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis

Realistic, high-fidelity 3D facial animations are crucial for expressive avatar systems in human-computer interaction and accessibility. Although prior methods show promising quality, their reliance on the mesh domain limits their ability to fully leverage the rapid visual innovations seen in 2D computer vision and graphics. We propose VisualSpeaker, a novel method that bridges this gap using photorealistic differentiable rendering, supervised by visual speech recognition, for improved 3D facial animation. Our contribution is a perceptual lip-reading loss, derived by passing photorealistic 3D Gaussian Splatting avatar renders through a pre-trained Visual Automatic Speech Recognition model during training. Evaluation on the MEAD dataset demonstrates that VisualSpeaker improves both the standard Lip Vertex Error metric by 56.1% and the perceptual quality of the generated animations, while retaining the controllability of mesh-driven animation. This perceptual focus naturally supports accurate mouthings, essential cues that disambiguate similar manual signs in sign language avatars.

Updated: 2025-07-08 15:04:17

标题: VisualSpeaker：视觉引导的3D化身唇语合成

摘要: 现实的、高保真的3D面部动画对于人机交互和辅助功能中具有表现力的头像系统至关重要。尽管先前的方法显示出令人期待的质量，但它们依赖于网格域限制了它们充分利用2D计算机视觉和图形中所见的快速视觉创新的能力。我们提出了VisualSpeaker，这是一种新颖的方法，通过使用照片逼真的可微渲染，受视觉语音识别监督，以改善3D面部动画。我们的贡献是一个感知性的唇读损失，通过在训练过程中将照片逼真的3D高斯分布头像渲染通过预训练的Visual自动语音识别模型得出。对MEAD数据集的评估表明，VisualSpeaker将标准的唇顶点误差指标提高了56.1%，并提高了生成动画的感知质量，同时保持了基于网格驱动的动画的可控性。这种感知焦点自然地支持准确的口型，这是在手语头像中消除类似手势的重要线索。

更新时间: 2025-07-08 15:04:17

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.06060v1

Hume: Introducing System-2 Thinking in Visual-Language-Action Model

Humans practice slow thinking before performing actual actions when handling complex tasks in the physical world. This thinking paradigm, recently, has achieved remarkable advancement in boosting Large Language Models (LLMs) to solve complex tasks in digital domains. However, the potential of slow thinking remains largely unexplored for robotic foundation models interacting with the physical world. In this work, we propose Hume: a dual-system Vision-Language-Action (VLA) model with value-guided System-2 thinking and cascaded action denoising, exploring human-like thinking capabilities of Vision-Language-Action models for dexterous robot control. System 2 of Hume implements value-Guided thinking by extending a Vision-Language-Action Model backbone with a novel value-query head to estimate the state-action value of predicted actions. The value-guided thinking is conducted by repeat sampling multiple action candidates and selecting one according to state-action value. System 1 of Hume is a lightweight reactive visuomotor policy that takes System 2 selected action and performs cascaded action denoising for dexterous robot control. At deployment time, System 2 performs value-guided thinking at a low frequency while System 1 asynchronously receives the System 2 selected action candidate and predicts fluid actions in real time. We show that Hume outperforms the existing state-of-the-art Vision-Language-Action models across multiple simulation benchmark and real-robot deployments.

Updated: 2025-07-08 15:03:11

标题: 休谟：在视觉-语言-行动模型中引入系统二思维

摘要: 人类在处理物理世界中的复杂任务时，在执行实际动作之前会进行缓慢思考。最近，这种思考范式在提升大型语言模型（LLMs）解决数字领域复杂任务方面取得了显著进展。然而，缓慢思考的潜力在与物理世界互动的机器人基础模型中仍然大部分未被探索。在这项工作中，我们提出了Hume：一个具有价值引导的System-2思考和级联动作去噪的双系统视觉-语言-动作（VLA）模型，探索人类式思考能力在灵巧机器人控制中的应用。Hume的System 2通过将视觉-语言-动作模型主干扩展为新颖的价值查询头来实现价值引导思考，以估计预测动作的状态-动作价值。价值引导思考通过重复采样多个动作候选项并根据状态-动作价值选择其中一个来进行。Hume的System 1是一个轻量级的反应性视觉运动策略，接受System 2选择的动作并进行级联动作去噪以实现灵巧机器人控制。在部署时，System 2以低频率进行价值引导思考，而System 1异步接收System 2选择的动作候选项并实时预测流畅的动作。我们展示了Hume在多个模拟基准测试和真实机器人部署中优于现有的最先进的视觉-语言-动作模型。

更新时间: 2025-07-08 15:03:11

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2505.21432v4

Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger

Large language models (LLMs) have shown remarkable emergent capabilities, transforming the execution of functional tasks by leveraging external tools for complex problems that require specialized processing or up-to-date data. While existing research expands LLMs access to diverse tools (e.g., program interpreters, search engines, calculators), the necessity of using these tools is often overlooked, leading to indiscriminate tool invocation. This naive approach raises two key issues: increased latency due to unnecessary tool calls, and potential errors resulting from faulty interactions with external tools. In this paper, we introduce meta-cognition as a proxy for LLMs self-assessment of their capabilities, reflecting the model's awareness of its own limitations. Based on this, we propose MeCo, an adaptive decision-making strategy for external tool use. MeCo quantifies metacognitive scores by capturing high-level cognitive signals in the representation space, guiding when to invoke tools. Notably, MeCo is fine-tuning-free and incurs minimal cost. Experiments across multiple backbone models and benchmarks show that MeCo reliably detects LLMs' internal cognitive signals and significantly improves tool-use decision-making.

Updated: 2025-07-08 15:02:59

标题: 大型语言模型中的自适应工具使用与元认知触发

摘要: 大型语言模型（LLMs）展示了出色的新兴能力，通过利用外部工具来转变功能任务的执行，解决需要专门处理或更新数据的复杂问题。虽然现有研究扩展了LLMs对各种工具（例如程序解释器、搜索引擎、计算器）的访问权限，但往往忽视了使用这些工具的必要性，导致工具调用的盲目性。这种幼稚的方法引发了两个关键问题：由于不必要的工具调用而导致的延迟增加，以及与外部工具的错误交互引起的潜在错误。本文介绍了元认知作为LLMs对其能力的自我评估的代理，反映了模型对自身限制的认识。基于此，我们提出了MeCo，一种用于外部工具使用的自适应决策策略。MeCo通过捕获表示空间中的高级认知信号来量化元认知分数，指导何时调用工具。值得注意的是，MeCo无需微调，成本最低。跨多个骨干模型和基准测试的实验表明，MeCo可可靠地检测LLMs的内部认知信号，并显著改善工具使用决策。

更新时间: 2025-07-08 15:02:59

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2502.12961v2

Entropy-Memorization Law: Evaluating Memorization Difficulty of Data in LLMs

Large Language Models (LLMs) are known to memorize portions of their training data, sometimes reproducing content verbatim when prompted appropriately. In this work, we investigate a fundamental yet under-explored question in the domain of memorization: How to characterize memorization difficulty of training data in LLMs? Through empirical experiments on OLMo, a family of open models, we present the Entropy-Memorization Law. It suggests that data entropy is linearly correlated with memorization score. Moreover, in a case study of memorizing highly randomized strings, or "gibberish", we observe that such sequences, despite their apparent randomness, exhibit unexpectedly low empirical entropy compared to the broader training corpus. Adopting the same strategy to discover Entropy-Memorization Law, we derive a simple yet effective approach to distinguish training and testing data, enabling Dataset Inference (DI).

Updated: 2025-07-08 14:58:28

标题: Entropy-Memorization Law: 评估LLMs中数据的记忆难度

摘要: 大型语言模型（LLMs）被认为会记忆它们的训练数据的部分内容，有时在适当提示时会完全复制内容。在这项工作中，我们研究了一个在记忆领域中基础但尚未深入探讨的问题：如何表征LLMs中训练数据的记忆困难性？通过对OLMo系列开放模型的实证实验，我们提出了熵-记忆定律。它表明数据熵与记忆分数呈线性相关。此外，在一个记忆高度随机化字符串或“胡言乱语”的案例研究中，我们观察到，尽管这些序列表面上是随机的，但与更广泛的训练语料库相比，它们的实证熵意外地较低。采用相同的策略来发现熵-记忆定律，我们推导出一种简单而有效的方法来区分训练和测试数据，实现数据集推断（DI）。

更新时间: 2025-07-08 14:58:28

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.06056v1

Overcoming Data Scarcity in Generative Language Modelling for Low-Resource Languages: A Systematic Review

Generative language modelling has surged in popularity with the emergence of services such as ChatGPT and Google Gemini. While these models have demonstrated transformative potential in productivity and communication, they overwhelmingly cater to high-resource languages like English. This has amplified concerns over linguistic inequality in natural language processing (NLP). This paper presents the first systematic review focused specifically on strategies to address data scarcity in generative language modelling for low-resource languages (LRL). Drawing from 54 studies, we identify, categorise and evaluate technical approaches, including monolingual data augmentation, back-translation, multilingual training, and prompt engineering, across generative tasks. We also analyse trends in architecture choices, language family representation, and evaluation methods. Our findings highlight a strong reliance on transformer-based models, a concentration on a small subset of LRLs, and a lack of consistent evaluation across studies. We conclude with recommendations for extending these methods to a wider range of LRLs and outline open challenges in building equitable generative language systems. Ultimately, this review aims to support researchers and developers in building inclusive AI tools for underrepresented languages, a necessary step toward empowering LRL speakers and the preservation of linguistic diversity in a world increasingly shaped by large-scale language technologies.

Updated: 2025-07-08 14:57:13

标题: 克服低资源语言生成语言建模中的数据稀缺问题：系统评述

摘要: 生成式语言建模随着ChatGPT和Google Gemini等服务的出现而变得越来越受欢迎。虽然这些模型在提高生产力和沟通方面展现出了变革性潜力，但它们主要面向像英语这样的高资源语言。这加剧了人们对自然语言处理中语言不平等问题的担忧。本文首次系统地回顾了专门针对低资源语言（LRL）中数据稀缺问题的生成式语言建模策略。从54项研究中，我们确定、分类和评估了技术方法，包括单语数据增强、回译、多语言训练和提示工程等，在生成式任务中的应用。我们还分析了架构选择、语言家族表示和评估方法的趋势。我们的研究结果突出了对基于transformer的模型的强烈依赖，对少数LRL的集中关注，以及在研究中缺乏一致的评估。最后，我们提出了将这些方法扩展到更广泛的LRL，并概述了在构建公平的生成式语言系统中面临的挑战。最终，这项回顾旨在支持研究人员和开发人员构建面向被边缘化语言的包容性人工智能工具，这是赋权LRL讲者和在日益受到大规模语言技术塑造的世界中保护语言多样性的必要步骤。

更新时间: 2025-07-08 14:57:13

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2505.04531v2

Kernel Trace Distance: Quantum Statistical Metric between Measures through RKHS Density Operators

Distances between probability distributions are a key component of many statistical machine learning tasks, from two-sample testing to generative modeling, among others. We introduce a novel distance between measures that compares them through a Schatten norm of their kernel covariance operators. We show that this new distance is an integral probability metric that can be framed between a Maximum Mean Discrepancy (MMD) and a Wasserstein distance. In particular, we show that it avoids some pitfalls of MMD, by being more discriminative and robust to the choice of hyperparameters. Moreover, it benefits from some compelling properties of kernel methods, that can avoid the curse of dimensionality for their sample complexity. We provide an algorithm to compute the distance in practice by introducing an extension of kernel matrix for difference of distributions that could be of independent interest. Those advantages are illustrated by robust approximate Bayesian computation under contamination as well as particle flow simulations.

Updated: 2025-07-08 14:56:44

标题: 核跟踪距离：通过RKHS密度算符之间的量子统计度量

摘要: 概率分布之间的距离是许多统计机器学习任务的关键组成部分，包括两样本测试和生成模型等。我们介绍了一种通过核协方差算子的谱范数来比较它们的度量之间的新距离。我们展示了这个新距离是一个积分概率度量，可以在最大均值差异（MMD）和Wasserstein距离之间进行框架化。特别地，我们展示了它避免了MMD的一些缺陷，因为它更具有区分性并且对于超参数的选择更加稳健。此外，它还受益于核方法的一些引人注目的属性，可以避免样本复杂性所带来的维度灾难。我们提供了一种算法来实际计算距离，通过引入一个针对不同分布的核矩阵的扩展，这可能具有独立的兴趣。这些优势在受污染的鲁棒近似贝叶斯计算和粒子流模拟中得到了说明。

更新时间: 2025-07-08 14:56:44

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2507.06055v1

Minimal Deterministic Echo State Networks Outperform Random Reservoirs in Learning Chaotic Dynamics

Machine learning (ML) is widely used to model chaotic systems. Among ML approaches, echo state networks (ESNs) have received considerable attention due to their simple construction and fast training. However, ESN performance is highly sensitive to hyperparameter choices and to its random initialization. In this work, we demonstrate that ESNs constructed using deterministic rules and simple topologies (MESNs) outperform standard ESNs in the task of chaotic attractor reconstruction. We use a dataset of more than 90 chaotic systems to benchmark 10 different minimal deterministic reservoir initializations. We find that MESNs obtain up to a 41% reduction in error compared to standard ESNs. Furthermore, we show that the MESNs are more robust, exhibiting less inter-run variation, and have the ability to reuse hyperparameters across different systems. Our results illustrate how structured simplicity in ESN design can outperform stochastic complexity in learning chaotic dynamics.

Updated: 2025-07-08 14:51:33

标题: 最小确定性回声状态网络在学习混沌动态过程中胜过随机水库

摘要: 机器学习（ML）被广泛用于建模混沌系统。在ML方法中，回声状态网络（ESNs）由于其简单的构造和快速的训练而受到了广泛关注。然而，ESN的性能对超参数选择和随机初始化非常敏感。在这项工作中，我们展示了使用确定性规则和简单拓扑结构（MESNs）构建的ESNs在混沌吸引子重建任务中优于标准ESNs。我们使用超过90个混沌系统的数据集来评估10种不同的最小确定性储层初始化。我们发现与标准ESNs相比，MESNs的误差可以减少高达41％。此外，我们展示了MESNs更具鲁棒性，表现出更少的运行间变化，并且能够在不同系统之间重复使用超参数。我们的结果说明了在ESN设计中结构化简单性如何可以胜过学习混沌动态中的随机复杂性。

更新时间: 2025-07-08 14:51:33

领域: nlin.CD,cs.LG

下载: http://arxiv.org/abs/2507.06050v1

Tailored Conversations beyond LLMs: A RL-Based Dialogue Manager

In this work, we propose a novel framework that integrates large language models (LLMs) with an RL-based dialogue manager for open-ended dialogue with a specific goal. By leveraging hierarchical reinforcement learning to model the structured phases of dialogue and employ meta-learning to enhance adaptability across diverse user profiles, our approach enhances adaptability and efficiency, enabling the system to learn from limited data, transition fluidly between dialogue phases, and personalize responses to heterogeneous patient needs. We apply our framework to Motivational Interviews, aiming to foster behavior change, and demonstrate that the proposed dialogue manager outperforms a state-of-the-art LLM baseline in terms of reward, showing a potential benefit of conditioning LLMs to create open-ended dialogue systems with specific goals.

Updated: 2025-07-08 14:47:33

标题: 定制对话超越LLMs：基于RL的对话管理器

摘要: 在这项工作中，我们提出了一个新颖的框架，将大型语言模型（LLMs）与基于RL的对话管理器整合在一起，用于具有特定目标的开放式对话。通过利用分层强化学习来建模对话的结构化阶段，并采用元学习来增强对各种用户配置文件的适应性，我们的方法提高了适应性和效率，使系统能够从有限数据中学习，流畅地在对话阶段之间过渡，并个性化地回应异质患者的需求。我们将我们的框架应用于激励性访谈，旨在促进行为改变，并证明所提出的对话管理器在奖励方面优于最先进的LLM基线，显示了将LLMs调整为具有特定目标的开放式对话系统的潜在好处。

更新时间: 2025-07-08 14:47:33

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.19652v2

Neural-Network solver of ideal MHD equilibria

We present a novel approach to compute three-dimensional Magnetohydrodynamic equilibria by parametrizing Fourier modes with artificial neural networks and compare it to equilibria computed by conventional solvers. The full nonlinear global force residual across the volume in real space is then minimized with first order optimizers. Already,we observe competitive computational cost to arrive at the same minimum residuals computed by existing codes. With increased computational cost,lower minima of the residual are achieved by the neural networks,establishing a new lower bound for the force residual. We use minimally complex neural networks,and we expect significant improvements for solving not only single equilibria with neural networks,but also for computing neural network models valid over continuous distributions of equilibria.

Updated: 2025-07-08 14:46:53

标题: "理想磁流体力学平衡的神经网络求解器"

摘要: 我们提出了一种新颖的方法来计算三维磁流体力学平衡，通过使用人工神经网络对傅里叶模式进行参数化，并将其与通过传统求解器计算的平衡进行比较。然后，在实空间中最小化整个体积上的非线性全局力残差，使用一阶优化器。我们已经观察到，与现有代码计算的最小残差相比，我们具有竞争性的计算成本。通过增加计算成本，神经网络可以实现更低的残差最小值，为力残差建立了一个新的下限。我们使用最简单的神经网络，并且我们预计在解决不仅是单一平衡问题，而且是在连续分布的平衡上有效的神经网络模型的计算方面会有显着的改进。

更新时间: 2025-07-08 14:46:53

领域: cs.LG,cs.AI,physics.plasm-ph

下载: http://arxiv.org/abs/2507.03119v2

What's Making That Sound Right Now? Video-centric Audio-Visual Localization

Audio-Visual Localization (AVL) aims to identify sound-emitting sources within a visual scene. However, existing studies focus on image-level audio-visual associations, failing to capture temporal dynamics. Moreover, they assume simplified scenarios where sound sources are always visible and involve only a single object. To address these limitations, we propose AVATAR, a video-centric AVL benchmark that incorporates high-resolution temporal information. AVATAR introduces four distinct scenarios -- Single-sound, Mixed-sound, Multi-entity, and Off-screen -- enabling a more comprehensive evaluation of AVL models. Additionally, we present TAVLO, a novel video-centric AVL model that explicitly integrates temporal information. Experimental results show that conventional methods struggle to track temporal variations due to their reliance on global audio features and frame-level mappings. In contrast, TAVLO achieves robust and precise audio-visual alignment by leveraging high-resolution temporal modeling. Our work empirically demonstrates the importance of temporal dynamics in AVL and establishes a new standard for video-centric audio-visual localization.

Updated: 2025-07-08 14:46:46

标题: 现在是什么在发出那个声音？以视频为中心的音频-视觉定位

摘要: 音频-视觉定位（AVL）旨在识别视觉场景中的声音发射源。然而，现有研究集中在图像级别的音频-视觉关联上，未能捕捉时间动态。此外，它们假设声音源始终可见，并且仅涉及单个对象的简化场景。为了解决这些限制，我们提出了AVATAR，这是一个以视频为中心的AVL基准，融入了高分辨率的时间信息。AVATAR引入了四种不同的场景--单声音、混合声音、多实体和屏幕之外--从而实现对AVL模型更全面的评估。此外，我们提出了TAVLO，这是一种新颖的以视频为中心的AVL模型，明确地整合了时间信息。实验结果表明，传统方法由于依赖全局音频特征和帧级映射而难以跟踪时间变化。相反，TAVLO通过利用高分辨率的时间建模实现了稳健和精确的音频-视觉对齐。我们的工作在实证上证明了AVL中时间动态的重要性，并为以视频为中心的音频-视觉定位建立了新的标准。

更新时间: 2025-07-08 14:46:46

领域: cs.CV,cs.AI,cs.MM,cs.SD,eess.AS

下载: http://arxiv.org/abs/2507.04667v2

CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations

Security alignment enables the Large Language Model (LLM) to gain the protection against malicious queries, but various jailbreak attack methods reveal the vulnerability of this security mechanism. Previous studies have isolated LLM jailbreak attacks and defenses. We analyze the security protection mechanism of the LLM, and propose a framework that combines attack and defense. Our method is based on the linearly separable property of LLM intermediate layer embedding, as well as the essence of jailbreak attack, which aims to embed harmful problems and transfer them to the safe area. We utilize generative adversarial network (GAN) to learn the security judgment boundary inside the LLM to achieve efficient jailbreak attack and defense. The experimental results indicate that our method achieves an average jailbreak success rate of 88.85\% across three popular LLMs, while the defense success rate on the state-of-the-art jailbreak dataset reaches an average of 84.17\%. This not only validates the effectiveness of our approach but also sheds light on the internal security mechanisms of LLMs, offering new insights for enhancing model security The code and data are available at https://github.com/NLPGM/CAVGAN.

Updated: 2025-07-08 14:45:21

标题: CAVGAN：通过生成对抗攻击LLMs内部表示统一越狱和防御

摘要: 安全对齐使得大型语言模型（LLM）能够获得对恶意查询的保护，但是各种越狱攻击方法揭示了这种安全机制的漏洞。先前的研究已经孤立了LLM越狱攻击和防御。我们分析了LLM的安全保护机制，并提出了一个结合攻击和防御的框架。我们的方法基于LLM中间层嵌入的线性可分性质，以及越狱攻击的本质，旨在嵌入有害问题并将其转移到安全区域。我们利用生成对抗网络（GAN）学习LLM内部的安全判断边界，以实现高效的越狱攻击和防御。实验结果表明，我们的方法在三个流行的LLM上实现了平均越狱成功率为88.85％，而在最先进的越狱数据集上的防御成功率达到平均84.17％。这不仅验证了我们方法的有效性，还为增强模型安全性提供了新的见解。代码和数据可在https://github.com/NLPGM/CAVGAN 上获得。

更新时间: 2025-07-08 14:45:21

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2507.06043v1

UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer

With the rapid development of diffusion models in image generation, the demand for more powerful and flexible controllable frameworks is increasing. Although existing methods can guide generation beyond text prompts, the challenge of effectively combining multiple conditional inputs while maintaining consistency with all of them remains unsolved. To address this, we introduce UniCombine, a DiT-based multi-conditional controllable generative framework capable of handling any combination of conditions, including but not limited to text prompts, spatial maps, and subject images. Specifically, we introduce a novel Conditional MMDiT Attention mechanism and incorporate a trainable LoRA module to build both the training-free and training-based versions. Additionally, we propose a new pipeline to construct SubjectSpatial200K, the first dataset designed for multi-conditional generative tasks covering both the subject-driven and spatially-aligned conditions. Extensive experimental results on multi-conditional generation demonstrate the outstanding universality and powerful capability of our approach with state-of-the-art performance.

Updated: 2025-07-08 14:44:44

标题: UniCombine: 统一的多条件组合与扩散变压器

摘要: 随着图像生成中扩散模型的快速发展，对更强大和灵活的可控框架的需求正在增加。尽管现有方法可以引导生成超越文本提示，但有效地结合多个条件输入并保持与所有条件的一致性的挑战仍未解决。为了解决这个问题，我们引入了UniCombine，这是一个基于DiT的多条件可控生成框架，能够处理任何组合的条件，包括但不限于文本提示、空间地图和主题图像。具体地，我们引入了一种新颖的Conditional MMDiT Attention机制，并结合一个可训练的LoRA模块来构建基于训练和无训练的版本。另外，我们提出了一个新的流程来构建SubjectSpatial200K，这是专为多条件生成任务设计的第一个数据集，涵盖了主题驱动和空间对齐条件。在多条件生成的广泛实验结果中，展示了我们方法的出色普适性和强大性能，达到了行业领先水平。

更新时间: 2025-07-08 14:44:44

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.09277v2

On Lockean beliefs that are deductively closed and minimal change

Within the formal setting of the Lockean thesis, an agent belief set is defined in terms of degrees of confidence and these are described in probabilistic terms. This approach is of established interest, notwithstanding some limitations that make its use troublesome in some contexts, like, for instance, in belief change theory. Precisely, Lockean belief sets are not generally closed under (classical) logical deduction. The aim of the present paper is twofold: on one side we provide two characterizations of those belief sets that are closed under classical logic deduction, and on the other we propose an approach to probabilistic update that allows us for a minimal revision of those beliefs, i.e., a revision obtained by making the fewest possible changes to the existing belief set while still accommodating the new information. In particular, we show how we can deductively close a belief set via a minimal revision.

Updated: 2025-07-08 14:44:01

标题: 关于洛克对于具有演绎封闭性和最小变化性质的信念

摘要: 在洛克论文的正式设置中，一个代理信念集被定义为置信度，并以概率术语描述。尽管存在一些限制使其在某些情境下难以使用，比如在信念变化理论中，但这种方法仍然具有已确立的兴趣。准确地说，洛克信念集通常不在（经典）逻辑推理下封闭。本文旨在实现两个目标：一方面，我们提供了两种封闭于经典逻辑推理下的信念集的特征描述，另一方面，我们提出了一种概率更新方法，允许我们对这些信念进行最小修订，即通过对现有信念集进行尽可能少的修改来容纳新信息。特别是，我们展示了如何通过最小修订来推导封闭于经典逻辑推理下的信念集。

更新时间: 2025-07-08 14:44:01

领域: cs.AI,03B42, 03B48

下载: http://arxiv.org/abs/2507.06042v1

EdgeCodec: Onboard Lightweight High Fidelity Neural Compressor with Residual Vector Quantization

We present EdgeCodec, an end-to-end neural compressor for barometric data collected from wind turbine blades. EdgeCodec leverages a heavily asymmetric autoencoder architecture, trained with a discriminator and enhanced by a Residual Vector Quantizer to maximize compression efficiency. It achieves compression rates between 2'560:1 and 10'240:1 while maintaining a reconstruction error below 3%, and operates in real time on the GAP9 microcontroller with bitrates ranging from 11.25 to 45 bits per second. Bitrates can be selected on a sample-by-sample basis, enabling on-the-fly adaptation to varying network conditions. In its highest compression mode, EdgeCodec reduces the energy consumption of wireless data transmission by up to 2.9x, significantly extending the operational lifetime of deployed sensor units.

Updated: 2025-07-08 14:41:42

标题: EdgeCodec：搭载轻量级高保真神经压缩器与残差矢量量化

摘要: 我们提出了EdgeCodec，这是一种端到端的神经压缩器，用于从风力涡轮机叶片收集的气压数据。EdgeCodec利用了一个严重不对称的自动编码器架构，通过一个鉴别器进行训练，并通过一个残差矢量量化器进行增强，以最大化压缩效率。它实现了在保持重建误差低于3%的情况下，压缩比在2'560:1到10'240:1之间，并且在GAP9微控制器上实时运行，比特率在每秒11.25到45位之间变化。比特率可以根据样本选择，实现对网络条件变化的即时适应。在其最高压缩模式下，EdgeCodec将无线数据传输的能耗降低了高达2.9倍，显著延长了部署传感器单元的运行寿命。

更新时间: 2025-07-08 14:41:42

领域: cs.LG

下载: http://arxiv.org/abs/2507.06040v1

Enter, Exit, Page Fault, Leak: Testing Isolation Boundaries for Microarchitectural Leaks

CPUs provide isolation mechanisms like virtualization and privilege levels to protect software. Yet these focus on architectural isolation while typically overlooking microarchitectural side channels, exemplified by Meltdown and Foreshadow. Software must therefore supplement architectural defenses with ad-hoc microarchitectural patches, which are constantly evolving as new attacks emerge and defenses are proposed. Such reactive approach makes ensuring complete isolation a daunting task, and leaves room for errors and oversights. We address this problem by developing a tool that stress tests microarchitectural isolation between security domains such as virtual machines, kernel, and processes, with the goal of detecting flaws in the isolation boundaries. The tool extends model-based relational testing (MRT) methodology to enable detection of cross-domain information leakage. We design a new test case generator and execution sandbox to handle multi-domain execution, new leakage models to encode expected leaks, and new analysis techniques to manage nondeterminism. We use this tool to perform an in-depth testing campaign on six x86-64 CPUs for leakage across different isolation boundaries. The testing campaign exposed four new leaks and corroborated numerous known ones, with only two false positives throughout the entire campaign. These results show critical gaps in current isolation mechanisms as well as validate a robust methodology for detecting microarchitectural flaws. As such, this approach enables a shift from reactive patching to proactive security validation in processor design.

Updated: 2025-07-08 14:41:18

标题: 进入、退出、页面错误、泄漏：测试微架构泄漏的隔离边界

摘要: 中央处理器提供诸如虚拟化和特权级别等隔离机制，以保护软件。然而，这些机制侧重于体系结构隔离，通常忽视微体系结构的侧信道，如Meltdown和Foreshadow所示。因此，软件必须通过特定的微体系结构补丁补充体系结构防御，随着新攻击不断出现和防御方案被提出，这些补丁不断发展。这种反应性方法使得确保完全隔离成为一项艰巨的任务，并为错误和疏忽留下了空间。为了解决这个问题，我们开发了一个工具，用于在安全域之间进行微体系结构隔离的压力测试，比如虚拟机、内核和进程，旨在检测隔离边界中的缺陷。该工具扩展了基于模型的关系测试（MRT）方法，以实现跨域信息泄漏的检测。我们设计了一个新的测试用例生成器和执行沙箱来处理多域执行，新的泄漏模型来编码预期泄漏，以及新的分析技术来管理不确定性。我们使用这个工具对六个x86-64 CPU进行了深入的泄漏测试，跨不同隔离边界。测试活动暴露了四个新泄漏，并证实了许多已知的泄漏，整个活动中仅有两个误报。这些结果显示了当前隔离机制中的关键漏洞，同时验证了一种用于检测微体系结构缺陷的强大方法。因此，这种方法使得从被动补丁到主动安全验证在处理器设计中成为可能。

更新时间: 2025-07-08 14:41:18

领域: cs.CR

下载: http://arxiv.org/abs/2507.06039v1

TextPixs: Glyph-Conditioned Diffusion with Character-Aware Attention and OCR-Guided Supervision

The modern text-to-image diffusion models boom has opened a new era in digital content production as it has proven the previously unseen ability to produce photorealistic and stylistically diverse imagery based on the semantics of natural-language descriptions. However, the consistent disadvantage of these models is that they cannot generate readable, meaningful, and correctly spelled text in generated images, which significantly limits the use of practical purposes like advertising, learning, and creative design. This paper introduces a new framework, namely Glyph-Conditioned Diffusion with Character-Aware Attention (GCDA), using which a typical diffusion backbone is extended by three well-designed modules. To begin with, the model has a dual-stream text encoder that encodes both semantic contextual information and explicit glyph representations, resulting in a character-aware representation of the input text that is rich in nature. Second, an attention mechanism that is aware of the character is proposed with a new attention segregation loss that aims to limit the attention distribution of each character independently in order to avoid distortion artifacts. Lastly, GCDA has an OCR-in-the-loop fine-tuning phase, where a full text perceptual loss, directly optimises models to be legible and accurately spell. Large scale experiments to benchmark datasets, such as MARIO-10M and T2I-CompBench, reveal that GCDA sets a new state-of-the-art on all metrics, with better character based metrics on text rendering (Character Error Rate: 0.08 vs 0.21 for the previous best; Word Error Rate: 0.15 vs 0.25), human perception, and comparable image synthesis quality on high-fidelity (FID: 14.3).

Updated: 2025-07-08 14:35:02

标题: TextPixs：以字形为条件的扩散，带有字符感知注意力和OCR引导监督

摘要: 现代文本到图像扩散模型的繁荣开启了数字内容生产的新时代，因为它已经证明了以自然语言描述的语义为基础产生逼真和风格多样的图像的先前未曾见过的能力。然而，这些模型的一贯劣势是它们无法在生成的图像中生成可读、有意义且拼写正确的文本，这极大地限制了广告、学习和创意设计等实际用途的使用。本文介绍了一个新的框架，即基于字形条件的扩散与字符感知注意力（GCDA），使用这个框架，一个典型的扩散骨干被三个精心设计的模块扩展。首先，模型具有双流文本编码器，用于编码语义上下文信息和明确的字形表示，从而产生一个丰富多样的字符感知输入文本表示。其次，提出了一个考虑字符的注意力机制，其中包含一个新的注意力分离损失，旨在独立限制每个字符的注意力分布，以避免畸变伪影。最后，GCDA具有一个OCR环路微调阶段，其中一个完整的文本感知损失直接优化模型以使其易读且正确拼写。对基准数据集进行的大规模实验，例如MARIO-10M和T2I-CompBench，揭示了GCDA在所有指标上树立了新的行业标准，具有更好的基于字符的文本渲染指标（字符错误率：0.08 vs之前最佳的0.21；单词错误率：0.15 vs 0.25），人类感知和在高保真度上可比的图像合成质量（FID：14.3）。

更新时间: 2025-07-08 14:35:02

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.06033v1

Empirical evidence of Large Language Model's influence on human spoken communication

From the invention of writing and the printing press, to television and social media, human history is punctuated by major innovations in communication technology, which fundamentally altered how ideas spread and reshaped our culture. Recent chatbots powered by generative artificial intelligence constitute a novel medium that encodes cultural patterns in their neural representations and disseminates them in conversations with hundreds of millions of people. Understanding whether these patterns transmit into human language, and ultimately shape human culture, is a fundamental question. While fully quantifying the causal impact of a chatbot like ChatGPT on human culture is very challenging, lexicographic shift in human spoken communication may offer an early indicator of such broad phenomenon. Here, we apply econometric causal inference techniques to 740,249 hours of human discourse from 360,445 YouTube academic talks and 771,591 conversational podcast episodes across multiple disciplines. We detect a measurable and abrupt increase in the use of words preferentially generated by ChatGPT, such as delve, comprehend, boast, swift, and meticulous, after its release. These findings suggest a scenario where machines, originally trained on human data and subsequently exhibiting their own cultural traits, can, in turn, measurably reshape human culture. This marks the beginning of a closed cultural feedback loop in which cultural traits circulate bidirectionally between humans and machines. Our results motivate further research into the evolution of human-machine culture, and raise concerns over the erosion of linguistic and cultural diversity, and the risks of scalable manipulation.

Updated: 2025-07-08 14:34:57

标题: 大型语言模型对人类口语交流的影响的实证证据

摘要: 从写作和印刷术的发明，到电视和社交媒体，人类历史被重大的通讯技术创新所标志，这些技术根本改变了思想传播的方式，并重塑了我们的文化。最近由生成式人工智能驱动的聊天机器人构成了一种新颖的媒介，它们在其神经表示中编码文化模式，并在与数百万人的对话中传播这些模式。了解这些模式是否传播到人类语言，并最终塑造人类文化，是一个基本问题。虽然充分量化像ChatGPT这样的聊天机器人对人类文化的因果影响是非常具有挑战性的，但人类口语交流中的词汇转变可能提供了这种广泛现象的早期指标。在这里，我们应用计量因果推断技术对来自360,445个YouTube学术演讲和771,591个跨多学科领域的对话播客剧集的740,249小时的人类话语进行分析。我们发现在ChatGPT发布后，诸如深入、理解、夸耀、迅速和细致等ChatGPT优先生成的词汇的使用出现了可测量且突然的增加。这些发现暗示了一种情景，即机器，最初在人类数据上接受训练，随后展示出自己的文化特征，反过来可以明显地重塑人类文化。这标志着一个封闭文化反馈循环的开始，其中文化特征在人类和机器之间双向流通。我们的结果激励进一步研究人机文化的演变，并引发对语言和文化多样性侵蚀以及可扩展操纵风险的担忧。

更新时间: 2025-07-08 14:34:57

领域: cs.CY,cs.AI,cs.CL,cs.HC

下载: http://arxiv.org/abs/2409.01754v3

Efficient Federated Learning with Timely Update Dissemination

Federated Learning (FL) has emerged as a compelling methodology for the management of distributed data, marked by significant advancements in recent years. In this paper, we propose an efficient FL approach that capitalizes on additional downlink bandwidth resources to ensure timely update dissemination. Initially, we implement this strategy within an asynchronous framework, introducing the Asynchronous Staleness-aware Model Update (FedASMU), which integrates both server-side and device-side methodologies. On the server side, we present an asynchronous FL system model that employs a dynamic model aggregation technique, which harmonizes local model updates with the global model to enhance both accuracy and efficiency. Concurrently, on the device side, we propose an adaptive model adjustment mechanism that integrates the latest global model with local models during training to further elevate accuracy. Subsequently, we extend this approach to a synchronous context, referred to as FedSSMU. Theoretical analyses substantiate the convergence of our proposed methodologies. Extensive experiments, encompassing six models and five public datasets, demonstrate that FedASMU and FedSSMU significantly surpass baseline methods in terms of both accuracy (up to 145.87%) and efficiency (up to 97.59%).

Updated: 2025-07-08 14:34:32

标题: 高效的联邦学习与及时更新传播

摘要: 联邦学习（FL）已经成为分布式数据管理中一种引人注目的方法论，近年来取得了显著进展。在本文中，我们提出了一种利用额外下行带宽资源的高效FL方法，以确保及时更新传播。最初，我们在异步框架内实现了这种策略，引入了异步陈旧感知模型更新（FedASMU），该方法整合了服务器端和设备端的方法论。在服务器端，我们提出了一个采用动态模型聚合技术的异步FL系统模型，该技术协调本地模型更新与全局模型，以提高准确性和效率。同时，在设备端，我们提出了一种自适应模型调整机制，该机制在训练期间将最新的全局模型与本地模型整合，进一步提高准确性。随后，我们将这种方法扩展到同步环境，称为FedSSMU。理论分析证实了我们提出的方法论的收敛性。广泛的实验，涵盖六个模型和五个公共数据集，表明FedASMU和FedSSMU在准确性（高达145.87%）和效率（高达97.59%）方面显著超越了基线方法。

更新时间: 2025-07-08 14:34:32

领域: cs.DC,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.06031v1

Feature-Guided Neighbor Selection for Non-Expert Evaluation of Model Predictions

Explainable AI (XAI) methods often struggle to generate clear, interpretable outputs for users without domain expertise. We introduce Feature-Guided Neighbor Selection (FGNS), a post hoc method that enhances interpretability by selecting class-representative examples using both local and global feature importance. In a user study (N = 98) evaluating Kannada script classifications, FGNS significantly improved non-experts' ability to identify model errors while maintaining appropriate agreement with correct predictions. Participants made faster and more accurate decisions compared to those given traditional k-NN explanations. Quantitative analysis shows that FGNS selects neighbors that better reflect class characteristics rather than merely minimizing feature-space distance, leading to more consistent selection and tighter clustering around class prototypes. These results support FGNS as a step toward more human-aligned model assessment, although further work is needed to address the gap between explanation quality and perceived trust.

Updated: 2025-07-08 14:32:25

标题: 特征引导的邻居选择用于模型预测的非专家评估

摘要: 可解释性人工智能（XAI）方法常常难以为没有领域专业知识的用户生成清晰、可解释的输出。我们引入了特征引导的邻居选择（FGNS），这是一种事后方法，通过同时利用局部和全局特征重要性来选择代表性类别示例，从而增强可解释性。在一项用户研究中（N = 98），评估卡纳达文字分类时，FGNS显著提高了非专家识别模型错误的能力，同时与正确预测保持适当一致。与传统的k-NN解释相比，参与者做出了更快、更准确的决策。定量分析表明，FGNS选择更能反映类别特征而不仅仅是最小化特征空间距离的邻居，从而使选择更加一致并且更紧密地聚集在类原型周围。这些结果支持FGNS作为迈向更加与人类对齐的模型评估的一步，尽管仍需要进一步工作来解决解释质量与感知信任之间的差距。

更新时间: 2025-07-08 14:32:25

领域: cs.AI

下载: http://arxiv.org/abs/2507.06029v1

Multi-view mid fusion: a universal approach for learning in an HDLSS setting

The high-dimensional low-sample-size (HDLSS) setting presents significant challenges in various applications where the feature dimension far exceeds the number of available samples. This paper introduces a universal approach for learning in HDLSS setting using multi-view mid fusion techniques. It shows how existing mid fusion multi-view methods perform well in an HDLSS setting even if no inherent views are provided. Three view construction methods are proposed that split the high-dimensional feature vectors into smaller subsets, each representing a different view. Extensive experimental validation across model-types and learning tasks confirm the effectiveness and generalization of the approach. We believe the work in this paper lays the foundation for further research into the universal benefits of multi-view mid fusion learning.

Updated: 2025-07-08 14:31:53

标题: 多视角中间融合：一种在高维稀疏学习环境中的通用方法

摘要: 在高维低样本量（HDLSS）设置中，特征维度远远超过可用样本数量，这在各种应用中都存在重大挑战。本文介绍了一种在HDLSS设置中使用多视角中融合技术进行学习的通用方法。它展示了即使没有提供固有视角，现有的中融合多视角方法在HDLSS设置中也能表现良好。提出了三种视角构建方法，将高维特征向量分割为较小的子集，每个子集代表一个不同的视角。通过跨模型类型和学习任务的广泛实验验证证实了这种方法的有效性和泛化能力。我们相信本文的工作为进一步研究多视角中融合学习的通用益处奠定了基础。

更新时间: 2025-07-08 14:31:53

领域: cs.LG

下载: http://arxiv.org/abs/2507.06026v1

Kamae: Bridging Spark and Keras for Seamless ML Preprocessing

In production recommender systems, feature preprocessing must be faithfully replicated across training and inference environments. This often requires duplicating logic between offline and online environments, increasing engineering effort and introducing risks of dataset shift. We present Kamae, an open-source Python library that bridges this gap by translating PySpark preprocessing pipelines into equivalent Keras models. Kamae provides a suite of configurable Spark transformers and estimators, each mapped to a corresponding Keras layer, enabling consistent, end-to-end preprocessing across the ML lifecycle. Framework's utility is illustrated on real-world use cases, including MovieLens dataset and Expedia's Learning-to-Rank pipelines. The code is available at https://github.com/ExpediaGroup/kamae.

Updated: 2025-07-08 14:30:10

标题: Kamae: 将Spark和Keras桥接以实现无缝的机器学习预处理

摘要: 在生产推荐系统中，特征预处理必须在训练和推断环境中保持一致。这通常需要在离线和在线环境之间复制逻辑，增加工程工作量并引入数据集转移的风险。我们提出了Kamae，一个开源的Python库，通过将PySpark预处理流水线转换为等效的Keras模型来弥合这一差距。Kamae提供一套可配置的Spark转换器和估计器，每个都映射到相应的Keras层，从而实现整个ML生命周期中一致的端到端预处理。该框架的实用性在真实世界的用例中得到了展示，包括MovieLens数据集和Expedia的学习排名管道。代码可以在https://github.com/ExpediaGroup/kamae 上找到。

更新时间: 2025-07-08 14:30:10

领域: cs.LG

下载: http://arxiv.org/abs/2507.06021v1

OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model

Empathetic interaction is a cornerstone of human-machine communication, due to the need for understanding speech enriched with paralinguistic cues and generating emotional and expressive responses. However, the most powerful empathetic LSLMs are increasingly closed off, leaving the crucial details about the architecture, data and development opaque to researchers. Given the critical need for transparent research into the LSLMs and empathetic behavior, we present OpenS2S, a fully open-source, transparent and end-to-end LSLM designed to enable empathetic speech interactions. Based on our empathetic speech-to-text model BLSP-Emo, OpenS2S further employs a streaming interleaved decoding architecture to achieve low-latency speech generation. To facilitate end-to-end training, OpenS2S incorporates an automated data construction pipeline that synthesizes diverse, high-quality empathetic speech dialogues at low cost. By leveraging large language models to generate empathetic content and controllable text-to-speech systems to introduce speaker and emotional variation, we construct a scalable training corpus with rich paralinguistic diversity and minimal human supervision. We release the fully open-source OpenS2S model, including the dataset, model weights, pre-training and fine-tuning codes, to empower the broader research community and accelerate innovation in empathetic speech systems. The project webpage can be accessed at https://casia-lm.github.io/OpenS2S

Updated: 2025-07-08 14:28:55

标题: OpenS2S：推进完全开源的端到端共情大型语音语言模型

摘要: 情感互动是人机交流的基石，因为需要理解充满语调暗示的语音并产生情感和表达性回应。然而，最强大的情感式LSLMs越来越封闭，使得关于架构、数据和开发的关键细节对研究人员不透明。鉴于对LSLMs和情感行为透明研究的重要性，我们提出了OpenS2S，这是一个完全开源、透明和端到端的LSLM，旨在实现情感语音交互。基于我们的情感语音转文本模型BLSP-Emo，OpenS2S进一步采用流式交错解码架构来实现低延迟语音生成。为了促进端到端训练，OpenS2S集成了一个自动化数据构建流程，以低成本合成各种高质量的情感语音对话。通过利用大型语言模型生成情感内容和可控文本到语音系统引入说话者和情感变化，我们构建了一个具有丰富语调多样性和最小人类监督的可扩展训练语料库。我们发布了完全开源的OpenS2S模型，包括数据集、模型权重、预训练和微调代码，以赋予更广泛的研究社区力量，并加速情感语音系统的创新。项目网页可访问https://casia-lm.github.io/OpenS2S。

更新时间: 2025-07-08 14:28:55

领域: cs.CL,cs.AI,cs.SD,eess.AS

下载: http://arxiv.org/abs/2507.05177v2

Classification of autoimmune diseases from Peripheral blood TCR repertoires by multimodal multi-instance learning

T cell receptor (TCR) repertoires encode critical immunological signatures for autoimmune diseases, yet their clinical application remains limited by sequence sparsity and low witness rates. We developed EAMil, a multi-instance deep learning framework that leverages TCR sequencing data to diagnose systemic lupus erythematosus (SLE) and rheumatoid arthritis (RA) with exceptional accuracy. By integrating PrimeSeq feature extraction with ESMonehot encoding and enhanced gate attention mechanisms, our model achieved state-of-the-art performance with AUCs of 98.95% for SLE and 97.76% for RA. EAMil successfully identified disease-associated genes with over 90% concordance with established differential analyses and effectively distinguished disease-specific TCR genes. The model demonstrated robustness in classifying multiple disease categories, utilizing the SLEDAI score to stratify SLE patients by disease severity as well as to diagnose the site of damage in SLE patients, and effectively controlling for confounding factors such as age and gender. This interpretable framework for immune receptor analysis provides new insights for autoimmune disease detection and classification with broad potential clinical applications across immune-mediated conditions.

Updated: 2025-07-08 14:28:38

标题: 使用多模态多实例学习从外周血TCR重组中分类自身免疫疾病

摘要: T细胞受体（TCR）库编码了自身免疫性疾病的关键免疫特征，然而其临床应用受到序列稀疏性和低见证率的限制。我们开发了EAMil，这是一个多实例深度学习框架，利用TCR测序数据以极高的准确性诊断系统性红斑狼疮（SLE）和类风湿关节炎（RA）。通过将PrimeSeq特征提取与ESMonehot编码和增强门注意机制相结合，我们的模型在SLE和RA的AUC为98.95%和97.76%的情况下实现了最先进的性能。EAMil成功识别出与疾病相关的基因，与已建立的差异分析具有超过90%的一致性，并有效区分了疾病特定的TCR基因。该模型在分类多种疾病类别方面表现出了稳定性，利用SLEDAI评分对SLE患者按疾病严重程度进行分层，并诊断SLE患者的损害部位，并有效控制了年龄和性别等混杂因素。这种可解释的免疫受体分析框架为自身免疫性疾病的检测和分类提供了新的见解，具有广泛的潜在临床应用，适用于各种免疫介导的情况。

更新时间: 2025-07-08 14:28:38

领域: cs.LG,cs.AI,q-bio.GN

下载: http://arxiv.org/abs/2507.04981v2

CogniSQL-R1-Zero: Lightweight Reinforced Reasoning for Efficient SQL Generation

Translating natural language into SQL (Text-to-SQL) remains a core challenge at the intersection of language understanding and structured data access. Although large language models (LLMs) have improved fluency, generating correct and executable SQL, especially for complex queries, continues to be challenging. We introduce CogniSQL-R1-Zero, a reinforcement learning (RL) framework and model that produces accurate SQL using a lightweight reward signal based on execution correctness and format-tag compliance. By avoiding intermediate supervision, hybrid pipelines and complex reward shaping, our method encourages stable learning and stronger alignment with the ultimate task objective-producing executable programs. CogniSQL-R1-Zero achieves state-of-the-art execution accuracy on Text2SQL benchmark; BIRD bench, outperforming prior supervised and instruction-tuned baselines including SFT CodeS-7B, DeepSeek-Coder 236B, and Mistral 123B-despite being trained on a significantly smaller 7B backbone. This result underscores the scalability and efficiency of our RL-based approach when trained on just four NVIDIA A100 GPUs (40 GB VRAM each). To support further research in efficient and interpretable Text-to-SQL modeling, we release two curated datasets: (i) a collection of 5,024 reasoning traces with varying context lengths, and (ii) a positive-sampled corpus of 36,356 corpus of weakly supervised queries, each annotated with six semantically diverse reasoning paths. Together, these contributions advance scalable, execution-aligned Text-to-SQL generation.

Updated: 2025-07-08 14:17:07

标题: CogniSQL-R1-Zero: 轻量级强化推理以提高SQL生成效率

摘要: 将自然语言翻译成SQL（文本到SQL）仍然是语言理解和结构化数据访问交集的核心挑战。尽管大型语言模型（LLMs）已经提高了流畅度，但生成正确和可执行的SQL，特别是对于复杂查询，仍然具有挑战性。我们引入了CogniSQL-R1-Zero，这是一个基于强化学习（RL）框架和模型，通过基于执行正确性和格式标签符合性的轻量级奖励信号生成准确的SQL。通过避免中间监督、混合管道和复杂奖励塑造，我们的方法鼓励稳定学习，并与最终任务目标-生成可执行程序更强的对齐。CogniSQL-R1-Zero在Text2SQL基准测试中实现了最先进的执行准确性；BIRD基准，优于之前的监督和指令调整基线，包括SFT CodeS-7B，DeepSeek-Coder 236B和Mistral 123B-尽管只在一个显著较小的7B基础上训练。这个结果强调了我们基于RL的方法在仅在四个NVIDIA A100 GPU（每个40GB VRAM）上训练时的可扩展性和效率。为了支持高效且可解释的文本到SQL建模的进一步研究，我们发布了两个精心策划的数据集：（i）一系列包含5,024个推理轨迹，具有不同上下文长度，以及（ii）一个积极采样的语料库，包含36,356个弱监督查询语料库，每个查询都注释了六个语义多样的推理路径。这些贡献共同推动了可扩展的、与执行对齐的文本到SQL生成。

更新时间: 2025-07-08 14:17:07

领域: cs.AI

下载: http://arxiv.org/abs/2507.06013v1

Instance-Optimal Quantum State Certification with Entangled Measurements

We consider the task of quantum state certification: given a description of a hypothesis state $\sigma$ and multiple copies of an unknown state $\rho$, a tester aims to determine whether the two states are equal or $\epsilon$-far in trace distance. It is known that $\Theta(d/\epsilon^2)$ copies of $\rho$ are necessary and sufficient for this task, assuming the tester can make entangled measurements over all copies [CHW07,OW15,BOW19]. However, these bounds are for a worst-case $\sigma$, and it is not known what the optimal copy complexity is for this problem on an instance-by-instance basis. While such instance-optimal bounds have previously been shown for quantum state certification when the tester is limited to measurements unentangled across copies [CLO22,CLHL22], they remained open when testers are unrestricted in the kind of measurements they can perform. We address this open question by proving nearly instance-optimal bounds for quantum state certification when the tester can perform fully entangled measurements. Analogously to the unentangled setting, we show that the optimal copy complexity for certifying $\sigma$ is given by the worst-case complexity times the fidelity between $\sigma$ and the maximally mixed state. We prove our lower bounds using a novel quantum analogue of the Ingster-Suslina method, which is likely to be of independent interest. This method also allows us to recover the $\Omega(d/\epsilon^2)$ lower bound for mixedness testing [OW15], i.e., certification of the maximally mixed state, with a surprisingly simple proof.

Updated: 2025-07-08 14:15:46

标题: 用纠缠测量实现最优量子态认证

摘要: 我们考虑量子态认证的任务：给定一个假设态 $\sigma$ 的描述和多个未知态 $\rho$ 的副本，一个测试者的目标是确定这两个态是相等还是在迹距离上相距 $\epsilon$ 。已知，在测试者可以在所有副本上进行纠缠测量的情况下，$\Theta(d/\epsilon^2)$ 个 $\rho$ 的副本对于这个任务是必要且充分的 [CHW07,OW15,BOW19]。然而，这些界限是针对最坏情况下的 $\sigma$，目前尚不清楚在每个具体实例上，对于这个问题的最佳副本复杂度是多少。虽然在测试者只能进行非纠缠测量时，先前已经展示了量子态认证的实例最优界限 [CLO22,CLHL22]，但在测试者可以进行任意种类的测量时，这个问题仍然未解。我们通过证明几乎是实例最优的界限来解决这个未解问题，即在测试者可以进行完全纠缠测量时的量子态认证。类似于非纠缠设置，我们展示了认证 $\sigma$ 的最佳副本复杂度由最坏情况复杂度乘以 $\sigma$ 与最大混合态之间的保真度给出。我们使用一种新颖的量子版本的Ingster-Suslina方法证明了我们的下界，这可能是独立感兴趣的。这种方法还使我们能够用一个令人惊讶地简单的证明恢复混合度测试 [OW15] 的 $\Omega(d/\epsilon^2)$ 下界，即最大混合态的认证。

更新时间: 2025-07-08 14:15:46

领域: quant-ph,cs.DS,cs.LG

下载: http://arxiv.org/abs/2507.06010v1

KnowIt: Deep Time Series Modeling and Interpretation

KnowIt (Knowledge discovery in time series data) is a flexible framework for building deep time series models and interpreting them. It is implemented as a Python toolkit, with source code and documentation available from https://must-deep-learning.github.io/KnowIt. It imposes minimal assumptions about task specifications and decouples the definition of dataset, deep neural network architecture, and interpretability technique through well defined interfaces. This ensures the ease of importing new datasets, custom architectures, and the definition of different interpretability paradigms while maintaining on-the-fly modeling and interpretation of different aspects of a user's own time series data. KnowIt aims to provide an environment where users can perform knowledge discovery on their own complex time series data through building powerful deep learning models and explaining their behavior. With ongoing development, collaboration and application our goal is to make this a platform to progress this underexplored field and produce a trusted tool for deep time series modeling.

Updated: 2025-07-08 14:14:05

标题: KnowIt：深度时间序列建模和解释

摘要: KnowIt（时间序列数据的知识发现）是一个灵活的框架，用于构建深度时间序列模型并解释它们。它被实现为一个Python工具包，源代码和文档可从https://must-deep-learning.github.io/KnowIt 获取。它对任务规范做出了最少的假设，并通过明确定义的接口解耦数据集、深度神经网络架构和可解释性技术。这确保了轻松导入新数据集、自定义架构和定义不同解释范式的方便性，同时保持对用户自己的时间序列数据不同方面进行即时建模和解释。KnowIt旨在提供一个环境，让用户可以通过构建强大的深度学习模型并解释其行为来对其复杂的时间序列数据进行知识发现。通过持续的开发、合作和应用，我们的目标是将其打造成一个推动这一未开发领域并产生可信赖的深度时间序列建模工具的平台。

更新时间: 2025-07-08 14:14:05

领域: cs.LG

下载: http://arxiv.org/abs/2507.06009v1

The Impact of Event Data Partitioning on Privacy-aware Process Discovery

Information systems support the execution of business processes. The event logs of these executions generally contain sensitive information about customers, patients, and employees. The corresponding privacy challenges can be addressed by anonymizing the event logs while still retaining utility for process discovery. However, trading off utility and privacy is difficult: the higher the complexity of event log, the higher the loss of utility by anonymization. In this work, we propose a pipeline that combines anonymization and event data partitioning, where event abstraction is utilized for partitioning. By leveraging event abstraction, event logs can be segmented into multiple parts, allowing each sub-log to be anonymized separately. This pipeline preserves privacy while mitigating the loss of utility. To validate our approach, we study the impact of event partitioning on two anonymization techniques using three real-world event logs and two process discovery techniques. Our results demonstrate that event partitioning can bring improvements in process discovery utility for directly-follows-based anonymization techniques.

Updated: 2025-07-08 14:13:44

标题: 事件数据分区对隐私感知过程发现的影响

摘要: 信息系统支持业务流程的执行。这些执行的事件日志通常包含有关客户、患者和员工的敏感信息。相应的隐私挑战可以通过对事件日志进行匿名化来解决，同时仍保留用于流程发现的效用。然而，在效用和隐私之间进行权衡是困难的：事件日志的复杂性越高，匿名化造成的效用损失就越大。在这项工作中，我们提出了一个结合匿名化和事件数据分区的流程，其中事件抽象用于分区。通过利用事件抽象，事件日志可以分割成多个部分，使每个子日志可以分别进行匿名化。这个流程在保护隐私的同时减轻了效用损失。为了验证我们的方法，我们研究了事件分区对两种匿名化技术的影响，使用了三个真实世界的事件日志和两种流程发现技术。我们的结果表明，事件分区可以提高基于直接跟随的匿名化技术的流程发现效用。

更新时间: 2025-07-08 14:13:44

领域: cs.CR,cs.AI,cs.DB

下载: http://arxiv.org/abs/2507.06008v1

The GenAI Generation: Student Views of Awareness, Preparedness, and Concern

Generative Artificial Intelligence (GenAI) is revolutionizing education and workforce development, profoundly shaping how students learn, engage, and prepare for their future. Outpacing the development of uniform policies and structures, GenAI has heralded a unique era and given rise to the GenAI Generation. We define the GenAI Generation as a cohort of students whose education has been increasingly shaped by the opportunities and challenges GenAI presents during its widespread adoption within society. This study examines students' perceptions of GenAI through a concise survey with optional open-ended questions, focusing on their awareness, preparedness, and concerns. Notably, readiness appears increasingly tied to exposure to GenAI through one's coursework. Students with greater curricular exposure to GenAI tend to feel more prepared, while those without it more often express vulnerability and uncertainty, highlighting a new and growing divide in readiness that goes beyond traditional disciplinary boundaries. Evaluation of more than 250 responses, with over 40% providing detailed qualitative feedback, reveals a core dual sentiment: while most students express enthusiasm for GenAI, an even greater proportion voice a spectrum of concerns about ethics, job displacement, and the adequacy of educational structures given the highly transformative technology. These findings offer critical insights into how students view the potential and pitfalls of GenAI for future career impacts. The challenge ahead involves implementing associated recommendations for educational institutions, moving beyond the baseline of access toward more informed guidance on the use of these tools, while preserving critical thinking, ethical reasoning, and adaptive learning.

Updated: 2025-07-08 14:05:37

标题: 《GenAI一代：学生对意识、准备和关注的看法》

摘要: 生成人工智能（GenAI）正在彻底改变教育和劳动力发展，深刻地影响着学生学习、参与和为未来做准备的方式。在统一政策和结构的发展之上，GenAI已经宣告了一个独特时代的到来，同时也催生了GenAI一代。我们将GenAI一代定义为一群学生，他们的教育越来越多地受到GenAI在社会中广泛应用所带来的机遇和挑战的影响。本研究通过一项简洁调查，着重关注学生对GenAI的认识、准备和关注，来考察学生对GenAI的看法。值得注意的是，准备程度似乎越来越与个人在课程中接触GenAI的频率相关。接触GenAI课程更多的学生倾向于感到更为充分准备，而没有接触的学生更常表达出脆弱和不确定，突显了一种新的且不断扩大的准备程度分歧，超越了传统的学科边界。对250多个回复进行评估，其中超过40%提供了详细的定性反馈，揭示了一个核心的双重情绪：虽然大多数学生对GenAI表示热情，但更多的学生对伦理、工作替换以及鉴于高度变革性技术的教育结构充分性提出了一系列关切。这些发现为我们提供了学生如何看待GenAI对未来职业影响的潜力和陷阱的关键见解。未来的挑战在于实施与教育机构相关的建议，超越获取的基础，朝着更加明智地指导这些工具的使用，同时保留批判性思维、伦理推理和适应性学习。

更新时间: 2025-07-08 14:05:37

领域: cs.HC,cs.AI,cs.CY

下载: http://arxiv.org/abs/2505.02230v2

Scalable Discrete Diffusion Samplers: Combinatorial Optimization and Statistical Physics

Learning to sample from complex unnormalized distributions over discrete domains emerged as a promising research direction with applications in statistical physics, variational inference, and combinatorial optimization. Recent work has demonstrated the potential of diffusion models in this domain. However, existing methods face limitations in memory scaling and thus the number of attainable diffusion steps since they require backpropagation through the entire generative process. To overcome these limitations we introduce two novel training methods for discrete diffusion samplers, one grounded in the policy gradient theorem and the other one leveraging Self-Normalized Neural Importance Sampling (SN-NIS). These methods yield memory-efficient training and achieve state-of-the-art results in unsupervised combinatorial optimization. Numerous scientific applications additionally require the ability of unbiased sampling. We introduce adaptations of SN-NIS and Neural Markov Chain Monte Carlo that enable for the first time the application of discrete diffusion models to this problem. We validate our methods on Ising model benchmarks and find that they outperform popular autoregressive approaches. Our work opens new avenues for applying diffusion models to a wide range of scientific applications in discrete domains that were hitherto restricted to exact likelihood models.

Updated: 2025-07-08 14:03:25

标题: 可扩展的离散扩散取样器：组合优化和统计物理学

摘要: 学习从复杂的未归一化分布中进行采样在离散域中是一个具有应用前景的研究方向，可以在统计物理学、变分推断和组合优化中应用。最近的研究表明扩散模型在这一领域具有潜力。然而，现有方法在内存扩展和可达扩散步数方面存在限制，因为它们需要通过整个生成过程进行反向传播。为了克服这些限制，我们引入了两种新的离散扩散采样器训练方法，一种基于策略梯度定理，另一种利用自归一化神经重要性采样（SN-NIS）。这些方法实现了内存高效的训练，并在无监督组合优化中取得了最先进的结果。许多科学应用还需要无偏采样的能力。我们引入了SN-NIS和神经马尔可夫链蒙特卡罗的改进版本，首次将离散扩散模型应用于这一问题。我们在Ising模型基准上验证了我们的方法，并发现它们优于流行的自回归方法。我们的工作为将扩散模型应用于一系列科学应用开辟了新途径，这些应用以前局限于精确似然模型。

更新时间: 2025-07-08 14:03:25

领域: cs.LG,cond-mat.stat-mech,cs.AI,physics.comp-ph,stat.ML

下载: http://arxiv.org/abs/2502.08696v3

An Optimal Transport Perspective on Unpaired Image Super-Resolution

Real-world image super-resolution (SR) tasks often do not have paired datasets, which limits the application of supervised techniques. As a result, the tasks are usually approached by unpaired techniques based on Generative Adversarial Networks (GANs), which yield complex training losses with several regularization terms, e.g., content or identity losses. While GANs usually provide good practical performance, they are used heuristically, i.e., theoretical understanding of their behaviour is yet rather limited. We theoretically investigate optimization problems which arise in such models and find two surprising observations. First, the learned SR map is always an optimal transport (OT) map. Second, we theoretically prove and empirically show that the learned map is biased, i.e., it does not actually transform the distribution of low-resolution images to high-resolution ones. Inspired by these findings, we investigate recent advances in neural OT field to resolve the bias issue. We establish an intriguing connection between regularized GANs and neural OT approaches. We show that unlike the existing GAN-based alternatives, these algorithms aim to learn an unbiased OT map. We empirically demonstrate our findings via a series of synthetic and real-world unpaired SR experiments. Our source code is publicly available at https://github.com/milenagazdieva/OT-Super-Resolution.

Updated: 2025-07-08 14:01:59

标题: 一个关于非配对图像超分辨率的最优传输视角

摘要: 真实世界中的图像超分辨率（SR）任务通常没有配对的数据集，这限制了监督技术的应用。因此，这些任务通常通过基于生成对抗网络（GANs）的无配对技术来处理，这些技术产生了包含多个正则化项的复杂训练损失，例如内容或身份损失。虽然GANs通常提供良好的实际性能，但它们是启发式地使用的，即对它们的行为的理论理解仍然相当有限。我们在理论上研究了这些模型中出现的优化问题，并发现了两个令人惊讶的观察结果。首先，学习的SR映射始终是一个最优传输（OT）映射。其次，我们在理论上证明并实证表明，学习的映射是有偏的，即它实际上并不会将低分辨率图像的分布转换为高分辨率图像。受到这些发现的启发，我们研究了神经OT领域的最新进展，以解决偏差问题。我们建立了正规化GANs和神经OT方法之间有趣的联系。我们表明，与现有基于GAN的替代方法不同，这些算法旨在学习一个无偏的OT映射。我们通过一系列合成和真实世界的无配对SR实验在实证上展示了我们的研究发现。我们的源代码可以在https://github.com/milenagazdieva/OT-Super-Resolution上公开获取。

更新时间: 2025-07-08 14:01:59

领域: eess.IV,cs.CV,cs.LG

下载: http://arxiv.org/abs/2202.01116v3

DRAN: A Distribution and Relation Adaptive Network for Spatio-temporal Forecasting

Accurate predictions of spatio-temporal systems are crucial for tasks such as system management, control, and crisis prevention. However, the inherent time variance of many spatio-temporal systems poses challenges to achieving accurate predictions whenever stationarity is not granted. In order to address non-stationarity, we propose a Distribution and Relation Adaptive Network (DRAN) capable of dynamically adapting to relation and distribution changes over time. While temporal normalization and de-normalization are frequently used techniques to adapt to distribution shifts, this operation is not suitable for the spatio-temporal context as temporal normalization scales the time series of nodes and possibly disrupts the spatial relations among nodes. In order to address this problem, a Spatial Factor Learner (SFL) module is developed that enables the normalization and de-normalization process. To adapt to dynamic changes in spatial relationships among sensors, we propose a Dynamic-Static Fusion Learner (DSFL) module that effectively integrates features learned from both dynamic and static relations through an adaptive fusion ratio mechanism. Furthermore, we introduce a Stochastic Learner to capture the noisy components of spatio-temporal representations. Our approach outperforms state-of-the-art methods on weather prediction and traffic flow forecasting tasks.Experimental results show that our SFL efficiently preserves spatial relationships across various temporal normalization operations. Visualizations of the learned dynamic and static relations demonstrate that DSFL can capture both local and distant relationships between nodes.

Updated: 2025-07-08 13:56:41

标题: DRAN：一种用于时空预测的分布和关系自适应网络

摘要: 空间-时间系统的准确预测对于诸如系统管理、控制和危机预防等任务至关重要。然而，许多空间-时间系统固有的时间变化性质使得在不保证平稳性时实现准确预测面临挑战。为了解决非平稳性问题，我们提出了一种能够动态适应关系和分布变化的分布和关系自适应网络（DRAN）。虽然时间归一化和反归一化是用于适应分布转移的常用技术，但这种操作并不适用于空间-时间环境，因为时间归一化会缩放节点的时间序列，并可能破坏节点之间的空间关系。为了解决这个问题，我们开发了一个空间因子学习器（SFL）模块，使得归一化和反归一化过程成为可能。为了适应传感器之间空间关系的动态变化，我们提出了一个动态-静态融合学习器（DSFL）模块，通过自适应融合比例机制有效地整合了从动态和静态关系中学习到的特征。此外，我们引入了一个随机学习器来捕捉空间-时间表示中的噪声成分。我们的方法在天气预测和交通流预测任务上优于现有方法。实验结果表明，我们的SFL在各种时间归一化操作中有效保留了空间关系。学习到的动态和静态关系的可视化展示了DSFL能够捕捉节点之间的局部和远程关系。

更新时间: 2025-07-08 13:56:41

领域: cs.LG

下载: http://arxiv.org/abs/2504.01531v2

Beating the Best Constant Rebalancing Portfolio in Long-Term Investment: A Generalization of the Kelly Criterion and Universal Learning Algorithm for Markets with Serial Dependence

In the online portfolio optimization framework, existing learning algorithms generate strategies that yield significantly poorer cumulative wealth compared to the best constant rebalancing portfolio in hindsight, despite being consistent in asymptotic growth rate. While this unappealing performance can be improved by incorporating more side information, it raises difficulties in feature selection and high-dimensional settings. Instead, the inherent serial dependence of assets' returns, such as day-of-the-week and other calendar effects, can be leveraged. Although latent serial dependence patterns are commonly detected using large training datasets, this paper proposes an algorithm that learns such dependence using only gradually revealed data, without any assumption on their distribution, to form a strategy that eventually exceeds the cumulative wealth of the best constant rebalancing portfolio. Moreover, the classical Kelly criterion, which requires independent assets' returns, is generalized to accommodate serial dependence in a market modeled as an independent and identically distributed process of random matrices. In such a stochastic market, where existing learning algorithms designed for stationary processes fail to apply, the proposed learning algorithm still generates a strategy that asymptotically grows to the highest rate among all strategies, matching that of the optimal strategy constructed under the generalized Kelly criterion. The experimental results with real market data demonstrate the theoretical guarantees of the algorithm and its performance as expected, as long as serial dependence is significant, regardless of the validity of the generalized Kelly criterion in the experimental market. This further affirms the broad applicability of the algorithm in general contexts.

Updated: 2025-07-08 13:54:14

标题: 击败最佳定期再平衡组合的长期投资：对具有串行依赖性的市场的凯利准则和通用学习算法的推广

摘要: 在在线投资组合优化框架中，现有的学习算法生成的策略在累积财富方面明显不如事后最好的恒定再平衡投资组合，尽管在渐近增长率上是一致的。虽然通过加入更多的侧面信息可以改善这种不理想的性能，但在特征选择和高维设置方面却带来了困难。相反，资产回报的固有串行依赖性，例如星期几和其他日历效应，可以被利用。尽管通常使用大型训练数据集来检测潜在的串行依赖模式，本文提出了一种算法，仅使用逐渐揭示的数据来学习这种依赖性，而不对它们的分布做任何假设，以形成一种最终超过最佳恒定再平衡投资组合累积财富的策略。此外，传统的凯利准则，需要独立资产回报，被推广以适应市场模型为随机矩阵的独立同分布过程中的串行依赖。在这样一个随机市场中，现有的针对稳态过程设计的学习算法无法应用，而提出的学习算法仍然生成一个渐近增长到所有策略中最高速率的策略，与在广义凯利准则下构建的最优策略相匹配。使用真实市场数据的实验结果展示了该算法的理论保证和其性能如预期那样，只要串行依赖性显著，实验市场中广义凯利准则的有效性无关紧要。这进一步证实了该算法在一般情境中的广泛适用性。

更新时间: 2025-07-08 13:54:14

领域: q-fin.PM,cs.IT,cs.LG,math.IT,q-fin.CP

下载: http://arxiv.org/abs/2507.05994v1

Exploring Partial Multi-Label Learning via Integrating Semantic Co-occurrence Knowledge

Partial multi-label learning aims to extract knowledge from incompletely annotated data, which includes known correct labels, known incorrect labels, and unknown labels. The core challenge lies in accurately identifying the ambiguous relationships between labels and instances. In this paper, we emphasize that matching co-occurrence patterns between labels and instances is key to addressing this challenge. To this end, we propose Semantic Co-occurrence Insight Network (SCINet), a novel and effective framework for partial multi-label learning. Specifically, SCINet introduces a bi-dominant prompter module, which leverages an off-the-shelf multimodal model to capture text-image correlations and enhance semantic alignment. To reinforce instance-label interdependencies, we develop a cross-modality fusion module that jointly models inter-label correlations, inter-instance relationships, and co-occurrence patterns across instance-label assignments. Moreover, we propose an intrinsic semantic augmentation strategy that enhances the model's understanding of intrinsic data semantics by applying diverse image transformations, thereby fostering a synergistic relationship between label confidence and sample difficulty. Extensive experiments on four widely-used benchmark datasets demonstrate that SCINet surpasses state-of-the-art methods.

Updated: 2025-07-08 13:53:28

标题: 通过整合语义共现知识探索部分多标签学习

摘要: 多标签学习旨在从不完全标记的数据中提取知识，其中包括已知正确标签、已知错误标签和未知标签。其核心挑战在于准确识别标签和实例之间的模糊关系。本文强调匹配标签和实例之间的共现模式是解决这一挑战的关键。为此，我们提出了一种新颖有效的部分多标签学习框架——Semantic Co-occurrence Insight Network（SCINet）。具体来说，SCINet引入了一个双主导提示器模块，利用现成的多模态模型捕获文本-图像相关性并增强语义对齐。为了加强实例-标签间的相互依赖关系，我们开发了一个跨模态融合模块，共同建模标签间的相关性、实例间的关系，以及实例-标签分配间的共现模式。此外，我们提出了一种内在语义增强策略，通过应用多样化的图像转换增强模型对内在数据语义的理解，从而促进标签置信度与样本难度之间的协同关系。在四个广泛使用的基准数据集上进行的大量实验表明，SCINet超越了最先进的方法。

更新时间: 2025-07-08 13:53:28

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.05992v1

Counterfactual Inference under Thompson Sampling

Recommender systems exemplify sequential decision-making under uncertainty, strategically deciding what content to serve to users, to optimise a range of potential objectives. To balance the explore-exploit trade-off successfully, Thompson sampling provides a natural and widespread paradigm to probabilistically select which action to take. Questions of causal and counterfactual inference, which underpin use-cases like offline evaluation, are not straightforward to answer in these contexts. Specifically, whilst most existing estimators rely on action propensities, these are not readily available under Thompson sampling procedures. We derive exact and efficiently computable expressions for action propensities under a variety of parameter and outcome distributions, enabling the use of off-policy estimators in Thompson sampling scenarios. This opens up a range of practical use-cases where counterfactual inference is crucial, including unbiased offline evaluation of recommender systems, as well as general applications of causal inference in online advertising, personalisation, and beyond.

Updated: 2025-07-08 13:52:12

标题: Thompson抽样下的反事实推断

摘要: 推荐系统展示了在不确定性下的顺序决策，战略性地决定向用户提供什么内容，以优化一系列潜在目标。为了成功平衡探索和利用的权衡，汤普森抽样提供了一种自然和广泛的范例，以概率地选择要采取的行动。在这些情境中，因果推断和反事实推断的问题，支撑着离线评估等用例，不容易回答。具体而言，虽然大多数现有的估计器依赖于行动趋势，但在汤普森抽样程序下这些并不容易获得。我们推导出在各种参数和结果分布下行动趋势的精确且高效计算表达式，使得可以在汤普森抽样情境中使用离线策略估计器。这打开了一系列实际用例，其中反事实推断至关重要，包括对推荐系统的无偏离线评估，以及在线广告、个性化等应用中的因果推断。

更新时间: 2025-07-08 13:52:12

领域: cs.IR,cs.LG,stat.ME

下载: http://arxiv.org/abs/2504.08773v2

A Survey of Multi Agent Reinforcement Learning: Federated Learning and Cooperative and Noncooperative Decentralized Regimes

The increasing interest in research and innovation towards the development of autonomous agents presents a number of complex yet important scenarios of multiple AI Agents interacting with each other in an environment. The particular setting can be understood as exhibiting three possibly topologies of interaction - centrally coordinated cooperation, ad-hoc interaction and cooperation, and settings with noncooperative incentive structures. This article presents a comprehensive survey of all three domains, defined under the formalism of Federal Reinforcement Learning (RL), Decentralized RL, and Noncooperative RL, respectively. Highlighting the structural similarities and distinctions, we review the state of the art in these subjects, primarily explored and developed only recently in the literature. We include the formulations as well as known theoretical guarantees and highlights and limitations of numerical performance.

Updated: 2025-07-08 13:47:40

标题: 一项多代理强化学习调查：联邦学习以及合作和非合作分散式制度

摘要: 对于自主代理的研究和创新越来越受到关注，这导致了多个人工智能代理在环境中相互交互的复杂而重要的场景。这种特定情境可以被理解为展示三种可能的交互拓扑结构 - 中心协调合作，临时互动和合作，以及具有非合作激励结构的设置。本文在联邦强化学习（RL），分散式RL和非合作RL的形式主义下，分别对这三个领域进行了全面调查。我们重点介绍了这些主题的现状，这些主题最近才在文献中探讨和发展。我们包括公式以及已知的理论保证和数值表现的亮点和局限性。

更新时间: 2025-07-08 13:47:40

领域: cs.MA,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.06278v1

Hita: Holistic Tokenizer for Autoregressive Image Generation

Vanilla autoregressive image generation models generate visual tokens step-by-step, limiting their ability to capture holistic relationships among token sequences. Moreover, because most visual tokenizers map local image patches into latent tokens, global information is limited. To address this, we introduce \textit{Hita}, a novel image tokenizer for autoregressive (AR) image generation. It introduces a holistic-to-local tokenization scheme with learnable holistic queries and local patch tokens. Hita incorporates two key strategies to better align with the AR generation process: 1) {arranging} a sequential structure with holistic tokens at the beginning, followed by patch-level tokens, and using causal attention to maintain awareness of previous tokens; and 2) adopting a lightweight fusion module before feeding the de-quantized tokens into the decoder to control information flow and prioritize holistic tokens. Extensive experiments show that Hita accelerates the training speed of AR generators and outperforms those trained with vanilla tokenizers, achieving \textbf{2.59 FID} and \textbf{281.9 IS} on the ImageNet benchmark. Detailed analysis of the holistic representation highlights its ability to capture global image properties, such as textures, materials, and shapes. Additionally, Hita also demonstrates effectiveness in zero-shot style transfer and image in-painting. The code is available at \href{https://github.com/CVMI-Lab/Hita}{https://github.com/CVMI-Lab/Hita}.

Updated: 2025-07-08 13:43:13

标题: Hita：自回归图像生成的整体分词器

摘要: 香草自回归图像生成模型逐步生成视觉令牌，限制了其捕捉令牌序列之间整体关系的能力。此外，由于大多数视觉令牌化器将局部图像补丁映射为潜在令牌，全局信息受到限制。为了解决这个问题，我们引入了一种新颖的用于自回归（AR）图像生成的图像分词器Hita。它引入了一种从整体到本地的分词方案，具有可学习的整体查询和本地补丁令牌。Hita采用了两个关键策略，以更好地与AR生成过程对齐：1）将整体令牌放在开头，其后是补丁级别令牌，使用因果关注力以保持对先前令牌的意识；2）在将去量化的令牌馈送到解码器之前采用轻量级融合模块，以控制信息流并优先考虑整体令牌。大量实验证明，Hita加快了AR生成器的训练速度，并优于使用香草分词器训练的模型，在ImageNet基准测试中取得了2.59 FID和281.9 IS。对整体表示的详细分析突出了其捕捉全局图像属性（如纹理、材料和形状）的能力。此外，Hita还在零样式迁移和图像修复方面展现了有效性。代码可在https://github.com/CVMI-Lab/Hita 上找到。

更新时间: 2025-07-08 13:43:13

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.02358v3

Robust Speech-Workload Estimation for Intelligent Human-Robot Systems

Demanding task environments (e.g., supervising a remotely piloted aircraft) require performing tasks quickly and accurately; however, periods of low and high operator workload can decrease task performance. Intelligent modulation of the system's demands and interaction modality in response to changes in operator workload state may increase performance by avoiding undesirable workload states. This system requires real-time estimation of each workload component (i.e., cognitive, physical, visual, speech, and auditory) to adapt the correct modality. Existing workload systems estimate multiple workload components post-hoc, but few estimate speech workload, or function in real-time. An algorithm to estimate speech workload and mitigate undesirable workload states in real-time is presented. An analysis of the algorithm's accuracy is presented, along with the results demonstrating the algorithm's generalizability across individuals and human-machine teaming paradigms. Real-time speech workload estimation is a crucial element towards developing adaptive human-machine systems.

Updated: 2025-07-08 13:41:59

标题: 智能人机系统中的稳健语音工作量估计

摘要: 要求高的任务环境（例如，监控远程驾驶飞行器）需要快速准确地执行任务；然而，低和高的操作员工作负荷可能会降低任务绩效。智能调节系统的需求和交互方式以响应操作员工作负荷状态的变化，可以通过避免不良工作负荷状态来提高绩效。该系统需要实时估计每个工作负荷成分（即，认知、身体、视觉、语音和听觉），以适应正确的模态。现有的工作负荷系统事后估计多个工作负荷成分，但很少估计语音工作负荷，或实时运行。提出了一种实时估计语音工作负荷并减轻不良工作负荷状态的算法。提出了对算法准确性的分析，并展示了该算法在个体和人机团队范式中的泛化能力的结果。实时语音工作负荷估计是发展自适应人机系统的关键因素。

更新时间: 2025-07-08 13:41:59

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2507.05985v1

Development and Evaluation of HopeBot: an LLM-based chatbot for structured and interactive PHQ-9 depression screening

Static tools like the Patient Health Questionnaire-9 (PHQ-9) effectively screen depression but lack interactivity and adaptability. We developed HopeBot, a chatbot powered by a large language model (LLM) that administers the PHQ-9 using retrieval-augmented generation and real-time clarification. In a within-subject study, 132 adults in the United Kingdom and China completed both self-administered and chatbot versions. Scores demonstrated strong agreement (ICC = 0.91; 45% identical). Among 75 participants providing comparative feedback, 71% reported greater trust in the chatbot, highlighting clearer structure, interpretive guidance, and a supportive tone. Mean ratings (0-10) were 8.4 for comfort, 7.7 for voice clarity, 7.6 for handling sensitive topics, and 7.4 for recommendation helpfulness; the latter varied significantly by employment status and prior mental-health service use (p < 0.05). Overall, 87.1% expressed willingness to reuse or recommend HopeBot. These findings demonstrate voice-based LLM chatbots can feasibly serve as scalable, low-burden adjuncts for routine depression screening.

Updated: 2025-07-08 13:41:22

标题: 《HopeBot的开发与评估：基于LLM的用于结构化和互动式PHQ-9抑郁筛查的聊天机器人》

摘要: 静态工具如患者健康问卷-9（PHQ-9）有效地筛查抑郁症，但缺乏互动性和适应性。我们开发了HopeBot，一个由大型语言模型（LLM）驱动的聊天机器人，使用检索增强生成和实时澄清来管理PHQ-9。在一项受试者内研究中，来自英国和中国的132名成年人完成了自我管理和聊天机器人版本。得分表明强有力的一致性（ICC = 0.91；45%相同）。在提供比较反馈的75名参与者中，71%报告更加信任聊天机器人，强调了更清晰的结构、解释性指导和支持性语气。舒适度、语音清晰度、处理敏感话题和推荐有用性的平均评分（0-10）分别为8.4、7.7、7.6和7.4；后者在就业状况和先前心理健康服务使用方面有显著差异（p < 0.05）。总体而言，87.1%的人表示愿意重新使用或推荐HopeBot。这些发现表明基于语音的LLM聊天机器人可以作为可扩展、低负担的例行抑郁症筛查的辅助手段。

更新时间: 2025-07-08 13:41:22

领域: cs.AI,cs.CL,cs.HC

下载: http://arxiv.org/abs/2507.05984v1

RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages

Large language models (LLMs) and their safety classifiers often perform poorly on low-resource languages due to limited training data and evaluation benchmarks. This paper introduces RabakBench, a new multilingual safety benchmark localized to Singapore's unique linguistic context, covering Singlish, Chinese, Malay, and Tamil. RabakBench is constructed through a scalable three-stage pipeline: (i) Generate - adversarial example generation by augmenting real Singlish web content with LLM-driven red teaming; (ii) Label - semi-automated multi-label safety annotation using majority-voted LLM labelers aligned with human judgments; and (iii) Translate - high-fidelity translation preserving linguistic nuance and toxicity across languages. The final dataset comprises over 5,000 safety-labeled examples across four languages and six fine-grained safety categories with severity levels. Evaluations of 11 popular open-source and closed-source guardrail classifiers reveal significant performance degradation. RabakBench not only enables robust safety evaluation in Southeast Asian multilingual settings but also offers a reproducible framework for building localized safety datasets in low-resource environments. The benchmark dataset, including the human-verified translations, and evaluation code are publicly available.

Updated: 2025-07-08 13:37:25

标题: RabakBench：扩展人类注释以构建低资源语言的本地化多语言安全基准

摘要: 大型语言模型（LLMs）及其安全分类器在低资源语言上通常表现不佳，原因是受限于训练数据和评估基准。本文介绍了RabakBench，这是一个新的多语言安全基准，定位于新加坡独特的语言环境，涵盖了新加坡英语、中文、马来语和泰米尔语。RabakBench是通过可扩展的三阶段流程构建的：（i）生成 - 通过增加真实的新加坡英语网络内容与LLM驱动的红队合作生成对抗性示例；（ii）标记 - 使用多数投票的LLM标注器对安全性进行半自动多标签注释，与人类判断保持一致；（iii）翻译 - 保留跨语言的语言细微差别和有毒性的高保真度翻译。最终数据集包含超过5,000个涵盖四种语言和六种细粒度安全类别及严重级别的标记示例。对11个流行的开源和闭源防护栏分类器的评估显示出显著的性能下降。RabakBench不仅能够在东南亚多语言环境中实现强大的安全评估，还为在低资源环境中构建本地化安全数据集提供了可复制的框架。基准数据集，包括经人工验证的翻译和评估代码，均可公开获取。

更新时间: 2025-07-08 13:37:25

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.05980v1

CoDy: Counterfactual Explainers for Dynamic Graphs

Temporal Graph Neural Networks (TGNNs) are widely used to model dynamic systems where relationships and features evolve over time. Although TGNNs demonstrate strong predictive capabilities in these domains, their complex architectures pose significant challenges for explainability. Counterfactual explanation methods provide a promising solution by illustrating how modifications to input graphs can influence model predictions. To address this challenge, we present CoDy, Counterfactual Explainer for Dynamic Graphs, a model-agnostic, instance-level explanation approach that identifies counterfactual subgraphs to interpret TGNN predictions. CoDy employs a search algorithm that combines Monte Carlo Tree Search with heuristic selection policies, efficiently exploring a vast search space of potential explanatory subgraphs by leveraging spatial, temporal, and local event impact information. Extensive experiments against state-of-the-art factual and counterfactual baselines demonstrate CoDy's effectiveness, with improvements of 16% in AUFSC+ over the strongest baseline.

Updated: 2025-07-08 13:36:25

标题: CoDy: 动态图的反事实解释器

摘要: 时间图神经网络（TGNNs）被广泛用于建模动态系统，其中关系和特征随时间演变。尽管TGNNs在这些领域展示了强大的预测能力，但它们复杂的架构给可解释性带来了重大挑战。反事实解释方法通过说明如何修改输入图可以影响模型预测，提供了一种有希望的解决方案。为了解决这一挑战，我们提出了CoDy，即Counterfactual Explainer for Dynamic Graphs，这是一种模型无关的、实例级的解释方法，用于解释TGNN的预测结果。CoDy采用了一种搜索算法，将蒙特卡罗树搜索与启发式选择策略相结合，通过利用空间、时间和本地事件影响信息，高效地探索潜在解释子图的广阔搜索空间。与最先进的基准事实和反事实进行广泛实验证明CoDy的有效性，在AUFSC+上改进了16%。

更新时间: 2025-07-08 13:36:25

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2403.16846v2

Enhancing the Interpretability of Rule-based Explanations through Information Retrieval

The lack of transparency of data-driven Artificial Intelligence techniques limits their interpretability and acceptance into healthcare decision-making processes. We propose an attribution-based approach to improve the interpretability of Explainable AI-based predictions in the specific context of arm lymphedema's risk assessment after lymph nodal radiotherapy in breast cancer. The proposed method performs a statistical analysis of the attributes in the rule-based prediction model using standard metrics from Information Retrieval techniques. This analysis computes the relevance of each attribute to the prediction and provides users with interpretable information about the impact of risk factors. The results of a user study that compared the output generated by the proposed approach with the raw output of the Explainable AI model suggested higher levels of interpretability and usefulness in the context of predicting lymphedema risk.

Updated: 2025-07-08 13:32:50

标题: 通过信息检索提高基于规则解释的可解释性

摘要: 数据驱动的人工智能技术缺乏透明度，限制了它们在医疗决策过程中的可解释性和接受度。我们提出了一种基于归因的方法，旨在改善可解释人工智能预测在乳腺癌淋巴结放疗后上肢淋巴水肿风险评估的特定背景下的可解释性。该方法通过使用信息检索技术的标准度量对基于规则的预测模型中的属性进行统计分析。这种分析计算每个属性对预测的相关性，并为用户提供关于风险因素影响的可解释信息。与可解释人工智能模型的原始输出相比，通过用户研究比较提出方法生成的输出结果表明，在预测淋巴水肿风险的背景下，具有更高水平的可解释性和实用性。

更新时间: 2025-07-08 13:32:50

领域: cs.AI,cs.IR

下载: http://arxiv.org/abs/2507.05976v1

VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play

Robot sports, characterized by well-defined objectives, explicit rules, and dynamic interactions, present ideal scenarios for demonstrating embodied intelligence. In this paper, we present VolleyBots, a novel robot sports testbed where multiple drones cooperate and compete in the sport of volleyball under physical dynamics. VolleyBots integrates three features within a unified platform: competitive and cooperative gameplay, turn-based interaction structure, and agile 3D maneuvering. Competitive and cooperative gameplay challenges each drone to coordinate with its teammates while anticipating and countering opposing teams' tactics. Turn-based interaction demands precise timing, accurate state prediction, and management of long-horizon temporal dependencies. Agile 3D maneuvering requires rapid accelerations, sharp turns, and precise 3D positioning despite the quadrotor's underactuated dynamics. These intertwined features yield a complex problem combining motion control and strategic play, with no available expert demonstrations. We provide a comprehensive suite of tasks ranging from single-drone drills to multi-drone cooperative and competitive tasks, accompanied by baseline evaluations of representative multi-agent reinforcement learning (MARL) and game-theoretic algorithms. Simulation results show that on-policy reinforcement learning (RL) methods outperform off-policy methods in single-agent tasks, but both approaches struggle in complex tasks that combine motion control and strategic play. We additionally design a hierarchical policy which achieves a 69.5% percent win rate against the strongest baseline in the 3 vs 3 task, underscoring its potential as an effective solution for tackling the complex interplay between low-level control and high-level strategy. The project page is at https://sites.google.com/view/thu-volleybots.

Updated: 2025-07-08 13:30:59

标题: VolleyBots：一个结合运动控制和战略游戏的多无人机排球比赛测试平台

摘要: 机器人运动，具有明确定义的目标、明确的规则和动态的互动，为展示具有体现智能提供了理想场景。本文介绍了VolleyBots，这是一个新颖的机器人运动测试平台，多个无人机在体育排球比赛中合作和竞争。VolleyBots在一个统一平台内整合了三个特点：竞争和合作游戏、轮流互动结构和敏捷的三维机动。竞争和合作游戏挑战每个无人机与其队友协调，同时预测和对抗对方团队的战术。轮流互动要求精确的时机、准确的状态预测和管理长期时间依赖性。敏捷的三维机动要求快速加速、急转弯和精确的三维定位，尽管四旋翼飞行器的动力学是欠驱动的。这些交织在一起的特点产生了一个结合运动控制和战略游戏的复杂问题，没有可用的专家示范。我们提供了一套全面的任务，从单个无人机训练到多个无人机的合作和竞争任务，同时伴随着代表性多智体强化学习（MARL）和博弈论算法的基线评估。模拟结果表明，在单一代理任务中，基于政策的强化学习（RL）方法表现优于基于脱机的方法，但在结合运动控制和战略游戏的复杂任务中，这两种方法都面临困难。我们另外设计了一个分层策略，在3对3任务中以69.5%的胜率击败了最强基线，突显了其作为解决低级控制和高级策略之间复杂相互作用的有效解决方案的潜力。项目页面位于https://sites.google.com/view/thu-volleybots。

更新时间: 2025-07-08 13:30:59

领域: cs.RO,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.01932v4

Generalized and Unified Equivalences between Hardness and Pseudoentropy

Pseudoentropy characterizations provide a quantitatively precise demonstration of the close relationship between computational hardness and computational randomness. We prove a unified pseudoentropy characterization that generalizes and strengthens previous results for both uniform and non-uniform models of computation. Our characterization holds for a general family of entropy notions that encompasses the common notions of Shannon entropy and min entropy as special cases. Moreover, we show that the characterizations for different entropy notions can be simultaneously achieved by a single, universal function that simultaneously witnesses computational hardness and computational randomness. A key technical insight of our work is that the notion of weight-restricted calibration from the recent literature on algorithm fairness, along with standard computational indistinguishability (known as multiaccuracy in the fairness literature), suffices for proving pseudoentropy characterizations for general entropy notions. This demonstrates the power of weight-restricted calibration to enhance the classic Complexity-Theoretic Regularity Lemma (Trevisan, Tulsiani, and Vadhan, 2009) and Leakage Simulation Lemma (Jetchev and Pietrzak, 2014) and allows us to achieve an exponential improvement in the complexity dependency on the alphabet size compared to the pseudoentropy characterizations by Casacuberta, Dwork, and Vadhan (2024) based on the much stronger notion of multicalibration. We show that the exponential dependency on the alphabet size is inevitable for multicalibration as well as for the weaker notion of calibrated multiaccuracy.

Updated: 2025-07-08 13:27:03

标题: 硬度和伪熵之间的广义和统一等价关系

摘要: 伪熵特征提供了计算难度和计算随机性之间密切关系的定量精确证明。我们证明了一个统一的伪熵特征化，它推广并加强了先前针对统一和非统一计算模型的结果。我们的特征化适用于一个通用的熵概念家族，涵盖了常见的香农熵和最小熵概念作为特例。此外，我们展示了不同熵概念的特征化可以通过一个单一的通用函数同时实现，该函数同时证明了计算难度和计算随机性。我们工作的一个关键技术见解是最近算法公正性文献中的权重限制校准概念，以及标准的计算不可区分性（在公正性文献中称为多精度），足以证明一般熵概念的伪熵特征化。这展示了权重限制校准概念增强了经典的复杂性理论规则引理（Trevisan，Tulsiani和Vadhan，2009）和泄漏模拟引理（Jetchev和Pietrzak，2014）的能力，并使我们能够在与基于更强大的多校准概念的Casacuberta，Dwork和Vadhan（2024）的伪熵特征化相比，实现对字母表大小的复杂依赖的指数级改进。我们表明，对于多校准和较弱的校准多精度概念，对字母表大小的指数依赖是不可避免的。

更新时间: 2025-07-08 13:27:03

领域: cs.CC,cs.CR,cs.LG

下载: http://arxiv.org/abs/2507.05972v1

Mastering Multi-Drone Volleyball through Hierarchical Co-Self-Play Reinforcement Learning

In this paper, we tackle the problem of learning to play 3v3 multi-drone volleyball, a new embodied competitive task that requires both high-level strategic coordination and low-level agile control. The task is turn-based, multi-agent, and physically grounded, posing significant challenges due to its long-horizon dependencies, tight inter-agent coupling, and the underactuated dynamics of quadrotors. To address this, we propose Hierarchical Co-Self-Play (HCSP), a hierarchical reinforcement learning framework that separates centralized high-level strategic decision-making from decentralized low-level motion control. We design a three-stage population-based training pipeline to enable both strategy and skill to emerge from scratch without expert demonstrations: (I) training diverse low-level skills, (II) learning high-level strategy via self-play with fixed low-level controllers, and (III) joint fine-tuning through co-self-play. Experiments show that HCSP achieves superior performance, outperforming non-hierarchical self-play and rule-based hierarchical baselines with an average 82.9% win rate and a 71.5% win rate against the two-stage variant. Moreover, co-self-play leads to emergent team behaviors such as role switching and coordinated formations, demonstrating the effectiveness of our hierarchical design and training scheme. The project page is at https://sites.google.com/view/hi-co-self-play.

Updated: 2025-07-08 13:24:58

标题: 通过分层共自我博弈强化学习掌握多机无人机排球

摘要: 在这篇论文中，我们解决了学习玩3v3多无人机排球的问题，这是一项需要高水平战略协调和低水平灵活控制的新的体验竞技任务。该任务是基于回合、多智能体和物理基础的，由于其长期依赖性、紧密的智能体耦合以及四旋翼飞行器的欠驱动动力学，提出了重大挑战。为了解决这个问题，我们提出了分层共自博弈（HCSP），这是一个分层强化学习框架，将集中式高层战略决策与分散式低层运动控制分离开来。我们设计了一个三阶段基于人口的训练流程，使策略和技能能够在没有专家演示的情况下从零开始产生：（I）训练多样化的低层技能，（II）通过与固定的低层控制器自我博弈学习高层战略，以及（III）通过共自博弈进行联合微调。实验证明，HCSP取得了卓越的性能，胜过非分层自博弈和基于规则的分层基线，平均胜率为82.9%，对抗两阶段变体的胜率为71.5%。此外，共自博弈导致了角色切换和协调编队等新兴团队行为的出现，展示了我们分层设计和训练方案的有效性。项目页面位于https://sites.google.com/view/hi-co-self-play。

更新时间: 2025-07-08 13:24:58

领域: cs.AI

下载: http://arxiv.org/abs/2505.04317v3

Analytic Subspace Routing: How Recursive Least Squares Works in Continual Learning of Large Language Model

Large Language Models (LLMs) possess encompassing capabilities that can process diverse language-related tasks. However, finetuning on LLMs will diminish this general skills and continual finetuning will further cause severe degradation on accumulated knowledge. Recently, Continual Learning (CL) in Large Language Models (LLMs) arises which aims to continually adapt the LLMs to new tasks while maintaining previously learned knowledge and inheriting general skills. Existing techniques either leverage previous data to replay, leading to extra computational costs, or utilize a single parameter-efficient module to learn the downstream task, constraining new knowledge absorption with interference between different tasks. Toward these issues, this paper proposes Analytic Subspace Routing(ASR) to address these challenges. For each task, we isolate the learning within a subspace of deep layers' features via low-rank adaptation, eliminating knowledge interference between different tasks. Additionally, we propose an analytic routing mechanism to properly utilize knowledge learned in different subspaces. Our approach employs Recursive Least Squares to train a multi-task router model, allowing the router to dynamically adapt to incoming data without requiring access to historical data. Also, the router effectively assigns the current task to an appropriate subspace and has a non-forgetting property of previously learned tasks with a solid theoretical guarantee. Experimental results demonstrate that our method achieves near-perfect retention of prior knowledge while seamlessly integrating new information, effectively overcoming the core limitations of existing methods. Our code will be released after acceptance.

Updated: 2025-07-08 13:20:21

标题: 解析子空间路由：递归最小二乘在大型语言模型的持续学习中的作用

摘要: 大型语言模型（LLMs）具有广泛的能力，可以处理各种语言相关任务。然而，在LLMs上进行微调会降低这种通用技能，而持续的微调会进一步导致已积累知识的严重退化。最近，在大型语言模型（LLMs）中出现了持续学习（CL）的概念，旨在不断适应LLMs到新任务，同时保持先前学到的知识和继承通用技能。现有技术要么利用先前数据进行重放，导致额外的计算成本，要么利用单参数高效模块学习下游任务，限制新知识吸收并导致不同任务之间的干扰。针对这些问题，本文提出了Analytic Subspace Routing（ASR）来应对这些挑战。对于每个任务，我们通过低秩调整在深层特征的子空间内隔离学习，消除不同任务之间的知识干扰。此外，我们提出了一种分析路由机制，以适当利用在不同子空间中学到的知识。我们的方法采用递归最小二乘法训练多任务路由器模型，使路由器能够动态适应传入数据而无需访问历史数据。此外，路由器有效地将当前任务分配给适当的子空间，并具有先前学习任务的非遗忘特性，并具有坚实的理论保证。实验结果表明，我们的方法实现了先前知识的近乎完美保留，同时有效地整合新信息，成功地克服了现有方法的核心限制。我们的代码将在接受后发布。

更新时间: 2025-07-08 13:20:21

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2503.13575v2

Simple Convergence Proof of Adam From a Sign-like Descent Perspective

Adam is widely recognized as one of the most effective optimizers for training deep neural networks (DNNs). Despite its remarkable empirical success, its theoretical convergence analysis remains unsatisfactory. Existing works predominantly interpret Adam as a preconditioned stochastic gradient descent with momentum (SGDM), formulated as $\bm{x}_{t+1} = \bm{x}_t - \frac{\gamma_t}{{\sqrt{\bm{v}_t}+\epsilon}} \circ \bm{m}_t$. This perspective necessitates strong assumptions and intricate techniques, resulting in lengthy and opaque convergence proofs that are difficult to verify and extend. In contrast, we propose a novel interpretation by treating Adam as a sign-like optimizer, expressed as $\bm{x}_{t+1} = \bm{x}_t - \gamma_t \frac{|\bm{m}_t|}{{\sqrt{\bm{v}_t}+\epsilon}} \circ {\rm Sign}(\bm{m}_t)$. This reformulation significantly simplifies the convergence analysis. For the first time, with some mild conditions, we prove that Adam achieves the optimal rate of ${\cal O}(\frac{1}{T^{\sfrac{1}{4}}})$ rather than the previous ${\cal O} \left(\frac{\ln T}{T^{\sfrac{1}{4}}}\right)$ under weak assumptions of the generalized $p$-affine variance and $(L_0, L_1, q)$-smoothness, without dependence on the model dimensionality or the numerical stability parameter $\epsilon$. Additionally, our theoretical analysis provides new insights into the role of momentum as a key factor ensuring convergence and offers practical guidelines for tuning learning rates in Adam, further bridging the gap between theory and practice.

Updated: 2025-07-08 13:19:26

标题: 从一个类似符号下降的视角简单收敛证明Adam

摘要: 亚当被广泛认为是训练深度神经网络（DNNs）最有效的优化器之一。尽管它在经验上取得了显著的成功，但其理论收敛分析仍然令人不满意。现有的研究主要将亚当解释为带有动量的预条件随机梯度下降（SGDM），公式化为$\bm{x}_{t+1} = \bm{x}_t - \frac{\gamma_t}{{\sqrt{\bm{v}_t}+\epsilon}} \circ \bm{m}_t$。这种观点需要强大的假设和复杂的技术，导致收敛证明冗长和晦涩，难以验证和扩展。相反，我们提出了一种新的解释，将亚当视为类似符号的优化器，表达为$\bm{x}_{t+1} = \bm{x}_t - \gamma_t \frac{|\bm{m}_t|}{{\sqrt{\bm{v}_t}+\epsilon}} \circ {\rm Sign}(\bm{m}_t)$。这种重新表述显著简化了收敛分析。首次，在一些温和的条件下，我们证明了亚当实现了最佳速率${\cal O}(\frac{1}{T^{\sfrac{1}{4}}})$，而不是以前的${\cal O} \left(\frac{\ln T}{T^{\sfrac{1}{4}}}\right)$在广义$p$-仿射方差和$(L_0, L_1, q)$-平滑性的弱假设下，不依赖于模型维度或数值稳定性参数$\epsilon$。此外，我们的理论分析为动量作为确保收敛的关键因素提供了新的见解，并为调整亚当中的学习率提供了实用指导，进一步弥合了理论与实践之间的差距。

更新时间: 2025-07-08 13:19:26

领域: cs.LG,cs.AI,cs.IT,math.IT

下载: http://arxiv.org/abs/2507.05966v1

OpenFActScore: Open-Source Atomic Evaluation of Factuality in Text Generation

We introduce OpenFActScore, an open-source implementation of the FActScore framework for evaluating the factuality of text generated by large language models (LLMs). FActScore evaluates the factual accuracy of long-form text by using Atomic Fact Generation (AFG) to extract individual factual claims and Atomic Fact Validation (AFV) to verify each claim against a trusted knowledge source. While the original FActScore relies on closed-source and commercial models such as InstructGPT and ChatGPT, OpenFActScore enables the use of any Hugging Face-compatible model for both AFG and AFV. We provide a detailed technical overview of our implementation, highlighting design choices and modifications made to support open models. We evaluate multiple open-source LLMs on both AFG and AFV using the original FActScore benchmark, reporting BERTScore-F1 for AFG and Error Rate relative to human annotations for AFV. Our results show that open models can approximate the performance of closed-source systems, with Gemma achieving the best overall performance, and our final setup obtains a 0.99 Pearson correlation with the original FActScore experiments. OpenFActScore promotes transparency, reproducibility, and cost-effective evaluation, and is available at: https://github.com/lflage/OpenFActScore.

Updated: 2025-07-08 13:19:00

标题: OpenFActScore：开源文本生成中事实性的原子评估

摘要: 我们介绍了OpenFActScore，这是一个开源实现的FActScore框架，用于评估由大型语言模型（LLMs）生成的文本的事实性。FActScore通过使用原子事实生成（AFG）来提取单个事实主张，并使用原子事实验证（AFV）来针对受信任的知识来源验证每个主张，来评估长篇文本的事实准确性。虽然原始的FActScore依赖于闭源和商业模型，如InstructGPT和ChatGPT，但OpenFActScore使得可以使用任何兼容Hugging Face模型来进行AFG和AFV。我们提供了我们实现的详细技术概述，重点介绍了为支持开放模型而进行的设计选择和修改。我们使用原始FActScore基准测试评估多个开源LLMs在AFG和AFV上的性能，报告AFG的BERTScore-F1和AFV相对于人工注释的错误率。我们的结果表明，开放模型可以近似闭源系统的性能，其中Gemma实现了最佳的整体性能，我们的最终设置与原始FActScore实验的皮尔逊相关系数为0.99。OpenFActScore促进了透明性、可重现性和具有成本效益的评估，并可在https://github.com/lflage/OpenFActScore上获得。

更新时间: 2025-07-08 13:19:00

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.05965v1

Comparative Analysis of CNN and Transformer Architectures with Heart Cycle Normalization for Automated Phonocardiogram Classification

The automated classification of phonocardiogram (PCG) recordings represents a substantial advancement in cardiovascular diagnostics. This paper presents a systematic comparison of four distinct models for heart murmur detection: two specialized convolutional neural networks (CNNs) and two zero-shot universal audio transformers (BEATs), evaluated using fixed-length and heart cycle normalization approaches. Utilizing the PhysioNet2022 dataset, a custom heart cycle normalization method tailored to individual cardiac rhythms is introduced. The findings indicate the following AUROC values: the CNN model with fixed-length windowing achieves 79.5%, the CNN model with heart cycle normalization scores 75.4%, the BEATs transformer with fixed-length windowing achieves 65.7%, and the BEATs transformer with heart cycle normalization results in 70.1%. The findings indicate that physiological signal constraints, especially those introduced by different normalization strategies, have a substantial impact on model performance. The research provides evidence-based guidelines for architecture selection in clinical settings, emphasizing the need for a balance between accuracy and computational efficiency. Although specialized CNNs demonstrate superior performance overall, the zero-shot transformer models may offer promising efficiency advantages during development, such as faster training and evaluation cycles, despite their lower classification accuracy. These findings highlight the potential of automated classification systems to enhance cardiac diagnostics and improve patient care.

Updated: 2025-07-08 13:17:26

标题: 使用心脏周期归一化的卷积神经网络和Transformer架构的自动心音图分类的比较分析

摘要: 心音图(PCG)记录的自动分类代表了心血管诊断方面的重大进步。本文系统比较了四种不同模型用于心脏杂音检测：两种专门的卷积神经网络(CNNs)和两种零样本通用音频转换器(BEATs)，使用固定长度和心脏周期归一化方法进行评估。利用PhysioNet2022数据集，引入了一种针对个体心脏节律量身定制的心脏周期归一化方法。研究结果显示以下AUROC值：固定长度窗口的CNN模型达到了79.5%，心脏周期归一化的CNN模型得分为75.4%，固定长度窗口的BEATs转换器实现了65.7%，心脏周期归一化的BEATs转换器结果为70.1%。研究结果表明，生理信号约束，特别是不同归一化策略引入的约束，对模型性能有重大影响。该研究为临床设置中的架构选择提供了基于证据的指导，强调了在准确性和计算效率之间取得平衡的必要性。尽管专门的CNNs总体表现出色，但零样本转换器模型可能在开发过程中提供具有潜在效率优势，如更快的训练和评估周期，尽管它们的分类准确性较低。这些发现凸显了自动分类系统提升心脏诊断和改善患者护理的潜力。

更新时间: 2025-07-08 13:17:26

领域: cs.SD,cs.AI,cs.LG,eess.AS

下载: http://arxiv.org/abs/2507.07058v1

skfolio: Portfolio Optimization in Python

Portfolio optimization is a fundamental challenge in quantitative finance, requiring robust computational tools that integrate statistical rigor with practical implementation. We present skfolio, an open-source Python library for portfolio construction and risk management that seamlessly integrates with the scikit-learn ecosystem. skfolio provides a unified framework for diverse allocation strategies, from classical mean-variance optimization to modern clustering-based methods, state-of-the-art financial estimators with native interfaces, and advanced cross-validation techniques tailored for financial time series. By adhering to scikit-learn's fit-predict-transform paradigm, the library enables researchers and practitioners to leverage machine learning workflows for portfolio optimization, promoting reproducibility and transparency in quantitative finance.

Updated: 2025-07-08 13:15:03

标题: skfolio: Python中的投资组合优化 (Note: "skfolio"可能是一个特定的Python库或工具，用于投资组合优化)

摘要: 投资组合优化是量化金融中的一个基本挑战，需要强大的计算工具将统计严谨性与实际实施相结合。我们介绍了skfolio，一个开源的Python库，用于投资组合构建和风险管理，与scikit-learn生态系统无缝集成。skfolio提供了一个统一的框架，涵盖了从经典的均值-方差优化到现代基于聚类的方法、具有本地接口的最新金融估计器，以及专为金融时间序列定制的高级交叉验证技术的多样化分配策略。通过遵循scikit-learn的拟合-预测-转换范式，该库使研究人员和实践者能够利用机器学习工作流进行投资组合优化，促进量化金融中的可重现性和透明度。

更新时间: 2025-07-08 13:15:03

领域: cs.LG,q-fin.PM

下载: http://arxiv.org/abs/2507.04176v2

Rethinking Associative Memory Mechanism in Induction Head

Induction head mechanism is a part of the computational circuits for in-context learning (ICL) that enable large language models (LLMs) to adapt to new tasks without fine-tuning. Most existing work explains the training dynamics behind acquiring such a powerful mechanism. However, the model's ability to coordinate in-context information over long contexts and global knowledge acquired during pretraining remains poorly understood. This paper investigates how a two-layer transformer thoroughly captures in-context information and balances it with pretrained bigram knowledge in next token prediction, from the viewpoint of associative memory. We theoretically analyze the representation of weight matrices in attention layers and the resulting logits when a transformer is given prompts generated by a bigram model. In the experiments, we design specific prompts to evaluate whether the outputs of the trained transformer align with the theoretical results.

Updated: 2025-07-08 13:14:01

标题: 重新思考感应头中的联想记忆机制

摘要: 感应头机制是上下文学习（ICL）计算电路的一部分，使大型语言模型（LLMs）能够适应新任务而无需微调。大多数现有工作解释了获取这种强大机制背后的训练动态。然而，模型在长上下文和预训练期间获取的全局知识上协调上下文信息的能力仍然知之甚少。本文从联想记忆的角度探讨了两层变压器如何全面捕捉上下文信息，并在下一个令牌预测中平衡其与预训练的双字知识。我们从理论上分析了注意力层中的权重矩阵的表示以及当变压器接收由双字模型生成的提示时产生的logits。在实验中，我们设计了特定提示来评估训练后变压器的输出是否与理论结果一致。

更新时间: 2025-07-08 13:14:01

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2412.11459v2

Redefining Evaluation Standards: A Unified Framework for Evaluating the Korean Capabilities of Language Models

Recent advancements in Korean large language models (LLMs) have driven numerous benchmarks and evaluation methods, yet inconsistent protocols cause up to 10 p.p performance gaps across institutions. Overcoming these reproducibility gaps does not mean enforcing a one-size-fits-all evaluation. Rather, effective benchmarking requires diverse experimental approaches and a framework robust enough to support them. To this end, we introduce HRET (Haerae Evaluation Toolkit), an open-source, registry-based framework that unifies Korean LLM assessment. HRET integrates major Korean benchmarks, multiple inference backends, and multi-method evaluation, with language consistency enforcement to ensure genuine Korean outputs. Its modular registry design also enables rapid incorporation of new datasets, methods, and backends, ensuring the toolkit adapts to evolving research needs. Beyond standard accuracy metrics, HRET incorporates Korean-focused output analyses-morphology-aware Type-Token Ratio (TTR) for evaluating lexical diversity and systematic keyword-omission detection for identifying missing concepts-to provide diagnostic insights into language-specific behaviors. These targeted analyses help researchers pinpoint morphological and semantic shortcomings in model outputs, guiding focused improvements in Korean LLM development.

Updated: 2025-07-08 13:13:04

标题: 重新定义评估标准：评估语言模型韩国能力的统一框架

摘要: 最近，在韩国大型语言模型（LLMs）方面取得了重大进展，推动了许多基准和评估方法，然而不一致的协议导致不同机构之间高达10个百分点的性能差距。克服这些可重现性差距并不意味着强制执行一种适用于所有的评估。相反，有效的基准测试需要多样化的实验方法和一个足够强大的框架来支持它们。为此，我们介绍了HRET（Haerae评估工具包），这是一个开源的、基于注册的框架，统一了韩国LLM的评估。HRET整合了主要的韩国基准测试、多个推理后端和多种评估方法，并通过语言一致性强制执行来确保真实的韩文输出。其模块化注册设计还可以快速整合新的数据集、方法和后端，确保工具包能够适应不断发展的研究需求。除了标准的准确度指标外，HRET还包括韩文输出分析-用于评估词汇多样性的形态学感知词汇比率（TTR）和用于识别缺失概念的系统关键词遗漏检测，以提供对语言特定行为的诊断洞察。这些有针对性的分析帮助研究人员找出模型输出中的形态和语义缺陷，指导韩国LLM发展中的有针对性改进。

更新时间: 2025-07-08 13:13:04

领域: cs.CE,cs.AI,cs.CL

下载: http://arxiv.org/abs/2503.22968v4

Empirical Analysis Of Heuristic and Approximation Algorithms for the The Mutual-Visibility Problem

The NP-complete mutual-visibility (MV) problem currently lacks empirical analysis on its practical behaviour despite theoretical studies. This paper addresses this gap by implementing and evaluating three distinct algorithms -- a direct random heuristic, a hypergraph-based approximation, and a genetic algorithm -- on diverse synthetic graph datasets, including those with analytically known $\mu(G)$ values and general graph models. Our results demonstrate that for smaller graphs, the algorithms consistently achieve MV set sizes aligning with theoretical bounds. However, for larger instances, achieved solution sizes notably diverge from theoretical limits; this, combined with the absence of tight bounds, complicates absolute quality assessment. Nevertheless, validation on known optimal graphs showed the Genetic Algorithm and other heuristics empirically performing best among tested methods.

Updated: 2025-07-08 13:11:57

标题: 启发式和近似算法在相互可见性问题中的实证分析

摘要: 目前，NP完全的相互可见性（MV）问题尽管在理论研究中存在，但在实际行为方面缺乏实证分析。本文通过在不同合成图数据集上实施和评估三种不同的算法（直接随机启发式算法、基于超图的近似算法和遗传算法），填补了这一空白，包括那些具有已知 $\mu(G)$ 值和一般图模型的数据集。我们的结果表明，对于较小的图，这些算法始终达到与理论界限一致的MV集大小。然而，对于较大的实例，实现的解决方案大小明显偏离理论极限；这一事实，加上缺乏严格的界限，使得绝对质量评估变得复杂。然而，对已知最优图的验证表明，遗传算法和其他启发式算法在经验性能方面表现最好。

更新时间: 2025-07-08 13:11:57

领域: cs.CG,cs.AI,cs.PF,math.CO

下载: http://arxiv.org/abs/2507.01076v2

Advancing Stroke Risk Prediction Using a Multi-modal Foundation Model

Predicting stroke risk is a complex challenge that can be enhanced by integrating diverse clinically available data modalities. This study introduces a self-supervised multimodal framework that combines 3D brain imaging, clinical data, and image-derived features to improve stroke risk prediction prior to onset. By leveraging large unannotated clinical datasets, the framework captures complementary and synergistic information across image and tabular data modalities. Our approach is based on a contrastive learning framework that couples contrastive language-image pretraining with an image-tabular matching module, to better align multimodal data representations in a shared latent space. The model is trained on the UK Biobank, which includes structural brain MRI and clinical data. We benchmark its performance against state-of-the-art unimodal and multimodal methods using tabular, image, and image-tabular combinations under diverse frozen and trainable model settings. The proposed model outperformed self-supervised tabular (image) methods by 2.6% (2.6%) in ROC-AUC and by 3.3% (5.6%) in balanced accuracy. Additionally, it showed a 7.6% increase in balanced accuracy compared to the best multimodal supervised model. Through interpretable tools, our approach demonstrated better integration of tabular and image data, providing richer and more aligned embeddings. Gradient-weighted Class Activation Mapping heatmaps further revealed activated brain regions commonly associated in the literature with brain aging, stroke risk, and clinical outcomes. This robust self-supervised multimodal framework surpasses state-of-the-art methods for stroke risk prediction and offers a strong foundation for future studies integrating diverse data modalities to advance clinical predictive modelling.

Updated: 2025-07-08 13:09:35

标题: 利用多模态基础模型推进中风风险预测

摘要: 预测中风风险是一个复杂的挑战，可以通过整合多种临床可用的数据模态来增强。本研究介绍了一种自监督的多模态框架，结合了3D脑成像、临床数据和图像衍生特征，以提高发作前的中风风险预测。通过利用大型未标记的临床数据集，该框架捕获了图像和表格数据模态之间的互补和协同信息。我们的方法基于对比学习框架，将对比语言-图像预训练与图像-表格匹配模块相结合，以更好地将多模态数据表示在共享的潜在空间中对齐。该模型在包含结构性脑MRI和临床数据的英国生物库上进行训练。我们通过使用冻结和可训练模型设置下的表格、图像和图像-表格组合来评估其性能并与最先进的单模态和多模态方法进行比较。提出的模型在ROC-AUC方面比自监督表格（图像）方法表现出2.6%（2.6%）的提高，在平衡准确度方面提高了3.3%（5.6%）。此外，与最佳的多模态监督模型相比，它显示了平衡准确度增加了7.6%。通过可解释的工具，我们的方法展示了更好地整合表格和图像数据，提供了更丰富和更对齐的嵌入。梯度加权类激活映射热图进一步揭示了与脑衰老、中风风险和临床结果在文献中常见的激活脑区域。这种强大的自监督多模态框架超越了中风风险预测的最新方法，并为未来整合多种数据模态以推进临床预测建模提供了坚实基础。

更新时间: 2025-07-08 13:09:35

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2411.09822v2

Unsupervised Anomaly Detection through Mass Repulsing Optimal Transport

Detecting anomalies in datasets is a longstanding problem in machine learning. In this context, anomalies are defined as a sample that significantly deviates from the remaining data. Meanwhile, optimal transport (OT) is a field of mathematics concerned with the transportation, between two probability measures, at least effort. In classical OT, the optimal transportation strategy of a measure to itself is the identity. In this paper, we tackle anomaly detection by forcing samples to displace its mass, while keeping the least effort objective. We call this new transportation problem Mass Repulsing Optimal Transport (MROT). Naturally, samples lying in low density regions of space will be forced to displace mass very far, incurring a higher transportation cost. We use these concepts to design a new anomaly score. Through a series of experiments in existing benchmarks, and fault detection problems, we show that our algorithm improves over existing methods.

Updated: 2025-07-08 13:05:48

标题: 无监督异常检测通过大规模排斥最优传输

摘要: 在机器学习中，检测数据集中的异常值是一个长期存在的问题。在这个背景下，异常值被定义为与其余数据显著不同的样本。同时，最优输运（OT）是一个数学领域，关注的是两个概率分布之间的输运，至少的努力。在经典的最优输运中，将一个分布输运到自身的最佳策略是恒等映射。在本文中，我们通过强制样本位移其质量，同时保持最小努力目标，来解决异常检测问题。我们将这个新的输运问题称为质量斥力最优输运（MROT）。自然地，在空间中密度较低的区域中的样本将被迫远距离位移质量，导致更高的输运成本。我们利用这些概念设计了一个新的异常值分数。通过一系列在现有基准数据集和故障检测问题上的实验，我们展示了我们的算法优于现有方法。

更新时间: 2025-07-08 13:05:48

领域: stat.ML,cs.AI,cs.LG

下载: http://arxiv.org/abs/2502.12793v2

CTA: Cross-Task Alignment for Better Test Time Training

Deep learning models have demonstrated exceptional performance across a wide range of computer vision tasks. However, their performance often degrades significantly when faced with distribution shifts, such as domain or dataset changes. Test-Time Training (TTT) has emerged as an effective method to enhance model robustness by incorporating an auxiliary unsupervised task during training and leveraging it for model updates at test time. In this work, we introduce CTA (Cross-Task Alignment), a novel approach for improving TTT. Unlike existing TTT methods, CTA does not require a specialized model architecture and instead takes inspiration from the success of multi-modal contrastive learning to align a supervised encoder with a self-supervised one. This process enforces alignment between the learned representations of both models, thereby mitigating the risk of gradient interference, preserving the intrinsic robustness of self-supervised learning and enabling more semantically meaningful updates at test-time. Experimental results demonstrate substantial improvements in robustness and generalization over the state-of-the-art on several benchmark datasets.

Updated: 2025-07-08 13:04:25

标题: CTA：跨任务对齐以实现更好的测试时间训练

摘要: 深度学习模型已在广泛的计算机视觉任务中展现出卓越的性能。然而，当面临分布偏移（如领域或数据集变化）时，它们的性能往往会显著下降。测试时训练（TTT）已被证明是一种有效的方法，通过在训练期间引入一个辅助的无监督任务，并在测试时利用它来更新模型，以增强模型的鲁棒性。在这项工作中，我们引入了CTA（交叉任务对齐），一种改进TTT的新方法。与现有的TTT方法不同，CTA不需要专门的模型架构，而是受到多模态对比学习成功的启发，将一个监督编码器与一个自监督编码器对齐。这个过程强制对齐两个模型的学习表示，从而减轻梯度干扰的风险，保留自监督学习的内在鲁棒性，并在测试时实现更具语义意义的更新。实验结果显示，在几个基准数据集上，对抗性和泛化性能都比最先进的方法有了显著的改进。

更新时间: 2025-07-08 13:04:25

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.05221v2

Holistic Construction Automation with Modular Robots: From High-Level Task Specification to Execution

In situ robotic automation in construction is challenging due to constantly changing environments, a shortage of robotic experts, and a lack of standardized frameworks bridging robotics and construction practices. This work proposes a holistic framework for construction task specification, optimization of robot morphology, and mission execution using a mobile modular reconfigurable robot. Users can specify and monitor the desired robot behavior through a graphical interface. In contrast to existing, monolithic solutions, we automatically identify a new task-tailored robot for every task by integrating \acf{bim}. Our framework leverages modular robot components that enable the fast adaption of robot hardware to the specific demands of the construction task. Other than previous works on modular robot optimization, we consider multiple competing objectives, which allow us to explicitly model the challenges of real-world transfer, such as calibration errors. We demonstrate our framework in simulation by optimizing robots for drilling and spray painting. Finally, experimental validation demonstrates that our approach robustly enables the autonomous execution of robotic drilling.

Updated: 2025-07-08 13:02:38

标题: 用模块化机器人实现全面的建筑自动化：从高级任务规范到执行

摘要: 在施工中的就地机器人自动化面临挑战，原因是环境不断变化、机器人专家短缺以及缺乏将机器人技术与施工实践桥接的标准化框架。本文提出了一个全面的框架，用于建筑任务规范、优化机器人形态和使用移动模块化可重配置机器人执行任务。用户可以通过图形界面指定和监控期望的机器人行为。与现有的单一解决方案相比，我们通过整合BIM自动识别每个任务的新型定制机器人。我们的框架利用模块化机器人组件，实现快速调整机器人硬件以适应施工任务的特定需求。与以往的模块化机器人优化工作不同，我们考虑多个竞争目标，使我们能够明确地建模现实世界转移的挑战，如校准误差。我们通过模拟展示了我们的框架，优化了钻孔和喷漆机器人。最后，实验验证表明，我们的方法能够稳健地实现机器人钻孔的自主执行。

更新时间: 2025-07-08 13:02:38

领域: cs.RO,cs.AI,cs.HC

下载: http://arxiv.org/abs/2412.20867v2

Deep neural networks have an inbuilt Occam's razor

The remarkable performance of overparameterized deep neural networks (DNNs) must arise from an interplay between network architecture, training algorithms, and structure in the data. To disentangle these three components, we apply a Bayesian picture, based on the functions expressed by a DNN, to supervised learning. The prior over functions is determined by the network, and is varied by exploiting a transition between ordered and chaotic regimes. For Boolean function classification, we approximate the likelihood using the error spectrum of functions on data. When combined with the prior, this accurately predicts the posterior, measured for DNNs trained with stochastic gradient descent. This analysis reveals that structured data, combined with an intrinsic Occam's razor-like inductive bias towards (Kolmogorov) simple functions that is strong enough to counteract the exponential growth of the number of functions with complexity, is a key to the success of DNNs.

Updated: 2025-07-08 12:58:32

标题: 深度神经网络具有内置的奥卡姆剃刀

摘要: 过参数化深度神经网络（DNNs）的卓越性能必定源于网络架构、训练算法和数据结构之间的相互作用。为了解开这三个组成部分，我们应用了一个基于DNN表达的函数的贝叶斯图像来进行监督学习。函数的先验由网络确定，并通过利用有序和混沌状态之间的转变进行变化。对于布尔函数分类，我们使用函数在数据上的误差谱来近似似然。当与先验结合时，这可以准确预测用随机梯度下降训练的DNN的后验。这一分析揭示了结构化数据，以及对（科尔莫戈洛夫）简单函数具有内在的奥卡姆剃刀式归纳偏好，强大到足以抵消随复杂度增长而指数增长的函数数量，是DNN成功的关键。

更新时间: 2025-07-08 12:58:32

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2304.06670v2

The Prompt War: How AI Decides on a Military Intervention

Which factors determine AI propensity for military intervention? While the use of AI in war games and military planning is growing exponentially, the simple analysis of key drivers embedded in the models has not yet been done. This paper does a simple conjoint experiment proposing a model to decide on military intervention in 640 vignettes where each was run for 100 times allowing to explore AI decision on military intervention systematically. The analysis finds that largest predictors of AI decision to intervene are high domestic support and high probability of success. Costs such as international condemnation, military deaths, civilian deaths, and negative economic effect are statistically significant, but their effect is around half of domestic support and probability of victory. Closing window of opportunity only reaches statistical significance in interaction with other factors. The results are remarkably consistent across scenarios and across different models (OpenAI GPT, Anthropic Claude, Google Gemini) suggesting a pattern in AI decision-making.

Updated: 2025-07-08 12:52:08

标题: 《即时战争：人工智能如何决定军事干预》

摘要: 哪些因素决定了人工智能对军事干预的倾向？虽然人工智能在战争游戏和军事规划中的应用呈指数增长，但对嵌入模型的关键驱动因素进行简单分析尚未完成。本文通过一项简单的联合实验提出了一个模型，决定在640个情景中是否进行军事干预，每个情景运行100次，以便系统地探索人工智能对军事干预的决策。分析发现，人工智能决定干预的最大预测因素是高度的国内支持和成功的可能性。国际谴责、军事死亡、平民死亡和负面经济影响等成本在统计上显著，但其影响仅为国内支持和胜利可能性的一半。机会的关闭窗口仅在与其他因素的互动中达到统计显著性。结果在不同情景和不同模型（OpenAI GPT、Anthropic Claude、Google Gemini）中都表现出惊人的一致性，这表明了人工智能决策中的模式。

更新时间: 2025-07-08 12:52:08

领域: cs.CY,cs.AI

下载: http://arxiv.org/abs/2507.06277v1

Complexity Results of Persuasion

We prove that persuasion is an NP-complete problem.

Updated: 2025-07-08 12:49:22

标题: 说服的复杂性结果

摘要: 我们证明说说服是一个NP完全问题。

更新时间: 2025-07-08 12:49:22

领域: cs.CC,cs.AI

下载: http://arxiv.org/abs/2507.05951v1

Improving AI-Based Canine Heart Disease Diagnosis with Expert-Consensus Auscultation Labeling

Noisy labels pose significant challenges for AI model training in veterinary medicine. This study examines expert assessment ambiguity in canine auscultation data, highlights the negative impact of label noise on classification performance, and introduces methods for label noise reduction. To evaluate whether label noise can be minimized by incorporating multiple expert opinions, a dataset of 140 heart sound recordings (HSR) was annotated regarding the intensity of holosystolic heart murmurs caused by Myxomatous Mitral Valve Disease (MMVD). The expert opinions facilitated the selection of 70 high-quality HSR, resulting in a noise-reduced dataset. By leveraging individual heart cycles, the training data was expanded and classification robustness was enhanced. The investigation encompassed training and evaluating three classification algorithms: AdaBoost, XGBoost, and Random Forest. While AdaBoost and Random Forest exhibited reasonable performances, XGBoost demonstrated notable improvements in classification accuracy. All algorithms showed significant improvements in classification accuracy due to the applied label noise reduction, most notably XGBoost. Specifically, for the detection of mild heart murmurs, sensitivity increased from 37.71% to 90.98% and specificity from 76.70% to 93.69%. For the moderate category, sensitivity rose from 30.23% to 55.81% and specificity from 64.56% to 97.19%. In the loud/thrilling category, sensitivity and specificity increased from 58.28% to 95.09% and from 84.84% to 89.69%, respectively. These results highlight the importance of minimizing label noise to improve classification algorithms for the detection of canine heart murmurs. Index Terms: AI diagnosis, canine heart disease, heart sound classification, label noise reduction, machine learning, XGBoost, veterinary cardiology, MMVD.

Updated: 2025-07-08 12:48:25

标题: 用专家共识听诊标记改进基于人工智能的犬心脏病诊断

摘要: 嘈杂的标签对兽医学中AI模型的训练带来了重大挑战。本研究检验了犬心脏听诊数据中专家评估的模糊性，强调了标签噪声对分类性能的负面影响，并介绍了标签噪声减少的方法。为了评估是否可以通过整合多个专家意见来最小化标签噪声，对包括由二尖瓣粘液样变性疾病（MMVD）引起的全收缩期心脏杂音强度的140个心音记录（HSR）进行了注释。专家意见促进了70个高质量HSR的选择，从而得到了一个降噪的数据集。通过利用单个心脏周期，扩展了训练数据并增强了分类的稳健性。研究涵盖了三种分类算法的训练和评估：AdaBoost、XGBoost和Random Forest。虽然AdaBoost和Random Forest表现合理，但XGBoost表现出显著的分类准确性提高。所有算法由于应用了标签噪声减少而在分类准确性方面都有显著改进，尤其是XGBoost。具体来说，对于轻度心脏杂音的检测，敏感性从37.71%提高到90.98%，特异性从76.70%提高到93.69%。对于中度类别，敏感性从30.23%增加到55.81%，特异性从64.56%增加到97.19%。在响亮/激动类别中，敏感性和特异性分别从58.28%增加到95.09%和从84.84%增加到89.69%。这些结果突显了减少标签噪声以改善分类算法用于检测犬心脏杂音的重要性。索引词：AI诊断、犬心脏疾病、心音分类、标签噪声减少、机器学习、XGBoost、兽医心脏病学、MMVD。

更新时间: 2025-07-08 12:48:25

领域: cs.LG

下载: http://arxiv.org/abs/2507.05950v1

Measuring Variable Importance in Heterogeneous Treatment Effects with Confidence

Causal machine learning holds promise for estimating individual treatment effects from complex data. For successful real-world applications of machine learning methods, it is of paramount importance to obtain reliable insights into which variables drive heterogeneity in the response to treatment. We propose PermuCATE, an algorithm based on the Conditional Permutation Importance (CPI) method, for statistically rigorous global variable importance assessment in the estimation of the Conditional Average Treatment Effect (CATE). Theoretical analysis of the finite sample regime and empirical studies show that PermuCATE has lower variance than the Leave-One-Covariate-Out (LOCO) reference method and provides a reliable measure of variable importance. This property increases statistical power, which is crucial for causal inference in the limited-data regime common to biomedical applications. We empirically demonstrate the benefits of PermuCATE in simulated and real-world health datasets, including settings with up to hundreds of correlated variables.

Updated: 2025-07-08 12:45:45

标题: 用置信度衡量异质治疗效果中的变量重要性

摘要: 因果机器学习有望从复杂数据中估计个体治疗效果。为了成功地将机器学习方法应用于实际场景中，获得可靠的洞察力以了解哪些变量驱动治疗反应的异质性至关重要。我们提出了PermuCATE算法，该算法基于条件排列重要性（CPI）方法，用于在估计条件平均治疗效果（CATE）时进行统计严格的全局变量重要性评估。有限样本范围的理论分析和实证研究表明，PermuCATE的方差比留一特征变量（LOCO）参考方法低，并提供可靠的变量重要性度量。这种性质增加了统计功效，对于生物医学应用中常见的有限数据情况下的因果推断至关重要。我们在模拟和真实的健康数据集中实证展示了PermuCATE的好处，包括具有上百个相关变量的情景。

更新时间: 2025-07-08 12:45:45

领域: cs.LG

下载: http://arxiv.org/abs/2408.13002v4

Conthereum: Concurrent Ethereum Optimized Transaction Scheduling for Multi-Core Execution

Conthereum is a concurrent Ethereum solution for intra-block parallel transaction execution, enabling validators to utilize multi-core infrastructure and transform the sequential execution model of Ethereum into a parallel one. This shift significantly increases throughput and transactions per second (TPS), while ensuring conflict-free execution in both proposer and attestor modes and preserving execution order consistency in the attestor. At the heart of Conthereum is a novel, lightweight, high-performance scheduler inspired by the Flexible Job Shop Scheduling Problem (FJSS). We propose a custom greedy heuristic algorithm, along with its efficient implementation, that solves this formulation effectively and decisively outperforms existing scheduling methods in finding suboptimal solutions that satisfy the constraints, achieve minimal makespan, and maximize speedup in parallel execution. Additionally, Conthereum includes an offline phase that equips its real-time scheduler with a conflict analysis repository obtained through static analysis of smart contracts, identifying potentially conflicting functions using a pessimistic approach. Building on this novel scheduler and extensive conflict data, Conthereum outperforms existing concurrent intra-block solutions. Empirical evaluations show near-linear throughput gains with increasing computational power on standard 8-core machines. Although scalability deviates from linear with higher core counts and increased transaction conflicts, Conthereum still significantly improves upon the current sequential execution model and outperforms existing concurrent solutions under a wide range of conditions.

Updated: 2025-07-08 12:40:31

标题: Conthereum：用于多核执行的并发以太坊优化事务调度

摘要: Conthereum是一种并发以太坊解决方案，用于区块内并行交易执行，使验证者能够利用多核基础设施，并将以太坊的顺序执行模型转变为并行执行模型。这种转变显著增加了吞吐量和每秒交易量（TPS），同时确保在提议者和证明者模式下执行无冲突，并在证明者中保持执行顺序一致性。Conthereum的核心是一种由灵活作业车间调度问题（FJSS）启发的新颖、轻量级、高性能调度器。我们提出了一种定制的贪婪启发式算法，以及其高效实现，有效地解决了这个问题，明显优于现有调度方法，在找到满足约束条件、实现最小完成时间和最大化并行执行加速度方面取得了决定性的成果。此外，Conthereum包括一个离线阶段，通过对智能合约的静态分析获得冲突分析存储库，利用悲观方法识别潜在冲突函数，为其实时调度器提供支持。基于这个新颖的调度器和大量冲突数据，Conthereum优于现有的并发区块内解决方案。实证评估显示，在标准的8核计算机上，随着计算能力的增加，吞吐量几乎呈线性增长。尽管随着核心数量的增加和交易冲突的增加，可扩展性并非线性，但Conthereum仍然显著改进了当前的顺序执行模型，并在各种条件下优于现有的并发解决方案。

更新时间: 2025-07-08 12:40:31

领域: cs.CR,cs.DC

下载: http://arxiv.org/abs/2504.07280v2

Information-theoretic machine learning for time-varying mode decomposition of separated aerodynamic flows

We perform an information-theoretic mode decomposition for separated aerodynamic flows. The current data-driven approach based on a neural network referred to as deep sigmoidal flow enables the extraction of an informative component from a given flow field snapshot with respect to a target variable at a future time stamp, thereby capturing the causality as a time-varying modal structure. We consider four examples of separated flows around a wing, namely, 1. laminar periodic wake at post-stall angles of attack, strong gust-wing interactions of 2. numerical and 3. experimental measurements, and 4. a turbulent wake in a spanwise-periodic domain. The present approach reveals informative vortical structures associated with a time-varying lift response. For the periodic shedding cases, the informative structures vary in time corresponding to the fluctuation level from their mean values. With the examples of gust-wing interactions, how the effect of gust on a wing emerges in the lift response over time is identified in an interpretable manner. Furthermore, for the case of turbulent wake, the present model highlights structures near the wing and vortex cores as informative components based solely on the information metric without any prior knowledge of aerodynamics and length scales. This study provides causality-based insights into a range of unsteady aerodynamic problems.

Updated: 2025-07-08 12:39:14

标题: 信息论机器学习用于分离空气动力流体的时变模式分解

摘要: 我们对分离气动流进行了信息论模式分解。基于神经网络的当前数据驱动方法，称为深层sigmoidal流，使得能够从给定的流场快照中提取与未来时间戳的目标变量相关的信息组件，从而捕捉因果关系作为一个时变的模态结构。我们考虑了围绕机翼的四个分离流的例子，即，1.后失速攻角下的层流周期尾流，2.数值和3.实验测量的强气流-机翼相互作用，以及4.跨距周期域中的湍流尾流。当前方法揭示了与时变升力响应相关的信息涡结构。对于周期性脱落情况，信息结构随时间变化，对应于它们的平均值的波动水平。通过气流-机翼相互作用的例子，如何风对机翼的影响随时间在升力响应中出现以可解释的方式进行了识别。此外，对于湍流尾流的情况，当前模型突出显示了靠近机翼和涡核的结构作为信息组件，仅基于信息度量，而没有任何先验的气动知识和长度尺度。这项研究为一系列非定常气动问题提供了基于因果关系的洞察。

更新时间: 2025-07-08 12:39:14

领域: physics.flu-dyn,cs.LG,physics.comp-ph

下载: http://arxiv.org/abs/2505.24132v2

News and Load: Social and Economic Drivers of Regional Multi-horizon Electricity Demand Forecasting

The relationship between electricity demand and variables such as economic activity and weather patterns is well established. However, this paper explores the connection between electricity demand and social aspects. It further embeds dynamic information about the state of society into energy demand modelling and forecasting approaches. Through the use of natural language processing on a large news corpus, we highlight this important link. This study is conducted in five regions of the UK and Ireland and considers multiple time horizons from 1 to 30 days. It also considers economic variables such as GDP, unemployment and inflation. The textual features used in this study represent central constructs from the word frequencies, topics, word embeddings extracted from the news. The findings indicate that: 1) the textual features are related to various contents, such as military conflicts, transportation, the global pandemic, regional economics, and the international energy market. They exhibit causal relationships with regional electricity demand, which are validated using Granger causality and Double Machine Learning methods. 2) Economic indicators play a more important role in the East Midlands and Northern Ireland, while social indicators are more influential in the West Midlands and the South West of England. 3) The use of these factors improves deterministic forecasting by around 6%.

Updated: 2025-07-08 12:37:42

标题: 新闻和负载：区域多时间跨度电力需求预测的社会和经济驱动因素

摘要: 电力需求与经济活动、气候模式等变量之间的关系已被充分确认。然而，本文探讨了电力需求与社会因素之间的联系。它进一步将有关社会状况的动态信息嵌入到能源需求建模和预测方法中。通过对大量新闻语料库进行自然语言处理，我们突出了这一重要联系。该研究在英国和爱尔兰的五个地区进行，考虑了从1到30天的多个时间范围。它还考虑了GDP、失业率和通货膨胀等经济变量。本研究使用的文本特征代表了来自新闻中提取的词频、主题和词嵌入的中心构造。研究结果表明：1）文本特征与各种内容相关，如军事冲突、交通运输、全球大流行、区域经济和国际能源市场。它们与区域电力需求呈现因果关系，通过格兰杰因果性和双机器学习方法进行验证。2）经济指标在东米德兰兹和北爱尔兰发挥更为重要的作用，而社会指标对西米德兰兹和英格兰西南部的影响更大。3）这些因素的使用提高了确定性预测约6%。

更新时间: 2025-07-08 12:37:42

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2406.06641v2

WATS: Calibrating Graph Neural Networks with Wavelet-Aware Temperature Scaling

Graph Neural Networks (GNNs) have demonstrated strong predictive performance on relational data; however, their confidence estimates often misalign with actual predictive correctness, posing significant limitations for deployment in safety-critical settings. While existing graph-aware calibration methods seek to mitigate this limitation, they primarily depend on coarse one-hop statistics, such as neighbor-predicted confidence, or latent node embeddings, thereby neglecting the fine-grained structural heterogeneity inherent in graph topology. In this work, we propose Wavelet-Aware Temperature Scaling (WATS), a post-hoc calibration framework that assigns node-specific temperatures based on tunable heat-kernel graph wavelet features. Specifically, WATS harnesses the scalability and topology sensitivity of graph wavelets to refine confidence estimates, all without necessitating model retraining or access to neighboring logits or predictions. Extensive evaluations across seven benchmark datasets with varying graph structures and two GNN backbones demonstrate that WATS achieves the lowest Expected Calibration Error (ECE) among all compared methods, outperforming both classical and graph-specific baselines by up to 42.3\% in ECE and reducing calibration variance by 17.24\% on average compared with graph-specific methods. Moreover, WATS remains computationally efficient, scaling well across graphs of diverse sizes and densities. Code will be released based on publication.

Updated: 2025-07-08 12:34:43

标题: WATS：使用小波感知温度缩放校准图神经网络

摘要: 图神经网络（GNNs）在关系数据上展现出强大的预测性能；然而，它们的置信估计往往与实际的预测正确性不一致，在安全关键环境中部署时存在重大限制。虽然现有的图感知校准方法试图缓解这一限制，但它们主要依赖于粗糙的一跳统计数据，比如邻居预测的置信度，或者潜在节点嵌入，因此忽略了图拓扑中固有的细粒度结构异质性。在这项工作中，我们提出了Wavelet-Aware Temperature Scaling（WATS），一个后期校准框架，根据可调节的热核图小波特征为每个节点分配特定于节点的温度。具体而言，WATS利用图小波的可扩展性和拓扑敏感性来优化置信估计，而无需重新训练模型或访问邻居的逻辑值或预测值。在七个具有不同图结构的基准数据集和两个GNN主干上进行的广泛评估显示，WATS在所有比较方法中实现了最低的预期校准误差（ECE），在ECE方面超过了传统和图特定基线方法高达42.3％，与图特定方法相比，校准方差平均减少了17.24％。此外，WATS保持计算效率，在不同大小和密度的图中具有良好的可扩展性。代码将会根据出版物发布。

更新时间: 2025-07-08 12:34:43

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.23782v2

BlueLM-2.5-3B Technical Report

We present BlueLM-2.5-3B, a compact and unified dense Multimodal Large Language Model (MLLM) designed for efficient edge-device deployment, offering strong general-purpose and reasoning capabilities. To the best of our knowledge, this is the first 3B-scale MLLM to support both thinking and non-thinking modes, while also enabling explicit control over thinking token budget. BlueLM-2.5-3B is developed through diversified data curation, key data resampling, hybrid heterogeneous reinforcement learning, and a high-performance training infrastructure. Our model achieves superior multimodal capacity while preserving competitive pure-text performance with only 2.9 billion parameters. We conduct comprehensive evaluations across a broad range of multimodal and text-only benchmarks. In thinking mode, BlueLM-2.5-3B achieves comparable performance to Qwen3-4B on text-only benchmarks, and trails the larger Kimi-VL-A3B-16B by only about 5% on average across multimodal evaluations. In non-thinking mode, it outperforms Qwen2.5-VL-3B on the majority of multimodal benchmarks. Additionally, BlueLM-2.5-3B exhibits exceptional data efficiency. All of the aforementioned performance is achieved with substantially less total training data than Qwen2.5-VL-3B and Qwen3-4B. We hope our work contributes to the advancement of high-performance, on-device MLLMs and provides meaningful insights to the research community.

Updated: 2025-07-08 12:34:10

标题: 蓝色LM-2.5-3B技术报告

摘要: 我们提出BlueLM-2.5-3B，这是一个紧凑且统一的稠密多模态大语言模型（MLLM），旨在实现高效的边缘设备部署，具有强大的通用和推理能力。据我们所知，这是第一个支持思考和非思考模式的3B级MLLM，同时还能够明确控制思考令牌预算。BlueLM-2.5-3B通过多样化的数据整理、关键数据重采样、混合异构强化学习和高性能训练基础设施进行开发。我们的模型在保持竞争性纯文本性能的同时实现了卓越的多模态容量，仅使用29亿参数。我们在广泛范围的多模态和仅文本基准测试中进行了全面评估。在思考模式下，BlueLM-2.5-3B在仅文本基准测试中实现了与Qwen3-4B相当的性能，并且在多模态评估中仅比更大的Kimi-VL-A3B-16B平均低约5%。在非思考模式下，它在大多数多模态基准测试中优于Qwen2.5-VL-3B。此外，BlueLM-2.5-3B表现出卓越的数据效率。所有上述性能是在远远少于Qwen2.5-VL-3B和Qwen3-4B的总训练数据下实现的。我们希望我们的工作有助于推动高性能、设备上的MLLM的发展，并为研究界提供有意义的见解。

更新时间: 2025-07-08 12:34:10

领域: cs.AI

下载: http://arxiv.org/abs/2507.05934v1

A Comprehensive Study of Shapley Value in Data Analytics

Over the recent years, Shapley value (SV), a solution concept from cooperative game theory, has found numerous applications in data analytics (DA). This paper presents the first comprehensive study of SV used throughout the DA workflow, clarifying the key variables in defining DA-applicable SV and the essential functionalities that SV can provide for data scientists. We condense four primary challenges of using SV in DA, namely computation efficiency, approximation error, privacy preservation, and interpretability, disentangle the resolution techniques from existing arts in this field, then analyze and discuss the techniques w.r.t. each challenge and the potential conflicts between challenges.We also implement SVBench, a modular and extensible open-source framework for developing SV applications in different DA tasks, and conduct extensive evaluations to validate our analyses and discussions. Based on the qualitative and quantitative results, we identify the limitations of current efforts for applying SV to DA and highlight the directions of future research and engineering.

Updated: 2025-07-08 12:32:47

标题: 数据分析中沙普利值的全面研究

摘要: 在最近几年中，协作游戏理论中的解决方案概念Shapley值（SV）在数据分析（DA）中找到了许多应用。本文首次全面研究了SV在整个DA工作流程中的应用，澄清了定义适用于DA的SV的关键变量以及SV可以为数据科学家提供的基本功能。我们总结了在DA中使用SV的四个主要挑战，即计算效率、近似误差、隐私保护和可解释性，将现有领域中的解决技术与这些挑战进行了分析和讨论，然后分析和讨论了每个挑战以及挑战之间的潜在冲突。我们还实现了SVBench，这是一个模块化和可扩展的开源框架，用于在不同的DA任务中开发SV应用，并进行了广泛的评估以验证我们的分析和讨论。基于定性和定量结果，我们确定了目前将SV应用于DA的努力的局限性，并强调了未来研究和工程的方向。

更新时间: 2025-07-08 12:32:47

领域: cs.DB,cs.LG

下载: http://arxiv.org/abs/2412.01460v8

Self-supervised learning of speech representations with Dutch archival data

This paper explores the use of Dutch archival television broadcast data for self-supervised learning of speech foundation models, specifically wav2vec 2.0. We first study data quality assumptions for pre-training, and show how music, noise and speaker overlap affect SSL convergence and downstream fine-tuning performance. Secondly, we explore effectively pre-processing strategies to convert the noisy broadcast dataset into a qualitative dataset for pre-training, by using Whisper and WhisperX. Thirdly, we compare mono-lingual and multi-lingual pre-training with equivalent amounts of data, and show that mono-lingual pre-training is more robust to out-of-domain data. Lastly, we achieve a state-of-the-art LARGE wav2vec 2.0 model for the Dutch language, by a continuation of pre-training a wav2vec 2.0 XLS-R model checkpoint with our 55k hour archival dataset.

Updated: 2025-07-08 12:27:54

标题: 使用荷兰档案数据进行语音表示的自监督学习

摘要: 本文探讨了使用荷兰档案电视广播数据进行自监督学习语音基础模型，具体是wav2vec 2.0的应用。我们首先研究了预训练的数据质量假设，并展示了音乐、噪音和说话者重叠如何影响SSL收敛和下游微调性能。其次，我们探讨了有效的预处理策略，将嘈杂的广播数据集转换为用于预训练的优质数据集，通过使用Whisper和WhisperX。第三，我们比较了使用相同数量数据的单语言和多语言预训练，结果显示单语言预训练对领域外数据更具鲁棒性。最后，我们通过继续使用我们的55k小时档案数据集对wav2vec 2.0 XLS-R模型检查点进行预训练，实现了荷兰语的最新的LARGE wav2vec 2.0模型。

更新时间: 2025-07-08 12:27:54

领域: cs.SD,cs.CL,cs.LG,eess.AS

下载: http://arxiv.org/abs/2507.04554v2

KD$^{2}$M: A unifying framework for feature knowledge distillation

Knowledge Distillation (KD) seeks to transfer the knowledge of a teacher, towards a student neural net. This process is often done by matching the networks' predictions (i.e., their output), but, recently several works have proposed to match the distributions of neural nets' activations (i.e., their features), a process known as \emph{distribution matching}. In this paper, we propose an unifying framework, Knowledge Distillation through Distribution Matching (KD$^{2}$M), which formalizes this strategy. Our contributions are threefold. We i) provide an overview of distribution metrics used in distribution matching, ii) benchmark on computer vision datasets, and iii) derive new theoretical results for KD.

Updated: 2025-07-08 12:27:47

标题: KD$^{2}$M：一种统一的特征知识蒸馏框架

摘要: Knowledge Distillation (KD)旨在将教师的知识转移到学生神经网络中。这个过程通常通过匹配网络的预测（即它们的输出）来完成，但最近一些研究提出了匹配神经网络激活的分布（即它们的特征），这个过程被称为“分布匹配”。在本文中，我们提出了一个统一的框架，即通过分布匹配进行知识蒸馏（KD$^{2}$M），并对这种策略进行了形式化。我们的贡献有三个方面。我们i）提供了在分布匹配中使用的分布度量的概述，ii）在计算机视觉数据集上进行了基准测试，iii）推导了知识蒸馏的新理论结果。

更新时间: 2025-07-08 12:27:47

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2504.01757v2

Online Regularized Learning Algorithms in RKHS with $β$- and $φ$-Mixing Sequences

In this paper, we study an online regularized learning algorithm in a reproducing kernel Hilbert spaces (RKHS) based on a class of dependent processes. We choose such a process where the degree of dependence is measured by mixing coefficients. As a representative example, we analyze a strictly stationary Markov chain, where the dependence structure is characterized by the $\phi$- and $\beta$-mixing coefficients. Under these assumptions, we derive probabilistic upper bounds as well as convergence rates for both the exponential and polynomial decay of the mixing coefficients.

Updated: 2025-07-08 12:25:04

标题: 在线正则化学习算法在具有$β$-和$φ$-混合序列的RKHS中的应用

摘要: 在这篇论文中，我们研究了一种基于依赖过程的再生核希尔伯特空间（RKHS）的在线正则化学习算法。我们选择了一个依赖度由混合系数衡量的过程。作为一个代表性例子，我们分析了一个严格平稳的马尔可夫链，其中依赖结构由$\phi$-和$\beta$-混合系数来表征。在这些假设下，我们得出了混合系数指数和多项式衰减的概率上界以及收敛速度。

更新时间: 2025-07-08 12:25:04

领域: stat.ML,cs.LG,math.FA,60J20, 68T05, 68Q32, 62L20, 62H05

下载: http://arxiv.org/abs/2507.05929v1

Advancing Offline Handwritten Text Recognition: A Systematic Review of Data Augmentation and Generation Techniques

Offline Handwritten Text Recognition (HTR) systems play a crucial role in applications such as historical document digitization, automatic form processing, and biometric authentication. However, their performance is often hindered by the limited availability of annotated training data, particularly for low-resource languages and complex scripts. This paper presents a comprehensive survey of offline handwritten data augmentation and generation techniques designed to improve the accuracy and robustness of HTR systems. We systematically examine traditional augmentation methods alongside recent advances in deep learning, including Generative Adversarial Networks (GANs), diffusion models, and transformer-based approaches. Furthermore, we explore the challenges associated with generating diverse and realistic handwriting samples, particularly in preserving script authenticity and addressing data scarcity. This survey follows the PRISMA methodology, ensuring a structured and rigorous selection process. Our analysis began with 1,302 primary studies, which were filtered down to 848 after removing duplicates, drawing from key academic sources such as IEEE Digital Library, Springer Link, Science Direct, and ACM Digital Library. By evaluating existing datasets, assessment metrics, and state-of-the-art methodologies, this survey identifies key research gaps and proposes future directions to advance the field of handwritten text generation across diverse linguistic and stylistic landscapes.

Updated: 2025-07-08 12:03:58

标题: 推进离线手写文本识别：数据增强和生成技术的系统综述

摘要: 离线手写文本识别（HTR）系统在历史文件数字化、自动表单处理和生物特征认证等应用中发挥着至关重要的作用。然而，它们的性能通常受到标注训练数据的有限可用性的限制，尤其是对于低资源语言和复杂脚本。本文介绍了离线手写数据增强和生成技术的全面调查，旨在提高HTR系统的准确性和稳健性。我们系统地研究了传统的增强方法以及深度学习中最新的进展，包括生成对抗网络（GANs）、扩散模型和基于变压器的方法。此外，我们探讨了生成多样且真实手写样本所面临的挑战，特别是在保留脚本真实性和解决数据稀缺性方面。本调查遵循PRISMA方法论，确保了结构化和严格的选择过程。我们的分析始于1,302篇主要研究，经过去重后筛选得到848篇，涵盖了IEEE数字图书馆、Springer Link、Science Direct和ACM数字图书馆等重要学术来源。通过评估现有数据集、评估指标和最新方法，本调查确定了关键的研究空白，并提出了未来发展方向，以推动跨越多样语言和风格景观的手写文本生成领域的发展。

更新时间: 2025-07-08 12:03:58

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.06275v1

On the Effectiveness of Methods and Metrics for Explainable AI in Remote Sensing Image Scene Classification

The development of explainable artificial intelligence (xAI) methods for scene classification problems has attracted great attention in remote sensing (RS). Most xAI methods and the related evaluation metrics in RS are initially developed for natural images considered in computer vision (CV), and their direct usage in RS may not be suitable. To address this issue, in this paper, we investigate the effectiveness of explanation methods and metrics in the context of RS image scene classification. In detail, we methodologically and experimentally analyze ten explanation metrics spanning five categories (faithfulness, robustness, localization, complexity, randomization), applied to five established feature attribution methods (Occlusion, LIME, GradCAM, LRP, and DeepLIFT) across three RS datasets. Our methodological analysis identifies key limitations in both explanation methods and metrics. The performance of perturbation-based methods, such as Occlusion and LIME, heavily depends on perturbation baselines and spatial characteristics of RS scenes. Gradient-based approaches like GradCAM struggle when multiple labels are present in the same image, while some relevance propagation methods (LRP) can distribute relevance disproportionately relative to the spatial extent of classes. Analogously, we find limitations in evaluation metrics. Faithfulness metrics share the same problems as perturbation-based methods. Localization metrics and complexity metrics are unreliable for classes with a large spatial extent. In contrast, robustness metrics and randomization metrics consistently exhibit greater stability. Our experimental results support these methodological findings. Based on our analysis, we provide guidelines for selecting explanation methods, metrics, and hyperparameters in the context of RS image scene classification.

Updated: 2025-07-08 12:00:24

标题: 关于遥感图像场景分类中可解释人工智能方法和度量的有效性

摘要: 可解释人工智能（xAI）方法在遥感（RS）中的场景分类问题的发展引起了广泛关注。大多数xAI方法和RS中相关的评估指标最初是为计算机视觉（CV）中考虑的自然图像而开发的，直接在RS中使用可能不太合适。为了解决这个问题，在本文中，我们研究了在RS图像场景分类环境中解释方法和指标的有效性。具体地，我们在三个RS数据集上系统地和实验性地分析了涵盖五个类别的十种解释指标（忠实度、稳健性、定位、复杂性、随机化），应用于五种已建立的特征归因方法（遮挡、LIME、GradCAM、LRP和DeepLIFT）。我们的方法论分析确定了解释方法和指标的关键局限性。基于扰动的方法（如遮挡和LIME）的性能严重依赖于扰动基线和RS场景的空间特征。梯度-based方法如GradCAM在同一图像中存在多个标签时会遇到困难，而一些相关传播方法（LRP）可能会不均匀地分配相关性相对于类的空间范围。类似地，我们在评估指标中发现了局限性。忠实度指标与基于扰动的方法有相同的问题。定位指标和复杂性指标对于具有较大空间范围的类不可靠。相反，稳健性指标和随机化指标始终表现出更大的稳定性。我们的实验结果支持这些方法论发现。根据我们的分析，我们提供了在RS图像场景分类环境中选择解释方法、指标和超参数的指南。

更新时间: 2025-07-08 12:00:24

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.05916v1

Diffusion Dataset Condensation: Training Your Diffusion Model Faster with Less Data

Diffusion models have achieved remarkable success in various generative tasks, but training them remains highly resource-intensive, often requiring millions of images and many days of GPU computation. From a data-centric perspective addressing this limitation, we study diffusion dataset condensation as a new and challenging problem setting. The goal is to construct a "synthetic" sub-dataset with significantly fewer samples than the original dataset, enabling high-quality diffusion model training with greatly reduced cost. To the best of our knowledge, we are the first to formally investigate dataset condensation for diffusion models, whereas prior work focused on training discriminative models. To tackle this new challenge, we propose a novel Diffusion Dataset Condensation (D2C) framework, which consists of two phases: Select and Attach. The Select phase identifies a compact and diverse subset using a diffusion difficulty score and interval sampling. The Attach phase enhances the selected subset by attaching rich semantic and visual representations to strengthen the conditional signals. Extensive experiments across various dataset sizes, model architectures, and resolutions show that our D2C framework enables significantly faster diffusion model training with dramatically fewer data, while preserving high visual quality. Notably, for the SiT-XL/2 architecture, D2C achieves a 100x training speed-up, reaching a FID score of 4.3 in just 40k steps using only 0.8% of the training data.

Updated: 2025-07-08 12:00:04

标题: 扩散数据集压缩：用更少的数据更快地训练您的扩散模型

摘要: 扩散模型在各种生成任务中取得了显著的成功，但训练它们仍然非常耗费资源，通常需要数百万张图像和多天的GPU计算。从数据中心的角度来看，我们研究了扩散数据集凝聚作为一个新的和具有挑战性的问题设置。目标是构建一个“合成”子数据集，其样本数量明显少于原始数据集，从而实现高质量的扩散模型训练，成本大大降低。据我们所知，我们是第一个正式研究扩散模型数据集凝聚的人，而先前的工作重点是训练区分模型。为了解决这一新挑战，我们提出了一个新颖的扩散数据集凝聚（D2C）框架，包括两个阶段：选择和附加。选择阶段使用扩散难度评分和间隔采样来识别一个紧凑且多样化的子集。附加阶段通过附加丰富的语义和视觉表示来增强所选子集，以加强条件信号。在各种数据集大小、模型架构和分辨率上进行了大量实验，结果显示我们的D2C框架能够实现显著更快的扩散模型训练，并且使用极少的数据，同时保持高视觉质量。值得注意的是，对于SiT-XL/2架构，D2C实现了100倍的训练加速，仅使用0.8%的训练数据在仅40000步内达到了4.3的FID分数。

更新时间: 2025-07-08 12:00:04

领域: cs.LG

下载: http://arxiv.org/abs/2507.05914v1

Best-of-N through the Smoothing Lens: KL Divergence and Regret Analysis

A simple yet effective method for inference-time alignment of generative models is Best-of-$N$ (BoN), where $N$ outcomes are sampled from a reference policy, evaluated using a proxy reward model, and the highest-scoring one is selected. While prior work argues that BoN is almost optimal in reward vs KL tradeoffs, the effectiveness of BoN depends critically on the quality of the proxy reward model used for selection. For this purpose, we study BoN through a smooth version known as Soft Best-of-N (SBoN) and develop a theoretical framework to address this gap. We analyze the scaling behaviour of BoN by providing bounds on the KL divergence between the SBoN policy and the reference policy, offering insights into how performance varies with the number of samples. We also study the regret gap, i.e., the gap between the expected true reward under the optimal policy and the SBoN policy. Our theoretical and empirical findings show that smoothing helps SBoN mitigate reward overoptimization, especially when the quality of the proxy reward is low.

Updated: 2025-07-08 11:59:48

标题: Best-of-N通过平滑镜头：KL散度和遗憾分析

摘要: 一种简单而有效的生成模型推断时间对齐方法是最佳-N（BoN），其中从参考策略中采样N个结果，使用代理奖励模型进行评估，并选择得分最高的结果。尽管之前的研究认为BoN在奖励与KL交换中几乎是最优的，但BoN的有效性在很大程度上取决于用于选择的代理奖励模型的质量。为此，我们通过一个称为软最佳-N（SBoN）的平滑版本来研究BoN，并开发一个理论框架来解决这一差距。我们通过提供SBoN策略和参考策略之间的KL散度的界限来分析BoN的扩展行为，从而揭示性能如何随样本数量变化而变化。我们还研究后悔间隙，即期望真实奖励在最优策略下与SBoN策略之间的差距。我们的理论和实证研究结果表明，平滑有助于SBoN减轻奖励过度优化，特别是当代理奖励的质量较低时。

更新时间: 2025-07-08 11:59:48

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2507.05913v1

Deep learning from strongly mixing observations: Sparse-penalized regularization and minimax optimality

The explicit regularization and optimality of deep neural networks estimators from independent data have made considerable progress recently. The study of such properties on dependent data is still a challenge. In this paper, we carry out deep learning from strongly mixing observations, and deal with the squared and a broad class of loss functions. We consider sparse-penalized regularization for deep neural network predictor. For a general framework that includes, regression estimation, classification, time series prediction,$\cdots$, oracle inequality for the expected excess risk is established and a bound on the class of H\"older smooth functions is provided. For nonparametric regression from strong mixing data and sub-exponentially error, we provide an oracle inequality for the $L_2$ error and investigate an upper bound of this error on a class of H\"older composition functions. For the specific case of nonparametric autoregression with Gaussian and Laplace errors, a lower bound of the $L_2$ error on this H\"older composition class is established. Up to logarithmic factor, this bound matches its upper bound; so, the deep neural network estimator attains the minimax optimal rate.

Updated: 2025-07-08 11:59:11

标题: 深度学习从强混合观测中：稀疏惩罚正则化和极小化最优性

摘要: 最近，深度神经网络估计器的显式正则化和最优性在独立数据方面取得了相当大的进展。然而，在相关数据上研究这些特性仍然是一个挑战。本文中，我们从强混合观测数据中进行深度学习，并处理平方和一类广泛的损失函数。我们考虑了稀疏惩罚正则化用于深度神经网络预测器。对于一个包括回归估计、分类、时间序列预测等的一般框架，建立了期望超额风险的奥拉克不等式，并提供了一类H\"older光滑函数的上界。对于从强混合数据和次指数误差中的非参数回归，我们提供了$L_2$误差的奥拉克不等式，并研究了H\"older组合函数类的这种误差的上界。对于具体的具有高斯和拉普拉斯误差的非参数自回归情况，建立了$L_2$误差在这个H\"older组合类上的下界。在对数因子上，这个下界与其上界匹配；因此，深度神经网络估计器达到了极小化最优率。

更新时间: 2025-07-08 11:59:11

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2406.08321v2

Differentiable Reward Optimization for LLM based TTS system

This paper proposes a novel Differentiable Reward Optimization (DiffRO) method aimed at enhancing the performance of neural codec language models based text-to-speech (TTS) systems. In contrast to conventional reinforcement learning from human feedback (RLHF) approaches applied to TTS, DiffRO directly compute the rewards based on neural codec tokens, rather than relying on synthesized audio. Furthermore, we employ the Gumbel-Softmax technique to render the reward function differentiable, thereby streamlining the RLHF training process. Additionally, we introduce a multi-task reward (MTR) model which can provide feedback from different perspectives and find that it can augment the system's capability to follow instructions effectively.Experimental results indicate that DiffRO significantly improves the pronunciation accuracy of the TTS system, achieving state-of-the-art (SOTA) WER results on the seed-tts-eval benchmark. Moreover, with the integration of the MTR model, we demonstrate the ability to control emotional and quality attributes in a zero-shot manner.

Updated: 2025-07-08 11:57:16

标题: 基于LLM的TTS系统的可微分奖励优化

摘要: 本文提出了一种新颖的可微分奖励优化（DiffRO）方法，旨在增强基于神经编解码器语言模型的文本到语音（TTS）系统的性能。与应用于TTS的传统强化学习人类反馈（RLHF）方法相比，DiffRO直接基于神经编解码器标记计算奖励，而不是依赖于合成音频。此外，我们采用Gumbel-Softmax技术使奖励函数可微分化，从而简化了RLHF训练过程。此外，我们引入了一个多任务奖励（MTR）模型，可以从不同角度提供反馈，并发现它可以增强系统有效地遵循指令的能力。实验结果表明，DiffRO显著提高了TTS系统的发音准确性，在seed-tts-eval基准上取得了最先进的（SOTA）WER结果。此外，通过集成MTR模型，我们展示了以零样本方式控制情感和质量属性的能力。

更新时间: 2025-07-08 11:57:16

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2507.05911v1

Longitudinal Ensemble Integration for sequential classification with multimodal data

Effectively modeling multimodal longitudinal data is a pressing need in various application areas, especially biomedicine. Despite this, few approaches exist in the literature for this problem, with most not adequately taking into account the multimodality of the data. In this study, we developed multiple configurations of a novel multimodal and longitudinal learning framework, Longitudinal Ensemble Integration (LEI), for sequential classification. We evaluated LEI's performance, and compared it against existing approaches, for the early detection of dementia, which is among the most studied multimodal sequential classification tasks. LEI outperformed these approaches due to its use of intermediate base predictions arising from the individual data modalities, which enabled their better integration over time. LEI's design also enabled the identification of features that were consistently important across time for the effective prediction of dementia-related diagnoses. Overall, our work demonstrates the potential of LEI for sequential classification from longitudinal multimodal data.

Updated: 2025-07-08 11:51:34

标题: 多模态数据的序列分类的纵向集成

摘要: 在各种应用领域，特别是生物医学领域，有效建模多模态纵向数据是一个迫切的需求。尽管如此，在文献中针对这个问题存在的方法很少，大多数方法未能充分考虑数据的多模态性。在这项研究中，我们开发了一个新颖的多模态和纵向学习框架Longitudinal Ensemble Integration（LEI）的多种配置，用于顺序分类。我们评估了LEI的性能，并将其与现有方法进行比较，以用于早期检测痴呆症，这是最受关注的多模态顺序分类任务之一。由于LEI利用了来自各个数据模态的中间基本预测，使它们能够更好地随时间整合，因此LEI优于这些方法。LEI的设计还使得能够识别出在时间上一直重要的特征，从而有效预测与痴呆症相关的诊断。总体而言，我们的工作展示了LEI在从纵向多模态数据进行顺序分类方面的潜力。

更新时间: 2025-07-08 11:51:34

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2411.05983v2

Trust-Region Twisted Policy Improvement

Monte-Carlo tree search (MCTS) has driven many recent breakthroughs in deep reinforcement learning (RL). However, scaling MCTS to parallel compute has proven challenging in practice which has motivated alternative planners like sequential Monte-Carlo (SMC). Many of these SMC methods adopt particle filters for smoothing through a reformulation of RL as a policy inference problem. Yet, persisting design choices of these particle filters often conflict with the aim of online planning in RL, which is to obtain a policy improvement at the start of planning. Drawing inspiration from MCTS, we tailor SMC planners specifically for RL by improving data generation within the planner through constrained action sampling and explicit terminal state handling, as well as improving policy and value target estimation. This leads to our Trust-Region Twisted SMC (TRT-SMC), which shows improved runtime and sample-efficiency over baseline MCTS and SMC methods in both discrete and continuous domains.

Updated: 2025-07-08 11:50:25

标题: 信任域扭曲策略改进

摘要: 蒙特卡洛树搜索（MCTS）推动了深度强化学习（RL）领域许多最新突破。然而，在实践中将MCTS扩展到并行计算已被证明具有挑战性，这促使了像顺序蒙特卡洛（SMC）这样的替代规划器的出现。许多这些SMC方法采用粒子滤波器进行平滑，通过将RL重新制定为政策推断问题。然而，这些粒子滤波器的持续设计选择通常与RL中在线规划的目标相冲突，即在规划开始时获得政策改进。受MCTS的启发，我们通过改进规划器内的数据生成，通过约束动作采样和明确处理终端状态，以及改进政策和价值目标估计，专门为RL定制SMC规划者。这导致了我们的信任区域扭曲SMC（TRT-SMC），在离散和连续领域中显示出比基准MCTS和SMC方法更高的运行时间和样本效率。

更新时间: 2025-07-08 11:50:25

领域: cs.LG

下载: http://arxiv.org/abs/2504.06048v4

Unsupervised Learning for Optimal Transport plan prediction between unbalanced graphs

Optimal transport between graphs, based on Gromov-Wasserstein and other extensions, is a powerful tool for comparing and aligning graph structures. However, solving the associated non-convex optimization problems is computationally expensive, which limits the scalability of these methods to large graphs. In this work, we present Unbalanced Learning of Optimal Transport (ULOT), a deep learning method that predicts optimal transport plans between two graphs. Our method is trained by minimizing the fused unbalanced Gromov-Wasserstein (FUGW) loss. We propose a novel neural architecture with cross-attention that is conditioned on the FUGW tradeoff hyperparameters. We evaluate ULOT on synthetic stochastic block model (SBM) graphs and on real cortical surface data obtained from fMRI. ULOT predicts transport plans with competitive loss up to two orders of magnitude faster than classical solvers. Furthermore, the predicted plan can be used as a warm start for classical solvers to accelerate their convergence. Finally, the predicted transport plan is fully differentiable with respect to the graph inputs and FUGW hyperparameters, enabling the optimization of functionals of the ULOT plan.

Updated: 2025-07-08 11:47:25

标题: 无监督学习用于不平衡图之间的最优输运计划预测

摘要: 图之间的最优输运，基于Gromov-Wasserstein和其他扩展，是比较和对齐图结构的强大工具。然而，解决相关的非凸优化问题在计算上是昂贵的，这限制了这些方法在大型图上的可扩展性。在这项工作中，我们提出了Unbalanced Learning of Optimal Transport (ULOT)，这是一种深度学习方法，可以预测两个图之间的最优输运方案。我们的方法通过最小化融合不平衡Gromov-Wasserstein（FUGW）损失进行训练。我们提出了一种新颖的神经架构，具有交叉注意力，其条件是FUGW权衡超参数。我们在合成随机块模型（SBM）图和从fMRI获取的真实大脑皮层表面数据上评估了ULOT。ULOT预测的输运方案比传统求解器快两个数量级，并且具有竞争性的损失。此外，预测的方案可以作为传统求解器的热启动，加速它们的收敛。最后，预测的输运方案对图输入和FUGW超参数是完全可微的，从而可以优化ULOT方案的功能。

更新时间: 2025-07-08 11:47:25

领域: cs.LG

下载: http://arxiv.org/abs/2506.12025v3

Feature-Based vs. GAN-Based Learning from Demonstrations: When and Why

This survey provides a comparative analysis of feature-based and GAN-based approaches to learning from demonstrations, with a focus on the structure of reward functions and their implications for policy learning. Feature-based methods offer dense, interpretable rewards that excel at high-fidelity motion imitation, yet often require sophisticated representations of references and struggle with generalization in unstructured settings. GAN-based methods, in contrast, use implicit, distributional supervision that enables scalability and adaptation flexibility, but are prone to training instability and coarse reward signals. Recent advancements in both paradigms converge on the importance of structured motion representations, which enable smoother transitions, controllable synthesis, and improved task integration. We argue that the dichotomy between feature-based and GAN-based methods is increasingly nuanced: rather than one paradigm dominating the other, the choice should be guided by task-specific priorities such as fidelity, diversity, interpretability, and adaptability. This work outlines the algorithmic trade-offs and design considerations that underlie method selection, offering a framework for principled decision-making in learning from demonstrations.

Updated: 2025-07-08 11:45:51

标题: 特征驱动与生成对抗网络驱动的示教学习：何时以及为什么？

摘要: 这份调查提供了基于特征和基于GAN的方法在从示范中学习方面的比较分析，重点关注奖励函数的结构及其对策略学习的影响。基于特征的方法提供稠密、可解释的奖励，在高保真度动作模仿方面表现出色，但通常需要对参考进行复杂的表示，并在非结构化环境中很难推广。相比之下，基于GAN的方法使用隐式的分布式监督，能够实现可扩展性和适应性灵活性，但容易出现训练不稳定和粗糙奖励信号的情况。最近两种范式的进展都强调了结构化动作表示的重要性，这有助于实现更平滑的过渡、可控的合成和改善任务整合。我们认为基于特征和基于GAN的方法之间的二元对立越来越微妙：与其说一个范式主导另一个，不如根据任务特定的优先级，如保真度、多样性、可解释性和适应性来指导选择。本文概述了方法选择背后的算法权衡和设计考虑，为从示范中学习提供了一个基于原则的决策框架。

更新时间: 2025-07-08 11:45:51

领域: cs.LG,cs.AI,cs.GR,cs.RO

下载: http://arxiv.org/abs/2507.05906v1

Universal Embeddings of Tabular Data

Tabular data in relational databases represents a significant portion of industrial data. Hence, analyzing and interpreting tabular data is of utmost importance. Application tasks on tabular data are manifold and are often not specified when setting up an industrial database. To address this, we present a novel framework for generating universal, i.e., task-independent embeddings of tabular data for performing downstream tasks without predefined targets. Our method transforms tabular data into a graph structure, leverages Graph Auto-Encoders to create entity embeddings, which are subsequently aggregated to obtain embeddings for each table row, i.e., each data sample. This two-step approach has the advantage that unseen samples, consisting of similar entities, can be embedded without additional training. Downstream tasks such as regression, classification or outlier detection, can then be performed by applying a distance-based similarity measure in the embedding space. Experiments on real-world datasets demonstrate that our method achieves superior performance compared to existing universal tabular data embedding techniques.

Updated: 2025-07-08 11:45:29

标题: 表格数据的通用嵌入

摘要: 关系数据库中的表格数据代表了工业数据的一个重要部分。因此，分析和解释表格数据至关重要。表格数据的应用任务是多样的，通常在建立工业数据库时并未明确指定。为了解决这一问题，我们提出了一个新颖的框架，用于生成通用的、即与任务无关的表格数据嵌入，以便在没有预定义目标的情况下执行下游任务。我们的方法将表格数据转换为图结构，利用图自动编码器创建实体嵌入，然后将这些嵌入聚合以获得每个表格行（即每个数据样本）的嵌入。这种两步方法的优点在于，可以嵌入包含相似实体的未见样本，而无需额外的训练。然后，通过在嵌入空间中应用基于距离的相似度度量，可以执行回归、分类或异常检测等下游任务。对真实世界数据集的实验表明，与现有的通用表格数据嵌入技术相比，我们的方法实现了更优越的性能。

更新时间: 2025-07-08 11:45:29

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.05904v1

On the Fundamental Impossibility of Hallucination Control in Large Language Models

We prove that perfect hallucination control in large language models is mathematically impossible. No LLM inference mechanism can simultaneously achieve truthful response generation, semantic information conservation, relevant knowledge revelation, and knowledge-constrained optimality. This impossibility is fundamental, arising from the mathematical structure of information aggregation itself rather than engineering limitations. The proof spans three mathematical frameworks: auction theory, proper scoring theory for probabilistic predictions, and log-sum-exp analysis for transformer architectures. In each setting, we demonstrate that information aggregation creates unavoidable violations of conservation principles. The Jensen gap in transformer probability aggregation provides a direct measure of this impossibility. These results reframe hallucination from an engineering bug to an inevitable mathematical feature of distributed intelligence. There are fundamental trade-offs between truthfulness, knowledge utilization, and response completeness, providing principled foundations for managing rather than eliminating hallucination. This work reveals deep connections between neural network inference, philosophy of knowledge and reasoning, and classical results in game theory and information theory, opening new research directions for developing beneficial AI systems within mathematical constraints.

Updated: 2025-07-08 11:43:16

标题: 关于大型语言模型中幻觉控制的基本不可能性

摘要: 我们证明了在大型语言模型中完美幻觉控制在数学上是不可能的。没有LLM推理机制可以同时实现真实的响应生成、语义信息的保存、相关知识的揭示和知识约束下的最优性。这种不可能性是根本的，源于信息聚合的数学结构本身而不是工程限制。证明涵盖了三个数学框架：拍卖理论、概率预测的适当评分理论，以及Transformer架构的log-sum-exp分析。在每个设置中，我们展示了信息聚合会不可避免地违反保守原则。Transformer概率聚合中的Jensen差提供了这种不可能性的直接衡量。这些结果将幻觉从工程上的缺陷重新定位为分布式智能的必然数学特征。在真实性、知识利用和响应完整性之间存在根本的权衡，为管理幻觉而不是消除幻觉提供了原则性基础。这项工作揭示了神经网络推理、知识和推理的哲学，以及博弈论和信息论中经典结果之间的深刻联系，开辟了在数学约束下开发有益AI系统的新研究方向。

更新时间: 2025-07-08 11:43:16

领域: stat.ML,cs.AI,cs.CL,cs.GT,cs.LG

下载: http://arxiv.org/abs/2506.06382v3

Stable Acoustic Relay Assignment with High Throughput via Lase Chaos-based Reinforcement Learning

This study addresses the problem of stable acoustic relay assignment in an underwater acoustic network. Unlike the objectives of most existing literature, two distinct objectives, namely classical stable arrangement and ambiguous stable arrangement, are considered. To achieve these stable arrangements, a laser chaos-based multi-processing learning (LC-ML) method is introduced to efficiently obtain high throughput and rapidly attain stability. In order to sufficiently explore the relay's decision-making, this method uses random numbers generated by laser chaos to learn the assignment of relays to multiple source nodes. This study finds that the laser chaos-based random number and multi-processing in the exchange process have a positive effect on higher throughput and strong adaptability with environmental changing over time. Meanwhile, ambiguous cognitions result in the stable configuration with less volatility compared to accurate ones. This provides a practical and useful method and can be the basis for relay selection in complex underwater environments.

Updated: 2025-07-08 11:41:24

标题: 稳定的高吞吐量声学中继分配通过基于激光混沌的强化学习

摘要: 这项研究解决了水下声学网络中稳定声学中继分配的问题。与大多数现有文献的目标不同，考虑了两个不同的目标，即经典稳定布局和模糊稳定布局。为了实现这些稳定布局，引入了一种基于激光混沌的多处理学习（LC-ML）方法，以高效地获得高吞吐量并迅速达到稳定状态。为了充分探索中继的决策过程，该方法使用激光混沌生成的随机数来学习将中继分配给多个源节点。研究发现，激光混沌生成的随机数和交换过程中的多处理对提高吞吐量和在环境变化时具有强大适应性具有积极影响。同时，模糊认知相比准确认知导致的稳定配置具有更少的波动。这提供了一种实用和有用的方法，并可以成为复杂水下环境中中继选择的基础。

更新时间: 2025-07-08 11:41:24

领域: cs.SD,cs.LG,eess.AS,math.OC

下载: http://arxiv.org/abs/2507.05900v1

Composable Strategy Framework with Integrated Video-Text based Large Language Models for Heart Failure Assessment

Heart failure is one of the leading causes of death worldwide, with millons of deaths each year, according to data from the World Health Organization (WHO) and other public health agencies. While significant progress has been made in the field of heart failure, leading to improved survival rates and improvement of ejection fraction, there remains substantial unmet needs, due to the complexity and multifactorial characteristics. Therefore, we propose a composable strategy framework for assessment and treatment optimization in heart failure. This framework simulates the doctor-patient consultation process and leverages multi-modal algorithms to analyze a range of data, including video, physical examination, text results as well as medical history. By integrating these various data sources, our framework offers a more holistic evaluation and optimized treatment plan for patients. Our results demonstrate that this multi-modal approach outperforms single-modal artificial intelligence (AI) algorithms in terms of accuracy in heart failure (HF) prognosis prediction. Through this method, we can further evaluate the impact of various pathological indicators on HF prognosis,providing a more comprehensive evaluation.

Updated: 2025-07-08 11:39:01

标题: 可组合战略框架：集成视频文本的大型语言模型用于心力衰竭评估

摘要: 心力衰竭是全球死亡的主要原因之一，根据世界卫生组织（WHO）和其他公共卫生机构的数据，每年有数百万人死于心力衰竭。虽然在心力衰竭领域取得了显著进展，导致生存率提高和射血分数改善，但由于其复杂性和多因素特征，仍存在重大未满足需求。因此，我们提出了一个适用于心力衰竭评估和治疗优化的可组合策略框架。该框架模拟了医生与患者的咨询过程，并利用多模算法来分析各种数据，包括视频、体格检查、文本结果以及病史。通过整合这些不同的数据源，我们的框架为患者提供了更全面的评估和优化的治疗计划。我们的结果表明，这种多模态方法在心力衰竭（HF）预后预测的准确性方面胜过单一模态的人工智能（AI）算法。通过这种方法，我们可以进一步评估各种病理指标对HF预后的影响，提供更全面的评估。

更新时间: 2025-07-08 11:39:01

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2502.16548v2

MusiScene: Leveraging MU-LLaMA for Scene Imagination and Enhanced Video Background Music Generation

Humans can imagine various atmospheres and settings when listening to music, envisioning movie scenes that complement each piece. For example, slow, melancholic music might evoke scenes of heartbreak, while upbeat melodies suggest celebration. This paper explores whether a Music Language Model, e.g. MU-LLaMA, can perform a similar task, called Music Scene Imagination (MSI), which requires cross-modal information from video and music to train. To improve upon existing music captioning models which focusing solely on musical elements, we introduce MusiScene, a music captioning model designed to imagine scenes that complement each music. In this paper, (1) we construct a large-scale video-audio caption dataset with 3,371 pairs, (2) we finetune Music Understanding LLaMA for the MSI task to create MusiScene, and (3) we conduct comprehensive evaluations and prove that our MusiScene is more capable of generating contextually relevant captions compared to MU-LLaMA. We leverage the generated MSI captions to enhance Video Background Music Generation (VBMG) from text.

Updated: 2025-07-08 11:32:02

标题: MusiScene：利用MU-LLaMA进行场景想象和增强视频背景音乐生成

摘要: 人类在听音乐时可以想象出各种大气和背景，构思出与每一首曲目相配的电影场景。例如，慢板、忧郁的音乐可能会唤起心碎的场景，而欢快的旋律则暗示着庆祝。本文探讨了一个音乐语言模型，例如MU-LLaMA，是否能够执行类似的任务，称为音乐场景想象（MSI），这需要来自视频和音乐的跨模态信息来进行训练。为了改进现有的仅关注音乐元素的音乐字幕模型，我们引入了MusiScene，这是一个旨在想象与每首音乐相配的场景的音乐字幕模型。在本文中，（1）我们构建了一个包含3,371对视频-音频字幕数据集，（2）我们微调了Music Understanding LLaMA以进行MSI任务，创建了MusiScene，（3）我们进行了全面评估，并证明我们的MusiScene在生成上下文相关字幕方面比MU-LLaMA更有能力。我们利用生成的MSI字幕来增强从文本中生成视频背景音乐的能力。

更新时间: 2025-07-08 11:32:02

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.05894v1

Decomposing the Time Series Forecasting Pipeline: A Modular Approach for Time Series Representation, Information Extraction, and Projection

With the advent of Transformers, time series forecasting has seen significant advances, yet it remains challenging due to the need for effective sequence representation, memory construction, and accurate target projection. Time series forecasting remains a challenging task, demanding effective sequence representation, meaningful information extraction, and precise future projection. Each dataset and forecasting configuration constitutes a distinct task, each posing unique challenges the model must overcome to produce accurate predictions. To systematically address these task-specific difficulties, this work decomposes the time series forecasting pipeline into three core stages: input sequence representation, information extraction and memory construction, and final target projection. Within each stage, we investigate a range of architectural configurations to assess the effectiveness of various modules, such as convolutional layers for feature extraction and self-attention mechanisms for information extraction, across diverse forecasting tasks, including evaluations on seven benchmark datasets. Our models achieve state-of-the-art forecasting accuracy while greatly enhancing computational efficiency, with reduced training and inference times and a lower parameter count. The source code is available at https://github.com/RobertLeppich/REP-Net.

Updated: 2025-07-08 11:26:42

标题: 分解时间序列预测管道：一种模块化方法用于时间序列表示、信息提取和投影

摘要: 随着Transformer的出现，时间序列预测已经取得了显著进展，但由于需要有效的序列表示、记忆构建和准确的目标投影，仍然具有挑战性。时间序列预测仍然是一项具有挑战性的任务，需要有效的序列表示、有意义的信息提取和精确的未来投影。每个数据集和预测配置都构成一个独特的任务，每个任务都提出了模型必须克服的独特挑战，以产生准确的预测。为了系统地解决这些任务特定的困难，本研究将时间序列预测流程分解为三个核心阶段：输入序列表示、信息提取和记忆构建，以及最终目标投影。在每个阶段内，我们调查了一系列的架构配置，评估了各种模块的有效性，例如用于特征提取的卷积层和用于信息提取的自注意机制，跨多样的预测任务，包括对七个基准数据集的评估。我们的模型在大大增强计算效率的同时，实现了最先进的预测准确性，降低了训练和推理时间以及参数数量。源代码可在https://github.com/RobertLeppich/REP-Net 上获得。

更新时间: 2025-07-08 11:26:42

领域: cs.AI

下载: http://arxiv.org/abs/2507.05891v1

Psychometric Item Validation Using Virtual Respondents with Trait-Response Mediators

As psychometric surveys are increasingly used to assess the traits of large language models (LLMs), the need for scalable survey item generation suited for LLMs has also grown. A critical challenge here is ensuring the construct validity of generated items, i.e., whether they truly measure the intended trait. Traditionally, this requires costly, large-scale human data collection. To make it efficient, we present a framework for virtual respondent simulation using LLMs. Our central idea is to account for mediators: factors through which the same trait can give rise to varying responses to a survey item. By simulating respondents with diverse mediators, we identify survey items that robustly measure intended traits. Experiments on three psychological trait theories (Big5, Schwartz, VIA) show that our mediator generation methods and simulation framework effectively identify high-validity items. LLMs demonstrate the ability to generate plausible mediators from trait definitions and to simulate respondent behavior for item validation. Our problem formulation, metrics, methodology, and dataset open a new direction for cost-effective survey development and a deeper understanding of how LLMs replicate human-like behavior. We will publicly release our dataset and code to support future work.

Updated: 2025-07-08 11:26:03

标题: 使用具有特质-响应中介的虚拟受访者进行心理测量项目验证

摘要: 随着心理测量调查越来越多地用于评估大型语言模型（LLMs）的特征，对适用于LLMs的可扩展调查项目生成的需求也在增长。这里的一个关键挑战是确保生成项目的构造效度，即它们是否真正衡量了预期的特征。传统上，这需要昂贵的大规模人类数据收集。为了提高效率，我们提出了一个用于虚拟受访者模拟的框架，使用LLMs。我们的核心思想是考虑中介因素：通过这些因素，相同的特征可以导致对调查项目产生不同的反应。通过模拟具有不同中介因素的受访者，我们确定了能够稳健地衡量预期特征的调查项目。对三种心理特质理论（Big5、Schwartz、VIA）的实验表明，我们的中介生成方法和模拟框架有效地识别了高效度项目。LLMs展示了根据特质定义生成合理中介因素并模拟受访者行为以进行项目验证的能力。我们的问题公式化、指标、方法论和数据集为成本效益高的调查开发和更深入理解LLMs如何复制类似人类行为的新方向。我们将公开发布我们的数据集和代码，以支持未来的工作。

更新时间: 2025-07-08 11:26:03

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.05890v1

Improving Trust Estimation in Human-Robot Collaboration Using Beta Reputation at Fine-grained Timescales

When interacting with each other, humans adjust their behavior based on perceived trust. To achieve similar adaptability, robots must accurately estimate human trust at sufficiently granular timescales while collaborating with humans. Beta reputation is a popular way to formalize a mathematical estimation of human trust. However, it relies on binary performance, which updates trust estimations only after each task concludes. Additionally, manually crafting a reward function is the usual method of building a performance indicator, which is labor-intensive and time-consuming. These limitations prevent efficient capture of continuous trust changes at more granular timescales throughout the collaboration task. Therefore, this paper presents a new framework for the estimation of human trust using beta reputation at fine-grained timescales. To achieve granularity in beta reputation, we utilize continuous reward values to update trust estimates at each timestep of a task. We construct a continuous reward function using maximum entropy optimization to eliminate the need for the laborious specification of a performance indicator. The proposed framework improves trust estimations by increasing accuracy, eliminating the need to manually craft a reward function, and advancing toward the development of more intelligent robots.

Updated: 2025-07-08 11:25:50

标题: 在人机协作中利用细粒度时间尺度的Beta声誉来提高信任估计

摘要: 与人类互动时，人类会根据感知到的信任水平调整自己的行为。为了实现类似的适应性，机器人在与人类合作时必须准确地估计人类的信任水平，并在足够精细的时间尺度上进行调整。Beta声誉是一种流行的形式化数学估计人类信任的方法。然而，它依赖于二进制表现，仅在每项任务结束后更新信任估计。此外，手工制定奖励函数是构建绩效指标的常用方法，这是费时费力的。这些限制阻碍了在整个合作任务过程中以更精细的时间尺度连续捕捉信任变化的有效性。因此，本文提出了一种利用Beta声誉在细粒度时间尺度上估计人类信任的新框架。为了在Beta声誉中实现粒度，我们利用连续的奖励值来更新每个任务时间步的信任估计。我们使用最大熵优化构建连续奖励函数，以消除手工制定绩效指标的需要。所提出的框架通过提高准确性、消除手工制定奖励函数的需求，并朝着开发更智能的机器人的方向前进，改善了信任估计。

更新时间: 2025-07-08 11:25:50

领域: cs.RO,cs.AI,cs.HC,cs.LG

下载: http://arxiv.org/abs/2411.01866v2

Hierarchy or Heterarchy? A Theory of Long-Range Connections for the Sensorimotor Brain

In the traditional understanding of the neocortex, sensory information flows up a hierarchy of regions, with each level processing increasingly complex features. Information also flows down the hierarchy via a different set of connections. Although the hierarchical model has significant support, many anatomical connections do not conform to the standard hierarchical interpretation. In addition, hierarchically arranged regions sometimes respond in parallel, not sequentially as would occur in a hierarchy. This and other evidence suggests that two regions can act in parallel and hierarchically at the same time. Given this flexibility, the word "heterarchy" might be a more suitable term to describe neocortical organization. This paper proposes a new interpretation of how sensory and motor information is processed in the neocortex. The key to our proposal is what we call the "Thousand Brains Theory", which posits that every cortical column is a sensorimotor learning system. Columns learn by integrating sensory input over multiple movements of a sensor. In this view, even primary and secondary regions, such as V1 and V2, can learn and recognize complete 3D objects. This suggests that the hierarchical connections between regions are used to learn the compositional structure of parent objects composed of smaller child objects. We explain the theory by examining the different types of long-range connections between cortical regions and between the neocortex and thalamus. We describe these connections, and then suggest the specific roles they play in the context of a heterarchy of sensorimotor regions. We also suggest that the thalamus plays an essential role in transforming the pose between objects and sensors. The novel perspective we argue for here has broad implications for both neuroscience and artificial intelligence.

Updated: 2025-07-08 11:22:02

标题: 等级制还是异等制？传感运动脑长程连接的理论

摘要: 在传统的新皮层理解中，感觉信息通过一系列区域的层次结构向上流动，每个层次处理越来越复杂的特征。信息也通过另一组连接向下流动至层次结构。尽管层次模型得到了重要支持，但许多解剖连接并不符合标准的层次解释。此外，按层次排列的区域有时会并行响应，而不是按顺序进行，这与层次结构不符。这些证据及其他证据表明，两个区域可以同时并行和按层次进行操作。鉴于这种灵活性，"异质结构"可能是更适合描述新皮质组织的术语。本文提出了一种新的对新皮质中感觉和运动信息处理方式的解释。我们提出的关键是我们所称的"千脑理论"，即每个皮质柱都是一个感觉运动学习系统。柱通过在传感器多次运动中整合感觉输入来学习。在这种观点下，甚至像V1和V2这样的初级和次级区域都可以学习和识别完整的3D物体。这表明区域之间的层次连接被用于学习由较小子对象组成的父对象的组成结构。我们通过审视皮质区域和新皮质与丘脑之间不同类型的远程连接来解释这一理论。我们描述了这些连接，然后建议它们在感觉运动区域的异质性背景下所扮演的具体角色。我们还建议丘脑在转换对象和传感器之间的姿势方面发挥着重要作用。我们在这里辩护的新观点对神经科学和人工智能都具有广泛的影响。

更新时间: 2025-07-08 11:22:02

领域: q-bio.NC,cs.AI

下载: http://arxiv.org/abs/2507.05888v1

Current Practices for Building LLM-Powered Reasoning Tools Are Ad Hoc -- and We Can Do Better

There is growing excitement about building software verifiers, synthesizers, and other Automated Reasoning (AR) tools by combining traditional symbolic algorithms and Large Language Models (LLMs). Unfortunately, the current practice for constructing such neurosymbolic AR systems is an ad hoc programming model that does not have the strong guarantees of traditional symbolic algorithms, nor a deep enough synchronization of neural networks and symbolic reasoning to unlock the full potential of LLM-powered reasoning. I propose Neurosymbolic Transition Systems as a principled computational model that can underlie infrastructure for building neurosymbolic AR tools. In this model, symbolic state is paired with intuition, and state transitions operate over symbols and intuition in parallel. I argue why this new paradigm can scale logical reasoning beyond current capabilities while retaining the strong guarantees of symbolic algorithms, and I sketch out how the computational model I propose can be reified in a logic programming language.

Updated: 2025-07-08 11:19:09

标题: 目前构建LLM驱动推理工具的做法是临时的 - 我们可以做得更好

摘要: 越来越多的人对通过结合传统符号算法和大型语言模型（LLMs）来构建软件验证器、合成器和其他自动推理（AR）工具感到兴奋。不幸的是，目前构建这种神经符号AR系统的实践是一种临时编程模型，没有传统符号算法的强大保证，也没有神经网络和符号推理之间足够深入的同步，无法释放LLM驱动推理的全部潜力。我提出了神经符号转换系统作为一个有原则的计算模型，可以作为构建神经符号AR工具基础设施的基础。在这个模型中，符号状态与直觉配对，状态转换同时在符号和直觉之间操作。我论述了为什么这种新范式可以将逻辑推理扩展到超出当前能力的规模，同时保留符号算法的强大保证，并勾勒了我提出的计算模型如何在逻辑编程语言中具象化。

更新时间: 2025-07-08 11:19:09

领域: cs.AI,cs.PL

下载: http://arxiv.org/abs/2507.05886v1

Fundamental Limits of Hierarchical Secure Aggregation with Cyclic User Association

Secure aggregation is motivated by federated learning (FL) where a cloud server aims to compute an averaged model (i.e., weights of deep neural networks) of the locally-trained models of numerous clients, while adhering to data security requirements. Hierarchical secure aggregation (HSA) extends this concept to a three-layer hierarchical network, where clustered users communicate with the server through an intermediate layer of relays. In HSA, beyond conventional server security, relay security is also enforced to ensure that the relays remain oblivious to the users' inputs (an abstraction of the local models in FL). Existing study on HSA assumes that each user is associated with only one relay, limiting opportunities for coding across inter-cluster users to achieve efficient communication and key generation. In this paper, we consider HSA with a cyclic association pattern where each user is connected to $B$ consecutive relays in a wrap-around manner. We propose an efficient aggregation scheme which includes a message design for the inputs inspired by gradient coding-a well-known technique for efficient communication in distributed computing-along with a highly non-trivial security key design. We also derive novel converse bounds on the minimum achievable communication and key rates using information-theoretic arguments.

Updated: 2025-07-08 11:18:54

标题: 层次安全聚合与循环用户关联的基本限制

摘要: 安全聚合的动机是联邦学习（FL），在此过程中，云服务器旨在计算众多客户端本地训练模型的平均模型（即深度神经网络的权重），同时遵守数据安全要求。分层安全聚合（HSA）将这一概念扩展到一个三层分层网络，其中聚类用户通过中间层的中继与服务器通信。在HSA中，除了传统的服务器安全性外，还强制执行中继安全性，以确保中继对用户的输入保持无知（FL中的本地模型的抽象）。现有关于HSA的研究假设每个用户只与一个中继相关联，限制了在跨集群用户之间进行编码以实现高效通信和密钥生成的机会。本文考虑了具有循环关联模式的HSA，其中每个用户以环绕方式连接到B个连续的中继。我们提出了一种高效的聚合方案，其中包括受梯度编码启发的输入消息设计-这是一种在分布式计算中实现高效通信的众所周知的技术，以及一个非常复杂的安全密钥设计。我们还使用信息论论证推导了关于最小可实现通信和密钥速率的新颖对偶界限。

更新时间: 2025-07-08 11:18:54

领域: cs.IT,cs.AI,cs.CR,cs.DC,math.IT

下载: http://arxiv.org/abs/2503.04564v5

Comparison of Path Planning Algorithms for Autonomous Vehicle Navigation Using Satellite and Airborne LiDAR Data

Autonomous vehicle navigation in unstructured environments, such as forests and mountainous regions, presents significant challenges due to irregular terrain and complex road conditions. This work provides a comparative evaluation of mainstream and well-established path planning algorithms applied to weighted pixel-level road networks derived from high-resolution satellite imagery and airborne LiDAR data. For 2D road-map navigation, where the weights reflect road conditions and terrain difficulty, A*, Dijkstra, RRT*, and a Novel Improved Ant Colony Optimization Algorithm (NIACO) are tested on the DeepGlobe satellite dataset. For 3D road-map path planning, 3D A*, 3D Dijkstra, RRT-Connect, and NIACO are evaluated using the Hamilton airborne LiDAR dataset, which provides detailed elevation information. All algorithms are assessed under identical start and end point conditions, focusing on path cost, computation time, and memory consumption. Results demonstrate that Dijkstra consistently offers the most stable and efficient performance in both 2D and 3D scenarios, particularly when operating on dense, pixel-level geospatial road-maps. These findings highlight the reliability of Dijkstra-based planning for static terrain navigation and establish a foundation for future research on dynamic path planning under complex environmental constraints.

Updated: 2025-07-08 11:15:21

标题: 使用卫星和机载LiDAR数据比较自主车辆导航的路径规划算法

摘要: 在森林和山区等无结构环境中的自主车辆导航面临着由于不规则地形和复杂道路条件所带来的重大挑战。本研究提供了对主流和成熟的路径规划算法在基于高分辨率卫星图像和航空LiDAR数据导出的加权像素级道路网络上应用的比较评估。在权重反映道路条件和地形困难的2D路网导航中，A*、Dijkstra、RRT*和一种新颖的改进蚁群优化算法（NIACO）在DeepGlobe卫星数据集上进行了测试。对于3D路网路径规划，使用提供详细高程信息的Hamilton航空LiDAR数据集评估了3D A*、3D Dijkstra、RRT-Connect和NIACO。所有算法都在相同的起点和终点条件下进行评估，重点关注路径成本、计算时间和内存消耗。结果表明，Dijkstra在2D和3D场景中始终提供最稳定和高效的性能，特别是在密集的像素级地理空间路网上运行时。这些发现强调了基于Dijkstra的规划在静态地形导航中的可靠性，并为未来研究提供了在复杂环境约束下进行动态路径规划的基础。

更新时间: 2025-07-08 11:15:21

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2507.05884v1

Iterative Importance Fine-tuning of Diffusion Models

Diffusion models are an important tool for generative modelling, serving as effective priors in applications such as imaging and protein design. A key challenge in applying diffusion models for downstream tasks is efficiently sampling from resulting posterior distributions, which can be addressed using the $h$-transform. This work introduces a self-supervised algorithm for fine-tuning diffusion models by estimating the $h$-transform, enabling amortised conditional sampling. Our method iteratively refines the $h$-transform using a synthetic dataset resampled with path-based importance weights. We demonstrate the effectiveness of this framework on class-conditional sampling, inverse problems and reward fine-tuning for text-to-image diffusion models.

Updated: 2025-07-08 11:14:27

标题: 扩散模型的迭代重要性微调

摘要: 扩散模型是生成建模的重要工具，在诸如图像和蛋白质设计等应用中作为有效的先验。将扩散模型应用于下游任务的一个关键挑战是高效地从结果后验分布中进行抽样，可以通过使用$h$-transform来解决这个问题。本文介绍了一种自监督算法，通过估计$h$-transform来微调扩散模型，实现摊销条件抽样。我们的方法通过使用基于路径的重要性权重对合成数据集重新采样来迭代地优化$h$-transform。我们展示了这一框架在类别条件抽样、反问题和奖励微调文本到图像扩散模型中的有效性。

更新时间: 2025-07-08 11:14:27

领域: cs.LG,eess.IV,math.PR,68T07,I.4.9; I.2.6

下载: http://arxiv.org/abs/2502.04468v2

Enhancing LLM Watermark Resilience Against Both Scrubbing and Spoofing Attacks

Watermarking is a promising defense against the misuse of large language models (LLMs), yet it remains vulnerable to scrubbing and spoofing attacks. This vulnerability stems from an inherent trade-off governed by watermark window size: smaller windows resist scrubbing better but are easier to reverse-engineer, enabling low-cost statistics-based spoofing attacks. This work breaks this trade-off by introducing a novel mechanism, equivalent texture keys, where multiple tokens within a watermark window can independently support the detection. Based on the redundancy, we propose a novel watermark scheme with Sub-vocabulary decomposed Equivalent tExture Key (SEEK). It achieves a Pareto improvement, increasing the resilience against scrubbing attacks without compromising robustness to spoofing. Experiments demonstrate SEEK's superiority over prior method, yielding spoofing robustness gains of +88.2%/+92.3%/+82.0% and scrubbing robustness gains of +10.2%/+6.4%/+24.6% across diverse dataset settings.

Updated: 2025-07-08 11:14:00

标题: 增强LLM水印对擦除和欺骗攻击的抵抗力

摘要: 数字水印技术是一种有望防止大型语言模型（LLMs）被滥用的防御方法，但仍然容易受到擦除和欺骗攻击的影响。这种脆弱性源于由水印窗口大小所控制的固有权衡：较小的窗口更能抵抗擦除，但更容易被逆向工程，从而使低成本的基于统计的欺骗攻击成为可能。本文通过引入一种新颖机制，即等效纹理密钥，打破了这种权衡，水印窗口中的多个令牌可以独立支持检测。基于冗余性，我们提出了一种新颖的水印方案，即子词汇分解等效纹理密钥（SEEK）。它实现了帕累托改进，增强了对擦除攻击的韧性，同时不影响对欺骗攻击的稳健性。实验证明，SEEK相较于先前的方法具有更高的欺骗韧性增益（+88.2%/+92.3%/+82.0%）和擦除韧性增益（+10.2%/+6.4%/+24.6%），在不同数据集设置下表现出色。

更新时间: 2025-07-08 11:14:00

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2507.06274v1

HiBayES: A Hierarchical Bayesian Modeling Framework for AI Evaluation Statistics

As Large Language Models (LLMs) and other AI systems evolve, robustly estimating their capabilities from inherently stochastic outputs while systematically quantifying uncertainty in these estimates becomes increasingly important. Further, advanced AI evaluations often have a nested hierarchical structure, exhibit high levels of complexity, and come with high costs in testing the most advanced AI systems. To address these challenges, we introduce HiBayES, a generalizable Hierarchical Bayesian modeling framework for AI Evaluation Statistics. HiBayES supports robust inferences in classical question-answer benchmarks and advanced agentic evaluations, particularly in low-data scenarios (e.g., < 20 data points per evaluation). Built on Generalized Linear Models (GLMs), Bayesian data analysis, and formal model comparison, HiBayES provides principled uncertainty quantification and robust parameter estimation. This paper offers a comprehensive introduction to HiBayES, including illustrative examples, comparisons to conventional statistical methods, and practical guidance for implementing multilevel Bayesian GLMs. Additionally, we provide a HiBayES software package [4] (Beta version) for out-of-the-box implementation.

Updated: 2025-07-08 11:06:22

标题: HiBayES：一种用于AI评估统计的分层贝叶斯建模框架

摘要: 随着大型语言模型（LLMs）和其他人工智能系统的发展，从固有随机输出中稳健地估计它们的能力，并系统地量化这些估计中的不确定性变得越来越重要。此外，先进的人工智能评估往往具有嵌套的层次结构，表现出高复杂性水平，并伴随着在测试最先进的人工智能系统方面的高成本。为了解决这些挑战，我们介绍了HiBayES，这是一个通用的用于人工智能评估统计的分层贝叶斯建模框架。HiBayES支持在经典的问题-回答基准和高级主体评估中进行稳健的推断，特别是在低数据场景下（例如，每个评估<20个数据点）。基于广义线性模型（GLMs）、贝叶斯数据分析和正式模型比较，HiBayES提供了原理上的不确定性量化和稳健的参数估计。本文全面介绍了HiBayES，包括示例、与传统统计方法的比较，以及实施多级贝叶斯GLMs的实用指导。此外，我们提供了一个HiBayES软件包（Beta版本），可立即使用。

更新时间: 2025-07-08 11:06:22

领域: cs.AI,stat.AP

下载: http://arxiv.org/abs/2505.05602v2

Post-Processing in Local Differential Privacy: An Extensive Evaluation and Benchmark Platform

Local differential privacy (LDP) has recently gained prominence as a powerful paradigm for collecting and analyzing sensitive data from users' devices. However, the inherent perturbation added by LDP protocols reduces the utility of the collected data. To mitigate this issue, several post-processing (PP) methods have been developed. Yet, the comparative performance of PP methods under diverse settings remains underexplored. In this paper, we present an extensive benchmark comprising 6 popular LDP protocols, 7 PP methods, 4 utility metrics, and 6 datasets to evaluate the behaviors and optimality of PP methods under diverse conditions. Through extensive experiments, we show that while PP can substantially improve utility when the privacy budget is small (i.e., strict privacy), its benefit diminishes as the privacy budget grows. Moreover, our findings reveal that the optimal PP method depends on multiple factors, including the choice of LDP protocol, privacy budget, data characteristics (such as distribution and domain size), and the specific utility metric. To advance research in this area and assist practitioners in identifying the most suitable PP method for their setting, we introduce LDP$^3$, an open-source benchmark platform. LDP$^3$ contains all methods used in our experimental analysis, and it is designed in a modular, extensible, and multi-threaded way for future use and development.

Updated: 2025-07-08 10:59:49

标题: 局部差分隐私中的后处理：一个广泛评估和基准平台

摘要: 局部差分隐私（LDP）最近作为从用户设备收集和分析敏感数据的强大范例而备受关注。然而，LDP协议所添加的固有扰动降低了收集数据的效用。为了缓解这一问题，已经开发了几种后处理（PP）方法。然而，在不同设置下PP方法的比较性能仍未得到充分探讨。在本文中，我们提出了一个广泛的基准测试，包括6种流行的LDP协议、7种PP方法、4种效用度量和6个数据集，以评估不同条件下PP方法的行为和最佳性能。通过广泛的实验，我们发现，尽管在隐私预算较小（即严格隐私）时PP可以显著提高效用，但随着隐私预算的增加，其效益逐渐减弱。此外，我们的研究结果表明，最佳的PP方法取决于多个因素，包括LDP协议的选择、隐私预算、数据特征（如分布和域大小）以及具体的效用度量。为了推进这一领域的研究并帮助从业者确定其设置中最合适的PP方法，我们引入了LDP$^3$，一个开源的基准平台。LDP$^3$包含我们实验分析中使用的所有方法，并设计为模块化、可扩展和多线程方式，以便未来使用和开发。

更新时间: 2025-07-08 10:59:49

领域: cs.CR

下载: http://arxiv.org/abs/2507.05875v1

Robust Power System State Estimation using Physics-Informed Neural Networks

Modern power systems face significant challenges in state estimation and real-time monitoring, particularly regarding response speed and accuracy under faulty conditions or cyber-attacks. This paper proposes a hybrid approach using physics-informed neural networks (PINNs) to enhance the accuracy and robustness, of power system state estimation. By embedding physical laws into the neural network architecture, PINNs improve estimation accuracy for transmission grid applications under both normal and faulty conditions, while also showing potential in addressing security concerns such as data manipulation attacks. Experimental results show that the proposed approach outperforms traditional machine learning models, achieving up to 83% higher accuracy on unseen subsets of the training dataset and 65% better performance on entirely new, unrelated datasets. Experiments also show that during a data manipulation attack against a critical bus in a system, the PINN can be up to 93% more accurate than an equivalent neural network.

Updated: 2025-07-08 10:58:13

标题: 使用基于物理信息的神经网络进行强健的电力系统状态估计

摘要: 现代电力系统在状态估计和实时监测方面面临着重大挑战，特别是在故障条件或网络攻击下的响应速度和准确性。本文提出了一种使用物理信息神经网络（PINNs）的混合方法，以增强电力系统状态估计的准确性和稳健性。通过将物理定律嵌入神经网络架构中，PINNs提高了在传输网格应用中的估计准确性，无论是在正常还是故障条件下，同时还显示出在解决数据操纵攻击等安全问题方面的潜力。实验结果表明，所提出的方法优于传统机器学习模型，在训练数据集的未知子集上实现了高达83%的更高准确性，并在全新、无关的数据集上表现出65%的更好性能。实验还表明，在对系统中的关键总线进行数据操纵攻击时，PINN的准确性可能比等价的神经网络高出93%。

更新时间: 2025-07-08 10:58:13

领域: cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2507.05874v1

LDP$^3$: An Extensible and Multi-Threaded Toolkit for Local Differential Privacy Protocols and Post-Processing Methods

Local differential privacy (LDP) has become a prominent notion for privacy-preserving data collection. While numerous LDP protocols and post-processing (PP) methods have been developed, selecting an optimal combination under different privacy budgets and datasets remains a challenge. Moreover, the lack of a comprehensive and extensible LDP benchmarking toolkit raises difficulties in evaluating new protocols and PP methods. To address these concerns, this paper presents LDP$^3$ (pronounced LDP-Cube), an open-source, extensible, and multi-threaded toolkit for LDP researchers and practitioners. LDP$^3$ contains implementations of several LDP protocols, PP methods, and utility metrics in a modular and extensible design. Its modular design enables developers to conveniently integrate new protocols and PP methods. Furthermore, its multi-threaded nature enables significant reductions in execution times via parallelization. Experimental evaluations demonstrate that: (i) using LDP$^3$ to select a good protocol and post-processing method substantially improves utility compared to a bad or random choice, and (ii) the multi-threaded design of LDP$^3$ brings substantial benefits in terms of efficiency.

Updated: 2025-07-08 10:51:42

标题: LDP$^3$: 用于本地差分隐私协议和后处理方法的可扩展多线程工具包

摘要: 局部差分隐私（LDP）已成为隐私保护数据收集的重要概念。尽管已经开发了许多LDP协议和后处理（PP）方法，但在不同隐私预算和数据集下选择最佳组合仍然是一个挑战。此外，缺乏一个全面和可扩展的LDP基准工具包使得评估新协议和PP方法变得困难。为了解决这些问题，本文介绍了LDP$^3$（读作LDP-Cube），这是一个面向LDP研究人员和从业者的开源、可扩展和多线程工具包。LDP$^3$包含了几种LDP协议、PP方法和实用度指标的实现，采用了模块化和可扩展的设计。其模块化设计使开发人员可以方便地集成新的协议和PP方法。此外，其多线程性质通过并行化显著减少了执行时间。实验评估表明：（i）使用LDP$^3$选择一个好的协议和后处理方法相比于选择一个糟糕或随机的选择显著提高了实用性，（ii）LDP$^3$的多线程设计在效率方面带来了显著的好处。

更新时间: 2025-07-08 10:51:42

领域: cs.CR

下载: http://arxiv.org/abs/2507.05872v1

Bayesian Hierarchical Invariant Prediction

We propose Bayesian Hierarchical Invariant Prediction (BHIP) reframing Invariant Causal Prediction (ICP) through the lens of Hierarchical Bayes. We leverage the hierarchical structure to explicitly test invariance of causal mechanisms under heterogeneous data, resulting in improved computational scalability for a larger number of predictors compared to ICP. Moreover, given its Bayesian nature BHIP enables the use of prior information. In this paper, we test two sparsity inducing priors: horseshoe and spike-and-slab, both of which allow us a more reliable identification of causal features. We test BHIP in synthetic and real-world data showing its potential as an alternative inference method to ICP.

Updated: 2025-07-08 10:51:36

标题: 贝叶斯层次不变预测

摘要: 我们提出了贝叶斯分层不变预测（BHIP），通过分层贝叶斯的视角重新构建了不变因果预测（ICP）。我们利用分层结构明确测试因果机制在异质数据下的不变性，相比ICP，这使得我们在更多预测变量的情况下具有更好的计算可扩展性。此外，由于其贝叶斯性质，BHIP可以利用先验信息。在本文中，我们测试了两种稀疏诱导先验：horseshoe和spike-and-slab，这两种先验都让我们更可靠地识别因果特征。我们在合成和真实数据中测试了BHIP，并展示了它作为ICP的替代推断方法的潜力。

更新时间: 2025-07-08 10:51:36

领域: cs.LG,cs.AI,stat.ME,stat.ML

下载: http://arxiv.org/abs/2505.11211v2

CogniPlay: a work-in-progress Human-like model for General Game Playing

While AI systems have equaled or surpassed human performance in a wide variety of games such as Chess, Go, or Dota 2, describing these systems as truly "human-like" remains far-fetched. Despite their success, they fail to replicate the pattern-based, intuitive decision-making processes observed in human cognition. This paper presents an overview of findings from cognitive psychology and previous efforts to model human-like behavior in artificial agents, discusses their applicability to General Game Playing (GGP) and introduces our work-in-progress model based on these observations: CogniPlay.

Updated: 2025-07-08 10:48:29

标题: CogniPlay：一个进行中的通用游戏模型，类似于人类模型

摘要: 尽管人工智能系统在诸如国际象棋、围棋或Dota 2等各种游戏中已经达到或超过了人类的表现，但将这些系统描述为真正“类人”的仍然是牵强的。尽管它们取得了成功，但它们未能复制人类认知中观察到的基于模式的直觉决策过程。本文概述了认知心理学的研究结果以及先前努力在人工智能代理中建模类人行为的情况，讨论了它们对通用游戏玩法（GGP）的适用性，并介绍了我们基于这些观察结果的进行中的模型：CogniPlay。

更新时间: 2025-07-08 10:48:29

领域: cs.AI

下载: http://arxiv.org/abs/2507.05868v1

Communication-Efficient Module-Wise Federated Learning for Grasp Pose Detection in Cluttered Environments

Grasp pose detection (GPD) is a fundamental capability for robotic autonomy, but its reliance on large, diverse datasets creates significant data privacy and centralization challenges. Federated Learning (FL) offers a privacy-preserving solution, but its application to GPD is hindered by the substantial communication overhead of large models, a key issue for resource-constrained robots. To address this, we propose a novel module-wise FL framework that begins by analyzing the learning dynamics of the GPD model's functional components. This analysis identifies slower-converging modules, to which our framework then allocates additional communication effort. This is realized through a two-phase process: a standard full-model training phase is followed by a communication-efficient phase where only the identified subset of slower-converging modules is trained and their partial updates are aggregated. Extensive experiments on the GraspNet-1B dataset demonstrate that our method outperforms standard FedAvg and other baselines, achieving higher accuracy for a given communication budget. Furthermore, real-world experiments on a physical robot validate our approach, showing a superior grasp success rate compared to baseline methods in cluttered scenes. Our work presents a communication-efficient framework for training robust, generalized GPD models in a decentralized manner, effectively improving the trade-off between communication cost and model performance.

Updated: 2025-07-08 10:40:49

标题: 在混乱环境中进行握持姿势检测的通信高效模块化联邦学习

摘要: 抓取姿势检测（GPD）是机器人自主性的基本能力，但其依赖于大规模、多样化的数据集带来了重大的数据隐私和集中化挑战。联邦学习（FL）提供了一种保护隐私的解决方案，但其在GPD上的应用受到大型模型的重要通信开销的限制，这是资源受限的机器人的关键问题。为了解决这个问题，我们提出了一种新颖的基于模块的FL框架，首先分析GPD模型的功能组件的学习动态。这种分析确定了收敛速度较慢的模块，我们的框架随后将额外的通信努力分配给这些模块。这是通过一个两阶段过程实现的：一个标准的全模型训练阶段后面是一个通信高效的阶段，只有被识别出来的较慢收敛的模块被训练并且它们的部分更新被聚合。对GraspNet-1B数据集的大量实验证明了我们的方法优于标准的FedAvg和其他基准方法，在给定的通信预算下实现了更高的准确性。此外，在物理机器人上的实际实验验证了我们的方法，在混乱的场景中显示出比基准方法更高的抓取成功率。我们的工作提出了一个通信高效的框架，以分散的方式训练稳健、泛化的GPD模型，有效改善了通信成本和模型性能之间的平衡。

更新时间: 2025-07-08 10:40:49

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2507.05861v1

Quantum QSAR for drug discovery

Quantitative Structure-Activity Relationship (QSAR) modeling is key in drug discovery, but classical methods face limitations when handling high-dimensional data and capturing complex molecular interactions. This research proposes enhancing QSAR techniques through Quantum Support Vector Machines (QSVMs), which leverage quantum computing principles to process information Hilbert spaces. By using quantum data encoding and quantum kernel functions, we aim to develop more accurate and efficient predictive models.

Updated: 2025-07-08 10:40:25

标题: 量子定量构效关系在药物发现中的应用

摘要: 定量构效关系（QSAR）建模在药物发现中起着关键作用，但传统方法在处理高维数据和捕捉复杂分子相互作用时存在局限性。本研究提出通过量子支持向量机（QSVMs）增强QSAR技术，利用量子计算原理处理希尔伯特空间中的信息。通过使用量子数据编码和量子核函数，我们旨在开发更准确和高效的预测模型。

更新时间: 2025-07-08 10:40:25

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2505.04648v2

Property Elicitation on Imprecise Probabilities

Property elicitation studies which attributes of a probability distribution can be determined by minimising a risk. We investigate a generalisation of property elicitation to imprecise probabilities (IP). This investigation is motivated by multi-distribution learning, which takes the classical machine learning paradigm of minimising a single risk over a (precise) probability and replaces it with $\Gamma$-maximin risk minimization over an IP. We provide necessary conditions for elicitability of a IP-property. Furthermore, we explain what an elicitable IP-property actually elicits through Bayes pairs -- the elicited IP-property is the corresponding standard property of the maximum Bayes risk distribution.

Updated: 2025-07-08 10:36:49

标题: 模糊概率的属性引出

摘要: 财产引诱研究表明，通过最小化风险可以确定概率分布的哪些属性。我们研究了对不精确概率（IP）的财产引诱的泛化。这项研究受到多分布学习的启发，该学习采用了经典机器学习范式，即通过对（精确）概率最小化单一风险，并将其替换为通过IP最小化$\Gamma$-maximin风险。我们提供了IP-属性的引诱性所需条件。此外，我们通过贝叶斯对来解释引诱性IP-属性实际上通过最大贝叶斯风险分布的相应标准属性引诱了什么。

更新时间: 2025-07-08 10:36:49

领域: stat.ML,cs.LG,math.ST,stat.TH

下载: http://arxiv.org/abs/2507.05857v1

Optimal Transport for Domain Adaptation through Gaussian Mixture Models

Machine learning systems operate under the assumption that training and test data are sampled from a fixed probability distribution. However, this assumptions is rarely verified in practice, as the conditions upon which data was acquired are likely to change. In this context, the adaptation of the unsupervised domain requires minimal access to the data of the new conditions for learning models robust to changes in the data distribution. Optimal transport is a theoretically grounded tool for analyzing changes in distribution, especially as it allows the mapping between domains. However, these methods are usually computationally expensive as their complexity scales cubically with the number of samples. In this work, we explore optimal transport between Gaussian Mixture Models (GMMs), which is conveniently written in terms of the components of source and target GMMs. We experiment with 9 benchmarks, with a total of $85$ adaptation tasks, showing that our methods are more efficient than previous shallow domain adaptation methods, and they scale well with number of samples $n$ and dimensions $d$.

Updated: 2025-07-08 10:35:31

标题: 高斯混合模型在领域自适应中的最优传输

摘要: 机器学习系统在假设训练和测试数据是从固定概率分布中抽样的前提下运作。然而，在实践中很少验证这种假设，因为获取数据的条件很可能会发生改变。在这种情况下，无监督领域的适应性需要对新条件下的数据进行最少访问，以便学习模型能够抵御数据分布变化。最优传输是一种理论上基础的工具，用于分析分布的变化，特别是它允许在领域之间进行映射。然而，这些方法通常计算开销很大，因为它们的复杂度随着样本数量的增加呈立方增长。在这项工作中，我们探讨了高斯混合模型（GMMs）之间的最优传输，这方便地以源和目标GMMs的组件为基础。我们在9个基准测试中进行了实验，共计85个适应性任务，结果显示我们的方法比以往的浅层领域适应方法更有效，并且能够很好地适应样本数量n和维度d的变化。

更新时间: 2025-07-08 10:35:31

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2403.13847v3

Prototype-Guided and Lightweight Adapters for Inherent Interpretation and Generalisation in Federated Learning

Federated learning (FL) provides a promising paradigm for collaboratively training machine learning models across distributed data sources while maintaining privacy. Nevertheless, real-world FL often faces major challenges including communication overhead during the transfer of large model parameters and statistical heterogeneity, arising from non-identical independent data distributions across clients. In this work, we propose an FL framework that 1) provides inherent interpretations using prototypes, and 2) tackles statistical heterogeneity by utilising lightweight adapter modules to act as compressed surrogates of local models and guide clients to achieve generalisation despite varying client distribution. Each client locally refines its model by aligning class embeddings toward prototype representations and simultaneously adjust the lightweight adapter. Our approach replaces the need to communicate entire model weights with prototypes and lightweight adapters. This design ensures that each client's model aligns with a globally shared structure while minimising communication load and providing inherent interpretations. Moreover, we conducted our experiments on a real-world retinal fundus image dataset, which provides clinical-site information. We demonstrate inherent interpretable capabilities and perform a classification task, which shows improvements in accuracy over baseline algorithms.

Updated: 2025-07-08 10:30:08

标题: 原型引导和轻量级适配器在联邦学习中的固有解释和泛化

摘要: 联邦学习（FL）为在分布式数据源之间协作训练机器学习模型提供了一个有前途的范式，同时保持隐私。然而，现实世界中的FL经常面临重大挑战，包括在传输大型模型参数时的通信开销和统计异质性，这是由于客户端之间存在非相同的独立数据分布。在这项工作中，我们提出了一个FL框架，该框架1）使用原型提供固有的解释，并且2）通过利用轻量级适配器模块来解决统计异质性，这些模块充当本地模型的压缩替代物，并指导客户端实现泛化，尽管客户端分布不同。每个客户端通过将类嵌入调整到原型表示来本地优化其模型，并同时调整轻量级适配器。我们的方法通过原型和轻量级适配器取代了需要传输整个模型权重的需求。这种设计确保每个客户端的模型与全局共享结构保持一致，同时最小化通信负载并提供固有解释。此外，我们在一个真实的视网膜底图像数据集上进行了实验，该数据集提供了临床站点信息。我们展示了固有的可解释能力，并执行了一个分类任务，结果显示与基线算法相比准确性有所提升。

更新时间: 2025-07-08 10:30:08

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2507.05852v1

Filter Like You Test: Data-Driven Data Filtering for CLIP Pretraining

We introduce Filter Like You Test (FLYT), an algorithm for curating large-scale vision-language datasets that learns the usefulness of each data point as a pretraining example. FLYT trains a scoring model that learns to weigh each example's features using gradient signals from downstream tasks training sets. Based on FLYT, we implement Mixing-FLYT (M-FLYT), which takes the per-example scores generated by different scoring methods as features, and learns to unify them into a single score. FLYT naturally produces a distribution over the training examples, which we leverage through Soft Cap Sampling (SCS), a strategy for obtaining a filtered pretraining dataset from per-example probabilities that samples examples while preventing over-representation through a repetition penalty. Using these methods, we achieve 40.1% ImageNet zero-shot accuracy on the DataComp medium scale filtering benchmark, a 2% absolute accuracy increase over all previous results and a 5.5% increase over results that - like us - use only public resources. Our approach also yields 37.7\% on the average of 38 DataComp evaluation tasks, outperforming previous public-resource approaches by 0.4\%.

Updated: 2025-07-08 10:07:58

标题: 像测试一样筛选：用于CLIP预训练的数据驱动数据过滤

摘要: 我们介绍了Filter Like You Test (FLYT)，这是一个用于筛选大规模视觉-语言数据集的算法，它学习了每个数据点作为预训练示例的有用性。 FLYT训练一个评分模型，该模型通过下游任务训练集的梯度信号学习如何对每个示例的特征进行加权。基于FLYT，我们实现了Mixing-FLYT (M-FLYT)，它将不同评分方法生成的每个示例得分作为特征，并学习将它们统一为一个单一得分。FLYT自然地产生了训练示例的分布，我们通过Soft Cap Sampling (SCS)利用这一特性，这是一种通过样本示例而避免重复表示的策略，用于从每个示例的概率中获取经过筛选的预训练数据集。使用这些方法，我们在DataComp中等规模过滤基准测试中实现了40.1%的ImageNet零样本准确率，比所有先前结果提高了2%的绝对准确度，比我们之前的使用公共资源的结果提高了5.5%。我们的方法还在38个DataComp评估任务的平均值上获得了37.7\%，比之前使用公共资源方法提高了0.4\%。

更新时间: 2025-07-08 10:07:58

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2503.08805v2

Detecting value-expressive text posts in Russian social media

Basic values are concepts or beliefs which pertain to desirable end-states and transcend specific situations. Studying personal values in social media can illuminate how and why societal values evolve especially when the stimuli-based methods, such as surveys, are inefficient, for instance, in hard-to-reach populations. On the other hand, user-generated content is driven by the massive use of stereotyped, culturally defined speech constructions rather than authentic expressions of personal values. We aimed to find a model that can accurately detect value-expressive posts in Russian social media VKontakte. A training dataset of 5,035 posts was annotated by three experts, 304 crowd-workers and ChatGPT. Crowd-workers and experts showed only moderate agreement in categorizing posts. ChatGPT was more consistent but struggled with spam detection. We applied an ensemble of human- and AI-assisted annotation involving active learning approach, subsequently trained several classification models using embeddings from various pre-trained transformer-based language models. The best performance was achieved with embeddings from a fine-tuned rubert-tiny2 model, yielding high value detection quality (F1 = 0.75, F1-macro = 0.80). This model provides a crucial step to a study of values within and between Russian social media users.

Updated: 2025-07-08 10:07:06

标题: 在俄罗斯社交媒体中检测表达价值观的文本帖子

摘要: 基本价值是指与理想终极状态相关的概念或信念，超越特定情境。研究社交媒体上的个人价值观可以阐明社会价值观如何以及为何发展，特别是当基于刺激的方法，如调查，在难以接触的人群中效率低下时。另一方面，用户生成的内容受到大量使用刻板化、文化定义的语言结构的驱动，而不是个人价值观的真实表达。我们旨在找到一个能够准确检测俄罗斯社交媒体VKontakte中价值表达帖子的模型。由三位专家、304名众包工作者和ChatGPT标注的训练数据集包括5,035篇帖子。众包工作者和专家在分类帖子时仅有适度的一致性。ChatGPT更为一致，但在垃圾检测方面遇到困难。我们应用了人工和AI辅助标注的集成方法，采用主动学习方法，随后使用来自各种预训练的基于变压器的语言模型的嵌入训练了几个分类模型。通过来自经过微调的rubert-tiny2模型的嵌入实现了最佳性能，产生了高价值检测质量（F1 = 0.75，F1-macro = 0.80）。该模型为研究俄罗斯社交媒体用户之间和内部的价值观提供了重要一步。

更新时间: 2025-07-08 10:07:06

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2312.08968v2

Enhancing GOP in CTC-Based Mispronunciation Detection with Phonological Knowledge

Computer-Assisted Pronunciation Training (CAPT) systems employ automatic measures of pronunciation quality, such as the goodness of pronunciation (GOP) metric. GOP relies on forced alignments, which are prone to labeling and segmentation errors due to acoustic variability. While alignment-free methods address these challenges, they are computationally expensive and scale poorly with phoneme sequence length and inventory size. To enhance efficiency, we introduce a substitution-aware alignment-free GOP that restricts phoneme substitutions based on phoneme clusters and common learner errors. We evaluated our GOP on two L2 English speech datasets, one with child speech, My Pronunciation Coach (MPC), and SpeechOcean762, which includes child and adult speech. We compared RPS (restricted phoneme substitutions) and UPS (unrestricted phoneme substitutions) setups within alignment-free methods, which outperformed the baseline. We discuss our results and outline avenues for future research.

Updated: 2025-07-08 10:03:52

标题: 利用音韵知识提升基于CTC的发音错误检测

摘要: 计算机辅助发音训练（Computer-Assisted Pronunciation Training，CAPT）系统采用自动发音质量度量，如发音好坏（GOP）指标。GOP依赖于强制对齐，由于声学变异性导致标签和分割错误。虽然无对齐方法可以解决这些挑战，但在计算上昂贵，并且随着音素序列长度和库存量的增加而扩展性差。为了提高效率，我们引入了一种基于音素簇和常见学习者错误的替代感知无对齐GOP。我们在两个L2英语语音数据集上评估了我们的GOP，一个是儿童语音My Pronunciation Coach（MPC），另一个是包括儿童和成人语音的SpeechOcean762。我们比较了无对齐方法中的RPS（受限音素替换）和UPS（无限制音素替换）设置，这些设置优于基线。我们讨论了我们的结果并概述了未来研究的途径。

更新时间: 2025-07-08 10:03:52

领域: eess.AS,cs.AI

下载: http://arxiv.org/abs/2506.02080v2

GC-GAT: Multimodal Vehicular Trajectory Prediction using Graph Goal Conditioning and Cross-context Attention

Predicting future trajectories of surrounding vehicles heavily relies on what contextual information is given to a motion prediction model. The context itself can be static (lanes, regulatory elements, etc) or dynamic (traffic participants). This paper presents a lane graph-based motion prediction model that first predicts graph-based goal proposals and later fuses them with cross attention over multiple contextual elements. We follow the famous encoder-interactor-decoder architecture where the encoder encodes scene context using lightweight Gated Recurrent Units, the interactor applies cross-context attention over encoded scene features and graph goal proposals, and the decoder regresses multimodal trajectories via Laplacian Mixture Density Network from the aggregated encodings. Using cross-attention over graph-based goal proposals gives robust trajectory estimates since the model learns to attend to future goal-relevant scene elements for the intended agent. We evaluate our work on nuScenes motion prediction dataset, achieving state-of-the-art results.

Updated: 2025-07-08 10:01:00

标题: GC-GAT：使用图形目标条件和跨上下文注意力的多模式车辆轨迹预测

摘要: 预测周围车辆未来轨迹严重依赖于向运动预测模型提供的上下文信息。上下文本身可以是静态的（车道、规则元素等）或动态的（交通参与者）。本文提出了一种基于车道图的运动预测模型，首先预测基于图的目标提案，然后通过跨多个上下文元素进行交叉注意力融合。我们采用了著名的编码器-交互器-解码器架构，其中编码器使用轻量级门控循环单元对场景上下文进行编码，交互器在编码的场景特征和图形目标提案上应用跨上下文注意力，解码器通过Laplacian混合密度网络从聚合的编码中回归多模态轨迹。使用跨图形目标提案的交叉注意力可以为轨迹提供稳健的估计，因为模型学会关注未来与目标相关的场景元素。我们在nuScenes运动预测数据集上评估了我们的工作，取得了最先进的结果。

更新时间: 2025-07-08 10:01:00

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2504.11150v2

Evaluating Logit-Based GOP Scores for Mispronunciation Detection

Pronunciation assessment relies on goodness of pronunciation (GOP) scores, traditionally derived from softmax-based posterior probabilities. However, posterior probabilities may suffer from overconfidence and poor phoneme separation, limiting their effectiveness. This study compares logit-based GOP scores with probability-based GOP scores for mispronunciation detection. We conducted our experiment on two L2 English speech datasets spoken by Dutch and Mandarin speakers, assessing classification performance and correlation with human ratings. Logit-based methods outperform probability-based GOP in classification, but their effectiveness depends on dataset characteristics. The maximum logit GOP shows the strongest alignment with human perception, while a combination of different GOP scores balances probability and logit features. The findings suggest that hybrid GOP methods incorporating uncertainty modeling and phoneme-specific weighting improve pronunciation assessment.

Updated: 2025-07-08 09:58:20

标题: 评估基于Logit的GOP分数用于发音错误检测

摘要: 发音评估依赖于发音质量（GOP）分数，传统上是从基于softmax的后验概率导出的。然而，后验概率可能存在过度自信和音素分离不佳的问题，从而限制了它们的有效性。本研究比较了基于logit的GOP分数和基于概率的GOP分数在发音错误检测方面的表现。我们在两个由荷兰和普通话说话者说话的L2英语语音数据集上进行了实验，评估了分类性能和与人类评分的相关性。基于logit的方法在分类方面胜过基于概率的GOP，但它们的有效性取决于数据集的特征。最大logit GOP与人类感知之间的对齐最强，而不同GOP分数的组合平衡了概率和logit特征。研究结果表明，将不确定性建模和音素特定加权纳入的混合GOP方法改善了发音评估。

更新时间: 2025-07-08 09:58:20

领域: eess.AS,cs.AI,cs.SD

下载: http://arxiv.org/abs/2506.12067v2

PDFMathTranslate: Scientific Document Translation Preserving Layouts

Language barriers in scientific documents hinder the diffusion and development of science and technologies. However, prior efforts in translating such documents largely overlooked the information in layouts. To bridge the gap, we introduce PDFMathTranslate, the world's first open-source software for translating scientific documents while preserving layouts. Leveraging the most recent advances in large language models and precise layout detection, we contribute to the community with key improvements in precision, flexibility, and efficiency. The work has been open-sourced at https://github.com/byaidu/pdfmathtranslate with more than 222k downloads.

Updated: 2025-07-08 09:51:21

标题: PDFMathTranslate：保持布局的科学文档翻译

摘要: 科学文献中的语言障碍阻碍了科学技术的传播和发展。然而，先前翻译此类文档的努力在很大程度上忽视了布局中的信息。为了弥补这一差距，我们推出了PDFMathTranslate，这是世界上第一个保留布局的开源软件，用于翻译科学文档。利用最新的大型语言模型和精确的布局检测技术，我们为社区贡献了关键的改进，包括精度、灵活性和效率。这项工作已在https://github.com/byaidu/pdfmathtranslate上开源，下载量超过22.2万次。

更新时间: 2025-07-08 09:51:21

领域: cs.CL,cs.IR,cs.LG,68T50, 68T45, 68U10, 68U15,D.2.2; I.2.10; I.2.7; J.0

下载: http://arxiv.org/abs/2507.03009v2

Intra-DP: A High Performance Collaborative Inference System for Mobile Edge Computing

Deploying deep neural networks (DNNs) on resource-constrained mobile devices presents significant challenges, particularly in achieving real-time performance while simultaneously coping with limited computational resources and battery life. While Mobile Edge Computing (MEC) offers collaborative inference with GPU servers as a promising solution, existing approaches primarily rely on layer-wise model partitioning and undergo significant transmission bottlenecks caused by the sequential execution of DNN operations. To address this challenge, we present Intra-DP, a high-performance collaborative inference system optimized for DNN inference on MEC. Intra DP employs a novel parallel computing technique based on local operators (i.e., operators whose minimum unit input is not the entire input tensor, such as the convolution kernel). By decomposing their computations (operations) into several independent sub-operations and overlapping the computation and transmission of different sub-operations through parallel execution, Intra-DP mitigates transmission bottlenecks in MEC, achieving fast and energy-efficient inference. The evaluation demonstrates that Intra-DP reduces per-inference latency by up to 50% and energy consumption by up to 75% compared to state-of-the-art baselines, without sacrificing accuracy.

Updated: 2025-07-08 09:50:57

标题: Intra-DP: 移动边缘计算的高性能协作推理系统

摘要: 在资源受限的移动设备上部署深度神经网络（DNNs）面临着重大挑战，特别是在实现实时性能的同时，还需要应对有限的计算资源和电池寿命。虽然移动边缘计算（MEC）提供了与GPU服务器协同推理的解决方案，但现有方法主要依赖于分层模型分区，并由于DNN操作的顺序执行而遭受严重的传输瓶颈问题。为了解决这一挑战，我们提出了Intra-DP，一种针对MEC上DNN推理进行优化的高性能协同推理系统。Intra DP采用了一种基于本地操作员（即，最小单位输入不是整个输入张量的操作员，如卷积核）的新型并行计算技术。通过将它们的计算（操作）分解为多个独立的子操作，并通过并行执行重叠不同子操作的计算和传输，Intra-DP减轻了MEC中的传输瓶颈，实现了快速和高效的推理。评估结果表明，与最先进的基准方法相比，Intra-DP将推理延迟降低了高达50％，能耗降低了高达75％，而不会牺牲准确性。

更新时间: 2025-07-08 09:50:57

领域: cs.NI,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.05829v1

Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge

LLM-as-a-Judge models generate chain-of-thought (CoT) sequences intended to capture the step-bystep reasoning process that underlies the final evaluation of a response. However, due to the lack of human annotated CoTs for evaluation, the required components and structure of effective reasoning traces remain understudied. Consequently, previous approaches often (1) constrain reasoning traces to hand-designed components, such as a list of criteria, reference answers, or verification questions and (2) structure them such that planning is intertwined with the reasoning for evaluation. In this work, we propose EvalPlanner, a preference optimization algorithm for Thinking-LLM-as-a-Judge that first generates an unconstrained evaluation plan, followed by its execution, and then the final judgment. In a self-training loop, EvalPlanner iteratively optimizes over synthetically constructed evaluation plans and executions, leading to better final verdicts. Our method achieves a new state-of-the-art performance for generative reward models on RewardBench (with a score of 93.9), despite being trained on fewer amount of, and synthetically generated, preference pairs. Additional experiments on other benchmarks like RM-Bench, JudgeBench, and FollowBenchEval further highlight the utility of both planning and reasoning for building robust LLM-as-a-Judge reasoning models.

Updated: 2025-07-08 09:49:57

标题: 学习使用Thinking-LLM作为评价的裁判进行规划和推理

摘要: LLM作为法官模型生成思维链（CoT）序列，旨在捕捉支持最终评估的步骤推理过程。然而，由于缺乏人工注释的CoT用于评估，有效推理跟踪的所需组成部分和结构仍未充分研究。因此，先前的方法通常（1）将推理跟踪限制在手工设计的组件中，如一系列标准、参考答案或验证问题，以及（2）将其结构化，以便规划与评估的推理交织在一起。在这项工作中，我们提出了EvalPlanner，这是一种优化偏好算法，用于Thinking-LLM作为法官，首先生成一个无约束的评估计划，然后执行，最后作出最终判断。在一个自我训练循环中，EvalPlanner迭代优化合成构建的评估计划和执行，从而获得更好的最终裁决。尽管在较少量的、合成生成的偏好对中进行训练，我们的方法在RewardBench上实现了一种新的最先进的生成奖励模型性能（得分为93.9）。对其他基准的额外实验，如RM-Bench、JudgeBench和FollowBenchEval，进一步凸显了规划和推理对构建健壮的LLM作为法官推理模型的实用性。

更新时间: 2025-07-08 09:49:57

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2501.18099v2

The Impact of Prompt Programming on Function-Level Code Generation

Large Language Models (LLMs) are increasingly used by software engineers for code generation. However, limitations of LLMs such as irrelevant or incorrect code have highlighted the need for prompt programming (or prompt engineering) where engineers apply specific prompt techniques (e.g., chain-of-thought or input-output examples) to improve the generated code. While some prompt techniques have been studied, the impact of different techniques -- and their interactions -- on code generation is still not fully understood. In this study, we introduce CodePromptEval, a dataset of 7072 prompts designed to evaluate five prompt techniques (few-shot, persona, chain-of-thought, function signature, list of packages) and their effect on the correctness, similarity, and quality of complete functions generated by three LLMs (GPT-4o, Llama3, and Mistral). Our findings show that while certain prompt techniques significantly influence the generated code, combining multiple techniques does not necessarily improve the outcome. Additionally, we observed a trade-off between correctness and quality when using prompt techniques. Our dataset and replication package enable future research on improving LLM-generated code and evaluating new prompt techniques.

Updated: 2025-07-08 09:46:27

标题: 提示编程对功能级代码生成的影响

摘要: 大型语言模型（LLMs）越来越被软件工程师用于代码生成。然而，LLMs的限制，比如无关或不正确的代码，突显了迅速编程（或即时工程）的必要性，工程师在其中应用特定的提示技术（例如思维链或输入输出示例）来改善生成的代码。虽然一些提示技术已经被研究，但不同技术以及它们之间的互动对代码生成的影响仍然没有完全被理解。在这项研究中，我们引入了CodePromptEval，一个包含7072个提示的数据集，旨在评估五种提示技术（少量样本、人物、思维链、函数签名、包列表）对三个LLMs生成的完整函数的正确性、相似性和质量的影响（GPT-4o、Llama3和Mistral）。我们的研究结果表明，虽然某些提示技术显著影响生成的代码，但结合多种技术并不一定会改善结果。此外，我们观察到在使用提示技术时正确性和质量之间存在一种权衡。我们的数据集和复制包使未来的研究可以进一步改进LLM生成的代码，并评估新的提示技术。

更新时间: 2025-07-08 09:46:27

领域: cs.SE,cs.CL,cs.HC,cs.LG

下载: http://arxiv.org/abs/2412.20545v2

Accelerating Large-Scale Regularized High-Order Tensor Recovery

Currently, existing tensor recovery methods fail to recognize the impact of tensor scale variations on their structural characteristics. Furthermore, existing studies face prohibitive computational costs when dealing with large-scale high-order tensor data. To alleviate these issue, assisted by the Krylov subspace iteration, block Lanczos bidiagonalization process, and random projection strategies, this article first devises two fast and accurate randomized algorithms for low-rank tensor approximation (LRTA) problem. Theoretical bounds on the accuracy of the approximation error estimate are established. Next, we develop a novel generalized nonconvex modeling framework tailored to large-scale tensor recovery, in which a new regularization paradigm is exploited to achieve insightful prior representation for large-scale tensors. On the basis of the above, we further investigate new unified nonconvex models and efficient optimization algorithms, respectively, for several typical high-order tensor recovery tasks in unquantized and quantized situations. To render the proposed algorithms practical and efficient for large-scale tensor data, the proposed randomized LRTA schemes are integrated into their central and time-intensive computations. Finally, we conduct extensive experiments on various large-scale tensors, whose results demonstrate the practicability, effectiveness and superiority of the proposed method in comparison with some state-of-the-art approaches.

Updated: 2025-07-08 09:43:58

标题: 加速大规模正则化高阶张量恢复

摘要: 目前，现有的张量恢复方法未能认识到张量尺度变化对其结构特征的影响。此外，现有研究在处理大规模高阶张量数据时面临着巨大的计算成本。为了缓解这些问题，本文首先借助Krylov子空间迭代、块Lanczos双对角化过程和随机投影策略，设计了两种快速准确的随机算法用于低秩张量逼近（LRTA）问题。建立了逼近误差估计精度的理论界限。接下来，我们开发了一个新颖的针对大规模张量恢复的广义非凸建模框架，在其中利用了一种新的正则化范式，以实现对大规模张量的有见地的先验表示。在此基础上，我们进一步分别针对非量化和量化情况下的几种典型高阶张量恢复任务，研究了新的统一非凸模型和高效优化算法。为了使所提出的算法对大规模张量数据实用和高效，所提出的随机LRTA方案被整合到它们的中心和耗时的计算中。最后，我们对各种大规模张量进行了广泛的实验，其结果表明所提出的方法与一些最新方法相比在可行性、有效性和优越性方面表现出色。

更新时间: 2025-07-08 09:43:58

领域: cs.LG

下载: http://arxiv.org/abs/2506.09594v2

Fair Domain Generalization: An Information-Theoretic View

Domain generalization (DG) and algorithmic fairness are two critical challenges in machine learning. However, most DG methods focus only on minimizing expected risk in the unseen target domain without considering algorithmic fairness. Conversely, fairness methods typically do not account for domain shifts, so the fairness achieved during training may not generalize to unseen test domains. In this work, we bridge these gaps by studying the problem of Fair Domain Generalization (FairDG), which aims to minimize both expected risk and fairness violations in unseen target domains. We derive novel mutual information-based upper bounds for expected risk and fairness violations in multi-class classification tasks with multi-group sensitive attributes. These bounds provide key insights for algorithm design from an information-theoretic perspective. Guided by these insights, we introduce PAFDG (Pareto-Optimal Fairness for Domain Generalization), a practical framework that solves the FairDG problem and models the utility-fairness trade-off through Pareto optimization. Experiments on real-world vision and language datasets show that PAFDG achieves superior utility-fairness trade-offs compared to existing methods.

Updated: 2025-07-08 09:43:40

标题: 公平域泛化：信息论视角

摘要: 领域泛化（DG）和算法公平性是机器学习中两个关键挑战。然而，大多数DG方法仅关注于在未知目标领域中最小化预期风险，而不考虑算法公平性。相反，公平性方法通常不考虑领域转移，因此在训练过程中实现的公平性可能无法推广到未知的测试领域。在这项工作中，我们通过研究公平领域泛化（FairDG）问题来弥合这些差距，该问题旨在最小化未知目标领域中的预期风险和公平性违规。我们针对具有多组敏感属性的多类分类任务推导出基于互信息的预期风险和公平性违规的新型上界。这些界限从信息论的角度为算法设计提供了关键见解。在这些见解的指导下，我们引入了PAFDG（领域泛化的帕累托最优公平性），这是一个解决FairDG问题并通过帕累托优化模拟效用-公平性权衡的实用框架。对真实世界的视觉和语言数据集进行的实验表明，与现有方法相比，PAFDG实现了更优越的效用-公平性权衡。

更新时间: 2025-07-08 09:43:40

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2507.05823v1

Constella: Supporting Storywriters' Interconnected Character Creation through LLM-based Multi-Agents

Creating a cast of characters by attending to their relational dynamics is a critical aspect of most long-form storywriting. However, our formative study (N=14) reveals that writers struggle to envision new characters that could influence existing ones, to balance similarities and differences among characters, and to intricately flesh out their relationships. Based on these observations, we designed Constella, an LLM-based multi-agent tool that supports storywriters' interconnected character creation process. Constella suggests related characters (FRIENDS DISCOVERY feature), reveals the inner mindscapes of several characters simultaneously (JOURNALS feature), and manifests relationships through inter-character responses (COMMENTS feature). Our 7-8 day deployment study with storywriters (N=11) shows that Constella enabled the creation of expansive communities composed of related characters, facilitated the comparison of characters' thoughts and emotions, and deepened writers' understanding of character relationships. We conclude by discussing how multi-agent interactions can help distribute writers' attention and effort across the character cast.

Updated: 2025-07-08 09:39:02

标题: Constella：通过基于LLM的多智能体支持故事创作者的互联角色创作

摘要: 通过关注角色之间的关系动态来创建一个角色阵容是大多数长篇故事写作的关键方面。然而，我们的初步研究（N=14）显示，作家们很难设想出可能影响现有角色、平衡角色之间的相似性和差异性，并复杂地描绘他们之间的关系的新角色。基于这些观察，我们设计了Constella，这是一个基于LLM的多智能体工具，支持故事作者之间相互关联的角色创建过程。Constella提供相关角色建议（FRIENDS DISCOVERY功能），同时展示多个角色的内心世界（JOURNALS功能），并通过角色间的回应展现关系（COMMENTS功能）。我们进行了为期7-8天的故事作者部署研究（N=11），结果显示Constella促进了由相关角色组成的庞大社区的创建，便于比较角色的想法和情感，并加深了作家对角色关系的理解。最后，我们讨论了多智能体互动如何帮助分配作家的注意力和努力到角色阵容中。

更新时间: 2025-07-08 09:39:02

领域: cs.HC,cs.AI,cs.MA

下载: http://arxiv.org/abs/2507.05820v1

TT-TFHE: a Torus Fully Homomorphic Encryption-Friendly Neural Network Architecture

This paper presents TT-TFHE, a deep neural network Fully Homomorphic Encryption (FHE) framework that effectively scales Torus FHE (TFHE) usage to tabular and image datasets using a recent family of convolutional neural networks called Truth-Table Neural Networks (TTnet). The proposed framework provides an easy-to-implement, automated TTnet-based design toolbox with an underlying (python-based) open-source Concrete implementation (CPU-based and implementing lookup tables) for inference over encrypted data. Experimental evaluation shows that TT-TFHE greatly outperforms in terms of time and accuracy all Homomorphic Encryption (HE) set-ups on three tabular datasets, all other features being equal. On image datasets such as MNIST and CIFAR-10, we show that TT-TFHE consistently and largely outperforms other TFHE set-ups and is competitive against other HE variants such as BFV or CKKS (while maintaining the same level of 128-bit encryption security guarantees). In addition, our solutions present a very low memory footprint (down to dozens of MBs for MNIST), which is in sharp contrast with other HE set-ups that typically require tens to hundreds of GBs of memory per user (in addition to their communication overheads). This is the first work presenting a fully practical solution of private inference (i.e. a few seconds for inference time and a few dozens MBs of memory) on both tabular datasets and MNIST, that can easily scale to multiple threads and users on server side.

Updated: 2025-07-08 09:38:05

标题: TT-TFHE：一个支持全同态加密的环形神经网络架构

摘要: 这篇论文介绍了TT-TFHE，一种深度神经网络完全同态加密（FHE）框架，有效地将Torus FHE（TFHE）用于表格和图像数据集，使用了一种叫做Truth-Table Neural Networks（TTnet）的最新卷积神经网络家族。所提出的框架提供了一个易于实施、自动化的基于TTnet的设计工具箱，具有一个基于Python的开源Concrete实现（基于CPU并实现查找表），可以对加密数据进行推断。实验评估表明，TT-TFHE在三个表格数据集上的时间和准确性表现远远优于所有同态加密（HE）设置，其他特性相同。在像MNIST和CIFAR-10这样的图像数据集上，我们展示了TT-TFHE始终且大幅优于其他TFHE设置，并与其他HE变体（如BFV或CKKS）竞争（同时保持相同水平的128位加密安全性保证）。此外，我们的解决方案具有非常低的内存占用（对于MNIST可降至几十MB），与其他通常需要每个用户数十到数百GB内存（以及通信开销）的HE设置形成鲜明对比。这是第一个在表格数据集和MNIST上完全实用的私有推断解决方案（即几秒的推断时间和几十MB的内存），可以轻松扩展到服务器端的多个线程和用户。

更新时间: 2025-07-08 09:38:05

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2302.01584v2

Affective-ROPTester: Capability and Bias Analysis of LLMs in Predicting Retinopathy of Prematurity

Despite the remarkable progress of large language models (LLMs) across various domains, their capacity to predict retinopathy of prematurity (ROP) risk remains largely unexplored. To address this gap, we introduce a novel Chinese benchmark dataset, termed CROP, comprising 993 admission records annotated with low, medium, and high-risk labels. To systematically examine the predictive capabilities and affective biases of LLMs in ROP risk stratification, we propose Affective-ROPTester, an automated evaluation framework incorporating three prompting strategies: Instruction-based, Chain-of-Thought (CoT), and In-Context Learning (ICL). The Instruction scheme assesses LLMs' intrinsic knowledge and associated biases, whereas the CoT and ICL schemes leverage external medical knowledge to enhance predictive accuracy. Crucially, we integrate emotional elements at the prompt level to investigate how different affective framings influence the model's ability to predict ROP and its bias patterns. Empirical results derived from the CROP dataset yield two principal observations. First, LLMs demonstrate limited efficacy in ROP risk prediction when operating solely on intrinsic knowledge, yet exhibit marked performance gains when augmented with structured external inputs. Second, affective biases are evident in the model outputs, with a consistent inclination toward overestimating medium- and high-risk cases. Third, compared to negative emotions, positive emotional framing contributes to mitigating predictive bias in model outputs. These findings highlight the critical role of affect-sensitive prompt engineering in enhancing diagnostic reliability and emphasize the utility of Affective-ROPTester as a framework for evaluating and mitigating affective bias in clinical language modeling systems.

Updated: 2025-07-08 09:36:14

标题: 情感-ROPTester：对预测早产儿视网膜病变的LLMs能力和偏差分析

摘要: 尽管大型语言模型（LLMs）在各个领域取得了显著进展，但它们在预测早产儿视网膜病变（ROP）风险方面的能力仍然很少被探索。为了填补这一空白，我们引入了一个新颖的中文基准数据集，称为CROP，包括993个住院记录，标有低、中和高风险标签。为了系统地检验LLMs在ROP风险分层中的预测能力和情感偏差，我们提出了Affective-ROPTester，这是一个自动化评估框架，包括三种提示策略：基于指令、思维链（CoT）和上下文学习（ICL）。指令方案评估LLMs的内在知识和相关偏见，而CoT和ICL方案利用外部医学知识提高预测准确性。关键是，我们在提示级别集成了情感元素，以研究不同情感框架如何影响模型预测ROP的能力和其偏见模式。从CROP数据集中得出的实证结果得出两个主要观察结果。首先，当仅使用内在知识时，LLMs在ROP风险预测方面表现出有限的功效，但在结构化外部输入的增强下表现出明显的性能提升。其次，模型输出中明显存在情感偏见，倾向于高估中、高风险病例。第三，与负面情绪相比，积极情感框架有助于减轻模型输出中的预测偏见。这些发现突出了情感敏感提示工程在增强诊断可靠性中的关键作用，并强调了Affective-ROPTester作为评估和减轻临床语言建模系统中情感偏见的框架的实用性。

更新时间: 2025-07-08 09:36:14

领域: cs.AI,cs.CE,cs.CL

下载: http://arxiv.org/abs/2507.05816v1

Just Say Better or Worse: A Human-AI Collaborative Framework for Medical Image Segmentation Without Manual Annotations

Manual annotation of medical images is a labor-intensive and time-consuming process, posing a significant bottleneck in the development and deployment of robust medical imaging AI systems. This paper introduces a novel Human-AI collaborative framework for medical image segmentation that substantially reduces the annotation burden by eliminating the need for explicit manual pixel-level labeling. The core innovation lies in a preference learning paradigm, where human experts provide minimal, intuitive feedback -- simply indicating whether an AI-generated segmentation is better or worse than a previous version. The framework comprises four key components: (1) an adaptable foundation model (FM) for feature extraction, (2) label propagation based on feature similarity, (3) a clicking agent that learns from human better-or-worse feedback to decide where to click and with which label, and (4) a multi-round segmentation learning procedure that trains a state-of-the-art segmentation network using pseudo-labels generated by the clicking agent and FM-based label propagation. Experiments on three public datasets demonstrate that the proposed approach achieves competitive segmentation performance using only binary preference feedback, without requiring experts to directly manually annotate the images.

Updated: 2025-07-08 09:36:12

标题: 只说更好或更糟：一种人工智能协作框架，用于医学图像分割而无需手动标注

摘要: 医学图像的手动注释是一项费时费力的过程，这在开发和部署强大的医学影像人工智能系统中构成了一个重要瓶颈。本文介绍了一种新颖的人工智能协作框架，用于医学图像分割，通过消除明确的手动像素级标注需求，大大减轻了注释负担。核心创新在于偏好学习范式，即人类专家提供最少、直观的反馈 -- 只需指示AI生成的分割比以前的版本好还是更差。该框架包括四个关键组件：（1）用于特征提取的可调整的基础模型（FM），（2）基于特征相似性的标签传播，（3）一个点击代理，从人类的好坏反馈中学习决定何处点击以及使用哪个标签，以及（4）一个多轮分割学习过程，使用点击代理生成的伪标签和基于FM的标签传播来训练最先进的分割网络。对三个公共数据集的实验表明，所提出的方法仅使用二进制偏好反馈就能实现竞争性的分割性能，而无需专家直接手动注释图像。

更新时间: 2025-07-08 09:36:12

领域: eess.IV,cs.LG

下载: http://arxiv.org/abs/2507.05815v1

Immutability Does Not Guarantee Trust: A Formal and Logical Refutation

It is frequently claimed in blockchain discourse that immutability guarantees trust. This paper rigorously refutes that assertion. We define immutability as the cryptographic persistence of historical states in an append-only data structure and contrast it with trust, understood as a rational epistemic expectation under uncertainty. Employing predicate logic, automata-theoretic models, and epistemic game-theoretic analysis, we demonstrate that immutability neither entails nor implies correctness, fairness, or credibility. Through formal constructions and counterexamples--including predictive fraud schemes and the phenomenon of garbage permanence--we show that the belief conflates structural and epistemic domains. Immutability preserves all data equally, regardless of veracity. Therefore, the assertion that immutability guarantees trust collapses under the weight of formal scrutiny.

Updated: 2025-07-08 09:35:52

标题: 不变性并不保证信任：一个形式和逻辑上的反驳

摘要: 在区块链讨论中经常声称不可变性可以保证信任。本文严格驳斥了这一说法。我们将不可变性定义为追加数据结构中历史状态的加密持久性，并将其与信任对比，信任被理解为在不确定性下的理性认知期望。通过使用谓词逻辑、自动机理论模型和认知博弈论分析，我们证明了不可变性既不蕴含也不意味着正确性、公平性或可信度。通过形式构建和反例，包括预测性欺诈方案和垃圾永久性现象，我们展示了这种信念混淆了结构和认知领域。不可变性保留所有数据，无论真实性如何。因此，声称不可变性可以保证信任的说法在形式审查的重压下崩溃了。

更新时间: 2025-07-08 09:35:52

领域: cs.CR,cs.CC,03B70, 68M10, 91A80,F.4.1; D.4.6; C.2.2

下载: http://arxiv.org/abs/2507.08844v1

ETrace:Event-Driven Vulnerability Detection in Smart Contracts via LLM-Based Trace Analysis

With the advance application of blockchain technology in various fields, ensuring the security and stability of smart contracts has emerged as a critical challenge. Current security analysis methodologies in vulnerability detection can be categorized into static analysis and dynamic analysis methods.However, these existing traditional vulnerability detection methods predominantly rely on analyzing original contract code, not all smart contracts provide accessible code.We present ETrace, a novel event-driven vulnerability detection framework for smart contracts, which uniquely identifies potential vulnerabilities through LLM-powered trace analysis without requiring source code access. By extracting fine-grained event sequences from transaction logs, the framework leverages Large Language Models (LLMs) as adaptive semantic interpreters to reconstruct event analysis through chain-of-thought reasoning. ETrace implements pattern-matching to establish causal links between transaction behavior patterns and known attack behaviors. Furthermore, we validate the effectiveness of ETrace through preliminary experimental results.

Updated: 2025-07-08 09:31:28

标题: ETrace：基于LLM的事件驱动智能合约漏洞检测的跟踪分析

摘要: 随着区块链技术在各个领域的广泛应用，确保智能合约的安全性和稳定性已经成为一个关键挑战。当前的漏洞检测安全分析方法主要分为静态分析和动态分析方法。然而，这些现有的传统漏洞检测方法主要依赖于分析原始合约代码，而并非所有智能合约都提供可访问的代码。本文提出了一种新颖的基于事件驱动的智能合约漏洞检测框架ETrace，通过LLM技术进行追踪分析，独特地识别潜在的漏洞，而无需获取源代码访问权限。通过从交易日志中提取细粒度事件序列，该框架利用大型语言模型（LLMs）作为自适应语义解释器，通过思维链推理重建事件分析。ETrace实现了模式匹配，建立交易行为模式与已知攻击行为之间的因果关系。此外，我们通过初步实验结果验证了ETrace的有效性。

更新时间: 2025-07-08 09:31:28

领域: cs.CR,cs.SE,68N01,D.2.0

下载: http://arxiv.org/abs/2506.15790v2

Towards Solar Altitude Guided Scene Illumination

The development of safe and robust autonomous driving functions is heavily dependent on large-scale, high-quality sensor data. However, real-word data acquisition demands intensive human labor and is strongly limited by factors such as labeling cost, driver safety protocols and diverse scenario coverage. Thus, multiple lines of work focus on the conditional generation of synthetic camera sensor data. We identify a significant gap in research regarding daytime variation, presumably caused by the scarcity of available labels. Consequently, we present the solar altitude as global conditioning variable. It is readily computable from latitude-longitude coordinates and local time, eliminating the need for extensive manual labeling. Our work is complemented by a tailored normalization approach, targeting the sensitivity of daylight towards small numeric changes in altitude. We demonstrate its ability to accurately capture lighting characteristics and illumination-dependent image noise in the context of diffusion models.

Updated: 2025-07-08 09:31:16

标题: 朝向太阳高度引导的场景照明

摘要: 安全和稳健的自动驾驶功能的发展严重依赖于大规模、高质量的传感器数据。然而，真实世界数据采集要求大量人力，并受到标注成本、驾驶员安全协议和不同情景覆盖等因素的严格限制。因此，多个研究方向集中于合成摄像头传感器数据的有条件生成。我们发现关于白天变化的研究存在重大空白，可能是由于可用标签的稀缺性所致。因此，我们提出太阳高度作为全局条件变量。它可以从纬度-经度坐标和当地时间轻松计算得出，消除了对广泛手动标注的需求。我们的工作还采用了一种定制的标准化方法，针对白天光照对高度微小数字变化的敏感性。我们展示了它在扩散模型的背景下准确捕捉光照特性和与照明相关的图像噪声的能力。

更新时间: 2025-07-08 09:31:16

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.05812v1

Concept-Based Mechanistic Interpretability Using Structured Knowledge Graphs

While concept-based interpretability methods have traditionally focused on local explanations of neural network predictions, we propose a novel framework and interactive tool that extends these methods into the domain of mechanistic interpretability. Our approach enables a global dissection of model behavior by analyzing how high-level semantic attributes (referred to as concepts) emerge, interact, and propagate through internal model components. Unlike prior work that isolates individual neurons or predictions, our framework systematically quantifies how semantic concepts are represented across layers, revealing latent circuits and information flow that underlie model decision-making. A key innovation is our visualization platform that we named BAGEL (for Bias Analysis with a Graph for global Explanation Layers), which presents these insights in a structured knowledge graph, allowing users to explore concept-class relationships, identify spurious correlations, and enhance model trustworthiness. Our framework is model-agnostic, scalable, and contributes to a deeper understanding of how deep learning models generalize (or fail to) in the presence of dataset biases. The demonstration is available at https://knowledge-graph-ui-4a7cb5.gitlab.io/.

Updated: 2025-07-08 09:30:20

标题: 使用结构化知识图谱进行基于概念的机制可解释性解释

摘要: 尽管基于概念的可解释性方法传统上专注于对神经网络预测的局部解释，但我们提出了一种新颖的框架和交互式工具，将这些方法扩展到机械解释领域。我们的方法通过分析高级语义属性（称为概念）如何出现、相互作用和传播到内部模型组件，实现了对模型行为的全局解剖。与以往研究孤立个别神经元或预测的方法不同，我们的框架系统地量化了语义概念在各层之间的表示，揭示了潜在的电路和信息流，这些潜在的电路和信息流是模型决策的基础。一个关键创新是我们的可视化平台，我们将其命名为BAGEL（偏见分析与用于全局解释层的图形），该平台以结构化知识图的形式呈现这些见解，使用户能够探索概念-类别关系，识别虚假相关性，并增强模型的可信度。我们的框架是与模型无关的、可扩展的，并有助于更深入地理解深度学习模型在数据集偏差存在的情况下是如何泛化（或失败）的。演示可在https://knowledge-graph-ui-4a7cb5.gitlab.io/上找到。

更新时间: 2025-07-08 09:30:20

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2507.05810v1

A Formal Refutation of the Blockchain Trilemma

The so-called blockchain trilemma asserts the impossibility of simultaneously achieving scalability, security, and decentralisation within a single blockchain protocol. In this paper, we formally refute that proposition. Employing predicate logic, formal automata theory, computational complexity analysis, and graph-theoretic measures of relay topology--specifically Baran's model of network path redundancy--we demonstrate that the trilemma constitutes a category error, conflates distinct analytical domains, and relies upon unproven causal assumptions. We further expose its reliance on composition fallacies drawn from flawed system implementations. A constructive counterexample is presented: a blockchain protocol exhibiting unbounded transaction throughput, cryptographic security under adversarial load, and multipath decentralised propagation. This example is not hypothetical but grounded in protocol design enabled by compact block relay, SPV verification, and IPv6 multicast. The trilemma is revealed not as a law of protocol architecture, but as a heuristic fallacy sustained by imprecision and design defeatism.

Updated: 2025-07-08 09:29:09

标题: 区块链三难题的正式驳斥

摘要: 所谓的区块链三难问题声称在单个区块链协议中同时实现可扩展性、安全性和去中心化是不可能的。本文正式驳斥了这一命题。通过使用谓词逻辑、形式自动机理论、计算复杂性分析和基于图论的中继拓扑度量-具体来说是Baran的网络路径冗余模型，我们展示了三难问题构成了一个范畴错误，混淆了不同的分析领域，并依赖于未经证明的因果假设。我们进一步揭示了它依赖于从有缺陷的系统实现中提取的组合谬误。我们提出了一个建设性的反例：一个展示无限交易吞吐量、在对抗负荷下的加密安全性和多路径去中心化传播的区块链协议。这个例子并非假设性的，而是基于由紧凑块中继、SPV验证和IPv6多播所实现的协议设计。三难问题被揭示为不是协议架构的规律，而是一种由不精确性和设计悲观主义维持的启发式谬误。

更新时间: 2025-07-08 09:29:09

领域: cs.CC,cs.CR,cs.DC,cs.DS,03B70, 68M10, 91A80,F.4.1; D.4.6; C.2.2

下载: http://arxiv.org/abs/2507.05809v1

Improving Robustness of Foundation Models in Domain Adaptation with Soup-Adapters

In this paper, we tackle two fundamental problems in few-shot domain adaptation of foundation models. First, hyperparameter tuning is often impractical due to the lack of large validation datasets. Second, model robustness under distribution shifts where test time data deviates slightly from training distributions, remains a concern. We show that by training multiple independent adapters and averaging their outputs, the new model has a higher performance and is more robust to distribution shifts compared to any individual adapter. This improvement holds even when the adapters are trained with diverse hyperparameters sampled from a wide range, resulting in varied individual performance. Consequently, our method addresses both of the problems described above. The ensemble is also significantly less sensitive to the residual ratio, a critical hyperparameter of CLIP-Adapter. Since the ensemble can be reparameterized to a single adapter again using a principled concatenation of the parameters, we refer to our method as Soup-Adapter. This is also the first study to explore CLIP adapter-style techniques for DINOv2 and to directly compare them with CLIP in this setting.

Updated: 2025-07-08 09:26:10

标题: 使用汤适配器提高领域适应中基础模型的稳健性

摘要: 本文解决了在基础模型的少样本领域适应中的两个基本问题。首先，由于缺乏大规模验证数据集，调整超参数通常是不切实际的。其次，在测试时间数据略微偏离训练分布的分布转移下，模型的稳健性仍然是一个问题。我们表明，通过训练多个独立的适配器并平均它们的输出，新模型具有更高的性能，并且相比于任何单独的适配器更能抵御分布转移。即使适配器是使用从广泛范围中抽样的多样超参数进行训练的，导致各种各样的个体性能，这种改进仍然成立。因此，我们的方法解决了上述两个问题。集成对于CLIP-Adapter的关键超参数残差比率也明显不敏感。由于集成可以通过参数的原则性连接重新参数化为单个适配器，我们将我们的方法称为Soup-Adapter。这也是第一项研究探讨DINOv2的CLIP适配器风格技术，并在此设置中直接将其与CLIP进行比较。

更新时间: 2025-07-08 09:26:10

领域: cs.LG

下载: http://arxiv.org/abs/2507.05807v1

Predicting Graph Structure via Adapted Flux Balance Analysis

Many dynamic processes such as telecommunication and transport networks can be described through discrete time series of graphs. Modelling the dynamics of such time series enables prediction of graph structure at future time steps, which can be used in applications such as detection of anomalies. Existing approaches for graph prediction have limitations such as assuming that the vertices do not to change between consecutive graphs. To address this, we propose to exploit time series prediction methods in combination with an adapted form of flux balance analysis (FBA), a linear programming method originating from biochemistry. FBA is adapted to incorporate various constraints applicable to the scenario of growing graphs. Empirical evaluations on synthetic datasets (constructed via Preferential Attachment model) and real datasets (UCI Message, HePH, Facebook, Bitcoin) demonstrate the efficacy of the proposed approach.

Updated: 2025-07-08 09:25:51

标题: 通过适应通量平衡分析预测图结构

摘要: 许多动态过程，如电信和交通网络，可以通过离散时间序列的图来描述。对这种时间序列的动态建模使得能够预测未来时间步骤的图结构，这可以在检测异常等应用中使用。现有的图预测方法存在一些局限性，比如假设顶点在连续图之间不会改变。为了解决这个问题，我们提出利用时间序列预测方法与一种改进的通量平衡分析（FBA）相结合，FBA是一种源自生物化学的线性规划方法。FBA被改进以纳入适用于不断增长图的各种约束。对合成数据集（通过优先连接模型构建）和真实数据集（UCI消息，HePH，Facebook，Bitcoin）的实证评估证明了所提方法的有效性。

更新时间: 2025-07-08 09:25:51

领域: cs.LG,stat.ML,37M10, 05C90, 68R10, 62M10, 62M20,G.2.2; G.3; I.2.6; E.1

下载: http://arxiv.org/abs/2507.05806v1

Fine-tuning Diffusion Policies with Backpropagation Through Diffusion Timesteps

Diffusion policies, widely adopted in decision-making scenarios such as robotics, gaming and autonomous driving, are capable of learning diverse skills from demonstration data due to their high representation power. However, the sub-optimal and limited coverage of demonstration data could lead to diffusion policies that generate sub-optimal trajectories and even catastrophic failures. While reinforcement learning (RL)-based fine-tuning has emerged as a promising solution to address these limitations, existing approaches struggle to effectively adapt Proximal Policy Optimization (PPO) to diffusion models. This challenge stems from the computational intractability of action likelihood estimation during the denoising process, which leads to complicated optimization objectives. In our experiments starting from randomly initialized policies, we find that online tuning of Diffusion Policies demonstrates much lower sample efficiency compared to directly applying PPO on MLP policies (MLP+PPO). To address these challenges, we introduce NCDPO, a novel framework that reformulates Diffusion Policy as a noise-conditioned deterministic policy. By treating each denoising step as a differentiable transformation conditioned on pre-sampled noise, NCDPO enables tractable likelihood evaluation and gradient backpropagation through all diffusion timesteps. Our experiments demonstrate that NCDPO achieves sample efficiency comparable to MLP+PPO when training from scratch, outperforming existing methods in both sample efficiency and final performance across diverse benchmarks, including continuous robot control and multi-agent game scenarios. Furthermore, our experimental results show that our method is robust to the number denoising timesteps in the Diffusion Policy.

Updated: 2025-07-08 09:23:32

标题: 使用通过扩散时间步的反向传播微调扩散策略

摘要: 扩散策略在决策场景中被广泛采用，如机器人技术、游戏和自动驾驶，由于其高表示能力，能够从示范数据中学习多种技能。然而，示范数据的次优和有限覆盖可能导致生成次优轨迹甚至灾难性失败的扩散策略。虽然基于强化学习（RL）的微调已被证明是解决这些限制的一个有希望的解决方案，但现有方法在将近似策略优化（PPO）有效地适应扩散模型方面存在困难。这一挑战源于在去噪过程中计算行动概率估计的计算难题，这导致了复杂的优化目标。在我们的实验中，从随机初始化的策略开始，我们发现在线调整扩散策略的样本效率远远低于直接在MLP策略上应用PPO（MLP+PPO）。为了应对这些挑战，我们引入了NCDPO，这是一个新颖的框架，将扩散策略重新构造为一个噪声条件下的确定性策略。通过将每个去噪步骤视为一个在预采样噪声条件下的可微分变换，NCDPO实现了对所有扩散时间步的可追踪概率评估和梯度反向传播。我们的实验表明，当从头开始训练时，NCDPO实现了与MLP+PPO相媲美的样本效率，优于现有方法在样本效率和最终性能上的表现，包括连续机器人控制和多智能体游戏场景。此外，我们的实验结果表明，我们的方法对扩散策略中去噪时间步数具有鲁棒性。

更新时间: 2025-07-08 09:23:32

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2505.10482v3

An Ensemble Embedding Approach for Improving Semantic Caching Performance in LLM-based Systems

Semantic caching enhances the efficiency of large language model (LLM) systems by identifying semantically similar queries, storing responses once, and serving them for subsequent equivalent requests. However, existing semantic caching frameworks rely on single embedding models for query representation, which limits their ability to capture the diverse semantic relationships present in real-world query distributions. This paper presents an ensemble embedding approach that combines multiple embedding models through a trained meta-encoder to improve semantic similarity detection in LLM caching systems. We evaluate our method using the Quora Question Pairs (QQP) dataset, measuring cache hit ratios, cache miss ratios, token savings, and response times. Our ensemble approach achieves a 92\% cache hit ratio for semantically equivalent queries while maintaining an 85\% accuracy in correctly rejecting non-equivalent queries as cache misses. These results demonstrate that ensemble embedding methods significantly outperform single-model approaches in distinguishing between semantically similar and dissimilar queries, leading to more effective caching performance and reduced computational overhead in LLM-based systems.

Updated: 2025-07-08 09:20:12

标题: 一种用于提高基于LLM系统中语义缓存性能的集成嵌入方法

摘要: 语义缓存通过识别语义上相似的查询、一次存储响应并为后续等效请求提供服务，增强了大型语言模型（LLM）系统的效率。然而，现有的语义缓存框架依赖于单一嵌入模型进行查询表示，这限制了它们捕捉真实查询分布中存在的多样化语义关系的能力。本文提出了一种集成嵌入方法，通过训练的元编码器结合多个嵌入模型，以改进LLM缓存系统中的语义相似性检测。我们使用Quora问题配对（QQP）数据集评估我们的方法，测量缓存命中率、缓存未命中率、令牌节省和响应时间。我们的集成方法在语义上等效的查询中实现了92%的缓存命中率，同时保持了85%的正确拒绝非等效查询作为缓存未命中的准确率。这些结果表明，集成嵌入方法在区分语义相似和不相似的查询方面明显优于单一模型方法，从而提高了缓存性能并减少了LLM系统中的计算开销。

更新时间: 2025-07-08 09:20:12

领域: cs.LG,68T50,I.2.7; H.3.3; I.5.1

下载: http://arxiv.org/abs/2507.07061v1

Enhancing Generalization of Spiking Neural Networks Through Temporal Regularization

Spiking Neural Networks (SNNs) have received widespread attention due to their event-driven and low-power characteristics, making them particularly effective for processing event-based neuromorphic data. Recent studies have shown that directly trained SNNs suffer from severe overfitting issues due to the limited scale of neuromorphic datasets and the gradient mismatching problem, which fundamentally constrain their generalization performance. In this paper, we propose a temporal regularization training (TRT) method by introducing a time-dependent regularization mechanism to enforce stronger constraints on early timesteps. We compare the performance of TRT with other state-of-the-art methods performance on datasets including CIFAR10/100, ImageNet100, DVS-CIFAR10, and N-Caltech101. To validate the effectiveness of TRT, we conducted ablation studies and analyses including loss landscape visualization and learning curve analysis, demonstrating that TRT can effectively mitigate overfitting and flatten the training loss landscape, thereby enhancing generalizability. Furthermore, we establish a theoretical interpretation of TRT's temporal regularization mechanism based on the results of Fisher information analysis. We analyze the temporal information dynamics inside SNNs by tracking Fisher information during the TRT training process, revealing the Temporal Information Concentration (TIC) phenomenon, where Fisher information progressively concentrates in early timesteps. The time-decaying regularization mechanism implemented in TRT effectively guides the network to learn robust features in early timesteps with rich information, thereby leading to significant improvements in model generalization. Code is available at https://github.com/ZBX05/Temporal-Regularization-Training.

Updated: 2025-07-08 09:11:40

标题: 通过时间正则化增强尖峰神经网络的泛化

摘要: 尖峰神经网络（SNNs）由于其事件驱动和低功耗特性而受到广泛关注，使其特别适用于处理基于事件的神经形态数据。最近的研究表明，直接训练的SNNs存在严重的过拟合问题，这是由于神经形态数据集的规模有限以及梯度不匹配问题，这些问题从根本上限制了它们的泛化性能。在本文中，我们提出了一种时间正则化训练（TRT）方法，通过引入一个时间依赖的正则化机制来对早期时间步施加更强的约束。我们将TRT的性能与其他最先进的方法在包括CIFAR10/100、ImageNet100、DVS-CIFAR10和N-Caltech101在内的数据集上进行比较。为了验证TRT的有效性，我们进行了消融研究和分析，包括损失景观可视化和学习曲线分析，证明TRT可以有效减轻过拟合并使训练损失景观变得平坦，从而增强泛化能力。此外，我们根据Fisher信息分析的结果，建立了TRT的时间正则化机制的理论解释。我们通过跟踪TRT训练过程中的Fisher信息来分析SNNs内部的时间信息动态，揭示了时间信息集中（TIC）现象，即Fisher信息逐渐集中在早期时间步。TRT中实施的时间衰减正则化机制有效地引导网络在早期时间步学习具有丰富信息的稳健特征，从而显著提升模型的泛化能力。代码可在https://github.com/ZBX05/Temporal-Regularization-Training获得。

更新时间: 2025-07-08 09:11:40

领域: cs.NE,cs.AI

下载: http://arxiv.org/abs/2506.19256v2

Automated Reasoning for Vulnerability Management by Design

For securing systems, it is essential to manage their vulnerability posture and design appropriate security controls. Vulnerability management allows to proactively address vulnerabilities by incorporating pertinent security controls into systems designs. Current vulnerability management approaches do not support systematic reasoning about the vulnerability postures of systems designs. To effectively manage vulnerabilities and design security controls, we propose a formally grounded automated reasoning mechanism. We integrate the mechanism into an open-source security design tool and demonstrate its application through an illustrative example driven by real-world challenges. The automated reasoning mechanism allows system designers to identify vulnerabilities that are applicable to a specific system design, explicitly specify vulnerability mitigation options, declare selected controls, and thus systematically manage vulnerability postures.

Updated: 2025-07-08 08:56:14

标题: 自动化推理设计的漏洞管理

摘要: 为了保护系统，管理其漏洞状况并设计适当的安全控制是至关重要的。漏洞管理允许通过将相关安全控制纳入系统设计中来积极地解决漏洞。当前的漏洞管理方法不支持对系统设计的漏洞状况进行系统化推理。为了有效地管理漏洞并设计安全控制，我们提出了一个形式化基础的自动推理机制。我们将该机制集成到一个开源安全设计工具中，并通过一个以实际挑战为驱动的示例演示其应用。自动推理机制使系统设计者能够识别适用于特定系统设计的漏洞，明确指定漏洞缓解选项，声明所选的控制，并因此系统地管理漏洞状况。

更新时间: 2025-07-08 08:56:14

领域: cs.CR,cs.AI,cs.LO,cs.SY,eess.SY

下载: http://arxiv.org/abs/2507.05794v1

MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training

Mathematical formulas are a fundamental and widely used component in various scientific fields, serving as a universal language for expressing complex concepts and relationships. While state-of-the-art transformer models excel in processing and understanding natural language, they encounter challenges with mathematical notation, which involves a complex structure and diverse representations. This study focuses on the development of specialized training datasets to enhance the encoding of mathematical content. We introduce Math Mutator (MAMUT), a framework capable of generating equivalent and falsified versions of a given mathematical formula in LaTeX notation, effectively capturing the mathematical variety in notation of the same concept. Based on MAMUT, we have generated four large mathematical datasets containing diverse notation. Experiments show that models trained on these datasets exhibit new SoTA performance on mathematical retrieval tasks. We publish our code, generated datasets, and pretrained mathematical models: https://github.com/aieng-lab/math-mutator.

Updated: 2025-07-08 08:54:52

标题: MAMUT：一个新颖的框架，用于修改数学公式以生成专门的数据集，用于语言模型训练

摘要: 数学公式是各种科学领域中的基本且广泛使用的组成部分，作为表达复杂概念和关系的通用语言。尽管现代的Transformer模型在处理和理解自然语言方面表现出色，但在处理数学符号时仍面临挑战，因为数学表示涉及复杂的结构和多样的表达方式。本研究专注于开发专门的训练数据集，以增强数学内容的编码。我们介绍了Math Mutator（MAMUT），这是一个能够生成给定数学公式的LaTeX符号的等价和伪造版本的框架，有效地捕捉了相同概念的数学符号多样性。基于MAMUT，我们生成了四个包含多样化符号的大型数学数据集。实验证明，使用这些数据集训练的模型在数学检索任务上表现出新的最先进性能。我们发布了我们的代码、生成的数据集和预训练的数学模型：https://github.com/aieng-lab/math-mutator.

更新时间: 2025-07-08 08:54:52

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2502.20855v2

Aria-UI: Visual Grounding for GUI Instructions

Digital agents for automating tasks across different platforms by directly manipulating the GUIs are increasingly important. For these agents, grounding from language instructions to target elements remains a significant challenge due to reliance on HTML or AXTree inputs. In this paper, we introduce Aria-UI, a large multimodal model specifically designed for GUI grounding. Aria-UI adopts a pure-vision approach, eschewing reliance on auxiliary inputs. To adapt to heterogeneous planning instructions, we propose a scalable data pipeline that synthesizes diverse and high-quality instruction samples for grounding. To handle dynamic contexts in task performing, Aria-UI incorporates textual and text-image interleaved action histories, enabling robust context-aware reasoning for grounding. Aria-UI sets new state-of-the-art results across offline and online agent benchmarks, outperforming both vision-only and AXTree-reliant baselines. We release all training data and model checkpoints to foster further research at https://ariaui.github.io.

Updated: 2025-07-08 08:49:17

标题: Aria-UI：GUI指令的视觉基础

摘要: 数字代理用于直接操作GUI在不同平台上自动化任务变得越来越重要。对于这些代理来说，从语言指令到目标元素的基础仍然是一个重要挑战，因为它依赖于HTML或AXTree输入。在本文中，我们介绍了Aria-UI，这是一个专门设计用于GUI基础的大型多模型。Aria-UI采用纯视觉方法，避免依赖辅助输入。为了适应异构规划指令，我们提出了一个可扩展的数据流水线，用于综合多样化和高质量的指令样本以进行基础。为了处理任务执行中的动态环境，Aria-UI结合了文本和文本-图像交织的行动历史，实现了基础的上下文感知推理。Aria-UI在离线和在线代理基准测试中取得了最新的最佳结果，优于仅视觉和依赖AXTree的基线。我们发布了所有训练数据和模型检查点，以促进进一步的研究。

更新时间: 2025-07-08 08:49:17

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2412.16256v2

Copula Density Neural Estimation

Probability density estimation from observed data constitutes a central task in statistics. In this brief, we focus on the problem of estimating the copula density associated to any observed data, as it fully describes the dependence between random variables. We separate univariate marginal distributions from the joint dependence structure in the data, the copula itself, and we model the latter with a neural network-based method referred to as copula density neural estimation (CODINE). Results show that the novel learning approach is capable of modeling complex distributions and can be applied for mutual information estimation and data generation.

Updated: 2025-07-08 08:43:36

标题: Copula密度神经估计

摘要: 从观察数据中估计概率密度是统计学中的一个核心任务。在这篇简短的文章中，我们关注于估计与任何观察数据相关的copula密度，因为它完全描述了随机变量之间的依赖关系。我们将单变量边际分布与数据中的联合依赖结构，即copula本身，分开，并使用基于神经网络的方法来建模后者，称为copula密度神经估计（CODINE）。结果显示，这种新颖的学习方法能够建模复杂的分布，并可用于相互信息估计和数据生成。

更新时间: 2025-07-08 08:43:36

领域: cs.LG,eess.SP,stat.ML

下载: http://arxiv.org/abs/2211.15353v3

Robust Bandwidth Estimation for Real-Time Communication with Offline Reinforcement Learning

Accurate bandwidth estimation (BWE) is critical for real-time communication (RTC) systems. Traditional heuristic approaches offer limited adaptability under dynamic networks, while online reinforcement learning (RL) suffers from high exploration costs and potential service disruptions. Offline RL, which leverages high-quality data collected from real-world environments, offers a promising alternative. However, challenges such as out-of-distribution (OOD) actions, policy extraction from behaviorally diverse datasets, and reliable deployment in production systems remain unsolved. We propose RBWE, a robust bandwidth estimation framework based on offline RL that integrates Q-ensemble (an ensemble of Q-functions) with a Gaussian mixture policy to mitigate OOD risks and enhance policy learning. A fallback mechanism ensures deployment stability by switching to heuristic methods under high uncertainty. Experimental results show that RBWE reduces overestimation errors by 18% and improves the 10th percentile Quality of Experience (QoE) by 18.6%, demonstrating its practical effectiveness in real-world RTC applications.

Updated: 2025-07-08 08:43:29

标题: 实时通信的离线强化学习下的稳健带宽估计

摘要: 准确的带宽估计（BWE）对实时通信（RTC）系统至关重要。传统的启发式方法在动态网络下适应性有限，而在线强化学习（RL）存在高探索成本和潜在的服务中断问题。离线RL利用从真实环境中收集的高质量数据提供了一种有前途的替代方案。然而，诸如分布外（OOD）操作、从行为多样的数据集中提取策略以及在生产系统中可靠部署等挑战仍未解决。我们提出了RBWE，一个基于离线RL的稳健带宽估计框架，它将Q-ensemble（Q函数的集成）与高斯混合策略结合，以减轻OOD风险并增强策略学习。一个备用机制通过在高不确定性下切换到启发式方法来确保部署稳定性。实验结果表明，RBWE将过度估计误差降低了18％，并将体验质量（QoE）的第10百分位提高了18.6％，证明了其在真实RTC应用中的实际有效性。

更新时间: 2025-07-08 08:43:29

领域: eess.SY,cs.LG,cs.SY

下载: http://arxiv.org/abs/2507.05785v1

From Motion to Meaning: Biomechanics-Informed Neural Network for Explainable Cardiovascular Disease Identification

Cardiac diseases are among the leading causes of morbidity and mortality worldwide, which requires accurate and timely diagnostic strategies. In this study, we introduce an innovative approach that combines deep learning image registration with physics-informed regularization to predict the biomechanical properties of moving cardiac tissues and extract features for disease classification. We utilize the energy strain formulation of Neo-Hookean material to model cardiac tissue deformations, optimizing the deformation field while ensuring its physical and biomechanical coherence. This explainable approach not only improves image registration accuracy, but also provides insights into the underlying biomechanical processes of the cardiac tissues. Evaluation on the Automated Cardiac Diagnosis Challenge (ACDC) dataset achieved Dice scores of 0.945 for the left ventricular cavity, 0.908 for the right ventricular cavity, and 0.905 for the myocardium. Subsequently, we estimate the local strains within the moving heart and extract a detailed set of features used for cardiovascular disease classification. We evaluated five classification algorithms, Logistic Regression, Multi-Layer Perceptron, Support Vector Classifier, Random Forest, and Nearest Neighbour, and identified the most relevant features using a feature selection algorithm. The best performing classifier obtained a classification accuracy of 98% in the training set and 100% in the test set of the ACDC dataset. By integrating explainable artificial intelligence, this method empowers clinicians with a transparent understanding of the model's predictions based on cardiac mechanics, while also significantly improving the accuracy and reliability of cardiac disease diagnosis, paving the way for more personalized and effective patient care.

Updated: 2025-07-08 08:43:05

标题: 从动作到含义：基于生物力学的神经网络用于可解释的心血管疾病识别

摘要: 心脏疾病是全球致残和死亡的主要原因之一，需要准确和及时的诊断策略。在这项研究中，我们引入了一种创新方法，结合深度学习图像配准和物理学知识规范，预测移动心脏组织的生物力学特性并提取用于疾病分类的特征。我们利用Neo-Hookean材料的能量应变公式来模拟心脏组织的变形，优化变形场同时确保其物理和生物力学的一致性。这种可解释的方法不仅提高了图像配准的准确性，还能洞察心脏组织的基础生物力学过程。在自动心脏诊断挑战 (ACDC) 数据集上的评估结果显示，左心室腔的Dice分数为0.945，右心室腔为0.908，心肌为0.905。随后，我们估计了心脏内移动的局部应变，并提取了一组详细的特征用于心血管疾病分类。我们评估了五种分类算法，逻辑回归、多层感知器、支持向量分类器、随机森林和最近邻算法，并利用特征选择算法确定了最相关的特征。表现最佳的分类器在ACDC数据集的训练集中获得了98%的分类准确率，在测试集中获得了100%的准确率。通过整合可解释的人工智能，这种方法使临床医生能够透明理解基于心脏力学的模型预测，同时显著提高了心脏疾病诊断的准确性和可靠性，为更个性化和有效的患者护理铺平了道路。

更新时间: 2025-07-08 08:43:05

领域: cs.LG

下载: http://arxiv.org/abs/2507.05783v1

On the relation between trainability and dequantization of variational quantum learning models

The quest for successful variational quantum machine learning (QML) relies on the design of suitable parametrized quantum circuits (PQCs), as analogues to neural networks in classical machine learning. Successful QML models must fulfill the properties of trainability and non-dequantization, among others. Recent works have highlighted an intricate interplay between trainability and dequantization of such models, which is still unresolved. In this work we contribute to this debate from the perspective of machine learning, proving a number of results identifying, among others when trainability and non-dequantization are not mutually exclusive. We begin by providing a number of new somewhat broader definitions of the relevant concepts, compared to what is found in other literature, which are operationally motivated, and consistent with prior art. With these precise definitions given and motivated, we then study the relation between trainability and dequantization of variational QML. Next, we also discuss the degrees of "variationalness" of QML models, where we distinguish between models like the hardware efficient ansatz and quantum kernel methods. Finally, we introduce recipes for building PQC-based QML models which are both trainable and nondequantizable, and corresponding to different degrees of variationalness. We do not address the practical utility for such models. Our work however does point toward a way forward for finding more general constructions, for which finding applications may become feasible.

Updated: 2025-07-08 08:38:58

标题: 关于可训练性与变分量子学习模型去量化之间的关系

摘要: 对成功的变分量子机器学习（QML）的追求依赖于设计适合的参数化量子电路（PQCs），类似于经典机器学习中的神经网络。成功的QML模型必须具备可训练性和非解量化等性质。最近的研究强调了这些模型的可训练性和解量化之间错综复杂的相互作用，这仍未得到解决。在这项工作中，我们从机器学习的角度对这一争论做出贡献，证明了一些结果，其中包括识别在何种情况下可训练性和非解量化并不是互斥的。我们首先提供了一些相对于其他文献更为广泛的相关概念的新定义，这些定义具有操作上的动机，并与先前的研究一致。在给出并激励这些精确定义后，我们研究了变分QML的可训练性和解量化之间的关系。接下来，我们还讨论了QML模型的“变分性”程度，其中我们区分了硬件高效的ansatz和量子核方法等模型。最后，我们介绍了构建基于PQC的QML模型的方法，这些模型既可训练又不可解量化，并对应不同程度的变分性。我们并未讨论这些模型的实际实用性。然而，我们的工作指出了寻找更一般构造的前进方向，为其找到应用可能变得可行。

更新时间: 2025-07-08 08:38:58

领域: quant-ph,cs.LG,stat.ML

下载: http://arxiv.org/abs/2406.07072v3

NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models

Large language models (LLMs) exhibit remarkable performance across various natural language processing tasks but suffer from immense computational and memory demands, limiting their deployment in resource-constrained environments. To address this challenge, we propose NoWag: (Normalized Weight and Activation Guided Compression), a unified framework for zero-shot shape preserving compression algorithms. We compressed Llama-2 7B/13B/70B and Llama-3 8/70BB models, using two popular forms of shape-preserving compression, vector quantization NoWag-VQ (NoWag for Vector Quantization), and unstructured/semi-structured pruning NoWag-P (NoWag for Pruning). We found that NoWag-VQ significantly outperforms state-of-the-art zero shot VQ, and that NoWag-P performs competitively against state-of-the-art methods. These results suggest commonalities between these compression paradigms that could inspire future work. Our code is available at https://github.com/LawrenceRLiu/NoWag

Updated: 2025-07-08 08:34:51

标题: NoWag：大语言模型形状保持压缩的统一框架

摘要: 大型语言模型（LLMs）在各种自然语言处理任务中表现出色，但受制于巨大的计算和内存需求，限制了它们在资源受限环境中的部署。为了解决这一挑战，我们提出了NoWag：（归一化权重和激活引导压缩），这是一个统一的零压缩算法框架，可保持形状不变。我们使用两种流行的形状保持压缩形式，向量量化NoWag-VQ（用于向量量化的NoWag）和非结构化/半结构化剪枝NoWag-P（用于剪枝的NoWag）对Llama-2 7B/13B/70B和Llama-3 8/70BB模型进行了压缩。我们发现NoWag-VQ明显优于最先进的零压缩VQ，而NoWag-P与最先进的方法竞争激烈。这些结果表明这些压缩范式之间的共同之处可能会激发未来的研究。我们的代码可以在https://github.com/LawrenceRLiu/NoWag找到。

更新时间: 2025-07-08 08:34:51

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2504.14569v2

Geological Everything Model 3D: A Promptable Foundation Model for Unified and Zero-Shot Subsurface Understanding

Understanding Earth's subsurface is critical for energy transition, natural hazard mitigation, and planetary science. Yet subsurface analysis remains fragmented, with separate models required for structural interpretation, stratigraphic analysis, geobody segmentation, and property modeling-each tightly coupled to specific data distributions and task formulations. We introduce the Geological Everything Model 3D (GEM), a unified generative architecture that reformulates all these tasks as prompt-conditioned inference along latent structural frameworks derived from subsurface imaging. This formulation moves beyond task-specific models by enabling a shared inference mechanism, where GEM propagates human-provided prompts-such as well logs, masks, or structural sketches-along inferred structural frameworks to produce geologically coherent outputs. Through this mechanism, GEM achieves zero-shot generalization across tasks with heterogeneous prompt types, without retraining for new tasks or data sources. This capability emerges from a two-stage training process that combines self-supervised representation learning on large-scale field seismic data with adversarial fine-tuning using mixed prompts and labels across diverse subsurface tasks. GEM demonstrates broad applicability across surveys and tasks, including Martian radar stratigraphy analysis, structural interpretation in subduction zones, full seismic stratigraphic interpretation, geobody segmentation, and property modeling. By bridging expert knowledge with generative reasoning in a structurally aware manner, GEM lays the foundation for scalable, human-in-the-loop geophysical AI-transitioning from fragmented pipelines to a vertically integrated, promptable reasoning system. Project page: https://douyimin.github.io/GEM

Updated: 2025-07-08 08:31:54

标题: 地质全息模型3D：统一和零射孔地下理解的可提示基础模型

摘要: 理解地球地下是能源转型、自然灾害减灾和行星科学的关键。然而，地下分析仍然是零散的，需要单独的模型进行结构解释、地层分析、地质体分割和属性建模-每个模型与特定的数据分布和任务公式紧密耦合。我们引入了地质万物模型3D（GEM），这是一个统一的生成架构，它将所有这些任务重新制定为基于地下成像得出的潜在结构框架上的提示条件推理。这种表述超越了特定任务的模型，通过启用共享推理机制，其中GEM沿着推断出的结构框架传播人类提供的提示-如井记录、掩模或结构草图-以产生地质连贯的输出。通过这种机制，GEM实现了跨异构提示类型的任务的零射击泛化，而无需为新任务或数据来源重新进行训练。这种能力来自一个两阶段的训练过程，结合了在大规模地震数据上的自监督表示学习和使用混合提示和标签进行对抗微调的方法。GEM展示了在各种勘测和任务中的广泛适用性，包括火星雷达地层学分析、俯冲带的结构解释、完整的地震地层解释、地质体分割和属性建模。通过在具有结构意识的方式中将专业知识与生成推理相结合，GEM为可扩展的、人为参与的地球物理AI奠定了基础-从零散的管道过渡到一个垂直集成的、可提示的推理系统。项目页面：https://douyimin.github.io/GEM

更新时间: 2025-07-08 08:31:54

领域: physics.geo-ph,cs.AI

下载: http://arxiv.org/abs/2507.00419v2

Efficient Risk-sensitive Planning via Entropic Risk Measures

Risk-sensitive planning aims to identify policies maximizing some tail-focused metrics in Markov Decision Processes (MDPs). Such an optimization task can be very costly for the most widely used and interpretable metrics such as threshold probabilities or (Conditional) Values at Risk. Indeed, previous work showed that only Entropic Risk Measures (EntRM) can be efficiently optimized through dynamic programming, leaving a hard-to-interpret parameter to choose. We show that the computation of the full set of optimal policies for EntRM across parameter values leads to tight approximations for the metrics of interest. We prove that this optimality front can be computed effectively thanks to a novel structural analysis and smoothness properties of entropic risks. Empirical results demonstrate that our approach achieves strong performance in a variety of decision-making scenarios.

Updated: 2025-07-08 08:20:49

标题: 通过熵风险度量实现高效的风险敏感规划

摘要: 风险敏感规划旨在在马尔可夫决策过程（MDP）中识别最大化一些尾部聚焦指标的策略。这样的优化任务对于最广泛使用和可解释的指标，如阈值概率或（条件）风险价值，可能非常昂贵。实际上，先前的工作表明，只有熵风险度量（EntRM）可以通过动态规划进行高效优化，而留下一个难以解释的参数选择。我们展示，通过在参数值上计算EntRM的全套最优策略集，可以为感兴趣的指标提供紧密的近似值。我们证明，由于新颖的结构分析和熵风险的平滑性质，这种最优性前缘可以有效地计算。经验结果表明，我们的方法在各种决策场景中取得了强大的性能。

更新时间: 2025-07-08 08:20:49

领域: stat.ML,cs.AI,cs.LG,math.OC,math.PR

下载: http://arxiv.org/abs/2502.20423v2

Mind the Cost of Scaffold! Benign Clients May Even Become Accomplices of Backdoor Attack

By using a control variate to calibrate the local gradient of each client, Scaffold has been widely known as a powerful solution to mitigate the impact of data heterogeneity in Federated Learning. Although Scaffold achieves significant performance improvements, we show that this superiority is at the cost of increased security vulnerabilities. Specifically, this paper presents BadSFL, the first backdoor attack targeting Scaffold, which turns benign clients into accomplices to amplify the attack effect. The core idea of BadSFL is to uniquely tamper with the control variate to subtly steer benign clients' local gradient updates towards the attacker's poisoned direction, effectively turning them into unwitting accomplices and significantly enhancing the backdoor persistence. Additionally, BadSFL leverages a GAN-enhanced poisoning strategy to enrich the attacker's dataset, maintaining high accuracy on both benign and backdoored samples while remaining stealthy. Extensive experiments demonstrate that BadSFL achieves superior attack durability, maintaining effectiveness for over 60 global rounds, lasting up to three times longer than existing baselines even after ceasing malicious model injections.

Updated: 2025-07-08 08:14:22

标题: 小心支架的成本！良性客户甚至可能成为后门攻击的同谋

摘要: 通过使用控制变量来校准每个客户端的本地梯度，支架已被广泛认为是减轻联邦学习中数据异质性影响的强大解决方案。尽管支架实现了显著的性能改进，我们表明这种优势是以增加安全漏洞为代价的。具体而言，本文提出了BadSFL，这是针对支架的首个后门攻击，将良性客户端转变为共犯，以放大攻击效果。BadSFL的核心思想是通过独特地篡改控制变量，微妙地引导良性客户端的本地梯度更新朝向攻击者的有毒方向，有效地将它们变成无意的共犯，并显著增强后门的持久性。此外，BadSFL利用了GAN增强的毒化策略来丰富攻击者的数据集，在保持对良性和带后门样本的高准确率的同时保持隐蔽性。大量实验证明，BadSFL实现了出色的攻击持久性，在超过60个全局轮次内保持有效，甚至在停止恶意模型注入后仍能持续时间长达三倍于现有基线。

更新时间: 2025-07-08 08:14:22

领域: cs.LG

下载: http://arxiv.org/abs/2411.16167v3

Real-time monitoring of the SoH of lithium-ion batteries

Real-time monitoring of the state of health (SoH) of batteries remains a major challenge, particularly in microgrids where operational constraints limit the use of traditional methods. As part of the 4BLife project, we propose an innovative method based on the analysis of a discharge pulse at the end of the charge phase. The parameters of the equivalent electrical model describing the voltage evolution across the battery terminals during this current pulse are then used to estimate the SoH. Based on the experimental data acquired so far, the initial results demonstrate the relevance of the proposed approach. After training using the parameters of two batteries with a capacity degradation of around 85%, we successfully predicted the degradation of two other batteries, cycled down to approximately 90% SoH, with a mean absolute error of around 1% in the worst case, and an explainability score of the estimator close to 0.9. If these performances are confirmed, this method can be easily integrated into battery management systems (BMS) and paves the way for optimized battery management under continuous operation.

Updated: 2025-07-08 08:08:53

标题: 锂离子电池健康状况的实时监测

摘要: 电池健康状态（SoH）的实时监测仍然是一个主要挑战，特别是在微电网中，操作约束限制了传统方法的使用。作为4BLife项目的一部分，我们提出了一种基于在充电阶段末端的放电脉冲分析的创新方法。描述电池端子上电压演变的等效电气模型参数随后被用来估计SoH。根据迄今为止获得的实验数据，初始结果显示了所提出方法的相关性。在使用两个容量降解约为85%的电池的参数进行训练后，我们成功预测了另外两个电池的降解，其SoH降至约90%，最差情况下的平均绝对误差约为1％，估计器的可解释性得分接近0.9。如果这些性能得到确认，这种方法可以轻松集成到电池管理系统（BMS）中，并为持续运行下的优化电池管理铺平道路。

更新时间: 2025-07-08 08:08:53

领域: cs.AI

下载: http://arxiv.org/abs/2507.05765v1

Unveiling Effective In-Context Configurations for Image Captioning: An External & Internal Analysis

The evolution of large models has witnessed the emergence of In-Context Learning (ICL) capabilities. In Natural Language Processing (NLP), numerous studies have demonstrated the effectiveness of ICL. Inspired by the success of Large Language Models (LLMs), researchers have developed Large Multimodal Models (LMMs) with ICL capabilities. However, explorations of demonstration configuration for multimodal ICL remain preliminary. Additionally, the controllability of In-Context Examples (ICEs) provides an efficient and cost-effective means to observe and analyze the inference characteristics of LMMs under varying inputs. This paper conducts a comprehensive external and internal investigation of multimodal in-context learning on the image captioning task. Externally, we explore demonstration configuration strategies through three dimensions: shot number, image retrieval, and caption assignment. We employ multiple metrics to systematically and thoroughly evaluate and summarize key findings. Internally, we analyze typical LMM attention characteristics and develop attention-based metrics to quantify model behaviors. We also conduct auxiliary experiments to explore the feasibility of attention-driven model acceleration and compression. We further compare performance variations between LMMs with identical model design and pretraining strategies and explain the differences from the angles of pre-training data features. Our study reveals both how ICEs configuration strategies impact model performance through external experiments and characteristic typical patterns through internal inspection, providing dual perspectives for understanding multimodal ICL in LMMs. Our method of combining external and internal analysis to investigate large models, along with our newly proposed metrics, can be applied to broader research areas.

Updated: 2025-07-08 08:07:57

标题: 揭示图像标题的有效上下文配置：外部和内部分析

摘要: 大型模型的发展见证了In-Context Learning（ICL）能力的出现。在自然语言处理（NLP）领域，许多研究已经证明了ICL的有效性。受到大型语言模型（LLMs）成功的启发，研究人员开发了具有ICL能力的大型多模态模型（LMMs）。然而，对多模态ICL的演示配置的探索仍处于初步阶段。此外，In-Context Examples（ICEs）的可控性提供了一种高效且具有成本效益的方法，可观察和分析LMM在不同输入下的推理特征。本文对图像字幕任务上的多模态上下文学习进行了全面的外部和内部调查。在外部方面，我们通过三个维度探索演示配置策略：拍摄次数、图像检索和字幕分配。我们采用多种指标系统地和彻底地评估和总结关键发现。在内部方面，我们分析了典型的LMM注意力特征，并开发了基于注意力的指标来量化模型行为。我们还进行了辅助实验，探索基于注意力驱动的模型加速和压缩的可行性。我们进一步比较了具有相同模型设计和预训练策略的LMM之间的性能变化，并从预训练数据特征的角度解释了差异。我们的研究揭示了ICEs配置策略如何通过外部实验影响模型性能，并通过内部检查展示了典型模式，为理解LMM中的多模态ICL提供了双重视角。我们结合外部和内部分析方法来研究大型模型，以及我们新提出的指标，可以应用于更广泛的研究领域。

更新时间: 2025-07-08 08:07:57

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.08021v1

PSAT: Pediatric Segmentation Approaches via Adult Augmentations and Transfer Learning

Pediatric medical imaging presents unique challenges due to significant anatomical and developmental differences compared to adults. Direct application of segmentation models trained on adult data often yields suboptimal performance, particularly for small or rapidly evolving structures. To address these challenges, several strategies leveraging the nnU-Net framework have been proposed, differing along four key axes: (i) the fingerprint dataset (adult, pediatric, or a combination thereof) from which the Training Plan -including the network architecture-is derived; (ii) the Learning Set (adult, pediatric, or mixed), (iii) Data Augmentation parameters, and (iv) the Transfer learning method (finetuning versus continual learning). In this work, we introduce PSAT (Pediatric Segmentation Approaches via Adult Augmentations and Transfer learning), a systematic study that investigates the impact of these axes on segmentation performance. We benchmark the derived strategies on two pediatric CT datasets and compare them with state-of-theart methods, including a commercial radiotherapy solution. PSAT highlights key pitfalls and provides actionable insights for improving pediatric segmentation. Our experiments reveal that a training plan based on an adult fingerprint dataset is misaligned with pediatric anatomy-resulting in significant performance degradation, especially when segmenting fine structures-and that continual learning strategies mitigate institutional shifts, thus enhancing generalization across diverse pediatric datasets. The code is available at https://github.com/ICANS-Strasbourg/PSAT.

Updated: 2025-07-08 08:07:36

标题: PSAT：通过成人增强和迁移学习的儿科分割方法

摘要: 小儿医学影像学面临独特挑战，因为与成人相比存在显著的解剖和发育差异。直接应用在成人数据上训练的分割模型通常会产生次优的性能，特别是对于小型或快速演变的结构。为了解决这些挑战，提出了几种利用nnU-Net框架的策略，这些策略在四个关键轴上有所不同：（i）指纹数据集（成人、小儿或两者的组合）衍生出训练计划，包括网络架构；（ii）学习集（成人、小儿或混合）；（iii）数据增强参数；以及（iv）迁移学习方法（微调与持续学习）。在这项工作中，我们介绍了PSAT（通过成人增强和迁移学习的小儿分割方法），这是一项系统性研究，探讨了这些轴对分割性能的影响。我们在两个小儿CT数据集上对这些衍生策略进行了基准测试，并将它们与包括商业放射治疗解决方案在内的最新方法进行了比较。PSAT突显了关键缺陷，并提供了可操作的见解，以改善小儿分割。我们的实验表明，基于成人指纹数据集的训练计划与小儿解剖不一致，导致性能显著下降，特别是在分割细小结构时，并且持续学习策略可以减轻机构转变，从而增强跨多样化小儿数据集的泛化能力。该代码可在https://github.com/ICANS-Strasbourg/PSAT获得。

更新时间: 2025-07-08 08:07:36

领域: eess.IV,cs.LG

下载: http://arxiv.org/abs/2507.05764v1

Magneto-radiative modelling and artificial neural network optimization of biofluid flow in a stenosed arterial domain

The increasing complexity of cardiovascular diseases and limitations in traditional healing methods mandate the invention of new drug delivery systems that assure targeted, effective, and regulated treatments, contributing directly to UN SDGs 3 and 9, thereby encouraging the utilization of sustainable medical technologies in healthcare. This study investigates the flow of a Casson-Maxwell nanofluid through a stenosed arterial domain. The quantities, such as skin friction and heat transfer rate, are analysed in detail. The Casson-Maxwell fluid shows a lower velocity profile than the Casson fluids, which indicates the improved residence time for efficient drug delivery. The heat transfer rate shows an increase with higher volume fractions of copper and aluminium oxide nanoparticles and a decrease with higher volume fractions of silver nanoparticles. The skin friction coefficient decreases by 219% with a unit increase in the Maxwell parameter, whereas it increases by 66.1% with a unit rise in the Casson parameter. This work supports SDGs 4 and 17 by fostering interdisciplinary learning and collaboration in fluid dynamics and healthcare innovation. Additionally, the rate of heat flow was forecasted (with an overall R-value of 0.99457) using the Levenberg-Marquardt backpropagation training scheme under the influence of magneto-radiative, linear heat source and Casson-Maxwell parameters along with the tri-metallic nanoparticle volume fractions. It is also observed that the drag coefficient is most sensitive to the changes in the Maxwell parameter.

Updated: 2025-07-08 08:04:40

标题: 磁致辐射建模和人工神经网络优化在狭窄动脉领域生物流体流动中的应用

摘要: 心血管疾病的日益复杂和传统治疗方法的局限性要求发明新的药物输送系统，以确保治疗的靶向性、有效性和规范性，从而直接促进联合国可持续发展目标3和9的实现，鼓励在医疗保健中利用可持续医疗技术。本研究调查了卡松-麦克斯韦纳米流体通过狭窄动脉领域的流动。详细分析了皮肤摩擦力和传热速率等数量。卡松-麦克斯韦流体显示出比卡松流体更低的速度剖面，表明提高了药物输送的有效停留时间。传热速率随着铜和氧化铝纳米颗粒的体积分数增加而增加，随着银纳米颗粒的体积分数增加而减少。单位增加麦克斯韦参数，皮肤摩擦系数减少219％；单位增加卡松参数，皮肤摩擦系数增加66.1％。这项工作通过在流体动力学和医疗创新领域促进跨学科学习和合作来支持可持续发展目标4和17。此外，利用Levenberg-Marquardt反向传播训练方案预测了热流速率（总R值为0.99457），受到磁辐射、线性热源和卡松-麦克斯韦参数以及三金属纳米颗粒体积分数的影响。还观察到阻力系数对麦克斯韦参数的变化最为敏感。

更新时间: 2025-07-08 08:04:40

领域: physics.med-ph,cs.AI,cs.NA,math.NA,physics.bio-ph

下载: http://arxiv.org/abs/2507.06273v1

An autonomous agent for auditing and improving the reliability of clinical AI models

The deployment of AI models in clinical practice faces a critical challenge: models achieving expert-level performance on benchmarks can fail catastrophically when confronted with real-world variations in medical imaging. Minor shifts in scanner hardware, lighting or demographics can erode accuracy, but currently reliability auditing to identify such catastrophic failure cases before deployment is a bespoke and time-consuming process. Practitioners lack accessible and interpretable tools to expose and repair hidden failure modes. Here we introduce ModelAuditor, a self-reflective agent that converses with users, selects task-specific metrics, and simulates context-dependent, clinically relevant distribution shifts. ModelAuditor then generates interpretable reports explaining how much performance likely degrades during deployment, discussing specific likely failure modes and identifying root causes and mitigation strategies. Our comprehensive evaluation across three real-world clinical scenarios - inter-institutional variation in histopathology, demographic shifts in dermatology, and equipment heterogeneity in chest radiography - demonstrates that ModelAuditor is able correctly identify context-specific failure modes of state-of-the-art models such as the established SIIM-ISIC melanoma classifier. Its targeted recommendations recover 15-25% of performance lost under real-world distribution shift, substantially outperforming both baseline models and state-of-the-art augmentation methods. These improvements are achieved through a multi-agent architecture and execute on consumer hardware in under 10 minutes, costing less than US$0.50 per audit.

Updated: 2025-07-08 07:58:52

标题: 一个用于审计和提高临床人工智能模型可靠性的自主代理

摘要: 在临床实践中部署AI模型面临一个关键挑战：在基准测试中达到专家级性能的模型在面对医学影像中的真实世界变化时可能会发生灾难性失败。扫描仪硬件、光照或人口统计学的微小变化都可能降低准确性，但目前用于识别部署前此类灾难性失败案例的可靠性审计是一个定制且耗时的过程。从业者缺乏可访问和可解释的工具来暴露和修复隐藏的失败模式。在这里，我们介绍了ModelAuditor，这是一个自我反思的代理，与用户对话，选择任务特定的指标，并模拟依赖上下文、临床相关的分布偏移。ModelAuditor然后生成解释性报告，解释在部署过程中性能可能降低多少，讨论特定可能的失败模式并确定根本原因和缓解策略。我们在三个真实世界临床场景中进行了全面评估 - 组织病理学的跨机构变化、皮肤科的人口统计学变化以及胸部放射学的设备异质性 - 结果显示，ModelAuditor能够正确识别基于上下文的最先进模型（如已建立的SIIM-ISIC黑色素瘤分类器）的特定失败模式。其有针对性的建议可以在真实世界分布偏移下恢复15-25%的性能损失，明显优于基线模型和最先进的增强方法。这些改进是通过多代理体系结构实现的，并在不到10分钟内在消费者硬件上执行，每次审计成本不到0.50美元。

更新时间: 2025-07-08 07:58:52

领域: cs.AI

下载: http://arxiv.org/abs/2507.05755v1

LeAD: The LLM Enhanced Planning System Converged with End-to-end Autonomous Driving

A principal barrier to large-scale deployment of urban autonomous driving systems lies in the prevalence of complex scenarios and edge cases. Existing systems fail to effectively interpret semantic information within traffic contexts and discern intentions of other participants, consequently generating decisions misaligned with skilled drivers' reasoning patterns. We present LeAD, a dual-rate autonomous driving architecture integrating imitation learning-based end-to-end (E2E) frameworks with large language model (LLM) augmentation. The high-frequency E2E subsystem maintains real-time perception-planning-control cycles, while the low-frequency LLM module enhances scenario comprehension through multi-modal perception fusion with HD maps and derives optimal decisions via chain-of-thought (CoT) reasoning when baseline planners encounter capability limitations. Our experimental evaluation in the CARLA Simulator demonstrates LeAD's superior handling of unconventional scenarios, achieving 71 points on Leaderboard V1 benchmark, with a route completion of 93%.

Updated: 2025-07-08 07:58:29

标题: LeAD：LLM增强规划系统与端到端自动驾驶的融合

摘要: 城市自动驾驶系统大规模部署的一个主要障碍在于复杂情景和边缘案例的普遍存在。现有系统未能有效解释交通环境中的语义信息并识别其他参与者的意图，从而导致生成与熟练驾驶员推理模式不一致的决策。我们提出了LeAD，一个集成了基于模仿学习的端到端（E2E）框架和大型语言模型（LLM）增强的双速率自动驾驶架构。高频率的E2E子系统维持实时感知-规划-控制循环，而低频率的LLM模块通过与高清地图的多模态感知融合增强了场景理解，并在基线规划器遇到能力限制时通过思维链（CoT）推导出最佳决策。我们在CARLA模拟器中的实验评估表明LeAD在处理非常规情况方面表现优越，在Leaderboard V1基准测试中获得了71分，路线完成率为93%。

更新时间: 2025-07-08 07:58:29

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2507.05754v1

Jigsaw: Training Multi-Billion-Parameter AI Weather Models with Optimized Model Parallelism

AI-based methods have revolutionized atmospheric forecasting, with recent successes in medium-range forecasting spurring the development of climate foundation models. Accurate modeling of complex atmospheric dynamics at high spatial resolutions and longer lead times requires large neural networks and gigabyte-sized data samples, making accelerator memory and I/O-bandwidth the bottlenecks for model training. We introduce WeatherMixer, a multi-layer-perceptron-based architecture whose workload scales linearly with input size, allowing the model to learn global weather phenomena at accuracies similar to numerical weather prediction. To cope with the computational demand, we propose Jigsaw, a novel model parallelization scheme that employs both domain and tensor parallelism, eliminating memory redundancy. Jigsaw exceeds state-of-the-art performance in strong scaling in compute-communication-limited systems and achieves superscalar weak scaling in I/O-bandwidth-limited systems. We scale training to 256 GPUs, reaching peak performances of 9 and 11 PFLOPs, 23% and 28% of theoretical peaks, achieving 68% and 72% scaling efficiency versus 51% without model parallelism.

Updated: 2025-07-08 07:57:08

标题: 拼图：通过优化模型并行性训练拥有数十亿参数的人工智能天气模型

摘要: 基于人工智能的方法已经彻底改变了大气预报，在中期预报方面取得的最新成功推动了气候基础模型的发展。在高空间分辨率和较长前瞻时间精确建模复杂大气动力学需要大型神经网络和千兆字节级数据样本，使加速器内存和I/O带宽成为模型训练的瓶颈。我们引入了WeatherMixer，这是一种基于多层感知器的架构，其工作负荷随输入大小线性扩展，使模型能够以与数值天气预报相似的准确度学习全球气象现象。为了满足计算需求，我们提出了Jigsaw，这是一种新颖的模型并行化方案，采用了领域和张量并行性，消除了内存冗余。Jigsaw在计算-通信受限系统中的强扩展性方面超过了现有技术水平，并且在I/O带宽受限系统中实现了超标量弱扩展性。我们将训练扩展到256个GPU，达到了9和11 PFLOPs的峰值性能，分别为理论峰值的23%和28%，实现了68%和72%的扩展效率，而没有模型并行性的情况下仅为51%。

更新时间: 2025-07-08 07:57:08

领域: cs.LG

下载: http://arxiv.org/abs/2507.05753v1

Pretrained Reversible Generation as Unsupervised Visual Representation Learning

Recent generative models based on score matching and flow matching have significantly advanced generation tasks, but their potential in discriminative tasks remains underexplored. Previous approaches, such as generative classifiers, have not fully leveraged the capabilities of these models for discriminative tasks due to their intricate designs. We propose Pretrained Reversible Generation (PRG), which extracts unsupervised representations by reversing the generative process of a pretrained continuous generation model. PRG effectively reuses unsupervised generative models, leveraging their high capacity to serve as robust and generalizable feature extractors for downstream tasks. This framework enables the flexible selection of feature hierarchies tailored to specific downstream tasks. Our method consistently outperforms prior approaches across multiple benchmarks, achieving state-of-the-art performance among generative model based methods, including 78% top-1 accuracy on ImageNet at a resolution of 64*64. Extensive ablation studies, including out-of-distribution evaluations, further validate the effectiveness of our approach.PRG is available at https://github.com/opendilab/PRG.

Updated: 2025-07-08 07:53:20

标题: 预训练的可逆生成作为无监督视觉表示学习

摘要: 最近基于得分匹配和流匹配的生成模型在生成任务方面取得了显著进展，但它们在区分任务中的潜力尚未得到充分探索。以前的方法，如生成分类器，由于其复杂的设计，未能充分利用这些模型在区分任务中的能力。我们提出了预训练可逆生成（PRG）方法，通过反转预训练连续生成模型的生成过程来提取无监督表示。PRG有效地重用无监督生成模型，利用它们的高容量作为强大且可泛化的特征提取器，用于下游任务。该框架使得可以灵活选择针对特定下游任务定制的特征层次结构。我们的方法在多个基准测试中始终优于先前的方法，包括在分辨率为64*64的ImageNet上达到78%的top-1准确率，实现了基于生成模型的方法中的最先进性能。广泛的消融研究，包括超出分布的评估，进一步验证了我们方法的有效性。PRG可在https://github.com/opendilab/PRG 上获取。

更新时间: 2025-07-08 07:53:20

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2412.01787v5

Decomposition-Based Optimal Bounds for Privacy Amplification via Shuffling

Shuffling has been shown to amplify differential privacy guarantees, enabling a more favorable privacy-utility trade-off. To characterize and compute this amplification, two fundamental analytical frameworks have been proposed: the \emph{privacy blanket} by Balle et al. (CRYPTO 2019) and the \emph{clone paradigm}--including both the standard and stronger variants--by Feldman et al. (FOCS 2021, SODA 2023). These frameworks share a common foundation: decomposing local randomizers into structured components for analysis. In this work, we introduce a unified analytical framework--the general clone paradigm--which subsumes all possible decompositions, with the clone and blanket decompositions arising as special cases. Within this framework, we identify the optimal decomposition, which is precisely the one used by the privacy blanket. Moreover, we develop a simple and efficient algorithm based on the Fast Fourier Transform (FFT) to compute optimal privacy amplification bounds. Experimental results show that our computed upper bounds nearly match the empirical lower bounds, demonstrating the tightness of our method. Building on this method, we also derive optimal amplification bounds for both \emph{joint} and \emph{parallel} compositions of LDP mechanisms in the shuffle model.

Updated: 2025-07-08 07:52:03

标题: 基于分解的隐私放大通过洗牌的最优界限

摘要: 洗牌已被证明可以增强差分隐私保证，实现更有利的隐私-效用权衡。为了表征和计算这种增强效果，已经提出了两种基本的分析框架：Balle等人提出的“隐私毯”（CRYPTO 2019）和Feldman等人提出的“克隆范式”（包括标准和更强变体）（FOCS 2021，SODA 2023）。这些框架共享一个共同的基础：将局部随机器分解为结构化组件进行分析。在这项工作中，我们引入了一个统一的分析框架——通用克隆范式——它包含所有可能的分解，其中克隆和毯子分解是特例。在这个框架内，我们确定了最佳的分解方式，这恰好是隐私毯所使用的。此外，我们开发了一种基于快速傅里叶变换（FFT）的简单高效算法，用于计算最佳隐私增强界限。实验结果显示，我们计算的上界几乎与经验下界相匹配，证明了我们方法的紧密性。基于这个方法，我们还推导了在混洗模型中LDP机制的“联合”和“并行”组合的最佳增强界限。

更新时间: 2025-07-08 07:52:03

领域: cs.CR

下载: http://arxiv.org/abs/2504.07414v4

PVChat: Personalized Video Chat with One-Shot Learning

Video large language models (ViLLMs) excel in general video understanding, e.g., recognizing activities like talking and eating, but struggle with identity-aware comprehension, such as "Wilson is receiving chemotherapy" or "Tom is discussing with Sarah", limiting their applicability in smart healthcare and smart home environments. To address this limitation, we propose a one-shot learning framework PVChat, the first personalized ViLLM that enables subject-aware question answering (QA) from a single video for each subject. Our approach optimizes a Mixture-of-Heads (MoH) enhanced ViLLM on a synthetically augmented video-QA dataset, leveraging a progressive image-to-video learning strategy. Specifically, we introduce an automated augmentation pipeline that synthesizes identity-preserving positive samples and retrieves hard negatives from existing video corpora, generating a diverse training dataset with four QA types: existence, appearance, action, and location inquiries. To enhance subject-specific learning, we propose a ReLU Routing MoH attention mechanism, alongside two novel objectives: (1) Smooth Proximity Regularization for progressive learning through exponential distance scaling and (2) Head Activation Enhancement for balanced attention routing. Finally, we adopt a two-stage training strategy, transitioning from image pre-training to video fine-tuning, enabling a gradual learning process from static attributes to dynamic representations. We evaluate PVChat on diverse datasets covering medical scenarios, TV series, anime, and real-world footage, demonstrating its superiority in personalized feature understanding after learning from a single video, compared to state-of-the-art ViLLMs.

Updated: 2025-07-08 07:50:07

标题: PVChat：使用一次性学习的个性化视频聊天

摘要: 视频大型语言模型（ViLLMs）在一般视频理解方面表现优异，例如识别说话和进食等活动，但在身份感知理解方面，如“Wilson正在接受化疗”或“Tom正在与Sarah讨论”上遇到困难，限制了它们在智能医疗和智能家居环境中的适用性。为了解决这一限制，我们提出了一种一次性学习框架PVChat，这是第一个能够从每个主体的单个视频中进行主体感知问答（QA）的个性化ViLLM。我们的方法在一个合成增广视频问答数据集上优化了一个增强型Mixture-of-Heads（MoH）的ViLLM，利用了渐进式图像到视频学习策略。具体来说，我们引入了一个自动增广管道，合成保持身份的正样本，并从现有视频语料库中检索难以处理的负样本，生成了一个包含四种QA类型的多样化训练数据集：存在、外观、动作和位置查询。为了增强主体特定的学习，我们提出了一个ReLU Routing MoH注意机制，以及两个新颖的目标：（1）通过指数距离缩放进行渐进式学习的平滑接近正则化和（2）用于平衡关注路由的Head Activation Enhancement。最后，我们采用了一个两阶段训练策略，从图像预训练过渡到视频微调，使从静态属性到动态表示的学习过程逐渐进行。我们在涵盖医疗情景、电视剧、动画和真实世界影片的多样化数据集上对PVChat进行评估，展示了它在从单个视频学习后个性化特征理解方面的优越性，相较于最先进的ViLLMs。

更新时间: 2025-07-08 07:50:07

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.17069v3

Fine-Grained Knowledge Structuring and Retrieval for Visual Question Answering

Visual Question Answering (VQA) focuses on providing answers to natural language questions by utilizing information from images. Although cutting-edge multimodal large language models (MLLMs) such as GPT-4o achieve strong performance on VQA tasks, they frequently fall short in accessing domain-specific or the latest knowledge. To mitigate this issue, retrieval-augmented generation (RAG) leveraging external knowledge bases (KBs), referred to as KB-VQA, emerges as a promising approach. Nevertheless, conventional unimodal retrieval techniques, which translate images into textual descriptions, often result in the loss of critical visual details. To address these challenges, this study presents two key innovations. First, we introduce fine-grained knowledge units that consist of multimodal data fragments (e.g. text fragments, entity images, and so on) in a structured manner. Rather than merely refining retrieval mechanisms, we prioritize the systematic organization and management of these knowledge units, ensuring that the structuring process itself enhances retrieval quality. Second, we propose a knowledge unit retrieval-augmented generation framework (KU-RAG) that seamlessly integrates fine-grained retrieval with MLLMs. Our KU-RAG framework not only ensures precise retrieval of relevant knowledge but also enhances reasoning capabilities through a knowledge correction chain. Experimental results demonstrate that our approach consistently outperforms existing KB-VQA methods across four benchmarks, achieving an average improvement of approximately 3% and up to 11% in the best case.

Updated: 2025-07-08 07:47:57

标题: 细粒度知识结构化和检索在视觉问答中的应用

摘要: 视觉问答（VQA）着重于利用图像信息回答自然语言问题。尽管前沿的多模态大语言模型（MLLMs）如GPT-4o在VQA任务上表现出色，但它们经常在访问特定领域或最新知识方面存在不足。为了缓解这一问题，利用外部知识库的检索增强生成（RAG），即KB-VQA，被提出作为一种有前途的方法。然而，传统的单模式检索技术将图像转化为文本描述，往往导致关键视觉细节的丢失。为了解决这些挑战，本研究提出了两个关键创新。首先，我们引入了由多模态数据片段（如文本片段、实体图像等）结构化组成的细粒度知识单元。我们不仅仅是完善检索机制，更注重对这些知识单元的系统组织和管理，确保结构化过程本身提升了检索质量。其次，我们提出了一个知识单元检索增强生成框架（KU-RAG），它将细粒度检索与MLLMs无缝集成。我们的KU-RAG框架不仅确保精确检索相关知识，还通过知识校正链增强推理能力。实验结果表明，我们的方法在四个基准测试中始终优于现有的KB-VQA方法，平均改进约为3%，在最佳情况下可达到11%。

更新时间: 2025-07-08 07:47:57

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2502.20964v3

LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance

While large multi-modal models (LMMs) demonstrate promising capabilities in segmentation and comprehension, they still struggle with two limitations: inaccurate segmentation and hallucinated comprehension. These challenges stem primarily from constraints in weak visual comprehension and a lack of fine-grained perception. To alleviate these limitations, we propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation via two key components: (1) Semantic-Enhanced Feature Extractor (SEFE) improves object attribute inference by fusing semantic and pixel-level features, leading to more accurate segmentation; (2) Interleaved Local Visual Coupling (ILVC) autoregressively generates local descriptions after extracting local features based on segmentation masks, offering fine-grained supervision to mitigate hallucinations. Furthermore, we find that the precision of object segmentation is positively correlated with the latent related semantics of the <seg> token. To quantify this relationship and the model's potential semantic inferring ability, we introduce the Attributes Evaluation (AttrEval) dataset. Our experiments show that LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks. Code will be available at https://github.com/echo840/LIRA.

Updated: 2025-07-08 07:46:26

标题: LIRA：利用局部交错区域辅助在大型多模态模型中推断分割

摘要: 尽管大型多模型(LMMs)在分割和理解方面展示了令人期待的能力，但它们仍然面临两个限制：不精确的分割和虚构的理解。这些挑战主要源于对弱视觉理解的限制和缺乏细粒度感知。为了缓解这些限制，我们提出了LIRA，一个利用视觉理解和分割之间互补关系的框架，通过两个关键组件实现：(1) 语义增强特征提取器(SEFE)通过融合语义和像素级特征来改进对象属性推断，从而实现更准确的分割；(2) 交错局部视觉耦合(ILVC)在提取基于分割掩模的局部特征后，自回归地生成局部描述，提供细粒度监督以减轻虚构。此外，我们发现对象分割的精度与<seg>标记的潜在相关语义呈正相关关系。为了量化这种关系和模型的潜在语义推断能力，我们引入了属性评估(AttrEval)数据集。我们的实验表明，LIRA在分割和理解任务中实现了最先进的性能。代码将在https://github.com/echo840/LIRA上提供。

更新时间: 2025-07-08 07:46:26

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.06272v1

Policy Verification in Stochastic Dynamical Systems Using Logarithmic Neural Certificates

We consider the verification of neural network policies for discrete-time stochastic systems with respect to reach-avoid specifications. We use a learner-verifier procedure that learns a certificate for the specification, represented as a neural network. Verifying that this neural network certificate is a so-called reach-avoid supermartingale (RASM) proves the satisfaction of a reach-avoid specification. Existing approaches for such a verification task rely on computed Lipschitz constants of neural networks. These approaches struggle with large Lipschitz constants, especially for reach-avoid specifications with high threshold probabilities. We present two key contributions to obtain smaller Lipschitz constants than existing approaches. First, we introduce logarithmic RASMs (logRASMs), which take exponentially smaller values than RASMs and hence have lower theoretical Lipschitz constants. Second, we present a fast method to compute tighter upper bounds on Lipschitz constants based on weighted norms. Our empirical evaluation shows we can consistently verify the satisfaction of reach-avoid specifications with probabilities as high as 99.9999%.

Updated: 2025-07-08 07:42:57

标题: 使用对数神经证书在随机动力系统中的政策验证

摘要: 我们考虑针对离散时间随机系统的神经网络策略的验证，涉及到达-避免规范。我们使用一个学习者-验证者过程，为规范学习一个证书，表示为一个神经网络。验证这个神经网络证书是否是所谓的到达-避免超马氏（RASM）证明了达到-避免规范的满足。现有方法用于这种验证任务依赖于神经网络的计算Lipschitz常数。这些方法在处理大的Lipschitz常数时存在困难，尤其是对于具有高阈值概率的达到-避免规范。我们提出了两个关键贡献，以获得比现有方法更小的Lipschitz常数。首先，我们引入对数RASM（logRASMs），其值比RASMs小指数级，因此具有更低的理论Lipschitz常数。其次，我们提出了一种基于加权范数计算更紧密上界的Lipschitz常数的快速方法。我们的实证评估表明，我们可以持续验证达到-避免规范的满足，概率高达99.9999%。

更新时间: 2025-07-08 07:42:57

领域: cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2406.00826v3

When Transformers Meet Recommenders: Integrating Self-Attentive Sequential Recommendation with Fine-Tuned LLMs

Self-Attentive Sequential Recommendation (SASRec) effectively captures long-term user preferences by applying attention mechanisms to historical interactions. Concurrently, the rise of Large Language Models (LLMs) has motivated research into LLM-based recommendation, which leverages their powerful generalization and language understanding capabilities. However, LLMs often lack the domain-specific knowledge and collaborative signals essential for high-quality recommendations when relying solely on textual prompts. To address this limitation, this study proposes SASRecLLM, a novel framework that integrates SASRec as a collaborative encoder with an LLM fine-tuned using Low-Rank Adaptation (LoRA). The components are connected via a mapping layer to align their dimensional spaces, and three targeted training strategies are designed to optimize the hybrid architecture. Extensive experiments on multiple datasets demonstrate that SASRecLLM achieves robust and consistent improvements over strong baselines in both cold-start and warm-start scenarios. This work advances the field of LLM-based recommendation by presenting a modular and effective paradigm for fusing structured collaborative filtering with the semantic power of fine-tuned LLMs. The implementation is available on GitHub: https://github.com/kechenkristin/RecLLM

Updated: 2025-07-08 07:26:55

标题: 当变压器遇上推荐系统：将自注意力序列推荐与微调的LLMs集成起来

摘要: Self-Attentive Sequential Recommendation (SASRec) 通过将注意力机制应用于历史交互，有效地捕捉了长期用户偏好。同时，大型语言模型（LLMs）的兴起促使了基于LLM的推荐研究，利用了它们强大的泛化和语言理解能力。然而，当仅依赖于文本提示时，LLMs通常缺乏领域特定知识和协作信号，这对于高质量推荐是必不可少的。为了解决这一限制，本研究提出了SASRecLLM，这是一个新颖的框架，将SASRec作为协作编码器与使用低秩适应（LoRA）微调的LLM集成在一起。这些组件通过映射层连接以对齐它们的维度空间，并设计了三种针对性的训练策略来优化混合架构。在多个数据集上进行的广泛实验表明，SASRecLLM在冷启动和热启动情况下都实现了强有力基线的稳健和一致的改进。这项工作通过提出一种将结构化协作过滤与精调LLM的语义能力融合的模块化和有效范式，推动了基于LLM的推荐领域的发展。该实现可在GitHub上找到：https://github.com/kechenkristin/RecLLM

更新时间: 2025-07-08 07:26:55

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2507.05733v1

A Satellite-Ground Synergistic Large Vision-Language Model System for Earth Observation

Recently, large vision-language models (LVLMs) unleash powerful analysis capabilities for low Earth orbit (LEO) satellite Earth observation images in the data center. However, fast satellite motion, brief satellite-ground station (GS) contact windows, and large size of the images pose a data download challenge. To enable near real-time Earth observation applications (e.g., disaster and extreme weather monitoring), we should explore how to deploy LVLM in LEO satellite networks, and design SpaceVerse, an efficient satellite-ground synergistic LVLM inference system. To this end, firstly, we deploy compact LVLMs on satellites for lightweight tasks, whereas regular LVLMs operate on GSs to handle computationally intensive tasks. Then, we propose a computing and communication co-design framework comprised of a progressive confidence network and an attention-based multi-scale preprocessing, used to identify on-satellite inferring data, and reduce data redundancy before satellite-GS transmission, separately. We implement and evaluate SpaceVerse on real-world LEO satellite constellations and datasets, achieving a 31.2% average gain in accuracy and a 51.2% reduction in latency compared to state-of-the-art baselines.

Updated: 2025-07-08 07:24:34

标题: 一个卫星-地面协同的大视觉-语言模型系统用于地球观测

摘要: 最近，大型视觉语言模型（LVLMs）在数据中心为低地球轨道（LEO）卫星地球观测图像释放出强大的分析能力。然而，快速卫星运动、短暂的卫星-地面站（GS）接触窗口以及图像的大尺寸给数据下载带来了挑战。为了实现近实时的地球观测应用（例如灾害和极端天气监测），我们应该探索如何在LEO卫星网络中部署LVLM，并设计SpaceVerse，一个高效的卫星-地面协同LVLM推断系统。为此，首先，我们在卫星上部署紧凑的LVLM用于轻量级任务，而常规的LVLM在地面站上运行以处理计算密集型任务。然后，我们提出了一个由渐进置信网络和基于注意力的多尺度预处理组成的计算和通信共同设计框架，用于识别在卫星上推断数据，并在卫星-GS传输之前分别减少数据冗余。我们在真实的LEO卫星星座和数据集上实施和评估SpaceVerse，与最先进的基线相比，实现了31.2%的平均准确度增益和51.2%的延迟降低。

更新时间: 2025-07-08 07:24:34

领域: cs.NI,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.05731v1

Hyperspectral Anomaly Detection Methods: A Survey and Comparative Study

Hyperspectral images are high-dimensional datasets consisting of hundreds of contiguous spectral bands, enabling detailed material and surface analysis. Hyperspectral anomaly detection (HAD) refers to the technique of identifying and locating anomalous targets in such data without prior information about a hyperspectral scene or target spectrum. This technology has seen rapid advancements in recent years, with applications in agriculture, defence, military surveillance, and environmental monitoring. Despite this significant progress, existing HAD methods continue to face challenges such as high computational complexity, sensitivity to noise, and limited generalisation across diverse datasets. This study presents a comprehensive comparison of various HAD techniques, categorising them into statistical models, representation-based methods, classical machine learning approaches, and deep learning models. We evaluated these methods across 17 benchmarking datasets using different performance metrics, such as ROC, AUC, and separability map to analyse detection accuracy, computational efficiency, their strengths, limitations, and directions for future research.The research shows that deep learning models achieved the highest detection accuracy, while statistical models demonstrated exceptional speed across all datasets. This study aims to provide valuable insights for researchers and practitioners working to advance the field of hyperspectral anomaly detection methods.

Updated: 2025-07-08 07:23:24

标题: 高光谱异常检测方法：调查和比较研究

摘要: 高光谱图像是由数百个连续光谱波段组成的高维数据集，可以进行详细的材料和表面分析。高光谱异常检测（HAD）是指在没有关于高光谱场景或目标光谱的先验信息的情况下，识别和定位这类数据中的异常目标的技术。近年来，这项技术取得了快速发展，在农业、国防、军事监视和环境监测等领域得到了应用。尽管取得了显著进展，但现有的HAD方法仍面临诸如计算复杂性高、对噪声敏感和在各种数据集上的泛化能力有限等挑战。本研究对各种HAD技术进行了全面比较，将它们分为统计模型、基于表示的方法、经典机器学习方法和深度学习模型。我们使用不同的性能指标（如ROC、AUC和可分离性图）在17个基准数据集上评估了这些方法，分析了检测准确性、计算效率、它们的优势、局限性和未来研究方向。研究表明，深度学习模型在所有数据集上实现了最高的检测准确性，而统计模型在速度方面表现出色。本研究旨在为致力于推进高光谱异常检测方法领域的研究人员和从业者提供宝贵的见解。

更新时间: 2025-07-08 07:23:24

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.05730v1

Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling

Large language models (LLMs) are prone to hallucination stemming from misaligned self-awareness, particularly when processing queries exceeding their knowledge boundaries. While existing mitigation strategies employ uncertainty estimation or query rejection mechanisms, they suffer from computational efficiency and sacrificed helpfulness. To address these issues, we propose the Explicit Knowledge Boundary Modeling (EKBM) framework, integrating fast and slow reasoning systems to harmonize reliability and usability. The framework first employs a fast-thinking model to generate confidence-labeled responses, enabling immediate utilization of high-confidence outputs, whereas uncertain predictions trigger a slow refinement model for accuracy improvement. To align model behavior with our proposed object, we propose a hybrid training pipeline, enhancing self-awareness without degrading task performance. Evaluations on dialogue state tracking tasks demonstrate that EKBM achieves superior model reliability over uncertainty-based baselines. Further analysis reveals that refinement substantially boosts accuracy while maintaining low computational overhead. The framework establishes a scalable paradigm for deploying reliable LLMs in error-sensitive applications, effectively balancing accuracy and practical utility.

Updated: 2025-07-08 07:23:07

标题: 通过明确知识边界建模提高LLM可靠性

摘要: 大型语言模型（LLMs）容易出现幻觉，这源于自我意识不一致，特别是在处理超出其知识范围的查询时。虽然现有的缓解策略采用不确定性估计或查询拒绝机制，但它们在计算效率和帮助性方面存在问题。为了解决这些问题，我们提出了显式知识边界建模（EKBM）框架，将快速和慢速推理系统结合起来，以协调可靠性和可用性。该框架首先采用快速思考模型生成带有置信标签的响应，使高置信度的输出能够立即得到利用，而不确定预测会触发一个用于提高准确性的慢速细化模型。为了使模型行为与我们提出的对象保持一致，我们提出了一种混合训练流程，增强自我意识而不降低任务性能。对对话状态跟踪任务的评估表明，EKBM在可靠性方面优于基于不确定性的基线模型。进一步分析表明，细化显著提高了准确性，同时保持低计算开销。该框架为在错误敏感应用中部署可靠的LLMs建立了可扩展的范式，有效平衡了准确性和实用性。

更新时间: 2025-07-08 07:23:07

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.02233v3

Asynchronous Event Error-Minimizing Noise for Safeguarding Event Dataset

With more event datasets being released online, safeguarding the event dataset against unauthorized usage has become a serious concern for data owners. Unlearnable Examples are proposed to prevent the unauthorized exploitation of image datasets. However, it's unclear how to create unlearnable asynchronous event streams to prevent event misuse. In this work, we propose the first unlearnable event stream generation method to prevent unauthorized training from event datasets. A new form of asynchronous event error-minimizing noise is proposed to perturb event streams, tricking the unauthorized model into learning embedded noise instead of realistic features. To be compatible with the sparse event, a projection strategy is presented to sparsify the noise to render our unlearnable event streams (UEvs). Extensive experiments demonstrate that our method effectively protects event data from unauthorized exploitation, while preserving their utility for legitimate use. We hope our UEvs contribute to the advancement of secure and trustworthy event dataset sharing. Code is available at: https://github.com/rfww/uevs.

Updated: 2025-07-08 07:21:59

标题: 异步事件误差最小化噪声以保障事件数据集

摘要: 随着更多事件数据集在线发布，保护事件数据集免受未经授权的使用已经成为数据所有者的严重关注。提出了不可学习示例来防止图像数据集的未经授权利用。然而，如何创建不可学习的异步事件流以防止事件被滥用尚不清楚。在这项工作中，我们提出了第一个不可学习的事件流生成方法，以防止未经授权的训练来自事件数据集。提出了一种新形式的异步事件误差最小化噪声，用于扰乱事件流，欺骗未经授权的模型学习嵌入的噪声而不是现实特征。为了与稀疏事件兼容，提出了一种投影策略来稀疏噪声，使我们的不可学习事件流（UEvs）变得更加难以学习。大量实验证明，我们的方法有效保护事件数据免受未经授权的利用，同时保留其对合法使用的效用。我们希望我们的UEvs有助于推动安全可信的事件数据集共享的进步。代码可在以下链接找到：https://github.com/rfww/uevs。

更新时间: 2025-07-08 07:21:59

领域: cs.CR

下载: http://arxiv.org/abs/2507.05728v1

Multi-Channel Hypergraph Contrastive Learning for Matrix Completion

Rating is a typical user explicit feedback that visually reflects how much a user likes a related item. The (rating) matrix completion is essentially a rating prediction process, which is also a significant problem in recommender systems. Recently, graph neural networks (GNNs) have been widely used in matrix completion, which captures users' preferences over items by formulating a rating matrix as a bipartite graph. However, existing methods are susceptible due to data sparsity and long-tail distribution in real-world scenarios. Moreover, the messaging mechanism of GNNs makes it difficult to capture high-order correlations and constraints between nodes, which are essentially useful in recommendation tasks. To tackle these challenges, we propose a Multi-Channel Hypergraph Contrastive Learning framework for matrix completion, named MHCL. Specifically, MHCL adaptively learns hypergraph structures to capture high-order correlations between nodes and jointly captures local and global collaborative relationships through attention-based cross-view aggregation. Additionally, to consider the magnitude and order information of ratings, we treat different rating subgraphs as different channels, encourage alignment between adjacent ratings, and further achieve the mutual enhancement between different ratings through multi-channel cross-rating contrastive learning. Extensive experiments on five public datasets demonstrate that the proposed method significantly outperforms the current state-of-the-art approaches.

Updated: 2025-07-08 07:19:20

标题: 多通道超图对比学习用于矩阵补全

摘要: 评分是一种典型的用户明确反馈，可以直观反映用户对相关物品的喜好程度。评分矩阵填充实质上是一个评分预测过程，在推荐系统中也是一个重要问题。最近，图神经网络（GNNs）在矩阵填充中被广泛应用，通过将评分矩阵构建为二部图来捕获用户对物品的偏好。然而，在现实场景中，现有方法容易受到数据稀疏和长尾分布的影响。此外，GNNs的消息传递机制使其难以捕捉节点之间的高阶相关性和约束，而这在推荐任务中实质上是有用的。为了应对这些挑战，我们提出了一个用于矩阵填充的多通道超图对比学习框架，命名为MHCL。具体来说，MHCL自适应地学习超图结构，捕捉节点之间的高阶相关性，并通过基于注意力的跨视图聚合同时捕捉局部和全局的协作关系。此外，为了考虑评分的大小和顺序信息，我们将不同评分子图视为不同通道，鼓励相邻评分之间的对齐，并通过多通道交叉评分对比学习进一步实现不同评分之间的互相增强。对五个公共数据集的广泛实验表明，所提出的方法明显优于当前最先进的方法。

更新时间: 2025-07-08 07:19:20

领域: cs.LG,cs.IR

下载: http://arxiv.org/abs/2411.01376v2

Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition

Mixture-of-experts (MoE) architectures have expanded from language modeling to automatic speech recognition (ASR). Traditional MoE methods, such as the Switch Transformer, route experts independently within each layer. Our analysis reveals that routers in most layers make expert choices that are not strongly correlated with the choices of the routers in other layers. To increase the cooperation between experts in different layers and encourage greater specialization, we use a shared router across different MoE layers. We call this model \emph{Omni-router Transformer}. Extensive experiments on a large-scale pseudo-labeled dataset and evaluations across 10 diverse, out-of-domain ASR benchmarks demonstrate that the Omni-router Transformer is able to achieve lower training loss and consistently outperform dense and Switch Transformer models, reducing average word error rates by 11.2% and 8.2%, respectively, while providing structured expert usage and improved robustness to diverse data.

Updated: 2025-07-08 07:18:33

标题: 全路由器：在稀疏专家混合中共享语音识别的路由决策

摘要: 混合专家（MoE）架构已经从语言建模扩展到自动语音识别（ASR）。传统的MoE方法，例如Switch Transformer，在每一层内独立路由专家。我们的分析表明，大多数层中的路由器所做的专家选择与其他层中的路由器的选择并不强相关。为了增加不同层中专家之间的合作，并鼓励更大的专业化，我们在不同的MoE层之间使用共享路由器。我们将这个模型称为“全路由器变压器”。对一个大规模伪标记数据集进行了大量实验，并在10个不同领域的ASR基准测试中进行了评估，结果表明全路由器变压器能够实现更低的训练损失，并始终优于密集和Switch Transformer模型，分别将平均词错误率降低了11.2%和8.2%，同时提供结构化的专家使用和对多样化数据的改进鲁棒性。

更新时间: 2025-07-08 07:18:33

领域: cs.CL,cs.AI,cs.LG,cs.SD,eess.AS

下载: http://arxiv.org/abs/2507.05724v1

Empirical Validation of the Independent Chip Model

The independent chip model (ICM) forms a cornerstone of all modern poker tournament strategy. However, despite its prominence, the ICM's performance in the real world has not been sufficiently scrutinized, especially at a large scale. In this paper, we introduce our new dataset of poker tournaments, consisting of results of over ten thousand events. Then, using this dataset, we perform two experiments as part of a large-scale empirical validation of the ICM. First, we verify that the ICM performs more accurately than a baseline we propose. Second, we obtain empirical evidence of the ICM underestimating the performances of players with larger stacks while overestimating those who are short-stacked. Our contributions may be useful to future researchers developing new algorithms for estimating a player's value in poker tournaments.

Updated: 2025-07-08 07:17:16

标题: 实证验证独立筹码模型

摘要: 独立筹码模型（ICM）是所有现代扑克锦标赛策略的基石。然而，尽管它很重要，ICM在现实世界中的表现还没有得到足够的审查，特别是在大规模的情况下。在这篇论文中，我们介绍了我们的新扑克锦标赛数据集，包括一万多个事件的结果。然后，利用这个数据集，我们进行了两个实验，作为对ICM的大规模实证验证的一部分。首先，我们验证ICM的表现比我们提出的基准更准确。其次，我们获得了实证证据，表明ICM低估了拥有更大筹码的玩家的表现，同时高估了筹码较少的玩家。我们的研究成果可能对未来开发新算法来估算扑克锦标赛中玩家价值的研究人员有用。

更新时间: 2025-07-08 07:17:16

领域: cs.GT,cs.LG

下载: http://arxiv.org/abs/2506.00180v2

Hierarchical Task Offloading for UAV-Assisted Vehicular Edge Computing via Deep Reinforcement Learning

With the emergence of compute-intensive and delay-sensitive applications in vehicular networks, unmanned aerial vehicles (UAVs) have emerged as a promising complement for vehicular edge computing due to the high mobility and flexible deployment. However, the existing UAV-assisted offloading strategies are insufficient in coordinating heterogeneous computing resources and adapting to dynamic network conditions. Hence, this paper proposes a dual-layer UAV-assisted edge computing architecture based on partial offloading, composed of the relay capability of high-altitude UAVs and the computing support of low-altitude UAVs. The proposed architecture enables efficient integration and coordination of heterogeneous resources. A joint optimization problem is formulated to minimize the system delay and energy consumption while ensuring the task completion rate. To solve the high-dimensional decision problem, we reformulate the problem as a Markov decision process and propose a hierarchical offloading scheme based on the soft actor-critic algorithm. The method decouples global and local decisions, where the global decisions integrate offloading ratios and trajectory planning into continuous actions, while the local scheduling is handled via designing a priority-based mechanism. Simulations are conducted and demonstrate that the proposed approach outperforms several baselines in task completion rate, system efficiency, and convergence speed, showing strong robustness and applicability in dynamic vehicular environments.

Updated: 2025-07-08 07:10:52

标题: 无人机辅助车辆边缘计算的分层任务卸载：基于深度强化学习

摘要: 随着计算密集型和延迟敏感的应用在车载网络中的出现，无人机（UAVs）因其高度移动性和灵活部署而成为车辆边缘计算的有前途的补充。然而，现有的UAV辅助卸载策略在协调异构计算资源和适应动态网络条件方面仍有不足。因此，本文提出了一种基于部分卸载的双层UAV辅助边缘计算架构，由高空UAV的中继能力和低空UAV的计算支持组成。所提出的架构实现了异构资源的有效整合和协调。提出了一个联合优化问题，旨在最小化系统延迟和能耗，同时确保任务完成率。为了解决高维决策问题，我们将问题重新制定为马尔可夫决策过程，并提出了基于软演员-评论家算法的分层卸载方案。该方法解耦了全局和局部决策，其中全局决策将卸载比例和轨迹规划整合到连续动作中，而本地调度则通过设计基于优先级的机制来处理。进行了仿真实验，并表明所提出的方法在任务完成率、系统效率和收敛速度方面优于几种基线方法，在动态车载环境中表现出较强的鲁棒性和适用性。

更新时间: 2025-07-08 07:10:52

领域: cs.LG

下载: http://arxiv.org/abs/2507.05722v1

MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment

Recently, there has been a surge of vision-based GUI agents designed to automate everyday mobile and web tasks. These agents interpret raw GUI screenshots and autonomously decide where to click, scroll, or type, which bypasses handcrafted rules and app-specific APIs. However, most existing methods trained GUI agent in the offline environment using pre-collected trajectories. This approach limits scalability, causes overfitting to specific UI templates, and leads to brittle policies when faced with unseen environment. We present MobileGUI-RL, a scalable framework that trains GUI agent in online environment. MobileGUI-RL contains two key components. It (i) synthesizes a curriculum of learnable tasks through self-exploration and filtering, and (ii) adapts GRPO to GUI navigation with trajectory-aware advantages and composite rewards that balance task success and execution efficiency. Experiments on three online mobile-agent benchmarks show consistent gains, validating the effectiveness of our approach.

Updated: 2025-07-08 07:07:53

标题: MobileGUI-RL：通过在线环境中的强化学习推进移动GUI代理

摘要: 最近，出现了一大批基于视觉的GUI代理，旨在自动化日常移动和Web任务。这些代理解释原始GUI截图，并自主决定何时点击、滚动或输入，从而绕过手工制定的规则和特定应用程序的API。然而，大多数现有方法在离线环境中训练GUI代理，使用预先收集的轨迹。这种方法限制了可扩展性，导致对特定UI模板的过拟合，并在面对未知环境时导致脆弱的策略。我们提出了MobileGUI-RL，这是一个可扩展的框架，可以在在线环境中训练GUI代理。MobileGUI-RL包含两个关键组件。它（i）通过自我探索和过滤合成可学习任务的课程，并且（ii）通过适应GRPO来进行GUI导航，具有轨迹感知优势和平衡任务成功和执行效率的复合奖励。在三个在线移动代理基准测试中进行的实验显示出一致的收益，验证了我们方法的有效性。

更新时间: 2025-07-08 07:07:53

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2507.05720v1

Common Data Format (CDF): A Standardized Format for Match-Data in Football (Soccer)

During football matches, a variety of different parties (e.g., companies) each collect (possibly overlapping) data about the match ranging from basic information (e.g., starting players) to detailed positional data. This data is provided to clubs, federations, and other organizations who are increasingly interested in leveraging this data to inform their decision making. Unfortunately, analyzing such data pose significant barriers because each provider may (1) collect different data, (2) use different specifications even within the same category of data, (3) represent the data differently, and (4) delivers the data in a different manner (e.g., file format, protocol). Consequently, working with these data requires a significant investment of time and money. The goal of this work is to propose a uniform and standardized format for football data called the Common Data Format (CDF). The CDF specifies a minimal schema for five types of match data: match sheet data, video footage, event data, tracking data, and match meta data. It aims to ensure that the provided data is clear, sufficiently contextualized (e.g., its provenance is clear), and complete such that it enables common downstream analysis tasks. Concretely, this paper will detail the technical specifications of the CDF, the representational choices that were made to help ensure the clarity of the provided data, and a concrete approach for delivering data in the CDF. This represents Version 1.0.0 of the CDF.

Updated: 2025-07-08 07:05:17

标题: 共同数据格式（CDF）：足球比赛数据的标准化格式

摘要: 在足球比赛期间，各种不同的利益相关方（例如公司）收集各种关于比赛的数据，从基本信息（例如首发球员）到详细的位置数据。这些数据提供给俱乐部、联合会和其他组织，这些组织越来越有兴趣利用这些数据来指导他们的决策。不幸的是，分析这些数据存在重大障碍，因为每个提供者可能会（1）收集不同的数据，（2）即使在同一类别的数据中使用不同的规范，（3）以不同的方式表示数据，以及（4）以不同的方式提供数据（例如文件格式、协议）。因此，处理这些数据需要大量的时间和金钱投资。本文的目标是提出一个统一和标准化的足球数据格式，称为通用数据格式（CDF）。CDF为五种类型的比赛数据（比赛表数据、视频镜头、事件数据、跟踪数据和比赛元数据）指定了最小模式。它旨在确保提供的数据清晰、充分上下文化（例如其出处清晰）和完整，以便使其能够启用常见的下游分析任务。具体来说，本文将详细介绍CDF的技术规范、为确保提供数据清晰而做出的表达选择，以及在CDF中传递数据的具体方法。这代表了CDF的1.0.0版本。

更新时间: 2025-07-08 07:05:17

领域: cs.DB,cs.AI

下载: http://arxiv.org/abs/2505.15820v3

Divergent Realities: A Comparative Analysis of Human Expert vs. Artificial Intelligence Based Generation and Evaluation of Treatment Plans in Dermatology

Background: Evaluating AI-generated treatment plans is a key challenge as AI expands beyond diagnostics, especially with new reasoning models. This study compares plans from human experts and two AI models (a generalist and a reasoner), assessed by both human peers and a superior AI judge. Methods: Ten dermatologists, a generalist AI (GPT-4o), and a reasoning AI (o3) generated treatment plans for five complex dermatology cases. The anonymized, normalized plans were scored in two phases: 1) by the ten human experts, and 2) by a superior AI judge (Gemini 2.5 Pro) using an identical rubric. Results: A profound 'evaluator effect' was observed. Human experts scored peer-generated plans significantly higher than AI plans (mean 7.62 vs. 7.16; p=0.0313), ranking GPT-4o 6th (mean 7.38) and the reasoning model, o3, 11th (mean 6.97). Conversely, the AI judge produced a complete inversion, scoring AI plans significantly higher than human plans (mean 7.75 vs. 6.79; p=0.0313). It ranked o3 1st (mean 8.20) and GPT-4o 2nd, placing all human experts lower. Conclusions: The perceived quality of a clinical plan is fundamentally dependent on the evaluator's nature. An advanced reasoning AI, ranked poorly by human experts, was judged as superior by a sophisticated AI, revealing a deep gap between experience-based clinical heuristics and data-driven algorithmic logic. This paradox presents a critical challenge for AI integration, suggesting the future requires synergistic, explainable human-AI systems that bridge this reasoning gap to augment clinical care.

Updated: 2025-07-08 06:59:58

标题: 分歧现实：皮肤病治疗计划的人类专家与人工智能生成和评估的比较分析

摘要: 背景：评估人工智能生成的治疗方案是一个关键挑战，尤其是在人工智能超越诊断领域并采用新的推理模型的情况下。本研究比较了人类专家和两个人工智能模型（一个通用模型和一个推理模型）为五个复杂皮肤病例生成的治疗方案，这些方案由人类同行和一个更高级的人工智能评判者进行评估。方法：十名皮肤科医生、一个通用人工智能模型（GPT-4o）和一个推理人工智能模型（o3）为五个复杂皮肤病例生成了治疗方案。匿名化、标准化的方案通过两个阶段进行评分：1）由十名人类专家评分，2）由一个更高级的人工智能评判者（Gemini 2.5 Pro）使用相同的评分标准进行评分。结果：观察到了显著的“评估者效应”。人类专家评分同行生成的方案显著高于人工智能生成的方案（平均分别为7.62和7.16；p=0.0313），将GPT-4o排名第六（平均分7.38），将推理模型o3排名第十一（平均分6.97）。相反，人工智能评判者产生了完全颠倒的结果，评分显示人工智能生成的方案显著高于人类生成的方案（平均分别为7.75和6.79；p=0.0313），将o3排名第一（平均分8.20），将GPT-4o排名第二，将所有人类专家排名较低。结论：临床方案的感知质量在很大程度上取决于评估者的性质。一个被人类专家评为较低的高级推理人工智能被一个复杂的人工智能评判者评为更优秀，揭示了基于经验的临床启发式和基于数据的算法逻辑之间存在的深刻差距。这种悖论提出了对人工智能整合的重要挑战，暗示未来需要协同、可解释的人工智能系统来弥合这种推理差距，以增强临床护理。

更新时间: 2025-07-08 06:59:58

领域: cs.AI

下载: http://arxiv.org/abs/2507.05716v1

HIRAG: Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation

Retrieval-augmented generation (RAG) has become a fundamental paradigm for addressing the challenges faced by large language models in handling real-time information and domain-specific problems. Traditional RAG systems primarily rely on the in-context learning (ICL) capabilities of the large language model itself. Still, in-depth research on the specific capabilities needed by the RAG generation model is lacking, leading to challenges with inconsistent document quality and retrieval system imperfections. Even the limited studies that fine-tune RAG generative models often \textit{lack a granular focus on RAG task} or \textit{a deeper utilization of chain-of-thought processes}. To address this, we propose that RAG models should possess three progressively hierarchical abilities (1) Filtering: the ability to select relevant information; (2) Combination: the ability to combine semantic information across paragraphs; and (3) RAG-specific reasoning: the ability to further process external knowledge using internal knowledge. Thus, we introduce our new RAG instruction fine-tuning method, Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation (HIRAG) incorporates a "think before answering" strategy. This method enhances the model's open-book examination capability by utilizing multi-level progressive chain-of-thought. Experiments show that the HIRAG training strategy significantly improves the model's performance on datasets such as RGB, PopQA, MuSiQue, HotpotQA, and PubmedQA.

Updated: 2025-07-08 06:53:28

标题: HIRAG: 分层思维指导-调整检索增强生成

摘要: 检索增强生成（RAG）已成为解决大型语言模型在处理实时信息和特定领域问题时面临的挑战的基本范例。传统的RAG系统主要依赖于大型语言模型本身的上下文学习（ICL）能力。然而，对RAG生成模型所需的具体能力的深入研究不足，导致文档质量不一致和检索系统缺陷等挑战。即使有限的研究对RAG生成模型进行微调，也往往缺乏对RAG任务的细粒度关注或更深入地利用思维链过程。为了解决这个问题，我们提出RAG模型应具备三个逐渐递进的层次能力：（1）过滤：选择相关信息的能力；（2）组合：跨段落组合语义信息的能力；和（3）RAG特定推理：使用内部知识进一步处理外部知识的能力。因此，我们介绍了我们的新的RAG指令微调方法，分层思维指令调整检索增强生成（HIRAG）融入了“思考后回答”策略。该方法通过利用多层渐进的思维链，增强了模型的开放书本考试能力。实验表明，HIRAG训练策略显著提高了模型在RGB、PopQA、MuSiQue、HotpotQA和PubmedQA等数据集上的性能。

更新时间: 2025-07-08 06:53:28

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.05714v1

DRAGON: Dynamic RAG Benchmark On News

Retrieval-Augmented Generation (RAG) is a widely adopted approach for improving the factuality of large language models (LLMs) by incorporating external knowledge at inference time. Although there exist multiple RAG benchmarks for English, evaluation resources for other languages, including Russian, remain scarce and static, failing to capture the dynamic nature of real-world deployments. In this work, we present DRAGON (Dynamic RAG Benchmark On News), the first dynamic benchmark for evaluating RAG systems in Russian on a changing news corpora. DRAGON is built upon a regularly updated corpus of Russian news and public documents and supports comprehensive evaluation of both the retriever and generator components. Question generation is performed automatically with the use of Knowledge Graph constructed from the corpus and enables the extraction of four core question types aligned with distinct subgraph patterns. We release a complete evaluation framework comprising the pipeline for automatic question generation, evaluation scripts, which are potentially reusable for other languages and multilingual settings, and benchmark data. We also launch a public leaderboard to encourage community participation and comparison.

Updated: 2025-07-08 06:52:43

标题: 龙：新闻动态RAG基准测试

摘要: 检索增强生成（RAG）是一种广泛采用的方法，通过在推断时合并外部知识，以改善大语言模型（LLMs）的真实性。尽管存在多个用于英语的RAG基准，但用于其他语言（包括俄语）的评估资源仍然稀缺且静态，无法捕捉现实世界部署的动态特性。在这项工作中，我们介绍了DRAGON（Dynamic RAG Benchmark On News），这是第一个评估俄语RAG系统的动态基准，基于不断更新的俄语新闻和公共文件语料库构建而成，并支持对检索器和生成器组件的全面评估。问题生成是通过使用从语料库构建的知识图自动执行的，可以提取与不同子图模式对齐的四种核心问题类型。我们发布了一个完整的评估框架，包括自动问题生成的流程、评估脚本，这些脚本可能可用于其他语言和多语言环境，并提供基准数据。我们还启动了一个公开排行榜，以鼓励社区参与和比较。

更新时间: 2025-07-08 06:52:43

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.05713v1

From Video to EEG: Adapting Joint Embedding Predictive Architecture to Uncover Visual Concepts in Brain Signal Analysis

EEG signals capture brain activity with high temporal and low spatial resolution, supporting applications such as neurological diagnosis, cognitive monitoring, and brain-computer interfaces. However, effective analysis is hindered by limited labeled data, high dimensionality, and the absence of scalable models that fully capture spatiotemporal dependencies. Existing self-supervised learning (SSL) methods often focus on either spatial or temporal features, leading to suboptimal representations. To this end, we propose EEG-VJEPA, a novel adaptation of the Video Joint Embedding Predictive Architecture (V-JEPA) for EEG classification. By treating EEG as video-like sequences, EEG-VJEPA learns semantically meaningful spatiotemporal representations using joint embeddings and adaptive masking. To our knowledge, this is the first work that exploits V-JEPA for EEG classification and explores the visual concepts learned by the model. Evaluations on the publicly available Temple University Hospital (TUH) Abnormal EEG dataset show that EEG-VJEPA outperforms existing state-of-the-art models in classification accuracy.Beyond classification accuracy, EEG-VJEPA captures physiologically relevant spatial and temporal signal patterns, offering interpretable embeddings that may support human-AI collaboration in diagnostic workflows. These findings position EEG-VJEPA as a promising framework for scalable, trustworthy EEG analysis in real-world clinical settings.

Updated: 2025-07-08 06:52:14

标题: 从视频到脑电图：将联合嵌入预测架构调整为揭示脑信号分析中的视觉概念

摘要: 脑电图（EEG）信号捕捉大脑活动，具有高时间分辨率和低空间分辨率，支持神经诊断、认知监测和脑-计算机界面等应用。然而，有效分析受到有限的标记数据、高维度和缺乏完全捕捉时空依赖关系的可扩展模型的阻碍。现有的自监督学习（SSL）方法通常专注于空间或时间特征，导致表示不佳。因此，我们提出了EEG-VJEPA，这是对视频联合嵌入预测体系结构（V-JEPA）的新颖改编，用于脑电图分类。通过将EEG视为类似视频的序列，EEG-VJEPA使用联合嵌入和自适应掩模学习语义上有意义的时空表示。据我们所知，这是首次利用V-JEPA进行脑电图分类并探索模型学习的视觉概念的研究。对公开可用的Temple大学医院（TUH）异常脑电图数据集的评估显示，EEG-VJEPA在分类准确性方面优于现有的最先进模型。除了分类准确性外，EEG-VJEPA捕捉到生理相关的空间和时间信号模式，提供可解释的嵌入，可能支持诊断工作流中人工智能的协作。这些发现将EEG-VJEPA定位为在真实世界临床环境中进行可扩展、可信任的脑电图分析的有前途的框架。

更新时间: 2025-07-08 06:52:14

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.03633v2

Argumentative Characterizations of (Extended) Disjunctive Logic Programs

This paper continues an established line of research about the relations between argumentation theory, particularly assumption-based argumentation, and different kinds of logic programs. In particular, we extend known result of Caminada, Schultz and Toni by showing that assumption-based argumentation can represent not only normal logic programs, but also disjunctive logic programs and their extensions. For this, we consider some inference rules for disjunction that the core logic of the argumentation frameworks should respect, and show the correspondence to the handling of disjunctions in the heads of the logic programs' rules. Under consideration in Theory and Practice of Logic Programming (TPLP).

Updated: 2025-07-08 06:50:02

标题: (扩展)分离逻辑程序的论证性描述Characterizations

摘要: 本文延续了一条关于论证理论，尤其是基于假设的论证与不同类型逻辑程序之间关系的研究线。我们通过扩展Caminada、Schultz和Toni已知结果，表明基于假设的论证不仅可以表示普通逻辑程序，还可以表示析取逻辑程序及其扩展。为此，我们考虑了论证框架核心逻辑应遵循的一些关于析取的推理规则，并展示了与逻辑程序规则头部中的析取处理对应的关系。该研究已被《逻辑编程的理论与实践》（Theory and Practice of Logic Programming，TPLP）考虑。

更新时间: 2025-07-08 06:50:02

领域: cs.AI

下载: http://arxiv.org/abs/2306.07126v2

RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism

Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, while they remain prone to generating hallucinated or outdated responses due to their static internal knowledge. Recent advancements in Retrieval-Augmented Generation (RAG) methods have explored enhancing models' search and reasoning capabilities through reinforcement learning (RL). Although these methods demonstrate promising results, they face challenges in training stability and encounter issues such as substantial inference time and restricted capabilities due to the single-query mode. In this paper, we propose RAG-R1, a novel training framework designed to enable LLMs to adaptively leverage internal and external knowledge during the reasoning process. We further expand the generation and retrieval processes within the framework from single-query mode to multi-query parallelism, aimed at reducing inference time and enhancing the model's capabilities. Extensive experiments on seven question-answering benchmarks demonstrate that our method outperforms the strongest baseline by up to 13.2% and decreases inference time by 11.1%.

Updated: 2025-07-08 06:38:26

标题: RAG-R1: 通过多查询并行性激励LLMs的搜索和推理能力

摘要: 大型语言模型(LLMs)在各种任务中展现出了显著的能力，但由于其静态内部知识，它们仍然容易生成幻觉或过时的响应。最近在检索增强生成(RAG)方法方面取得了进展，通过强化学习(RL)探索增强模型的搜索和推理能力。尽管这些方法表现出有希望的结果，但它们面临着训练稳定性的挑战，遇到了诸如大量推理时间和受限能力等问题，这是由于单一查询模式导致的。在本文中，我们提出了RAG-R1，这是一个新颖的训练框架，旨在使LLMs在推理过程中能够自适应地利用内部和外部知识。我们进一步将框架中的生成和检索过程从单一查询模式扩展到多查询并行，旨在减少推理时间并增强模型的能力。对七个问答基准进行的大量实验表明，我们的方法比最强基线表现提高了最多13.2%，并且推理时间减少了11.1%。

更新时间: 2025-07-08 06:38:26

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2507.02962v2

Agentic-R1: Distilled Dual-Strategy Reasoning

Current long chain-of-thought (long-CoT) models excel at mathematical reasoning but rely on slow and error-prone natural language traces. Tool-augmented agents address arithmetic via code execution, but often falter on complex logical tasks. We introduce a fine-tuning framework, DualDistill, that distills complementary reasoning strategies from multiple teachers into a unified student model. Using this approach, we train Agentic-R1, which dynamically selects the optimal strategy for each query, invoking tools for arithmetic and algorithmic problems, and using text-based reasoning for abstract ones. Our method improves accuracy across a range of tasks, including both computation-intensive and standard benchmarks, demonstrating the effectiveness of multi-strategy distillation in achieving robust and efficient reasoning. Our project is available at https://github.com/StigLidu/DualDistill

Updated: 2025-07-08 06:35:16

标题: Agentic-R1：精炼的双策略推理

摘要: 目前的长思维链模型在数学推理方面表现出色，但依赖于缓慢且容易出错的自然语言追踪。工具增强型代理通过代码执行解决算术问题，但在复杂逻辑任务上通常表现不佳。我们引入了一个精细调整框架DualDistill，从多个教师中提炼出互补的推理策略，融合到一个统一的学生模型中。使用这种方法，我们训练了Agentic-R1，它动态选择每个查询的最佳策略，调用工具解决算术和算法问题，并对抽象问题使用基于文本的推理。我们的方法提高了在一系列任务中的准确性，包括计算密集型和标准基准测试，展示了多策略提炼在实现稳健和高效推理方面的有效性。我们的项目可在https://github.com/StigLidu/DualDistill找到。

更新时间: 2025-07-08 06:35:16

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.05707v1

MPX: Mixed Precision Training for JAX

Mixed-precision training has emerged as an indispensable tool for enhancing the efficiency of neural network training in recent years. Concurrently, JAX has grown in popularity as a versatile machine learning toolbox. However, it currently lacks robust support for mixed-precision training. We propose MPX, a mixed-precision training toolbox for JAX that simplifies and accelerates the training of large-scale neural networks while preserving model accuracy. MPX seamlessly integrates with popular toolboxes such as Equinox and Flax, allowing users to convert full-precision pipelines to mixed-precision versions with minimal modifications. By casting both inputs and outputs to half precision, and introducing a dynamic loss-scaling mechanism, MPX alleviates issues like gradient underflow and overflow that commonly arise in half precision computations. Its design inherits critical features from JAX's type-promotion behavior, ensuring that operations take place in the correct precision and allowing for selective enforcement of full precision where needed (e.g., sums, means, or softmax). MPX further provides wrappers for automatic creation and management of mixed-precision gradients and optimizers, enabling straightforward integration into existing JAX training pipelines. MPX's source code, documentation, and usage examples are available at github.com/Data-Science-in-Mechanical-Engineering/mixed_precision_for_JAX .

Updated: 2025-07-08 06:28:22

标题: MPX：JAX的混合精度训练

摘要: 混合精度训练近年来已经成为提高神经网络训练效率的不可或缺的工具。与此同时，JAX作为一个多才多艺的机器学习工具包逐渐受到欢迎。然而，目前它对混合精度训练的支持还不够强大。我们提出了MPX，一个专为JAX设计的混合精度训练工具包，简化并加速了大规模神经网络的训练，同时保持模型精度。MPX与流行的工具包（如Equinox和Flax）无缝集成，允许用户将全精度管道转换为混合精度版本，只需进行最少的修改。通过将输入和输出都转换为半精度，并引入动态损失缩放机制，MPX缓解了半精度计算中常见的梯度下溢和上溢等问题。其设计继承了JAX的类型提升行为的关键特性，确保操作在正确的精度下进行，并允许在需要时选择强制全精度（例如，求和、均值或softmax）。MPX还提供了用于自动创建和管理混合精度梯度和优化器的包装器，使其能够轻松集成到现有的JAX训练管道中。MPX的源代码、文档和使用示例可在github.com/Data-Science-in-Mechanical-Engineering/mixed_precision_for_JAX上找到。

更新时间: 2025-07-08 06:28:22

领域: cs.LG

下载: http://arxiv.org/abs/2507.03312v2

A Survey on Transformer Context Extension: Approaches and Evaluation

Large language models (LLMs) based on Transformer have been widely applied in the filed of natural language processing (NLP), demonstrating strong performance, particularly in handling short text tasks. However, when it comes to long context scenarios, the performance of LLMs degrades due to some challenges. To alleviate this phenomenon, there is a number of work proposed recently. In this survey, we first list the challenges of applying pre-trained LLMs to process long contexts. Then systematically review the approaches related to long context and propose our taxonomy categorizing them into four main types: positional encoding, context compression, retrieval augmented, and attention pattern. In addition to the approaches, we focus on the evaluation of long context, organizing relevant data, tasks, and metrics based on existing long context benchmarks. Finally, we summarize unresolved issues in the long context domain and put forward our views on future developments.

Updated: 2025-07-08 06:24:53

标题: 关于Transformer上下文扩展的调查：方法和评估

摘要: 基于Transformer的大型语言模型（LLMs）已被广泛应用于自然语言处理（NLP）领域，表现出强大的性能，特别是在处理短文本任务方面。然而，在长文本情境下，LLMs的性能会因为一些挑战而下降。为了缓解这种现象，最近提出了许多工作。在本调查中，我们首先列出了应用预训练LLMs处理长文本时面临的挑战。然后系统地审查了与长文本相关的方法，并提出了将其分类为四种主要类型的分类法：位置编码、上下文压缩、检索增强和注意力模式。除了方法之外，我们还关注长文本的评估，根据现有的长文本基准组织相关数据、任务和度量标准。最后，我们总结了长文本领域尚未解决的问题，并提出了我们对未来发展的看法。

更新时间: 2025-07-08 06:24:53

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2503.13299v2

Evaluating AI Counseling in Japanese: Counselor, Client, and Evaluator Roles Assessed by Motivational Interviewing Criteria

This study provides the first comprehensive evaluation of large language model (LLM) performance across three counseling roles in Japanese-language therapeutic contexts. We simultaneously assessed counselor artificial intelligence (AI) systems (GPT-4-turbo with zeroshot prompting or Structured Multi-step Dialogue Prompts (SMDP), Claude-3-Opus-SMDP), client AI simulations, and evaluation AI systems (o3, Claude-3.7-Sonnet, Gemini-2.5-pro). Human experts (n = 15) with extensive counseling experience evaluated AI-generated dialogues using the Motivational Interviewing Treatment Integrity (MITI) Coding Manual 4.2.1. Notably, SMDP implementation significantly enhanced counselor AI performance across all MITI global ratings compared with zeroshot prompting, with no significant differences between GPT-SMDP and Opus-SMDP. Evaluation AIs showed comparable performance to human raters for Cultivating Change Talk but systematically overestimated Softening Sustain Talk and the overall quality metrics. Model-specific biases emerged: Gemini emphasized power-sharing, o3 focused on technical proficiency, and Sonnet prioritized emotional expression. Client AI simulations exhibited a limited emotional range and unnaturally high compliance, indicating the need for enhanced realism. These findings establish benchmarks for AI-assisted counseling in non-English contexts and identify critical areas for improvement through advanced prompt engineering, retrieval-augmented generation, and targeted fine-tuning, with important implications for developing culturally sensitive AI mental health tools.

Updated: 2025-07-08 06:16:17

标题: 评估日本的人工智能辅导：通过激励性面谈标准评估辅导员、客户和评估者的角色

摘要: 这项研究首次全面评估了大型语言模型（LLM）在日语语境中的三种辅导角色中的表现。我们同时评估了辅导员人工智能（AI）系统（具有零提示或结构化多步对话提示（SMDP）的GPT-4-turbo，Claude-3-Opus-SMDP）、客户AI模拟和评估AI系统（o3，Claude-3.7-Sonnet，Gemini-2.5-pro）。具有丰富辅导经验的人类专家（n = 15）使用《激励性面谈治疗完整性（MITI）编码手册4.2.1》评估AI生成的对话。值得注意的是，与零提示相比，SMDP的实施显著提高了辅导员AI在所有MITI全局评分中的表现，GPT-SMDP和Opus-SMDP之间没有显著差异。评估AI与人类评分者在培养变革性对话方面表现相当，但在软化维持对话和整体质量指标方面系统性地高估了。出现了特定于模型的偏见：Gemini强调权力共享，o3注重技术熟练度，Sonnet优先考虑情感表达。客户AI模拟展示了有限的情感范围和不自然高的顺从度，表明需要增强现实感。这些发现为非英语环境中的AI辅助辅导建立了基准，并通过先进的提示工程、检索增强生成和有针对性的微调确定了需要改进的关键领域，这对于开发具有重要文化敏感性的AI心理健康工具具有重要意义。

更新时间: 2025-07-08 06:16:17

领域: cs.CL,cs.AI,cs.HC,68T50,I.2.7; H.5.2; J.4

下载: http://arxiv.org/abs/2507.02950v2

A COMPASS to Model Comparison and Simulation-Based Inference in Galactic Chemical Evolution

We present COMPASS, a novel simulation-based inference framework that combines score-based diffusion models with transformer architectures to jointly perform parameter estimation and Bayesian model comparison across competing Galactic Chemical Evolution (GCE) models. COMPASS handles high-dimensional, incomplete, and variable-size stellar abundance datasets. Applied to high-precision elemental abundance measurements, COMPASS evaluates 40 combinations of nucleosynthetic yield tables. The model strongly favours Asymptotic Giant Branch yields from NuGrid and core-collapse SN yields used in the IllustrisTNG simulation, achieving near-unity cumulative posterior probability. Using the preferred model, we infer a steep high-mass IMF slope and an elevated Supernova Ia normalization, consistent with prior solar neighbourhood studies but now derived from fully amortized Bayesian inference. Our results demonstrate that modern SBI methods can robustly constrain uncertain physics in astrophysical simulators and enable principled model selection when analysing complex, simulation-based data.

Updated: 2025-07-08 06:11:39

标题: 一个指南：在银河化学演化中进行模型比较和基于模拟的推断

摘要: 我们提出了COMPASS，这是一个新颖的基于模拟的推断框架，它将基于分数的扩散模型与变压器架构结合起来，共同进行参数估计和在竞争的银河化学演化（GCE）模型之间进行贝叶斯模型比较。COMPASS处理高维、不完整和可变大小的恒星丰度数据集。应用于高精度元素丰度测量，COMPASS评估40种核合成产额表的组合。该模型强烈偏好NuGrid的渐近巨星分支产额和IllustrisTNG模拟中使用的核心坍缩SN产额，实现了接近单位的后验概率。使用首选模型，我们推断出一个陡峭的高质量IMF斜率和一个提升的超新星Ia标准化，与先前太阳邻域研究一致，但现在是从完全摊销的贝叶斯推断中推导出来的。我们的结果表明，现代的SBI方法可以稳健地约束天体物理模拟器中的不确定物理，并在分析复杂的基于模拟的数据时实现原则性模型选择。

更新时间: 2025-07-08 06:11:39

领域: astro-ph.GA,astro-ph.IM,cs.LG,physics.comp-ph,physics.data-an

下载: http://arxiv.org/abs/2507.05060v2

Horus: A Protocol for Trustless Delegation Under Uncertainty

Correctness is an emergent property of systems where exposing error is cheaper than committing it. In dynamic, low-trust environments, autonomous AI agents benefit from delegating work to sub-agents, yet correctness cannot be assured through upfront specification or centralized oversight. We propose a protocol that enforces correctness through collateralized claims in a recursive verification game. Tasks are published as intents, and solvers compete to fulfill them. Selected solvers carry out tasks under risk, with correctness checked post hoc by verifiers. Any challenger can challenge a result by staking against it to trigger the verification process. Incorrect agents are slashed and correct opposition is rewarded, with an escalation path that penalizes erroneous verifiers themselves. When incentives are aligned across solvers, challengers, and verifiers, falsification conditions make correctness the Nash equilibrium.

Updated: 2025-07-08 06:04:17

标题: 荷鲁斯：一种在不确定性下无需信任的委托协议

摘要: 正确性是系统的一种新兴属性，其中暴露错误比犯错更便宜。在动态、低信任环境中，自主人工智能代理从将工作委托给子代理中受益，然而正确性无法通过事先规范或集中监督来保证。我们提出了一种通过递归验证游戏中的抵押索赔来强制执行正确性的协议。任务被发布为意图，求解者竞争来完成它们。被选中的求解者在风险下执行任务，由验证者事后检查正确性。任何挑战者都可以挑战结果，通过抵押来触发验证过程。不正确的代理被减持，正确的对立被奖励，具有惩罚错误验证者本身的升级路径。当求解者、挑战者和验证者之间的激励一致时，虚假化条件使正确性成为纳什均衡。

更新时间: 2025-07-08 06:04:17

领域: cs.GT,cs.AI,cs.MA,I.2.11; F.2.2

下载: http://arxiv.org/abs/2507.00631v5

AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs

Kernel development in deep learning requires optimizing computational units across hardware while balancing memory management, parallelism, and hardware-specific optimizations through extensive empirical tuning. Although domain-specific languages like Triton simplify GPU programming by abstracting low-level details, developers must still manually tune critical parameters such as tile sizes and memory access patterns through iterative experimentation, creating substantial barriers to optimal performance and wider adoption. In this work, we introduce AutoTriton, the first model dedicated to Triton programming powered by reinforcement learning (RL). AutoTriton performs supervised fine-tuning (SFT) to be equipped with essential Triton programming expertise using a high-quality data gathering pipeline, and conducts RL with Group Relative Policy Optimization (GRPO) algorithm, combining a rule-based reward and an execution-based reward to further improve Triton programming ability, sequentially. Experiments across five evaluation channels of TritonBench and KernelBench illustrate that our 8B model AutoTriton achieves performance comparable to mainstream large models, including Claude-4-Sonnet and DeepSeek-R1-0528. Further experimental analysis demonstrates the crucial role of each module within AutoTriton, including the SFT stage, the RL stage, and the reward design strategy. These findings underscore the promise of RL for automatically generating high-performance kernels, and since high-performance kernels are core components of AI systems, this breakthrough establishes an important foundation for building more efficient AI systems. The model and code will be available at https://github.com/AI9Stars/AutoTriton.

Updated: 2025-07-08 05:38:24

标题: AutoTriton：在LLMs中利用强化学习实现自动Triton编程

摘要: 在深度学习中的内核开发需要在硬件上优化计算单元，同时平衡内存管理、并行性和硬件特定的优化，通过广泛的经验调整。尽管像 Triton 这样的领域特定语言通过抽象低级细节简化了 GPU 编程，开发人员仍然必须通过迭代实验手动调整关键参数，如瓷砖大小和内存访问模式，从而为最佳性能和更广泛的采用创建了实质性障碍。在这项工作中，我们介绍了 AutoTriton，这是第一个由强化学习（RL）驱动的 Triton 编程专用模型。AutoTriton执行监督微调（SFT），通过高质量的数据收集管道获得基本的 Triton 编程专业知识，并使用 Group Relative Policy Optimization（GRPO）算法进行 RL，结合基于规则的奖励和基于执行的奖励，进一步提高 Triton 编程能力，依次进行。TritonBench 和 KernelBench 的五个评估通道的实验表明，我们的 8B 模型 AutoTriton 实现了与主流大型模型（包括 Claude-4-Sonnet 和 DeepSeek-R1-0528）可比的性能。进一步的实验分析显示了 AutoTriton 中每个模块的关键作用，包括 SFT 阶段、RL 阶段和奖励设计策略。这些发现强调了 RL 为自动生成高性能内核提供了希望，而高性能内核是 AI 系统的核心组件，这一突破为构建更高效的 AI 系统奠定了重要基础。该模型和代码将在 https://github.com/AI9Stars/AutoTriton 上提供。

更新时间: 2025-07-08 05:38:24

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2507.05687v1

Pre-Training Graph Contrastive Masked Autoencoders are Strong Distillers for EEG

Effectively utilizing extensive unlabeled high-density EEG data to improve performance in scenarios with limited labeled low-density EEG data presents a significant challenge. In this paper, we address this challenge by formulating it as a graph transfer learning and knowledge distillation problem. We propose a Unified Pre-trained Graph Contrastive Masked Autoencoder Distiller, named EEG-DisGCMAE, to bridge the gap between unlabeled and labeled as well as high- and low-density EEG data. Our approach introduces a novel unified graph self-supervised pre-training paradigm, which seamlessly integrates the graph contrastive pre-training with the graph masked autoencoder pre-training. Furthermore, we propose a graph topology distillation loss function, allowing a lightweight student model trained on low-density data to learn from a teacher model trained on high-density data during pre-training and fine-tuning. This method effectively handles missing electrodes through contrastive distillation. We validate the effectiveness of EEG-DisGCMAE across four classification tasks using two clinical EEG datasets with abundant data. The source code is available at https://github.com/weixinxu666/EEG_DisGCMAE.

Updated: 2025-07-08 05:37:50

标题: 预训练图对比掩码自编码器是 EEG 的强大蒸馏器

摘要: 有效利用大量未标记的高密度脑电图数据，以提高在只有有限标记的低密度脑电图数据场景中的性能，是一个重要的挑战。本文将这一挑战作为一个图传递学习和知识蒸馏问题来解决。我们提出了一种统一的预训练图对比掩码自编码器蒸馏器，命名为EEG-DisGCMAE，以弥合未标记和标记以及高密度和低密度脑电图数据之间的差距。我们的方法引入了一种新颖的统一图自监督预训练范式，无缝地将图对比预训练与图掩码自编码器预训练相结合。此外，我们提出了一个图拓扑知识蒸馏损失函数，允许在低密度数据上训练的轻量级学生模型在预训练和微调过程中从在高密度数据上训练的教师模型中学习。这种方法通过对比蒸馏有效处理了缺失电极。我们通过使用两个丰富数据的临床脑电图数据集上的四个分类任务验证了EEG-DisGCMAE的有效性。源代码可在https://github.com/weixinxu666/EEG_DisGCMAE 获取。

更新时间: 2025-07-08 05:37:50

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2411.19230v2

Training Set Reconstruction from Differentially Private Forests: How Effective is DP?

Recent research has shown that machine learning models are vulnerable to privacy attacks targeting their training data. To mitigate these risks, differential privacy (DP) has become a widely adopted countermeasure, as it offers rigorous privacy protection. In this paper, we introduce a reconstruction attack targeting state-of-the-art $\varepsilon$-DP random forests. By leveraging a constraint programming model that incorporates knowledge of the forest's structure and DP mechanism characteristics, our approach formally reconstructs the most likely dataset that could have produced a given forest. Through extensive computational experiments, we examine the interplay between model utility, privacy guarantees and reconstruction accuracy across various configurations. Our results reveal that random forests trained with meaningful DP guarantees can still leak portions of their training data. Specifically, while DP reduces the success of reconstruction attacks, the only forests fully robust to our attack exhibit predictive performance no better than a constant classifier. Building on these insights, we provide practical recommendations for the construction of DP random forests that are more resilient to reconstruction attacks and maintain non-trivial predictive performance.

Updated: 2025-07-08 05:32:30

标题: 隐私森林中的训练集重建：差分隐私的有效性如何？

摘要: 最近的研究表明，机器学习模型容易受到针对其训练数据的隐私攻击。为了减轻这些风险，差分隐私（DP）已经成为一个广泛采用的对策，因为它提供了严格的隐私保护。在本文中，我们介绍了一种针对最先进的$\varepsilon$-DP随机森林的重构攻击。通过利用一个融合了森林结构和DP机制特征知识的约束规划模型，我们的方法正式重建了可能生成给定森林的数据集。通过广泛的计算实验，我们研究了模型效用、隐私保证和重构准确性在各种配置下的相互作用。我们的结果表明，训练具有有意义DP保证的随机森林仍然可能泄漏部分训练数据。具体而言，虽然DP降低了重构攻击的成功率，但唯一完全抵抗我们攻击的森林显示的预测表现不比一个常数分类器好。基于这些见解，我们提供了建立更具抗重构攻击能力并保持非平凡预测性能的DP随机森林的实用建议。

更新时间: 2025-07-08 05:32:30

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2502.05307v2

Efficient Training of Large-Scale AI Models Through Federated Mixture-of-Experts: A System-Level Approach

The integration of Federated Learning (FL) and Mixture-of-Experts (MoE) presents a compelling pathway for training more powerful, large-scale artificial intelligence models (LAMs) on decentralized data while preserving privacy. However, efficient federated training of these complex MoE-structured LAMs is hindered by significant system-level challenges, particularly in managing the interplay between heterogeneous client resources and the sophisticated coordination required for numerous specialized experts. This article highlights a critical, yet underexplored concept: the absence of robust quantitative strategies for dynamic client-expert alignment that holistically considers varying client capacities and the imperative for system-wise load balancing. Specifically, we propose a conceptual system design for intelligent client-expert alignment that incorporates dynamic fitness scoring, global expert load monitoring, and client capacity profiling. By tackling these systemic issues, we can unlock more scalable, efficient, and robust training mechanisms {with fewer communication rounds for convergence}, paving the way for the widespread deployment of large-scale federated MoE-structured LAMs in edge computing with ultra-high communication efficiency.

Updated: 2025-07-08 05:30:37

标题: 大规模AI模型的高效训练通过联邦专家混合：一种系统级方法

摘要: 联邦学习（FL）和专家混合（MoE）的整合为在分散数据上训练更强大、大规模人工智能模型（LAMs）提供了一条引人注目的途径，同时保护隐私。然而，有效地联合训练这些复杂的MoE结构LAMs受到重大系统级挑战的阻碍，特别是在管理异构客户资源之间的相互作用以及为众多专业专家所需的复杂协调方面。本文强调了一个关键但未被充分探讨的概念：缺乏针对动态客户-专家对齐的稳健定量策略，全面考虑不同客户容量和系统范围负载平衡的必要性。具体而言，我们提出了一个智能客户-专家对齐的概念系统设计，包括动态适应性评分、全局专家负载监控和客户容量分析。通过解决这些系统性问题，我们可以释放出更具可扩展性、高效性和鲁棒性的训练机制，为在边缘计算中具有超高通信效率的大规模联合MoE结构LAMs的广泛部署铺平道路。

更新时间: 2025-07-08 05:30:37

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.05685v1

Polyadic encryption

A novel original procedure of encryption/decryption based on the polyadic algebraic structures and on signal processing methods is proposed. First, we use signals with integer amplitudes to send information. Then we use polyadic techniques to transfer the plaintext into series of special integers. The receiver restores the plaintext using special rules and systems of equations.

Updated: 2025-07-08 05:26:24

标题: 多元加密

摘要: 提出了一种基于多项代数结构和信号处理方法的加密/解密的新颖原始程序。首先，我们使用具有整数幅度的信号来发送信息。然后我们使用多项技术将明文转换为一系列特殊整数。接收者使用特殊规则和方程组恢复明文。

更新时间: 2025-07-08 05:26:24

领域: cs.CR,cs.IT,math.IT,math.RA

下载: http://arxiv.org/abs/2507.05683v1

LoSiA: Efficient High-Rank Fine-Tuning via Subnet Localization and Optimization

Parameter-Efficient Fine-Tuning (PEFT) methods, such as LoRA, significantly reduce the number of trainable parameters by introducing low-rank decomposition matrices. However, existing methods perform extensive matrix multiplications in domain specialization tasks, resulting in computational inefficiency and sub-optimal fine-tuning performance. Hence, we propose LoSiA(Low-Resources Subnet Integration Adaptation), an innovative method that dynamically localizes and optimizes critical parameters during the training process. Specifically, it identifies a sub-network using gradient sparsity analysis and optimizes it as the trainable target. This design enables effective high-rank adaptation by updating only the sub-network parameters, reducing the additional matrix multiplication. We also present LoSiA-Pro, a faster implementation of LoSiA, which reduces the training latency by about $27\%$ compared to LoRA. Extensive evaluations show that our method achieves minimal performance drop compared to full fine-tuning, while requiring the least training time across domain specialization and common-sense reasoning tasks. Further analysis shows that LoSiA also reduces forgetting during continued training.

Updated: 2025-07-08 05:22:57

标题: LoSiA: 通过子网络定位和优化实现高效的高秩微调

摘要: Parameter-Efficient Fine-Tuning (PEFT)方法，例如LoRA，通过引入低秩分解矩阵，显著减少可训练参数的数量。然而，现有方法在领域专业化任务中执行大量矩阵乘法运算，导致计算效率低下和次优的微调性能。因此，我们提出了LoSiA（Low-Resources Subnet Integration Adaptation），这是一种创新方法，可以在训练过程中动态定位和优化关键参数。具体来说，它通过梯度稀疏性分析识别一个子网络，并将其优化为可训练目标。这种设计通过仅更新子网络参数来实现有效的高秩适应性，减少了额外的矩阵乘法。我们还提出了LoSiA-Pro，这是LoSiA的更快实现，与LoRA相比，可以将训练延迟降低约27％。广泛的评估表明，我们的方法在领域专业化和常识推理任务中实现了与完全微调相比的最小性能下降，同时需要最少的训练时间。进一步的分析表明，LoSiA还减少了持续训练期间的遗忘。

更新时间: 2025-07-08 05:22:57

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.04487v2

GATMesh: Clock Mesh Timing Analysis using Graph Neural Networks

Clock meshes are essential in high-performance VLSI systems for minimizing skew and handling PVT variations, but analyzing them is difficult due to reconvergent paths, multi-source driving, and input mesh buffer skew. SPICE simulations are accurate but slow; yet simplified models miss key effects like slew and input skew. We propose GATMesh, a Graph Neural Network (GNN)-based framework that models the clock mesh as a graph with augmented structural and physical features. Trained on SPICE data, GATMesh achieves high accuracy with average delay error of 5.27ps on unseen benchmarks, while achieving speed-ups of 47146x over multi-threaded SPICE simulation.

Updated: 2025-07-08 05:18:42

标题: GATMesh: 使用图神经网络进行时钟网格定时分析

摘要: 时钟网格在高性能VLSI系统中是必不可少的，用于最小化偏移和处理PVT变化，但由于再聚合路径、多源驱动和输入网格缓冲器偏移，分析它们是困难的。SPICE模拟精确但速度慢；然而简化模型会忽略关键影响，如过渡和输入偏移。我们提出了GATMesh，一个基于图神经网络（GNN）的框架，将时钟网格建模为一个具有增强结构和物理特征的图。在SPICE数据上训练后，GATMesh在未知基准测试中达到了高准确性，平均延迟误差为5.27ps，同时在多线程SPICE模拟上实现了47146倍的加速。

更新时间: 2025-07-08 05:18:42

领域: cs.AR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.05681v1

Speeding up Speculative Decoding via Sequential Approximate Verification

Speculative Decoding (SD) is a recently proposed technique for faster inference using Large Language Models (LLMs). SD operates by using a smaller draft LLM for autoregressively generating a sequence of tokens and a larger target LLM for parallel verification to ensure statistical consistency. However, periodic parallel calls to the target LLM for verification prevent SD from achieving even lower latencies. We propose SPRINTER, which utilizes a low-complexity verifier trained to predict if tokens generated from a draft LLM would be accepted by the target LLM. By performing sequential approximate verification, SPRINTER does not require verification by the target LLM and is only invoked when a token is deemed unacceptable. This reduces the number of calls to the larger LLM, achieving further speedups and lower computation cost. We present a theoretical analysis of SPRINTER, examining the statistical properties of the generated tokens, as well as the expected reduction in latency as a function of the verifier. We evaluate SPRINTER on several datasets and model pairs, demonstrating that approximate verification can still maintain high quality generation while further reducing latency.

Updated: 2025-07-08 05:05:28

标题: 通过顺序近似验证加速推测解码

摘要: Speculative Decoding (SD) 是一种最近提出的技术，用于利用大型语言模型（LLMs）进行更快的推断。SD通过使用一个较小的草稿LLM自回归地生成一系列标记，并使用一个较大的目标LLM进行并行验证，以确保统计一致性。然而，定期并行调用目标LLM进行验证阻止了SD实现更低的延迟。我们提出了SPRINTER，它利用一个训练有素的低复杂度验证器，用于预测从草稿LLM生成的标记是否会被目标LLM接受。通过执行顺序近似验证，SPRINTER不需要目标LLM进行验证，并且仅在标记被视为不可接受时调用。这减少了对较大LLM的调用次数，实现了进一步的加速和降低计算成本。我们对SPRINTER进行了理论分析，考察了生成标记的统计特性，以及预期的延迟减少作为验证器的函数。我们在几个数据集和模型对上评估了SPRINTER，并展示了近似验证仍然可以保持高质量的生成，同时进一步降低延迟。

更新时间: 2025-07-08 05:05:28

领域: cs.LG,cs.IT,math.IT

下载: http://arxiv.org/abs/2502.04557v3

BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset

In this paper, we introduce BMMR, a large-scale bilingual, multimodal, multi-disciplinary reasoning dataset for the community to develop and evaluate large multimodal models (LMMs). BMMR comprises 110k college-level questions spanning 300 UNESCO-defined subjects, spanning diverse formats-multiple-choice, fill-in-the-blank, and open-ended QA-and sourced from both print and digital media such as books, exams, and quizzes. All data are curated and filtered via a human-in-the-loop and scalable framework, and each instance is paired with a high-quality reasoning path. The dataset is organized into two parts: BMMR-Eval that comprises 20,458 high-quality instances to comprehensively assess LMMs' knowledge and reasoning across multiple disciplines in both Chinese and English; and BMMR-Train that contains 88,991 instances to support further research and development, extending the current focus on mathematical reasoning to diverse disciplines and domains. In addition, we propose the process-based multi-discipline verifier (i.e., BMMR-Verifier) for accurate and fine-grained evaluation of reasoning paths. Extensive experiments on 24 models reveal that (i) even SOTA models (e.g., o3 and Gemini-2.5-Pro) leave substantial headroom on BMMR-Eval; (ii) reasoning models exhibit discipline bias and outperform LMMs only on specific subjects; (iii) open-source models still trail their proprietary counterparts; and (iv) fine-tuning on BMMR-Train narrows this gap. Additionally, we conduct reasoning-chain analyses using BMMR-Verifier and other in-depth studies, uncovering the challenges LMMs currently face in multidisciplinary reasoning. We will release the data, and we hope our work can offer insights and contributions to the community.

Updated: 2025-07-08 05:05:04

标题: BMMR：一个大规模的双语多模态多学科推理数据集

摘要: 在这篇论文中，我们介绍了BMMR，这是一个大规模的双语、多模态、多学科推理数据集，供社区开发和评估大型多模态模型（LMMs）使用。BMMR包括11万个涵盖300个联合国教科文组织定义学科的大学水平问题，涵盖多种格式-选择题、填空题和开放式问答-并且数据来源于印刷和数字媒体，如书籍、考试和测验。所有数据都经过人工筛选和过滤，每个实例都配有高质量的推理路径。数据集分为两部分：BMMR-Eval包括20,458个高质量实例，全面评估LMMs在中英文多学科知识和推理能力；BMMR-Train包含88,991个实例，支持进一步研究和发展，将当前对数学推理的关注扩展到各种学科和领域。此外，我们提出了基于过程的多学科验证器（即BMMR-Verifier），用于准确和细致地评估推理路径。对24个模型进行广泛实验表明：（i）即使是SOTA模型（如o3和Gemini-2.5-Pro）在BMMR-Eval上还存在很大的改进空间；（ii）推理模型表现出学科偏见，并且仅在特定学科上表现优于LMMs；（iii）开源模型仍落后于专有模型；（iv）在BMMR-Train上微调可以缩小这一差距。此外，我们使用BMMR-Verifier和其他深入研究进行推理链分析，揭示了当前多学科推理面临的挑战。我们将发布数据，希望我们的工作可以为社区提供见解和贡献。

更新时间: 2025-07-08 05:05:04

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.03483v2

From Counterfactuals to Trees: Competitive Analysis of Model Extraction Attacks

The advent of Machine Learning as a Service (MLaaS) has heightened the trade-off between model explainability and security. In particular, explainability techniques, such as counterfactual explanations, inadvertently increase the risk of model extraction attacks, enabling unauthorized replication of proprietary models. In this paper, we formalize and characterize the risks and inherent complexity of model reconstruction, focusing on the "oracle'' queries required for faithfully inferring the underlying prediction function. We present the first formal analysis of model extraction attacks through the lens of competitive analysis, establishing a foundational framework to evaluate their efficiency. Focusing on models based on additive decision trees (e.g., decision trees, gradient boosting, and random forests), we introduce novel reconstruction algorithms that achieve provably perfect fidelity while demonstrating strong anytime performance. Our framework provides theoretical bounds on the query complexity for extracting tree-based model, offering new insights into the security vulnerabilities of their deployment.

Updated: 2025-07-08 05:03:12

标题: 从反事实到树：模型提取攻击的竞争性分析

摘要: 机器学习即服务（MLaaS）的出现加剧了模型可解释性和安全性之间的权衡。特别是，诸如对抗性解释等可解释性技术无意中增加了模型提取攻击的风险，使未经授权的复制专有模型成为可能。在本文中，我们形式化和表征了模型重建的风险和固有复杂性，重点关注“神谕”查询，这些查询需要忠实地推断出底层预测函数。我们通过竞争分析的视角首次对模型提取攻击进行了形式化分析，建立了一个评估其效率的基础框架。我们重点关注基于加法决策树的模型（例如决策树、梯度提升和随机森林），引入了新颖的重建算法，能够在保证完美忠实性的同时展现出强大的实时表现。我们的框架为提取基于树的模型的查询复杂度提供了理论界限，为它们的部署安全漏洞提供了新的洞见。

更新时间: 2025-07-08 05:03:12

领域: cs.LG,cs.CR

下载: http://arxiv.org/abs/2502.05325v2

MedGen: Unlocking Medical Video Generation by Scaling Granularly-annotated Medical Videos

Recent advances in video generation have shown remarkable progress in open-domain settings, yet medical video generation remains largely underexplored. Medical videos are critical for applications such as clinical training, education, and simulation, requiring not only high visual fidelity but also strict medical accuracy. However, current models often produce unrealistic or erroneous content when applied to medical prompts, largely due to the lack of large-scale, high-quality datasets tailored to the medical domain. To address this gap, we introduce MedVideoCap-55K, the first large-scale, diverse, and caption-rich dataset for medical video generation. It comprises over 55,000 curated clips spanning real-world medical scenarios, providing a strong foundation for training generalist medical video generation models. Built upon this dataset, we develop MedGen, which achieves leading performance among open-source models and rivals commercial systems across multiple benchmarks in both visual quality and medical accuracy. We hope our dataset and model can serve as a valuable resource and help catalyze further research in medical video generation. Our code and data is available at https://github.com/FreedomIntelligence/MedGen

Updated: 2025-07-08 04:58:36

标题: MedGen：通过精细标注的医学视频进行扩展解锁医学视频生成

摘要: 近年来，视频生成方面取得了显著进展，尤其是在开放领域，然而医学视频生成仍然很大程度上未被探索。医学视频对于临床培训、教育和模拟等应用至关重要，不仅需要高视觉保真度，还需要严格的医学准确性。然而，当前的模型在应用于医学提示时经常产生不真实或错误的内容，这主要是由于缺乏针对医学领域定制的大规模、高质量的数据集。为了填补这一空白，我们介绍了MedVideoCap-55K，这是第一个用于医学视频生成的大规模、多样化和充满标题的数据集。它包括超过55,000个经筛选的剪辑，涵盖了真实世界的医学场景，为训练通用医学视频生成模型提供了坚实的基础。基于这个数据集，我们开发了MedGen，它在开源模型中取得了领先的性能，在视觉质量和医学准确性方面与商业系统在多个基准测试中相媲美。我们希望我们的数据集和模型可以成为有价值的资源，并促进医学视频生成领域的进一步研究。我们的代码和数据可在https://github.com/FreedomIntelligence/MedGen 上获得。

更新时间: 2025-07-08 04:58:36

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.05675v1

Canine Clinical Gait Analysis for Orthopedic and Neurological Disorders: An Inertial Deep-Learning Approach

Canine gait analysis using wearable inertial sensors is gaining attention in veterinary clinical settings, as it provides valuable insights into a range of mobility impairments. Neurological and orthopedic conditions cannot always be easily distinguished even by experienced clinicians. The current study explored and developed a deep learning approach using inertial sensor readings to assess whether neurological and orthopedic gait could facilitate gait analysis. Our investigation focused on optimizing both performance and generalizability in distinguishing between these gait abnormalities. Variations in sensor configurations, assessment protocols, and enhancements to deep learning model architectures were further suggested. Using a dataset of 29 dogs, our proposed approach achieved 96% accuracy in the multiclass classification task (healthy/orthopedic/neurological) and 82% accuracy in the binary classification task (healthy/non-healthy) when generalizing to unseen dogs. Our results demonstrate the potential of inertial-based deep learning models to serve as a practical and objective diagnostic and clinical aid to differentiate gait assessment in orthopedic and neurological conditions.

Updated: 2025-07-08 04:54:16

标题: 《用于骨科和神经疾病的犬类临床步态分析：一种惯性深度学习方法》

摘要: 穿戴惯性传感器进行犬的步态分析在兽医临床设置中越来越受到关注，因为它为一系列行动障碍提供了宝贵的见解。即使经验丰富的临床医生也不能总是轻松区分神经和骨科疾病。本研究探讨并开发了一种利用惯性传感器读数进行深度学习的方法，以评估神经和骨科步态是否有助于步态分析。我们的研究重点是优化性能和泛化能力，以区分这些步态异常。进一步建议变化传感器配置、评估协议和深度学习模型结构的增强。利用29只狗的数据集，我们提出的方法在多类别分类任务（健康/骨科/神经）中实现了96%的准确率，在二元分类任务（健康/非健康）中实现了82%的准确率，泛化至未见过的狗。我们的结果表明，基于惯性的深度学习模型有潜力成为一种实用和客观的诊断和临床辅助工具，用于区分骨科和神经疾病的步态评估。

更新时间: 2025-07-08 04:54:16

领域: cs.LG

下载: http://arxiv.org/abs/2507.05671v1

A Novel APVD Steganography Technique Incorporating Pseudorandom Pixel Selection for Robust Image Security

Steganography is the process of embedding secret information discreetly within a carrier, ensuring secure exchange of confidential data. The Adaptive Pixel Value Differencing (APVD) steganography method, while effective, encounters certain challenges like the "unused blocks" issue. This problem can cause a decrease in security, compromise the embedding capacity, and lead to lower visual quality. This research presents a novel steganographic strategy that integrates APVD with pseudorandom pixel selection to effectively mitigate these issues. The results indicate that the new method outperforms existing techniques in aspects of security, data hiding capacity, and the preservation of image quality. Empirical results reveal that the combination of APVD with pseudorandom pixel selection significantly enhances key image quality metrics such as Peak Signal-to-Noise Ratio (PSNR), Universal Image Quality Index (UIQ), and Structural Similarity Index (SSIM), surpassing other contemporary methods in performance. The newly proposed method is versatile, able to handle a variety of cover and secret images in both color and grayscale, thereby ensuring secure data transmission without compromising the aesthetic quality of the image.

Updated: 2025-07-08 04:54:06

标题: 一种新颖的APVD隐写术技术，结合伪随机像素选择，用于强健的图像安全

摘要: 隐写术是将秘密信息隐蔽地嵌入到载体中，确保机密数据的安全交换的过程。自适应像素值差分（APVD）隐写术方法虽然有效，但面临着诸如“未使用块”问题等挑战。这个问题可能导致安全性降低，嵌入容量受损，以及视觉质量降低。本研究提出了一种新颖的隐写策略，将APVD与伪随机像素选择相结合，有效缓解了这些问题。结果表明，这种新方法在安全性、数据隐藏容量和图像质量保护等方面优于现有技术。经验结果显示，APVD与伪随机像素选择的组合显著提升了关键的图像质量指标，如峰值信噪比（PSNR）、通用图像质量指数（UIQ）和结构相似性指数（SSIM），在性能上超越了其他当代方法。这种新提出的方法多功能性强，能够处理各种彩色和灰度的封面和秘密图像，从而确保在不损害图像审美质量的情况下进行安全的数据传输。

更新时间: 2025-07-08 04:54:06

领域: cs.CR,cs.CV,cs.MM,eess.IV,68Q80,I.4.2

下载: http://arxiv.org/abs/2507.13367v1

TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data

Recent advances in foundation models, such as LLMs, have revolutionized conversational AI. Chatbots are increasingly being developed by customizing LLMs on specific conversational datasets. However, mitigating toxicity during this customization, especially when dealing with untrusted training data, remains a significant challenge. To address this, we introduce TuneShield, a defense framework designed to mitigate toxicity during chatbot fine-tuning while preserving conversational quality. TuneShield leverages LLM-based toxicity classification, utilizing the instruction-following capabilities and safety alignment of LLMs to effectively identify toxic samples, outperforming industry API services. TuneShield generates synthetic conversation samples, termed 'healing data', based on the identified toxic samples, using them to mitigate toxicity while reinforcing desirable behavior during fine-tuning. It performs an alignment process to further nudge the chatbot towards producing desired responses. Our findings show that TuneShield effectively mitigates toxicity injection attacks while preserving conversational quality, even when the toxicity classifiers are imperfect or biased. TuneShield proves to be resilient against adaptive adversarial and jailbreak attacks. Additionally, TuneShield demonstrates effectiveness in mitigating adaptive toxicity injection attacks during dialog-based learning (DBL).

Updated: 2025-07-08 04:40:09

标题: TuneShield：在不受信任的数据上微调对话AI时减轻毒性

摘要: 最近基础模型的进展，如LLMs，已经彻底改变了对话式人工智能。聊天机器人越来越多地通过定制LLMs来开发，根据特定的对话数据集。然而，在这种定制过程中减轻毒性，特别是在处理不受信任的训练数据时，仍然是一个重大挑战。为了解决这个问题，我们引入了TuneShield，一个旨在在聊天机器人微调过程中减轻毒性的防御框架，同时保持对话质量。 TuneShield利用基于LLM的毒性分类，利用LLMs的遵循指令能力和安全对齐，有效地识别有毒样本，优于行业API服务。 TuneShield生成合成对话样本，称为“治疗数据”，基于识别的有毒样本，利用它们在微调过程中减轻毒性同时强化可取行为。它执行对齐过程，进一步推动聊天机器人产生期望的响应。我们的研究结果表明，TuneShield在保留对话质量的同时有效地减轻了毒性注入攻击，即使毒性分类器不完美或存在偏见。TuneShield证明了对抗性适应和越狱攻击的弹性。此外，TuneShield在缓解基于对话的学习(DBL)期间的自适应毒性注入攻击方面表现出有效性。

更新时间: 2025-07-08 04:40:09

领域: cs.CR,cs.AI,cs.CL

下载: http://arxiv.org/abs/2507.05660v1

HRRRCast: a data-driven emulator for regional weather forecasting at convection allowing scales

The High-Resolution Rapid Refresh (HRRR) model is a convection-allowing model used in operational weather forecasting across the contiguous United States (CONUS). To provide a computationally efficient alternative, we introduce HRRRCast, a data-driven emulator built with advanced machine learning techniques. HRRRCast includes two architectures: a ResNet-based model (ResHRRR) and a Graph Neural Network-based model (GraphHRRR). ResHRRR uses convolutional neural networks enhanced with squeeze-and-excitation blocks and Feature-wise Linear Modulation, and supports probabilistic forecasting via the Denoising Diffusion Implicit Model (DDIM). To better handle longer lead times, we train a single model to predict multiple lead times (1h, 3h, and 6h), then use a greedy rollout strategy during inference. When evaluated on composite reflectivity over the full CONUS domain using ensembles of 3 to 10 members, ResHRRR outperforms HRRR forecast at light rainfall threshold (20 dBZ) and achieves competitive performance at moderate thresholds (30 dBZ). Our work advances the StormCast model of Pathak et al. [21] by: a) training on the full CONUS domain, b) using multiple lead times to improve long-range skill, c) training on analysis data instead of the +1h post-analysis data inadvertently used in StormCast, and d) incorporating future GFS states as inputs, enabling downscaling that improves long-lead accuracy. Grid-, neighborhood-, and object-based metrics confirm better storm placement, lower frequency bias, and higher success ratios than HRRR. HRRRCast ensemble forecasts also maintain sharper spatial detail, with power spectra more closely matching HRRR analysis. While GraphHRRR underperforms in its current form, it lays groundwork for future graph-based forecasting. HRRRCast represents a step toward efficient, data-driven regional weather prediction with competitive accuracy and ensemble capability.

Updated: 2025-07-08 04:26:47

标题: HRRRCast：一个基于数据驱动的模拟器，用于区域天气预报，覆盖对流允许尺度

摘要: 高分辨率快速刷新（HRRR）模型是一种在美国连续48个州（CONUS）范围内用于操作性气象预报的对流允许模型。为提供计算效率高的替代方案，我们引入了HRRRCast，这是一个采用先进机器学习技术构建的数据驱动模拟器。HRRRCast包括两种架构：基于ResNet的模型（ResHRRR）和基于图神经网络的模型（GraphHRRR）。ResHRRR采用卷积神经网络，增强了压缩和激励块以及特征逐层线性调制，通过去噪扩散隐式模型（DDIM）支持概率预测。为了更好地处理更长的提前时间，我们训练一个单一模型来预测多个提前时间（1小时，3小时和6小时），然后在推理过程中使用贪婪展开策略。在使用3到10个成员的合成反射率对全CONUS范围进行评估时，ResHRRR在轻雨阈值（20 dBZ）上优于HRRR预报，并在中等阈值（30 dBZ）上实现竞争性表现。我们的工作推进了Pathak等人的StormCast模型[21]，具体体现在：a）在全CONUS范围内进行训练，b）使用多个提前时间以提高长期技能，c）在分析数据上进行训练，而不是意外使用StormCast中的+1h后分析数据，d）将未来GFS状态作为输入，实现使长期准确性提高的下采样。基于网格、邻域和对象的指标证实了比HRRR更好的风暴位置、更低的频率偏差和更高的成功率。HRRRCast集合预测还保持更锐利的空间细节，功率谱与HRRR分析更为接近。虽然GraphHRRR在当前形式下表现不佳，但它为未来基于图的预测奠定了基础。HRRRCast代表了朝着高效、数据驱动的区域天气预测，具有竞争性准确性和集合能力的一步。

更新时间: 2025-07-08 04:26:47

领域: physics.ao-ph,cs.LG

下载: http://arxiv.org/abs/2507.05658v1

KAN-AD: Time Series Anomaly Detection with Kolmogorov-Arnold Networks

Time series anomaly detection (TSAD) underpins real-time monitoring in cloud services and web systems, allowing rapid identification of anomalies to prevent costly failures. Most TSAD methods driven by forecasting models tend to overfit by emphasizing minor fluctuations. Our analysis reveals that effective TSAD should focus on modeling "normal" behavior through smooth local patterns. To achieve this, we reformulate time series modeling as approximating the series with smooth univariate functions. The local smoothness of each univariate function ensures that the fitted time series remains resilient against local disturbances. However, a direct KAN implementation proves susceptible to these disturbances due to the inherently localized characteristics of B-spline functions. We thus propose KAN-AD, replacing B-splines with truncated Fourier expansions and introducing a novel lightweight learning mechanism that emphasizes global patterns while staying robust to local disturbances. On four popular TSAD benchmarks, KAN-AD achieves an average 15% improvement in detection accuracy (with peaks exceeding 27%) over state-of-the-art baselines. Remarkably, it requires fewer than 1,000 trainable parameters, resulting in a 50% faster inference speed compared to the original KAN, demonstrating the approach's efficiency and practical viability.

Updated: 2025-07-08 04:25:33

标题: KAN-AD：使用Kolmogorov-Arnold网络进行时间序列异常检测

摘要: 时间序列异常检测（TSAD）是云服务和网络系统实时监测的基础，允许快速识别异常以防止昂贵的故障。大多数由预测模型驱动的TSAD方法往往过度拟合，强调较小的波动。我们的分析表明，有效的TSAD应该专注于通过平滑的局部模式建模“正常”行为。为了实现这一点，我们重新构建了时间序列建模，将序列近似为平滑的一元函数。每个一元函数的局部平滑性确保拟合的时间序列对局部干扰保持弹性。然而，直接的KAN实现由于B样条函数固有的局部特性而容易受到这些干扰的影响。因此，我们提出了KAN-AD，将B样条替换为截断的傅立叶展开，并引入了一种强调全局模式同时对局部干扰保持稳健的新颖轻量级学习机制。在四个流行的TSAD基准测试中，KAN-AD相对于最先进的基准线实现了平均15%的检测准确度提高（峰值超过27%）。值得注意的是，它只需要少于1,000个可训练参数，与原始的KAN相比，推断速度快了50%，展示了这种方法的效率和实际可行性。

更新时间: 2025-07-08 04:25:33

领域: cs.LG

下载: http://arxiv.org/abs/2411.00278v3

AgentSafe: Safeguarding Large Language Model-based Multi-agent Systems via Hierarchical Data Management

Large Language Model based multi-agent systems are revolutionizing autonomous communication and collaboration, yet they remain vulnerable to security threats like unauthorized access and data breaches. To address this, we introduce AgentSafe, a novel framework that enhances MAS security through hierarchical information management and memory protection. AgentSafe classifies information by security levels, restricting sensitive data access to authorized agents. AgentSafe incorporates two components: ThreatSieve, which secures communication by verifying information authority and preventing impersonation, and HierarCache, an adaptive memory management system that defends against unauthorized access and malicious poisoning, representing the first systematic defense for agent memory. Experiments across various LLMs show that AgentSafe significantly boosts system resilience, achieving defense success rates above 80% under adversarial conditions. Additionally, AgentSafe demonstrates scalability, maintaining robust performance as agent numbers and information complexity grow. Results underscore effectiveness of AgentSafe in securing MAS and its potential for real-world application.

Updated: 2025-07-08 04:14:01

标题: AgentSafe：通过分层数据管理保护基于大型语言模型的多代理系统

摘要: 基于大型语言模型的多智能体系统正在彻底改变自主通信和协作，然而它们仍然容易受到安全威胁，如未经授权访问和数据泄露。为了解决这个问题，我们引入了AgentSafe，这是一个通过分层信息管理和内存保护增强MAS安全性的新框架。AgentSafe通过安全级别对信息进行分类，限制敏感数据的访问权限。AgentSafe包括两个组件：ThreatSieve，通过验证信息的权威性和防止冒充来保护通信；以及HierarCache，一种适应性内存管理系统，可以防止未经授权的访问和恶意污染，代表了对智能体内存的第一个系统性防御。在各种LLM上的实验表明，AgentSafe显著提高了系统的韧性，在对抗性条件下实现了超过80%的防御成功率。此外，AgentSafe展示了可扩展性，随着智能体数量和信息复杂性的增加，保持了强大的性能。结果强调了AgentSafe在保护MAS安全性方面的有效性，以及其在现实世界应用中的潜力。

更新时间: 2025-07-08 04:14:01

领域: cs.AI

下载: http://arxiv.org/abs/2503.04392v2

City-Level Foreign Direct Investment Prediction with Tabular Learning on Judicial Data

To advance the United Nations Sustainable Development Goal on promoting sustained, inclusive, and sustainable economic growth, foreign direct investment (FDI) plays a crucial role in catalyzing economic expansion and fostering innovation. Precise city-level FDI prediction is quite important for local government and is commonly studied based on economic data (e.g., GDP). However, such economic data could be prone to manipulation, making predictions less reliable. To address this issue, we try to leverage large-scale judicial data which reflects judicial performance influencing local investment security and returns, for city-level FDI prediction. Based on this, we first build an index system for the evaluation of judicial performance over twelve million publicly available adjudication documents according to which a tabular dataset is reformulated. We then propose a new Tabular Learning method on Judicial Data (TLJD) for city-level FDI prediction. TLJD integrates row data and column data in our built tabular dataset for judicial performance indicator encoding, and utilizes a mixture of experts model to adjust the weights of different indicators considering regional variations. To validate the effectiveness of TLJD, we design cross-city and cross-time tasks for city-level FDI predictions. Extensive experiments on both tasks demonstrate the superiority of TLJD (reach to at least 0.92 R2) over the other ten state-of-the-art baselines in different evaluation metrics.

Updated: 2025-07-08 04:10:25

标题: 基于司法数据的表格学习在城市级外国直接投资预测中的应用

摘要: 为推动联合国可持续发展目标，促进持续、包容和可持续的经济增长，外国直接投资（FDI）在催化经济扩张和促进创新方面发挥着至关重要的作用。精确预测城市级别的FDI对于地方政府非常重要，通常是基于经济数据（如GDP）进行研究。然而，这些经济数据可能容易受到操纵，使预测结果不够可靠。为解决这个问题，我们尝试利用反映司法绩效影响当地投资安全和回报的大规模司法数据，用于城市级别的FDI预测。基于此，我们首先根据一千二百多万公开可获得的裁决文书建立了一个司法绩效评估指标体系，重新构建了一个表格数据集。然后，我们提出了一种新的基于司法数据的表格学习方法（TLJD）用于城市级别的FDI预测。TLJD将我们构建的表格数据集中的行数据和列数据进行司法绩效指标编码，并利用专家混合模型调整不同指标的权重，考虑区域变化。为验证TLJD的有效性，我们设计了跨城市和跨时间任务进行城市级别的FDI预测。在这两个任务上的大量实验证明了TLJD的优越性（至少达到0.92的R2值），在不同的评估指标上优于其他十种最先进的基准方法。

更新时间: 2025-07-08 04:10:25

领域: cs.AI

下载: http://arxiv.org/abs/2507.05651v1

MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation

Diffusion models are successful for synthesizing high-quality videos but are limited to generating short clips (e.g., 2-10 seconds). Synthesizing sustained footage (e.g. over minutes) still remains an open research question. In this paper, we propose MALT Diffusion (using Memory-Augmented Latent Transformers), a new diffusion model specialized for long video generation. MALT Diffusion (or just MALT) handles long videos by subdividing them into short segments and doing segment-level autoregressive generation. To achieve this, we first propose recurrent attention layers that encode multiple segments into a compact memory latent vector; by maintaining this memory vector over time, MALT is able to condition on it and continuously generate new footage based on a long temporal context. We also present several training techniques that enable the model to generate frames over a long horizon with consistent quality and minimal degradation. We validate the effectiveness of MALT through experiments on long video benchmarks. We first perform extensive analysis of MALT in long-contextual understanding capability and stability using popular long video benchmarks. For example, MALT achieves an FVD score of 220.4 on 128-frame video generation on UCF-101, outperforming the previous state-of-the-art of 648.4. Finally, we explore MALT's capabilities in a text-to-video generation setting and show that it can produce long videos compared with recent techniques for long text-to-video generation.

Updated: 2025-07-08 04:03:03

标题: MALT扩散：用于任意长度视频生成的记忆增强潜变换器

摘要: 扩散模型在合成高质量视频方面取得了成功，但仅限于生成短视频片段（例如2-10秒）。合成持续时间较长的视频素材（例如数分钟）仍然是一个尚未解决的研究问题。在本文中，我们提出了MALT扩散（使用记忆增强潜在变换器），这是一种专门用于长视频生成的新型扩散模型。MALT扩散（简称MALT）通过将长视频细分为短片段并进行段级自回归生成来处理长视频。为了实现这一点，我们首先提出了循环注意力层，将多个片段编码为紧凑的记忆潜在向量；通过随时间维持这个记忆向量，MALT能够对其进行条件化，并基于长时序上下文持续生成新的素材。我们还提出了几种训练技术，使模型能够在保持一致质量和最小降级的情况下生成长期视角内的帧。我们通过在长视频基准测试上的实验验证了MALT的有效性。首先，在使用流行的长视频基准测试对MALT的长上下文理解能力和稳定性进行了广泛分析。例如，在UCF-101上进行128帧视频生成时，MALT实现了220.4的FVD得分，超过了之前的648.4的最新技术。最后，我们探索了MALT在文本到视频生成环境中的能力，并展示了与最近的长文本到视频生成技术相比，它能够生成长视频。

更新时间: 2025-07-08 04:03:03

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2502.12632v3

DESIGN: Encrypted GNN Inference via Server-Side Input Graph Pruning

Graph Neural Networks (GNNs) have achieved state-of-the-art performance in various graph-based learning tasks. However, enabling privacy-preserving GNNs in encrypted domains, such as under Fully Homomorphic Encryption (FHE), typically incurs substantial computational overhead, rendering real-time and privacy-preserving inference impractical. In this work, we propose DESIGN (EncrypteD GNN Inference via sErver-Side Input Graph pruNing), a novel framework for efficient encrypted GNN inference. DESIGN tackles the critical efficiency limitations of existing FHE GNN approaches, which often overlook input data redundancy and apply uniform computational strategies. Our framework achieves significant performance gains through a hierarchical optimization strategy executed entirely on the server: first, FHE-compatible node importance scores (based on encrypted degree statistics) are computed from the encrypted graph. These scores then guide a homomorphic partitioning process, generating multi-level importance masks directly under FHE. This dynamically generated mask facilitates both input graph pruning (by logically removing unimportant elements) and a novel adaptive polynomial activation scheme, where activation complexity is tailored to node importance levels. Empirical evaluations demonstrate that DESIGN substantially accelerates FHE GNN inference compared to state-of-the-art methods while maintaining competitive model accuracy, presenting a robust solution for secure graph analytics.

Updated: 2025-07-08 04:01:53

标题: 设计：通过服务器端输入图剪枝实现加密的GNN推断

摘要: 图神经网络（GNNs）在各种基于图的学习任务中取得了最先进的性能。然而，在加密领域（如完全同态加密（FHE））中实现保护隐私的GNNs通常会产生大量的计算开销，使得实时和保护隐私的推断变得不切实际。在这项工作中，我们提出了DESIGN（通过服务器端输入图剪枝实现加密GNN推断），这是一个高效的加密GNN推断的新框架。DESIGN解决了现有FHE GNN方法的关键效率限制，这些方法通常忽视输入数据的冗余并应用统一的计算策略。我们的框架通过完全在服务器上执行的分层优化策略实现了显著的性能提升：首先，从加密图中计算出基于加密度统计的适用于FHE的节点重要性分数。然后，这些分数指导同态分区过程，直接在FHE下生成多级重要性掩码。这种动态生成的掩码既促进了输入图的剪枝（通过逻辑地删除不重要的元素），又实现了一种新颖的自适应多项式激活方案，其中激活复杂度根据节点重要性水平进行了调整。实证评估表明，与最先进的方法相比，DESIGN显著加速了FHE GNN推断，同时保持了竞争性的模型准确性，为安全图分析提供了一种强大的解决方案。

更新时间: 2025-07-08 04:01:53

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.05649v1

Challenges and Trends in Egocentric Vision: A Survey

With the rapid development of artificial intelligence technologies and wearable devices, egocentric vision understanding has emerged as a new and challenging research direction, gradually attracting widespread attention from both academia and industry. Egocentric vision captures visual and multimodal data through cameras or sensors worn on the human body, offering a unique perspective that simulates human visual experiences. This paper provides a comprehensive survey of the research on egocentric vision understanding, systematically analyzing the components of egocentric scenes and categorizing the tasks into four main areas: subject understanding, object understanding, environment understanding, and hybrid understanding. We explore in detail the sub-tasks within each category. We also summarize the main challenges and trends currently existing in the field. Furthermore, this paper presents an overview of high-quality egocentric vision datasets, offering valuable resources for future research. By summarizing the latest advancements, we anticipate the broad applications of egocentric vision technologies in fields such as augmented reality, virtual reality, and embodied intelligence, and propose future research directions based on the latest developments in the field.

Updated: 2025-07-08 03:59:12

标题: 挑战和趋势在自我中心视觉中：一项调查

摘要: 随着人工智能技术和可穿戴设备的快速发展，以个人为中心的视觉理解已成为一项新的且具有挑战性的研究方向，逐渐引起学术界和工业界的广泛关注。以个人为中心的视觉通过在人体上佩戴的摄像头或传感器捕捉视觉和多模态数据，提供了一种模拟人类视觉体验的独特视角。本文对以个人为中心的视觉理解研究进行了全面调查，系统分析了以个人为中心的场景的组成部分，并将任务分类为主体理解、物体理解、环境理解和混合理解四个主要领域。我们详细探讨了每个类别中的子任务。我们还总结了该领域目前存在的主要挑战和趋势。此外，本文介绍了高质量的以个人为中心的视觉数据集的概述，为未来研究提供了宝贵资源。通过总结最新的进展，我们预期以个人为中心的视觉技术在增强现实、虚拟现实和实体智能等领域具有广泛的应用，并根据该领域的最新发展提出未来的研究方向。

更新时间: 2025-07-08 03:59:12

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2503.15275v3

Curvature-Aligned Federated Learning (CAFe): Harmonizing Loss Landscapes for Fairness Without Demographics

Federated Learning (FL) enables privacy-preserving collaborative training, making it well-suited for decentralized human-sensing applications. Ensuring fairness in FL is challenging, as current methods rely on sensitive attribute knowledge, which conflicts with FL's privacy principles. Additionally, sensitive attributes in human-sensing data may be unknown or latent. To address this, we introduce Curvature-Aligned Federated Learning (CAFe), a theoretically grounded approach that achieves fairness in FL without requiring sensitive attribute knowledge, a concept termed "Fairness without Demographics" (FWD). CAFe introduces loss-landscape curvature regularization during local training and clients' loss-landscape sharpness-aware aggregation to align curvature both within and across clients, enabling a strong balance between higher fairness and performance. CAFe is especially suitable for real-world human-sensing FL scenarios involving single or multi-user edge devices with unknown or multiple bias factors. We validated CAFe through theoretical and empirical justifications, and comprehensive evaluations using three real-world datasets and a live real-world FL deployment with a heterogeneous testbed of resource-constrained devices. Additionally, we conduct sensitivity analyses on local training data volume, client sampling, communication overhead, resource costs, and runtime performance to demonstrate its feasibility for practical FL edge device deployment.

Updated: 2025-07-08 03:57:12

标题: 曲率对齐的联邦学习（CAFe）：在不考虑人口统计数据的情况下协调损失景观以实现公平

摘要: 联邦学习（FL）实现了隐私保护的协作训练，使其非常适用于分散式人体感应应用。在FL中确保公平性是具有挑战性的，因为当前的方法依赖于敏感属性知识，这与FL的隐私原则相冲突。此外，在人体感应数据中的敏感属性可能是未知或潜在的。为了解决这个问题，我们引入了一种名为曲率对齐联邦学习（CAFe）的理论基础方法，它在不需要敏感属性知识的情况下实现了FL中的公平性，这个概念被称为“无人口统计数据的公平性”（FWD）。CAFe在本地训练期间引入了损失地景曲率正则化，并且客户端的损失地景锐度感知聚合以在客户端内部和客户端之间对齐曲率，从而实现更高的公平性和性能之间的强平衡。CAFe特别适用于涉及未知或多个偏倚因素的单用户或多用户边缘设备的现实世界人体感应FL场景。我们通过理论和实证证明以及使用三个真实世界数据集和一个具有资源受限设备异构测试平台的现实世界FL部署进行了综合评估，验证了CAFe。此外，我们对本地训练数据量、客户端采样、通信开销、资源成本和运行性能进行了敏感性分析，以证明其在实际FL边缘设备部署中的可行性。

更新时间: 2025-07-08 03:57:12

领域: cs.LG,cs.AI,cs.DC

下载: http://arxiv.org/abs/2404.19725v5

FreqCross: A Multi-Modal Frequency-Spatial Fusion Network for Robust Detection of Stable Diffusion 3.5 Generated Images

The rapid advancement of diffusion models, particularly Stable Diffusion 3.5, has enabled the generation of highly photorealistic synthetic images that pose significant challenges to existing detection methods. This paper presents FreqCross, a novel multi-modal fusion network that combines spatial RGB features, frequency domain artifacts, and radial energy distribution patterns to achieve robust detection of AI-generated images. Our approach leverages a three-branch architecture: (1) a ResNet-18 backbone for spatial feature extraction, (2) a lightweight CNN for processing 2D FFT magnitude spectra, and (3) a multi-layer perceptron for analyzing radial energy profiles. We introduce a novel radial energy distribution analysis that captures characteristic frequency artifacts inherent in diffusion-generated images, and fuse it with spatial and spectral cues via simple feature concatenation followed by a compact classification head. Extensive experiments on a dataset of 10,000 paired real (MS-COCO) and synthetic (Stable Diffusion 3.5) images demonstrate that FreqCross achieves 97.8\% accuracy, outperforming state-of-the-art baselines by 5.2\%. The frequency analysis further reveals that synthetic images exhibit distinct spectral signatures in the 0.1--0.4 normalised frequency range, providing theoretical foundation for our approach. Code and pre-trained models are publicly available to facilitate reproducible research.

Updated: 2025-07-08 03:56:18

标题: FreqCross：一种多模式频率-空间融合网络，用于稳定扩散3.5生成图像的鲁棒检测

摘要: 随着扩散模型的快速发展，特别是稳定扩散3.5，已经实现了高度逼真的合成图像的生成，这对现有的检测方法提出了重大挑战。本文介绍了FreqCross，一种新颖的多模态融合网络，结合了空间RGB特征、频率域伪影和径向能量分布模式，实现了对人工智能生成图像的稳健检测。我们的方法利用了三分支架构：(1) 用于空间特征提取的ResNet-18主干，(2) 用于处理2D FFT幅度谱的轻量级CNN，以及(3) 用于分析径向能量分布的多层感知器。我们引入了一种捕获扩散生成图像固有特征频率伪影的新颖径向能量分布分析，并通过简单的特征串联将其与空间和光谱线索融合，然后是一个紧凑的分类头。对一个包含10,000对真实（MS-COCO）和合成（稳定扩散3.5）图像的数据集进行了大量实验，结果表明FreqCross实现了97.8%的准确率，优于现有技术基线5.2%。频率分析进一步揭示，合成图像在0.1-0.4归一化频率范围内呈现出明显的光谱特征，为我们的方法提供了理论基础。代码和预训练模型已公开提供，以促进可复现研究。

更新时间: 2025-07-08 03:56:18

领域: cs.CV,cs.CR

下载: http://arxiv.org/abs/2507.02995v2

Enhancing Satellite Object Localization with Dilated Convolutions and Attention-aided Spatial Pooling

Object localization in satellite imagery is particularly challenging due to the high variability of objects, low spatial resolution, and interference from noise and dominant features such as clouds and city lights. In this research, we focus on three satellite datasets: upper atmospheric Gravity Waves (GW), mesospheric Bores (Bore), and Ocean Eddies (OE), each presenting its own unique challenges. These challenges include the variability in the scale and appearance of the main object patterns, where the size, shape, and feature extent of objects of interest can differ significantly. To address these challenges, we introduce YOLO-DCAP, a novel enhanced version of YOLOv5 designed to improve object localization in these complex scenarios. YOLO-DCAP incorporates a Multi-scale Dilated Residual Convolution (MDRC) block to capture multi-scale features at scale with varying dilation rates, and an Attention-aided Spatial Pooling (AaSP) module to focus on the global relevant spatial regions, enhancing feature selection. These structural improvements help to better localize objects in satellite imagery. Experimental results demonstrate that YOLO-DCAP significantly outperforms both the YOLO base model and state-of-the-art approaches, achieving an average improvement of 20.95% in mAP50 and 32.23% in IoU over the base model, and 7.35% and 9.84% respectively over state-of-the-art alternatives, consistently across all three satellite datasets. These consistent gains across all three satellite datasets highlight the robustness and generalizability of the proposed approach. Our code is open sourced at https://github.com/AI-4-atmosphere-remote-sensing/satellite-object-localization.

Updated: 2025-07-08 03:53:32

标题: 使用扩张卷积和注意力辅助空间池化增强卫星目标定位

摘要: 卫星图像中的目标定位特别具有挑战性，因为对象的变异性高，空间分辨率低，同时受到噪声和云层、城市灯光等主要特征的干扰。在这项研究中，我们专注于三个卫星数据集：上层大气重力波（GW）、中间层波（Bore）和海洋涡旋（OE），每个数据集都提出了自己独特的挑战。这些挑战包括主要对象模式的尺度和外观的变异性，感兴趣对象的大小、形状和特征范围可能存在显著差异。为了解决这些挑战，我们引入了YOLO-DCAP，这是YOLOv5的一种新的增强版本，旨在改善这些复杂情境下的目标定位。YOLO-DCAP采用了多尺度膨胀残差卷积（MDRC）块来捕捉具有不同膨胀率的尺度的多尺度特征，并引入了辅助空间池化（AaSP）模块，以聚焦于全局相关的空间区域，增强特征选择。这些结构改进有助于更好地定位卫星图像中的对象。实验结果表明，YOLO-DCAP在mAP50和IoU方面显著优于YOLO基础模型和最先进的方法，分别在基础模型上平均提高了20.95%和32.23%，在最先进的替代方案上分别提高了7.35%和9.84%，且在所有三个卫星数据集上保持一致。这种在所有三个卫星数据集上的一致增益突显了所提出方法的鲁棒性和普适性。我们的代码开源在https://github.com/AI-4-atmosphere-remote-sensing/satellite-object-localization。

更新时间: 2025-07-08 03:53:32

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2505.05599v3

FACT: the Features At Convergence Theorem for neural networks

A central challenge in deep learning theory is to understand how neural networks learn and represent features. To this end, we prove the Features at Convergence Theorem (FACT), which gives a self-consistency equation that neural network weights satisfy at convergence when trained with nonzero weight decay. For each weight matrix $W$, this equation relates the "feature matrix" $W^\top W$ to the set of input vectors passed into the matrix during forward propagation and the loss gradients passed through it during backpropagation. We validate this relation empirically, showing that neural features indeed satisfy the FACT at convergence. Furthermore, by modifying the "Recursive Feature Machines" of Radhakrishnan et al. 2024 so that they obey the FACT, we arrive at a new learning algorithm, FACT-RFM. FACT-RFM achieves high performance on tabular data and captures various feature learning behaviors that occur in neural network training, including grokking in modular arithmetic and phase transitions in learning sparse parities.

Updated: 2025-07-08 03:52:48

标题: 事实：神经网络收敛定理的特征

摘要: 深度学习理论中的一个核心挑战是理解神经网络是如何学习和表示特征的。为此，我们证明了收敛时神经网络权重满足的自洽方程，即Features at Convergence Theorem (FACT)。对于每个权重矩阵$W$，这个方程将“特征矩阵”$W^\top W$与在前向传播过程中传递到该矩阵中的输入向量集以及在反向传播过程中传递的损失梯度相关联。我们通过经验证明这种关系，表明神经特征确实在收敛时满足FACT。此外，通过修改Radhakrishnan等人2014年提出的“递归特征机”使其符合FACT，我们得到了一种新的学习算法，即FACT-RFM。FACT-RFM在表格数据上表现出色，并捕捉了在神经网络训练中发生的各种特征学习行为，包括在模块化算术中的理解和学习稀疏奇偶数的相变。

更新时间: 2025-07-08 03:52:48

领域: cs.LG,cs.AI,stat.ML

下载: http://arxiv.org/abs/2507.05644v1

An empirical study of task and feature correlations in the reuse of pre-trained models

Pre-trained neural networks are commonly used and reused in the machine learning community. Alice trains a model for a particular task, and a part of her neural network is reused by Bob for a different task, often to great effect. To what can we ascribe Bob's success? This paper introduces an experimental setup through which factors contributing to Bob's empirical success could be studied in silico. As a result, we demonstrate that Bob might just be lucky: his task accuracy increases monotonically with the correlation between his task and Alice's. Even when Bob has provably uncorrelated tasks and input features from Alice's pre-trained network, he can achieve significantly better than random performance due to Alice's choice of network and optimizer. When there is little correlation between tasks, only reusing lower pre-trained layers is preferable, and we hypothesize the converse: that the optimal number of retrained layers is indicative of task and feature correlation. Finally, we show in controlled real-world scenarios that Bob can effectively reuse Alice's pre-trained network if there are semantic correlations between his and Alice's task.

Updated: 2025-07-08 03:46:46

标题: 一个关于在重复使用预训练模型中任务和特征相关性的实证研究

摘要: 预训练的神经网络在机器学习社区中被广泛使用和重复利用。爱丽丝为特定任务训练了一个模型，鲍勃则重复使用她神经网络的一部分进行不同的任务，通常效果显著。我们可以归因于鲍勃的成功吗？本文介绍了一个实验设置，通过该设置可以研究对鲍勃实证成功的因素。结果表明，鲍勃可能只是幸运的：他的任务准确度随着他的任务和爱丽丝的相关性单调增加。即使鲍勃的任务和输入特征与爱丽丝的预训练网络据称不相关，由于爱丽丝选择的网络和优化器，他也可以实现明显优于随机表现。当任务之间的相关性较小时，只重新使用较低的预训练层是可取的，并且我们假设相反：重新训练层的最佳数量是任务和特征相关性的指示。最后，我们在受控的现实场景中展示，如果鲍勃的任务和爱丽丝的任务之间存在语义相关性，他可以有效地重复使用爱丽丝的预训练网络。

更新时间: 2025-07-08 03:46:46

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2506.01975v2

A Runtime-Adaptive Transformer Neural Network Accelerator on FPGAs

Transformer neural networks (TNN) excel in natural language processing (NLP), machine translation, and computer vision (CV) without relying on recurrent or convolutional layers. However, they have high computational and memory demands, particularly on resource-constrained devices like FPGAs. Moreover, transformer models vary in processing time across applications, requiring custom models with specific parameters. Designing custom accelerators for each model is complex and time-intensive. Some custom accelerators exist with no runtime adaptability, and they often rely on sparse matrices to reduce latency. However, hardware designs become more challenging due to the need for application-specific sparsity patterns. This paper introduces ADAPTOR, a runtime-adaptive accelerator for dense matrix computations in transformer encoders and decoders on FPGAs. ADAPTOR enhances the utilization of processing elements and on-chip memory, enhancing parallelism and reducing latency. It incorporates efficient matrix tiling to distribute resources across FPGA platforms and is fully quantized for computational efficiency and portability. Evaluations on Xilinx Alveo U55C data center cards and embedded platforms like VC707 and ZCU102 show that our design is 1.2$\times$ and 2.87$\times$ more power efficient than the NVIDIA K80 GPU and the i7-8700K CPU respectively. Additionally, it achieves a speedup of 1.7 to 2.25$\times$ compared to some state-of-the-art FPGA-based accelerators.

Updated: 2025-07-08 03:42:27

标题: 在FPGAs上的运行时自适应Transformer神经网络加速器

摘要: Transformer神经网络（TNN）在自然语言处理（NLP）、机器翻译和计算机视觉（CV）方面表现出色，而无需依赖循环或卷积层。然而，它们在计算和内存需求方面较高，特别是在像FPGA这样的资源受限设备上。此外，transformer模型在不同应用中的处理时间各不相同，需要具有特定参数的自定义模型。为每个模型设计自定义加速器是复杂且耗时的。一些自定义加速器存在，但没有运行时适应性，并且它们通常依赖于稀疏矩阵以减少延迟。然而，由于需要特定于应用程序的稀疏模式，硬件设计变得更具挑战性。本文介绍了ADAPTOR，这是一种针对FPGA上transformer编码器和解码器中的稠密矩阵计算的运行时自适应加速器。ADAPTOR增强了处理元素和芯片内存的利用率，增加了并行性并减少了延迟。它结合了高效的矩阵切片技术，将资源分配到FPGA平台上，并且完全量化以实现计算效率和可移植性。在Xilinx Alveo U55C数据中心卡和嵌入式平台（如VC707和ZCU102）上的评估显示，我们的设计比NVIDIA K80 GPU和i7-8700K CPU分别更节能1.2倍和2.87倍。此外，与一些最先进的基于FPGA的加速器相比，它实现了1.7到2.25倍的加速。

更新时间: 2025-07-08 03:42:27

领域: cs.AR,cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2411.18148v3

Rule Learning for Knowledge Graph Reasoning under Agnostic Distribution Shift

Knowledge graph (KG) reasoning remains a critical research area focused on inferring missing knowledge by analyzing relationships among observed facts. Despite its success, a key limitation of existing KG reasoning methods is their dependence on the I.I.D assumption. This assumption can easily be violated due to unknown sample selection bias during training or agnostic distribution shifts during testing, significantly compromising model performance and reliability. To facilitate the deployment of KG reasoning in wild environments, this study investigates learning logical rules from KGs affected by unknown selection bias. Additionally, we address test sets with agnostic distribution shifts, formally defining this challenge as out-of-distribution (OOD) KG reasoning-a previously underexplored problem. To solve the issue, we propose the Stable Rule Learning (StableRule) framework, an end-to-end methodology that integrates feature decorrelation with rule learning network, to enhance OOD generalization performance. By leveraging feature decorrelation, the StableRule framework mitigates the adverse effects of covariate shifts arising in OOD scenarios, thereby improving the robustness of the rule learning component in effectively deriving logical rules. Extensive experiments on seven benchmark KGs demonstrate the framework's superior effectiveness and stability across diverse heterogeneous environments, underscoring its practical significance for real-world applications.

Updated: 2025-07-08 03:40:07

标题: 对于不可知分布转移下的知识图谱推理规则学习

摘要: 知识图谱（KG）推理仍然是一个关注推断观察到的事实之间关系以填补缺失知识的关键研究领域。尽管取得了成功，现有KG推理方法的一个关键局限是它们依赖于I.I.D假设。这种假设很容易在训练期间由于未知的样本选择偏差或在测试期间由于不可知的分布变化而被违反，从而显著损害模型性能和可靠性。为了促进KG推理在野外环境中的部署，本研究调查了受未知选择偏差影响的KG中学习逻辑规则的方法。此外，我们解决了具有不可知分布变化的测试集，正式将这一挑战定义为超出分布（OOD）KG推理-一个以前未被充分探讨的问题。为了解决这个问题，我们提出了稳定规则学习（StableRule）框架，这是一种端到端的方法论，将特征去相关性与规则学习网络相结合，以增强OOD泛化性能。通过利用特征去相关性，稳定规则学习框架减轻了OOD场景中出现的协变量变化的不良影响，从而提高了规则学习组件有效推导逻辑规则的稳健性。对七个基准KG的大量实验表明了该框架在不同异构环境中的卓越有效性和稳定性，突显了其对实际应用的重要意义。

更新时间: 2025-07-08 03:40:07

领域: cs.AI

下载: http://arxiv.org/abs/2507.05110v2

Variational OOD State Correction for Offline Reinforcement Learning

The performance of Offline reinforcement learning is significantly impacted by the issue of state distributional shift, and out-of-distribution (OOD) state correction is a popular approach to address this problem. In this paper, we propose a novel method named Density-Aware Safety Perception (DASP) for OOD state correction. Specifically, our method encourages the agent to prioritize actions that lead to outcomes with higher data density, thereby promoting its operation within or the return to in-distribution (safe) regions. To achieve this, we optimize the objective within a variational framework that concurrently considers both the potential outcomes of decision-making and their density, thus providing crucial contextual information for safe decision-making. Finally, we validate the effectiveness and feasibility of our proposed method through extensive experimental evaluations on the offline MuJoCo and AntMaze suites.

Updated: 2025-07-08 03:38:40

标题: 离线强化学习中的变分OOD状态校正

摘要: 离线强化学习的性能受到状态分布转移问题的显著影响，而超出分布（OOD）状态校正是解决此问题的一种流行方法。在本文中，我们提出了一种名为密度感知安全感知（DASP）的新方法，用于OOD状态校正。具体而言，我们的方法鼓励代理人优先考虑导致数据密度较高结果的行动，从而促进其在分布内或返回到分布内（安全）区域的操作。为实现此目标，我们在一个变分框架中优化目标，同时考虑决策的潜在结果和它们的密度，从而为安全决策提供关键的上下文信息。最后，我们通过对离线MuJoCo和AntMaze套件进行广泛的实验评估，验证了我们提出的方法的有效性和可行性。

更新时间: 2025-07-08 03:38:40

领域: cs.LG,cs.AI,cs.RO

下载: http://arxiv.org/abs/2505.00503v3

Learnable quantum spectral filters for hybrid graph neural networks

In this paper, we describe a parameterized quantum circuit that can be considered as convolutional and pooling layers for graph neural networks. The circuit incorporates the parameterized quantum Fourier circuit where the qubit connections for the controlled gates derived from the Laplacian operator. Specifically, we show that the eigenspace of the Laplacian operator of a graph can be approximated by using QFT based circuit whose connections are determined from the adjacency matrix. For an $N\times N$ Laplacian, this approach yields an approximate polynomial-depth circuit requiring only $n=log(N)$ qubits. These types of circuits can eliminate the expensive classical computations for approximating the learnable functions of the Laplacian through Chebyshev polynomial or Taylor expansions. Using this circuit as a convolutional layer provides an $n-$ dimensional probability vector that can be considered as the filtered and compressed graph signal. Therefore, the circuit along with the measurement can be considered a very efficient convolution plus pooling layer that transforms an $N$-dimensional signal input into $n-$dimensional signal with an exponential compression. We then apply a classical neural network prediction head to the output of the circuit to construct a complete graph neural network. Since the circuit incorporates geometric structure through its graph connection-based approach, we present graph classification results for the benchmark datasets listed in TUDataset library. Using only [1-100] learnable parameters for the quantum circuit and minimal classical layers (1000-5000 parameters) in a generic setting, the obtained results are comparable to and in some cases better than many of the baseline results, particularly for the cases when geometric structure plays a significant role.

Updated: 2025-07-08 03:36:40

标题: 可学习的量子谱滤波器用于混合图神经网络

摘要: 在本文中，我们描述了一个可以被视为图神经网络的卷积和池化层的参数化量子电路。该电路包含参数化量子傅立叶电路，其中受控门的量子比特连接源自拉普拉斯算子。具体来说，我们展示了图的拉普拉斯算子的特征空间可以通过使用基于QFT的电路来近似，其连接由邻接矩阵确定。对于一个$N\times N$的拉普拉斯矩阵，这种方法产生了一个仅需$n=log(N)$量子比特的近似多项式深度电路。这种类型的电路可以消除用于逼近拉普拉斯可学习函数的昂贵经典计算，通过切比雪夫多项式或泰勒展开来实现。将这个电路作为一个卷积层使用，提供了一个$n-$维概率向量，可以被视为经过滤波和压缩的图信号。因此，该电路连同测量可以被视为一个非常高效的卷积加池化层，将一个$N-$维信号输入转换为一个$n-$维信号，并实现指数压缩。然后，我们将经过电路输出的结果应用于经典神经网络预测头部，构建一个完整的图神经网络。由于电路通过其基于图连接的方法融入了几何结构，我们在TUDataset库中列出的基准数据集上展示了图分类结果。在通用设置中仅使用[1-100]可学习参数的量子电路和最小化的经典层（1000-5000参数），所得结果与许多基线结果相当，甚至在某些情况下更好，特别是当几何结构发挥重要作用时。

更新时间: 2025-07-08 03:36:40

领域: quant-ph,cs.LG

下载: http://arxiv.org/abs/2507.05640v1

Learning Federated Neural Graph Databases for Answering Complex Queries from Distributed Knowledge Graphs

The increasing demand for deep learning-based foundation models has highlighted the importance of efficient data retrieval mechanisms. Neural graph databases (NGDBs) offer a compelling solution, leveraging neural spaces to store and query graph-structured data, thereby enabling LLMs to access precise and contextually relevant information. However, current NGDBs are constrained to single-graph operation, limiting their capacity to reason across multiple, distributed graphs. Furthermore, the lack of support for multi-source graph data in existing NGDBs hinders their ability to capture the complexity and diversity of real-world data. In many applications, data is distributed across multiple sources, and the ability to reason across these sources is crucial for making informed decisions. This limitation is particularly problematic when dealing with sensitive graph data, as directly sharing and aggregating such data poses significant privacy risks. As a result, many applications that rely on NGDBs are forced to choose between compromising data privacy or sacrificing the ability to reason across multiple graphs. To address these limitations, we propose to learn Federated Neural Graph DataBase (FedNGDB), a pioneering systematic framework that empowers privacy-preserving reasoning over multi-source graph data. FedNGDB leverages federated learning to collaboratively learn graph representations across multiple sources, enriching relationships between entities, and improving the overall quality of graph data.

Updated: 2025-07-08 03:35:45

标题: 学习联邦神经图数据库，以从分布式知识图中回答复杂查询

摘要: 对于基于深度学习的基础模型的需求不断增加，突显了高效数据检索机制的重要性。神经图数据库（NGDBs）提供了一个引人注目的解决方案，利用神经空间来存储和查询图结构化数据，从而使LLMs能够访问精确和具有上下文相关性的信息。然而，目前的NGDBs受限于单图操作，限制了它们跨多个分布图进行推理的能力。此外，现有NGDBs对多源图数据的支持不足，阻碍了它们捕捉现实世界数据的复杂性和多样性。在许多应用中，数据分布在多个来源，跨这些来源进行推理对于做出明智决策至关重要。当处理敏感图数据时，直接共享和聚合此类数据会带来重大的隐私风险，因此，许多依赖于NGDBs的应用程序被迫在损害数据隐私和放弃跨多个图进行推理之间做出选择。为了解决这些限制，我们提出了学习联合神经图数据库（FedNGDB），这是一个开创性的系统框架，可以在多源图数据上实现保护隐私的推理。FedNGDB利用联邦学习来协作学习跨多个来源的图表示，丰富实体之间的关系，并提高图数据的整体质量。

更新时间: 2025-07-08 03:35:45

领域: cs.LG,cs.AI,cs.CR,cs.DB

下载: http://arxiv.org/abs/2402.14609v4

Latent Acoustic Mapping for Direction of Arrival Estimation: A Self-Supervised Approach

Acoustic mapping techniques have long been used in spatial audio processing for direction of arrival estimation (DoAE). Traditional beamforming methods for acoustic mapping, while interpretable, often rely on iterative solvers that can be computationally intensive and sensitive to acoustic variability. On the other hand, recent supervised deep learning approaches offer feedforward speed and robustness but require large labeled datasets and lack interpretability. Despite their strengths, both methods struggle to consistently generalize across diverse acoustic setups and array configurations, limiting their broader applicability. We introduce the Latent Acoustic Mapping (LAM) model, a self-supervised framework that bridges the interpretability of traditional methods with the adaptability and efficiency of deep learning methods. LAM generates high-resolution acoustic maps, adapts to varying acoustic conditions, and operates efficiently across different microphone arrays. We assess its robustness on DoAE using the LOCATA and STARSS benchmarks. LAM achieves comparable or superior localization performance to existing supervised methods. Additionally, we show that LAM's acoustic maps can serve as effective features for supervised models, further enhancing DoAE accuracy and underscoring its potential to advance adaptive, high-performance sound localization systems.

Updated: 2025-07-08 03:35:00

标题: 隐性声学制图用于到达方向估计：一种自监督方法

摘要: 声学映射技术长期以来一直被用于空间音频处理中的到达方向估计（DoAE）。传统的声学映射波束成形方法虽然可解释，但通常依赖于可能计算密集且对声学变化敏感的迭代求解器。另一方面，最近的监督深度学习方法提供了前馈速度和稳健性，但需要大量标记数据集并且缺乏可解释性。尽管它们各有优势，但这两种方法在不同声学设置和阵列配置中很难一致地推广，限制了它们更广泛的适用性。我们引入了潜在声学映射（LAM）模型，这是一个自监督框架，它将传统方法的可解释性与深度学习方法的适应性和效率相结合。LAM生成高分辨率的声学地图，适应不同声学条件，并在不同麦克风阵列中高效运行。我们使用LOCATA和STARSS基准对其在DoAE上的稳健性进行评估。LAM实现了与现有监督方法相当或更优越的定位性能。此外，我们展示了LAM的声学地图可以作为监督模型的有效特征，进一步提高DoAE精度，并强调其推进自适应、高性能声音定位系统的潜力。

更新时间: 2025-07-08 03:35:00

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2507.07066v1

LLMs are Introvert

The exponential growth of social media and generative AI has transformed information dissemination, fostering connectivity but also accelerating the spread of misinformation. Understanding information propagation dynamics and developing effective control strategies is essential to mitigate harmful content. Traditional models, such as SIR, provide basic insights but inadequately capture the complexities of online interactions. Advanced methods, including attention mechanisms and graph neural networks, enhance accuracy but typically overlook user psychology and behavioral dynamics. Large language models (LLMs), with their human-like reasoning, offer new potential for simulating psychological aspects of information spread. We introduce an LLM-based simulation environment capturing agents' evolving attitudes, emotions, and responses. Initial experiments, however, revealed significant gaps between LLM-generated behaviors and authentic human dynamics, especially in stance detection and psychological realism. A detailed evaluation through Social Information Processing Theory identified major discrepancies in goal-setting and feedback evaluation, stemming from the lack of emotional processing in standard LLM training. To address these issues, we propose the Social Information Processing-based Chain of Thought (SIP-CoT) mechanism enhanced by emotion-guided memory. This method improves the interpretation of social cues, personalization of goals, and evaluation of feedback. Experimental results confirm that SIP-CoT-enhanced LLM agents more effectively process social information, demonstrating behaviors, attitudes, and emotions closer to real human interactions. In summary, this research highlights critical limitations in current LLM-based propagation simulations and demonstrates how integrating SIP-CoT and emotional memory significantly enhances the social intelligence and realism of LLM agents.

Updated: 2025-07-08 03:32:38

标题: LLMs是内向的

摘要: 社交媒体和生成式人工智能的指数增长改变了信息传播方式，促进了连接，但也加速了错误信息的传播。了解信息传播动态并开发有效的控制策略对减少有害内容至关重要。传统模型，如SIR，提供了基本见解，但未能捕捉在线互动的复杂性。先进的方法，包括注意力机制和图神经网络，提高了准确性，但通常忽视了用户心理和行为动态。大型语言模型（LLMs）以其类似人类推理的特点，为模拟信息传播的心理方面提供了新的潜力。我们介绍了一个基于LLM的模拟环境，捕捉了代理的演变态度、情绪和反应。然而，初步实验揭示了LLM生成的行为与真实人类动态之间存在显著差距，尤其是在立场检测和心理现实主义方面。通过社交信息处理理论的详细评估，发现了在目标设定和反馈评估方面的主要差异，这源于标准LLM训练中缺乏情感处理。为了解决这些问题，我们提出了基于社交信息处理的思维链（SIP-CoT）机制，通过情感引导的记忆来增强。这种方法改善了对社交线索的解释、目标的个性化以及反馈的评估。实验结果证实，增强了SIP-CoT的LLM代理更有效地处理社交信息，展示出与真实人类互动更接近的行为、态度和情绪。总之，这项研究突显了当前基于LLM的传播模拟中的关键局限性，并展示了如何通过整合SIP-CoT和情感记忆显著增强LLM代理的社交智能和现实主义。

更新时间: 2025-07-08 03:32:38

领域: cs.AI,cs.SI

下载: http://arxiv.org/abs/2507.05638v1

Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs

Jailbreak attack can be used to access the vulnerabilities of Large Language Models (LLMs) by inducing LLMs to generate the harmful content. And the most common method of the attack is to construct semantically ambiguous prompts to confuse and mislead the LLMs. To access the security and reveal the intrinsic relation between the input prompt and the output for LLMs, the distribution of attention weight is introduced to analyze the underlying reasons. By using statistical analysis methods, some novel metrics are defined to better describe the distribution of attention weight, such as the Attention Intensity on Sensitive Words (Attn_SensWords), the Attention-based Contextual Dependency Score (Attn_DepScore) and Attention Dispersion Entropy (Attn_Entropy). By leveraging the distinct characteristics of these metrics, the beam search algorithm and inspired by the military strategy "Feint and Attack", an effective jailbreak attack strategy named as Attention-Based Attack (ABA) is proposed. In the ABA, nested attack prompts are employed to divert the attention distribution of the LLMs. In this manner, more harmless parts of the input can be used to attract the attention of the LLMs. In addition, motivated by ABA, an effective defense strategy called as Attention-Based Defense (ABD) is also put forward. Compared with ABA, the ABD can be used to enhance the robustness of LLMs by calibrating the attention distribution of the input prompt. Some comparative experiments have been given to demonstrate the effectiveness of ABA and ABD. Therefore, both ABA and ABD can be used to access the security of the LLMs. The comparative experiment results also give a logical explanation that the distribution of attention weight can bring great influence on the output for LLMs.

Updated: 2025-07-08 03:32:27

标题: 佯攻与实战：基于注意力的策略用于越狱和保护LLMs

摘要: 越狱攻击可以利用大型语言模型（LLMs）的漏洞，通过诱导LLMs生成有害内容来访问这些漏洞。攻击的最常见方法是构建语义模糊的提示，以混淆和误导LLMs。为了访问安全性并揭示输入提示与LLMs输出之间的内在关系，引入了注意力权重分布来分析潜在原因。通过使用统计分析方法，定义了一些新颖的度量标准来更好地描述注意力权重的分布，例如对敏感词的注意力强度（Attn_SensWords）、基于注意力的上下文依赖得分（Attn_DepScore）和注意力分散熵（Attn_Entropy）。通过利用这些度量标准的独特特征，结合光束搜索算法并受到军事策略“假动作和攻击”的启发，提出了一种有效的越狱攻击策略，名为基于注意力的攻击（ABA）。在ABA中，采用嵌套攻击提示来转移LLMs的注意力分布。通过这种方式，可以利用输入的更无害部分来吸引LLMs的注意力。此外，受到ABA的启发，还提出了一种有效的防御策略，称为基于注意力的防御（ABD）。与ABA相比，ABD可以通过校准输入提示的注意力分布来增强LLMs的鲁棒性。进行了一些比较实验来证明ABA和ABD的有效性。因此，ABA和ABD都可以用于访问LLMs的安全性。比较实验结果也给出了一个合乎逻辑的解释，即注意力权重的分布对LLMs的输出产生重大影响。

更新时间: 2025-07-08 03:32:27

领域: cs.CR,cs.AI,cs.CL

下载: http://arxiv.org/abs/2410.16327v2

Graph Learning

Graph learning has rapidly evolved into a critical subfield of machine learning and artificial intelligence (AI). Its development began with early graph-theoretic methods, gaining significant momentum with the advent of graph neural networks (GNNs). Over the past decade, progress in scalable architectures, dynamic graph modeling, multimodal learning, generative AI, explainable AI (XAI), and responsible AI has broadened the applicability of graph learning to various challenging environments. Graph learning is significant due to its ability to model complex, non-Euclidean relationships that traditional machine learning struggles to capture, thus better supporting real-world applications ranging from drug discovery and fraud detection to recommender systems and scientific reasoning. However, challenges like scalability, generalization, heterogeneity, interpretability, and trustworthiness must be addressed to unlock its full potential. This survey provides a comprehensive introduction to graph learning, focusing on key dimensions including scalable, temporal, multimodal, generative, explainable, and responsible graph learning. We review state-of-the-art techniques for efficiently handling large-scale graphs, capturing dynamic temporal dependencies, integrating heterogeneous data modalities, generating novel graph samples, and enhancing interpretability to foster trust and transparency. We also explore ethical considerations, such as privacy and fairness, to ensure responsible deployment of graph learning models. Additionally, we identify and discuss emerging topics, highlighting recent integration of graph learning and other AI paradigms and offering insights into future directions. This survey serves as a valuable resource for researchers and practitioners seeking to navigate the rapidly evolving landscape of graph learning.

Updated: 2025-07-08 03:29:27

标题: 图学习

摘要: 图学习迅速发展成为机器学习和人工智能（AI）的一个关键子领域。它的发展始于早期的图论方法，随着图神经网络（GNNs）的出现获得了显著的动力。在过去的十年中，可扩展架构、动态图建模、多模态学习、生成式AI、可解释AI（XAI）和负责任的AI的进展扩大了图学习对各种具有挑战性的环境的适用性。图学习的重要性在于其能够建模复杂的、非欧几里得关系，传统机器学习很难捕捉到这些关系，因此更好地支持从药物发现和欺诈检测到推荐系统和科学推理等实际应用。然而，必须解决可扩展性、泛化性、异质性、可解释性和可信度等挑战，以释放其全部潜力。本调查提供了图学习的全面介绍，重点关注可扩展、时间、多模态、生成、可解释和负责任的图学习等关键维度。我们审查了处理大规模图、捕捉动态时间依赖性、整合异构数据模态、生成新颖图样本以及增强可解释性以促进信任和透明度的最新技术。我们还探讨了伦理考虑，如隐私和公平性，以确保图学习模型的负责任部署。此外，我们确定并讨论新兴主题，突出了最近图学习与其他AI范式的整合，并提供对未来方向的见解。这项调查为寻求了解图学习快速发展格局的研究人员和实践者提供了宝贵的资源。

更新时间: 2025-07-08 03:29:27

领域: cs.LG,cs.AI,68T09, 68R10,I.2.6; G.2.2; E.1

下载: http://arxiv.org/abs/2507.05636v1

SARA: Selective and Adaptive Retrieval-augmented Generation with Context Compression

Retrieval-augmented Generation (RAG) extends large language models (LLMs) with external knowledge but faces key challenges: restricted effective context length and redundancy in retrieved documents. Pure compression-based approaches reduce input size but often discard fine-grained details essential for factual accuracy. We propose SARA, a unified RAG framework that balances local precision and global knowledge coverage under tight context budgets. SARA combines natural-language text snippets with semantic compression vectors to jointly enhance context efficiency and answer correctness. It represents contexts at two complementary levels: 1) fine-grained natural-language spans that preserve critical entities and numerical values, and 2) compact, interpretable vectors that summarize high-level semantics. An iterative evidence-selection module employs the compression vectors for dynamic reranking of contexts. Across 9 datasets and 5 open-source LLMs spanning 3 model families (Mistral, Llama, and Gemma), SARA consistently improves answer relevance (+17.71), answer correctness (+13.72), and semantic similarity (+15.53), demonstrating the importance of integrating textual and compressed representations for robust, context-efficient RAG.

Updated: 2025-07-08 03:29:09

标题: SARA：具有上下文压缩的选择性和自适应检索增强生成

摘要: 检索增强生成（RAG）扩展大型语言模型（LLMs）与外部知识但面临关键挑战：受限的有效上下文长度和检索文档中的冗余。纯压缩型方法减小输入大小但通常丢弃对事实准确性至关重要的细节。我们提出了SARA，一个统一的RAG框架，可以在紧凑的上下文预算下平衡局部精度和全局知识覆盖。SARA将自然语言文本片段与语义压缩向量结合起来，共同增强上下文效率和答案正确性。它以两个互补的层次表示上下文：1）保留关键实体和数字值的细粒度自然语言跨度，2）总结高级语义的紹概、可解释向量。一个迭代的证据选择模块利用压缩向量对上下文进行动态重新排序。在跨越3个模型系列（Mistral、Llama和Gemma）的9个数据集和5个开源LLMs中，SARA始终提高了答案相关性（+17.71）、答案正确性（+13.72）和语义相似性（+15.53），证明了将文本和压缩表示整合到强大、上下文高效的RAG中的重要性。

更新时间: 2025-07-08 03:29:09

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2507.05633v1

DeepCell: Self-Supervised Multiview Fusion for Circuit Representation Learning

We introduce DeepCell, a novel circuit representation learning framework that effectively integrates multiview information from both And-Inverter Graphs (AIGs) and Post-Mapping (PM) netlists. At its core, DeepCell employs a self-supervised Mask Circuit Modeling (MCM) strategy, inspired by masked language modeling, to fuse complementary circuit representations from different design stages into unified and rich embeddings. To our knowledge, DeepCell is the first framework explicitly designed for PM netlist representation learning, setting new benchmarks in both predictive accuracy and reconstruction quality. We demonstrate the practical efficacy of DeepCell by applying it to critical EDA tasks such as functional Engineering Change Orders (ECO) and technology mapping. Extensive experimental results show that DeepCell significantly surpasses state-of-the-art open-source EDA tools in efficiency and performance.

Updated: 2025-07-08 03:25:12

标题: DeepCell：用于电路表示学习的自监督多视图融合

摘要: 我们介绍了DeepCell，这是一个新颖的电路表示学习框架，有效地整合了来自And-Inverter Graphs（AIGs）和Post-Mapping（PM）网表的多视图信息。在其核心，DeepCell采用了一种自监督的Mask Circuit Modeling（MCM）策略，灵感来自于遮罩语言建模，将不同设计阶段的互补电路表示融合为统一丰富的嵌入。据我们所知，DeepCell是第一个专门设计用于PM网表表示学习的框架，在预测准确性和重建质量方面设定了新的基准。我们通过将DeepCell应用于关键的EDA任务，如功能性工程变更订单（ECO）和技术映射，展示了DeepCell的实际有效性。大量实验结果表明，DeepCell在效率和性能方面明显优于最先进的开源EDA工具。

更新时间: 2025-07-08 03:25:12

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.06816v2

How Not to Detect Prompt Injections with an LLM

LLM-integrated applications and agents are vulnerable to prompt injection attacks, in which adversaries embed malicious instructions within seemingly benign user inputs to manipulate the LLM's intended behavior. Recent defenses based on $\textit{known-answer detection}$ (KAD) have achieved near-perfect performance by using an LLM to classify inputs as clean or contaminated. In this work, we formally characterize the KAD framework and uncover a structural vulnerability in its design that invalidates its core security premise. We design a methodical adaptive attack, $\textit{DataFlip}$, to exploit this fundamental weakness. It consistently evades KAD defenses with detection rates as low as $1.5\%$ while reliably inducing malicious behavior with success rates of up to $88\%$, without needing white-box access to the LLM or any optimization procedures.

Updated: 2025-07-08 03:24:56

标题: 如何不使用LLM检测快速注入

摘要: LLM集成应用程序和代理易受到即时注入攻击的影响，即对手在看似良性用户输入中嵌入恶意指令，以操纵LLM的预期行为。基于已知答案检测（KAD）的最近防御措施通过使用LLM将输入分类为干净或受污染，实现了几乎完美的性能。在这项工作中，我们正式表征了KAD框架，并揭示了其设计中的结构性漏洞，使其核心安全前提无效。我们设计了一种系统的自适应攻击方法DataFlip，以利用这一基本弱点。它始终规避KAD防御，检测率仅为1.5％，同时可可靠地诱导恶意行为，成功率高达88％，而无需对LLM进行白盒访问或任何优化程序。

更新时间: 2025-07-08 03:24:56

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.05630v1

Enhancing Student Learning with LLM-Generated Retrieval Practice Questions: An Empirical Study in Data Science Courses

Retrieval practice is a well-established pedagogical technique known to significantly enhance student learning and knowledge retention. However, generating high-quality retrieval practice questions is often time-consuming and labor intensive for instructors, especially in rapidly evolving technical subjects. Large Language Models (LLMs) offer the potential to automate this process by generating questions in response to prompts, yet the effectiveness of LLM-generated retrieval practice on student learning remains to be established. In this study, we conducted an empirical study involving two college-level data science courses, with approximately 60 students. We compared learning outcomes during one week in which students received LLM-generated multiple-choice retrieval practice questions to those from a week in which no such questions were provided. Results indicate that students exposed to LLM-generated retrieval practice achieved significantly higher knowledge retention, with an average accuracy of 89%, compared to 73% in the week without such practice. These findings suggest that LLM-generated retrieval questions can effectively support student learning and may provide a scalable solution for integrating retrieval practice into real-time teaching. However, despite these encouraging outcomes and the potential time-saving benefits, cautions must be taken, as the quality of LLM-generated questions can vary. Instructors must still manually verify and revise the generated questions before releasing them to students.

Updated: 2025-07-08 03:23:19

标题: 使用LLM生成的检索实践问题增强学生学习：数据科学课程中的实证研究

摘要: 检索实践是一种被广泛认可的教学技术，已知它能显著提高学生学习和知识保留。然而，生成高质量的检索实践问题通常对教师来说耗时且劳动密集，特别是在迅速发展的技术学科中。大型语言模型（LLMs）提供了通过对提示生成问题来自动化这一过程的潜力，然而LLM生成的检索实践对学生学习的有效性尚待确定。在这项研究中，我们进行了一项实证研究，涉及两门大学级数据科学课程，约60名学生参与。我们比较了学生在一周内接受LLM生成的多项选择检索实践问题和未提供此类问题的一周内的学习成果。结果表明，接受LLM生成的检索实践的学生实现了显著更高的知识保留，平均准确率为89％，而未进行此类实践的一周为73％。这些发现表明，LLM生成的检索问题可以有效支持学生学习，并可能为将检索实践整合到实时教学中提供可扩展的解决方案。然而，尽管有这些令人鼓舞的结果和潜在的节省时间的好处，仍需谨慎，因为LLM生成的问题的质量可能有所不同。教师仍需在向学生发布问题之前手动验证和修订生成的问题。

更新时间: 2025-07-08 03:23:19

领域: cs.AI

下载: http://arxiv.org/abs/2507.05629v1

A Probabilistic Approach to Uncertainty Quantification Leveraging 3D Geometry

Quantifying uncertainty in neural implicit 3D representations, particularly those utilizing Signed Distance Functions (SDFs), remains a substantial challenge due to computational inefficiencies, scalability issues, and geometric inconsistencies. Existing methods typically neglect direct geometric integration, leading to poorly calibrated uncertainty maps. We introduce BayesSDF, a novel probabilistic framework for uncertainty quantification in neural implicit SDF models, motivated by scientific simulation applications with 3D environments (e.g., forests) such as modeling fluid flow through forests, where precise surface geometry and awareness of fidelity surface geometric uncertainty are essential. Unlike radiance-based models such as NeRF or 3D Gaussian splatting, which lack explicit surface formulations, SDFs define continuous and differentiable geometry, making them better suited for physical modeling and analysis. BayesSDF leverages a Laplace approximation to quantify local surface instability via Hessian-based metrics, enabling computationally efficient, surface-aware uncertainty estimation. Our method shows that uncertainty predictions correspond closely with poorly reconstructed geometry, providing actionable confidence measures for downstream use. Extensive evaluations on synthetic and real-world datasets demonstrate that BayesSDF outperforms existing methods in both calibration and geometric consistency, establishing a strong foundation for uncertainty-aware 3D scene reconstruction, simulation, and robotic decision-making.

Updated: 2025-07-08 03:21:12

标题: 一个利用3D几何的概率方法来量化不确定性的研究

摘要: 在神经隐式3D表示中量化不确定性，特别是那些利用有符号距离函数（SDFs）的表示，仍然是一个重大挑战，原因是计算效率低、可扩展性问题和几何不一致性。现有方法通常忽视直接几何集成，导致不良校准的不确定性地图。我们引入了BayesSDF，这是一个新颖的概率框架，用于神经隐式SDF模型中的不确定性量化，受到科学模拟应用的启发，例如在3D环境（如森林）中模拟流体流动，其中精确的表面几何和对准确性表面几何的认识至关重要。与基于辐射的模型（如NeRF或3D高斯点阵）不同，这些模型缺乏明确的表面公式，SDFs定义了连续且可微分的几何，使它们更适合于物理建模和分析。BayesSDF利用拉普拉斯近似来通过基于Hessian的度量来量化局部表面不稳定性，从而实现了计算效率高、面向表面的不确定性估计。我们的方法表明，不确定性预测与重建几何不佳密切对应，为下游使用提供了可操作的置信度度量。对合成和实际数据集的广泛评估表明，BayesSDF在校准和几何一致性方面优于现有方法，为不确定性感知的3D场景重建、模拟和机器人决策奠定了坚实基础。

更新时间: 2025-07-08 03:21:12

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.06269v1

One Surrogate to Fool Them All: Universal, Transferable, and Targeted Adversarial Attacks with CLIP

Deep Neural Networks (DNNs) have achieved widespread success yet remain prone to adversarial attacks. Typically, such attacks either involve frequent queries to the target model or rely on surrogate models closely mirroring the target model -- often trained with subsets of the target model's training data -- to achieve high attack success rates through transferability. However, in realistic scenarios where training data is inaccessible and excessive queries can raise alarms, crafting adversarial examples becomes more challenging. In this paper, we present UnivIntruder, a novel attack framework that relies solely on a single, publicly available CLIP model and publicly available datasets. By using textual concepts, UnivIntruder generates universal, transferable, and targeted adversarial perturbations that mislead DNNs into misclassifying inputs into adversary-specified classes defined by textual concepts. Our extensive experiments show that our approach achieves an Attack Success Rate (ASR) of up to 85% on ImageNet and over 99% on CIFAR-10, significantly outperforming existing transfer-based methods. Additionally, we reveal real-world vulnerabilities, showing that even without querying target models, UnivIntruder compromises image search engines like Google and Baidu with ASR rates up to 84%, and vision language models like GPT-4 and Claude-3.5 with ASR rates up to 80%. These findings underscore the practicality of our attack in scenarios where traditional avenues are blocked, highlighting the need to reevaluate security paradigms in AI applications.

Updated: 2025-07-08 03:14:54

标题: 一个欺骗所有人的替代品：具有CLIP的通用、可转移和针对性的对抗攻击

摘要: 深度神经网络（DNNs）取得了广泛的成功，但仍然容易受到对抗性攻击。通常，这类攻击要么涉及对目标模型的频繁查询，要么依赖于与目标模型密切相似的替代模型 - 通常是使用目标模型训练数据的子集进行训练 - 通过可传递性实现高攻击成功率。然而，在现实场景中，训练数据不可访问且过多的查询可能引起警报，制作对抗性示例变得更具挑战性。在本文中，我们提出了UnivIntruder，这是一个新颖的攻击框架，仅依赖于单个公开可用的CLIP模型和公开可用的数据集。通过使用文本概念，UnivIntruder生成通用的、可传递的和有针对性的对抗性扰动，误导DNNs将输入误分类为由文本概念定义的对手指定的类别。我们的广泛实验表明，我们的方法在ImageNet上达到高达85%的攻击成功率，在CIFAR-10上超过99%，明显优于现有的基于传递的方法。此外，我们揭示了现实世界的漏洞，显示即使不查询目标模型，UnivIntruder也可以以高达84%的攻击成功率危害像Google和百度这样的图像搜索引擎，以及像GPT-4和Claude-3.5这样的视觉语言模型，攻击成功率高达80%。这些发现强调了我们的攻击在传统途径受阻的场景中的实用性，突显了需要重新评估AI应用中的安全范式。

更新时间: 2025-07-08 03:14:54

领域: cs.CR,cs.LG,68T07,I.2.6

下载: http://arxiv.org/abs/2505.19840v2

StreamDiT: Real-Time Streaming Text-to-Video Generation

Recently, great progress has been achieved in text-to-video (T2V) generation by scaling transformer-based diffusion models to billions of parameters, which can generate high-quality videos. However, existing models typically produce only short clips offline, restricting their use cases in interactive and real-time applications. This paper addresses these challenges by proposing StreamDiT, a streaming video generation model. StreamDiT training is based on flow matching by adding a moving buffer. We design mixed training with different partitioning schemes of buffered frames to boost both content consistency and visual quality. StreamDiT modeling is based on adaLN DiT with varying time embedding and window attention. To practice the proposed method, we train a StreamDiT model with 4B parameters. In addition, we propose a multistep distillation method tailored for StreamDiT. Sampling distillation is performed in each segment of a chosen partitioning scheme. After distillation, the total number of function evaluations (NFEs) is reduced to the number of chunks in a buffer. Finally, our distilled model reaches real-time performance at 16 FPS on one GPU, which can generate video streams at 512p resolution. We evaluate our method through both quantitative metrics and human evaluation. Our model enables real-time applications, e.g. streaming generation, interactive generation, and video-to-video. We provide video results and more examples in our project website: https://cumulo-autumn.github.io/StreamDiT/

Updated: 2025-07-08 03:10:13

标题: StreamDiT：实时流文本到视频生成

摘要: 最近，通过将基于transformer的扩散模型扩展到数十亿参数，文本到视频（T2V）生成取得了巨大进展，这可以生成高质量的视频。然而，现有模型通常只能离线生成短片段，限制了它们在交互和实时应用中的使用案例。本文通过提出StreamDiT，一个流式视频生成模型，来解决这些挑战。StreamDiT训练基于通过添加移动缓冲区进行流匹配。我们设计了混合训练，使用不同的缓冲帧分区方案来提升内容一致性和视觉质量。StreamDiT建模基于具有不同时间嵌入和窗口注意力的adaLN DiT。为了实践提出的方法，我们使用4B参数训练了一个StreamDiT模型。此外，我们提出了一种专为StreamDiT量身定制的多步蒸馏方法。在所选分区方案的每个段中执行采样蒸馏。蒸馏后，函数评估的总数（NFEs）减少到缓冲区中的块数。最后，我们的蒸馏模型在一个GPU上以16 FPS的实时性能生成视频流，可以生成512p分辨率的视频。我们通过定量指标和人类评估评估我们的方法。我们的模型支持实时应用，例如流式生成、交互生成和视频到视频。我们在我们的项目网站上提供视频结果和更多示例：https://cumulo-autumn.github.io/StreamDiT/

更新时间: 2025-07-08 03:10:13

领域: cs.CV,cs.AI,cs.LG,eess.IV

下载: http://arxiv.org/abs/2507.03745v2

ADMC: Attention-based Diffusion Model for Missing Modalities Feature Completion

Multimodal emotion and intent recognition is essential for automated human-computer interaction, It aims to analyze users' speech, text, and visual information to predict their emotions or intent. One of the significant challenges is that missing modalities due to sensor malfunctions or incomplete data. Traditional methods that attempt to reconstruct missing information often suffer from over-coupling and imprecise generation processes, leading to suboptimal outcomes. To address these issues, we introduce an Attention-based Diffusion model for Missing Modalities feature Completion (ADMC). Our framework independently trains feature extraction networks for each modality, preserving their unique characteristics and avoiding over-coupling. The Attention-based Diffusion Network (ADN) generates missing modality features that closely align with authentic multimodal distribution, enhancing performance across all missing-modality scenarios. Moreover, ADN's cross-modal generation offers improved recognition even in full-modality contexts. Our approach achieves state-of-the-art results on the IEMOCAP and MIntRec benchmarks, demonstrating its effectiveness in both missing and complete modality scenarios.

Updated: 2025-07-08 03:08:52

标题: ADMC：基于注意力的扩散模型用于缺失模态特征补全

摘要: 多模态情感和意图识别对于自动化人机交互至关重要，它旨在分析用户的语音、文本和视觉信息，以预测他们的情绪或意图。其中一个重要的挑战是由于传感器故障或数据不完整而导致的缺失模态。传统方法尝试重建缺失信息通常会受到过度耦合和不精确的生成过程的影响，导致结果不佳。为了解决这些问题，我们引入了基于注意力的缺失模态特征完成（ADMC）扩散模型。我们的框架独立训练每个模态的特征提取网络，保留它们独特的特征并避免过度耦合。基于注意力的扩散网络（ADN）生成与真实多模态分布密切一致的缺失模态特征，提高了在所有缺失模态场景下的性能。此外，ADN的跨模态生成在全模态情况下也提供了改进的识别。我们的方法在IEMOCAP和MIntRec基准测试中取得了最先进的结果，证明了它在缺失和完整模态场景中的有效性。

更新时间: 2025-07-08 03:08:52

领域: cs.AI

下载: http://arxiv.org/abs/2507.05624v1

A Collectivist, Economic Perspective on AI

Information technology is in the midst of a revolution in which omnipresent data collection and machine learning are impacting the human world as never before. The word "intelligence" is being used as a North Star for the development of this technology, with human cognition viewed as a baseline. This view neglects the fact that humans are social animals, and that much of our intelligence is social and cultural in origin. A related issue is that the current view treats the social consequences of technology as an afterthought. The path forward is not merely more data and compute, and not merely more attention paid to cognitive or symbolic representations, but a thorough blending of economic and social concepts with computational and inferential concepts, in the service of system-level designs in which social welfare is a first-class citizen, and with the aspiration that a new human-centric engineering field will emerge.

Updated: 2025-07-08 03:07:43

标题: 一个集体主义、经济学视角下的人工智能

摘要: 信息技术正处于一场革命之中，无处不在的数据收集和机器学习正在以前所未有的方式影响人类世界。人们将“智能”作为这项技术发展的北极星，将人类认知视为基线。然而，这种观点忽视了人类是社会动物这一事实，以及我们的大部分智慧是社会和文化的产物。另一个相关问题是，当前的观点将技术的社会后果视为事后才考虑的问题。前进的道路不仅仅是更多的数据和计算，也不仅仅是更多关注认知或符号表示，而是将经济和社会概念与计算和推理概念彻底融合，以服务于系统级设计，其中社会福利是一等公民，并且希望一个新的以人为中心的工程领域会涌现出来。

更新时间: 2025-07-08 03:07:43

领域: cs.CY,cs.AI,stat.ML

下载: http://arxiv.org/abs/2507.06268v1

DATABench: Evaluating Dataset Auditing in Deep Learning from an Adversarial Perspective

The widespread application of Deep Learning across diverse domains hinges critically on the quality and composition of training datasets. However, the common lack of disclosure regarding their usage raises significant privacy and copyright concerns. Dataset auditing techniques, which aim to determine if a specific dataset was used to train a given suspicious model, provide promising solutions to addressing these transparency gaps. While prior work has developed various auditing methods, their resilience against dedicated adversarial attacks remains largely unexplored. To bridge the gap, this paper initiates a comprehensive study evaluating dataset auditing from an adversarial perspective. We start with introducing a novel taxonomy, classifying existing methods based on their reliance on internal features (IF) (inherent to the data) versus external features (EF) (artificially introduced for auditing). Subsequently, we formulate two primary attack types: evasion attacks, designed to conceal the use of a dataset, and forgery attacks, intending to falsely implicate an unused dataset. Building on the understanding of existing methods and attack objectives, we further propose systematic attack strategies: decoupling, removal, and detection for evasion; adversarial example-based methods for forgery. These formulations and strategies lead to our new benchmark, DATABench, comprising 17 evasion attacks, 5 forgery attacks, and 9 representative auditing methods. Extensive evaluations using DATABench reveal that none of the evaluated auditing methods are sufficiently robust or distinctive under adversarial settings. These findings underscore the urgent need for developing a more secure and reliable dataset auditing method capable of withstanding sophisticated adversarial manipulation. Code is available at https://github.com/shaoshuo-ss/DATABench.

Updated: 2025-07-08 03:07:15

标题: DATABench：从对抗性角度评估深度学习中的数据集审计

摘要: 深度学习在各个领域的广泛应用，关键在于训练数据集的质量和组成。然而，关于它们使用情况的普遍缺乏披露引发了重大的隐私和版权问题。数据集审计技术旨在确定特定数据集是否被用于训练给定的可疑模型，提供了解决这些透明度缺口的有希望的解决方案。尽管先前的工作已经发展了各种审计方法，但它们对专门的对抗攻击的抵抗力仍然很少被探索。为了弥合这一差距，本文从对抗的角度开始进行了一项全面研究，评估数据集审计。我们首先介绍了一种新的分类法，根据其对内部特征（IF）（数据固有的）和外部特征（EF）（为审计而人为引入的）的依赖程度对现有方法进行分类。随后，我们制定了两种主要的攻击类型：规避攻击，旨在隐藏数据集的使用，和伪造攻击，旨在虚假地暗示未使用的数据集。基于对现有方法和攻击目标的理解，我们进一步提出了系统性的攻击策略：对规避的解耦、移除和检测；对伪造的基于对抗性示例的方法。这些公式和策略导致了我们的新基准DATABench，包括17种规避攻击、5种伪造攻击和9种代表性审计方法。使用DATABench进行广泛评估显示，在对抗环境下，没有任何评估的审计方法足够强大或独特。这些发现强调了迫切需要开发一种更安全可靠的数据集审计方法，能够抵御复杂的对抗操纵。代码可在https://github.com/shaoshuo-ss/DATABench找到。

更新时间: 2025-07-08 03:07:15

领域: cs.CR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.05622v1

Circumventing Safety Alignment in Large Language Models Through Embedding Space Toxicity Attenuation

Large Language Models (LLMs) have achieved remarkable success across domains such as healthcare, education, and cybersecurity. However, this openness also introduces significant security risks, particularly through embedding space poisoning, which is a subtle attack vector where adversaries manipulate the internal semantic representations of input data to bypass safety alignment mechanisms. While previous research has investigated universal perturbation methods, the dynamics of LLM safety alignment at the embedding level remain insufficiently understood. Consequently, more targeted and accurate adversarial perturbation techniques, which pose significant threats, have not been adequately studied. In this work, we propose ETTA (Embedding Transformation Toxicity Attenuation), a novel framework that identifies and attenuates toxicity-sensitive dimensions in embedding space via linear transformations. ETTA bypasses model refusal behaviors while preserving linguistic coherence, without requiring model fine-tuning or access to training data. Evaluated on five representative open-source LLMs using the AdvBench benchmark, ETTA achieves a high average attack success rate of 88.61%, outperforming the best baseline by 11.34%, and generalizes to safety-enhanced models (e.g., 77.39% ASR on instruction-tuned defenses). These results highlight a critical vulnerability in current alignment strategies and underscore the need for embedding-aware defenses.

Updated: 2025-07-08 03:01:00

标题: 通过嵌入空间毒性衰减规避大型语言模型的安全对齐

摘要: 大型语言模型（LLMs）在医疗保健、教育和网络安全等领域取得了显著成功。然而，这种开放性也带来了重大的安全风险，特别是通过嵌入空间毒化，这是一种微妙的攻击方式，敌对方利用内部语义表示来绕过安全对齐机制。尽管先前的研究已经调查了通用扰动方法，但在嵌入层面上的LLM安全对齐动态仍然不够理解。因此，更具针对性和准确的对抗扰动技术，对模型构成重大威胁，尚未得到充分研究。在这项工作中，我们提出了ETTA（嵌入变换毒性削弱），这是一个通过线性变换识别和减弱嵌入空间中的毒性敏感维度的新框架。ETTA绕过了模型的拒绝行为，同时保持语言连贯性，而不需要模型微调或访问训练数据。在使用AdvBench基准测试的五个代表性开源LLMs上评估，ETTA实现了高达88.61%的平均攻击成功率，超过了最佳基线11.34%，并且适用于安全增强模型（例如，在经过指导调整的防御上达到77.39% ASR）。这些结果突显了当前对齐策略中的关键漏洞，并强调了对嵌入感知防御的需求。

更新时间: 2025-07-08 03:01:00

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.08020v1

Generative Head-Mounted Camera Captures for Photorealistic Avatars

Enabling photorealistic avatar animations in virtual and augmented reality (VR/AR) has been challenging because of the difficulty of obtaining ground truth state of faces. It is physically impossible to obtain synchronized images from head-mounted cameras (HMC) sensing input, which has partial observations in infrared (IR), and an array of outside-in dome cameras, which have full observations that match avatars' appearance. Prior works relying on analysis-by-synthesis methods could generate accurate ground truth, but suffer from imperfect disentanglement between expression and style in their personalized training. The reliance of extensive paired captures (HMC and dome) for the same subject makes it operationally expensive to collect large-scale datasets, which cannot be reused for different HMC viewpoints and lighting. In this work, we propose a novel generative approach, Generative HMC (GenHMC), that leverages large unpaired HMC captures, which are much easier to collect, to directly generate high-quality synthetic HMC images given any conditioning avatar state from dome captures. We show that our method is able to properly disentangle the input conditioning signal that specifies facial expression and viewpoint, from facial appearance, leading to more accurate ground truth. Furthermore, our method can generalize to unseen identities, removing the reliance on the paired captures. We demonstrate these breakthroughs by both evaluating synthetic HMC images and universal face encoders trained from these new HMC-avatar correspondences, which achieve better data efficiency and state-of-the-art accuracy.

Updated: 2025-07-08 03:00:44

标题: 生成式头戴式摄像头捕获用于逼真化头像

摘要: 在虚拟和增强现实（VR/AR）中实现逼真的化身动画一直是一个挑战，因为获取面部真实状态的困难。通过头戴式摄像头（HMC）感知输入，从而获得同步图像在物理上是不可能的，这种输入具有红外部分观察结果，以及一系列外部-内部圆顶摄像头，这些摄像头具有与化身外观匹配的完整观察结果。先前依赖分析合成方法的工作可以生成准确的真实数据，但是在个性化训练中表达和风格之间的不完全分离。为同一主题依赖大量配对捕捉（HMC和圆顶）使得收集大规模数据集在操作上成本高昂，无法用于不同HMC视角和照明。在这项工作中，我们提出了一种新颖的生成方法，Generative HMC（GenHMC），利用大量未配对的HMC捕捉，这些捕捉更容易收集，以直接生成高质量的合成HMC图像，给定任何来自圆顶捕捉的条件化化身状态。我们展示了我们的方法能够正确分离指定面部表情和视角的输入条件信号，以及面部外观，从而导致更准确的真实数据。此外，我们的方法可以泛化到看不见的身份，消除了对配对捕捉的依赖。我们通过评估合成HMC图像和从这些新的HMC-化身对应关系训练的通用面部编码器来展示这些突破，这些编码器实现了更好的数据效率和最先进的准确性。

更新时间: 2025-07-08 03:00:44

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.05620v1

Detecting and Mitigating Reward Hacking in Reinforcement Learning Systems: A Comprehensive Empirical Study

Reward hacking in Reinforcement Learning (RL) systems poses a critical threat to the deployment of autonomous agents, where agents exploit flaws in reward functions to achieve high scores without fulfilling intended objectives. Despite growing awareness of this problem, systematic detection and mitigation approaches remain limited. This paper presents a large-scale empirical study of reward hacking across diverse RL environments and algorithms. We analyze 15,247 training episodes across 15 RL environments (Atari, MuJoCo, custom domains) and 5 algorithms (PPO, SAC, DQN, A3C, Rainbow), implementing automated detection algorithms for six categories of reward hacking: specification gaming, reward tampering, proxy optimization, objective misalignment, exploitation patterns, and wireheading. Our detection framework achieves 78.4% precision and 81.7% recall across environments, with computational overhead under 5%. Through controlled experiments varying reward function properties, we demonstrate that reward density and alignment with true objectives significantly impact hacking frequency ($p < 0.001$, Cohen's $d = 1.24$). We validate our approach through three simulated application studies representing recommendation systems, competitive gaming, and robotic control scenarios. Our mitigation techniques reduce hacking frequency by up to 54.6% in controlled scenarios, though we find these trade-offs are more challenging in practice due to concept drift, false positive costs, and adversarial adaptation. All detection algorithms, datasets, and experimental protocols are publicly available to support reproducible research in RL safety.

Updated: 2025-07-08 03:00:02

标题: 检测和缓解强化学习系统中的奖励欺骗：一项全面的实证研究

摘要: 在强化学习（RL）系统中的奖励欺骗对自主代理部署构成了重要威胁，代理利用奖励函数中的缺陷来获得高分数而不实现预期目标。尽管对这个问题的认识正在增加，但系统性的检测和缓解方法仍然有限。本文对不同RL环境和算法中奖励欺骗进行了大规模的实证研究。我们分析了15个RL环境（Atari、MuJoCo、自定义领域）和5种算法（PPO、SAC、DQN、A3C、Rainbow）中的15,247个训练回合，实施了六类奖励欺骗的自动检测算法：规范游戏、奖励篡改、代理优化、目标不对齐、利用模式和线路接头。我们的检测框架在各个环境中实现了78.4%的精度和81.7%的召回率，计算开销低于5%。通过改变奖励函数属性进行的控制实验，我们证明奖励密度和与真实目标的一致性显著影响欺骗频率（$p < 0.001$, Cohen's $d = 1.24$）。我们通过三个模拟应用研究验证了我们的方法，分别代表推荐系统、竞技游戏和机器人控制场景。我们的缓解技术在受控场景中将欺骗频率降低了高达54.6%，尽管我们发现这些权衡在实践中更具挑战性，因为概念漂移、误报成本和对抗性适应。所有检测算法、数据集和实验协议都可公开获取，以支持RL安全领域的可重复研究。

更新时间: 2025-07-08 03:00:02

领域: cs.LG,cs.SE

下载: http://arxiv.org/abs/2507.05619v1

A Theory for Conditional Generative Modeling on Multiple Data Sources

The success of large generative models has driven a paradigm shift, leveraging massive multi-source data to enhance model capabilities. However, the interaction among these sources remains theoretically underexplored. This paper takes the first step toward a rigorous analysis of multi-source training in conditional generative modeling, where each condition represents a distinct data source. Specifically, we establish a general distribution estimation error bound in average total variation distance for conditional maximum likelihood estimation based on the bracketing number. Our result shows that when source distributions share certain similarities and the model is expressive enough, multi-source training guarantees a sharper bound than single-source training. We further instantiate the general theory on conditional Gaussian estimation and deep generative models including autoregressive and flexible energy-based models, by characterizing their bracketing numbers. The results highlight that the number of sources and similarity among source distributions improve the advantage of multi-source training. Simulations and real-world experiments are conducted to validate the theory, with code available at: https://github.com/ML-GSAI/Multi-Source-GM.

Updated: 2025-07-08 02:55:32

标题: 多数据源条件生成建模的理论

摘要: 大型生成模型的成功推动了一种范式转变，利用大规模多源数据来增强模型能力。然而，这些数据源之间的交互在理论上仍未充分探讨。本文首次对条件生成建模中的多源训练进行严格分析，其中每个条件代表一个不同的数据源。具体来说，我们基于括号数建立了条件最大似然估计的平均总变差距离的一般分布估计误差界。我们的结果表明，当源分布具有某些相似性并且模型足够表达时，多源训练能够比单一源训练提供更加严格的界限。我们进一步将这一通用理论实例化到条件高斯估计和深度生成模型，包括自回归和灵活的基于能量的模型，通过表征它们的括号数。结果表明，源数量和源分布之间的相似性提高了多源训练的优势。我们进行了模拟和真实世界实验来验证理论，代码可在以下链接找到：https://github.com/ML-GSAI/Multi-Source-GM。

更新时间: 2025-07-08 02:55:32

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2502.14583v2

Activation Steering for Chain-of-Thought Compression

Large language models (LLMs) excel at complex reasoning when they include intermediate steps, known as "chains of thought" (CoTs). However, these rationales are often overly verbose, even for simple problems, leading to wasted context, increased latency, and higher energy consumption. We observe that verbose, English-heavy CoTs and concise, math-centric CoTs occupy distinct regions in the model's residual-stream activation space. By extracting and injecting a "steering vector" to transition between these modes, we can reliably shift generation toward more concise reasoning, effectively compressing CoTs without retraining. We formalize this approach as Activation-Steered Compression (ASC), an inference-time technique that shortens reasoning traces by directly modifying hidden representations. In addition, we provide a theoretical analysis of the impact of ASC on the output distribution, derived from a closed-form KL-divergence-bounded constraint to regulate steering strength. Using only 100 paired verbose and concise examples, ASC achieves up to 67.43% reduction in CoT length on MATH500 and GSM8K datasets, while maintaining accuracy across 7B, 8B, and 32B parameter models. As a training-free method, ASC introduces negligible runtime overhead and, on MATH500, delivers an average 2.73x speedup in end-to-end reasoning wall-clock time on an 8B model. This makes ASC a practical and efficient tool for streamlining the deployment of reasoning-capable LLMs in latency- or cost-sensitive settings. The code is available at: https://github.com/ArminAzizi98/ASC

Updated: 2025-07-08 02:54:20

标题: 思维链压缩的激活导向

摘要: 大型语言模型（LLMs）在包含中间步骤，即“思维链”（CoTs）时，擅长复杂推理。然而，这些原因通常过于冗长，即使对于简单问题也是如此，导致浪费上下文，增加延迟和能源消耗。我们观察到，冗长的、英语为主的 CoTs 和简洁的、以数学为中心的 CoTs 占据了模型残差流激活空间中的不同区域。通过提取和注入一个“转向向量”来在这些模式之间进行转换，我们可以可靠地将生成转向更简洁的推理，有效地压缩 CoTs 而无需重新训练。我们将这种方法形式化为“激活引导压缩”（ASC），这是一种在推理时间缩短推理轨迹的技术，通过直接修改隐藏表示来实现。此外，我们提供了对 ASC 对输出分布的影响的理论分析，该分析源自一个闭合形式的 KL 散度约束，以调节转向强度。仅使用 100 个冗长和简洁示例，ASC 在 MATH500 和 GSM8K 数据集上实现了高达 67.43% 的 CoT 长度缩短，同时在 7B、8B 和 32B 参数模型上保持准确性。作为一种无需训练的方法，ASC 引入了可忽略的运行时开销，并在 MATH500 上，对 8B 模型的端到端推理墙钟时间提供了平均 2.73 倍的加速。这使 ASC 成为在延迟或成本敏感的环境中简化推理能力 LLM 部署的实用且高效工具。该代码可在以下链接找到：https://github.com/ArminAzizi98/ASC.

更新时间: 2025-07-08 02:54:20

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2507.04742v2

MOD-X: A Modular Open Decentralized eXchange Framework proposal for Heterogeneous Interoperable Artificial Intelligence Agents

As Artificial Intelligence systems evolve from monolithic models to ecosystems of specialized agents, the need for standardized communication protocols becomes increasingly critical. This paper introduces MOD-X (Modular Open Decentralized eXchange), a novel architectural framework proposal for agent interoperability that addresses key limitations of existing protocols. Unlike current approaches, MOD-X proposes a layered architecture with a Universal Message Bus, thorough state management, translation capabilities, and blockchain-based security mechanisms. We present MOD-X's architecture, compare it with existing protocols, and demonstrate its application through a worked example how it enables integration between heterogeneous specialist agents (agents with different architectures, vendors, capabilities, and knowledge representations--including rule-based systems, neural networks, symbolic reasoning engines, and legacy software with agent wrappers). MOD-X's key innovations include a publish-subscribe communication model, semantic capability discovery, and dynamic workflow orchestration--providing a framework that bridges theoretical formalism with practical implementation. This architecture addresses the growing need for truly decentralized, interoperable agent ecosystems that can scale effectively without the need for central coordination.

Updated: 2025-07-08 02:48:45

标题: MOD-X：用于异构可互操作人工智能代理的模块化开放式去中心化交换框架提案

摘要: 随着人工智能系统从单一模型演变为专门代理的生态系统，标准化通信协议的需求变得日益关键。本文介绍了MOD-X（模块化开放式分散式交换），这是一个为代理互操作性提出的新颖的架构框架提案，解决了现有协议的关键局限性。与当前方法不同，MOD-X提出了一个分层架构，具有通用消息总线、全面的状态管理、翻译能力和基于区块链的安全机制。我们介绍了MOD-X的架构，将其与现有协议进行了比较，并通过一个示例展示了它的应用，说明了它如何实现异构专家代理之间的集成（具有不同架构、供应商、能力和知识表示的代理，包括基于规则的系统、神经网络、符号推理引擎和带有代理包装器的传统软件）。MOD-X的关键创新包括发布-订阅通信模型、语义能力发现以及动态工作流编排，提供了一个框架，将理论形式主义与实际实现联系起来。这种架构解决了日益增长的对真正分散、互操作的代理生态系统的需求，这些系统可以在没有中央协调的情况下有效扩展。

更新时间: 2025-07-08 02:48:45

领域: cs.AI,cs.DC,cs.MA,cs.NI

下载: http://arxiv.org/abs/2507.04376v2

IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for Multi-shot Scenes

Video Large Language Models (VideoLLMs) have demonstrated remarkable understanding capabilities, but are found struggling to tackle multi-shot scenarios,e.g., video clips with varying camera angles or scene changes. This challenge can render failures such as instance identity forgetting and key frame negligence. In this work, we first attribute the challenge to the lack of multi-shot annotations among existing datasets and therefore we introduce a new dataset termed MultiClip-Bench, featuring dense descriptions and instruction-based question-answering pairs tailored for multi-shot scenarios. We empirically find that the training set significantly boosts the multi-shot performance, while the testing benchmark provides a reliable measure of the model capability in multi-shot scenarios. By further analyzing and discovering that current models only encode instance features in a discrete or lossy manner, at the risk of missing identity information, we then contribute a new model IPFormer-VideoLLM. Its key idea is the injection of instance-level features as instance prompts through an efficient attention-based connector. This allows for the aggregation of instance-specific information across scenes. Experiments demonstrate that our proposed dataset and model not only enhance the multi-scene video understanding significantly, but also offer distinct advantages across various video benchmarks.

Updated: 2025-07-08 02:46:17

标题: IPFormer-VideoLLM：增强多模态视频理解以适应多镜头场景

摘要: 视频大型语言模型（VideoLLMs）表现出了非凡的理解能力，但发现在处理多镜头场景（例如，具有不同摄像机角度或场景变化的视频片段）时遇到困难。这种挑战可能导致失败，例如实例身份遗忘和关键帧忽视。在这项工作中，我们首先将这一挑战归因于现有数据集中缺乏多镜头注释，因此我们引入了一个新的数据集，名为MultiClip-Bench，其中包含密集描述和针对多镜头场景定制的基于指令的问答对。我们经验性地发现，训练集显著提升了多镜头性能，而测试基准提供了一种可靠的衡量模型在多镜头场景中能力的方法。通过进一步分析和发现当前模型只以离散或有损的方式编码实例特征，存在错过身份信息的风险，我们贡献了一个新的模型IPFormer-VideoLLM。其关键思想是通过高效的基于注意力的连接器将实例级特征注入为实例提示。这允许跨场景聚合实例特定信息。实验证明，我们提出的数据集和模型不仅显著增强了对多场景视频的理解，而且在各种视频基准测试中提供了明显的优势。

更新时间: 2025-07-08 02:46:17

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2506.21116v2

Domain adaptation of large language models for geotechnical applications

Recent developments in large language models (LLMs) are opening up new opportunities in geotechnical engineering and engineering geology. While general-purpose LLMs possess broad capabilities, effective application in geotechnics often requires domain-specific adaptation. Such tailored LLMs are increasingly employed to streamline geotechnical workflows. This paper presents the first survey of the adaptation and application of LLMs in geotechnical engineering. It outlines key methodologies for adaptation to geotechnical domain, including prompt engineering, retrieval-augmented generation, domain-adaptive pretraining, and fine-tuning. The survey examines the state-of-the-art applications of geotechnical-adapted LLMs, including geological interpretation, subsurface characterization, site planning, design calculations, numerical modeling, safety and risk assessment, and educational tutoring. It also analyzes benefits and limitations of geotechnical-adapted LLMs, and identifies promising directions for future research in this interdisciplinary discipline. The findings serve as a valuable resource for practitioners seeking to integrate LLMs into geotechnical practice, while also providing a foundation to stimulate further investigation within the academic community.

Updated: 2025-07-08 02:45:44

标题: 大型语言模型在岩土工程应用中的域适应

摘要: 最近大型语言模型（LLMs）的发展为岩土工程和工程地质领域开辟了新的机遇。虽然通用型LLMs具有广泛的能力，但在岩土工程中有效的应用通常需要特定领域的适应性。这种定制的LLMs越来越多地被用来简化岩土工程工作流程。本文介绍了LLMs在岩土工程中的适应和应用的首次调查。它概述了适应到岩土领域的关键方法，包括提示工程、检索增强生成、领域自适应预训练和微调。调查分析了岩土工程适应LLMs的最新应用，包括地质解释、地下特征描述、场地规划、设计计算、数值建模、安全和风险评估以及教育辅导。它还分析了岩土工程适应LLMs的优势和局限性，并确定了未来研究中值得探索的方向。这些发现为寻求将LLMs整合到岩土工程实践中的从业者提供了宝贵的资源，同时也为学术界进一步探讨提供了基础。

更新时间: 2025-07-08 02:45:44

领域: cs.AI

下载: http://arxiv.org/abs/2507.05613v1

Stacked conformal prediction

We consider a method for conformalizing a stacked ensemble of predictive models, showing that the potentially simple form of the meta-learner at the top of the stack enables a procedure with manageable computational cost that achieves approximate marginal validity without requiring the use of a separate calibration sample. Empirical results indicate that the method compares favorably to a standard inductive alternative.

Updated: 2025-07-08 02:34:36

标题: 叠加的一致性预测

摘要: 我们考虑一种将预测模型堆叠集合进行共形化的方法，结果显示在堆叠顶部的元学习器可能具有简单的形式，从而实现了具有可管理计算成本的过程，实现了近似边际有效性，而无需使用单独的校准样本。实证结果表明，该方法与标准的归纳替代方法相比具有优势。

更新时间: 2025-07-08 02:34:36

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2505.12578v3

ILP Techniques for Enhancing Branch and Bound MaxSAT Solvers

This paper investigates the impact of ILP techniques on BnB MaxSAT solvers, particularly ILP preprocessing techniques and various portfolio strategies. Experimental results demonstrate that ILP techniques enable WMaxCDCL-OpenWbo1200 and MaxCDCL-OpenWbo300, the best two solvers in the unweighted track of the MaxSAT evaluation 2024, to solve 27 and 30 additional instances, respectively. Furthermore, although state-of-the-art MaxSAT solvers heavily rely on an ILP solver in their portfolios, our proposed approach uses ILP preprocessing techniques to reduce this dependency. Allocating only a short runtime to the ILP solver within a portfolio that includes (W)MaxCDCL, as proposed in our approach, is sufficient to achieve strong results.

Updated: 2025-07-08 02:29:46

标题: ILP技术用于增强分支定界最大满足性问题求解器

摘要: 这篇论文研究了ILP技术对BnB MaxSAT求解器的影响，特别是ILP预处理技术和各种投资组合策略。实验结果表明，ILP技术使得WMaxCDCL-OpenWbo1200和MaxCDCL-OpenWbo300这两个在MaxSAT评估2024年无权重赛道中表现最好的求解器，分别解决了27个和30个额外的实例。此外，尽管当前最先进的MaxSAT求解器在它们的投资组合中严重依赖于ILP求解器，我们提出的方法使用ILP预处理技术来减少这种依赖性。在我们提出的方法中，将ILP求解器的运行时间分配给包含（W）MaxCDCL在内的投资组合，就足以取得良好的结果。

更新时间: 2025-07-08 02:29:46

领域: cs.AI

下载: http://arxiv.org/abs/2506.06216v2

Privacy-preserving Machine Learning in Internet of Vehicle Applications: Fundamentals, Recent Advances, and Future Direction

Machine learning (ML) in Internet of Vehicles (IoV) applications enhanced intelligent transportation, autonomous driving capabilities, and various connected services within a large, heterogeneous network. However, the increased connectivity and massive data exchange for ML applications introduce significant privacy challenges. Privacy-preserving machine learning (PPML) offers potential solutions to address these challenges by preserving privacy at various stages of the ML pipeline. Despite the rapid development of ML-based IoV applications and the growing data privacy concerns, there are limited comprehensive studies on the adoption of PPML within this domain. Therefore, this study provides a comprehensive review of the fundamentals, recent advancements, and the challenges of integrating PPML into IoV applications. We first review existing surveys of various PPML techniques and their integration into IoV across different scopes. We then categorize IoV applications into three key domains and analyze the privacy challenges in leveraging ML in these application domains. Building on these fundamentals, we review recent advancements in integrating various PPML techniques within IoV applications, discussing their frameworks, key features, and performance in terms of privacy, utility, and efficiency. Finally, we identify current challenges and propose future research directions to enhance privacy and reliability in IoV applications.

Updated: 2025-07-08 02:27:09

标题: 隐私保护机器学习在车联网应用中的应用：基础知识、最新进展和未来方向

摘要: 机器学习（ML）在物联网汽车（IoV）应用中增强了智能交通、自动驾驶能力以及大型异构网络内的各种连接服务。然而，为ML应用增加的连通性和大规模数据交换引入了显著的隐私挑战。隐私保护机器学习（PPML）通过在ML管道的各个阶段保护隐私，提供了解决这些挑战的潜在解决方案。尽管基于ML的IoV应用迅速发展并且数据隐私关切逐渐增加，但在该领域内采用PPML的综合研究有限。因此，本研究全面回顾了将PPML整合到IoV应用中的基础知识、最新进展和挑战。我们首先回顾了各种PPML技术的现有调查及其在不同范围内整合到IoV中的情况。然后将IoV应用分类为三个关键领域，并分析在这些应用领域中利用ML所面临的隐私挑战。在这些基础上，我们回顾了最近在IoV应用中整合各种PPML技术的最新进展，讨论了它们的框架、关键特征以及在隐私、效用和效率方面的表现。最后，我们确定了当前的挑战，并提出了未来研究方向，以增强IoV应用中的隐私和可靠性。

更新时间: 2025-07-08 02:27:09

领域: cs.CR

下载: http://arxiv.org/abs/2503.01089v2

Domain Generalizable Portrait Style Transfer

This paper presents a portrait style transfer method that generalizes well to various different domains while enabling high-quality semantic-aligned stylization on regions including hair, eyes, eyelashes, skins, lips, and background. To this end, we propose to establish dense semantic correspondence between the given input and reference portraits based on a pre-trained model and a semantic adapter, with which we obtain a warped reference semantically aligned with the input. To ensure effective yet controllable style transfer, we devise an AdaIN-Wavelet transform to balance content preservation and stylization by blending low-frequency information of the warped reference with high-frequency information of the input in the latent space. A style adapter is also designed to provide style guidance from the warped reference. With the stylized latent from AdaIN-Wavelet transform, we employ a dual-conditional diffusion model that integrates a ControlNet recording high-frequency information and the style guidance to generate the final result. Extensive experiments demonstrate the superiority of our method. Our code and trained model are available at https://github.com/wangxb29/DGPST.

Updated: 2025-07-08 02:18:16

标题: 领域通用的画像风格转移

摘要: 这篇论文提出了一种人像风格转移方法，能够在各种不同领域中得到很好的泛化，同时实现了包括头发、眼睛、睫毛、皮肤、嘴唇和背景在内的高质量语义对齐的风格化。为此，我们提出了一种基于预训练模型和语义适配器的方法，在给定输入和参考人像之间建立密集的语义对应关系，从而获得一个与输入语义对齐的变形参考。为了确保有效而可控的风格转移，我们设计了一种AdaIN-Wavelet转换，通过在潜空间中将变形参考的低频信息与输入的高频信息进行融合，以平衡内容保留和风格化。还设计了一个风格适配器，从变形参考提供风格指导。通过AdaIN-Wavelet转换得到的风格化潜空间，我们使用了一个双条件扩散模型，将记录高频信息的ControlNet和风格指导整合起来，生成最终结果。大量实验证明了我们方法的优越性。我们的代码和训练模型可在https://github.com/wangxb29/DGPST 上获得。

更新时间: 2025-07-08 02:18:16

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2507.04243v2

Self-Review Framework for Enhancing Instruction Following Capability of LLM

Various techniques have been proposed to improve large language models (LLMs) adherence to formatting and instruction constraints. One of the most effective approaches involves utilizing high-quality data generated by powerful models. However, such models often fail to fully comply with complex instructions in a single generation. To address this limitation, iterative revision methods have been introduced. Nevertheless, as the number of data points and revision iterations increases, the associated monetary costs grow significantly. As a resource-efficient alternative, methods have been proposed that leverage high-performance evaluation tools to compensate for the limited self-evaluation capabilities of open-source LLMs. However, these approaches often lead to a degradation in output quality due to excessive revision. To overcome these challenges, we propose Re5, a self-evaluation and revision framework designed to enhance instruction-following performance while preserving the quality of the generated content. Re5 extracts task and constraint components from user instructions, performs structural evaluations to prevent error accumulation, and applies fine-grained constraint-specific content evaluations followed by selective revisions. This process ensures precise and quality-preserving improvements. The final high-quality outputs are used for alignment tuning, enabling long-term alignment improvements through a data-centric iterative refinement loop. Experimental results demonstrate that Re5 achieves instruction-following performance comparable to models trained on data generated by GPT-4o-mini, a high-performance model, even with a small amount of data while maintaining response quality with a 64.24%-win rate over the non-revised initial responses. These results validate Re5 as an efficient and effective solution for enhancing instruction adherence with minimal external supervision.

Updated: 2025-07-08 02:17:18

标题: 自我评估框架，用于增强LLM的指导能力。

摘要: 已经提出了各种技术来改进大型语言模型（LLMs）对格式和指令约束的遵循。其中最有效的方法之一涉及利用由强大模型生成的高质量数据。然而，这些模型通常无法在单次生成中完全遵守复杂的指令。为解决这一限制，引入了迭代修订方法。然而，随着数据点数量和修订迭代次数的增加，相关的经济成本显著增加。作为一种资源高效的替代方法，已经提出了利用高性能评估工具的方法，以补偿开源LLMs有限的自我评估能力。然而，这些方法往往会由于过度修订而导致输出质量下降。为了克服这些挑战，我们提出了Re5，这是一个自我评估和修订框架，旨在提高指令遵循性能，同时保留生成内容的质量。Re5从用户指令中提取任务和约束组件，执行结构评估以防止错误累积，并应用细粒度的约束特定内容评估，然后进行选择性修订。这个过程确保了精确和保持质量的改进。最终的高质量输出用于对齐调整，通过数据中心的迭代精炼循环实现长期对齐改进。实验结果表明，Re5实现了类似于由高性能模型GPT-4o-mini生成的数据训练模型的指令遵循性能，即使只有少量数据，也保持了64.24%的胜率，超过了未修订初始响应。这些结果验证了Re5作为一种有效且高效的解决方案，可以在最少外部监督的情况下增强指令遵循性。

更新时间: 2025-07-08 02:17:18

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.05598v1

Efficient Detection of Intermittent Job Failures Using Few-Shot Learning

One of the main challenges developers face in the use of continuous integration (CI) and deployment pipelines is the occurrence of intermittent job failures, which result from unexpected non-deterministic issues (e.g., flaky tests or infrastructure problems) rather than regular code-related errors such as bugs. Prior studies developed machine learning (ML) models trained on large datasets of job logs to classify job failures as either intermittent or regular. As an alternative to costly manual labeling of large datasets, the state-of-the-art (SOTA) approach leveraged a heuristic based on non-deterministic job reruns. However, this method mislabels intermittent job failures as regular in contexts where rerunning suspicious job failures is not an explicit policy, and therefore limits the SOTA's performance in practice. In fact, our manual analysis of 2,125 job failures from 5 industrial and 1 open-source projects reveals that, on average, 32% of intermittent job failures are mislabeled as regular. To address these limitations, this paper introduces a novel approach to intermittent job failure detection using few-shot learning (FSL). Specifically, we fine-tune a small language model using a few number of manually labeled log examples to generate rich embeddings, which are then used to train an ML classifier. Our FSL-based approach achieves 70-88% F1-score with only 12 shots in all projects, outperforming the SOTA, which proved ineffective (34-52% F1-score) in 4 projects. Overall, this study underlines the importance of data quality over quantity and provides a more efficient and practical framework for the detection of intermittent job failures in organizations.

Updated: 2025-07-08 02:06:50

标题: Few-Shot Learning在高效检测间歇性作业故障中的应用

摘要: 开发人员在使用持续集成（CI）和部署流水线时面临的主要挑战之一是间歇性作业失败的发生，这是由于意外的非确定性问题（例如，不稳定的测试或基础设施问题）而不是常规的与代码相关的错误（如错误）导致的。先前的研究开发了机器学习（ML）模型，这些模型在大型作业日志数据集上进行训练，以将作业失败分类为间歇性或常规。作为大型数据集的昂贵手动标记的替代方法，最先进的方法利用了基于非确定性作业重新运行的启发式方法。然而，在不明确规定重新运行可疑作业失败的情况下，这种方法会将间歇性作业故障错误地标记为常规，从而在实践中限制了最先进技术的性能。事实上，我们对来自5个工业项目和1个开源项目的2,125个作业失败进行了手动分析，发现平均有32%的间歇性作业失败被错误地标记为常规。为解决这些限制，本文介绍了一种使用少样本学习（FSL）检测间歇性作业失败的新方法。具体来说，我们使用少量手动标记的日志示例微调一个小语言模型，以生成丰富的嵌入，然后用于训练一个ML分类器。我们基于FSL的方法在所有项目中仅进行了12次试验，就实现了70-88%的F1分数，优于先进技术，在4个项目中证明无效（34-52%的F1分数）。总的来说，这项研究强调数据质量胜过数量的重要性，并提供了一个更高效、更实用的框架，用于组织中检测间歇性作业失败。

更新时间: 2025-07-08 02:06:50

领域: cs.SE,cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.04173v2

MLlm-DR: Towards Explainable Depression Recognition with MultiModal Large Language Models

Automated depression diagnosis aims to analyze multimodal information from interview videos to predict participants' depression scores. Previous studies often lack clear explanations of how these scores were determined, limiting their adoption in clinical practice. While the advent of LLMs provides a possible pathway for explainable depression diagnosis, current LLMs capable of processing multimodal data lack training on interview data, resulting in poor diagnostic performance when used directly. In this paper, we propose a novel multimodal large language model (MLlm-DR) that can understand multimodal information inputs and supports explainable depression diagnosis. MLlm-DR integrates a smaller LLMs and a lightweight query module (LQ-former). Specifically, the smaller LLMs is designed to generate depression scores and corresponding evaluation rationales. To enhance its logical reasoning for domain-specific tasks while maintaining practicality, we constructed a robust training dataset to fine-tune it. Meanwhile, the LQ-former captures depression-related features from speech and visual data, aiding the model's ability to process multimodal information, to achieve comprehensive depression diagnosis. Our approach achieves state-of-the-art results on two interview-based benchmark datasets, CMDC and E-DAIC-WOZ, demonstrating its effectiveness and superiority.

Updated: 2025-07-08 01:56:39

标题: MLlm-DR：朝着具有多模态大型语言模型的可解释抑郁症识别方向

摘要: 自动抑郁症诊断旨在分析访谈视频中的多模态信息，以预测参与者的抑郁症分数。先前的研究往往缺乏清晰说明这些分数是如何确定的，从而限制了它们在临床实践中的应用。虽然LLMs的出现为可解释的抑郁症诊断提供了可能的途径，但目前能够处理多模态数据的LLMs在训练访谈数据方面缺乏，直接使用时的诊断性能较差。在本文中，我们提出了一种新颖的多模态大语言模型（MLlm-DR），它可以理解多模态信息输入并支持可解释的抑郁症诊断。MLlm-DR集成了较小的LLMs和轻量级查询模块（LQ-former）。具体而言，较小的LLMs被设计用于生成抑郁症分数和相应的评估理由。为了增强其在特定领域任务的逻辑推理能力，同时保持实用性，我们构建了一个强大的训练数据集对其进行微调。与此同时，LQ-former从语音和视觉数据中捕获与抑郁症相关的特征，帮助模型处理多模态信息，实现全面的抑郁症诊断。我们的方法在两个基于访谈的基准数据集CMDC和E-DAIC-WOZ上取得了最先进的结果，证明了其有效性和优越性。

更新时间: 2025-07-08 01:56:39

领域: cs.AI

下载: http://arxiv.org/abs/2507.05591v1

Low-Rank and Sparse Model Merging for Multi-Lingual Speech Recognition and Translation

Language diversity presents a significant challenge in speech-to-text (S2T) tasks, such as automatic speech recognition and translation. Traditional multi-lingual multi-task training approaches aim to address this by jointly optimising multiple speech recognition and translation tasks across various languages. While models like Whisper, built on these strategies, demonstrate strong performance, they still face issues of high computational cost, language interference, suboptimal training configurations, and limited extensibility. To overcome these challenges, we introduce LoRS-Merging (low-rank and sparse model merging), a novel technique designed to efficiently integrate models trained on different languages or tasks while preserving performance and reducing computational overhead. LoRS-Merging combines low-rank and sparse pruning to retain essential structures while eliminating redundant parameters, mitigating language interference, and enhancing extensibility. Experimental results across 10 languages demonstrate that LoRS-Merging significantly outperforms multi-lingual multi-task training, sequential training, and other merging methods, achieving over 20% improvement in normalised performance. Our findings suggest that model merging, particularly LoRS-Merging, is a scalable and effective complement to traditional multi-lingual training strategies for S2T applications.

Updated: 2025-07-08 01:55:33

标题: 低秩和稀疏模型合并用于多语言语音识别和翻译

摘要: 语言多样性在语音到文本（S2T）任务中提出了重要挑战，例如自动语音识别和翻译。传统的多语言多任务训练方法旨在通过同时优化跨不同语言的多个语音识别和翻译任务来解决这一问题。虽然基于这些策略构建的Whisper等模型表现出较强的性能，但仍然面临计算成本高、语言干扰、次优的训练配置和有限的可扩展性等问题。为了克服这些挑战，我们引入了LoRS-Merging（低秩和稀疏模型合并），这是一种旨在有效地整合在不同语言或任务上训练的模型的新技术，同时保持性能并减少计算开销。LoRS-Merging结合了低秩和稀疏剪枝，保留了基本结构，同时消除了冗余参数，减轻了语言干扰，并增强了可扩展性。对10种语言的实验结果表明，LoRS-Merging在多语言多任务训练、顺序训练和其他合并方法方面表现出色，性能提高超过20%。我们的研究结果表明，模型合并，特别是LoRS-Merging，是传统S2T应用中多语言训练策略的可扩展和有效的补充。

更新时间: 2025-07-08 01:55:33

领域: cs.SD,cs.AI,cs.CL,eess.AS

下载: http://arxiv.org/abs/2502.17380v3

Towards Measurement Theory for Artificial Intelligence

We motivate and outline a programme for a formal theory of measurement of artificial intelligence. We argue that formalising measurement for AI will allow researchers, practitioners, and regulators to: (i) make comparisons between systems and the evaluation methods applied to them; (ii) connect frontier AI evaluations with established quantitative risk analysis techniques drawn from engineering and safety science; and (iii) foreground how what counts as AI capability is contingent upon the measurement operations and scales we elect to use. We sketch a layered measurement stack, distinguish direct from indirect observables, and signpost how these ingredients provide a pathway toward a unified, calibratable taxonomy of AI phenomena.

Updated: 2025-07-08 01:52:37

标题: 《人工智能的测量理论探索》

摘要: 我们提出并概述了一个关于人工智能测量的正式理论计划。我们认为，为人工智能形式化测量将使研究人员、从业者和监管者能够：(i) 对系统和应用于它们的评估方法进行比较；(ii) 将前沿人工智能评估与工程和安全科学中的建立的定量风险分析技术相连接；以及(iii) 强调AI能力的计量操作和比例对我们选择使用的度量操作和比例的依赖性。我们勾勒了一个分层的测量体系，区分了直接和间接可观测量，并指出这些要素如何提供一条通向统一、可校准的AI现象分类法的路径。

更新时间: 2025-07-08 01:52:37

领域: cs.AI

下载: http://arxiv.org/abs/2507.05587v1

The Fourier Spectral Transformer Networks For Efficient and Generalizable Nonlinear PDEs Prediction

In this work we propose a unified Fourier Spectral Transformer network that integrates the strengths of classical spectral methods and attention based neural architectures. By transforming the original PDEs into spectral ordinary differential equations, we use high precision numerical solvers to generate training data and use a Transformer network to model the evolution of the spectral coefficients. We demonstrate the effectiveness of our approach on the two dimensional incompressible Navier-Stokes equations and the one dimensional Burgers' equation. The results show that our spectral Transformer can achieve highly accurate long term predictions even with limited training data, better than traditional numerical methods and machine learning methods in forecasting future flow dynamics. The proposed framework generalizes well to unseen data, bringing a promising paradigm for real time prediction and control of complex dynamical systems.

Updated: 2025-07-08 01:43:33

标题: 傅立叶频谱变换网络用于高效和可推广的非线性偏微分方程预测

摘要: 在这项工作中，我们提出了一个统一的傅立叶谱变换器网络，将经典谱方法和基于注意力的神经架构的优势整合在一起。通过将原始PDE转换为谱常微分方程，我们使用高精度数值求解器生成训练数据，并使用Transformer网络来建模谱系数的演化。我们在二维不可压缩Navier-Stokes方程和一维Burgers方程上展示了我们方法的有效性。结果表明，我们的谱Transformer即使在有限的训练数据下也能实现高精度的长期预测，比传统数值方法和机器学习方法更好地预测未来流体动力学。所提出的框架对未见数据具有很好的泛化能力，为复杂动态系统的实时预测和控制带来了一个有前途的范式。

更新时间: 2025-07-08 01:43:33

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.05584v1

Ethical AI for Young Digital Citizens: A Call to Action on Privacy Governance

The rapid expansion of Artificial Intelligence (AI) in digital platforms used by youth has created significant challenges related to privacy, autonomy, and data protection. While AI-driven personalization offers enhanced user experiences, it often operates without clear ethical boundaries, leaving young users vulnerable to data exploitation and algorithmic biases. This paper presents a call to action for ethical AI governance, advocating for a structured framework that ensures youth-centred privacy protections, transparent data practices, and regulatory oversight. We outline key areas requiring urgent intervention, including algorithmic transparency, privacy education, parental data-sharing ethics, and accountability measures. Through this approach, we seek to empower youth with greater control over their digital identities and propose actionable strategies for policymakers, AI developers, and educators to build a fairer and more accountable AI ecosystem.

Updated: 2025-07-08 01:43:24

标题: 年轻数字公民的道德人工智能：隐私治理行动的号召

摘要: 人工智能（AI）在青少年使用的数字平台中的快速扩张已经造成了与隐私、自主权和数据保护相关的重大挑战。虽然基于AI的个性化提供了更好的用户体验，但通常在没有明确的道德边界的情况下运作，使年轻用户容易受到数据利用和算法偏见的影响。本文呼吁进行道德AI治理，倡导建立一个结构化框架，确保以青少年为中心的隐私保护、透明的数据实践和监管监督。我们概述了需要紧急干预的关键领域，包括算法透明度、隐私教育、家长数据共享道德以及问责措施。通过这种方法，我们希望赋予青少年对他们的数字身份更大的控制权，并提出了可操作的策略，以建立一个更加公平和负责任的AI生态系统，供政策制定者、AI开发者和教育者参考。

更新时间: 2025-07-08 01:43:24

领域: cs.CY,cs.AI,cs.LG

下载: http://arxiv.org/abs/2503.11947v2

Model-free Optical Processors using In Situ Reinforcement Learning with Proximal Policy Optimization

Optical computing holds promise for high-speed, energy-efficient information processing, with diffractive optical networks emerging as a flexible platform for implementing task-specific transformations. A challenge, however, is the effective optimization and alignment of the diffractive layers, which is hindered by the difficulty of accurately modeling physical systems with their inherent hardware imperfections, noise, and misalignments. While existing in situ optimization methods offer the advantage of direct training on the physical system without explicit system modeling, they are often limited by slow convergence and unstable performance due to inefficient use of limited measurement data. Here, we introduce a model-free reinforcement learning approach utilizing Proximal Policy Optimization (PPO) for the in situ training of diffractive optical processors. PPO efficiently reuses in situ measurement data and constrains policy updates to ensure more stable and faster convergence. We experimentally validated our method across a range of in situ learning tasks, including targeted energy focusing through a random diffuser, holographic image generation, aberration correction, and optical image classification, demonstrating in each task better convergence and performance. Our strategy operates directly on the physical system and naturally accounts for unknown real-world imperfections, eliminating the need for prior system knowledge or modeling. By enabling faster and more accurate training under realistic experimental constraints, this in situ reinforcement learning approach could offer a scalable framework for various optical and physical systems governed by complex, feedback-driven dynamics.

Updated: 2025-07-08 01:39:36

标题: 无模型光学处理器利用现场强化学习和近端策略优化

摘要: 光计算具有高速、高效的信息处理潜力，其中衍射光网络作为实施特定任务转换的灵活平台逐渐显现出来。然而，一个挑战是有效优化和对齐衍射层，这受到准确建模物理系统的困难的限制，因为这些系统具有固有的硬件缺陷、噪声和错位。虽然现有的原位优化方法具有直接在物理系统上进行训练而无需显式系统建模的优势，但由于不充分利用有限的测量数据而导致收敛速度慢和性能不稳定。在这里，我们介绍了一种利用Proximal Policy Optimization（PPO）进行的无模型强化学习方法，用于原位训练衍射光处理器。PPO有效地重复利用原位测量数据，并限制策略更新，以确保更稳定和更快的收敛。我们在一系列原位学习任务中实验证明了我们的方法，包括通过随机衍射器进行能量聚焦、全息图像生成、像差校正和光学图像分类，在每项任务中均表现出更好的收敛和性能。我们的策略直接在物理系统上运行，并自然考虑未知的现实世界缺陷，消除了对先前系统知识或建模的需求。通过在现实实验约束条件下实现更快、更准确的训练，这种原位强化学习方法可能为受复杂反馈驱动动态控制的各种光学和物理系统提供可扩展的框架。

更新时间: 2025-07-08 01:39:36

领域: cs.LG,cs.NE,physics.app-ph,physics.optics

下载: http://arxiv.org/abs/2507.05583v1

Visual Adaptive Prompting for Compositional Zero-Shot Learning

Vision-Language Models (VLMs) have demonstrated impressive multimodal capabilities in learning joint representations of visual and textual data, making them powerful tools for tasks such as Compositional Zero-Shot Learning (CZSL). CZSL requires models to generalize to novel combinations of visual primitives--such as attributes and objects--that were not explicitly encountered during training. Recent works in prompting for CZSL have focused on modifying inputs for the text encoder, often using static prompts that do not change across varying visual contexts. However, these approaches struggle to fully capture varying visual contexts, as they focus on text adaptation rather than leveraging visual features for compositional reasoning. To address this, we propose a Visual Adaptive Prompting System (VAPS) that leverages a learnable visual prompt repository and similarity-based retrieval mechanism within the framework of VLMs to bridge the gap between semantic and visual features. Our method introduces a dynamic visual prompt repository mechanism that selects the most relevant attribute and object prompts based on the visual features of the image. Our proposed system includes a visual prompt adapter that encourages the model to learn a more generalizable embedding space. Experiments on three CZSL benchmarks, across both closed and open-world scenarios, demonstrate state-of-the-art results.

Updated: 2025-07-08 01:38:49

标题: 视觉自适应提示对于组合式零样本学习的研究

摘要: 视觉语言模型（VLMs）已经展示出在学习视觉和文本数据的联合表示方面具有令人印象深刻的多模态能力，使它们成为诸如构成式零样本学习（CZSL）等任务的强大工具。CZSL要求模型泛化到在训练过程中未显式遇到的视觉原语的新组合，例如属性和对象。最近关于CZSL提示的研究集中在修改文本编码器的输入上，通常使用在不同视觉上下文中不变的静态提示。然而，这些方法很难完全捕捉不同的视觉上下文，因为它们侧重于文本适应而不是利用视觉特征进行构成式推理。为了解决这个问题，我们提出了一个视觉自适应提示系统（VAPS），它利用可学习的视觉提示存储库和在VLM框架内的基于相似性的检索机制来弥合语义和视觉特征之间的差距。我们的方法引入了一个动态视觉提示存储库机制，基于图像的视觉特征选择最相关的属性和对象提示。我们提出的系统包括一个视觉提示适配器，鼓励模型学习更具泛化性的嵌入空间。在三个CZSL基准测试中的实验，涵盖封闭和开放世界场景，展示出了最先进的结果。

更新时间: 2025-07-08 01:38:49

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2502.20292v5

Unifying Explainable Anomaly Detection and Root Cause Analysis in Dynamical Systems

Dynamical systems, prevalent in various scientific and engineering domains, are susceptible to anomalies that can significantly impact their performance and reliability. This paper addresses the critical challenges of anomaly detection, root cause localization, and anomaly type classification in dynamical systems governed by ordinary differential equations (ODEs). We define two categories of anomalies: cyber anomalies, which propagate through interconnected variables, and measurement anomalies, which remain localized to individual variables. To address these challenges, we propose the Interpretable Causality Ordinary Differential Equation (ICODE) Networks, a model-intrinsic explainable learning framework. ICODE leverages Neural ODEs for anomaly detection while employing causality inference through an explanation channel to perform root cause analysis (RCA), elucidating why specific time periods are flagged as anomalous. ICODE is designed to simultaneously perform anomaly detection, RCA, and anomaly type classification within a single, interpretable framework. Our approach is grounded in the hypothesis that anomalies alter the underlying ODEs of the system, manifesting as changes in causal relationships between variables. We provide a theoretical analysis of how perturbations in learned model parameters can be utilized to identify anomalies and their root causes in time series data. Comprehensive experimental evaluations demonstrate the efficacy of ICODE across various dynamical systems, showcasing its ability to accurately detect anomalies, classify their types, and pinpoint their origins.

Updated: 2025-07-08 01:30:48

标题: 在动态系统中统一可解释异常检测和根本原因分析

摘要: 动力系统在各种科学和工程领域中普遍存在，容易受到异常的影响，这可能会显著影响其性能和可靠性。本文探讨了在由常微分方程（ODEs）控制的动力系统中异常检测、根本原因定位和异常类型分类的关键挑战。我们定义了两类异常：网络异常，通过互连变量传播，和测量异常，局限于单个变量。为了解决这些挑战，我们提出了一种内在可解释性学习框架——可解释性因果常微分方程（ICODE）网络。ICODE利用神经ODE进行异常检测，同时通过解释通道进行因果推断来进行根本原因分析（RCA），阐明为什么特定时间段被标记为异常。ICODE旨在在一个单一、可解释的框架内同时进行异常检测、RCA和异常类型分类。我们的方法基于假设，即异常改变了系统的基础ODEs，表现为变量之间因果关系的变化。我们提供了理论分析，说明学习模型参数中的扰动可用于识别时间序列数据中的异常及其根本原因。全面的实验评估证明了ICODE在各种动力系统中的有效性，展示了其准确检测异常、分类类型并定位根源的能力。

更新时间: 2025-07-08 01:30:48

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2502.12086v2

The Landscape of Memorization in LLMs: Mechanisms, Measurement, and Mitigation

Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet they also exhibit memorization of their training data. This phenomenon raises critical questions about model behavior, privacy risks, and the boundary between learning and memorization. Addressing these concerns, this paper synthesizes recent studies and investigates the landscape of memorization, the factors influencing it, and methods for its detection and mitigation. We explore key drivers, including training data duplication, training dynamics, and fine-tuning procedures that influence data memorization. In addition, we examine methodologies such as prefix-based extraction, membership inference, and adversarial prompting, assessing their effectiveness in detecting and measuring memorized content. Beyond technical analysis, we also explore the broader implications of memorization, including the legal and ethical implications. Finally, we discuss mitigation strategies, including data cleaning, differential privacy, and post-training unlearning, while highlighting open challenges in balancing the minimization of harmful memorization with utility. This paper provides a comprehensive overview of the current state of research on LLM memorization across technical, privacy, and performance dimensions, identifying critical directions for future work.

Updated: 2025-07-08 01:30:46

标题: LLMs中的记忆景观：机制、测量和缓解

摘要: 大型语言模型（LLMs）已经展示出在各种任务中的显著能力，但它们也表现出对训练数据的记忆。这种现象引发了关于模型行为、隐私风险以及学习和记忆之间的界限的关键问题。为了解决这些问题，本文综合了最近的研究，并调查了记忆的景观、影响它的因素以及检测和减轻方法。我们探讨了包括训练数据重复、训练动态和微调程序在内的关键驱动因素，这些因素影响了数据的记忆。此外，我们还研究了基于前缀的提取、成员推断和对抗提示等方法，评估它们在检测和衡量记忆内容方面的有效性。除了技术分析，我们还探讨了记忆的更广泛影响，包括法律和道德影响。最后，我们讨论了减轻策略，包括数据清理、差分隐私和训练后的遗忘，同时强调在平衡减少有害记忆和实用性方面所面临的挑战。本文全面概述了目前关于LLM记忆研究的技术、隐私和性能维度，确定了未来工作的关键方向。

更新时间: 2025-07-08 01:30:46

领域: cs.LG,cs.CL,cs.CR

下载: http://arxiv.org/abs/2507.05578v1

Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models

Large Language Models (LLMs) demonstrate the ability to solve reasoning and mathematical problems using the Chain-of-Thought (CoT) technique. Expanding CoT length, as seen in models such as DeepSeek-R1, significantly enhances this reasoning for complex problems, but requires costly and high-quality long CoT data and fine-tuning. This work, inspired by the deep thinking paradigm of DeepSeek-R1, utilizes a steering technique to enhance the reasoning ability of an LLM without external datasets. Our method first employs Sparse Autoencoders (SAEs) to extract interpretable features from vanilla CoT. These features are then used to steer the LLM's internal states during generation. Recognizing that many LLMs do not have corresponding pre-trained SAEs, we further introduce a novel SAE-free steering algorithm, which directly computes steering directions from the residual activations of an LLM, obviating the need for an explicit SAE. Experimental results demonstrate that both our SAE-based and subsequent SAE-free steering algorithms significantly enhance the reasoning capabilities of LLMs.

Updated: 2025-07-08 01:29:52

标题: 特征提取和导向，用于增强语言模型中的思维链推理

摘要: 大型语言模型(LLMs)展示了使用Chain-of-Thought (CoT)技术解决推理和数学问题的能力。扩展CoT长度，如在DeepSeek-R1等模型中所见，显著增强了对复杂问题的推理能力，但需要昂贵且高质量的长CoT数据和精细调整。受DeepSeek-R1深度思维范式的启发，本研究利用一种引导技术来增强LLM的推理能力，而无需外部数据集。我们的方法首先利用稀疏自动编码器(SAEs)从普通CoT中提取可解释特征。然后利用这些特征在生成过程中引导LLM的内部状态。鉴于许多LLMs没有相应的预训练SAEs，我们进一步引入一种新颖的无SAE引导算法，直接从LLM的剩余激活中计算引导方向，避免了对显式SAE的需求。实验结果表明，我们基于SAE的以及后续的无SAE引导算法显著增强了LLMs的推理能力。

更新时间: 2025-07-08 01:29:52

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2505.15634v3

Offline Learning and Forgetting for Reasoning with Large Language Models

Leveraging inference-time search in large language models has proven effective in further enhancing a trained model's capability to solve complex mathematical and reasoning problems. However, this approach significantly increases computational costs and inference time, as the model must generate and evaluate multiple candidate solutions to identify a viable reasoning path. To address this, we propose an effective approach that integrates search capabilities directly into the model by fine-tuning it on unpaired successful (learning) and failed reasoning paths (forgetting) derived from diverse search methods. A key challenge we identify is that naive fine-tuning can degrade the model's search capability; we show this can be mitigated with a smaller learning rate. Extensive experiments on the challenging Game-of-24 and Countdown reasoning benchmarks show that, replacing CoT-generated data with search-generated data for offline fine-tuning improves success rates by around 23% over inference-time search baselines, while reducing inference time by 180$\times$. On top of this, our learning and forgetting objective consistently outperforms both supervised fine-tuning and preference-based methods.

Updated: 2025-07-08 01:26:04

标题: 大型语言模型的离线学习和遗忘用于推理

摘要: 在大型语言模型中利用推理时搜索已被证明可以进一步增强已训练模型解决复杂数学和推理问题的能力。然而，这种方法显著增加了计算成本和推理时间，因为模型必须生成和评估多个候选解以识别可行的推理路径。为了解决这个问题，我们提出了一种有效的方法，通过对来自各种搜索方法的未配对成功（学习）和失败推理路径（遗忘）进行微调，直接将搜索能力整合到模型中。我们发现的一个关键挑战是，简单的微调可能会降低模型的搜索能力；我们展示这可以通过更小的学习率来缓解。在具有挑战性的“24点”和“倒计时”推理基准测试上进行了大量实验，结果显示，用搜索生成的数据替代CoT生成的数据进行离线微调，可以将成功率提高约23%，同时将推理时间缩短180倍。此外，我们的学习和遗忘目标一直优于监督微调和基于偏好的方法。

更新时间: 2025-07-08 01:26:04

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2504.11364v3

Beyond Retrieval: Ensembling Cross-Encoders and GPT Rerankers with LLMs for Biomedical QA

Biomedical semantic question answering rooted in information retrieval can play a crucial role in keeping up to date with vast, rapidly evolving and ever-growing biomedical literature. A robust system can help researchers, healthcare professionals and even layman users access relevant knowledge grounded in evidence. The BioASQ 2025 Task13b Challenge serves as an important benchmark, offering a competitive platform for advancement of this space. This paper presents the methodologies and results from our participation in this challenge where we built a Retrieval-Augmented Generation (RAG) system that can answer biomedical questions by retrieving relevant PubMed documents and snippets to generate answers. For the retrieval task, we generated dense embeddings from biomedical articles for initial retrieval, and applied an ensemble of finetuned cross-encoders and large language models (LLMs) for re-ranking to identify top relevant documents. Our solution achieved an MAP@10 of 0.1581, placing 10th on the leaderboard for the retrieval task. For answer generation, we employed few-shot prompting of instruction-tuned LLMs. Our system achieved macro-F1 score of 0.95 for yes/no questions (rank 12), Mean Reciprocal Rank (MRR) of 0.64 for factoid questions (rank 1), mean-F1 score of 0.63 for list questions (rank 5), and ROUGE-SU4 F1 score of 0.29 for ideal answers (rank 11).

Updated: 2025-07-08 01:25:06

标题: 超越检索：利用LLMs集成交叉编码器和GPT重新排序器进行生物医学问答

摘要: 基于信息检索的生物医学语义问答可以在保持与庞大、快速发展和不断增长的生物医学文献同步方面发挥关键作用。一个强大的系统可以帮助研究人员、医疗保健专业人员甚至普通用户获取以证据为基础的相关知识。BioASQ 2025 Task13b挑战作为一个重要的基准，为这一领域的进展提供了一个竞争平台。本文介绍了我们参与此挑战的方法和结果，我们构建了一个检索增强生成（RAG）系统，可以通过检索相关的PubMed文献和片段来生成答案回答生物医学问题。对于检索任务，我们从生物医学文章中生成了密集嵌入，用于初始检索，并应用了一组微调的交叉编码器和大型语言模型（LLMs）进行重新排序，以识别前十个相关文档。我们的解决方案在检索任务的排行榜上取得了0.1581的MAP@10，排名第十。对于生成答案，我们采用了少量提示的指令调整LLMs。我们的系统在是/否问题的宏F1得分为0.95（排名第12），事实问答的平均倒数排名为0.64（排名第1），列表问题的平均F1得分为0.63（排名第5），理想答案的ROUGE-SU4 F1得分为0.29（排名第11）。

更新时间: 2025-07-08 01:25:06

领域: cs.IR,cs.CL,cs.LG

下载: http://arxiv.org/abs/2507.05577v1

iThermTroj: Exploiting Intermittent Thermal Trojans in Multi-Processor System-on-Chips

Thermal Trojan attacks present a pressing concern for the security and reliability of System-on-Chips (SoCs), especially in mobile applications. The situation becomes more complicated when such attacks are more evasive and operate sporadically to stay hidden from detection mechanisms. In this paper, we introduce Intermittent Thermal Trojans (iThermTroj) that exploit the chips' thermal information in a random time-triggered manner. According to our experiments, iThermTroj attack can easily bypass available threshold-based thermal Trojan detection solutions. We investigate SoC vulnerabilities to variations of iThermTroj through an in-depth analysis of Trojan activation and duration scenarios. We also propose a set of tiny Machine Learning classifiers for run-time anomaly detection to protect SoCs against such intermittent thermal Trojan attacks. Compared to existing methods, our approach improves the attack detection rate by 29.4\%, 17.2\%, and 14.3\% in scenarios where iThermTroj manipulates up to 80\%, 60\%, and 40\% of SoC's thermal data, respectively. Additionally, our method increases the full protection resolution to 0.8 degrees Celsius, meaning that any temperature manipulations exceeding $\pm 0.8$ degrees will be detected with 100\% accuracy.

Updated: 2025-07-08 01:24:28

标题: iThermTroj：利用多处理器系统芯片中的间歇性热特洛伊木马

摘要: 热特洛伊攻击对于片上系统（SoC）的安全性和可靠性提出了紧迫的关注，特别是在移动应用中。当这种攻击更加隐蔽并且间歇性操作以避免检测机制时，情况变得更加复杂。在本文中，我们介绍了间歇性热特洛伊（iThermTroj），它以随机时间触发的方式利用芯片的热信息。根据我们的实验，iThermTroj攻击可以轻易绕过可用的基于阈值的热特洛伊检测解决方案。我们通过对特洛伊激活和持续时间情景的深入分析，研究了SoC对iThermTroj变体的脆弱性。我们还提出了一组微型机器学习分类器，用于运行时异常检测，以保护SoC免受这种间歇性热特洛伊攻击。与现有方法相比，我们的方法在iThermTroj操纵多达80％、60％和40％SoC热数据的情景中，攻击检测率提高了29.4％、17.2％和14.3％。此外，我们的方法将全面保护分辨率提高至0.8摄氏度，意味着任何超过±0.8摄氏度的温度操纵都将以100％的准确性被检测到。

更新时间: 2025-07-08 01:24:28

领域: cs.CR,cs.AR

下载: http://arxiv.org/abs/2507.05576v1

Prompt Migration: Stabilizing GenAI Applications with Evolving Large Language Models

Generative AI is transforming business applications by enabling natural language interfaces and intelligent automation. However, the underlying large language models (LLMs) are evolving rapidly and so prompting them consistently is a challenge. This leads to inconsistent and unpredictable application behavior, undermining the reliability that businesses require for mission-critical workflows. In this paper, we introduce the concept of prompt migration as a systematic approach to stabilizing GenAI applications amid changing LLMs. Using the Tursio enterprise search application as a case study, we analyze the impact of successive GPT model upgrades, detail our migration framework including prompt redesign and a migration testbed, and demonstrate how these techniques restore application consistency. Our results show that structured prompt migration can fully recover the application reliability that was lost due to model drift. We conclude with practical lessons learned, emphasizing the need for prompt lifecycle management and robust testing to ensure dependable GenAI-powered business applications.

Updated: 2025-07-08 01:20:12

标题: 即时迁移：通过不断发展的大型语言模型稳定GenAI应用程序

摘要: 生成式人工智能正在通过实现自然语言界面和智能自动化来改变业务应用。然而，基础的大型语言模型（LLMs）正在快速发展，因此持续地促使它们成为一个挑战。这导致应用行为不一致和不可预测，破坏了企业对于使命关键工作流程所需的可靠性。在本文中，我们介绍了“提示迁移”概念作为在LLMs变化中稳定GenAI应用的系统方法。通过以Tursio企业搜索应用为案例研究，我们分析了连续的GPT模型升级的影响，详细说明了我们的迁移框架，包括提示重新设计和迁移测试平台，并展示了这些技术如何恢复应用程序的一致性。我们的结果表明，结构化的提示迁移可以完全恢复因模型漂移而丢失的应用程序可靠性。最后，我们总结了实践经验，强调了提示生命周期管理和强大测试的必要性，以确保可靠的GenAI驱动的业务应用程序。

更新时间: 2025-07-08 01:20:12

领域: cs.DB,cs.AI,cs.SE

下载: http://arxiv.org/abs/2507.05573v1

Approximating invariant functions with the sorting trick is theoretically justified

Many machine learning models leverage group invariance which is enjoyed with a wide-range of applications. For exploiting an invariance structure, one common approach is known as \emph{frame averaging}. One popular example of frame averaging is the \emph{group averaging}, where the entire group is used to symmetrize a function. The other end of the spectrum is the \emph{canonicalization}, where a frame at each point consists of a single group element which transforms the point to its orbit representative. Compared to group averaging, canonicalization is more efficient computationally. However, it results in non-differentiablity or discontinuity of the canonicalized function. As a result, the theoretical performance of canonicalization has not been given much attention. In this work, we establish an approximation theory for canonicalization. Specifically, we bound the point-wise and $L^2(\mathbb{P})$ approximation errors as well as the kernel's eigenvalue decay rates associated with a canonicalization trick.

Updated: 2025-07-08 01:15:19

标题: 用排序技巧近似不变函数在理论上是合理的

摘要: 许多机器学习模型利用群不变性，在广泛的应用中受益于这种不变性。为了利用一个不变性结构，一种常见的方法被称为\emph{帧平均}。帧平均的一个流行示例是\emph{群平均}，其中整个群被用来使一个函数对称化。另一种极端是\emph{规范化}，其中每个点的一个帧包含一个单个群元素，将该点转换为其轨道代表。与群平均相比，规范化在计算上更有效。然而，它导致规范化函数的不可微性或不连续性。因此，规范化的理论性能并没有受到太多关注。在这项工作中，我们建立了一个规范化的近似理论。具体来说，我们限制了与规范化技巧相关的逐点和$L^2(\mathbb{P})$逼近误差以及核的特征值衰减速率。

更新时间: 2025-07-08 01:15:19

领域: cs.LG,65G05 (Primary), 20B99 (Secondary)

下载: http://arxiv.org/abs/2403.01671v4

ReLayout: Integrating Relation Reasoning for Content-aware Layout Generation with Multi-modal Large Language Models

Content-aware layout aims to arrange design elements appropriately on a given canvas to convey information effectively. Recently, the trend for this task has been to leverage large language models (LLMs) to generate layouts automatically, achieving remarkable performance. However, existing LLM-based methods fail to adequately interpret spatial relationships among visual themes and design elements, leading to structural and diverse problems in layout generation. To address this issue, we introduce ReLayout, a novel method that leverages relation-CoT to generate more reasonable and aesthetically coherent layouts by fundamentally originating from design concepts. Specifically, we enhance layout annotations by introducing explicit relation definitions, such as region, salient, and margin between elements, with the goal of decomposing the layout into smaller, structured, and recursive layouts, thereby enabling the generation of more structured layouts. Furthermore, based on these defined relationships, we introduce a layout prototype rebalance sampler, which defines layout prototype features across three dimensions and quantifies distinct layout styles. This sampler addresses uniformity issues in generation that arise from data bias in the prototype distribution balance process. Extensive experimental results verify that ReLayout outperforms baselines and can generate structural and diverse layouts that are more aligned with human aesthetics and more explainable.

Updated: 2025-07-08 01:13:43

标题: 重新布局：将关系推理与多模态大语言模型集成以生成内容感知布局

摘要: 内容感知布局旨在在给定的画布上适当地安排设计元素，以有效传达信息。最近，这个任务的趋势是利用大型语言模型(LLMs)自动生成布局，取得了显著的表现。然而，现有基于LLM的方法未能充分解释视觉主题和设计元素之间的空间关系，导致布局生成中的结构和多样化问题。为了解决这个问题，我们引入了一种新颖的方法ReLayout，它利用关系-CoT从设计概念根本上生成更合理和美学上连贯的布局。具体来说，我们通过引入明确的关系定义，如元素之间的区域、显著和边距，增强了布局注释，目的是将布局分解为更小、结构化和递归的布局，从而实现更具结构的布局的生成。此外，基于这些定义的关系，我们引入了布局原型重新平衡采样器，该采样器跨三个维度定义了布局原型特征，并量化了不同的布局风格。这个采样器解决了在原型分布平衡过程中由数据偏差引起的生成中的均匀性问题。大量实验结果验证了ReLayout优于基线，并能生成更符合人类审美观和更具解释性的结构化和多样化布局。

更新时间: 2025-07-08 01:13:43

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.05568v1

SingLoRA: Low Rank Adaptation Using a Single Matrix

Low-Rank Adaptation (LoRA) has significantly advanced parameter-efficient fine-tuning of large pretrained models. LoRA augments the pre-trained weights of a model by adding the product of two smaller matrices that together form a low-rank matrix update. Recent research has shown that scale disparities between these two matrices often cause unstable training dynamics, leading to suboptimal performance. In this paper, we propose SingLoRA, which reformulates low-rank adaptation by learning the weights update as a decomposition of a single low-rank matrix multiplied by its transpose. This simple design inherently removes inter-matrix scale conflicts, ensuring stable optimization, and roughly halves the parameter count. We analyze SingLoRA within the infinite-width neural network framework, showing that it guarantees stable feature learning by construction. Extensive experiments on multiple tasks validate these benefits. In common sense reasoning, fine-tuning LLama 7B on MNLI with SingLoRA achieves 91.3% accuracy - surpassing LoRA (89.1%) and LoRA+ (90.2%) - while using only 60% of their parameter budget. In image generation, fine-tuning Stable Diffusion with SingLoRA significantly improves image fidelity on DreamBooth, achieving a DINO similarity score of 0.151, compared to scores of 0.148 and 0.143 for DoRA and LoRA, respectively.

Updated: 2025-07-08 01:11:30

标题: SingLoRA：使用单个矩阵进行低秩适应

摘要: 低秩适应（LoRA）显著提高了大型预训练模型的参数有效微调。LoRA通过将两个较小矩阵的乘积添加到模型的预先训练权重中，形成一个低秩矩阵更新。最近的研究表明，这两个矩阵之间的尺度差异经常导致不稳定的训练动态，从而导致次优性能。在本文中，我们提出了SingLoRA，通过将权重更新学习为单个低秩矩阵与其转置的分解，重新构建低秩适应。这种简单的设计本质上消除了矩阵间尺度冲突，确保稳定的优化，并大致减少了参数数量。我们在无限宽度神经网络框架内分析了SingLoRA，表明它通过构造保证了稳定的特征学习。对多个任务的大量实验验证了这些优势。在常识推理中，使用SingLoRA在MNLI上对LLama 7B进行微调，实现了91.3%的准确率，超过了LoRA（89.1%）和LoRA+（90.2%），同时只使用了它们参数预算的60%。在图像生成中，使用SingLoRA对稳定扩散进行微调在DreamBooth上显著提高了图像的保真度，实现了0.151的DINO相似度分数，而DoRA和LoRA的分数分别为0.148和0.143。

更新时间: 2025-07-08 01:11:30

领域: cs.AI

下载: http://arxiv.org/abs/2507.05566v1

Search-based Selection of Metamorphic Relations for Optimized Robustness Testing of Large Language Models

Assessing the trustworthiness of Large Language Models (LLMs), such as robustness, has garnered significant attention. Recently, metamorphic testing that defines Metamorphic Relations (MRs) has been widely applied to evaluate the robustness of LLM executions. However, the MR-based robustness testing still requires a scalable number of MRs, thereby necessitating the optimization of selecting MRs. Most extant LLM testing studies are limited to automatically generating test cases (i.e., MRs) to enhance failure detection. Additionally, most studies only considered a limited test space of single perturbation MRs in their evaluation of LLMs. In contrast, our paper proposes a search-based approach for optimizing the MR groups to maximize failure detection and minimize the LLM execution cost. Moreover, our approach covers the combinatorial perturbations in MRs, facilitating the expansion of test space in the robustness assessment. We have developed a search process and implemented four search algorithms: Single-GA, NSGA-II, SPEA2, and MOEA/D with novel encoding to solve the MR selection problem in the LLM robustness testing. We conducted comparative experiments on the four search algorithms along with a random search, using two major LLMs with primary Text-to-Text tasks. Our statistical and empirical investigation revealed two key findings: (1) the MOEA/D algorithm performed the best in optimizing the MR space for LLM robustness testing, and (2) we identified silver bullet MRs for the LLM robustness testing, which demonstrated dominant capabilities in confusing LLMs across different Text-to-Text tasks. In LLM robustness assessment, our research sheds light on the fundamental problem for optimized testing and provides insights into search-based solutions.

Updated: 2025-07-08 01:11:27

标题: 基于搜索的选择变形关系用于优化大型语言模型的鲁棒性测试

摘要: 评估大型语言模型（LLMs）的可信度，如鲁棒性，引起了广泛关注。最近，定义了变形关系（MRs）的变形测试已被广泛应用于评估LLM执行的鲁棒性。然而，基于MR的鲁棒性测试仍然需要大量可扩展的MRs，因此需要优化选择MRs。大多数现有的LLM测试研究仅限于自动生成测试用例（即MRs）以增强故障检测。此外，大多数研究仅在LLMs的评估中考虑了单一扰动MRs的有限测试空间。相比之下，我们的论文提出了一种基于搜索的方法，用于优化MR组，以最大限度地提高故障检测并最小化LLM执行成本。此外，我们的方法涵盖了MRs中的组合扰动，促进了在鲁棒性评估中测试空间的扩展。我们开发了一个搜索过程，并实施了四种搜索算法：Single-GA，NSGA-II，SPEA2和MOEA/D，采用新的编码解决LLM鲁棒性测试中的MR选择问题。我们对四种搜索算法以及随机搜索进行了比较实验，使用了两个主要的LLM，主要是文本到文本任务。我们的统计和经验调查揭示了两个关键发现：（1）MOEA/D算法在优化LLM鲁棒性测试的MR空间方面表现最佳，（2）我们确定了对LLM鲁棒性测试具有优势能力的银弹MRs，这些MRs在不同的文本到文本任务中混淆LLMs。在LLM鲁棒性评估中，我们的研究为优化测试提供了基本问题的启示，并为基于搜索的解决方案提供了见解。

更新时间: 2025-07-08 01:11:27

领域: cs.SE,cs.AI,cs.NE

下载: http://arxiv.org/abs/2507.05565v1

AbdomenAtlas-8K: Annotating 8,000 CT Volumes for Multi-Organ Segmentation in Three Weeks

Annotating medical images, particularly for organ segmentation, is laborious and time-consuming. For example, annotating an abdominal organ requires an estimated rate of 30-60 minutes per CT volume based on the expertise of an annotator and the size, visibility, and complexity of the organ. Therefore, publicly available datasets for multi-organ segmentation are often limited in data size and organ diversity. This paper proposes an active learning method to expedite the annotation process for organ segmentation and creates the largest multi-organ dataset (by far) with the spleen, liver, kidneys, stomach, gallbladder, pancreas, aorta, and IVC annotated in 8,448 CT volumes, equating to 3.2 million slices. The conventional annotation methods would take an experienced annotator up to 1,600 weeks (or roughly 30.8 years) to complete this task. In contrast, our annotation method has accomplished this task in three weeks (based on an 8-hour workday, five days a week) while maintaining a similar or even better annotation quality. This achievement is attributed to three unique properties of our method: (1) label bias reduction using multiple pre-trained segmentation models, (2) effective error detection in the model predictions, and (3) attention guidance for annotators to make corrections on the most salient errors. Furthermore, we summarize the taxonomy of common errors made by AI algorithms and annotators. This allows for continuous revision of both AI and annotations and significantly reduces the annotation costs required to create large-scale datasets for a wider variety of medical imaging tasks.

Updated: 2025-07-08 01:09:50

标题: AbdomenAtlas-8K：用三周时间为8,000个CT体积标注多器官分割

摘要: 对医学图像进行注释，尤其是用于器官分割的注释，是费时费力的。例如，注释一个腹部器官需要根据注释者的专业知识以及器官的大小、可见性和复杂性估计每个CT体积需要30-60分钟。因此，公开可用的多器官分割数据集通常在数据大小和器官多样性方面受限。本文提出了一种主动学习方法，以加快器官分割的注释过程，并创建了迄今为止最大的多器官数据集，其中在8,448个CT体积中标注了脾脏、肝脏、肾脏、胃、胆囊、胰腺、主动脉和IVC，相当于320万个切片。传统的注释方法将需要一个有经验的注释者花费1600周（或大约30.8年）才能完成这项任务。相比之下，我们的注释方法在三周内完成了这项任务（每周工作8小时，每周五天），同时保持了类似甚至更好的注释质量。这一成就归功于我们的方法的三个独特特性：（1）使用多个预训练的分割模型来减少标签偏差，（2）有效检测模型预测中的错误，以及（3）指导注释者在最突出的错误上进行修正。此外，我们总结了人工智能算法和注释者常见错误的分类。这使得对人工智能和注释进行持续修订，并显著降低了为更广泛的医学成像任务创建大规模数据集所需的注释成本。

更新时间: 2025-07-08 01:09:50

领域: eess.IV,cs.CV,cs.LG

下载: http://arxiv.org/abs/2305.09666v3

Exact and efficient basis pursuit denoising via differential inclusions and a selection principle

Basis pursuit denoising (BPDN) is a cornerstone of compressive sensing, statistics and machine learning. While various algorithms for BPDN have been proposed, they invariably suffer from drawbacks and must either favor efficiency at the expense of accuracy or vice versa. As such, state-of-the-art algorithms remain ineffective for high-dimensional applications that require accurate solutions within a reasonable amount of computational time. In this work, we address this issue and propose an exact and efficient algorithm for BPDN using differential inclusions. Specifically, we prove that a selection principle from the theory of differential inclusions turns the dual problem of BPDN into calculating the trajectory of an \emph{integrable} projected dynamical system, that is, whose trajectory and asymptotic limit can be computed exactly. Our analysis naturally yields an exact algorithm, numerically up to machine precision, that is amenable to computing regularization paths and very fast. Numerical experiments confirm that our algorithm outperforms the state-of-the-art algorithms in both accuracy and efficiency. Moreover, we show that the global continuation of solutions (in terms of the hyperparameter and data) of the projected dynamical system yields a rigorous homotopy algorithm for BPDN, as well as a novel greedy algorithm for computing feasible solutions to basis pursuit in strongly polynomial time. Beyond this work, we expect that our results and analysis can be adapted to compute exact or approximate solutions to a broader class of polyhedral-constrained optimization problems.

Updated: 2025-07-08 01:07:22

标题: 精确高效的基 Pursuit 去噪方法：通过微分包含和选择原则

摘要: 基础追踪去噪（BPDN）是压缩感知、统计学和机器学习的基石。虽然提出了各种BPDN算法，但它们无一例外地存在缺点，必须在效率和准确性之间做出取舍。因此，目前的算法对于需要在合理的计算时间内获得准确解的高维应用仍然无效。在这项工作中，我们解决了这个问题，并提出了一种使用微分包含的精确高效算法来解决BPDN问题。具体而言，我们证明微分包含理论中的选择原则将BPDN的对偶问题转化为计算\emph{可积}投影动力系统的轨迹，即其轨迹和渐近极限可以被精确计算。我们的分析自然产生了一个精确算法，可以在机器精度范围内进行数值计算，适合计算正则化路径并非常快速。数值实验证实了我们的算法在准确性和效率方面优于现有算法。此外，我们展示了投影动力系统解的全局延续（关于超参数和数据）提供了一个严格的BPDN同伦算法，以及一种新颖的贪婪算法，可以在强多项式时间内计算基础追踪的可行解。在这项工作之外，我们期望我们的结果和分析可以被调整以计算更广泛的多面体约束优化问题的精确或近似解。

更新时间: 2025-07-08 01:07:22

领域: math.OC,cs.LG,math.FA,90C25, 65K05, 37N40, 46N10, 34A60, 62J07,G.1.6; I.5.4

下载: http://arxiv.org/abs/2507.05562v1

LATST: Are Transformers Necessarily Complex for Time-Series Forecasting

Transformer-based architectures have achieved remarkable success in natural language processing and computer vision. However, their performance in multivariate long-term forecasting often falls short compared to simpler linear baselines. Previous research has identified the traditional attention mechanism as a key factor limiting their effectiveness in this domain. To bridge this gap, we introduce LATST, a novel approach designed to mitigate entropy collapse and training instability common challenges in Transformer-based time series forecasting. We rigorously evaluate LATST across multiple real-world multivariate time series datasets, demonstrating its ability to outperform existing state-of-the-art Transformer models. Notably, LATST manages to achieve competitive performance with fewer parameters than some linear models on certain datasets, highlighting its efficiency and effectiveness.

Updated: 2025-07-08 00:59:53

标题: LATST：Transformer在时间序列预测中是否必须复杂？

摘要: 基于Transformer的架构在自然语言处理和计算机视觉领域取得了显著的成功。然而，在多变量长期预测方面，它们的表现往往不及更简单的线性基准。先前的研究已经确定传统的注意力机制是限制它们在这一领域有效性的关键因素。为了弥合这一差距，我们引入了LATST，一种新颖的方法，旨在减轻Transformer-based时间序列预测中常见的熵崩溃和训练不稳定性挑战。我们在多个真实世界的多变量时间序列数据集上对LATST进行了严格评估，展示了其超越现有最先进的Transformer模型的能力。值得注意的是，LATST在某些数据集上达到了与一些线性模型相当的性能，突显了其效率和有效性。

更新时间: 2025-07-08 00:59:53

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2410.23749v9

Preemptive Solving of Future Problems: Multitask Preplay in Humans and Machines

Humans can pursue a near-infinite variety of tasks, but typically can only pursue a small number at the same time. We hypothesize that humans leverage experience on one task to preemptively learn solutions to other tasks that were accessible but not pursued. We formalize this idea as Multitask Preplay, a novel algorithm that replays experience on one task as the starting point for "preplay" -- counterfactual simulation of an accessible but unpursued task. Preplay is used to learn a predictive representation that can support fast, adaptive task performance later on. We first show that, compared to traditional planning and predictive representation methods, multitask preplay better predicts how humans generalize to tasks that were accessible but not pursued in a small grid-world, even when people didn't know they would need to generalize to these tasks. We then show these predictions generalize to Craftax, a partially observable 2D Minecraft environment. Finally, we show that Multitask Preplay enables artificial agents to learn behaviors that transfer to novel Craftax worlds sharing task co-occurrence structure. These findings demonstrate that Multitask Preplay is a scalable theory of how humans counterfactually learn and generalize across multiple tasks; endowing artificial agents with the same capacity can significantly improve their performance in challenging multitask environments.

Updated: 2025-07-08 00:55:47

标题: 未来问题的预防性解决：人类和机器的多任务预演

摘要: 人类可以追求几乎无限种类的任务，但通常只能同时追求少数任务。我们假设人类利用在一个任务上的经验，预先学习其他可访问但未追求的任务的解决方案。我们将这个想法形式化为多任务预演（Multitask Preplay），这是一种新颖的算法，它将在一个任务上的经验重新播放，作为“预演”的起点--对可访问但未追求的任务进行反事实模拟。预演被用来学习可以支持后续快速、自适应任务表现的预测表示。我们首先展示，与传统的规划和预测表示方法相比，多任务预演更好地预测人类在一个小的格子世界中如何推广到可访问但未追求的任务，即使人们不知道他们将需要推广到这些任务。然后我们展示这些预测可以推广到Craftax，一个部分可观察的2D Minecraft环境。最后，我们展示多任务预演使人工智能代理能够学习行为，这些行为可以转移到共享任务共现结构的新颖Craftax世界。这些发现表明，多任务预演是一个可扩展的理论，说明了人类如何在多个任务中反事实地学习和推广；赋予人工智能代理相同的能力可以显著提高它们在具有挑战性的多任务环境中的表现。

更新时间: 2025-07-08 00:55:47

领域: cs.LG,q-bio.NC

下载: http://arxiv.org/abs/2507.05561v1

RandAR: Decoder-only Autoregressive Visual Generation in Random Orders

We introduce RandAR, a decoder-only visual autoregressive (AR) model capable of generating images in arbitrary token orders. Unlike previous decoder-only AR models that rely on a predefined generation order, RandAR removes this inductive bias, unlocking new capabilities in decoder-only generation. Our essential design enables random order by inserting a "position instruction token" before each image token to be predicted, representing the spatial location of the next image token. Trained on randomly permuted token sequences -- a more challenging task than fixed-order generation, RandAR achieves comparable performance to its conventional raster-order counterpart. More importantly, decoder-only transformers trained from random orders acquire new capabilities. For the efficiency bottleneck of AR models, RandAR adopts parallel decoding with KV-Cache at inference time, enjoying 2.5x acceleration without sacrificing generation quality. Additionally, RandAR supports inpainting, outpainting and resolution extrapolation in a zero-shot manner. We hope RandAR inspires new directions for decoder-only visual generation models and broadens their applications across diverse scenarios. Our project page is at https://rand-ar.github.io/.

Updated: 2025-07-08 00:51:16

标题: RandAR: 在随机顺序中仅解码自回归视觉生成

摘要: 我们介绍了RandAR，这是一种仅解码的视觉自回归（AR）模型，能够以任意令牌顺序生成图像。与先前依赖预定义生成顺序的仅解码AR模型不同，RandAR消除了这种归纳偏见，从而在仅解码生成中获得了新的能力。我们的基本设计通过在要预测的每个图像令牌前插入“位置指令令牌”，表示下一个图像令牌的空间位置，实现了随机顺序。在随机排列的令牌序列上训练，比固定顺序生成更具挑战性，RandAR实现了与传统光栅顺序对应物相近的性能。更重要的是，从随机顺序训练的仅解码变压器获得了新的能力。对于AR模型的效率瓶颈，RandAR在推断时采用了KV-Cache并行解码，享受了2.5倍的加速而不牺牲生成质量。此外，RandAR以零样本方式支持修复、扩展和分辨率外推。我们希望RandAR能启发仅解码的视觉生成模型的新方向，拓展它们在各种场景中的应用。我们的项目页面位于https://rand-ar.github.io/。

更新时间: 2025-07-08 00:51:16

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2412.01827v2

MP-ALOE: An r2SCAN dataset for universal machine learning interatomic potentials

We present MP-ALOE, a dataset of nearly 1 million DFT calculations using the accurate r2SCAN meta-generalized gradient approximation. Covering 89 elements, MP-ALOE was created using active learning and primarily consists of off-equilibrium structures. We benchmark a machine learning interatomic potential trained on MP-ALOE, and evaluate its performance on a series of benchmarks, including predicting the thermochemical properties of equilibrium structures; predicting forces of far-from-equilibrium structures; maintaining physical soundness under static extreme deformations; and molecular dynamic stability under extreme temperatures and pressures. MP-ALOE shows strong performance on all of these benchmarks, and is made public for the broader community to utilize.

Updated: 2025-07-08 00:45:32

标题: MP-ALOE：用于通用机器学习原子间势的r2SCAN数据集

摘要: 我们介绍了MP-ALOE，这是一个包含近100万个使用准确的r2SCAN元广义梯度近似进行的密度泛函理论计算的数据集。涵盖了89种元素，MP-ALOE是使用主动学习创建的，主要由非平衡结构组成。我们对在MP-ALOE上训练的机器学习原子间势进行了基准测试，并评估其在一系列基准测试中的表现，包括预测平衡结构的热化学性质；预测远离平衡结构的力；在静态极端变形下保持物理合理性；以及在极端温度和压力下的分子动力学稳定性。MP-ALOE在所有这些基准测试中表现出色，并对更广泛的社区公开使用。

更新时间: 2025-07-08 00:45:32

领域: cond-mat.mtrl-sci,cs.AI,physics.comp-ph

下载: http://arxiv.org/abs/2507.05559v1

Neural Network-Based Parameter Estimation for Non-Autonomous Differential Equations with Discontinuous Signals

Non-autonomous differential equations are crucial for modeling systems influenced by external signals, yet fitting these models to data becomes particularly challenging when the signals change abruptly. To address this problem, we propose a novel parameter estimation method utilizing functional approximations with artificial neural networks. Our approach, termed Harmonic Approximation of Discontinuous External Signals using Neural Networks (HADES-NN), operates in two iterated stages. In the first stage, the algorithm employs a neural network to approximate the discontinuous signal with a smooth function. In the second stage, it uses this smooth approximate signal to estimate model parameters. HADES-NN gives highly accurate and precise parameter estimates across various applications, including circadian clock systems regulated by external light inputs measured via wearable devices and the mating response of yeast to external pheromone signals. HADES-NN greatly extends the range of model systems that can be fit to real-world measurements.

Updated: 2025-07-08 00:42:42

标题: 基于神经网络的参数估计方法用于具有不连续信号的非自治微分方程

摘要: 非自治微分方程对于建模受外部信号影响的系统至关重要，然而，当信号突然改变时，将这些模型拟合到数据变得特别具有挑战性。为了解决这个问题，我们提出了一种利用人工神经网络进行函数逼近的新颖参数估计方法。我们的方法，称为利用神经网络对不连续外部信号进行谐波逼近（HADES-NN），分为两个迭代阶段。在第一阶段，算法使用神经网络来逼近不连续信号为平滑函数。在第二阶段，它利用这个平滑近似信号来估计模型参数。HADES-NN在各种应用中都能给出高度准确和精确的参数估计，包括通过可穿戴设备测量的受外部光输入调节的昼夜节律系统和酵母对外部信息素信号的交配反应。HADES-NN极大地扩展了可以适应真实世界测量的模型系统范围。

更新时间: 2025-07-08 00:42:42

领域: cs.LG,34C60, 92B05, 68T07, 93C15, 65K10

下载: http://arxiv.org/abs/2507.06267v1

Per-Row Activation Counting on Real Hardware: Demystifying Performance Overheads

Per-Row Activation Counting (PRAC), a DRAM read disturbance mitigation method, modifies key DRAM timing parameters, reportedly causing significant performance overheads in simulator-based studies. However, given known discrepancies between simulators and real hardware, real-machine experiments are vital for accurate PRAC performance estimation. We present the first real-machine performance analysis of PRAC. After verifying timing modifications on the latest CPUs using microbenchmarks, our analysis shows that PRAC's average and maximum overheads are just 1.06% and 3.28% for the SPEC CPU2017 workloads -- up to 9.15x lower than simulator-based reports. Further, we show that the close page policy minimizes this overhead by effectively hiding the elongated DRAM row precharge operations due to PRAC from the critical path.

Updated: 2025-07-08 00:38:44

标题: 在真实硬件上进行逐行激活计数：揭开性能开销的神秘面纱

摘要: Per-Row Activation Counting（PRAC）是一种DRAM读取干扰缓解方法，修改关键的DRAM定时参数，据报道在基于模拟器的研究中导致显著的性能开销。然而，鉴于模拟器和实际硬件之间已知的差异，进行真实机器实验对于准确估计PRAC性能至关重要。我们提出了PRAC的首个真实机器性能分析。在使用微基准测试验证最新CPU上的定时修改之后，我们的分析显示，对于SPEC CPU2017工作负载，PRAC的平均和最大开销仅为1.06%和3.28% -- 高达模拟器报告的最低9.15倍。此外，我们展示了紧密页策略通过有效地隐藏由于PRAC导致的DRAM行预充电操作的延长，最小化了这种开销对关键路径的影响。

更新时间: 2025-07-08 00:38:44

领域: cs.AR,cs.CR

下载: http://arxiv.org/abs/2507.05556v1

Insuring Uninsurable Risks from AI: The State as Insurer of Last Resort

Many experts believe that AI systems will sooner or later pose uninsurable risks, including existential risks. This creates an extreme judgment-proof problem: few if any parties can be held accountable ex post in the event of such a catastrophe. This paper proposes a novel solution: a government-provided, mandatory indemnification program for AI developers. The program uses risk-priced indemnity fees to induce socially optimal levels of care. Risk-estimates are determined by surveying experts, including indemnified developers. The Bayesian Truth Serum mechanism is employed to incent honest and effortful responses. Compared to alternatives, this approach arguably better leverages all private information, and provides a clearer signal to indemnified developers regarding what risks they must mitigate to lower their fees. It's recommended that collected fees be used to help fund the safety research developers need, employing a fund matching mechanism (Quadratic Financing) to induce an optimal supply of this public good. Under Quadratic Financing, safety research projects would compete for private contributions from developers, signaling how much each is to be supplemented with public funds.

Updated: 2025-07-08 00:23:00

标题: 使用人工智能保险无法保险的风险：国家作为最终保险人

摘要: 许多专家认为，人工智能系统迟早会带来无法保险的风险，包括存在性风险。这造成了一个极端的无法判断的问题：在发生这样的灾难时，几乎没有任何一方可以承担事后责任。本文提出了一个新颖的解决方案：政府提供的、强制性的人工智能开发者赔偿计划。该计划利用风险定价赔偿费来诱导社会最优水平的关注。风险估计是通过对专家进行调查来确定的，包括被赔偿的开发者。采用贝叶斯真相血清机制来激励诚实和努力的回应。与其他选择相比，这种方法可能更好地利用所有私人信息，并为被赔偿的开发者提供了更清晰的信号，告诉他们必须降低费用以减少风险。建议使用收集的费用来帮助资助开发者所需的安全研究，采用基金配对机制（二次融资）来诱导这种公共产品的最佳供应。在二次融资下，安全研究项目将竞争开发者的私人捐款，表明每个项目应该用公共资金补充多少。

更新时间: 2025-07-08 00:23:00

领域: cs.CY,cs.AI,cs.LG,q-fin.RM

下载: http://arxiv.org/abs/2409.06672v2

Machine Learning based Enterprise Financial Audit Framework and High Risk Identification

In the face of global economic uncertainty, financial auditing has become essential for regulatory compliance and risk mitigation. Traditional manual auditing methods are increasingly limited by large data volumes, complex business structures, and evolving fraud tactics. This study proposes an AI-driven framework for enterprise financial audits and high-risk identification, leveraging machine learning to improve efficiency and accuracy. Using a dataset from the Big Four accounting firms (EY, PwC, Deloitte, KPMG) from 2020 to 2025, the research examines trends in risk assessment, compliance violations, and fraud detection. The dataset includes key indicators such as audit project counts, high-risk cases, fraud instances, compliance breaches, employee workload, and client satisfaction, capturing both audit behaviors and AI's impact on operations. To build a robust risk prediction model, three algorithms - Support Vector Machine (SVM), Random Forest (RF), and K-Nearest Neighbors (KNN) - are evaluated. SVM uses hyperplane optimization for complex classification, RF combines decision trees to manage high-dimensional, nonlinear data with resistance to overfitting, and KNN applies distance-based learning for flexible performance. Through hierarchical K-fold cross-validation and evaluation using F1-score, accuracy, and recall, Random Forest achieves the best performance, with an F1-score of 0.9012, excelling in identifying fraud and compliance anomalies. Feature importance analysis reveals audit frequency, past violations, employee workload, and client ratings as key predictors. The study recommends adopting Random Forest as a core model, enhancing features via engineering, and implementing real-time risk monitoring. This research contributes valuable insights into using machine learning for intelligent auditing and risk management in modern enterprises.

Updated: 2025-07-08 00:22:49

标题: 基于机器学习的企业财务审计框架及高风险识别

摘要: 在全球经济不确定性的面前，财务审计已成为监管合规和风险缓解的必要手段。传统手动审计方法受到大数据量、复杂业务结构和不断发展的欺诈策略的限制越来越大。本研究提出了一个基于人工智能的企业财务审计和高风险识别框架，利用机器学习提高效率和准确性。利用2020年至2025年来自四大会计师事务所（安永、普华永道、德勤、毕马威）的数据集，该研究探讨了风险评估、合规违规和欺诈检测的趋势。数据集包括审计项目数量、高风险案例、欺诈实例、合规违规、员工工作量和客户满意度等关键指标，捕捉了审计行为和人工智能对运营的影响。为建立健壮的风险预测模型，评估了三种算法——支持向量机（SVM）、随机森林（RF）和K最近邻居（KNN）。SVM利用超平面优化进行复杂分类，RF结合决策树处理高维度、非线性数据并抗过拟合，KNN应用基于距离的学习实现灵活性能。通过层次K折交叉验证和使用F1分数、准确率和召回率进行评估，随机森林获得了最佳性能，F1分数为0.9012，在识别欺诈和合规异常方面表现出色。特征重要性分析显示审计频率、过去违规、员工工作量和客户评级是关键预测因子。研究建议采用随机森林作为核心模型，通过工程化增强特征，并实施实时风险监控。这项研究为在现代企业中利用机器学习进行智能审计和风险管理提供了宝贵的见解。

更新时间: 2025-07-08 00:22:49

领域: q-fin.RM,cs.AI,cs.LG,stat.AP

下载: http://arxiv.org/abs/2507.06266v1

Liability and Insurance for Catastrophic Losses: the Nuclear Power Precedent and Lessons for AI

As AI systems become more autonomous and capable, experts warn of them potentially causing catastrophic losses. Drawing on the successful precedent set by the nuclear power industry, this paper argues that developers of frontier AI models should be assigned limited, strict, and exclusive third party liability for harms resulting from Critical AI Occurrences (CAIOs) - events that cause or easily could have caused catastrophic losses. Mandatory insurance for CAIO liability is recommended to overcome developers' judgment-proofness, mitigate winner's curse dynamics, and leverage insurers' quasi-regulatory abilities. Based on theoretical arguments and observations from the analogous nuclear power context, insurers are expected to engage in a mix of causal risk-modeling, monitoring, lobbying for stricter regulation, and providing loss prevention guidance in the context of insuring against heavy-tail risks from AI. While not a substitute for regulation, clear liability assignment and mandatory insurance can help efficiently allocate resources to risk-modeling and safe design, facilitating future regulatory efforts.

Updated: 2025-07-08 00:21:48

标题: 巨大损失的责任与保险：核能先例和人工智能的教训

摘要: 随着人工智能系统变得更加自主和强大，专家们警告称它们可能会造成灾难性的损失。本文借鉴核能行业成功的先例，提出开发前沿人工智能模型的开发者应当被指定为有限、严格和独家的第三方责任，以承担由关键人工智能事件（CAIOs）造成的损害 - 这些事件会造成或很容易造成灾难性损失。建议对CAIO责任进行强制性保险，以克服开发者的无法赔偿能力、缓解赢家诅咒动态，并利用保险公司的准监管能力。根据理论论据和类似核能背景下的观察，保险公司预计将在针对人工智能的重尾风险进行保险时参与一系列因果风险建模、监测、游说更严格法规和提供损失预防指导的活动。虽然不是对法规的替代，明确的责任分配和强制性保险可以帮助有效地分配资源到风险建模和安全设计，促进未来的监管努力。

更新时间: 2025-07-08 00:21:48

领域: cs.CY,cs.AI,cs.LG

下载: http://arxiv.org/abs/2409.06673v2

A Malliavin calculus approach to score functions in diffusion generative models

Score-based diffusion generative models have recently emerged as a powerful tool for modelling complex data distributions. These models aim at learning the score function, which defines a map from a known probability distribution to the target data distribution via deterministic or stochastic differential equations (SDEs). The score function is typically estimated from data using a variety of approximation techniques, such as denoising or sliced score matching, Hyv\"arien's method, or Schr\"odinger bridges. In this paper, we derive an exact, closed form, expression for the score function for a broad class of nonlinear diffusion generative models. Our approach combines modern stochastic analysis tools such as Malliavin derivatives and their adjoint operators (Skorokhod integrals or Malliavin Divergence) with a new Bismut-type formula. The resulting expression for the score function can be written entirely in terms of the first and second variation processes, with all Malliavin derivatives systematically eliminated, thereby enhancing its practical applicability. The theoretical framework presented in this work offers a principled foundation for advancing score estimation methods in generative modelling, enabling the design of new sampling algorithms for complex probability distributions. Our results can be extended to broader classes of stochastic differential equations, opening new directions for the development of score-based diffusion generative models.

Updated: 2025-07-08 00:20:57

标题: 一个马里纳文演算法方法在扩散生成模型中的得分函数

摘要: 基于分数的扩散生成模型最近已成为建模复杂数据分布的强大工具。这些模型旨在学习分数函数，该函数通过确定性或随机微分方程（SDEs）定义了从已知概率分布到目标数据分布的映射。分数函数通常通过各种逼近技术（如去噪或切片分数匹配、Hyv\"arien方法或Schr\"odinger桥）从数据中估计出来。在本文中，我们为一类广泛的非线性扩散生成模型推导了分数函数的精确、闭合形式表达式。我们的方法结合了现代随机分析工具，如Malliavin导数及其伴随算子（Skorokhod积分或Malliavin散度），以及一种新的Bismut类型公式。所得的分数函数表达式完全可以用第一和第二变化过程来写出，所有Malliavin导数都被系统地消除，从而增强了其实际适用性。本文提出的理论框架为推进生成模型中的分数估计方法提供了一个有原则的基础，可以设计出用于复杂概率分布的新采样算法。我们的结果可以扩展到更广泛的随机微分方程类，为发展基于分数的扩散生成模型开辟了新方向。

更新时间: 2025-07-08 00:20:57

领域: stat.ML,cs.LG,math.PR

下载: http://arxiv.org/abs/2507.05550v1

An AI Theory of Mind Will Enhance Our Collective Intelligence

Collective intelligence plays a central role in many fields, from economics and evolutionary theory to neural networks and eusocial insects, and is also core to work on emergence and self-organisation in complex-systems theory. However, in human collective intelligence there is still much to understand about how specific psychological processes at the individual level give rise to self-organised structures at the social level. Psychological factors have so far played a minor role in collective-intelligence studies because the principles are often general and applicable to agents without sophisticated psychologies. We emphasise, with examples from other complex adaptive systems, the broad applicability of collective-intelligence principles, while noting that mechanisms and time scales differ markedly between cases. We review evidence that flexible collective intelligence in human social settings is improved by a particular cognitive tool: our Theory of Mind. We then hypothesise that AIs equipped with a theory of mind will enhance collective intelligence in ways similar to human contributions. To make this case, we step back from the algorithmic basis of AI psychology and consider the large-scale impact AI can have as agential actors in a 'social ecology' rather than as mere technological tools. We identify several key characteristics of psychologically mediated collective intelligence and show that the development of a Theory of Mind is crucial in distinguishing human social collective intelligence from more general forms. Finally, we illustrate how individuals, human or otherwise, integrate within a collective not by being genetically or algorithmically programmed, but by growing and adapting into the socio-cognitive niche they occupy. AI can likewise inhabit one or multiple such niches, facilitated by a Theory of Mind.

Updated: 2025-07-08 00:18:46

标题: 一个人工智能心智理论将提升我们的集体智慧

摘要: 集体智慧在许多领域中起着核心作用，从经济学和进化论到神经网络和社会性昆虫，同时也是复杂系统理论中关于新兴和自组织工作的核心。然而，在人类集体智慧中，关于个体层面的特定心理过程如何导致社会层面的自组织结构仍有许多需要理解的地方。由于这些原则通常是普遍的并适用于没有复杂心理学的代理人，因此心理因素在集体智慧研究中迄今为止扮演了较小的角色。我们强调，通过其他复杂适应系统的例子，集体智慧原则的广泛适用性，同时指出机制和时间尺度在不同案例之间存在明显差异。我们回顾了证据，表明人类社会环境中的灵活集体智慧通过一种特定的认知工具得到改善：我们的心灵理论。然后，我们假设配备心灵理论的人工智能将以类似于人类贡献的方式增强集体智慧。为了证明这一点，我们从人工智能心理学的算法基础中退后一步，考虑人工智能作为“社会生态”中的行动者而不仅仅是技术工具所能产生的大规模影响。我们确定了心理调解的集体智慧的几个关键特征，并展示了理解人类社会集体智慧与更一般形式之间的差异中，心灵理论的发展是至关重要的。最后，我们说明了个体，无论是人类还是其他生物，是通过成长和适应到他们所占据的社会认知空间中而不是通过遗传或算法编程来融入集体的。人工智能同样可以在一个或多个这样的空间中生存，通过心灵理论的帮助。

更新时间: 2025-07-08 00:18:46

领域: cs.MA,cs.AI,cs.CY,cs.GT,nlin.AO

下载: http://arxiv.org/abs/2411.09168v2

The Ethical Implications of AI in Creative Industries: A Focus on AI-Generated Art

As Artificial Intelligence (AI) continues to grow daily, more exciting (and somewhat controversial) technology emerges every other day. As we see the advancements in AI, we see more and more people becoming skeptical of it. This paper explores the complications and confusion around the ethics of generative AI art. We delve deep into the ethical side of AI, specifically generative art. We step back from the excitement and observe the impossible conundrums that this impressive technology produces. Covering environmental consequences, celebrity representation, intellectual property, deep fakes, and artist displacement. Our research found that generative AI art is responsible for increased carbon emissions, spreading misinformation, copyright infringement, unlawful depiction, and job displacement. In light of this, we propose multiple possible solutions for these problems. We address each situation's history, cause, and consequences and offer different viewpoints. At the root of it all, though, the central theme is that generative AI Art needs to be correctly legislated and regulated.

Updated: 2025-07-08 00:16:38

标题: 人工智能在创意产业中的道德影响：聚焦于人工智能生成的艺术

摘要: 随着人工智能（AI）每天不断发展，更令人兴奋（也有些争议的）技术每隔一天就会出现。随着我们看到AI的进步，我们看到越来越多的人对其持怀疑态度。本文探讨了生成式AI艺术伦理的复杂性和困惑。我们深入探讨了AI的伦理方面，特别是生成式艺术。我们从兴奋中退后一步，观察这项令人印象深刻的技术产生的不可能的困境。涵盖环境后果、名人代表、知识产权、深度伪造和艺术家替换。我们的研究发现，生成式AI艺术导致碳排放增加、传播错误信息、侵犯版权、非法描绘和工作岗位替换。鉴于此，我们提出了多种可能的解决方案。我们探讨了每种情况的历史、原因和后果，并提出不同的观点。然而，总的来说，生成式AI艺术需要得到正确的立法和监管。

更新时间: 2025-07-08 00:16:38

领域: cs.CY,cs.AI,cs.HC,I.2.0

下载: http://arxiv.org/abs/2507.05549v1

SPEED-RL: Faster Training of Reasoning Models via Online Curriculum Learning

Training large language models with reinforcement learning (RL) against verifiable rewards significantly enhances their reasoning abilities, yet remains computationally expensive due to inefficient uniform prompt sampling. We introduce Selective Prompting with Efficient Estimation of Difficulty (SPEED), an adaptive online RL curriculum that selectively chooses training examples of intermediate difficulty to maximize learning efficiency. Theoretically, we establish that intermediate-difficulty prompts improve the gradient estimator's signal-to-noise ratio, accelerating convergence. Empirically, our efficient implementation leads to 2x to 6x faster training without degrading accuracy, requires no manual tuning, and integrates seamlessly into standard RL algorithms.

Updated: 2025-07-08 00:02:06

标题: SPEED-RL：通过在线课程学习加快推理模型的训练速度

摘要: 使用强化学习（RL）对大型语言模型进行训练，以对可验证的奖励进行增强，显著提升其推理能力，但由于效率低下的统一提示抽样，仍然需要耗费大量计算资源。我们引入了具有高效难度估计的选择性提示（SPEED），这是一种自适应在线RL课程，通过选择中等难度的训练示例来最大化学习效率。从理论上讲，我们建立了中等难度提示改进了梯度估计器的信噪比，加快了收敛速度。在实证方面，我们的高效实现导致训练速度提高2倍至6倍，而不降低准确性，无需手动调整，并且能够无缝集成到标准RL算法中。

更新时间: 2025-07-08 00:02:06

领域: cs.LG

下载: http://arxiv.org/abs/2506.09016v2

Gait-Based Hand Load Estimation via Deep Latent Variable Models with Auxiliary Information

Machine learning methods are increasingly applied to ergonomic risk assessment in manual material handling, particularly for estimating carried load from gait motion data collected from wearable sensors. However, existing approaches often rely on direct mappings from loaded gait to hand load, limiting generalization and predictive accuracy. In this study, we propose an enhanced load estimation framework that incorporates auxiliary information, including baseline gait patterns during unloaded walking and carrying style. While baseline gait can be automatically captured by wearable sensors and is thus readily available at inference time, carrying style typically requires manual labeling and is often unavailable during deployment. Our model integrates deep latent variable modeling with temporal convolutional networks and bi-directional cross-attention to capture gait dynamics and fuse loaded and unloaded gait patterns. Guided by domain knowledge, the model is designed to estimate load magnitude conditioned on carrying style, while eliminating the need for carrying style labels at inference time. Experiments using real-world data collected from inertial measurement units attached to participants demonstrate substantial accuracy gains from incorporating auxiliary information and highlight the importance of explicit fusion mechanisms over naive feature concatenation.

Updated: 2025-07-08 00:00:12

标题: 基于步态的手部负载估算：利用深度潜变量模型和辅助信息

摘要: 机器学习方法越来越多地应用于人体工程学风险评估，特别是用于从可穿戴传感器收集的步行动作数据估计手提载荷的手动搬运中。然而，现有方法通常依赖于从负载步行到手提载荷的直接映射，从而限制了泛化和预测准确性。在这项研究中，我们提出了一个增强的载荷估计框架，其中包括基线步行模式和携带风格等辅助信息。虽然基线步行可以通过可穿戴传感器自动捕捉，并且因此在推断时可立即获得，但携带风格通常需要手动标记，并且在部署期间通常不可用。我们的模型将深层潜在变量建模与时间卷积网络和双向交叉注意力相结合，以捕捉步行动态并融合负载和未负载步行模式。在领域知识的指导下，该模型被设计为在携带风格条件下估计载荷大小，同时在推断时消除了对携带风格标签的需求。使用附加信息收集的真实世界数据进行的实验表明，通过整合辅助信息可获得显著的准确性提升，并突显了明确的融合机制比天真的特征串联的重要性。

更新时间: 2025-07-08 00:00:12

领域: cs.LG

下载: http://arxiv.org/abs/2507.05544v1