Arxiv Day: Article

Adaptive Anomaly Detection in Evolving Network Environments

Distribution shift, a change in the statistical properties of data over time, poses a critical challenge for deep learning anomaly detection systems. Existing anomaly detection systems often struggle to adapt to these shifts. Specifically, systems based on supervised learning require costly manual labeling, while those based on unsupervised learning rely on clean data, which is difficult to obtain, for shift adaptation. Both of these requirements are challenging to meet in practice. In this paper, we introduce NetSight, a framework for supervised anomaly detection in network data that continually detects and adapts to distribution shifts in an online manner. NetSight eliminates manual intervention through a novel pseudo-labeling technique and uses a knowledge distillation-based adaptation strategy to prevent catastrophic forgetting. Evaluated on three long-term network datasets, NetSight demonstrates superior adaptation performance compared to state-of-the-art methods that rely on manual labeling, achieving F1-score improvements of up to 11.72%. This proves its robustness and effectiveness in dynamic networks that experience distribution shifts over time.

Updated: 2025-08-20 22:31:57

标题: 在不断演变的网络环境中的自适应异常检测

摘要: 分布转移是数据的统计特性随时间发生变化的一种情况，对深度学习异常检测系统构成了重要挑战。现有的异常检测系统通常难以适应这些转移。具体来说，基于监督学习的系统需要昂贵的手动标记，而基于无监督学习的系统依赖于难以获取的干净数据进行转移适应。这两种要求在实践中都难以满足。在本文中，我们介绍了NetSight，一个用于网络数据中监督异常检测的框架，可以持续地在线检测和适应分布转移。NetSight通过一种新颖的伪标记技术消除了手动干预，并使用基于知识蒸馏的适应策略来防止灾难性遗忘。在三个长期网络数据集上进行评估，NetSight相对于依赖于手动标记的最先进方法表现出更好的适应性，F1-分数提高了高达11.72%。这证明了它在经历随时间发生分布转移的动态网络中的稳健性和有效性。

更新时间: 2025-08-20 22:31:57

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2508.15100v1

Tighter Privacy Analysis for Truncated Poisson Sampling

We give a new privacy amplification analysis for truncated Poisson sampling, a Poisson sampling variant that truncates a batch if it exceeds a given maximum batch size.

Updated: 2025-08-20 22:00:23

标题: 更严格的隐私分析方法用于截断泊松抽样

摘要: 我们提供了一个新的隐私增强分析，适用于截断泊松抽样，这是泊松抽样的一个变体，如果批次超过给定的最大批次大小，则会对其进行截断。

更新时间: 2025-08-20 22:00:23

领域: cs.CR

下载: http://arxiv.org/abs/2508.15089v1

Fortifying the Agentic Web: A Unified Zero-Trust Architecture Against Logic-layer Threats

This paper presents a Unified Security Architecture that fortifies the Agentic Web through a Zero-Trust IAM framework. This architecture is built on a foundation of rich, verifiable agent identities using Decentralized Identifiers (DIDs) and Verifiable Credentials (VCs), with discovery managed by a protocol-agnostic Agent Name Service (ANS). Security is operationalized through a multi-layered Trust Fabric which introduces significant innovations, including Trust-Adaptive Runtime Environments (TARE), Causal Chain Auditing, and Dynamic Identity with Behavioral Attestation. By explicitly linking the LPCI threat to these enhanced architectural countermeasures within a formal security model, we propose a comprehensive and forward-looking blueprint for a secure, resilient, and trustworthy agentic ecosystem. Our formal analysis demonstrates that the proposed architecture provides provable security guarantees against LPCI attacks with bounded probability of success.

Updated: 2025-08-20 21:14:55

标题: 加固主动网络：统一的零信任体系抵御逻辑层威胁

摘要: 本文提出了一个通过零信任IAM框架加强代理网络的统一安全架构。该架构建立在使用分散式标识符（DIDs）和可验证凭证（VCs）的丰富、可验证的代理身份的基础上，发现由协议不可知的代理名称服务（ANS）管理。安全性通过多层信任框架实现，引入了重大创新，包括信任适应运行环境（TARE）、因果链审计和动态身份与行为认证。通过在正式安全模型中将LPCI威胁明确与这些增强的架构对策联系起来，我们提出了一个安全、弹性和可信赖的代理生态系统的综合且前瞻性的蓝图。我们的正式分析证明了所提出的架构提供了对LPCI攻击的可证明安全保证，成功的概率有界。

更新时间: 2025-08-20 21:14:55

领域: cs.CR,cs.AI,cs.ET

下载: http://arxiv.org/abs/2508.12259v3

When Machine Learning Meets Vulnerability Discovery: Challenges and Lessons Learned

In recent years, machine learning has demonstrated impressive results in various fields, including software vulnerability detection. Nonetheless, using machine learning to identify software vulnerabilities presents new challenges, especially regarding the scale of data involved, which was not a factor in traditional methods. Consequently, in spite of the rise of new machine-learning-based approaches in that space, important shortcomings persist regarding their evaluation. First, researchers often fail to provide concrete statistics about their training datasets, such as the number of samples for each type of vulnerability. Moreover, many methods rely on training with semantically similar functions rather than directly on vulnerable programs. This leads to uncertainty about the suitability of the datasets currently used for training. Secondly, the choice of a model and the level of granularity at which models are trained also affect the effectiveness of such vulnerability discovery approaches. In this paper, we explore the challenges of applying machine learning to vulnerability discovery. We also share insights from our two previous research papers, Bin2vec and BinHunter, which could enhance future research in this field.

Updated: 2025-08-20 20:09:49

标题: 当机器学习遇见漏洞发现：挑战与经验教训

摘要: 近年来，机器学习在各个领域展现出令人印象深刻的成果，包括软件漏洞检测。然而，利用机器学习来识别软件漏洞也带来了新的挑战，特别是涉及到的数据规模，这在传统方法中并不是一个因素。因此，尽管在这一领域出现了新的基于机器学习的方法，但在评估方面仍存在重要缺陷。首先，研究人员经常未能提供关于他们的训练数据集的具体统计数据，比如每种漏洞类型的样本数量。此外，许多方法依赖于使用语义相似的函数进行训练，而不是直接使用有漏洞的程序。这导致目前用于训练的数据集的适用性存在不确定性。其次，模型的选择以及模型训练的粒度水平也影响这种漏洞发现方法的有效性。在本文中，我们探讨了将机器学习应用于漏洞发现面临的挑战。我们还分享了我们之前两篇研究论文《Bin2vec》和《BinHunter》的见解，这些见解可以增强未来在这一领域的研究。

更新时间: 2025-08-20 20:09:49

领域: cs.CR

下载: http://arxiv.org/abs/2508.15042v1

MoEcho: Exploiting Side-Channel Attacks to Compromise User Privacy in Mixture-of-Experts LLMs

The transformer architecture has become a cornerstone of modern AI, fueling remarkable progress across applications in natural language processing, computer vision, and multimodal learning. As these models continue to scale explosively for performance, implementation efficiency remains a critical challenge. Mixture of Experts (MoE) architectures, selectively activating specialized subnetworks (experts), offer a unique balance between model accuracy and computational cost. However, the adaptive routing in MoE architectures, where input tokens are dynamically directed to specialized experts based on their semantic meaning inadvertently opens up a new attack surface for privacy breaches. These input-dependent activation patterns leave distinctive temporal and spatial traces in hardware execution, which adversaries could exploit to deduce sensitive user data. In this work, we propose MoEcho, discovering a side channel analysis based attack surface that compromises user privacy on MoE based systems. Specifically, in MoEcho, we introduce four novel architectural side channels on different computing platforms, including Cache Occupancy Channels and Pageout+Reload on CPUs, and Performance Counter and TLB Evict+Reload on GPUs, respectively. Exploiting these vulnerabilities, we propose four attacks that effectively breach user privacy in large language models (LLMs) and vision language models (VLMs) based on MoE architectures: Prompt Inference Attack, Response Reconstruction Attack, Visual Inference Attack, and Visual Reconstruction Attack. MoEcho is the first runtime architecture level security analysis of the popular MoE structure common in modern transformers, highlighting a serious security and privacy threat and calling for effective and timely safeguards when harnessing MoE based models for developing efficient large scale AI services.

Updated: 2025-08-20 20:02:35

标题: MoEcho：利用侧信道攻击来破坏混合专家LLMs中的用户隐私

摘要: 变压器架构已经成为现代人工智能的基石，推动了自然语言处理、计算机视觉和多模态学习等应用领域的显著进展。随着这些模型继续爆炸性地提升性能，实现效率仍然是一个关键挑战。混合专家（MoE）架构通过选择性地激活专门的子网络（专家），在模型准确性和计算成本之间提供了独特的平衡。然而，在MoE架构中的自适应路由，其中输入令牌根据其语义含义动态定向到专门的专家，无意中打开了一个新的隐私侵犯攻击面。这些与输入相关的激活模式在硬件执行中留下了明显的时间和空间痕迹，攻击者可以利用这些痕迹推断敏感用户数据。在这项工作中，我们提出了MoEcho，发现了基于侧信道分析的攻击面，危害了基于MoE的系统用户隐私。具体来说，在MoEcho中，我们在不同的计算平台上引入了四种新颖的架构侧信道，包括CPU上的Cache占用信道和Pageout+Reload，以及GPU上的性能计数器和TLB Evict+Reload。利用这些漏洞，我们提出了四种有效侵犯用户隐私的攻击，针对基于MoE架构的大型语言模型（LLMs）和视觉语言模型（VLMs）：提示推断攻击、响应重建攻击、视觉推断攻击和视觉重建攻击。MoEcho是流行的MoE结构的第一个运行时架构级安全分析，突显了一个严重的安全和隐私威胁，并呼吁在开发高效大规模人工智能服务时，有效和及时地保护MoE基于模型的必要。

更新时间: 2025-08-20 20:02:35

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2508.15036v1

Securing Swarms: Cross-Domain Adaptation for ROS2-based CPS Anomaly Detection

Cyber-physical systems (CPS) are being increasingly utilized for critical applications. CPS combines sensing and computing elements, often having multi-layer designs with networking, computational, and physical interfaces, which provide them with enhanced capabilities for a variety of application scenarios. However, the combination of physical and computational elements also makes CPS more vulnerable to attacks compared to network-only systems, and the resulting impacts of CPS attacks can be substantial. Intelligent intrusion detection systems (IDS) are an effective mechanism by which CPS can be secured, but the majority of current solutions often train and validate on network traffic-only datasets, ignoring the distinct attacks that may occur on other system layers. In order to address this, we develop an adaptable CPS anomaly detection model that can detect attacks within CPS without the need for previously labeled data. To achieve this, we utilize domain adaptation techniques that allow us to transfer known attack knowledge from a network traffic-only environment to a CPS environment. We validate our approach using a state-of-the-art CPS intrusion dataset that combines network, operating system (OS), and Robot Operating System (ROS) data. Through this dataset, we are able to demonstrate the effectiveness of our model across network traffic-only and CPS environments with distinct attack types and its ability to outperform other anomaly detection methods.

Updated: 2025-08-20 20:02:28

标题: 保障群体安全：基于ROS2的CPS异常检测的跨领域适应性

摘要: 网络物理系统（CPS）正被越来越广泛地用于关键应用。 CPS结合了传感和计算元素，通常具有多层设计，具有网络、计算和物理接口，使其在各种应用场景中具有增强的能力。然而，物理和计算元素的结合也使得CPS相对于仅网络系统更容易受到攻击，并且CPS攻击的影响可能会很大。智能入侵检测系统（IDS）是一种有效的机制，可以确保CPS的安全性，但目前大多数解决方案通常只在网络流量数据集上进行训练和验证，忽略了可能发生在其他系统层的不同攻击。为了解决这个问题，我们开发了一种适应性CPS异常检测模型，可以在CPS中检测攻击，无需事先标记的数据。为了实现这一点，我们利用领域适应技术，允许我们将已知的攻击知识从仅网络流量的环境转移到CPS环境。我们使用一种最先进的CPS入侵数据集验证了我们的方法，该数据集结合了网络、操作系统（OS）和机器人操作系统（ROS）数据。通过这个数据集，我们能够展示我们的模型在仅网络流量和CPS环境中跨不同攻击类型的有效性，以及其优于其他异常检测方法的能力。

更新时间: 2025-08-20 20:02:28

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2508.15865v1

Bridging the Mobile Trust Gap: A Zero Trust Framework for Consumer-Facing Applications

Zero Trust Architecture (ZTA) has become a widely adopted model for securing enterprise environments, promoting continuous verification and minimal trust across systems. However, its application in mobile contexts remains limited, despite mobile applications now accounting for most global digital interactions and being increasingly targeted by sophisticated threats. Existing Zero Trust frameworks developed by organisations such as the National Institute of Standards and Technology (NIST) and the Cybersecurity and Infrastructure Security Agency (CISA) primarily focus on enterprise-managed infrastructure, assuming organisational control over devices, networks, and identities. This paper addresses a critical gap by proposing an extended Zero Trust model designed for mobile applications operating in untrusted, user-controlled environments. Using a design science methodology, the study introduced a six-pillar framework that supports runtime enforcement of trust through controls including device integrity, user identity validation, data protection, secure application programming interface (API) usage, behavioural monitoring, and live application protection. Each pillar was mapped to relevant regulatory and security standards to support compliance. A phased implementation roadmap and maturity assessment model were also developed to guide adoption across varying organisational contexts. The proposed model offers a practical and standards-aligned approach to securing mobile applications beyond pre-deployment controls, aligning real-time enforcement with Zero Trust principles. This contribution expands the operational boundaries of ZTA and provides organisations with a deployable path to reduce fraud, enhance compliance, and address emerging mobile security challenges. Future research may include empirical validation of the framework and cross-sector application testing.

Updated: 2025-08-20 18:42:36

标题: 填补移动信任鸿沟：面向消费者应用的零信任框架

摘要: 零信任架构（ZTA）已成为保护企业环境的广泛采用模型，促进系统间持续验证和最小信任。然而，尽管移动应用程序现在占据全球数字互动的大部分份额，并且受到越来越多复杂威胁的攻击，但其在移动环境中的应用仍然有限。现有的零信任框架由机构（如美国国家标准与技术研究院（NIST）和网络安全与基础设施安全局（CISA））开发，主要侧重于企业管理的基础设施，假设组织对设备、网络和身份拥有控制权。本文通过提出一个专为在不受信任的用户控制环境中运行的移动应用程序设计的扩展零信任模型，解决了这一关键差距。使用设计科学方法，研究引入了一个支持通过控制实时执行信任的六柱框架，包括设备完整性、用户身份验证、数据保护、安全应用程序编程接口（API）使用、行为监控和实时应用程序保护。每个柱子都映射到相关的监管和安全标准，以支持合规性。还开发了一个分阶段实施路线图和成熟度评估模型，以指导在不同组织背景下的采用。所提出的模型为超出预部署控制的移动应用程序提供了一种实用且与标准一致的保护方法，将实时执行与零信任原则保持一致。这一贡献扩展了ZTA的运营边界，并为组织提供了一个可部署的路径，以减少欺诈、增强合规性并应对新兴移动安全挑战。未来的研究可能包括对框架的实证验证和跨部门应用测试。

更新时间: 2025-08-20 18:42:36

领域: cs.CR,cs.CY,cs.NI,cs.SE,K.6.5; C.2.0; D.4.6

下载: http://arxiv.org/abs/2508.16662v1

Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs

Recent advances in diffusion large language models (dLLMs) have introduced a promising alternative to autoregressive (AR) LLMs for natural language generation tasks, leveraging full attention and denoising-based decoding strategies. However, the deployment of these models on edge devices remains challenging due to their massive parameter scale and high resource demands. While post-training quantization (PTQ) has emerged as a widely adopted technique for compressing AR LLMs, its applicability to dLLMs remains largely unexplored. In this work, we present the first systematic study on quantizing diffusion-based language models. We begin by identifying the presence of activation outliers, characterized by abnormally large activation values that dominate the dynamic range. These outliers pose a key challenge to low-bit quantization, as they make it difficult to preserve precision for the majority of values. More importantly, we implement state-of-the-art PTQ methods and conduct a comprehensive evaluation across multiple task types and model variants. Our analysis is structured along four key dimensions: bit-width, quantization method, task category, and model type. Through this multi-perspective evaluation, we offer practical insights into the quantization behavior of dLLMs under different configurations. We hope our findings provide a foundation for future research in efficient dLLM deployment. All codes and experimental setups will be released to support the community.

Updated: 2025-08-20 17:59:51

标题: 量化与dLLMs相遇：扩散LLMs的后训练量化的系统研究

摘要: 最近在扩散大语言模型（dLLMs）方面取得的进展，为自然语言生成任务引入了一种有希望的替代方法，利用全注意力和去噪解码策略，与自回归（AR）LLMs相比。然而，由于这些模型的庞大参数规模和高资源需求，将这些模型部署到边缘设备仍然具有挑战性。虽然后训练量化（PTQ）已经成为压缩AR LLMs的广泛采用技术，但其在dLLMs上的适用性仍然很大程度上尚未被探索。在这项工作中，我们首次对扩散型语言模型进行量化的系统研究。我们首先确定存在激活异常值，其特征是异常大的激活值占主导地位，这些异常值对低位量化构成了关键挑战，因为它们使得难以保留大多数值的精度。更重要的是，我们实现了最先进的PTQ方法，并在多个任务类型和模型变体之间进行了全面评估。我们的分析沿着四个关键维度展开：位宽、量化方法、任务类别和模型类型。通过这种多角度评估，我们提供了关于在不同配置下量化行为的实用见解。我们希望我们的研究成果为高效部署dLLM的未来研究奠定基础。所有代码和实验设置将被发布以支持社区。

更新时间: 2025-08-20 17:59:51

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.14896v1

Compute-Optimal Scaling for Value-Based Deep RL

As models grow larger and training them becomes expensive, it becomes increasingly important to scale training recipes not just to larger models and more data, but to do so in a compute-optimal manner that extracts maximal performance per unit of compute. While such scaling has been well studied for language modeling, reinforcement learning (RL) has received less attention in this regard. In this paper, we investigate compute scaling for online, value-based deep RL. These methods present two primary axes for compute allocation: model capacity and the update-to-data (UTD) ratio. Given a fixed compute budget, we ask: how should resources be partitioned across these axes to maximize sample efficiency? Our analysis reveals a nuanced interplay between model size, batch size, and UTD. In particular, we identify a phenomenon we call TD-overfitting: increasing the batch quickly harms Q-function accuracy for small models, but this effect is absent in large models, enabling effective use of large batch size at scale. We provide a mental model for understanding this phenomenon and build guidelines for choosing batch size and UTD to optimize compute usage. Our findings provide a grounded starting point for compute-optimal scaling in deep RL, mirroring studies in supervised learning but adapted to TD learning.

Updated: 2025-08-20 17:54:21

标题: 基于价值的深度强化学习的计算最优缩放

摘要: 随着模型变得越来越大并且训练变得越来越昂贵，将训练配方扩展到不仅适用于更大的模型和更多数据，而且以一种计算优化的方式进行，以提取每单位计算的最大性能变得越来越重要。虽然语言建模的这种扩展已经被广泛研究过，但在强化学习（RL）方面，对此的关注较少。在本文中，我们调查了在线、基于价值的深度RL的计算扩展。这些方法提供了两个主要的计算分配轴：模型容量和更新到数据（UTD）比。在给定固定计算预算的情况下，我们提出一个问题：如何在这些轴上分配资源，以最大化样本效率？我们的分析揭示了模型大小、批量大小和UTD之间微妙的相互作用。特别地，我们确定了一种我们称之为TD过拟合的现象：增加批量会迅速损害小模型的Q函数准确性，但这种影响在大模型中是不存在的，从而使大批量大小在规模上得以有效利用。我们提供了一个理解这一现象的心理模型，并建立了选择批量大小和UTD以优化计算使用的指南。我们的发现为深度RL中的计算优化扩展提供了一个扎实的起点，与监督学习中的研究相呼应，但是针对TD学习进行了调整。

更新时间: 2025-08-20 17:54:21

领域: cs.LG

下载: http://arxiv.org/abs/2508.14881v1

RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation

We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0{\deg}, 90{\deg}, 180{\deg}, and 270{\deg}. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench -- a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information -- including captions, depth maps, and more -- or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0{\deg}) images, while certain models are able to identify upside-down (180{\deg}) images. None can reliably distinguish between 90{\deg} and 270{\deg}. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models' ability to distinguish 90{\deg} and 270{\deg} rotations, despite substantially improving the identification of 180{\deg} images. Together, these results reveal a significant gap between MLLMs' spatial reasoning capabilities and human perception in identifying rotation.

Updated: 2025-08-20 17:53:09

标题: RotBench：评估多模态大型语言模型在识别图像旋转方面的表现

摘要: 我们研究了多模态大型语言模型（MLLMs）在多大程度上能够准确识别输入图像的方向，包括0度、90度、180度和270度旋转。这项任务需要强大的视觉推理能力，以便检测旋转线索并在图像内部定位空间关系，而不考虑它们的方向。为了评估MLLMs在这些能力上的表现，我们引入了RotBench - 一个包含生活方式、肖像和风景图像的350张手动筛选的基准测试。尽管这项任务相对简单，我们发现几种最先进的开放和专有MLLMs，包括GPT-5、o3和Gemini-2.5-Pro，不能可靠地识别输入图像的旋转。为模型提供辅助信息 - 包括字幕、深度图和其他信息 - 或使用思维链提示只会带来轻微且不一致的改进。我们的结果表明，大多数模型能够可靠地识别正面朝上（0度）的图像，而某些模型能够识别倒置（180度）的图像。没有任何模型能够可靠地区分90度和270度。同时展示图像以不同方向旋转会使推理模型的性能略有提升，而使用投票的修改设置会提高较弱模型的性能。我们进一步表明，微调并不能改善模型区分90度和270度旋转的能力，尽管显著提高了180度图像的识别率。总的来说，这些结果揭示了MLLMs在空间推理能力和人类感知中识别旋转方面存在显著差距。

更新时间: 2025-08-20 17:53:09

领域: cs.CV,cs.AI,cs.CL

下载: http://arxiv.org/abs/2508.13968v2

Squeezed Diffusion Models

Diffusion models typically inject isotropic Gaussian noise, disregarding structure in the data. Motivated by the way quantum squeezed states redistribute uncertainty according to the Heisenberg uncertainty principle, we introduce Squeezed Diffusion Models (SDM), which scale noise anisotropically along the principal component of the training distribution. As squeezing enhances the signal-to-noise ratio in physics, we hypothesize that scaling noise in a data-dependent manner can better assist diffusion models in learning important data features. We study two configurations: (i) a Heisenberg diffusion model that compensates the scaling on the principal axis with inverse scaling on orthogonal directions and (ii) a standard SDM variant that scales only the principal axis. Counterintuitively, on CIFAR-10/100 and CelebA-64, mild antisqueezing - i.e. increasing variance on the principal axis - consistently improves FID by up to 15% and shifts the precision-recall frontier toward higher recall. Our results demonstrate that simple, data-aware noise shaping can deliver robust generative gains without architectural changes.

Updated: 2025-08-20 17:37:53

标题: 挤压扩散模型

摘要: 传统的扩散模型通常注入各向同性高斯噪声，忽视数据中的结构。受到量子压缩态根据海森堡不确定性原理重新分配不确定性的启发，我们引入了压缩扩散模型（SDM），它沿着训练分布的主成分非均匀地缩放噪声。由于压缩在物理学中增强了信噪比，我们假设以数据相关的方式缩放噪声可以更好地帮助扩散模型学习重要的数据特征。我们研究了两种配置：（i）一个海森堡扩散模型，它通过在主轴上进行缩放来补偿正交方向上的逆缩放，以及（ii）一个标准的SDM变体，它仅缩放主轴。出乎意料的是，在CIFAR-10/100和CelebA-64上，轻微的反压缩 - 即在主轴上增加方差 - 使FID稳定提高了最多15％，并将精确度-召回曲线向更高召回率方向移动。我们的结果表明，简单的、数据感知的噪声塑形可以在不改变架构的情况下提供稳健的生成增益。

更新时间: 2025-08-20 17:37:53

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2508.14871v1

GenVC: Self-Supervised Zero-Shot Voice Conversion

Most current zero-shot voice conversion methods rely on externally supervised components, particularly speaker encoders, for training. To explore alternatives that eliminate this dependency, this paper introduces GenVC, a novel framework that disentangles speaker identity and linguistic content from speech signals in a self-supervised manner. GenVC leverages speech tokenizers and an autoregressive, Transformer-based language model as its backbone for speech generation. This design supports large-scale training while enhancing both source speaker privacy protection and target speaker cloning fidelity. Experimental results demonstrate that GenVC achieves notably higher speaker similarity, with naturalness on par with leading zero-shot approaches. Moreover, due to its autoregressive formulation, GenVC introduces flexibility in temporal alignment, reducing the preservation of source prosody and speaker-specific traits, and making it highly effective for voice anonymization.

Updated: 2025-08-20 17:34:21

标题: GenVC：自监督零样本声音转换

摘要: 大多数当前的零样本语音转换方法依赖于外部监督的组件，特别是说话者编码器用于训练。为了探索消除这种依赖性的替代方案，本文介绍了GenVC，这是一个新颖的框架，以自监督的方式将说话者身份和语言内容从语音信号中解耦。GenVC利用语音标记器和基于Transformer的自回归语言模型作为其语音生成的支撑。这种设计支持大规模训练，同时增强源说话者隐私保护和目标说话者克隆的逼真度。实验结果表明，GenVC实现了显著更高的说话者相似性，自然度与领先的零样本方法不相上下。此外，由于其自回归的形式，GenVC引入了时间对齐的灵活性，减少了源韵律和说话者特定特征的保留，使其对语音匿名化非常有效。

更新时间: 2025-08-20 17:34:21

领域: eess.AS,cs.LG

下载: http://arxiv.org/abs/2502.04519v2

LoSiA: Efficient High-Rank Fine-Tuning via Subnet Localization and Optimization

Parameter-Efficient Fine-Tuning (PEFT) methods, such as LoRA, significantly reduce the number of trainable parameters by introducing low-rank decomposition matrices. However, existing methods perform extensive matrix multiplications in domain specialization tasks, resulting in computational inefficiency and sub-optimal fine-tuning performance. Hence, we propose LoSiA(Low-Resources Subnet Integration Adaptation), an innovative method that dynamically localizes and optimizes critical parameters during the training process. Specifically, it identifies a sub-network using gradient sparsity analysis and optimizes it as the trainable target. This design enables effective high-rank adaptation by updating only the sub-network parameters, reducing the additional matrix multiplication. We also present LoSiA-Pro, a faster implementation of LoSiA, which reduces the training latency by about $27\%$ compared to LoRA. Extensive evaluations show that our method achieves minimal performance drop compared to full fine-tuning, while requiring the least training time across domain specialization and common-sense reasoning tasks. Further analysis shows that LoSiA also reduces forgetting during continued training. The source code is available at https://github.com/KlozeWang/LoSiA.

Updated: 2025-08-20 17:33:10

标题: LoSiA: 通过子网络定位和优化实现高阶微调

摘要: Parameter-Efficient Fine-Tuning (PEFT)方法，例如LoRA，通过引入低秩分解矩阵显著减少可训练参数的数量。然而，现有方法在领域专业化任务中执行大量矩阵乘法，导致计算效率低下和次优的微调性能。因此，我们提出了LoSiA（Low-Resources Subnet Integration Adaptation），一种创新方法，它在训练过程中动态定位和优化关键参数。具体来说，它使用梯度稀疏性分析识别一个子网络，并将其优化为可训练目标。这种设计通过仅更新子网络参数实现了有效的高秩适应，减少了额外的矩阵乘法。我们还提出了LoSiA-Pro，LoSiA的更快实现，与LoRA相比，它将训练延迟减少约27％。广泛的评估表明，我们的方法在领域专业化和常识推理任务中，与完全微调相比实现了最小的性能下降，同时需要最少的训练时间。进一步的分析表明，LoSiA还减少了在持续训练过程中的遗忘。源代码可在https://github.com/KlozeWang/LoSiA获取。

更新时间: 2025-08-20 17:33:10

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.04487v3

SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents

Large Language Model (LLM)-based agents have recently shown impressive capabilities in complex reasoning and tool use via multi-step interactions with their environments. While these agents have the potential to tackle complicated tasks, their problem-solving process, i.e., agents' interaction trajectory leading to task completion, remains underexploited. These trajectories contain rich feedback that can navigate agents toward the right directions for solving problems correctly. Although prevailing approaches, such as Monte Carlo Tree Search (MCTS), can effectively balance exploration and exploitation, they ignore the interdependence among various trajectories and lack the diversity of search spaces, which leads to redundant reasoning and suboptimal outcomes. To address these challenges, we propose SE-Agent, a Self-Evolution framework that enables Agents to optimize their reasoning processes iteratively. Our approach revisits and enhances former pilot trajectories through three key operations: revision, recombination, and refinement. This evolutionary mechanism enables two critical advantages: (1) it expands the search space beyond local optima by intelligently exploring diverse solution paths guided by previous trajectories, and (2) it leverages cross-trajectory inspiration to efficiently enhance performance while mitigating the impact of suboptimal reasoning paths. Through these mechanisms, SE-Agent achieves continuous self-evolution that incrementally improves reasoning quality. We evaluate SE-Agent on SWE-bench Verified to resolve real-world GitHub issues. Experimental results across five strong LLMs show that integrating SE-Agent delivers up to 55% relative improvement, achieving state-of-the-art performance among all open-source agents on SWE-bench Verified. Our code and demonstration materials are publicly available at https://github.com/JARVIS-Xs/SE-Agent.

Updated: 2025-08-20 17:19:48

标题: SE-Agent：基于LLM代理的多步推理中的自我演化轨迹优化

摘要: 基于大型语言模型（LLM）的代理最近展示出在复杂推理和工具使用方面的显著能力，通过与环境的多步交互。虽然这些代理有潜力解决复杂任务，但它们的问题解决过程，即代理的交互轨迹导致任务完成，仍未被充分利用。这些轨迹包含丰富的反馈信息，可以引导代理朝着正确的方向解决问题。虽然现有方法，如蒙特卡洛树搜索（MCTS），可以有效平衡探索和利用，但它们忽略了各种轨迹之间的相互依存关系，缺乏搜索空间的多样性，导致冗余推理和次优结果。为了解决这些挑战，我们提出了SE-Agent，一种自我进化框架，使代理能够迭代地优化其推理过程。我们的方法通过三个关键操作重新审查和增强以前的试验轨迹：修订、重组和精炼。这种演化机制带来了两个关键优势：（1）通过智能地探索由先前轨迹指导的多样化解决路径，将搜索空间扩展到局部最优解之外，（2）利用跨轨迹灵感有效地增强性能，同时减轻次优推理路径的影响。通过这些机制，SE-Agent实现了连续的自我进化，逐步提高推理质量。我们在SWE-bench上评估了SE-Agent，验证了解决真实世界GitHub问题。跨五个强大的LLM的实验结果显示，整合SE-Agent可带来高达55%的相对改进，实现了在SWE-bench Verified中所有开源代理中的最先进性能。我们的代码和演示材料可以公开获取，网址为https://github.com/JARVIS-Xs/SE-Agent。

更新时间: 2025-08-20 17:19:48

领域: cs.AI

下载: http://arxiv.org/abs/2508.02085v4

Graph Structure Learning with Temporal Graph Information Bottleneck for Inductive Representation Learning

Temporal graph learning is crucial for dynamic networks where nodes and edges evolve over time and new nodes continuously join the system. Inductive representation learning in such settings faces two major challenges: effectively representing unseen nodes and mitigating noisy or redundant graph information. We propose GTGIB, a versatile framework that integrates Graph Structure Learning (GSL) with Temporal Graph Information Bottleneck (TGIB). We design a novel two-step GSL-based structural enhancer to enrich and optimize node neighborhoods and demonstrate its effectiveness and efficiency through theoretical proofs and experiments. The TGIB refines the optimized graph by extending the information bottleneck principle to temporal graphs, regularizing both edges and features based on our derived tractable TGIB objective function via variational approximation, enabling stable and efficient optimization. GTGIB-based models are evaluated to predict links on four real-world datasets; they outperform existing methods in all datasets under the inductive setting, with significant and consistent improvement in the transductive setting.

Updated: 2025-08-20 17:13:19

标题: 使用时间图信息瓶颈的图结构学习，用于归纳表示学习

摘要: 时间图学习在动态网络中至关重要，其中节点和边随时间演变，新节点不断加入系统。在这种情况下的归纳表示学习面临两个主要挑战：有效地表示未见节点和减轻嘈杂或冗余的图信息。我们提出了GTGIB，一个集成了图结构学习（GSL）和时间图信息瓶颈（TGIB）的多功能框架。我们设计了一个新颖的基于GSL的两步结构增强器，以丰富和优化节点邻域，并通过理论证明和实验展示其有效性和效率。TGIB通过将信息瓶颈原理扩展到时间图，通过变分逼近基于我们推导的可处理的TGIB目标函数对边和特征进行正则化，从而优化所优化的图，实现稳定和高效的优化。基于GTGIB的模型在四个真实数据集上被评估以预测链接；它们在所有数据集中都优于现有方法，在归纳设置下有显著和一致的改进。

更新时间: 2025-08-20 17:13:19

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.14859v1

It Takes Two: A Peer-Prediction Solution for Blockchain Verifier's Dilemma

The security of blockchain systems is fundamentally based on the decentralized consensus in which the majority of parties behave honestly, and the content verification process is essential to maintaining the robustness of blockchain systems. However, the phenomenon that a rational verifier may not have the incentive to honestly perform the costly verification, referred to as the Verifier's Dilemma, could incentivize lazy reporting and undermine the fundamental security of blockchain systems, particularly for verification-expensive decentralized AI applications. In this paper, we initiate the research with the development of a Byzantine-robust peer prediction framework towards the design of one-phase Bayesian truthful mechanisms for the decentralized verification games among multiple verifiers, incentivizing all verifiers to perform honest verification without access to the ground truth even in the presence of noisy observations, malicious players and inaccurate priors in the verification process, proposing the compactness criteria that ensures such robustness guarantees. With robust incentive guarantees and budget efficiency, our study provides a framework of incentive design for decentralized verification protocols that enhances the security and robustness of the blockchain, decentralized AI, and potentially other decentralized systems.

Updated: 2025-08-20 17:12:12

标题: 需要两个人合作：区块链验证者困境的同行预测解决方案

摘要: 区块链系统的安全基本上是基于去中心化的共识，其中大多数参与方都表现出诚实行为，内容验证过程对于维护区块链系统的稳健性至关重要。然而，一个理性的验证者可能没有动力诚实地进行成本高昂的验证，这种现象被称为验证者困境，可能激励懒惰的报告并破坏区块链系统的基本安全性，特别是对于验证成本高昂的去中心化人工智能应用程序。在本文中，我们通过开发拜占庭鲁棒的对等预测框架，设计了一阶段贝叶斯诚实机制，用于多个验证者之间的去中心化验证游戏，激励所有验证者进行诚实验证，即使在存在噪声观察、恶意玩家和验证过程中不准确的先验情况下也不需要接触地面真相，并提出了确保这种稳健性保证的紧凑性标准。通过稳健的激励保证和预算效率，我们的研究为去中心化验证协议提供了激励设计框架，增强了区块链、去中心化人工智能以及潜在的其他去中心化系统的安全性和稳定性。

更新时间: 2025-08-20 17:12:12

领域: cs.CR,cs.GT

下载: http://arxiv.org/abs/2406.01794v5

Security Concerns for Large Language Models: A Survey

Large Language Models (LLMs) such as ChatGPT and its competitors have caused a revolution in natural language processing, but their capabilities also introduce new security vulnerabilities. This survey provides a comprehensive overview of these emerging concerns, categorizing threats into several key areas: prompt injection and jailbreaking; adversarial attacks, including input perturbations and data poisoning; misuse by malicious actors to generate disinformation, phishing emails, and malware; and the worrisome risks inherent in autonomous LLM agents. Recently, a significant focus is increasingly being placed on the latter, exploring goal misalignment, emergent deception, self-preservation instincts, and the potential for LLMs to develop and pursue covert, misaligned objectives, a behavior known as scheming, which may even persist through safety training. We summarize recent academic and industrial studies from 2022 to 2025 that exemplify each threat, analyze proposed defenses and their limitations, and identify open challenges in securing LLM-based applications. We conclude by emphasizing the importance of advancing robust, multi-layered security strategies to ensure LLMs are safe and beneficial.

Updated: 2025-08-20 17:03:48

标题: 大型语言模型的安全问题：一项调查

摘要: 大型语言模型（LLMs）如ChatGPT及其竞争对手已经引起了自然语言处理领域的革命，但它们的能力也引入了新的安全漏洞。本调查综述了这些新兴关注点，将威胁分类为几个关键领域：提示注入和越狱；对抗性攻击，包括输入扰动和数据毒化；恶意行为者滥用以生成虚假信息、钓鱼邮件和恶意软件；以及自主LLM代理固有的令人担忧的风险。最近，越来越多的关注被放在了后者上，探讨目标错位、紧急欺骗、自我保护本能以及LLMs发展和追求隐蔽、错位目标的潜力，这种行为被称为策划，甚至可能在安全培训中持续存在。我们总结了从2022年到2025年的最新学术和工业研究，例举了每种威胁，分析了提出的防御措施及其局限性，并确定了保护基于LLM的应用程序的开放挑战。最后，我们强调推进强大、多层次的安全策略的重要性，以确保LLMs是安全和有益的。

更新时间: 2025-08-20 17:03:48

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2505.18889v4

Universal and Transferable Adversarial Attack on Large Language Models Using Exponentiated Gradient Descent

As large language models (LLMs) are increasingly deployed in critical applications, ensuring their robustness and safety alignment remains a major challenge. Despite the overall success of alignment techniques such as reinforcement learning from human feedback (RLHF) on typical prompts, LLMs remain vulnerable to jailbreak attacks enabled by crafted adversarial triggers appended to user prompts. Most existing jailbreak methods either rely on inefficient searches over discrete token spaces or direct optimization of continuous embeddings. While continuous embeddings can be given directly to selected open-source models as input, doing so is not feasible for proprietary models. On the other hand, projecting these embeddings back into valid discrete tokens introduces additional complexity and often reduces attack effectiveness. We propose an intrinsic optimization method which directly optimizes relaxed one-hot encodings of the adversarial suffix tokens using exponentiated gradient descent coupled with Bregman projection, ensuring that the optimized one-hot encoding of each token always remains within the probability simplex. We provide theoretical proof of convergence for our proposed method and implement an efficient algorithm that effectively jailbreaks several widely used LLMs. Our method achieves higher success rates and faster convergence compared to three state-of-the-art baselines, evaluated on five open-source LLMs and four adversarial behavior datasets curated for evaluating jailbreak methods. In addition to individual prompt attacks, we also generate universal adversarial suffixes effective across multiple prompts and demonstrate transferability of optimized suffixes to different LLMs.

Updated: 2025-08-20 17:03:32

标题: 使用指数梯度下降的大型语言模型通用且可转移的对抗性攻击

摘要: 随着大型语言模型（LLMs）在关键应用中的部署越来越多，确保它们的鲁棒性和安全对齐仍然是一个重大挑战。尽管强化学习从人类反馈（RLHF）等对齐技术在典型提示上取得了整体成功，但LLMs仍然容易受到通过植入对抗性触发器的精心设计的越狱攻击的影响。大多数现有的越狱方法要么依赖于在离散标记空间上进行低效搜索，要么直接优化连续嵌入。虽然连续嵌入可以直接作为输入提供给选择的开源模型，但对于专有模型来说这是不可行的。另一方面，将这些嵌入投影回有效的离散标记会引入额外的复杂性，并且通常会降低攻击效果。我们提出了一种内在优化方法，该方法直接优化对抗性后缀标记的松弛独热编码，使用指数梯度下降结合Bregman投影，确保每个标记的优化独热编码始终保持在概率单纯形内。我们为我们提出的方法提供了收敛的理论证明，并实现了一种高效的算法，有效越狱了几种广泛使用的LLMs。与三种最先进的基线相比，我们的方法在五个开源LLMs和四个用于评估越狱方法的对抗行为数据集上实现了更高的成功率和更快的收敛速度。除了个别提示攻击，我们还生成了对多个提示有效的通用对抗性后缀，并展示了优化后缀对不同LLMs的可转移性。

更新时间: 2025-08-20 17:03:32

领域: cs.LG

下载: http://arxiv.org/abs/2508.14853v1

Multimodal Quantum Vision Transformer for Enzyme Commission Classification from Biochemical Representations

Accurately predicting enzyme functionality remains one of the major challenges in computational biology, particularly for enzymes with limited structural annotations or sequence homology. We present a novel multimodal Quantum Machine Learning (QML) framework that enhances Enzyme Commission (EC) classification by integrating four complementary biochemical modalities: protein sequence embeddings, quantum-derived electronic descriptors, molecular graph structures, and 2D molecular image representations. Quantum Vision Transformer (QVT) backbone equipped with modality-specific encoders and a unified cross-attention fusion module. By integrating graph features and spatial patterns, our method captures key stereoelectronic interactions behind enzyme function. Experimental results demonstrate that our multimodal QVT model achieves a top-1 accuracy of 85.1%, outperforming sequence-only baselines by a substantial margin and achieving better performance results compared to other QML models.

Updated: 2025-08-20 16:56:41

标题: 多模态量子视觉变压器用于从生化表示中进行酶委员会分类

摘要: 准确预测酶功能仍然是计算生物学中的一项主要挑战，特别是对于具有有限结构注释或序列同源性的酶。我们提出了一种新颖的多模态量子机器学习（QML）框架，通过整合四种互补的生化模态来增强酶委员会（EC）分类：蛋白质序列嵌入、量子衍生的电子描述符、分子图结构和二维分子图像表示。量子视觉变换器（QVT）骨干配备有模态特定的编码器和统一的交叉注意力融合模块。通过整合图特征和空间模式，我们的方法捕捉了酶功能背后的关键立体电子相互作用。实验结果表明，我们的多模态QVT模型实现了85.1%的一级准确性，明显优于仅基于序列的基线，并与其他QML模型相比取得了更好的性能结果。

更新时间: 2025-08-20 16:56:41

领域: cs.LG

下载: http://arxiv.org/abs/2508.14844v1

Action Engine: Automatic Workflow Generation in FaaS

Function as a Service (FaaS) is poised to become the foundation of the next generation of cloud systems due to its inherent advantages in scalability, cost-efficiency, and ease of use. However, challenges such as the need for specialized knowledge, platform dependence, and difficulty in scalability in building functional workflows persist for cloud-native application developers. To overcome these challenges and mitigate the burden of developing FaaS-based applications, in this paper, we propose a mechanism called Action Engine, that makes use of tool-augmented large language models (LLMs) at its kernel to interpret human language queries and automates FaaS workflow generation, thereby, reducing the need for specialized expertise and manual design. Action Engine includes modules to identify relevant functions from the FaaS repository and seamlessly manage the data dependency between them, ensuring the developer's query is processed and resolved. Beyond that, Action Engine can execute the generated workflow by injecting the user-provided arguments. On another front, this work addresses a gap in tool-augmented LLM research via adopting an Automatic FaaS Workflow Generation perspective to systematically evaluate methodologies across four fundamental sub-processes. Through benchmarking various parameters, this research provides critical insights into streamlining workflow automation for real-world applications, specifically in the FaaS continuum. Our evaluations demonstrate that the Action Engine achieves comparable performance to the few-shot learning approach while maintaining platform- and language-agnosticism, thereby, mitigating provider-specific dependencies in workflow generation. We notice that Action Engine can unlock FaaS workflow generation for non-cloud-savvy developers and expedite the development cycles of cloud-native applications.

Updated: 2025-08-20 16:32:06

标题: Action Engine：FaaS中的自动工作流生成

摘要: Function as a Service（FaaS）有望成为下一代云系统的基础，因为其在可扩展性、成本效益和易用性方面的固有优势。然而，云原生应用开发人员在构建功能工作流时仍面临挑战，例如需要专业知识、平台依赖性以及扩展性困难。为了克服这些挑战并减轻开发基于FaaS的应用程序的负担，在本文中，我们提出了一种称为Action Engine的机制，该机制利用工具增强型大型语言模型（LLMs）作为核心，解释人类语言查询并自动化FaaS工作流程生成，从而减少对专业知识和手动设计的需求。Action Engine包括模块，用于从FaaS存储库中识别相关函数，并无缝管理它们之间的数据依赖性，确保开发人员的查询得到处理和解决。此外，Action Engine可以通过注入用户提供的参数执行生成的工作流程。另一方面，这项工作通过采用自动FaaS工作流生成视角，填补了工具增强型LLM研究中的空白，以系统地评估四个基本子过程的方法。通过对多种参数进行基准测试，这项研究为简化实际应用程序的工作流自动化提供了关键见解，特别是在FaaS连续性方面。我们的评估表明，Action Engine在保持与平台和语言无关的同时，实现了与少样本学习方法相当的性能，从而减轻了工作流生成中的提供商特定依赖性。我们注意到，Action Engine可以为非云专业的开发人员打开FaaS工作流生成的大门，并加快云原生应用程序的开发周期。

更新时间: 2025-08-20 16:32:06

领域: cs.DC,cs.AI,cs.LG,cs.SE

下载: http://arxiv.org/abs/2411.19485v2

TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis

Satellite remote sensing enables a wide range of downstream applications, including habitat mapping, carbon accounting, and strategies for conservation and sustainable land use. However, satellite time series are voluminous and often corrupted, making them challenging to use. We present TESSERA, an open, global, land-oriented remote sensing foundation model that uses self-supervised learning to generate `ready-to-use' embeddings at 10~m scale from pixel-level satellite time-series data. TESSERA uses two encoders to combine optical data with synthetic aperture radar backscatter coefficients at 10~m resolution to create embeddings that are fused with a multilayer perceptron to create annual global embedding maps. We compare our work with state-of-the-art task-specific models and other foundation models in five diverse downstream tasks and find that TESSERA closely matches or outperforms these baselines. We believe that TESSERA's ease of use, state-of-the-art performance, openness, and computation- and labelled data-efficiency will prove transformative in a wide range of ecological applications.

Updated: 2025-08-20 16:28:55

标题: TESSERA：地球表面光谱的时间嵌入表示与分析

摘要: 卫星遥感技术使得一系列下游应用成为可能，包括栖息地映射、碳账户和保护及可持续土地利用策略。然而，卫星时间序列数据庞大且经常受到损坏，使得它们难以使用。我们提出了TESSERA，一个开放的、全球的、以土地为导向的遥感基础模型，利用自监督学习从像素级卫星时间序列数据生成10米尺度的“即用型”嵌入。TESSERA利用两个编码器将光学数据与10米分辨率的合成孔径雷达回波系数结合起来，创建融合了多层感知器的年度全球嵌入图。我们将我们的工作与最先进的特定任务模型和其他基础模型在五个不同的下游任务中进行比较，发现TESSERA与这些基线模型相匹配甚至表现更好。我们相信，TESSERA的易用性、最先进的性能、开放性以及计算和标记数据效率将在广泛的生态应用领域产生变革。

更新时间: 2025-08-20 16:28:55

领域: cs.LG

下载: http://arxiv.org/abs/2506.20380v4

On Defining Neural Averaging

What does it even mean to average neural networks? We investigate the problem of synthesizing a single neural network from a collection of pretrained models, each trained on disjoint data shards, using only their final weights and no access to training data. In forming a definition of neural averaging, we take insight from model soup, which appears to aggregate multiple models into a singular model while enhancing generalization performance. In this work, we reinterpret model souping as a special case of a broader framework: Amortized Model Ensembling (AME) for neural averaging, a data-free meta-optimization approach that treats model differences as pseudogradients to guide neural weight updates. We show that this perspective not only recovers model soup but enables more expressive and adaptive ensembling strategies. Empirically, AME produces averaged neural solutions that outperform both individual experts and model soup baselines, especially in out-of-distribution settings. Our results suggest a principled and generalizable notion of data-free model weight aggregation and defines, in one sense, how to perform neural averaging.

Updated: 2025-08-20 16:28:08

标题: 关于定义神经平均化

摘要: 对神经网络进行平均化意味着什么？我们研究了从一组预训练模型中合成单个神经网络的问题，每个模型在不相交的数据片段上训练，仅使用它们的最终权重而无需访问训练数据。在形成神经平均化的定义时，我们从模型汤中获得了启示，模型汤似乎将多个模型聚合成一个单一模型，同时增强了泛化性能。在这项工作中，我们重新解释模型汤为更广泛框架的特例：神经平均化的摊销模型集成（AME），这是一种数据无关的元优化方法，将模型差异视为伪梯度来指导神经权重更新。我们展示了这种观点不仅可以恢复模型汤，还可以实现更具表现力和适应性的集成策略。实证上，AME产生的平均神经解决方案在超出分布情况下优于个别专家和模型汤基线。我们的结果表明了一种基于原则且可推广的无数据模型权重聚合概念，并从某种意义上定义了如何执行神经平均化。

更新时间: 2025-08-20 16:28:08

领域: cs.LG

下载: http://arxiv.org/abs/2508.14832v1

TIME$[t] \subseteq {\rm SPACE}[O(\sqrt{t})]$ via Tree Height Compression

We prove a square-root space simulation for deterministic multitape Turing machines, showing ${\rm TIME}[[t] \subseteq {\rm SPACE}[O(\sqrt{t})]$. The key step is a Height Compression Theorem that uniformly (and in logspace) reshapes the canonical left-deep succinct computation tree for a block-respecting run into a binary tree whose evaluation-stack depth along any DFS path is $O(\log T)$ for $T = \lceil t/b \rceil$, while preserving $O(b)$ work at leaves, $O(1)$ at internal nodes, and edges that are logspace-checkable; semantic correctness across merges is witnessed by an exact $O(b)$ window replay at the unique interface. The proof uses midpoint (balanced) recursion, a per-path potential that bounds simultaneously active interfaces by $O(\log T)$, and an indegree-capping replacement of multiway merges by balanced binary combiners. Algorithmically, an Algebraic Replay Engine with constant-degree maps over a constant-size field, together with pointerless DFS and index-free streaming, ensures constant-size per-level tokens and eliminates wide counters, yielding the additive tradeoff $S(b)=O(b + \log(t/b))$ for block sizes $b \ge b_0$ with $b_0 = \Theta(\log t)$, which at the canonical choice $b = \Theta(\sqrt{t})$ gives $O(\sqrt{t})$ space; the $b_0$ threshold rules out degenerate blocks where addressing scratch would dominate the window footprint. The construction is uniform, relativizes, and is robust to standard model choices. Consequences include branching-program upper bounds $2^{O(\sqrt{s})}$ for size-$s$ bounded-fan-in circuits, tightened quadratic-time lower bounds for SPACE$[n]$-complete problems via the standard hierarchy argument, and $O(\sqrt{t})$-space certifying interpreters; under explicit locality assumptions, the framework extends to geometric $d$-dimensional models.

Updated: 2025-08-20 16:27:53

标题: 通过树高压缩，证明$TIME[t] \subseteq {\rm SPACE}[O(\sqrt{t})]$

摘要: 我们证明了确定性多带图灵机的平方根空间模拟，表明${\rm TIME}[[t] \subseteq {\rm SPACE}[O(\sqrt{t})]$。关键步骤是一个高度压缩定理，它能够统一（并在对数空间中）将一个块保留运行的规范左深简洁计算树重塑为一个二叉树，沿着任何DFS路径的评估栈深度为$O(\log T)$，其中$T = \lceil t/b \rceil$，同时保留叶子节点处的$O(b)$工作，内部节点处的$O(1)$工作，以及对数空间可检查的边缘；合并操作的语义正确性由唯一接口处的精确$O(b)$窗口回放所证实。证明使用中点（平衡）递归，一个将同时活动接口限制为$O(\log T)$的每路径潜力，以及通过平衡二进制合并器替换多路合并的入度限制。在算法上，一个具有常数度映射的代数回放引擎，结合无指针DFS和无索引流式处理，确保每级令牌的大小恒定，并消除宽计数器，从而为块大小$b \ge b_0$提供了增量折衷$S(b)=O(b + \log(t/b))$，其中$b_0 = \Theta(\log t)$，在规范选择$b = \Theta(\sqrt{t})$的情况下得到$O(\sqrt{t})$空间；$b_0$阈值排除了寻址刮擦将支配窗口占用的退化块。这一构建是统一的，相对化的，并且对标准模型选择具有鲁棒性。结果包括大小为$s$的有界分支计算上界$2^{O(\sqrt{s})}$，通过标准分层论证对SPACE$[n]$-complete问题收紧二次时间下界，以及$O(\sqrt{t})$空间认证解释器；在明确的局部性假设下，该框架可以扩展到几何$d$维模型。

更新时间: 2025-08-20 16:27:53

领域: cs.CC,cs.AI,cs.DS

下载: http://arxiv.org/abs/2508.14831v1

Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

Synthetic data generation has recently emerged as a promising approach for enhancing the capabilities of large language models (LLMs) without the need for expensive human annotations. However, existing methods often generate data that can be low quality or contrived. In this paper, we introduce Source2Synth, a scalable approach for synthetic data generation and curation that is grounded in real-world data sources. Source2Synth takes as input a custom data source and produces synthetic data examples with intermediate reasoning steps. Our method improves the dataset quality by discarding low-quality generations based on their answerability. We demonstrate the generality of this approach by applying it to two tasks that leverage two different types of data: multi-hop question answering (MHQA), where we test complex reasoning abilities leveraging documents, and tabular question answering (TQA), where we test tool usage leveraging tables. Our method improves performance by 25.51% for TQA on WikiSQL and 22.57% for MHQA on HotpotQA compared to the fine-tuned baselines.

Updated: 2025-08-20 16:27:42

标题: Source2Synth: 基于真实数据源的合成数据生成和筛选

摘要: 最近，合成数据生成已经成为一种增强大型语言模型（LLMs）能力的有希望的方法，而无需昂贵的人工注释。然而，现有方法通常会生成质量低下或人为的数据。在本文中，我们介绍了Source2Synth，这是一种基于真实数据源的可扩展的合成数据生成和整理方法。Source2Synth接受自定义数据源作为输入，并生成具有中间推理步骤的合成数据示例。我们的方法通过丢弃根据可回答性的低质量生成来提高数据集质量。我们通过将其应用于利用两种不同类型数据的两个任务来展示此方法的普适性：多跳问题回答（MHQA），在这个任务中，我们测试了利用文档进行复杂推理的能力；表格问题回答（TQA），在这个任务中，我们测试了利用表格进行工具使用。与微调的基线相比，我们的方法提高了WikiSQL上TQA的性能25.51%，HotpotQA上MHQA的性能22.57%。

更新时间: 2025-08-20 16:27:42

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2409.08239v2

Long Chain-of-Thought Reasoning Across Languages

Scaling inference through long chains-of-thought (CoTs) has unlocked impressive reasoning capabilities in large language models (LLMs), yet the reasoning process remains almost exclusively English-centric. We construct translated versions of two popular English reasoning datasets, fine-tune Qwen 2.5 (7B) and Qwen 3 (8B) models, and present a systematic study of long CoT generation across French, Japanese, Latvian, and Swahili. Our experiments reveal three key findings. First, the efficacy of using English as a pivot language varies by language: it provides no benefit for French, improves performance when used as the reasoning language for Japanese and Latvian, and proves insufficient for Swahili where both task comprehension and reasoning remain poor. Second, extensive multilingual pretraining in Qwen 3 narrows but does not eliminate the cross-lingual performance gap. A lightweight fine-tune using only 1k traces still improves performance by over 30\% in Swahili. Third, data quality versus scale trade-offs are language dependent: small, carefully curated datasets suffice for English and French, whereas larger but noisier corpora prove more effective for Swahili and Latvian. Together, these results clarify when and why long CoTs transfer across languages and provide translated datasets to foster equitable multilingual reasoning research.

Updated: 2025-08-20 16:22:51

标题: 跨语言的长链式思维推理

摘要: 通过长链推理（CoTs），已经解锁了大型语言模型（LLMs）中令人印象深刻的推理能力，然而推理过程仍然几乎完全以英语为中心。我们构建了两个流行的英语推理数据集的翻译版本，对Qwen 2.5（7B）和Qwen 3（8B）模型进行微调，并对法语、日语、拉脱维亚语和斯瓦希里语的长链生成进行系统研究。我们的实验揭示了三个关键发现。首先，将英语作为中继语言的有效性因语言而异：对于法语没有任何好处，当用作日语和拉脱维亚语的推理语言时，会提高性能，并且对于斯瓦希里语来说是不够的，其中任务理解和推理仍然不足。其次，在Qwen 3中进行广泛的多语言预训练缩小了跨语言性能差距，但并未消除。仅使用1k迹线进行轻量级微调仍然可以使斯瓦希里语的性能提高超过30％。第三，数据质量与规模的权衡因语言而异：对于英语和法语，小型、精心策划的数据集就足够了，而对于斯瓦希里语和拉脱维亚语来说，更大但更嘈杂的语料库则更为有效。总的来说，这些结果阐明了长链在跨语言传递时的情况和原因，并提供了翻译数据集，以促进公平的多语言推理研究。

更新时间: 2025-08-20 16:22:51

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.14828v1

Fragile, Robust, and Antifragile: A Perspective from Parameter Responses in Reinforcement Learning Under Stress

This paper explores Reinforcement learning (RL) policy robustness by systematically analyzing network parameters under internal and external stresses. Inspired by synaptic plasticity in neuroscience, synaptic filtering introduces internal stress by selectively perturbing parameters, while adversarial attacks apply external stress through modified agent observations. This dual approach enables the classification of parameters as fragile, robust, or antifragile, based on their influence on policy performance in clean and adversarial settings. Parameter scores are defined to quantify these characteristics, and the framework is validated on PPO-trained agents in Mujoco continuous control environments. The results highlight the presence of antifragile parameters that enhance policy performance under stress, demonstrating the potential of targeted filtering techniques to improve RL policy adaptability. These insights provide a foundation for future advancements in the design of robust and antifragile RL systems.

Updated: 2025-08-20 16:21:01

标题: 脆弱，强健和抗脆弱：应激下强化学习中参数响应的视角

摘要: 本文通过系统分析网络参数在内部和外部压力下的表现，探讨了强化学习（RL）策略的鲁棒性。受到神经科学中突触可塑性的启发，突触滤波通过选择性扰动参数引入内部压力，而对抗性攻击通过修改智能体观察结果引入外部压力。这种双重方法使得可以将参数分类为脆弱、鲁棒或反脆弱，基于它们对干净和对抗性环境中策略表现的影响。参数得分被定义为量化这些特征，并且该框架在Mujoco连续控制环境中对PPO训练的智能体进行了验证。结果突出了在压力下增强策略表现的反脆弱参数的存在，表明有针对性的滤波技术有望改善RL策略的适应性。这些见解为未来设计鲁棒和反脆弱RL系统的进展奠定了基础。

更新时间: 2025-08-20 16:21:01

领域: cs.LG,cs.SY,eess.SY

下载: http://arxiv.org/abs/2506.23036v2

From Passive Tool to Socio-cognitive Teammate: A Conceptual Framework for Agentic AI in Human-AI Collaborative Learning

The role of Artificial Intelligence (AI) in education is undergoing a rapid transformation, moving beyond its historical function as an instructional tool towards a new potential as an active participant in the learning process. This shift is driven by the emergence of agentic AI, autonomous systems capable of proactive, goal-directed action. However, the field lacks a robust conceptual framework to understand, design, and evaluate this new paradigm of human-AI interaction in learning. This paper addresses this gap by proposing a novel conceptual framework (the APCP framework) that charts the transition from AI as a tool to AI as a collaborative partner. We present a four-level model of escalating AI agency within human-AI collaborative learning: (1) the AI as an Adaptive Instrument, (2) the AI as a Proactive Assistant, (3) the AI as a Co-Learner, and (4) the AI as a Peer Collaborator. Grounded in sociocultural theories of learning and Computer-Supported Collaborative Learning (CSCL), this framework provides a structured vocabulary for analysing the shifting roles and responsibilities between human and AI agents. The paper further engages in a critical discussion of the philosophical underpinnings of collaboration, examining whether an AI, lacking genuine consciousness or shared intentionality, can be considered a true collaborator. We conclude that while AI may not achieve authentic phenomenological partnership, it can be designed as a highly effective functional collaborator. This distinction has significant implications for pedagogy, instructional design, and the future research agenda for AI in education, urging a shift in focus towards creating learning environments that harness the complementary strengths of both human and AI.

Updated: 2025-08-20 16:17:32

标题: 从被动工具到社会认知队友：人工智能在人工智能与人类协作学习中的主体框架

摘要: 人工智能（AI）在教育中的作用正在经历快速转变，超越其历史功能作为教学工具，向着作为学习过程中的积极参与者的新潜力迈进。这种转变是由自主AI的出现推动的，这是一种能够主动、目标导向行动的自主系统。然而，这一领域缺乏一个强大的概念框架来理解、设计和评估这种新型人工智能与人类互动学习的范式。本文通过提出一个新颖的概念框架（APCP框架），以描述从AI作为工具到AI作为合作伙伴的过渡。我们提出了一个四级模型，描述了人工智能代理在人工智能与人类合作学习中不断升级的过程：（1）AI作为自适应工具，（2）AI作为主动助手，（3）AI作为共同学习者，（4）AI作为同伴合作者。基于社会文化学习理论和计算机支持的协作学习（CSCL），该框架为分析人类与AI代理之间的角色和责任转变提供了结构化词汇。本文进一步探讨了合作的哲学基础，考察了一个缺乏真正意识或共享意图的AI是否能被视为真正的合作者。我们得出结论，虽然AI可能无法达到真正的现象学伙伴关系，但可以被设计成高效的功能合作者。这种区别对教学法、教学设计以及AI在教育中的未来研究议程具有重要意义，促使将重点转向创造既能利用人类又能利用AI的互补优势的学习环境。

更新时间: 2025-08-20 16:17:32

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2508.14825v1

The C-index Multiverse

Quantifying out-of-sample discrimination performance for time-to-event outcomes is a fundamental step for model evaluation and selection in the context of predictive modelling. The concordance index, or C-index, is a widely used metric for this purpose, particularly with the growing development of machine learning methods. Beyond differences between proposed C-index estimators (e.g. Harrell's, Uno's and Antolini's), we demonstrate the existence of a C-index multiverse among available R and python software, where seemingly equal implementations can yield different results. This can undermine reproducibility and complicate fair comparisons across models and studies. Key variation sources include tie handling and adjustment to censoring. Additionally, the absence of a standardised approach to summarise risk from survival distributions, result in another source of variation dependent on input types. We demonstrate the consequences of the C-index multiverse when quantifying predictive performance for several survival models (from Cox proportional hazards to recent deep learning approaches) on publicly available breast cancer data, and semi-synthetic examples. Our work emphasises the need for better reporting to improve transparency and reproducibility. This article aims to be a useful guideline, helping analysts when navigating the multiverse, providing unified documentation and highlighting potential pitfalls of existing software. All code is publicly available at: www.github.com/BBolosSierra/CindexMultiverse.

Updated: 2025-08-20 16:11:10

标题: C指数多元宇宙

摘要: Quantifying out-of-sample discrimination performance for time-to-event outcomes is a fundamental step for model evaluation and selection in the context of predictive modeling. The concordance index, or C-index, is a widely used metric for this purpose, particularly with the growing development of machine learning methods. Beyond differences between proposed C-index estimators (e.g. Harrell's, Uno's and Antolini's), we demonstrate the existence of a C-index multiverse among available R and python software, where seemingly equal implementations can yield different results. This can undermine reproducibility and complicate fair comparisons across models and studies. Key variation sources include tie handling and adjustment to censoring. Additionally, the absence of a standardized approach to summarize risk from survival distributions, result in another source of variation dependent on input types. We demonstrate the consequences of the C-index multiverse when quantifying predictive performance for several survival models (from Cox proportional hazards to recent deep learning approaches) on publicly available breast cancer data, and semi-synthetic examples. Our work emphasizes the need for better reporting to improve transparency and reproducibility. This article aims to be a useful guideline, helping analysts when navigating the multiverse, providing unified documentation and highlighting potential pitfalls of existing software. All code is publicly available at: www.github.com/BBolosSierra/CindexMultiverse.

更新时间: 2025-08-20 16:11:10

领域: stat.ML,cs.LG,stat.AP

下载: http://arxiv.org/abs/2508.14821v1

Successive Halving with Learning Curve Prediction via Latent Kronecker Gaussian Processes

Successive Halving is a popular algorithm for hyperparameter optimization which allocates exponentially more resources to promising candidates. However, the algorithm typically relies on intermediate performance values to make resource allocation decisions, which can cause it to prematurely prune slow starters that would eventually become the best candidate. We investigate whether guiding Successive Halving with learning curve predictions based on Latent Kronecker Gaussian Processes can overcome this limitation. In a large-scale empirical study involving different neural network architectures and a click prediction dataset, we compare this predictive approach to the standard approach based on current performance values. Our experiments show that, although the predictive approach achieves competitive performance, it is not Pareto optimal compared to investing more resources into the standard approach, because it requires fully observed learning curves as training data. However, this downside could be mitigated by leveraging existing learning curve data.

Updated: 2025-08-20 16:10:23

标题: 使用潜在 Kronecker 高斯过程进行学习曲线预测的连续减半

摘要: 连续减半是一种用于超参数优化的流行算法，它将指数级更多的资源分配给有前途的候选者。然而，该算法通常依赖中间性能值来做出资源分配决策，这可能导致它过早地修剪慢启动器，而这些慢启动器最终可能成为最佳候选者。我们研究了基于潜在Kronecker高斯过程的学习曲线预测是否能克服这一限制，以指导连续减半。在一个涉及不同神经网络架构和点击预测数据集的大规模实证研究中，我们将这种预测方法与基于当前性能值的标准方法进行了比较。我们的实验证明，尽管预测方法取得了具有竞争力的性能，但与将更多资源投入到标准方法相比，它并不是帕累托最优的，因为它需要完全观察到的学习曲线作为训练数据。然而，这一缺点可以通过利用现有的学习曲线数据来减轻。

更新时间: 2025-08-20 16:10:23

领域: cs.LG

下载: http://arxiv.org/abs/2508.14818v1

Evaluating Retrieval-Augmented Generation vs. Long-Context Input for Clinical Reasoning over EHRs

Electronic health records (EHRs) are long, noisy, and often redundant, posing a major challenge for the clinicians who must navigate them. Large language models (LLMs) offer a promising solution for extracting and reasoning over this unstructured text, but the length of clinical notes often exceeds even state-of-the-art models' extended context windows. Retrieval-augmented generation (RAG) offers an alternative by retrieving task-relevant passages from across the entire EHR, potentially reducing the amount of required input tokens. In this work, we propose three clinical tasks designed to be replicable across health systems with minimal effort: 1) extracting imaging procedures, 2) generating timelines of antibiotic use, and 3) identifying key diagnoses. Using EHRs from actual hospitalized patients, we test three state-of-the-art LLMs with varying amounts of provided context, using either targeted text retrieval or the most recent clinical notes. We find that RAG closely matches or exceeds the performance of using recent notes, and approaches the performance of using the models' full context while requiring drastically fewer input tokens. Our results suggest that RAG remains a competitive and efficient approach even as newer models become capable of handling increasingly longer amounts of text.

Updated: 2025-08-20 16:09:37

标题: 评估检索增强生成与长上下文输入对电子病历中临床推理的影响

摘要: 电子健康记录（EHRs）通常长、嘈杂且经常冗余，给医生在其中导航带来了重大挑战。大型语言模型（LLMs）为提取和推理这些非结构化文本提供了一个有希望的解决方案，但临床笔记的长度往往甚至超过了最先进模型的扩展上下文窗口。检索增强生成（RAG）提供了一个替代方案，通过从整个EHR中检索与任务相关的段落，可能减少所需输入令牌的数量。在这项工作中，我们提出了三个旨在在健康系统中可复制的临床任务：1）提取影像程序，2）生成抗生素使用时间表，3）识别关键诊断。利用实际入院患者的EHRs，我们测试了三种最先进的LLMs，提供不同数量的上下文，使用定向文本检索或最近的临床笔记。我们发现，RAG与使用最近笔记的表现相匹配或超过，并接近使用模型完整上下文的表现，同时需要大大减少输入令牌。我们的结果表明，即使新模型能够处理越来越长的文本，RAG仍然是一种具有竞争力和高效的方法。

更新时间: 2025-08-20 16:09:37

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.14817v1

A Lightweight Privacy-Preserving Smart Metering Billing Protocol with Dynamic Tariff Policy Adjustment

The integration of information and communication technology (ICT) with traditional power grids has led to the emergence of smart grids. Advanced metering infrastructure (AMI) plays a crucial role in smart grids by facilitating two-way communication between smart meters and the utility provider. This bidirectional communication allows intelligent meters to report fine-grained consumption data at predefined intervals, enabling accurate billing, efficient grid monitoring and management, and rapid outage detection. However, the collection of detailed consumption data can inadvertently disclose consumers' daily activities, raising privacy concerns and potentially leading to privacy violations. To address these issues and preserve individuals' privacy, we propose a lightweight privacy-preserving smart metering protocol specifically designed to support real-time tariff billing service with dynamic policy adjustment. Our scheme employs an efficient data perturbation technique to obscure precise energy usage data from internal adversaries, including the intermediary gateways and the utility provider. Subsequently, we validate the efficiency and security of our protocol through comprehensive performance and privacy evaluations. We examined the computational, memory, and communication overhead of the proposed scheme. The execution time of our secure and privacy-aware billing system is approximately 3.94540 seconds for a complete year. Furthermore, we employed the Jensen-Shannon divergence as a privacy metric to demonstrate that our protocol can effectively safeguard users' privacy by increasing the noise scale.

Updated: 2025-08-20 16:06:19

标题: 一个轻量级的隐私保护智能计量结算协议，具有动态电价政策调整

摘要: 信息与通信技术（ICT）与传统电网的整合导致智能电网的出现。先进的计量基础设施（AMI）通过促进智能电表与公用事业提供商之间的双向通信，在智能电网中起着至关重要的作用。这种双向通信允许智能表在预定间隔报告细粒度的能耗数据，实现准确计费、高效的电网监控和管理，以及快速的故障检测。然而，详细的能耗数据收集可能无意中披露消费者的日常活动，引发隐私问题，潜在地导致隐私侵犯。为了解决这些问题并保护个人隐私，我们提出了一种轻量级的保护隐私的智能计量协议，专门设计支持具有动态策略调整的实时费率计费服务。我们的方案采用高效的数据扰动技术，使内部对手包括中介网关和公用事业提供商无法准确获取能源使用数据。随后，我们通过全面性能和隐私评估验证了我们协议的效率和安全性。我们考察了所提方案的计算、内存和通信开销。我们的安全和隐私感知计费系统的执行时间约为3.94540秒，为一个完整的年度。此外，我们采用Jensen-Shannon散度作为隐私度量标准，证明我们的协议可以通过增加噪音幅度有效地保护用户的隐私。

更新时间: 2025-08-20 16:06:19

领域: cs.CR

下载: http://arxiv.org/abs/2508.14815v1

TransLight: Image-Guided Customized Lighting Control with Generative Decoupling

Most existing illumination-editing approaches fail to simultaneously provide customized control of light effects and preserve content integrity. This makes them less effective for practical lighting stylization requirements, especially in the challenging task of transferring complex light effects from a reference image to a user-specified target image. To address this problem, we propose TransLight, a novel framework that enables high-fidelity and high-freedom transfer of light effects. Extracting the light effect from the reference image is the most critical and challenging step in our method. The difficulty lies in the complex geometric structure features embedded in light effects that are highly coupled with content in real-world scenarios. To achieve this, we first present Generative Decoupling, where two fine-tuned diffusion models are used to accurately separate image content and light effects, generating a newly curated, million-scale dataset of image-content-light triplets. Then, we employ IC-Light as the generative model and train our model with our triplets, injecting the reference lighting image as an additional conditioning signal. The resulting TransLight model enables customized and natural transfer of diverse light effects. Notably, by thoroughly disentangling light effects from reference images, our generative decoupling strategy endows TransLight with highly flexible illumination control. Experimental results establish TransLight as the first method to successfully transfer light effects across disparate images, delivering more customized illumination control than existing techniques and charting new directions for research in illumination harmonization and editing.

Updated: 2025-08-20 16:05:12

标题: TransLight：具有生成解耦的图像引导定制照明控制

摘要: 大多数现有的照明编辑方法无法同时提供光效果的定制控制和内容完整性的保留。这使得它们在实际照明风格化需求方面效果不佳，特别是在将复杂光效从参考图像传输到用户指定的目标图像的挑战性任务中。为了解决这一问题，我们提出了TransLight，这是一个新颖的框架，可以实现高保真度和高自由度的光效传输。从参考图像中提取光效是我们方法中最关键和最具挑战性的一步。困难在于在现实场景中，嵌入在光效中与内容高度耦合的复杂几何结构特征。为了实现这一目标，我们首先提出了生成解耦（Generative Decoupling），使用两个经过精调的扩散模型准确地分离图像内容和光效，生成一个新的、百万规模的图像内容光效三元组数据集。然后，我们使用IC-Light作为生成模型，并用我们的三元组数据训练我们的模型，将参考照明图像注入为一个额外的条件信号。由此产生的TransLight模型实现了多样化光效的定制和自然传输。值得注意的是，通过彻底将光效与参考图像解耦，我们的生成解耦策略赋予了TransLight高度灵活的照明控制。实验结果证实TransLight是第一种成功在不同图像之间传输光效的方法，提供比现有技术更定制化的照明控制，并为照明协调和编辑研究开辟了新的方向。

更新时间: 2025-08-20 16:05:12

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.14814v1

JudgeLRM: Large Reasoning Models as a Judge

The rise of Large Language Models (LLMs) as evaluators offers a scalable alternative to human annotation, yet existing Supervised Fine-Tuning (SFT) for judges approaches often fall short in domains requiring complex reasoning. In this work, we investigate whether LLM judges truly benefit from enhanced reasoning capabilities. Through a detailed analysis of reasoning requirements across evaluation tasks, we reveal a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples - highlighting the limitations of SFT in such scenarios. To address this, we introduce JudgeLRM, a family of judgment-oriented LLMs trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards. JudgeLRM models consistently outperform both SFT-tuned and state-of-the-art reasoning models. Notably, JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms DeepSeek-R1 by 2.79% in F1 score, particularly excelling in judge tasks requiring deep reasoning.

Updated: 2025-08-20 16:01:52

标题: JudgeLRM：作为法官的大型推理模型

摘要: 大型语言模型（LLMs）的兴起作为评估者为人类注释提供了可扩展的替代方案，然而现有的用于法官方法的监督微调（SFT）往往在需要复杂推理的领域表现不佳。在这项工作中，我们调查了LLM法官是否真正受益于增强的推理能力。通过对评估任务中推理要求的详细分析，我们揭示了SFT性能提升与需要推理的样本比例之间的负相关关系，突显了SFT在这种情况下的局限性。为了解决这个问题，我们引入了JudgeLRM，这是一组基于评判的LLM，使用强化学习（RL）训练，具有以法官为导向、结果驱动的奖励。JudgeLRM模型始终优于经过SFT微调的模型和最先进的推理模型。值得注意的是，JudgeLRM-3B超越了GPT-4，而JudgeLRM-7B在F1得分上比DeepSeek-R1高出2.79%，尤其在需要深入推理的法官任务中表现出色。

更新时间: 2025-08-20 16:01:52

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.00050v2

DINOv3 with Test-Time Training for Medical Image Registration

Prior medical image registration approaches, particularly learning-based methods, often require large amounts of training data, which constrains clinical adoption. To overcome this limitation, we propose a training-free pipeline that relies on a frozen DINOv3 encoder and test-time optimization of the deformation field in feature space. Across two representative benchmarks, the method is accurate and yields regular deformations. On Abdomen MR-CT, it attained the best mean Dice score (DSC) of 0.790 together with the lowest 95th percentile Hausdorff Distance (HD95) of 4.9+-5.0 and the lowest standard deviation of Log-Jacobian (SDLogJ) of 0.08+-0.02. On ACDC cardiac MRI, it improves mean DSC to 0.769 and reduces SDLogJ to 0.11 and HD95 to 4.8, a marked gain over the initial alignment. The results indicate that operating in a compact foundation feature space at test time offers a practical and general solution for clinical registration without additional training.

Updated: 2025-08-20 15:58:19

标题: DINOv3结合测试时间训练用于医学图像配准

摘要: 以往的医学图像配准方法，特别是基于学习的方法，通常需要大量的训练数据，这限制了临床的采用。为了克服这一限制，我们提出了一种无需训练的流程，依赖于冻结的DINOv3编码器和在特征空间中的变形场的测试时间优化。通过两个代表性的基准测试，该方法准确且产生规则的变形。在腹部MR-CT上，它达到了最佳的平均Dice分数（DSC）为0.790，以及最低的95th百分位Hausdorff距离（HD95）为4.9+-5.0和最低的对数雅可比标准差（SDLogJ）为0.08+-0.02。在ACDC心脏MRI上，它将平均DSC提高到0.769，并将SDLogJ降低到0.11，HD95降至4.8，比初始对齐有明显的提升。结果表明，在测试时在紧凑的基础特征空间中运行，为临床注册提供了一种实用且通用的解决方案，无需额外的训练。

更新时间: 2025-08-20 15:58:19

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.14809v1

Enhancing Contrastive Link Prediction With Edge Balancing Augmentation

Link prediction is one of the most fundamental tasks in graph mining, which motivates the recent studies of leveraging contrastive learning to enhance the performance. However, we observe two major weaknesses of these studies: i) the lack of theoretical analysis for contrastive learning on link prediction, and ii) inadequate consideration of node degrees in contrastive learning. To address the above weaknesses, we provide the first formal theoretical analysis for contrastive learning on link prediction, where our analysis results can generalize to the autoencoder-based link prediction models with contrastive learning. Motivated by our analysis results, we propose a new graph augmentation approach, Edge Balancing Augmentation (EBA), which adjusts the node degrees in the graph as the augmentation. We then propose a new approach, named Contrastive Link Prediction with Edge Balancing Augmentation (CoEBA), that integrates the proposed EBA and the proposed new contrastive losses to improve the model performance. We conduct experiments on 8 benchmark datasets. The results demonstrate that our proposed CoEBA significantly outperforms the other state-of-the-art link prediction models.

Updated: 2025-08-20 15:58:01

标题: 通过边平衡增强对比链预测

摘要: 链路预测是图挖掘中最基本的任务之一，这促使最近的研究利用对比学习来提高性能。然而，我们观察到这些研究存在两个主要弱点：i）对比学习在链路预测上缺乏理论分析，ii）对比学习中对节点度的考虑不足。为了解决上述弱点，我们在链路预测上提供了第一个正式的理论分析，我们的分析结果可以推广到基于自动编码器的链路预测模型与对比学习相结合。受我们分析结果的启发，我们提出了一种新的图增强方法，Edge Balancing Augmentation（EBA），该方法调整图中的节点度作为增强。然后我们提出了一种新方法，命名为Contrastive Link Prediction with Edge Balancing Augmentation（CoEBA），该方法整合了提出的EBA和提出的新对比损失来提高模型性能。我们在8个基准数据集上进行实验。结果表明，我们提出的CoEBA明显优于其他最先进的链路预测模型。

更新时间: 2025-08-20 15:58:01

领域: cs.LG

下载: http://arxiv.org/abs/2508.14808v1

Source-Guided Flow Matching

Guidance of generative models is typically achieved by modifying the probability flow vector field through the addition of a guidance field. In this paper, we instead propose the Source-Guided Flow Matching (SGFM) framework, which modifies the source distribution directly while keeping the pre-trained vector field intact. This reduces the guidance problem to a well-defined problem of sampling from the source distribution. We theoretically show that SGFM recovers the desired target distribution exactly. Furthermore, we provide bounds on the Wasserstein error for the generated distribution when using an approximate sampler of the source distribution and an approximate vector field. The key benefit of our approach is that it allows the user to flexibly choose the sampling method depending on their specific problem. To illustrate this, we systematically compare different sampling methods and discuss conditions for asymptotically exact guidance. Moreover, our framework integrates well with optimal flow matching models since the straight transport map generated by the vector field is preserved. Experimental results on synthetic 2D benchmarks, image datasets, and physics-informed generative tasks demonstrate the effectiveness and flexibility of the proposed framework.

Updated: 2025-08-20 15:56:25

标题: 来源导向的流匹配

摘要: 生成模型的引导通常通过修改概率流向量场，通过添加引导场来实现。在本文中，我们提出了源导向流匹配（SGFM）框架，该框架直接修改源分布，同时保持预先训练的向量场不变。这将引导问题简化为从源分布中进行采样的明确定义问题。我们在理论上展示了SGFM可以精确恢复所需的目标分布。此外，当使用源分布的近似采样器和近似向量场时，我们提供了生成分布的Wasserstein误差的界限。我们方法的关键优点是允许用户根据其具体问题灵活选择采样方法。为了说明这一点，我们系统比较了不同的采样方法，并讨论了渐近精确引导的条件。此外，我们的框架与最优流匹配模型很好地集成，因为向量场生成的直接传输图被保留。在合成2D基准、图像数据集和物理启发式生成任务上的实验结果表明了所提出框架的有效性和灵活性。

更新时间: 2025-08-20 15:56:25

领域: cs.LG

下载: http://arxiv.org/abs/2508.14807v1

Learning from user's behaviour of some well-known congested traffic networks

We consider the problem of predicting users' behavior of a congested traffic network under an equilibrium condition, the traffic assignment problem. We propose a two-stage machine learning approach which couples a neural network with a fixed point algorithm, and we evaluate its performance along several classical congested traffic networks.

Updated: 2025-08-20 15:53:13

标题: 学习一些知名拥堵交通网络用户行为

摘要: 我们考虑在均衡条件下预测拥挤交通网络用户行为的问题，即交通分配问题。我们提出了一种两阶段的机器学习方法，将神经网络与固定点算法相结合，并在几个经典的拥挤交通网络上评估其性能。

更新时间: 2025-08-20 15:53:13

领域: math.OC,cs.LG,90B20, 68T20, 90C33

下载: http://arxiv.org/abs/2508.14804v1

Privileged Self-Access Matters for Introspection in AI

Whether AI models can introspect is an increasingly important practical question. But there is no consensus on how introspection is to be defined. Beginning from a recently proposed ''lightweight'' definition, we argue instead for a thicker one. According to our proposal, introspection in AI is any process which yields information about internal states through a process more reliable than one with equal or lower computational cost available to a third party. Using experiments where LLMs reason about their internal temperature parameters, we show they can appear to have lightweight introspection while failing to meaningfully introspect per our proposed definition.

Updated: 2025-08-20 15:52:34

标题: AI中特权的自我访问对内省至关重要

摘要: AI模型是否能够自省是一个日益重要的实际问题。但关于如何定义自省，目前尚无共识。从最近提出的“轻量级”定义开始，我们提出了一个更加丰富的定义。根据我们的提议，在AI中，自省是通过比第三方可用的等价或更低计算成本的过程产生有关内部状态的信息的任何过程。通过LLMs推理其内部温度参数的实验，我们展示了它们似乎具有轻量级的自省能力，但未能按照我们提出的定义进行有意义的自省。

更新时间: 2025-08-20 15:52:34

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2508.14802v1

A Guide for Manual Annotation of Scientific Imagery: How to Prepare for Large Projects

Despite the high demand for manually annotated image data, managing complex and costly annotation projects remains under-discussed. This is partly due to the fact that leading such projects requires dealing with a set of diverse and interconnected challenges which often fall outside the expertise of specific domain experts, leaving practical guidelines scarce. These challenges range widely from data collection to resource allocation and recruitment, from mitigation of biases to effective training of the annotators. This paper provides a domain-agnostic preparation guide for annotation projects, with a focus on scientific imagery. Drawing from the authors' extensive experience in managing a large manual annotation project, it addresses fundamental concepts including success measures, annotation subjects, project goals, data availability, and essential team roles. Additionally, it discusses various human biases and recommends tools and technologies to improve annotation quality and efficiency. The goal is to encourage further research and frameworks for creating a comprehensive knowledge base to reduce the costs of manual annotation projects across various fields.

Updated: 2025-08-20 15:52:10

标题: 科学图像手动标注指南：如何准备大型项目

摘要: 尽管手动标注图像数据的需求很高，但管理复杂且昂贵的标注项目仍未得到充分讨论。这部分是因为领导这类项目需要应对一系列多样且相互关联的挑战，这些挑战通常超出特定领域专家的专业知识范围，导致实用指南匮乏。这些挑战范围广泛，从数据收集到资源分配和招募，从减少偏见到有效培训标注人员。本文提供了一个面向科学图像的领域无关的标注项目准备指南。借鉴作者在管理大规模手动标注项目方面的丰富经验，它涉及包括成功衡量标准、标注主题、项目目标、数据可用性和关键团队角色在内的基本概念。此外，它讨论了各种人类偏见，并推荐工具和技术以提高标注质量和效率。其目标是鼓励进一步研究和框架，创建一个全面的知识库，以降低各个领域手动标注项目的成本。

更新时间: 2025-08-20 15:52:10

领域: cs.LG

下载: http://arxiv.org/abs/2508.14801v1

Towards Understanding Gradient Dynamics of the Sliced-Wasserstein Distance via Critical Point Analysis

In this paper, we investigate the properties of the Sliced Wasserstein Distance (SW) when employed as an objective functional. The SW metric has gained significant interest in the optimal transport and machine learning literature, due to its ability to capture intricate geometric properties of probability distributions while remaining computationally tractable, making it a valuable tool for various applications, including generative modeling and domain adaptation. Our study aims to provide a rigorous analysis of the critical points arising from the optimization of the SW objective. By computing explicit perturbations, we establish that stable critical points of SW cannot concentrate on segments. This stability analysis is crucial for understanding the behaviour of optimization algorithms for models trained using the SW objective. Furthermore, we investigate the properties of the SW objective, shedding light on the existence and convergence behavior of critical points. We illustrate our theoretical results through numerical experiments.

Updated: 2025-08-20 15:52:07

标题: 朝向通过临界点分析理解切片-瓦瑟斯坦距离的梯度动力学

摘要: 在这篇论文中，我们研究了当作为目标函数时切片Wasserstein距离（SW）的性质。由于SW度量具有捕捉概率分布的复杂几何特性且在计算上可行的能力，它在最优输运和机器学习文献中引起了巨大兴趣，使其成为包括生成建模和领域自适应在内的各种应用的有价值工具。我们的研究旨在对SW目标的优化产生的临界点进行严格分析。通过计算明确的扰动，我们建立了SW的稳定临界点不能集中在段上。这种稳定性分析对于理解使用SW目标训练的模型的优化算法的行为至关重要。此外，我们研究了SW目标的性质，阐明了临界点的存在和收敛行为。我们通过数值实验说明了我们的理论结果。

更新时间: 2025-08-20 15:52:07

领域: stat.ML,cs.LG

下载: http://arxiv.org/abs/2502.06525v2

TASER: Table Agents for Schema-guided Extraction and Recommendation

Real-world financial documents report essential information about an entity's financial holdings that can span millions of different financial instrument types. Yet, these details are often buried in messy, multi-page, fragmented tables - for example, 99.4% of the tables in our dataset have no bounding boxes with the maximum number of rows amounting to 426 per table across 44 pages. To tackle these unique challenges from real-world tables, we present a continuously learning, agentic table extraction system, TASER (Table Agents for Schema-guided Extraction and Recommendation) that extracts highly unstructured, multi-page, heterogeneous tables into normalized, schema-conforming outputs. Our table agents execute on table detection, classification, extraction, and recommendations by leveraging an initial schema. Then, our Recommender Agent reviews the outputs, recommends schema revisions, and decides on the final recommendations, enabling TASER to outperform existing table detection models such as Table Transformer by 10.1%. Within this continuous learning process, we highlight that larger batch sizes result in a 104.3% increase in schema recommendations that are actionable and utilized, resulting in a 9.8% increase in extracted holdings - highlighting the importance of a continuous learning process. To train TASER, we have manually labeled 22,584 pages (28,150,449 tokens), 3,213 tables for $731,685,511,687 of holdings culminating in one of the first real financial table datasets. We release our dataset TASERTab to enable the research community to access real-world financial tables and outputs. Our results highlight the promise of agentic, schema-guided extraction systems for robust understanding of real-world financial tables.

Updated: 2025-08-20 15:50:21

标题: TASER：用于模式引导的抽取和推荐的表代理

摘要: 真实世界的财务文件报告了关于实体的财务持有的基本信息，这些信息涵盖了数百万种不同的金融工具类型。然而，这些细节通常被埋在混乱、多页、碎片化的表格中 - 例如，我们数据集中99.4%的表没有边界框，每个表最多有426行，跨越44页。为了解决来自真实世界表格的这些独特挑战，我们提出了一个不断学习的agent表提取系统TASER（Table Agents for Schema-guided Extraction and Recommendation），将高度非结构化、多页、异构表格提取成标准化、符合模式的输出。我们的表格agent执行表检测、分类、提取和推荐，利用初始模式。然后，我们的推荐agent审查输出，推荐模式修订，并决定最终推荐，使TASER能够比现有表检测模型（如Table Transformer）提高10.1%的性能。在这个持续学习过程中，我们强调更大的批量大小会导致可操作和利用的模式推荐增加104.3%，提取持有增加9.8%，突出了持续学习过程的重要性。为了训练TASER，我们手动标记了22,584页（28,150,449标记），3,213个表，持有总计731,685,511,687美元，形成了第一个真实财务表格数据集之一。我们发布了我们的数据集TASERTab，使研究社区能够访问真实世界的财务表格和输出。我们的结果突显了agent、模式引导提取系统对于理解真实世界财务表格的稳健性的潜力。

更新时间: 2025-08-20 15:50:21

领域: cs.AI,cs.CL,cs.IR,cs.LG

下载: http://arxiv.org/abs/2508.13404v2

A Guide to Stakeholder Analysis for Cybersecurity Researchers

Stakeholder-based ethics analysis is now a formal requirement for submissions to top cybersecurity research venues. This requirement reflects a growing consensus that cybersecurity researchers must go beyond providing capabilities to anticipating and mitigating the potential harms thereof. However, many cybersecurity researchers may be uncertain about how to proceed in an ethics analysis. In this guide, we provide practical support for that requirement by enumerating stakeholder types and mapping them to common empirical research methods. We also offer worked examples to demonstrate how researchers can identify likely stakeholder exposures in real-world projects. Our goal is to help research teams meet new ethics mandates with confidence and clarity, not confusion.

Updated: 2025-08-20 15:48:19

标题: 一份针对网络安全研究人员的利益相关者分析指南

摘要: 基于利益相关者的伦理分析现在是提交顶级网络安全研究论文的正式要求。这一要求反映了一个越来越普遍的共识，即网络安全研究人员必须超越提供能力，而是要预见和减轻潜在的危害。然而，许多网络安全研究人员可能对如何进行伦理分析感到不确定。在本指南中，我们通过列举利益相关者类型并将它们映射到常见的实证研究方法，为这一要求提供实际支持。我们还提供了工作示例，以展示研究人员如何在现实项目中确定可能的利益相关者暴露。我们的目标是帮助研究团队以信心和清晰度，而不是困惑，来满足新的伦理要求。

更新时间: 2025-08-20 15:48:19

领域: cs.CR,cs.SE

下载: http://arxiv.org/abs/2508.14796v1

Coupling without Communication and Drafter-Invariant Speculative Decoding

Suppose Alice has a distribution $P$ and Bob has a distribution $Q$. Alice wants to draw a sample $a\sim P$ and Bob a sample $b \sim Q$ such that $a = b$ with as high of probability as possible. It is well-known that, by sampling from an optimal coupling between the distributions, Alice and Bob can achieve $\Pr[a = b] = 1 - D_{TV}(P,Q)$, where $D_{TV}(P,Q)$ is the total variation distance between $P$ and $Q$. What if Alice and Bob must solve this same problem \emph{without communicating at all?} Perhaps surprisingly, with access to public randomness, they can still achieve $\Pr[a = b] \geq \frac{1 - D_{TV}(P,Q)}{1 + D_{TV}(P,Q)} \geq 1-2D_{TV}(P,Q)$ using a simple protocol based on the Weighted MinHash algorithm. This bound was shown to be optimal in the worst-case by [Bavarian et al., 2020]. In this work, we revisit the communication-free coupling problem. We provide a simpler proof of the optimality result from [Bavarian et al., 2020]. We show that, while the worst-case success probability of Weighted MinHash cannot be improved, an equally simple protocol based on Gumbel sampling offers a Pareto improvement: for every pair of distributions $P, Q$, Gumbel sampling achieves an equal or higher value of $\Pr[a = b]$ than Weighted MinHash. Importantly, this improvement translates to practice. We demonstrate an application of communication-free coupling to \emph{speculative decoding}, a recent method for accelerating autoregressive large language models [Leviathan, Kalman, Matias, ICML 2023]. We show that communication-free protocols can be used to contruct \emph{\CSD{}} schemes, which have the desirable property that their output is fixed given a fixed random seed, regardless of what drafter is used for speculation. In experiments on a language generation task, Gumbel sampling outperforms Weighted MinHash. Code is available at https://github.com/majid-daliri/DISD.

Updated: 2025-08-20 15:38:08

标题: 没有通信的耦合和草稿者不变的推测解码

摘要: 假设Alice有一个分布$P$，Bob有一个分布$Q$。Alice想要抽取一个样本$a\sim P$，Bob抽取一个样本$b \sim Q$，使得$a = b$的概率尽可能高。众所周知，通过从分布之间的最佳耦合中抽样，Alice和Bob可以实现$\Pr[a = b] = 1 - D_{TV}(P,Q)$，其中$D_{TV}(P,Q)$是$P$和$Q$之间的总变差距离。如果Alice和Bob必须在完全不进行通信的情况下解决这个问题呢？也许令人惊讶的是，通过访问公共随机性，他们仍然可以使用基于加权最小哈希算法的简单协议实现$\Pr[a = b] \geq \frac{1 - D_{TV}(P,Q)}{1 + D_{TV}(P,Q)} \geq 1-2D_{TV}(P,Q)$。这个界限被[Bavarian等人，2020]证明在最坏情况下是最优的。在这项工作中，我们重新审视了无通信耦合问题。我们提供了一个更简单的证明，证明了[Bavarian等人，2020]的最优结果。我们展示了，虽然加权最小哈希的最坏情况成功概率无法改进，但基于Gumbel抽样的同样简单协议提供了帕累托改进：对于每对分布$P, Q$，Gumbel抽样实现了一个等于或更高的$\Pr[a = b]$值，比加权最小哈希更好。重要的是，这种改进可以转化为实践。我们展示了无通信耦合在"推测解码"中的应用，这是一种加速自回归大型语言模型的最新方法[Leviathan, Kalman, Matias，ICML 2023]。我们展示了无通信协议可以用于构建\emph{\CSD{}}方案，其具有一种理想的属性，即在给定固定随机种子的情况下，输出是固定的，不管用于推测的是什么抽样器。在一个语言生成任务的实验中，Gumbel抽样优于加权最小哈希。代码可在https://github.com/majid-daliri/DISD找到。

更新时间: 2025-08-20 15:38:08

领域: cs.DS,cs.CL,cs.LG

下载: http://arxiv.org/abs/2408.07978v4

Learning to Solve Related Linear Systems

Solving multiple parametrised related systems is an essential component of many numerical tasks, and learning from the already solved systems will make this process faster. In this work, we propose a novel probabilistic linear solver over the parameter space. This leverages information from the solved linear systems in a regression setting to provide an efficient posterior mean and covariance. We advocate using this as companion regression model for the preconditioned conjugate gradient method, and discuss the favourable properties of the posterior mean and covariance as the initial guess and preconditioner. We also provide several design choices for this companion solver. Numerical experiments showcase the benefits of using our novel solver in a hyperparameter optimisation problem.

Updated: 2025-08-20 15:37:36

标题: 学习解决相关的线性系统

摘要: 解决多个参数化相关系统是许多数值任务的重要组成部分，从已解决的系统中学习将使这个过程更快。在这项工作中，我们提出了一种新颖的概率线性求解器，它在参数空间中利用已解决的线性系统的信息，以在回归设置中提供高效的后验均值和协方差。我们主张将其用作预处理共轭梯度法的伴随回归模型，并讨论后验均值和协方差作为初始猜测和预处理器的有利特性。我们还提供了几种设计选择供这个伴随求解器使用。数值实验展示了在超参数优化问题中使用我们的新颖求解器的好处。

更新时间: 2025-08-20 15:37:36

领域: stat.ML,cs.LG,cs.NA,math.NA

下载: http://arxiv.org/abs/2503.17265v2

Synthetic Adaptive Guided Embeddings (SAGE): A Novel Knowledge Distillation Method

Model distillation enables the transfer of knowledge from large-scale models to compact student models, facilitating deployment in resource-constrained environments. However, conventional distillation approaches often suffer from computational overhead and limited generalization. We propose a novel adaptive distillation framework that dynamically augments training data in regions of high student model loss. Using UMAP-based dimensionality reduction and nearest neighbor sampling, our method identifies underperforming regions in the embedding space and generates targeted synthetic examples to guide student learning. To further improve efficiency, we introduce a lightweight teacher-student interface that bypasses the teacher's input layer, enabling direct distillation on vectorized representations. Experiments across standard NLP benchmarks demonstrate that our 66M-parameter student model consistently matches or surpasses established baselines, achieving 91.2% on QNLI and 92.3% on SST-2, while training with fewer epochs. These results highlight the promise of loss-aware data augmentation and vectorized distillation for efficient and effective model compression.

Updated: 2025-08-20 15:29:00

标题: 合成自适应引导嵌入（SAGE）：一种新颖的知识蒸馏方法

摘要: 模型蒸馏能够将大规模模型的知识转移到紧凑的学生模型中，促进在资源受限环境中的部署。然而，传统的蒸馏方法通常存在计算开销高和泛化能力有限的问题。我们提出了一种新颖的自适应蒸馏框架，动态增加训练数据以弥补学生模型损失较高的区域。通过基于UMAP的降维和最近邻采样，我们的方法识别了嵌入空间中表现不佳的区域，并生成有针对性的合成样本来指导学生学习。为了进一步提高效率，我们引入了一个轻量级的师生接口，绕过老师的输入层，实现对矢量化表示的直接蒸馏。在标准的NLP基准测试中，我们的6600万参数学生模型始终能够达到或超越已建立的基线水平，在QNLI上达到91.2%，在SST-2上达到92.3%，同时训练的轮数更少。这些结果突显了损失感知数据增强和矢量化蒸馏对于高效和有效的模型压缩的潜力。

更新时间: 2025-08-20 15:29:00

领域: cs.LG

下载: http://arxiv.org/abs/2508.14783v1

A Novel Vascular Risk Scoring Framework for Quantifying Sex-Specific Cerebral Perfusion from 3D pCASL MRI

We present a novel framework that leverages 3D pseudo-continuous arterial spin labeling (pCASL) MRI to investigate sex- and age-dependent heterogeneity in cerebral perfusion and to establish a biologically informed vascular risk quantification metric. A custom convolutional neural network was trained on ASL-derived cerebral blood flow (CBF) maps from 186 cognitively healthy individuals (89 males and 97 females, ages 8-92 years), achieving 95% accuracy in sex classification and revealing robust sex-specific perfusion signatures. Regional analyses identified significantly elevated CBF in females across medial Brodmann areas 6 and 10, the visual area of the cortex, the polar occipital cortex, and both ventral and dorsal dysgranular insula, highlighting sex-specific neurovascular specialization in motor, cognitive, sensory, and affective domains. In addition, we observed a consistent global age-related decline in CBF across both sexes, reflecting progressive cerebrovascular aging. To integrate these findings, we propose a biologically informed Vascular Risk Score (VRS) derived from age- and sex-stratified normative CBF distributions. The VRS enables individualized assessment of cerebral perfusion integrity by quantifying deviations from expected normative patterns. This metric offers a sensitive, personalized biomarker for detecting early hypoperfusion and stratifying vascular contributions to neurodegenerative diseases, including Alzheimer's disease, thereby advancing the goals of precision neurology.

Updated: 2025-08-20 15:28:01

标题: 一种用于量化性别特异性脑灌注的新型血管风险评分框架，基于3D pCASL MRI

摘要: 我们提出了一个新颖的框架，利用3D伪连续动脉自旋标记（pCASL）MRI来研究大脑灌注的性别和年龄依赖性异质性，并建立一个基于生物信息学的血管风险量化指标。我们训练了一个定制的卷积神经网络，使用来自186名认知健康个体（89名男性和97名女性，年龄为8-92岁）的ASL衍生脑血流（CBF）图，实现了95%的性别分类准确性，并揭示了稳健的性别特异性灌注特征。区域分析发现，女性在布罗德曼区域6和10、皮质的视觉区域、极枕叶皮质以及腹侧和背侧非颗粒性岛叶中的CBF显著升高，突显了在运动、认知、感觉和情感领域中的性别特异性神经血管专门化。此外，我们观察到不同性别在全局范围内CBF随年龄增长而持续下降，反映了渐进性脑血管老化。为了整合这些发现，我们提出了一个从年龄和性别分层的正常CBF分布中派生的生物信息学血管风险评分（VRS）。VRS通过量化与预期正常模式的偏差，实现了对大脑灌注完整性的个体化评估。该指标为早期低灌注的检测和血管对神经退行性疾病（包括阿尔茨海默病）的分层贡献提供了一种敏感的、个性化的生物标志物，从而推动精准神经学的目标。

更新时间: 2025-08-20 15:28:01

领域: eess.IV,cs.LG

下载: http://arxiv.org/abs/2508.13173v2

TransLLM: A Unified Multi-Task Foundation Framework for Urban Transportation via Learnable Prompting

Urban transportation systems encounter diverse challenges across multiple tasks, such as traffic forecasting, electric vehicle (EV) charging demand prediction, and taxi dispatch. Existing approaches suffer from two key limitations: small-scale deep learning models are task-specific and data-hungry, limiting their generalizability across diverse scenarios, while large language models (LLMs), despite offering flexibility through natural language interfaces, struggle with structured spatiotemporal data and numerical reasoning in transportation domains. To address these limitations, we propose TransLLM, a unified foundation framework that integrates spatiotemporal modeling with large language models through learnable prompt composition. Our approach features a lightweight spatiotemporal encoder that captures complex dependencies via dilated temporal convolutions and dual-adjacency graph attention networks, seamlessly interfacing with LLMs through structured embeddings. A novel instance-level prompt routing mechanism, trained via reinforcement learning, dynamically personalizes prompts based on input characteristics, moving beyond fixed task-specific templates. The framework operates by encoding spatiotemporal patterns into contextual representations, dynamically composing personalized prompts to guide LLM reasoning, and projecting the resulting representations through specialized output layers to generate task-specific predictions. Experiments across seven datasets and three tasks demonstrate the exceptional effectiveness of TransLLM in both supervised and zero-shot settings. Compared to ten baseline models, it delivers competitive performance on both regression and planning problems, showing strong generalization and cross-task adaptability. Our code is available at https://github.com/BiYunying/TransLLM.

Updated: 2025-08-20 15:27:49

标题: TransLLM：一种通过可学习提示实现城市交通的统一多任务基础框架

摘要: 城市交通系统在多个任务中面临各种挑战，例如交通预测、电动汽车充电需求预测和出租车派遣。现有方法存在两个关键限制：小规模深度学习模型具有特定任务且数据需求量大，限制了它们在不同情景下的泛化能力，而大型语言模型(LLMs)，尽管通过自然语言界面提供了灵活性，但在交通领域的结构化时空数据和数值推理方面存在困难。为了解决这些限制，我们提出了TransLLM，这是一个通过可学习提示组合将时空建模与大型语言模型相结合的统一基础框架。我们的方法采用轻量级时空编码器，通过扩张时间卷积和双邻接图注意力网络捕获复杂依赖关系，通过结构化嵌入与LLMs无缝对接。一种新颖的实例级提示路由机制，通过强化学习训练，根据输入特征动态个性化提示，超越固定的特定任务模板。该框架通过将时空模式编码为上下文表示，动态组合个性化提示以指导LLM推理，并通过专门的输出层投影生成特定任务的预测结果。在七个数据集和三个任务上的实验表明，TransLLM在监督和零样本设置中的卓越有效性。与十个基线模型相比，它在回归和规划问题上表现出竞争性能，显示出强大的泛化和跨任务适应性。我们的代码可在https://github.com/BiYunying/TransLLM上找到。

更新时间: 2025-08-20 15:27:49

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.14782v1

Context Steering: A New Paradigm for Compression-based Embeddings by Synthesizing Relevant Information Features

Compression-based distances (CD) offer a flexible and domain-agnostic means of measuring similarity by identifying implicit information through redundancies between data objects. However, as similarity features are derived from the data, rather than defined as an input, it often proves difficult to align with the task at hand, particularly in complex clustering or classification settings. To address this issue, we introduce "context steering," a novel methodology that actively guides the feature-shaping process. Instead of passively accepting the emergent data structure (typically a hierarchy derived from clustering CDs), our approach "steers" the process by systematically analyzing how each object influences the relational context within a clustering framework. This process generates a custom-tailored embedding that isolates and amplifies class-distinctive information. We validate the capabilities of this strategy using Normalized Compression Distance (NCD) and Relative Compression Distance (NRC) with common hierarchical clustering, providing an effective alternative to common transductive methods. Experimental results across heterogeneous datasets-from text to real-world audio-validate the robustness and generality of context steering, marking a fundamental shift in their application: from merely discovering inherent data structures to actively shaping a feature space tailored to a specific objective.

Updated: 2025-08-20 15:26:52

标题: 上下文引导：通过合成相关信息特征进行基于压缩的嵌入的新范式

摘要: 基于压缩的距离（CD）提供了一种灵活且与领域无关的相似度测量方法，通过识别数据对象之间的冗余来确定隐含信息。然而，由于相似度特征是从数据中导出的，而不是作为输入定义的，通常很难与手头的任务对齐，特别是在复杂的聚类或分类设置中。为了解决这个问题，我们引入了“上下文引导”，一种新颖的方法论，可以主动引导特征塑造过程。我们的方法不是被动地接受出现的数据结构（通常是从聚类CD中派生的层次结构），而是通过系统地分析每个对象如何影响聚类框架内的关系上下文来“引导”这个过程。这个过程生成了一个定制的嵌入，隔离和放大了类别特异信息。我们使用标准化压缩距离（NCD）和相对压缩距离（NRC）与常见的层次聚类验证了这种策略的能力，为常见的传导方法提供了有效的替代方案。跨异构数据集的实验结果-从文本到现实世界的音频-验证了上下文引导的稳健性和普适性，标志着其应用的根本性转变：不仅仅是发现固有的数据结构，而是积极地塑造一个适合特定目标的特征空间。

更新时间: 2025-08-20 15:26:52

领域: cs.LG,cs.IT,math.IT

下载: http://arxiv.org/abs/2508.14780v1

Federated Distillation on Edge Devices: Efficient Client-Side Filtering for Non-IID Data

Federated distillation has emerged as a promising collaborative machine learning approach, offering enhanced privacy protection and reduced communication compared to traditional federated learning by exchanging model outputs (soft logits) rather than full model parameters. However, existing methods employ complex selective knowledge-sharing strategies that require clients to identify in-distribution proxy data through computationally expensive statistical density ratio estimators. Additionally, server-side filtering of ambiguous knowledge introduces latency to the process. To address these challenges, we propose a robust, resource-efficient EdgeFD method that reduces the complexity of the client-side density ratio estimation and removes the need for server-side filtering. EdgeFD introduces an efficient KMeans-based density ratio estimator for effectively filtering both in-distribution and out-of-distribution proxy data on clients, significantly improving the quality of knowledge sharing. We evaluate EdgeFD across diverse practical scenarios, including strong non-IID, weak non-IID, and IID data distributions on clients, without requiring a pre-trained teacher model on the server for knowledge distillation. Experimental results demonstrate that EdgeFD outperforms state-of-the-art methods, consistently achieving accuracy levels close to IID scenarios even under heterogeneous and challenging conditions. The significantly reduced computational overhead of the KMeans-based estimator is suitable for deployment on resource-constrained edge devices, thereby enhancing the scalability and real-world applicability of federated distillation. The code is available online for reproducibility.

Updated: 2025-08-20 15:17:59

标题: 边缘设备上的联合蒸馏：非独立同分布数据的高效客户端过滤

摘要: 联邦蒸馏已经成为一种有前途的协作式机器学习方法，通过交换模型输出（软日志）而不是完整的模型参数，提供了增强的隐私保护和减少的通信。然而，现有方法采用复杂的选择性知识共享策略，需要客户端通过计算昂贵的统计密度比估计器识别分布内代理数据。此外，服务器端对模糊知识进行过滤会引入延迟。为了解决这些挑战，我们提出了一种稳健、资源高效的EdgeFD方法，它减少了客户端密度比估计的复杂性，并消除了对服务器端过滤的需求。EdgeFD引入了一种基于KMeans的高效密度比估计器，可以有效地在客户端上过滤分布内和分布外的代理数据，显著提高了知识共享的质量。我们在不同的实际场景中评估了EdgeFD，包括客户端上的强非IID、弱非IID和IID数据分布，而不需要在服务器上使用预训练的教师模型进行知识蒸馏。实验证明，EdgeFD优于现有技术方法，在异构和具有挑战性的条件下，一致地实现接近IID场景的准确度水平。基于KMeans的估计器大大降低了计算开销，适用于部署在资源受限的边缘设备上，从而提高了联邦蒸馏的可扩展性和现实世界的适用性。代码可在线获取以便复现。

更新时间: 2025-08-20 15:17:59

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2508.14769v1

PepThink-R1: LLM for Interpretable Cyclic Peptide Optimization with CoT SFT and Reinforcement Learning

Designing therapeutic peptides with tailored properties is hindered by the vastness of sequence space, limited experimental data, and poor interpretability of current generative models. To address these challenges, we introduce PepThink-R1, a generative framework that integrates large language models (LLMs) with chain-of-thought (CoT) supervised fine-tuning and reinforcement learning (RL). Unlike prior approaches, PepThink-R1 explicitly reasons about monomer-level modifications during sequence generation, enabling interpretable design choices while optimizing for multiple pharmacological properties. Guided by a tailored reward function balancing chemical validity and property improvements, the model autonomously explores diverse sequence variants. We demonstrate that PepThink-R1 generates cyclic peptides with significantly enhanced lipophilicity, stability, and exposure, outperforming existing general LLMs (e.g., GPT-5) and domain-specific baseline in both optimization success and interpretability. To our knowledge, this is the first LLM-based peptide design framework that combines explicit reasoning with RL-driven property control, marking a step toward reliable and transparent peptide optimization for therapeutic discovery.

Updated: 2025-08-20 15:13:52

标题: PepThink-R1: 使用CoT SFT和强化学习进行可解释的循环肽优化

摘要: 设计具有定制特性的治疗肽受到序列空间的广阔、有限的实验数据和当前生成模型的解释能力差的限制。为了解决这些挑战，我们引入了PepThink-R1，这是一个将大型语言模型（LLMs）与思维链（CoT）监督微调和强化学习（RL）相结合的生成框架。与先前的方法不同，PepThink-R1在序列生成过程中明确考虑单体水平的修改，从而在优化多种药理特性的同时实现可解释的设计选择。在由平衡化学有效性和性能改进的特制奖励函数指导下，模型自主探索多样的序列变体。我们证明，PepThink-R1生成的环肽具有显著增强的亲脂性、稳定性和暴露性，优于现有的通用LLMs（例如GPT-5）和领域特定基线模型在优化成功和解释性方面。据我们所知，这是第一个基于LLM的肽设计框架，结合了明确的推理和RL驱动的性能控制，标志着迈向可靠和透明的肽优化以用于治疗发现的一步。

更新时间: 2025-08-20 15:13:52

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.14765v1

Distributional Adversarial Attacks and Training in Deep Hedging

In this paper, we study the robustness of classical deep hedging strategies under distributional shifts by leveraging the concept of adversarial attacks. We first demonstrate that standard deep hedging models are highly vulnerable to small perturbations in the input distribution, resulting in significant performance degradation. Motivated by this, we propose an adversarial training framework tailored to increase the robustness of deep hedging strategies. Our approach extends pointwise adversarial attacks to the distributional setting and introduces a computationally tractable reformulation of the adversarial optimization problem over a Wasserstein ball. This enables the efficient training of hedging strategies that are resilient to distributional perturbations. Through extensive numerical experiments, we show that adversarially trained deep hedging strategies consistently outperform their classical counterparts in terms of out-of-sample performance and resilience to model misspecification. Our findings establish a practical and effective framework for robust deep hedging under realistic market uncertainties.

Updated: 2025-08-20 14:59:32

标题: 深度对冲中的分布对抗攻击与训练

摘要: 在这篇论文中，我们通过利用对抗攻击的概念，研究了经典深度对冲策略在分布转移下的稳健性。我们首先证明了标准深度对冲模型对输入分布中的微小扰动非常脆弱，导致性能显著下降。受此启发，我们提出了一个定制的对抗训练框架，旨在增加深度对冲策略的稳健性。我们的方法将点对点对抗攻击扩展到分布设置，并引入了一个在Wasserstein球上的计算可行的对抗优化问题的重新表述。这使得对冲策略的高效训练能够抵御分布扰动。通过大量的数值实验，我们展示了对抗训练的深度对冲策略在样本外表现和对模型规范误差的抵抗力方面始终优于其经典对应物。我们的研究结果建立了一个实用且有效的框架，用于在现实市场不确定性下进行稳健的深度对冲。

更新时间: 2025-08-20 14:59:32

领域: math.OC,cs.LG

下载: http://arxiv.org/abs/2508.14757v1

Reliable generation of isomorphic physics problems using ChatGPT with prompt-chaining and tool use

We present a method for generating large numbers of isomorphic physics problems using ChatGPT through prompt chaining and tool use. This approach enables precise control over structural variations-such as numeric values and spatial relations-while supporting diverse contextual variations in the problem body. By utilizing the Python code interpreter, the method supports automatic solution validation and simple diagram generation, addressing key limitations in existing LLM-based methods. We generated two example isomorphic problem banks and compared the outcome against simpler prompt-based approaches. Results show that prompt-chaining produces significantly higher quality and more consistent outputs than simpler, non-chaining prompts. This work demonstrates a promising method for efficient problem creation accessible to the average instructor, which opens new possibilities for personalized adaptive testing and automated content development.

Updated: 2025-08-20 14:58:05

标题: 使用ChatGPT进行可靠的同构物理问题生成，采用提示链接和工具使用

摘要: 我们提出了一种使用ChatGPT通过提示链接和工具使用生成大量同构物理问题的方法。这种方法能够精确控制结构变化，如数字值和空间关系，同时支持问题主体中的多样化上下文变化。通过利用Python代码解释器，该方法支持自动解决方案验证和简单的图表生成，解决了现有基于LLM的方法中的关键限制。我们生成了两个示例同构问题库，并将结果与更简单的基于提示的方法进行了比较。结果显示，提示链接产生的输出质量和一致性明显高于更简单的非链接提示。这项工作展示了一种有前途的方法，可让普通教师轻松创建问题，并为个性化自适应测试和自动内容开发开辟了新的可能性。

更新时间: 2025-08-20 14:58:05

领域: physics.ed-ph,cs.AI

下载: http://arxiv.org/abs/2508.14755v1

Boosting Chart-to-Code Generation in MLLM via Dual Preference-Guided Refinement

Translating chart images into executable plotting scripts-referred to as the chart-to-code generation task-requires Multimodal Large Language Models (MLLMs) to perform fine-grained visual parsing, precise code synthesis, and robust cross-modal reasoning. However, this task is inherently under-constrained: multiple valid code implementations can produce the same visual chart, and evaluation must consider both code correctness and visual fidelity across diverse dimensions. This makes it difficult to learn accurate and generalizable mappings through standard supervised fine-tuning. To address these challenges, we propose a dual preference-guided refinement framework that combines a feedback-driven, dual-modality reward mechanism with iterative preference learning. Our approach introduces a structured variant generation strategy and a visual reward model to efficiently produce high-quality, aspect-aware preference pairs-making preference collection scalable and supervision more targeted. These preferences are used in an offline reinforcement learning setup to optimize the model toward multi-dimensional fidelity. Experimental results show that our framework significantly enhances the performance of general-purpose open-source MLLMs, enabling them to generate high-quality plotting code that rivals specialized chart-centric models and even some proprietary systems. The code and datasets are publicly available at https://github.com/Zhihan72/Chart2Code.

Updated: 2025-08-20 14:56:28

标题: 通过双重偏好引导细化，在MLLM中提升图表到代码生成

摘要: 将图表图像转换为可执行绘图脚本-称为图表到代码生成任务-需要多模式大语言模型（MLLMs）进行精细的视觉解析、精确的代码合成和强大的跨模态推理。然而，这个任务本质上是不受约束的：多个有效的代码实现可以产生相同的视觉图表，评估必须考虑代码正确性和视觉保真度在多个维度上的交叉。这使得通过标准监督微调学习准确和可泛化的映射变得困难。为了解决这些挑战，我们提出了一个双偏好引导的细化框架，结合了反馈驱动的双模态奖励机制和迭代偏好学习。我们的方法引入了一种结构化变体生成策略和一个视觉奖励模型，以有效产生高质量的、考虑方面的偏好对-使得偏好收集可伸缩，监督更有针对性。这些偏好用于离线强化学习设置中，以优化模型朝向多维度保真度。实验结果表明，我们的框架显著提升了通用开源MLLMs的性能，使它们能够生成与专门的图表中心模型甚至一些专有系统相媲美的高质量绘图代码。代码和数据集可在https://github.com/Zhihan72/Chart2Code 上公开获取。

更新时间: 2025-08-20 14:56:28

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2504.02906v2

Low-rank bias, weight decay, and model merging in neural networks

We explore the low-rank structure of the weight matrices in neural networks at the stationary points (limiting solutions of optimization algorithms) with $L2$ regularization (also known as weight decay). We show several properties of such deep neural networks, induced by $L2$ regularization. In particular, for a stationary point we show alignment of the parameters and the gradient, norm preservation across layers, and low-rank bias: properties previously known in the context of solution of gradient descent/flow type algorithms. Experiments show that the assumptions made in the analysis only mildly affect the observations. In addition, we investigate a multitask learning phenomenon enabled by $L2$ regularization and low-rank bias. In particular, we show that if two networks are trained, such that the inputs in the training set of one network are approximately orthogonal to the inputs in the training set of the other network, the new network obtained by simply summing the weights of the two networks will perform as well on both training sets as the respective individual networks. We demonstrate this for shallow ReLU neural networks trained by gradient descent, as well as deep linear networks trained by gradient flow.

Updated: 2025-08-20 14:53:28

标题: 神经网络中的低秩偏差、权重衰减和模型合并

摘要: 我们探讨了神经网络中权重矩阵在稳定点（优化算法的极限解）上的低秩结构，该稳定点使用$L2$正则化（也称为权重衰减）。我们展示了$L2$正则化引发的这些深度神经网络的几个性质。特别是，对于稳定点，我们展示了参数和梯度的对齐，跨层的范数保持，以及低秩偏差：这些性质以前在梯度下降/流类型算法解决方案的背景下已知。实验表明，分析中做出的假设只轻微影响观察结果。此外，我们研究了由$L2$正则化和低秩偏差启用的多任务学习现象。特别地，我们展示了如果训练两个网络，使得一个网络的训练集中的输入大致正交于另一个网络的训练集中的输入，则通过简单地将两个网络的权重相加得到的新网络将在两个训练集上表现得与各自的单个网络一样好。我们在通过梯度下降训练的浅ReLU神经网络以及通过梯度流训练的深线性网络上进行了演示。

更新时间: 2025-08-20 14:53:28

领域: cs.LG

下载: http://arxiv.org/abs/2502.17340v2

Neural Restoration of Greening Defects in Historical Autochrome Photographs Based on Purely Synthetic Data

The preservation of early visual arts, particularly color photographs, is challenged by deterioration caused by aging and improper storage, leading to issues like blurring, scratches, color bleeding, and fading defects. Despite great advances in image restoration and enhancement in recent years, such systematic defects often cannot be restored by current state-of-the-art software features as available e.g. in Adobe Photoshop, but would require the incorporation of defect-aware priors into the underlying machine learning techniques. However, there are no publicly available datasets of autochromes with defect annotations. In this paper, we address these limitations and present the first approach that allows the automatic removal of greening color defects in digitized autochrome photographs. For this purpose, we introduce an approach for accurately simulating respective defects and use the respectively obtained synthesized data with its ground truth defect annotations to train a generative AI model with a carefully designed loss function that accounts for color imbalances between defected and non-defected areas. As demonstrated in our evaluation, our approach allows for the efficient and effective restoration of the considered defects, thereby overcoming limitations of alternative techniques that struggle with accurately reproducing original colors and may require significant manual effort.

Updated: 2025-08-20 14:51:09

标题: 基于纯合成数据的历史Autochrome照片中绿化缺陷的神经修复

摘要: 早期视觉艺术的保护，特别是彩色照片，受到由老化和不当存储引起的恶化的挑战，导致模糊、划痕、颜色渗透和褪色等问题。尽管近年来在图像恢复和增强方面取得了巨大进展，但这种系统性缺陷通常无法通过当前最先进的软件功能（例如Adobe Photoshop）进行恢复，而需要将具有缺陷感知先验的技术纳入基础机器学习技术中。然而，目前还没有公开可用的带有缺陷注释的自动色彩照片数据集。在本文中，我们解决了这些限制，并提出了一种允许自动消除数字化自动色彩照片中发绿色缺陷的第一方法。为此，我们引入了一种准确模拟相应缺陷的方法，并使用相应获得的合成数据及其地面真实缺陷注释来训练一个具有精心设计的损失函数的生成AI模型，该损失函数考虑了带有缺陷和无缺陷区域之间的颜色不平衡。正如我们在评估中所展示的那样，我们的方法允许对考虑的缺陷进行高效和有效的恢复，从而克服了其他技术的局限性，这些技术在准确再现原始颜色方面存在困难，可能需要大量的手动工作。

更新时间: 2025-08-20 14:51:09

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2505.22291v2

MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling

Despite recent advances, long-sequence video generation frameworks still suffer from significant limitations: poor assistive capability, suboptimal visual quality, and limited expressiveness. To mitigate these limitations, we propose MAViS, an end-to-end multi-agent collaborative framework for long-sequence video storytelling. MAViS orchestrates specialized agents across multiple stages, including script writing, shot designing, character modeling, keyframe generation, video animation, and audio generation. In each stage, agents operate under the 3E Principle -- Explore, Examine, and Enhance -- to ensure the completeness of intermediate outputs. Considering the capability limitations of current generative models, we propose the Script Writing Guidelines to optimize compatibility between scripts and generative tools. Experimental results demonstrate that MAViS achieves state-of-the-art performance in assistive capability, visual quality, and video expressiveness. Its modular framework further enables scalability with diverse generative models and tools. With just a brief user prompt, MAViS is capable of producing high-quality, expressive long-sequence video storytelling, enriching inspirations and creativity for users. To the best of our knowledge, MAViS is the only framework that provides multimodal design output -- videos with narratives and background music.

Updated: 2025-08-20 14:50:55

标题: MAViS: 一个用于长序列视频叙事的多智能体框架

摘要: 尽管最近取得了一些进展，但长序列视频生成框架仍然存在显著的局限性：辅助能力差、视觉质量不佳和表现力有限。为了缓解这些限制，我们提出了MAViS，这是一个端到端的多智能体协作框架，用于长序列视频叙事。MAViS在包括剧本编写、拍摄设计、角色建模、关键帧生成、视频动画和音频生成在内的多个阶段协调专门的智能体。在每个阶段，智能体都遵循“探索、审查和增强”的3E原则，以确保中间输出的完整性。考虑到当前生成模型的能力限制，我们提出了剧本编写指南，以优化剧本与生成工具之间的兼容性。实验结果表明，MAViS在辅助能力、视觉质量和视频表现力方面实现了最先进的性能。其模块化框架进一步实现了与多样化生成模型和工具的可扩展性。只需简短的用户提示，MAViS就能够生成高质量、富有表现力的长序列视频叙事，为用户提供灵感和创造力。据我们所知，MAViS是唯一提供多模态设计输出的框架，即具有叙事和背景音乐的视频。

更新时间: 2025-08-20 14:50:55

领域: cs.CV,cs.AI,cs.MA

下载: http://arxiv.org/abs/2508.08487v3

HERAKLES: Hierarchical Skill Compilation for Open-ended LLM Agents

Open-ended AI agents need to be able to learn efficiently goals of increasing complexity, abstraction and heterogeneity over their lifetime. Beyond sampling efficiently their own goals, autotelic agents specifically need to be able to keep the growing complexity of goals under control, limiting the associated growth in sample and computational complexity. To adress this challenge, recent approaches have leveraged hierarchical reinforcement learning (HRL) and language, capitalizing on its compositional and combinatorial generalization capabilities to acquire temporally extended reusable behaviours. Existing approaches use expert defined spaces of subgoals over which they instantiate a hierarchy, and often assume pre-trained associated low-level policies. Such designs are inadequate in open-ended scenarios, where goal spaces naturally diversify across a broad spectrum of difficulties. We introduce HERAKLES, a framework that enables a two-level hierarchical autotelic agent to continuously compile mastered goals into the low-level policy, executed by a small, fast neural network, dynamically expanding the set of subgoals available to the high-level policy. We train a Large Language Model (LLM) to serve as the high-level controller, exploiting its strengths in goal decomposition and generalization to operate effectively over this evolving subgoal space. We evaluate HERAKLES in the open-ended Crafter environment and show that it scales effectively with goal complexity, improves sample efficiency through skill compilation, and enables the agent to adapt robustly to novel challenges over time.

Updated: 2025-08-20 14:50:28

标题: 赫拉克勒斯：用于开放式LLM代理的分层技能编译

摘要: 开放式AI代理需要能够高效地学习逐渐增加的目标复杂性、抽象性和异质性。除了有效地采样自己的目标外，自主性代理特别需要能够控制不断增长的目标复杂性，限制相关采样和计算复杂性的增长。为了解决这一挑战，最近的方法利用了分层强化学习（HRL）和语言，利用其组合和组合泛化能力来获取时间延长的可重复行为。现有方法使用专家定义的子目标空间，通过实例化层次结构，并经常假设预先训练的相关低级策略。这样的设计在开放式场景下是不足够的，其中目标空间自然地在各种困难程度的广泛范围内多样化。我们介绍了HERAKLES，这是一个框架，使得一个两级分层自主代理能够持续将掌握的目标编译成由一个小型快速神经网络执行的低级策略，动态扩展高级策略可用的子目标集。我们训练了一个大型语言模型（LLM）作为高级控制器，利用其在目标分解和泛化方面的优势，有效地在这个不断发展的子目标空间上运行。我们在开放式Crafter环境中评估了HERAKLES，并展示它能够有效地随着目标复杂性的增加而扩展，通过技能编译提高采样效率，并使代理能够稳健地适应随时间推移而出现的新挑战。

更新时间: 2025-08-20 14:50:28

领域: cs.LG

下载: http://arxiv.org/abs/2508.14751v1

Cross-Modality Controlled Molecule Generation with Diffusion Language Model

Current SMILES-based diffusion models for molecule generation typically support only unimodal constraint. They inject conditioning signals at the start of the training process and require retraining a new model from scratch whenever the constraint changes. However, real-world applications often involve multiple constraints across different modalities, and additional constraints may emerge over the course of a study. This raises a challenge: how to extend a pre-trained diffusion model not only to support cross-modality constraints but also to incorporate new ones without retraining. To tackle this problem, we propose the Cross-Modality Controlled Molecule Generation with Diffusion Language Model (CMCM-DLM), demonstrated by two distinct cross modalities: molecular structure and chemical properties. Our approach builds upon a pre-trained diffusion model, incorporating two trainable modules, the Structure Control Module (SCM) and the Property Control Module (PCM), and operates in two distinct phases during the generation process. In Phase I, we employs the SCM to inject structural constraints during the early diffusion steps, effectively anchoring the molecular backbone. Phase II builds on this by further introducing PCM to guide the later stages of inference to refine the generated molecules, ensuring their chemical properties match the specified targets. Experimental results on multiple datasets demonstrate the efficiency and adaptability of our approach, highlighting CMCM-DLM's significant advancement in molecular generation for drug discovery applications.

Updated: 2025-08-20 14:48:44

标题: 跨模态控制分子生成的扩散语言模型

摘要: 目前基于SMILES的扩散模型通常仅支持单峰约束。它们在训练过程的开始注入条件信号，并在约束发生变化时需要从头开始重新训练一个新模型。然而，现实世界中的应用通常涉及跨不同模态的多个约束，并且随着研究的推进可能会出现额外的约束。这带来了一个挑战：如何扩展一个预训练的扩散模型，不仅支持跨模态约束，还能在无需重新训练的情况下纳入新约束。为了解决这个问题，我们提出了交互模态控制分子生成扩散语言模型（CMCM-DLM），通过两个不同的交互模态进行演示：分子结构和化学性质。我们的方法建立在一个预训练的扩散模型之上，包括两个可训练模块，结构控制模块（SCM）和性质控制模块（PCM），并在生成过程中的两个不同阶段进行操作。在第一阶段，我们使用SCM在早期扩散步骤中注入结构约束，有效地锚定分子骨架。第二阶段在此基础上进一步引入PCM，引导推理的后期阶段以精细化生成的分子，确保它们的化学性质与指定目标相匹配。多个数据集上的实验结果证明了我们方法的效率和适应性，突显了CMCM-DLM在药物发现应用中分子生成方面的重大进展。

更新时间: 2025-08-20 14:48:44

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.14748v1

MissionHD: Data-Driven Refinement of Reasoning Graph Structure through Hyperdimensional Causal Path Encoding and Decoding

Reasoning graphs from Large Language Models (LLMs) are often misaligned with downstream visual tasks such as video anomaly detection (VAD). Existing Graph Structure Refinement (GSR) methods are ill-suited for these novel, dataset-less graphs. We introduce Data-driven GSR (D-GSR), a new paradigm that directly optimizes graph structure using downstream task data, and propose MissionHD, a hyperdimensional computing (HDC) framework to operationalize it. MissionHD uses an efficient encode-decode process to refine the graph, guided by the downstream task signal. Experiments on challenging VAD and VAR benchmarks show significant performance improvements when using our refined graphs, validating our approach as an effective pre-processing step.

Updated: 2025-08-20 14:43:04

标题: MissionHD：通过超高维因果路径编码和解码对推理图结构进行数据驱动细化

摘要: 大型语言模型（LLMs）生成的推理图通常与下游视觉任务（如视频异常检测（VAD））不匹配。现有的图结构优化（GSR）方法不适用于这些新颖的、无数据集的图形。我们引入了数据驱动的GSR（D-GSR）方法，这是一种直接使用下游任务数据优化图结构的新范式，并提出了MissionHD，一个高维计算（HDC）框架来实现它。MissionHD使用高效的编码-解码过程来优化图形，由下游任务信号引导。在具有挑战性的VAD和VAR基准测试中，使用我们优化后的图形显示出显著的性能改善，验证了我们的方法作为有效的预处理步骤。

更新时间: 2025-08-20 14:43:04

领域: cs.LG

下载: http://arxiv.org/abs/2508.14746v1

Identity Preserving 3D Head Stylization with Multiview Score Distillation

3D head stylization transforms realistic facial features into artistic representations, enhancing user engagement across gaming and virtual reality applications. While 3D-aware generators have made significant advancements, many 3D stylization methods primarily provide near-frontal views and struggle to preserve the unique identities of original subjects, often resulting in outputs that lack diversity and individuality. This paper addresses these challenges by leveraging the PanoHead model, synthesizing images from a comprehensive 360-degree perspective. We propose a novel framework that employs negative log-likelihood distillation (LD) to enhance identity preservation and improve stylization quality. By integrating multi-view grid score and mirror gradients within the 3D GAN architecture and introducing a score rank weighing technique, our approach achieves substantial qualitative and quantitative improvements. Our findings not only advance the state of 3D head stylization but also provide valuable insights into effective distillation processes between diffusion models and GANs, focusing on the critical issue of identity preservation. Please visit the https://three-bee.github.io/head_stylization for more visuals.

Updated: 2025-08-20 14:41:03

标题: 使用多视角评分提炼实现保持身份的3D头部风格化

摘要: 3D头部风格化将现实面部特征转化为艺术表现，增强了用户在游戏和虚拟现实应用中的参与感。尽管3D感知生成器取得了显著进展，但许多3D风格化方法主要提供近前视图，并且难以保留原始对象的独特身份，通常导致输出缺乏多样性和个性。本文通过利用PanoHead模型，从全面的360度视角合成图像来解决这些挑战。我们提出了一个新颖的框架，采用负对数似然蒸馏（LD）来增强身份保留和改善风格化质量。通过在3D GAN架构中集成多视图网格分数和镜像梯度，并引入分数排名加权技术，我们的方法实现了实质性的定性和定量改进。我们的研究结果不仅推动了3D头部风格化的发展，还为扩散模型和GAN之间的有效蒸馏过程提供了宝贵的见解，重点关注身份保留的关键问题。请访问https://three-bee.github.io/head_stylization获取更多视觉信息。

更新时间: 2025-08-20 14:41:03

领域: cs.CV,cs.AI,cs.GR,cs.LG,cs.MM

下载: http://arxiv.org/abs/2411.13536v3

A Collusion-Resistance Privacy-Preserving Smart Metering Protocol for Operational Utility

Modern grids have adopted advanced metering infrastructure (AMI) to facilitate bidirectional communication between smart meters and control centers. This enables smart meters to report consumption values at predefined intervals to utility providers for purposes including demand balancing, load forecasting, dynamic billing, and operational efficiency. Compared to traditional power grids, smart grids offer advantages such as enhanced reliability, improved energy efficiency, and increased security. However, utility providers can compromise user privacy by analyzing fine-grained readings and extracting individuals' daily activities from this time-series data. To address this concern, we propose a collusion-resistant, privacy-preserving aggregation protocol for smart metering in operational services. Our protocol ensures privacy by leveraging techniques such as partially additive homomorphic encryption, aggregation, data perturbation, and data minimization. The scheme aggregates perturbed readings using the additive homomorphic property of the Paillier cryptosystem to provide results for multiple operational purposes. We evaluate the protocol in terms of both performance and privacy. Computational, memory, and communication overhead were examined. The total execution time with 1024-bit key size is about 2.21 seconds. We also evaluated privacy through the normalized conditional entropy (NCE) metric. Higher NCE values, closer to 1, indicate stronger privacy. By increasing noise scale, the NCE value rises, showing perturbed values retain minimal information about the original, thereby reducing risks. Overall, evaluation demonstrates the protocol's efficiency while employing various privacy-preserving techniques.

Updated: 2025-08-20 14:40:33

标题: 一个抵制串通、保护隐私的智能计量协议，用于实用操作

摘要: 现代电网采用了先进的计量基础设施（AMI）来促进智能电表和控制中心之间的双向通信。这使得智能电表能够在预定的时间间隔内向公用事业提供商报告消费数值，以进行需求平衡、负荷预测、动态计费和运营效率等目的。与传统电网相比，智能电网具有增强的可靠性、提高的能源效率和增强的安全性等优势。然而，公用事业提供商可以通过分析细粒度的读数并从这些时间序列数据中提取个体的日常活动来损害用户的隐私。为了解决这一问题，我们提出了一种抗串通、保护隐私的智能电表操作服务聚合协议。我们的协议通过利用部分可加同态加密、聚合、数据扰动和数据最小化等技术来确保隐私。该方案利用Paillier密码系统的可加同态特性来聚合扰动的读数，为多种操作目的提供结果。我们通过性能和隐私两方面评估了该协议。计算、内存和通信开销进行了检查。使用1024位密钥大小的总执行时间约为2.21秒。我们还通过标准化条件熵（NCE）指标评估了隐私性。较高的NCE值，接近1，表示更强的隐私。通过增加噪音规模，NCE值上升，显示扰动值保留了有关原始值的最少信息，从而降低了风险。总体而言，评估证明了该协议在采用各种保护隐私技术的情况下的效率。

更新时间: 2025-08-20 14:40:33

领域: cs.CR

下载: http://arxiv.org/abs/2508.14744v1

CaTE Data Curation for Trustworthy AI

This report provides practical guidance to teams designing or developing AI-enabled systems for how to promote trustworthiness during the data curation phase of development. In this report, the authors first define data, the data curation phase, and trustworthiness. We then describe a series of steps that the development team, especially data scientists, can take to build a trustworthy AI-enabled system. We enumerate the sequence of core steps and trace parallel paths where alternatives exist. The descriptions of these steps include strengths, weaknesses, preconditions, outcomes, and relevant open-source software tool implementations. In total, this report is a synthesis of data curation tools and approaches from relevant academic literature, and our goal is to equip readers with a diverse yet coherent set of practices for improving AI trustworthiness.

Updated: 2025-08-20 14:40:21

标题: CaTE数据管理以实现可信的AI

摘要: 这份报告为设计或开发AI-enabled系统的团队提供了实用指导，指导他们如何在数据整理阶段促进可信度。在这份报告中，作者首先定义了数据、数据整理阶段和可信度。然后描述了开发团队，特别是数据科学家可以采取的一系列步骤来构建一个可信的AI-enabled系统。我们列举了核心步骤的顺序，以及存在替代方案的平行路径。这些步骤的描述包括优势、劣势、前提条件、结果以及相关的开源软件工具实现。总的来说，这份报告是从相关学术文献中综合了数据整理工具和方法，我们的目标是为读者提供一套多样而连贯的实践方法，以提高AI的可信度。

更新时间: 2025-08-20 14:40:21

领域: cs.LG

下载: http://arxiv.org/abs/2508.14741v1

Don't Push the Button! Exploring Data Leakage Risks in Machine Learning and Transfer Learning

Machine Learning (ML) has revolutionized various domains, offering predictive capabilities in several areas. However, with the increasing accessibility of ML tools, many practitioners, lacking deep ML expertise, adopt a "push the button" approach, utilizing user-friendly interfaces without a thorough understanding of underlying algorithms. While this approach provides convenience, it raises concerns about the reliability of outcomes, leading to challenges such as incorrect performance evaluation. This paper addresses a critical issue in ML, known as data leakage, where unintended information contaminates the training data, impacting model performance evaluation. Users, due to a lack of understanding, may inadvertently overlook crucial steps, leading to optimistic performance estimates that may not hold in real-world scenarios. The discrepancy between evaluated and actual performance on new data is a significant concern. In particular, this paper categorizes data leakage in ML, discussing how certain conditions can propagate through the ML workflow. Furthermore, it explores the connection between data leakage and the specific task being addressed, investigates its occurrence in Transfer Learning, and compares standard inductive ML with transductive ML frameworks. The conclusion summarizes key findings, emphasizing the importance of addressing data leakage for robust and reliable ML applications.

Updated: 2025-08-20 14:39:43

标题: 不要按按钮！探索机器学习和迁移学习中的数据泄漏风险

摘要: 机器学习（ML）已经彻底改变了各个领域，为多个领域提供了预测能力。然而，随着ML工具的日益普及，许多从业者缺乏深入的ML专业知识，采用了一种“按下按钮”的方法，利用用户友好的界面而不深入理解底层算法。虽然这种方法提供了便利性，但也引发了对结果可靠性的担忧，导致挑战，如错误的性能评估。本文讨论了ML中的一个关键问题，即数据泄漏，即无意中的信息污染训练数据，影响模型性能评估。由于缺乏理解，用户可能会无意中忽略关键步骤，导致对性能的乐观估计在实际情况下可能不成立。在新数据上评估和实际性能之间的差异是一个重要问题。特别是，本文对ML中的数据泄漏进行了分类，讨论了某些条件如何在ML工作流中传播。此外，它探讨了数据泄露与正在解决的特定任务之间的联系，在迁移学习中的发生情况进行了调查，并比较了标准归纳ML与传导ML框架。结论总结了关键发现，强调解决数据泄漏对于健壮可靠的ML应用的重要性。

更新时间: 2025-08-20 14:39:43

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2401.13796v5

Sample Selection Bias in Machine Learning for Healthcare

While machine learning algorithms hold promise for personalised medicine, their clinical adoption remains limited, partly due to biases that can compromise the reliability of predictions. In this paper, we focus on sample selection bias (SSB), a specific type of bias where the study population is less representative of the target population, leading to biased and potentially harmful decisions. Despite being well-known in the literature, SSB remains scarcely studied in machine learning for healthcare. Moreover, the existing machine learning techniques try to correct the bias mostly by balancing distributions between the study and the target populations, which may result in a loss of predictive performance. To address these problems, our study illustrates the potential risks associated with SSB by examining SSB's impact on the performance of machine learning algorithms. Most importantly, we propose a new research direction for addressing SSB, based on the target population identification rather than the bias correction. Specifically, we propose two independent networks(T-Net) and a multitasking network (MT-Net) for addressing SSB, where one network/task identifies the target subpopulation which is representative of the study population and the second makes predictions for the identified subpopulation. Our empirical results with synthetic and semi-synthetic datasets highlight that SSB can lead to a large drop in the performance of an algorithm for the target population as compared with the study population, as well as a substantial difference in the performance for the target subpopulations that are representative of the selected and the non-selected patients from the study population. Furthermore, our proposed techniques demonstrate robustness across various settings, including different dataset sizes, event rates, and selection rates, outperforming the existing bias correction techniques.

Updated: 2025-08-20 14:33:07

标题: 在医疗保健机器学习中的样本选择偏差

摘要: 尽管机器学习算法为个性化医学提供了希望，但它们在临床中的应用仍然有限，部分原因是可能会影响预测可靠性的偏见。在本文中，我们关注样本选择偏见（SSB），这是一种特定类型的偏见，其中研究人群不够代表目标人群，导致偏见和潜在有害的决策。尽管在文献中广为人知，但SSB在医疗机器学习中研究甚少。此外，现有的机器学习技术试图通过平衡研究和目标人群之间的分布来纠正偏见，这可能导致预测性能的损失。为了解决这些问题，我们的研究通过检查SSB对机器学习算法性能的影响来说明了与SSB相关的潜在风险。最重要的是，我们提出了一个针对SSB的新的研究方向，基于目标人群的识别而不是偏见校正。具体而言，我们提出了两个独立网络（T-Net）和一个多任务网络（MT-Net）来解决SSB，其中一个网络/任务识别代表研究人群的目标亚人群，第二个为识别的亚人群进行预测。我们通过合成和半合成数据集的实证结果突显了SSB可能导致算法在目标人群的性能大幅下降，与研究人群相比，以及代表选定和未选定研究人群的目标亚人群之间性能差异显著。此外，我们提出的技术在各种环境中表现出稳健性，包括不同的数据集大小、事件率和选择率，优于现有的偏见校正技术。

更新时间: 2025-08-20 14:33:07

领域: cs.LG

下载: http://arxiv.org/abs/2405.07841v3

Evaluating Multilingual and Code-Switched Alignment in LLMs via Synthetic Natural Language Inference

Large language models (LLMs) are increasingly applied in multilingual contexts, yet their capacity for consistent, logically grounded alignment across languages remains underexplored. We present a controlled evaluation framework for multilingual natural language inference (NLI) that generates synthetic, logic-based premise-hypothesis pairs and translates them into a typologically diverse set of languages. This design enables precise control over semantic relations and allows testing in both monolingual and mixed-language (code-switched) conditions. Surprisingly, code-switching does not degrade, and can even improve, performance, suggesting that translation-induced lexical variation may serve as a regularization signal. We validate semantic preservation through embedding-based similarity analyses and cross-lingual alignment visualizations, confirming the fidelity of translated pairs. Our findings expose both the potential and the brittleness of current LLM cross-lingual reasoning, and identify code-switching as a promising lever for improving multilingual robustness. Code available at: https://github.com/KurbanIntelligenceLab/nli-stress-testing

Updated: 2025-08-20 14:30:34

标题: 通过合成自然语言推理评估LLMs中多语言和代码切换的对齐

摘要: 大型语言模型（LLMs）越来越多地应用于多语言环境中，然而它们在跨语言一致性和逻辑基础上的对齐能力仍未得到充分探讨。我们提出了一个用于多语言自然语言推理（NLI）的受控评估框架，该框架生成了基于逻辑的前提-假设对，并将它们翻译成一组语言类型多样的语言。这种设计可以精确控制语义关系，并允许在单语和混合语言（代码切换）条件下进行测试。令人惊讶的是，代码切换不会降低性能，甚至可能提高性能，这表明由翻译引起的词汇变化可能作为正则化信号。我们通过基于嵌入式相似性分析和跨语言对齐可视化来验证语义保留，从而确认翻译对的忠实性。我们的发现揭示了当前LLM跨语言推理的潜力和脆弱性，并将代码切换确定为提高多语言稳健性的有希望的杠杆。代码链接：https://github.com/KurbanIntelligenceLab/nli-stress-testing

更新时间: 2025-08-20 14:30:34

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.14735v1

AFABench: A Generic Framework for Benchmarking Active Feature Acquisition

In many real-world scenarios, acquiring all features of a data instance can be expensive or impractical due to monetary cost, latency, or privacy concerns. Active Feature Acquisition (AFA) addresses this challenge by dynamically selecting a subset of informative features for each data instance, trading predictive performance against acquisition cost. While numerous methods have been proposed for AFA, ranging from greedy information-theoretic strategies to non-myopic reinforcement learning approaches, fair and systematic evaluation of these methods has been hindered by the lack of standardized benchmarks. In this paper, we introduce AFABench, the first benchmark framework for AFA. Our benchmark includes a diverse set of synthetic and real-world datasets, supports a wide range of acquisition policies, and provides a modular design that enables easy integration of new methods and tasks. We implement and evaluate representative algorithms from all major categories, including static, greedy, and reinforcement learning-based approaches. To test the lookahead capabilities of AFA policies, we introduce a novel synthetic dataset, AFAContext, designed to expose the limitations of greedy selection. Our results highlight key trade-offs between different AFA strategies and provide actionable insights for future research. The benchmark code is available at: https://github.com/Linusaronsson/AFA-Benchmark.

Updated: 2025-08-20 14:29:16

标题: AFABench：用于基准测试主动特征获取的通用框架

摘要: 在许多真实场景中，由于金钱成本、延迟或隐私问题，获取数据实例的所有特征可能是昂贵或不切实际的。主动特征获取（AFA）通过动态选择每个数据实例的信息性特征子集来应对这一挑战，以预测性能为代价获取成本。虽然已经提出了许多AFA方法，从贪婪信息论策略到非远见强化学习方法，但由于缺乏标准化基准，对这些方法的公平和系统评估受到了阻碍。在本文中，我们介绍了AFABench，这是第一个针对AFA的基准框架。我们的基准包括多样化的合成和真实世界数据集，支持各种获取策略，并提供了模块化设计，可以轻松集成新方法和任务。我们实现并评估了所有主要类别的代表性算法，包括静态、贪婪和基于强化学习的方法。为了测试AFA策略的前瞻能力，我们引入了一种新颖的合成数据集AFAContext，旨在揭示贪婪选择的局限性。我们的结果突出了不同AFA策略之间的关键权衡，并为未来研究提供可操作的见解。基准代码可在以下链接找到：https://github.com/Linusaronsson/AFA-Benchmark。

更新时间: 2025-08-20 14:29:16

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.14734v1

Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers

Hallucinations are a persistent problem with Large Language Models (LLMs). As these models become increasingly used in high-stakes domains, such as healthcare and finance, the need for effective hallucination detection is crucial. To this end, we outline a versatile framework for zero-resource hallucination detection that practitioners can apply to real-world use cases. To achieve this, we adapt a variety of existing uncertainty quantification (UQ) techniques, including black-box UQ, white-box UQ, and LLM-as-a-Judge, transforming them as necessary into standardized response-level confidence scores ranging from 0 to 1. To enhance flexibility, we propose a tunable ensemble approach that incorporates any combination of the individual confidence scores. This approach enables practitioners to optimize the ensemble for a specific use case for improved performance. To streamline implementation, the full suite of scorers is offered in this paper's companion Python toolkit, UQLM. To evaluate the performance of the various scorers, we conduct an extensive set of experiments using several LLM question-answering benchmarks. We find that our tunable ensemble typically surpasses its individual components and outperforms existing hallucination detection methods. Our results demonstrate the benefits of customized hallucination detection strategies for improving the accuracy and reliability of LLMs.

Updated: 2025-08-20 14:26:48

标题: 语言模型的不确定性量化：一套黑盒、白盒、LLM评判者和集成得分器

摘要: 幻觉是大型语言模型（LLMs）的一个持续问题。随着这些模型在高风险领域（如医疗保健和金融）中的应用越来越广泛，有效的幻觉检测需求至关重要。为此，我们概述了一个多功能框架，用于零资源幻觉检测，从而实践者可以将其应用于真实世界的用例。为了实现这一目标，我们调整了各种现有的不确定性量化（UQ）技术，包括黑盒UQ、白盒UQ和LLM作为评判者，将它们转换为从0到1的标准化响应级置信度分数。为了增强灵活性，我们提出了一个可调整的集成方法，可以结合任意组合的个体置信度分数。这种方法使实践者可以优化集成以提高特定用例的性能。为了简化实施，本文的配套Python工具包UQLM提供了完整的评分器套件。为了评估各种评分器的性能，我们使用多个LLM问答基准进行了一系列广泛的实验。我们发现，我们的可调整集成通常优于其个体组成部分，并优于现有的幻觉检测方法。我们的结果证明了定制幻觉检测策略对提高LLMs的准确性和可靠性的好处。

更新时间: 2025-08-20 14:26:48

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2504.19254v3

Hallucinations and Key Information Extraction in Medical Texts: A Comprehensive Assessment of Open-Source Large Language Models

Clinical summarization is crucial in healthcare as it distills complex medical data into digestible information, enhancing patient understanding and care management. Large language models (LLMs) have shown significant potential in automating and improving the accuracy of such summarizations due to their advanced natural language understanding capabilities. These models are particularly applicable in the context of summarizing medical/clinical texts, where precise and concise information transfer is essential. In this paper, we investigate the effectiveness of open-source LLMs in extracting key events from discharge reports, including admission reasons, major in-hospital events, and critical follow-up actions. In addition, we also assess the prevalence of various types of hallucinations in the summaries produced by these models. Detecting hallucinations is vital as it directly influences the reliability of the information, potentially affecting patient care and treatment outcomes. We conduct comprehensive simulations to rigorously evaluate the performance of these models, further probing the accuracy and fidelity of the extracted content in clinical summarization. Our results reveal that while the LLMs (e.g., Qwen2.5 and DeepSeek-v2) perform quite well in capturing admission reasons and hospitalization events, they are generally less consistent when it comes to identifying follow-up recommendations, highlighting broader challenges in leveraging LLMs for comprehensive summarization.

Updated: 2025-08-20 14:24:25

标题: 幻觉和医学文本中关键信息提取：对开源大型语言模型的全面评估

摘要: 在医疗保健中，临床总结至关重要，因为它将复杂的医学数据概括为易于理解的信息，提高了患者对医疗的理解和管理。大型语言模型(LLMs)显示出了在自动化和提高此类总结准确性方面的显著潜力，这是由于它们先进的自然语言理解能力。这些模型特别适用于总结医学/临床文本的情境，在这种情境下，精确而简洁的信息传递是必不可少的。在本文中，我们研究了开源LLMs在从出院报告中提取关键事件的有效性，包括入院原因、重要住院事件和关键后续行动。此外，我们还评估了这些模型生成的摘要中各种幻觉的普遍程度。检测幻觉至关重要，因为它直接影响信息的可靠性，可能影响患者的护理和治疗结果。我们进行了全面的模拟以严格评估这些模型的性能，进一步探究临床总结中提取内容的准确性和忠实度。我们的结果表明，虽然LLMs(例如Qwen2.5和DeepSeek-v2)在捕捉入院原因和住院事件方面表现得相当出色，但在识别后续建议方面一般不够一致，突显了在利用LLMs进行全面总结时的更广泛挑战。

更新时间: 2025-08-20 14:24:25

领域: cs.CL,cs.AI,cs.HC

下载: http://arxiv.org/abs/2504.19061v3

Multi-agent Auditory Scene Analysis

Auditory scene analysis (ASA) aims to retrieve information from the acoustic environment, by carrying out three main tasks: sound source location, separation, and classification. These tasks are traditionally executed with a linear data flow, where the sound sources are first located; then, using their location, each source is separated into its own audio stream; from each of which, information is extracted that is relevant to the application scenario (audio event detection, speaker identification, emotion classification, etc.). However, running these tasks linearly increases the overall response time, while making the last tasks (separation and classification) highly sensitive to errors of the first task (location). A considerable amount of effort and computational complexity has been employed in the state-of-the-art to develop techniques that are the least error-prone possible. However, doing so gives rise to an ASA system that is non-viable in many applications that require a small computational footprint and a low response time, such as bioacoustics, hearing-aid design, search and rescue, human-robot interaction, etc. To this effect, in this work, a multi-agent approach is proposed to carry out ASA where the tasks are run in parallel, with feedback loops between them to compensate for local errors, such as: using the quality of the separation output to correct the location error; and using the classification result to reduce the localization's sensitivity towards interferences. The result is a multi-agent auditory scene analysis (MASA) system that is robust against local errors, without a considerable increase in complexity, and with a low response time. The complete proposed MASA system is provided as a publicly available framework that uses open-source tools for sound acquisition and reproduction (JACK) and inter-agent communication (ROS2), allowing users to add their own agents.

Updated: 2025-08-20 14:18:03

标题: 多智能体听觉场景分析

摘要: Auditory scene analysis (ASA)旨在从声学环境中检索信息，通过执行三项主要任务：声源定位、分离和分类。传统上，这些任务是按照线性数据流程执行的，首先定位声源；然后，利用它们的位置，将每个声源分离为自己的音频流；从每个音频流中提取与应用场景相关的信息（音频事件检测、说话人识别、情绪分类等）。然而，按照线性方式运行这些任务会增加整体响应时间，同时使得最后的任务（分离和分类）对第一个任务（定位）的错误非常敏感。在现代技术中已经投入了大量的努力和计算复杂度来开发尽可能少出错的技术。然而，这样做会导致一个在许多需要小型计算占用和低响应时间的应用中不可行的ASA系统，如生物声学、助听器设计、搜救、人机交互等。为此，在这项工作中，提出了一种多智能体方法来进行ASA，其中任务并行运行，并在它们之间建立反馈回路以补偿局部错误，例如：使用分离输出的质量来纠正定位错误；使用分类结果来减少定位对干扰的敏感度。结果是一个多智能体听觉场景分析（MASA）系统，对局部错误具有鲁棒性，而不会显著增加复杂性，并具有低响应时间。提供了完整的拟议MASA系统作为一个公开可用的框架，使用开源工具进行声音采集和再现（JACK）和智能体间通信（ROS2），允许用户添加自己的智能体。

更新时间: 2025-08-20 14:18:03

领域: eess.AS,cs.AI

下载: http://arxiv.org/abs/2507.02755v3

Assessing the Quality and Security of AI-Generated Code: A Quantitative Analysis

This study presents a quantitative evaluation of the code quality and security of five prominent Large Language Models (LLMs): Claude Sonnet 4, Claude 3.7 Sonnet, GPT-4o, Llama 3.2 90B, and OpenCoder 8B. While prior research has assessed the functional performance of LLM-generated code, this research tested LLM output from 4,442 Java coding assignments through comprehensive static analysis using SonarQube. The findings suggest that although LLMs can generate functional code, they also introduce a range of software defects, including bugs, security vulnerabilities, and code smells. These defects do not appear to be isolated; rather, they may represent shared weaknesses stemming from systemic limitations within current LLM code generation methods. In particular, critically severe issues, such as hard-coded passwords and path traversal vulnerabilities, were observed across multiple models. These results indicate that LLM-generated code requires verification in order to be considered production-ready. This study found no direct correlation between a model's functional performance (measured by Pass@1 rate of unit tests) and the overall quality and security of its generated code, measured by the number of SonarQube issues in benchmark solutions that passed the functional tests. This suggests that functional benchmark performance score is not a good indicator of overall code quality and security. The goal of this study is not to rank LLM performance but to highlight that all evaluated models appear to share certain weaknesses. Consequently, these findings support the view that static analysis can be a valuable instrument for detecting latent defects and an important safeguard for organizations that deploy AI in software development.

Updated: 2025-08-20 14:16:21

标题: 评估人工智能生成代码的质量和安全性：定量分析

摘要: 这项研究对五种知名大型语言模型（LLMs）的代码质量和安全性进行了定量评估：Claude Sonnet 4、Claude 3.7 Sonnet、GPT-4o、Llama 3.2 90B和OpenCoder 8B。尽管先前的研究已经评估了LLM生成的代码的功能性能，但这项研究通过对4,442个Java编程作业的LLM输出进行全面静态分析，使用SonarQube进行了测试。研究结果表明，虽然LLMs可以生成功能性代码，但它们也引入了一系列软件缺陷，包括错误、安全漏洞和代码异味。这些缺陷似乎并非孤立的；相反，它们可能代表了当前LLM代码生成方法中系统性限制导致的共同弱点。特别是，在多个模型中观察到了严重问题，例如硬编码密码和路径遍历漏洞。这些结果表明，LLM生成的代码需要经过验证才能被视为可投入生产。这项研究发现，模型的功能性能（通过单元测试的Pass@1率衡量）与生成的代码的整体质量和安全性之间没有直接相关性，后者通过SonarQube在通过功能测试的基准解决方案中出现的问题数量进行衡量。这表明功能基准性能得分不是总体代码质量和安全性的良好指标。本研究的目标不是评估LLM的性能，而是强调所有评估模型似乎共享某些弱点。因此，这些发现支持了静态分析可以是检测潜在缺陷的有价值工具，并对在软件开发中部署AI的组织来说是一项重要的保障的观点。

更新时间: 2025-08-20 14:16:21

领域: cs.SE,cs.LG

下载: http://arxiv.org/abs/2508.14727v1

Emerson-Lei and Manna-Pnueli Games for LTLf+ and PPLTL+ Synthesis

Recently, the Manna-Pnueli Hierarchy has been used to define the temporal logics LTLfp and PPLTLp, which allow to use finite-trace LTLf/PPLTL techniques in infinite-trace settings while achieving the expressiveness of full LTL. In this paper, we present the first actual solvers for reactive synthesis in these logics. These are based on games on graphs that leverage DFA-based techniques from LTLf/PPLTL to construct the game arena. We start with a symbolic solver based on Emerson-Lei games, which reduces lower-class properties (guarantee, safety) to higher ones (recurrence, persistence) before solving the game. We then introduce Manna-Pnueli games, which natively embed Manna-Pnueli objectives into the arena. These games are solved by composing solutions to a DAG of simpler Emerson-Lei games, resulting in a provably more efficient approach. We implemented the solvers and practically evaluated their performance on a range of representative formulas. The results show that Manna-Pnueli games often offer significant advantages, though not universally, indicating that combining both approaches could further enhance practical performance.

Updated: 2025-08-20 14:07:43

标题: Emerson-Lei和Manna-Pnueli游戏用于LTLf+和PPLTL+合成

摘要: 最近，Manna-Pnueli层次结构已被用于定义时态逻辑LTLfp和PPLTLp，这使得可以在无限轨迹设置中使用有限轨迹LTLf/PPLTL技术，同时实现完整LTL的表达能力。在本文中，我们提出了这些逻辑中反应合成的第一个实际求解器。这些求解器基于图形游戏，利用来自LTLf/PPLTL的DFA技术构建游戏竞技场。我们从基于Emerson-Lei游戏的符号求解器开始，将较低级别的属性（保证、安全）降级为较高级别的属性（重复、持久）然后解决游戏。然后我们介绍了Manna-Pnueli游戏，将Manna-Pnueli目标本地嵌入竞技场中。这些游戏通过将更简单的Emerson-Lei游戏的解决方案组合来解决，从而实现可证明更有效的方法。我们实现了这些求解器，并在一系列代表性公式上实际评估了它们的性能。结果显示，Manna-Pnueli游戏经常具有显著优势，虽然并非普遍如此，这表明结合两种方法可能进一步提高实际性能。

更新时间: 2025-08-20 14:07:43

领域: cs.LO,cs.AI,cs.FL

下载: http://arxiv.org/abs/2508.14725v1

Transplant Then Regenerate: A New Paradigm for Text Data Augmentation

Data augmentation is a critical technique in deep learning. Traditional methods like Back-translation typically focus on lexical-level rephrasing, which primarily produces variations with the same semantics. While large language models (LLMs) have enhanced text augmentation by their "knowledge emergence" capability, controlling the style and structure of these outputs remains challenging and requires meticulous prompt engineering. In this paper, we propose LMTransplant, a novel text augmentation paradigm leveraging LLMs. The core idea of LMTransplant is transplant-then-regenerate: incorporating seed text into a context expanded by LLM, and asking the LLM to regenerate a variant based on the expanded context. This strategy allows the model to create more diverse and creative content-level variants by fully leveraging the knowledge embedded in LLMs, while preserving the core attributes of the original text. We evaluate LMTransplant across various text-related tasks, demonstrating its superior performance over existing text augmentation methods. Moreover, LMTransplant demonstrates exceptional scalability as the size of augmented data grows.

Updated: 2025-08-20 14:05:18

标题: 移植后再生：文本数据增强的新范式

摘要: 数据增强是深度学习中的一项关键技术。传统方法如反向翻译通常侧重于词汇级别的重新表述，主要产生具有相同语义的变化。虽然大型语言模型（LLMs）通过其“知识出现”能力增强了文本增强，但控制这些输出的风格和结构仍然具有挑战性，并需要精心的提示工程。在本文中，我们提出了LMTransplant，一种利用LLMs的新型文本增强范式。LMTransplant的核心思想是移植然后再生成：将种子文本纳入LLM扩展的上下文中，并要求LLM根据扩展的上下文重新生成一个变体。这种策略允许模型通过充分利用LLMs中嵌入的知识来创建更多样化和创造性的内容级别变体，同时保留原始文本的核心属性。我们评估了LMTransplant在各种文本相关任务中的表现，展示了其优于现有文本增强方法的性能。此外，随着增强数据量的增加，LMTransplant表现出了卓越的可扩展性。

更新时间: 2025-08-20 14:05:18

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.14723v1

MAVIS: Multi-Objective Alignment via Value-Guided Inference-Time Search

Large Language Models (LLMs) are increasingly deployed across diverse applications that demand balancing multiple, often conflicting, objectives -- such as helpfulness, harmlessness, or humor. Aligning outputs to user-specific preferences in such multi-objective settings typically requires fine-tuning models for each objective or preference configuration, which is computationally expensive and inflexible. We introduce MAVIS -- Multi-Objective Alignment via Value-Guided Inference-Time Search -- a lightweight inference-time alignment framework that enables dynamic control over LLM behavior without modifying the base model's weights. MAVIS trains a set of small value models, each corresponding to a distinct objective. At inference time, these value models are combined using user-specified weights to produce a tilting function that adjusts the base model's output distribution toward desired trade-offs. The value models are trained using a simple iterative algorithm that ensures monotonic improvement of the KL-regularized policy. We show empirically that MAVIS outperforms baselines that fine-tune per-objective models and combine them post hoc, and even approaches the performance of the idealized setting where models are fine-tuned for a user's exact preferences.

Updated: 2025-08-20 13:57:38

标题: MAVIS: 基于价值引导的多目标对齐推理时间搜索

摘要: Large Language Models (LLMs)越来越广泛地部署在需求平衡多个、经常相互冲突的目标的各种应用中，比如帮助性、无害性或幽默性。在这种多目标设置中将输出与用户特定偏好对齐通常需要为每个目标或偏好配置微调模型，这在计算上是昂贵且不灵活的。我们引入MAVIS -- 通过价值引导的推理时搜索实现多目标对齐 -- 一个轻量级推理时对齐框架，它可以在不修改基础模型权重的情况下实现对LLM行为的动态控制。MAVIS训练一组小型价值模型，每个模型对应一个不同的目标。在推理时，这些价值模型使用用户指定的权重组合，生成一个倾斜函数，将基础模型的输出分布调整到所需的权衡点。这些价值模型使用简单的迭代算法进行训练，确保KL正则化策略的单调改进。我们实验证明，MAVIS优于对每个目标进行微调并事后组合的基线，并且甚至接近理想情况下为用户的确切偏好进行微调的性能。

更新时间: 2025-08-20 13:57:38

领域: cs.LG

下载: http://arxiv.org/abs/2508.13415v2

The NordDRG AI Benchmark for Large Language Models

Large language models (LLMs) are being piloted for clinical coding and decision support, yet no open benchmark targets the hospital-funding layer where Diagnosis-Related Groups (DRGs) determine reimbursement. In most OECD systems, DRGs route a substantial share of multi-trillion-dollar health spending through governed grouper software, making transparency and auditability first-order concerns. We release NordDRG-AI-Benchmark, the first public, rule-complete test bed for DRG reasoning. The package includes (i) machine-readable approximately 20-sheet NordDRG definition tables and (ii) expert manuals and change-log templates that capture governance workflows. It exposes two suites: a 13-task Logic benchmark (code lookup, cross-table inference, grouping features, multilingual terminology, and CC/MCC validity checks) and a 13-task Grouper benchmark that requires full DRG grouper emulation with strict exact-match scoring on both the DRG and the triggering drg_logic.id. Lightweight reference agents (LogicAgent, GrouperAgent) enable artefact-only evaluation. Under an artefact-only (no web) setting, on the 13 Logic tasks GPT-5 Thinking and Opus 4.1 score 13/13, o3 scores 12/13; mid-tier models (GPT-5 Thinking Mini, o4-mini, GPT-5 Fast) achieve 6-8/13, and remaining models score 5/13 or below. On full grouper emulation across 13 tasks, GPT-5 Thinking solves 7/13, o3 6/13, o4-mini 3/13; GPT-5 Thinking Mini solves 1/13, and all other tested endpoints score 0/13. To our knowledge, this is the first public report of an LLM partially emulating the complete NordDRG grouper logic with governance-grade traceability. Coupling a rule-complete release with exact-match tasks and open scoring provides a reproducible yardstick for head-to-head and longitudinal evaluation in hospital funding. Benchmark materials available in Github.

Updated: 2025-08-20 13:47:47

标题: 大型语言模型的NordDRG人工智能基准测试

摘要: 大型语言模型（LLMs）正在用于临床编码和决策支持的试点项目，然而目前没有一个公开的基准目标医院资金层，其中诊断相关分组（DRGs）确定报销。在大多数经合组织系统中，DRGs通过受控分组软件引导大量数万亿美元的医疗支出，这使得透明度和可审计性成为首要关注的问题。我们发布了NordDRG-AI-Benchmark，这是第一个用于DRG推理的公开、规则完整的测试平台。该套件包括（i）可机器读取的约20页NordDRG定义表和（ii）捕捉治理工作流程的专家手册和变更日志模板。它展示了两个套件：一个包含13项逻辑基准（代码查找、交叉表推理、分组特征、多语言术语和CC/MCC有效性检查）和一个需要完整DRG分组器模拟的13项分组器基准，要求在DRG和触发ID上严格进行精准匹配评分。轻量级参考代理（LogicAgent、GrouperAgent）实现了仅基于工件的评估。在仅基于工件（无网络）设置下，在13个逻辑任务中，GPT-5 Thinking和Opus 4.1得分均为13/13，o3得分为12/13；中端模型（GPT-5 Thinking Mini、o4-mini、GPT-5 Fast）实现了6-8/13，其余模型得分为5/13或更低。在13项任务中进行完整的分组器模拟时，GPT-5 Thinking解决了7/13，o3解决了6/13，o4-mini解决了3/13；GPT-5 Thinking Mini解决了1/13，而所有其他测试端点得分为0/13。据我们所知，这是首次公开报告使用LLM部分模拟完整NordDRG分组器逻辑，并具有治理级可追溯性。将规则完整的发布与精确匹配任务和开放评分相结合，为医院资金中头对头和纵向评估提供了可重复使用的标准。基准材料可在Github上获取。

更新时间: 2025-08-20 13:47:47

领域: cs.AI

下载: http://arxiv.org/abs/2506.13790v3

Behind the Myth of Exploration in Policy Gradients

In order to compute near-optimal policies with policy-gradient algorithms, it is common in practice to include intrinsic exploration terms in the learning objective. Although the effectiveness of these terms is usually justified by an intrinsic need to explore environments, we propose a novel analysis with the lens of numerical optimization. Two criteria are introduced on the learning objective and two others on its stochastic gradient estimates, and are afterwards used to discuss the quality of the policy after optimization. The analysis sheds light on two separate effects of exploration techniques. First, they make it possible to smooth the learning objective and to eliminate local optima while preserving the global maximum. Second, they modify the gradient estimates, increasing the probability that the stochastic parameter updates eventually provide an optimal policy. We empirically illustrate these effects with exploration strategies based on entropy bonuses, identifying limitations and suggesting directions for future work.

Updated: 2025-08-20 13:43:37

标题: 政策梯度探索神话背后

摘要: 为了使用策略梯度算法计算近优策略，实践中通常在学习目标中包含内在的探索项是很常见的。尽管这些项的有效性通常通过内在的探索环境的需求来证明，但我们提出了一种新颖的分析方法，通过数值优化的视角进行。我们引入了两个标准用于学习目标，以及另外两个标准用于其随机梯度估计，并用于讨论优化后策略的质量。分析揭示了探索技术的两个独立效应。首先，它们使得平滑学习目标成为可能，并消除局部最优解，同时保留全局最大值。其次，它们修改梯度估计，增加了随机参数更新最终提供最优策略的概率。我们通过基于熵奖励的探索策略在实证上说明了这些效应，识别了局限性并提出了未来工作的方向。

更新时间: 2025-08-20 13:43:37

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2402.00162v3

Data-Driven Probabilistic Evaluation of Logic Properties with PAC-Confidence on Mealy Machines

Cyber-Physical Systems (CPS) are complex systems that require powerful models for tasks like verification, diagnosis, or debugging. Often, suitable models are not available and manual extraction is difficult. Data-driven approaches then provide a solution to, e.g., diagnosis tasks and verification problems based on data collected from the system. In this paper, we consider CPS with a discrete abstraction in the form of a Mealy machine. We propose a data-driven approach to determine the safety probability of the system on a finite horizon of n time steps. The approach is based on the Probably Approximately Correct (PAC) learning paradigm. Thus, we elaborate a connection between discrete logic and probabilistic reachability analysis of systems, especially providing an additional confidence on the determined probability. The learning process follows an active learning paradigm, where new learning data is sampled in a guided way after an initial learning set is collected. We validate the approach with a case study on an automated lane-keeping system.

Updated: 2025-08-20 13:38:52

标题: 基于数据驱动的PAC-置信度在Mealy机器上对逻辑属性进行概率评估

摘要: 网络物理系统（CPS）是需要强大模型的复杂系统，用于诸如验证、诊断或调试等任务。通常，合适的模型不可用，手动提取困难。然后，基于系统收集的数据，数据驱动方法提供了解决方案，例如，诊断任务和基于验证问题。在本文中，我们考虑具有Mealy机器形式的离散抽象的CPS。我们提出了一种数据驱动方法，用于在n个时间步骤的有限时间范围内确定系统的安全概率。该方法基于可能近似正确（PAC）学习范式。因此，我们详细阐述了离散逻辑与系统的概率可达性分析之间的联系，特别是对确定的概率提供了额外的信心。学习过程遵循主动学习范式，其中在收集了初始学习集之后，以引导方式对新的学习数据进行采样。我们通过对自动车道保持系统的案例研究验证了该方法。

更新时间: 2025-08-20 13:38:52

领域: cs.AI

下载: http://arxiv.org/abs/2508.14710v1

ShizhenGPT: Towards Multimodal LLMs for Traditional Chinese Medicine

Despite the success of large language models (LLMs) in various domains, their potential in Traditional Chinese Medicine (TCM) remains largely underexplored due to two critical barriers: (1) the scarcity of high-quality TCM data and (2) the inherently multimodal nature of TCM diagnostics, which involve looking, listening, smelling, and pulse-taking. These sensory-rich modalities are beyond the scope of conventional LLMs. To address these challenges, we present ShizhenGPT, the first multimodal LLM tailored for TCM. To overcome data scarcity, we curate the largest TCM dataset to date, comprising 100GB+ of text and 200GB+ of multimodal data, including 1.2M images, 200 hours of audio, and physiological signals. ShizhenGPT is pretrained and instruction-tuned to achieve deep TCM knowledge and multimodal reasoning. For evaluation, we collect recent national TCM qualification exams and build a visual benchmark for Medicinal Recognition and Visual Diagnosis. Experiments demonstrate that ShizhenGPT outperforms comparable-scale LLMs and competes with larger proprietary models. Moreover, it leads in TCM visual understanding among existing multimodal LLMs and demonstrates unified perception across modalities like sound, pulse, smell, and vision, paving the way toward holistic multimodal perception and diagnosis in TCM. Datasets, models, and code are publicly available. We hope this work will inspire further exploration in this field.

Updated: 2025-08-20 13:30:20

标题: ShizhenGPT：面向传统中医的多模态LLMs

摘要: 尽管大型语言模型（LLMs）在各个领域取得了成功，但它们在中医药领域的潜力仍然大部分未被开发，这主要是由于两个关键障碍：（1）高质量中医药数据的稀缺性和（2）中医诊断的固有多模态性质，涉及视觉、听觉、嗅觉和脉诊。这些富有感官的模态性质超出了传统LLMs的范围。为了解决这些挑战，我们提出了ShizhenGPT，这是专为中医药定制的第一个多模态LLM。为了克服数据稀缺性，我们策划了迄今为止最大的中医药数据集，包括100GB以上的文本和200GB以上的多模态数据，其中包括120万张图像、200小时的音频和生理信号。ShizhenGPT经过预训练和指导调整，以达到深度中医药知识和多模态推理。为了评估，我们收集了最近的国家中医药资格考试，并建立了一个用于药材识别和视觉诊断的视觉基准。实验证明，ShizhenGPT优于规模相当的LLMs，并与更大的专有模型竞争。此外，它在现有多模态LLMs中领先于中医药的视觉理解，并展示了跨模态的统一感知，如声音、脉搏、气味和视觉，为中医药的整体多模态感知和诊断铺平了道路。数据集、模型和代码都是公开可用的。我们希望这项工作能够激发对这一领域的进一步探索。

更新时间: 2025-08-20 13:30:20

领域: cs.CL,cs.AI,cs.CV,cs.LG,cs.MM

下载: http://arxiv.org/abs/2508.14706v1

Learning in Repeated Multi-Objective Stackelberg Games with Payoff Manipulation

We study payoff manipulation in repeated multi-objective Stackelberg games, where a leader may strategically influence a follower's deterministic best response, e.g., by offering a share of their own payoff. We assume that the follower's utility function, representing preferences over multiple objectives, is unknown but linear, and its weight parameter must be inferred through interaction. This introduces a sequential decision-making challenge for the leader, who must balance preference elicitation with immediate utility maximisation. We formalise this problem and propose manipulation policies based on expected utility (EU) and long-term expected utility (longEU), which guide the leader in selecting actions and offering incentives that trade off short-term gains with long-term impact. We prove that under infinite repeated interactions, longEU converges to the optimal manipulation. Empirical results across benchmark environments demonstrate that our approach improves cumulative leader utility while promoting mutually beneficial outcomes, all without requiring explicit negotiation or prior knowledge of the follower's utility function.

Updated: 2025-08-20 13:29:24

标题: 在重复的多目标斯塔克尔贝格博弈中学习与收益操纵

摘要: 我们研究了在重复多目标Stackelberg博弈中的收益操纵问题，领导者可以通过提供自己收益的一部分来策略性地影响追随者的确定性最佳反应。我们假设追随者的效用函数，代表对多个目标的偏好，是未知的但是线性的，并且其权重参数必须通过互动来推断。这为领导者引入了一个顺序决策挑战，他必须在偏好引导和立即效用最大化之间取得平衡。我们正式化了这个问题，并提出了基于期望效用（EU）和长期期望效用（longEU）的操纵策略，指导领导者在选择行动和提供奖励时权衡短期收益与长期影响。我们证明在无限重复互动下，longEU收敛到最佳操纵。基准环境的实证结果表明，我们的方法提高了领导者累积效用，同时促进了互惠互利的结果，而无需明确谈判或追随者效用函数的先验知识。

更新时间: 2025-08-20 13:29:24

领域: cs.GT,cs.AI

下载: http://arxiv.org/abs/2508.14705v1

MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers

The Model Context Protocol has emerged as a transformative standard for connecting large language models to external data sources and tools, rapidly gaining adoption across major AI providers and development platforms. However, existing benchmarks are overly simplistic and fail to capture real application challenges such as long-horizon reasoning and large, unfamiliar tool spaces. To address this critical gap, we introduce MCP-Universe, the first comprehensive benchmark specifically designed to evaluate LLMs in realistic and hard tasks through interaction with real-world MCP servers. Our benchmark encompasses 6 core domains spanning 11 different MCP servers: Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching. To ensure rigorous evaluation, we implement execution-based evaluators, including format evaluators for agent format compliance, static evaluators for time-invariant content matching, and dynamic evaluators that automatically retrieve real-time ground truth for temporally sensitive tasks. Through extensive evaluation of leading LLMs, we find that even SOTA models such as GPT-5 (43.72%), Grok-4 (33.33%) and Claude-4.0-Sonnet (29.44%) exhibit significant performance limitations. In addition, our benchmark poses a significant long-context challenge for LLM agents, as the number of input tokens increases rapidly with the number of interaction steps. Moreover, it introduces an unknown-tools challenge, as LLM agents often lack familiarity with the precise usage of the MCP servers. Notably, enterprise-level agents like Cursor cannot achieve better performance than standard ReAct frameworks. Beyond evaluation, we open-source our extensible evaluation framework with UI support, enabling researchers and practitioners to seamlessly integrate new agents and MCP servers while fostering innovation in the rapidly evolving MCP ecosystem.

Updated: 2025-08-20 13:28:58

标题: MCP-Universe：使用真实世界模型上下文协议服务器对大型语言模型进行基准测试

摘要: 模型上下文协议已经成为连接大型语言模型与外部数据源和工具的一种变革性标准，迅速在主要人工智能提供商和开发平台中得到采用。然而，现有的基准测试过于简单，未能捕捉到真实应用挑战，比如长期推理和大规模、陌生的工具空间。为了解决这一关键缺口，我们引入了MCP-Universe，这是第一个专门设计用于通过与真实世界MCP服务器进行交互来评估LLMs在现实和困难任务中的全面基准。我们的基准涵盖了6个核心领域，跨越11个不同的MCP服务器：位置导航、仓库管理、财务分析、3D设计、浏览器自动化和网页搜索。为了确保严格评估，我们实现了基于执行的评估器，包括用于代理格式合规性的格式评估器，用于静态时间不变内容匹配的静态评估器，以及自动检索实时基准的动态评估器，用于对时间敏感任务进行评估。通过对领先的LLMs进行广泛评估，我们发现即使是像GPT-5（43.72%）、Grok-4（33.33%）和Claude-4.0-Sonnet（29.44%）这样的最新模型也存在显著的性能限制。此外，我们的基准对LLM代理提出了重要的长背景挑战，因为输入标记的数量随着交互步骤的增加而迅速增加。此外，它引入了未知工具挑战，因为LLM代理通常缺乏对MCP服务器精确使用的熟悉度。值得注意的是，像Cursor这样的企业级代理无法比标准ReAct框架表现更好。除了评估之外，我们还开源了我们可扩展的评估框架，并提供UI支持，使研究人员和从业者能够无缝地集成新代理和MCP服务器，同时促进不断发展的MCP生态系统中的创新。

更新时间: 2025-08-20 13:28:58

领域: cs.AI,cs.CL

下载: http://arxiv.org/abs/2508.14704v1

A Lightweight Incentive-Based Privacy-Preserving Smart Metering Protocol for Value-Added Services

The emergence of smart grids and advanced metering infrastructure (AMI) has revolutionized energy management. Unlike traditional power grids, smart grids benefit from two-way communication through AMI, which surpasses earlier automated meter reading (AMR). AMI enables diverse demand- and supply-side utilities such as accurate billing, outage detection, real-time grid control, load forecasting, and value-added services. Smart meters play a key role by delivering consumption values at predefined intervals to the utility provider (UP). However, such reports may raise privacy concerns, as adversaries can infer lifestyle patterns, political orientations, and the types of electrical devices in a household, or even sell the data to third parties (TP) such as insurers. In this paper, we propose a lightweight, privacy-preserving smart metering protocol for incentive-based value-added services. The scheme employs local differential privacy, hash chains, blind digital signatures, pseudonyms, temporal aggregation, and anonymous overlay networks to report coarse-grained values with adjustable granularity to the UP. This protects consumers' privacy while preserving data utility. The scheme prevents identity disclosure while enabling automatic token redemption. From a performance perspective, our results show that with a 1024-bit RSA key, a 7-day duration, and four reports per day, our protocol runs in approximately 0.51s and consumes about 4.5 MB of memory. From a privacy perspective, the protocol resists semi-trusted and untrusted adversaries.

Updated: 2025-08-20 13:28:39

标题: 一个基于轻量级激励的隐私保护智能计量协议，用于增值服务

摘要: 智能电网和先进的计量基础设施（AMI）的出现彻底改变了能源管理。与传统电网不同，智能电网通过AMI进行双向通信，这超越了早期的自动抄表（AMR）。AMI实现了多样化的需求和供应端的效用，如准确计费、故障检测、实时电网控制、负荷预测和增值服务。智能电表通过在预定间隔向公用事业提供商（UP）传递消费值起着关键作用。然而，这些报告可能引起隐私担忧，因为对手可以推断生活方式模式、政治取向和家庭中的电器类型，甚至将数据出售给第三方（TP），如保险公司。在本文中，我们提出了一种轻量级、保护隐私的智能计量协议，用于基于激励的增值服务。该方案采用本地差分隐私、哈希链、盲数字签名、伪名、时间聚合和匿名覆盖网络，以可调粒度向UP报告粗粒度值。这保护了消费者的隐私，同时保留了数据效用。该方案防止身份泄露，同时实现了自动令牌兑换。从性能角度来看，我们的结果显示，使用1024位RSA密钥、7天持续时间和每天四次报告，我们的协议运行大约需要0.51秒，消耗约4.5 MB内存。从隐私角度来看，该协议抵抗半信任和不信任的对手。

更新时间: 2025-08-20 13:28:39

领域: cs.CR

下载: http://arxiv.org/abs/2508.14703v1

Common Data Format (CDF): A Standardized Format for Match-Data in Football (Soccer)

During football matches, a variety of different parties (e.g., companies) each collect (possibly overlapping) data about the match ranging from basic information (e.g., starting players) to detailed positional data. This data is provided to clubs, federations, and other organizations who are increasingly interested in leveraging this data to inform their decision making. Unfortunately, analyzing such data pose significant barriers because each provider may (1) collect different data, (2) use different specifications even within the same category of data, (3) represent the data differently, and (4) delivers the data in a different manner (e.g., file format, protocol). Consequently, working with these data requires a significant investment of time and money. The goal of this work is to propose a uniform and standardized format for football data called the Common Data Format (CDF). The CDF specifies a minimal schema for five types of match data: match sheet data, video footage, event data, tracking data, and match meta data. It aims to ensure that the provided data is clear, sufficiently contextualized (e.g., its provenance is clear), and complete such that it enables common downstream analysis tasks. Concretely, this paper will detail the technical specifications of the CDF, the representational choices that were made to help ensure the clarity of the provided data, and a concrete approach for delivering data in the CDF. This represents Version 1.0.0 of the CDF.

Updated: 2025-08-20 13:24:51

标题: Common Data Format (CDF): 一种用于足球比赛数据的标准化格式

摘要: 在足球比赛中，各种不同的团体（如公司）收集可能重叠的比赛数据，从基本信息（如首发球员）到详细的位置数据。这些数据提供给俱乐部、联合会和其他组织，这些组织越来越有兴趣利用这些数据来指导他们的决策。不幸的是，分析这些数据存在重大障碍，因为每个提供者可能（1）收集不同的数据，（2）即使在相同的数据类别内也使用不同的规范，（3）以不同的方式表示数据，（4）以不同的方式提供数据（例如文件格式、协议）。因此，处理这些数据需要大量的时间和金钱投入。这项工作的目标是提出一个名为Common Data Format（CDF）的统一和标准化的足球数据格式。CDF为五种类型的比赛数据制定了一个最小模式：比赛表数据、视频镜头、事件数据、跟踪数据和比赛元数据。它旨在确保提供的数据清晰、足够具有上下文（例如其来源清楚）和完整，以便实现常见的下游分析任务。具体而言，本文将详细介绍CDF的技术规范，为确保所提供数据的清晰性而做出的表述选择，以及以CDF提供数据的具体方法。这代表了CDF的1.0.0版本。

更新时间: 2025-08-20 13:24:51

领域: cs.DB,cs.AI

下载: http://arxiv.org/abs/2505.15820v4

Foe for Fraud: Transferable Adversarial Attacks in Credit Card Fraud Detection

Credit card fraud detection (CCFD) is a critical application of Machine Learning (ML) in the financial sector, where accurately identifying fraudulent transactions is essential for mitigating financial losses. ML models have demonstrated their effectiveness in fraud detection task, in particular with the tabular dataset. While adversarial attacks have been extensively studied in computer vision and deep learning, their impacts on the ML models, particularly those trained on CCFD tabular datasets, remains largely unexplored. These latent vulnerabilities pose significant threats to the security and stability of the financial industry, especially in high-value transactions where losses could be substantial. To address this gap, in this paper, we present a holistic framework that investigate the robustness of CCFD ML model against adversarial perturbations under different circumstances. Specifically, the gradient-based attack methods are incorporated into the tabular credit card transaction data in both black- and white-box adversarial attacks settings. Our findings confirm that tabular data is also susceptible to subtle perturbations, highlighting the need for heightened awareness among financial technology practitioners regarding ML model security and trustworthiness. Furthermore, the experiments by transferring adversarial samples from gradient-based attack method to non-gradient-based models also verify our findings. Our results demonstrate that such attacks remain effective, emphasizing the necessity of developing robust defenses for CCFD algorithms.

Updated: 2025-08-20 13:23:28

标题: 欺诈的敌人：信用卡欺诈检测中的可转移对抗性攻击

摘要: 信用卡欺诈检测（CCFD）是机器学习（ML）在金融领域的一个关键应用，准确识别欺诈交易对于减少金融损失至关重要。ML模型在欺诈检测任务中表现出了其有效性，特别是在表格数据集中。虽然对抗性攻击在计算机视觉和深度学习中得到了广泛研究，但对ML模型的影响，尤其是那些在CCFD表格数据集上训练的模型，仍然很少被探讨。这些潜在的漏洞对金融行业的安全和稳定性构成重大威胁，尤其是在高价值交易中，损失可能是巨大的。为了填补这一空白，本文提出了一个全面的框架，研究了CCFD ML模型在不同情况下对抗性扰动的稳健性。具体地，基于梯度的攻击方法被纳入到表格信用卡交易数据中，同时考虑黑盒和白盒对抗性攻击设置。我们的研究结果证实，表格数据也容易受到微小扰动的影响，强调了金融科技从业者对ML模型安全性和可信度的意识的必要性。此外，将基于梯度的攻击方法产生的对抗样本转移到非梯度模型的实验也验证了我们的发现。我们的结果表明，这种攻击仍然有效，强调了开发CCFD算法强大防御的必要性。

更新时间: 2025-08-20 13:23:28

领域: cs.CR,cs.AI

下载: http://arxiv.org/abs/2508.14699v1

Testing Components of the Attention Schema Theory in Artificial Neural Networks

Growing evidence suggests that the brain uses an attention schema, or a simplified model of attention, to help control what it attends to. One proposed benefit of this model is to allow agents to model the attention states of other agents, and thus predict and interact with other agents. The effects of an attention schema may be examined in artificial agents. Although attention mechanisms in artificial agents are different from in biological brains, there may be some principles in common. In both cases, select features or representations are emphasized for better performance. Here, using neural networks with transformer attention mechanisms, we asked whether the addition of an attention schema affected the ability of agents to make judgements about and cooperate with each other. First, we found that an agent with an attention schema is better at categorizing the attention states of other agents (higher accuracy). Second, an agent with an attention schema develops a pattern of attention that is easier for other agents to categorize. Third, in a joint task where two agents must predict each other to paint a scene together, adding an attention schema improves performance. Finally, the performance improvements are not caused by a general increase in network complexity. Instead, improvement is specific to tasks involving judging, categorizing, or predicting the attention of other agents. These results support the hypothesis that an attention schema has computational properties beneficial to mutual interpretability and interactive behavior. We speculate that the same principles might pertain to biological attention and attention schemas in people.

Updated: 2025-08-20 13:19:18

标题: 在人工神经网络中测试注意力模式理论的组件

摘要: 越来越多的证据表明，大脑使用注意模式或简化的注意模式来帮助控制它关注的事物。这个模式的一个可能好处是允许代理模拟其他代理的注意状态，从而预测并与其他代理互动。注意模式的影响可以在人工代理中进行研究。虽然人工代理中的注意机制与生物大脑中的不同，但可能存在一些共同的原则。在两种情况下，选择特征或表示以提高性能。在这里，我们使用具有变压器注意机制的神经网络，探讨了注意模式的添加是否影响代理对彼此进行判断和合作的能力。首先，我们发现具有注意模式的代理更擅长对其他代理的注意状态进行分类（更高的准确性）。其次，具有注意模式的代理会发展出其他代理更容易分类的注意模式。第三，在一个共同任务中，两个代理必须预测对方一起绘制一个场景时，添加注意模式会提高性能。最后，性能的提高不是由于网络复杂性的普遍增加。相反，改进是特定于涉及判断、分类或预测其他代理注意力的任务。这些结果支持一个假设，即注意模式具有有益于相互可解释性和互动行为的计算属性。我们猜测这些原则可能也适用于人类的生物注意力和注意模式。

更新时间: 2025-08-20 13:19:18

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2411.00983v3

ECHO: Frequency-aware Hierarchical Encoding for Variable-length Signal

Pre-trained foundation models have demonstrated remarkable success in vision and language, yet their potential for general machine signal modeling-covering acoustic, vibration, and other industrial sensor data-remains under-explored. Existing approach using sub-band-based encoders has achieved competitive results but are limited by fixed input lengths, and the absence of explicit frequency positional encoding. In this work, we propose a novel foundation model that integrates an advanced band-split architecture with relative frequency positional embeddings, enabling precise spectral localization across arbitrary sampling configurations. The model supports inputs of arbitrary length without padding or segmentation, producing a concise embedding that retains both temporal and spectral fidelity. We evaluate our method on SIREN (https://github.com/yucongzh/SIREN), a newly introduced large-scale benchmark for machine signal encoding that unifies multiple datasets, including all DCASE task 2 challenges (2020-2025) and widely-used industrial signal corpora. Experimental results demonstrate consistent state-of-the-art performance in anomaly detection and fault identification, confirming the effectiveness and generalization capability of the proposed model. We open-sourced ECHO on https://github.com/yucongzh/ECHO.

Updated: 2025-08-20 13:10:44

标题: ECHO: 针对可变长度信号的频率感知分层编码

摘要: 预训练的基础模型在视觉和语言方面取得了显著成功，但它们在一般机器信号建模方面的潜力——涵盖声学、振动和其他工业传感器数据——尚未得到充分挖掘。现有的基于子带编码器的方法取得了竞争性的结果，但受限于固定的输入长度，以及缺乏明确的频率位置编码。在这项工作中，我们提出了一个新颖的基础模型，将先进的分带架构与相对频率位置嵌入相结合，实现了在任意采样配置下精确的频谱定位。该模型支持任意长度的输入，无需填充或分段，生成一个保留了时域和频谱保真度的简洁嵌入。我们在SIREN（https://github.com/yucongzh/SIREN）上评估了我们的方法，这是一个新引入的大规模机器信号编码基准，统一了多个数据集，包括所有DCASE任务2挑战（2020-2025）和广泛使用的工业信号语料库。实验结果表明，在异常检测和故障识别方面保持了一致的最先进性能，验证了所提出模型的有效性和泛化能力。我们在https://github.com/yucongzh/ECHO上开源了ECHO。

更新时间: 2025-08-20 13:10:44

领域: cs.SD,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.14689v1

Benchmarking graph construction by large language models for coherence-driven inference

We devise an algorithm to generate propositions that objectively instantiate graphs supporting coherence-driven inference. We also benchmark the ability of large language models (LLMs) to reconstruct coherence graphs from (a simple transformation of) propositions expressed in natural language, with promising results from a single prompt to reasoning-optimized LLMs. For example, o1/3/4-mini achieve perfect reconstruction half of the time on sparse graphs. Coherence-driven inference on consistency evaluations by LLMs may advance machine cognition capabilities.

Updated: 2025-08-20 13:10:14

标题: 使用大型语言模型进行基准测试图构建，以推动一致性推理

摘要: 我们设计了一个算法，用于生成客观实例化支持基于连贯性推理的图形的命题。我们还对大型语言模型（LLMs）从自然语言表达的命题（经过简单转换）重建连贯性图的能力进行了基准测试，并从单一提示到经过优化的推理LLMs获得了有希望的结果。例如，o1/3/4-mini 在稀疏图上一半的时间实现了完美重建。LLMs对一致性评估的连贯性推理可能推进机器认知能力的发展。

更新时间: 2025-08-20 13:10:14

领域: cs.AI

下载: http://arxiv.org/abs/2502.13953v2

Addressing Graph Anomaly Detection via Causal Edge Separation and Spectrum

In the real world, anomalous entities often add more legitimate connections while hiding direct links with other anomalous entities, leading to heterophilic structures in anomalous networks that most GNN-based techniques fail to address. Several works have been proposed to tackle this issue in the spatial domain. However, these methods overlook the complex relationships between node structure encoding, node features, and their contextual environment and rely on principled guidance, research on solving spectral domain heterophilic problems remains limited. This study analyzes the spectral distribution of nodes with different heterophilic degrees and discovers that the heterophily of anomalous nodes causes the spectral energy to shift from low to high frequencies. To address the above challenges, we propose a spectral neural network CES2-GAD based on causal edge separation for anomaly detection on heterophilic graphs. Firstly, CES2-GAD will separate the original graph into homophilic and heterophilic edges using causal interventions. Subsequently, various hybrid-spectrum filters are used to capture signals from the segmented graphs. Finally, representations from multiple signals are concatenated and input into a classifier to predict anomalies. Extensive experiments with real-world datasets have proven the effectiveness of the method we proposed.

Updated: 2025-08-20 12:59:22

标题: 通过因果边缘分离和频谱解决图异常检测

摘要: 在现实世界中，异常实体通常会添加更多合法连接，同时隐藏与其他异常实体的直接链接，导致异质结构在异常网络中，大多数基于图神经网络的技术无法解决。已经提出了几种方法来解决空间领域中的这个问题。然而，这些方法忽视了节点结构编码、节点特征和它们的上下文环境之间复杂的关系，并依赖于原则性的指导，解决谱领域异质问题的研究仍然有限。本研究分析了具有不同异质度的节点的谱分布，并发现异常节点的异质性导致谱能量从低频转移到高频。为了解决上述挑战，我们提出了一种基于因果边分离的谱神经网络CES2-GAD，用于在异质图上进行异常检测。首先，CES2-GAD将使用因果干预将原始图分成同质和异质边。随后，使用各种混合谱滤波器来捕捉来自分段图的信号。最后，将多个信号的表示连接并输入分类器以预测异常。通过对真实世界数据集的大量实验，已经证明了我们提出的方法的有效性。

更新时间: 2025-08-20 12:59:22

领域: cs.LG

下载: http://arxiv.org/abs/2508.14684v1

Improving Fairness in Graph Neural Networks via Counterfactual Debiasing

Graph Neural Networks (GNNs) have been successful in modeling graph-structured data. However, similar to other machine learning models, GNNs can exhibit bias in predictions based on attributes like race and gender. Moreover, bias in GNNs can be exacerbated by the graph structure and message-passing mechanisms. Recent cutting-edge methods propose mitigating bias by filtering out sensitive information from input or representations, like edge dropping or feature masking. Yet, we argue that such strategies may unintentionally eliminate non-sensitive features, leading to a compromised balance between predictive accuracy and fairness. To tackle this challenge, we present a novel approach utilizing counterfactual data augmentation for bias mitigation. This method involves creating diverse neighborhoods using counterfactuals before message passing, facilitating unbiased node representations learning from the augmented graph. Subsequently, an adversarial discriminator is employed to diminish bias in predictions by conventional GNN classifiers. Our proposed technique, Fair-ICD, ensures the fairness of GNNs under moderate conditions. Experiments on standard datasets using three GNN backbones demonstrate that Fair-ICD notably enhances fairness metrics while preserving high predictive performance.

Updated: 2025-08-20 12:59:05

标题: 通过反事实去偏见改善图神经网络的公平性

摘要: 图神经网络（GNNs）在建模图结构数据方面取得了成功。然而，类似于其他机器学习模型，GNNs可能在基于种族和性别等属性的预测中表现出偏见。此外，GNNs中的偏见可能会因图结构和消息传递机制而加剧。最近的前沿方法提出通过过滤输入或表示中的敏感信息（如删除边或特征屏蔽）来减轻偏见。然而，我们认为这种策略可能会意外地消除非敏感特征，导致预测准确性和公平性之间的平衡受损。为了解决这一挑战，我们提出了一种利用反事实数据增强来减轻偏见的新方法。该方法涉及在消息传递之前使用反事实创建多样化的邻域，促进从增强图中学习无偏见的节点表示。随后，通过传统GNN分类器来减少预测中的偏见。我们提出的技术Fair-ICD在适度条件下确保了GNNs的公平性。使用三种GNN主干的标准数据集进行的实验表明，Fair-ICD显著提高了公平性指标，同时保持了高预测性能。

更新时间: 2025-08-20 12:59:05

领域: cs.LG

下载: http://arxiv.org/abs/2508.14683v1

Hybrid-Hierarchical Fashion Graph Attention Network for Compatibility-Oriented and Personalized Outfit Recommendation

The rapid expansion of the fashion industry and the growing variety of products have made it increasingly challenging for users to identify compatible items on e-commerce platforms. Effective fashion recommendation systems are therefore crucial for filtering irrelevant options and suggesting suitable ones. However, simultaneously addressing outfit compatibility and personalized recommendations remains a significant challenge, as these aspects are typically treated independently in existing studies, thereby overlooking the complex interactions between items and user preferences. This research introduces a new framework named FGAT, which leverages a hierarchical graph representation together with graph attention mechanisms to address this problem. The framework constructs a three-tier graph of users, outfits, and items, integrating visual and textual features to jointly model outfit compatibility and user preferences. By dynamically weighting node importance during representation propagation, the graph attention mechanism captures key interactions and produces precise embeddings for both user preferences and outfit compatibility. Evaluated on the POG dataset, FGAT outperforms strong baselines such as HFGN, achieving notable improvements in accuracy, precision, HR, recall, and NDCG. These results demonstrate that combining multimodal visual and textual features with a hierarchical graph structure and attention mechanisms significantly enhances the effectiveness and efficiency of personalized fashion recommendation systems.

Updated: 2025-08-20 12:50:16

标题: 混合-分层时尚图注意力网络用于基于兼容性和个性化的服装推荐

摘要: 时尚行业的快速扩张和产品种类的增多使得用户在电子商务平台上识别兼容物品变得越来越具有挑战性。因此，有效的时尚推荐系统对于过滤不相关选项并建议适合的选项至关重要。然而，同时处理服装兼容性和个性化推荐仍然是一个重大挑战，因为这些方面通常在现有研究中被独立处理，从而忽视了物品和用户偏好之间复杂的相互作用。本研究介绍了一个名为FGAT的新框架，该框架利用分层图表示以及图注意机制来解决这一问题。该框架构建了一个用户、服装和物品的三层图，整合视觉和文本特征来共同建模服装兼容性和用户偏好。通过在表示传播过程中动态加权节点重要性，图注意机制捕捉关键交互并为用户偏好和服装兼容性产生精确的嵌入。在POG数据集上评估，FGAT优于HFGN等强基线，实现了准确性、精度、HR、召回率和NDCG的显著改进。这些结果表明，将多模态视觉和文本特征与分层图结构和注意机制相结合显著增强了个性化时尚推荐系统的有效性和效率。

更新时间: 2025-08-20 12:50:16

领域: cs.LG,cs.IR

下载: http://arxiv.org/abs/2508.11105v2

Multi-scale species richness estimation with deep learning

Biodiversity assessments are critically affected by the spatial scale at which species richness is measured. How species richness accumulates with sampling area depends on natural and anthropogenic processes whose effects can change depending on the spatial scale considered. These accumulation dynamics, described by the species-area relationship (SAR), are challenging to assess because most biodiversity surveys are restricted to sampling areas much smaller than the scales at which these processes operate. Here, we combine sampling theory and deep learning to predict local species richness within arbitrarily large sampling areas, enabling for the first time to estimate spatial differences in SARs. We demonstrate our approach by predicting vascular plant species richness across Europe and evaluate predictions against an independent dataset of plant community inventories. The resulting model, named deep SAR, delivers multi-scale species richness maps, improving coarse grain richness estimates by 32% compared to conventional methods, while delivering finer grain estimates. Additional to its predictive capabilities, we show how our deep SAR model can provide fundamental insights on the multi-scale effects of key biodiversity processes. The capacity of our approach to deliver comprehensive species richness estimates across the full spectrum of ecologically relevant scales is essential for robust biodiversity assessments and forecasts under global change.

Updated: 2025-08-20 12:43:56

标题: 用深度学习进行多尺度物种丰富度估计

摘要: 生物多样性评估受到物种丰富度测量的空间尺度的重大影响。物种丰富度随着采样区域的增加而积累取决于自然和人为过程，其效果可能会因考虑的空间尺度而改变。这些积累动态由物种-面积关系（SAR）描述，这些关系很难评估，因为大多数生物多样性调查受限于比这些过程操作的尺度小得多的采样区域。在这里，我们结合采样理论和深度学习，预测任意大采样区域内的本地物种丰富度，首次能够估计SAR的空间差异。我们通过预测欧洲各地的维管植物物种丰富度来展示我们的方法，并将预测结果与植物群落清单的独立数据集进行评估。结果模型，称为深度SAR，提供多尺度物种丰富度地图，相对于常规方法，将粗粒度丰富度估计提高了32％，同时提供了更细粒度的估计。除了其预测能力，我们展示了我们的深度SAR模型如何提供关键生物多样性过程的多尺度效应的基本见解。我们的方法能够提供全面的物种丰富度估计，涵盖生态相关尺度的整个范围，这对于在全球变化下进行强大的生物多样性评估和预测至关重要。

更新时间: 2025-08-20 12:43:56

领域: q-bio.PE,cs.LG,92-08, 92B05, 92B15, 92B20, 92D40 (Primary) 62P10, 62P12 (Secondary)

下载: http://arxiv.org/abs/2507.06358v2

ELATE: Evolutionary Language model for Automated Time-series Engineering

Time-series prediction involves forecasting future values using machine learning models. Feature engineering, whereby existing features are transformed to make new ones, is critical for enhancing model performance, but is often manual and time-intensive. Existing automation attempts rely on exhaustive enumeration, which can be computationally costly and lacks domain-specific insights. We introduce ELATE (Evolutionary Language model for Automated Time-series Engineering), which leverages a language model within an evolutionary framework to automate feature engineering for time-series data. ELATE employs time-series statistical measures and feature importance metrics to guide and prune features, while the language model proposes new, contextually relevant feature transformations. Our experiments demonstrate that ELATE improves forecasting accuracy by an average of 8.4% across various domains.

Updated: 2025-08-20 12:36:29

标题: ELATE：用于自动时间序列工程的进化语言模型

摘要: 时间序列预测涉及使用机器学习模型来预测未来值。特征工程，即通过转换现有特征来生成新特征，对于增强模型性能至关重要，但通常是手动且耗时的。现有的自动化尝试依赖于穷举法，这可能计算成本高昂且缺乏领域特定的见解。我们介绍了ELATE（自动化时间序列工程的进化语言模型），它利用语言模型在进化框架中自动化时间序列数据的特征工程。ELATE利用时间序列统计量和特征重要性度量来指导和剪枝特征，而语言模型提出新的、上下文相关的特征转换。我们的实验表明，ELATE在各个领域平均提高了8.4%的预测准确性。

更新时间: 2025-08-20 12:36:29

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.14667v1

Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving

Large Language Models (LLMs) often struggle with mathematical reasoning tasks requiring precise, verifiable computation. While Reinforcement Learning (RL) from outcome-based rewards enhances text-based reasoning, understanding how agents autonomously learn to leverage external tools like code execution remains crucial. We investigate RL from outcome-based rewards for Tool-Integrated Reasoning, ZeroTIR, training base LLMs to spontaneously generate and execute Python code for mathematical problems without supervised tool-use examples. Our central contribution is we demonstrate that as RL training progresses, key metrics scale predictably. Specifically, we observe strong positive correlations where increased training steps lead to increases in the spontaneous code execution frequency, the average response length, and, critically, the final task accuracy. This suggests a quantifiable relationship between computational effort invested in training and the emergence of effective, tool-augmented reasoning strategies. We implement a robust framework featuring a decoupled code execution environment and validate our findings across standard RL algorithms and frameworks. Experiments show ZeroTIR significantly surpasses non-tool ZeroRL baselines on challenging math benchmarks. Our findings provide a foundational understanding of how autonomous tool use is acquired and scales within Agent RL, offering a reproducible benchmark for future studies. Code is released at \href{https://github.com/yyht/openrlhf_async_pipline}{https://github.com/yyht/openrlhf\_async\_pipline}.

Updated: 2025-08-20 12:20:55

标题: 代理RL缩放定律：具有自发代码执行的代理RL用于数学问题求解

摘要: 大型语言模型（LLMs）通常在需要精确、可验证计算的数学推理任务中遇到困难。虽然基于结果的奖励的强化学习（RL）增强了基于文本的推理能力，但了解代理如何自主学习利用外部工具如代码执行仍然至关重要。我们研究了基于结果的奖励的RL用于工具整合推理（Tool-Integrated Reasoning，ZeroTIR），训练基于LLMs的模型自发生成和执行Python代码，解决数学问题，而无需受监督的工具使用示例。我们的主要贡献是我们证明随着RL训练的进行，关键指标可预测地扩展。具体而言，我们观察到强烈的正相关关系，即增加训练步骤会导致自发代码执行频率、平均响应长度以及关键的最终任务准确性的增加。这表明了在训练中投入的计算努力与有效的、工具增强的推理策略的出现之间存在可量化的关系。我们实现了一个稳健的框架，其中包含一个解耦的代码执行环境，并验证了我们的发现跨标准RL算法和框架。实验证明ZeroTIR在具有挑战性的数学基准测试中显着超越了非工具ZeroRL基线。我们的研究结果提供了关于如何在代理RL中获得和扩展自主工具使用的基础性理解，为未来研究提供了可再现的基准。代码已发布在\href{https://github.com/yyht/openrlhf_async_pipline}{https://github.com/yyht/openrlhf\_async\_pipline}。

更新时间: 2025-08-20 12:20:55

领域: cs.AI

下载: http://arxiv.org/abs/2505.07773v4

Entropy-Constrained Strategy Optimization in Urban Floods: A Multi-Agent Framework with LLM and Knowledge Graph Integration

In recent years, the increasing frequency of extreme urban rainfall events has posed significant challenges to emergency scheduling systems. Urban flooding often leads to severe traffic congestion and service disruptions, threatening public safety and mobility. However, effective decision making remains hindered by three key challenges: (1) managing trade-offs among competing goals (e.g., traffic flow, task completion, and risk mitigation) requires dynamic, context-aware strategies; (2) rapidly evolving environmental conditions render static rules inadequate; and (3) LLM-generated strategies frequently suffer from semantic instability and execution inconsistency. Existing methods fail to align perception, global optimization, and multi-agent coordination within a unified framework. To tackle these challenges, we introduce H-J, a hierarchical multi-agent framework that integrates knowledge-guided prompting, entropy-constrained generation, and feedback-driven optimization. The framework establishes a closed-loop pipeline spanning from multi-source perception to strategic execution and continuous refinement. We evaluate H-J on real-world urban topology and rainfall data under three representative conditions: extreme rainfall, intermittent bursts, and daily light rain. Experiments show that H-J outperforms rule-based and reinforcement-learning baselines in traffic smoothness, task success rate, and system robustness. These findings highlight the promise of uncertainty-aware, knowledge-constrained LLM-based approaches for enhancing resilience in urban flood response.

Updated: 2025-08-20 12:13:03

标题: 城市洪水中的熵约束策略优化：带有LLM和知识图谱集成的多智能体框架

摘要: 近年来，极端城市降雨事件的频率不断增加，给应急调度系统带来了巨大挑战。城市洪水往往导致严重的交通拥堵和服务中断，威胁公共安全和流动性。然而，有效的决策仍受到三个关键挑战的阻碍：（1）在竞争目标之间进行权衡（例如交通流量、任务完成和风险缓解）需要动态的、上下文感知的策略；（2）快速变化的环境条件使静态规则不足以应对；（3）基于LLM生成的策略经常受到语义不稳定性和执行不一致性的影响。现有方法未能在统一框架内实现感知、全局优化和多智能体协调的一致性。为了解决这些挑战，我们引入了H-J，一个集成了知识引导提示、熵约束生成和反馈驱动优化的分层多智能体框架。该框架建立了一个从多源感知到战略执行和持续改进的闭环管道。我们在真实的城市拓扑和降雨数据下评估了H-J在三种典型条件下：极端降雨、间歇性爆发和日常小雨。实验证明，H-J在交通流畅性、任务成功率和系统稳健性方面优于基于规则和强化学习的基线。这些发现突出了基于不确定性感知、知识约束的LLM方法在增强城市洪水响应韧性方面的潜力。

更新时间: 2025-08-20 12:13:03

领域: cs.AI

下载: http://arxiv.org/abs/2508.14654v1

Understanding Data Influence with Differential Approximation

Data plays a pivotal role in the groundbreaking advancements in artificial intelligence. The quantitative analysis of data significantly contributes to model training, enhancing both the efficiency and quality of data utilization. However, existing data analysis tools often lag in accuracy. For instance, many of these tools even assume that the loss function of neural networks is convex. These limitations make it challenging to implement current methods effectively. In this paper, we introduce a new formulation to approximate a sample's influence by accumulating the differences in influence between consecutive learning steps, which we term Diff-In. Specifically, we formulate the sample-wise influence as the cumulative sum of its changes/differences across successive training iterations. By employing second-order approximations, we approximate these difference terms with high accuracy while eliminating the need for model convexity required by existing methods. Despite being a second-order method, Diff-In maintains computational complexity comparable to that of first-order methods and remains scalable. This efficiency is achieved by computing the product of the Hessian and gradient, which can be efficiently approximated using finite differences of first-order gradients. We assess the approximation accuracy of Diff-In both theoretically and empirically. Our theoretical analysis demonstrates that Diff-In achieves significantly lower approximation error compared to existing influence estimators. Extensive experiments further confirm its superior performance across multiple benchmark datasets in three data-centric tasks: data cleaning, data deletion, and coreset selection. Notably, our experiments on data pruning for large-scale vision-language pre-training show that Diff-In can scale to millions of data points and outperforms strong baselines.

Updated: 2025-08-20 11:59:32

标题: 理解数据影响力的差分逼近

摘要: 数据在人工智能的突破性进展中发挥着至关重要的作用。对数据的定量分析显著促进了模型训练，提高了数据利用的效率和质量。然而，现有的数据分析工具在准确性方面经常滞后。例如，许多这些工具甚至假定神经网络的损失函数是凸的。这些限制使得有效实施当前方法具有挑战性。在本文中，我们介绍了一种新的公式，通过累积连续学习步骤之间的影响差异来近似样本的影响，我们将其称为Diff-In。具体来说，我们将样本影响形式化为其在连续训练迭代中的变化/差异的累积和。通过使用二阶近似，我们可以高精度地近似这些差异项，同时消除了现有方法所需的模型凸性。尽管是一个二阶方法，Diff-In保持了与一阶方法相当的计算复杂度，并且具有可扩展性。这种效率是通过计算Hessian矩阵和梯度的乘积实现的，可以通过一阶梯度的有限差分有效地近似。我们在理论上和实证上评估了Diff-In的近似准确性。我们的理论分析表明，与现有的影响估计器相比，Diff-In实现了显著更低的近似误差。大量实验进一步证实了其在三个数据中心任务（数据清理、数据删除和核心集选择）中跨多个基准数据集的优越性能。值得注意的是，我们对大规模视觉语言预训练的数据修剪实验显示，Diff-In可以扩展到数百万个数据点，并且优于强基线。

更新时间: 2025-08-20 11:59:32

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2508.14648v1

OneLoc: Geo-Aware Generative Recommender Systems for Local Life Service

Local life service is a vital scenario in Kuaishou App, where video recommendation is intrinsically linked with store's location information. Thus, recommendation in our scenario is challenging because we should take into account user's interest and real-time location at the same time. In the face of such complex scenarios, end-to-end generative recommendation has emerged as a new paradigm, such as OneRec in the short video scenario, OneSug in the search scenario, and EGA in the advertising scenario. However, in local life service, an end-to-end generative recommendation model has not yet been developed as there are some key challenges to be solved. The first challenge is how to make full use of geographic information. The second challenge is how to balance multiple objectives, including user interests, the distance between user and stores, and some other business objectives. To address the challenges, we propose OneLoc. Specifically, we leverage geographic information from different perspectives: (1) geo-aware semantic ID incorporates both video and geographic information for tokenization, (2) geo-aware self-attention in the encoder leverages both video location similarity and user's real-time location, and (3) neighbor-aware prompt captures rich context information surrounding users for generation. To balance multiple objectives, we use reinforcement learning and propose two reward functions, i.e., geographic reward and GMV reward. With the above design, OneLoc achieves outstanding offline and online performance. In fact, OneLoc has been deployed in local life service of Kuaishou App. It serves 400 million active users daily, achieving 21.016% and 17.891% improvements in terms of gross merchandise value (GMV) and orders numbers.

Updated: 2025-08-20 11:57:48

标题: OneLoc：面向本地生活服务的地理感知生成式推荐系统

摘要: 本地生活服务是快手App中的一个重要场景，视频推荐与商店的位置信息密切相关。因此，在我们的场景中，推荐是具有挑战性的，因为我们需要同时考虑用户的兴趣和实时位置。面对这样复杂的情景，端到端生成式推荐已经成为一种新的范式，如短视频场景中的OneRec，搜索场景中的OneSug，以及广告场景中的EGA。然而，在本地生活服务中，端到端生成式推荐模型尚未开发，因为有一些关键挑战需要解决。第一个挑战是如何充分利用地理信息。第二个挑战是如何平衡多个目标，包括用户兴趣，用户与商店之间的距离，以及其他一些业务目标。为了解决这些挑战，我们提出了OneLoc。具体来说，我们从不同角度利用地理信息：（1）地理感知语义ID将视频和地理信息合并进行标记化，（2）编码器中的地理感知自注意力利用视频位置相似性和用户的实时位置，（3）邻居感知提示捕获围绕用户的丰富上下文信息进行生成。为了平衡多个目标，我们使用强化学习并提出了两个奖励函数，即地理奖励和GMV奖励。通过以上设计，OneLoc取得了出色的离线和在线表现。事实上，OneLoc已经部署在快手App的本地生活服务中。它每天为4亿活跃用户提供服务，GMV和订单数量分别实现了21.016%和17.891%的改善。

更新时间: 2025-08-20 11:57:48

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2508.14646v1

LeanGeo: Formalizing Competitional Geometry problems in Lean

Geometry problems are a crucial testbed for AI reasoning capabilities. Most existing geometry solving systems cannot express problems within a unified framework, thus are difficult to integrate with other mathematical fields. Besides, since most geometric proofs rely on intuitive diagrams, verifying geometry problems is particularly challenging. To address these gaps, we introduce LeanGeo, a unified formal system for formalizing and solving competition-level geometry problems within the Lean 4 theorem prover. LeanGeo features a comprehensive library of high-level geometric theorems with Lean's foundational logic, enabling rigorous proof verification and seamless integration with Mathlib. We also present LeanGeo-Bench, a formal geometry benchmark in LeanGeo, comprising problems from the International Mathematical Olympiad (IMO) and other advanced sources. Our evaluation demonstrates the capabilities and limitations of state-of-the-art Large Language Models on this benchmark, highlighting the need for further advancements in automated geometric reasoning. We open source the theorem library and the benchmark of LeanGeo at https://github.com/project-numina/LeanGeo/tree/master.

Updated: 2025-08-20 11:55:19

标题: LeanGeo: 在Lean中形式化竞技几何问题

摘要: 几何问题是评估人工智能推理能力的重要试验场。大多数现有的几何解决系统无法在统一框架内表达问题，因此很难与其他数学领域集成。此外，由于大多数几何证明依赖直观的图表，验证几何问题尤为具有挑战性。为了解决这些差距，我们引入了LeanGeo，这是一个在Lean 4定理证明器中形式化和解决竞赛级几何问题的统一形式系统。LeanGeo拥有一个包含高级几何定理的全面库，结合Lean的基础逻辑，实现了严格的证明验证和与Mathlib的无缝集成。我们还推出了LeanGeo-Bench，在LeanGeo中的一个形式几何基准，包括来自国际数学奥林匹克竞赛（IMO）和其他高级来源的问题。我们的评估展示了现有最先进大型语言模型在这一基准上的能力和局限性，突出了自动几何推理领域进一步发展的需求。我们在https://github.com/project-numina/LeanGeo/tree/master上开源了LeanGeo的定理库和基准。

更新时间: 2025-08-20 11:55:19

领域: cs.AI

下载: http://arxiv.org/abs/2508.14644v1

MEGA: Second-Order Gradient Alignment for Catastrophic Forgetting Mitigation in GFSCIL

Graph Few-Shot Class-Incremental Learning (GFSCIL) enables models to continually learn from limited samples of novel tasks after initial training on a large base dataset. Existing GFSCIL approaches typically utilize Prototypical Networks (PNs) for metric-based class representations and fine-tune the model during the incremental learning stage. However, these PN-based methods oversimplify learning via novel query set fine-tuning and fail to integrate Graph Continual Learning (GCL) techniques due to architectural constraints. To address these challenges, we propose a more rigorous and practical setting for GFSCIL that excludes query sets during the incremental training phase. Building on this foundation, we introduce Model-Agnostic Meta Graph Continual Learning (MEGA), aimed at effectively alleviating catastrophic forgetting for GFSCIL. Specifically, by calculating the incremental second-order gradient during the meta-training stage, we endow the model to learn high-quality priors that enhance incremental learning by aligning its behaviors across both the meta-training and incremental learning stages. Extensive experiments on four mainstream graph datasets demonstrate that MEGA achieves state-of-the-art results and enhances the effectiveness of various GCL methods in GFSCIL. We believe that our proposed MEGA serves as a model-agnostic GFSCIL paradigm, paving the way for future research.

Updated: 2025-08-20 11:45:29

标题: MEGA：GFSCIL中的二阶梯度对齐以减轻灾难性遗忘

摘要: Graph Few-Shot Class-Incremental Learning (GFSCIL)使模型能够在初始在大型基础数据集上进行训练后，持续从新任务的有限样本中学习。现有的GFSCIL方法通常利用原型网络（PNs）进行基于度量的类表示，并在增量学习阶段对模型进行微调。然而，这些基于PN的方法通过新领域查询集微调过于简化学习，并且由于架构限制而无法集成图持续学习（GCL）技术。为了解决这些挑战，我们提出了一个更严格和实用的GFSCIL设置，该设置在增量训练阶段不包括查询集。在此基础上，我们引入了面向模型的元图持续学习（MEGA），旨在有效减轻GFSCIL的灾难性遗忘。具体来说，在元训练阶段计算增量二阶梯度，我们赋予模型学习高质量先验的能力，通过将其行为在元训练和增量学习阶段之间进行对齐，增强增量学习。对四个主流图数据集的大量实验证明，MEGA取得了最先进的结果，并增强了GFSCIL中各种GCL方法的有效性。我们相信我们提出的MEGA作为一个面向模型的GFSCIL范式，为未来研究铺平了道路。

更新时间: 2025-08-20 11:45:29

领域: cs.LG

下载: http://arxiv.org/abs/2504.13691v3

Can LLM Agents Solve Collaborative Tasks? A Study on Urgency-Aware Planning and Coordination

The ability to coordinate actions across multiple agents is critical for solving complex, real-world problems. Large Language Models (LLMs) have shown strong capabilities in communication, planning, and reasoning, raising the question of whether they can also support effective collaboration in multi-agent settings. In this work, we investigate the use of LLM agents to solve a structured victim rescue task that requires division of labor, prioritization, and cooperative planning. Agents operate in a fully known graph-based environment and must allocate resources to victims with varying needs and urgency levels. We systematically evaluate their performance using a suite of coordination-sensitive metrics, including task success rate, redundant actions, room conflicts, and urgency-weighted efficiency. This study offers new insights into the strengths and failure modes of LLMs in physically grounded multi-agent collaboration tasks, contributing to future benchmarks and architectural improvements.

Updated: 2025-08-20 11:44:10

标题: Can LLM代理解决协作任务？关于紧急性感知规划和协调的研究

摘要: 协调多个代理在解决复杂实际问题中至关重要。大型语言模型(LLMs)在沟通、规划和推理方面表现出强大的能力，引发了一个问题，即它们是否也能支持多代理环境中的有效合作。在这项工作中，我们研究了使用LLM代理来解决需要分工、优先级和合作规划的结构化受害者救援任务。代理在一个完全已知的基于图的环境中运行，并必须将资源分配给具有不同需求和紧急程度的受害者。我们系统评估了它们的表现，使用一系列协调敏感的度量标准，包括任务成功率、冗余行动、房间冲突和紧急程度加权效率。这项研究为理解LLMs在基于物理的多代理协作任务中的优势和失败模式提供了新的见解，有助于未来的基准测试和架构改进。

更新时间: 2025-08-20 11:44:10

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2508.14635v1

Clinical semantics for lung cancer prediction

Background: Existing clinical prediction models often represent patient data using features that ignore the semantic relationships between clinical concepts. This study integrates domain-specific semantic information by mapping the SNOMED medical term hierarchy into a low-dimensional hyperbolic space using Poincar\'e embeddings, with the aim of improving lung cancer onset prediction. Methods: Using a retrospective cohort from the Optum EHR dataset, we derived a clinical knowledge graph from the SNOMED taxonomy and generated Poincar\'e embeddings via Riemannian stochastic gradient descent. These embeddings were then incorporated into two deep learning architectures, a ResNet and a Transformer model. Models were evaluated for discrimination (area under the receiver operating characteristic curve) and calibration (average absolute difference between observed and predicted probabilities) performance. Results: Incorporating pre-trained Poincar\'e embeddings resulted in modest and consistent improvements in discrimination performance compared to baseline models using randomly initialized Euclidean embeddings. ResNet models, particularly those using a 10-dimensional Poincar\'e embedding, showed enhanced calibration, whereas Transformer models maintained stable calibration across configurations. Discussion: Embedding clinical knowledge graphs into hyperbolic space and integrating these representations into deep learning models can improve lung cancer onset prediction by preserving the hierarchical structure of clinical terminologies used for prediction. This approach demonstrates a feasible method for combining data-driven feature extraction with established clinical knowledge.

Updated: 2025-08-20 11:29:47

标题: 肺癌预测的临床语义学

摘要: 背景：现有的临床预测模型通常使用忽略临床概念之间语义关系的特征来表示患者数据。本研究通过将SNOMED医学术语层次结构映射到低维双曲空间中的Poincar\'e嵌入，将领域特定的语义信息整合到一起，旨在改善肺癌发病预测。方法：使用Optum EHR数据集中的回顾性队列，我们从SNOMED分类中导出了临床知识图，并通过黎曼随机梯度下降生成了Poincar\'e嵌入。然后将这些嵌入融入两种深度学习架构，一个是ResNet，另一个是Transformer模型。模型的评估指标包括区分度（接收器操作特性曲线下的面积）和校准性能（观察到的概率与预测概率之间的平均绝对差异）。结果：将预先训练的Poincar\'e嵌入融入到模型中相对于使用随机初始化的欧几里得嵌入的基线模型，结果显示出适度且一致的区分性能改善。特别是使用10维Poincar\'e嵌入的ResNet模型显示出增强的校准性能，而Transformer模型在各种配置下保持稳定的校准性。讨论：将临床知识图嵌入双曲空间，并将这些表示整合到深度学习模型中，可以通过保留用于预测的临床术语的层次结构来改善肺癌发病预测。这种方法展示了一种将数据驱动的特征提取与已建立的临床知识结合的可行方法。

更新时间: 2025-08-20 11:29:47

领域: cs.LG

下载: http://arxiv.org/abs/2508.14627v1

Ranking by Lifts: A Cost-Benefit Approach to Large-Scale A/B Tests

A/B testing is a core tool for decision-making in business experimentation, particularly in digital platforms and marketplaces. Practitioners often prioritize lift in performance metrics while seeking to control the costs of false discoveries. This paper develops a decision-theoretic framework for maximizing expected profit subject to a constraint on the cost-weighted false discovery rate (FDR). We propose an empirical Bayes approach that uses a greedy knapsack algorithm to rank experiments based on the ratio of expected lift to cost, incorporating the local false discovery rate (lfdr) as a key statistic. The resulting oracle rule is valid and rank-optimal. In large-scale settings, we establish the asymptotic validity of a data-driven implementation and demonstrate superior finite-sample performance over existing FDR-controlling methods. An application to A/B tests run on the Optimizely platform highlights the business value of the approach.

Updated: 2025-08-20 11:28:11

标题: 通过提升排序：一种成本效益方法用于大规模A/B测试

摘要: A/B测试是商业实验中决策制定的核心工具，特别是在数字平台和市场中。从业者通常优先考虑性能指标的提升，同时寻求控制虚假发现的成本。本文提出了一个决策理论框架，用于在成本加权虚假发现率（FDR）约束条件下最大化预期利润。我们提出了一个经验贝叶斯方法，使用贪婪背包算法根据预期提升与成本的比率对实验进行排名，将局部虚假发现率（lfdr）作为关键统计量。结果得到的Oracle规则是有效的且排名最优的。在大规模设置中，我们建立了数据驱动实现的渐近有效性，并展示了相对于现有FDR控制方法的优越有限样本性能。在Optimizely平台上运行的A/B测试的应用突显了该方法的商业价值。

更新时间: 2025-08-20 11:28:11

领域: stat.ME,cs.LG,stat.AP,stat.ML

下载: http://arxiv.org/abs/2407.01036v3

Unsupervised Urban Tree Biodiversity Mapping from Street-Level Imagery Using Spatially-Aware Visual Clustering

Urban tree biodiversity is critical for climate resilience, ecological stability, and livability in cities, yet most municipalities lack detailed knowledge of their canopies. Field-based inventories provide reliable estimates of Shannon and Simpson diversity but are costly and time-consuming, while supervised AI methods require labeled data that often fail to generalize across regions. We introduce an unsupervised clustering framework that integrates visual embeddings from street-level imagery with spatial planting patterns to estimate biodiversity without labels. Applied to eight North American cities, the method recovers genus-level diversity patterns with high fidelity, achieving low Wasserstein distances to ground truth for Shannon and Simpson indices and preserving spatial autocorrelation. This scalable, fine-grained approach enables biodiversity mapping in cities lacking detailed inventories and offers a pathway for continuous, low-cost monitoring to support equitable access to greenery and adaptive management of urban ecosystems.

Updated: 2025-08-20 11:25:04

标题: 无监督的城市树木生物多样性映射：利用具有空间感知的视觉聚类技术从街头图像中进行

摘要: 城市树木生物多样性对气候适应性、生态稳定性和城市宜居性至关重要，然而大多数市政府缺乏对其树冠的详细了解。基于现场的清点提供了香农和辛普森多样性的可靠估计，但耗时且昂贵，而监督的人工智能方法需要标记数据，通常无法在不同地区推广。我们引入了一个无监督的聚类框架，将街道级图像的视觉嵌入与空间种植模式相结合，以在没有标签的情况下估计生物多样性。应用于八个北美城市，该方法以高度准确性恢复了属水平多样性模式，实现了香农和辛普森指数的与地面真相的低Wasserstein距离，并保留了空间自相关性。这种可扩展的、细粒度的方法使得在缺乏详细清点的城市中进行生物多样性绘图成为可能，并为持续、低成本监测提供了一条途径，以支持绿化的平等获取和城市生态系统的适应性管理。

更新时间: 2025-08-20 11:25:04

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.13814v2

A Study of the Scale Invariant Signal to Distortion Ratio in Speech Separation with Noisy References

This paper examines the implications of using the Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) as both evaluation and training objective in supervised speech separation, when the training references contain noise, as is the case with the de facto benchmark WSJ0-2Mix. A derivation of the SI-SDR with noisy references reveals that noise limits the achievable SI-SDR, or leads to undesired noise in the separated outputs. To address this, a method is proposed to enhance references and augment the mixtures with WHAM!, aiming to train models that avoid learning noisy references. Two models trained on these enhanced datasets are evaluated with the non-intrusive NISQA.v2 metric. Results show reduced noise in separated speech but suggest that processing references may introduce artefacts, limiting overall quality gains. Negative correlation is found between SI-SDR and perceived noisiness across models on the WSJ0-2Mix and Libri2Mix test sets, underlining the conclusion from the derivation.

Updated: 2025-08-20 11:22:11

标题: 一个关于在带噪参考信号的语音分离中尺度不变信噪比的研究

摘要: 本文研究了在受监督的语音分离中使用尺度不变信号失真比（SI-SDR）作为评估和训练目标的影响，当训练参考包含噪声时，如实际基准WSJ0-2Mix的情况。通过对带有噪声参考的SI-SDR进行推导，发现噪声限制了可达到的SI-SDR，或者导致分离输出中出现不希望的噪声。为了解决这个问题，提出了一种方法来增强参考和通过WHAM!来增强混合物，旨在训练模型避免学习带噪声的参考。对在这些增强数据集上训练的两个模型使用非侵入式NISQA.v2度量进行评估。结果显示分离后的语音中噪声减少，但表明处理参考可能引入人工制品，限制整体质量的增益。在WSJ0-2Mix和Libri2Mix测试集上，发现SI-SDR与被认为有噪声的模型之间存在负相关性，强调了推导的结论。

更新时间: 2025-08-20 11:22:11

领域: eess.AS,cs.AI,cs.SD

下载: http://arxiv.org/abs/2508.14623v1

Redundant feature screening method for human activity recognition based on attention purification mechanism

In the field of sensor-based Human Activity Recognition (HAR), deep neural networks provide advanced technical support. Many studies have proven that recognition accuracy can be improved by increasing the depth or width of the network. However, for wearable devices, the balance between network performance and resource consumption is crucial. With minimum resource consumption as the basic principle, we propose a universal attention feature purification mechanism, called MSAP, which is suitable for multi-scale networks. The mechanism effectively solves the feature redundancy caused by the superposition of multi-scale features by means of inter-scale attention screening and connection method. In addition, we have designed a network correction module that integrates seamlessly between layers of individual network modules to mitigate inherent problems in deep networks. We also built an embedded deployment system that is in line with the current level of wearable technology to test the practical feasibility of the HAR model, and further prove the efficiency of the method. Extensive experiments on four public datasets show that the proposed method model effectively reduces redundant features in filtered data and provides excellent performance with little resource consumption.

Updated: 2025-08-20 11:16:23

标题: 基于注意力净化机制的人体活动识别冗余特征筛选方法

摘要: 在基于传感器的人体活动识别（HAR）领域，深度神经网络提供了先进的技术支持。许多研究已经证明，通过增加网络的深度或宽度可以提高识别准确性。然而，对于可穿戴设备，网络性能和资源消耗之间的平衡是至关重要的。以最小资源消耗为基本原则，我们提出了一种适用于多尺度网络的通用注意力特征净化机制，称为MSAP。该机制通过跨尺度注意力筛选和连接方法有效解决了由多尺度特征叠加引起的特征冗余问题。此外，我们设计了一个网络校正模块，无缝集成在单个网络模块的层之间，以缓解深度网络中固有的问题。我们还建立了一个嵌入式部署系统，符合当前可穿戴技术水平，以测试HAR模型的实际可行性，并进一步证明了该方法的效率。对四个公共数据集进行的大量实验表明，所提出的方法有效地减少了过滤数据中的冗余特征，并在资源消耗很少的情况下提供了出色的性能。

更新时间: 2025-08-20 11:16:23

领域: cs.LG

下载: http://arxiv.org/abs/2503.23537v2

A Fuzzy-Enhanced Explainable AI Framework for Flight Continuous Descent Operations Classification

Continuous Descent Operations (CDO) involve smooth, idle-thrust descents that avoid level-offs, reducing fuel burn, emissions, and noise while improving efficiency and passenger comfort. Despite its operational and environmental benefits, limited research has systematically examined the factors influencing CDO performance. Moreover, many existing methods in related areas, such as trajectory optimization, lack the transparency required in aviation, where explainability is critical for safety and stakeholder trust. This study addresses these gaps by proposing a Fuzzy-Enhanced Explainable AI (FEXAI) framework that integrates fuzzy logic with machine learning and SHapley Additive exPlanations (SHAP) analysis. For this purpose, a comprehensive dataset of 29 features, including 11 operational and 18 weather-related features, was collected from 1,094 flights using Automatic Dependent Surveillance-Broadcast (ADS-B) data. Machine learning models and SHAP were then applied to classify flights' CDO adherence levels and rank features by importance. The three most influential features, as identified by SHAP scores, were then used to construct a fuzzy rule-based classifier, enabling the extraction of interpretable fuzzy rules. All models achieved classification accuracies above 90%, with FEXAI providing meaningful, human-readable rules for operational users. Results indicated that the average descent rate within the arrival route, the number of descent segments, and the average change in directional heading during descent were the strongest predictors of CDO performance. The FEXAI method proposed in this study presents a novel pathway for operational decision support and could be integrated into aviation tools to enable real-time advisories that maintain CDO adherence under varying operational conditions.

Updated: 2025-08-20 11:08:16

标题: 一个模糊增强的可解释人工智能框架用于飞行连续下降操作分类

摘要: 连续下降操作（CDO）涉及平稳、怠速推力下降，避免平飞，减少燃油消耗、排放和噪音，同时提高效率和乘客舒适度。尽管具有运营和环境方面的好处，但有限的研究系统地考虑了影响CDO性能的因素。此外，许多现有的相关领域方法，如轨迹优化，缺乏航空业所需的透明度，在那里可解释性对安全和利益相关者信任至关重要。本研究通过提出一个模糊增强可解释人工智能（FEXAI）框架，将模糊逻辑与机器学习和SHapley Additive exPlanations（SHAP）分析相结合，填补了这些空白。为此，从1,094个飞行中使用自动相关监视广播（ADS-B）数据收集了包括11个运营和18个与天气相关的特征在内的29个特征的全面数据集。然后应用机器学习模型和SHAP来对飞行的CDO遵从水平进行分类，并按重要性对特征进行排名。通过SHAP分数确定的三个最具影响力的特征随后用于构建基于模糊规则的分类器，实现可解释的模糊规则提取。所有模型的分类准确率均在90%以上，FEXAI为运营用户提供了有意义的、可读的规则。结果表明，到达航线内的平均下降速率、下降段数以及下降过程中方向航向的平均变化是CDO性能的最强预测因子。本研究提出的FEXAI方法为运营决策支持打开了一条新的路径，并可以集成到航空工具中，以在各种运营条件下保持CDO遵从性的实时建议。

更新时间: 2025-08-20 11:08:16

领域: cs.LG

下载: http://arxiv.org/abs/2508.14618v1

Measuring IIA Violations in Similarity Choices with Bayesian Models

Similarity choice data occur when humans make choices among alternatives based on their similarity to a target, e.g., in the context of information retrieval and in embedding learning settings. Classical metric-based models of similarity choice assume independence of irrelevant alternatives (IIA), a property that allows for a simpler formulation. While IIA violations have been detected in many discrete choice settings, the similarity choice setting has received scant attention. This is because the target-dependent nature of the choice complicates IIA testing. We propose two statistical methods to test for IIA: a classical goodness-of-fit test and a Bayesian counterpart based on the framework of Posterior Predictive Checks (PPC). This Bayesian approach, our main technical contribution, quantifies the degree of IIA violation beyond its mere significance. We curate two datasets: one with choice sets designed to elicit IIA violations, and another with randomly generated choice sets from the same item universe. Our tests confirmed significant IIA violations on both datasets, and notably, we find a comparable degree of violation between them. Further, we devise a new PPC test for population homogeneity. Results show that the population is indeed homogenous, suggesting that the IIA violations are driven by context effects -- specifically, interactions within the choice sets. These results highlight the need for new similarity choice models that account for such context effects.

Updated: 2025-08-20 11:02:26

标题: 用贝叶斯模型测量相似选择中的IIA违规行为

摘要: 相似性选择数据发生在人类在备选方案中进行选择时，选择的依据是它们与目标的相似性，例如在信息检索和嵌入学习环境中。经典的基于度量的相似性选择模型假设无关的备选方案（IIA），这一属性使得公式更简单。虽然在许多离散选择环境中已经检测到IIA违规，但相似性选择环境却鲜有关注。这是因为选择的目标依赖性使得IIA测试变得复杂。我们提出了两种统计方法来测试IIA：一个是经典的拟合优度检验，另一个是基于后验预测检验（PPC）框架的贝叶斯对应方法。这种贝叶斯方法，我们的主要技术贡献，量化了IIA违规的程度，超越了其简单的显著性。我们筛选了两个数据集：一个是设计用于引发IIA违规的选择集，另一个是从相同项目群体中随机生成的选择集。我们的测试确认了这两个数据集上的显著IIA违规，值得注意的是，我们发现它们之间的违规程度相当。此外，我们设计了一个新的用于人口同质性的PPC测试。结果显示人口确实是同质的，这表明IIA违规是由背景效应引起的，具体来说是选择集内的相互作用。这些结果突显了需要新的相似性选择模型，以解释这种背景效应。

更新时间: 2025-08-20 11:02:26

领域: cs.LG,stat.ML,I.2.6

下载: http://arxiv.org/abs/2508.14615v1

The importance of visual modelling languages in generative software engineering

Multimodal GPTs represent a watershed in the interplay between Software Engineering and Generative Artificial Intelligence. GPT-4 accepts image and text inputs, rather than simply natural language. We investigate relevant use cases stemming from these enhanced capabilities of GPT-4. To the best of our knowledge, no other work has investigated similar use cases involving Software Engineering tasks carried out via multimodal GPTs prompted with a mix of diagrams and natural language.

Updated: 2025-08-20 10:59:45

标题: 视觉建模语言在生成式软件工程中的重要性

摘要: Multimodal GPTs代表了软件工程和生成人工智能之间相互作用的一个分水岭。GPT-4接受图像和文本输入，而不仅仅是自然语言。我们研究了由GPT-4的增强功能带来的相关用例。据我们所知，没有其他作品调查过类似的使用案例，涉及通过提示混合图表和自然语言进行的软件工程任务的多模态GPTs。

更新时间: 2025-08-20 10:59:45

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2411.17976v4

KIRETT: Knowledge-Graph-Based Smart Treatment Assistant for Intelligent Rescue Operations

Over the years, the need for rescue operations throughout the world has increased rapidly. Demographic changes and the resulting risk of injury or health disorders form the basis for emergency calls. In such scenarios, first responders are in a rush to reach the patient in need, provide first aid, and save lives. In these situations, they must be able to provide personalized and optimized healthcare in the shortest possible time and estimate the patients condition with the help of freshly recorded vital data in an emergency situation. However, in such a timedependent situation, first responders and medical experts cannot fully grasp their knowledge and need assistance and recommendation for further medical treatments. To achieve this, on the spot calculated, evaluated, and processed knowledge must be made available to improve treatments by first responders. The Knowledge Graph presented in this article as a central knowledge representation provides first responders with an innovative knowledge management that enables intelligent treatment recommendations with an artificial intelligence-based pre-recognition of the situation.

Updated: 2025-08-20 10:56:09

标题: KIRETT：基于知识图谱的智能救援操作智能治疗助手

摘要: 多年来，全球对救援行动的需求迅速增加。人口结构变化以及由此带来的受伤或健康障碍的风险构成紧急呼叫的基础。在这种情况下，急救人员急于赶到需要帮助的患者身边，提供急救，并挽救生命。在这些情况下，他们必须能够在最短可能的时间内提供个性化和优化的医疗保健，并通过紧急情况下新记录的生命数据来估计患者的状况。然而，在这种时间紧迫的情况下，急救人员和医疗专家无法完全掌握自己的知识，需要协助和建议进行进一步的医疗治疗。为了实现这一点，需要在现场计算、评估和处理的知识必须得到提供，以改善急救人员的治疗。本文提出的知识图作为中央知识表示，为急救人员提供了一种创新的知识管理，使其能够通过基于人工智能的预识别情况进行智能治疗建议。

更新时间: 2025-08-20 10:56:09

领域: cs.AI,cs.ET

下载: http://arxiv.org/abs/2508.07834v2

Learnable Kernel Density Estimation for Graphs

This work proposes a framework LGKDE that learns kernel density estimation for graphs. The key challenge in graph density estimation lies in effectively capturing both structural patterns and semantic variations while maintaining theoretical guarantees. Combining graph kernels and kernel density estimation (KDE) is a standard approach to graph density estimation, but has unsatisfactory performance due to the handcrafted and fixed features of kernels. Our method LGKDE leverages graph neural networks to represent each graph as a discrete distribution and utilizes maximum mean discrepancy to learn the graph metric for multi-scale KDE, where all parameters are learned by maximizing the density of graphs relative to the density of their well-designed perturbed counterparts. The perturbations are conducted on both node features and graph spectra, which helps better characterize the boundary of normal density regions. Theoretically, we establish consistency and convergence guarantees for LGKDE, including bounds on the mean integrated squared error, robustness, and complexity. We validate LGKDE by demonstrating its effectiveness in recovering the underlying density of synthetic graph distributions and applying it to graph anomaly detection across diverse benchmark datasets. Extensive empirical evaluation shows that LGKDE demonstrates superior performance compared to state-of-the-art baselines on most benchmark datasets.

Updated: 2025-08-20 10:50:41

标题: 可学习的图形核密度估计

摘要: 这项工作提出了一个名为LGKDE的框架，用于学习图的核密度估计。图密度估计中的关键挑战在于有效地捕捉结构模式和语义变化，同时保持理论保证。将图核和核密度估计（KDE）相结合是图密度估计的标准方法，但由于核的手工设计和固定特征而表现不佳。我们的方法LGKDE利用图神经网络将每个图表示为离散分布，并利用最大均值差异学习多尺度KDE的图度量，其中所有参数通过最大化图的密度相对于它们精心设计的扰动对应物的密度来学习。扰动同时在节点特征和图谱上进行，有助于更好地表征正常密度区域的边界。从理论上讲，我们为LGKDE建立了一致性和收敛保证，包括均方误差的平均积分界限、鲁棒性和复杂性。我们通过演示LGKDE在恢复合成图分布的潜在密度方面的有效性以及将其应用于跨多样基准数据集的图异常检测来验证LGKDE。广泛的实证评估表明，与大多数基准数据集上的最新基线相比，LGKDE表现出优越的性能。

更新时间: 2025-08-20 10:50:41

领域: cs.LG,stat.ML,I.2; I.5.1; I.5.2

下载: http://arxiv.org/abs/2505.21285v2

Comparison of parallel SMC and MCMC for Bayesian deep learning

This work systematically compares parallel implementations of consistent (asymptotically unbiased) Bayesian deep learning algorithms: sequential Monte Carlo sampler (SMC$_\parallel$) or Markov chain Monte Carlo (MCMC$_\parallel$). We provide a proof of convergence for SMC$_\parallel$ showing that it theoretically achieves the same level of convergence as a single monolithic SMC sampler, while the reduced communication lowers wall-clock time. It is well-known that the first samples from MCMC need to be discarded to eliminate initialization bias, and that the number of discarded samples must grow like the logarithm of the number of parallel chains to control that bias for MCMC$_\parallel$. A systematic empirical numerical study on MNIST, CIFAR, and IMDb, reveals that parallel implementations of both methods perform comparably to non-parallel implementations in terms of performance and total cost, and also comparably to each other. However, both methods still require a large wall-clock time, and suffer from catastrophic non-convergence if they aren't run for long enough.

Updated: 2025-08-20 10:50:33

标题: Bayesian深度学习中并行SMC和MCMC的比较

摘要: 这项工作系统地比较了一致（渐近无偏）贝叶斯深度学习算法的并行实现：顺序蒙特卡洛采样器（SMC$_\parallel$）或马尔可夫链蒙特卡洛（MCMC$_\parallel$）。我们提供了SMC$_\parallel$的收敛证明，表明在理论上它实现了与单一整体SMC采样器相同水平的收敛，而降低的通信量降低了墙钟时间。众所周知，MCMC的初始样本需要被丢弃以消除初始化偏差，而被丢弃的样本数量必须像并行链数量的对数那样增长，以控制MCMC$_\parallel$的偏差。对MNIST、CIFAR和IMDb进行的系统的实证数值研究表明，两种方法的并行实现在性能和总成本方面与非并行实现表现相当，也相互之间表现相当。然而，这两种方法仍然需要大量的墙钟时间，并且如果运行时间不够长，它们会遭受灾难性的不收敛。

更新时间: 2025-08-20 10:50:33

领域: stat.ML,cs.LG,stat.CO

下载: http://arxiv.org/abs/2402.06173v3

UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling

Point cloud videos capture dynamic 3D motion while reducing the effects of lighting and viewpoint variations, making them highly effective for recognizing subtle and continuous human actions. Although Selective State Space Models (SSMs) have shown good performance in sequence modeling with linear complexity, the spatio-temporal disorder of point cloud videos hinders their unidirectional modeling when directly unfolding the point cloud video into a 1D sequence through temporally sequential scanning. To address this challenge, we propose the Unified Spatio-Temporal State Space Model (UST-SSM), which extends the latest advancements in SSMs to point cloud videos. Specifically, we introduce Spatial-Temporal Selection Scanning (STSS), which reorganizes unordered points into semantic-aware sequences through prompt-guided clustering, thereby enabling the effective utilization of points that are spatially and temporally distant yet similar within the sequence. For missing 4D geometric and motion details, Spatio-Temporal Structure Aggregation (STSA) aggregates spatio-temporal features and compensates. To improve temporal interaction within the sampled sequence, Temporal Interaction Sampling (TIS) enhances fine-grained temporal dependencies through non-anchor frame utilization and expanded receptive fields. Experimental results on the MSR-Action3D, NTU RGB+D, and Synthia 4D datasets validate the effectiveness of our method. Our code is available at https://github.com/wangzy01/UST-SSM.

Updated: 2025-08-20 10:46:01

标题: UST-SSM：用于点云视频建模的统一时空状态空间模型

摘要: 点云视频捕捉动态的3D运动，同时减少光照和视角变化的影响，使其非常适用于识别微妙和连续的人类动作。虽然选择性状态空间模型（SSMs）在序列建模中表现出良好性能，且具有线性复杂性，但点云视频的时空错乱阻碍了它们在直接将点云视频展开为1D序列时的单向建模。为了解决这一挑战，我们提出了统一的时空状态空间模型（UST-SSM），将最新的SSM技术延伸到点云视频中。具体而言，我们引入了空间-时间选择扫描（STSS），通过即时引导聚类将无序点重新组织成语义感知序列，从而使空间和时间上相似但相隔较远的点在序列中得到有效利用。对于缺失的4D几何和运动细节，时空结构聚合（STSA）聚合时空特征并进行补偿。为了改进样本序列内的时间交互作用，时间交互采样（TIS）通过非锚定帧利用和扩展感受野增强了细粒度的时间依赖性。在MSR-Action3D、NTU RGB+D和Synthia 4D数据集上的实验结果验证了我们方法的有效性。我们的代码可在https://github.com/wangzy01/UST-SSM找到。

更新时间: 2025-08-20 10:46:01

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.14604v1

DualNILM: Energy Injection Identification Enabled Disaggregation with Deep Multi-Task Learning

Non-Intrusive Load Monitoring (NILM) offers a cost-effective method to obtain fine-grained appliance-level energy consumption in smart homes and building applications. However, the increasing adoption of behind-the-meter energy sources, such as solar panels and battery storage, poses new challenges for conventional NILM methods that rely solely on at-the-meter data. The injected energy from the behind-the-meter sources can obscure the power signatures of individual appliances, leading to a significant decline in NILM performance. To address this challenge, we present DualNILM, a deep multi-task learning framework designed for the dual tasks of appliance state recognition and injected energy identification in NILM. By integrating sequence-to-point and sequence-to-sequence strategies within a Transformer-based architecture, DualNILM can effectively capture multi-scale temporal dependencies in the aggregate power consumption patterns, allowing for accurate appliance state recognition and energy injection identification. We conduct validation of DualNILM using both self-collected and synthesized open NILM datasets that include both appliance-level energy consumption and energy injection. Extensive experimental results demonstrate that DualNILM maintains an excellent performance for the dual tasks in NILM, much outperforming conventional methods.

Updated: 2025-08-20 10:35:38

标题: DualNILM: 能量注入识别启用的深度多任务学习分解

摘要: 非侵入式负载监测（NILM）为智能家居和建筑应用提供了一种成本效益的方法，可以获取细粒度的电器级能耗数据。然而，由于越来越多的自用能源来源，如太阳能电池板和电池存储，传统的仅依赖于仪表数据的NILM方法面临新挑战。来自自用能源的注入能量可能会掩盖个别电器的功率特征，导致NILM性能显著下降。为了解决这一挑战，我们提出了DualNILM，这是一个专为NILM中电器状态识别和注入能量识别这两项任务而设计的深度多任务学习框架。通过在基于Transformer的架构中集成序列到点和序列到序列策略，DualNILM能够有效捕获聚合功耗模式中的多尺度时间依赖关系，从而实现准确的电器状态识别和能量注入识别。我们使用自行收集和合成的开放NILM数据集对DualNILM进行验证，这些数据集包括电器级能耗和能量注入。大量实验结果表明，DualNILM在NILM中保持了出色的双重任务性能，远远优于传统方法。

更新时间: 2025-08-20 10:35:38

领域: cs.LG,eess.SP,I.2.6; J.7; I.5.4

下载: http://arxiv.org/abs/2508.14600v1

EoH-S: Evolution of Heuristic Set using LLMs for Automated Heuristic Design

Automated Heuristic Design (AHD) using Large Language Models (LLMs) has achieved notable success in recent years. Despite the effectiveness of existing approaches, they only design a single heuristic to serve all problem instances, often inducing poor generalization across different distributions or settings. To address this issue, we propose Automated Heuristic Set Design (AHSD), a new formulation for LLM-driven AHD. The aim of AHSD is to automatically generate a small-sized complementary heuristic set to serve diverse problem instances, such that each problem instance could be optimized by at least one heuristic in this set. We show that the objective function of AHSD is monotone and supermodular. Then, we propose Evolution of Heuristic Set (EoH-S) to apply the AHSD formulation for LLM-driven AHD. With two novel mechanisms of complementary population management and complementary-aware memetic search, EoH-S could effectively generate a set of high-quality and complementary heuristics. Comprehensive experimental results on three AHD tasks with diverse instances spanning various sizes and distributions demonstrate that EoH-S consistently outperforms existing state-of-the-art AHD methods and achieves up to 60\% performance improvements.

Updated: 2025-08-20 10:33:40

标题: EoH-S: 使用LLMs进行启发式集演化的自动启发式设计

摘要: 自动启发式设计（AHD）使用大型语言模型（LLMs）在近年取得了显著成功。尽管现有方法的有效性，它们仅设计一个启发式来服务所有问题实例，通常导致在不同分布或设置下的泛化效果较差。为了解决这个问题，我们提出了自动启发式集设计（AHSD），这是一种新的LLM驱动的AHD的表述。AHSD的目标是自动生成一个小型互补启发式集，以服务不同的问题实例，使得每个问题实例至少可以通过这个集合中的一个启发式进行优化。我们展示了AHSD的目标函数是单调的和超模块化的。然后，我们提出了启发式集的演化（EoH-S）来应用AHSD的表述来驱动LLM的AHD。通过互补人口管理和互补感知的记忆搜索两种新颖机制，EoH-S能够有效生成一组高质量和互补的启发式。在包含各种大小和分布的多个实例的三个AHD任务上的全面实验结果表明，EoH-S始终优于现有的最先进的AHD方法，并实现了高达60\%的性能改进。

更新时间: 2025-08-20 10:33:40

领域: cs.AI

下载: http://arxiv.org/abs/2508.03082v2

Towards the Use of Saliency Maps for Explaining Low-Quality Electrocardiograms to End Users

When using medical images for diagnosis, either by clinicians or artificial intelligence (AI) systems, it is important that the images are of high quality. When an image is of low quality, the medical exam that produced the image often needs to be redone. In telemedicine, a common problem is that the quality issue is only flagged once the patient has left the clinic, meaning they must return in order to have the exam redone. This can be especially difficult for people living in remote regions, who make up a substantial portion of the patients at Portal Telemedicina, a digital healthcare organization based in Brazil. In this paper, we report on ongoing work regarding (i) the development of an AI system for flagging and explaining low-quality medical images in real-time, (ii) an interview study to understand the explanation needs of stakeholders using the AI system at OurCompany, and, (iii) a longitudinal user study design to examine the effect of including explanations on the workflow of the technicians in our clinics. To the best of our knowledge, this would be the first longitudinal study on evaluating the effects of XAI methods on end-users -- stakeholders that use AI systems but do not have AI-specific expertise. We welcome feedback and suggestions on our experimental setup.

Updated: 2025-08-20 10:08:27

标题: 朝向利用显著性地图向最终用户解释低质量心电图的方向前进

摘要: 在使用医学影像进行诊断时，无论是由临床医生还是人工智能（AI）系统进行，图像的质量都非常重要。当图像质量较低时，产生图像的医学检查通常需要重做。在远程医疗中，一个常见问题是质量问题只有在患者离开诊所后才被标记出来，这意味着他们必须返回以进行重做检查。对于生活在偏远地区的人们来说，这可能特别困难，而这些人占据了总部位于巴西的数字医疗组织Portal Telemedicina的大部分患者。在本文中，我们报告了关于（i）开发一个AI系统以实时标记和解释低质量医学影像的进行中工作，（ii）通过访谈研究来了解在OurCompany使用AI系统的利益相关者的解释需求，以及（iii）一个纵向用户研究设计来检查在我们诊所中包括解释对技术人员工作流程的影响。据我们所知，这将是第一项关于评估XAI方法对最终用户 - 使用AI系统但没有AI专业知识的利益相关者的影响的纵向研究。我们欢迎对我们的实验设置提出反馈和建议。

更新时间: 2025-08-20 10:08:27

领域: cs.LG,cs.AI,cs.HC,eess.SP

下载: http://arxiv.org/abs/2207.02726v2

An Open-Source HW-SW Co-Development Framework Enabling Efficient Multi-Accelerator Systems

Heterogeneous accelerator-centric compute clusters are emerging as efficient solutions for diverse AI workloads. However, current integration strategies often compromise data movement efficiency and encounter compatibility issues in hardware and software. This prevents a unified approach that balances performance and ease of use. To this end, we present SNAX, an open-source integrated HW-SW framework enabling efficient multi-accelerator platforms through a novel hybrid-coupling scheme, consisting of loosely coupled asynchronous control and tightly coupled data access. SNAX brings reusable hardware modules designed to enhance compute accelerator utilization, and its customizable MLIR-based compiler to automate key system management tasks, jointly enabling rapid development and deployment of customized multi-accelerator compute clusters. Through extensive experimentation, we demonstrate SNAX's efficiency and flexibility in a low-power heterogeneous SoC. Accelerators can easily be integrated and programmed to achieve > 10x improvement in neural network performance compared to other accelerator systems while maintaining accelerator utilization of > 90% in full system operation.

Updated: 2025-08-20 10:04:21

标题: 一个开源的硬件软件协同开发框架，实现高效的多加速器系统

摘要: 异构加速器为中心的计算集群正逐渐成为各种人工智能工作负载的高效解决方案。然而，当前的集成策略往往会牺牲数据移动效率，并在硬件和软件方面遇到兼容性问题。这阻碍了实现平衡性能和易用性的统一方法。为此，我们提出了SNAX，一个开源集成的硬件-软件框架，通过一种新颖的混合耦合方案实现高效的多加速器平台，该方案由松散耦合的异步控制和紧密耦合的数据访问组成。SNAX提供可重复使用的硬件模块，旨在增强计算加速器的利用率，以及基于可自定义的MLIR编译器，自动化关键的系统管理任务，共同实现快速开发和部署定制的多加速器计算集群。通过广泛的实验，我们展示了SNAX在低功耗异构SoC中的效率和灵活性。加速器可以轻松集成和编程，相比其他加速器系统，神经网络性能可以提高超过10倍，同时在整个系统运行过程中保持加速器利用率超过90%。

更新时间: 2025-08-20 10:04:21

领域: cs.AR,cs.AI

下载: http://arxiv.org/abs/2508.14582v1

MetaWild: A Multimodal Dataset for Animal Re-Identification with Environmental Metadata

Identifying individual animals within large wildlife populations is essential for effective wildlife monitoring and conservation efforts. Recent advancements in computer vision have shown promise in animal re-identification (Animal ReID) by leveraging data from camera traps. However, existing Animal ReID datasets rely exclusively on visual data, overlooking environmental metadata that ecologists have identified as highly correlated with animal behavior and identity, such as temperature and circadian rhythms. Moreover, the emergence of multimodal models capable of jointly processing visual and textual data presents new opportunities for Animal ReID, but existing datasets fail to leverage these models' text-processing capabilities, limiting their full potential. Additionally, to facilitate the use of metadata in existing ReID methods, we propose the Meta-Feature Adapter (MFA), a lightweight module that can be incorporated into existing vision-language model (VLM)-based Animal ReID methods, allowing ReID models to leverage both environmental metadata and visual information to improve ReID performance. Experiments on MetaWild show that combining baseline ReID models with MFA to incorporate metadata consistently improves performance compared to using visual information alone, validating the effectiveness of incorporating metadata in re-identification. We hope that our proposed dataset can inspire further exploration of multimodal approaches for Animal ReID.

Updated: 2025-08-20 10:02:32

标题: MetaWild：一种具有环境元数据的动物重新识别多模态数据集

摘要: 识别大型野生动物种群中的个体动物对于有效的野生动物监测和保护工作至关重要。最近计算机视觉的进展显示出在利用摄像机陷阱数据进行动物再识别（Animal ReID）方面有希望。然而，现有的Animal ReID数据集仅依赖于视觉数据，忽视了生态学家们已经确定与动物行为和身份高度相关的环境元数据，例如温度和昼夜节律。此外，能够联合处理视觉和文本数据的多模型的出现为Animal ReID带来了新的机会，但现有数据集未能充分利用这些模型的文本处理能力，限制了它们的全部潜力。此外，为了促进现有ReID方法中元数据的使用，我们提出了Meta-Feature Adapter（MFA），这是一个轻量级模块，可以整合到现有的基于视觉语言模型（VLM）的Animal ReID方法中，使ReID模型能够利用环境元数据和视觉信息来提高ReID性能。在MetaWild上的实验表明，将基线ReID模型与MFA结合起来以整合元数据，与仅使用视觉信息相比，持续提高性能，验证了整合元数据在再识别中的有效性。我们希望我们提出的数据集可以激发进一步探索多模态方法用于Animal ReID。

更新时间: 2025-08-20 10:02:32

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2501.13368v2

A Comprehensive Evaluation of the Sensitivity of Density-Ratio Estimation Based Fairness Measurement in Regression

The prevalence of algorithmic bias in Machine Learning (ML)-driven approaches has inspired growing research on measuring and mitigating bias in the ML domain. Accordingly, prior research studied how to measure fairness in regression which is a complex problem. In particular, recent research proposed to formulate it as a density-ratio estimation problem and relied on a Logistic Regression-driven probabilistic classifier-based approach to solve it. However, there are several other methods to estimate a density ratio, and to the best of our knowledge, prior work did not study the sensitivity of such fairness measurement methods to the choice of underlying density ratio estimation algorithm. To fill this gap, this paper develops a set of fairness measurement methods with various density-ratio estimation cores and thoroughly investigates how different cores would affect the achieved level of fairness. Our experimental results show that the choice of density-ratio estimation core could significantly affect the outcome of fairness measurement method, and even, generate inconsistent results with respect to the relative fairness of various algorithms. These observations suggest major issues with density-ratio estimation based fairness measurement in regression and a need for further research to enhance their reliability.

Updated: 2025-08-20 09:54:55

标题: 密度比估计法在回归中公平度测量敏感性的全面评估

摘要: 机器学习（ML）驱动方法中算法偏差的普遍性已经激发了对ML领域中偏差的测量和减轻的研究不断增长。因此，先前的研究研究了如何测量回归中的公平性，这是一个复杂的问题。特别是，最近的研究提出将其制定为一个密度比估计问题，并依赖于一个基于逻辑回归驱动的概率分类器方法来解决它。然而，还有几种其他方法来估计密度比，据我们所知，先前的工作没有研究这种公平度测量方法对底层密度比估计算法选择的敏感性。为了填补这一空白，本文开发了一组具有各种密度比估计核心的公平度测量方法，并彻底调查了不同核心如何影响实现的公平水平。我们的实验结果显示，密度比估计核心的选择可能会显著影响公平度测量方法的结果，甚至可能产生与各种算法相对公平性不一致的结果。这些观察结果表明，基于密度比估计的公平度测量在回归中存在重大问题，需要进一步研究以增强其可靠性。

更新时间: 2025-08-20 09:54:55

领域: cs.LG

下载: http://arxiv.org/abs/2508.14576v1

Towards Skeletal and Signer Noise Reduction in Sign Language Production via Quaternion-Based Pose Encoding and Contrastive Learning

One of the main challenges in neural sign language production (SLP) lies in the high intra-class variability of signs, arising from signer morphology and stylistic variety in the training data. To improve robustness to such variations, we propose two enhancements to the standard Progressive Transformers (PT) architecture (Saunders et al., 2020). First, we encode poses using bone rotations in quaternion space and train with a geodesic loss to improve the accuracy and clarity of angular joint movements. Second, we introduce a contrastive loss to structure decoder embeddings by semantic similarity, using either gloss overlap or SBERT-based sentence similarity, aiming to filter out anatomical and stylistic features that do not convey relevant semantic information. On the Phoenix14T dataset, the contrastive loss alone yields a 16% improvement in Probability of Correct Keypoint over the PT baseline. When combined with quaternion-based pose encoding, the model achieves a 6% reduction in Mean Bone Angle Error. These results point to the benefit of incorporating skeletal structure modeling and semantically guided contrastive objectives on sign pose representations into the training of Transformer-based SLP models.

Updated: 2025-08-20 09:52:51

标题: 朝向通过四元数姿势编码和对比学习实现手语制作中的骨骼和手势噪音减少

摘要: 神经手语产生（SLP）中的主要挑战之一在于手语的高内类变异性，这是由于手语者形态学和训练数据中的风格变化所导致的。为了提高对这种变化的鲁棒性，我们对标准渐进变压器（PT）架构（Saunders等人，2020）提出了两项改进。首先，我们使用四元数空间中的骨旋转来编码姿势，并通过测地线损失进行训练，以提高角度关节运动的准确性和清晰度。其次，我们引入了对比损失，通过语义相似性对结构解码器嵌入进行结构化，使用词汇重叠或基于SBERT的句子相似性，旨在过滤掉不传达相关语义信息的解剖和风格特征。在Phoenix14T数据集上，仅对比损失就使PT基线的正确关键点概率提高了16%。当与基于四元数的姿势编码结合时，该模型的平均骨角误差减少了6%。这些结果表明，在基于变压器的SLP模型的训练中，将骨骼结构建模和语义引导的对比目标纳入手语姿势表示中的好处。

更新时间: 2025-08-20 09:52:51

领域: cs.CL,cs.LG

下载: http://arxiv.org/abs/2508.14574v1

STEM: Efficient Relative Capability Evaluation of LLMs through Structured Transition Samples

Evaluating large language models (LLMs) has become increasingly challenging as model capabilities advance rapidly. While recent models often achieve higher scores on standard benchmarks, these improvements do not consistently reflect enhanced real-world reasoning capabilities. Moreover, widespread overfitting to public benchmarks and the high computational cost of full evaluations have made it both expensive and less effective to distinguish meaningful differences between models. To address these challenges, we propose the \textbf{S}tructured \textbf{T}ransition \textbf{E}valuation \textbf{M}ethod (STEM), a lightweight and interpretable evaluation framework for efficiently estimating the relative capabilities of LLMs. STEM identifies \textit{significant transition samples} (STS) by analyzing consistent performance transitions among LLMs of the same architecture but varying parameter scales. These samples enable STEM to effectively estimate the capability position of an unknown model. Qwen3 model family is applied to construct the STS pool on six diverse and representative benchmarks. To assess generalizability. Experimental results indicate that STEM reliably captures performance trends, aligns with ground-truth rankings of model capability. These findings highlight STEM as a practical and scalable method for fine-grained, architecture-agnostic evaluation of LLMs.

Updated: 2025-08-20 09:52:00

标题: STEM：通过结构化过渡样本高效评估LLMs的相对能力

摘要: 评估大型语言模型（LLMs）在模型能力快速发展的同时变得越来越具有挑战性。尽管最近的模型通常在标准基准上取得更高的分数，但这些改进并不一致地反映出增强的真实世界推理能力。此外，对公共基准的普遍过拟合和完整评估的高计算成本使得区分模型之间的有意义差异既昂贵又不够有效。为了解决这些挑战，我们提出了\textbf{S}tructured \textbf{T}ransition \textbf{E}valuation \textbf{M}ethod（STEM），这是一个轻量且可解释的评估框架，用于有效地估计LLMs的相对能力。STEM通过分析相同架构但参数规模不同的LLMs之间的一致性性能转换来识别\textit{显著转换样本}（STS）。这些样本使得STEM能够有效地估计未知模型的能力位置。Qwen3模型系列被应用于在六个不同而具有代表性的基准上构建STS池。为了评估泛化能力。实验结果表明，STEM可靠地捕捉性能趋势，与模型能力的实际排名相一致。这些发现突出了STEM作为一种实用且可扩展的方法，用于对LLMs进行细粒度的、与架构无关的评估。

更新时间: 2025-08-20 09:52:00

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.12096v2

Leuvenshtein: Efficient FHE-based Edit Distance Computation with Single Bootstrap per Cell

This paper presents a novel approach to calculating the Levenshtein (edit) distance within the framework of Fully Homomorphic Encryption (FHE), specifically targeting third-generation schemes like TFHE. Edit distance computations are essential in applications across finance and genomics, such as DNA sequence alignment. We introduce an optimised algorithm that significantly reduces the cost of edit distance calculations called Leuvenshtein. This algorithm specifically reduces the number of programmable bootstraps (PBS) needed per cell of the calculation, lowering it from approximately 94 operations -- required by the conventional Wagner-Fisher algorithm -- to just 1. Additionally, we propose an efficient method for performing equality checks on characters, reducing ASCII character comparisons to only 2 PBS operations. Finally, we explore the potential for further performance improvements by utilising preprocessing when one of the input strings is unencrypted. Our Leuvenshtein achieves up to $278\times$ faster performance compared to the best available TFHE implementation and up to $39\times$ faster than an optimised implementation of the Wagner-Fisher algorithm. Moreover, when offline preprocessing is possible due to the presence of one unencrypted input on the server side, an additional $3\times$ speedup can be achieved.

Updated: 2025-08-20 09:40:06

标题: Leuvenshtein：每个单元格单次引导的高效基于全同态加密的编辑距离计算

摘要: 本文提出了一种新颖的方法，在全同态加密（FHE）框架内计算Levenshtein（编辑）距离，具体针对第三代方案如TFHE。编辑距离计算在金融和基因组学等应用中至关重要，如DNA序列比对。我们介绍了一种优化算法，显著降低了编辑距离计算成本，称为Leuvenshtein。该算法特别减少了每个计算单元所需的可编程启动（PBS）数量，将传统的Wagner-Fisher算法所需的约94个操作降低到只需1个。此外，我们提出了一种有效的方法，用于对字符进行相等性检查，将ASCII字符比较仅限于2个PBS操作。最后，我们探讨了通过利用预处理来进一步提高性能的潜力，当其中一个输入字符串未加密时。我们的Leuvenshtein相比于最佳可用TFHE实现，性能提高了$278\times$，比优化的Wagner-Fisher算法实现快了$39\times$。此外，当由于服务器端存在一个未加密输入时可以进行离线预处理时，可以实现额外的$3\times$加速。

更新时间: 2025-08-20 09:40:06

领域: cs.CR,E.3

下载: http://arxiv.org/abs/2508.14568v1

Cooperative SGD with Dynamic Mixing Matrices

One of the most common methods to train machine learning algorithms today is the stochastic gradient descent (SGD). In a distributed setting, SGD-based algorithms have been shown to converge theoretically under specific circumstances. A substantial number of works in the distributed SGD setting assume a fixed topology for the edge devices. These papers also assume that the contribution of nodes to the global model is uniform. However, experiments have shown that such assumptions are suboptimal and a non uniform aggregation strategy coupled with a dynamically shifting topology and client selection can significantly improve the performance of such models. This paper details a unified framework that covers several Local-Update SGD-based distributed algorithms with dynamic topologies and provides improved or matching theoretical guarantees on convergence compared to existing work.

Updated: 2025-08-20 09:37:07

标题: 具有动态混合矩阵的合作式随机梯度下降

摘要: 目前训练机器学习算法最常用的方法之一是随机梯度下降（SGD）。在分布式环境中，基于SGD的算法在特定情况下被理论上证明可以收敛。许多在分布式SGD环境中的研究假定边缘设备的拓扑结构是固定的。这些论文还假定节点对全局模型的贡献是均匀的。然而，实验证明这种假设是次优的，非均匀的聚合策略结合动态变化的拓扑结构和客户端选择可以显著提高这些模型的性能。本文详细介绍了一个统一框架，涵盖了几种基于本地更新SGD的分布式算法，具有动态拓扑，并且相比现有研究提供了改进或匹配的理论保证。

更新时间: 2025-08-20 09:37:07

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2508.14565v1

Who Sees What? Structured Thought-Action Sequences for Epistemic Reasoning in LLMs

Recent advances in large language models (LLMs) and reasoning frameworks have opened new possibilities for improving the perspective -taking capabilities of autonomous agents. However, tasks that involve active perception, collaborative reasoning, and perspective taking (understanding what another agent can see or knows) pose persistent challenges for current LLM-based systems. This study investigates the potential of structured examples derived from transformed solution graphs generated by the Fast Downward planner to improve the performance of LLM-based agents within a ReAct framework. We propose a structured solution-processing pipeline that generates three distinct categories of examples: optimal goal paths (G-type), informative node paths (E-type), and step-by-step optimal decision sequences contrasting alternative actions (L-type). These solutions are further converted into ``thought-action'' examples by prompting an LLM to explicitly articulate the reasoning behind each decision. While L-type examples slightly reduce clarification requests and overall action steps, they do not yield consistent improvements. Agents are successful in tasks requiring basic attentional filtering but struggle in scenarios that required mentalising about occluded spaces or weighing the costs of epistemic actions. These findings suggest that structured examples alone are insufficient for robust perspective-taking, underscoring the need for explicit belief tracking, cost modelling, and richer environments to enable socially grounded collaboration in LLM-based agents.

Updated: 2025-08-20 09:36:53

标题: 谁看到了什么？用于LLMs中认识推理的结构化思维行动序列

摘要: 最近大型语言模型（LLMs）和推理框架的进展为改善自主代理的透视能力打开了新的可能性。然而，涉及主动感知、协作推理和透视（理解另一个代理可以看到或知道的内容）的任务对当前基于LLM的系统提出了持续的挑战。本研究调查了从Fast Downward规划器生成的转换解决方案图中导出的结构化示例对LLM代理在ReAct框架内性能的改进潜力。我们提出了一个结构化解决方案处理流程，生成三类不同的示例：最佳目标路径（G型）、信息节点路径（E型）和逐步最佳决策序列对比替代行动（L型）。这些解决方案进一步转换为“思考-行动”示例，通过促使LLM明确表达每个决策背后的推理。虽然L型示例稍微减少了澄清请求和整体动作步骤，但并未产生一致的改进。代理在需要基本注意力过滤的任务中取得成功，但在需要思考关于遮挡空间或权衡认知行动成本的情景中遇到困难。这些发现表明，单独的结构化示例对于强大的透视能力是不足的，强调了需要明确的信念跟踪、成本建模和更丰富的环境，以实现基于LLM的代理的社会基础合作。

更新时间: 2025-08-20 09:36:53

领域: cs.AI,cs.CL,cs.HC,I.2.9; I.2.10; I.2.7; J.4

下载: http://arxiv.org/abs/2508.14564v1

Generalizable Spectral Embedding with an Application to UMAP

Spectral Embedding (SE) is a popular method for dimensionality reduction, applicable across diverse domains. Nevertheless, its current implementations face three prominent drawbacks which curtail its broader applicability: generalizability (i.e., out-of-sample extension), scalability, and eigenvectors separation. Existing SE implementations often address two of these drawbacks; however, they fall short in addressing the remaining one. In this paper, we introduce Sep-SpectralNet (eigenvector-separated SpectralNet), a SE implementation designed to address all three limitations. Sep-SpectralNet extends SpectralNet with an efficient post-processing step to achieve eigenvectors separation, while ensuring both generalizability and scalability. This method expands the applicability of SE to a wider range of tasks and can enhance its performance in existing applications. We empirically demonstrate Sep-SpectralNet's ability to consistently approximate and generalize SE, while maintaining SpectralNet's scalability. Additionally, we show how Sep-SpectralNet can be leveraged to enable generalizable UMAP visualization. Our codes are publicly available.

Updated: 2025-08-20 09:31:59

标题: 具有泛化能力的光谱嵌入及其在UMAP中的应用

摘要: 谱嵌入（SE）是一种用于降维的流行方法，适用于各种领域。然而，其当前的实现面临三个突出的缺点，限制了其更广泛的适用性：泛化性（即，样本外扩展）、可扩展性和特征向量分离。现有的SE实现通常解决了其中两个缺点；然而，它们在解决剩下的一个方面上做得不够。在本文中，我们介绍了Sep-SpectralNet（特征向量分离的谱网络），这是一种旨在解决所有三个限制的SE实现。Sep-SpectralNet通过一个高效的后处理步骤扩展了SpectralNet，以实现特征向量分离，同时确保泛化性和可扩展性。这种方法扩展了SE的适用范围，可以提高其在现有应用中的性能。我们通过实验证明了Sep-SpectralNet能够持续地近似和泛化SE，同时保持SpectralNet的可扩展性。此外，我们展示了如何利用Sep-SpectralNet实现可泛化的UMAP可视化。我们的代码已经公开发布。

更新时间: 2025-08-20 09:31:59

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2501.11305v2

Causal Mechanism Estimation in Multi-Sensor Systems Across Multiple Domains

To gain deeper insights into a complex sensor system through the lens of causality, we present common and individual causal mechanism estimation (CICME), a novel three-step approach to inferring causal mechanisms from heterogeneous data collected across multiple domains. By leveraging the principle of Causal Transfer Learning (CTL), CICME is able to reliably detect domain-invariant causal mechanisms when provided with sufficient samples. The identified common causal mechanisms are further used to guide the estimation of the remaining causal mechanisms in each domain individually. The performance of CICME is evaluated on linear Gaussian models under scenarios inspired from a manufacturing process. Building upon existing continuous optimization-based causal discovery methods, we show that CICME leverages the benefits of applying causal discovery on the pooled data and repeatedly on data from individual domains, and it even outperforms both baseline methods under certain scenarios.

Updated: 2025-08-20 09:29:46

标题: 多领域多传感器系统中的因果机制估计

摘要: 为了通过因果关系的角度深入了解复杂传感器系统，我们提出了常见和个体因果机制估计（CICME）的新颖三步方法，用于从跨多个领域收集的异构数据中推断因果机制。通过利用因果转移学习（CTL）原则，CICME能够在提供足够样本的情况下可靠地检测域不变的因果机制。进一步利用识别出的共同因果机制来指导在每个领域中单独估计其余因果机制。我们在受制造过程启发的场景下评估了CICME在线性高斯模型上的性能。在现有基于连续优化的因果发现方法的基础上，我们展示了CICME利用将因果发现应用于汇总数据以及重复应用于各个领域数据的好处，并且在某些场景下甚至优于基准方法。

更新时间: 2025-08-20 09:29:46

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2507.17792v4

Improving OCR using internal document redundancy

Current OCR systems are based on deep learning models trained on large amounts of data. Although they have shown some ability to generalize to unseen data, especially in detection tasks, they can struggle with recognizing low-quality data. This is particularly evident for printed documents, where intra-domain data variability is typically low, but inter-domain data variability is high. In that context, current OCR methods do not fully exploit each document's redundancy. We propose an unsupervised method by leveraging the redundancy of character shapes within a document to correct imperfect outputs of a given OCR system and suggest better clustering. To this aim, we introduce an extended Gaussian Mixture Model (GMM) by alternating an Expectation-Maximization (EM) algorithm with an intra-cluster realignment process and normality statistical testing. We demonstrate improvements in documents with various levels of degradation, including recovered Uruguayan military archives and 17th to mid-20th century European newspapers.

Updated: 2025-08-20 09:21:43

标题: 利用内部文档冗余性提高OCR的准确性

摘要: 当前的OCR系统是基于大量数据训练的深度学习模型。尽管它们已经表现出一定的泛化能力，特别是在检测任务中，但在识别低质量数据方面可能会遇到困难。这在印刷文件中尤为明显，其中域内数据变异性通常较低，但域间数据变异性较高。在这种情况下，当前的OCR方法并未充分利用每个文档的冗余信息。我们提出了一种无监督方法，通过利用文档内字符形状的冗余性来纠正给定OCR系统的不完美输出，并提出更好的聚类。为此，我们引入了一种扩展的高斯混合模型（GMM），通过交替使用期望最大化（EM）算法与一个域内聚类重新对齐过程和正态统计检验。我们展示了在各种程度的降级文档中的改进，包括恢复的乌拉圭军事档案和17世纪至20世纪中叶的欧洲报纸。

更新时间: 2025-08-20 09:21:43

领域: cs.CV,cs.LG,eess.IV

下载: http://arxiv.org/abs/2508.14557v1

Mamba2 Meets Silence: Robust Vocal Source Separation for Sparse Regions

We introduce a new music source separation model tailored for accurate vocal isolation. Unlike Transformer-based approaches, which often fail to capture intermittently occurring vocals, our model leverages Mamba2, a recent state space model, to better capture long-range temporal dependencies. To handle long input sequences efficiently, we combine a band-splitting strategy with a dual-path architecture. Experiments show that our approach outperforms recent state-of-the-art models, achieving a cSDR of 11.03 dB-the best reported to date-and delivering substantial gains in uSDR. Moreover, the model exhibits stable and consistent performance across varying input lengths and vocal occurrence patterns. These results demonstrate the effectiveness of Mamba-based models for high-resolution audio processing and open up new directions for broader applications in audio research.

Updated: 2025-08-20 09:19:11

标题: 蝰蛇2遇见寂静：稀疏区域的强大声音源分离

摘要: 我们介绍了一种新的音乐源分离模型，专门用于准确地分离人声。与基于Transformer的方法不同，后者经常无法捕捉间歇性出现的人声，我们的模型利用了最近的状态空间模型Mamba2，以更好地捕捉长范围的时间依赖关系。为了有效处理长输入序列，我们将带分割策略与双路径架构相结合。实验表明，我们的方法优于最近的最先进模型，实现了11.03 dB的cSDR-迄今为止报告的最佳性能，并在uSDR上获得了实质性的增益。此外，该模型在不同输入长度和人声出现模式下表现稳定和一致。这些结果表明，基于Mamba的模型在高分辨率音频处理方面的有效性，并为音频研究中更广泛的应用开辟了新方向。

更新时间: 2025-08-20 09:19:11

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2508.14556v1

Towards LLM-generated explanations for Component-based Knowledge Graph Question Answering Systems

Over time, software systems have reached a level of complexity that makes it difficult for their developers and users to explain particular decisions made by them. In this paper, we focus on the explainability of component-based systems for Question Answering (QA). These components often conduct processes driven by AI methods, in which behavior and decisions cannot be clearly explained or justified, s.t., even for QA experts interpreting the executed process and its results is hard. To address this challenge, we present an approach that considers the components' input and output data flows as a source for representing the behavior and provide explanations for the components, enabling users to comprehend what happened. In the QA framework used here, the data flows of the components are represented as SPARQL queries (inputs) and RDF triples (outputs). Hence, we are also providing valuable insights on verbalization regarding these data types. In our experiments, the approach generates explanations while following template-based settings (baseline) or via the use of Large Language Models (LLMs) with different configurations (automatic generation). Our evaluation shows that the explanations generated via LLMs achieve high quality and mostly outperform template-based approaches according to the users' ratings. Therefore, it enables us to automatically explain the behavior and decisions of QA components to humans while using RDF and SPARQL as a context for explanations.

Updated: 2025-08-20 09:14:48

标题: 朝向基于LLM生成的组件化知识图问题回答系统解释

摘要: 随着时间的推移，软件系统已经达到了一种复杂性水平，使得开发人员和用户难以解释其中所做的特定决策。在本文中，我们关注基于组件的系统在问答（QA）中的可解释性。这些组件通常通过人工智能方法进行驱动的过程，其中行为和决策无法清楚解释或证明，因此，即使对于QA专家来说，解释执行过程和结果也很困难。为了解决这一挑战，我们提出了一种方法，将组件的输入和输出数据流视为表示行为的来源，并为组件提供解释，使用户能够理解发生了什么。在此处使用的QA框架中，组件的数据流被表示为SPARQL查询（输入）和RDF三元组（输出）。因此，我们还提供了关于这些数据类型的表述方面的有价值的见解。在我们的实验中，该方法在遵循基于模板的设置（基线）或通过使用具有不同配置的大型语言模型（LLMs）（自动生成）时生成解释。我们的评估显示，通过LLMs生成的解释具有高质量，并在用户评分中大多胜过基于模板的方法。因此，它使我们能够自动向人类解释QA组件的行为和决策，同时使用RDF和SPARQL作为解释的背景。

更新时间: 2025-08-20 09:14:48

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2508.14553v1

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Recent advances in reinforcement learning (RL) with numerical feedback, such as scalar rewards, have significantly enhanced the complex reasoning capabilities of large language models (LLMs). Despite this success, we identify three key challenges encountered by RL with solely numerical feedback: performance plateaus, limited effectiveness of spontaneous self-reflection, and persistent failures. We then demonstrate that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems by leveraging natural language feedback in the form of critiques. Building on this insight, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for effective policy optimization. Critique-GRPO enables LLMs to learn from initial responses and critique-guided self-refinements simultaneously while maintaining exploration. Additionally, we employ a shaping function to amplify learning from correct, especially unfamiliar, refinements and penalize incorrect ones. Extensive experiments with Qwen2.5-7B-Base, Qwen2.5-Math-7B-Base, and Qwen3-8B demonstrate that Critique-GRPO consistently outperforms supervised learning and RL-based fine-tuning methods across eight challenging mathematical, STEM, and general reasoning tasks. Specifically, Critique-GRPO improves average pass@1 scores across all compared methods by approximately +4.4% on Qwen2.5-7B-Base and +3.8% on Qwen3-8B. Notably, Critique-GRPO enables effective self-improvement through self-critiquing, achieving significant gains over GRPO, e.g., +16.7% pass@1 improvement on AIME 2024.

Updated: 2025-08-20 09:10:05

标题: Critique-GRPO：用自然语言和数值反馈推进LLM推理

摘要: 最近在带有数值反馈的强化学习（RL）方面取得了重要进展，例如标量奖励，显著增强了大型语言模型（LLMs）的复杂推理能力。尽管取得了成功，我们发现RL仅依赖数值反馈面临三个关键挑战：性能平台，自发性自我反思的有效性有限，以及持续失败。然后，我们展示了即使在展现性能平台后，RL微调模型也可以通过利用自然语言反馈形式的批评，在持续失败的问题上生成正确的改进。基于这一见解，我们提出了Critique-GRPO，这是一个在线RL框架，将自然语言和数值反馈整合在一起，以实现有效的策略优化。Critique-GRPO使LLMs能够同时从初始响应和批评引导的自我改进中学习，同时保持探索。此外，我们使用一个塑形函数来增强从正确（特别是不熟悉的）改进中学习，并惩罚不正确的改进。对Qwen2.5-7B-Base、Qwen2.5-Math-7B-Base和Qwen3-8B进行的大量实验表明，Critique-GRPO在八项具有挑战性的数学、STEM和一般推理任务中始终优于监督学习和基于RL的微调方法。具体而言，Critique-GRPO在所有比较方法中将Qwen2.5-7B-Base的平均pass@1得分提高了约+4.4%，将Qwen3-8B的得分提高了约+3.8%。值得注意的是，Critique-GRPO通过自我批评实现了有效的自我改进，相对于GRPO，比如在AIME 2024上pass@1得分提高了16.7%。

更新时间: 2025-08-20 09:10:05

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2506.03106v5

Enhancing Temporal Sensitivity of Large Language Model for Recommendation with Counterfactual Tuning

Recent advances have applied large language models (LLMs) to sequential recommendation, leveraging their pre-training knowledge and reasoning capabilities to provide more personalized user experiences. However, existing LLM-based methods fail to sufficiently leverage the rich temporal information inherent in users' historical interaction sequences, stemming from fundamental architectural constraints: LLMs process information through self-attention mechanisms that lack inherent sequence ordering and rely on position embeddings designed primarily for natural language rather than user interaction sequences. This limitation significantly impairs their ability to capture the evolution of user preferences over time and predict future interests accurately. To address this critical gap, we propose \underline{C}ounterfactual \underline{E}nhanced \underline{T}emporal Framework for LLM-Based \underline{Rec}ommendation (CETRec). CETRec is grounded in causal inference principles, which allow it to isolate and measure the specific impact of temporal information on recommendation outcomes. Combined with our counterfactual tuning task derived from causal analysis, CETRec effectively enhances LLMs' awareness of both absolute order (how recently items were interacted with) and relative order (the sequential relationships between items). Extensive experiments on real-world datasets demonstrate the effectiveness of our CETRec. Our code is available at https://anonymous.4open.science/r/CETRec-B9CE/.

Updated: 2025-08-20 09:09:56

标题: 使用反事实调整增强大型语言模型在推荐中的时间敏感性

摘要: 最近的研究已将大型语言模型(LLMs)应用于顺序推荐，利用它们的预训练知识和推理能力，提供更个性化的用户体验。然而，现有基于LLM的方法未能充分利用用户历史互动序列中固有的丰富时间信息，这源于基本架构约束：LLMs通过缺乏固有序列排序并依赖主要设计用于自然语言而非用户互动序列的位置嵌入的自注意机制处理信息。这一限制显著影响了它们捕捉用户随时间演变的偏好和准确预测未来兴趣的能力。为了填补这一关键差距，我们提出了基于因果推断原理的\textbf{C}ounterfactual \textbf{E}nhanced \textbf{T}emporal Framework for LLM-Based \textbf{Rec}ommendation (CETRec)。CETRec能够隔离并衡量时间信息对推荐结果的具体影响。结合我们从因果分析中派生的反事实调整任务，CETRec有效地提升了LLMs对绝对顺序(物品最近互动的时间)和相对顺序(物品之间的顺序关系)的认识。对真实世界数据集的大量实验表明了我们CETRec的有效性。我们的代码可在https://anonymous.4open.science/r/CETRec-B9CE/上获取。

更新时间: 2025-08-20 09:09:56

领域: cs.CL,cs.AI,cs.IR

下载: http://arxiv.org/abs/2507.03047v2

Poisson Midpoint Method for Log Concave Sampling: Beyond the Strong Error Lower Bounds

We study the problem of sampling from strongly log-concave distributions over $\mathbb{R}^d$ using the Poisson midpoint discretization (a variant of the randomized midpoint method) for overdamped/underdamped Langevin dynamics. We prove its convergence in the 2-Wasserstein distance ($W_2$), achieving a cubic speedup in dependence on the target accuracy ($\epsilon$) over the Euler-Maruyama discretization, surpassing existing bounds for randomized midpoint methods. Notably, in the case of underdamped Langevin dynamics, we demonstrate the complexity of $W_2$ convergence is much smaller than the complexity lower bounds for convergence in $L^2$ strong error established in the literature.

Updated: 2025-08-20 09:06:53

标题: 泊松中点法用于对数凹采样：超越强误差下界

摘要: 我们研究了使用泊松中点离散化（随机中点方法的一个变种）对过度阻尼/欠阻尼 Langevin 动力学中的强对数凹分布进行抽样的问题。我们证明了它在2-Wasserstein距离（$W_2$）上的收敛性，实现了相对于Euler-Maruyama离散化目标精度（$\epsilon$）的立方速度提升，超过了随机中点方法的现有界限。值得注意的是，在欠阻尼 Langevin 动力学的情况下，我们证明了$W_2$收敛的复杂度远小于文献中建立的$L^2$强误差收敛的复杂度下界。

更新时间: 2025-08-20 09:06:53

领域: math.PR,cs.LG,math.ST,stat.TH

下载: http://arxiv.org/abs/2506.07614v3

Adaptively Robust LLM Inference Optimization under Prediction Uncertainty

We study the problem of optimizing Large Language Model (LLM) inference scheduling to minimize total latency. LLM inference is an online and multi-task service process and also heavily energy consuming by which a pre-trained LLM processes input requests and generates output tokens sequentially. Therefore, it is vital to improve its scheduling efficiency and reduce the power consumption while a great amount of prompt requests are arriving. A key challenge in LLM inference scheduling is that while the prompt length is known upon arrival, the output length, which critically impacts memory usage and processing time, is unknown. To address this uncertainty, we propose algorithms that leverage machine learning to predict output lengths, assuming the prediction provides an interval classification (min-max range) for each request. We first design a conservative algorithm, $\mathcal{A}_{\max}$, which schedules requests based on the upper bound of predicted output lengths to prevent memory overflow. However, this approach is overly conservative: as prediction accuracy decreases, performance degrades significantly due to potential overestimation. To overcome this limitation, we propose $\mathcal{A}_{\min}$, an adaptive algorithm that initially treats the predicted lower bound as the output length and dynamically refines this estimate during inferencing. We prove that $\mathcal{A}_{\min}$ achieves a log-scale competitive ratio. Through numerical simulations, we demonstrate that $\mathcal{A}_{\min}$ often performs nearly as well as the hindsight scheduler, highlighting both its efficiency and robustness in practical scenarios. Moreover, $\mathcal{A}_{\min}$ relies solely on the lower bound of the prediction interval--an advantageous design choice since upper bounds on output length are typically more challenging to predict accurately.

Updated: 2025-08-20 08:55:26

标题: 适应性强的LLM推断优化在预测不确定性下

摘要: 我们研究了优化大型语言模型（LLM）推理调度以最小化总延迟的问题。LLM推理是一个在线和多任务服务过程，同时也是高能耗的，一个预先训练的LLM处理输入请求并按顺序生成输出令牌。因此，在大量提示请求到达时，提高其调度效率并减少功耗是至关重要的。LLM推理调度中的一个关键挑战是，虽然在到达时知道提示长度，但输出长度（对内存使用和处理时间产生重大影响）是未知的。为了解决这种不确定性，我们提出了利用机器学习来预测输出长度的算法，假设预测为每个请求提供一个区间分类（最小-最大范围）。我们首先设计了一个保守算法$\mathcal{A}_{\max}$，根据预测的输出长度上限安排请求，以防止内存溢出。然而，这种方法过于保守：随着预测精度降低，由于可能的过度估计，性能会显著下降。为了克服这一限制，我们提出了$\mathcal{A}_{\min}$，一种自适应算法，最初将预测的下限视为输出长度，并在推理过程中动态地调整这一估计。我们证明$\mathcal{A}_{\min}$实现了对数比例竞争性。通过数值模拟，我们证明$\mathcal{A}_{\min}$通常表现几乎和事后调度器一样好，突显了其在实际场景中的效率和稳健性。此外，$\mathcal{A}_{\min}$仅依赖于预测区间的下限--这是一个有利的设计选择，因为输出长度的上限通常更难准确预测。

更新时间: 2025-08-20 08:55:26

领域: cs.LG,cs.AI,math.OC

下载: http://arxiv.org/abs/2508.14544v1

Post-hoc LLM-Supported Debugging of Distributed Processes

In this paper, we address the problem of manual debugging, which nowadays remains resource-intensive and in some parts archaic. This problem is especially evident in increasingly complex and distributed software systems. Therefore, our objective of this work is to introduce an approach that can possibly be applied to any system, at both the macro- and micro-level, to ease this debugging process. This approach utilizes a system's process data, in conjunction with generative AI, to generate natural-language explanations. These explanations are generated from the actual process data, interface information, and documentation to guide the developers more efficiently to understand the behavior and possible errors of a process and its sub-processes. Here, we present a demonstrator that employs this approach on a component-based Java system. However, our approach is language-agnostic. Ideally, the generated explanations will provide a good understanding of the process, even if developers are not familiar with all the details of the considered system. Our demonstrator is provided as an open-source web application that is freely accessible to all users.

Updated: 2025-08-20 08:45:53

标题: 后续LLM支持的分布式进程调试

摘要: 在这篇论文中，我们解决了手动调试的问题，这在当今仍然是资源密集且在某些部分过时的。这个问题在日益复杂和分布式的软件系统中尤为明显。因此，我们的工作目标是引入一种方法，可以可能应用于任何系统的宏观和微观级别，以简化这个调试过程。这种方法利用系统的过程数据，结合生成性人工智能，生成自然语言解释。这些解释是从实际过程数据、界面信息和文档生成的，以更有效地指导开发人员理解一个过程及其子过程的行为和可能的错误。在这里，我们展示了一个演示程序，该演示程序在基于组件的Java系统上采用了这种方法。然而，我们的方法不受语言限制。理想情况下，生成的解释将提供对过程的良好理解，即使开发人员不熟悉所考虑系统的所有细节。我们的演示程序作为一个开源的web应用程序提供，所有用户都可以免费访问。

更新时间: 2025-08-20 08:45:53

领域: cs.SE,cs.AI

下载: http://arxiv.org/abs/2508.14540v1

Near Optimal Non-asymptotic Sample Complexity of 1-Identification

Motivated by an open direction in existing literature, we study the 1-identification problem, a fundamental multi-armed bandit formulation on pure exploration. The goal is to determine whether there exists an arm whose mean reward is at least a known threshold $\mu_0$, or to output None if it believes such an arm does not exist. The agent needs to guarantee its output is correct with probability at least $1-\delta$. Degenne & Koolen 2019 has established the asymptotically tight sample complexity for the 1-identification problem, but they commented that the non-asymptotic analysis remains unclear. We design a new algorithm Sequential-Exploration-Exploitation (SEE), and conduct theoretical analysis from the non-asymptotic perspective. Novel to the literature, we achieve near optimality, in the sense of matching upper and lower bounds on the pulling complexity. The gap between the upper and lower bounds is up to a polynomial logarithmic factor. The numerical result also indicates the effectiveness of our algorithm, compared to existing benchmarks.

Updated: 2025-08-20 08:44:30

标题: 接近最优的非渐近样本复杂度1-识别

摘要: 受现有文献中一个开放方向的启发，我们研究了1-识别问题，这是关于纯探索的基本多臂赌博问题。目标是确定是否存在一个臂，其平均奖励至少达到已知阈值$\mu_0$，或者如果认为不存在这样的臂，则输出None。代理需要保证其输出的正确性概率至少为$1-\delta$。Degenne & Koolen 2019已经为1-识别问题建立了渐近紧密的样本复杂度，但他们评论说非渐近分析仍不清楚。我们设计了一个新算法Sequential-Exploration-Exploitation (SEE)，并从非渐近的角度进行了理论分析。与文献中的其他算法相比，我们实现了接近最优的结果，即在拉动复杂度的上限和下限上达到匹配。上限和下限之间的差距最多是一个多项式对数因子。数值结果也表明，与现有基准相比，我们的算法的有效性。

更新时间: 2025-08-20 08:44:30

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2506.06978v2

FedEve: On Bridging the Client Drift and Period Drift for Cross-device Federated Learning

Federated learning (FL) is a machine learning paradigm that allows multiple clients to collaboratively train a shared model without exposing their private data. Data heterogeneity is a fundamental challenge in FL, which can result in poor convergence and performance degradation. Client drift has been recognized as one of the factors contributing to this issue resulting from the multiple local updates in FedAvg. However, in cross-device FL, a different form of drift arises due to the partial client participation, but it has not been studied well. This drift, we referred as period drift, occurs as participating clients at each communication round may exhibit distinct data distribution that deviates from that of all clients. It could be more harmful than client drift since the optimization objective shifts with every round. In this paper, we investigate the interaction between period drift and client drift, finding that period drift can have a particularly detrimental effect on cross-device FL as the degree of data heterogeneity increases. To tackle these issues, we propose a predict-observe framework and present an instantiated method, FedEve, where these two types of drift can compensate each other to mitigate their overall impact. We provide theoretical evidence that our approach can reduce the variance of model updates. Extensive experiments demonstrate that our method outperforms alternatives on non-iid data in cross-device settings.

Updated: 2025-08-20 08:42:34

标题: FedEve：关于跨设备联邦学习中客户漂移和周期漂移的桥接

摘要: 联邦学习（FL）是一种机器学习范例，它允许多个客户端共同训练共享模型，而不暴露他们的私人数据。数据异质性是FL中的一个基本挑战，可能导致收敛性差和性能下降。客户端漂移已被认识为导致这个问题的因素之一，这是由于FedAvg中的多个本地更新引起的。然而，在跨设备FL中，由于部分客户端参与，会出现一种不同形式的漂移，但这并没有被很好地研究。我们称之为周期漂移的这种漂移，是由于每次通信轮次中参与的客户端可能展示出与所有客户端不同的数据分布而产生的。它可能比客户端漂移更有害，因为优化目标会在每一轮中发生变化。在本文中，我们调查了周期漂移和客户端漂移之间的相互作用，发现周期漂移在数据异质性增加时对跨设备FL可能产生特别有害的影响。为了解决这些问题，我们提出了一个预测-观察框架，并提出了一个具体的方法FedEve，其中这两种漂移可以相互补偿，以减轻它们的整体影响。我们提供理论证据表明我们的方法可以减少模型更新的方差。大量实验表明我们的方法在跨设备设置中的非iid数据上优于其他替代方法。

更新时间: 2025-08-20 08:42:34

领域: cs.LG,cs.DC

下载: http://arxiv.org/abs/2508.14539v1

Beyond ReLU: Chebyshev-DQN for Enhanced Deep Q-Networks

The performance of Deep Q-Networks (DQN) is critically dependent on the ability of its underlying neural network to accurately approximate the action-value function. Standard function approximators, such as multi-layer perceptrons, may struggle to efficiently represent the complex value landscapes inherent in many reinforcement learning problems. This paper introduces a novel architecture, the Chebyshev-DQN (Ch-DQN), which integrates a Chebyshev polynomial basis into the DQN framework to create a more effective feature representation. By leveraging the powerful function approximation properties of Chebyshev polynomials, we hypothesize that the Ch-DQN can learn more efficiently and achieve higher performance. We evaluate our proposed model on the CartPole-v1 benchmark and compare it against a standard DQN with a comparable number of parameters. Our results demonstrate that the Ch-DQN with a moderate polynomial degree (N=4) achieves significantly better asymptotic performance, outperforming the baseline by approximately 39\%. However, we also find that the choice of polynomial degree is a critical hyperparameter, as a high degree (N=8) can be detrimental to learning. This work validates the potential of using orthogonal polynomial bases in deep reinforcement learning while also highlighting the trade-offs involved in model complexity.

Updated: 2025-08-20 08:41:15

标题: 超越ReLU：用于增强深度Q网络的切比雪夫-DQN

摘要: 深度Q网络（DQN）的性能在很大程度上取决于其底层神经网络准确逼近动作值函数的能力。标准的函数逼近器，如多层感知器，可能难以有效地表示许多强化学习问题中固有的复杂价值景观。本文介绍了一种新颖的架构，Chebyshev-DQN（Ch-DQN），该架构将Chebyshev多项式基础集成到DQN框架中，以创建更有效的特征表示。通过利用Chebyshev多项式强大的函数逼近特性，我们假设Ch-DQN可以更有效地学习并实现更高的性能。我们在CartPole-v1基准上评估了我们提出的模型，并与具有可比参数数量的标准DQN进行了比较。我们的结果表明，具有适度多项式次数（N=4）的Ch-DQN实现了显着更好的渐近性能，大约比基准模型优越了39％。然而，我们还发现多项式次数的选择是一个关键的超参数，因为高次数（N=8）可能对学习产生不利影响。这项工作验证了在深度强化学习中使用正交多项式基础的潜力，同时也突出了模型复杂性涉及的权衡。

更新时间: 2025-08-20 08:41:15

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.14536v1

DOPA: Stealthy and Generalizable Backdoor Attacks from a Single Client under Challenging Federated Constraints

Federated Learning (FL) is increasingly adopted for privacy-preserving collaborative training, but its decentralized nature makes it particularly susceptible to backdoor attacks. Existing attack methods, however, often rely on idealized assumptions and fail to remain effective under real-world constraints, such as limited attacker control, non-IID data distributions, and the presence of diverse defense mechanisms. To address this gap, we propose DOPA (Divergent Optimization Path Attack), a novel framework that simulates heterogeneous local training dynamics and seeks consensus across divergent optimization trajectories to craft universally effective and stealthy backdoor triggers. By leveraging consistency signals across simulated paths to guide optimization, DOPA overcomes the challenge of heterogeneity-induced instability and achieves practical attack viability under stringent federated constraints. We validate DOPA on a comprehensive suite of 12 defense strategies, two model architectures (ResNet18/VGG16), two datasets (CIFAR-10/TinyImageNet), and both mild and extreme non-IID settings. Despite operating under a single-client, black-box, and sparsely participating threat model, DOPA consistently achieves high attack success, minimal accuracy degradation, low runtime, and long-term persistence. These results demonstrate a more practical attack paradigm, offering new perspectives for designing robust defense strategies in federated learning systems

Updated: 2025-08-20 08:39:12

标题: DOPA：在具有挑战性的联合约束下，从单个客户端进行隐蔽且可推广的后门攻击

摘要: 联邦学习（FL）越来越被采用用于隐私保护的协作训练，但其分散性质使其特别容易受到后门攻击的影响。然而，现有的攻击方法往往依赖于理想化的假设，并且在现实世界的限制条件下往往无法保持有效，如受限的攻击者控制、非独立同分布的数据分布以及存在多样化的防御机制。为了弥补这一差距，我们提出了DOPA（Divergent Optimization Path Attack），这是一个模拟异构本地训练动态并寻求在不同优化轨迹上达成共识，以制定普遍有效且隐蔽的后门触发器的新框架。通过利用模拟路径之间的一致性信号来引导优化，DOPA克服了由异质性引起的不稳定性挑战，并在严格的联邦约束条件下实现了实际攻击可行性。我们在12种防御策略、两种模型架构（ResNet18/VGG16）、两个数据集（CIFAR-10/TinyImageNet）以及轻微和极端非独立同分布设置下验证了DOPA。尽管在单客户端、黑盒和稀疏参与的威胁模型下运行，DOPA始终实现了高攻击成功率、最小的准确性降级、低运行时间和长期持续性。这些结果展示了一种更实用的攻击范式，为设计联邦学习系统中强大的防御策略提供了新的视角。

更新时间: 2025-08-20 08:39:12

领域: cs.CR

下载: http://arxiv.org/abs/2508.14530v1

Benchmarking Vector, Graph and Hybrid Retrieval Augmented Generation (RAG) Pipelines for Open Radio Access Networks (ORAN)

Generative AI (GenAI) is expected to play a pivotal role in enabling autonomous optimization in future wireless networks. Within the ORAN architecture, Large Language Models (LLMs) can be specialized to generate xApps and rApps by leveraging specifications and API definitions from the RAN Intelligent Controller (RIC) platform. However, fine-tuning base LLMs for telecom-specific tasks remains expensive and resource-intensive. Retrieval-Augmented Generation (RAG) offers a practical alternative through in-context learning, enabling domain adaptation without full retraining. While traditional RAG systems rely on vector-based retrieval, emerging variants such as GraphRAG and Hybrid GraphRAG incorporate knowledge graphs or dual retrieval strategies to support multi-hop reasoning and improve factual grounding. Despite their promise, these methods lack systematic, metric-driven evaluations, particularly in high-stakes domains such as ORAN. In this study, we conduct a comparative evaluation of Vector RAG, GraphRAG, and Hybrid GraphRAG using ORAN specifications. We assess performance across varying question complexities using established generation metrics: faithfulness, answer relevance, context relevance, and factual correctness. Results show that both GraphRAG and Hybrid GraphRAG outperform traditional RAG. Hybrid GraphRAG improves factual correctness by 8%, while GraphRAG improves context relevance by 11%.

Updated: 2025-08-20 08:37:28

标题: 基准测试向量、图和混合检索增强生成（RAG）管道，用于开放式无线接入网络（ORAN）

摘要: 生成式人工智能（GenAI）被期望在未来无线网络中扮演关键角色，实现自主优化。在ORAN架构中，大型语言模型（LLMs）可以通过利用来自RAN智能控制器（RIC）平台的规范和API定义进行特化，生成xApps和rApps。然而，为电信特定任务微调基础LLMs仍然昂贵且资源密集。检索增强生成（RAG）通过上下文学习提供了一种实际的替代方案，实现领域适应而无需完全重新训练。尽管传统的RAG系统依赖于基于向量的检索，但新兴的变体如GraphRAG和混合图形RAG则整合了知识图或双重检索策略，以支持多跳推理并提高事实基础。尽管具有潜力，但这些方法缺乏系统性、基于度量的评估，特别是在ORAN等高风险领域。在本研究中，我们使用ORAN规范对Vector RAG、GraphRAG和Hybrid GraphRAG进行了比较评估。我们通过已建立的生成度量评估了在不同问题复杂度下的性能：忠实度、答案相关性、上下文相关性和事实正确性。结果显示，GraphRAG和Hybrid GraphRAG均优于传统RAG。Hybrid GraphRAG将事实正确性提高了8%，而GraphRAG将上下文相关性提高了11%。

更新时间: 2025-08-20 08:37:28

领域: cs.AI,cs.DC,cs.ET,cs.NI

下载: http://arxiv.org/abs/2507.03608v2

CoFacS -- Simulating a Complete Factory to Study the Security of Interconnected Production

While the digitization of industrial factories provides tremendous improvements for the production of goods, it also renders such systems vulnerable to serious cyber-attacks. To research, test, and validate security measures protecting industrial networks against such cyber-attacks, the security community relies on testbeds to simulate industrial systems, as utilizing live systems endangers costly components or even human life. However, existing testbeds focus on individual parts of typically complex production lines in industrial factories. Consequently, the impact of cyber-attacks on industrial networks as well as the effectiveness of countermeasures cannot be evaluated in an end-to-end manner. To address this issue and facilitate research on novel security mechanisms, we present CoFacS, the first COmplete FACtory Simulation that replicates an entire production line and affords the integration of real-life industrial applications. To showcase that CoFacS accurately captures real-world behavior, we validate it against a physical model factory widely used in security research. We show that CoFacS has a maximum deviation of 0.11% to the physical reference, which enables us to study the impact of physical attacks or network-based cyber-attacks. Moreover, we highlight how CoFacS enables security research through two cases studies surrounding attack detection and the resilience of 5G-based industrial communication against jamming.

Updated: 2025-08-20 08:36:55

标题: CoFacS -- 模拟完整工厂以研究生产互连安全问题

摘要: 工业工厂的数字化虽然为产品生产提供了巨大的改进，但也使这些系统容易受到严重的网络攻击。为了研究、测试和验证保护工业网络免受此类网络攻击的安全措施，安全社区依赖于测试平台来模拟工业系统，因为利用实际系统会危及昂贵的组件甚至人员生命。然而，现有的测试平台主要关注工业工厂中通常复杂生产线的个别部分。因此，无法以端到端的方式评估网络攻击对工业网络的影响以及对策的有效性。为了解决这个问题并促进新型安全机制的研究，我们提出了CoFacS，即第一个完整的工厂模拟，复制整个生产线并集成实际工业应用程序。为了展示CoFacS准确捕捉真实世界行为，我们将其与广泛用于安全研究的物理模型工厂进行验证。我们表明CoFacS与物理参考的最大偏差为0.11%，这使我们能够研究物理攻击或基于网络的网络攻击的影响。此外，我们强调CoFacS如何通过两个围绕攻击检测和基于5G的工业通信抗干扰能力的案例研究促进安全研究。

更新时间: 2025-08-20 08:36:55

领域: cs.CR,cs.NI

下载: http://arxiv.org/abs/2508.14526v1

EffiFusion-GAN: Efficient Fusion Generative Adversarial Network for Speech Enhancement

We introduce EffiFusion-GAN (Efficient Fusion Generative Adversarial Network), a lightweight yet powerful model for speech enhancement. The model integrates depthwise separable convolutions within a multi-scale block to capture diverse acoustic features efficiently. An enhanced attention mechanism with dual normalization and residual refinement further improves training stability and convergence. Additionally, dynamic pruning is applied to reduce model size while maintaining performance, making the framework suitable for resource-constrained environments. Experimental evaluation on the public VoiceBank+DEMAND dataset shows that EffiFusion-GAN achieves a PESQ score of 3.45, outperforming existing models under the same parameter settings.

Updated: 2025-08-20 08:36:43

标题: EffiFusion-GAN：用于语音增强的高效融合生成对抗网络

摘要: 我们介绍了EffiFusion-GAN（高效融合生成对抗网络），这是一个轻量但功能强大的语音增强模型。该模型在多尺度块中集成了深度可分离卷积，以有效地捕获多样化的声学特征。增强的注意机制结合了双重归一化和残差细化，进一步提高了训练稳定性和收敛性。此外，动态剪枝用于减小模型尺寸同时保持性能，使该框架适用于资源受限环境。在公开VoiceBank+DEMAND数据集上的实验评估显示，EffiFusion-GAN实现了3.45的PESQ分数，在相同参数设置下优于现有模型。

更新时间: 2025-08-20 08:36:43

领域: cs.SD,cs.AI,eess.AS

下载: http://arxiv.org/abs/2508.14525v1

Evaluating Autoencoders for Parametric and Invertible Multidimensional Projections

Recently, neural networks have gained attention for creating parametric and invertible multidimensional data projections. Parametric projections allow for embedding previously unseen data without recomputing the projection as a whole, while invertible projections enable the generation of new data points. However, these properties have never been explored simultaneously for arbitrary projection methods. We evaluate three autoencoder (AE) architectures for creating parametric and invertible projections. Based on a given projection, we train AEs to learn a mapping into 2D space and an inverse mapping into the original space. We perform a quantitative and qualitative comparison on four datasets of varying dimensionality and pattern complexity using t-SNE. Our results indicate that AEs with a customized loss function can create smoother parametric and inverse projections than feed-forward neural networks while giving users control over the strength of the smoothing effect.

Updated: 2025-08-20 08:35:22

标题: 评估自编码器在参数化和可逆多维投影中的应用

摘要: 最近，神经网络在创建参数化和可逆的多维数据投影方面引起了关注。参数化投影允许嵌入以前未见过的数据而无需重新计算整个投影，而可逆投影则实现了生成新数据点的功能。然而，这些属性从未同时应用于任意投影方法。我们评估了三种自动编码器（AE）架构，用于创建参数化和可逆的投影。基于给定的投影，我们训练AE学习将数据映射到2D空间和将逆映射到原始空间。我们使用t-SNE在四个不同维度和模式复杂度的数据集上进行定量和定性比较。我们的结果表明，具有自定义损失函数的AE可以比前馈神经网络创建更平滑的参数化和逆向投影，同时使用户能够控制平滑效果的强度。

更新时间: 2025-08-20 08:35:22

领域: cs.LG

下载: http://arxiv.org/abs/2504.16831v2

Boosting Payment Channel Network Liquidity with Topology Optimization and Transaction Selection

Payment channel networks (PCNs) are a promising technology that alleviates blockchain scalability by shifting the transaction load from the blockchain to the PCN. Nevertheless, the network topology has to be carefully designed to maximise the transaction throughput in PCNs. Additionally, users in PCNs also have to make optimal decisions on which transactions to forward and which to reject to prolong the lifetime of their channels. In this work, we consider an input sequence of transactions over $p$ parties. Each transaction consists of a transaction size, source, and target, and can be either accepted or rejected (entailing a cost). The goal is to design a PCN topology among the $p$ cooperating parties, along with the channel capacities, and then output a decision for each transaction in the sequence to minimise the cost of creating and augmenting channels, as well as the cost of rejecting transactions. Our main contribution is an $\mathcal{O}(p)$ approximation algorithm for the problem with $p$ parties. We further show that with some assumptions on the distribution of transactions, we can reduce the approximation ratio to $\mathcal{O}(\sqrt{p})$. We complement our theoretical analysis with an empirical study of our assumptions and approach in the context of the Lightning Network.

Updated: 2025-08-20 08:34:20

标题: 优化拓扑结构和交易选择以提高支付通道网络流动性

摘要: 支付通道网络（PCNs）是一项有前途的技术，通过将交易负载从区块链转移到PCN来缓解区块链的可扩展性问题。然而，网络拓扑必须经过精心设计，以最大限度地提高PCN中的交易吞吐量。此外，PCN中的用户还必须对要转发的交易和要拒绝的交易做出最佳决策，以延长通道的生命周期。在这项工作中，我们考虑了$p$个参与方之间的交易序列。每笔交易包括交易大小、来源和目标，并且可以被接受或拒绝（涉及成本）。目标是设计$p$个合作方之间的PCN拓扑结构，以及通道容量，并为序列中的每笔交易输出一个决策，以最小化创建和增加通道的成本，以及拒绝交易的成本。我们的主要贡献是对于$p$个参与方的这个问题提出了一个$\mathcal{O}(p)$的近似算法。我们进一步展示，在对交易分布做一些假设的情况下，可以将近似比率降低到$\mathcal{O}(\sqrt{p})。我们通过在闪电网络的背景下进行实证研究来补充我们的理论分析和方法的实证研究。

更新时间: 2025-08-20 08:34:20

领域: cs.DC,cs.CR

下载: http://arxiv.org/abs/2508.14524v1

Great GATsBi: Hybrid, Multimodal, Trajectory Forecasting for Bicycles using Anticipation Mechanism

Accurate prediction of road user movement is increasingly required by many applications ranging from advanced driver assistance systems to autonomous driving, and especially crucial for road safety. Even though most traffic accident fatalities account to bicycles, they have received little attention, as previous work focused mainly on pedestrians and motorized vehicles. In this work, we present the Great GATsBi, a domain-knowledge-based, hybrid, multimodal trajectory prediction framework for bicycles. The model incorporates both physics-based modeling (inspired by motorized vehicles) and social-based modeling (inspired by pedestrian movements) to explicitly account for the dual nature of bicycle movement. The social interactions are modeled with a graph attention network, and include decayed historical, but also anticipated, future trajectory data of a bicycles neighborhood, following recent insights from psychological and social studies. The results indicate that the proposed ensemble of physics models -- performing well in the short-term predictions -- and social models -- performing well in the long-term predictions -- exceeds state-of-the-art performance. We also conducted a controlled mass-cycling experiment to demonstrate the framework's performance when forecasting bicycle trajectories and modeling social interactions with road users.

Updated: 2025-08-20 08:31:35

标题: 《伟大的GATsBi：使用预期机制的混合、多模态、自行车轨迹预测》

摘要: 准确预测道路用户的移动越来越受到许多应用程序的需求，从先进的驾驶辅助系统到自动驾驶，尤其对于道路安全至关重要。尽管大多数交通事故的死亡人数都归因于自行车，但它们却受到了很少的关注，因为先前的工作主要集中在行人和机动车辆上。在这项工作中，我们提出了Great GATsBi，这是一个基于领域知识的混合多模态轨迹预测框架，用于自行车。该模型结合了基于物理的建模（受到机动车辆启发）和基于社交的建模（受到行人移动启发），明确考虑了自行车运动的双重性质。社交互动通过图注意力网络建模，包括自行车邻域的衰减历史数据，以及预期的未来轨迹数据，遵循最近心理和社会研究的见解。结果表明，所提出的物理模型集合 - 在短期预测中表现良好 - 和社交模型 - 在长期预测中表现良好 - 超过了现有技术的性能。我们还进行了一项受控的大规模骑行实验，以展示该框架在预测自行车轨迹和建模与道路用户的社交互动时的性能。

更新时间: 2025-08-20 08:31:35

领域: cs.LG

下载: http://arxiv.org/abs/2508.14523v1

Markov Chain-based Model of Blockchain Radio Access Networks

Security has always been a priority, for researchers, service providers and network operators when it comes to radio access networks (RAN). One wireless access approach that has captured attention is blockchain enabled RAN (B-RAN) due to its secure nature. This research introduces a framework that integrates blockchain technology into RAN while also addressing the limitations of state-of-the-art models. The proposed framework utilizes queuing and Markov chain theory to model the aspects of B-RAN. An extensive evaluation of the models performance is provided, including an analysis of timing factors and a focused assessment of its security aspects. The results demonstrate reduced latency and comparable security making the presented framework suitable for diverse application scenarios.

Updated: 2025-08-20 08:28:30

标题: 马尔可夫链基于的区块链无线接入网络模型

摘要: 安全一直是研究人员、服务提供商和网络运营商在无线接入网络（RAN）方面的优先考虑。一种引起关注的无线接入方法是基于区块链的RAN（B-RAN），因其安全性而备受关注。本研究介绍了一个将区块链技术整合到RAN中的框架，同时解决了现有模型的局限性。所提出的框架利用排队和马尔可夫链理论来建模B-RAN的方面。提供了对模型性能的全面评估，包括对时序因素的分析以及对其安全性方面的专注评估。结果显示降低了延迟并具有可比较的安全性，使所提出的框架适用于各种应用场景。

更新时间: 2025-08-20 08:28:30

领域: eess.SY,cs.CR,cs.SY

下载: http://arxiv.org/abs/2508.14519v1

Hands-On: Segmenting Individual Signs from Continuous Sequences

This work tackles the challenge of continuous sign language segmentation, a key task with huge implications for sign language translation and data annotation. We propose a transformer-based architecture that models the temporal dynamics of signing and frames segmentation as a sequence labeling problem using the Begin-In-Out (BIO) tagging scheme. Our method leverages the HaMeR hand features, and is complemented with 3D Angles. Extensive experiments show that our model achieves state-of-the-art results on the DGS Corpus, while our features surpass prior benchmarks on BSLCorpus.

Updated: 2025-08-20 08:22:19

标题: 动手操作：从连续序列中分割单个符号

摘要: 这项工作应对了持续手语分割的挑战，这是手语翻译和数据注释的关键任务，具有巨大的影响。我们提出了一种基于转换器的架构，模拟了手语的时间动态，并将分割帧作为一个序列标记问题，使用Begin-In-Out（BIO）标记方案。我们的方法利用了HaMeR手部特征，并补充了3D角度。广泛的实验表明，我们的模型在DGS语料库上取得了最先进的结果，而我们的特征超过了BSLCorpus上的先前基准。

更新时间: 2025-08-20 08:22:19

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2504.08593v4

MISS: Multi-Modal Tree Indexing and Searching with Lifelong Sequential Behavior for Retrieval Recommendation

Large-scale industrial recommendation systems typically employ a two-stage paradigm of retrieval and ranking to handle huge amounts of information. Recent research focuses on improving the performance of retrieval model. A promising way is to introduce extensive information about users and items. On one hand, lifelong sequential behavior is valuable. Existing lifelong behavior modeling methods in ranking stage focus on the interaction of lifelong behavior and candidate items from retrieval stage. In retrieval stage, it is difficult to utilize lifelong behavior because of a large corpus of candidate items. On the other hand, existing retrieval methods mostly relay on interaction information, potentially disregarding valuable multi-modal information. To solve these problems, we represent the pioneering exploration of leveraging multi-modal information and lifelong sequence model within the advanced tree-based retrieval model. We propose Multi-modal Indexing and Searching with lifelong Sequence (MISS), which contains a multi-modal index tree and a multi-modal lifelong sequence modeling module. Specifically, for better index structure, we propose multi-modal index tree, which is built using the multi-modal embedding to precisely represent item similarity. To precisely capture diverse user interests in user lifelong sequence, we propose collaborative general search unit (Co-GSU) and multi-modal general search unit (MM-GSU) for multi-perspective interests searching.

Updated: 2025-08-20 08:22:02

标题: MISS：多模态树索引与检索，以终身顺序行为进行检索推荐

摘要: 大规模工业推荐系统通常采用检索和排名的两阶段范式来处理大量信息。最近的研究集中在改善检索模型的性能上。一种有前景的方法是引入有关用户和物品的广泛信息。一方面，终身序列行为是有价值的。现有的终身行为建模方法在排名阶段侧重于终身行为与检索阶段的候选物品的交互。在检索阶段，由于大量候选物品的语料库，利用终身行为是困难的。另一方面，现有的检索方法主要依赖于交互信息，可能忽视有价值的多模态信息。为了解决这些问题，我们提出了在先进的基于树的检索模型中利用多模态信息和终身序列模型的开拓性探索。我们提出了包含多模态索引树和多模态终身序列建模模块的终身序列（MISS）的多模态索引和搜索。具体而言，为了更好的索引结构，我们提出了多模态索引树，它是使用多模态嵌入构建的，以精确表示物品的相似性。为了精确捕获用户终身序列中的多样化兴趣，我们提出了协作一般搜索单元（Co-GSU）和多模态一般搜索单元（MM-GSU）进行多视角兴趣搜索。

更新时间: 2025-08-20 08:22:02

领域: cs.IR,cs.AI

下载: http://arxiv.org/abs/2508.14515v1

BadBlocks: Low-Cost and Stealthy Backdoor Attacks Tailored for Text-to-Image Diffusion Models

In recent years, Diffusion models have achieved remarkable progress in the field of image generation. However, recent studies have shown that diffusion models are susceptible to backdoor attacks, in which attackers can manipulate the output by injecting covert triggers such as specific visual patterns or textual phrases into the training dataset. Fortunately, with the continuous advancement of defense techniques, defenders have become increasingly capable of identifying and mitigating most backdoor attacks using visual inspection and neural network-based detection methods. However, in this paper, we identify a novel type of backdoor threat that is more lightweight and covert than existing approaches, which we name BadBlocks, requires only about 30% of the computational resources and 20% GPU time typically needed by previous backdoor attacks, yet it successfully injects backdoors and evades the most advanced defense frameworks. BadBlocks enables attackers to selectively contaminate specific blocks within the UNet architecture of diffusion models while maintaining normal functionality in the remaining components. Experimental results demonstrate that BadBlocks achieves a high attack success rate and low perceptual quality loss , even under extremely constrained computational resources and GPU time. Moreover, BadBlocks is able to bypass existing defense frameworks, especially the attention-based backdoor detection method, highlighting it as a novel and noteworthy threat. Ablation studies further demonstrate that effective backdoor injection does not require fine-tuning the entire network and highlight the pivotal role of certain neural network layers in backdoor mapping. Overall, BadBlocks significantly reduces the barrier to conducting backdoor attacks in all aspects. It enables attackers to inject backdoors into large-scale diffusion models even using consumer-grade GPUs.

Updated: 2025-08-20 08:11:26

标题: 坏块：低成本和隐蔽的后门攻击，专为文本到图像扩散模型定制

摘要: 最近几年，扩散模型在图像生成领域取得了显著进展。然而，最近的研究表明，扩散模型容易受到后门攻击的影响，攻击者可以通过向训练数据集中注入特定的视觉模式或文本短语等隐蔽触发器来操纵输出。幸运的是，随着防御技术的不断发展，防御者已经越来越能够通过视觉检查和基于神经网络的检测方法识别和减轻大多数后门攻击。然而，在本文中，我们确定了一种更轻量级和隐蔽的新型后门威胁，我们称之为BadBlocks，它只需要先前后门攻击通常需要的大约30%的计算资源和20%的GPU时间，但却成功注入后门并避开了最先进的防御框架。BadBlocks使攻击者能够有选择地污染扩散模型UNet架构中的特定块，同时在其余组件中保持正常功能。实验结果表明，BadBlocks在极度受限的计算资源和GPU时间下实现了高攻击成功率和低感知质量损失。此外，BadBlocks能够绕过现有的防御框架，特别是基于注意力的后门检测方法，突出了它作为一种新颖且值得关注的威胁。消融研究进一步表明，有效的后门注入并不需要对整个网络进行微调，并突出了某些神经网络层在后门映射中的关键作用。总的来说，BadBlocks显著降低了进行后门攻击的障碍。它使攻击者能够即使使用消费级GPU也能向大规模扩散模型注入后门。

更新时间: 2025-08-20 08:11:26

领域: cs.CR,cs.CV

下载: http://arxiv.org/abs/2508.03221v3

CRED-SQL: Enhancing Real-world Large Scale Database Text-to-SQL Parsing through Cluster Retrieval and Execution Description

Recent advances in large language models (LLMs) have significantly improved the accuracy of Text-to-SQL systems. However, a critical challenge remains: the semantic mismatch between natural language questions (NLQs) and their corresponding SQL queries. This issue is exacerbated in large-scale databases, where semantically similar attributes hinder schema linking and semantic drift during SQL generation, ultimately reducing model accuracy. To address these challenges, we introduce CRED-SQL, a framework designed for large-scale databases that integrates Cluster Retrieval and Execution Description. CRED-SQL first performs cluster-based large-scale schema retrieval to pinpoint the tables and columns most relevant to a given NLQ, alleviating schema mismatch. It then introduces an intermediate natural language representation-Execution Description Language (EDL)-to bridge the gap between NLQs and SQL. This reformulation decomposes the task into two stages: Text-to-EDL and EDL-to-SQL, leveraging LLMs' strong general reasoning capabilities while reducing semantic deviation. Extensive experiments on two large-scale, cross-domain benchmarks-SpiderUnion and BirdUnion-demonstrate that CRED-SQL achieves new state-of-the-art (SOTA) performance, validating its effectiveness and scalability. Our code is available at https://github.com/smduan/CRED-SQL.git

Updated: 2025-08-20 08:11:10

标题: CRED-SQL：通过集群检索和执行描述增强现实世界大规模数据库文本到SQL解析

摘要: 最近对大型语言模型（LLMs）的进展显著提高了文本到SQL系统的准确性。然而，一个关键挑战仍然存在：自然语言问题（NLQs）和它们相应的SQL查询之间的语义不匹配。在大规模数据库中，语义相似的属性会阻碍模式链接和SQL生成过程中的语义漂移，最终降低模型的准确性。为了解决这些挑战，我们引入了CRED-SQL，这是一个针对大型数据库设计的框架，集成了集群检索和执行描述。CRED-SQL首先执行基于集群的大规模模式检索，以精确定位与给定NLQ最相关的表和列，缓解了模式不匹配问题。然后引入了一个中间自然语言表示形式-执行描述语言（EDL）-来弥合NLQs和SQL之间的差距。这种重构将任务分解为两个阶段：文本到EDL和EDL到SQL，利用LLMs的强大一般推理能力，同时减少语义偏差。在两个大型跨领域基准测试-SpiderUnion和BirdUnion上进行的广泛实验表明，CRED-SQL实现了新的最先进性能（SOTA），验证了其有效性和可扩展性。我们的代码可在https://github.com/smduan/CRED-SQL.git上找到。

更新时间: 2025-08-20 08:11:10

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.12769v3

AFLoRA: Adaptive Federated Fine-Tuning of Large Language Models with Resource-Aware Low-Rank Adaption

Federated fine-tuning has emerged as a promising approach to adapt foundation models to downstream tasks using decentralized data. However, real-world deployment remains challenging due to the high computational and communication demands of fine-tuning Large Language Models (LLMs) on clients with data and system resources that are heterogeneous and constrained. In such settings, the global model's performance is often bottlenecked by the weakest clients and further degraded by the non-IID nature of local data. Although existing methods leverage parameter-efficient techniques such as Low-Rank Adaptation (LoRA) to reduce communication and computation overhead, they often fail to simultaneously ensure accurate aggregation of low-rank updates and maintain low system costs, thereby hindering overall performance. To address these challenges, we propose AFLoRA, an adaptive and lightweight federated fine-tuning framework for LLMs. AFLoRA decouples shared and client-specific updates to reduce overhead and improve aggregation accuracy, incorporates diagonal matrix-based rank pruning to better utilize local resources, and employs rank-aware aggregation with public data refinement to strengthen generalization under data heterogeneity. Extensive experiments demonstrate that AFLoRA outperforms state-of-the-art methods in both accuracy and efficiency, providing a practical solution for efficient LLM adaptation in heterogeneous environments in the real world.

Updated: 2025-08-20 08:08:03

标题: AFLoRA：资源感知低秩调整的自适应联邦微调大型语言模型

摘要: 联邦微调已经成为一种有前途的方法，用于通过分散的数据将基础模型适应到下游任务。然而，由于在具有异构和受限制数据和系统资源的客户端上对大型语言模型（LLMs）进行微调所需的计算和通信需求较高，真实世界中的部署仍然具有挑战性。在这种情况下，全局模型的性能通常受到最弱客户端的限制，并且由于本地数据的非IID性质而进一步降低。尽管现有方法利用参数高效的技术（如低秩适应LoRA）来减少通信和计算开销，但它们通常无法同时确保低秩更新的准确聚合并保持低系统成本，从而阻碍整体性能。为了解决这些挑战，我们提出了AFLoRA，一种自适应和轻量级的用于LLMs的联邦微调框架。AFLoRA将共享和客户端特定的更新解耦以减少开销并提高聚合精度，结合基于对角矩阵的秩剪枝以更好地利用本地资源，并利用基于秩的聚合与公共数据细化以增强在数据异构性下的泛化能力。大量实验证明，AFLoRA在准确性和效率方面优于最先进的方法，在实际环境中提供了一种有效的LLM适应解决方案。

更新时间: 2025-08-20 08:08:03

领域: cs.LG

下载: http://arxiv.org/abs/2505.24773v2

TolerantECG: A Foundation Model for Imperfect Electrocardiogram

The electrocardiogram (ECG) is an essential and effective tool for diagnosing heart diseases. However, its effectiveness can be compromised by noise or unavailability of one or more leads of the standard 12-lead recordings, resulting in diagnostic errors or uncertainty. To address these challenges, we propose TolerantECG, a foundation model for ECG signals that is robust to noise and capable of functioning with arbitrary subsets of the standard 12-lead ECG. TolerantECG training combines contrastive and self-supervised learning frameworks to jointly learn ECG signal representations alongside their corresponding knowledge-retrieval-based text report descriptions and corrupted or lead-missing signals. Comprehensive benchmarking results demonstrate that TolerantECG consistently ranks as the best or second-best performer across various ECG signal conditions and class levels in the PTB-XL dataset, and achieves the highest performance on the MIT-BIH Arrhythmia Database.

Updated: 2025-08-20 08:07:02

标题: TolerantECG：一种不完美心电图的基础模型

摘要: 心电图（ECG）是诊断心脏疾病的重要有效工具。然而，其效果可能会受到噪音或标准12导联记录中一个或多个导联不可用的影响，导致诊断错误或不确定性。为解决这些挑战，我们提出了TolerantECG，这是一个对噪音具有鲁棒性并能够与标准12导联ECG的任意子集一起工作的基础模型。TolerantECG训练结合了对比和自监督学习框架，同时学习ECG信号表示以及它们对应的基于知识检索的文本报告描述和受损或缺失导联的信号。全面的基准测试结果表明，在PTB-XL数据集中，TolerantECG在各种ECG信号条件和类别水平上始终排名为最佳或次佳表现者，并在MIT-BIH心律失常数据库上取得最佳表现。

更新时间: 2025-08-20 08:07:02

领域: cs.LG,cs.AI,eess.SP

下载: http://arxiv.org/abs/2507.09887v3

Benchmarking Pre-Trained Time Series Models for Electricity Price Forecasting

Accurate electricity price forecasting (EPF) is crucial for effective decision-making in power trading on the spot market. While recent advances in generative artificial intelligence (GenAI) and pre-trained large language models (LLMs) have inspired the development of numerous time series foundation models (TSFMs) for time series forecasting, their effectiveness in EPF remains uncertain. To address this gap, we benchmark several state-of-the-art pretrained models--Chronos-Bolt, Chronos-T5, TimesFM, Moirai, Time-MoE, and TimeGPT--against established statistical and machine learning (ML) methods for EPF. Using 2024 day-ahead auction (DAA) electricity prices from Germany, France, the Netherlands, Austria, and Belgium, we generate daily forecasts with a one-day horizon. Chronos-Bolt and Time-MoE emerge as the strongest among the TSFMs, performing on par with traditional models. However, the biseasonal MSTL model, which captures daily and weekly seasonality, stands out for its consistent performance across countries and evaluation metrics, with no TSFM statistically outperforming it.

Updated: 2025-08-20 07:59:08

标题: 基准测试预训练时间序列模型用于电价预测

摘要: 精确的电价预测（EPF）对于在现货市场上进行有效决策至关重要。虽然最近生成式人工智能（GenAI）和预训练大型语言模型（LLMs）的进展促进了许多时间序列基础模型（TSFMs）的发展，用于时间序列预测，但它们在EPF中的有效性仍然不确定。为了填补这一空白，我们将几种最先进的预训练模型--Chronos-Bolt、Chronos-T5、TimesFM、Moirai、Time-MoE和TimeGPT--与已建立的统计和机器学习（ML）方法进行了EPF的基准测试。使用来自德国、法国、荷兰、奥地利和比利时的2024天提前拍卖（DAA）电价，我们生成了一天的前瞻性预测。Chronos-Bolt和Time-MoE在TSFMs中表现最强，与传统模型表现相当。然而，捕捉每日和每周季节性的双季节性MSTL模型在各国和评估指标中的一致表现脱颖而出，没有TSFM在统计上胜过它。

更新时间: 2025-08-20 07:59:08

领域: cs.LG,cs.AI,q-fin.ST

下载: http://arxiv.org/abs/2506.08113v2

Improving Actor-Critic Training with Steerable Action-Value Approximation Errors

Off-policy actor-critic algorithms have shown strong potential in deep reinforcement learning for continuous control tasks. Their success primarily comes from leveraging pessimistic state-action value function updates, which reduce function approximation errors and stabilize learning. However, excessive pessimism can limit exploration, preventing the agent from effectively refining its policies. Conversely, optimism can encourage exploration but may lead to high-risk behaviors and unstable learning if not carefully managed. To address this trade-off, we propose Utility Soft Actor-Critic (USAC), a novel framework that allows independent, interpretable control of pessimism and optimism for both the actor and the critic. USAC dynamically adapts its exploration strategy based on the uncertainty of critics using a utility function, enabling a task-specific balance between optimism and pessimism. This approach goes beyond binary choices of pessimism or optimism, making the method both theoretically meaningful and practically feasible. Experiments across a variety of continuous control tasks show that adjusting the degree of pessimism or optimism significantly impacts performance. When configured appropriately, USAC consistently outperforms state-of-the-art algorithms, demonstrating its practical utility and feasibility.

Updated: 2025-08-20 07:56:10

标题: 利用可转向的动作值逼近误差改进演员-评论家训练

摘要: 离线策略演员-评论者算法在连续控制任务的深度强化学习中展现出强大潜力。它们的成功主要来自于利用悲观的状态-动作价值函数更新，这有助于减少函数逼近误差并稳定学习。然而，过度的悲观主义可能会限制探索能力，阻碍代理有效地优化其策略。相反，乐观主义可以鼓励探索，但如果管理不当可能会导致高风险行为和不稳定的学习。为了解决这种权衡，我们提出了一种新颖的框架：Utility Soft Actor-Critic (USAC)，它允许独立、可解释地控制演员和评论者的悲观主义和乐观主义。USAC根据评论者的不确定性利用效用函数动态调整其探索策略，实现乐观主义和悲观主义之间的任务特定平衡。这种方法超越了悲观主义或乐观主义的二元选择，使该方法在理论上具有意义并且在实践上可行。在各种连续控制任务中的实验表明，调整悲观主义或乐观主义的程度显著影响性能。当适当配置时，USAC始终优于最先进的算法，展示了其实用性和可行性。

更新时间: 2025-08-20 07:56:10

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2406.03890v2

PB-IAD: Utilizing multimodal foundation models for semantic industrial anomaly detection in dynamic manufacturing environments

The detection of anomalies in manufacturing processes is crucial to ensure product quality and identify process deviations. Statistical and data-driven approaches remain the standard in industrial anomaly detection, yet their adaptability and usability are constrained by the dependence on extensive annotated datasets and limited flexibility under dynamic production conditions. Recent advances in the perception capabilities of foundation models provide promising opportunities for their adaptation to this downstream task. This paper presents PB-IAD (Prompt-based Industrial Anomaly Detection), a novel framework that leverages the multimodal and reasoning capabilities of foundation models for industrial anomaly detection. Specifically, PB-IAD addresses three key requirements of dynamic production environments: data sparsity, agile adaptability, and domain user centricity. In addition to the anomaly detection, the framework includes a prompt template that is specifically designed for iteratively implementing domain-specific process knowledge, as well as a pre-processing module that translates domain user inputs into effective system prompts. This user-centric design allows domain experts to customise the system flexibly without requiring data science expertise. The proposed framework is evaluated by utilizing GPT-4.1 across three distinct manufacturing scenarios, two data modalities, and an ablation study to systematically assess the contribution of semantic instructions. Furthermore, PB-IAD is benchmarked to state-of-the-art methods for anomaly detection such as PatchCore. The results demonstrate superior performance, particularly in data-sparse scenarios and low-shot settings, achieved solely through semantic instructions.

Updated: 2025-08-20 07:53:13

标题: PB-IAD: 在动态制造环境中利用多模式基础模型进行语义工业异常检测

摘要: 在制造过程中检测异常对于确保产品质量并识别工艺偏差至关重要。统计和数据驱动方法仍然是工业异常检测的标准，然而它们的适应性和可用性受到对大量标注数据集的依赖以及在动态生产条件下受限的限制。最近基础模型感知能力的进步为将其适应到这一下游任务提供了有希望的机会。本文提出了PB-IAD（基于提示的工业异常检测），这是一种利用基础模型的多模态和推理能力进行工业异常检测的新框架。具体来说，PB-IAD解决了动态生产环境的三个关键需求：数据稀疏性、灵活的适应性和领域用户中心性。除了异常检测，该框架还包括一个专门设计用于逐步实施领域特定工艺知识的提示模板，以及一个将领域用户输入转换为有效系统提示的预处理模块。这种面向用户的设计使领域专家能够灵活定制系统，而无需数据科学专业知识。所提出的框架通过利用GPT-4.1在三种不同的制造场景、两种数据模态和一项消融研究中进行评估，以系统评估语义指令的贡献。此外，PB-IAD与像PatchCore这样的异常检测的最新方法进行了基准测试。结果表明，在数据稀疏场景和低样本设置中，仅通过语义指令就能实现卓越的性能。

更新时间: 2025-08-20 07:53:13

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.14504v1

Artificial Intelligence-Based Multiscale Temporal Modeling for Anomaly Detection in Cloud Services

This study proposes an anomaly detection method based on the Transformer architecture with integrated multiscale feature perception, aiming to address the limitations of temporal modeling and scale-aware feature representation in cloud service environments. The method first employs an improved Transformer module to perform temporal modeling on high-dimensional monitoring data, using a self-attention mechanism to capture long-range dependencies and contextual semantics. Then, a multiscale feature construction path is introduced to extract temporal features at different granularities through downsampling and parallel encoding. An attention-weighted fusion module is designed to dynamically adjust the contribution of each scale to the final decision, enhancing the model's robustness in anomaly pattern modeling. In the input modeling stage, standardized multidimensional time series are constructed, covering core signals such as CPU utilization, memory usage, and task scheduling states, while positional encoding is used to strengthen the model's temporal awareness. A systematic experimental setup is designed to evaluate performance, including comparative experiments and hyperparameter sensitivity analysis, focusing on the impact of optimizers, learning rates, anomaly ratios, and noise levels. Experimental results show that the proposed method outperforms mainstream baseline models in key metrics, including precision, recall, AUC, and F1-score, and maintains strong stability and detection performance under various perturbation conditions, demonstrating its superior capability in complex cloud environments.

Updated: 2025-08-20 07:52:36

标题: 基于人工智能的多尺度时间建模用于云服务异常检测

摘要: 这项研究提出了一种基于Transformer架构的异常检测方法，具有集成的多尺度特征感知，旨在解决云服务环境中时间建模和尺度感知特征表示的局限性。该方法首先利用改进的Transformer模块对高维监控数据进行时间建模，使用自注意力机制捕获长距离依赖关系和上下文语义。然后，引入多尺度特征构建路径，通过降采样和并行编码提取不同粒度的时间特征。设计了一个注意力加权融合模块，动态调整每个尺度对最终决策的贡献，增强模型在异常模式建模中的稳健性。在输入建模阶段，构建了标准化的多维时间序列，涵盖了诸如CPU利用率、内存使用和任务调度状态等核心信号，同时使用位置编码加强模型的时间意识。设计了系统化的实验设置来评估性能，包括比较实验和超参数敏感性分析，重点关注优化器、学习率、异常比率和噪声水平的影响。实验结果表明，所提出的方法在关键指标（如精度、召回率、AUC和F1分数）上优于主流基线模型，并在各种扰动条件下保持强大的稳定性和检测性能，展示了其在复杂云环境中的优越能力。

更新时间: 2025-08-20 07:52:36

领域: cs.LG

下载: http://arxiv.org/abs/2508.14503v1

Deep Exploration with PAC-Bayes

Reinforcement learning (RL) for continuous control under delayed rewards is an under-explored problem despite its significance in real-world applications. Many complex skills are based on intermediate ones as prerequisites. For instance, a humanoid locomotor must learn how to stand before it can learn to walk. To cope with delayed reward, an agent must perform deep exploration. However, existing deep exploration methods are designed for small discrete action spaces, and their generalization to state-of-the-art continuous control remains unproven. We address the deep exploration problem for the first time from a PAC-Bayesian perspective in the context of actor-critic learning. To do this, we quantify the error of the Bellman operator through a PAC-Bayes bound, where a bootstrapped ensemble of critic networks represents the posterior distribution, and their targets serve as a data-informed function-space prior. We derive an objective function from this bound and use it to train the critic ensemble. Each critic trains an individual soft actor network, implemented as a shared trunk and critic-specific heads. The agent performs deep exploration by acting epsilon-softly on a randomly chosen actor head. Our proposed algorithm, named {\it PAC-Bayesian Actor-Critic (PBAC)}, is the only algorithm to consistently discover delayed rewards on continuous control tasks with varying difficulty.

Updated: 2025-08-20 07:52:22

标题: 深度探索与PAC-Bayes

摘要: 延迟奖励下的连续控制强化学习（RL）是一个尚未充分探索的问题，尽管在现实世界应用中具有重要意义。许多复杂技能都是以中间技能作为先决条件的。例如，一个人形机器人必须学会站立才能学会行走。为了应对延迟奖励，代理必须进行深度探索。然而，现有的深度探索方法是为小离散动作空间设计的，它们是否适用于最先进的连续控制尚未得到证实。我们首次从PAC-Bayesian角度解决了深度探索问题，并将其应用于演员-评论家学习。为此，我们通过PAC-Bayes边界量化了贝尔曼算子的误差，其中一组自举的评论家网络代表后验分布，它们的目标作为数据驱动的函数空间先验。我们从这个边界中导出一个目标函数，并用它来训练评论者集合。每个评论家训练一个独立的软演员网络，实现为共享干线和评论家特定的头。代理通过在随机选择的演员头上以epsilon-soft方式行动来进行深度探索。我们提出的算法，命名为PAC-Bayesian Actor-Critic（PBAC），是唯一一个能够在难度不同的连续控制任务中持续发现延迟奖励的算法。

更新时间: 2025-08-20 07:52:22

领域: cs.LG

下载: http://arxiv.org/abs/2402.03055v5

EvoCurr: Self-evolving Curriculum with Behavior Code Generation for Complex Decision-making

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, including programming, planning, and decision-making. However, their performance often degrades when faced with highly complex problem instances that require deep reasoning over long horizons. In such cases, direct problem-solving approaches can lead to inefficiency or failure due to the lack of structured intermediate guidance. To address this, we propose a novel self-evolve framework, EvoCurr, in which a dedicated curriculum-generation LLM constructs a sequence of problem instances with gradually increasing difficulty, tailored to the solver LLM's learning progress. The curriculum dynamically adapts easing challenges when the solver struggles and escalating them when success is consistent, thus maintaining an optimal learning trajectory. This approach enables the solver LLM, implemented as a code-generation model producing Python decision-tree scripts, to progressively acquire the skills needed for complex decision-making tasks. Experimental results on challenging decision-making benchmarks show that our method significantly improves task success rates and solution efficiency compared to direct-solving baselines. These findings suggest that LLM-driven curriculum learning holds strong potential for enhancing automated reasoning in real-world, high-complexity domains.

Updated: 2025-08-20 07:50:49

标题: EvoCurr：用于复杂决策的自适应课程和行为代码生成

摘要: 大型语言模型（LLMs）在各个领域，包括编程、规划和决策制定中展示出了显著的能力。然而，当面对需要对长期问题进行深入推理的高度复杂问题实例时，它们的性能通常会下降。在这种情况下，直接的问题解决方法可能会由于缺乏结构化中间指导而导致低效或失败。为了解决这个问题，我们提出了一个新颖的自我进化框架EvoCurr，其中一个专门的课程生成LLM构建了一个具有逐渐增加难度的问题实例序列，以符合求解器LLM的学习进度。课程会在求解器遇到困难时动态地调整挑战，当成功是一致的时候逐渐增加挑战，从而保持最佳的学习轨迹。这种方法使求解器LLM能够逐步获得完成复杂决策任务所需的技能，该模型实现为生成Python决策树脚本的代码生成模型。在具有挑战性的决策基准上的实验结果显示，与直接解决基线相比，我们的方法显著提高了任务成功率和解决效率。这些发现表明，LLM驱动的课程学习在增强现实世界中高复杂性领域的自动推理方面具有强大的潜力。

更新时间: 2025-08-20 07:50:49

领域: cs.AI

下载: http://arxiv.org/abs/2508.09586v2

No Metric to Rule Them All: Toward Principled Evaluations of Graph-Learning Datasets

Benchmark datasets have proved pivotal to the success of graph learning, and good benchmark datasets are crucial to guide the development of the field. Recent research has highlighted problems with graph-learning datasets and benchmarking practices -- revealing, for example, that methods which ignore the graph structure can outperform graph-based approaches. Such findings raise two questions: (1) What makes a good graph-learning dataset, and (2) how can we evaluate dataset quality in graph learning? Our work addresses these questions. As the classic evaluation setup uses datasets to evaluate models, it does not apply to dataset evaluation. Hence, we start from first principles. Observing that graph-learning datasets uniquely combine two modes -- graph structure and node features --, we introduce Rings, a flexible and extensible mode-perturbation framework to assess the quality of graph-learning datasets based on dataset ablations -- i.e., quantifying differences between the original dataset and its perturbed representations. Within this framework, we propose two measures -- performance separability and mode complementarity -- as evaluation tools, each assessing the capacity of a graph dataset to benchmark the power and efficacy of graph-learning methods from a distinct angle. We demonstrate the utility of our framework for dataset evaluation via extensive experiments on graph-level tasks and derive actionable recommendations for improving the evaluation of graph-learning methods. Our work opens new research directions in data-centric graph learning, and it constitutes a step toward the systematic evaluation of evaluations.

Updated: 2025-08-20 07:43:47

标题: 没有适用于所有情况的度量标准：朝着基于原则的图学习数据集评估前进

摘要: 基准数据集已经被证明对于图学习的成功至关重要，良好的基准数据集对于指导该领域的发展至关重要。最近的研究突出了图学习数据集和基准实践存在的问题，例如发现忽略图结构的方法可能优于基于图的方法。这些发现引发了两个问题：（1）什么是一个好的图学习数据集，以及（2）我们如何评估图学习中数据集的质量？我们的工作解决了这些问题。由于经典评估设置使用数据集来评估模型，因此不适用于数据集评估。因此，我们从基本原理开始。观察到图学习数据集独特地结合了两种模式——图结构和节点特征——我们引入了Rings，这是一个灵活且可扩展的模式扰动框架，用于评估图学习数据集的质量，基于数据集消融——即量化原始数据集与其扰动表示之间的差异。在这个框架内，我们提出了两个指标——性能可分离性和模式互补性——作为评估工具，分别评估图数据集作为基准的能力从不同角度评估图学习方法的能力和效果。我们通过对图级任务进行大量实验，展示了我们的框架对于数据集评估的实用性，并提出了改进图学习方法评估的可操作建议。我们的工作开辟了数据中心图学习的新研究方向，并迈出了系统评估评估的一步。

更新时间: 2025-08-20 07:43:47

领域: cs.LG,cs.SI,stat.ML

下载: http://arxiv.org/abs/2502.02379v3

Exact Shapley Attributions in Quadratic-time for FANOVA Gaussian Processes

Shapley values are widely recognized as a principled method for attributing importance to input features in machine learning. However, the exact computation of Shapley values scales exponentially with the number of features, severely limiting the practical application of this powerful approach. The challenge is further compounded when the predictive model is probabilistic - as in Gaussian processes (GPs) - where the outputs are random variables rather than point estimates, necessitating additional computational effort in modeling higher-order moments. In this work, we demonstrate that for an important class of GPs known as FANOVA GP, which explicitly models all main effects and interactions, *exact* Shapley attributions for both local and global explanations can be computed in *quadratic time*. For local, instance-wise explanations, we define a stochastic cooperative game over function components and compute the exact stochastic Shapley value in quadratic time only, capturing both the expected contribution and uncertainty. For global explanations, we introduce a deterministic, variance-based value function and compute exact Shapley values that quantify each feature's contribution to the model's overall sensitivity. Our methods leverage a closed-form (stochastic) M\"{o}bius representation of the FANOVA decomposition and introduce recursive algorithms, inspired by Newton's identities, to efficiently compute the mean and variance of Shapley values. Our work enhances the utility of explainable AI, as demonstrated by empirical studies, by providing more scalable, axiomatically sound, and uncertainty-aware explanations for predictions generated by structured probabilistic models.

Updated: 2025-08-20 07:39:14

标题: FANOVA高斯过程中的二次时间精确Shapley归因

摘要: Shapley值被广泛认为是一种在机器学习中归因输入特征重要性的原则方法。然而，Shapley值的精确计算随着特征数量呈指数增长，严重限制了这种强大方法的实际应用。当预测模型是概率性的 - 如高斯过程（GPs） - 输出是随机变量而不是点估计时，挑战进一步加剧，需要在建模高阶矩方面进行额外的计算工作。在这项工作中，我们证明了对于一类重要的GPs，即明确建模所有主效应和交互作用的FANOVA GP，*精确*的Shapley归因可以在*二次时间内*计算出来，用于局部和全局解释。对于局部、逐实例解释，我们定义了一个随机合作博弈，计算了仅在二次时间内的精确随机Shapley值，捕捉了预期贡献和不确定性。对于全局解释，我们引入了一个确定性、基于方差的值函数，并计算了精确的Shapley值，量化了每个特征对模型整体灵敏度的贡献。我们的方法利用了FANOVA分解的封闭形式（随机）M\"{o}bius表示，并引入了受牛顿恒等式启发的递归算法，以高效地计算Shapley值的均值和方差。我们的工作通过实证研究提供了更具可扩展性、公理化合理和考虑不确定性的解释，增强了可解释AI的实用性，为结构化概率模型生成的预测提供更多的解释。

更新时间: 2025-08-20 07:39:14

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.14499v1

LLM4FS: Leveraging Large Language Models for Feature Selection

Recent advances in large language models (LLMs) have provided new opportunities for decision-making, particularly in the task of automated feature selection. In this paper, we first comprehensively evaluate LLM-based feature selection methods, covering the state-of-the-art DeepSeek-R1, GPT-o3-mini, and GPT-4.5. Then, we propose a new hybrid strategy called LLM4FS that integrates LLMs with traditional data-driven methods. Specifically, input data samples into LLMs, and directly call traditional data-driven techniques such as random forest and forward sequential selection. Notably, our analysis reveals that the hybrid strategy leverages the contextual understanding of LLMs and the high statistical reliability of traditional data-driven methods to achieve excellent feature selection performance, even surpassing LLMs and traditional data-driven methods. Finally, we point out the limitations of its application in decision-making. Our code is available at https://github.com/xianchaoxiu/LLM4FS.

Updated: 2025-08-20 07:35:22

标题: LLM4FS：利用大型语言模型进行特征选择

摘要: 近年来，大型语言模型（LLMs）的最新进展为决策提供了新的机遇，特别是在自动特征选择任务中。本文首先全面评估了基于LLMs的特征选择方法，涵盖了最先进的DeepSeek-R1，GPT-o3-mini和GPT-4.5。然后，我们提出了一种新的混合策略，称为LLM4FS，将LLMs与传统数据驱动方法相结合。具体地，将输入数据样本到LLMs中，并直接调用传统的数据驱动技术，如随机森林和前向顺序选择。值得注意的是，我们的分析表明，混合策略利用了LLMs的上下文理解和传统数据驱动方法的高统计可靠性，实现了出色的特征选择性能，甚至超过了LLMs和传统数据驱动方法。最后，我们指出了其在决策制定中的应用限制。我们的代码可在https://github.com/xianchaoxiu/LLM4FS找到。

更新时间: 2025-08-20 07:35:22

领域: cs.LG

下载: http://arxiv.org/abs/2503.24157v3

Semantic Energy: Detecting LLM Hallucination Beyond Entropy

Large Language Models (LLMs) are being increasingly deployed in real-world applications, but they remain susceptible to hallucinations, which produce fluent yet incorrect responses and lead to erroneous decision-making. Uncertainty estimation is a feasible approach to detect such hallucinations. For example, semantic entropy estimates uncertainty by considering the semantic diversity across multiple sampled responses, thus identifying hallucinations. However, semantic entropy relies on post-softmax probabilities and fails to capture the model's inherent uncertainty, causing it to be ineffective in certain scenarios. To address this issue, we introduce Semantic Energy, a novel uncertainty estimation framework that leverages the inherent confidence of LLMs by operating directly on logits of penultimate layer. By combining semantic clustering with a Boltzmann-inspired energy distribution, our method better captures uncertainty in cases where semantic entropy fails. Experiments across multiple benchmarks show that Semantic Energy significantly improves hallucination detection and uncertainty estimation, offering more reliable signals for downstream applications such as hallucination detection.

Updated: 2025-08-20 07:33:50

标题: 语义能量：检测低语义学习幻觉的方法超越熵

摘要: 大型语言模型（LLMs）越来越多地被部署在现实世界的应用中，但它们仍然容易出现幻觉，产生流畅但错误的响应，并导致错误的决策。不确定性估计是一种可行的方法来检测这种幻觉。例如，语义熵通过考虑多个采样响应之间的语义多样性来估计不确定性，从而识别幻觉。然而，语义熵依赖于后softmax概率，无法捕获模型固有的不确定性，导致在某些情况下无效。为了解决这个问题，我们引入了语义能量，这是一个新颖的不确定性估计框架，通过直接在倒数第二层的logit上操作，利用LLMs的固有信心。通过将语义聚类与受玻尔兹曼启发的能量分布相结合，我们的方法更好地捕获了语义熵失效的情况下的不确定性。跨多个基准测试的实验表明，语义能量显著改善了幻觉检测和不确定性估计，为幻觉检测等下游应用提供了更可靠的信号。

更新时间: 2025-08-20 07:33:50

领域: cs.LG

下载: http://arxiv.org/abs/2508.14496v1

DE-VAE: Revealing Uncertainty in Parametric and Inverse Projections with Variational Autoencoders using Differential Entropy

Recently, autoencoders (AEs) have gained interest for creating parametric and invertible projections of multidimensional data. Parametric projections make it possible to embed new, unseen samples without recalculating the entire projection, while invertible projections allow the synthesis of new data instances. However, existing methods perform poorly when dealing with out-of-distribution samples in either the data or embedding space. Thus, we propose DE-VAE, an uncertainty-aware variational AE using differential entropy (DE) to improve the learned parametric and invertible projections. Given a fixed projection, we train DE-VAE to learn a mapping into 2D space and an inverse mapping back to the original space. We conduct quantitative and qualitative evaluations on four well-known datasets, using UMAP and t-SNE as baseline projection methods. Our findings show that DE-VAE can create parametric and inverse projections with comparable accuracy to other current AE-based approaches while enabling the analysis of embedding uncertainty.

Updated: 2025-08-20 07:31:03

标题: DE-VAE:使用变分自动编码器在参数化和逆向投影中揭示不确定性，利用差分熵

摘要: 最近，自编码器（AEs）因为能够创建多维数据的参数化和可逆投影而引起了人们的兴趣。参数化投影使得能够嵌入新的、未见样本，而无需重新计算整个投影；而可逆投影则允许合成新的数据实例。然而，现有方法在处理数据空间或嵌入空间中的分布之外的样本时表现不佳。因此，我们提出了DE-VAE，这是一种利用差分熵（DE）改进学习的参数化和可逆投影的不确定性感知变分自编码器。给定一个固定的投影，我们训练DE-VAE学习将数据映射到2D空间，并且再进行逆映射回原始空间。我们在四个知名数据集上进行了定量和定性评估，使用UMAP和t-SNE作为基准投影方法。我们的研究结果表明，DE-VAE能够创建具有可比较精度的参数化和逆投影，与其他当前基于AE的方法相当，同时还能够分析嵌入的不确定性。

更新时间: 2025-08-20 07:31:03

领域: cs.LG

下载: http://arxiv.org/abs/2508.12145v2

Synaptic bundle theory for spike-driven sensor-motor system: More than eight independent synaptic bundles collapse reward-STDP learning

Neuronal spikes directly drive muscles and endow animals with agile movements, but applying the spike-based control signals to actuators in artificial sensor-motor systems inevitably causes a collapse of learning. We developed a system that can vary \emph{the number of independent synaptic bundles} in sensor-to-motor connections. This paper demonstrates the following four findings: (i) Learning collapses once the number of motor neurons or the number of independent synaptic bundles exceeds a critical limit. (ii) The probability of learning failure is increased by a smaller number of motor neurons, while (iii) if learning succeeds, a smaller number of motor neurons leads to faster learning. (iv) The number of weight updates that move in the opposite direction of the optimal weight can quantitatively explain these results. The functions of spikes remain largely unknown. Identifying the parameter range in which learning systems using spikes can be constructed will make it possible to study the functions of spikes that were previously inaccessible due to the difficulty of learning.

Updated: 2025-08-20 07:29:33

标题: 突触束理论用于脉冲驱动的感觉-运动系统：超过八个独立的突触束崩溃奖励-STDP学习

摘要: 神经元的尖峰直接驱动肌肉，赋予动物灵活的运动能力，但将基于尖峰的控制信号应用于人工传感器-运动系统中不可避免地导致学习崩溃。我们开发了一个系统，可以在传感器到运动器官连接中变化\emph{独立突触束的数量}。本文展示了以下四个发现：(i) 一旦运动神经元的数量或独立突触束的数量超过临界限，学习就会崩溃。(ii) 学习失败的概率会因为更少的运动神经元而增加，而(iii) 如果学习成功，更少的运动神经元会导致更快的学习。(iv) 反向移动的最佳重量的权重更新数量可以定量解释这些结果。尖峰的功能仍然大部分未知。确定使用尖峰构建学习系统的参数范围将使得研究尖峰功能成为可能，而这之前由于学习困难而无法接触。

更新时间: 2025-08-20 07:29:33

领域: q-bio.NC,cs.AI,nlin.AO

下载: http://arxiv.org/abs/2508.14492v1

Social Debiasing for Fair Multi-modal LLMs

Multi-modal Large Language Models (MLLMs) have dramatically advanced the research field and delivered powerful vision-language understanding capabilities. However, these models often inherit deep-rooted social biases from their training data, leading to uncomfortable responses with respect to attributes such as race and gender. This paper addresses the issue of social biases in MLLMs by i) introducing a comprehensive counterfactual dataset with multiple social concepts (CMSC), which complements existing datasets by providing 18 diverse and balanced social concepts; and ii) proposing a counter-stereotype debiasing (CSD) strategy that mitigates social biases in MLLMs by leveraging the opposites of prevalent stereotypes. CSD incorporates both a novel bias-aware data sampling method and a loss rescaling method, enabling the model to effectively reduce biases. We conduct extensive experiments with four prevalent MLLM architectures. The results demonstrate the advantage of the CMSC dataset and the edge of CSD strategy in reducing social biases compared to existing competing methods, without compromising the overall performance on general multi-modal reasoning benchmarks.

Updated: 2025-08-20 07:24:46

标题: 社交去偏见以实现公平的多模态LLMs

摘要: 多模态大型语言模型（MLLMs）已经极大地推动了研究领域的发展，并提供了强大的视觉-语言理解能力。然而，这些模型往往会从它们的训练数据中继承根深蒂固的社会偏见，导致在种族和性别等属性方面产生令人不适的反应。本文通过引入一个包含多个社会概念的综合反事实数据集（CMSC），从而补充了现有数据集，提供了18个多样化且平衡的社会概念，来解决MLLMs中的社会偏见问题；同时提出了一个反刻板印象去偏见（CSD）策略，通过利用普遍刻板印象的对立面来减轻MLLMs中的社会偏见。CSD结合了一种新颖的偏见感知数据采样方法和损失重缩放方法，使模型能够有效减少偏见。我们对四种流行的MLLM架构进行了广泛的实验。结果表明，与现有竞争方法相比，CMSC数据集的优势和CSD策略的优势在减少社会偏见方面，而不会损害在一般多模态推理基准测试上的整体性能。

更新时间: 2025-08-20 07:24:46

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2408.06569v2

A Comprehensive Benchmark on Spectral GNNs: The Impact on Efficiency, Memory, and Effectiveness

With recent advancements in graph neural networks (GNNs), spectral GNNs have received increasing popularity by virtue of their ability to retrieve graph signals in the spectral domain. These models feature uniqueness in efficient computation as well as rich expressiveness, which stems from advanced management and profound understanding of graph data. However, few systematic studies have been conducted to assess spectral GNNs, particularly in benchmarking their efficiency, memory consumption, and effectiveness in a unified and fair manner. There is also a pressing need to select spectral models suitable for learning specific graph data and deploying them to massive web-scale graphs, which is currently constrained by the varied model designs and training settings. In this work, we extensively benchmark spectral GNNs with a focus on the spectral perspective, demystifying them as spectral graph filters. We analyze and categorize 35 GNNs with 27 corresponding filters, spanning diverse formulations and utilizations of the graph data. Then, we implement the filters within a unified spectral-oriented framework with dedicated graph computations and efficient training schemes. In particular, our implementation enables the deployment of spectral GNNs over million-scale graphs and various tasks with comparable performance and less overhead. Thorough experiments are conducted on the graph filters with comprehensive metrics on effectiveness and efficiency, offering novel observations and practical guidelines that are only available from our evaluations across graph scales. Different from the prevailing belief, our benchmark reveals an intricate landscape regarding the effectiveness and efficiency of spectral graph filters, demonstrating the potential to achieve desirable performance through tailored spectral manipulation of graph data.

Updated: 2025-08-20 07:15:59

标题: 一项关于谱图神经网络的综合基准测试：对效率、内存和效果的影响

摘要: 随着图神经网络（GNNs）的最新进展，谱GNNs由于其在频谱域中检索图信号的能力而越来越受欢迎。这些模型具有高效计算的独特性以及丰富的表达能力，这源自于对图数据的先进管理和深刻理解。然而，很少有系统性研究对谱GNNs进行评估，特别是在以统一和公平的方式评估其效率、内存消耗和有效性方面。目前，选择适合学习特定图数据并将其部署到大规模网络图中的谱模型的需求迫切，这受到各种模型设计和训练设置的限制。在这项工作中，我们重点对谱GNNs进行全面评估，主要关注谱图滤波器的观点，将它们解密为谱图滤波器。我们分析并分类了35个GNNs和27个相应的滤波器，涵盖了对图数据的不同公式和利用方式。然后，我们在一个统一的谱定向框架中实现这些滤波器，具有专门的图计算和高效的训练方案。特别是，我们的实现使得谱GNNs能够在百万规模的图和各种任务上部署，并具有可比性能和较少的开销。我们对图滤波器进行了彻底的实验，包括对有效性和效率的全面度量，提供了仅在我们跨图规模的评估中可用的新观察和实用指南。与普遍信念不同，我们的评估揭示了关于谱图滤波器的效果和效率的复杂格局，展示了通过针对性地操纵图数据的谱潜力实现理想性能的可能性。

更新时间: 2025-08-20 07:15:59

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2406.09675v3

On the notion of missingness for path attribution explainability methods in medical settings: Guiding the selection of medically meaningful baselines

The explainability of deep learning models remains a significant challenge, particularly in the medical domain where interpretable outputs are critical for clinical trust and transparency. Path attribution methods such as Integrated Gradients rely on a baseline input representing the absence of relevant features ("missingness"). Commonly used baselines, such as all-zero inputs, are often semantically meaningless, especially in medical contexts where missingness can itself be informative. While alternative baseline choices have been explored, existing methods lack a principled approach to dynamically select baselines tailored to each input. In this work, we examine the notion of missingness in the medical setting, analyze its implications for baseline selection, and introduce a counterfactual-guided approach to address the limitations of conventional baselines. We argue that a clinically normal but input-close counterfactual represents a more accurate representation of a meaningful absence of features in medical data. To implement this, we use a Variational Autoencoder to generate counterfactual baselines, though our concept is generative-model-agnostic and can be applied with any suitable counterfactual method. We evaluate the approach on three distinct medical data sets and empirically demonstrate that counterfactual baselines yield more faithful and medically relevant attributions compared to standard baseline choices.

Updated: 2025-08-20 07:13:41

标题: 在医疗环境中路径归因可解释性方法中缺失性概念的研究：引导选择具有医学意义的基线

摘要: 深度学习模型的可解释性仍然是一个重要挑战，特别是在医学领域，解释性输出对临床信任和透明度至关重要。路径归因方法如整合梯度依赖于基线输入，表示缺少相关特征（"缺失"）。常用的基线，如全零输入，在医学背景下通常意义不明确，特别是在缺失本身可能具有信息意义的情况下。虽然已经探索了替代基线选择，但现有方法缺乏一种针对每个输入动态选择基线的原则方法。在这项工作中，我们考察了医学环境中缺失的概念，分析了其对基线选择的影响，并介绍了一种反事实引导的方法来解决传统基线的局限性。我们认为，在医学数据中，临床正常但输入接近的反事实代表了对特征有意义缺失的更准确表达。为了实现这一点，我们使用变分自动编码器生成反事实基线，尽管我们的概念是生成模型不可知的，可以与任何合适的反事实方法一起应用。我们在三个不同的医学数据集上评估了这种方法，并实证证明，与标准基线选择相比，反事实基线产生更忠实和医学相关的归因。

更新时间: 2025-08-20 07:13:41

领域: cs.LG

下载: http://arxiv.org/abs/2508.14482v1

FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation

Tibetan is a low-resource language with minimal parallel speech corpora spanning its three major dialects-\"U-Tsang, Amdo, and Kham-limiting progress in speech modeling. To address this issue, we propose FMSD-TTS, a few-shot, multi-speaker, multi-dialect text-to-speech framework that synthesizes parallel dialectal speech from limited reference audio and explicit dialect labels. Our method features a novel speaker-dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net) to capture fine-grained acoustic and linguistic variations across dialects while preserving speaker identity. Extensive objective and subjective evaluations demonstrate that FMSD-TTS significantly outperforms baselines in both dialectal expressiveness and speaker similarity. We further validate the quality and utility of the synthesized speech through a challenging speech-to-speech dialect conversion task. Our contributions include: (1) a novel few-shot TTS system tailored for Tibetan multi-dialect speech synthesis, (2) the public release of a large-scale synthetic Tibetan speech corpus generated by FMSD-TTS, and (3) an open-source evaluation toolkit for standardized assessment of speaker similarity, dialect consistency, and audio quality.

Updated: 2025-08-20 07:13:27

标题: FMSD-TTS：用于维、安多和康区语音数据集生成的少样本多说话人多方言文本转语音合成

摘要: 西藏语是一种资源匮乏的语言，其三种主要方言-\"乌藏、安多和康-仅有极少的平行语料库，限制了语音建模的进展。为了解决这个问题，我们提出了FMSD-TTS，一种少样本、多人物、多方言文本到语音的框架，可以从有限的参考音频和明确的方言标签中合成平行方言语音。我们的方法采用了一种新颖的说话者-方言融合模块和一种方言专门的动态路由网络（DSDR-Net），可以捕捉跨方言的细微声学和语言变化，同时保留说话者身份。广泛的客观和主观评估表明，FMSD-TTS在方言表达和说话者相似度方面明显优于基线。我们进一步通过一个具有挑战性的语音到语音方言转换任务验证了合成语音的质量和实用性。我们的贡献包括：（1）一种针对西藏多方言语音合成定制的新型少样本TTS系统，（2）由FMSD-TTS生成的大规模合成西藏语语音语料库的公开发布，以及（3）一个用于标准化评估说话者相似度、方言一致性和音频质量的开源评估工具包。

更新时间: 2025-08-20 07:13:27

领域: cs.SD,cs.AI,cs.CL,eess.AS

下载: http://arxiv.org/abs/2505.14351v3

Fast Symbolic Regression Benchmarking

Symbolic regression (SR) uncovers mathematical models from data. Several benchmarks have been proposed to compare the performance of SR algorithms. However, existing ground-truth rediscovery benchmarks overemphasize the recovery of "the one" expression form or rely solely on computer algebra systems (such as SymPy) to assess success. Furthermore, existing benchmarks continue the expression search even after its discovery. We improve upon these issues by introducing curated lists of acceptable expressions, and a callback mechanism for early termination. As a starting point, we use the symbolic regression for scientific discovery (SRSD) benchmark problems proposed by Yoshitomo et al., and benchmark the two SR packages SymbolicRegression.jl and TiSR. The new benchmarking method increases the rediscovery rate of SymbolicRegression.jl from 26.7%, as reported by Yoshitomo et at., to 44.7%. Performing the benchmark takes 41.2% less computational expense. TiSR's rediscovery rate is 69.4%, while performing the benchmark saves 63% time.

Updated: 2025-08-20 07:12:44

标题: 快速符号回归基准测试

摘要: 符号回归（SR）从数据中发现数学模型。已经提出了几个基准来比较SR算法的性能。然而，现有的真实性重新发现基准过分强调恢复“一个”表达形式，或者仅依赖于计算代数系统（如SymPy）来评估成功。此外，现有的基准在发现表达式后继续搜索。我们通过引入可接受表达式的策划清单和用于提前终止的回调机制来改进这些问题。作为起点，我们使用由Yoshitomo等人提出的用于科学发现的符号回归（SRSD）基准问题，并对两个SR包SymbolicRegression.jl和TiSR进行基准测试。新的基准测试方法将SymbolicRegression.jl的重新发现率从Yoshitomo等人报告的26.7%提高到44.7%。进行基准测试的计算开销减少了41.2%。TiSR的重新发现率为69.4%，进行基准测试可节省63%的时间。

更新时间: 2025-08-20 07:12:44

领域: cs.LG

下载: http://arxiv.org/abs/2508.14481v1

Time-Scale Coupling Between States and Parameters in Recurrent Neural Networks

We study how gating mechanisms in recurrent neural networks (RNNs) implicitly induce adaptive learning-rate behavior, even when training is carried out with a fixed, global learning rate. This effect arises from the coupling between state-space time scales--parametrized by the gates--and parameter-space dynamics during gradient descent. By deriving exact Jacobians for leaky-integrator and gated RNNs, we obtain a first-order expansion that makes explicit how constant, scalar, and multi-dimensional gates reshape gradient propagation, modulate effective step sizes, and introduce anisotropy in parameter updates. These findings reveal that gates not only control information flow, but also act as data-driven preconditioners that adapt optimization trajectories in parameter space. We further draw formal analogies with learning-rate schedules, momentum, and adaptive methods such as Adam, pointing to possible redundancies. Empirical simulations corroborate these claims: in canonical synthetic sequence tasks (adding, copy) we show that gates induce lag-dependent effective learning rates and directional concentration of gradient flow, with multi-gate models matching or exceeding the anisotropic structure produced by Adam. These results highlight that optimizer-driven and gate-driven adaptivity are complementary but not equivalent mechanisms. Overall, this work provides a unified dynamical-systems perspective on how gating couples state evolution with parameter updates, explaining why gated architectures achieve robust trainability and stability in practice.

Updated: 2025-08-20 07:10:59

标题: 《循环神经网络中状态与参数之间的时间尺度耦合》

摘要: 我们研究了递归神经网络（RNNs）中的门控机制如何在训练过程中隐含地引入自适应学习率行为，即使训练过程中使用固定的全局学习率。这种效应源于状态空间时间尺度（由门参数化）与梯度下降期间的参数空间动态之间的耦合。通过推导泄漏积分器和门控RNNs的精确雅可比矩阵，我们得到了一个一阶展开，明确了常数、标量和多维门如何重塑梯度传播、调节有效步长，并在参数更新中引入各向异性。这些研究结果揭示，门不仅控制信息流，还作为数据驱动的预处理器，调整参数空间中的优化轨迹。我们进一步提出了与学习率调度、动量和自适应方法（如Adam）的形式类比，指出可能存在冗余。经验模拟证实了这些说法：在经典的合成序列任务（加法、复制）中，我们展示了门诱导的依赖滞后的有效学习率和梯度流的定向集中，多门模型可以匹配或超过Adam生成的各向异性结构。这些结果凸显了优化器驱动和门控适应性是互补但不等价的机制。总的来说，这项工作提供了一个统一的动力系统视角，解释了为什么门控架构在实践中实现了稳健的可训练性和稳定性。

更新时间: 2025-08-20 07:10:59

领域: cs.LG,math.DS

下载: http://arxiv.org/abs/2508.12121v2

MinD: Learning A Dual-System World Model for Real-Time Planning and Implicit Risk Analysis

Video Generation Models (VGMs) have become powerful backbones for Vision-Language-Action (VLA) models, leveraging large-scale pretraining for robust dynamics modeling. However, current methods underutilize their distribution modeling capabilities for predicting future states. Two challenges hinder progress: integrating generative processes into feature learning is both technically and conceptually underdeveloped, and naive frame-by-frame video diffusion is computationally inefficient for real-time robotics. To address these, we propose Manipulate in Dream (MinD), a dual-system world model for real-time, risk-aware planning. MinD uses two asynchronous diffusion processes: a low-frequency visual generator (LoDiff) that predicts future scenes and a high-frequency diffusion policy (HiDiff) that outputs actions. Our key insight is that robotic policies do not require fully denoised frames but can rely on low-resolution latents generated in a single denoising step. To connect early predictions to actions, we introduce DiffMatcher, a video-action alignment module with a novel co-training strategy that synchronizes the two diffusion models. MinD achieves a 63% success rate on RL-Bench, 60% on real-world Franka tasks, and operates at 11.3 FPS, demonstrating the efficiency of single-step latent features for control signals. Furthermore, MinD identifies 74% of potential task failures in advance, providing real-time safety signals for monitoring and intervention. This work establishes a new paradigm for efficient and reliable robotic manipulation using generative world models.

Updated: 2025-08-20 07:07:13

标题: MinD：学习双系统世界模型，用于实时规划和隐性风险分析

摘要: 视频生成模型（VGMs）已成为视觉-语言-动作（VLA）模型的强大支柱，利用大规模预训练进行稳健动态建模。然而，当前方法未充分利用其分布建模能力来预测未来状态。两个挑战阻碍了进展：将生成过程整合到特征学习中在技术和概念上都不够成熟，而天真的逐帧视频扩散对于实时机器人来说计算效率低下。为了解决这些问题，我们提出了Manipulate in Dream（MinD），这是一个用于实时、风险感知规划的双系统世界模型。MinD使用两个异步扩散过程：一个低频率的视觉生成器（LoDiff）预测未来场景，一个高频率的扩散策略（HiDiff）输出动作。我们的关键见解是，机器人策略不需要完全去噪的帧，而可以依赖于在单个去噪步骤中生成的低分辨率潜变量。为了将早期预测与动作连接起来，我们引入了DiffMatcher，这是一个具有新颖共训练策略的视频-动作对齐模块，同步了两个扩散模型。MinD在RL-Bench上实现了63%的成功率，在现实世界的Franka任务中达到了60%，并以11.3 FPS的速度运行，展示了用于控制信号的单步潜变量特征的效率。此外，MinD提前识别了74%的潜在任务失败，为监控和干预提供了实时安全信号。这项工作建立了使用生成世界模型进行高效可靠的机器人操作的新范式。

更新时间: 2025-08-20 07:07:13

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2506.18897v2

The Man Behind the Sound: Demystifying Audio Private Attribute Profiling via Multimodal Large Language Model Agents

Our research uncovers a novel privacy risk associated with multimodal large language models (MLLMs): the ability to infer sensitive personal attributes from audio data -- a technique we term audio private attribute profiling. This capability poses a significant threat, as audio can be covertly captured without direct interaction or visibility. Moreover, compared to images and text, audio carries unique characteristics, such as tone and pitch, which can be exploited for more detailed profiling. However, two key challenges exist in understanding MLLM-employed private attribute profiling from audio: (1) the lack of audio benchmark datasets with sensitive attribute annotations and (2) the limited ability of current MLLMs to infer such attributes directly from audio. To address these challenges, we introduce AP^2, an audio benchmark dataset that consists of two subsets collected and composed from real-world data, and both are annotated with sensitive attribute labels. Additionally, we propose Gifts, a hybrid multi-agent framework that leverages the complementary strengths of audio-language models (ALMs) and large language models (LLMs) to enhance inference capabilities. Gifts employs an LLM to guide the ALM in inferring sensitive attributes, then forensically analyzes and consolidates the ALM's inferences, overcoming severe hallucinations of existing ALMs in generating long-context responses. Our evaluations demonstrate that Gifts significantly outperforms baseline approaches in inferring sensitive attributes. Finally, we investigate model-level and data-level defense strategies to mitigate the risks of audio private attribute profiling. Our work validates the feasibility of audio-based privacy attacks using MLLMs, highlighting the need for robust defenses, and provides a dataset and framework to facilitate future research.

Updated: 2025-08-20 07:04:41

标题: 声音背后的人：通过多模态大语言模型代理解密音频私人属性配置

摘要: 我们的研究揭示了与多模态大语言模型（MLLMs）相关的一种新型隐私风险：即从音频数据中推断敏感个人属性的能力 - 我们称之为音频私人属性分析技术。这种能力构成了一个重大威胁，因为音频可以在没有直接互动或可见性的情况下被秘密捕获。此外，与图像和文本相比，音频具有独特的特征，例如音调和音高，可以被利用用于更详细的分析。然而，在理解MLLMs利用音频进行私人属性分析方面存在两个关键挑战：（1）缺乏带有敏感属性标注的音频基准数据集，以及（2）当前MLLMs从音频直接推断此类属性的能力有限。为了解决这些挑战，我们介绍了AP^2，一个音频基准数据集，由两个从现实世界数据中收集和组合的子集组成，两者都带有敏感属性标签。此外，我们提出了Gifts，一个混合多智能体框架，利用音频语言模型（ALMs）和大型语言模型（LLMs）的互补优势来增强推断能力。Gifts利用LLM指导ALM推断敏感属性，然后对ALM的推断进行法庭分析和整合，克服了现有ALMs在生成长文本响应时严重幻觉的问题。我们的评估表明，Gifts在推断敏感属性方面明显优于基线方法。最后，我们调查了模型级和数据级防御策略，以减轻音频私人属性分析的风险。我们的工作验证了使用MLLMs进行基于音频的隐私攻击的可行性，突出了对强大防御的需求，并提供了一个数据集和框架，以促进未来研究。

更新时间: 2025-08-20 07:04:41

领域: cs.CR,cs.SD,eess.AS

下载: http://arxiv.org/abs/2507.10016v2

Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning

Model-Heterogeneous Federated Learning (Hetero-FL) has attracted growing attention for its ability to aggregate knowledge from heterogeneous models while keeping private data locally. To better aggregate knowledge from clients, ensemble distillation, as a widely used and effective technique, is often employed after global aggregation to enhance the performance of the global model. However, simply combining Hetero-FL and ensemble distillation does not always yield promising results and can make the training process unstable. The reason is that existing methods primarily focus on logit distillation, which, while being model-agnostic with softmax predictions, fails to compensate for the knowledge bias arising from heterogeneous models. To tackle this challenge, we propose a stable and efficient Feature Distillation for model-heterogeneous Federated learning, dubbed FedFD, that can incorporate aligned feature information via orthogonal projection to integrate knowledge from heterogeneous models better. Specifically, a new feature-based ensemble federated knowledge distillation paradigm is proposed. The global model on the server needs to maintain a projection layer for each client-side model architecture to align the features separately. Orthogonal techniques are employed to re-parameterize the projection layer to mitigate knowledge bias from heterogeneous models and thus maximize the distilled knowledge. Extensive experiments show that FedFD achieves superior performance compared to state-of-the-art methods.

Updated: 2025-08-20 06:59:30

标题: 特征蒸馏是模型异构联邦学习的更好选择

摘要: Model-Heterogeneous Federated Learning（Hetero-FL）因其能够从异构模型中聚合知识并保持本地私有数据而受到越来越多的关注。为了更好地从客户端聚合知识，集成蒸馏作为一种广泛使用且有效的技术，经常在全局聚合之后使用以增强全局模型的性能。然而，简单地将Hetero-FL和集成蒸馏相结合并不总是产生令人满意的结果，可能会使训练过程不稳定。原因在于现有方法主要侧重于对数蒸馏，这种方法虽然与softmax预测无关，但未能弥补异构模型所产生的知识偏差。为了解决这一挑战，我们提出了一种稳定且高效的特征蒸馏方法，命名为FedFD，用于模型异构联邦学习，可以通过正交投影来更好地整合来自异构模型的知识。具体来说，提出了一种基于特征的集成联邦知识蒸馏范式。服务器上的全局模型需要为每个客户端模型架构维护一个投影层，以单独对齐特征。采用正交技术重新参数化投影层，以减轻来自异构模型的知识偏差，从而最大化蒸馏知识。广泛的实验表明，与最先进的方法相比，FedFD实现了更优异的性能。

更新时间: 2025-08-20 06:59:30

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.10348v2

In2x at WMT25 Translation Task

This paper presents the open-system submission by the In2x research team for the WMT25 General Machine Translation Shared Task. Our submission focuses on Japanese-related translation tasks, aiming to explore a generalizable paradigm for extending large language models (LLMs) to other languages. This paradigm encompasses aspects such as data construction methods and reward model design. The ultimate goal is to enable large language model systems to achieve exceptional performance in low-resource or less commonly spoken languages.

Updated: 2025-08-20 06:52:42

标题: In2x在WMT25翻译任务中的表现

摘要: 这篇论文介绍了In2x研究团队针对WMT25通用机器翻译共享任务的开放系统提交。我们的提交重点关注与日语相关的翻译任务，旨在探索将大型语言模型（LLMs）扩展到其他语言的可推广范式。该范式涵盖了数据构建方法和奖励模型设计等方面。最终目标是使大型语言模型系统能够在资源匮乏或不常用语言中取得出色的表现。

更新时间: 2025-08-20 06:52:42

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.14472v1

Each to Their Own: Exploring the Optimal Embedding in RAG

Recently, as Large Language Models (LLMs) have fundamentally impacted various fields, the methods for incorporating up-to-date information into LLMs or adding external knowledge to construct domain-specific models have garnered wide attention. Retrieval-Augmented Generation (RAG), serving as an inference-time scaling method, is notable for its low cost and minimal effort for parameter tuning. However, due to heterogeneous training data and model architecture, the variant embedding models used in RAG exhibit different benefits across various areas, often leading to different similarity calculation results and, consequently, varying response quality from LLMs. To address this problem, we propose and examine two approaches to enhance RAG by combining the benefits of multiple embedding models, named Mixture-Embedding RAG and Confident RAG. Mixture-Embedding RAG simply sorts and selects retrievals from multiple embedding models based on standardized similarity; however, it does not outperform vanilla RAG. In contrast, Confident RAG generates responses multiple times using different embedding models and then selects the responses with the highest confidence level, demonstrating average improvements of approximately 10% and 5% over vanilla LLMs and RAG, respectively. The consistent results across different LLMs and embedding models indicate that Confident RAG is an efficient plug-and-play approach for various domains. We will release our code upon publication.

Updated: 2025-08-20 06:44:38

标题: 每个人有每个人的选择：探索RAG中的最佳嵌入

摘要: 最近，随着大型语言模型（LLMs）对各个领域产生了根本性影响，将最新信息纳入LLMs或添加外部知识构建特定领域模型的方法受到了广泛关注。作为推理时间扩展方法的检索增强生成（RAG），以其低成本和最小的参数调整工作而引人注目。然而，由于异构的训练数据和模型架构，RAG中使用的变种嵌入模型在各个领域展示出不同的优势，通常导致不同的相似度计算结果，从而导致LLMs的响应质量不同。为解决这一问题，我们提出并研究了两种增强RAG的方法，分别命名为混合嵌入RAG和自信RAG。混合嵌入RAG简单地根据标准化相似度对来自多个嵌入模型的检索进行排序和选择；然而，它并没有超越普通RAG。相反，自信RAG使用不同的嵌入模型多次生成响应，然后选择具有最高置信水平的响应，分别相对于普通LLMs和RAG平均改善约10%和5%。不同LLMs和嵌入模型之间的一致结果表明，自信RAG是各种领域的高效即插即用方法。我们将在发表后发布我们的代码。

更新时间: 2025-08-20 06:44:38

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2507.17442v2

Input Time Scaling

Current Large Language Models (LLMs) are usually post-trained on large-scale carefully curated datasets (data & training scaling) and doing reasoning in test time (inference time scaling). In this work, we present a new scaling paradigm, Input Time Scaling, to complement previous scaling methods by putting resources on queries (input time). During training and testing, we combine meta-knowledge from LLMs to refine inputs with different strategies. We also find a new phenomenon, training-testing co-design there. We need to apply query strategies during both training and testing. Only applying strategies on training or testing would seriously degrade the performance. We are also surprised to find that seemingly low data quality datasets can gain high performance. Adding irrelevant information to the queries, randomly selecting examples from a minimally filtered dataset, can even perform the best. These findings contradict the widely held inductive bias, "garbage in, garbage out". Curating datasets with seemingly high-quality data can even potentially limit the performance ceiling. In addition, models trained on more data with similar quality (15k VS 1k) perform worse, simple dataset size scaling should also be carefully inspected. The good news is that our findings are compatible with the Less is More phenomenon. A small set of examples is enough to evoke high-level reasoning ability. With experiments on models trained on Qwen2.5-32B-Instruct, we are able to reach SOTA performance among 32B models on AIME24(76.7%) and AIME25(76.7%) pass@1. We can further achieve AIME24(76.7%) and AIME25(80%) with a majority vote of three models. Starting from DeepSeek-R1-Distill-Qwen-32B, the best result would be 86.7% on AIME24 and 76.7% on AIME25. To facilitate reproducibility and further research, we are working on open-source our datasets, data pipelines, evaluation results, and checkpoints.

Updated: 2025-08-20 06:41:59

标题: 输入时间缩放

摘要: 目前的大型语言模型（LLMs）通常在大规模精心策划的数据集上进行后训练（数据和训练扩展），并在测试时间进行推理（推断时间扩展）。在这项工作中，我们提出了一种新的扩展范式，即输入时间扩展，以补充以前的扩展方法，通过将资源放在查询（输入时间）上。在训练和测试期间，我们结合LLMs的元知识，使用不同策略来优化输入。我们还发现了一个新现象，即训练-测试协同设计。我们需要在训练和测试期间应用查询策略。仅在训练或测试时应用策略会严重降低性能。我们还惊讶地发现，表面上质量低的数据集可以获得高性能。向查询添加无关信息，从最小程度过滤的数据集中随机选择示例，甚至可以表现最佳。这些发现与广泛认为的归纳偏见“垃圾进，垃圾出”相矛盾。对看似高质量数据的数据集进行精心策划甚至可能限制性能的上限。此外，在质量相似的更多数据上训练的模型（15k与1k）表现更差，简单的数据集大小扩展也应该仔细检查。好消息是，我们的发现与“少即是多”现象是相容的。一小组示例足以唤起高层推理能力。通过在Qwen2.5-32B-Instruct模型上进行实验，我们能够在AIME24（76.7%）和AIME25（76.7%）的32B模型中达到SOTA性能。我们可以通过三个模型的多数投票进一步实现AIME24（76.7%）和AIME25（80%）。从DeepSeek-R1-Distill-Qwen-32B开始，最佳结果为在AIME24上达到86.7%，在AIME25上达到76.7%。为了促进可重复性和进一步研究，我们正在努力开放我们的数据集、数据管道、评估结果和检查点。

更新时间: 2025-08-20 06:41:59

领域: cs.LG,cs.AI,cs.CL

下载: http://arxiv.org/abs/2508.13654v2

SketchDNN: Joint Continuous-Discrete Diffusion for CAD Sketch Generation

We present SketchDNN, a generative model for synthesizing CAD sketches that jointly models both continuous parameters and discrete class labels through a unified continuous-discrete diffusion process. Our core innovation is Gaussian-Softmax diffusion, where logits perturbed with Gaussian noise are projected onto the probability simplex via a softmax transformation, facilitating blended class labels for discrete variables. This formulation addresses 2 key challenges, namely, the heterogeneity of primitive parameterizations and the permutation invariance of primitives in CAD sketches. Our approach significantly improves generation quality, reducing Fr\'echet Inception Distance (FID) from 16.04 to 7.80 and negative log-likelihood (NLL) from 84.8 to 81.33, establishing a new state-of-the-art in CAD sketch generation on the SketchGraphs dataset.

Updated: 2025-08-20 06:40:48

标题: SketchDNN: CAD草图生成的联合连续-离散扩散

摘要: 我们提出了SketchDNN，一种用于合成CAD草图的生成模型，通过统一的连续-离散扩散过程，同时对连续参数和离散类别标签进行建模。我们的核心创新是高斯-Softmax扩散，其中受高斯噪声扰动的logits通过softmax变换投影到概率单纯形上，促进了离散变量的混合类别标签。这种表达方式解决了两个关键挑战，即原始参数化的异质性和CAD草图中原语的排列不变性。我们的方法显著提高了生成质量，将Fr\'echet Inception Distance (FID) 从16.04降至7.80，负对数似然度（NLL）从84.8降至81.33，在SketchGraphs数据集上建立了CAD草图生成的新技术水平。

更新时间: 2025-08-20 06:40:48

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2507.11579v2

PathGPT: Reframing Path Recommendation as a Natural Language Generation Task with Retrieval-Augmented Language Models

Path recommendation (PR) aims to generate travel paths that are customized to a user's specific preferences and constraints. Conventional approaches often employ explicit optimization objectives or specialized machine learning architectures; however, these methods typically exhibit limited flexibility and generalizability, necessitating costly retraining to accommodate new scenarios. This paper introduces an alternative paradigm that conceptualizes PR as a natural language generation task. We present PathGPT, a retrieval-augmented large language model (LLM) system that leverages historical trajectory data and natural language user constraints to generate plausible paths. The proposed methodology first converts raw trajectory data into a human-interpretable textual format, which is then stored in a database. Subsequently, a hybrid retrieval system extracts path-specific context from this database to inform a pretrained LLM. The primary contribution of this work is a novel framework that demonstrates how integrating established information retrieval and generative model components can enable adaptive, zero-shot path generation across diverse scenarios. Extensive experiments on large-scale trajectory datasets indicate that PathGPT's performance is competitive with specialized, learning-based methods, underscoring its potential as a flexible and generalizable path generation system that avoids the need for retraining inherent in previous data-driven models.

Updated: 2025-08-20 06:37:23

标题: PathGPT：将路径推荐重新构建为一种使用检索增强语言模型的自然语言生成任务

摘要: 路径推荐（PR）旨在生成符合用户特定偏好和约束条件的旅行路径。传统方法通常采用明确的优化目标或专门的机器学习架构；然而，这些方法通常表现出有限的灵活性和泛化能力，需要昂贵的重新训练以适应新的场景。本文介绍了一种将PR概念化为自然语言生成任务的替代范式。我们提出了PathGPT，一个检索增强型大型语言模型（LLM）系统，利用历史轨迹数据和自然语言用户约束条件生成合理的路径。所提出的方法首先将原始轨迹数据转换为人类可解释的文本格式，然后存储在数据库中。随后，一个混合检索系统从这个数据库中提取路径特定的上下文信息，用于指导一个预训练的LLM。这项工作的主要贡献是展示了如何整合已建立的信息检索和生成模型组件可以实现跨不同场景的自适应、零-shot路径生成。对大规模轨迹数据集的广泛实验表明，PathGPT的性能与专门的基于学习的方法竞争力相当，突显其作为一个灵活且泛化性强的路径生成系统的潜力，避免了以往数据驱动模型中固有的重新训练需求。

更新时间: 2025-08-20 06:37:23

领域: cs.IR,cs.AI,cs.LG

下载: http://arxiv.org/abs/2504.05846v2

LaViPlan : Language-Guided Visual Path Planning with RLVR

Out-of-distribution (OOD) scenarios in autonomous driving pose critical challenges, as planners often fail to generalize beyond their training experience, leading to unsafe or unexpected behavior. Vision-Language Models (VLMs) have shown promise in handling such scenarios by providing high-level scene understanding and user-aligned decisions. However, existing VLMs often exhibit a misalignment between their language-based reasoning and the low-level trajectories required for action-level planning. In this paper, we propose LaViPlan, a framework that leverages Reinforcement Learning with Verifiable Rewards (RLVR) to fine-tune VLMs using planning-oriented metrics. Experimental results show that LaViPlan improves planning performance across both in-domain and out-of-domain datasets. While linguistic fidelity slightly decreases after RLVR-based fine-tuning, qualitative evaluation indicates that the outputs remain coherent. We also conduct ablation studies to analyze the effects of sampling ratio and reasoning guidance, highlighting how these design choices influence performance. These findings demonstrate the potential of RLVR as a post-training paradigm for aligning language-guided reasoning with action-level planning in autonomous driving.

Updated: 2025-08-20 06:32:37

标题: LaViPlan：RLVR指导的语言引导的视觉路径规划

摘要: 自动驾驶中的超出分布（OOD）场景提出了关键挑战，因为规划者经常无法超越他们的训练经验，导致不安全或意外行为。视觉语言模型（VLMs）显示出处理这种场景的潜力，提供高水平的场景理解和与用户对齐的决策。然而，现有的VLMs经常表现出基于语言的推理与行动级规划所需的低级轨迹之间的错位。在本文中，我们提出了LaViPlan，一个利用可验证奖励的强化学习（RLVR）来使用面向规划的度量来微调VLMs的框架。实验结果显示，LaViPlan在域内和域外数据集上提高了规划性能。虽然RLVR-based微调后语言的忠实度略有降低，但定性评估表明输出仍然连贯。我们还进行消融研究来分析采样比例和推理指导对性能的影响。这些发现表明RLVR作为一个后训练范式，可以实现语言引导的推理与自动驾驶中的行动级规划的对齐。

更新时间: 2025-08-20 06:32:37

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2507.12911v4

DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization

We present DuPO, a dual learning-based preference optimization framework that generates annotation-free feedback via a generalized duality. DuPO addresses two key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)'s reliance on costly labels and applicability restricted to verifiable tasks, and traditional dual learning's restriction to strictly dual task pairs (e.g., translation and back-translation). Specifically, DuPO decomposes a primal task's input into known and unknown components, then constructs its dual task to reconstruct the unknown part using the primal output and known information (e.g., reversing math solutions to recover hidden variables), broadening applicability to non-invertible tasks. The quality of this reconstruction serves as a self-supervised reward to optimize the primal task, synergizing with LLMs' ability to instantiate both tasks via a single model. Empirically, DuPO achieves substantial gains across diverse tasks: it enhances the average translation quality by 2.13 COMET over 756 directions, boosts the mathematical reasoning accuracy by an average of 6.4 points on three challenge benchmarks, and enhances performance by 9.3 points as an inference-time reranker (trading computation for accuracy). These results position DuPO as a scalable, general, and annotation-free paradigm for LLM optimization.

Updated: 2025-08-20 06:31:18

标题: DuPO: 通过双重偏好优化实现可靠的LLM自验证

摘要: 我们提出了DuPO，这是一个基于双学习的偏好优化框架，通过广义对偶生成无需注释的反馈。DuPO解决了两个关键限制：强化学习与可验证奖励（RLVR）依赖昂贵标签和仅适用于可验证任务的应用限制，以及传统双学习仅限于严格的双任务对（例如，翻译和逆向翻译）。具体来说，DuPO将原始任务的输入分解为已知和未知组件，然后构建其双任务，利用原始输出和已知信息（例如，逆转数学解以恢复隐藏变量）重建未知部分，从而扩展到不可逆任务的适用性。这种重建的质量作为自监督奖励，优化原始任务，与LLM的能力相辅相成，通过单一模型实例化两个任务。在实证上，DuPO在多样化任务上取得了重大收益：在756个方向上，将平均翻译质量提高了2.13 COMET，将数学推理准确性提高了三个挑战基准上的平均6.4分，并通过一个推理时间重新排名器将性能提高了9.3分（以精度换取计算）。这些结果将DuPO定位为一种可扩展、通用且无需注释的LLM优化范式。

更新时间: 2025-08-20 06:31:18

领域: cs.LG,cs.CL

下载: http://arxiv.org/abs/2508.14460v1

When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs

As large language models become increasingly integrated into daily life, audio has emerged as a key interface for human-AI interaction. However, this convenience also introduces new vulnerabilities, making audio a potential attack surface for adversaries. Our research introduces WhisperInject, a two-stage adversarial audio attack framework that can manipulate state-of-the-art audio language models to generate harmful content. Our method uses imperceptible perturbations in audio inputs that remain benign to human listeners. The first stage uses a novel reward-based optimization method, Reinforcement Learning with Projected Gradient Descent (RL-PGD), to guide the target model to circumvent its own safety protocols and generate harmful native responses. This native harmful response then serves as the target for Stage 2, Payload Injection, where we use Projected Gradient Descent (PGD) to optimize subtle perturbations that are embedded into benign audio carriers, such as weather queries or greeting messages. Validated under the rigorous StrongREJECT, LlamaGuard, as well as Human Evaluation safety evaluation framework, our experiments demonstrate a success rate exceeding 86% across Qwen2.5-Omni-3B, Qwen2.5-Omni-7B, and Phi-4-Multimodal. Our work demonstrates a new class of practical, audio-native threats, moving beyond theoretical exploits to reveal a feasible and covert method for manipulating AI behavior.

Updated: 2025-08-20 06:08:28

标题: 当好声音变得对抗性：用良性输入越狱音频语言模型

摘要: 随着大型语言模型越来越多地融入日常生活，音频已经成为人工智能交互的关键接口。然而，这种便利性也引入了新的漏洞，使音频成为对手方的潜在攻击面。我们的研究引入了WhisperInject，一个两阶段的对抗性音频攻击框架，可以操纵最先进的音频语言模型生成有害内容。我们的方法利用音频输入中的不可察觉的扰动，对人类听众保持良性。第一阶段使用一种新颖的基于奖励的优化方法，强化学习与投影梯度下降（RL-PGD），来引导目标模型规避自身的安全协议，并生成有害的本地响应。这种本地有害响应然后作为第二阶段的目标，负载注入，我们使用投影梯度下降（PGD）来优化微小扰动，将其嵌入到良性音频载体中，比如天气查询或问候信息。在严格的StrongREJECT、LlamaGuard以及人类评估安全评估框架下验证，我们的实验表明，在Qwen2.5-Omni-3B、Qwen2.5-Omni-7B和Phi-4-Multimodal中，成功率超过86%。我们的工作展示了一类新的实际音频本地威胁，超越了理论攻击，揭示了一种可行且隐蔽的操纵人工智能行为的方法。

更新时间: 2025-08-20 06:08:28

领域: cs.SD,cs.AI,cs.CR,eess.AS

下载: http://arxiv.org/abs/2508.03365v2

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achieve improved inference speed when generating the long thinking traces needed for reasoning. We create Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe. After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to compress and distill the model with the goal of enabling inference on up to 128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision). Compared to existing similarly-sized models (e.g., Qwen3-8B), we show that Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoning benchmarks while achieving up to 6x higher inference throughput in reasoning settings like 8k input and 16k output tokens. We are releasing Nemotron-Nano-9B-v2, Nemotron-Nano12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with the majority of our pre- and post-training datasets on Hugging Face.

Updated: 2025-08-20 06:00:57

标题: NVIDIA Nemotron Nano 2：一种准确高效的混合Mamba-Transformer推理模型

摘要: 我们介绍了Nemotron-Nano-9B-v2，这是一个混合Mamba-Transformer语言模型，旨在提高推理工作负载的吞吐量，同时与大小相似的模型相比实现最先进的准确性。Nemotron-Nano-9B-v2基于Nemotron-H架构构建，在该架构中，常见Transformer架构中的大多数自注意力层被替换为Mamba-2层，以在生成推理所需的长思考轨迹时实现改进的推理速度。我们首先通过使用FP8训练配方在200万亿标记上预训练一个120亿参数模型（Nemotron-Nano-12B-v2-Base）来创建Nemotron-Nano-9B-v2。在对齐Nemotron-Nano-12B-v2-Base之后，我们采用Minitron策略对模型进行压缩和提取，旨在使其能够在单个NVIDIA A10G GPU（22GiB内存，bfloat16精度）上对多达128k标记进行推理。与现有的类似大小的模型（例如Qwen3-8B）相比，我们展示了Nemotron-Nano-9B-v2在推理基准测试中实现了与或更好的准确性，同时在8k输入和16k输出标记等推理设置中实现了多达6倍的更高推理吞吐量。我们发布了Nemotron-Nano-9B-v2、Nemotron-Nano12B-v2-Base和Nemotron-Nano-9B-v2-Base检查点，以及我们大多数的预训练和后训练数据集在Hugging Face上。

更新时间: 2025-08-20 06:00:57

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.14444v1

Detecting Reading-Induced Confusion Using EEG and Eye Tracking

Humans regularly navigate an overwhelming amount of information via text media, whether reading articles, browsing social media, or interacting with chatbots. Confusion naturally arises when new information conflicts with or exceeds a reader's comprehension or prior knowledge, posing a challenge for learning. In this study, we present a multimodal investigation of reading-induced confusion using EEG and eye tracking. We collected neural and gaze data from 11 adult participants as they read short paragraphs sampled from diverse, real-world sources. By isolating the N400 event-related potential (ERP), a well-established neural marker of semantic incongruence, and integrating behavioral markers from eye tracking, we provide a detailed analysis of the neural and behavioral correlates of confusion during naturalistic reading. Using machine learning, we show that multimodal (EEG + eye tracking) models improve classification accuracy by 4-22% over unimodal baselines, reaching an average weighted participant accuracy of 77.3% and a best accuracy of 89.6%. Our results highlight the dominance of the brain's temporal regions in these neural signatures of confusion, suggesting avenues for wearable, low-electrode brain-computer interfaces (BCI) for real-time monitoring. These findings lay the foundation for developing adaptive systems that dynamically detect and respond to user confusion, with potential applications in personalized learning, human-computer interaction, and accessibility.

Updated: 2025-08-20 05:56:17

标题: 利用脑电图和眼动追踪技术检测阅读引起的困惑

摘要: 人类经常通过文本媒体浏览大量信息，无论是阅读文章、浏览社交媒体还是与聊天机器人交互。当新信息与读者的理解能力或先前知识冲突或超过时，自然会产生混淆，这对学习构成挑战。在这项研究中，我们通过脑电图和眼动追踪提出了一个多模态对阅读引起的混淆进行调查。我们从11名成年参与者那里收集了神经和注视数据，当他们阅读来自多样真实来源的短段落时。通过隔离N400事件相关电位（ERP），这是一个已经建立的语义不一致的神经标记，并结合来自眼动追踪的行为标记，我们提供了对自然阅读中混淆的神经和行为相关性的详细分析。使用机器学习，我们展示了多模态（脑电图+眼动追踪）模型相比于单模态基准可以提高4-22%的分类准确率，达到了77.3%的平均加权参与者准确率和89.6%的最佳准确率。我们的结果突显了大脑的时间区域在这些混淆的神经特征中的主导地位，暗示了适用于实时监测的佩戴式低电极脑-计算机接口（BCI）的可能途径。这些发现为开发能够动态检测并回应用户混淆的自适应系统奠定了基础，具有在个性化学习、人机交互和可访问性方面的潜在应用。

更新时间: 2025-08-20 05:56:17

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2508.14442v1

Linkage Attacks Expose Identity Risks in Public ECG Data Sharing

The increasing availability of publicly shared electrocardiogram (ECG) data raises critical privacy concerns, as its biometric properties make individuals vulnerable to linkage attacks. Unlike prior studies that assume idealized adversarial capabilities, we evaluate ECG privacy risks under realistic conditions where attackers operate with partial knowledge. Using data from 109 participants across diverse real-world datasets, our approach achieves 85% accuracy in re-identifying individuals in public datasets while maintaining a 14.2% overall misclassification rate at an optimal confidence threshold, with 15.6% of unknown individuals misclassified as known and 12.8% of known individuals misclassified as unknown. These results highlight the inadequacy of simple anonymization techniques in preventing re-identification, demonstrating that even limited adversarial knowledge enables effective identity linkage. Our findings underscore the urgent need for privacy-preserving strategies, such as differential privacy, access control, and encrypted computation, to mitigate re-identification risks while ensuring the utility of shared biosignal data in healthcare applications.

Updated: 2025-08-20 05:52:10

标题: 链路攻击暴露了公共心电图数据共享中的身份风险

摘要: 随着公开共享心电图（ECG）数据的增加，人们对隐私问题提出了关键性担忧，因为其生物特性使个人易受链接攻击的影响。与先前假设理想敌方能力的研究不同，我们在现实条件下评估了心电图隐私风险，攻击者在部分知识的情况下进行操作。利用来自不同真实世界数据集中的109名参与者的数据，我们的方法在保持14.2%总体误分类率的最佳置信阈下，实现了85%的准确性在公共数据集中重新识别个人，其中15.6%的未知个体被误分类为已知，12.8%的已知个体被误分类为未知。这些结果突显了简单匿名化技术在防止重新识别方面的不足，表明即使有限的敌对知识也能实现有效的身份链接。我们的发现强调了在医疗应用中确保共享生物信号数据的效用的同时减轻重新识别风险的紧迫需要，如差分隐私、访问控制和加密计算等隐私保护策略。

更新时间: 2025-08-20 05:52:10

领域: cs.CR,cs.LG

下载: http://arxiv.org/abs/2508.15850v1

Personalized Counterfactual Framework: Generating Potential Outcomes from Wearable Data

Wearable sensor data offer opportunities for personalized health monitoring, yet deriving actionable insights from their complex, longitudinal data streams is challenging. This paper introduces a framework to learn personalized counterfactual models from multivariate wearable data. This enables exploring what-if scenarios to understand potential individual-specific outcomes of lifestyle choices. Our approach first augments individual datasets with data from similar patients via multi-modal similarity analysis. We then use a temporal PC (Peter-Clark) algorithm adaptation to discover predictive relationships, modeling how variables at time t-1 influence physiological changes at time t. Gradient Boosting Machines are trained on these discovered relationships to quantify individual-specific effects. These models drive a counterfactual engine projecting physiological trajectories under hypothetical interventions (e.g., activity or sleep changes). We evaluate the framework via one-step-ahead predictive validation and by assessing the plausibility and impact of interventions. Evaluation showed reasonable predictive accuracy (e.g., mean heart rate MAE 4.71 bpm) and high counterfactual plausibility (median 0.9643). Crucially, these interventions highlighted significant inter-individual variability in response to hypothetical lifestyle changes, showing the framework's potential for personalized insights. This work provides a tool to explore personalized health dynamics and generate hypotheses on individual responses to lifestyle changes.

Updated: 2025-08-20 05:04:17

标题: 个性化反事实框架：从可穿戴数据生成潜在结果

摘要: 穿戴式传感器数据为个性化健康监测提供了机会，然而从其复杂的、长期的数据流中获得可操作的见解是具有挑战性的。本文介绍了一个框架，用于从多变量穿戴式数据中学习个性化因果模型。这使得可以探索假设情景，以了解生活方式选择的潜在个体特定结果。我们的方法首先通过多模态相似性分析将个体数据集与类似患者的数据进行扩充。然后，我们使用一种时间PC（Peter-Clark）算法的调整版本来发现预测关系，建模时间t-1的变量如何影响时间t的生理变化。梯度提升机器被训练用于量化个体特定效应。这些模型驱动一个因果推理引擎，投影在假设干预下的生理轨迹（例如，活动或睡眠改变）。我们通过一步预测验证和评估干预的合理性和影响来评估该框架。评估显示了合理的预测准确性（例如，平均心率MAE为4.71 bpm）和高因果推理的合理性（中位数为0.9643）。至关重要的是，这些干预突出了对假设生活方式变化的响应的显著个体间变异性，显示了该框架对个性化见解的潜力。这项工作提供了一个工具，用于探索个性化健康动态并生成关于个体对生活方式变化的响应的假设。

更新时间: 2025-08-20 05:04:17

领域: cs.LG

下载: http://arxiv.org/abs/2508.14432v1

Bi-directional Model Cascading with Proxy Confidence

Model Cascading, recently applied successfully to LLMs, is a simple but powerful technique that improves the efficiency of inference by selectively applying models of varying sizes. Models are used in sequence from smallest to largest, only deferring samples to large, costly models when smaller models are not sufficiently confident. Existing approaches to deferral use only limited small model confidence estimates because of the inaccessibility of the large model, although large model confidence is known to be important. We therefore propose a bi-directional approach to deferral that considers the confidence of small and large models in the cascade simultaneously through the use of a proxy for the large model. This requires a richer representation of model confidence to enable comparative calibration: we use an analysis of hidden states to improve post-invocation confidence of the small model, which in itself improves cascading results over prior approaches. We then combine this with a tiny proxy model to estimate pre-invocation confidence of the large model. We examine the proposed cascading system over challenging, multiple-choice datasets, finding improvements over standard cascading baselines reflected in reductions in deferrals to more costly models.

Updated: 2025-08-20 04:48:09

标题: 双向模型级联与代理置信度

摘要: 级联模型(Model Cascading)最近成功地应用于LLMs，这是一种简单但强大的技术，通过选择性地应用不同大小的模型来提高推断的效率。模型按照从最小到最大的顺序使用，只有在较小的模型不够自信时才将样本推迟到较大、成本更高的模型中。现有的推迟方法仅使用有限的小模型置信度估计，因为大模型不易访问，尽管已知大模型的置信度很重要。因此，我们提出了一种双向推迟方法，通过使用大模型的代理来同时考虑级联中小模型和大模型的置信度。这需要更丰富的模型置信度表示，以便进行比较校准：我们使用隐藏状态的分析来提高小模型的调用后置信度，这本身就改进了比以往方法更好的级联结果。然后，我们将此与一个微小的代理模型结合起来，来估算大模型的调用前置信度。我们在具有挑战性的多选数据集上检验了所提出的级联系统，发现相对于标准级联基线，改进体现在减少推迟到更昂贵模型的情况。

更新时间: 2025-08-20 04:48:09

领域: cs.LG

下载: http://arxiv.org/abs/2504.19391v2

The Agent Behavior: Model, Governance and Challenges in the AI Digital Age

Advancements in AI have led to agents in networked environments increasingly mirroring human behavior, thereby blurring the boundary between artificial and human actors in specific contexts. This shift brings about significant challenges in trust, responsibility, ethics, security and etc. The difficulty in supervising of agent behaviors may lead to issues such as data contamination and unclear accountability. To address these challenges, this paper proposes the "Network Behavior Lifecycle" model, which divides network behavior into 6 stages and systematically analyzes the behavioral differences between humans and agents at each stage. Based on these insights, the paper further introduces the "Agent for Agent (A4A)" paradigm and the "Human-Agent Behavioral Disparity (HABD)" model, which examine the fundamental distinctions between human and agent behaviors across 5 dimensions: decision mechanism, execution efficiency, intention-behavior consistency, behavioral inertia, and irrational patterns. The effectiveness of the model is verified through real-world cases such as red team penetration and blue team defense. Finally, the paper discusses future research directions in dynamic cognitive governance architecture, behavioral disparity quantification, and meta-governance protocol stacks, aiming to provide a theoretical foundation and technical roadmap for secure and trustworthy human-agent collaboration.

Updated: 2025-08-20 04:24:55

标题: 机器人行为：模型、治理和挑战在人工智能数字时代

摘要: 人工智能的进步导致网络环境中的代理越来越多地模仿人类行为，从而在特定情境中模糊了人工和人类行为者之间的界限。这种转变带来了在信任、责任、伦理、安全等方面的重大挑战。监督代理行为的困难可能导致数据污染和责任不明确等问题。为了解决这些挑战，本文提出了“网络行为生命周期”模型，将网络行为分为6个阶段，并系统分析了每个阶段人类和代理之间的行为差异。基于这些见解，本文进一步引入了“代理对代理（A4A）”范式和“人-代理行为差异（HABD）”模型，这些模型在决策机制、执行效率、意图-行为一致性、行为惯性和非理性模式等5个维度上检查了人类和代理行为之间的基本差异。该模型的有效性通过红队渗透和蓝队防御等实际案例得到验证。最后，本文讨论了动态认知治理体系结构、行为差异量化和元治理协议栈等未来研究方向，旨在为安全可靠的人-代理协作提供理论基础和技术路线图。

更新时间: 2025-08-20 04:24:55

领域: cs.AI

下载: http://arxiv.org/abs/2508.14415v1

Disentanglement in T-space for Faster and Distributed Training of Diffusion Models with Fewer Latent-states

We challenge a fundamental assumption of diffusion models, namely, that a large number of latent-states or time-steps is required for training so that the reverse generative process is close to a Gaussian. We first show that with careful selection of a noise schedule, diffusion models trained over a small number of latent states (i.e. $T \sim 32$) match the performance of models trained over a much large number of latent states ($T \sim 1,000$). Second, we push this limit (on the minimum number of latent states required) to a single latent-state, which we refer to as complete disentanglement in T-space. We show that high quality samples can be easily generated by the disentangled model obtained by combining several independently trained single latent-state models. We provide extensive experiments to show that the proposed disentangled model provides 4-6$\times$ faster convergence measured across a variety of metrics on two different datasets.

Updated: 2025-08-20 04:21:26

标题: 在T空间中解开扩散模型的快速和分布式训练，减少潜在状态

摘要: 我们挑战了扩散模型的一个基本假设，即需要大量潜在状态或时间步来训练，以便逆生成过程接近高斯分布。我们首先展示了通过精心选择噪声计划，可以在训练过程中使用少量潜在状态（即$T \sim 32$）的扩散模型与训练过大量潜在状态（$T \sim 1,000$）的模型性能相匹配。其次，我们将潜在状态所需的最小数量限制推到了单个潜在状态，这被我们称为T空间中的完全解耦。我们展示了通过组合几个独立训练的单个潜在状态模型获得的解耦模型可以轻松生成高质量样本。我们进行了大量实验，展示了所提出的解耦模型在两个不同数据集上通过多种指标测量得到的收敛速度比现有模型快4-6倍。

更新时间: 2025-08-20 04:21:26

领域: cs.LG,cs.CV

下载: http://arxiv.org/abs/2508.14413v1

Self-Disguise Attack: Induce the LLM to disguise itself for AIGT detection evasion

AI-generated text (AIGT) detection evasion aims to reduce the detection probability of AIGT, helping to identify weaknesses in detectors and enhance their effectiveness and reliability in practical applications. Although existing evasion methods perform well, they suffer from high computational costs and text quality degradation. To address these challenges, we propose Self-Disguise Attack (SDA), a novel approach that enables Large Language Models (LLM) to actively disguise its output, reducing the likelihood of detection by classifiers. The SDA comprises two main components: the adversarial feature extractor and the retrieval-based context examples optimizer. The former generates disguise features that enable LLMs to understand how to produce more human-like text. The latter retrieves the most relevant examples from an external knowledge base as in-context examples, further enhancing the self-disguise ability of LLMs and mitigating the impact of the disguise process on the diversity of the generated text. The SDA directly employs prompts containing disguise features and optimized context examples to guide the LLM in generating detection-resistant text, thereby reducing resource consumption. Experimental results demonstrate that the SDA effectively reduces the average detection accuracy of various AIGT detectors across texts generated by three different LLMs, while maintaining the quality of AIGT.

Updated: 2025-08-20 04:17:03

标题: 自我伪装攻击：诱使LLM自我伪装以逃避AIGT检测

摘要: 人工智能生成文本（AIGT）检测逃避旨在降低AIGT的检测概率，有助于识别检测器的弱点，并增强它们在实际应用中的效果和可靠性。尽管现有的逃避方法表现良好，但存在计算成本高和文本质量下降的问题。为了解决这些挑战，我们提出了自我伪装攻击（SDA），这是一种新颖的方法，使大型语言模型（LLM）能够主动伪装其输出，从而降低分类器检测到的可能性。SDA包括两个主要组件：敌对特征提取器和基于检索的上下文示例优化器。前者生成伪装特征，使LLM能够理解如何生成更类似人类的文本。后者从外部知识库中检索最相关的示例作为上下文示例，进一步增强LLM的自我伪装能力，并减轻伪装过程对生成文本多样性的影响。SDA直接使用包含伪装特征和优化上下文示例的提示来引导LLM生成抗检测文本，从而降低资源消耗。实验结果表明，SDA有效降低了三种不同LLM生成的文本上各种AIGT检测器的平均检测准确率，同时保持AIGT的质量。

更新时间: 2025-08-20 04:17:03

领域: cs.CR,cs.CL

下载: http://arxiv.org/abs/2508.15848v1

Automated Optimization Modeling through Expert-Guided Large Language Model Reasoning

Optimization Modeling (OM) is essential for solving complex decision-making problems. However, the process remains time-consuming and error-prone, heavily relying on domain experts. While Large Language Models (LLMs) show promise in addressing these challenges through their natural language understanding and reasoning capabilities, current approaches face three critical limitations: high benchmark labeling error rates reaching up to 42\%, narrow evaluation scope that only considers optimal values, and computational inefficiency due to heavy reliance on multi-agent systems or model fine-tuning. In this work, we first enhance existing datasets through systematic error correction and more comprehensive annotation. Additionally, we introduce LogiOR, a new optimization modeling benchmark from the logistics domain, containing more complex problems with standardized annotations. Furthermore, we present ORThought, a novel framework that leverages expert-level optimization modeling principles through chain-of-thought reasoning to automate the OM process. Through extensive empirical evaluation, we demonstrate that ORThought outperforms existing approaches, including multi-agent frameworks, with particularly significant advantages on complex optimization problems. Finally, we provide a systematic analysis of our method, identifying critical success factors and failure modes, providing valuable insights for future research on LLM-based optimization modeling.

Updated: 2025-08-20 04:14:54

标题: 专家引导的大型语言模型推理的自动优化建模

摘要: 优化建模（OM）对于解决复杂的决策问题至关重要。然而，这个过程仍然耗时且容易出错，严重依赖领域专家。虽然大型语言模型（LLMs）通过其自然语言理解和推理能力在解决这些挑战方面显示出了希望，但目前的方法面临三个关键限制：高达42％的基准标签错误率，狭窄的评估范围仅考虑最优值，以及由于过度依赖多智能体系统或模型微调而导致的计算效率低下。在这项工作中，我们首先通过系统性的错误更正和更全面的注释来增强现有数据集。此外，我们引入了LogiOR，一个新的来自物流领域的优化建模基准，包含更复杂的问题和标准化的注释。此外，我们提出了ORThought，一个新颖的框架，通过思维链推理利用专家级别的优化建模原则来自动化OM过程。通过广泛的实证评估，我们证明ORThought在表现上优于现有方法，包括多智能体框架，特别在复杂优化问题上具有显著优势。最后，我们对我们的方法进行了系统分析，识别了关键的成功因素和失败模式，为基于LLM的优化建模的未来研究提供了有价值的见解。

更新时间: 2025-08-20 04:14:54

领域: cs.AI

下载: http://arxiv.org/abs/2508.14410v1

Cognitive Surgery: The Awakening of Implicit Territorial Awareness in LLMs

Large language models (LLMs) have been shown to possess a degree of self-recognition capability-the ability to identify whether a given text was generated by themselves. Prior work has demonstrated that this capability is reliably expressed under the Pair Presentation Paradigm (PPP), where the model is presented with two texts and asked to choose which one it authored. However, performance deteriorates sharply under the Individual Presentation Paradigm (IPP), where the model is given a single text to judge authorship. Although this phenomenon has been observed, its underlying causes have not been systematically analyzed. In this paper, we first replicate existing findings to confirm that LLMs struggle to distinguish self- from other-generated text under IPP. We then investigate the reasons for this failure and attribute it to a phenomenon we term Implicit Territorial Awareness (ITA)-the model's latent ability to distinguish self- and other-texts in representational space, which remains unexpressed in its output behavior. To awaken the ITA of LLMs, we propose Cognitive Surgery (CoSur), a novel framework comprising four main modules: representation extraction, territory construction, authorship discrimination and cognitive editing. Experimental results demonstrate that our proposed method improves the performance of three different LLMs in the IPP scenario, achieving average accuracies of 83.25%, 66.19%, and 88.01%, respectively.

Updated: 2025-08-20 04:08:18

标题: 认知手术：在LLMs中隐性领域意识的觉醒

摘要: 大型语言模型(LLMs)已被证明具有一定程度的自我识别能力-即识别给定文本是否由其生成的能力。先前的研究表明，这种能力在配对呈现范式(PPP)下可可靠地表达，其中模型被呈现两个文本，并被要求选择自己生成的文本。然而，在个体呈现范式(IPP)下，模型被给予一个文本来判断作者时，性能急剧下降。尽管观察到了这一现象，但其根本原因尚未得到系统分析。在本文中，我们首先复制现有研究结果以确认LLMs在IPP下难以区分自身和他人生成的文本。然后，我们调查了这种失败的原因，并将其归因于一种我们称之为“隐式领域意识”(ITA)的现象-模型在表征空间中区分自身和他人文本的潜在能力，在其输出行为中未被表达。为唤醒LLMs的ITA，我们提出了认知手术(CoSur)，这是一个包括四个主要模块的新框架：表示提取、领域构建、作者鉴别和认知编辑。实验结果表明，我们提出的方法改善了三种不同LLMs在IPP场景下的性能，分别达到了平均准确率83.25%、66.19%和88.01%。

更新时间: 2025-08-20 04:08:18

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.14408v1

Generative AI in K-12 Education: The CyberScholar Initiative

This paper focuses on the piloting of CyberScholar, a Generative AI assistant tool that aims to provide formative feedback on writing in K-12 contexts. Specifically, this study explores how students worked with CyberScholar in diverse subject areas, including English Language Arts, Social Studies, and Modern World History classes in Grades 7, 8, 10, and 11 in three schools in the Midwest and one in the Northwest of the United States. This paper focuses on CyberScholar's potential to support K-12 students' writing in diverse subject areas requiring written assignments. Data were collected through implementation observations, surveys, and interviews by participating 121 students and 4 teachers. Thematic qualitative analysis revealed that the feedback tool was perceived as a valuable tool for supporting student writing through detailed feedback, enhanced interactivity, and alignment with rubric criteria. Students appreciated the tool's guidance in refining their writing. For the students, the assistant tool suggests restructuring feedback as a dynamic, dialogic process rather than a static evaluation, a shift that aligns with the cyber-social learning idea, self-regulation, and metacognition. For the teaching side, the findings indicate a shift in teachers' roles, from serving primarily as evaluators to guiding AI feedback processes that foster better student writing and critical thinking.

Updated: 2025-08-20 03:58:04

标题: K-12教育中的生成式人工智能：CyberScholar计划

摘要: 本文关注的是CyberScholar的试点，这是一个旨在为K-12环境中的写作提供形成性反馈的生成式人工智能助手工具。具体来说，这项研究探讨了学生如何在不同科目领域（包括英语语言艺术、社会研究和现代世界历史）中与CyberScholar合作，涵盖了美国中西部三所学校的7年级、8年级、10年级和11年级的班级以及西北部的一所学校。本文关注CyberScholar在支持需要书面作业的各种学科领域中的K-12学生写作方面的潜力。数据是通过对121名学生和4名教师的实施观察、调查和访谈收集的。主题性定性分析表明，这个反馈工具被认为是一种有价值的工具，通过详细的反馈、增强的互动性和与评分标准的一致性来支持学生写作。学生们赞赏这个工具在完善他们的写作方面给予的指导。对于学生来说，助手工具将反馈重构为一个动态、对话式的过程，而不是一个静态的评估，这一变化与网络社交学习理念、自我调节和元认知相一致。对于教学方面，研究结果表明教师的角色发生了转变，从主要担任评估者转变为引导AI反馈过程，促进学生更好地写作和批判性思维。

更新时间: 2025-08-20 03:58:04

领域: cs.CY,cs.AI,cs.HC

下载: http://arxiv.org/abs/2502.19422v3

Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models

Text-to-image (T2I) diffusion models have shown exceptional capabilities in generating images that closely correspond to textual prompts. However, the advancement of T2I diffusion models presents significant risks, as the models could be exploited for malicious purposes, such as generating images with violence or nudity, or creating unauthorized portraits of public figures in inappropriate contexts. To mitigate these risks, concept removal methods have been proposed. These methods aim to modify diffusion models to prevent the generation of malicious and unwanted concepts. Despite these efforts, existing research faces several challenges: (1) a lack of consistent comparisons on a comprehensive dataset, (2) ineffective prompts in harmful and nudity concepts, (3) overlooked evaluation of the ability to generate the benign part within prompts containing malicious concepts. To address these gaps, we propose to benchmark the concept removal methods by introducing a new dataset, Six-CD, along with a novel evaluation metric. In this benchmark, we conduct a thorough evaluation of concept removals, with the experimental observations and discussions offering valuable insights in the field.

Updated: 2025-08-20 03:56:07

标题: Six-CD：用于良性文本到图像扩散模型概念去除的基准测试

摘要: 文本到图像（T2I）扩散模型展现出在生成与文本提示密切对应的图像方面的异常能力。然而，T2I扩散模型的进步带来了重大风险，因为这些模型可能被用于恶意目的，比如生成暴力或裸露的图像，或者在不当背景下创建未经授权的公众人物肖像。为了减轻这些风险，已经提出了概念去除方法。这些方法旨在修改扩散模型，以防止生成恶意和不需要的概念。尽管有这些努力，现有研究面临着几个挑战：（1）缺乏对全面数据集的一致比较，（2）在有害和裸露概念中无效的提示，（3）忽视了在包含恶意概念的提示中生成良性部分的能力的评估。为了解决这些差距，我们提议通过引入一个新的数据集Six-CD以及一个新颖的评估指标来评估概念去除方法。在这个基准测试中，我们对概念去除进行了彻底的评估，实验观察和讨论提供了有价值的见解。

更新时间: 2025-08-20 03:56:07

领域: cs.CV,cs.CR

下载: http://arxiv.org/abs/2406.14855v3

Precision over Noise: Tailoring S3 Public Access Detection to Reduce False Positives in Cloud Security Platforms

Excessive and spurious alert generation by cloud security solutions is a root cause of analyst fatigue and operational inefficiencies. In this study, the long-standing issue of false positives from publicly accessible alerts in Amazon S3, as generated by a licensed cloud-native security solution, is examined. In a simulated production test environment, which consisted of over 1,000 Amazon S3 buckets with diverse access configurations, it was discovered that over 80\% of the alerts generated by default rules were classified as false positives, thus demonstrating the severity of the detection issue. This severely impacted detection accuracy and generated a heavier workload for analysts due to redundant manual triage efforts. For addressing this problem, custom detection logic was created as an exercise of the native rule customization capabilities of the solution. A unified titled ``S3 Public Access Validation and Data Exposure'' was created in an effort to consolidate different forms of alerts into one, context-aware logic that systematically scans ACL configurations, bucket policies, indicators of public exposure, and the presence of sensitive data, and then marks only those S3 buckets that indeed denote security risk and are publicly exposed on the internet with no authentication. The results demonstrate a significant reduction in false positives, more precise alert fidelity, and significant time saving for security analysts, thus demonstrating an actionable and reproducible solution to enhance the accuracy of security alerting in compliance-focused cloud environments.

Updated: 2025-08-20 03:55:19

标题: 噪声之上的精确性：定制S3公共访问检测以减少云安全平台中的假阳性

摘要: 云安全解决方案过度和虚假警报生成是分析师疲劳和运营效率低下的根本原因。本研究探讨了由授权的云原生安全解决方案在亚马逊S3中生成的公开访问警报中长期存在的虚假阳性问题。在一个模拟的生产测试环境中，该环境包含了超过1,000个具有不同访问配置的亚马逊S3存储桶，发现默认规则生成的警报中超过80\%被分类为虚假阳性，从而展示了检测问题的严重性。这严重影响了检测准确性，并因冗余的手动分类工作而给分析师带来了更重的工作负担。为解决这一问题，创建了自定义检测逻辑，作为解决方案本地规则自定义功能的练习。创建了一个统一的标题“S3公共访问验证和数据曝露”，以整合不同形式的警报为一个具有上下文意识的逻辑，系统地扫描ACL配置、存储桶策略、公开暴露指标和敏感数据的存在，然后标记那些确实表示安全风险且在互联网上没有身份验证的亚马逊S3存储桶。结果显示虚假阳性显著减少，警报准确性更高，安全分析师节省了大量时间，从而展示了在以合规为重点的云环境中提高安全警报准确性的可操作和可复制的解决方案。

更新时间: 2025-08-20 03:55:19

领域: cs.CR

下载: http://arxiv.org/abs/2508.14402v1

NoteIt: A System Converting Instructional Videos to Interactable Notes Through Multimodal Video Understanding

Users often take notes for instructional videos to access key knowledge later without revisiting long videos. Automated note generation tools enable users to obtain informative notes efficiently. However, notes generated by existing research or off-the-shelf tools fail to preserve the information conveyed in the original videos comprehensively, nor can they satisfy users' expectations for diverse presentation formats and interactive features when using notes digitally. In this work, we present NoteIt, a system, which automatically converts instructional videos to interactable notes using a novel pipeline that faithfully extracts hierarchical structure and multimodal key information from videos. With NoteIt's interface, users can interact with the system to further customize the content and presentation formats of the notes according to their preferences. We conducted both a technical evaluation and a comparison user study (N=36). The solid performance in objective metrics and the positive user feedback demonstrated the effectiveness of the pipeline and the overall usability of NoteIt. Project website: https://zhaorunning.github.io/NoteIt/

Updated: 2025-08-20 03:45:18

标题: NoteIt：一种通过多模态视频理解将教学视频转换为可交互笔记的系统

摘要: 用户经常为教学视频做笔记，以便以后查看关键知识而不必再次观看长视频。自动生成笔记工具使用户能够高效获取信息性笔记。然而，现有研究或现成工具生成的笔记未能全面保留原始视频中传达的信息，也无法满足用户在数字笔记使用时对多样化呈现格式和交互功能的期望。在这项工作中，我们提出了NoteIt，一个系统，它使用一种新颖的流程自动将教学视频转换为可互动的笔记，该流程能忠实提取视频中的层次结构和多模态关键信息。通过NoteIt的界面，用户可以与系统交互，根据他们的偏好进一步自定义笔记的内容和呈现格式。我们进行了技术评估和用户比较研究（N=36）。客观指标的良好表现和积极的用户反馈证明了该流程的有效性以及NoteIt的整体可用性。项目网站：https://zhaorunning.github.io/NoteIt/

更新时间: 2025-08-20 03:45:18

领域: cs.HC,cs.AI

下载: http://arxiv.org/abs/2508.14395v1

DEPTH: Hallucination-Free Relation Extraction via Dependency-Aware Sentence Simplification and Two-tiered Hierarchical Refinement

Relation extraction enables the construction of structured knowledge for many downstream applications. While large language models (LLMs) have shown great promise in this domain, most existing methods concentrate on relation classification, which predicts the semantic relation type between a related entity pair. However, we observe that LLMs often struggle to reliably determine whether a relation exists, especially in cases involving complex sentence structures or intricate semantics, which leads to spurious predictions. Such hallucinations can introduce noisy edges in knowledge graphs, compromising the integrity of structured knowledge and downstream reliability. To address these challenges, we propose DEPTH, a framework that integrates Dependency-aware sEntence simPlification and Two-tiered Hierarchical refinement into the relation extraction pipeline. Given a sentence and its candidate entity pairs, DEPTH operates in two stages: (1) the Grounding module extracts relations for each pair by leveraging their shortest dependency path, distilling the sentence into a minimal yet coherent relational context that reduces syntactic noise while preserving key semantics; (2) the Refinement module aggregates all local predictions and revises them based on a holistic understanding of the sentence, correcting omissions and inconsistencies. We further introduce a causality-driven reward model that mitigates reward hacking by disentangling spurious correlations, enabling robust fine-tuning via reinforcement learning with human feedback. Experiments on six benchmarks demonstrate that DEPTH reduces the average hallucination rate to 7.0\% while achieving a 17.2\% improvement in average F1 score over state-of-the-art baselines.

Updated: 2025-08-20 03:35:24

标题: 深度：通过依赖感知句子简化和两层分层细化实现无幻觉的关系抽取

摘要: 关系抽取使得可以为许多下游应用程序构建结构化知识。虽然大型语言模型(LLMs)在这一领域表现出了巨大的潜力，但大多数现有方法集中在关系分类上，即预测相关实体对之间的语义关系类型。然而，我们观察到LLMs经常难以可靠地确定是否存在关系，特别是在涉及复杂句子结构或复杂语义的情况下，这导致了虚假预测。这种幻觉可能会在知识图中引入嘈杂的边缘，损害结构化知识和下游可靠性的完整性。为了应对这些挑战，我们提出了DEPTH，这是一个将依赖感知句子简化和两层次层次细化集成到关系抽取流程中的框架。给定一句话及其候选实体对，DEPTH分为两个阶段：(1)基于最短依赖路径提取每一对的关系的基础模块，将句子提炼成一个最小但连贯的关系上下文，减少句法噪音同时保留关键语义；(2)细化模块汇总所有局部预测，并根据对句子的整体理解进行修订，纠正遗漏和不一致之处。我们进一步引入了一个因果驱动的奖励模型，通过解开虚假相关性来缓解奖励欺骗，从而使得通过人类反馈的强化学习进行稳健的微调。在六个基准测试上的实验证明，DEPTH将平均幻觉率降低到7.0\%，同时在平均F1得分上比最先进的基线模型提高了17.2%。

更新时间: 2025-08-20 03:35:24

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.14391v1

Credence Calibration Game? Calibrating Large Language Models through Structured Play

As Large Language Models (LLMs) are increasingly deployed in decision-critical domains, it becomes essential to ensure that their confidence estimates faithfully correspond to their actual correctness. Existing calibration methods have primarily focused on post-hoc adjustments or auxiliary model training; however, many of these approaches necessitate additional supervision or parameter updates. In this work, we propose a novel prompt-based calibration framework inspired by the Credence Calibration Game. Our method establishes a structured interaction loop wherein LLMs receive feedback based on the alignment of their predicted confidence with correctness. Through feedback-driven prompting and natural language summaries of prior performance, our framework dynamically improves model calibration. Extensive experiments across models and game configurations demonstrate consistent improvements in evaluation metrics. Our results highlight the potential of game-based prompting as an effective strategy for LLM calibration. Code and data are available at https://anonymous.4open.science/r/LLM-Calibration/.

Updated: 2025-08-20 03:33:38

标题: 信任校准游戏？通过结构化游戏来校准大型语言模型

摘要: 随着大型语言模型（LLMs）越来越多地部署在决策关键领域，确保它们的置信度估计与实际正确性相符变得至关重要。现有的校准方法主要集中在事后调整或辅助模型训练上；然而，许多这些方法需要额外的监督或参数更新。在这项工作中，我们提出了一个新颖的基于提示的校准框架，灵感来源于Credence校准游戏。我们的方法建立了一个结构化的交互循环，其中LLMs基于其预测置信度与正确性的一致性接收反馈。通过反馈驱动的提示和先前表现的自然语言摘要，我们的框架动态改善了模型的校准。在模型和游戏配置方面进行的广泛实验表明，在评估指标上一致改进。我们的结果突显了基于游戏提示作为LLM校准的有效策略的潜力。代码和数据可在https://anonymous.4open.science/r/LLM-Calibration/ 上获得。

更新时间: 2025-08-20 03:33:38

领域: cs.CL,cs.AI

下载: http://arxiv.org/abs/2508.14390v1

Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens

We propose a method to extend foundational monocular depth estimators (FMDEs), trained on perspective images, to fisheye images. Despite being trained on tens of millions of images, FMDEs are susceptible to the covariate shift introduced by changes in camera calibration (intrinsic, distortion) parameters, leading to erroneous depth estimates. Our method aligns the distribution of latent embeddings encoding fisheye images to those of perspective images, enabling the reuse of FMDEs for fisheye cameras without retraining or finetuning. To this end, we introduce a set of Calibration Tokens as a light-weight adaptation mechanism that modulates the latent embeddings for alignment. By exploiting the already expressive latent space of FMDEs, we posit that modulating their embeddings avoids the negative impact of artifacts and loss introduced in conventional recalibration or map projection to a canonical reference frame in the image space. Our method is self-supervised and does not require fisheye images but leverages publicly available large-scale perspective image datasets. This is done by recalibrating perspective images to fisheye images, and enforcing consistency between their estimates during training. We evaluate our approach with several FMDEs, on both indoors and outdoors, where we consistently improve over state-of-the-art methods using a single set of tokens for both. Code available at: https://github.com/JungHeeKim29/calibration-token.

Updated: 2025-08-20 03:31:58

标题: 将基础单目深度估计器扩展到使用校准令牌的鱼眼摄像机

摘要: 我们提出了一种方法，将基础单目深度估计器（FMDEs）从透视图像扩展到鱼眼图像。尽管在数千万图像上进行了训练，FMDEs仍然容易受到由相机校准（内在，畸变）参数变化引入的协变量转移的影响，导致深度估计错误。我们的方法将编码鱼眼图像的潜在嵌入分布与透视图像的嵌入分布对齐，从而使得可以在无需重新训练或微调的情况下重复使用FMDEs用于鱼眼相机。为此，我们引入了一组校准令牌作为一种轻量级调整机制，用于调节嵌入以实现对齐。通过利用FMDEs已经具有表现力的潜在空间，我们认为调节它们的嵌入可以避免在传统的重校准或将地图投影到图像空间中的规范参考框架中引入的伪影和损失的负面影响。我们的方法是自监督的，不需要鱼眼图像，而是利用公开可用的大规模透视图像数据集。这是通过将透视图像重新校准为鱼眼图像，并在训练期间强制保持它们的估计一致性来实现的。我们在室内和室外使用几种FMDEs评估了我们的方法，在这些场景中，我们一致地比使用单一令牌集的最先进方法取得了改进。代码可在以下链接找到：https://github.com/JungHeeKim29/calibration-token。

更新时间: 2025-08-20 03:31:58

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.04928v3

Online Incident Response Planning under Model Misspecification through Bayesian Learning and Belief Quantization

Effective responses to cyberattacks require fast decisions, even when information about the attack is incomplete or inaccurate. However, most decision-support frameworks for incident response rely on a detailed system model that describes the incident, which restricts their practical utility. In this paper, we address this limitation and present an online method for incident response planning under model misspecification, which we call MOBAL: Misspecified Online Bayesian Learning. MOBAL iteratively refines a conjecture about the model through Bayesian learning as new information becomes available, which facilitates model adaptation as the incident unfolds. To determine effective responses online, we quantize the conjectured model into a finite Markov model, which enables efficient response planning through dynamic programming. We prove that Bayesian learning is asymptotically consistent with respect to the information feedback. Additionally, we establish bounds on misspecification and quantization errors. Experiments on the CAGE-2 benchmark show that MOBAL outperforms the state of the art in terms of adaptability and robustness to model misspecification.

Updated: 2025-08-20 03:25:59

标题: 通过贝叶斯学习和信念量化在模型错误规范下进行在线事件响应规划

摘要: 网络攻击的有效应对需要及时做出决策，即使对攻击的信息不完整或不准确。然而，大多数关于事件响应的决策支持框架依赖于描述事件的详细系统模型，这限制了它们的实际效用。在本文中，我们解决了这一限制，并提出了一种针对模型错误的事件响应规划的在线方法，我们称之为MOBAL：Misspecified Online Bayesian Learning。MOBAL通过贝叶斯学习逐步完善关于模型的猜测，随着新信息的出现，这有助于在事件展开过程中对模型进行调整。为了在线确定有效的响应，我们将猜测的模型量化为有限的马尔可夫模型，通过动态规划实现高效的响应规划。我们证明贝叶斯学习在信息反馈方面是渐近一致的。此外，我们建立了模型错误和量化误差的界限。在CAGE-2基准测试中的实验结果表明，MOBAL在适应性和对模型错误的鲁棒性方面优于现有技术水平。

更新时间: 2025-08-20 03:25:59

领域: cs.LG,cs.AI,cs.CR,cs.SY,eess.SY

下载: http://arxiv.org/abs/2508.14385v1

Offline Imitation Learning upon Arbitrary Demonstrations by Pre-Training Dynamics Representations

Limited data has become a major bottleneck in scaling up offline imitation learning (IL). In this paper, we propose enhancing IL performance under limited expert data by introducing a pre-training stage that learns dynamics representations, derived from factorizations of the transition dynamics. We first theoretically justify that the optimal decision variable of offline IL lies in the representation space, significantly reducing the parameters to learn in the downstream IL. Moreover, the dynamics representations can be learned from arbitrary data collected with the same dynamics, allowing the reuse of massive non-expert data and mitigating the limited data issues. We present a tractable loss function inspired by noise contrastive estimation to learn the dynamics representations at the pre-training stage. Experiments on MuJoCo demonstrate that our proposed algorithm can mimic expert policies with as few as a single trajectory. Experiments on real quadrupeds show that we can leverage pre-trained dynamics representations from simulator data to learn to walk from a few real-world demonstrations.

Updated: 2025-08-20 03:23:20

标题: 离线模仿学习：通过预训练动态表示进行任意演示

摘要: Limited data has become a major obstacle in expanding offline imitation learning (IL). In this study, we suggest improving IL performance with limited expert data by incorporating a pre-training phase that learns dynamics representations obtained from factorizations of transition dynamics. We demonstrate theoretically that the optimal decision variable for offline IL is situated in the representation space, leading to a significant reduction in parameters to be learned in downstream IL. Additionally, the dynamics representations can be acquired from any data gathered with the same dynamics, enabling the utilization of extensive non-expert data and alleviating data scarcity issues. We introduce a manageable loss function inspired by noise contrastive estimation to learn the dynamics representations during the pre-training phase. Experimentation on MuJoCo illustrates that our proposed algorithm can replicate expert policies with just a single trajectory. Experiments on actual quadrupeds show that we can utilize pre-trained dynamics representations from simulator data to acquire the ability to walk from a few real-world demonstrations.

更新时间: 2025-08-20 03:23:20

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2508.14383v1

Action-Constrained Imitation Learning

Policy learning under action constraints plays a central role in ensuring safe behaviors in various robot control and resource allocation applications. In this paper, we study a new problem setting termed Action-Constrained Imitation Learning (ACIL), where an action-constrained imitator aims to learn from a demonstrative expert with larger action space. The fundamental challenge of ACIL lies in the unavoidable mismatch of occupancy measure between the expert and the imitator caused by the action constraints. We tackle this mismatch through \textit{trajectory alignment} and propose DTWIL, which replaces the original expert demonstrations with a surrogate dataset that follows similar state trajectories while adhering to the action constraints. Specifically, we recast trajectory alignment as a planning problem and solve it via Model Predictive Control, which aligns the surrogate trajectories with the expert trajectories based on the Dynamic Time Warping (DTW) distance. Through extensive experiments, we demonstrate that learning from the dataset generated by DTWIL significantly enhances performance across multiple robot control tasks and outperforms various benchmark imitation learning algorithms in terms of sample efficiency. Our code is publicly available at https://github.com/NYCU-RL-Bandits-Lab/ACRL-Baselines.

Updated: 2025-08-20 03:19:07

标题: 受限动作模仿学习

摘要: 在行动约束下的政策学习在确保各种机器人控制和资源分配应用中的安全行为方面起着至关重要的作用。在本文中，我们研究了一种新的问题设置，称为行动约束模仿学习（ACIL），其中受行动约束的模仿者旨在从具有更大行动空间的示范专家那里学习。ACIL的根本挑战在于由于行动约束导致的专家和模仿者之间占用度量的不可避免的不匹配。我们通过轨迹对齐来解决这种不匹配，并提出了DTWIL，它用类似状态轨迹但遵守行动约束的替代数据集替换了原始的专家示范。具体地，我们将轨迹对齐重新构造为一个规划问题，并通过模型预测控制来解决它，该方法基于动态时间扭曲（DTW）距离将替代轨迹与专家轨迹对齐。通过大量实验，我们证明了从DTWIL生成的数据集显著提高了多个机器人控制任务的性能，并在样本效率方面优于各种基准模仿学习算法。我们的代码公开可用于https://github.com/NYCU-RL-Bandits-Lab/ACRL-Baselines。

更新时间: 2025-08-20 03:19:07

领域: cs.RO,cs.LG

下载: http://arxiv.org/abs/2508.14379v1

ETA: Energy-based Test-time Adaptation for Depth Completion

We propose a method for test-time adaptation of pretrained depth completion models. Depth completion models, trained on some ``source'' data, often predict erroneous outputs when transferred to ``target'' data captured in novel environmental conditions due to a covariate shift. The crux of our method lies in quantifying the likelihood of depth predictions belonging to the source data distribution. The challenge is in the lack of access to out-of-distribution (target) data prior to deployment. Hence, rather than making assumptions regarding the target distribution, we utilize adversarial perturbations as a mechanism to explore the data space. This enables us to train an energy model that scores local regions of depth predictions as in- or out-of-distribution. We update the parameters of pretrained depth completion models at test time to minimize energy, effectively aligning test-time predictions to those of the source distribution. We call our method ``Energy-based Test-time Adaptation'', or ETA for short. We evaluate our method across three indoor and three outdoor datasets, where ETA improve over the previous state-of-the-art method by an average of 6.94% for outdoors and 10.23% for indoors. Project Page: https://fuzzythecat.github.io/eta.

Updated: 2025-08-20 03:11:51

标题: ETA：基于能量的深度完成测试时间自适应

摘要: 我们提出了一种用于预训练深度完成模型的测试时间适应方法。深度完成模型在一些“源”数据上训练后，当转移到在新环境条件下捕获的“目标”数据时，通常会产生错误的输出，这是由于协变量偏移造成的。我们方法的关键在于量化深度预测属于源数据分布的可能性。挑战在于在部署之前无法访问到分布外的（目标）数据。因此，我们不假设目标分布，而是利用对抗扰动作为一种探索数据空间的机制。这使我们能够训练一个能够评分深度预测局部区域是否属于分布内或分布外的能量模型。我们在测试时间更新预训练深度完成模型的参数，以最小化能量，有效地将测试时间预测与源分布的预测对齐。我们将我们的方法称为“基于能量的测试时间适应”，简称ETA。我们在三个室内和三个室外数据集上评估我们的方法，其中ETA在室外和室内的平均改进分别为6.94%和10.23%。项目页面：https://fuzzythecat.github.io/eta。

更新时间: 2025-08-20 03:11:51

领域: cs.CV,cs.AI,cs.LG

下载: http://arxiv.org/abs/2508.05989v2

ZPD-SCA: Unveiling the Blind Spots of LLMs in Assessing Students' Cognitive Abilities

Large language models (LLMs) have demonstrated potential in educational applications, yet their capacity to accurately assess the cognitive alignment of reading materials with students' developmental stages remains insufficiently explored. This gap is particularly critical given the foundational educational principle of the Zone of Proximal Development (ZPD), which emphasizes the need to match learning resources with Students' Cognitive Abilities (SCA). Despite the importance of this alignment, there is a notable absence of comprehensive studies investigating LLMs' ability to evaluate reading comprehension difficulty across different student age groups, especially in the context of Chinese language education. To fill this gap, we introduce ZPD-SCA, a novel benchmark specifically designed to assess stage-level Chinese reading comprehension difficulty. The benchmark is annotated by 60 Special Grade teachers, a group that represents the top 0.15% of all in-service teachers nationwide. Experimental results reveal that LLMs perform poorly in zero-shot learning scenarios, with Qwen-max and GLM even falling below the probability of random guessing. When provided with in-context examples, LLMs performance improves substantially, with some models achieving nearly double the accuracy of their zero-shot baselines. These results reveal that LLMs possess emerging abilities to assess reading difficulty, while also exposing limitations in their current training for educationally aligned judgment. Notably, even the best-performing models display systematic directional biases, suggesting difficulties in accurately aligning material difficulty with SCA. Furthermore, significant variations in model performance across different genres underscore the complexity of task. We envision that ZPD-SCA can provide a foundation for evaluating and improving LLMs in cognitively aligned educational applications.

Updated: 2025-08-20 03:08:47

标题: ZPD-SCA：揭示LLMs在评估学生认知能力中的盲点

摘要: 大型语言模型（LLMs）已经展示了在教育应用中的潜力，然而它们准确评估阅读材料与学生发展阶段认知一致性的能力尚未充分探索。这一差距尤为关键，因为基础教育原则“最近发展区”（ZPD）强调了需要将学习资源与学生认知能力匹配。尽管这种一致性的重要性，目前缺乏全面研究来调查LLMs在不同年龄组学生之间评估阅读理解难度的能力，特别是在中国语言教育领域。为填补这一差距，我们引入了ZPD-SCA，一个专门设计用于评估阶段级别中文阅读理解难度的新基准。该基准由60名特级教师进行注释，这个群体代表了全国所有在职教师的0.15%。实验结果显示，在零样本学习情景下，LLMs表现不佳，Qwen-max和GLM甚至低于随机猜测的概率。当提供上下文示例时，LLMs的表现显著提高，一些模型的准确率几乎是其零样本基线的两倍。这些结果表明LLMs具有评估阅读难度的新兴能力，同时也暴露了它们当前在教育一致判断方面的训练的局限性。值得注意的是，即使是表现最佳的模型也显示出系统性的方向偏差，表明在准确地将材料难度与学生认知能力匹配方面存在困难。此外，不同类型的模型在不同流派中的性能差异显著，突显了任务的复杂性。我们设想ZPD-SCA可以为评估和改进在认知一致教育应用中的LLMs提供基础。

更新时间: 2025-08-20 03:08:47

领域: cs.CL,cs.AI,cs.CY

下载: http://arxiv.org/abs/2508.14377v1

Computing-In-Memory Dataflow for Minimal Buffer Traffic

Computing-In-Memory (CIM) offers a potential solution to the memory wall issue and can achieve high energy efficiency by minimizing data movement, making it a promising architecture for edge AI devices. Lightweight models like MobileNet and EfficientNet, which utilize depthwise convolution for feature extraction, have been developed for these devices. However, CIM macros often face challenges in accelerating depthwise convolution, including underutilization of CIM memory and heavy buffer traffic. The latter, in particular, has been overlooked despite its significant impact on latency and energy consumption. To address this, we introduce a novel CIM dataflow that significantly reduces buffer traffic by maximizing data reuse and improving memory utilization during depthwise convolution. The proposed dataflow is grounded in solid theoretical principles, fully demonstrated in this paper. When applied to MobileNet and EfficientNet models, our dataflow reduces buffer traffic by 77.4-87.0%, leading to a total reduction in data traffic energy and latency by 10.1-17.9% and 15.6-27.8%, respectively, compared to the baseline (conventional weight-stationary dataflow).

Updated: 2025-08-20 03:05:40

标题: 计算内存数据流的最小缓冲流量

摘要: 在内存计算（CIM）中，通过最小化数据移动，可以解决内存墙问题，并实现高能效，这使得其成为边缘AI设备的一种有前途的架构。像MobileNet和EfficientNet这样利用深度卷积进行特征提取的轻量级模型已经为这些设备开发出来。然而，CIM宏经常面临加速深度卷积的挑战，包括CIM内存的低利用率和大量的缓冲器流量。尤其是后者，在延迟和能耗上的显著影响经常被忽视。为了解决这个问题，我们引入了一种新颖的CIM数据流程，通过最大化数据重用和提高深度卷积期间的内存利用率，显著减少缓冲器流量。所提出的数据流程基于扎实的理论原则，在本文中得到了充分展示。当应用于MobileNet和EfficientNet模型时，我们的数据流程将缓冲器流量减少了77.4-87.0%，使得相比基线（传统的权重静态数据流程），数据流量的能量和延迟总体减少了10.1-17.9%和15.6-27.8%。

更新时间: 2025-08-20 03:05:40

领域: cs.AR,cs.AI

下载: http://arxiv.org/abs/2508.14375v1

One-Layer Transformers are Provably Optimal for In-context Reasoning and Distributional Association Learning in Next-Token Prediction Tasks

We study the approximation capabilities and on-convergence behaviors of one-layer transformers on the noiseless and noisy in-context reasoning of next-token prediction. Existing theoretical results focus on understanding the in-context reasoning behaviors for either the first gradient step or when the number of samples is infinite. Furthermore, no convergence rates nor generalization abilities were known. Our work addresses these gaps by showing that there exists a class of one-layer transformers that are provably Bayes-optimal with both linear and ReLU attention. When being trained with gradient descent, we show via a finite-sample analysis that the expected loss of these transformers converges at linear rate to the Bayes risk. Moreover, we prove that the trained models generalize to unseen samples as well as exhibit learning behaviors that were empirically observed in previous works. Our theoretical findings are further supported by extensive empirical validations.

Updated: 2025-08-20 03:05:36

标题: 一层变压器在上下文推理和分布式关联学习的下一个令牌预测任务中被证明是最佳的。

摘要: 我们研究了单层transformers在无噪声和有噪声情境推理下一个token预测的逼近能力和收敛行为。现有的理论结果侧重于理解在第一个梯度步骤或样本数量无限时的情境推理行为。此外，没有收敛速率或泛化能力的相关知识。我们的工作通过展示存在一类单层transformers，可以证明在训练过程中具有线性和ReLU注意力时是贝叶斯最优的。在使用梯度下降进行训练时，我们通过有限样本分析表明这些transformers的期望损失以线性速率收敛到贝叶斯风险。此外，我们证明训练模型能够泛化到未见的样本，并表现出在先前工作中经验观察到的学习行为。我们的理论发现得到了广泛的实证验证支持。

更新时间: 2025-08-20 03:05:36

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2505.15009v2

Hilbert geometry of the symmetric positive-definite bicone: Application to the geometry of the extended Gaussian family

The extended Gaussian family is the closure of the Gaussian family obtained by completing the Gaussian family with the counterpart elements induced by degenerate covariance or degenerate precision matrices, or a mix of both degeneracies. The parameter space of the extended Gaussian family forms a symmetric positive semi-definite matrix bicone, i.e. two partial symmetric positive semi-definite matrix cones joined at their bases. In this paper, we study the Hilbert geometry of such an open bounded convex symmetric positive-definite bicone. We report the closed-form formula for the corresponding Hilbert metric distance and study exhaustively its invariance properties. We also touch upon potential applications of this geometry for dealing with extended Gaussian distributions.

Updated: 2025-08-20 02:57:02

标题: 希尔伯特几何中的对称正定双锥体：应用于扩展高斯家族的几何学

摘要: 扩展高斯族是通过用由退化协方差或退化精度矩阵引起的对应元素完成高斯族而获得的。参数空间形成对称正半定矩阵双锥，即两个部分对称正半定矩阵锥体在它们的基部连接。在本文中，我们研究了这样一个开放有界凸对称正定双锥的希尔伯特几何学。我们报告了相应的希尔伯特度量距离的闭合形式公式，并详尽研究了它的不变性属性。我们还涉及了这种几何学在处理扩展高斯分布方面的潜在应用。

更新时间: 2025-08-20 02:57:02

领域: cs.CG,cs.LG,math.PR

下载: http://arxiv.org/abs/2508.14369v1

Evaluation and Optimization of Leave-one-out Cross-validation for the Lasso

I develop an algorithm to produce the piecewise quadratic that computes leave-one-out cross-validation for the lasso as a function of its hyperparameter. The algorithm can be used to find exact hyperparameters that optimize leave-one-out cross-validation either globally or locally, and its practicality is demonstrated on real-world data sets.

Updated: 2025-08-20 02:53:54

标题: 评估和优化Lasso算法的留一交叉验证

摘要: 我开发了一个算法来生成按片段二次的方式计算lasso留一交叉验证的超参数函数。该算法可用于找到精确优化留一交叉验证的全局或局部超参数，并在真实数据集上展示了其实用性。

更新时间: 2025-08-20 02:53:54

领域: stat.ML,cs.LG,stat.CO

下载: http://arxiv.org/abs/2508.14368v1

Is Contrastive Distillation Enough for Learning Comprehensive 3D Representations?

Cross-modal contrastive distillation has recently been explored for learning effective 3D representations. However, existing methods focus primarily on modality-shared features, neglecting the modality-specific features during the pre-training process, which leads to suboptimal representations. In this paper, we theoretically analyze the limitations of current contrastive methods for 3D representation learning and propose a new framework, namely CMCR, to address these shortcomings. Our approach improves upon traditional methods by better integrating both modality-shared and modality-specific features. Specifically, we introduce masked image modeling and occupancy estimation tasks to guide the network in learning more comprehensive modality-specific features. Furthermore, we propose a novel multi-modal unified codebook that learns an embedding space shared across different modalities. Besides, we introduce geometry-enhanced masked image modeling to further boost 3D representation learning. Extensive experiments demonstrate that our method mitigates the challenges faced by traditional approaches and consistently outperforms existing image-to-LiDAR contrastive distillation methods in downstream tasks. Code will be available at https://github.com/Eaphan/CMCR.

Updated: 2025-08-20 02:53:13

标题: 对比蒸馏足够用于学习全面的3D表示吗？

摘要: 跨模态对比蒸馏最近已被探索用于学习有效的3D表示。然而，现有方法主要关注共享特征，忽略了在预训练过程中的模态特定特征，导致了次优表示。在本文中，我们理论分析了当前用于3D表示学习的对比方法的限制，并提出了一种新的框架，即CMCR，以解决这些缺陷。我们的方法通过更好地整合共享特征和特定特征来改进传统方法。具体来说，我们引入了遮盖图像建模和占用估计任务，以指导网络学习更全面的模态特定特征。此外，我们提出了一种新颖的多模态统一码书，学习跨不同模态共享的嵌入空间。此外，我们引入了几何增强遮盖图像建模，进一步提升3D表示学习。大量实验证明，我们的方法缓解了传统方法面临的挑战，并在下游任务中始终优于现有的图像到LiDAR对比蒸馏方法。代码将在https://github.com/Eaphan/CMCR 上提供。

更新时间: 2025-08-20 02:53:13

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2412.08973v2

Adaptive Experiments Under Data Sparse Settings: Applications for Educational Platforms

Adaptive experimentation is increasingly used in educational platforms to personalize learning through dynamic content and feedback. However, standard adaptive strategies such as Thompson Sampling often underperform in real-world educational settings where content variations are numerous and student participation is limited, resulting in sparse data. In particular, Thompson Sampling can lead to imbalanced content allocation and delayed convergence on which aspects of content are most effective for student learning. To address these challenges, we introduce Weighted Allocation Probability Adjusted Thompson Sampling (WAPTS), an algorithm that refines the sampling strategy to improve content-related decision-making in data-sparse environments. WAPTS is guided by the principle of lenient regret, allowing near-optimal allocations to accelerate learning while still exploring promising content. We evaluate WAPTS in a learnersourcing scenario where students rate peer-generated learning materials, and demonstrate that it enables earlier and more reliable identification of promising treatments.

Updated: 2025-08-20 02:46:43

标题: 数据稀疏环境下的自适应实验：教育平台的应用

摘要: 自适应实验越来越多地用于教育平台，通过动态内容和反馈个性化学习。然而，标准的自适应策略如汤普森取样在真实世界的教育环境中往往表现不佳，因为内容变化繁多，学生参与有限，导致数据稀疏。特别是，汤普森取样可能导致内容分配不平衡，以及对哪些内容方面对学生学习最有效的收敛延迟。为了解决这些挑战，我们引入了加权分配概率调整汤普森取样（WAPTS）算法，该算法改进了采样策略，提高了在数据稀疏环境中与内容相关的决策制定能力。WAPTS以宽容后悔原则为指导，允许接近最优分配加速学习，同时探索有前途的内容。我们在学习者资源共享场景中评估了WAPTS，学生对同学生成的学习材料进行评分，并展示了它能够更早更可靠地识别有前途的治疗方法。

更新时间: 2025-08-20 02:46:43

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2501.03999v3

Robust Finite-Memory Policy Gradients for Hidden-Model POMDPs

Partially observable Markov decision processes (POMDPs) model specific environments in sequential decision-making under uncertainty. Critically, optimal policies for POMDPs may not be robust against perturbations in the environment. Hidden-model POMDPs (HM-POMDPs) capture sets of different environment models, that is, POMDPs with a shared action and observation space. The intuition is that the true model is hidden among a set of potential models, and it is unknown which model will be the environment at execution time. A policy is robust for a given HM-POMDP if it achieves sufficient performance for each of its POMDPs. We compute such robust policies by combining two orthogonal techniques: (1) a deductive formal verification technique that supports tractable robust policy evaluation by computing a worst-case POMDP within the HM-POMDP, and (2) subgradient ascent to optimize the candidate policy for a worst-case POMDP. The empirical evaluation shows that, compared to various baselines, our approach (1) produces policies that are more robust and generalize better to unseen POMDPs, and (2) scales to HM-POMDPs that consist of over a hundred thousand environments.

Updated: 2025-08-20 02:45:49

标题: 《用于隐藏模型POMDP的鲁棒有限记忆策略梯度》

摘要: 部分可观察的马尔可夫决策过程（POMDPs）模拟特定环境下的不确定性顺序决策。关键是，POMDPs的最优策略可能不够稳健，无法对环境中的扰动做出反应。隐藏模型POMDPs（HM-POMDPs）捕捉了一组不同的环境模型，即具有共享动作和观察空间的POMDPs。直觉是真实模型在潜在模型集合中隐藏，未知在执行时哪个模型将成为环境。对于给定的HM-POMDP，如果策略对其每个POMDP都能达到足够的性能，则该策略是稳健的。我们通过结合两种正交技术来计算这种稳健策略：（1）一种推理形式化验证技术，支持通过计算HM-POMDP中的最坏情况POMDP来进行可行的稳健策略评估，以及（2）次梯度上升来优化候选策略以应对最坏情况POMDP。经验评估表明，与各种基准相比，我们的方法（1）产生更加稳健且更好地泛化到未知POMDPs的策略，以及（2）可扩展到由十万种环境组成的HM-POMDPs。

更新时间: 2025-08-20 02:45:49

领域: cs.AI,cs.LG

下载: http://arxiv.org/abs/2505.09518v3

Beyond the Protocol: Unveiling Attack Vectors in the Model Context Protocol (MCP) Ecosystem

The Model Context Protocol (MCP) is an emerging standard designed to enable seamless interaction between Large Language Model (LLM) applications and external tools or resources. Within a short period, thousands of MCP services have already been developed and deployed. However, the client-server integration architecture inherent in MCP may expand the attack surface against LLM Agent systems, introducing new vulnerabilities that allow attackers to exploit by designing malicious MCP servers. In this paper, we present the first systematic study of attack vectors targeting the MCP ecosystem. Our analysis identifies four categories of attacks, i.e., Tool Poisoning Attacks, Puppet Attacks, Rug Pull Attacks, and Exploitation via Malicious External Resources. To evaluate the feasibility of these attacks, we conduct experiments following the typical steps of launching an attack through malicious MCP servers: upload-download-attack. Specifically, we first construct malicious MCP servers and successfully upload them to three widely used MCP aggregation platforms. The results indicate that current audit mechanisms are insufficient to identify and prevent the proposed attack methods. Next, through a user study and interview with 20 participants, we demonstrate that users struggle to identify malicious MCP servers and often unknowingly install them from aggregator platforms. Finally, we demonstrate that these attacks can trigger harmful behaviors within the user's local environment-such as accessing private files or controlling devices to transfer digital assets-by deploying a proof-of-concept (PoC) framework against five leading LLMs. Additionally, based on interview results, we discuss four key challenges faced by the current security ecosystem surrounding MCP servers. These findings underscore the urgent need for robust security mechanisms to defend against malicious MCP servers.

Updated: 2025-08-20 02:42:06

标题: 超越协议：揭示模型上下文协议（MCP）生态系统中的攻击向量

摘要: 模型上下文协议（MCP）是一种新兴标准，旨在实现大型语言模型（LLM）应用与外部工具或资源之间的无缝交互。在短时间内，已经开发和部署了成千上万个MCP服务。然而，MCP中固有的客户端-服务器集成架构可能扩大了针对LLM代理系统的攻击面，引入了新的漏洞，使攻击者可以通过设计恶意的MCP服务器来利用。在本文中，我们首次对针对MCP生态系统的攻击向量进行系统研究。我们的分析确定了四类攻击，即工具中毒攻击、傀儡攻击、拉地毯攻击和通过恶意外部资源进行利用的攻击。为了评估这些攻击的可行性，我们进行了实验，按照通过恶意MCP服务器发动攻击的典型步骤：上传-下载-攻击。具体地，我们首先构建恶意的MCP服务器，并成功将它们上传到三个广泛使用的MCP聚合平台。结果表明，当前的审计机制无法识别和阻止所提出的攻击方法。接下来，通过对20名参与者进行用户研究和访谈，我们展示了用户很难识别恶意的MCP服务器，并经常在不知情的情况下从聚合平台安装它们。最后，我们通过部署一个概念验证（PoC）框架对五个主要的LLM进行攻击，展示这些攻击可以触发用户本地环境中的有害行为，如访问私人文件或控制设备转移数字资产。此外，基于访谈结果，我们讨论了当前围绕MCP服务器的安全生态系统面临的四个关键挑战。这些发现强调了迫切需要强大的安全机制来抵御恶意的MCP服务器。

更新时间: 2025-08-20 02:42:06

领域: cs.CR,cs.SE

下载: http://arxiv.org/abs/2506.02040v3

MetAdv: A Unified and Interactive Adversarial Testing Platform for Autonomous Driving

Evaluating and ensuring the adversarial robustness of autonomous driving (AD) systems is a critical and unresolved challenge. This paper introduces MetAdv, a novel adversarial testing platform that enables realistic, dynamic, and interactive evaluation by tightly integrating virtual simulation with physical vehicle feedback. At its core, MetAdv establishes a hybrid virtual-physical sandbox, within which we design a three-layer closed-loop testing environment with dynamic adversarial test evolution. This architecture facilitates end-to-end adversarial evaluation, ranging from high-level unified adversarial generation, through mid-level simulation-based interaction, to low-level execution on physical vehicles. Additionally, MetAdv supports a broad spectrum of AD tasks, algorithmic paradigms (e.g., modular deep learning pipelines, end-to-end learning, vision-language models). It supports flexible 3D vehicle modeling and seamless transitions between simulated and physical environments, with built-in compatibility for commercial platforms such as Apollo and Tesla. A key feature of MetAdv is its human-in-the-loop capability: besides flexible environmental configuration for more customized evaluation, it enables real-time capture of physiological signals and behavioral feedback from drivers, offering new insights into human-machine trust under adversarial conditions. We believe MetAdv can offer a scalable and unified framework for adversarial assessment, paving the way for safer AD.

Updated: 2025-08-20 02:30:56

标题: MetAdv：一种面向自动驾驶的统一交互式对抗测试平台

摘要: 评估和确保自动驾驶（AD）系统的对抗鲁棒性是一个关键且未解决的挑战。本文介绍了MetAdv，这是一个新颖的对抗测试平台，通过将虚拟模拟与物理车辆反馈紧密结合，实现了真实、动态和互动式评估。在其核心，MetAdv建立了一个混合虚拟-物理沙盒，在其中我们设计了一个具有动态对抗测试演变的三层闭环测试环境。这种架构促进了端到端的对抗性评估，从高级统一对抗生成，通过中级基于模拟的交互，到在物理车辆上的低级执行。此外，MetAdv支持广泛的AD任务，算法范式（例如，模块化深度学习流水线，端到端学习，视觉-语言模型）。它支持灵活的3D车辆建模和在模拟和物理环境之间的无缝过渡，具有与Apollo和特斯拉等商业平台的内置兼容性。MetAdv的一个关键特性是其人机交互功能：除了更加定制化的环境配置以进行评估，它还能实时捕获驾驶员的生理信号和行为反馈，为在对抗条件下的人机信任提供新的见解。我们相信MetAdv可以为对抗性评估提供可伸缩和统一的框架，为更安全的自动驾驶铺平道路。

更新时间: 2025-08-20 02:30:56

领域: cs.RO,cs.AI

下载: http://arxiv.org/abs/2508.06534v2

Enhancing Depression-Diagnosis-Oriented Chat with Psychological State Tracking

Depression-diagnosis-oriented chat aims to guide patients in self-expression to collect key symptoms for depression detection. Recent work focuses on combining task-oriented dialogue and chitchat to simulate the interview-based depression diagnosis. Whereas, these methods can not well capture the changing information, feelings, or symptoms of the patient during dialogues. Moreover, no explicit framework has been explored to guide the dialogue, which results in some useless communications that affect the experience. In this paper, we propose to integrate Psychological State Tracking (POST) within the large language model (LLM) to explicitly guide depression-diagnosis-oriented chat. Specifically, the state is adapted from a psychological theoretical model, which consists of four components, namely Stage, Information, Summary and Next. We fine-tune an LLM model to generate the dynamic psychological state, which is further used to assist response generation at each turn to simulate the psychiatrist. Experimental results on the existing benchmark show that our proposed method boosts the performance of all subtasks in depression-diagnosis-oriented chat.

Updated: 2025-08-20 02:28:53

标题: 用心理状态跟踪增强抑郁症诊断导向的聊天

摘要: 抑郁症诊断导向的聊天旨在引导患者进行自我表达，收集抑郁症检测的关键症状。最近的工作侧重于将任务导向对话和闲聊相结合，以模拟基于访谈的抑郁症诊断。然而，这些方法无法很好地捕捉患者在对话过程中的信息、感受或症状的变化。此外，还没有明确的框架来指导对话，这导致一些无用的沟通影响体验。在本文中，我们建议将心理状态跟踪（POST）集成到大型语言模型（LLM）中，以明确指导抑郁症诊断导向的聊天。具体来说，该状态是从心理学理论模型中调整而来，包括阶段、信息、总结和下一步四个组成部分。我们微调LLM模型以生成动态心理状态，进而用于在每个轮次辅助生成响应以模拟精神科医生。现有基准测试结果表明，我们提出的方法提升了抑郁症诊断导向聊天中所有子任务的性能。

更新时间: 2025-08-20 02:28:53

领域: cs.HC,cs.AI,cs.CL,cs.CY

下载: http://arxiv.org/abs/2403.09717v2

Structure As Search: Unsupervised Permutation Learning for Combinatorial Optimization

We propose a non-autoregressive framework for the Travelling Salesman Problem where solutions emerge directly from learned permutations, without requiring explicit search. By applying a similarity transformation to Hamiltonian cycles, the model learns to approximate permutation matrices via continuous relaxations. Our unsupervised approach achieves competitive performance against classical heuristics, demonstrating that the inherent structure of the problem can effectively guide combinatorial optimization without sequential decision-making.

Updated: 2025-08-20 02:25:21

标题: 结构作为搜索：无监督排列学习用于组合优化

摘要: 我们提出了一个非自回归的框架，用于旅行推销员问题，其中解决方案直接从学习到的排列中产生，而无需显式搜索。通过对哈密顿回路应用相似性转换，该模型学习通过连续松弛逼近排列矩阵。我们的无监督方法在与经典启发式方法的竞争性表现中取得了竞争性的表现，表明问题的固有结构可以有效地引导组合优化而无需顺序决策。

更新时间: 2025-08-20 02:25:21

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2507.04164v2

Learning Point Cloud Representations with Pose Continuity for Depth-Based Category-Level 6D Object Pose Estimation

Category-level object pose estimation aims to predict the 6D pose and 3D size of objects within given categories. Existing approaches for this task rely solely on 6D poses as supervisory signals without explicitly capturing the intrinsic continuity of poses, leading to inconsistencies in predictions and reduced generalization to unseen poses. To address this limitation, we propose HRC-Pose, a novel depth-only framework for category-level object pose estimation, which leverages contrastive learning to learn point cloud representations that preserve the continuity of 6D poses. HRC-Pose decouples object pose into rotation and translation components, which are separately encoded and leveraged throughout the network. Specifically, we introduce a contrastive learning strategy for multi-task, multi-category scenarios based on our 6D pose-aware hierarchical ranking scheme, which contrasts point clouds from multiple categories by considering rotational and translational differences as well as categorical information. We further design pose estimation modules that separately process the learned rotation-aware and translation-aware embeddings. Our experiments demonstrate that HRC-Pose successfully learns continuous feature spaces. Results on REAL275 and CAMERA25 benchmarks show that our method consistently outperforms existing depth-only state-of-the-art methods and runs in real-time, demonstrating its effectiveness and potential for real-world applications. Our code is at https://github.com/zhujunli1993/HRC-Pose.

Updated: 2025-08-20 02:09:02

标题: 学习具有姿态连续性的点云表示，用于基于深度的类别级别6D物体姿态估计

摘要: 目录级别的物体姿态估计旨在预测给定类别内物体的6D姿态和3D大小。现有的方法依赖于仅6D姿态作为监督信号，而没有明确捕捉姿态的内在连续性，导致预测不一致并且在未见姿态下的泛化能力降低。为了解决这个限制，我们提出了HRC-Pose，这是一个新颖的仅深度的框架，用于目录级别的物体姿态估计，它利用对比学习来学习保持6D姿态连续性的点云表示。HRC-Pose将物体姿态分解为旋转和平移组件，分别对其进行编码并在整个网络中加以利用。具体而言，我们基于我们的6D姿态感知分层排名方案，引入了一种多任务、多类别场景下的对比学习策略，通过考虑旋转和平移差异以及类别信息来对比来自多个类别的点云。我们进一步设计了分别处理学习到的旋转感知和平移感知嵌入的姿态估计模块。我们的实验表明，HRC-Pose成功学习了连续特征空间。在REAL275和CAMERA25基准测试上的结果显示，我们的方法始终优于现有的仅深度的最先进方法，并且能够实时运行，展示了其在实际应用中的有效性和潜力。我们的代码位于https://github.com/zhujunli1993/HRC-Pose。

更新时间: 2025-08-20 02:09:02

领域: cs.CV,cs.AI,cs.RO

下载: http://arxiv.org/abs/2508.14358v1

A Little Human Data Goes A Long Way

Faced with an expensive human annotation process, creators of NLP systems increasingly turn to synthetic data generation. While this method shows promise, the extent to which synthetic data can replace human annotation is poorly understood. We investigate the use of synthetic data in Fact Verification (FV) and Question Answering (QA) by studying the effects of incrementally replacing human generated data with synthetic points on eight diverse datasets. Strikingly, replacing up to 90% of the training data only marginally decreases performance, but replacing the final 10% leads to severe declines. We find that models trained on purely synthetic data can be reliably improved by including as few as 125 human generated data points. We show that matching the performance gain of just a little additional human data (only 200 points) requires an order of magnitude more synthetic data and estimate price ratios at which human annotation would be a more cost-effective solution. Our results suggest that even when human annotation at scale is infeasible, there is great value to having a small proportion of the dataset being human generated.

Updated: 2025-08-20 01:59:58

标题: 一点人类数据走得很远

摘要: 面对昂贵的人类标注过程，自然语言处理系统的创建者越来越转向合成数据生成。尽管这种方法显示出潜力，但合成数据可以替代人类标注的程度尚不明确。我们通过研究逐渐用合成点替换八个不同数据集上人类生成数据的效果，来探究在事实验证（FV）和问答（QA）中使用合成数据的效果。引人注目的是，替换高达90%的训练数据仅略微降低了性能，但替换最后10%导致了严重下降。我们发现，仅使用纯合成数据训练的模型可以通过包含至少125个人类生成的数据点来可靠地改进。我们展示了为了达到仅增加一点额外人类数据的性能增益，需要数量级更多的合成数据，并估计了人类标注成本更具成本效益解决方案的价格比率。我们的结果表明，即使在规模上进行人类标注是不可行的情况下，将数据集中的一小部分由人类生成仍具有很大价值。

更新时间: 2025-08-20 01:59:58

领域: cs.CL,cs.AI,cs.LG

下载: http://arxiv.org/abs/2410.13098v3

Organ-Agents: Virtual Human Physiology Simulator via LLMs

Recent advances in large language models (LLMs) have enabled new possibilities in simulating complex physiological systems. We introduce Organ-Agents, a multi-agent framework that simulates human physiology via LLM-driven agents. Each Simulator models a specific system (e.g., cardiovascular, renal, immune). Training consists of supervised fine-tuning on system-specific time-series data, followed by reinforcement-guided coordination using dynamic reference selection and error correction. We curated data from 7,134 sepsis patients and 7,895 controls, generating high-resolution trajectories across 9 systems and 125 variables. Organ-Agents achieved high simulation accuracy on 4,509 held-out patients, with per-system MSEs <0.16 and robustness across SOFA-based severity strata. External validation on 22,689 ICU patients from two hospitals showed moderate degradation under distribution shifts with stable simulation. Organ-Agents faithfully reproduces critical multi-system events (e.g., hypotension, hyperlactatemia, hypoxemia) with coherent timing and phase progression. Evaluation by 15 critical care physicians confirmed realism and physiological plausibility (mean Likert ratings 3.9 and 3.7). Organ-Agents also enables counterfactual simulations under alternative sepsis treatment strategies, generating trajectories and APACHE II scores aligned with matched real-world patients. In downstream early warning tasks, classifiers trained on synthetic data showed minimal AUROC drops (<0.04), indicating preserved decision-relevant patterns. These results position Organ-Agents as a credible, interpretable, and generalizable digital twin for precision diagnosis, treatment simulation, and hypothesis testing in critical care.

Updated: 2025-08-20 01:58:45

标题: 器官代理：通过LLMs实现的虚拟人体生理模拟器

摘要: 最近大规模语言模型（LLM）的进展使得在模拟复杂生理系统方面出现了新的可能性。我们介绍了Organ-Agents，这是一个通过LLM驱动的多智能体框架，用于模拟人类生理学。每个模拟器都模拟特定系统（例如心血管、肾脏、免疫系统）。训练包括对系统特定时间序列数据进行监督微调，然后通过动态参考选择和误差校正进行强化引导的协调。我们从7,134名败血症患者和7,895名对照患者中整理了数据，生成了跨9个系统和125个变量的高分辨率轨迹。Organ-Agents在4,509名被保留的患者中实现了高度准确的模拟，每个系统的均方误差均小于0.16，并且在基于SOFA的严重程度分层中具有稳健性。对来自两家医院的22,689名重症监护室患者进行外部验证显示，在分布转移下有中等程度的退化，但模拟保持稳定。Organ-Agents忠实地再现了关键的多系统事件（例如低血压、高乳酸血症、低氧血症），具有连贯的时间和相位进展。15名重症护理医师对其进行评估，证实了其真实性和生理合理性（均值Likert评分为3.9和3.7）。Organ-Agents还能够在替代败血症治疗策略下进行对照模拟，生成与匹配的现实世界患者一致的轨迹和APACHE II评分。在下游早期预警任务中，基于合成数据训练的分类器显示出最小的AUROC下降（<0.04），表明保留了决策相关模式。这些结果将Organ-Agents定位为可信、可解释和可推广的数字孪生体，用于精准诊断、治疗模拟和危重病护理中的假设测试。

更新时间: 2025-08-20 01:58:45

领域: cs.LG,cs.AI,cs.CV

下载: http://arxiv.org/abs/2508.14357v1

Reinitializing weights vs units for maintaining plasticity in neural networks

Loss of plasticity is a phenomenon in which a neural network loses its ability to learn when trained for an extended time on non-stationary data. It is a crucial problem to overcome when designing systems that learn continually. An effective technique for preventing loss of plasticity is reinitializing parts of the network. In this paper, we compare two different reinitialization schemes: reinitializing units vs reinitializing weights. We propose a new algorithm, which we name \textit{selective weight reinitialization}, for reinitializing the least useful weights in a network. We compare our algorithm to continual backpropagation and ReDo, two previously proposed algorithms that reinitialize units in the network. Through our experiments in continual supervised learning problems, we identify two settings when reinitializing weights is more effective at maintaining plasticity than reinitializing units: (1) when the network has a small number of units and (2) when the network includes layer normalization. Conversely, reinitializing weights and units are equally effective at maintaining plasticity when the network is of sufficient size and does not include layer normalization. We found that reinitializing weights maintains plasticity in a wider variety of settings than reinitializing units.

Updated: 2025-08-20 01:53:57

标题: 重新初始化权重与单元以维持神经网络中的可塑性

摘要: 失去可塑性是一种现象，即神经网络在非稳态数据上进行长时间训练后失去学习能力。这是在设计持续学习系统时必须克服的一个关键问题。防止失去可塑性的有效技术是重新初始化网络的部分。在本文中，我们比较了两种不同的重新初始化方案：重新初始化单元与重新初始化权重。我们提出了一种新算法，我们称之为“选择性权重重新初始化”，用于重新初始化网络中最无用的权重。我们将我们的算法与持续反向传播和ReDo进行比较，这两种先前提出的算法重新初始化网络中的单元。通过我们在持续监督学习问题上的实验，我们确定了两种情况，重新初始化权重比重新初始化单元更有效地保持可塑性：(1)当网络具有少量单元时和(2)当网络包括层规范化时。相反，在网络具有足够大小且不包括层规范化时，重新初始化权重和单元在维持可塑性方面同样有效。我们发现重新初始化权重在更广泛的设置中比重新初始化单元更能保持可塑性。

更新时间: 2025-08-20 01:53:57

领域: cs.NE,cs.AI

下载: http://arxiv.org/abs/2508.00212v2

SBGD: Improving Graph Diffusion Generative Model via Stochastic Block Diffusion

Graph diffusion generative models (GDGMs) have emerged as powerful tools for generating high-quality graphs. However, their broader adoption faces challenges in \emph{scalability and size generalization}. GDGMs struggle to scale to large graphs due to their high memory requirements, as they typically operate in the full graph space, requiring the entire graph to be stored in memory during training and inference. This constraint limits their feasibility for large-scale real-world graphs. GDGMs also exhibit poor size generalization, with limited ability to generate graphs of sizes different from those in the training data, restricting their adaptability across diverse applications. To address these challenges, we propose the stochastic block graph diffusion (SBGD) model, which refines graph representations into a block graph space. This space incorporates structural priors based on real-world graph patterns, significantly reducing memory complexity and enabling scalability to large graphs. The block representation also improves size generalization by capturing fundamental graph structures. Empirical results show that SBGD achieves significant memory improvements (up to 6$\times$) while maintaining comparable or even superior graph generation performance relative to state-of-the-art methods. Furthermore, experiments demonstrate that SBGD better generalizes to unseen graph sizes. The significance of SBGD extends beyond being a scalable and effective GDGM; it also exemplifies the principle of modularization in generative modeling, offering a new avenue for exploring generative models by decomposing complex tasks into more manageable components.

Updated: 2025-08-20 01:47:46

标题: SBGD: 通过随机块扩散改进图扩散生成模型

摘要: 图扩散生成模型（GDGMs）已经成为生成高质量图形的强大工具。然而，它们在更广泛的应用中面临着\emph{可扩展性和大小泛化}方面的挑战。由于它们通常在完整的图形空间中运作，需要在训练和推断期间将整个图形存储在内存中，因此GDGMs很难扩展到大型图形，由于高内存需求，这种约束限制了它们在大规模现实世界图形中的可行性。GDGMs也表现出较差的大小泛化能力，其生成的图形大小受限于训练数据中的大小，限制了它们在各种应用中的适应性。为了解决这些挑战，我们提出了随机块图扩散（SBGD）模型，将图形表示精细化为块图形空间。这个空间结合了基于现实世界图形模式的结构先验，显著减少了内存复杂性，并使其能够扩展到大型图形。块表示还通过捕捉基本图形结构改善了大小泛化。实证结果表明，SBGD在保持与最先进方法相媲美甚至更优越的图形生成性能的同时，实现了显著的内存改进（最高可达6倍）。此外，实验表明，SBGD更好地推广到未见的图形大小。SBGD的重要性不仅在于它是一个可扩展和有效的GDGM；它还体现了生成建模中模块化原则，为通过将复杂任务分解为更易管理的组件来探索生成模型提供了一条新途径。

更新时间: 2025-08-20 01:47:46

领域: cs.LG

下载: http://arxiv.org/abs/2508.14352v1

CRINN: Contrastive Reinforcement Learning for Approximate Nearest Neighbor Search

Approximate nearest-neighbor search (ANNS) algorithms have become increasingly critical for recent AI applications, particularly in retrieval-augmented generation (RAG) and agent-based LLM applications. In this paper, we present CRINN, a new paradigm for ANNS algorithms. CRINN treats ANNS optimization as a reinforcement learning problem where execution speed serves as the reward signal. This approach enables the automatic generation of progressively faster ANNS implementations while maintaining accuracy constraints. Our experimental evaluation demonstrates CRINN's effectiveness across six widely-used NNS benchmark datasets. When compared against state-of-the-art open-source ANNS algorithms, CRINN achieves best performance on three of them (GIST-960-Euclidean, MNIST-784-Euclidean, and GloVe-25-angular), and tied for first place on two of them (SIFT-128-Euclidean and GloVe-25-angular). The implications of CRINN's success reach well beyond ANNS optimization: It validates that LLMs augmented with reinforcement learning can function as an effective tool for automating sophisticated algorithmic optimizations that demand specialized knowledge and labor-intensive manual refinement. Code can be found at https://github.com/deepreinforce-ai/CRINN

Updated: 2025-08-20 01:47:01

标题: CRINN: 用于近似最近邻搜索的对比强化学习

摘要: 近似最近邻搜索（ANNS）算法已经变得越来越关键于最近的人工智能应用中，特别是在检索增强生成（RAG）和基于代理的LLM应用中。在本文中，我们提出了CRINN，一种新的ANNS算法范式。CRINN将ANNS优化视为一个强化学习问题，其中执行速度作为奖励信号。这种方法使得能够自动生成逐渐更快的ANNS实现，同时保持精确度约束。我们的实验评估表明，CRINN在六个广泛使用的NNS基准数据集上表现出了有效性。与最先进的开源ANNS算法相比，CRINN在其中三个（GIST-960-Euclidean、MNIST-784-Euclidean和GloVe-25-angular）上达到最佳性能，并在其中两个（SIFT-128-Euclidean和GloVe-25-angular）上并列第一。CRINN的成功影响远不止于ANNS优化：它验证了使用强化学习增强的LLM可以作为自动化复杂算法优化的有效工具，这些优化需要专业知识和劳动密集的手动细化。代码可以在https://github.com/deepreinforce-ai/CRINN找到。

更新时间: 2025-08-20 01:47:01

领域: cs.LG,cs.AI,cs.CL,cs.DB

下载: http://arxiv.org/abs/2508.02091v2

A Non-Asymptotic Convergent Analysis for Scored-Based Graph Generative Model via a System of Stochastic Differential Equations

Score-based graph generative models (SGGMs) have proven effective in critical applications such as drug discovery and protein synthesis. However, their theoretical behavior, particularly regarding convergence, remains underexplored. Unlike common score-based generative models (SGMs), which are governed by a single stochastic differential equation (SDE), SGGMs involve a system of coupled SDEs. In SGGMs, the graph structure and node features are governed by separate but interdependent SDEs. This distinction makes existing convergence analyses from SGMs inapplicable for SGGMs. In this work, we present the first non-asymptotic convergence analysis for SGGMs, focusing on the convergence bound (the risk of generative error) across three key graph generation paradigms: (1) feature generation with a fixed graph structure, (2) graph structure generation with fixed node features, and (3) joint generation of both graph structure and node features. Our analysis reveals several unique factors specific to SGGMs (e.g., the topological properties of the graph structure) which affect the convergence bound. Additionally, we offer theoretical insights into the selection of hyperparameters (e.g., sampling steps and diffusion length) and advocate for techniques like normalization to improve convergence. To validate our theoretical findings, we conduct a controlled empirical study using synthetic graph models, and the results align with our theoretical predictions. This work deepens the theoretical understanding of SGGMs, demonstrates their applicability in critical domains, and provides practical guidance for designing effective models.

Updated: 2025-08-20 01:44:42

标题: 通过随机微分方程系统对基于得分的图生成模型进行非渐近收敛分析

摘要: 基于得分的图生成模型（SGGMs）已被证明在关键应用领域，如药物发现和蛋白质合成中有效。然而，它们的理论行为，特别是收敛性，仍未得到充分探讨。与常见的基于得分的生成模型（SGMs）不同，后者受单一随机微分方程（SDE）控制，SGGMs涉及一组耦合的SDEs。在SGGMs中，图结构和节点特征由独立但相互依赖的SDEs控制。这种区别使得现有的SGMs的收敛性分析不适用于SGGMs。在这项工作中，我们提出了SGGMs的第一种非渐近收敛分析，重点关注了三种关键图生成范例的收敛界限（生成误差的风险）：（1）固定图结构的特征生成，（2）固定节点特征的图结构生成，以及（3）图结构和节点特征的联合生成。我们的分析揭示了对SGGMs特有的几个因素（例如图结构的拓扑特性）对收敛界限的影响。此外，我们提供了关于超参数选择（例如采样步骤和扩散长度）的理论见解，并倡导使用归一化等技术来改善收敛性。为了验证我们的理论发现，我们使用合成图模型进行了一项受控实证研究，结果与我们的理论预测一致。这项工作深化了对SGGMs的理论理解，展示了它们在关键领域的适用性，并为设计有效模型提供了实用指导。

更新时间: 2025-08-20 01:44:42

领域: cs.LG,stat.ML

下载: http://arxiv.org/abs/2508.14351v1

CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

The exponential growth in demand for GPU computing resources has created an urgent need for automated CUDA optimization strategies. While recent advances in LLMs show promise for code generation, current SOTA models achieve low success rates in improving CUDA speed. In this paper, we introduce CUDA-L1, an automated reinforcement learning framework for CUDA optimization that employs a novel contrastive RL algorithm. CUDA-L1 achieves significant performance improvements on the CUDA optimization task: trained on A100, it delivers an average speedup of x3.12 with a median speedup of x1.42 against default baselines over across all 250 CUDA kernels of KernelBench, with peak speedups reaching x120. In addition to the default baseline provided by KernelBench, CUDA-L1 demonstrates x2.77 over Torch Compile, x2.88 over Torch Compile with reduce overhead, x2.81 over CUDA Graph implementations, and remarkably x7.72 over cuDNN libraries. Furthermore, the model also demonstrates portability across different GPU architectures. Beyond these benchmark results, CUDA-L1 demonstrates several properties: it 1) discovers a variety of CUDA optimization techniques and learns to combine them strategically to achieve optimal performance; 2) uncovers fundamental principles of CUDA optimization, such as the multiplicative nature of optimizations; 3) identifies non-obvious performance bottlenecks and rejects seemingly beneficial optimizations that actually harm performance. The capabilities demonstrate that, RL can transform an initially poor-performing LLM into an effective CUDA optimizer through speedup-based reward signals alone, without human expertise or domain knowledge. This paradigm opens possibilities for automated optimization of CUDA operations, and holds promise to substantially promote GPU efficiency and alleviate the rising pressure on GPU computing resources.

Updated: 2025-08-20 01:41:45

标题: CUDA-L1：通过对比强化学习提高CUDA优化

摘要: 对GPU计算资源需求的指数增长已经创造了对自动化CUDA优化策略的迫切需求。虽然最近对LLMs的进展显示了代码生成的潜力，但目前的SOTA模型在提高CUDA速度方面的成功率较低。在本文中，我们介绍了CUDA-L1，这是一个自动化的强化学习框架，用于CUDA优化，采用了一种新颖的对比RL算法。 CUDA-L1在CUDA优化任务上取得了显著的性能改进：在A100上训练，与默认基线相比，它在KernelBench的所有250个CUDA核心上实现了平均加速比为x3.12，中位加速比为x1.42，峰值加速比达到x120。除了KernelBench提供的默认基线外，CUDA-L1还展示了相对于Torch Compile的x2.77，相对于带有减少开销的Torch Compile的x2.88，相对于CUDA图实现的x2.81，以及显著地相对于cuDNN库的x7.72。此外，该模型还展示了在不同GPU架构之间的可移植性。除了这些基准结果，CUDA-L1还展示了几个特性：1）发现了各种CUDA优化技术，并学会了策略性地将它们结合起来以实现最佳性能；2）揭示了CUDA优化的基本原则，如优化的乘法性质；3）识别了非显而易见的性能瓶颈，并拒绝了看似有益的优化，实际上对性能有害。这些能力表明，RL可以通过仅依靠基于加速的奖励信号，将最初表现不佳的LLM转变为一个有效的CUDA优化器，而无需人类专业知识或领域知识。这种范式为CUDA操作的自动化优化打开了可能性，并有望大大促进GPU效率，并缓解GPU计算资源的不断增加的压力。

更新时间: 2025-08-20 01:41:45

领域: cs.AI,cs.DC,cs.LG

下载: http://arxiv.org/abs/2507.14111v7

HandCraft: Dynamic Sign Generation for Synthetic Data Augmentation

Sign Language Recognition (SLR) models face significant performance limitations due to insufficient training data availability. In this article, we address the challenge of limited data in SLR by introducing a novel and lightweight sign generation model based on CMLPe. This model, coupled with a synthetic data pretraining approach, consistently improves recognition accuracy, establishing new state-of-the-art results for the LSFB and DiSPLaY datasets using our Mamba-SL and Transformer-SL classifiers. Our findings reveal that synthetic data pretraining outperforms traditional augmentation methods in some cases and yields complementary benefits when implemented alongside them. Our approach democratizes sign generation and synthetic data pretraining for SLR by providing computationally efficient methods that achieve significant performance improvements across diverse datasets.

Updated: 2025-08-20 01:38:24

标题: 手工制作：用于合成数据增强的动态标志生成

摘要: 手语识别（SLR）模型由于训练数据不足而面临重大性能限制。在本文中，我们通过引入基于CMLPe的新颖且轻量级的手语生成模型来解决SLR中有限数据的挑战。这种模型结合了合成数据预训练方法，持续提高了识别准确性，在LSFB和DiSPLaY数据集上实现了新的最先进结果，使用了我们的Mamba-SL和Transformer-SL分类器。我们的研究结果表明，在某些情况下，合成数据预训练优于传统的增强方法，并且在与它们同时实施时产生互补的好处。我们的方法通过提供计算效率高的方法，为SLR提供了民主化的手语生成和合成数据预训练，实现了跨多个数据集的显著性能改进。

更新时间: 2025-08-20 01:38:24

领域: cs.CV,cs.LG

下载: http://arxiv.org/abs/2508.14345v1

Inter-Class Relational Loss for Small Object Detection: A Case Study on License Plates

In one-stage multi-object detection tasks, various intersection over union (IoU)-based solutions aim at smooth and stable convergence near the targets during training. However, IoU-based losses fail to correctly update the gradient of small objects due to an extremely flat gradient. During the update of multiple objects, the learning of small objects' gradients suffers more because of insufficient gradient updates. Therefore, we propose an inter-class relational loss to efficiently update the gradient of small objects while not sacrificing the learning efficiency of other objects based on the simple fact that an object has a spatial relationship to another object (e.g., a car plate is attached to a car in a similar position). When the predicted car plate's bounding box is not within its car, a loss punishment is added to guide the learning, which is inversely proportional to the overlapped area of the car's and predicted car plate's bounding box. By leveraging the spatial relationship at the inter-class level, the loss guides small object predictions using larger objects and enhances latent information in deeper feature maps. In this paper, we present twofold contributions using license plate detection as a case study: (1) a new small vehicle multi-license plate dataset (SVMLP), featuring diverse real-world scenarios with high-quality annotations; and (2) a novel inter-class relational loss function designed to promote effective detection performance. We highlight the proposed ICR loss penalty can be easily added to existing IoU-based losses and enhance the performance. These contributions improve the standard mean Average Precision (mAP) metric, achieving gains of 10.3% and 1.6% in mAP$^{\text{test}}_{50}$ for YOLOv12-T and UAV-DETR, respectively, without any additional hyperparameter tuning. Code and dataset will be available soon.

Updated: 2025-08-20 01:37:17

标题: 小物体检测的跨类别关系损失：以车牌为例的案例研究

摘要: 在单阶段多目标检测任务中，各种基于交并比（IoU）的解决方案旨在训练过程中使收敛平稳且稳定接近目标。然而，基于IoU的损失未能正确更新小目标的梯度，因为梯度非常平缓。在更新多个对象时，由于梯度更新不足，小对象的梯度学习更加困难。因此，我们提出了一种跨类别的关系损失，以有效更新小对象的梯度，同时不影响其他对象的学习效率，基于一个简单的事实，即一个对象与另一个对象存在空间关系（例如，汽车上的车牌与汽车位置相似）。当预测的车牌边界框不在它所属的汽车内时，会添加一个损失惩罚以引导学习，该损失与汽车和预测的车牌边界框的重叠区域成反比。通过利用跨类别层面的空间关系，损失指导小对象的预测，使用较大对象增强深层特征图中的潜在信息。在本文中，我们以车牌检测为案例研究，提出了双重贡献：（1）一个新的小型车辆多车牌数据集（SVMLP），包含具有高质量注释的多样真实场景；（2）一种旨在促进有效检测性能的新型跨类别关系损失函数。我们强调所提出的ICR损失惩罚可以轻松添加到现有的基于IoU的损失中，并增强性能。这些贡献提升了标准的平均精度（mAP）指标，在YOLOv12-T和UAV-DETR中，分别实现了mAP$^{\text{test}}_{50}$的10.3%和1.6%的增益，而无需进行任何额外的超参数调整。代码和数据集即将提供。

更新时间: 2025-08-20 01:37:17

领域: cs.CV,cs.AI

下载: http://arxiv.org/abs/2508.14343v1

Generative AI Against Poaching: Latent Composite Flow Matching for Wildlife Conservation

Poaching poses significant threats to wildlife and biodiversity. A valuable step in reducing poaching is to forecast poacher behavior, which can inform patrol planning and other conservation interventions. Existing poaching prediction methods based on linear models or decision trees lack the expressivity to capture complex, nonlinear spatiotemporal patterns. Recent advances in generative modeling, particularly flow matching, offer a more flexible alternative. However, training such models on real-world poaching data faces two central obstacles: imperfect detection of poaching events and limited data. To address imperfect detection, we integrate flow matching with an occupancy-based detection model and train the flow in latent space to infer the underlying occupancy state. To mitigate data scarcity, we adopt a composite flow initialized from a linear-model prediction rather than random noise which is the standard in diffusion models, injecting prior knowledge and improving generalization. Evaluations on datasets from two national parks in Uganda show consistent gains in predictive accuracy.

Updated: 2025-08-20 01:35:51

标题: 生成AI抵制偷猎：用于野生动物保护的潜在复合流匹配

摘要: 盗猎对野生动物和生物多样性构成重大威胁。减少盗猎的一个宝贵步骤是预测盗猎者的行为，这可以为巡逻规划和其他保护干预提供信息。基于线性模型或决策树的现有盗猎预测方法缺乏捕捉复杂、非线性时空模式的表达能力。最近发展的生成建模技术，特别是流匹配，提供了更灵活的替代方案。然而，利用真实世界的盗猎数据训练这些模型面临两个主要障碍：盗猎事件的不完全检测和数据有限。为了解决不完全检测问题，我们将流匹配与基于占用的检测模型相结合，并在潜在空间中训练流以推断潜在的占用状态。为了缓解数据稀缺性，我们采用从线性模型预测初始化的复合流，而不是扩散模型中的随机噪音，注入先验知识并提高泛化能力。在乌干达两个国家公园的数据集上进行的评估显示了预测准确性的一致增益。

更新时间: 2025-08-20 01:35:51

领域: cs.LG,cs.AI,cs.MA

下载: http://arxiv.org/abs/2508.14342v1

Dominated Actions in Imperfect-Information Games

Dominance is a fundamental concept in game theory. In strategic-form games dominated strategies can be identified in polynomial time. As a consequence, iterative removal of dominated strategies can be performed efficiently as a preprocessing step for reducing the size of a game before computing a Nash equilibrium. For imperfect-information games in extensive form, we could convert the game to strategic form and then iteratively remove dominated strategies in the same way; however, this conversion may cause an exponential blowup in game size. In this paper we define and study the concept of dominated actions in imperfect-information games. Our main result is a polynomial-time algorithm for determining whether an action is dominated (strictly or weakly) by any mixed strategy in n-player games, which can be extended to an algorithm for iteratively removing dominated actions. This allows us to efficiently reduce the size of the game tree as a preprocessing step for Nash equilibrium computation. We explore the role of dominated actions empirically in the "All In or Fold" No-Limit Texas Hold'em poker variant.

Updated: 2025-08-20 01:33:06

标题: 在不完全信息博弈中的主导行动

摘要: 支配是博弈论中的一个基本概念。在战略形式博弈中，可以在多项式时间内识别支配策略。因此，在计算纳什均衡之前，可以有效地进行支配策略的迭代移除作为预处理步骤来减小博弈的规模。对于广泛形式的不完全信息博弈，我们可以将游戏转换为战略形式，然后以相同的方式迭代地移除支配策略；然而，这种转换可能导致游戏规模的指数级增长。在本文中，我们定义并研究了不完全信息博弈中支配行动的概念。我们的主要结果是一个多项式时间算法，用于确定在n个玩家游戏中任何混合策略是否支配（严格或弱地）某个行动，这可以扩展为一个迭代地移除支配行动的算法。这使我们能够有效地减小游戏树的规模，作为计算纳什均衡的预处理步骤。我们在"All In or Fold"无限制德州扑克变体中实证地探讨了支配行动的作用。

更新时间: 2025-08-20 01:33:06

领域: cs.GT,cs.AI,cs.MA,econ.TH

下载: http://arxiv.org/abs/2504.09716v3

A Comparative Evaluation of Teacher-Guided Reinforcement Learning Techniques for Autonomous Cyber Operations

Autonomous Cyber Operations (ACO) rely on Reinforcement Learning (RL) to train agents to make effective decisions in the cybersecurity domain. However, existing ACO applications require agents to learn from scratch, leading to slow convergence and poor early-stage performance. While teacher-guided techniques have demonstrated promise in other domains, they have not yet been applied to ACO. In this study, we implement four distinct teacher-guided techniques in the simulated CybORG environment and conduct a comparative evaluation. Our results demonstrate that teacher integration can significantly improve training efficiency in terms of early policy performance and convergence speed, highlighting its potential benefits for autonomous cybersecurity.

Updated: 2025-08-20 01:30:27

标题: 一种自主网络操作的教师引导强化学习技术的比较评估

摘要: 自主网络作战（ACO）依赖强化学习（RL）来训练代理程序在网络安全领域做出有效决策。然而，现有的ACO应用要求代理程序从零开始学习，导致收敛速度缓慢和早期性能不佳。虽然在其他领域中教师引导技术表现出了潜力，但尚未应用到ACO中。在本研究中，我们在模拟的CybORG环境中实施了四种不同的教师引导技术，并进行了比较评估。我们的结果表明，教师整合可以显著提高训练效率，早期政策表现和收敛速度，突出了其对自主网络安全的潜在好处。

更新时间: 2025-08-20 01:30:27

领域: cs.LG,cs.AI

下载: http://arxiv.org/abs/2508.14340v1

On the Interplay between Graph Structure and Learning Algorithms in Graph Neural Networks

This paper studies the interplay between learning algorithms and graph structure for graph neural networks (GNNs). Existing theoretical studies on the learning dynamics of GNNs primarily focus on the convergence rates of learning algorithms under the interpolation regime (noise-free) and offer only a crude connection between these dynamics and the actual graph structure (e.g., maximum degree). This paper aims to bridge this gap by investigating the excessive risk (generalization performance) of learning algorithms in GNNs within the generalization regime (with noise). Specifically, we extend the conventional settings from the learning theory literature to the context of GNNs and examine how graph structure influences the performance of learning algorithms such as stochastic gradient descent (SGD) and Ridge regression. Our study makes several key contributions toward understanding the interplay between graph structure and learning in GNNs. First, we derive the excess risk profiles of SGD and Ridge regression in GNNs and connect these profiles to the graph structure through spectral graph theory. With this established framework, we further explore how different graph structures (regular vs. power-law) impact the performance of these algorithms through comparative analysis. Additionally, we extend our analysis to multi-layer linear GNNs, revealing an increasing non-isotropic effect on the excess risk profile, thereby offering new insights into the over-smoothing issue in GNNs from the perspective of learning algorithms. Our empirical results align with our theoretical predictions, \emph{collectively showcasing a coupling relation among graph structure, GNNs and learning algorithms, and providing insights on GNN algorithm design and selection in practice.}

Updated: 2025-08-20 01:26:56

标题: 关于图结构和学习算法在图神经网络中的相互作用

摘要: 这篇论文研究了学习算法和图神经网络（GNNs）之间的相互作用。现有关于GNNs学习动态的理论研究主要集中在插值制度（无噪声）下学习算法的收敛速度，并且仅提供这些动态与实际图结构（例如最大度）之间的粗略联系。本文旨在通过研究GNNs中学习算法的过度风险（泛化性能），在泛化制度（有噪声）下弥合这一差距。具体而言，我们将传统学习理论文献中的设置扩展到GNNs的背景中，并且研究图结构如何影响学习算法（例如随机梯度下降（SGD）和岭回归）的性能。我们的研究对于理解GNNs中图结构和学习之间的相互作用做出了几个关键贡献。首先，我们推导了SGD和岭回归在GNNs中的过度风险曲线，并通过谱图论将这些曲线与图结构联系起来。在建立了这一框架后，我们进一步探讨不同图结构（正则vs.幂律）如何通过比较分析影响这些算法的性能。此外，我们将分析扩展到多层线性GNNs，揭示了过度风险曲线上逐渐增加的非各向性效应，从学习算法的角度为GNNs中的过度平滑问题提供了新的见解。我们的实证结果与我们的理论预测一致，集体展示了图结构、GNNs和学习算法之间的耦合关系，并为实践中的GNN算法设计和选择提供了见解。

更新时间: 2025-08-20 01:26:56

领域: cs.LG

下载: http://arxiv.org/abs/2508.14338v1

NeRC: Neural Ranging Correction through Differentiable Moving Horizon Location Estimation

GNSS localization using everyday mobile devices is challenging in urban environments, as ranging errors caused by the complex propagation of satellite signals and low-quality onboard GNSS hardware are blamed for undermining positioning accuracy. Researchers have pinned their hopes on data-driven methods to regress such ranging errors from raw measurements. However, the grueling annotation of ranging errors impedes their pace. This paper presents a robust end-to-end Neural Ranging Correction (NeRC) framework, where localization-related metrics serve as the task objective for training the neural modules. Instead of seeking impractical ranging error labels, we train the neural network using ground-truth locations that are relatively easy to obtain. This functionality is supported by differentiable moving horizon location estimation (MHE) that handles a horizon of measurements for positioning and backpropagates the gradients for training. Even better, as a blessing of end-to-end learning, we propose a new training paradigm using Euclidean Distance Field (EDF) cost maps, which alleviates the demands on labeled locations. We evaluate the proposed NeRC on public benchmarks and our collected datasets, demonstrating its distinguished improvement in positioning accuracy. We also deploy NeRC on the edge to verify its real-time performance for mobile devices.

Updated: 2025-08-20 01:23:32

标题: NeRC：通过可微分移动视野位置估计进行神经测距校正

摘要: 在城市环境中使用日常移动设备进行GNSS定位具有挑战性，因为由于卫星信号的复杂传播和低质量的GNSS硬件造成的测距误差被认为是降低定位精度的原因。研究人员寄希望于数据驱动方法，以从原始测量中回归出这种测距误差。然而，测距误差的繁重注释阻碍了他们的步伐。本文提出了一个稳健的端到端神经测距校正（NeRC）框架，其中与定位相关的度量指标作为训练神经模块的任务目标。我们不寻求不切实际的测距误差标签，而是使用相对容易获取的地面真实位置来训练神经网络。这种功能受可微分移动地平面估计（MHE）的支持，该方法处理定位的一系列测量，并将梯度反向传播进行训练。更好的是，作为端到端学习的一个福音，我们提出了一种新的训练范式，使用欧几里得距离场（EDF）成本地图，从而减轻对标记位置的需求。我们在公共基准测试和我们收集的数据集上评估了提出的NeRC，在定位精度上展示出了显著的改进。我们还将NeRC部署到边缘以验证其在移动设备上的实时性能。

更新时间: 2025-08-20 01:23:32

领域: cs.LG

下载: http://arxiv.org/abs/2508.14336v1

Modeling Relational Logic Circuits for And-Inverter Graph Convolutional Network

The automation of logic circuit design enhances chip performance, energy efficiency, and reliability, and is widely applied in the field of Electronic Design Automation (EDA).And-Inverter Graphs (AIGs) efficiently represent, optimize, and verify the functional characteristics of digital circuits, enhancing the efficiency of EDA development.Due to the complex structure and large scale of nodes in real-world AIGs, accurate modeling is challenging, leading to existing work lacking the ability to jointly model functional and structural characteristics, as well as insufficient dynamic information propagation capability.To address the aforementioned challenges, we propose AIGer.Specifically, AIGer consists of two components: 1) Node logic feature initialization embedding component and 2) AIGs feature learning network component.The node logic feature initialization embedding component projects logic nodes, such as AND and NOT, into independent semantic spaces, to enable effective node embedding for subsequent processing.Building upon this, the AIGs feature learning network component employs a heterogeneous graph convolutional network, designing dynamic relationship weight matrices and differentiated information aggregation approaches to better represent the original structure and information of AIGs.The combination of these two components enhances AIGer's ability to jointly model functional and structural characteristics and improves its message passing capability. Experimental results indicate that AIGer outperforms the current best models in the Signal Probability Prediction (SSP) task, improving MAE and MSE by 18.95\% and 44.44\%, respectively. In the Truth Table Distance Prediction (TTDP) task, AIGer achieves improvements of 33.57\% and 14.79\% in MAE and MSE, respectively, compared to the best-performing models.

Updated: 2025-08-20 01:16:52

标题: 建模关系逻辑电路用于And-Inverter图卷积网络

摘要: 逻辑电路设计的自动化提高了芯片性能、能效和可靠性，在电子设计自动化（EDA）领域得到广泛应用。与非图（AIGs）高效地表示、优化和验证数字电路的功能特性，增强了EDA开发的效率。由于现实世界中AIGs的复杂结构和大规模节点的挑战，准确建模困难，导致现有工作缺乏联合建模功能和结构特性的能力，以及动态信息传播能力不足。为解决上述挑战，我们提出了AIGer。具体而言，AIGer由两个组件组成：1）节点逻辑特征初始化嵌入组件和2）AIGs特征学习网络组件。节点逻辑特征初始化嵌入组件将逻辑节点，如AND和NOT，投射到独立的语义空间，以实现后续处理的有效节点嵌入。在此基础上，AIGs特征学习网络组件采用异构图卷积网络，设计动态关系权重矩阵和不同信息聚合方法，更好地表示AIGs的原始结构和信息。这两个组件的结合增强了AIGer联合建模功能和结构特性的能力，并提高了其消息传递能力。实验结果表明，AIGer在信号概率预测（SSP）任务中优于当前最佳模型，分别提高了MAE和MSE 18.95\%和44.44\%。在真值表距离预测（TTDP）任务中，AIGer相较于表现最佳的模型，分别提高了MAE和MSE 33.57\%和14.79%。

更新时间: 2025-08-20 01:16:52

领域: cs.AI

下载: http://arxiv.org/abs/2508.11991v3

Multi-view Graph Condensation via Tensor Decomposition

Graph Neural Networks (GNNs) have demonstrated remarkable results in various real-world applications, including drug discovery, object detection, social media analysis, recommender systems, and text classification. In contrast to their vast potential, training them on large-scale graphs presents significant computational challenges due to the resources required for their storage and processing. Graph Condensation has emerged as a promising solution to reduce these demands by learning a synthetic compact graph that preserves the essential information of the original one while maintaining the GNN's predictive performance. Despite their efficacy, current graph condensation approaches frequently rely on a computationally intensive bi-level optimization. Moreover, they fail to maintain a mapping between synthetic and original nodes, limiting the interpretability of the model's decisions. In this sense, a wide range of decomposition techniques have been applied to learn linear or multi-linear functions from graph data, offering a more transparent and less resource-intensive alternative. However, their applicability to graph condensation remains unexplored. This paper addresses this gap and proposes a novel method called Multi-view Graph Condensation via Tensor Decomposition (GCTD) to investigate to what extent such techniques can synthesize an informative smaller graph and achieve comparable downstream task performance. Extensive experiments on six real-world datasets demonstrate that GCTD effectively reduces graph size while preserving GNN performance, achieving up to a 4.0\ improvement in accuracy on three out of six datasets and competitive performance on large graphs compared to existing approaches. Our code is available at https://anonymous.4open.science/r/gctd-345A.

Updated: 2025-08-20 01:02:18

标题: 多视图图收缩的张量分解

摘要: 图神经网络（GNN）在各种实际应用中展示出卓越的结果，包括药物发现、物体检测、社交媒体分析、推荐系统和文本分类等。与其巨大潜力相比，在大规模图上训练它们面临着重要的计算挑战，因为这需要大量的存储和处理资源。图压缩已经成为一个有前景的解决方案，通过学习一个保留原始图的基本信息的合成紧凑图，同时保持GNN的预测性能，来减少这些需求。尽管它们有效，当前的图压缩方法经常依赖于计算密集型的双层优化。此外，它们未能维持合成和原始节点之间的映射，限制了模型决策的可解释性。在这方面，已经应用了广泛的分解技术来从图数据中学习线性或多线性函数，提供了一个更透明且资源消耗较少的替代方案。然而，它们对图压缩的适用性尚未被探索。本文填补了这一空白，并提出了一种名为多视图图压缩通过张量分解（GCTD）的新方法，探究这些技术在多大程度上可以合成一个信息量较小的图并达到可比较的下游任务性能。在六个真实数据集上的大量实验表明，GCTD有效地减小了图的大小，同时保持了GNN的性能，在六个数据集中有三个数据集的准确性提高了4.0%，在大图上的性能与现有方法相比表现出竞争力。我们的代码可在https://anonymous.4open.science/r/gctd-345A获得。

更新时间: 2025-08-20 01:02:18

领域: cs.LG

下载: http://arxiv.org/abs/2508.14330v1

Sharp Generalization for Nonparametric Regression in Interpolation Space by Over-Parameterized Neural Networks Trained with Preconditioned Gradient Descent and Early-Stopping

We study nonparametric regression using an over-parameterized two-layer neural networks trained with algorithmic guarantees in this paper. We consider the setting where the training features are drawn uniformly from the unit sphere in $\mathbb{R}^d$, and the target function lies in an interpolation space commonly studied in statistical learning theory. We demonstrate that training the neural network with a novel Preconditioned Gradient Descent (PGD) algorithm, equipped with early stopping, achieves a sharp regression rate of $\mathcal O(n^{-\frac{2\alpha s'}{2\alpha s'+1}})$ when the target function is in the interpolation space $[\mathcal H_K]^{s'}$ with $s' \ge 3$. This rate is even sharper than the currently known nearly-optimal rate of $\mathcal O(n^{-\frac{2\alpha s'}{2\alpha s'+1}})\log^2(1/\delta)$~\citep{Li2024-edr-general-domain}, where $n$ is the size of the training data and $\delta \in (0,1)$ is a small probability. This rate is also sharper than the standard kernel regression rate of $\mathcal O(n^{-\frac{2\alpha}{2\alpha+1}})$ obtained under the regular Neural Tangent Kernel (NTK) regime when training the neural network with the vanilla gradient descent (GD), where $2\alpha = d/(d-1)$. Our analysis is based on two key technical contributions. First, we present a principled decomposition of the network output at each PGD step into a function in the reproducing kernel Hilbert space (RKHS) of a newly induced integral kernel, and a residual function with small $L^{\infty}$-norm. Second, leveraging this decomposition, we apply local Rademacher complexity theory to tightly control the complexity of the function class comprising all the neural network functions obtained in the PGD iterates. Our results further suggest that PGD enables the neural network to escape the linear NTK regime and achieve improved generalization by inducing a new integral kernel of lower kernel complexity.

Updated: 2025-08-20 00:20:54

标题: 使用预条件梯度下降和提前停止训练的过参数化神经网络在插值空间中进行非参数回归的尖锐概括

摘要: 本文研究了使用过度参数化的两层神经网络进行非参数回归，并通过算法保证进行训练。我们考虑训练特征均匀地从$\mathbb{R}^d$中的单位球中抽取，并且目标函数位于统计学习理论中常见的插值空间中。我们证明，使用一种新颖的预调节梯度下降（PGD）算法进行神经网络训练，并配备早停止，当目标函数在插值空间$[\mathcal H_K]^{s'}$中，且$s'\ge 3$时，可以实现尖锐的回归速率为$\mathcal O(n^{-\frac{2\alpha s'}{2\alpha s'+1}})$。该速率甚至比当前已知的近乎最优速率$\mathcal O(n^{-\frac{2\alpha s'}{2\alpha s'+1}})\log^2(1/\delta)$~\citep{Li2024-edr-general-domain} 更为尖锐，其中$n$为训练数据的大小，$\delta \in (0,1)$为小概率。该速率也比在正常神经切线核（NTK）制度下使用普通梯度下降（GD）进行神经网络训练得到的标准核回归速率$\mathcal O(n^{-\frac{2\alpha}{2\alpha+1}})$ 更为尖锐，其中$2\alpha = d/(d-1)$。我们的分析基于两个关键技术贡献。首先，我们在每个PGD步骤中将网络输出进行了原理性分解，得到一个属于新引入的积分核的再生核希尔伯特空间（RKHS）中的函数，以及一个具有较小$L^{\infty}$-范数的残差函数。其次，利用这种分解，我们应用局部Rademacher复杂性理论来严密控制包含在PGD迭代中获得的所有神经网络函数的函数类的复杂性。我们的结果进一步表明，PGD使神经网络能够摆脱线性NTK制度，并通过引入一个更低核复杂性的新积分核来实现改进的泛化。

更新时间: 2025-08-20 00:20:54

领域: stat.ML,cs.LG,math.ST,stat.TH

下载: http://arxiv.org/abs/2407.11353v3

Power Stabilization for AI Training Datacenters

Large Artificial Intelligence (AI) training workloads spanning several tens of thousands of GPUs present unique power management challenges. These arise due to the high variability in power consumption during the training. Given the synchronous nature of these jobs, during every iteration there is a computation-heavy phase, where each GPU works on the local data, and a communication-heavy phase where all the GPUs synchronize on the data. Because compute-heavy phases require much more power than communication phases, large power swings occur. The amplitude of these power swings is ever increasing with the increase in the size of training jobs. An even bigger challenge arises from the frequency spectrum of these power swings which, if harmonized with critical frequencies of utilities, can cause physical damage to the power grid infrastructure. Therefore, to continue scaling AI training workloads safely, we need to stabilize the power of such workloads. This paper introduces the challenge with production data and explores innovative solutions across the stack: software, GPU hardware, and datacenter infrastructure. We present the pros and cons of each of these approaches and finally present a multi-pronged approach to solving the challenge. The proposed solutions are rigorously tested using a combination of real hardware and Microsoft's in-house cloud power simulator, providing critical insights into the efficacy of these interventions under real-world conditions.

Updated: 2025-08-20 00:04:06

标题: 人工智能训练数据中心的电力稳定化

摘要: 大规模人工智能（AI）训练工作负载涵盖数万个GPU，面临独特的功耗管理挑战。这些挑战源于训练过程中功耗的高度变化。考虑到这些作业的同步性质，在每次迭代中都存在计算密集阶段，其中每个GPU都在本地数据上工作，以及通信密集阶段，其中所有GPU都在数据上进行同步。由于计算密集阶段的功耗远远高于通信阶段，因此会出现大幅功耗波动。这些功耗波动的振幅随着训练作业规模的增加而不断增加。更大的挑战来自这些功耗波动的频谱，如果与公用事业的关键频率协调，可能会对电网基础设施造成物理损害。因此，为了安全地继续扩展AI训练工作负载，我们需要稳定这些工作负载的功耗。本文介绍了生产数据中的挑战，并探讨了跨软件、GPU硬件和数据中心基础设施的创新解决方案。我们介绍了每种方法的利弊，并最终提出了解决挑战的多管齐下的方法。提出的解决方案经过严格测试，结合真实硬件和微软内部云功耗模拟器，为这些干预措施在实际条件下的有效性提供了关键见解。

更新时间: 2025-08-20 00:04:06

领域: cs.AR,cs.AI,cs.DC

下载: http://arxiv.org/abs/2508.14318v1